OIAI Generative Text Evaluation

Welcome 👋

The task is to evaluate which model produces the most useful answer to the user's question. A good answer should be factually correct, directly relevant, complete enough to satisfy the user, clearly written, and safe.

Do not reward unnecessary length. Do not penalize concise answers if they fully answer the question. Penalize unsupported claims, hallucinations, evasiveness, unnecessary verbosity, and failture to follow instructions.

How it works

  1. You'll see a question and two anonymized answers, Answer A and Answer B, from two different models. You won't be told which model produced which answer.
  2. Pick the better answer, or mark them as both good (equally good) or both bad.
  3. Optionally add a comment explaining your choice.
  4. Click Next › to move on. Your progress saves automatically, you can stop and resume anytime using the same name.
  5. When you're done, you may optionally leave overall feedback at the bottom.

Choose what to compare

Pick the two answer sets you want to evaluate against each other.

vs