The task is to evaluate which model produces the most useful answer to the user's question. A good answer should be factually correct, directly relevant, complete enough to satisfy the user, clearly written, and safe.
Do not reward unnecessary length. Do not penalize concise answers if they fully answer the question. Penalize unsupported claims, hallucinations, evasiness, unnecessary verbosity, and failture to follow instructions.
Pick the two answer sets you want to evaluate against each other.
data/ and check
config.json, then restart the server.