Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
Today’s paper proposes using a Panel of LLM Evaluators (PoLL) composed of multiple smaller language models instead of a single large model like GPT-4 to evaluate the quality of generations from other LLMs. This approach aims to reduce intra-model bias, cost, and latency compared to using just one powerful model as the judge.
Method Overview
The PoLL method works by having multiple language models from different families (e.g. GPT, Anthropic, Cohere) independently score the output of a test model. Their individual scores are then aggregated through a voting function like taking the maximum or average score.
For the experiments, the authors used a PoLL composed of three models: GPT-3.5, Haiku (Anthropic), and Command R (Cohere). On question answering tasks, they used max voting since judgments were binary (correct/incorrect). For the Chatbot Arena task with 1-5 scores, they used average pooling.
The different settings they evaluated include:
1) Single-hop QA: Models retrieve evidence and generate an answer evaluated against a reference.
2) Multi-hop QA: Models must do multiple retrievals to collect sufficient evidence to answer.
3) Chatbot Arena: Pairwise comparison of outputs from two models on open-ended prompts.
Results
Across all three settings spanning six datasets, PoLL correlated better with human judgments than using GPT-4 or other single model judges. PoLL also exhibited less intra-model bias since it aggregates across different model families.
The authors found that GPT-4 could be an unreliable judge, with its performance varying significantly based on the exact prompt given. Explicit instructions not to "overthink" improved GPT-4's correlation with humans.
PoLL is also 7-8 times cheaper to run than GPT-4 while being faster due to evaluating multiple smaller models in parallel.
Conclusion
Using a Panel of LLM Evaluators composed of smaller, diverse models is an effective way to evaluate LLM outputs that reduces bias, cost and latency compared to relying on a single large model judge. For more information please consult the full paper.
Congrats to the authors for their work!
Verga, Pat, et al. "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models." ArXiv, 29 Apr. 2024, arxiv.org/abs/2404.18796v1.