55-LLM blind peer evaluation reveals systematic same-family bias in LLM judges
English summary
An open evaluation pitted 55 LLMs from 11 developer families against 198 hand-written prompts; models then blind-graded each other across 22,254 judgments, excluding self-ratings. All eight families with sufficient data showed statistically significant same-family rating bias: Qwen judges favored other Qwen models by +0.91 points, while Mistral judges penalized other Mistral models by −1.02 points—the largest absolute bias. Other families ranged from xAI (+0.75) to Meta (−0.68). Aggregate leaderboards obscured category-level variation, with six different models topping nine categories, and code tasks provoked the highest judge disagreement. The full dataset, code, and prompts are MIT-licensed, and the author outlines next steps including anchoring to ground truth and isolating judge bias from response quality.
Chinese summary
一项公开评估让来自11个开发家族的55个LLM回答了198个人工编写的问题,随后模型之间进行盲审互评,排除自我评分后共获得22,254次判断。所有有足够数据的8个家族均显示出统计显著的同家族评分偏差:Qwen裁判对其他Qwen模型评分平均高出0.91分,而Mistral裁判对其他Mistral模型评分低出1.02分,为绝对值最大的偏差。其余家族偏差介于xAI的+0.75到Meta的-0.68之间。综合排行榜掩盖了类别差异——九个类别中有六个不同的模型夺冠,且代码类任务的裁判分歧最大。完整数据集、代码和提示词以MIT许可证开放,作者提出了锚定真实答案、分离裁判偏差与回答质量等后续改进方向。
Key points
All eight developer families with enough data exhibited significant same-family rating bias (p<0.05, most surviving Bonferroni). Qwen judges rated Qwen models +0.91 higher, while Mistral judges rated Mistral models −1.02 lower.
8个有足够数据的开发家族均显现显著的同家族评分偏差(p<0.05,多数通过Bonferroni校正)。Qwen裁判对Qwen模型评分高出0.91分,Mistral裁判对Mistral模型评分低出1.02分。
The experiment involved 55 models, 198 hand-written questions, and 22,254 blind peer judgments. Code, dataset, and prompts are openly released under the MIT license.
实验涵盖55个模型、198个人工编写的问题和22,254次盲审互评,代码、数据集及提示词均以MIT许可证开源。
Aggregate leaderboards hide category-specific performance: six different models took first place across nine categories, and code tasks showed nearly double the judge disagreement of meta-alignment tasks, making single-judge code evaluation unreliable.
综合排行榜掩盖类别差异:九个类别中有六个不同的模型拔得头筹,代码任务的裁判分歧量近乎是元对齐任务的两倍,使得单一裁判的代码评估尤其不可靠。