55-LLM blind peer evaluation reveals systematic same-family bias in LLM judges
An open evaluation pitted 55 LLMs from 11 developer families against 198 hand-written prompts; models then blind-graded each other across 22,254 judgments, excluding self-ratings. All eight families with sufficient data showed statistically significant same-family rating bias: Qwen judges favored other Qwen models by +0.91 points, while Mistral judges penalized other Mistral models by −1.02 points—the largest absolute bias. Other families ranged from xAI (+0.75) to Meta (−0.68). Aggregate leaderboards obscured category-level variation, with six different models topping nine categories, and code tasks provoked the highest judge disagreement. The full dataset, code, and prompts are MIT-licensed, and the author outlines next steps including anchoring to ground truth and isolating judge bias from response quality.