AI intelligence feed

R LOCALLLAMAJun 28, 2026Highlight

55-LLM blind peer evaluation reveals systematic same-family bias in LLM judges

An open evaluation pitted 55 LLMs from 11 developer families against 198 hand-written prompts; models then blind-graded each other across 22,254 judgments, excluding self-ratings. All eight families with sufficient data showed statistically significant same-family rating bias: Qwen judges favored other Qwen models by +0.91 points, while Mistral judges penalized other Mistral models by −1.02 points—the largest absolute bias. Other families ranged from xAI (+0.75) to Meta (−0.68). Aggregate leaderboards obscured category-level variation, with six different models topping nine categories, and code tasks provoked the highest judge disagreement. The full dataset, code, and prompts are MIT-licensed, and the author outlines next steps including anchoring to ground truth and isolating judge bias from response quality.

R LOCALLLAMAJun 28, 2026Highlight

Quantized Gemma 4-31B MTP Draft Rates: Q5_K_S Highest, IQ4_XS/IQ3_M Nearly Identical, IQ2_M Accepts 84.5% at n=1

A community experiment measured MTP speculative decoding acceptance rates for Gemma 4-31B-it trunk quantized to Q5_K_S, IQ4_XS, IQ3_M, and IQ2_M, paired with its MTP drafter. Single-token draft acceptance (n=1) fell from 88.5% (Q5_K_S) to 84.5% (IQ2_M); at n=4, it dropped to 66.7% and 61.2% respectively. IQ4_XS and IQ3_M performed nearly identically across all depths. The greatest speed gains occur with n=2 on CUDA, while Apple Metal benefits only marginally from n=1. The IQ2_M trunk requires about 12 GB memory, enabling speculative decoding on consumer GPUs.

AI signal, minus the noise.

55-LLM blind peer evaluation reveals systematic same-family bias in LLM judges

Quantized Gemma 4-31B MTP Draft Rates: Q5_K_S Highest, IQ4_XS/IQ3_M Nearly Identical, IQ2_M Accepts 84.5% at n=1