DeepSeek v4 Pro achieves top coding scores: 80.6% on SWE-bench Verified and 93.5% on LiveCodeBench. However, CAISI’s multi-domain evaluation places it roughly 8 months behind the US frontier, contrasting with DeepSeek’s own claim of 2 months behind. The discrepancy is attributed to narrow coding benchmarks versus broader requirements in cybersecurity and abstract reasoning. The frontier has also advanced, with closed models like Fable 5 recently released. For local users, quantized versions of the model may yield different real-world agent performance than the full 1.6T-parameter Pro configuration.
A user attempted to benchmark Google's new Eloquent local dictation app but found that it dropped about half of dictations, returning only a small fraction of spoken words. In 15 of 50 tests, the app provided a complete transcript with a word error rate of ~24%, comparable to Qwen3-ASR's ~21%. However, for the majority of attempts, the output was severely incomplete, with clips of 20+ words often yielding just 5-10 words. The user suspects the underlying chat-style AI model sometimes refuses to transcribe and instead responds with an apology, a behavior observed when running Gemma 3n directly on the same audio. The issue highlights a fundamental usability problem with the chat-based transcription approach.
A long-time user of local LLMs argues that the LocalLLaMA community routinely overstates how close local models are to frontier closed models. They note that while large open models from DeepSeek, MiniMax, and others exist, the accessible mid-sized models cannot replace Claude or similar systems for serious agentic work. Benchmarks are misleading, and real-world coding or multi-step tasks expose a significant gap, requiring excessive steering and corrections. The user asks whether anyone truly believes a local model can replace a frontier model for serious agentic tasks, or if the community’s enthusiasm is driven mainly by privacy, tinkering, or roleplay.
A Reddit user conducted a summarization benchmark using human-annotated summaries and an LLM judge. Among models in the 30B parameter range, Qwen 3 achieved the highest score, outperforming Gemma 4, which ranked second. The user speculated that newer Qwen versions might be increasingly optimized for agentic tasks, potentially impacting pure summarization performance, though Qwen 3 still led in this real-world annotation evaluation.
The paper 'Predictable Compression Failures' (ICML 2026) addresses hallucinations in evidence-grounded QA by modeling order sensitivity as permutation dispersion and deriving an Expectation-level Decompression Law (EDFL). It defines a fixed ISR=1 answer/abstain gate that requires no threshold tuning, achieving 0.0–0.7% hallucination at ~24% abstention and 80.5% accuracy on held-out tests. The newly released ntkMirror implements this gate for local open-weight models in a training-free manner, using order-marginal verification across multiple evidence permutations. A fused kernel speeds up the permutation forwards by 2.6–10× with bit-identical fp32 results. New hallucination detection benchmarks on Qwen2.5 and Gemma models show AUROC up to 0.96 on SciFact, and the gate raises grounded fraction from 50% to 75–90% at the cost of dropping 10–20% valid claims.
A Reddit user ran MMLU_PRO and HumanEval benchmarks on Gemma 4 26B IT using MLX on a Mac M5 Pro to compare regular 4-bit, 6-bit, and QAT 8-bit quantizations. The QAT 8-bit model scored 90% on HumanEval, significantly lower than the 98% achieved by the regular 6-bit model, while MMLU_PRO differences were not statistically significant. The results suggest that the unquantized QAT model is inferior to the unquantized original model and that replacing higher-bit quants (like 5‑bit or 6‑bit) with QAT versions may not be justified. The tests had limited sample sizes (50 and 100 questions) and may not generalize to other Gemma 4 sizes such as 31B, 12B, or E2/4B.