A Reddit user reports that after extensive testing on three low-end laptops (Intel i3, 8GB RAM, integrated GPU), Qwen3-VL-2B in Q4_K_M GGUF quantization reliably extracts data from images to JSON, outperforming Qwen3-VL-4B and Qwen3.5 2B. The user notes this model is absent from major benchmarks like Artificial Analysis and the Open LLM Leaderboard, which list the 4B version instead. The post questions why it is ignored and asks if any other model can handle the task on similarly constrained devices like phones or Raspberry Pis. No quantitative benchmarks or replication details are provided.
A LocalLLaMA community member completed a multi-GPU build using an existing RTX 5090 and a newly acquired RTX PRO 5000, achieving 80GB of total VRAM. The 9950X3D system also includes 192GB RAM and 17TB storage, powered by a 1300W PSU. The user originally planned to buy an RTX PRO 6000 for $8.5K with a hoped-for NVIDIA Inception discount, but after a 3-month wait the application was rejected and the product price surged to $13.5K. They instead purchased the last available RTX PRO 5000 in their country with the saved funds. The rig is now used for large quantized LLMs (Q8) and multi-GPU ComfyUI workflows.
A developer released claude_converter, an open-source tool that converts Claude Code session .jsonl files into the messages format accepted by fine-tuning frameworks like TRL/SFTTrainer, Axolotl, and LLaMA-Factory (ShareGPT format). It includes a clean_messages() helper to strip tool-use blocks and an inspect_session() function for token counts and breakdowns. The tool has zero dependencies and can be installed via `uv pip install claude-converter`. Users are advised to filter sessions to only those where the final assistant turn solved the problem before training.
A local AI enthusiast built a personal voice assistant with premium capabilities including voice verification, wake words, continuous conversation, Home Assistant control, Hermes Agent integration, and deep research. The system runs on a custom server with four modified RTX 4090s (192GB VRAM total), 128GB DDR5 RAM, and a 3000W PSU powered via a 240V/30A dryer line. After testing large models like Qwen 397B, MiniMax M3, Nemotron 3 Ultra, and GLM 4.7/5.2, the user found that Google's Gemma 4 31B QAT outperforms them all and is significantly faster for its size. The assistant is deployed across the house using conference speaker-mics, with heat managed by a laundry room exhaust fan.
A community experiment measured MTP speculative decoding acceptance rates for Gemma 4-31B-it trunk quantized to Q5_K_S, IQ4_XS, IQ3_M, and IQ2_M, paired with its MTP drafter. Single-token draft acceptance (n=1) fell from 88.5% (Q5_K_S) to 84.5% (IQ2_M); at n=4, it dropped to 66.7% and 61.2% respectively. IQ4_XS and IQ3_M performed nearly identically across all depths. The greatest speed gains occur with n=2 on CUDA, while Apple Metal benefits only marginally from n=1. The IQ2_M trunk requires about 12 GB memory, enabling speculative decoding on consumer GPUs.
A user tested Ornith 35B by asking it to create a quick 3D game via the Claude Code harness, and the model succeeded after three prompts. In the same test, Qwen3.5-35b-A3B failed to produce the game even after multiple prompts. The report is an anecdotal coding comparison with no systematic evaluation or metrics provided. It suggests Ornith 35B performed better on this specific task.