Google DeepMind’s DiffusionGemma 26B A4B IT is an open-weights multimodal model that uses discrete diffusion to generate text from text, image, and video inputs. It has 25.2B total parameters and 3.8B active parameters (MoE), supports a 256K context window, and achieves over 1,100 tokens per second on NVIDIA H100 GPUs. NVIDIA has quantized the model to NVFP4 precision using its Model Optimizer, making it available on Hugging Face for commercial and non-commercial use. The model also features configurable thinking mode, native function calling, and multilingual support across 35+ languages.
A Reddit user reports using Qwen3-VL 8B via Ollama for OCR of handwritten letters, achieving decent results. They ask the LocalLLaMA community for other local models that might perform better for handwriting OCR.
Lemonade v10.7 introduces local omni-modal chat supporting image generation and editing by combining multiple backends and models; its LMX-Omni virtual models are now compatible with Open WebUI and other OpenAI clients. The release adds a lemonade bench CLI tool to collect standardized LLM performance data across llama.cpp, FastFlowLM, and vLLM. Cross-vendor support expands with CUDA backends for llama.cpp and stable-diffusion.cpp and a Vulkan backend for sd-cpp, enabling GPU acceleration on AMD, Apple Silicon, Nvidia, and Intel systems. The project is now organized into six working groups, four led by non-AMD contributors, and this release involved 19 contributors.
SCAIL-2 is an open-source model for end-to-end controlled character animation that removes dependence on intermediate pose representations. It was trained on 60K synthetic motion pairs using several teacher models (SCAIL-Preview, Wan-Animate, MoCha) and a Unified Motion Transfer Interface. The model enables animating a reference character from a driving video, supports cross-identity character replacement and multi-character scenarios, and extends to animal-driving. Additionally, it offers zero-shot support for advanced control intermediates like SAM3D-Body mesh rendering.
Omi Health founder released Omi Med STT v1, an open-weight (CC-BY-4.0) fine-tune of NVIDIA Parakeet TDT 0.6B v2 specialized for medical speech, with a local runtime that auto-selects backends (MLX on Apple Silicon, NeMo on CUDA, GGUF on CPU). On a held-out benchmark of 1,513 medical clips (7.18 hours), it achieves a medical word error rate (M-WER) of 2.37% and overall WER 8.30% while running at 145× realtime on an A10, significantly outperforming the base model and most open local ASR options. The model trails only VibeVoice-ASR 9B on M-WER but beats it on WER and speed, and rivals cloud-based medical transcription services such as ElevenLabs Scribe v2 (M-WER 1.39%) and AssemblyAI (1.81%) with the structural latency advantage of on-device processing. Training used 127 hours of audio (71% real, 29% synthetic), and the benchmark confirmed zero overlap with training data; key weaknesses are drug name accuracy (4.75% drug WER) targeted for improvement in v2.