TutorialsSource: MEDIUM ARTIFICIAL INTELLIGENCEImportance: 2/5
The article describes a perception-and-memory stack for edge devices that operates at microwatt power levels, emphasizing privacy and reversible computation. It is intended as an alternative machine vision approach for scenarios where cloud connectivity is unavailable or undesirable. The stack is designed to run entirely on-device, avoiding reliance on cloud infrastructure. The brief teaser on Medium does not disclose specific hardware, benchmarks, or deployment details, indicating the full content is a tutorial or opinion piece.
This Towards Data Science tutorial discusses using vision language models to parse charts, diagrams, and other visual elements from PDF documents. It shows how these models extend beyond text-only parsing, allowing retrieval-augmented generation (RAG) systems to incorporate image-based information. The post focuses on practical integration of visual context into enterprise document intelligence workflows.
The tutorial shows how to parse PDFs locally using the Docling tool, preserving table cells, OCR text, captions, and headings. The output matches cloud-grade document structure without any cloud upload, API keys, or per-page billing. This approach enables privacy-preserving document intelligence for RAG pipelines by converting PDFs into richly structured data ready for ingestion.
TutorialsSource: SIMON WILLISONImportance: 2/5
Simon Willison's browser-based audio conversation tool, originally built in December 2024 to test the OpenAI WebRTC realtime audio API, has been updated. It now supports the GPT‑Realtime‑2 model, which OpenAI promotes as its first voice model with GPT‑5-class reasoning and a knowledge cutoff of September 30, 2024. A new feature allows users to paste document context, enabling interactive voice Q&A about the provided content. The update makes the newer model available for experimentation while the model has not yet appeared in the ChatGPT iPhone app.
Zyphra has released Zamba2-VL, a family of open vision-language models in three sizes: 1.2B, 2.7B, and 7B parameters. Each model uses a hybrid Mamba2 state-space model combined with a small number of shared transformer blocks, replacing dense attention to achieve near-linear inference scaling. The models pair a Qwen2.5-VL vision encoder with this backbone, supporting single- and multi-image understanding and grounding. On 14 benchmarks, Zamba2-VL shows strong visual counting and document understanding (e.g., 90.9 DocVQA for the 2.7B model) but lags larger baselines on knowledge-heavy reasoning like MMMU and MathVista. Its main advantage is an order-of-magnitude lower time-to-first-token compared to comparable Transformer VLMs, particularly beneficial for long multimodal inputs and on-device deployment. Weights are released under Apache 2.0 license on HuggingFace with inference code available.
TutorialsSource: MEDIUM LARGE LANGUAGE MODELSImportance: 2/5
GELATO investigates extending a strong pre-trained text embedding model to handle multimodal data rather than training a new model from scratch. The text encoder remains frozen (the 'text tower') while separate modality-specific encoders are trained to align images, audio, or other modalities into the same embedding space. This 'frozen towers' strategy leverages existing text understanding and avoids retraining the core model. The blog post outlines the method and its motivation for efficient multimodal representation learning.