Thinkgap feed

AI signal, minus the noise.

Curated items are read from the processed items table and served as a bilingual feed.

8 items

TOWARDSDATASCIENCEJun 16, 2026

RAG Question Parsing: Splitting User Input into Retrieval and Generation Briefs Before Pipeline Execution

A Towards Data Science tutorial by Angela Shi argues that user questions in RAG systems deserve the same careful parsing as documents. The technique splits a raw question into a 'retrieval brief' that specifies what to find and a 'generation brief' that defines how to use the retrieved context. This pre-processing step decouples searching from answer formation, improving both retrieval precision and answer quality. The approach is illustrated for enterprise document intelligence use cases.

TOWARDSDATASCIENCEJun 14, 2026

Vision LLMs are PDF Parsers Too: Reading Charts and Diagrams for RAG

This Towards Data Science tutorial discusses using vision language models to parse charts, diagrams, and other visual elements from PDF documents. It shows how these models extend beyond text-only parsing, allowing retrieval-augmented generation (RAG) systems to incorporate image-based information. The post focuses on practical integration of visual context into enterprise document intelligence workflows.

TOWARDSDATASCIENCEJun 13, 2026

After Benchmarking 100K-Row Aggregation Tasks, Author Builds Deterministic Engine to Replace RAG for Computation Queries

In this blog post, the author benchmarks retrieval-augmented generation (RAG) pipelines against a deterministic full-scan engine across 100,000 rows for aggregation tasks. The results show that larger context windows do not improve accuracy—they actually make errors harder to detect. The author finds that computation-heavy queries must be routed away from RAG entirely, and builds a system that directs such queries to a deterministic full-scan engine to preserve accuracy.

TOWARDSDATASCIENCEJun 13, 2026

Parse PDFs for RAG Locally with Docling: Rich Tables, No Cloud Upload

The tutorial shows how to parse PDFs locally using the Docling tool, preserving table cells, OCR text, captions, and headings. The output matches cloud-grade document structure without any cloud upload, API keys, or per-page billing. This approach enables privacy-preserving document intelligence for RAG pipelines by converting PDFs into richly structured data ready for ingestion.

TOWARDSDATASCIENCEJun 12, 2026

When PyMuPDF Can’t See the Table: Parse PDFs for RAG with Azure Layout

This tutorial from the Enterprise Document Intelligence series shows how Azure Document Intelligence’s layout model extracts relational tables from PDFs where PyMuPDF falls short. The Azure approach preserves native table cells and works on scanned pages via integrated OCR. It also retrieves captions and headings without relying on regular expressions. The method is presented as a superior parsing step for Retrieval Augmented Generation (RAG) pipelines.

TOWARDSDATASCIENCEJun 11, 2026

Tutorial: Parse PDFs Into Relational DataFrames (Lines, Pages, TOC, Images) for RAG

This Towards Data Science tutorial presents a PDF parsing method that outputs relational DataFrames instead of flat text. It extracts structured elements including lines, pages, table of contents, images, cross-references, captions, spans, and a parsing summary. The relational shape is designed to improve retrieval-augmented generation (RAG) workflows by preserving document structure. The post is part of the 'Enterprise Document Intelligence' series.