Glossary

Chunking (RAG)

Chunking is the process of splitting documents into smaller, meaningful segments before storing them in a vector database. The size and method of chunking directly affect the quality of retrieval in a RAG system, since the AI model can only work with the chunks it receives.

Chunking (RAG)

How It Works

You can't feed a 200-page document to a language model and expect it to find the one paragraph that answers a question. Chunking solves this by breaking documents into smaller pieces that can be individually indexed, searched, and retrieved.

The simplest approach is fixed-size chunking: split every document into 500-token pieces. This is fast and easy but often cuts sentences in half or splits related information across two chunks. Better approaches are aware of document structure. They split on paragraph boundaries, section headers, or semantic shifts in topic.

Chunk size matters more than most teams realize. Too small, and each chunk lacks enough context to be useful. Too large, and you retrieve a lot of irrelevant text along with the relevant part. Most production systems use chunks between 256 and 1024 tokens, with overlap of 10-20% between consecutive chunks so that sentences at the boundaries aren't lost.

Advanced chunking strategies include hierarchical chunking (small chunks for retrieval, larger parent chunks for context passed to the LLM), semantic chunking (using an embedding model to detect where topics shift), and document-aware chunking (using headings, tables, and formatting to identify natural boundaries). Agentic chunking goes further: an LLM reads the document and proposes boundaries based on meaning, at higher ingestion cost but better downstream recall.

The right strategy depends on the document type. Plain prose chunks cleanly at paragraph boundaries. Technical manuals with tables, code blocks, and figures need structure-aware parsing or those elements get mangled. Chat transcripts chunk best by speaker turn, not token count. One-size-fits-all chunking is how RAG systems end up with poor retrieval on a subset of documents nobody notices until a customer complains.

Getting chunking right is one of the biggest-impact optimizations in a RAG pipeline. Before tuning your embedding model or retrieval algorithm, make sure your chunks are the right size and preserve the meaning of the original documents. When retrieval quality is bad, chunking is the first thing to look at.

In Practice

Most teams use LangChain's RecursiveCharacterTextSplitter or LlamaIndex's node parsers for chunking. For complex PDFs with tables and figures, Unstructured.io and LlamaParse handle the parsing step before chunking. Semantic chunking is available in both LlamaIndex (SemanticSplitterNodeParser) and LangChain (SemanticChunker), typically using the same embedding model that powers retrieval.

Typical configuration: 512-token chunks with 50-token overlap for general knowledge bases, 256-token chunks for dense technical content, and 1024-token chunks for narrative documents like legal contracts. Hierarchical setups often use 128-token child chunks for retrieval and 2048-token parent chunks for context. Metadata attached to each chunk includes the source document ID, section heading, page number, and ingestion timestamp, all filterable at query time.

The ingestion workflow looks like this. A document lands in S3 or blob storage. A parser extracts text and preserves structural cues (headings, tables, lists). A chunker splits the text using the configured strategy. Each chunk goes through an embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, or BGE-large) and gets upserted to Pinecone, Weaviate, Qdrant, or pgvector with its metadata. Re-ingestion on document updates typically uses content hashing to skip unchanged chunks and save embedding costs.

Worked Example

A commercial real estate firm builds a RAG system over 4,800 lease agreements, each 40-120 pages long. The first attempt uses fixed 500-token chunks with no overlap. Retrieval quality is mediocre: questions about renewal options often return the wrong clauses because lease documents reuse similar language across sections, and key tables get split across chunks.

The team switches to structure-aware chunking via LlamaParse. Each section of the lease (Term, Rent, Use, Maintenance, Insurance, Renewal) becomes a parent node. Sub-clauses under each section become 256-token child chunks with 40-token overlap. Tables are extracted as Markdown and kept intact inside their own chunks with the table's caption as metadata. Each chunk is tagged with the tenant ID, property ID, and section name.

At query time, a question like "When does tenant X's lease auto-renew?" embeds and retrieves child chunks filtered to section=Renewal, tenant_id=X. Top-k=5 comes back. The hierarchical retriever expands each child to its parent chunk before handing context to Claude Sonnet. Recall on a curated eval set of 120 questions jumps from 0.71 to 0.93. Average context tokens per query drops from 4,200 to 2,800 because the filtered retrieval is more precise. That cuts the per-query cost by roughly a third.

What People Get Wrong

Myth

Smaller chunks always give better retrieval.

Reality

Smaller chunks give more precise matches but often lose the context needed to actually answer the question. A 64-token chunk might match a query perfectly and still be useless because it's missing the surrounding paragraph. Production systems often use small chunks for retrieval and pass larger parent chunks to the LLM (hierarchical retrieval). Optimizing chunk size is a tradeoff, not a monotonic better-is-smaller rule.

Myth

Chunk overlap doesn't matter much.

Reality

It matters a lot for boundary cases. Without overlap, a sentence that spans a chunk boundary gets split across two chunks, and neither chunk may retrieve well for a query about that sentence. 10-20% overlap is standard. Zero overlap is fine only when you chunk on natural boundaries (paragraph, section header) that won't split meaning.

Myth

Semantic chunking is always better than recursive text splitting.

Reality

Semantic chunking has higher ingestion cost (an embedding call per candidate boundary) and sometimes produces chunks that are the wrong size for your retrieval and LLM budget. For clean, well-structured documents, recursive splitting on paragraphs and section headers often matches or beats semantic chunking at a fraction of the cost. Try the simpler approach first and measure before paying for the sophisticated one.

Related Solutions

Multimodal RAG SystemsView →
AI Knowledge BaseView →

Need help implementing this?

We build production AI systems for enterprises. Tell us what you are working on and we will scope it in 30 minutes.