Glossary

Chunking (RAG)

Chunking is the process of splitting documents into smaller, meaningful segments before storing them in a vector database. The size and method of chunking directly affect the quality of retrieval in a RAG system, since the AI model can only work with the chunks it receives.

How It Works

You cannot feed a 200-page document to a language model and expect it to find the one paragraph that answers a question. Chunking solves this by breaking documents into smaller pieces that can be individually indexed, searched, and retrieved.

The simplest approach is fixed-size chunking: split every document into 500-token pieces. This is fast and easy but often cuts sentences in half or splits related information across two chunks. Better approaches are aware of document structure. They split on paragraph boundaries, section headers, or semantic shifts in topic.

Chunk size matters more than most teams realize. Too small, and each chunk lacks enough context to be useful. Too large, and you retrieve a lot of irrelevant text along with the relevant part. Most production systems use chunks between 256 and 1024 tokens, with overlap between consecutive chunks so that sentences at the boundaries are not lost.

Advanced chunking strategies include hierarchical chunking (small chunks for retrieval, larger parent chunks for context), semantic chunking (using an embedding model to detect where topics shift), and document-aware chunking (using headings, tables, and formatting to identify natural boundaries).

Getting chunking right is one of the highest-leverage optimizations in a RAG pipeline. Before tuning your embedding model or retrieval algorithm, make sure your chunks are the right size and preserve the meaning of the original documents.

Related Solutions

Multimodal RAG SystemsView →
AI Knowledge BaseView →

Need help implementing this?

We build production AI systems for enterprises. Tell us what you are working on and we will scope it in 30 minutes.