Glossary

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a technique where an AI model first retrieves relevant documents from a knowledge base, then uses those documents as context to generate an accurate response. It reduces hallucination by grounding answers in real data.

How It Works

Large language models know a lot, but they don't know your company's internal data. They also make things up when they're unsure. RAG solves both problems by adding a retrieval step before generation.

Here's how it works. A user asks a question. The system converts that question into a vector embedding and searches a vector database for the most relevant documents. Those documents get passed to the LLM as context, along with the original question. The LLM then generates an answer based on the retrieved content instead of relying only on its training data.

The result is answers grounded in your actual documents, policies, product specs, or whatever you have indexed. This matters for enterprise use cases where accuracy is non-negotiable. A support agent pulling from your knowledge base needs to cite real policies, not generate plausible-sounding ones.

RAG architectures vary in complexity. A basic setup has an embedding model, a vector database, and an LLM. More advanced systems add re-ranking (scoring retrieved documents for relevance with a cross-encoder), chunking strategies (splitting documents into the right-sized pieces), hybrid search (combining BM25 keyword scoring with semantic similarity), and query rewriting (expanding or reformulating the user's question before retrieval).

Compared to fine-tuning, RAG is faster to set up and easier to update. When your data changes, you re-index the documents. No model retraining needed. The tradeoff: RAG only helps when the answer exists in your corpus. If the model's reasoning style or output format is wrong, fine-tuning is the right fix. If your corpus is small or your users ask broad questions, retrieval quality becomes the bottleneck and no amount of prompt tuning will save you.

RAG also isn't always the cheapest option. Each query runs an embedding call, a vector search, optional re-ranking, and a large context prompt. At high volume, that stacks up. Teams that measure cost per query often cache embeddings, cap top-k, and drop re-rankers on low-stakes queries.

In Practice

Most production RAG systems today are built on a small set of components. LangChain and LlamaIndex provide the orchestration layer. Embeddings come from OpenAI's text-embedding-3-large (3072-dim), Cohere embed-v3, or open models like BGE and E5 running on a GPU. Vectors land in Pinecone, Weaviate, Qdrant, Milvus, or pgvector. Generation uses Claude Sonnet, GPT-4o, or an open model like Llama 3.1 served via vLLM.

Typical configuration numbers: chunk size around 500 tokens with 50-token overlap, top-k=20 retrieval followed by a Cohere Rerank call that narrows to top 5, and a context budget of roughly 8k tokens handed to the model. Teams building for regulated industries add a citation-verification pass where each claim in the answer is checked against the retrieved chunks before the response leaves the system.

A common workflow looks like this. Ingest: documents are parsed (often with Unstructured.io or LlamaParse for PDFs), chunked, embedded, and upserted to the vector index with metadata for filtering. Query: the user question is embedded, filtered by tenant or document type, searched, re-ranked, and the top chunks are injected into a prompt template. Response: the LLM answers with inline citations, and a post-processor verifies the citations before returning the answer to the user.

Worked Example

A pharmaceutical sales rep asks a compliance assistant, "What's the recommended dosing for 65-year-old patients on our new antihypertensive?" The rep's question is embedded into a 3072-dim vector using OpenAI's text-embedding-3-large.

The system queries a Pinecone index holding 12,000 chunks from FDA-approved drug labels, clinical study reports, and internal medical affairs memos. Top-k=20 chunks come back by cosine similarity. A Cohere Rerank call scores those 20 against the original question and keeps the top 5, which cover the geriatric dosing table, renal adjustment guidance, and a cardiology sub-study.

Those 5 chunks, plus the question and a system prompt that says "answer only from the provided sources and cite each claim," go to Claude Sonnet. The model returns a 120-word answer citing section 2.3 of the approved label for the 5mg starting dose and section 8.5 for the renal caution. A post-processor confirms each cited passage actually appears in the retrieved chunks. The rep gets a grounded answer in about 1.8 seconds. No hallucinated dosage, and a compliance audit trail is logged automatically.

What People Get Wrong

Myth

RAG replaces the need for fine-tuning.

Reality

Fine-tuning changes how a model behaves. RAG changes what it knows. If your problem is the model's tone, output format, or domain-specific reasoning patterns, fine-tuning is the right tool. If the problem is that the model doesn't know your policies or documents, RAG is. Most serious production systems use both: a fine-tuned model for style and structure, RAG for fresh, verifiable facts.

Myth

Better embeddings solve retrieval problems.

Reality

Retrieval quality usually breaks on chunking, not embeddings. A 500-token chunk cutting mid-table or mid-code-block will retrieve poorly with any embedding model. Fix your parsing and chunking first. Add a re-ranker second. Upgrading from a cheap embedding model to a premium one is usually the last optimization worth making, and often shows smaller gains than teams expect.

Myth

A bigger context window removes the need for RAG.

Reality

You can stuff 200k tokens into Claude's context, but you'll pay for every token on every call and the model still exhibits the lost-in-the-middle effect where mid-context information gets ignored. RAG stays useful even with long-context models because it cuts cost, cuts latency, and forces the model to attend to the right few pages instead of scanning a whole library.