How to Measure RAG System Accuracy (And Why Most Teams Get It Wrong)
Most RAG evaluations miss what matters. Here are the metrics that actually predict production quality and how to set up an evaluation pipeline that catches problems before your users do.
I review a lot of RAG systems that teams have built internally. The most common evaluation strategy I see is someone on the team asking it ten questions and saying "looks good." That is not evaluation. RAG evaluation needs to be systematic, repeatable, and grounded in metrics that predict real-world accuracy. Most teams skip this because it feels like extra work. Then they ship a system that answers 60% of questions correctly and wonder why adoption stalls.
I've built RAG systems for legal teams processing 10,000+ documents, financial services firms, and healthcare companies where a wrong answer could mean a regulatory violation. In every case, the evaluation pipeline was the difference between a system that earned trust and one that got abandoned within three months.
Here is how to measure RAG accuracy in a way that actually tells you whether your system is production-ready.
Why "it looks right" is not an evaluation strategy
RAG systems fail in subtle ways. The answer reads well. The language is confident. The format is clean. But the actual content is wrong. Maybe the system pulled from an outdated document. Maybe it retrieved the right section but misinterpreted a conditional clause. Maybe it answered a slightly different question than what was asked.
Human spot-checking catches obvious failures but misses these subtle ones. And subtle failures are what erode trust in production. A user who gets a confident wrong answer once will stop using the system entirely. I saw this happen at a financial services client. Their internal RAG tool had 88% accuracy on simple lookups but dropped to 52% on multi-step questions. The team didn't know this because they only tested simple questions. Users figured it out within two weeks and went back to emailing the compliance team directly.
The other problem with manual spot-checking is that it doesn't scale with changes. Every time you update chunking parameters, swap an embedding model, or add new documents, you need to re-evaluate. If evaluation is a manual process, it doesn't happen. If it doesn't happen, regressions sneak in.
Three metrics that matter
I evaluate every RAG system on three dimensions: faithfulness, relevance, and coverage. Each measures a different failure mode. If you only track one, you'll miss entire categories of problems.
Faithfulness
Faithfulness measures whether the generated answer is grounded in the retrieved documents. Can every claim in the answer be traced back to a specific passage in the source material? A faithfulness score of 0.9 means 90% of the claims in the answer are supported by the retrieved context.
Low faithfulness means the model is hallucinating. It is generating plausible-sounding text that goes beyond what the documents actually say. For enterprise use cases, especially legal, compliance, and financial, faithfulness below 0.85 is a dealbreaker. I target 0.92+ for production systems.
The tricky thing about faithfulness failures is that they look correct to anyone who isn't a domain expert. The model might say a policy allows 30 days for appeal when the actual document says 15 days. The sentence structure is perfect. The answer feels authoritative. But the number is hallucinated. This is why automated faithfulness scoring matters so much.
Relevance
Relevance measures whether the answer addresses the actual question. The system might retrieve accurate information and generate a faithful summary, but if it answers a different question than what was asked, the output is useless.
This happens more than you would expect. A question about "employee termination policy in California" retrieves the general termination policy and ignores the California-specific addendum. The answer is faithful to the source but irrelevant to the question. I target relevance scores of 0.88+ for production.
Relevance failures usually trace back to the retrieval layer. The system found chunks that are semantically similar to the query but don't actually contain the specific information needed. Better metadata filtering and re-ranking fix most relevance issues. If your retrieval layer consistently returns the right documents but the wrong sections, your chunking is too coarse.
Coverage
Coverage measures whether the system found all the relevant information. If the correct answer requires information from three different documents and the system only retrieves two, the answer will be incomplete. Coverage is the hardest metric to measure because you need to know the complete ground truth answer.
Low coverage usually points to retrieval problems. Either the chunking split relevant information across too many small pieces, or the retrieval strategy missed documents that use different terminology for the same concept. I target coverage of 0.85+ in production.
One pattern I see often: the knowledge base has the information spread across a policy document, an FAQ page, and an internal memo. The RAG system retrieves the policy document but misses the other two. The answer is partially correct but incomplete. Users don't know what's missing, so they trust the incomplete answer. That's worse than returning no answer at all.
Building your evaluation dataset
You need a set of question-answer pairs where the expected answer is known and verified by a domain expert. I recommend 50-100 pairs for an initial evaluation. That is enough to surface systematic problems without requiring months of prep.
Get these from your actual users. Pull the last 100 questions submitted to your support team, your legal team, or whatever group will use the system. Have domain experts write the correct answers and note which source documents contain the information. This gives you ground truth for all three metrics.
Categorize your questions by difficulty. Simple factual lookups ("What is our vacation policy?") should score near-perfect. Multi-hop questions ("Which departments exceeded their Q3 budget and what was their Q2 trend?") will score lower. Having both types tells you where to focus optimization.
I also include a set of 10-15 "unanswerable" questions where the knowledge base genuinely doesn't contain the answer. A good RAG system should say "I don't have enough information to answer this" instead of generating a plausible guess. If your system confidently answers unanswerable questions, your faithfulness guardrails need work.
Automated evaluation with LLM-as-judge
Manually scoring 100 question-answer pairs across three metrics is slow. The practical approach is to use an LLM as an automated evaluator. You give the judge model the question, the generated answer, the retrieved context, and the ground truth answer, then ask it to score faithfulness, relevance, and coverage on a 0-1 scale.
I use GPT-4o or Claude as the judge model, with structured output to get numeric scores plus explanations. The explanations are critical. A score of 0.6 on faithfulness means nothing without knowing which specific claims were unsupported.
There's a calibration step most teams skip. Before trusting your automated judge, have human experts score the same 30-40 pairs. Compare the human scores to the LLM scores. If the correlation is below 0.85, refine your judge prompt. Adding 3-5 scored examples to the judge prompt improves correlation by 15-20%. The examples show the judge what a 0.5 versus a 0.9 looks like for your domain.
Tools like RAGAS and DeepEval automate this pipeline. RAGAS gives you faithfulness, answer relevance, and context precision out of the box. DeepEval adds custom metric support. Both integrate with LangChain and LlamaIndex. I use RAGAS for most projects because it requires the least setup, but switch to DeepEval when I need domain-specific metrics.
Run automated evaluation on every change to your pipeline. New chunking strategy? Run the eval suite. Different embedding model? Run the eval suite. This is the only way to know if a change actually improved things. I set this up as a CI step so it runs automatically on every pull request that touches the RAG pipeline.
Human evaluation for high-stakes domains
Automated evaluation is necessary but not sufficient for domains where wrong answers have real consequences. Legal, medical, and financial RAG systems need human evaluation on top.
I set up a weekly review cycle where domain experts evaluate a random sample of 20-30 production queries. They score each answer on a simple scale: correct, partially correct, or incorrect. They also flag any answer that is technically correct but could be misleading in context. This catches failure modes that automated metrics miss.
The human review also feeds back into the evaluation dataset. Every incorrect answer becomes a new test case. Over three to six months, you build a dataset of 300-500 pairs that covers the real distribution of questions your users ask. That dataset becomes one of the most valuable artifacts in your AI infrastructure.
Continuous monitoring in production
RAG accuracy drifts over time. New documents get added that conflict with old ones. The distribution of user questions shifts. Source systems change their formatting and the parser breaks silently. I've seen systems lose 15 percentage points of accuracy over 90 days without a single code change, purely because the knowledge base evolved.
I monitor three signals in production. First, retrieval quality: are the retrieved chunks actually relevant to the query? I log the retrieval scores and flag any query where the top result scores below 0.7 similarity. Second, answer confidence: I ask the generation model to rate its own confidence and flag low-confidence answers for human review. Third, user feedback: even a simple thumbs up/thumbs down on each answer gives you a direct signal of accuracy from the people who know best.
Set up automated alerts when any of these signals trend downward over a rolling 7-day window. A 10% drop in retrieval scores usually means something changed in the knowledge base. Catch it at the monitoring layer and you fix it in hours. Miss it and you find out weeks later when adoption craters.
What good looks like
Based on the production RAG systems I've built and evaluated, here are the benchmarks I aim for:
- →Faithfulness: 0.92+ (fewer than 8% of claims unsupported by source material)
- →Relevance: 0.88+ (answers address the actual question asked)
- →Coverage: 0.85+ (retrieval finds the majority of relevant information)
- →User satisfaction (thumbs up rate): 85%+
- →Retrieval precision at k=5: 0.75+ (at least 3-4 of top 5 chunks are relevant)
- →Unanswerable question detection: 90%+ (system correctly declines to answer)
If your system scores below these thresholds, the fix is almost always in the retrieval layer. Better chunking, better metadata, hybrid search, and re-ranking will move the needle more than switching LLMs or tweaking prompts. Measure first, then optimize the component that is actually failing.
We set up evaluation pipelines as part of every RAG build. If you have an existing RAG system that feels unreliable, the first step is getting real numbers on these three metrics. Once you know where the system is failing, the fix is usually straightforward.
Related Use Cases
Enterprise Knowledge Base Search with AI
Employees waste hours every week searching for information that exists somewhere in the organization but is impossible to find. We build AI retrieval systems that answer natural language questions accurately, with sources cited.
AI Compliance Monitoring and Regulatory Intelligence
Regulatory environments change constantly and compliance teams cannot manually monitor everything. We build AI systems that track regulatory developments 24/7, translate them into action items, and maintain the audit trail regulators need.