How to evaluate a RAG system before you trust it
Retrieval-augmented generation demos beautifully and fails quietly. Here is the evaluation loop I use to know whether a RAG system is actually good before it goes anywhere near a user.
Retrieval-augmented generation is the easiest impressive demo in AI and one of the hardest things to trust in production. You wire an LLM to your documents, ask it three questions you already know the answers to, watch it nail all three, and ship. Then a real user asks the fourth question, the one you never tried, and the system answers with total confidence and total nonsense. Nobody notices for a week.
The reason is simple. A RAG system has two independent ways to fail, and a happy-path demo tests neither of them properly. If you are putting one in front of customers, staff, or a board, you need an evaluation loop, not a vibe check. Here is the one I use.
The two failure modes
Every RAG answer is the product of two steps: retrieval (find the right source passages) and generation (write an answer from them). They fail for different reasons and you have to measure them separately.
- Retrieval failure. The right passage was never pulled from your index, so the model is answering from nothing. No amount of prompt tuning fixes this. The information simply was not in the context.
- Generation failure. The right passage was retrieved, but the model ignored it, contradicted it, or padded the answer with plausible invention. This is the classic hallucination, and it is a generation problem, not a retrieval one.
If you only look at the final answer, a wrong output tells you nothing about which half broke. Splitting the two is the whole game.
Build a gold set first
Before any metric, you need a small set of questions with known-good answers and, crucially, the specific source passages that should be used to answer them. Fifty to a hundred well-chosen questions beats a thousand lazy ones. Cover the boring cases, the edge cases, the questions with no answer in the corpus (the system should say so), and the questions that look similar but have different answers.
This gold set is the single highest-leverage artefact in the project. It is what turns "seems good" into a number you can move.
Metrics that actually tell you something
For retrieval, measure whether the correct passages come back and how high they rank:
- Recall@k: of the passages that should have been retrieved, what fraction appeared in the top k? If recall@5 is low, your answers are capped no matter how good the model is.
- MRR / rank position: how near the top is the first correct passage? Context windows are finite and models weight early context more, so rank matters, not just presence.
For generation, measure whether the answer is actually supported by what was retrieved:
- Faithfulness (groundedness): is every claim in the answer traceable to a retrieved passage? This is your hallucination detector.
- Answer relevance: does it actually address the question, or is it a fluent dodge?
- Completeness: did it use the retrieved evidence, or leave half of it on the table?
You can score faithfulness and relevance with a stronger model acting as judge, but calibrate the judge against human ratings on a slice first. An unchecked LLM judge is just another confident guess.
Test the "I don't know" case on purpose
The most dangerous behaviour is a confident answer to a question your corpus cannot support. Put unanswerable questions in your gold set and check that the system abstains instead of inventing. A RAG system that never says "the documents do not cover this" is not being helpful. It is being unfalsifiable.
Make it a loop, not a launch
The point of all this is not a one-off report. It is a harness you can rerun on every change: a new embedding model, a different chunking strategy, a reranker, a prompt tweak. Each change becomes an experiment with a number attached, so you can see whether you moved the metric or just moved it around.
Wire it into CI and it stops regressions before they ship. Change the chunk size, recall drops four points, the build tells you before a user does. That is the difference between a system you hope is working and one you know is.
I keep a small, public evaluation harness that shows this shape end to end, from gold set to metrics table to CI: github.com/CharlieHulme/rag-eval-harness.
What good looks like
A RAG system you can trust is not the one that aced the demo. It is the one where you can state, with a number behind it, how often it retrieves the right evidence, how often it stays faithful to that evidence, and how it behaves when the answer is not there at all. Get those three under measurement and you can improve them deliberately. Skip them and you are shipping confidence, not correctness.
Build the evaluation before you build the trust. Everything else in RAG is downstream of it.