Blog
RAG metrics: how to measure & optimize your retrieval pipeline
A user asks your support chatbot "how do I reset my password?" and it pulls the right help doc, generates a clear answer, and responds in under a second. The next user asks "what's your refund policy?" and the chatbot retrieves three barely relevant pages, hallucinates a 30-day guarantee you don't offer, and takes four seconds to do it. Without metrics, both interactions look the same from the outside. You only find out something's broken when a customer complains or, worse, acts on bad information.
RAG metrics tell you exactly where that second interaction went wrong. Was it a retrieval problem (wrong documents), a generation problem (the LLM made things up), or an infrastructure problem (too slow, too expensive)?
But here's the thing most teams get wrong: they pick metrics before they pick architecture, then wonder why scores plateau no matter how much they tune. The reality is that architectural decisions like chunk size, index type, and embedding model constrain which metrics you can even optimize.
This guide breaks down the metrics that matter for production RAG systems, how they connect to your retrieval architecture, and where teams typically get stuck.
What are RAG metrics & why your retrieval architecture shapes them
RAG metrics fall into three connected categories:
- Retrieval quality measures whether you found the right documents.
- Generation fidelity measures whether the LLM used them correctly.
- System reliability measures whether it holds up under real-world constraints like latency and cost.
These categories interact more than most teams expect. If your retriever misses relevant context, prompt tweaking alone is unlikely to save your faithfulness scores. And classic text overlap metrics like Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) often don't tell you whether retrieval did its job or whether the model stayed grounded in what it retrieved.
Architecture also shapes which metrics you can realistically move. Chunk size, for example, forces a precision/recall tradeoff: smaller chunks reduce noise but can fragment information across many results, making it harder to retrieve everything you need in the top K.
Larger chunks preserve more local context, though there isn't strong evidence that any single "ideal" range (including common 1,024–2,048 token heuristics) is universally optimal. The RAGChecker framework bundles context precision and recall metrics, but in practice your chunking strategy tends to set a ceiling on the balance you can reach. That's why it often makes sense to consider your architectural decisions before locking in which metrics to optimize.
How to measure retrieval quality: context relevance, recall & ranking
Retrieval metrics answer two questions: did you find the right documents, and did you rank them well? Most teams start with binary relevance metrics because they're the simplest to interpret, then layer in ranking-aware and RAG-specific metrics as their pipeline matures.
Binary relevance metrics
The simplest way to evaluate retrieval is to ask: was each document relevant or not? Binary metrics treat relevance as a yes/no judgment, which makes them easy to compute and interpret.
Precision@K metric measures the fraction of documents in the top K retrieved results that are actually relevant. If you retrieve 10 documents and 7 are useful, your Precision@10 is 0.7. High precision means less noise for the LLM to sort through. Recall@K metric flips the perspective: of all relevant documents that exist, what percentage did you find? High recall means you're less likely to miss important context.
The limitation of both is that they don't care about order. A result set with the best document at position 10 scores the same as one with it at position 1.
Ranking-aware metrics
Most RAG systems pass only the top few retrieved chunks to the LLM, so a relevant document buried at position 8 might never make it into the context window. Ranking-aware metrics account for this:
- Mean Reciprocal Rank (MRR) measures where the first relevant document appears. An MRR of 1.0 means it's always at position 1. Best suited for factoid or navigational queries where users seek a single correct answer.
- Normalized Discounted Cumulative Gain (NDCG@K) is the standard metric for ranked lists with graded relevance. Unlike binary metrics, NDCG works with graded relevance judgments (e.g., a scale of 0–3 or 0–5 depending on the dataset), rewarding highly relevant documents that appear early and penalizing good ones buried lower.
If your pipeline depends on surfacing the right document first, MRR is your go-to. If multiple documents contribute to answer quality and some matter more than others, NDCG@K gives you a more complete picture.
RAG-specific context metrics
Traditional information retrieval (IR) metrics were designed for search engines, not pipelines that feed context into an LLM. The Retrieval Augmented Generation Assessment (RAGAS) framework closes that gap with metrics built specifically for RAG:
- Context precision evaluates whether relevant chunks appear early in results.
- Context recall measures how well retrieved context covers all necessary information to answer the question, compared to a ground truth reference.
- Context relevance measures how much of the retrieved context is actually useful, penalizing redundancy and off-topic text.
Which metrics should you prioritize? It depends on your retrieval stage and product goal. In practice, the "best" metric bundle varies by use case, so pick the few that match what you're optimizing: coverage, cleanliness, or top-1 ranking.
How to measure RAG answer quality: faithfulness, relevance & hallucinations
Once you've measured retrieval, you'll want to evaluate what the LLM does with the context it receives.
Faithfulness
Faithfulness measures whether the generated answer sticks to the retrieved context or hallucinates information that isn't there.
RAGAS scores this as a ratio: Faithfulness = valid statements / total statements. In other words: how many claims in the answer are actually supported by what you retrieved?
Answer relevance
Answer relevance measures how directly the response addresses the original query. RAGAS does this with a bit of a trick: it generates questions from the answer, then measures cosine similarity between those generated questions and the original.
It works well for "did you answer the question?" checks, but it can also ding answers that wander, repeat themselves, or add extra (even if correct) detail.
Hallucination detection
If you want to catch hallucinations specifically, Natural Language Inference (NLI) checks can outperform lexical overlap metrics like ROUGE, particularly for longer-form content. The common pattern is to split the answer into sentences, score each sentence for entailment against the retrieved sources, then aggregate into an overall consistency score. Adding knowledge graph preprocessing on top of NLI models can improve hallucination detection accuracy, though results vary depending on your corpus and model choices.
Teams are also using LLMs as evaluators ("LLM-as-a-judge"). That can reduce the need for expensive human annotations, but it comes with its own gotchas. RAGAS' initial validation used two annotators (with ~95% agreement on faithfulness), though some researchers have noted this sample is statistically limited. And judge variability means your absolute scores can shift depending on which judge model you use and how you prompt it.
Why you should track latency, cost & safety alongside RAG accuracy
A RAG system that gets perfect accuracy scores but takes 10 seconds to respond and costs $50 per 1,000 queries isn't production-ready. In production, you end up juggling the quality-cost-speed tradeoff.
Latency
Response times for RAG systems can vary widely, often falling in the multi-second range. Characterization studies show retrieval accounting for 45–47% of time-to-first-token (TTFT) latency and approximately 41% of end-to-end latency. The retrieval and preprocessing stages together consume the majority of your latency budget before the LLM even starts generating.
Cost
Token consumption drives most RAG costs. The biggest lever is retrieving fewer, better chunks—five highly relevant passages typically outperform twenty marginally relevant ones, at a fraction of the token cost. Semantic caching offers another reduction by returning cached responses for semantically similar queries instead of calling the LLM again.
Safety
Production systems need safety guardrails. These are important components, and guardrail metrics help you confirm your guardrails don't degrade system performance.
How to align your RAG & retrieval architecture with the metrics you care about
Your index type, reranking strategy, and infrastructure choices shape which metrics you can optimize and what tradeoffs you'll face. Here's how the most common architectural decisions play out.
Index type shapes the recall-latency boundary
Hierarchical Navigable Small World (HNSW) indexes are a common choice for high-recall vector search. Billion-scale benchmarks show HNSW configurations achieving 0.84–0.99 recall@10, but the tradeoff is memory: HNSW requires significant memory at billion scale. When memory is a constraint, teams often look to alternatives like Inverted File with Product Quantization (IVFPQ). HNSW generally offers higher recall while requiring more memory—the right choice depends on your scale and infrastructure constraints.
Reranking boosts precision at a latency cost
Cross-encoder reranking can improve retrieval quality, though the lift depends on your corpus and queries. The tradeoff is latency: reranking adds processing time that may or may not be acceptable depending on your response time budget. Interestingly, smaller reranking pools can win on both quality and speed. More candidates doesn't always mean better results.
Hybrid retrieval combines precision & recall
Hybrid BM25 plus vector approaches often show improved performance across RAG benchmarks. The combination captures both semantic similarity and exact keyword matches—important when your documents contain technical terminology or specific identifiers that pure vector search can miss.
Redis supports hybrid retrieval natively, combining vector search with BM25 full-text search in a single query through the Redis Query Engine. In a benchmark against OpenSearch, Redis demonstrated up to 18x faster single-client vector search and up to 52x higher multi-client query throughput. At billion scale, Redis achieves 90% recall at 200ms median latency.
Beyond retrieval, Redis handles semantic caching alongside operational data—reducing LLM API calls by up to 68.8% in workloads with high semantic repetition.
Moving from metrics to production
RAG metrics are diagnostic tools that tell you where your pipeline is breaking and what architectural tradeoffs you're making. The metrics you choose should follow from your architecture, not the other way around.
Start by defining your constraints: latency budget, memory footprint, cost thresholds. Select an architecture that operates within those bounds. Then choose metrics aligned with what that architecture can realistically achieve. Reference-free evaluation frameworks like RAGAS work well for offline or pre-production evaluation where ground truth isn't available. For production, combine them with dedicated observability or monitoring tools. Use reference-based metrics for offline benchmarking when labeled ground truth exists.
Redis fits naturally into this workflow as the retrieval layer for production RAG systems. The unified platform handles vector search, caching, and operational data together, so your retrieval layer and semantic cache don't require separate infrastructure.
Try Redis free to see how it fits your RAG architecture, or talk to our team about optimizing your retrieval pipeline.
Get started with Redis today
Speak to a Redis expert and learn more about enterprise-grade Redis today.
