Resource Center

Events & webinars Blog Videos Glossary Resources Architecture Diagrams Demo Center

Resource Center

Events & webinars Blog Videos Glossary Resources Architecture Diagrams Demo Center

Back to blog

Blog

AI context windows: Why context quality beats context size

June 07, 20269 minute read

Jim Allen Wallace

Your AI agent has a 128K token context window. You're adding in retrieved documents, conversation history, tool outputs, and system instructions. But the answers are getting worse.

You're not alone. Most agent failures in production today are context failures. The model works fine. The information you feed it doesn't. A context window full of outdated policies, irrelevant documents, and contradictory facts can produce confidently wrong answers, even with a capable underlying model.

Context quality, the degree to which information in an LLM's context window is relevant, accurate, and structured for the task at hand, is emerging as a major constraint on agent performance. Context engineering, the practice of curating what goes into that window, is becoming a core skill for teams building LLM apps.

This article covers what separates good context from bad, the failure modes that degrade it, why precision beats volume, and how retrieval quality shapes everything downstream.

What separates good context from bad

Good context pulls its weight; bad context just crowds the window. That distinction matters because LLMs process every token through transformer attention, where tokens interact across the entire window. n tokens produce n² pairwise attention relationships, so every token you add doesn't just take up space, it multiplies the number of relationships the model needs to resolve.

So what does "good" actually look like? It comes down to a few measurable properties:

Relevance: Retrieved information directly addresses the query.
Completeness: The context contains enough information to produce a correct answer.
Faithfulness: The model's response stays grounded in the retrieved evidence.
High signal-to-noise ratio: Every passage contributes useful signal, not filler.
Freshness: The information reflects the current state of the world, not a snapshot from six months ago.

When those properties hold up together, the model has a much better shot at producing a correct answer. Bad context violates one or more of them, and the consequences aren't subtle.

Two failures show up clearly in the research: noise and excess volume. Noise means retrieved passages that don't address the query, and in one study, pure noise performed far worse than no retrieval at all (8.0% exact match versus 23.6%). Volume hurts even when everything retrieved is relevant. A separate study saw accuracy drop from 94% to 61% as context filled from 40% to 80% of the window. In both cases, a smaller window with the right information beat a fuller one.

Redis Iris serves agent context in milliseconds

Redis Iris connects memory, live data, and retrieval in one place.

Try for free

Context failure modes: poisoning, distraction, confusion, clash & rot

Context doesn't degrade in one uniform way. It breaks through distinct failure modes, and knowing what they look like is the first step to keeping them out of your agent's window.

Context poisoning

Context poisoning happens when incorrect, hallucinated, or adversarially injected information enters the context and the model treats it as ground truth. In retrieval-augmented generation (RAG), one paper on multi-hop retrieval found that LLMs often struggled to reject irrelevant content once it was already in the context window.

This gets worse in multi-step agent workflows, where an error early in the process becomes an established fact later on. The output sounds coherent and confident while the underlying data is wrong.

Context distraction

Context distraction happens when accumulated history, tool outputs, and summaries overwhelm fresh reasoning. The content doesn't need to be false to cause damage. One evaluation saw this after 100,000 tokens in an agent playing Pokémon, where the agent fell into behavioral loops rather than exploring new strategies.

Even with 100% perfect retrieval, performance drops as input length increases, ranging from 13.9% to 85% depending on the setup in that evaluation. Length alone can become a performance liability.

Context confusion

Context confusion is a narrower version of distraction: irrelevant tools and documents clutter the window, causing the model to pick the wrong tool or blend unrelated sources. Registering every available tool definition instead of filtering to the task-relevant ones is a common trigger.

Context clash

Context clash happens when contradictory information lands in the same window, which is hard for a model to resolve. Work on evidence conflicts found that performance declines as misinformation increases, with greater conflicting evidence producing more uncertainty. The model may arbitrarily pick one source, or hallucinate a synthesis of both.

Context rot

Taken together, these failure modes accumulate into a broader problem: context rot. As context fills across an agent's lifespan, recall can degrade even when the total token count stays within technical limits. Context rot is the umbrella under which poisoning, distraction, confusion, and clash operate together.

Every token & its position should earn its place

Keeping those failure modes out starts with token selection, and position matters as much as relevance. In a 20-document question-answering setup, accuracy dropped 20% when the relevant document sat in the middle rather than at the beginning or end. That U-shaped attention curve, with strong primacy and recency but a weak middle, showed up across models with 4K, 16K, and 32K windows.

Off-topic content hurts reasoning, too. One benchmark on reasoning degradation injected irrelevant passages into test problems and scaled them from 1 to 15. Grok-3-Beta's step accuracy fell from 43% to 19%, and GPT-4.1's fell from 26% to 2% under the same conditions. All six models tested showed degradation. Passages that look credible but lack answers can compete with real signal in ways obviously off-topic content doesn't.

The practical implication is simple: you improve context quality by making every token count.

How retrieval quality determines context quality

Making every token count is a retrieval problem. Retrieval is the upstream supplier of what the model reasons over: if that layer sends the wrong documents, prompting alone usually can't recover cleanly downstream. In one evaluation, retrieval changes alone cut measured hallucinations by a third, from 18.5 to 12.2, with no changes to the model or prompts. A separate study on retrieval quality makes the same point: the generator depends on access to relevant evidence, so retrieval often sets the ceiling on RAG performance.

Precision over volume: more context isn't better

As more passages are retrieved, recall rises while precision falls, the usual retrieval trade-off. The relevant information is somewhere in the pile, but the model can't reliably extract it. Stronger dense retrievers make this worse: they surface semantically close but factually wrong passages, called hard negatives, which damage reasoning more than obviously irrelevant ones. In that study, a better retriever without a precision layer made the problem worse.

Re-ranking is the fix. A cross-encoder re-orders candidate documents by relevance before they enter the context window: retrieve broadly for recall, then re-rank tightly for precision.

Hybrid search closes retrieval gaps

Pure vector and pure keyword retrieval each miss what the other catches. Dense retrieval captures semantic meaning but can miss exact terminology. Best Matching 25 (BM25) matches keywords precisely but misses paraphrases. Hybrid search combines both using Reciprocal Rank Fusion (RRF), which reconciles scores across incompatible ranking scales.

A benchmark across 23,088 queries over 7,318 financial documents measured the impact. BM25 alone outperformed dense retrieval at Recall@5 (0.644 vs. 0.587) in that benchmark, challenging the assumption that semantic search always dominates. Hybrid plus re-ranking measured 0.816, a 39.0% improvement over dense retrieval alone. Mean reciprocal rank (MRR)@3 rose from 0.433 with hybrid RRF to 0.605 once re-ranking was added.

Build agents that remember, not agents that guess

Redis Iris gives every agent fresh context and long-term memory.

Get started

Metadata filtering scopes the search space

Metadata filtering helps before ranking even starts: it narrows the candidate pool ahead of vector search, so embedding comparison runs against fewer documents. Restricting search to engineering documents when an engineering user asks about deployment processes produces a smaller, more relevant set.

Freshness keeps stale context out

Even on-topic, precisely ranked context breaks decisions when it's stale, and those failures produce confident wrong answers rather than visible errors: a compliance breach, a failed transaction, a recommendation for a discontinued product. Vector retrieval has no built-in recency signal. Similarity scoring is independent of when a document was written, so a shipping policy from two years ago can outrank the current one.

Freshness is better treated as a design constraint: timestamp metadata for time-bounded retrieval, re-embedding triggers when source documents change, and freshness service-level agreements (SLAs) for features serving real-time decisions.

Together, precision-focused re-ranking, hybrid retrieval, metadata filtering, and freshness constraints shape whether context quality holds up downstream. Everything that reaches the model flows through this layer.

Where a real-time context engine fits

Context problems usually trace back to the retrieval layer, so it has to be fast and current enough to trust. It works best when it provides three properties to keep context clean: speed, freshness, and precision. That means returning results fast enough for real-time agent interactions, staying synchronized with source data as it changes, and combining vector similarity with metadata filtering and hybrid search to surface the documents that actually answer the query.

Redis Iris is designed around those three properties. It's Redis' real-time context engine, combining retrieval, memory, caching, and data integration on the in-memory core that supports sub-millisecond latency for many core operations. On the retrieval side, Context Retriever gives agents structured paths through business data, and the underlying Redis Query Engine supports hybrid search that combines BM25 full-text scoring with vector retrieval through score fusion, plus metadata filtering for scoping results by time, category, or access level. In a billion-vector benchmark, Redis reported 90% precision at roughly 200 ms median latency on top-100 nearest-neighbor retrieval with 50 concurrent queries.

Iris covers the freshness and memory layers too. Redis Data Integration continuously syncs updates from source databases into the context layer, addressing the stale-context problem at its source. Redis Agent Memory supports short-term working memory and long-term recall across sessions, and Redis LangCache handles semantic caching by recognizing when queries carry similar meaning and serving cached responses instead of making redundant LLM calls.

Many teams end up managing separate systems for vectors, caching, and memory. Iris brings these together, reducing architectural complexity that can create context quality gaps in the first place. In the 2025 Stack Overflow survey, Redis ranked as the most-used data management tool for AI agents, with 43% adoption.

Fresh context, every call

Redis Iris keeps agent data current so answers stay accurate.

Try for free

Good context starts upstream

Bigger context windows won't fix bad context. What matters is whether the model gets information that's relevant, current, internally consistent, and worth the tokens it consumes. If your system keeps surfacing stale documents, conflicting versions, or low-signal filler, the model has to reason on a shaky foundation. Fast, fresh retrieval keeps that foundation clean, and Redis Iris fits the job: one context engine combining hybrid retrieval, agent memory, semantic caching, and real-time data integration.

If you're building AI agents or RAG pipelines and context quality is on your mind, try Redis free to see how Iris works with your data. Or talk to our team about designing your context layer for precision, freshness, and speed from the start.

Get started with Redis today

Speak to a Redis expert and learn more about enterprise-grade Redis today.

Try for free Talk to sales