Resource Center

Events & webinars Blog Videos Glossary Resources Architecture Diagrams Demo Center

Resource Center

Events & webinars Blog Videos Glossary Resources Architecture Diagrams Demo Center

Back to blog

Blog

Token efficiency: getting more signal into the context window

July 01, 202610 minute read

Jim Allen Wallace

Summarize with AI

You've probably hit this counterintuitive moment: you give your model more context to work with, expecting better answers, and the answers get worse. More tokens were supposed to mean more information, more grounding, fewer hallucinations. Instead, your agent starts repeating itself, picking the wrong tool, or confidently making things up.

This isn't a fluke. It reflects how transformer attention can behave as context grows, and it's an important thing to understand when you're building retrieval-augmented generation (RAG) or agentic systems. Token efficiency, getting the highest-signal tokens into the window and keeping the low-signal ones out, often matters more than raw context length.

This guide covers why more tokens can hurt reasoning, where low-signal tokens come from, how to select for high-signal context, and why the infrastructure behind retrieval shapes the whole thing.

Why more tokens can mean worse reasoning

LLMs don't read your context evenly. They pay more attention to what sits at the start and end of the window, and less to whatever's stuck in the middle. So as you cram more tokens in, the odds that the model actually uses the right ones drop.

The classic example is the "lost in the middle" effect: a U-shaped performance curve where accuracy tanks when the answer is buried in the middle of the context. GPT-3.5-Turbo's multi-document question answering accuracy dropped by more than 20% when the relevant document moved to the middle. At its lowest point, the model performed worse than if you'd given it no documents at all.

Position isn't the only problem. Sheer length hurts too. In one 2025 test, accuracy dropped substantially compared to the short-context case even with perfect retrieval, and the drop held even when the filler was just whitespace instead of natural language. Blank space in the window was enough to degrade the answer.

The gap between advertised and effective context length is wide. The NoLiMa benchmark tested 13 models with 128K-token support, and 11 of 13 dropped below 50% of their short-context baselines at 32K tokens. Treat the effective window as a fraction of what the spec sheet promises.

The reason is baked into the architecture. Transformers create n² pairwise relationships across n tokens, so attention gets stretched thin as the window fills. The model has an attention budget, and every token you add draws it down.

Context rot & the named failure modes

This slow degradation has a name: context rot. Models get less reliable as the window fills, especially when distractors crowd out the relevant information. More context isn't a neutral addition. It changes how the model reasons. Underneath context rot sit four specific failure modes worth knowing by name, since they show up constantly in agentic systems.

Context poisoning: A bad context keeps getting referenced, so the same error compounds across turns.
Context distraction: The window grows so long the model over-focuses on it and ignores what it learned in training. In agents, this shows up as repeating past actions instead of taking the next step.
Context confusion: Superfluous information drags the response down, often because every available tool definition got loaded instead of just the task-relevant ones.
Context clash: New information conflicts with what's already in the prompt, so the model either picks one source or hallucinates a synthesis of both.

These aren't edge cases. If you're running multi-turn agents with tools, you'll likely hit at least one of them, and knowing the vocabulary helps you diagnose what went wrong.

Redis Iris serves agent context in milliseconds

Redis Iris connects memory, live data, and retrieval in one place.

Try Redis Iris

Where low-signal tokens come from

Context rot happens because most of what fills a context window never earns its place there. Low-signal tokens are everything else in the window that competes for attention without contributing to the answer: stale conversation turns, unrelated tool schemas, boilerplate, near-duplicate chunks, and filler text. High-signal tokens are the ones that directly help the model answer the question in front of it: the specific passage that contains the answer, the tool definition for the tool it actually needs, the user's current instruction.

Every part of an LLM interaction, the system prompt, retrieved documents, conversation history, and tool output, competes for space in one shared token budget. Low-signal tokens come from a few predictable places.

A few sources produce most of the noise:

Poorly chunked RAG content: Oversized chunks bundle too much text, making it harder for vector search to pinpoint the relevant section and forcing you to pull more chunks to answer a query.
Conversation history: In multi-turn apps, the chat log grows turn over turn and can pollute the query sent to your vector database, pulling back less relevant documents.
Tool outputs: Greedy file reads that pull in an entire file when only a few lines are needed, hundreds of Model Context Protocol (MCP) tool schemas, and accumulating intermediate results eat the budget when tools return raw data instead of high-signal information.
Verbose system prompts: Boilerplate instructions and redundant formatting displace useful signal before the agent has done anything.

Spot the source of the noise, and you know where to make cuts.

What high-signal token selection looks like

Good context engineering means finding the smallest possible set of high-signal tokens that give the model the best shot at a good answer. A handful of techniques do most of the work.

Reranking

Reranking targets the position problem from earlier: instead of hoping the right chunk survives being buried in the middle of the window, it moves the most relevant chunks to the front before the context is assembled. A reranking model, usually a cross-encoder, scores each query-document pair for relevance, so you can retrieve broadly for recall and then keep only the best few matches instead of stuffing all of them into the window. Mean Reciprocal Rank measures how close to the top of the results list the first relevant chunk lands, averaged across every query in a test set: a score near 1.0 means the correct chunk shows up first almost every time, while a low score means it's often buried several results down.

On a benchmark over 1,200 filings from the Securities and Exchange Commission (SEC), cross-encoder reranking pushed Mean Reciprocal Rank from 0.160 to 0.750, meaning the correct chunk landed near the top of the results far more consistently after reranking than before. The trade-off is added latency, but it also means you can send the model fewer chunks overall, since the ones you do send are more likely to be the ones it actually needs.

Hybrid search with metadata filtering

Hybrid search catches what pure vector search misses, and that matters for token efficiency because a missed match forces you to either widen the search and pull in more, lower-signal chunks or hand the model an incomplete answer. It combines dense vector search for semantic meaning with sparse keyword search for exact term matching, usually merged with Reciprocal Rank Fusion. Vector search alone can miss rare terms, IDs, codes, and proper nouns that carry weight in medical, legal, and financial domains, and missing them often means retrieving several extra candidate chunks just to compensate. Layering in metadata filters by date, source, or document type trims outdated or untrusted material before it ever reaches the model, so the tokens you spend go toward relevant content instead of chunks the model has to sift through and discard.

Context compression

Prompt compression shrinks the context itself, stripping low-information tokens out before they ever reach the model, so the window holds more signal per token instead of more raw text. The LLMLingua family from Microsoft Research uses a small model to score token perplexity and drop the tokens carrying the least information. In one benchmark, LLMLingua reported a 20x compression ratio with only 1.5% performance loss on reasoning tasks. Moderate compression sometimes improves performance by abstracting the important information instead of dragging along everything verbatim, the same principle behind reranking and hybrid search: fewer, better-chosen tokens beat a bigger pile of them.

Build agents that remember, not agents that guess

Redis Iris gives every agent fresh context and long-term memory.

Try Redis Iris

Pruning & ordering

Where you place tokens matters almost as much as which ones you include. Since the model favors the start and end of the window, pruning irrelevant context and ordering what remains for maximum attention both improve results. Applying RAG to tool descriptions, so you fetch only the relevant tools, can improve tool selection accuracy. A small-to-big pattern, retrieving precise chunks then expanding to surrounding context, can improve answer quality with little added latency.

One pattern ties these together: place key instructions at the beginning and end, and cut entire conversational turns rather than trimming mid-message. Mid-message trimming can destroy semantic coherence because the model loses the surrounding meaning. Positioning helps, but it doesn't erase length-based degradation, so selection still does the heavy lifting.

Semantic caching: saving tokens before they're spent

The most token-efficient call is often the one you never make. Semantic caching stores LLM responses indexed by vector embeddings of the queries that produced them. When a new query arrives, its vector embedding is compared against the cached ones, and if similarity clears a threshold, the cached response comes back without invoking the LLM at all. "What's the weather like today?" and "How's the weather right now?" resolve to the same entry.

This saves both input and output tokens, which sets it apart from prompt caching that only trims the input side. In high-repetition workloads, LLM inference costs have been up to 73% lower without code changes. Hit rates are stronger for workloads with semantic repetition like support FAQs and documentation queries, and lower for highly creative or unique prompts. In that Redis example, vector search adds under 100ms compared with an LLM call that can cost 1–5 seconds in that workload, though latency depends on index size, query pattern, and deployment.

The infrastructure question behind signal quality

Retrieval speed decides whether the right tokens actually arrive in time. Slow retrieval is what pushes teams to pre-load huge context blocks in the first place, and that's exactly where the noise comes from.

The math adds up fast. With a 40–60ms penalty per hop across ten steps, slow retrieval stacks into roughly 400–600ms of added latency. Across dozens of retrievals or repeated agent steps, that becomes seconds.

Redis is a real-time data platform with a memory-first architecture that supports vector search and semantic caching with sub-millisecond latency for many core in-memory operations used in AI workloads. Fast retrieval lets your app fetch smaller, higher-signal context at query time instead of pre-loading a giant prompt. Redis Iris, a real-time context engine built on Redis, packages that retrieval into four components: Redis Context Retriever for navigable access to business data, Redis Agent Memory for short- and long-term context across sessions, Redis LangCache for semantic caching, and Redis Data Integration for keeping that data current. Instead of stitching together a separate vector database, cache, and memory service, your app pulls fresh, high-signal context from one system.

For AI workloads, that consolidation is the point: the right tokens, retrieved fast enough to matter.

Fresh context, every call

Redis Iris keeps agent data current so answers stay accurate.

Try Redis Iris

Signal beats size

The pattern across every section is the same: fewer, better-chosen tokens outperform a full window of noise. More tokens stretch attention thin, surface the named failure modes, and quietly degrade reasoning. Selecting for signal through reranking, hybrid search, compression, pruning, and caching improves output quality and cuts cost at the same time.

Infrastructure is what makes that selection possible. If retrieval is slow, teams fall back on pre-loading giant prompts and hoping the model sorts it out. Redis Iris keeps vector search, semantic caching, and agent memory on the hot path, so your app can pull the right tokens at query time instead of over-stuffing the window.

If you're building RAG or agentic systems and watching your context budget evaporate, it's worth seeing how this works with your own workload. Try Redis Iris to experiment with vector search and semantic caching, or talk to our team about improving your AI infrastructure.

Get started with Redis today

Speak to a Redis expert and learn more about enterprise-grade Redis today.

Try for free Talk to sales