Resource Center

Events & webinars Blog Videos Glossary Resources Architecture Diagrams Demo Center

Resource Center

Events & webinars Blog Videos Glossary Resources Architecture Diagrams Demo Center

Back to blog

Blog

Five context-engineering principles that survive production

June 14, 20269 minute read

Jim Allen Wallace

Summarize with AI

Your agent passes every test you throw at it. The demo is clean, the eval suite is green, and you ship. Then a week later, the support tickets start: agents looping, costs spiking, responses that confidently cite facts that were never true. The model didn't change. The context around it did.

Context engineering is the set of strategies for curating and maintaining the optimal set of tokens, or information, during LLM inference. That sounds tidy until you're running thousands of sessions a day and watching token bills climb for reasons your prompt-level thinking never accounted for. The principles that keep context reliable under real traffic don't show up in demos. They show up in production, when sessions get long, tool outputs pile up, and retrieval starts pulling in noise. This article covers five of them: how to budget cost across a whole run, where to filter tool outputs, how to layer static and dynamic context, why retrieval is a budget decision, and why evals often miss context failures.

Principle 1: Treat the whole run as the cost unit, not the turn

You optimize a prompt, shave 30% off its token count, and your bill barely moves. That's the first sign you're budgeting the wrong unit. The cost that matters isn't the turn, it's the whole run.

Here's why. Each API call is billed separately, but every turn re-sends the full conversation history as input. So per-turn input grows linearly while cumulative run cost grows much faster: one cost model shows that doubling conversation length roughly quadruples the spend. As sessions lengthen, the main cost driver becomes re-processing history, not producing new tokens.

Tool outputs make this worse. File reads, terminal output, and search results from earlier turns get re-sent on every call too. By turn 10, your context can carry thousands of tokens of old tool output the agent may never reference again, and you pay to reprocess all of it each turn.

Not all of that cost is unavoidable. Some of it is repeated work: the same question, or a paraphrase of it, hitting the model again. Semantic caching targets that slice. It recognizes when a new query means the same thing as a cached one, even if the wording differs, and returns the stored response instead of calling the model again. Run as a managed layer, this can take a real bite out of inference spend on repetitive traffic. Redis LangCache, the semantic caching service in the Redis Iris context engine, reported up to 73% lower inference costs in Redis benchmarks on high-repetition workloads.

Redis Iris serves agent context in milliseconds

Redis Iris connects memory, live data, and retrieval in one place.

Try Redis Iris

Caching has limits worth knowing up front. For multi-turn agents, cache keys and similarity thresholds need to account for session state, because follow-up questions can otherwise be falsely matched to earlier, unrelated prompts. It works best for repeated standalone intents or normalized sub-requests.

Principle 2: Filter tool outputs at ingestion, not after they bloat the window

Once you're thinking about the whole run, the next question is where the tokens go. More often than not, the answer is tool outputs. Search and API responses are often the biggest single cost in an agent's context.

APIs predate LLMs and were built for deterministic systems, so they often return more than the model needs. A single API call can dump a wall of nested JSON when the agent needed three fields. Multiply that across a multi-step loop and you've buried the signal under accumulated noise.

The instinct is to compress later: summarize the conversation once it gets long. That's the wrong order of operations. Filtering at ingestion is more effective, because you only keep what's needed for the next step. Compressing after the fact risks losing details that mattered and does little to stop bad data from entering the context at all. It also reduces the risk of context poisoning, where an early error gets preserved as truth and compounds across later reasoning. Production teams use a few concrete patterns:

Offload large responses. Store oversized tool responses outside the prompt and substitute a pointer plus a short preview.
Truncate noisy content. Bound tool outputs with a token policy and trim long text before it enters the next turn.
Clear stale results. Once a tool result sits deep in the message history, the agent rarely needs the raw result again.

These share the same logic: cut at the source, keep the working set lean. When trimming, preserving whole turns usually keeps coherence better than cutting through the middle of a message.

Principle 3: Keep static & dynamic context in separate layers

Filtering tells you what to keep. Layering tells you how to organize it. A lot of agent reliability problems trace back to one thing: instructions, retrieved data, and history all get collapsed into a single undifferentiated blob.

The cleaner pattern separates context into distinct layers. The simplest split is static versus dynamic. Static context is the fixed, cacheable part: system instructions, persona, and behavioral rules. Dynamic context is the task-dependent part: retrieved knowledge, conversation history, and working state. The practical motivation is straightforward. Some context rarely changes and benefits from caching, while other context needs to be fresh every turn.

Production teams often go further and run three layers: static, dynamic, and session. When everything is one string, you can't tell whether a failure came from a bad instruction, a stale retrieval, or a corrupted history. Separate layers let you update and debug each one independently.

Stateful agent architectures can model this directly. Short-term, thread-scoped memory tracks the ongoing conversation within a session, while long-term memory survives across sessions. The key move is that storing something and sending it to the model are separate decisions. The agent can hold everything it's tracking in a structured state, then expose only the relevant part to the LLM on a given turn, keeping the rest available without paying to put it in the prompt.

This is also where a unified data layer helps. The same store can hold conversation history, structured working state, and long-term semantic memory through vector search, keeping the layers distinct without scattering them across separate systems to sync. Redis Iris is built around this idea, bringing agent memory, context retrieval, and data integration into a single platform so the layers stay coherent as the agent runs.

Build agents that remember, not agents that guess

Redis Iris gives every agent fresh context and long-term memory.

Try Redis Iris

Principle 4: Retrieval is a budget decision, not a fetch-everything default

Layering organizes what you have. Retrieval decides what comes in. And the default instinct, pull as much relevant context as possible, is backwards. More retrieved documents don't guarantee better answers. They can make answers worse.

The foundational finding here is the lost in the middle problem: model performance is highest when relevant information sits at the start or end of the input, and degrades when the model has to use information buried in the middle. Stuffing more into the window pushes the good stuff toward that dead zone. And the reason extra documents backfire is counterintuitive: related-but-wrong documents do more damage than obviously irrelevant ones, because they look plausible. One study of RAG retrieval found that adding documents semantically close to the query, but not actually answering it, degraded accuracy more than adding random noise. Stronger retrievers can make this worse by surfacing hard negatives, semantically close but factually wrong passages.

The cost side reinforces the relevance side. Longer windows cost more to process, and accuracy can drop as the window fills. You can pay more to get less reliable answers. That's the opposite of what fetching everything is supposed to buy you.

A structural mitigation is to retrieve broadly for recall, then re-rank tightly for precision. A cross-encoder re-orders candidate documents by relevance before they enter the context window, filtering out the hard negatives before they reach the model. In a retrieval-augmented generation pipeline (RAG), your app embeds the query, the vector store returns a broad candidate set, a re-ranker trims it to the few that matter, and only those reach the model.

The retrieval layer fetches and ranks; the app decides the budget. Redis supports vector search, full-text search, and hybrid retrieval with metadata filtering. For vector search, Redis supports FLAT and Hierarchical Navigable Small World (HNSW) index types for exact and approximate nearest-neighbor search, plus SVS-VAMANA indexes for larger-scale vector workloads, depending on dataset size, index configuration, and latency budget.

Principle 5: Context failures can be invisible to standard evals

Here's the principle that ties the rest together: even if you nail cost, filtering, layering, and retrieval, your test suite probably won't tell you when they start to fail. Context degradation is easy to miss in standard evaluations.

The reason is structural. Most LLM benchmarks focus on short-form questions and rarely test performance when context is near capacity. Your agent passes because the test is short. Then it degrades over a long session, and nothing in your eval suite catches it. Some agent behaviors only emerge over multiple turns: the agent maintains context correctly for five turns and fails on turn six.

The numbers on this gap are stark. One evaluation covered web-agent sessions spanning 25,000 to 150,000 tokens, where success rates dropped from 40–50% to under 10%, with agents getting stuck in loops and losing track of the original task. A separate long-context evaluation raised context length from 32K to 256K tokens and watched accuracy fall from 29% to 3%.

This is context rot, the performance degradation that happens when models process increasingly long inputs. What makes it dangerous is that it's silent. The model keeps running without error while its accuracy declines, losing track of guidelines and variables across long sessions. No exception, no failed assertion, just slowly worse output.

Catching it means watching context quality in production, not just in pre-deployment tests. Most observability setups instrument spans and traces but don't track session-level quality, which is exactly where degradation hides. A better pattern is to evaluate at the session level across dimensions like coherence, context retention, and goal achievement, and to feed real production traces back into your offline test sets.

Fresh context, every call

Redis Iris keeps agent data current so answers stay accurate.

Try Redis Iris

Context engineering is mostly a data problem

The five principles share a root cause. When an agent that passed every demo starts looping, hallucinating, or burning tokens in production, the model is rarely what changed. The context around it was mismanaged. That makes these data problems, not model problems, and a bigger context window doesn't solve them. It just raises the ceiling on how much you can get wrong before it shows.

That makes the data layer the place reliability is won or lost. Enterprise data is fragmented across dozens of systems, and stitching together a separate vector database, cache, memory store, and session store tends to recreate the very sprawl that context engineering is trying to solve. Redis Iris is a context engine for AI agents that consolidates those pieces into one place that sits between an agent and the data it needs to act. The memory-first architecture behind Redis' caching speed keeps retrieval and memory inside an agent's latency budget, and Redis Flex tiers RAM and SSD so longer memories don't carry RAM-only costs.

Try Redis Iris to see how context retrieval, agent memory, and data integration hold up on your workload, or talk to our team about building production agents.

Get started with Redis today

Speak to a Redis expert and learn more about enterprise-grade Redis today.

Try for free Talk to sales