Resource Center

Events & webinars Blog Videos Glossary Resources Architecture Diagrams Demo Center

Resource Center

Events & webinars Blog Videos Glossary Resources Architecture Diagrams Demo Center

Back to blog

Blog

Why a bigger context window won't fix your agent's memory

June 17, 20269 minute read

Jim Allen Wallace

Summarize with AI

Context windows have grown fast. Models that once capped out at a few thousand tokens now advertise hundreds of thousands, and the natural assumption was that the agent memory problem would shrink as the window grew. Stuff more into the prompt, the thinking went, and your agent stops forgetting.

It doesn't work out that way. A bigger prompt holds more context for a single call, but agents don't fail because one call ran out of room. They fail on continuity, the ability to carry what they learned in one session into the next. And a larger window does nothing for that, because a context window and agent memory are two different things. One is a per-call input buffer the model reads fresh every time. The other is a system you build around the model so it can recall what happened yesterday, last week, or three sessions ago. Stretching the first doesn't give you the second.

This article covers why bigger windows delay the wall without removing it, why re-sending full history is a recurring cost, and what an actual persistence layer looks like.

Does a bigger context window fix agent memory?

A bigger window raises the token ceiling, but the model's attention starts to slip long before you reach it. Running a session longer is real and useful. The catch is that the hard limit usually isn't what breaks first, quality degrades inside the window you already have.

Language models don't read their context evenly. Performance tends to be highest when the relevant information sits at the very beginning or end of the input, and it drops when the model has to pull from the middle. This is the well-documented "lost in the middle" effect, and it bites harder than it sounds. In one set of multi-document question-answering tests, when the answer was buried mid-context, GPT-3.5-Turbo scored worse than its closed-book baseline, the score it gets when you hand it no documents at all. Feeding the model the right answer in the wrong position left it worse off than feeding it nothing.

Position is only half of it. Length degrades quality on its own, and the gap between a model's claimed window and its usable one is bigger than most teams expect. The Long-Context Evaluation Beyond Literal Matching (NoLiMa) benchmark tested 13 models that all claim support for at least 128K tokens. At 32K tokens, 11 of 13 models scored below half of what they managed on the same task with a short input. Most held up only to roughly 2K tokens once the task required real reasoning rather than spotting a matching keyword.

Put those two together, position and length, and you get the failure mode AI and machine learning teams increasingly call context rot: the model gets less reliable as the input grows. Raw context size matters less than context quality, and the rot sets in well before you hit the token limit. A bigger window buys you room, not reliability.

Redis Iris serves agent context in milliseconds

Redis Iris connects memory, live data, and retrieval in one place.

Try Redis Iris

Why do long context windows cost more on every call?

That attention problem gets more painful when you look at the bill, because the same long context you're stuffing in is something you pay for repeatedly. LLM API pricing runs per token, input and output priced separately, so the more you put in the window, the more each call costs. What turns that into a recurring bill rather than a one-time one is how the APIs handle state.

Standard LLM API calls are stateless. The provider doesn't hold onto your conversation for the next request, so what feels like an ongoing chat in the UI is really the full history being re-sent every message. Your app code re-attaches the prior turns to each call, and the model reads the whole thread from scratch again. Per-call input grows roughly linearly as the conversation gets longer, but the cumulative cost grows closer to quadratically, because every turn pays to re-send every turn before it.

Agentic workloads are especially exposed. Agents make far more calls than a human chatting, and each call drags along tool outputs and accumulated history, so input tokens dwarf output tokens and input price becomes the main cost driver. Every turn, the agent pays to re-read and re-understand the project state it already worked out the turn before, and the longer the task runs, the more that rebuilding costs.

This is where caching changes the math. When a query matches one the system has effectively answered before, you serve the stored answer straight back and skip the model call entirely, so you pay nothing to regenerate it. Semantic caching matches on meaning rather than exact wording, so "What are the features of Product A?" and "Tell me about Product A's features" resolve to the same cached response. The catch is that a cache needs a cache invalidation strategy so stale answers don't get served after the underlying data changes. When a meaningful share of queries repeat known intent, serving them from cache beats paying to re-send the same context to the model over and over.

Context window vs. agent memory: what's the difference?

Caching trims repeated calls, but it doesn't reach the deeper issue: even a perfectly managed window forgets everything once the session ends. That's not a bug to patch with more tokens. It's how these systems are built.

Statelessness is a deliberate property of how these models are served, not an oversight. Each inference call is independent, with nothing retained from the last one, which is part of how providers serve enormous request volumes without per-user bookkeeping. A standard context window holds the system prompt, chat history, retrieved documents, and tool results for one call, then discards all of it. There's no cross-session write path inside the window. Preferences set on turn 1, constraints added on turn 3, and decisions made on turn 7 are all state the app has to persist. None of that data carries into model memory when the session closes.

A useful way to frame it: the context window is working space, not the store of record. It's fast and immediately available to the model on every call, but it's scoped to the current task and clears when that task ends. Anything that needs to outlive the session has to live in a layer built to hold it.

This is why a bigger window doesn't help much with continuity. Large windows help within a session. They don't create continuity across sessions. Even a million-token window fills eventually, and when it does, older context falls off the edge. The model doesn't forget gracefully; it can lose older context completely. It helps to separate the tiers so the distinction is concrete:

Short-term memory: the current conversation, held directly in the context window for the duration of a session.
Working memory: active material the model is manipulating right now, the context window plus whatever relevant pieces you've retrieved into it.
Long-term memory: information that spans sessions, stored outside the model in an external database or vector store and retrieved when relevant, often for retrieval-augmented generation patterns.

Only the first two live in the window. The third one typically lives in a separate system outside the model, and that's the part a bigger window can't give you.

Build agents that remember, not agents that guess

Redis Iris gives every agent fresh context and long-term memory.

Try Redis Iris

How to build a persistent memory layer for agents

Memory isn't a switch you flip on the model; it's a system you stand up beside it. Whatever the agent appears to "know" on a given call is just what your code loaded into the window before the call ran. Take that loading step away and the agent is back to a blank slate, every time.

That maps onto the tiers above: the window covers short-term and working memory, and the long-term tier has to live somewhere the model can't reach on its own. Because the model carries no state between calls, the application layer has to fetch the relevant long-term context and place it in the window at inference time. That fetch-and-inject step is the layer, and it's yours to run.

Why the memory layer outlasts your model choice

The big practical advantage of treating persistence as its own layer is that it's independent of which model or window you pick. The frameworks reflect this directly. LangGraph designed checkpointing to be database-agnostic checkpointing from the start. Even product-level memory usually isn't a special model; it works by saving facts outside the model and injecting relevant saved context into the window before a session. Swap your model, change your window size, and the persistence layer keeps working.

Selective retrieval beats dumping full history

Dumping all past interactions back into the prompt to fake persistence creates the same two problems. It grows the context, increases cost, and reintroduces the attention degradation we already covered. Selective retrieval is usually the better pattern. One memory-augmented system reported 81.95% benchmark accuracy using only 1,294 tokens per query, around 5% of full context. Another memory-augmented system used roughly 1,764 tokens per conversation against 26,031 for full-context, with 91% lower 95th-percentile (p95) latency. Memory-augmented systems can outperform raw full-context approaches even when the conversation fits inside the window, partly because of the lost-in-the-middle effect.

Agent memory is an infrastructure problem, not a window size

A bigger window doesn't make the model better at using the space it has, it doesn't stop you paying to re-send history on every turn, and it doesn't survive the end of a session. Memory is a property of the system around the model, not the model itself, and that holds no matter which model, window size, or framework you run.

Once memory becomes an infrastructure problem, the question shifts to what kind of infrastructure can keep up. Agent memory has to be fetched on every call without slowing the response, which is exactly the workload Redis is built for. Its in-memory architecture is designed for low-latency cache and state access, and Redis Iris packages that foundation as a context engine, the layer that sits between an agent and the data it needs and feeds the right context at the right time. Iris brings the pieces of the memory stack into one place: short-term memory through in-memory data structures, long-term memory through vector search, operational state through hashes and JSON, and semantic caching to cut repeat inference.

Redis Agent Memory, one of the services in Iris, handles this directly with a two-tier design: short-term interaction history alongside persistent long-term memory for preferences and prior sessions. It builds on the open-source Redis Agent Memory Server, with Model Context Protocol (MCP) integration and configurable LLM provider support. If your app team already runs Redis for caching or sessions, the memory layer for your agents may be closer than you think. If you're building agents that need to remember across sessions, start with the persistence layer, not a bigger prompt.

Try Redis free to build it against your own workload, or talk to our team about designing the context layer for your agents.

Fresh context, every call

Redis Iris keeps agent data current so answers stay accurate.

Try Redis Iris

Get started with Redis today

Speak to a Redis expert and learn more about enterprise-grade Redis today.

Try for free Talk to sales