New from O’Reilly: The memory architecture behind adaptive AI agents

Read the report

Blog

Engineering for AI Agents

December 12, 202517 minute read
Image
Jim Allen Wallace

From prompting to programming: A guide to context engineering for AI agents

As you start building AI agents, you move past simple chatbots and into a world of complex, automated workflows. But you may face challenges like inconsistent performance and frustrating failures. Your agent hallucinates, forgets key details mid-conversation, or gets stuck in a loop. You might assume the AI model is the problem, but a different culprit is often to blame: the context.

This article breaks down context engineering, explaining what it is, why it represents a critical shift from simple prompt engineering, and how you can implement it to build more reliable, capable, and efficient AI systems.

Key takeaways

  • Shift from Prompting to Programming: Building reliable AI agents requires moving beyond crafting single prompts (prompt engineering) to systematically designing the entire information flow—including instructions, history, retrieved data, and tools—that an LLM uses to perform a task. This discipline, called context engineering, is the key to creating consistent and capable AI systems.
  • Performance and Cost Are Core Goals: Effective context engineering is critical for reducing latency and cost. By strategically retrieving, summarizing, and caching information (e.g., semantic caching) before it reaches the LLM, you minimize expensive token usage and ensure the real-time performance necessary for interactive AI applications. For users, a slow agent is a broken agent.
  • Complex Agents Require a Dedicated Memory Architecture: LLMs are stateless, so creating sophisticated agents that can handle multi-step tasks requires a dedicated memory layer. This involves managing short-term memory (the current conversation) and long-term memory (user preferences across sessions) to make the agent feel context-aware and personalized.
  • A Unified Memory Layer Outperforms a Fragmented One: While it's common to stitch together separate databases for vector search (RAG), caching, and user data, this approach creates complexity and performance bottlenecks. A unified, multi-model data platform that handles all of these functions simplifies the architecture, eliminates data synchronization issues, and reduces latency by keeping all necessary context in one place.

What is context engineering?

Context engineering is the discipline of systematically designing and managing the information that you provide to a Large Language Model (LLM) to help it accomplish a task. It treats the flow of information not as a single instruction, but as a dynamic system that runs before the model ever generates a response.

While often compared to prompt engineering, the two are fundamentally different. Prompt engineering focuses on crafting the perfect turn-of-phrase for a single input, while context engineering is about building the entire information architecture that ensures a model performs well consistently. Prompt engineering is what you do inside the context window; context engineering is how you decide what fills that window in the first place. As AI leader Andrej Karpathy puts it, context engineering is "the delicate art and science of filling the context window with just the right information for the next step."

This "right information" consists of several key components that form the complete picture for the AI model:

  • System Prompt and Instructions: The high-level rules, persona, and guidelines that define the agent's behavior.
  • User Prompt: The immediate question or task from the user.
  • State/History (Short-Term Memory): The record of the current conversation, allowing the agent to track recent interactions.
  • Retrieved Information (RAG): External, up-to-date knowledge pulled from documents, databases, or APIs to provide fresh, relevant information.
  • Long-Term Memory: Persistent knowledge about a user's preferences or past interactions, stored across sessions.
  • Few-Shot Examples: This involves providing concrete, in-line examples of the desired behavior directly in the context. While the System Prompt provides general rules, few-shot examples show the model exactly what a good input-output pair looks like. This is one of the most effective ways to guide model behavior for specific, repeatable tasks.
    • For instance, if you want the model to extract specific data from text, you would provide 2-3 examples of sample text and the correctly extracted JSON object.
  • Available Tools: The definitions and schemas for any functions or APIs the agent can call, like searching a database or sending an email.
  • Structured Output: A defined format, like a JSON object, that constrains the model's response.
  • Real-time Environmental Context: This is dynamic, real-world information that the model wouldn't otherwise have access to. This grounds the agent in the user's immediate environment, which is essential for many practical tasks.
    • The most common examples include the current date and time, the user's physical location, or other sensor data (e.g., the current weather).

Why context engineering is the next frontier in AI

The shift toward context engineering is driven by a practical reality: as AI models become more powerful, the root of failure is often the information they are given, not their ability to reason. As Google DeepMind’s Philipp Schmid stated, "Most agent failures are not model failures anymore, they are context failures." Adopting a systematic approach to context provides several key benefits.

Benefit 1: Improved reliability and accuracy

An LLM's context window is finite. Some models have a limit of a few thousand tokens (roughly 1,500-3,000 words), while others can handle over a million. If you simply stuff this window with raw data, critical details can get lost or ignored. This leads to "context poisoning," where irrelevant information distracts the model and causes it to hallucinate or make mistakes. Effective context engineering ensures the model receives the right information at the right time, grounding its responses in facts and dramatically improving the accuracy of AI applications.

Benefit 2: Reduced latency and cost

Every token processed by an LLM incurs both a financial cost and a latency cost; larger context windows are more expensive and slower. Manually crafting a perfect, massive prompt for every API call is inefficient. Context engineering introduces programmatic solutions, like semantic caching, where responses to similar queries are reused instead of being regenerated. By strategically retrieving and compressing information before it hits the model, you reduce token usage, lower operational costs, and deliver faster responses to users.

Benefit 3: Enhanced capabilities for complex workflows

Simple, single-shot prompts are not enough for building sophisticated agent systems. To create agentic AI that can handle multi-step tasks (like planning a trip, analyzing data, and then generating a report) you need a system that can manage state, orchestrate tool calls, and learn from its actions. This is where context engineering shines. It provides the architectural foundation for these complex workflows, enabling developers to build powerful and autonomous AI agents.

Benefit 4: Enhanced personalization and adaptability

Beyond just being accurate, AI interactions need to be relevant to the individual user. Context engineering is the mechanism for achieving true personalization. By creating a system that can dynamically pull in a user's history, stated preferences, or even their real-time behavior within a session, the AI can move from giving generic answers to providing bespoke, highly relevant assistance.

  • Before Context Engineering: A travel bot asks you where you want to go every single time.
  • After Context Engineering: The same travel bot remembers you prefer window seats, have a specific airline loyalty program, and recently searched for trips to Italy. Its first suggestion is, "Looking for a flight to Florence with Alitalia? I found a great window seat option that earns you double miles."

This adaptive capability is crucial for creating applications that feel genuinely helpful and build long-term user engagement.

The real-time imperative: Why performance is a pillar of context engineering

For interactive AI agents, slow is the same as broken. A context engineering strategy that delivers accurate answers but takes several seconds to do so will fail in the real world, as users abandon unresponsive chatbots and clunky AI applications. This is why performance (specifically latency) must be treated as a primary goal of context engineering, not just a secondary benefit. For any agent that interacts with a human, the "time-to-first-token" and overall response speed are as critical to the user experience as the quality of the final answer.

This forces an AI engineer to think in terms of a "context budget" that includes not just token count but also time. Your agent has only milliseconds to assemble the right context before the user perceives a lag. Staying within this budget requires moving beyond simple prompt engineering and into programmatic performance optimization. Techniques that were once considered back-end optimizations are now central to building effective agent systems:

  • Pre-computation: Generating and storing embeddings for RAG ahead of time, so retrieval is instantaneous.
  • Efficient Retrieval: Using a high-performance vector database to execute semantic searches in sub-millisecond timeframes.
  • Strategic Caching: Storing and quickly reusing not just exact-match queries but also the results of semantically similar prompts to avoid costly, high-latency LLM calls.
  • Context Compression: Before sending retrieved information to the LLM, use a programmatic technique (or a smaller, faster LLM) to remove redundant or less relevant sentences. This is more surgical than summarization.
    • For example, if you retrieve five documents, a compression step might extract only the top N most relevant sentences from each, significantly reducing the token count without losing the factual essence needed to answer the query.
  • Parallelized Data Fetching: Instead of sequentially retrieving information (1. get user history, 2. search vector DB, 3. call an external API), execute these data-gathering steps in parallel. The total time to assemble the context is then dictated by the single longest retrieval task, not the sum of all of them. This is a classic back-end optimization that is critical for minimizing latency in agentic systems.

Poor context engineering isn't just inaccurate; it's slow and expensive. By making low-latency data retrieval and assembly a core pillar of your design, you ensure your AI agents are not only intelligent but also usable in demanding, real-time use cases.

Core strategies for effective context engineering

Moving from theory to practice involves several key strategies that form the foundation of a robust context engineering system.

1. Master Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a cornerstone of modern AI systems. Instead of relying solely on its training data, the model first retrieves relevant context from an external knowledge source. This is where a vector database becomes essential. By converting documents, user data, and other information into numerical representations (embeddings), a vector database allows an agent to perform a semantic search: finding information based on conceptual meaning, not just keyword matches. This ensures the LLM receives highly relevant, timely, and factual data, making its responses more accurate and trustworthy.

2. Build a dedicated memory layer

LLMs are inherently stateless; they don't remember past conversations without help. A dedicated memory layer solves this by providing the agent with both short-term and long-term memory.

  • Short-term memory keeps track of the current session's conversation history, intermediate thoughts, and the outputs of tool calls. This is often managed within the agent's state.
  • Long-term memory persists information across sessions, such as user preferences, key facts, or summaries of past interactions. This is typically implemented using a fast, external data store like an in-memory database or a knowledge graph.

This memory architecture is what makes an AI feel context-aware and personalized, transforming a generic chatbot into a helpful assistant.

3. Orchestrate agentic workflows

For complex tasks, you often need to go beyond a single request-response cycle. This is where orchestration frameworks for building stateful agent systems come into play. Open-source Python libraries like LangGraph allow developers to define agent behaviors as a graph, where nodes represent functions (like calling a tool or an LLM) and edges control the flow of logic. This structure makes it possible to build cyclical, multi-step workflows where the agent can reason, act, and even self-correct. The state of the workflow (including conversation history and tool outputs) is managed explicitly, giving developers fine-grained control over the agent's execution. This is particularly powerful for creating multi-agent systems, where different agents with specialized contexts can collaborate on a complex problem.

4. Practice intelligent context management

More context is not always better. A key part of context engineering is the iterative process of pruning, summarizing, and shaping the context before sending it to the model.

  • Pruning: Techniques exist to programmatically remove irrelevant sentences or documents from the retrieved context, reducing noise and token count.
  • Summarization: As conversations grow, you can use an LLM to summarize older parts of the history, preserving key information while freeing up space in the context window.
  • Isolating Context: For highly complex tasks, it's often better to break the problem down and have a specialized agent focus on a smaller, isolated context rather than giving one agent all the information at once.

This discipline of context management prevents the model from getting distracted and helps it focus on the most important information for the task at hand.

5. Implement a Robust Evaluation Framework

Building a context-driven AI system is an iterative process. Without a way to measure performance, it's impossible to improve. A dedicated evaluation framework is crucial for understanding how well your context engineering strategies are working. This involves:

  • Component-Level Testing: Is your retrieval system (RAG) actually fetching relevant documents? Metrics like hit rate and Mean Reciprocal Rank (MRR) can measure retrieval quality independent of the final LLM output.
  • End-to-End Evaluation: Does the agent's final response accurately reflect the provided context? Frameworks like RAGAs (Retrieval-Augmented Generation Assessment) measure the faithfulness (how well the answer is supported by the context) and answer relevancy.
  • Task-Based Metrics: For agentic workflows, the ultimate measure is success. Did the agent successfully book the flight, summarize the report, or complete its assigned multi-step task?

From simple retrieval to semantic caching

Semantic caching is an optimization layer that sits alongside or in front of a RAG pipeline. While RAG excels at finding and supplying relevant knowledge to an LLM, it still requires a full retrieval and generation cycle for every query. That means if a user asks the same conceptual question in five slightly different ways, the system repeats the entire process five times.

Standard caching fails here because it relies on exact matches; "Tell me about our Q3 revenue" and "What was our revenue in the third quarter?" are treated as two entirely separate requests.

Semantic caching solves this by using vector embeddings to capture the meaning of a query, not just its text. Here's how it works:

  1. When a query is received, it's converted into a vector embedding.
  2. The system performs a similarity search against a cache of previously answered queries and their embeddings.
  3. If a conceptually similar query is found within a predefined threshold, the cached response is served instantly, bypassing the expensive and time-consuming LLM call.
  4. If no similar query is found, the request proceeds to the LLM, and the new question and its response are cached for future use.

This approach offers two transformative benefits. First, it dramatically reduces costs and latency by minimizing redundant calls to models from providers like OpenAI or Anthropic. Second, it improves consistency, ensuring that questions with the same underlying intent receive the same vetted answer, which is crucial for building trust in enterprise use cases.

The unseen data plane of agent orchestration

As agentic AI moves from simple chains to complex graphs, the framework that defines the logic is only half the battle. Orchestration libraries like LangChain and LangGraph provide the "control plane": the set of rules and pathways that govern an agent's behavior. But for these workflows to function at scale, they need a "data plane": a high-speed, reliable infrastructure layer for managing state and messaging between steps.

This unseen data plane is responsible for the practical realities of running complex agent systems. Consider a multi-agent system where one agent retrieves data, a second analyzes it, and a third generates a report. If Agent A retrieves a customer's order history and Agent B needs to analyze purchase patterns, where does that intermediate data live while Agent B is working? If Agent B's analysis triggers Agent C to generate a personalized recommendation, how does Agent C know the analysis is complete and ready? And if Agent C crashes halfway through generating the recommendation, can the workflow resume from where it failed, or does everything restart from Agent A?

The data plane is what handles the handoff of information between them. It answers critical questions for any production system:

  • How are the intermediate results of an asynchronous tool call stored while the main agent waits?
  • How do multiple, distributed AI agents communicate with each other reliably and with low latency?
  • What happens if one agent in a multi-step workflow fails? Is the state persisted so the job can be resumed without starting over?

Answering these questions requires more than a Python script; it requires robust infrastructure. A high-performance data plane often relies on real-time technologies like streams or pub/sub messaging to handle asynchronous communication and an in-memory data store to manage state persistence for long-running tasks. Without this foundation, even the most elegantly designed agent logic will crumble under the complexity and concurrency of real-world demands.

Architecting the agent's mind: Unified vs. fragmented memory

Building an effective memory layer for AI agents presents a fundamental architectural choice: should an agent's mind be a collection of disparate, specialized tools, or should it be a single, integrated system? Many teams start with a fragmented approach, stitching together separate technologies for each function: a vector database for RAG, a relational database for user profiles, and a separate cache for short-term conversation history.

While this seems logical, the fragmented approach introduces significant challenges that hinder performance and slow down development:

  • Data Synchronization Issues: Keeping data consistent across multiple databases is a classic engineering problem. When a user's preferences change, does that update propagate correctly from the relational store to the context being used in the LLM?
  • Performance Bottlenecks: Every time context has to be pulled from a different system (fetching conversation history from one, user data from another, and documents from a third) it introduces network latency. This "data-shipping" between services becomes a major bottleneck for real-time performance.
    • Example: Imagine a customer interacts with an AI support agent: "I want to return the blue shirt I bought last week."
      • In a Fragmented System:
        • The agent first queries a cache (e.g., Redis) to get the recent conversation history. (1st network hop)
        • It then queries a relational database (e.g., PostgreSQL) to identify the user and look up their recent orders. (2nd network hop)
        • Finding the order, it calls a shipping API to get the tracking status. (3rd network hop)
        • To answer the "return" part of the query, it might query a vector database containing the company's policy documents. (4th network hop)
        • The application logic must now stitch these four disparate pieces of information together before sending the context to the LLM. The cumulative latency makes the agent feel slow.
      • In a Unified System:
        • The agent makes a single call to the unified platform. A single query can retrieve the user's profile, their order history (stored as a document), and perform a vector search on the return policy documents. The data is co-located, eliminating multiple network hops and simplifying the logic dramatically. The resulting context is assembled instantly, making the agent feel responsive and intelligent.

That said, a fragmented approach has legitimate advantages. Teams can select specialized, best-in-class tools for each function: a vector database optimized purely for semantic search, a relational database for transactional integrity, a purpose-built cache for speed. For teams with existing infrastructure investments or specific performance requirements for individual components, this flexibility can be valuable. The trade-off is clear: engineering complexity, data synchronization overhead, and latency issues shift onto your team.

The alternative is a unified memory layer: a single, multi-model platform that can handle an agent's diverse memory needs in one low-latency environment. In this model, the same database can store and process conversation history as JSON objects, manage long-term memory of user preferences as hashes, and serve as a vector database for semantic search and RAG. This simplifies the architecture, eliminates data synchronization headaches, and boosts performance by keeping all relevant context in one place.

The future of reliable AI agents is context

Building powerful and reliable AI agents is less about finding a "magic prompt" and more about the engineering of context.

As AI systems are deployed in more critical workflows, context engineering will evolve into its own discipline, with best practices, frameworks, and toolchains. Agents’ underlying models will become commodities, leading to differentiation coming from the pipeline around them: retrieval systems, memory layers, orchestration frameworks, and data planes.

If prompting and model training defined this past era of AI, context engineering will define the next. This means mastering context engineering will become mission-critical for product managers, engineers, and technical leaders who want to build AI agents that are fast, accurate, efficient, and genuinely useful.


Get started with Redis today

Speak to a Redis expert and learn more about enterprise-grade Redis today.