All eyes on AI: 2026 predictions – The shifts that will shape your stack.

Read now

Blog

What is prompt caching? LLM speed & cost guide

March 10, 20269 minute read
Image
Jim Allen Wallace

If you're building with large language models (LLMs) in production, you've probably noticed two things: latency spikes that make your app feel sluggish, and token costs that climb faster than you expected. Most of these problems come down to redundant computation, and the right caching strategy can cut both latency and spend without changing your models.

Prompt caching stores the computational state from an LLM's attention layers so the model can skip redundant prefill work on repeated prompt prefixes. The result: lower time-to-first-token (TTFT) and cheaper input costs on every request that hits the cache for a shared prefix.

This guide covers how prompt caching works at the model layer, how it differs from regular and semantic caching, where each approach fits in your architecture, and how to combine them with Redis for maximum cost and latency reduction.

Why LLM apps get slow & expensive at scale

Every LLM request goes through two latency phases: time to first token (TTFT), which measures how long the model takes to start responding, and time to last token (TTLT), which captures the full generation time. Both get worse as your prompts get longer. A long system prompt increases TTFT because the model processes every token through its attention mechanism before producing any output. That "prefill" computation is expensive, and it runs on every single request.

Then there's the cost side. Across major providers, output tokens typically cost several times more than input tokens, with ratios typically ranging from 3x to 5x for standard models, and up to 8x for premium or reasoning models. A 10,000-token system prompt repeated across 50,000 monthly conversations adds up fast, and that's before you count the output tokens you're paying a premium for.

At scale, these costs compound alongside operational complexity: more concurrent users, more state to manage, more systems to coordinate. The good news is that a layered caching strategy can address both the latency and cost problems. And it starts with understanding prompt caching.

What is prompt caching in LLMs?

When an LLM processes your prompt, it generates key-value (KV) cache entries in its attention layers—mathematical representations of the relationships between tokens. Normally, the model recomputes this KV cache on every request. Prompt caching stores it so the model can skip that computation on subsequent requests that share the same prefix. The model still generates a fresh response every time; it's the redundant prefill work that gets cut. This is a provider-managed feature built into the LLM API, not something you build yourself.

The main constraint is prefix matching. Prompt caching works by comparing the beginning of your current prompt against what's already cached. If the cached prefix and your new prompt are exactly identical (token-for-token) up to a certain point, the model reuses the cached computation for that portion and only processes new tokens from where the match ends. A single token change anywhere in the prefix breaks the match from that point forward.

Major LLM providers each handle this differently. Anthropic offers both automatic caching and explicit cache_control markers, with cache reads priced at 0.1x the base input cost—a 90% discount. OpenAI's prompt caching is automatic on prompts over 1,024 tokens, with cached-input discounts that vary by model and go up to 90% on newer models. Optional parameters like prompt_cache_retention (for extended 24-hour caching) and prompt_cache_key (for routing control) are available for optimization. Google supports context caching through both the Gemini Developer API (Google AI Studio) and Vertex AI, with implicit caching enabled by default on Gemini 2.5 models. Cache discounts and implementation details vary by provider and model.

How does prompt caching actually speed up LLM apps?

Once you know what prompt caching stores, the next question is what you get back: lower TTFT and cheaper input tokens. The performance gains scale with prompt length:

  • A 1,024-token prompt saw 7% TTFT improvement, while prompts over 150,000 tokens hit 67% faster TTFT. The longer your shared prefix, the bigger the payoff.
  • In one book-chat benchmark, a 100,000-token cached prompt reduced TTFT by ~79% and cached input token costs by 90%.
  • Anthropic's documentation claims up to 85% latency reduction for long prompts.
  • Bedrock preview materials cite similar directional numbers—up to 85% lower latency and up to 90% lower costs on supported models.

The takeaway across providers: prompt caching targets input-side computation. It reduces TTFT and cuts the cost of repeated prefixes, but you still pay full price for output tokens. The biggest savings come from long, stable prefixes that get reused across many requests. Some engineering teams treat cache hit rate like an uptime metric, declaring SEVs when it drops.

How is prompt caching different from regular & semantic caching?

Prompt caching is one of three caching layers you'll use in production. They operate at different levels of the LLM stack and are meant to work together, not replace each other.

  • Regular (exact-match) caching stores full LLM responses keyed by an exact string hash. If someone asks the identical question twice, word for word, you return the stored response instantly. Natural language rarely repeats exactly, though, so hit rates for user-facing apps tend to be low. This layer works best for templated or programmatic queries.
  • Semantic caching converts queries into vector embeddings (numerical representations of meaning) and compares them against cached vectors using cosine similarity. If the similarity exceeds a configured threshold, the cached response is returned without calling the LLM at all. "Tell me about our Q3 revenue" and "What was our revenue in the third quarter?" would hit the same cache entry, saving you the full cost of that LLM call.
  • Prompt caching operates at the model layer and doesn't bypass the LLM—you still pay for output tokens. What it cuts is the redundant prefill computation on shared input prefixes.

The key cost difference: semantic caching bypasses LLM calls entirely on cache hits, saving both input and output token costs. Prompt caching only reduces input-side costs. That makes semantic caching generally more cost-effective for workloads where users ask similar questions in different ways, while prompt caching helps more with genuinely novel queries that share a long prefix. Redis supports both exact-match and semantic caching with vector search, so you can run all three layers from a single platform.

Where should you use prompt caching in your LLM architecture?

Because prompt caching relies on prefix matching, it works best when you structure prompts with stable content first and variable content last. The more of your prefix that stays identical across requests, the higher your cache hit rate.

A common ordering that tends to maximize cache reuse:

  1. Tool/function definitions: Most stable, rarely change
  2. System prompt: Stable per deployment
  3. Reference documents: Stable per session or task
  4. Conversation history: Grows, but older turns stay fixed
  5. User query: Almost always changes, so it goes last

This ordering is one of the simplest ways to improve cache hit rate, and it's worth designing around early rather than retrofitting later.

RAG pipelines

Prompt caching tends to work well in retrieval-augmented generation (RAG) setups where multiple users query the same knowledge base. Caching the system instructions and retrieved document chunks means the model skips prefill on the shared context for each new question. The payoff is highest when users ask several questions about the same document. When retrieved chunks change with every query, though, the prefix changes too, and cache reuse drops.

Multi-turn chatbots

System instructions in chatbots often run to thousands of tokens of behavioral guidelines, and they stay the same across every turn. Caching that prefix and letting conversation history and user messages stay dynamic is one of the simpler wins. This is especially valuable in long conversations, where session costs can vary widely depending on cache hit rate and token usage.

Agentic systems

In long-horizon agentic systems, the system prompt is typically where teams see the most consistent caching benefits because it's both large and stable. More dynamic components like tool outputs and retrieved context tend to vary across runs, which can reduce cache reuse given the prefix-matching constraint. Caching the system prompt is still worth it; just don't expect the same hit rates you'd see in a chatbot with a fixed prefix.

Cache-breaking anti-patterns

Watch for subtle cache breakers: timestamps in system prompts ("Today is {{date}}"), session identifiers in static sections, user-specific information in the prompt header, and dynamic tool definitions that change per user. Even a capitalization change can wipe out thousands of tokens of cached computation, so it's worth auditing your prompts for anything that changes between requests in sections you expect to be stable.

How to combine prompt caching with semantic caching

Once prompt caching is handling your shared prefixes, you can stack it with response-level caching to cover more of your traffic. Production systems that combine these layers into a caching hierarchy tend to get the broadest cost and latency coverage.

The layers stack like this: exact-match caching catches identical repeats, semantic caching catches paraphrased queries via vector similarity, and prompt caching optimizes the novel queries that still need the LLM. On cache hits, the first two layers bypass LLM calls entirely—the third reduces the cost of calls that have to happen. Together, they cover the full spectrum of query patterns.

Redis fits naturally across all three layers. Redis LangCache is a fully managed semantic caching service with integrated embedding generation, configurable similarity controls, and built-in cache hit rate monitoring. Teams that want more control can use RedisVL's SemanticCache, a self-managed Python library with distance threshold tuning and time-to-live (TTL)-based expiration. Redis also integrates with LangChain and LangGraph for vector storage and related AI workflows via its ecosystem integrations.

Teams typically start with a high similarity threshold and adjust based on their query patterns. Note that RedisVL's SemanticCache uses cosine distance (where lower = more similar), so a 0.95 cosine similarity translates to a 0.05 distance threshold. Higher similarity thresholds reduce false hits but lower cache reuse; lower thresholds catch more queries but risk serving incorrect responses. The right value depends on your domain and query distribution.

This layered approach tends to provide the most value for workloads with meaningful semantic overlap in queries—customer support, FAQ bots, and internal tools are good examples. For workloads with less repetition, the exact-match and prompt caching layers still deliver value, and semantic caching can be added later as query patterns become clearer.

Faster LLM apps require layered caching

Each caching layer solves a different part of the cost and latency problem. Stacking them into a layered architecture covers the full range of query patterns, from exact repeats to paraphrased questions to genuinely novel requests.

Redis combines vector search, semantic caching, and in-memory data structures in a single platform with sub-millisecond latency—so your semantic cache, session state, vector storage, and operational data all run on the same infrastructure. Whether you're building chatbots, RAG pipelines, or agentic systems, the same platform scales across all of them.

Try Redis free to test semantic caching with your own query patterns, or talk to the team about optimizing your LLM infrastructure costs.

Get started with Redis today

Speak to a Redis expert and learn more about enterprise-grade Redis today.