Blog
What is semantic caching?
You're watching your large language model (LLM) API bills climb as users ask the same questions differently. "What's machine learning?" and "Can you explain ML?" hit your LLM twice at double the cost. Semantic caching fixes this by recognizing semantically equivalent queries and returning cached responses instead of making duplicate API calls.
Semantic caching can reduce API calls by up to 68.8% and improve response latency by 40-50%. But here's the catch: poorly configured semantic caching can serve wrong answers to users when similarity thresholds aren't properly tuned.
Semantic caching explained
Semantic caching matches semantically equivalent queries regardless of phrasing, unlike traditional caching which requires identical queries (exact key-value matching). For example, "How do I reset my password?" and "I forgot my password, how do I change it?" share the same cached response. When someone asks a question, the system converts it to a vector embedding that captures the meaning, then checks whether a similar question has already been answered.
How semantic caching works step-by-step
When a query arrives, semantic caching follows five steps: vectorize the query, search for similar cached embeddings, check against your similarity threshold, return cached responses for hits, or call the LLM and store results for misses.
Step 1: Convert the query to a vector
First, the system converts the incoming query into a vector. The embedding model, whether that's OpenAI's text-embedding-3-small, a Hugging Face model, or something fine-tuned, captures the semantic meaning in numbers.
Step 2: Search for similar cached embeddings
The vector store runs a similarity search comparing the incoming query vector against all cached query vectors. The system uses approximate nearest neighbor search algorithms like HNSW (Hierarchical Navigable Small World) to find semantically similar queries. When the similarity score exceeds the threshold (using cosine similarity, L2, or inner product), the system returns the cached answer.
You can implement this using RedisVL for vector operations and Redis Query Engine for hybrid search capabilities. Redis handles the vector similarity search directly—HNSW indexing, cosine similarity calculations, and sub-millisecond retrieval all run in the same instance as your operational data. For teams that want semantic caching without building custom orchestration, Redis LangCache provides a managed service with the SemanticCache interface.
You achieve up to 40-50% latency reduction when you retrieve cached responses, with vector similarity search operating on embeddings stored alongside your operational data in the same Redis instance.
Step 3: Check against your similarity threshold
Production deployments use threshold values between 0.7 and 0.95 to determine when two queries are "similar enough" to share a cached response. The threshold directly controls precision-recall tradeoff: lower thresholds increase cache hits but risk serving incorrect answers, while higher thresholds reduce wrong answers but miss valid caching opportunities.
Getting this threshold right matters more than you'd think. Set it too low and you'll serve the same answer to "What's Python?" and "What's Java?"—not great. Set it too high and you'll miss legitimate cache hits, reducing the cache's effectiveness.
Step 4A: Return the cached response (cache hit)
The system retrieves the cached LLM response without touching your expensive inference endpoint. Response times improve from 2.7 seconds to 0.3 seconds—a dramatic 89% improvement. Your API costs stay flat.
Step 4B: Call the LLM and store the result (cache miss)
No similar query exists in the cache, so the system calls the LLM, gets a fresh response, generates an embedding for the new query, and stores the query-response pair in the cache with a time-to-live (TTL). This query now benefits all future similar requests.
Step 5: Set expiration policies (TTL)
Cache entries need TTLs because information gets stale. Time-sensitive data needs shorter TTLs (30-300 seconds) while evergreen content like documentation queries can persist longer (3600+ seconds). When entries expire, the next similar query generates a fresh response.
Most caching systems, including Redis, handle this natively through TTL mechanisms that automatically evict expired entries without manual intervention.
Why semantic caching matters
The cost savings compound quickly. For example, if you're spending $80k per quarter on OpenAI calls and semantic caching achieves a 30-40% cache hit rate, you're looking at $24-32k in potential savings per quarter. Your system becomes more scalable. You can handle increased query volumes without proportionally increasing LLM costs.
Semantic caching also improves user experience beyond just cost. Cached responses return in milliseconds instead of seconds, eliminating the loading states and delays users associate with AI applications. This speed advantage matters most during peak usage: when your LLM provider throttles requests or charges premium rates, your cache keeps delivering instant responses.
Practical considerations
Semantic caching matches based on similarity thresholds, not exact equivalence, which creates risk of serving wrong answers. Poorly configured systems can produce false positive rates as high as 99%, meaning your cache confidently serves incorrect answers.
The root cause is insufficient testing and validation. Most teams configure thresholds based on intuition rather than empirical testing with production-like queries. Start with a conservative threshold (0.9 or higher), test against 100+ representative queries, measure both precision and recall, then adjust based on actual performance data.
Choose your embedding model strategically
Don't automatically reach for the largest model. Smaller domain-specific embeddings often outperform massive general-purpose models for specialized use cases. Fine-tuned smaller models achieve 5-10% precision improvement over their base versions and can even surpass OpenAI's embeddings in specialized domains.
Balance three factors when selecting your embedding model:
- Quality: Accurately captures semantic similarities in your domain
- Speed: Inference latency matters for every query
- Domain alignment: Domain-specific or fine-tuned models frequently outperform general-purpose alternatives
Your cache architecture determines success more than embedding model choice. Effective architectures separate concerns: use exact matching for high-frequency queries (login flows, common FAQs), semantic matching for long-tail queries, and implement circuit breakers that fall back to LLM calls when cache confidence is low. Semantic cache optimization techniques can improve both hit rates and accuracy.
Partition your cache by domain or use case. Customer support queries shouldn't share a namespace with product documentation queries—different domains have different similarity patterns. Domain-specific fine-tuned models can outperform general-purpose models by 2-9% in precision, but proper partitioning and fallback logic matter more than model selection alone.
Monitor your cache performance
Track these metrics to validate your cache effectiveness:
- Cache hit rates (carefully distinguish exact matches from semantic matches)
- Similarity score distributions across queries
- Embedding generation latency
- Cache lookup latency
- Response quality for cached versus fresh responses
Set up automated alerts for anomalies like sudden drops in cache hit rates, unusual similarity score patterns, or increased latency. Sample cached responses weekly and manually verify correctness against expected answers.
Semantic caching patterns
Production implementations typically combine two patterns in a layered approach. The first layer uses exact key caching to handle high-frequency identical queries with zero false positives. When a query arrives, the system checks for an exact match before running expensive semantic matching. This works well for common queries that users phrase identically, like "reset password" or "track my order."
The second layer applies semantic similarity caching for queries that vary in phrasing but share the same intent. This pattern converts queries to vector embeddings, runs similarity searches against cached embeddings, applies threshold-based matching logic, and either returns cached responses or calls the LLM and stores the result. Semantic caching provides the most value for FAQ systems, customer support chatbots, and organizational knowledge bases where users ask similar questions in different ways.
One effective technique adds a confidence buffer to reduce errors. Rather than serving every cache hit that meets your threshold, only return results when similarity exceeds the threshold by a comfortable margin. If your threshold is 0.9, only serve cache hits above 0.92. This buffer reduces wrong answers while maintaining high cache hit rates. The layered approach optimizes for both precision and recall while minimizing embedding computation costs.
Should you implement semantic caching now?
You should implement semantic caching when four conditions are simultaneously true:
- You get repeat queries. Users ask similar questions in different ways, and identical responses are acceptable. This pattern works for FAQ systems, customer support chatbots, RAG implementations, and organizational knowledge bases.
- Your cost-benefit math works. LLM inference costs exceed semantic caching infrastructure costs (vector database, embeddings, maintenance).
- Your infrastructure exists and is properly architected. You need vector database capability, embedding generation, monitoring systems, and cache invalidation strategies. Your architecture should partition caches by domain, implement fallback logic for low-confidence matches, and include automated testing pipelines that validate cache accuracy before deployment. Don't build semantic caching before you have these pieces properly designed and tested.
- You can test with production-like data. Collect 100-500 representative queries from your production logs, label what the correct cached response should be, then measure your cache's precision (percentage of cache hits that are correct) and recall (percentage of valid cache opportunities your system captures). Aim for 95%+ precision before deploying to production.
If you're missing any of these four core components, fix that first: embedding model, vector store, similarity threshold tuning, and cache architecture design. Semantic caching done poorly causes more problems than it solves.
Redis provides straightforward semantic caching capabilities
Semantic caching delivers measurable results when implemented correctly. Redis makes semantic caching straightforward by combining vector search with production-grade caching infrastructure in a single platform.
The SemanticCache interface in Redis LangCache handles the complexity of orchestrating embeddings, similarity search, and response storage. Your embeddings live alongside your operational data. No separate vector database required.
Redis supports multiple indexing algorithms (HNSW for speed, FLAT for simplicity), all three distance metrics (cosine, Euclidean, inner product), and native TTL expiration. You get official integrations with LangChain, LlamaIndex, and support across multiple client libraries including redis-py, NRedisStack, node-redis, jedis, and go-redis.
Try Redis free to test semantic caching with your workload, or meet with our team if you need help architecting your semantic cache for production.
Get started with Redis today
Speak to a Redis expert and learn more about enterprise-grade Redis today.
