We’re seeing Retrieval Augmented Generation (RAG) become the de facto standard architecture for GenAI applications that require access to private data. Nevertheless, some may wonder why it’s important to have real-time access to this data. The answer is quite simple: you don’t want your application to stop running fast when you add AI to your stack.
So, what is a fast application? Paul Buchheit (the creator of Gmail) coined The 100ms Rule. It says every interaction should be faster than 100ms. Why? 100ms is the threshold “where interactions feel instantaneous.”
Let’s examine what a typical RAG-based architecture looks like and what latency boundaries each component currently has as well as the expected end-to-end latency.
Based on this analysis, a GenAI application built using the above architecture should expect an average of 1,513ms (or 1.5 seconds) end-to-end response time. This means you’ll probably lose your end users’ interest after a few interactions.
To build a real-time GenAI application that allows closer to the 100ms Rule experience, you need to rethink your data architecture.
To deal with the above challenges, Redis offers three main datastore capabilities for AI that will enable real-time RAG:
Redis supports vector data type and vector search capabilities even before the term GenAI was coined. The Redis vector search algorithm uses highly efficient in-memory data structures and a dedicated search engine, resulting in up to 50 times faster search (we will shortly release our comprehensive benchmark results) and two orders of magnitude faster retrieval of documents. It will be shown later in this blog how real-time vector search can significantly improve user experience end-to-end.
Traditional caching techniques in Redis (and generally) use keyword matching, which struggles to capture the semantic similarity between similar queries to LLM-based services, resulting in very low hits. Using existing caches, we don’t detect the semantic similarity between “give me suggestions for a comedy movie” and “recommend a funny movie”, leading to a cache miss. Semantic caching goes beyond exact matches: It uses clever algorithms to understand the meaning of a query. Even if the wording differs, the cache can recognize if it’s contextually similar to a previous query and return the corresponding response (if it has it). According to a recent study, 31% of queries to LLM can be cached (or, in other words, 31% of the queries are contextually repeatable), which can significantly improve response time in GenAI apps running in RAG-based architectures while dramatically reducing LLM costs.
You can think of semantic cache as the new caching for LLM. Utilizing vector search, semantic cache can have significant performance and deployment cost benefits, as we’ll explain in the following sections.
LLM Memory is the record of all previous interactions between the LLM and a specific user; think of it as the session store for LLM, except it can also record information across different user sessions. Implemented using existing Redis data structures and vector search, LLM Memory can be incredibly valuable for several reasons:
Without LLM Memory:
User: “I’m planning a trip to Italy. What are some interesting places to visit?”
LLM: “Italy has many beautiful cities! Here are some popular tourist destinations: Rome, Florence, Venice…”
With LLM Memory:
User: “I’m planning a trip to Italy. I’m interested in art and history, not so much crowded places.” (Let’s assume this is the first turn of the conversation)
LLM: “Since you’re interested in art and history, how about visiting Florence? It’s known for its Renaissance art and architecture.” (LLM uses conversation history to identify user preference and suggests a relevant location)
User: “That sounds great! Are there any museums I shouldn’t miss?”
LLM (referencing conversation history): “The Uffizi Gallery and the Accademia Gallery are must-sees for art lovers in Florence.” (LLM leverages conversation history to understand the user’s specific interests within the context of the trip)
In this example, LLM memory (or conversation history) allows the LLM to personalize its response based on the user’s initial statement. It avoids generic recommendations and tailors its suggestions to the user’s expressed interests, leading to a more helpful and engaging user experience.
To explain real-time RAG with Redis’ capabilities for AI, nothing beats a diagram and a short explanation:
There are two scenarios that we should look at:
In order to understand how real-time RAG applications perform end-to-end, let’s analyze each option.
As can be seen in the diagram above, only two components are actually involved in this scenario:
RAG architectures based on Redis have an average end-to-end response time of 389ms, which is around x3.2 faster than non-real-time RAG architectures and much closer to Paul Buchheit’s 100ms Rule. This allows existing and new applications to run LLM components in their stack with minimal performance impact, if any.
Apart from making sure your fast applications stay fast, Redis-based real-time RAG architecture offers these other benefits:
The blog analyzes the response time of RAG-based architectures and explains how Redis can provide real-time end-user experiences in complex, fast-changing LLM environments. If you want to try everything discussed here, we recommend Redis Vector Library (RedisVL), a Python-based client for AI applications that uses Redis capabilities for real-time RAG (Semantic Caching, LLM Memory, and Vector Database). RedisVL works with your Redis Cloud instance or your self-deployed Redis Stack.
In this appendix, you’ll find details on how we calculated end-to-end response times for RAG (real-time and non-real-time). It’s based on a comprehensive benchmark we did and will soon publish, which we ran across four types of vector datasets:
More info on these datasets will be available once the benchmark is out.
For the non-real-time RAG, we averaged the results across all disk-based databases (special purpose and general purpose). We took the normalized median value across four different sets of tests and all the vendors under test because of the big data skew.
Component | Latency |
Network round-trip | (20+50)/2 = 35ms |
LLM | (50+500)/2 = 275ms |
GenAI App *assuming 20ms when no other services are called and 100ms otherwise | (20+100)/2=60ms |
Vector database *assuming one vector/hybrid search + 10 docs | One vector search query (we took the median value across low and high loads) – 63ms10x document retrieval – 10x50ms = 500msTotal – 563ms |
The agent-based architecture assumes 1/3 of the calls to LLM trigger agent processing, which triggers an application call from LLM and another iteration of data retrieval and LLM call. | 33% x (LLM + App + VectorDB) |
Total | 35 + {275+60+563}⅔ + {275+60+563}2*⅓ = 1232 |
For the real-time RAG, we looked at two scenarios: cache hit (using semantic caching) and cache miss. Based on this research, we calculated a weighted average assuming that 30% of queries would hit the cache (and 70% would miss it). We took the Redis median latency value across all the datasets under test in the benchmark.
Component | Latency |
Best case | |
Network round-trip | (20+50)/2 = 35ms |
GenAI App *Assuming a cache hit would result in a 33% reduction in app processing time | 40ms |
Redis Semantic Caching | One vector search query (we took the median value across low and high loads) – 24.6ms1x document retrieval – 1×0.5msTotal – 25ms |
Total | 35+40+25 = 100ms |
Component | Latency |
Best case | |
Network round-trip | 35ms |
LLM *Based on historical context, we assume that LLM processing will improve by 25% due to a shorter prompt (much fewer tokens) that is more accurate and relevant | (40+400)/2 =220ms |
GenAI App *assuming 20ms when no other services are called and 100ms otherwise | (20+100)/2=60ms |
Redis | Semantic cache miss – 24.6msLLM Memory search (24.6ms) + 5 context retrievals (5x 0.5ms ) = 27.1msVector search (24.6ms) + 5 context retrievals (5x 0.5ms ) = 27.1msTotal – 24.6+27.1+27.1 = 79ms |
The agent-based architecture assumes 1/3 of the calls to LLM trigger agent processing, which triggers an application call from LLM and another iteration of data retrieval and LLM call. | 33% x (LLM + App + Redis) |
Total | 35 + {220+60+79}⅔ + {220+60+79}2*⅓ = 513ms |
The weighted average of cache hits and misses is calculated as follows: 30% * 100ms + 70% * 513ms = 389ms