{
  "id": "llmcache",
  "title": "Cache LLM Responses",
  "url": "https://redis.io/docs/latest/develop/ai/redisvl/0.15.0/user_guide/llmcache/",
  "summary": "",
  "content": "\n\nThis guide demonstrates how to use RedisVL's `SemanticCache` to cache LLM responses based on semantic similarity. Semantic caching reduces API costs and latency by retrieving cached responses for semantically similar prompts instead of making redundant API calls.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n- Installed RedisVL: `pip install redisvl`\n- A running Redis instance ([Redis 8+](https://redis.io/downloads/) or [Redis Cloud](https://redis.io/cloud))\n- An OpenAI API key for the examples\n\n## What You'll Learn\n\nBy the end of this guide, you will be able to:\n- Set up and configure a `SemanticCache`\n- Store and retrieve cached LLM responses\n- Customize semantic similarity thresholds\n- Configure TTL policies for cache expiration\n- Implement access controls with tags and filters for multi-user scenarios\n\nFirst, import [OpenAI](https://platform.openai.com) to use their API for responding to user prompts. The following code creates a simple `ask_openai` helper method to assist.\n\n\n```python\nimport os\nimport getpass\nimport time\nimport numpy as np\n\nfrom openai import OpenAI\n\n\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"False\"\n\napi_key = os.getenv(\"OPENAI_API_KEY\") or getpass.getpass(\"Enter your OpenAI API key: \")\n\nclient = OpenAI(api_key=api_key)\n\ndef ask_openai(question: str) -\u003e str:\n    response = client.completions.create(\n      model=\"gpt-4o-mini\",\n      prompt=f\"Answer the following question simply: {question}\",\n      max_tokens=200\n    )\n    return response.choices[0].text.strip()\n```\n\n\n```python\n# Test\nprint(ask_openai(\"What is the capital of France?\"))\n```\n\n    The capital of France is Paris.\n\n\n## Initializing ``SemanticCache``\n\n``SemanticCache`` will automatically create an index within Redis upon initialization for the semantic cache content.\n\n\n```python\nimport warnings\nwarnings.filterwarnings('ignore')\n\nfrom redisvl.extensions.cache.llm import SemanticCache\nfrom redisvl.utils.vectorize import HFTextVectorizer\n\nllmcache = SemanticCache(\n    name=\"llmcache\",                                          # underlying search index name\n    redis_url=\"redis://localhost:6379\",                       # redis connection url string\n    distance_threshold=0.1,                                   # semantic cache distance threshold (Redis COSINE [0-2], lower is stricter)\n    vectorizer=HFTextVectorizer(\"redis/langcache-embed-v1\"),  # embedding model\n)\n```\n\n\n```python\n# look at the index specification created for the semantic cache lookup\n!rvl index info -i llmcache\n```\n\n    \n    \n    Index Information:\n    ╭───────────────┬───────────────┬───────────────┬───────────────┬───────────────╮\n    │ Index Name    │ Storage Type  │ Prefixes      │ Index Options │ Indexing      │\n    ├───────────────┼───────────────┼───────────────┼───────────────┼───────────────┤\n    | llmcache      | HASH          | ['llmcache']  | []            | 0             |\n    ╰───────────────┴───────────────┴───────────────┴───────────────┴───────────────╯\n    Index Fields:\n    ╭─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────╮\n    │ Name            │ Attribute       │ Type            │ Field Option    │ Option Value    │ Field Option    │ Option Value    │ Field Option    │ Option Value    │ Field Option    │ Option Value    │\n    ├─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┤\n    │ prompt          │ prompt          │ TEXT            │ WEIGHT          │ 1               │                 │                 │                 │                 │                 │                 │\n    │ response        │ response        │ TEXT            │ WEIGHT          │ 1               │                 │                 │                 │                 │                 │                 │\n    │ inserted_at     │ inserted_at     │ NUMERIC         │                 │                 │                 │                 │                 │                 │                 │                 │\n    │ updated_at      │ updated_at      │ NUMERIC         │                 │                 │                 │                 │                 │                 │                 │                 │\n    │ prompt_vector   │ prompt_vector   │ VECTOR          │ algorithm       │ FLAT            │ data_type       │ FLOAT32         │ dim             │ 768             │ distance_metric │ COSINE          │\n    ╰─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────╯\n\n\n## Basic Cache Usage\n\n\n```python\nquestion = \"What is the capital of France?\"\n```\n\n\n```python\n# Check the semantic cache -- should be empty\nif response := llmcache.check(prompt=question):\n    print(response)\nelse:\n    print(\"Empty cache\")\n```\n\n    Empty cache\n\n\nOur initial cache check should be empty since we have not yet stored anything in the cache. Below, store the `question`,\nproper `response`, and any arbitrary `metadata` (as a python dictionary object) in the cache.\n\n\n```python\n# Cache the question, answer, and arbitrary metadata\nllmcache.store(\n    prompt=question,\n    response=\"Paris\",\n    metadata={\"city\": \"Paris\", \"country\": \"france\"}\n)\n```\n\n\n\n\n    'llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545'\n\n\n\nNow check the cache again with the same question and with a semantically similar question:\n\n\n```python\n# Check the cache again\nif response := llmcache.check(prompt=question, return_fields=[\"prompt\", \"response\", \"metadata\"]):\n    print(response)\nelse:\n    print(\"Empty cache\")\n```\n\n    [{'prompt': 'What is the capital of France?', 'response': 'Paris', 'metadata': {'city': 'Paris', 'country': 'france'}, 'key': 'llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545'}]\n\n\n\n```python\n# Check for a semantically similar result\nquestion = \"What actually is the capital of France?\"\nllmcache.check(prompt=question)[0]['response']\n```\n\n\n\n\n    'Paris'\n\n\n\n## Customize the Distance Threshold\n\nFor most use cases, the right semantic similarity threshold is not a fixed quantity. Depending on the choice of embedding model,\nthe properties of the input query, and even business use case -- the threshold might need to change.\n\nThe distance threshold uses Redis COSINE distance units [0-2], where 0 means identical and 2 means completely different.\n\nFortunately, you can seamlessly adjust the threshold at any point like below:\n\n\n```python\n# Widen the semantic distance threshold (allow less similar matches)\nllmcache.set_threshold(0.5)\n```\n\n\n```python\n# Really try to trick it by asking around the point\n# But is able to slip just under our new threshold\nquestion = \"What is the capital city of the country in Europe that also has a city named Nice?\"\nllmcache.check(prompt=question)[0]['response']\n```\n\n\n\n\n    'Paris'\n\n\n\n\n```python\n# Invalidate the cache completely by clearing it out\nllmcache.clear()\n\n# Should be empty now\nllmcache.check(prompt=question)\n```\n\n\n\n\n    []\n\n\n\n## Utilize TTL\n\nRedis uses TTL policies (optional) to expire individual keys at points in time in the future.\nThis allows you to focus on your data flow and business logic without bothering with complex cleanup tasks.\n\nA TTL policy set on the `SemanticCache` allows you to temporarily hold onto cache entries. Below, the TTL policy is set to 5 seconds.\n\n\n```python\nllmcache.set_ttl(5) # 5 seconds\n```\n\n\n```python\nllmcache.store(\"This is a TTL test\", \"This is a TTL test response\")\n\ntime.sleep(6)\n```\n\n\n```python\n# confirm that the cache has cleared by now on it's own\nresult = llmcache.check(\"This is a TTL test\")\n\nprint(result)\n```\n\n    []\n\n\n\n```python\n# Reset the TTL to null (long lived data)\nllmcache.set_ttl()\n```\n\n## Simple Performance Testing\n\nNext, measure the speedup obtained by using ``SemanticCache``. The ``time`` module measures the time taken to generate responses with and without ``SemanticCache``.\n\n\n```python\ndef answer_question(question: str) -\u003e str:\n    \"\"\"Helper function to answer a simple question using OpenAI with a wrapper\n    check for the answer in the semantic cache first.\n\n    Args:\n        question (str): User input question.\n\n    Returns:\n        str: Response.\n    \"\"\"\n    results = llmcache.check(prompt=question)\n    if results:\n        return results[0][\"response\"]\n    else:\n        answer = ask_openai(question)\n        return answer\n```\n\n\n```python\nstart = time.time()\n# asking a question -- openai response time\nquestion = \"What was the name of the first US President?\"\nanswer = answer_question(question)\nend = time.time()\n\nprint(f\"Without caching, a call to openAI to answer this simple question took {end-start} seconds.\")\n\n# add the entry to our LLM cache\nllmcache.store(prompt=question, response=\"George Washington\")\n```\n\n    Without caching, a call to openAI to answer this simple question took 3.055630922317505 seconds.\n\n\n\n\n\n    'llmcache:67e0f6e28fe2a61c0022fd42bf734bb8ffe49d3e375fd69d692574295a20fc1a'\n\n\n\n\n```python\n# Calculate the avg latency for caching over LLM usage\ntimes = []\n\nfor _ in range(10):\n    cached_start = time.time()\n    cached_answer = answer_question(question)\n    cached_end = time.time()\n    times.append(cached_end-cached_start)\n\navg_time_with_cache = np.mean(times)\nprint(f\"Avg time taken with LLM cache enabled: {avg_time_with_cache}\")\nprint(f\"Percentage of time saved: {round(((end - start) - avg_time_with_cache) / (end - start) * 100, 2)}%\")\n```\n\n    Avg time taken with LLM cache enabled: 0.05140402317047119\n    Percentage of time saved: 98.32%\n\n\n\n```python\n# check the stats of the index\n!rvl stats -i llmcache\n```\n\n    \n    Statistics:\n    ╭─────────────────────────────┬────────────╮\n    │ Stat Key                    │ Value      │\n    ├─────────────────────────────┼────────────┤\n    │ num_docs                    │ 1          │\n    │ num_terms                   │ 19         │\n    │ max_doc_id                  │ 4          │\n    │ num_records                 │ 36         │\n    │ percent_indexed             │ 1          │\n    │ hash_indexing_failures      │ 0          │\n    │ number_of_uses              │ 19         │\n    │ bytes_per_record_avg        │ 62.3888893 │\n    │ doc_table_size_mb           │ 0.00782489 │\n    │ inverted_sz_mb              │ 0.00214195 │\n    │ key_table_size_mb           │ 1.14440917 │\n    │ offset_bits_per_record_avg  │ 8          │\n    │ offset_vectors_sz_mb        │ 2.67028808 │\n    │ offsets_per_term_avg        │ 0.77777779 │\n    │ records_per_doc_avg         │ 36         │\n    │ sortable_values_size_mb     │ 0          │\n    │ total_indexing_time         │ 1.366666   │\n    │ total_inverted_index_blocks │ 21         │\n    │ vector_index_sz_mb          │ 3.01630401 │\n    ╰─────────────────────────────┴────────────╯\n\n\n\n```python\n# Clear the cache AND delete the underlying index\nllmcache.delete()\n```\n\n## Cache Access Controls, Tags \u0026 Filters\nWhen running complex workflows with similar applications, or handling multiple users it's important to keep data segregated. Building on top of RedisVL's support for complex and hybrid queries we can tag and filter cache entries using custom-defined `filterable_fields`.\n\nLet's store multiple users' data in our cache with similar prompts and ensure we return only the correct user information:\n\n\n```python\nprivate_cache = SemanticCache(\n    name=\"private_cache\",\n    filterable_fields=[{\"name\": \"user_id\", \"type\": \"tag\"}]\n)\n\nprivate_cache.store(\n    prompt=\"What is the phone number linked to my account?\",\n    response=\"The number on file is 123-555-0000\",\n    filters={\"user_id\": \"abc\"},\n)\n\nprivate_cache.store(\n    prompt=\"What's the phone number linked in my account?\",\n    response=\"The number on file is 123-555-1111\",\n    filters={\"user_id\": \"def\"},\n)\n```\n\n\n```python\nfrom redisvl.query.filter import Tag\n\n# define user id filter\nuser_id_filter = Tag(\"user_id\") == \"abc\"\n\nresponse = private_cache.check(\n    prompt=\"What is the phone number linked to my account?\",\n    filter_expression=user_id_filter,\n    num_results=2\n)\n\nprint(f\"found {len(response)} entry \\n{response[0]['response']}\")\n```\n\n    found 1 entry \n    The number on file is 123-555-0000\n\n\n\n```python\n# Cleanup\nprivate_cache.delete()\n```\n\nMultiple `filterable_fields` can be defined on a cache, and complex filter expressions can be constructed to filter on these fields, as well as the default fields already present.\n\n\n```python\n\ncomplex_cache = SemanticCache(\n    name='account_data',\n    filterable_fields=[\n        {\"name\": \"user_id\", \"type\": \"tag\"},\n        {\"name\": \"account_type\", \"type\": \"tag\"},\n        {\"name\": \"account_balance\", \"type\": \"numeric\"},\n        {\"name\": \"transaction_amount\", \"type\": \"numeric\"}\n    ]\n)\ncomplex_cache.store(\n    prompt=\"what is my most recent checking account transaction under $100?\",\n    response=\"Your most recent transaction was for $75\",\n    filters={\"user_id\": \"abc\", \"account_type\": \"checking\", \"transaction_amount\": 75},\n)\ncomplex_cache.store(\n    prompt=\"what is my most recent savings account transaction?\",\n    response=\"Your most recent deposit was for $300\",\n    filters={\"user_id\": \"abc\", \"account_type\": \"savings\", \"transaction_amount\": 300},\n)\ncomplex_cache.store(\n    prompt=\"what is my most recent checking account transaction over $200?\",\n    response=\"Your most recent transaction was for $350\",\n    filters={\"user_id\": \"abc\", \"account_type\": \"checking\", \"transaction_amount\": 350},\n)\ncomplex_cache.store(\n    prompt=\"what is my checking account balance?\",\n    response=\"Your current checking account is $1850\",\n    filters={\"user_id\": \"abc\", \"account_type\": \"checking\"},\n)\n```\n\n\n```python\nfrom redisvl.query.filter import Num\n\nvalue_filter = Num(\"transaction_amount\") \u003e 100\naccount_filter = Tag(\"account_type\") == \"checking\"\ncomplex_filter = value_filter \u0026 account_filter\n\n# check for checking account transactions over $100\ncomplex_cache.set_threshold(0.3)\nresponse = complex_cache.check(\n    prompt=\"what is my most recent checking account transaction?\",\n    filter_expression=complex_filter,\n    num_results=5\n)\nprint(f'found {len(response)} entry')\nprint(response[0][\"response\"])\n```\n\n    found 1 entry\n    Your most recent transaction was for $350\n\n\n## Next Steps\n\nNow that you understand semantic caching, explore these related guides:\n\n- [Cache Embeddings](10_embeddings_cache.ipynb) - Cache embedding vectors for faster repeated computations\n- [Manage LLM Message History](07_message_history.ipynb) - Store and retrieve conversation history\n- [Query and Filter Data](02_complex_filtering.ipynb) - Learn more about filter expressions for cache access control\n\n## Cleanup\n\n\n```python\ncomplex_cache.delete()\n```\n",
  "tags": [],
  "last_updated": "2026-04-21T14:39:33+02:00"
}
