{
  "id": "java-jedis",
  "title": "Redis semantic cache with Jedis",
  "url": "https://redis.io/docs/latest/develop/use-cases/semantic-cache/java-jedis/",
  "summary": "Build a Redis-backed semantic cache for LLM responses in Java with Jedis and DJL (PyTorch)",
  "tags": [
    "docs",
    "develop",
    "stack",
    "oss",
    "rs",
    "rc"
  ],
  "last_updated": "2026-06-01T09:32:08+01:00",
  "children": [
    {
      "id": "readme",
      "summary": "",
      "title": "",
      "url": "https://redis.io/docs/latest/develop/use-cases/semantic-cache/java-jedis/readme/"
    }
  ],
  "page_type": "content",
  "content_hash": "c956c9dfb61cdb260fde2f06241fd1646ce7700db43569184292b6b50fc63dd9",
  "sections": [
    {
      "id": "overview",
      "title": "Overview",
      "role": "overview",
      "text": "This guide shows you how to build a small Redis-backed semantic cache for LLM responses in Java with [Jedis](https://redis.io/docs/latest/develop/clients/jedis) and [DJL (Deep Java Library)](https://djl.ai/) running the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) encoder locally on PyTorch. It includes a local web server built with the JDK's standard `com.sun.net.httpserver.HttpServer` so you can send paraphrased prompts at a mock LLM, watch the cache decide hit or miss, sweep the cosine-distance threshold, and see the cumulative latency and token savings build up."
    },
    {
      "id": "overview",
      "title": "Overview",
      "role": "overview",
      "text": "Each cache entry is stored as a single Redis [Hash](https://redis.io/docs/latest/develop/data-types/hashes) at `cache:<id>`. The hash holds the original prompt, the LLM's response, the raw `float32` bytes of a 384-dimensional embedding of the prompt, and metadata fields — tenant, locale, model version, safety flag — plus a `created_ts` and a `hit_count`. A single [Redis Search](https://redis.io/docs/latest/develop/ai/search-and-query) index covers the embedding field and every metadata field, so one [`FT.SEARCH`](https://redis.io/docs/latest/commands/ft.search) call with a `KNN` clause does the vector lookup *and* the TAG pre-filter in the same round trip — no cross-store joins.\n\nThe lookup is thresholded: [`FT.SEARCH`](https://redis.io/docs/latest/commands/ft.search) always returns the nearest entry that satisfies the filters, but the application only serves it as a hit when the reported cosine distance is at or below `distanceThreshold`. Anything further away is treated as a miss; the caller runs the LLM and writes the new prompt, response, and embedding back to the same key pattern with a TTL.\n\nThe embedder is [DJL](https://djl.ai/) loading the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) PyTorch model from the DJL model zoo. This is the same 384-dimensional encoder the [Python example](https://redis.io/docs/latest/develop/use-cases/semantic-cache/redis-py) and the [Node.js example](https://redis.io/docs/latest/develop/use-cases/semantic-cache/nodejs) use. Embeddings produced by the three implementations are semantically equivalent — paraphrase distances differ only at the fourth decimal place — so a cache populated by one demo can be queried by another against the same Redis instance.\n\nThat gives you:\n\n* A single round trip for lookup — vector KNN + metadata pre-filter in one [`FT.SEARCH`](https://redis.io/docs/latest/commands/ft.search).\n* Tens of milliseconds on a hit vs. a multi-second LLM call on a miss; the embedding step is the bottleneck either way, and that's a model-side cost, not a Redis one.\n* Tenant, locale, and model-version isolation enforced inside the query, not in application code — a write under one tenant cannot be served to another.\n* Bounded memory: every entry has an [`EXPIRE`](https://redis.io/docs/latest/commands/expire) TTL, and a database-level [eviction policy](https://redis.io/docs/latest/develop/reference/eviction) (LRU / LFU) caps the cache size under pressure."
    },
    {
      "id": "how-it-works",
      "title": "How it works",
      "role": "content",
      "text": "A query goes through three stages: **embed**, **lookup**, and (on a miss) **call the LLM and write back**."
    },
    {
      "id": "hit-path-the-goal",
      "title": "Hit path (the goal)",
      "role": "content",
      "text": "1. The application calls `embedder.encodeOne(prompt)` to turn the incoming text into a 384-dimensional `float[]`.\n2. `cache.lookup(queryVec, tenant, locale, modelVersion, \"ok\", threshold)` runs [`FT.SEARCH`](https://redis.io/docs/latest/commands/ft.search) with a TAG pre-filter and a `KNN 1` clause. Redis returns the closest cached prompt that satisfies the filters along with its cosine distance.\n3. If the distance is at or below the threshold, the cache returns a `CacheHit` containing the cached response. The helper also runs an [`HINCRBY`](https://redis.io/docs/latest/commands/hincrby) on `hit_count` and an [`EXPIRE`](https://redis.io/docs/latest/commands/expire) refresh inside a [`MULTI/EXEC`](https://redis.io/docs/latest/commands/multi), so a frequently used answer keeps its TTL and the demo UI can see which entries are load-bearing.\n4. The LLM is not called at all. The application returns the cached response to the user."
    },
    {
      "id": "miss-path",
      "title": "Miss path",
      "role": "content",
      "text": "When the distance is above the threshold — or there is no candidate in scope at all — the helper returns a `CacheMiss` instead, carrying the distance of the nearest candidate (if any) for logging. The application then:\n\n1. Calls the LLM with the prompt.\n2. Calls `cache.put(prompt, response, embedding, tenant, locale, modelVersion, ...)`. The same embedding the lookup used is reused — no re-encode. The helper writes the Hash with [`HSET`](https://redis.io/docs/latest/commands/hset) and an [`EXPIRE`](https://redis.io/docs/latest/commands/expire) TTL inside a single [`MULTI/EXEC`](https://redis.io/docs/latest/commands/multi) so the entry never lands without a TTL on a partial failure.\n3. Returns the LLM's response to the user. The next semantically similar prompt under the same metadata scope will be a hit."
    },
    {
      "id": "the-cache-helper",
      "title": "The cache helper",
      "role": "content",
      "text": "The `RedisSemanticCache` class wraps the Redis Search index and the lookup / write flow\n([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/RedisSemanticCache.java)):\n\n[code example]"
    },
    {
      "id": "data-model",
      "title": "Data model",
      "role": "content",
      "text": "Each cache entry is one Redis Hash. The vector field is raw little-endian `float32` bytes — no JSON wrapping — because the Redis Search vector encoding expects exactly that. The helper packs the `float[]` with a `ByteBuffer` in `ByteOrder.LITTLE_ENDIAN`, which matches the bytes Redis Search reads and is identical to the encoding the Python and Node ports write.\n\n[code example]\n\nThe Redis Search index schema treats every field as queryable in its natural type:\n\n[code example]"
    },
    {
      "id": "the-query",
      "title": "The query",
      "role": "content",
      "text": "The lookup is a hybrid query: a TAG pre-filter expression in parentheses, then `=>[KNN 1 @embedding $vec]`. With `DIALECT 2`, Redis applies the filter first and KNN-ranks only the matching documents. In Jedis:\n\n[code example]\n\n`distance` is the cosine *distance* (0 means identical, 2 means opposite). The result is sorted ascending, so the top row is the closest candidate. The application inspects `distance` against the threshold and decides hit or miss in user code — Redis returns the row either way, and treating it as a hit or a miss is a policy decision the cache helper owns, not a server-side filter."
    },
    {
      "id": "the-mock-llm",
      "title": "The mock LLM",
      "role": "content",
      "text": "To make the latency and token savings visible without requiring an API key, `MockLLM.java` provides a deterministic stand-in\n([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/MockLLM.java)):\n\n[code example]\n\nThe mock sleeps for the configured latency, then keyword-matches against a small FAQ table to produce an answer. The deliberate slowness is what makes a hit visibly cheaper than a miss in the demo. In production code, you would replace `MockLLM` with your real client of choice — an HTTP call to OpenAI, Anthropic, a self-hosted vLLM endpoint, anything — without changing the cache helper."
    },
    {
      "id": "pre-seeding-the-cache",
      "title": "Pre-seeding the cache",
      "role": "content",
      "text": "In a real deployment the cache fills up organically: a first-time question is a miss, the LLM answers, and the response is written back. For the demo, `SeedCache.java` pre-loads a small set of canonical FAQ prompts so the very first query lands on a hit\n([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/SeedCache.java)):\n\n[code example]\n\nThe seed list stores the canonical phrasing of each question (\"What is your return policy?\"). Paraphrases of any of these prompts (\"How do I return an item?\", \"Can I get a refund?\") embed close to the canonical entry, so the cache lookup serves the stored response without ever calling the model."
    },
    {
      "id": "the-interactive-demo",
      "title": "The interactive demo",
      "role": "content",
      "text": "`DemoServer.java` runs an HTTP server built on the JDK's `com.sun.net.httpserver.HttpServer` — no Spring, no Jetty, no embedded framework. The HTML page lets you:\n\n* Type a prompt and toggle metadata: tenant, locale, model version. Each combination is a separate cache namespace inside the same index.\n* Slide the cosine-distance threshold and see hits flip to misses (and back) on the same prompt, with the actual distance reported on each query.\n* Submit with **Ask** to run the full hit-or-miss path (calls the LLM on a miss, writes the answer back). Submit with **Lookup only (no LLM)** to sweep the threshold against a fixed prompt without polluting the cache.\n* Watch the cumulative panel build up: total queries, cache hits, cache misses, hit ratio, tokens not spent, LLM milliseconds not waited.\n* Inspect every cached entry, including remaining TTL and total hit count, and drop individual entries to simulate eviction.\n\nThe server holds one `LocalEmbedder`, one `RedisSemanticCache`, and one `MockLLM` for the lifetime of the process. The HTML page is shared with the Python, Node.js, and Go demos; the build embeds `index.html` from the project root as a classpath resource so the jar runs from any working directory. Endpoints:\n\n| Endpoint        | What it does                                                                  |\n|-----------------|-------------------------------------------------------------------------------|\n| `GET  /state`   | Index info and the full list of cached entries.                               |\n| `POST /query`   | Embed the prompt, run `FT.SEARCH`, on miss call the LLM and write back.       |\n| `POST /reset`   | Drop every cached entry and re-seed from the FAQ list.                        |\n| `POST /drop`    | Delete a single cached entry by id.                                           |"
    },
    {
      "id": "run-the-demo-locally",
      "title": "Run the demo locally",
      "role": "content",
      "text": "1.  Clone the [`redis/docs`](https://github.com/redis/docs) repository and change into the example\n    directory:\n\n    [code example]\n\n2.  Make sure a Redis instance with the Redis Search module is running locally on\n    port 6379. [Redis Stack](https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack) or\n    [Redis 8 with Search](https://redis.io/docs/latest/develop/ai/search-and-query) both work.\n\n3.  Build the project with Maven. This pulls Jedis, DJL, and the PyTorch native\n    libraries. The first build takes a couple of minutes:\n\n    [code example]\n\n4.  Run the demo. The first run also downloads the `sentence-transformers/all-MiniLM-L6-v2`\n    PyTorch weights into the local DJL cache (~90 MB); every subsequent run is offline:\n\n    [code example]\n\n    Or with `mvn`:\n\n    [code example]\n\n5.  Open <http://localhost:8089> and try some queries:\n\n    * **\"What is your return policy?\"** — exact match against the seed, distance ≈ 0,\n      hit at any threshold.\n    * **\"How fast is delivery?\"** — paraphrase of the shipping seed; distance\n      around 0.30, hit at the default threshold of 0.5.\n    * **\"How do I return an item?\"** — slightly looser paraphrase of the returns\n      seed; distance around 0.49, still a hit at the default threshold. Slide\n      the threshold down to 0.4 to see this one flip to a miss.\n    * **\"What payment methods do you accept?\"** — unrelated to anything in the\n      seed; distance > 0.8, so you'll see a miss, the mock LLM kicks in for\n      ~1.5 s, the new answer is cached, and a follow-up of the same question\n      is now an immediate hit.\n    * Switch the **Tenant** dropdown to `globex` or `initech` and re-ask any\n      seeded question — the result flips to a miss because the cache entries\n      live under `acme`. That's the metadata pre-filter at work inside `FT.SEARCH`.\n\nThe server is read/write against your local Redis. The default index name is `semcache:idx` and entry keys live under `cache:`. Flags mirror the Python and Node demos: `--no-reset` to keep an existing cache across restarts, `--threshold` to change the default cosine-distance cutoff, `--llm-latency-ms` to make the mock LLM faster or slower for the demo, or `--port` to listen on a different port."
    }
  ],
  "examples": [
    {
      "id": "the-cache-helper-ex0",
      "language": "java",
      "code": "import redis.clients.jedis.JedisPooled;\nimport com.redis.semcache.RedisSemanticCache;\nimport com.redis.semcache.LocalEmbedder;\nimport com.redis.semcache.LookupResult;\nimport com.redis.semcache.CacheHit;\n\nJedisPooled jedis = new JedisPooled(\"localhost\", 6379);\nLocalEmbedder embedder = LocalEmbedder.create();   // sentence-transformers/all-MiniLM-L6-v2\n\nRedisSemanticCache cache = new RedisSemanticCache(\n        jedis,\n        \"semcache:idx\",\n        \"cache:\",\n        384,\n        0.5,    // cosine distance, lower = stricter\n        3600    // TTL in seconds (one hour)\n);\n\n// One-time index setup (idempotent).\ncache.createIndex();\n\n// 1) Embed the prompt.\nString prompt = \"How do I return an item?\";\nfloat[] queryVec = embedder.encodeOne(prompt);\n\n// 2) Look up under a metadata scope. The TAG filter and the KNN\n//    travel together in one FT.SEARCH.\nLookupResult result = cache.lookup(\n        queryVec, \"acme\", \"en\", \"gpt-4.5-2026\", \"ok\", null);\n\nString response;\nif (result instanceof CacheHit hit) {\n    response = hit.response();\n    System.out.printf(\"hit (%.3f): %s%n\", hit.distance(), response);\n} else {\n    // 3a) Miss — call the LLM. (Use your real client here.)\n    response = callLlm(prompt);\n\n    // 3b) Cache the new entry. Reuses the same embedding bytes the\n    //     lookup used, so we don't pay the encoder twice.\n    cache.put(\n            prompt,\n            response,\n            queryVec,\n            \"acme\",\n            \"en\",\n            \"gpt-4.5-2026\",\n            \"ok\",\n            null,   // ttl override (null = default)\n            null    // entry id (null = generated)\n    );\n}",
      "section_id": "the-cache-helper"
    },
    {
      "id": "data-model-ex0",
      "language": "text",
      "code": "cache:7c3f8a1b9e02\n  prompt=How do I return an item?\n  response=You can return any unworn item within 30 days...\n  tenant=acme\n  locale=en\n  model_version=gpt-4.5-2026\n  safety=ok\n  created_ts=1715990400.123\n  hit_count=4\n  embedding=<384 × float32 little-endian bytes>",
      "section_id": "data-model"
    },
    {
      "id": "data-model-ex1",
      "language": "text",
      "code": "FT.CREATE semcache:idx\n  ON HASH PREFIX 1 cache:\n  SCHEMA\n    prompt         TEXT\n    response       TEXT\n    tenant         TAG\n    locale         TAG\n    model_version  TAG\n    safety         TAG\n    created_ts     NUMERIC SORTABLE\n    hit_count      NUMERIC SORTABLE\n    embedding      VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE_METRIC COSINE",
      "section_id": "data-model"
    },
    {
      "id": "the-query-ex0",
      "language": "java",
      "code": "Query q = new Query(\n        \"(@tenant:{acme} @locale:{en} @model_version:{gpt\\\\-4\\\\.5\\\\-2026} @safety:{ok})\"\n                + \"=>[KNN 1 @embedding $vec AS distance]\")\n        .returnFields(\"prompt\", \"response\", \"tenant\", \"locale\",\n                \"model_version\", \"hit_count\", \"distance\")\n        .setSortBy(\"distance\", true)\n        .limit(0, 1)\n        .addParam(\"vec\", LocalEmbedder.toBytes(queryVec))\n        .dialect(2);\n\nSearchResult result = jedis.ftSearch(\"semcache:idx\", q);",
      "section_id": "the-query"
    },
    {
      "id": "the-mock-llm-ex0",
      "language": "java",
      "code": "import com.redis.semcache.MockLLM;\n\nMockLLM llm = new MockLLM(\"gpt-4.5-2026\", 1500.0);\nMockLLM.Response response = llm.complete(\"What is your return policy?\");\n// response.response()      — the templated answer text\n// response.latencyMs()     — wall-clock time the call took\n// response.totalTokens()   — estimated prompt + completion tokens",
      "section_id": "the-mock-llm"
    },
    {
      "id": "pre-seeding-the-cache-ex0",
      "language": "java",
      "code": "import com.redis.semcache.SeedCache;\n\ncache.createIndex();\nSeedCache.seed(cache, embedder, \"acme\", \"en\", \"gpt-4.5-2026\");",
      "section_id": "pre-seeding-the-cache"
    },
    {
      "id": "run-the-demo-locally-ex0",
      "language": "bash",
      "code": "git clone https://github.com/redis/docs.git\n    cd docs/content/develop/use-cases/semantic-cache/java-jedis",
      "section_id": "run-the-demo-locally"
    },
    {
      "id": "run-the-demo-locally-ex1",
      "language": "bash",
      "code": "mvn -q package",
      "section_id": "run-the-demo-locally"
    },
    {
      "id": "run-the-demo-locally-ex2",
      "language": "bash",
      "code": "java -jar target/semantic-cache-jedis.jar",
      "section_id": "run-the-demo-locally"
    },
    {
      "id": "run-the-demo-locally-ex3",
      "language": "bash",
      "code": "mvn -q exec:java",
      "section_id": "run-the-demo-locally"
    }
  ]
}
