Blog
How to Improve LLM UX: Speed, Latency & Caching
Your large language model (LLM)-powered app might be smart, but if it feels slow, users won't stick around. Around one second is a typical threshold for maintaining a user's flow of thought, and ~10 seconds is a common upper bound for holding attention. If your app falls outside that window, it's worth investing in speed.
Some fixes are architectural, some are perceptual, and the most effective strategy combines both. This article covers what makes LLM apps feel slow, how to diagnose where latency comes from, techniques for reducing both real and perceived delay, and how to tie UX improvements to business outcomes.
What makes LLM apps feel slow & frustrating
It's not just the wait. Slowness in LLM apps compounds across multiple dimensions that feed off each other. Here are the four that matter most, starting with the raw numbers and moving into the subtler forces that make latency feel even worse than it is.
Raw latency thresholds
How long users actually wait depends on the interaction type. UX degrades noticeably above the 4-second mark, though conversational fillers help reduce perceived delay.
Web Vitals recommend under 2.5 seconds for Largest Contentful Paint (LCP) and under 200 milliseconds for Interaction to Next Paint (INP). Your page shell can usually meet these thresholds. But end-to-end RAG latency across retrieval, tool calls, and model generation often lands between 2 and 30 seconds depending on pipeline complexity.
Context switching
But raw latency only tells part of the story. LLM latency isn't a single wait; it's lots of little waits sprinkled across the flow. You type, and then:
- The UI acknowledges the input (or doesn't).
- The app fetches context (docs, chat history, user profile).
- The model starts streaming (or sits silent).
- The answer shows up, but now you have to scroll, copy, re-ask, or correct.
Every extra pause nudges users out of "conversation mode" and into "tool mode," where they're thinking about the app, not the problem they're trying to solve.
No feedback during generation
Those micro-waits get even worse when users have no idea what's happening behind the scenes. Most LLM apps go silent between receiving a prompt and streaming a response. No spinner, no status update, no sign of progress. That silence reads as: Is it broken? Did it even send?
The problem isn't just that users are waiting; it's that they don't know why they're waiting or how long they'll be waiting. That uncertainty makes even short delays feel broken. Even if your backend latency doesn't change, adding clear progress signals can make the experience feel dramatically less frustrating.
Output that arrives late & still isn't usable
Finally, all that waiting compounds into real frustration when the answer itself falls short. Nothing feels slower than waiting 20 seconds for an answer you can't trust.
Hallucinations, missing citations, vague summaries, or "here are 12 options" responses force users to ask follow-ups, re-run with more constraints, copy/paste into another tool, or abandon the workflow entirely. So improving UX often means both shaving milliseconds and reducing the number of turns it takes to get to "done."
Why app speed is the foundation of LLM UX
LLMs create a new expectation: the UI is conversational, so users assume responses should feel conversational too. That matters because most LLM apps aren't one request. They're a chain of prompt rewriting, retrieval augmented generation (RAG), tool calls, safety checks, and streaming. If each step adds even a small delay, end-to-end latency balloons. One way to stay ahead of this: treat latency like a product feature. Measure it, budget it, and design around it.
How to diagnose LLM UX & performance bottlenecks
Before you optimize anything, get clear on where time is actually going. "The model is slow" is a common first instinct, but in many cases the real bottleneck is somewhere else in the pipeline.
Break latency into buckets
Track (and log) timing for each stage. The exact breakdown will depend on your architecture, but a useful starting point looks something like:
- Client time: input handling, rendering, markdown formatting, syntax highlighting.
- Network time: Transport Layer Security (TLS), Domain Name System (DNS), queuing, regional distance.
- Gateway time: auth, rate limiting, request shaping.
- Orchestrator time: prompt building, tool selection, retries.
- Retrieval time: vector search + metadata filters + reranking.
- Model time: time to first token (TTFT) + generation.
- Post-processing time: citations, safety filters, structured output validation.
Not every app will have all of these stages, and some will have additional ones. The point isn't to follow this list exactly. Without some version of this breakdown, you can't tell whether to optimize the UI, retrieval, model settings, or your own glue code.
Measure the right UX metrics
Raw end-to-end latency is useful, but LLM UX often comes down to a few more specific numbers:
- Time to first token (TTFT): how fast the app shows any useful sign of life.
- Tokens per second (TPS): how fast the answer streams once it starts.
- Turn count to success: how many back-and-forths it takes for users to get what they want.
TTFT is the "feels fast" metric. TPS is the "doesn't drag" metric. Turn count is the "it actually helped me" metric. Which one matters most depends on your use case. A code assistant and a customer support bot have different tolerance profiles.
Find self-inflicted slowness
A lot of LLM latency is self-inflicted. Some common patterns worth checking for:
- Sending huge chat histories every turn
- Re-embedding the same content repeatedly
- Doing retrieval when you don't need it
- Calling tools serially when they could run in parallel
- Doing expensive formatting or validation before streaming anything
The fastest token is the one you never generate. The second-fastest is the one you stream early, provided you don't cut so much that users need extra turns to get a usable answer.
How to reduce real latency in your LLM apps
Now for the part everyone wants: how to make it actually faster.
Stream early, even if you can't answer yet
If you can stream tokens, it's usually worth doing. If you can't stream the real answer yet, a short status line that sets expectations can help:
- "Searching your docs…"
- "Pulling recent account activity…"
- "Running calculations…"
Conversational fillers reduce perceived delay under higher latency conditions. You don't need to fake typing; you just need to avoid silence.
Cut prompt size like it's a build artifact
Long prompts tend to increase latency and cost, and they can lower quality by burying the model in irrelevant context.
Practical ways to shrink prompts:
- Summarize older conversation turns.
- Keep only the most recent tool outputs.
- Store "user preferences" separately (don't resend them every turn).
- Retrieve small, relevant chunks instead of dumping whole docs.
Not all of these will apply to every app, but most teams find at least one or two easy wins here.
Speed up retrieval (RAG) without wrecking accuracy
Retrieval augmented generation (RAG) improves LLM responses by fetching relevant context from your data before the model generates. It can also become one of your biggest bottlenecks, depending on dataset size and query complexity.
Common wins:
- Precompute vector embeddings for content you use often.
- Use metadata filters to shrink the candidate set (tenant ID, product area, time range).
- Cache retrieval results when the same questions repeat.
For retrieval-heavy workloads, in-memory data platforms can make a big difference because they keep vectors, cache, and metadata in one place instead of spreading them across separate systems. Redis delivers sub-millisecond access for cache, session, and state reads, and can provide low-millisecond retrieval for vector search and semantic caching (workload-dependent). Having all of this in one system can cut network hops and integration overhead compared to multi-vendor stacks, depending on how your services are deployed.
Cache the expensive stuff
Caching isn't glamorous, but it works, especially when your app has repetition. The key is designing cache keys and invalidation carefully to avoid serving stale or user-mismatched answers.
- Response caching: reuse final answers for identical prompts.
- Semantic caching: reuse answers for meaningfully similar prompts.
- Tool-result caching: reuse outputs from slow internal APIs.
- Embedding caching: don't re-embed the same text.
The right caching strategy depends on your query patterns. If you're already using Redis for sessions or caching, this is often a small step up in capability, not a whole new architecture.
Run independent work in parallel
A lot of LLM pipelines run serially because it's easier to code. But if steps don't depend on each other, parallelizing them is worth considering:
- Fetch user profile & permissions while you run retrieval.
- Start tool calls while the model is drafting.
- Validate structured outputs while streaming continues.
The trade-off: more concurrency can increase load and make failure modes more interesting. Still, for UX, parallel work is often one of the cleanest ways to cut end-to-end time.
How to reduce perceived latency with better interaction design
You won't always get end-to-end latency under a couple of seconds. Sometimes you really do need 8–20 seconds (multi-step tools, heavy retrieval, complex reasoning). That doesn't mean the experience has to feel bad.
Acknowledge instantly
At minimum, show the user their message immediately, disable duplicate submits, and show an in-progress state. This is basic web UX, but it's where many LLM apps fall short.
Make partial output useful
Don't stream fluff. Stream something the user can act on: an outline, the first step of a plan, the first relevant citation, or a clarification question. If the user can do something with the first 2–3 seconds of output, they'll typically wait longer for the rest.
Use progressive disclosure for long answers
Long answers feel slower because they're visually heavy. A short summary up front, expandable sections, "show sources" toggles, or "open in editor" for big outputs can all help. You're not hiding content; you're keeping the UI from looking like a wall of text. The right pattern depends on how your users typically consume the output.
Give users control during generation
If generation takes time, users tend to want escape hatches: the ability to stop generating, regenerate, continue, or ask a follow-up while the model is still finishing. Control reduces frustration even when latency is unchanged, because the user feels like a participant rather than a spectator.
How to connect UX improvements to business outcomes
Latency is a product metric because it drives behavior. Lower TTFT tends to increase engagement. People are more likely to try a second prompt instead of bouncing. Higher answer usefulness often reduces turn count, which can mean fewer retries and fewer support tickets. And faster retrieval can enable more context per request, which may improve outcomes without increasing model size.
To make this legible to non-technical stakeholders, it helps to tie improvements to a funnel metric you already track: completion rate of a workflow, conversion for a key action, support deflection, or time-to-resolution in an internal tool. Even small UX wins tend to compound when the app is used dozens of times per day.
Get started with Redis
If your LLM app feels slow, the first step is measuring the pipeline end to end. Most teams find the biggest wins in three places: streaming and UI feedback (TTFT), retrieval speed and relevance, and avoiding repeated work with caching.
The architectural side matters, but so does perception. Users don't experience your system as "model latency" or "retrieval time." They experience silence, uncertainty, and extra turns. Addressing both the real delays and the felt delays is what separates apps people tolerate from apps people come back to.
Redis is a real-time data platform optimized for low-latency in-memory operations. It delivers sub-millisecond access for cache and session reads, and low-millisecond retrieval for vector search depending on workload and deployment. Vector search, semantic caching, session data, and operational state all live in one system, which can reduce retrieval latency and simplify delivering partial results compared to stitching together separate databases. Redis LangCache handles semantic caching out of the box, recognizing when queries are semantically similar despite different wording. For retrieval-heavy apps with repeat traffic patterns, that can mean faster cache hits and fewer redundant LLM calls without adding architectural complexity.
Try Redis free to see how it works with your workload, or talk to our team about what it looks like in your stack.
Get started with Redis today
Speak to a Redis expert and learn more about enterprise-grade Redis today.
