All eyes on AI: 2026 predictions The shifts that will shape your stack.

Read now

Blog

How to optimize machine learning inference costs and performance

January 27, 20268 minute read
Redis
Fionce Siow

If you're building Large Language Model (LLM) apps, Retrieval-Augmented Generation (RAG) systems, or any production AI feature, you've probably noticed inference costs spiraling faster than expected. Or maybe your response times feel sluggish despite throwing more GPUs at the problem. Here's what most teams miss: when you're running LLM inference at low batch sizes, your bottleneck isn't compute, it's memory bandwidth. Your GPU is sitting there waiting for data instead of crunching numbers. That changes your optimization approach.

This article covers how inference costs add up, what causes performance bottlenecks in LLM workloads, and practical optimization strategies. You'll learn why semantic caching delivers some of the highest ROI and how Redis helps production teams cut inference costs while improving response times.

How machine learning inference works in production

Your inference deployment pattern shapes your optimization strategy. Real-time, batch, and asynchronous inference have completely different cost and performance tradeoffs.

  • Real-time inference uses persistent REST API endpoints serving individual requests with millisecond latency requirements. Think chatbots, recommendation engines, and any feature where users expect instant responses.
  • Batch inference processes multiple requests together with relaxed latency (hours or days), optimizing for throughput and cost efficiency rather than individual request speed. Think monthly reports, bulk data processing, and offline analytics.
  • Asynchronous inference handles queue-based processing for large payloads that take time—video analysis, large document processing, or anything that doesn't need immediate responses.

Each pattern requires different optimization strategies, and most production systems use a combination of all three.

Why LLM inference is different (and harder)

LLM inference differs from traditional ML patterns. Many traditional ML workloads can scale roughly proportionally with added compute (up to a point), whereas LLM inference is more frequently limited by memory bandwidth and KV (key-value) cache size.

The bottleneck isn't just processing power. LLMs need to store context from previous tokens to generate the next one, creating what's called a KV cache. This cache holds attention keys and values for every token you've processed so far, and it grows with your conversation length.

Based on standard KV cache scaling in recent inference studies, large models running with 8K context windows at batch size 32 can reach tens to hundreds of gigabytes of KV cache memory—on the same order as or larger than the model weights themselves—depending on precision and architecture.

For large language models running in production, memory bandwidth often becomes the primary constraint rather than raw compute, especially at low batch sizes.

How inference costs add up

LLM pricing has compressed substantially. As of early 2026, publicly listed prices for GPT-4o are around $2.50 per million input tokens, and GPT-4o-mini around $0.15 per million, though exact pricing may vary by provider and change over time.

In general, prices are falling, but at production scale with millions of daily requests, the bill still compounds. A single major application serving millions of users might process hundreds of millions of tokens monthly, turning per-token costs into significant annual spending. And a meaningful portion of those queries are semantically similar to ones you've already answered, so you're paying for computations you've already performed.

What causes inference bottlenecks in AI apps

When your LLM generates text, it needs to constantly fetch data from memory to process the next token. At low batch sizes (serving individual user requests in real time), your GPU often spends more time waiting for memory rather than actually computing. The hardware can handle far more operations per second than the memory system can feed it data.

This creates a counterintuitive situation. Adding more compute power won't help if memory can't keep up. Higher batch sizes can shift the bottleneck back toward compute, but most real-time applications can't wait to accumulate large batches: users expect instant responses.

For many commercial LLM APIs, output tokens are priced higher and decoding is often the main driver of incremental latency, so each additional output token tends to cost more (in both time and money) than an additional input token. A 500-token prompt processes quickly, but generating even 50 tokens of output takes longer.

How to improve inference performance

Inference optimization works at three levels: changing the model itself, adjusting how you serve requests, and choosing the right infrastructure. Start with the highest-impact, lowest-effort wins, which are usually caching and serving strategies.

Model-level optimization options

These techniques modify your model to reduce computational requirements. They require upfront investment but deliver ongoing efficiency gains.

  • Quantization: This reduces your model precision from FP32 down to FP16 or INT8. This means using fewer bits to represent each number, which speeds up math operations and reduces memory usage. Lower precision can hurt performance on complex reasoning tasks. Quantization works best when requests can be batched and savings compound at scale.
  • Pruning: Pruning removes redundant model parameters. Think of it as trimming fat from your model. You reduce weights that don't contribute much to accuracy. You'll need to retrain or fine-tune after pruning, which takes time and compute resources upfront, but the result is a smaller, faster model.
  • Knowledge distillation: This transfers knowledge from large "teacher" models to smaller "student" models. The smaller model learns to mimic the teacher's outputs, typically maintaining most capabilities while requiring fewer resources for inference.

Each technique involves tradeoffs between speed and accuracy, so test with your actual workload before deploying to production.

Serving-level strategies

These approaches change how you handle requests without modifying the model itself.

  • Dynamic batching: Incoming requests are grouped together to maximize GPU utilization. Instead of processing one request at a time, you wait briefly to collect a batch of requests and process them together. This introduces some latency—a request might wait 50-100ms for the batch to fill—but dramatically improves throughput.
  • Model cascading: Queries are routed to different-sized models based on complexity. Simple questions like "What's the weather?" go to a fast, cheap model. Complex reasoning tasks get sent to your premium model.
  • Speculative decoding: A smaller draft model predicts what comes next, then has a larger model verify those predictions. When the draft model guesses correctly, you get faster generation. When it's wrong, you fall back to the standard approach.

The real power comes from combining techniques. Caching plus batching plus model cascading can deliver cost reductions compared to a naive implementation. Your actual savings depend on your starting point: teams already running optimized setups will see smaller gains than those just beginning optimization.

Why caching matters for LLM inference optimization

Traditional exact match caching only helps when queries are character-for-character identical. This approach struggles to capture redundancy in natural language where semantic intent remains constant despite phrasing variations.

How semantic caching works

Semantic caching transforms queries into dense vector embeddings and performs similarity search to identify semantically equivalent cached queries. When similarity scores exceed your threshold, you return the cached response. When no match is found, you call the LLM, then cache the query embedding plus response for future requests.

The difference: exact match caching works through string comparison and only returns cached responses for identical queries. Semantic caching uses an encoder model to create embeddings, then performs approximate nearest neighbor (ANN) search to identify semantically equivalent cached queries even when phrasing differs.

In many production setups, cache lookups can be kept well under ~100ms end-to-end in well-provisioned production systems, while direct LLM calls often take hundreds to several thousand milliseconds depending on model and workload.

In AWS chatbot setups, semantic caching delivered responses up to 15x faster compared to direct LLM calls for queries with cache hits. In one published evaluation, AWS reported cost reductions up to 86% while maintaining 91% answer accuracy—but these results depend heavily on dataset characteristics and similarity threshold tuning. Workloads with high query redundancy (like FAQ-style chatbots) tend to see larger gains than diverse, search-like queries.

What you need for production caching

Production semantic caching systems typically require three core components: ANN search infrastructure for efficient similarity search, lightweight embedding models optimized for low-latency generation, and vector storage supporting high-throughput similarity operations.

Redis supports HNSW (Hierarchical Navigable Small World) indexing, a graph-based ANN algorithm optimized for large-scale datasets. HNSW-based indexes in Redis are designed to support very large (up to billion-scale) vector collections and can achieve single-digit millisecond query latencies in well-provisioned environments.

Build optimized inference without the infrastructure complexity

Most teams stitching together inference optimization end up managing separate tools for semantic caching, vector search, and operational data. One system for embeddings, another for similarity search, maybe a third for traditional caching. Each has different APIs, failure modes, and operational overhead. It works, but you're spending more time on infrastructure than optimization.

Your infrastructure matters as much as your model selection. In a study of mobile retail and travel sites, 0.1‑second improvements were associated with ~8–10% conversion lifts. When traffic spikes occur or your feature gains traction, infrastructure reliability influences whether you capture that revenue opportunity or lose it to timeout errors and degraded user experience.

Companies managing 50+ models with teams of 20+ people often need platform-based approaches versus fragmented point tools. A unified platform can reduce integration complexity and coordination costs while delivering consistent performance across your stack.

Redis: One platform for semantic caching and vector search

Redis handles semantic caching, vector search, and in-memory operations in one platform. Vector embeddings live alongside your cached responses with sub-10ms similarity search performance. When a query comes in, Redis checks for semantic matches through HNSW indexing, returns cached results for hits, and stores new embeddings for misses. No coordinating multiple systems.

The platform supports 1 billion vector scale, with sub-10ms similarity search across millions of embeddings in certain production configurations. LangCache, Redis' managed semantic caching service, can reduce infrastructure management overhead. For workloads with high semantic redundancy, internal and partner testing observed 30%+ cost reductions.

Because Redis operates in-memory with persistence options, your cache state can survive restarts, avoiding rebuilding embeddings in many cases. Native integrations with LangChain, LlamaIndex, and LangGraph mean you can implement semantic caching without custom infrastructure work. The same platform supports RAG pipelines, chatbots, and agent memory systems.

Ready to optimize your inference costs? Try Redis free to test semantic caching with your actual workload, or meet with our team to discuss your specific architecture requirements.

Get started with Redis today

Speak to a Redis expert and learn more about enterprise-grade Redis today.