Resource Center

Events & webinars Blog Videos Glossary Resources Architecture Diagrams Demo Center

Resource Center

Events & webinars Blog Videos Glossary Resources Architecture Diagrams Demo Center

Back to blog

Blog

Dynamic batching: a practical how-to guide

June 21, 20269 minute read

Jim Allen Wallace

Summarize with AI

You're load-testing a new inference endpoint before rollout. Traffic looks healthy on the client side, but your GPU dashboard tells a different story: utilization stuck at low single digits while requests arrive one at a time. That gap between what your hardware can do and what it's actually doing is the problem dynamic batching is built to address. Inference servers process requests far more efficiently in groups than one by one, but most clients send requests individually. Dynamic batching bridges that gap by combining separate requests into batches on the server side at runtime.

This guide covers what dynamic batching is, why inference servers need it, how it trades latency for throughput on GPUs, where its limits show up for large language models, and how semantic caching cuts repeated work before it reaches the batch queue.

What dynamic batching is & why inference servers need it

Dynamic batching is a server-side mechanism that combines multiple individual inference requests into a single batch at runtime, so clients don't have to pre-form those batches. Inference servers like Triton can group requests into a batch on the fly, and TorchServe and Ray Serve follow the same pattern: aggregate requests, then run them through the model together.

This matters because of GPU underutilization. When you serve single requests sequentially, most of the GPU sits idle between operations. Machine learning (ML) and deep learning (DL) frameworks are designed for batch requests, so feeding them one input at a time wastes hardware. Triton's own docs call batching the most beneficial way to increase GPU utilization.

The throughput gains are real. In a Triton DenseNet Open Neural Network Exchange (ONNX) benchmark, a tuned configuration with dynamic batching and four model instances per GPU moved throughput from 167.5 to 323.1 inferences per second over the default. The catch: 99th-percentile (p99) latency climbed from 12.7 ms to 35.8 ms. That trade is the heart of batching, and it's worth understanding why the hardware behaves this way.

Build fast, accurate AI apps that scale

Get started with Redis for real-time AI context and retrieval.

Try Redis for AI

How batching trades latency for throughput on GPUs

Batching helps so much because of a single hardware fact: GPU inference is usually bottlenecked on memory bandwidth, not compute. To run a model, the GPU has to load weights from high-bandwidth memory. Combining multiple requests lets it reuse that weight-loading work across more inputs instead of paying the same cost for each request, one at a time.

This is the roofline model in action. A kernel is memory-bound when it's limited by data transfer and compute-bound when it's limited by arithmetic throughput. At batch size 1, transformer decode often has low arithmetic intensity, so the hardware can't keep its compute units busy. Packing more requests per weight load raises that intensity and pushes utilization up, which is why batching pays off under real traffic.

The trade-off is straightforward once you see the numbers, but benchmark scope matters. In a TensorRT image inference sweep, aggregate throughput fell as batch size grew while per-image processing time dropped, and end-to-end latency rose monotonically. Larger batches reduced per-image processing time through amortization even though aggregate throughput fell in that test. In dynamic batching more broadly, the throughput benefit shows up when that amortization improves GPU utilization under real traffic.

Gains aren't infinite, and they're not uniform across models. In one analysis of theoretical maximum throughput across batch sizes, the lightweight ResNet-50 model showed a more than 200% increase from larger batches, while RoBERTa-B gained only 26.5%. Once a heavy model saturates GPU resources, extra batching yields diminishing returns. So bigger batches usually mean more throughput, up to the point where you hit memory pressure or saturate the hardware.

Static vs. dynamic batching: the timeout window that balances both

The throughput-latency trade-off raises an obvious question: how do you control it? The answer is the timeout window, and it's the main thing separating static batching from dynamic batching.

Static batching pushes the work onto the client. Your client assembles a fixed-size batch before sending it, and the server just runs whatever shows up. That's fine for offline jobs like processing a large document corpus overnight, where latency isn't a concern. Dynamic batching, by contrast, assembles batches on the server at runtime, which is the better fit for latency-sensitive production deployments like generating images in response to user input.

Most serving frameworks follow the same pattern: wait briefly to accumulate a larger batch, but dispatch early if the target batch size fills up first. When a preferred-size batch can't be formed from what's currently queued, the batcher waits as long as no request is delayed past the configured maximum queue delay. If a new request arrives in time to complete the batch, it goes immediately. If the delay expires first, the batch goes as is.

The mental model is simple. Maximum batch size sets the upper bound, and the timeout sets how long the server is allowed to wait. A shorter timeout protects latency but leaves more GPU capacity unused, while a longer timeout improves the odds of fuller batches at the cost of queueing delay. The win is that you can tune that balance on the server without changing every client.

Make your AI apps faster and cheaper

Cut costs by up to 90% and lower latency with semantic caching powered by Redis.

Learn more

Continuous batching & where dynamic batching's limits show

Dynamic batching works well for models with fixed-length outputs, but it hits a wall with autoregressive LLMs. The reason is structural: each LLM request generates output tokens one at a time, and you don't know how many tokens a request needs until it emits an end-of-sequence token. Requests in the same batch finish at unpredictably different times.

With request-level dynamic batching, every request waits for the longest sequence in the batch to finish before any slot frees up, so short requests get stuck behind long ones. When some generations finish earlier than others, that underutilizes the GPUs. Fixed batches can avoid some decode stalls, but they still force new requests to wait until the current batch completes, which inflates time-to-first-token.

Continuous batching addresses this by scheduling at the granularity of individual model iterations instead of full requests. A completed request is evicted from the batch immediately, and a new request fills the slot without waiting for the rest of the batch. With iteration-level scheduling, a newly arrived request can be considered after waiting only a single iteration. It now shows up in vLLM, TensorRT-LLM, and other serving stacks, sometimes under the name in-flight batching.

The throughput difference is large. One benchmark reported a 36.9× throughput improvement over FasterTransformer at the same latency level on GPT-3 175B, and a later study measured 2–4× higher throughput than FasterTransformer and Orca at the same latency, with the gap widening for longer sequences and larger models.

The distinction between the two approaches really comes down to scheduling granularity, and each has a trade-off. One comparison reported vLLM's iteration-level scheduling at up to 24× higher throughput than a request-level dynamic batching system under high concurrency, while the request-level system showed lower time-to-first-token at low concurrency. Continuous batching tends to maximize throughput with more variable per-request latency, while request-level dynamic batching offers more predictable latency. Continuous batching isn't free of limits either: long prefills can still stall decode steps.

The requests you batch best are the ones you never send

All this batching machinery only helps with requests that actually reach the server. The cheapest request to process is the one you never process at all. That's where caching comes in, and for repetitive LLM workloads, semantic caching is one of the most useful levers you have.

Redis is a real-time data platform for AI workloads that pairs sub-millisecond latency with vector search and semantic caching, so repeated LLM requests can be served before they ever reach the inference queue. That matters because LLM traffic tends to repeat itself: around 31% of queries in one analysis were contextually repeatable. Traditional keyword-based caching struggles here because it can't tell that "give me suggestions for a comedy movie" and "recommend a funny movie" mean the same thing, so it logs a miss and sends both to the model. Every duplicate adds depth to your batch queue and burns GPU time you didn't need to spend.

Semantic cache hits cut that repeat traffic before the batcher ever has to schedule it, which leaves fewer real requests competing for GPU time.

How semantic caching cuts repeated requests before the batch queue

Semantic caching intercepts queries before they reach the LLM pipeline by recognizing when two queries mean the same thing, even when the wording differs. Under the hood, it converts each incoming query into vector embeddings, searches a vector store for previously cached entries above a similarity threshold, and returns the cached response on a match. On a miss, it calls the LLM and stores the result for next time.

This is different from using a vector store for retrieval-augmented generation. Semantic caching stores complete LLM responses, not document chunks, and the goal is to skip the LLM on a hit rather than feed retrieved context into one. An exact-match cache would treat three differently worded VPN setup questions as three unique queries and call the model three times. A semantic cache can recognize them as equivalent and reuse a stored response when the similarity score and app boundaries make that reuse safe.

The setting that matters most is the similarity threshold. Set it too loose and you can serve wrong answers; too tight and your hit rate drops. Teams typically validate cache hits against domain risk, vector embedding behavior, and traffic patterns, often with metadata boundaries like tenant, locale, and model version.

The payoff shows up in both latency and cost. Redis LangCache benchmarks reported cache hits up to 15x faster and inference costs up to 73% lower in high-repetition workloads. How close you get depends heavily on how repetitive your query patterns actually are.

Now see how this runs in Redis

Power AI apps with real-time context, vector search, and caching.

Get started

Throughput & avoided requests are two separate budgets

Dynamic batching helps inference servers run more work per GPU cycle by grouping requests at runtime, while continuous batching pushes that scheduling idea down to the iteration level for autoregressive LLMs. Caching works in front of that stack, removing repeat work before the inference server sees it. The two are orthogonal, so they stack: batching is the better lever for high-throughput, latency-tolerant traffic, while semantic caching pays off most for FAQ bots, helpdesks, and internal knowledge assistants where query repetition runs high.

Keeping data in memory for sub-millisecond access makes Redis the natural home for the semantic caching layer that sits in front of your inference stack. Because vector search, caching, and operational data live behind one API, you can run semantic caching with vector search inside your latency budget without stitching together a separate vector database and cache. RedisVL gives Python devs a SemanticCache with a configurable distance threshold and a vector embedding model, and Redis LangCache offers a fully managed version.

If you're spending more than you'd like on LLM inference, the requests you avoid are the cheapest wins available. Try Redis free to see how semantic caching works with your workload, or talk to our team about optimizing your AI infrastructure.

Get started with Redis today

Speak to a Redis expert and learn more about enterprise-grade Redis today.

Try for free Talk to sales