Blog
Tokenization in LLMs: What every AI app developer needs to know
Every time you send a prompt to an LLM, your text gets chopped into tokens before anything else happens. Tokens are discrete integer IDs that the model uses to look up the vectors it actually processes, and that conversion step directly affects how much you pay, how fast your app responds, and how much context you can fit into a single request.
Tokenization is easy to overlook until it shows up in your bill or your context window runs out. But if you're building apps on top of LLMs, understanding how that conversion works gives you real control over cost and performance. This guide covers how tokenization works, how tokens relate to vector embeddings, and ways to reduce its impact on your app's speed and budget.
What is tokenization? The text-to-numbers pipeline
LLMs don't process raw text. They work with tokens, common character sequences drawn from a fixed vocabulary, and learn statistical relationships between them. Your input gets split into these tokens — sometimes at word boundaries, sometimes mid-word, depending on the tokenizer.
A useful developer approximation: about 1 token per 4 characters, or roughly three-quarters of a word. The sentence "Hello, how are you?" takes up more tokens than you'd guess by counting words, because punctuation and spacing get their own tokens too.
A typical tokenization pipeline has four stages (some implementations fuse or reorder these, but the concepts are consistent):
- Pre-tokenization: Split raw text on whitespace and punctuation rules. "Hello, world" becomes ["Hello", ",", "world"].
- Subword segmentation: Apply learned merge rules from the model's vocabulary. Common words stay whole, but longer or rarer words get broken into pieces. "Transformers" becomes ["Transform", "ers"].
- Vocabulary lookup: Each piece maps to an integer ID. ["Transform", "ers"] becomes [9602, 364].
- Vector embedding lookup: Those integer IDs get converted into dense float vectors that the transformer actually processes.
These four stages are the full path from human-readable text to the vectors the model consumes.
This pipeline also creates an important limitation worth understanding early. Because tokenization happens before the model sees anything, the model operates on token-sized units rather than individual characters. A word like "strawberry" may be a single token in many tokenizers, which means the model has no built-in way to inspect its individual letters. This is one cause token-based processing can contribute to failures on tasks like character counting, spelling, and some arithmetic.
There's also a hard coupling between a tokenizer and its model. Each model works best with the specific vocabulary it was trained on, so swapping in a different tokenizer can produce different token ID sequences and unreliable output. When you're choosing an LLM for your app, the tokenizer comes as a package deal.
Three methods of tokenization
The pipeline above handles the mechanics, but the key design decision is step two: how text gets segmented into tokens. There are three approaches, each with different trade-offs.
Word-level tokenization
Word-level tokenization treats each unique word as a token. The problem is vocabulary size. In traditional word-level tokenization, any word not in the vocabulary may be mapped to an unknown token. Morphological variations like "run," "running," and "ran" are treated as separate tokens.
Character-level tokenization
Character-level tokenization makes each character its own token. This handles any input, but sequences get much longer. Since the transformer's attention mechanism scales quadratically with sequence length, this increases compute cost.
Subword tokenization
Subword tokenization splits text into units between word-level and character-level. Frequent sequences get their own token, while rarer words decompose into smaller recognizable pieces. This gives you a bounded vocabulary with more efficient handling of morphological variation across languages, which is why subword methods became a practical default for many large language models.
The three subword algorithms
Three algorithms implement subword tokenization:
- Byte Pair Encoding (BPE) iteratively merges the most frequent character pair in the training corpus. It's deterministic and the basis for most production tokenizers. Byte-level BPE starts from byte values rather than Unicode characters, so any character can be represented without producing an unknown token.
- WordPiece selects merges using a likelihood-based scoring criterion rather than raw frequency alone.
- Unigram is probabilistic. It scores and prunes vocabulary items based on how well they tokenize training data.
These differences matter because they change how text gets split, how large the vocabulary becomes, and how consistently rare words get represented. Vocabulary size also affects efficiency. Models with larger vocabularies tend to produce fewer tokens for the same input.
Build fast, accurate AI apps that scale
Get started with Redis for real-time AI context and retrievalTokenization vs. vector embeddings
Tokenization and vector embeddings are different stages of the same pipeline, but they get mixed up often. Tokenization produces integer IDs like [9602, 364]. Vector embeddings convert those IDs into dense float vectors like [[0.23, -0.11, ...], [...]] that capture meaning in numerical space.
If you're building retrieval-augmented generation (RAG) apps, vector embeddings play two roles you'll care about. Token vector embeddings are internal to the LLM — each token ID maps to a learned vector during inference. Retrieval vector embeddings are produced by separate embedding models and stored in a vector database for similarity search. They encode the semantic meaning of entire passages into a single dense vector.
Both types connect back to token cost. Every retrieved chunk that gets concatenated into your prompt adds to your token count, because the tokenization step converts those text chunks into the billable units you pay for. A 100-token passage might be stored as one retrieval vector embedding, but the LLM still tokenizes every character of that passage when it lands in the prompt.
Once retrieval becomes part of the path from user query to final prompt, infrastructure speed matters too. Redis supports vector search and lets your app store and retrieve vector embeddings alongside operational data without adding a separate vector database to the stack.
Why tokenization directly shapes your app's performance
That distinction matters because tokenization isn't just a preprocessing detail. In production, tokens shape both cost and latency, which makes them a hard constraint on what your app can do.
Cost scales with token count
Most major LLM providers bill per token, and output tokens often cost more than input. That pricing gap matters because the more generation you ask for, the faster costs rise.
Latency can scale worse than linearly
The transformer's self-attention mechanism scales roughly quadratically with sequence length. Practically, longer sequences can make memory consumption and inference time grow as O(N²) for standard attention implementations, though the exact scaling depends on the model and sequence-length regime. Optimizations like FlashAttention help at the systems level but don't change the underlying scaling behavior.
Reasoning tokens are expensive
Some providers offer reasoning modes where the model generates intermediate reasoning tokens that use context space and get billed as output tokens, but whose full content is not visible in the API response. If the context window fills before producing visible output, you still get billed for everything consumed.
Put together, token count affects both what you pay and how long users wait.
The token budget mental model
That pressure is easier to manage with a mental model: treat the context window as a budget. Every token draws from that same finite pool, whether it comes from your system prompt, user input, retrieved context, or model output.
Four consumers compete for that budget:
- System prompt: Role definitions and instructions sent on every request, creating a fixed per-request tax.
- User input: The actual query, variable in length and largely outside your control.
- Retrieved context (RAG): External knowledge injected into the prompt. RAG helps with knowledge accuracy, but it doesn't eliminate context pressure. It shifts the problem into retrieval design.
- Model output: The generated response, billed at the premium output rate.
Those buckets interact directly. When one grows, something else has to shrink, or your quality and cost drift in the wrong direction. The trade-off is straightforward: every extra token in one bucket reduces room for the others. Quality can also degrade before you hit the hard limit, so don't treat the context ceiling as your safe quality target. That's why prompt design, retrieval, and output controls have to work together.
Anthropic's engineering team has a useful framing for the discipline around managing this budget: context engineering, the set of strategies for curating the optimal set of tokens during inference. It's the natural progression beyond prompt engineering. Your job isn't just writing instructions. It's actively managing what occupies the finite context window at inference time.
Make your AI apps faster and cheaper
Cut costs by up to 90% and lower latency with semantic caching powered by Redis.Reducing tokenization's impact on app speed & cost
With that budget model in place, the goal is simple: send fewer tokens and avoid paying for the same work twice. These optimizations compound, from easiest to most impactful.
Prompt tightening & provider caching
The fastest wins require no new infrastructure. Remove unnecessary verbosity from system prompts, specify response format and length limits explicitly, and structure prompts with static content first to improve provider-native cache reuse.
Smarter retrieval
Fixed-size document chunks can break semantic context and waste tokens. A useful rule of thumb: smaller chunks help precision-focused search, larger chunks help completeness, but you still need to respect token boundaries. The right strategy depends on your workload.
Semantic caching
Semantic caching intercepts LLM requests before they hit the provider API and checks whether a semantically equivalent query has already been answered. Unlike exact-match caching, it uses vector embeddings to match by meaning, so "What's machine learning?" and "Can you explain ML?" can map to the same cached response.
The catch is speed. Your cache lookup has to be much faster than the LLM API call for the user experience to improve. In benchmarks, Redis LangCache reported up to 70% lower LLM costs, with cache hits returning up to 15x faster than fresh inference. The payoff gets bigger as token usage rises, especially in agentic workloads.
These tactics work best as a stack, not as isolated tricks. Cut prompt bloat first, then make retrieval more selective, then cache what repeats.
Token budgets shape LLM app performance
Understanding how tokens work — from subword vocabularies to the attention costs longer sequences trigger — gives you real levers to pull on cost, latency, and context capacity. The best results come from managing the whole path from retrieval to prompt assembly instead of tuning one piece in isolation.
Redis LangCache applies semantic caching as a fully managed service, with minimal code changes to integrate. Your vector embeddings and cached responses live in the same Redis environment, so instead of managing separate systems for caching, vector search, and app state, you consolidate infrastructure and keep retrieval fast.
Try Redis free to test semantic caching and vector search with your own workloads, or talk to our team about optimizing your AI infrastructure costs.
Get started with Redis today
Speak to a Redis expert and learn more about enterprise-grade Redis today.
