Performance
Learn how Redis vector sets behave under load and how to optimize for speed and recall
Query performance
Vector similarity queries using the VSIM
are threaded by default. Redis uses up to 32 threads to process these queries in parallel.
VSIM
performance scales nearly linearly with available CPU cores.- Expect ~50,000 similarity queries per second for a 3M-item set with 300-dim vectors using int8 quantization.
- Performance depends heavily on the
EF
parameter:- Higher
EF
improves recall, but slows down search. - Lower
EF
returns faster results with reduced accuracy.
- Higher
Insertion performance
Inserting vectors with the VADD
command is more computationally expensive than querying:
- Insertion is single-threaded by default.
- Use the
CAS
option to offload candidate graph search to a background thread. - Expect a few thousand insertions per second on a single node.
Quantization effects
Quantization greatly impacts both speed and memory:
Q8
(default): 4x smaller thanFP32
, high recall, high speedBIN
(binary): 32x smaller thanFP32
, lower recall, fastest searchNOQUANT
(FP32
): Full precision, slower performance, highest memory use
Use the quantization mode that best fits your tradeoff between precision and efficiency.
The examples below show how the different modes affect a simple vector.
Note that even with NOQUANT
mode, the values change slightly,
due to floating point rounding.
Deletion performance
Deleting large vector sets using the DEL
can cause latency spikes:
- Redis must unlink and restructure many graph nodes.
- Latency is most noticeable when deleting millions of elements.
Save and load performance
Vector sets save and load the full HNSW graph structure:
- When reloading from disk is fast and there's no need to rebuild the graph.
Example: A 3M vector set with 300 components loads in ~15 seconds.
Summary of tuning tips
Factor | Effect on performance | Tip |
---|---|---|
EF |
Slower queries but higher recall | Start low (for example, 200) and tune upward |
M |
More memory per node, better recall | Use defaults unless recall is too low |
Quant type | Binary is fastest, FP32 is slowest |
Use Q8 or BIN unless full precision needed |
CAS |
Faster insertions with threading | Use when high write throughput is needed |