Bloom filters and other probabilistic data structures for Redis
RedisBloom contains a set of useful probabilistic data structures. Probabilistic data structures allow developers to control the accuracy of returned results while gaining performance and reducing memory. These data structures are ideal for analyzing streaming data and large datasets.
Use these data structures to answer a set of common questions concerning data streams:
- HyperLogLog: How many unique values appeared so far in the data stream?
- Bloom filter and Cuckoo filter: Did the value v already appear in the data stream?
- Count-min sketch: How many times did the value v appear in the data stream?
- Top-K: What are the k most frequent values in the data stream?
- t-digest can be used to answer these questions:
- What fraction of the values in the data stream are smaller than a given value?
- How many values in the data stream are smaller than a given value?
- Which value is smaller than p percent of the values in the data stream? (what is the p-percentile value)?
- What is the mean value between the p1-percentile value and the p2-percentile value?
- What is the value of the n-th smallest / largest value in the data stream? (what is the value with [reverse] rank n)?
Answering each of these questions accurately can require a huge amount of memory, but if you are willing to sacrifice accuracy, you can reduce the memory requirements drastically. Each of these data structures allows you to set a controllable tradeoff between accuracy and memory consumption.
Bloom vs. Cuckoo filters
Bloom filters typically exhibit better performance and scalability when inserting items (so if you're often adding items to your dataset, then a Bloom filter may be ideal). Cuckoo filters are quicker on check operations and also allow deletions.
Using t-digest is simple and straightforward:
Creating a sketch and adding observations
TDIGEST.CREATE key [COMPRESSION compression]initializes a new t-digest sketch (and error if such key already exists). The
COMPRESSIONargument is used to specify the tradeoff between accuracy and memory consumption. The default is 100. Higher values mean more accuracy.
TDIGEST.ADD key value...adds new observations (floating-point values) to the sketch. You can repeat calling TDIGEST.ADD whenever new observations are available.
Estimating fractions or ranks by values
TDIGEST.CDF key value...to retrieve, for each input value, an estimation of the fraction of (observations smaller than the given value + half the observations equal to the given value).
TDIGEST.RANK key value...is similar to TDIGEST.CDF, but used for estimating the number of observations instead of the fraction of observations. More accurately it returns, for each input value, an estimation of the number of (observations smaller than a given value + half the observations equal to the given value).
TDIGEST.REVRANK key value...is similar to TDIGEST.RANK, but returns, for each input value, an estimation of the number of (observations larger than a given value + half the observations equal to the given value).
Estimating values by fractions or ranks
TDIGEST.QUANTILE key fraction...returns, for each input fraction, an estimation of the value (floating point) that is smaller than the given fraction of observations.
TDIGEST.BYRANK key rank...returns, for each input rank, an estimation of the value (floating point) with that rank.
TDIGEST.BYREVRANK key rank...returns, for each input reverse rank, an estimation of the value (floating point) with that reverse rank.
Estimating trimmed mean
TDIGEST.TRIMMED_MEAN key lowFraction highFractionto retrieve an estimation of the mean value between the specified fractions.
This is especially useful for calculating the average value ignoring outliers. For example - calculating the average value between the 20th percentile and the 80th percentile.
Sometimes it is useful to merge sketches. For example, suppose we measure latencies for 3 servers, and we want to calculate the 90%, 95%, and 99% latencies for all the servers combined.
TDIGEST.MERGE destKey numKeys sourceKey... [COMPRESSION compression] [OVERRIDE]merges multiple sketches into a single sketch.
destKeydoes not exist - a new sketch is created.
destKeyis an existing sketch, its values are merged with the values of the source keys. To override the destination key contents, use
Retrieving sketch information
TDIGEST.MAX keyto retrieve the minimal and maximal values in the sketch, respectively.
Both return nan when the sketch is empty.
Both commands return accurate results and are equivalent to
TDIGEST.BYRANK key 0and
TDIGEST.BYREVRANK key 0respectively.
TDIGEST.INFO keyto retrieve some additional information about the sketch.
Resetting a sketch
TDIGEST.RESET keyempties the sketch and re-initializes it.
- Space/Time Trade-offs in Hash Coding with Allowable Errors by Burton H. Bloom.
- Scalable Bloom Filters
- RedisBloom Quick Start Tutorial
- Developing with Bloom Filters
- RedisBloom on Redis Enterprise
- Probably and No: Redis, RedisBloom, and Bloom Filters
- RedisBloom – Bloom Filter Datatype for Redis
Mailing List / Forum
Got questions? Feel free to ask at the RedisBloom forum.