Hash vs JSON Storage
Out of the box, Redis provides a variety of data structures that can adapt to your domain specific applications and use cases. In this notebook, we will demonstrate how to use RedisVL with both Hash and JSON data.
Before running this notebook, be sure to
- Have installed redisvland have that environment active for this notebook.
- Have a running Redis Stack or Redis Enterprise instance with RediSearch > 2.4 activated.
For example, you can run Redis Stack locally with Docker:
docker run -d -p 6379:6379 -p 8001:8001 redis/redis-stack:latest
Or create a FREE Redis Cloud.
# import necessary modules
import pickle
from redisvl.redis.utils import buffer_to_array
from redisvl.index import SearchIndex
# load in the example data and printing utils
data = pickle.load(open("hybrid_example_data.pkl", "rb"))
from jupyterutils import result_print, table_print
table_print(data)
| user | age | job | credit_score | office_location | user_embedding | last_updated | 
|---|---|---|---|---|---|---|
| john | 18 | engineer | high | -122.4194,37.7749 | b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?' | 1741627789 | 
| derrick | 14 | doctor | low | -122.4194,37.7749 | b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?' | 1741627789 | 
| nancy | 94 | doctor | high | -122.4194,37.7749 | b'333?\xcd\xcc\xcc=\x00\x00\x00?' | 1710696589 | 
| tyler | 100 | engineer | high | -122.0839,37.3861 | b'\xcd\xcc\xcc=\xcd\xcc\xcc>\x00\x00\x00?' | 1742232589 | 
| tim | 12 | dermatologist | high | -122.0839,37.3861 | b'\xcd\xcc\xcc>\xcd\xcc\xcc>\x00\x00\x00?' | 1739644189 | 
| taimur | 15 | CEO | low | -122.0839,37.3861 | b'\x9a\x99\x19?\xcd\xcc\xcc=\x00\x00\x00?' | 1742232589 | 
| joe | 35 | dentist | medium | -122.0839,37.3861 | b'fff?fff?\xcd\xcc\xcc=' | 1742232589 | 
Hash or JSON -- how to choose?
Both storage options offer a variety of features and tradeoffs. Below we will work through a dummy dataset to learn when and how to use both.
Working with Hashes
Hashes in Redis are simple collections of field-value pairs. Think of it like a mutable single-level dictionary contains multiple "rows":
{
    "model": "Deimos",
    "brand": "Ergonom",
    "type": "Enduro bikes",
    "price": 4972,
}
Hashes are best suited for use cases with the following characteristics:
- Performance (speed) and storage space (memory consumption) are top concerns
- Data can be easily normalized and modeled as a single-level dict
Hashes are typically the default recommendation.
# define the hash index schema
hash_schema = {
    "index": {
        "name": "user-hash",
        "prefix": "user-hash-docs",
        "storage_type": "hash", # default setting -- HASH
    },
    "fields": [
        {"name": "user", "type": "tag"},
        {"name": "credit_score", "type": "tag"},
        {"name": "job", "type": "text"},
        {"name": "age", "type": "numeric"},
        {"name": "office_location", "type": "geo"},
        {
            "name": "user_embedding",
            "type": "vector",
            "attrs": {
                "dims": 3,
                "distance_metric": "cosine",
                "algorithm": "flat",
                "datatype": "float32"
            }
        }
    ],
}
# construct a search index from the hash schema
hindex = SearchIndex.from_dict(hash_schema, redis_url="redis://localhost:6379")
# create the index (no data yet)
hindex.create(overwrite=True)
# show the underlying storage type
hindex.storage_type
<StorageType.HASH: 'hash'>
Vectors as byte strings
One nuance when working with Hashes in Redis, is that all vectorized data must be passed as a byte string (for efficient storage, indexing, and processing). An example of that can be seen below:
# show a single entry from the data that will be loaded
data[0]
{'user': 'john',
 'age': 18,
 'job': 'engineer',
 'credit_score': 'high',
 'office_location': '-122.4194,37.7749',
 'user_embedding': b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?',
 'last_updated': 1741627789}
# load hash data
keys = hindex.load(data)
!rvl stats -i user-hash
Statistics:
╭─────────────────────────────┬────────────╮
│ Stat Key                    │ Value      │
├─────────────────────────────┼────────────┤
│ num_docs                    │ 7          │
│ num_terms                   │ 6          │
│ max_doc_id                  │ 7          │
│ num_records                 │ 44         │
│ percent_indexed             │ 1          │
│ hash_indexing_failures      │ 0          │
│ number_of_uses              │ 1          │
│ bytes_per_record_avg        │ 40.2954559 │
│ doc_table_size_mb           │ 7.27653503 │
│ inverted_sz_mb              │ 0.00169086 │
│ key_table_size_mb           │ 2.48908996 │
│ offset_bits_per_record_avg  │ 8          │
│ offset_vectors_sz_mb        │ 8.58306884 │
│ offsets_per_term_avg        │ 0.20454545 │
│ records_per_doc_avg         │ 6.28571414 │
│ sortable_values_size_mb     │ 0          │
│ total_indexing_time         │ 0.25799998 │
│ total_inverted_index_blocks │ 18         │
│ vector_index_sz_mb          │ 0.02023315 │
╰─────────────────────────────┴────────────╯
Performing Queries
Once our index is created and data is loaded into the right format, we can run queries against the index with RedisVL:
from redisvl.query import VectorQuery
from redisvl.query.filter import Tag, Text, Num
t = (Tag("credit_score") == "high") & (Text("job") % "enginee*") & (Num("age") > 17)  # codespell:ignore enginee
v = VectorQuery(
    vector=[0.1, 0.1, 0.5],
    vector_field_name="user_embedding",
    return_fields=["user", "credit_score", "age", "job", "office_location"],
    filter_expression=t
)
results = hindex.query(v)
result_print(results)
| vector_distance | user | credit_score | age | job | office_location | 
|---|---|---|---|---|---|
| 0 | john | high | 18 | engineer | -122.4194,37.7749 | 
| 0.109129190445 | tyler | high | 100 | engineer | -122.0839,37.3861 | 
# clean up
hindex.delete()
Working with JSON
JSON is best suited for use cases with the following characteristics:
- Ease of use and data model flexibility are top concerns
- Application data is already native JSON
- Replacing another document storage/db solution
# define the json index schema
json_schema = {
    "index": {
        "name": "user-json",
        "prefix": "user-json-docs",
        "storage_type": "json", # JSON storage type
    },
    "fields": [
        {"name": "user", "type": "tag"},
        {"name": "credit_score", "type": "tag"},
        {"name": "job", "type": "text"},
        {"name": "age", "type": "numeric"},
        {"name": "office_location", "type": "geo"},
        {
            "name": "user_embedding",
            "type": "vector",
            "attrs": {
                "dims": 3,
                "distance_metric": "cosine",
                "algorithm": "flat",
                "datatype": "float32"
            }
        }
    ],
}
# construct a search index from the json schema
jindex = SearchIndex.from_dict(json_schema, redis_url="redis://localhost:6379")
# create the index (no data yet)
jindex.create(overwrite=True)
# note the multiple indices in the same database
!rvl index listall
13:02:56 [RedisVL] INFO   Indices:
13:02:56 [RedisVL] INFO   1. user-json
Vectors as float arrays
Vectorized data stored in JSON must be stored as a pure array (python list) of floats. We will modify our sample data to account for this below:
json_data = data.copy()
for d in json_data:
    d['user_embedding'] = buffer_to_array(d['user_embedding'], dtype='float32')
# inspect a single JSON record
json_data[0]
{'user': 'john',
 'age': 18,
 'job': 'engineer',
 'credit_score': 'high',
 'office_location': '-122.4194,37.7749',
 'user_embedding': [0.10000000149011612, 0.10000000149011612, 0.5],
 'last_updated': 1741627789}
keys = jindex.load(json_data)
# we can now run the exact same query as above
result_print(jindex.query(v))
| vector_distance | user | credit_score | age | job | office_location | 
|---|---|---|---|---|---|
| 0 | john | high | 18 | engineer | -122.4194,37.7749 | 
| 0.109129190445 | tyler | high | 100 | engineer | -122.0839,37.3861 | 
Cleanup
jindex.delete()
Working with nested data in JSON
Redis also supports native JSON objects. These can be multi-level (nested) objects, with full JSONPath support for updating/retrieving sub elements:
{
    "name": "Specialized Stump jumper",
    "metadata": {
        "model": "Stumpjumper",
        "brand": "Specialized",
        "type": "Enduro bikes",
        "price": 3000
    },
}
Full JSON Path support
Because Redis enables full JSON path support, when creating an index schema, elements need to be indexed and selected by their path with the desired name AND path that points to where the data is located within the objects.
By default, RedisVL will assume the path as $.{name} if not provided in JSON fields schema. If nested provide path as $.object.attribute
As an example:
from redisvl.utils.vectorize import HFTextVectorizer
emb_model = HFTextVectorizer()
bike_data = [
    {
        "name": "Specialized Stump jumper",
        "metadata": {
            "model": "Stumpjumper",
            "brand": "Specialized",
            "type": "Enduro bikes",
            "price": 3000
        },
        "description": "The Specialized Stumpjumper is a versatile enduro bike that dominates both climbs and descents. Features a FACT 11m carbon fiber frame, FOX FLOAT suspension with 160mm travel, and SRAM X01 Eagle drivetrain. The asymmetric frame design and internal storage compartment make it a practical choice for all-day adventures."
    },
    {
        "name": "bike_2",
        "metadata": {
            "model": "Slash",
            "brand": "Trek",
            "type": "Enduro bikes",
            "price": 5000
        },
        "description": "Trek's Slash is built for aggressive enduro riding and racing. Featuring Trek's Alpha Aluminum frame with RE:aktiv suspension technology, 160mm travel, and Knock Block frame protection. Equipped with Bontrager components and a Shimano XT drivetrain, this bike excels on technical trails and enduro race courses."
    }
]
bike_data = [{**d, "bike_embedding": emb_model.embed(d["description"])} for d in bike_data]
bike_schema = {
    "index": {
        "name": "bike-json",
        "prefix": "bike-json",
        "storage_type": "json", # JSON storage type
    },
    "fields": [
        {
            "name": "model",
            "type": "tag",
            "path": "$.metadata.model" # note the '$'
        },
        {
            "name": "brand",
            "type": "tag",
            "path": "$.metadata.brand"
        },
        {
            "name": "price",
            "type": "numeric",
            "path": "$.metadata.price"
        },
        {
            "name": "bike_embedding",
            "type": "vector",
            "attrs": {
                "dims": len(bike_data[0]["bike_embedding"]),
                "distance_metric": "cosine",
                "algorithm": "flat",
                "datatype": "float32"
            }
        }
    ],
}
/Users/tyler.hutcherson/Documents/AppliedAI/redis-vl-python/.venv/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
13:02:58 sentence_transformers.SentenceTransformer INFO   Use pytorch device_name: mps
13:02:58 sentence_transformers.SentenceTransformer INFO   Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
Batches: 100%|██████████| 1/1 [00:00<00:00,  7.23it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 12.93it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 14.10it/s]
# construct a search index from the json schema
bike_index = SearchIndex.from_dict(bike_schema, redis_url="redis://localhost:6379")
# create the index (no data yet)
bike_index.create(overwrite=True)
bike_index.load(bike_data)
['bike-json:01JY4J9M48CXF7F4Y6HRGEMT9B',
 'bike-json:01JY4J9M48RRY6F80HR82CVZ5G']
from redisvl.query import VectorQuery
vec = emb_model.embed("I'd like a bike for aggressive riding")
v = VectorQuery(
    vector=vec,
    vector_field_name="bike_embedding",
    return_fields=[
        "brand",
        "name",
        "$.metadata.type"
    ]
)
results = bike_index.query(v)
Batches: 100%|██████████| 1/1 [00:00<00:00, 11.72it/s]
Note: As shown in the example if you want to retrieve a field from json object that was not indexed you will also need to supply the full path as with $.metadata.type.
results
[{'id': 'bike-json:01JY4J9M48RRY6F80HR82CVZ5G',
  'vector_distance': '0.519989132881',
  'brand': 'Trek',
  '$.metadata.type': 'Enduro bikes'},
 {'id': 'bike-json:01JY4J9M48CXF7F4Y6HRGEMT9B',
  'vector_distance': '0.657624304295',
  'brand': 'Specialized',
  '$.metadata.type': 'Enduro bikes'}]
Cleanup
bike_index.delete()