Choose a Storage Type

Redis provides a variety of data structures that can adapt to your domain-specific applications. This guide demonstrates how to use RedisVL with both Hash and JSON storage types, helping you choose the right approach for your use case.

Prerequisites

Before you begin, ensure you have:

What You'll Learn

By the end of this guide, you will be able to:

  • Understand the differences between Hash and JSON storage types
  • Define schemas for both Hash and JSON storage
  • Load and query data using each storage type
  • Access nested JSON fields using JSONPath expressions
  • Choose the right storage type for your application
# import necessary modules
import pickle

from redisvl.redis.utils import buffer_to_array
from redisvl.index import SearchIndex


# load in the example data and printing utils
data = pickle.load(open("hybrid_example_data.pkl", "rb"))
from jupyterutils import result_print, table_print

table_print(data)
useragejobcredit_scoreoffice_locationuser_embeddinglast_updated
john18engineerhigh-122.4194,37.7749b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?'1741627789
derrick14doctorlow-122.4194,37.7749b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?'1741627789
nancy94doctorhigh-122.4194,37.7749b'333?\xcd\xcc\xcc=\x00\x00\x00?'1710696589
tyler100engineerhigh-122.0839,37.3861b'\xcd\xcc\xcc=\xcd\xcc\xcc>\x00\x00\x00?'1742232589
tim12dermatologisthigh-122.0839,37.3861b'\xcd\xcc\xcc>\xcd\xcc\xcc>\x00\x00\x00?'1739644189
taimur15CEOlow-122.0839,37.3861b'\x9a\x99\x19?\xcd\xcc\xcc=\x00\x00\x00?'1742232589
joe35dentistmedium-122.0839,37.3861b'fff?fff?\xcd\xcc\xcc='1742232589

Hash or JSON: How to Choose

Both storage options offer different features and tradeoffs. This section walks through a sample dataset to illustrate when and how to use each option.

Working with Hashes

Hashes in Redis are simple collections of field-value pairs. Think of it like a mutable single-level dictionary contains multiple "rows":

{
    "model": "Deimos",
    "brand": "Ergonom",
    "type": "Enduro bikes",
    "price": 4972,
}

Hashes are best suited for use cases with the following characteristics:

  • Performance (speed) and storage space (memory consumption) are top concerns
  • Data can be easily normalized and modeled as a single-level dict

Hashes are typically the default recommendation.

# define the hash index schema
hash_schema = {
    "index": {
        "name": "user-hash",
        "prefix": "user-hash-docs",
        "storage_type": "hash", # default setting -- HASH
    },
    "fields": [
        {"name": "user", "type": "tag"},
        {"name": "credit_score", "type": "tag"},
        {"name": "job", "type": "text"},
        {"name": "age", "type": "numeric"},
        {"name": "office_location", "type": "geo"},
        {
            "name": "user_embedding",
            "type": "vector",
            "attrs": {
                "dims": 3,
                "distance_metric": "cosine",
                "algorithm": "flat",
                "datatype": "float32"
            }

        }
    ],
}
# construct a search index from the hash schema
hindex = SearchIndex.from_dict(hash_schema, redis_url="redis://localhost:6379")

# create the index (no data yet)
hindex.create(overwrite=True)
# show the underlying storage type
hindex.storage_type
<StorageType.HASH: 'hash'>

Vectors as byte strings

One nuance when working with Hashes in Redis, is that all vectorized data must be passed as a byte string (for efficient storage, indexing, and processing). An example of that can be seen below:

# show a single entry from the data that will be loaded
data[0]
{'user': 'john',
 'age': 18,
 'job': 'engineer',
 'credit_score': 'high',
 'office_location': '-122.4194,37.7749',
 'user_embedding': b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?',
 'last_updated': 1741627789}
# load hash data
keys = hindex.load(data)
!rvl stats -i user-hash
Statistics:
╭─────────────────────────────┬────────────╮
│ Stat Key                    │ Value      │
├─────────────────────────────┼────────────┤
│ num_docs                    │ 7          │
│ num_terms                   │ 6          │
│ max_doc_id                  │ 7          │
│ num_records                 │ 44         │
│ percent_indexed             │ 1          │
│ hash_indexing_failures      │ 0          │
│ number_of_uses              │ 1          │
│ bytes_per_record_avg        │ 39.0681800 │
│ doc_table_size_mb           │ 0.00837230 │
│ inverted_sz_mb              │ 0.00163936 │
│ key_table_size_mb           │ 3.50952148 │
│ offset_bits_per_record_avg  │ 8          │
│ offset_vectors_sz_mb        │ 8.58306884 │
│ offsets_per_term_avg        │ 0.20454545 │
│ records_per_doc_avg         │ 6.28571414 │
│ sortable_values_size_mb     │ 0          │
│ total_indexing_time         │ 0.55204    │
│ total_inverted_index_blocks │ 18         │
│ vector_index_sz_mb          │ 0.02820587 │
╰─────────────────────────────┴────────────╯

Performing Queries

Once our index is created and data is loaded into the right format, we can run queries against the index with RedisVL:

from redisvl.query import VectorQuery
from redisvl.query.filter import Tag, Text, Num

t = (Tag("credit_score") == "high") & (Text("job") % "enginee*") & (Num("age") > 17)  # codespell:ignore enginee

v = VectorQuery(
    vector=[0.1, 0.1, 0.5],
    vector_field_name="user_embedding",
    return_fields=["user", "credit_score", "age", "job", "office_location"],
    filter_expression=t
)


results = hindex.query(v)
result_print(results)
vector_distanceusercredit_scoreagejoboffice_location
0johnhigh18engineer-122.4194,37.7749
0.109129190445tylerhigh100engineer-122.0839,37.3861
# clean up
hindex.delete()

Working with JSON

JSON is best suited for use cases with the following characteristics:

  • Ease of use and data model flexibility are top concerns
  • Application data is already native JSON
  • Replacing another document storage/db solution
# define the json index schema
json_schema = {
    "index": {
        "name": "user-json",
        "prefix": "user-json-docs",
        "storage_type": "json", # JSON storage type
    },
    "fields": [
        {"name": "user", "type": "tag"},
        {"name": "credit_score", "type": "tag"},
        {"name": "job", "type": "text"},
        {"name": "age", "type": "numeric"},
        {"name": "office_location", "type": "geo"},
        {
            "name": "user_embedding",
            "type": "vector",
            "attrs": {
                "dims": 3,
                "distance_metric": "cosine",
                "algorithm": "flat",
                "datatype": "float32"
            }

        }
    ],
}
# construct a search index from the json schema
jindex = SearchIndex.from_dict(json_schema, redis_url="redis://localhost:6379")

# create the index (no data yet)
jindex.create(overwrite=True)

Vectors as Float Arrays

Vectorized data stored in JSON must be a pure array (Python list) of floats. The following code modifies the sample data to use this format:

json_data = data.copy()

for d in json_data:
    d['user_embedding'] = buffer_to_array(d['user_embedding'], dtype='float32')
# inspect a single JSON record
json_data[0]
{'user': 'john',
 'age': 18,
 'job': 'engineer',
 'credit_score': 'high',
 'office_location': '-122.4194,37.7749',
 'user_embedding': [0.10000000149011612, 0.10000000149011612, 0.5],
 'last_updated': 1741627789}
keys = jindex.load(json_data)
# we can now run the exact same query as above
result_print(jindex.query(v))
vector_distanceusercredit_scoreagejoboffice_location
0johnhigh18engineer-122.4194,37.7749
0.109129190445tylerhigh100engineer-122.0839,37.3861

Cleanup

jindex.delete()

Working with nested data in JSON

Redis also supports native JSON objects. These can be multi-level (nested) objects, with full JSONPath support for updating/retrieving sub elements:

{
    "name": "Specialized Stump jumper",
    "metadata": {
        "model": "Stumpjumper",
        "brand": "Specialized",
        "type": "Enduro bikes",
        "price": 3000
    },
}

Full JSON Path support

Because Redis enables full JSON path support, when creating an index schema, elements need to be indexed and selected by their path with the desired name AND path that points to where the data is located within the objects.

By default, RedisVL will assume the path as $.{name} if not provided in JSON fields schema. If nested provide path as $.object.attribute

As an example:

from redisvl.utils.vectorize import HFTextVectorizer

emb_model = HFTextVectorizer()

bike_data = [
    {
        "name": "Specialized Stump jumper",
        "metadata": {
            "model": "Stumpjumper",
            "brand": "Specialized",
            "type": "Enduro bikes",
            "price": 3000
        },
        "description": "The Specialized Stumpjumper is a versatile enduro bike that dominates both climbs and descents. Features a FACT 11m carbon fiber frame, FOX FLOAT suspension with 160mm travel, and SRAM X01 Eagle drivetrain. The asymmetric frame design and internal storage compartment make it a practical choice for all-day adventures."
    },
    {
        "name": "bike_2",
        "metadata": {
            "model": "Slash",
            "brand": "Trek",
            "type": "Enduro bikes",
            "price": 5000
        },
        "description": "Trek's Slash is built for aggressive enduro riding and racing. Featuring Trek's Alpha Aluminum frame with RE:aktiv suspension technology, 160mm travel, and Knock Block frame protection. Equipped with Bontrager components and a Shimano XT drivetrain, this bike excels on technical trails and enduro race courses."
    }
]

bike_data = [{**d, "bike_embedding": emb_model.embed(d["description"])} for d in bike_data]

bike_schema = {
    "index": {
        "name": "bike-json",
        "prefix": "bike-json",
        "storage_type": "json", # JSON storage type
    },
    "fields": [
        {
            "name": "model",
            "type": "tag",
            "path": "$.metadata.model" # note the '$'
        },
        {
            "name": "brand",
            "type": "tag",
            "path": "$.metadata.brand"
        },
        {
            "name": "price",
            "type": "numeric",
            "path": "$.metadata.price"
        },
        {
            "name": "bike_embedding",
            "type": "vector",
            "attrs": {
                "dims": len(bike_data[0]["bike_embedding"]),
                "distance_metric": "cosine",
                "algorithm": "flat",
                "datatype": "float32"
            }

        }
    ],
}
# construct a search index from the json schema
bike_index = SearchIndex.from_dict(bike_schema, redis_url="redis://localhost:6379")

# create the index (no data yet)
bike_index.create(overwrite=True)
bike_index.load(bike_data)
['bike-json:01KHKJ5WW3DJE0X6E85GG27V0X',
 'bike-json:01KHKJ5WW3DJE0X6E85GG27V0Y']
from redisvl.query import VectorQuery

vec = emb_model.embed("I'd like a bike for aggressive riding")

v = VectorQuery(
    vector=vec,
    vector_field_name="bike_embedding",
    return_fields=[
        "brand",
        "name",
        "$.metadata.type"
    ]
)


results = bike_index.query(v)

Note: As shown in the example if you want to retrieve a field from json object that was not indexed you will also need to supply the full path as with $.metadata.type.

results
[{'id': 'bike-json:01KHKJ5WW3DJE0X6E85GG27V0Y',
  'vector_distance': '0.519988954067',
  'brand': 'Trek',
  '$.metadata.type': 'Enduro bikes'},
 {'id': 'bike-json:01KHKJ5WW3DJE0X6E85GG27V0X',
  'vector_distance': '0.65762424469',
  'brand': 'Specialized',
  '$.metadata.type': 'Enduro bikes'}]

Next Steps

Now that you understand Hash vs JSON storage, explore these related guides:

# Cleanup
bike_index.delete()
RATE THIS PAGE
Back to top ↑