JSON vs. hash storage
Storing JSON and hashes with RedisVL
Out of the box, Redis provides a variety of data structures that can be used for your domain specific applications and use cases. In this document, you will learn how to use RedisVL with both hash and JSON data.
Before beginning, be sure of the following:
- You have installed RedisVL and have that environment activated.
- You have a running Redis instance with the search and query capability.
# import necessary modules
import pickle
from redisvl.redis.utils import buffer_to_array
from jupyterutils import result_print, table_print
from redisvl.index import SearchIndex
# load in the example data and printing utils
data = pickle.load(open("hybrid_example_data.pkl", "rb"))
table_print(data)
user | age | job | credit_score | office_location | user_embedding |
---|---|---|---|---|---|
john | 18 | engineer | high | -122.4194,37.7749 | b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?' |
derrick | 14 | doctor | low | -122.4194,37.7749 | b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?' |
nancy | 94 | doctor | high | -122.4194,37.7749 | b'333?\xcd\xcc\xcc=\x00\x00\x00?' |
tyler | 100 | engineer | high | -122.0839,37.3861 | b'\xcd\xcc\xcc=\xcd\xcc\xcc>\x00\x00\x00?' |
tim | 12 | dermatologist | high | -122.0839,37.3861 | b'\xcd\xcc\xcc>\xcd\xcc\xcc>\x00\x00\x00?' |
taimur | 15 | CEO | low | -122.0839,37.3861 | b'\x9a\x99\x19?\xcd\xcc\xcc=\x00\x00\x00?' |
joe | 35 | dentist | medium | -122.0839,37.3861 | b'fff?fff?\xcd\xcc\xcc=' |
Hash or JSON - how to choose?
Both storage options offer a variety of features and tradeoffs. Below, you will work through a dummy dataset to learn when and how to use both data types.
Working with hashes
Hashes in Redis are simple collections of field-value pairs. Think of it like a mutable, single-level dictionary that contains multiple "rows":
{
"model": "Deimos",
"brand": "Ergonom",
"type": "Enduro bikes",
"price": 4972,
}
Hashes are best suited for use cases with the following characteristics:
- Performance (speed) and storage space (memory consumption) are top concerns.
- Data can be easily normalized and modeled as a single-level dictionary.
Hashes are typically the default recommendation.
# define the hash index schema
hash_schema = {
"index": {
"name": "user-hash",
"prefix": "user-hash-docs",
"storage_type": "hash", # default setting -- HASH
},
"fields": [
{"name": "user", "type": "tag"},
{"name": "credit_score", "type": "tag"},
{"name": "job", "type": "text"},
{"name": "age", "type": "numeric"},
{"name": "office_location", "type": "geo"},
{
"name": "user_embedding",
"type": "vector",
"attrs": {
"dims": 3,
"distance_metric": "cosine",
"algorithm": "flat",
"datatype": "float32"
}
}
],
}
# construct a search index from the hash schema
hindex = SearchIndex.from_dict(hash_schema)
# connect to local redis instance
hindex.connect("redis://localhost:6379")
# create the index (no data yet)
hindex.create(overwrite=True)
# show the underlying storage type
hindex.storage_type
<StorageType.HASH: 'hash'>
Vectors as byte strings
One nuance when working with hashes in Redis is that all vectorized data must be passed as a byte string (for efficient storage, indexing, and processing). An example of this can be seen below:
# show a single entry from the data that will be loaded
data[0]
{'user': 'john',
'age': 18,
'job': 'engineer',
'credit_score': 'high',
'office_location': '-122.4194,37.7749',
'user_embedding': b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?'}
# load hash data
keys = hindex.load(data)
$ rvl stats -i user-hash
Statistics:
╭─────────────────────────────┬─────────────╮
│ Stat Key │ Value │
├─────────────────────────────┼─────────────┤
│ num_docs │ 7 │
│ num_terms │ 6 │
│ max_doc_id │ 7 │
│ num_records │ 44 │
│ percent_indexed │ 1 │
│ hash_indexing_failures │ 0 │
│ number_of_uses │ 1 │
│ bytes_per_record_avg │ 3.40909 │
│ doc_table_size_mb │ 0.000767708 │
│ inverted_sz_mb │ 0.000143051 │
│ key_table_size_mb │ 0.000248909 │
│ offset_bits_per_record_avg │ 8 │
│ offset_vectors_sz_mb │ 8.58307e-06 │
│ offsets_per_term_avg │ 0.204545 │
│ records_per_doc_avg │ 6.28571 │
│ sortable_values_size_mb │ 0 │
│ total_indexing_time │ 0.587 │
│ total_inverted_index_blocks │ 18 │
│ vector_index_sz_mb │ 0.0202332 │
╰─────────────────────────────┴─────────────╯
Performing queries
Once the index is created and data is loaded into the right format, you can run queries against the index:
from redisvl.query import VectorQuery
from redisvl.query.filter import Tag, Text, Num
t = (Tag("credit_score") == "high") & (Text("job") % "enginee*") & (Num("age") > 17)
v = VectorQuery([0.1, 0.1, 0.5],
"user_embedding",
return_fields=["user", "credit_score", "age", "job", "office_location"],
filter_expression=t)
results = hindex.query(v)
result_print(results)
vector_distance | user | credit_score | age | job | office_location |
---|---|---|---|---|---|
0 | john | high | 18 | engineer | -122.4194,37.7749 |
0.109129190445 | tyler | high | 100 | engineer | -122.0839,37.3861 |
# clean up
hindex.delete()
Working with JSON
Redis also supports native JSON objects. These can be multi-level (nested) objects, with full JSONPath support for retrieving and updating sub-elements:
{
"name": "bike",
"metadata": {
"model": "Deimos",
"brand": "Ergonom",
"type": "Enduro bikes",
"price": 4972,
}
}
JSON is best suited for use cases with the following characteristics:
- Ease of use and data model flexibility are top concerns.
- Application data is already native JSON.
- Replacing another document storage/database solution.
Full JSON Path support
Because Redis enables full JSONPath support, when creating an index schema, elements need to be indexed and selected by their path with the desired name
and path
that points to where the data is located within the objects.
$.{name}
if not provided in JSON fields schema.# define the json index schema
json_schema = {
"index": {
"name": "user-json",
"prefix": "user-json-docs",
"storage_type": "json", # JSON storage type
},
"fields": [
{"name": "user", "type": "tag"},
{"name": "credit_score", "type": "tag"},
{"name": "job", "type": "text"},
{"name": "age", "type": "numeric"},
{"name": "office_location", "type": "geo"},
{
"name": "user_embedding",
"type": "vector",
"attrs": {
"dims": 3,
"distance_metric": "cosine",
"algorithm": "flat",
"datatype": "float32"
}
}
],
}
# construct a search index from the JSON schema
jindex = SearchIndex.from_dict(json_schema)
# connect to a local redis instance
jindex.connect("redis://localhost:6379")
# create the index (no data yet)
jindex.create(overwrite=True)
# note the multiple indices in the same database
$ rvl index listall
20:23:08 [RedisVL] INFO Indices:
20:23:08 [RedisVL] INFO 1. user-json
#### Vectors as float arrays
Vectorized data stored in JSON must be stored as a pure array (e.g., a Python list) of floats. Modify your sample data to account for this below:
```python
import numpy as np
json_data = data.copy()
for d in json_data:
d['user_embedding'] = buffer_to_array(d['user_embedding'], dtype=np.float32)
# inspect a single JSON record
json_data[0]
{'user': 'john',
'age': 18,
'job': 'engineer',
'credit_score': 'high',
'office_location': '-122.4194,37.7749',
'user_embedding': [0.10000000149011612, 0.10000000149011612, 0.5]}
keys = jindex.load(json_data)
# we can now run the exact same query as above
result_print(jindex.query(v))
vector_distance | user | credit_score | age | job | office_location |
---|---|---|---|---|---|
0 | john | high | 18 | engineer | -122.4194,37.7749 |
0.109129190445 | tyler | high | 100 | engineer | -122.0839,37.3861 |
Cleanup
jindex.delete()