Vector similarity
Use Redis as a vector database
The unstructured data problem
Today, about 80% of the data organizations generate is unstructured; data that either does not have a well-defined schema or cannot be restructured into a familiar columnar format. Typical examples of unstructured data include free-form text, images, videos, and sound clips. The amount of unstructured data is expected to grow in the coming decades.
Unstructured data is high-dimensional and noisy, making it more challenging to analyze and interpret using traditional methods. But it is also packed with information and meaning.
Traditionally, unstructured data is processed to extract specific features, effectively turning it into structured data. Once in the realm of structured data, you can search the data with SQL queries (if stored in a relational database) or with a text search engine.
The approach of transforming unstructured data into structured data has a few issues. First, engineering features out of unstructured data can be computationally expensive and error-prone, significantly delaying when you can effectively use the data. Second, some fidelity and information may be lost in the extraction/transformation process, because unique, latent features can't be easily categorized or quantified.
Enter vector databases
An approach to dealing with unstructured data is to vectorize the data. Vectorizing means to somehow convert something like a text passage, an image, a video, or a song into a flat sequence of numbers representing a particular piece of data. These vectors are representations of the data in N-dimensional space. Vectorizing provides the ability to use linear algebra techniques to compare, group, and operate on our data. This is the foundation of a vector database; the ability to store and operate on vectors. This approach is not new and has been around for a long time. The difference today is how the techniques for generating the vectors have advanced.
Using machine learning embeddings as vectors
Traditional methods for converting unstructured textual data into vector form include Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). For categorical data, one-hot encoding is a commonly used approach. Hashing and feature extraction techniques, such as edge detection, texture analysis, or color histograms, have been employed for high-dimensionality data like images.
While powerful in their own right, these approaches reveal limitations when confronted with high-dimensional and intricate data forms like long text passages, images, and audio. Consider, for example, how a text passage could be restructured through sentence rearrangement, synonym usage, or alterations in narrative style. Such simple modifications could effectively sidestep techniques like Bag-of-Words, preventing systems using the generated encodings from identifying text passages with similar meanings.
This is where advancements in machine learning, particularly deep learning, make their mark. Machine learning models have facilitated the rise of embeddings as a widely embraced method for generating dense, low-dimensional vector representations. Given a suitable model, the generated embeddings can encapsulate complex patterns and semantic meanings inherent in data, thus overcoming the limitations of their traditional counterparts.
Generate embeddings for the bikes dataset
To investigate vector similarity, you'll use a subset of the bikes dataset, a relatively simple synthetic dataset. The dataset has 11 bicycle records in a JSON file named bikes.json
and includes the fields model
, brand
, price
, type
, specs
, and description
. The description
field is particularly interesting for our purposes since it consists of a free-form textual description of a bicycle.
Before getting started with the code examples
Code examples are currently provided for Redis CLI and Python. For Python, you will need to create a virtual environment and install the following Python packages:
- redis
- pandas
- sentence-transformers
- (optional) tabulate; this package is used by Pandas to convert dataframe tables to Markdown
You'll also need the following imports:
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
Loading json bikes dataset
Let's load the bikes dataset as a JSON array using the following Python 3 code:
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
Inspect the bikes JSON
Let's inspect the content of the JSON array in table form:
model | brand | price | type | specs | description |
---|---|---|---|---|---|
Jigger | Velorim | 270 | Kids bikes | {'material': 'aluminium', 'weight': '10'} | Small and powerful, the Jigger is the best rid... |
Hillcraft | Bicyk | 1200 | Kids Mountain Bikes | {'material': 'carbon', 'weight': '11'} | Kids want to ride with as little weight as pos... |
Chook air 5 | Nord | 815 | Kids Mountain Bikes | {'material': 'alloy', 'weight': '9.1'} | The Chook Air 5 gives kids aged six years and... |
... |
Let's take a look at the structure of one of our bike JSON documents:
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
{
"model": "Jigger",
"brand": "Velorim",
"price": 270,
"type": "Kids bikes",
"specs": {
"material": "aluminium",
"weight": "10"
},
"description": "Small and powerful, the Jigger is the best ride for the smallest of tikes! ...
}
Generating text embeddings using SentenceTransformers
You will use the SentenceTransformers framework to generate embeddings for the bikes descriptions. Sentence-BERT (SBERT) is a BERT model modification that produces consistent and contextually rich sentence embeddings. SBERT improves tasks like semantic search and text grouping by allowing for efficient and meaningful comparison of sentence-level semantic similarity.
Selecting a suitable pre-trained model
You must pick a suitable model based on the task at hand when generating embeddings. In this case, you want to query for bicycles using short sentences against the longer bicycle descriptions. This is referred to as asymmetric semantic search, often employed in cases where the search query and the documents being searched are of a different nature or structure. Suitable models for asymmetric semantic search include pre-trained MS MARCO models. MS MARCO models are optimized for understanding real-world queries and retrieving relevant responses. They are widely used in search engines, chatbots, and other AI applications. At the time this tutorial was written, the highest performing MS MARCO model tuned for cosine-similarity available in the SentenceTranformers package is msmarco-distilbert-base-v4
.
Let's load the model using the SentenceTransformer
function:
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('msmarco-distilbert-base-v4')
Let's grab the description from the first bike in the JSON array:
from textwrap import TextWrapper
sample_description = bikes[0]['description']
wrapped_sample_description = TextWrapper(width=120).wrap(sample_description)
print(wrapped_sample_description)
['Small and powerful, the Jigger is the best ride for the smallest of tikes! This is the tiniest kids’ pedal bike on the',
'market available without a coaster brake, the Jigger is the vehicle of choice for the rare tenacious little rider raring',
'to go. We say rare because this smokin’ little bike is not ideal for a nervous first-time rider, but it’s a true giddy',
...]
To generate the vector embeddings, use the encode
function:
embedding = embedder.encode(sample_description)
Let's take a peek at the first 5 elements of the generated vector:
print(embedding.tolist()[:5])
[0.20076484978199005, -0.1300073117017746, 0.3081613779067993, 0.2062796652317047, -0.3692358434200287]
Let's look at the length of the vector embeddings generated by the model.
print(len(embedding))
768
The chosen model generates vector embeddings of length 768
regardless of the length of the input text.
Storing our bikes in Redis
Redis Stack setup
There are many ways to install and run Redis. See Install Redis Stack for more information.
Now that you know how to vectorize the bikes descriptions, it's time to start working with Redis.
Redis Python client
To interact with Redis, install the redis-py client library, which encapsulates the commands to work with OSS Redis as well as Redis Stack. For an overview of how to use redis-py
, see the Redis Python Guide.
Create a redis-py
client and test the server
Instantiate the Redis client, connecting to the localhost on Redis' default port 6379
. By default, Redis returns binary responses; to decode them, you'll pass the decode_responses
parameter set to True
:
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
Let's use Redis' PING command to check that Redis is up and running:
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
Storing the bikes as JSON documents in Redis
Redis Stack includes a JSON data type. Like any other Redis data type, the JSON datatype allows you to use Redis commands to save, update, and retrieve JSON values. The bikes data is already loaded in memory as the bikes
JSON array. You will iterate over bikes
, generate a suitable Redis key and store them in Redis using the JSON.SET command. You'll do this using a pipeline to minimize the round-trip times:
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
Let's retrieve a specific value from one of the JSON bikes in Redis using a JSONPath expression:
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
Vectorize all of the bike descriptions
To vectorize all the descriptions in the database, first collect all the Redis keys for the bikes.
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
Next, use the keys as a parameter to the JSON.MGET command, along with the JSONPath expression $.description
to collect the descriptions in a list. Then, pass the list to the encode
method to get a list of vectorized embeddings:
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
Now you can add the vectorized descriptions to the JSON documents in Redis using the JSON.SET
command to insert a new field in each of the documents under the JSONPath $.description_embeddings
. Once again, you'll do this using a pipeline:
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
Inspect one of the vectorized bike documents using the JSON.GET
command:
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
When storing a vector embedding as part of a JSON datatype, the embedding is stored as a JSON array, in our case, under the field description_embeddings
as shown. Note: in the example above, the array was shortened considerably for the sake of readability.
Making the bikes collection searchable
Redis Stack provides a powerful search engine that introduces commands to create and maintain search indexes for both collections of HASHES and JSON documents.
To create a search index for the bikes collection, use the FT.CREATE command:
1️⃣ FT.CREATE idx:bikes_vss ON JSON
2️⃣ PREFIX 1 bikes: SCORE 1.0
3️⃣ SCHEMA
4️⃣ $.model TEXT WEIGHT 1.0 NOSTEM
5️⃣ $.brand TEXT WEIGHT 1.0 NOSTEM
6️⃣ $.price NUMERIC
7️⃣ $.type TAG SEPARATOR ","
8️⃣ $.description AS description TEXT WEIGHT 1.0
9️⃣ $.description_embeddings AS vector VECTOR FLAT 6 TYPE FLOAT32 DIM 768 DISTANCE_METRIC COSINE
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
More detail on each step:
- Specify the name of the index;
idx:bikes
indexing keys of typeJSON
. - The keys being indexed are found using the
bikes:
key prefix. - The
SCHEMA
keyword marks the beginning of the schema field definitions. - Declares that field in the JSON document at the JSONPath
$.model
will be indexed as aTEXT
field, allowing full-text search queries (disabling stemming). - The
$.brand
field will also be treated as aTEXT
schema field. - The
$.price
field will be indexed as aNUMERIC
allowing numeric range queries. - The
$.type
field will be indexed as aTAG
field. Tag fields allow exact-match queries, and are suitable for categorical values. - The
$.description
field will also be indexed as aTEXT
field - Finally, the vector embeddings in
$.description_embeddings
are indexed as aVECTOR
field and assigned to the aliasvector
.
Here's a break down of the VECTOR
schema field definition to better understand the inner workings of vector similarity in Redis:
FLAT
: Specifies the indexing method, which can beFLAT
orHNSW
. FLAT (brute-force indexing) provides exact results but at a higher computational cost, while HNSW (Hierarchical Navigable Small World) is a more efficient method that provides approximate results with lower computational overhead.TYPE
: Set toFLOAT32
. Current supported types areFLOAT32
andFLOAT64
.DIM
: The length or dimension of the embeddings, which you determined previously to be768
.DISTANCE_METRIC
: One ofL2
,IP
,COSINE
.L2
stands for Euclidean distance, a straight-line distance between the vectors. Preferred when the absolute differences, including magnitude, matter most.IP
stands for inner product;IP
measures the projection of one vector onto another. It emphasizes the angle between vectors rather than their absolute positions in the vector space.COSINE
stands for cosine similarity; a normalized form of inner product. This metric measures only the angle between two vectors, making it magnitude-independent.- For our querying purposes, the direction of the vectors carry more meaning (indicating semantic similarity), and the magnitude is largely influenced by the length of the documents, therefore
COSINE
similarity is chosen. Also, our chosen embedding model is fine-tuned forCOSINE
similarity.
Check the state of the index
After the FT.CREATE
command creates the index, the indexing process is automatically started in the background. In a short amount of time, all eleven JSON documents should be indexed and ready to be searched. To validate that, use the FT.INFO command to check some information and statistics of the index. Of particular interest are the number of documents successfully indexed and the number of failures:
FT_INFO idx:bikes_vss
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
Structured data searches with Redis
The index idx:bikes_vss
indexes the structured fields of our JSON documents model
, brand
, price
, and type
. It also indexes the unstructured free-form text description
and the generated embeddings in description_embeddings
. Before diving deeper into Vector Similarity Search (VSS), you need to understand the basics of querying a Redis index. The Redis command of interest is FT.SEARCH. Like a SQL select
statement, an FT.SEARCH
statement can be as simple or as complex as needed.
Here are a few simple queries that give enough context to complete the VSS examples. For example, to retrieve all bikes where the brand
is Peaknetic
, use the following command:
FT.SEARCH idx:bikes_vss '@brand:Peaknetic'
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
This command will return all matching documents. With the inclusion of the vector embeddings, that's a little too verbose. If you only wanted to return specific fields from the JSON documents, for example, the document id
, the brand
, model
and price
, you could use:
FT.SEARCH idx:bikes_vss '@brand:Peaknetic' RETURN 4 id, brand, model, price
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
In this query, you are searching against a schema field of type TEXT
.
If you wanted a list of bikes under $1000, you can add a numeric range clause to the query since the price
field is indexed as NUMERIC
:
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
Semantic searching with VSS
Now that the bikes collection is stored and properly indexed in Redis, you can query it using short query prompts. Arrange your queries in a list so you can execute them in bulk:
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
You need to encode the query prompts to query the database using VSS. Just like you did with the descriptions of the bikes, you'll use the SentenceTransformers model to encode the queries:
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
Constructing a pure K-nearest neighbors (KNN) VSS query
Start with a KNN query. KNN is a foundational algorithm used in VSS, where the goal is to find the most similar items to a given query item. Using the chosen distance metric, the KNN algorithm calculates the distance between the query vector and each vector in the database. It then returns the K items with the smallest distances to the query vector. These are the most similar items.
The syntax for vector similarity KNN queries is (*)=>[<vector_similarity_query>]
where the (*)
(the *
meaning all) is the filter query for the search engine. That way, one can reduce the search space by filtering the collection on which the KNN algorithm operates.
- The
$query_vector
represents the query parameter you'll use to pass the vectorized query prompt. - The results will be filtered by
vector_score
, which is a field derived from the name of the field indexed as a vector by appending_score
to it, in our case,vector
(the alias for$.description_embeddings
). - Our query will return the
vector_score
, theid
s of the matched documents, and the$.brand
,$.model
, and$.description
. - Finally, to utilize a vector similarity query with the
FT.SEARCH
command, you must specify DIALECT 2 or greater.
query = (
Query('(*)=>[KNN 3 @vector $query_vector AS vector_score]')
.sort_by('vector_score')
.return_fields('vector_score', 'id', 'brand', 'model', 'description')
.dialect(2)
)
Pass the vectorized query as $query_vector
to the search function to execute the query. The following code shows an example of creating a Python NumPy array from a vectorized query prompt (encoded_query
) as a single precision floating point array and converting it into a compact, byte-level representation that can be passed as a Redis parameter:
client.ft(INDEX_NAME).search(query, { 'query_vector': np.array(encoded_query, dtype=np.float32).tobytes() }).docs
With the template for the query in place, use Python to execute all query prompts in a loop, passing the vectorized query prompts. Notice that for each result, the script calculates the vector_score
as 1 - doc.vector_score
. Because cosine "distance" is used as the metric, the items with the smallest "distance" are closer and therefore more similar to the query.
Then loop over the matched documents and create a list of results that can be converted into a Pandas table to visualize the results:
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.54 | bikes:003... (+ 32 more results)
hybrid_query = (
Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)
# >>> | Best Mountain bikes for kids | 0.3 | bikes:008... (+22 more results)
range_query = (
Query(
"@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
)
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.paging(0, 4)
.dialect(2)
)
create_query_table(
range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)
# >>> | Bike for small kids | 0.52 | bikes:001 | Velorim |... (+1 more result)
The query results show the individual queries' top 3 matches (our K parameter) along with the bike's id, brand, and model for each query. For example, for the query "Best Mountain bikes for kids", the highest similarity score (0.54
) and therefore the closest match was the 'Nord' brand 'Chook air 5' bike model, described as:
"The Chook Air 5 gives kids aged six years and older a durable and uberlight mountain bike for their first experience on tracks and easy cruising through forests and fields. The lower top tube makes it easy to mount and dismount in any situation, giving your kids greater safety on the trails. The Chook Air 5 is the perfect intro to mountain biking."
From the description, this bike is an excellent match for younger children, and the MS MARCO model-generated embeddings seem to have captured the semantics of the description accurately.
import json
import time
import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
NumericField,
TagField,
TextField,
VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
json.dumps(bikes[0], indent=2)
client = redis.Redis(host="localhost", port=6379, decode_responses=True)
res = client.ping()
# >>> True
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
redis_key = f"bikes:{i:03}"
pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010", "$.model")
# >>> ['Summit']
keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]
res = client.json().get("bikes:010")
# >>>
# {
# "model": "Summit",
# "brand": "nHill",
# "price": 1200,
# "type": "Mountain Bike",
# "specs": {
# "material": "alloy",
# "weight": "11.3"
# },
# "description": "This budget mountain bike from nHill performs well..."
# "description_embeddings": [
# -0.538114607334137,
# -0.49465855956077576,
# -0.025176964700222015,
# ...
# ]
# }
schema = (
TextField("$.model", no_stem=True, as_name="model"),
TextField("$.brand", no_stem=True, as_name="brand"),
NumericField("$.price", as_name="price"),
TagField("$.type", as_name="type"),
TextField("$.description", as_name="description"),
VectorField(
"$.description_embeddings",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIMENSION,
"DISTANCE_METRIC": "COSINE",
},
as_name="vector",
),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
fields=schema, definition=definition
)
# >>> 'OK'
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
# print(f"{num_docs} documents indexed with {indexing_failures} failures")
# >>> 11 documents indexed with 0 failures
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950', 'description_embeddings': ...
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'}, Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
"id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
# print(res)
# >>> [Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]
queries = [
"Bike for small kids",
"Best Mountain bikes for kids",
"Cheap Mountain bike for kids",
"Female specific mountain bike",
"Road bike for beginners",
"Commuter bike for people over 60",
"Comfortable commuter bike",
"Good bike for college students",
"Mountain bike for beginners",
"Vintage bike",
"Comfortable city bike",
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)
# >>> 11
def create_query_table(query, queries, encoded_queries, extra_params={}):
results_list = []
for i, encoded_query in enumerate(encoded_queries):
result_docs = (
client.ft("idx:bikes_vss")
.search(
query,
{
"query_vector": np.array(
encoded_query, dtype=np.float32
).tobytes()
}
| extra_params,
)
.docs
)
for doc in result_docs:
vector_score = round(1 - float(doc.vector_score), 2)
results_list.append(
{
"query": queries[i],
"score": vector_score,
"id": doc.id,
"brand": doc.brand,
"model": doc.model,
"description": doc.description,
}
)
# Optional: convert the table to Markdown using Pandas
queries_table = pd.DataFrame(results_list)
queries_table.sort_values(
by=["query", "score"], ascending=[True, False], inplace=True
)
queries_table["query"] = queries_table.groupby("query")["query"].transform(
lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
)
queries_table["description"] = queries_table["description"].apply(
lambda x: (x[:497] + "...") if len(x) > 500 else x
)
queries_table.to_markdown(index=False)
query = (
Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
.sort_by("vector_score")
.return_fields("vector_score", "id", "brand", "model", "description")
.dialect(2)
)
create_query_table(query, queries, encoded_queries)
# >>> | Best Mountain bikes for