# Vector similarity

Use Redis as a vector database

## The unstructured data problem

Today, about 80% of the data organizations generate is unstructured; data that either does not have a well-defined schema or cannot be restructured into a familiar columnar format. Typical examples of unstructured data include free-form text, images, videos, and sound clips. The amount of unstructured data is expected to grow in the coming decades.

Unstructured data is high-dimensional and noisy, making it more challenging to analyze and interpret using traditional methods. But it is also packed with information and meaning.

Traditionally, unstructured data is processed to extract specific features, effectively turning it into structured data. Once in the realm of structured data, you can search the data with SQL queries (if stored in a relational database) or with a text search engine.

The approach of transforming unstructured data into structured data has a few issues. First, engineering features out of unstructured data can be computationally expensive and error-prone, significantly delaying when you can effectively use the data. Second, some fidelity and information may be lost in the extraction/transformation process, because unique, latent features can't be easily categorized or quantified.

## Enter vector databases

An approach to dealing with unstructured data is to vectorize the data. Vectorizing means to somehow convert something like a text passage, an image, a video, or a song into a flat sequence of numbers representing a particular piece of data. These vectors are representations of the data in N-dimensional space. Vectorizing provides the ability to use linear algebra techniques to compare, group, and operate on our data. This is the foundation of a vector database; the ability to store and operate on vectors. This approach is not new and has been around for a long time. The difference today is how the techniques for generating the vectors have advanced.

## Using machine learning embeddings as vectors

Traditional methods for converting unstructured textual data into vector form include Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). For categorical data, one-hot encoding is a commonly used approach. Hashing and feature extraction techniques, such as edge detection, texture analysis, or color histograms, have been employed for high-dimensionality data like images.

While powerful in their own right, these approaches reveal limitations when confronted with high-dimensional and intricate data forms like long text passages, images, and audio. Consider, for example, how a text passage could be restructured through sentence rearrangement, synonym usage, or alterations in narrative style. Such simple modifications could effectively sidestep techniques like Bag-of-Words, preventing systems using the generated encodings from identifying text passages with similar meanings.

This is where advancements in machine learning, particularly deep learning, make their mark. Machine learning models have facilitated the rise of embeddings as a widely embraced method for generating dense, low-dimensional vector representations. Given a suitable model, the generated embeddings can encapsulate complex patterns and semantic meanings inherent in data, thus overcoming the limitations of their traditional counterparts.

### Generate embeddings for the bikes dataset

To investigate vector similarity, you'll use a subset of the bikes dataset, a relatively simple synthetic dataset. The dataset has 11 bicycle records in a JSON file named `bikes.json` and includes the fields `model`, `brand`, `price`, `type`, `specs`, and `description`. The `description` field is particularly interesting for our purposes since it consists of a free-form textual description of a bicycle.

#### Before getting started with the code examples

Code examples are currently provided for Redis CLI and Python. For Python, you will need to create a virtual environment and install the following Python packages:

1. redis
2. pandas
3. sentence-transformers
4. (optional) tabulate; this package is used by Pandas to convert dataframe tables to Markdown

You'll also need the following imports:

Let's load the bikes dataset as a JSON array using the following Python 3 code:

#### Inspect the bikes JSON

Let's inspect the content of the JSON array in table form:

model brand price type specs description
Jigger Velorim 270 Kids bikes {'material': 'aluminium', 'weight': '10'} Small and powerful, the Jigger is the best rid...
Hillcraft Bicyk 1200 Kids Mountain Bikes {'material': 'carbon', 'weight': '11'} Kids want to ride with as little weight as pos...
Chook air 5 Nord 815 Kids Mountain Bikes {'material': 'alloy', 'weight': '9.1'} The Chook Air 5 gives kids aged six years and...
`...`

Let's take a look at the structure of one of our bike JSON documents:

``````{
"model": "Jigger",
"brand": "Velorim",
"price": 270,
"type": "Kids bikes",
"specs": {
"material": "aluminium",
"weight": "10"
},
"description": "Small and powerful, the Jigger is the best ride for the smallest of tikes! ...
}
``````

#### Generating text embeddings using SentenceTransformers

You will use the SentenceTransformers framework to generate embeddings for the bikes descriptions. Sentence-BERT (SBERT) is a BERT model modification that produces consistent and contextually rich sentence embeddings. SBERT improves tasks like semantic search and text grouping by allowing for efficient and meaningful comparison of sentence-level semantic similarity.

### Selecting a suitable pre-trained model

You must pick a suitable model based on the task at hand when generating embeddings. In this case, you want to query for bicycles using short sentences against the longer bicycle descriptions. This is referred to as asymmetric semantic search, often employed in cases where the search query and the documents being searched are of a different nature or structure. Suitable models for asymmetric semantic search include pre-trained MS MARCO models. MS MARCO models are optimized for understanding real-world queries and retrieving relevant responses. They are widely used in search engines, chatbots, and other AI applications. At the time this tutorial was written, the highest performing MS MARCO model tuned for cosine-similarity available in the SentenceTranformers package is `msmarco-distilbert-base-v4`.

Let's load the model using the `SentenceTransformer` function:

``````from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('msmarco-distilbert-base-v4')
``````

Let's grab the description from the first bike in the JSON array:

``````from textwrap import TextWrapper

sample_description = bikes[0]['description']
wrapped_sample_description = TextWrapper(width=120).wrap(sample_description)
print(wrapped_sample_description)
['Small and powerful, the Jigger is the best ride for the smallest of tikes! This is the tiniest kids’ pedal bike on the',
'market available without a coaster brake, the Jigger is the vehicle of choice for the rare tenacious little rider raring',
'to go. We say rare because this smokin’ little bike is not ideal for a nervous first-time rider, but it’s a true giddy',
...]
``````

To generate the vector embeddings, use the `encode` function:

``````embedding = embedder.encode(sample_description)
``````

Let's take a peek at the first 5 elements of the generated vector:

``````print(embedding.tolist()[:5])
[0.20076484978199005, -0.1300073117017746, 0.3081613779067993, 0.2062796652317047, -0.3692358434200287]
``````

Let's look at the length of the vector embeddings generated by the model.

``````print(len(embedding))
768
``````

The chosen model generates vector embeddings of length `768` regardless of the length of the input text.

## Storing our bikes in Redis

### Redis Stack setup

There are many ways to install and run Redis. See Install Redis Stack for more information.

Now that you know how to vectorize the bikes descriptions, it's time to start working with Redis.

### Redis Python client

To interact with Redis, install the redis-py client library, which encapsulates the commands to work with OSS Redis as well as Redis Stack. For an overview of how to use `redis-py`, see the Redis Python Guide.

#### Create a `redis-py` client and test the server

Instantiate the Redis client, connecting to the localhost on Redis' default port `6379`. By default, Redis returns binary responses; to decode them, you'll pass the `decode_responses` parameter set to `True`:

Let's use Redis' PING command to check that Redis is up and running:

### Storing the bikes as JSON documents in Redis

Redis Stack includes a JSON data type. Like any other Redis data type, the JSON datatype allows you to use Redis commands to save, update, and retrieve JSON values. The bikes data is already loaded in memory as the `bikes` JSON array. You will iterate over `bikes`, generate a suitable Redis key and store them in Redis using the JSON.SET command. You'll do this using a pipeline to minimize the round-trip times:

Let's retrieve a specific value from one of the JSON bikes in Redis using a JSONPath expression:

### Vectorize all of the bike descriptions

To vectorize all the descriptions in the database, first collect all the Redis keys for the bikes.

Next, use the keys as a parameter to the JSON.MGET command, along with the JSONPath expression `\$.description` to collect the descriptions in a list. Then, pass the list to the `encode` method to get a list of vectorized embeddings:

Now you can add the vectorized descriptions to the JSON documents in Redis using the `JSON.SET` command to insert a new field in each of the documents under the JSONPath `\$.description_embeddings`. Once again, you'll do this using a pipeline:

Inspect one of the vectorized bike documents using the `JSON.GET` command:

When storing a vector embedding as part of a JSON datatype, the embedding is stored as a JSON array, in our case, under the field `description_embeddings` as shown. Note: in the example above, the array was shortened considerably for the sake of readability.

### Making the bikes collection searchable

Redis Stack provides a powerful search engine that introduces commands to create and maintain search indexes for both collections of HASHES and JSON documents.

To create a search index for the bikes collection, use the FT.CREATE command:

More detail on each step:

1. Specify the name of the index; `idx:bikes` indexing keys of type `JSON`.
2. The keys being indexed are found using the `bikes:` key prefix.
3. The `SCHEMA` keyword marks the beginning of the schema field definitions.
4. Declares that field in the JSON document at the JSONPath `\$.model` will be indexed as a `TEXT` field, allowing full-text search queries (disabling stemming).
5. The `\$.brand` field will also be treated as a `TEXT` schema field.
6. The `\$.price` field will be indexed as a `NUMERIC` allowing numeric range queries.
7. The `\$.type` field will be indexed as a `TAG` field. Tag fields allow exact-match queries, and are suitable for categorical values.
8. The `\$.description` field will also be indexed as a `TEXT` field
9. Finally, the vector embeddings in `\$.description_embeddings` are indexed as a `VECTOR` field and assigned to the alias `vector`.

Here's a break down of the `VECTOR` schema field definition to better understand the inner workings of vector similarity in Redis:

• `FLAT`: Specifies the indexing method, which can be `FLAT` or `HNSW`. FLAT (brute-force indexing) provides exact results but at a higher computational cost, while HNSW (Hierarchical Navigable Small World) is a more efficient method that provides approximate results with lower computational overhead.
• `TYPE`: Set to `FLOAT32`. Current supported types are `FLOAT32` and `FLOAT64`.
• `DIM`: The length or dimension of the embeddings, which you determined previously to be `768`.
• `DISTANCE_METRIC`: One of `L2`, `IP`, `COSINE`.
• `L2` stands for Euclidean distance, a straight-line distance between the vectors. Preferred when the absolute differences, including magnitude, matter most.
• `IP` stands for inner product; `IP` measures the projection of one vector onto another. It emphasizes the angle between vectors rather than their absolute positions in the vector space.
• `COSINE` stands for cosine similarity; a normalized form of inner product. This metric measures only the angle between two vectors, making it magnitude-independent.
• For our querying purposes, the direction of the vectors carry more meaning (indicating semantic similarity), and the magnitude is largely influenced by the length of the documents, therefore `COSINE` similarity is chosen. Also, our chosen embedding model is fine-tuned for `COSINE` similarity.

#### Check the state of the index

After the `FT.CREATE` command creates the index, the indexing process is automatically started in the background. In a short amount of time, all eleven JSON documents should be indexed and ready to be searched. To validate that, use the FT.INFO command to check some information and statistics of the index. Of particular interest are the number of documents successfully indexed and the number of failures:

### Structured data searches with Redis

The index `idx:bikes_vss` indexes the structured fields of our JSON documents `model`, `brand`, `price`, and `type`. It also indexes the unstructured free-form text `description` and the generated embeddings in `description_embeddings`. Before diving deeper into Vector Similarity Search (VSS), you need to understand the basics of querying a Redis index. The Redis command of interest is FT.SEARCH. Like a SQL `select` statement, an `FT.SEARCH` statement can be as simple or as complex as needed.

Here are a few simple queries that give enough context to complete the VSS examples. For example, to retrieve all bikes where the `brand` is `Peaknetic`, use the following command:

This command will return all matching documents. With the inclusion of the vector embeddings, that's a little too verbose. If you only wanted to return specific fields from the JSON documents, for example, the document `id`, the `brand`, `model` and `price`, you could use:

In this query, you are searching against a schema field of type `TEXT`.

If you wanted a list of bikes under \$1000, you can add a numeric range clause to the query since the `price` field is indexed as `NUMERIC`:

### Semantic searching with VSS

Now that the bikes collection is stored and properly indexed in Redis, you can query it using short query prompts. Arrange your queries in a list so you can execute them in bulk:

You need to encode the query prompts to query the database using VSS. Just like you did with the descriptions of the bikes, you'll use the SentenceTransformers model to encode the queries:

#### Constructing a pure K-nearest neighbors (KNN) VSS query

Start with a KNN query. KNN is a foundational algorithm used in VSS, where the goal is to find the most similar items to a given query item. Using the chosen distance metric, the KNN algorithm calculates the distance between the query vector and each vector in the database. It then returns the K items with the smallest distances to the query vector. These are the most similar items.

The syntax for vector similarity KNN queries is `(*)=>[<vector_similarity_query>]` where the `(*)` (the `*` meaning all) is the filter query for the search engine. That way, one can reduce the search space by filtering the collection on which the KNN algorithm operates.

• The `\$query_vector` represents the query parameter you'll use to pass the vectorized query prompt.
• The results will be filtered by `vector_score`, which is a field derived from the name of the field indexed as a vector by appending `_score` to it, in our case, `vector` (the alias for `\$.description_embeddings`).
• Our query will return the `vector_score`, the `id`s of the matched documents, and the `\$.brand`, `\$.model`, and `\$.description`.
• Finally, to utilize a vector similarity query with the `FT.SEARCH` command, you must specify DIALECT 2 or greater.
``````query = (
Query('(*)=>[KNN 3 @vector \$query_vector AS vector_score]')
.sort_by('vector_score')
.return_fields('vector_score', 'id', 'brand', 'model', 'description')
.dialect(2)
)
``````

Pass the vectorized query as `\$query_vector` to the search function to execute the query. The following code shows an example of creating a Python NumPy array from a vectorized query prompt (`encoded_query`) as a single precision floating point array and converting it into a compact, byte-level representation that can be passed as a Redis parameter:

``````client.ft(INDEX_NAME).search(query, { 'query_vector': np.array(encoded_query, dtype=np.float32).tobytes() }).docs
``````

With the template for the query in place, use Python to execute all query prompts in a loop, passing the vectorized query prompts. Notice that for each result, the script calculates the `vector_score` as `1 - doc.vector_score`. Because cosine "distance" is used as the metric, the items with the smallest "distance" are closer and therefore more similar to the query.

Then loop over the matched documents and create a list of results that can be converted into a Pandas table to visualize the results:

The query results show the individual queries' top 3 matches (our K parameter) along with the bike's id, brand, and model for each query. For example, for the query "Best Mountain bikes for kids", the highest similarity score (`0.54`) and therefore the closest match was the 'Nord' brand 'Chook air 5' bike model, described as:

"The Chook Air 5 gives kids aged six years and older a durable and uberlight mountain bike for their first experience on tracks and easy cruising through forests and fields. The lower top tube makes it easy to mount and dismount in any situation, giving your kids greater safety on the trails. The Chook Air 5 is the perfect intro to mountain biking."

From the description, this bike is an excellent match for younger children, and the MS MARCO model-generated embeddings seem to have captured the semantics of the description accurately.