Tokenization
Controlling text tokenization and escaping
Full-text search works by comparing words, URLs, numbers, and other elements of the query against the text in the searchable fields of each document. However, it would be very inefficient to compare the entire text of the query against the entire text of each field over and over again, so the search system doesn't do this. Instead, it splits the document text into short, significant sections called tokens during the indexing process and stores the tokens as part of the document's index data.
During a search, the query system also tokenizes the query text and then simply compares the tokens from the query against the tokens stored for each document. Finding a match like this is much more efficient than pattern-matching on the whole text and also lets you use stemming and stop words to improve the search even further. See this article about Tokenization for a general introduction to the concepts.
Redis Stack uses a very simple tokenizer for documents and a slightly more sophisticated tokenizer for queries. Both allow a degree of control over string escaping and tokenization.
The sections below describe the rules for tokenizing text fields and queries. Note that Tag fields are essentially text fields but they use a simpler form of tokenization, as described separately in the Tokenization rules for tag fields section.
Tokenization rules for text fields
-
All punctuation marks and whitespace (besides underscores) separate the document and queries into tokens. For example, any character of
,.<>{}[]"':;!@#$%^&*()-+=~
will break the text into terms, so the textfoo-bar.baz...bag
will be tokenized into[foo, bar, baz, bag]
-
Escaping separators in both queries and documents is done by prepending a backslash to any separator. For example, the text
hello\-world hello-world
will be tokenized as[hello-world, hello, world]
. In most languages you will need an extra backslash to signify an actual backslash when formatting the document or query, so the actual text entered into redis-cli will behello\\-world
. -
Underscores (
_
) are not used as separators in either document or query, so the texthello_world
will remain as is after tokenization. -
Repeating spaces or punctuation marks are stripped.
-
Latin characters are converted to lowercase.
-
A backslash before the first digit will tokenize it as a term. This will translate the
-
sign as NOT, which otherwise would make the number negative. Add a backslash before.
if you are searching for a float. For example,-20 -> {-20} vs -\20 -> {NOT{20}}
.
Tokenization rules for tag fields
Tag fields interpret a text field as a list of tags delimited by a separator character (which is a comma "," by default). The tokenizer simply splits the text wherever it finds the separator and so most punctuation marks and whitespace are valid characters within each tag token. The only changes that the tokenizer makes to the tags are:
- Trimming whitespace at the start and end of the tag. Other whitespace in the tag text is left intact.
- Converting Latin alphabet characters to lowercase. You can override this by adding the
CASESENSITIVE
option in the indexing schema for the tag field.
This means that when you define a tag field, you don't need to escape any characters, except in the unusual case where you want leading or trailing spaces to be part of the tag text. However, you do need to escape certain characters in a query against a tag field. See Query syntax for more information about this.