How to improve the performance of searches over large datasets?

Last updated 22, Mar 2024

Question

How to improve the performance of searches over large datasets?

Answer

When using FT.SEARCH with sorting and paging (LIMIT <offset> <count>), a new search query is executed every time, and results are gathered and then truncated to the requested size. This may not be convenient over large datasets.

When using FT.AGGREGATE with a similar query and using the cursor API, the query results are being constructed once, on the initial query, and a Cursor ID is being returned that can be queried later by using FT.CURSOR READ <index> <cursor-id>. Subsequent calls will return more and more results from the query until the cursor is exhausted. This will help will larger datasets.

Make sure LOAD is not used with FT.AGGREGATE commands, as it hurts the performance of aggregated queries.

Using FT.SEARCH

FT.CREATE po ON JSON PREFIX 1 "original:" SCHEMA "$.datetime" as datetime NUMERIC SORTABLE
FT.SEARCH po "@datetime:[1619290184.788929 1682362184.788929]" WITHSCORES SORTBY datetime DESC LIMIT 0 50

subsequent queries:

FT.SEARCH po "@datetime:[1619290184.788929 1682362184.788929]" WITHSCORES SORTBY datetime DESC LIMIT 50 50

Using FT.AGGREGATE

FT.CREATE pp ON JSON PREFIX 1 "proposed:" SCHEMA "$.datetime" as datetime NUMERIC SORTABLE
FT.AGGREGATE pp "@datetime:[1619290321.959789 1682362321.959789]" WITHCURSOR COUNT 50 LOAD * SORTBY 2 @datetime DESC MAX 1000

subsequent reads are performed with the cursor-id, which is returned from the initial query and can be used to fetch additional result sets from the same query.

FT.CURSOR READ pp <cursor-id> READ 50

Test both approaches when dealing with large datasets and evaluate the average latency, CPU usage, and throughput.

References

Refer to the documentation to learn more about aggregations