How to improve the performance of searches over large datasets?
Last updated 22, Mar 2024
Question
How to improve the performance of searches over large datasets?
Answer
When using FT.SEARCH
with sorting and paging (LIMIT <offset> <count>
), a new search query is executed every time, and results are gathered and then truncated to the requested size. This may not be convenient over large datasets.
When using FT.AGGREGATE
with a similar query and using the cursor API, the query results are being constructed once, on the initial query, and a Cursor ID is being returned that can be queried later by using FT.CURSOR READ <index> <cursor-id>
. Subsequent calls will return more and more results from the query until the cursor is exhausted. This will help will larger datasets.
Make sure LOAD
is not used with FT.AGGREGATE
commands, as it hurts the performance of aggregated queries.
Using FT.SEARCH
FT.CREATE po ON JSON PREFIX 1 "original:" SCHEMA "$.datetime" as datetime NUMERIC SORTABLE
FT.SEARCH po "@datetime:[1619290184.788929 1682362184.788929]" WITHSCORES SORTBY datetime DESC LIMIT 0 50
subsequent queries:
FT.SEARCH po "@datetime:[1619290184.788929 1682362184.788929]" WITHSCORES SORTBY datetime DESC LIMIT 50 50
Using FT.AGGREGATE
FT.CREATE pp ON JSON PREFIX 1 "proposed:" SCHEMA "$.datetime" as datetime NUMERIC SORTABLE
FT.AGGREGATE pp "@datetime:[1619290321.959789 1682362321.959789]" WITHCURSOR COUNT 50 LOAD * SORTBY 2 @datetime DESC MAX 1000
subsequent reads are performed with the cursor-id
, which is returned from the initial query and can be used to fetch additional result sets from the same query.
FT.CURSOR READ pp <cursor-id> READ 50
Test both approaches when dealing with large datasets and evaluate the average latency, CPU usage, and throughput.
References
Refer to the documentation to learn more about aggregations