What is the Redis Enterprise Cooldown Feature?

Last updated 18, Apr 2024

Question

Why do we have a cooldown period for a Redis Enterprise cluster and databases? The cooldown disadvantage is that after a second node failure, a master shard will have no replica (during the cooldown). So, what problem does the cooldown avoid?

Answer

As documented:

Both the cluster and the database have cooldown periods. After node failure, the cluster cooldown period prevents another replica migration due to another node failure for any database in the cluster until the cooldown period ends (default: one hour).

The cooldown is a protection mechanism against cascading failures. Consider a usage that causes node-level problems (after a node has already failed and replica shards are reconfigured accordingly), causing an additional node failure. New shard replicas would come up, quickly killing the node they came up on because of the massive usage of resources. If we bring up new replicas immediately, we'll quickly end up causing a quorum loss in case of having a 3 nodes cluster.

Because of this, the cooldown is configured by default to one hour.

To minimize the risk of quorum loss and a consequent cluster outage, it is possible to configure a 5 nodes cluster instead of a 3 nodes cluster, and still have a quorum even in the case of 2 node failures.

References

Refer to the section Cooldown periods to understand how to configure the cooldown feature.