Considerations about Consistency and Data Loss in a CRDB Regional Failure

Last updated 18, Apr 2024

Question

What amount of data could be lost in a regional outage, and what other consistency issues may occur using an Active-Active Geo Distributed database?

Terminology

Critical path. In project management, the critical path is the longest sequence of tasks that must be completed to complete a project. DR. Disaster recovery (DR) is an organization's ability to respond to and recover from an event that negatively affects business operations. RTO. The Recovery Time Objective is the amount of time after a disaster when business operation is retaken or resources are again available for use.

Answer

Assuming there is no data loss in the replication between the master and replica shards within a CRDB replica, the concern should not be on data loss but on data consistency since it's possible that some change that was committed before the failure hasn't been reconciled to the other CRDB replica/s. Since it will eventually be reconciled, it is not a function of data loss.

If AOF persistence is not enabled on a CRDB replica, those changes that didn't get replicated to other CRDB replicas will be lost if the replica experiences an outage: recovery is executed from the latest backup plus the changes that are synchronized from other CRDB replicas. In such a case, the amount of lost data is a function of the network latency and replication speed, and the data size (and probably data types) to estimate how much data can be lost.

As a side note, even with AOF persistence enabled, there may be a loss of data:

  • In a CRDB replica, persistence is executed by replica shards (default configuration), and replication to replica shards is asynchronous, so there is a delay between the command execution and the time data is safely stored on the disk
  • Append Only File, usually, is configured to perform a fsync to disk every second, which adds up one more second to the data loss

Customers using CRDB for DR should be monitoring the replication lag between the clusters, and stress testing to understand lag at the expected peak load. It is fine to say that the data is not lost on switching to the DR location, but if the primary location is down for an "extended" period, for some applications, the un-replicated data in the original location is as good as lost. The latency between regions may be in the order of seconds (1, 2..). For shopping cart applications this means that the last few items added to shopping carts may go missing.

Recovery objectives

The users set recovery objectives and are typically part of a larger scope than just one cache/database. In the scenario where a session cache recovers in 5 minutes, and the transactional data system of records (SOR) recovers in 2 hours, the RTO of the session cache would be irrelevant. It would only matter if it was part of the critical path. CRDB databases can be used to replicate data to a DR site, but how long it would take for the application to failover to the DR CRDB replica is mostly up to the customer. Think of CRDB as support for business continuity instead of DR. The industry is coming around to the differences between these terms but it is still common for everything to be referred to as DR. This is not entirely correct since a CRDB deployment can avoid a recovery altogether.

References

Refer to the section "High availability and disaster recovery" in this article.