How to Restore a Corrupted Active-Active Replica?

Last updated 18, Apr 2024

Question

One or more replicas may fail in an Active-Active Geo distributed setup (CRDB). Issues may be caused by a cluster failure, or by a logical corruption making a database replica unusable. Or even by human error, using the flush command or remove-instance of the crdb-cli tool.

Answer

Depending on the type of problem, a different solution may work to reestablish a database replica back in a cluster.

Data has been flushed

If data has been cleanly flushed from the CRB database using crdb-cli crdb flush command, the only option to recover it is resorting to a backup.

Replica removal or purge

Suppose the replica has gone missing because of manual removal or purge using the remove-instance or purge-instance. There is a good database replica in another Redis Enterprise Cluster. In that case, the solution is straightforward, as it will be sufficient to recreate the replica in the original cluster, where the replica was removed. The crdb-cli tool will come to the rescue with the add-instance command. This command must be executed from a cluster where a good replica is running. For example, collecting the crdb-guid identifier with crdb-cli crdb list, it is possible to recreate a replica as follows.

crdb-cli crdb add-instance --crdb-guid b6a68af2-978a-40d4-9eae-c9f471420da8 --instance fqdn=clusteraa1.local,username=redis@redis.com,password=redis
Task 5b30c439-c2c7-4c5c-b86f-8cdcadfafe94 created
---> CRDB GUID Assigned: crdb:b6a68af2-978a-40d4-9eae-c9f471420da8
---> Status changed: queued -> started
---> Status changed: started -> finished

Corruption of a replica

If a database replica suffered from data corruption or other issues whose root cause may not be fully diagnosed, such a situation may be addressed via configuration by removing and adding the replica back, as in the previous scenario. This operation would drive a full synchronization or delta synchronization. This scenario assumes that the cluster is still operational and the problem is at the database level.

Incident affecting one or more clusters

If one or more clusters suffered an incident, but there is still an operation cluster with a replica of the CRDB database, it will be sufficient to recover the affected clusters. This is the recommended solution in the document Recover a DB:

For Active-Active databases that still have live instances, we recommend that you recover the configuration for the failed instances and let the data update from the other instances.

Major incidents affecting all the clusters

If all the clusters are not operational because of a major incident, the only remedy is to restore from the backup a single cluster, as suggested in the documentation.

For Active-Active databases where all instances need to be recovered, we recommend that you recover one instance with the data and only recover the configuration for the other instances. The empty instances then update themselves from the recovered data.

Additional considerations

A cluster-wide Logical corruption (data is corrupted in a replica and is replicated to the rest of the CRDB mesh) will require an entire recovery from backup. This means that the only valid solution may be to rebuild the clusters from the last good backup. It is worth remembering that AOF may not protect the data from logical corruption, because it may be difficult to edit the file to a point in time recovery. Because of this implicit complication, RDB snapshots are the usual solution. Depending on the risk to reward, administrators would need to schedule backups. There is a possibility of not being able to recover the last change before corruption.

Data loss could be minimized by restoring a backup of the AOF files, which may give the ability to restore to a closer timeline of choice.

References

Recover a failed database