Resource Center

Events & webinars Blog Videos Glossary Resources Architecture Diagrams Demo Center

Resource Center

Events & webinars Blog Videos Glossary Resources Architecture Diagrams Demo Center

Back to blog

Blog

Redis vs ElastiCache: High availability comparison

October 23, 20256 minute read

James Tessier

When US-EAST-1 goes haywire, everyone feels it. Dashboards light up. Queues back up. Engineers open incident channels and brace for a long night.

For many teams, the problem isn’t just the outage itself. It’s what happens after: Failovers that take longer than expected, caches that go out of sync, and data that doesn’t look quite right.

That was the story for teams running Amazon ElastiCache during the most recent regional disruption. A service built for speed and scale struggled to stay consistent when the region it depended on went offline.

Outages like that expose the difference between managed availability and true resilience. Some systems recover. Others never go down in the first place.

What high availability really means

Every vendor talks about uptime, but uptime alone doesn’t tell the whole story. What matters is how quickly a system can fail over, how much data it preserves during that process, and how much intervention is required to get back to normal.

In many cache and database systems, failover still happens in tens of seconds or minutes. Those recovery windows might look small on paper but can disrupt real-time apps, queues, and user sessions across entire regions.

Redis Cloud was built to change that. Its failovers complete in single-digit seconds, even under heavy load. The difference comes from automation and intelligent routing through the Redis Cloud proxy, which seamlessly manages shard transitions behind the scenes. In most cases, clients don’t need to reconnect at all.

If an entire region becomes unavailable, client connections need to shift to the next available region. (More on that in a bit.) But for everyday operations such as planned maintenance, shard replacement, or shard failover, apps keep running without interruption or code changes.

Availability, in this context, means continuity. Redis Cloud is designed not just to resume quickly but to keep serving requests the entire time.

Durability and data integrity under pressure

High availability keeps your application online. Durability ensures your data survives whatever caused the failover in the first place.

Redis Cloud uses Append-Only File (AOF) persistence and RDB snapshotting to maintain durability during failure and recovery. AOF records every write operation as it happens, creating a continuous log that can be replayed instantly when an instance restarts. Combined with in-memory replication, this approach minimizes the risk of lost data even if a node fails entirely.

ElastiCache, by contrast, relies only on RDB snapshot-based durability. Snapshots occur at intervals, typically once per hour, which can leave up to 60 minutes of writes unprotected. During a regional disruption or hardware failure, that gap can translate into stale or missing data when systems come back online.

Some workloads can tolerate this. If Redis is used only as a read-through cache for data stored elsewhere, persistence may not be essential for correctness. But it still matters for speed. Without persistence, a restarted cache must repopulate from its source, creating a surge of traffic that can overwhelm downstream databases and extend recovery times. With AOF and RDB, Redis Cloud restarts warm, serving requests immediately and allowing dependent systems to stabilize faster.

Many modern Redis deployments also store ephemeral, real-time data such as sessions, queues, state, and buffers that do not exist anywhere else. In those cases, durability makes Redis dependable for maintaining application continuity.

For those apps, and others that handle real-time data, data gaps aren’t acceptable. High availability and data durability have to work together.

The architecture behind real HA

Redis Cloud delivers high availability in multiple layers, starting with automated local failover that keeps apps running through shard-level disruptions. Redis Cloud manages this process transparently, rerouting traffic within a region in single-digit seconds without requiring client reconnection. For most workloads, this level of HA already exceeds what ElastiCache provides.

Side-by-side comparison of Redis and node-based ElastiCache for standard HA:

Redis Cloud High availability	ElastiCache High availability
Automatic local failover handled by the Redis Cloud proxy	Failover via DNS update; client must reconnect
Single-digit-second failover under load	10s of seconds to minutes, depending on DNS TTL, propagation and health checks
No disruption during failover	Failover depends on connection timeout and retry logic (2-6 second disruption)
AOF and RDB persistence available for durability	Snapshot-only persistence (possible data loss between backups)

For mission-critical applications that demand continuous operation across regions, Redis Cloud also offers Active-Active replication. This configuration uses Conflict-free Replicated Data Types (CRDTs) to keep multiple writable regions synchronized at all times. Even if an entire region becomes unavailable, the others continue processing writes with full consistency while connectivity is restored.

Active-Active provides the highest level of HA Redis Cloud can offer, but it isn’t required for every workload. It involves additional network coordination and inter-region data transfer, which are factors to consider when designing for global continuity.

By contrast, ElastiCache Global Datastore uses an Active-Passive approach: One writable primary and up to two read-only replicas. Failover requires manually promoting a replica, reestablishing DNS routes, and manually repopulating and reattaching the region on recovery, each step adding delay and risk.

Active-Active systems keep every region available, so operations continue without failover. Active-Passive systems depend on a single writable primary. When it fails, replicas must be promoted.

Side-by-side comparison of Redis Active-Active and ElastiCache global datastore:

Redis Cloud Active-Active	ElastiCache Global Datastore
Local latency on reads and writes	Higher write latency
Write to local region	Write to master only
Instant “failover” (all nodes always active)	Manual failover
Automatic re-sync after regional recovery	Manual rebuild or re-association required for recovered regions
Five-nines (99.999%) uptime SLA	Four-nines (99.99%) uptime SLA
Strong eventual consistency	Eventual consistency
Read your own writes	Stale reads
Built-in conflict resolution	No defined conflict resolution
Up to 10 regions	Up to 3 regions

Consistency and split-brain protection

Redis Cloud’s Active-Active architecture is built on CRDTs, which give it natural split-brain resilience.

In a temporary network partition between regions:

Each region continues to accept reads and writes locally.
All writes are recorded as CRDT operations with metadata tracking.
When connectivity is restored, Redis Cloud automatically merges changes deterministically, making sure all replicas converge to the same consistent state.
There’s no data loss and no manual conflict resolution.

This design means that even if two regions operate independently for a period, they can reconcile safely once the partition heals. Apps don’t need custom reconciliation logic or manual clean-up.

ElastiCache Global Datastore, by contrast, doesn’t use CRDTs or bi-directional replication. It’s a single-writer, asynchronous replication model: one primary region handles writes, while secondary regions are read-only. If the primary region becomes unreachable, it allows a secondary region to be manually promoted to become the new primary. The original primary is effectively abandoned; it remains isolated and any writes that occurred during the partition are lost. Replication doesn’t resume automatically, and the old primary must be manually recovered and re-added to a new global datastore configuration.

So, while Redis Cloud treats partitions as temporary states that self-heal, Global Datastore treats them as failures that require human intervention.

Always available, by design

When US-EAST-1 goes haywire, caching might seem like the least of your problems. But it quickly becomes clear how much everything depends on it. Authentication tokens, cached database queries, message queues, user sessions, and API responses often live in cache layers for speed and scale.

When ElastiCache stalls, those dependencies start to surface. Services that look healthy on their own begin to slow or fail because the data they rely on isn’t being refreshed.

Failing over to another region doesn’t solve it, because ElastiCache replicas remain read-only until manually promoted, and any data written to the isolated primary after the partition is permanently lost.

To get caching back, teams have to promote a secondary, update DNS records, reconfigure connection strings, restart affected apps, and verify data consistency before resuming normal operations. Any writes that occurred in the isolated US-EAST-1 region are lost.

When US-EAST-1 finally comes back online, there’s another decision to make: rebuild ElastiCache in its original location or keep operating from the failover region. Re-establishing replication and confirming data consistency adds another long list of tasks to the recovery effort. For a moderately complex application with dozens of microservices, this can mean hours of coordinated recovery work across multiple teams.

With Redis Cloud, the same event is handled automatically. If shards or nodes fail, the failure is detected and reroutes traffic to a newly promoted primary in seconds. If the entire region is offline, Active-Active databases in other regions continue serving reads and writes without interruption. When the region recovers, Redis automatically synchronizes changes across all regions using CRDT-based replication, ensuring data integrity without manual steps.

When a region goes dark, you already have enough problems to recover from. Redis Cloud makes sure caching isn’t one of them.

Next steps

Ready to migrate from ElastiCache? See our guide on three proven migration strategies.

Read more about multi-cloud deployments to protect against cloud outages and lock-in: Redis vs. Valkey for multi-cloud deployments

Or book a meeting and talk to a Redis expert about your high availability requirements.

Get started with Redis today

Speak to a Redis expert and learn more about enterprise-grade Redis today.

Try for free Talk to sales