This tutorial covers quickly deploying a complete monitoring stack including Prometheus and Grafana for your Redis Software instance. We’ll then dive into understanding the metrics, setting up alerting, and further expanding your dashboards.
The tutorial should take ~1hr. The target audience is system administrators and devops professionals with experience using Redis Software.
This tutorial relies on an existing Redis Software instance. You may follow our quickstart guide for testing environments or our installation documentation for production environments.
The Redis Field Engineering team provides a turnkey solution that sets up Prometheus + Grafana with pre-configured dashboards. You can quickly set it up by running the following commands:
# Clone the repository
git clone https://github.com/redis-field-engineering/redis-enterprise-observability.git
# Navigate to the v2 kickstart directory
cd redis-enterprise-observability/grafana_v2/kickstart_v2
# Run the setup script with your cluster FQDN, the dashboard directory, and password
./setup.sh your-cluster-fqdn.example.com ../dashboards/grafana_v9-11/software/basic very-secret-passwordYou must use very-secret-password in the setup script above. If you use a different password, the script won't run properly. This password is configured in the docker-compose.yml file.
This script automatically:
After the setup completes:
1. Check that Prometheus is collecting metrics:
2. View your dashboards in Grafana:
To see the metrics as they start to come in, you may want to make the time range smaller in the top right of the dashboard, e.g. “Last 5 Minutes”.
This setup uses Redis Software Metrics Stream Engine. Below, you can compare features between the original Metrics v1 and v2.
Feature | Metrics v1 | Metrics v2 |
|---|---|---|
Precision | Millisecond averages | Microsecond histograms |
Real-time | Snapshot-based | Stream-based |
Maintenance visibility | Limited | Full visibility during failovers |
Required version | Any | v7.8.2+ |
Why v2 matters: Real-time monitoring with sub-millisecond precision and visibility during all operations, including maintenance windows.
V1 approach (snapshot-based):
# V1 provided pre-calculated averages
bdb_avg_latency
V2 approach (stream-based with PromQL):
# V2 requires calculation from histogram data per millisecond
histogram_quantile(0.95, sum by (le, db) (irate(endpoint_read_requests_latency_histogram_bucket[1m]))) / 1000
The metrics take a couple minutes to come into Prometheus. If you’re seeing no data and you’re sure you’re getting read requests, wait a few minutes and refresh or increase the time window, e.g. 1m -> 5m.
The v2 metrics stream engine provides finer-grained control over metric queries, including the ability to filter or aggregate results by quantile — for example, extracting the p95 (95th percentile) latency metric for specific operations.
This update gives you greater flexibility and precision in querying metrics, leveraging PromQL aggregation functions
for powerful custom analysis.
For a comparison of queries if you're looking to move from V1 to V2, check out our Transition from Prometheus v1 to Prometheus v2 documentation
Let's dive into what the key metrics mean and how to interpret dashboard data for operational decision-making.
Your monitoring stack includes several pre-built dashboards. Here's what to focus on:
Below are three important dashboards to get familiar with:
Cluster status dashboard - Your starting point for cluster health:
Database status dashboard - Application-focused metrics:
Node dashboard - Infrastructure details:
Memory utilization can be calculated using the following v2 metrics in Prometheus or Grafana.
# Memory utilization percentage per shard
avg by (cluster,db,redis)(redis_server_used_memory) / avg by (cluster,db,redis)(redis_server_maxmemory) * 100
# Database-level memory utilization (aggregated across shards)
sum by (cluster,db)(redis_server_used_memory{role="master"}) / (avg by(cluster,db)(db_memory_limit_bytes) / max by(cluster,db)(db_replication_factor))
What to look for:
Latency performance can be calculated using the following metrics in Prometheus or Grafana.
# 95th percentile read latency (milliseconds)
histogram_quantile(0.95, sum by (le, db) (irate(endpoint_read_requests_latency_histogram_bucket[1m]))) / 1000
# 95th percentile write latency (milliseconds)
histogram_quantile(0.95, sum by (le, db) (irate(endpoint_write_requests_latency_histogram_bucket[1m]))) / 1000
# Combined 95th percentile latency for all operations (milliseconds)
histogram_quantile(0.95, sum by (le, db) (
irate(endpoint_read_requests_latency_histogram_bucket[1m]) +
irate(endpoint_write_requests_latency_histogram_bucket[1m]) +
irate(endpoint_other_requests_latency_histogram_bucket[1m])
)) / 1000
Example performance targets:
These targets are based on typical Redis Software performance. Your specific thresholds may vary based on network, hardware, and app requirements.
Redis Software monitors CPU at three levels:
Since Redis shards are single-threaded, a high shard CPU utilization often indicates hot keys or data distribution problems, while a high proxy CPU utilization suggests connection issues. You'll need to establish appropriate thresholds based on your specific environment and performance requirements.
Hot keys - One shard has a high CPU utilization while others are idle:
Large keys - High network utilization with CPU spikes:
Slow operations - Commands taking excessive time:
For caching workloads, monitor:
# Cache read hit ratio (database level), can also be done for writes
(
sum by (db) (irate(redis_server_keyspace_read_hits{role="master"}[1m])) /
(sum by (db) (irate(redis_server_keyspace_read_hits{role="master"}[1m])) +
sum by (db) (irate(redis_server_keyspace_read_misses{role="master"}[1m])))
) * 100
Suggested target ranges:
Resources:
If you're looking to dive deeper, the following resources explain these metrics in greater detail.
We can also set up intelligent alerts that notify you before issues impact apps. The setup script that we ran earlier added Prometheus Alertmanager and it’s been populated with Redis-specific alert rules.
There are three primary suggested areas of monitoring and alerting:
Alerts will appear in the Prometheus console under the Alerts tab.
The alerts used for this tutorial can be found at redis-enterprise-observability/prometheus_v2/rules/alerts.yml. If you'd like to test edits, you'll need to bring down the dashboard Docker container and run the setup script again.
Here's an example alert for high latency:
# Example: High read latency alert
- alert: RedisHighReadLatency
expr: histogram_quantile(0.95, sum by (le, db) (irate(endpoint_read_requests_latency_histogram_bucket[1m]))) / 1000 > 1
for: 2m
labels:
severity: warning
annotations:
summary: "Redis database {{ $labels.db }} has high read latency"
description: "95th percentile read latency is {{ $value }}s"
Once you have an alert like the one above, you'll then need to add the file name to the rule_files section of the prometheus.yml file to see the alerts in action. See these in the Prometheus console under the Alerts tab.
You can also configure Alertmanager for your notification channels, such as Slack, PagerDuty, Email, etc. Here's an example of what a notification might look like:
# alertmanager.yml example
receivers:
- name: 'slack-alerts'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK'
channel: '#redis-alerts'
title: 'Redis Software Alert'
text: 'Alert: {{ .GroupLabels.alertname }}'
- name: 'email-alerts'
email_configs:
- to: '[email protected]'
subject: 'Redis Software Alert'
Prevent false positives:
for durations (2-5 minutes for most alerts)avg_over_time() for noisy metricsResources:
To dive deeper, check out the following resources on alerting:
Lastly, let’s get familiar with what's possible with custom dashboards and v2 metrics for specialized monitoring needs.
Redis Software v2 metrics (available in v7.8.2+) provide comprehensive monitoring capabilities. Note: V2 metrics are currently in preview with a partial list available.
You can test some of these queries in Prometheus to see their outputs.
Database endpoint monitoring:
# Client connection tracking (rates for meaningful metrics)
irate(endpoint_client_connections[1m])
irate(endpoint_client_disconnections[1m])
irate(endpoint_client_connection_expired[1m])
irate(endpoint_client_establishment_failures[1m])
# Number of active connections to Redis database
endpoint_client_connections - endpoint_client_disconnections - endpoint_proxy_disconnections
# Request rates by type
irate(endpoint_read_requests[1m])
irate(endpoint_write_requests[1m])
Node resource monitoring:
# Available system resources
node_available_flash_bytes
node_available_memory_bytes
# Node health and certificate monitoring
node_metrics_up
node_cert_expires_in_seconds
# Network throughput
irate(node_network_receive_bytes_total[1m])
Cluster status tracking:
# Cluster health indicators
generation
has_quorum
is_primary
Replication monitoring:
# Replication data flow
irate(database_syncer_ingress_bytes[1m])
# Sync status tracking
database_syncer_current_status{syncer_type="replicaof"}
Redis shard performance:
# Memory and processing
redis_server_used_memory
irate(redis_server_total_commands_processed[5m]) # Commands per second
# Shard health
redis_server_up
# CPU usage (via node_exporter)
irate(namedprocess_namegroup_thread_cpu_seconds_total{mode=~"system|user"}[1m])
Grafana provides several powerful features for creating sophisticated Redis Software dashboards:
Resources:
Throughout this tutorial, you:
Official docs:
Learning & certification:
Community support:
Production considerations:
Advanced integrations:
Your Redis Software observability foundation is now in place. The monitoring stack can grow with your deployment and provide continuous insights into your Redis operations.
Reach out to the Redis team if you're looking for help expanding further.