Redis troubleshooting pocket guide
Last updated 18, Apr 2024
Symptoms
Latency issues, other problems or just as health-check
Changes
Configuration changes to the software or to the system, changes in the workload or dataset size may provoke latency.
Identify issues on Redis hosts
- Check that disk space is not excessively consumed using "
df -h
". Check if the capacity of the log directory did not increase using “du -sh /var/opt/redislabs/log/
” and proceed to check other possible causes - Check that RAM memory or CPU are not excessively consumed. It is recommended that RAM and CPU utilization does not cross 80%. The host resources must be exclusively available for Redis software
- Verify swap memory is not utilized or not configured using "
free
" - It is recommended to have the host clock in sync with a time server. Verify using
timedatectl
or "ntpq -p
" or "chronyc sources
" - Check the output of "
env
", remove https_proxy/http_proxy variable if it exists: "unset https_proxy
" - review system logs including the syslog or journal for any error messages, warnings, or critical events
Identify potential issues caused by security hardening
- Temporarily disable any security/hardening and check if the problem is relieved. Examples: selinux, cylance, McAfee, dynatrace, ...
- Linux user "
redislabs
" must have read/write access to/tmp
folder. Verify using "su - redislabs -s /bin/bash -c 'touch /tmp/test'" - Non-permissive
umask
can cause issues. If umask differs from the default 022, it might prevent normal operation. Consult your sysadmin and revert to the default umask
Identify Redis cluster issues
- Execute “
supervisorctl status
" and verify all processes are in a RUNNING state. - Execute "
rlcheck
" and verify no errors appear - Execute "
rladmin status issue_only
" and verify no issues appear - Execute "
rladmin status shards
" and verify that the used memory of shards participating in the same database is balanced and that each shard does not exceed 25GB - Execute "
rladmin cluster running_actions
" and verify no tasks appear
Troubleshooting connectivity
- Check if the Redis endpoint can be resolved on the client machine "
dig <endpoint>
". If the resolution fails, proceed to check if the Redis endpoint can be resolved on one of the cluster nodes "dig @localhost <endpoint>
". If the resolution succeeds, the problem is with the organizational DNS. - To identify any issue with the client app, check connectivity from the client machine to the database using
redis-cli
: "redis-cli -h <endpoint> -p <port> -a <password> info
" or "redis-cli -h <endpoint> -p <port> -a <password> --tls --insecure --cert --key ping
" If that fails check connectivity to the database usingredis-cli
from one of the cluster nodes If that fails, the issue is with the network. Consult your sysadmin. - Verify the client uses the db name and not ip
- Verify the the database is configured with eviction policy and key expiration to avoid OOM
- Verify that access to the database is not blocked by a firewall on the client side or the Redis side
iptables -L, ufw status, firewall-cmd –list-all
- Additional details can be found in the related document about testing client connections.
Troubleshooting latency
Server-side
- Ensure that the memory used in the database does not reach the configured database max memory limit. More details can be found in the document about database memory limits.
- Try to correlate the latency time with any surge in the following metrics.
- number of connections
- used memory
- evicted keys, expired keys
- Check the output of "
slowlog get <number of entries to display>
" for slow commands such asKEYS
orHGETALL
Use alternative commands:SCAN
,SSCAN
,HSCAN
,ZSCAN
- Keys with large memory footprints can cause latency. To identify these keys, one can compare the key name that appear in the output of “
slowlog get
” with the big key reported by the following commands:redis-cli -h <endpoint> -p <port> -a <password> --memkeys
redis-cli -h <endpoint> -p <port> -a <password> --bigkeys
- Additional diagnostics steps can be found in the following links: https://redis.io/docs/latest/operate/oss_and_stack/management/optimization/latency/ https://redis.io/docs/latest/operate/rs/clusters/logging/redis-slow-log/
Client-side
- check there is no memory/CPU pressure on the client host
- check the client does not frequently open and close connections and instead uses a connection pool
- check the client does not erroneously open multiple connections that can pressure the client or server