Learn

Redis Software Developer Observability Playbook

Introduction#

Figure 1. Dashboard showing relevant statistics for a Node

Figure 1. Dashboard showing relevant statistics for a Node

Figure 2. Dashboard showing an overview of cluster metrics

Figure 2. Dashboard showing an overview of cluster metrics

Core cluster resource monitoring#

Memory#

Metric name

Definition

Unit

Memory usage percentage

Percentage of used memory relative to the configured memory limit for a given database

Percentage

Figure 3. Dashboard displaying high-level cluster metrics

Figure 3. Dashboard displaying high-level cluster metrics - Cluster Dashboard

Memory Thresholds#

Caching workloads#

Read latency#

Cache hit ratio and eviction#

Eviction policies#

Name

Description

noeviction

New values aren’t saved when memory limit is reached. When a database uses replication, this applies to the primary database

allkeys-lru

Keeps most recently used keys; removes least recently used (LRU) keys

allkeys-lfu

Keeps frequently used keys; removes least frequently used (LFU) keys

volatile-lru

Removes least recently used keys with the expire field set to true.

volatile-lfu

Removes least frequently used keys with the expire field set to true.

allkeys-random

Randomly removes keys to make space for the new data added.

volatile-random

Randomly removes keys with expire field set to true.

volatile-ttl

Removes keys with expire field set to true and the shortest remaining time-to-live (TTL) value.

Eviction policy guidelines#

Non-caching workloads#

Issue

Possible causes

Remediation

Redis memory usage has reached 100%

This may indicate an insufficient Redis memory limit for your application’s workload

For non-caching workloads (where eviction is unacceptable), immediately increase the memory limit for the database. You can accomplish this through the Redis Software console or its API. Alternatively, you can contact Redis support to assist.

For caching workloads, you need to monitor performance closely. Confirm that you have an eviction policy in place. If your application’s performance starts to degrade, you may need to increase the memory limit, as described above.

Redis has stopped accepting writes

Memory is at 100% and no eviction policy is in place

Increase the database’s total amount of memory. If this is for a caching workload, consider enabling an eviction policy

In addition, you may want to determine whether the application can set a reasonable TTL (time-to-live) on some or all of the data being written to Redis.

Cache hit ratio is steadily decreasing

The application’s working set size may be steadily increasing.

Alternatively, the application may be misconfigured (e.g., generating more than one unique cache key per cached item.)

If the working set size is increasing, consider increasing the memory limit for the database. If the application is misconfigured, review the application’s cache key generation logic.

CPU#

Metric name

Definition

Unit

Metric Shard CPU

CPU time portion spent by database shards

Percentage, up to 100% per shard

Metric Proxy CPU

CPU time portion spent by the cluster’s proxy(s)

Percentage, 100% per proxy thread

Metric Node CPU (User and System)

CPU time portion spent by all user-space and kernel-level processes

Percentage, 100% per node CPU

Figure 4. Dashboard displaying CPU usage

Figure 4. Dashboard displaying CPU usage - Database Dashboard

CPU Thresholds#

Figure 5. Display showing Proxy CPU usage

Figure 5. Display showing Proxy CPU usage - Proxy Dashboard

Figure 6. Dashboard displaying an ensemble of Node CPU usage data

CPU Troubleshooting#

Issue

Possible causes

Remediation

High CPU utilization across all shards of a database

This usually indicates that the database is under-provisioned in terms of number of shards. A secondary cause may be that the application is running too many inefficient Redis operations. You can detect slow Redis operations by enabling the slow log in the Redis Software UI.

First, rule out inefficient Redis operations as the cause of the high CPU utilization. See Slow operations for details on this. If inefficient Redis operations are not the cause, then increase the number of shards in the database.

High CPU utilization on a single shard, with the remaining shards having low CPU utilization

This usually indicates a master shard with at least one hot key. Hot keys are keys that are accessed extremely frequently (e.g., more than 1000 times per second).

Hot key issues generally cannot be resolved by increasing the number of shards. To resole this issue, see Hot keys.

High Proxy CPU

There are several possible causes of high proxy CPU. First, review the behavior of connections to the database. Frequent cycling of connections, especially with TLS is enabled, can cause high proxy CPU utilization. This is especially true when you see more than 100 connections per second per thread. Such behavior is almost always a sign of a misbehaving application.

Seconds, review the total number of operations per second against the cluster. If you see more than 50k operations per second per thread, you may need to increase the number of proxy threads.

In the case of high connection cycling, review the application’s connection behavior.

In the case of high operations per second, increase the number of proxy threads.

High Node CPU

You will typically detect high shard or proxy CPU utilization before you detect high node CPU utilization. Use the remediation steps above to address high shard and proxy CPU utilization. In spite of this, if you see high node CPU utilization, you may need to increase the number of nodes in the cluster.

Consider increasing the number of nodes in the cluster and the rebalancing the shards across the new nodes. This is a complex operation and should be done with the help of Redis support.

High System CPU

Most of the issues above will reflect user-space CPU utilization. However, if you see high system CPU utilization, this may indicate a problem at the network or storage level.

Review network bytes in and network bytes out to rule out any unexpected spikes in network traffic. You may need perform some deeper network diagnostics to identify the cause of the high system CPU utilization. For example, with high rates of packet loss, you may need to review network configurations or even the network hardware.

Connections#

Connections Troubleshooting#

Issue

Possible causes

Remediation

Fewer connections to Redis than expected

The application may not be connecting to the correct Redis database. There may be a network partition between the application and the Redis database.

Confirm that the application can successfully connect to Redis. This may require consulting the application logs or the application’s connection configuration.

Connection count continues to grow over time

Your application may not be releasing connections. The most common of such a connection leak is a manually implemented connection pool or a connection pool that is not properly configured.

Review the application’s connection configuration

Erratic connection counts (e.g, spikes and drops)

Application misbehavior (thundering herds, connection cycling, ) or networking issues

Review the application logs and network traffic to determine the cause of the erratic connection counts.

Figure 7. Dashboard displaying connections

Network ingress/egress#

Unbalanced database endpoint#

Synchronization#

Figure 8. Dashboard displaying connection metrics between zones

Database performance indicators#

Latency#

Figure 9. Dashboard display of latency metrics
Figure 10. Display showing a noticeable spike in latency

Latency Troubleshooting#

Issue

Possible causes

Remediation

Slow database operations

Confirm that there are no excessive slow operations in the Redis slow log.

If possible, reduce the number of slow operations being sent to the database. If this not possible, consider increasing the number of shards in the database.

Increased traffic to the database

Review the network traffic and the database operations per second chart to determine if increased traffic is causing the latency.

If the database is underprovisioned due to increased traffic, consider increasing the number of shards in the database.

Insufficient CPU

Check to see if the CPU utilization is increasing.

Confirm that slow operations are not causing the high CPU utilization. If the high CPU utilization is due to increased load, consider adding shards to the database.

Cache hit rate#

Figure 11. Dashboard showing the cache hit ratio along with read/write misses

Metric name

Definition

bdb_read_hits

The number of successful read operations

bdb_read_misses

The number of read operations returning null

bdb_write_hits

The number of write operations against existing keys

bdb_write_misses

The number of write operations that create new keys

Cache hit rate troubleshooting#

Key eviction rate#

Figure 12. Dashboard displaying object evictions

Proxy performance#

Proxy policies#

Policy

Description

Single

There is only a single proxy that is bound to the database. This is the default database configuration and preferable in most use cases.

All Master Shards

There are multiple proxies that are bound to the database, one on each node that hosts a database master shard. This mode fits most use cases that require multiple proxies.

All Nodes

There are multiple proxies that are bound to the database, one on each node in the cluster, regardless of whether or not there is a shard from this database on the node. This mode should be used only in special cases, such as using a load balancer.

Figure 13. Dashboard displaying proxy thread activity

Total cores

Redis (ROR)

Redis on Flash (ROF)

1

1

1

4

3

3

8

5

3

12

8

4

16

10

5

32

24

10

64/96

32

20

128

32

32

Data access anti-patterns#

Slow operations#

Slow operations troubleshooting#

Figure 14. Redis Cloud dashboard showing slow database operations

Figure 14. Redis Cloud dashboard showing slow database operations

Issue

Remediation

The KEYS command shows up in the slow log

Find the application that issuing the KEYS command and replace it with a SCAN command. In an emergency situation, you can alter the ACLs for the database user so that Redis will reject the KEYS command altogether.

The slow log shows a significant number of slow, O(n) operations

If these operations are being issued against large data structures, then the application may need to be refactored to use more efficient Redis commands.command altogether.

The slow logs contains only O(1) commands, and these commands are taking several milliseconds or more to complete

This likely indicate that the database is underprovisioned. Consider increasing the number o shards and/or nodes.

Hot keys#

Hot keys troubleshooting#

Remediation#

Large keys#

Large keys troubleshooting#

Remediation#

Alerting#

Configuring Prometheus#

prometheus.yml
# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "error_rules.yml"
  - "alerts.yml"
Prometheus alerts

List of alerts#

Description

Trigger

Average latency has reached a warning level

round(bdb_avg_latency * 1000) > 1

Average latency has reached a critical level indicating system degradation

round(bdb_avg_latency * 1000) > 4

Absence of any connection indicates improper configuration or firewall issue

bdb_conns < 1

A flood of connections has occurred that will impact normal operations

bdb_conns > 64000

Absence of any requests indicates improperly configured clients

bdb_total_req < 1

Excessive number of client requests indicates configuration and/or programmatic issues

bdb_total_req > 1000000

The database in question will soon be unable to accept new data

round((bdb_used_memory/bdb_memory_limit) * 100) > 98

The database in question will be unable to accept new data in two hours

redis_process_cpu_usage_percent{role="master"} > 0.75 and redis_process_cpu_usage_percent{role="master"} > on (bdb) group_left() (avg by (bdb)(redis_process_cpu_usage_percent{role="master"}) + on(bdb) 1.2 * stddev by (bdb) (redis_process_cpu_usage_percent{role="master"}))

Database read operations are failing to find entries more than 50% of the time

(100 * bdb_read_hits)/(bdb_read_hits + bdb_read_misses) < 50

In situations where TTL values are not set this indicates a problem

bdb_evicted_objects > 1

Replication between nodes is not in a satisfactory state

bdb_replicaof_syncer_status > 0

Record synchronization between nodes is not in a satisfactory state

bdb_crdt_syncer_status > 0

The amount by which replication lags behind events is worrisome

bdb_replicaof_syncer_local_ingress_lag_time > 500

The amount by which object replication lags behind events is worrisome

bdb_crdt_syncer_local_ingress_lag_time > 500

The expected number of active nodes is less than expected

count(node_up) != 3

Persistent storage will soon be exhausted

round((node_persistent_storage_free/node_persistent_storage_avail) * 100) ⇐ 5

Ephemeral storage will soon be exhausted

round((node_ephemeral_storage_free/node_ephemeral_storage_avail) * 100) ⇐ 5

The node in question is close to running out of memory

round((node_available_memory/node_free_memory) * 100) ⇐ 15

The node in question has exceeded expected levels of CPU usage

round((1 - node_cpu_idle) * 100) >= 80

The shard in question is not reachable

redis_up == 0

The master shard is not reachable

floor(redis_master_link_status{role="slave"}) < 1

The shard in question has exceeded expected levels of CPU usage

redis_process_cpu_usage_percent >= 80

The master shard has exceeded expected levels of CPU usage

round((bdb_used_memory/bdb_memory_limit) * 100) < 98 and (predict_linear(bdb_used_memory[15m], 2 * 3600) / bdb_memory_limit) > 0.3 and round(predict_linear(bdb_used_memory[15m], 2 * 3600)/bdb_memory_limit) > 0.98

The shard in question has an unhealthily high level of connections

redis_connected_clients > 500

Appendix A: Grafana dashboards#

Software#

Workflow#

Cloud#