Redis almost never fails the way a web server fails. There's no clean 500, no connection-refused that your error tracker lights up on. Instead it degrades: memory creeps toward maxmemory and the eviction policy starts quietly dropping keys, a replica drops off and you lose your read scaling, latency climbs under load, or maxclients gets hit and new connections are rejected. Through all of it your application keeps returning 200, because a cache miss falls through to the database and the page still renders — just slower, and with sessions or rate limits silently misbehaving.
A normal uptime check on your homepage is blind to this, the same way it's blind to a dead database or a lagging Postgres replica. This guide covers what "healthy" actually means for Redis and Amazon ElastiCache, how to expose that through an HTTP health endpoint, and how to monitor it externally with CronAlert so a degrading cache pages you before it becomes a customer-facing outage.
Liveness is not health
The reflex is to reach for PING. It's a fine liveness probe — Redis returns PONG when the process is up and the event loop is responsive — but liveness answers "is it running," not "is it healthy." A Redis instance can answer PING in under a millisecond while:
- Memory is nearly exhausted.
used_memoryis approachingmaxmemory, and the next write triggers an eviction (or an out-of-memory error if the policy isnoeviction). - Keys are being evicted.
evicted_keysis climbing, which means data your app assumed was cached is being thrown away. Session stores, rate limiters, and idempotency keys silently break. - A replica has dropped.
connected_slavesfell below what you expect, so you've lost read scaling and your failover target. - Connections are being rejected.
maxclientsis hit, and new clients get errors while existing ones look fine.
So a real health check does two things: it confirms responsiveness with PING, and it reads a handful of INFO fields and applies thresholds. That distinction — liveness versus deep health — is the same one covered in the complete guide to HTTP health check endpoints, applied to a cache.
The metrics that matter
INFO returns dozens of fields. These are the high-signal ones worth alerting on:
- Memory pressure —
used_memoryas a fraction ofmaxmemory. Above ~80% you're close to eviction or OOM. This is the single most important number. - Eviction rate — a rising
evicted_keysmeans you're already losing cached data. A steady non-zero rate is sometimes acceptable for a pure LRU cache; a sudden spike is not. - Replication health —
connected_slavesand, on each replica,master_link_status:up. A dropped link is your replication-lag equivalent for Redis. - Rejected connections —
rejected_connectionsclimbing meansmaxclientsis saturated. - Blocked clients and latency —
blocked_clients, plus command latency you can sample with a timedPINGround-trip. - Hit ratio —
keyspace_hitsvskeyspace_misses. A collapsing hit ratio means the cache has stopped doing its job and your database is taking the load.
Exposing Redis health through an HTTP endpoint
CronAlert — like any external HTTP monitor — can't and shouldn't connect to Redis directly. Redis speaks RESP over TCP, not HTTP, and a production cache (especially ElastiCache) lives inside a private VPC subnet that isn't publicly routable by design. The pattern, the same one used for database health endpoints, is a small authenticated HTTP endpoint inside your app (or a sidecar) that runs the checks and translates the result into a status code and a status string.
A Node.js / Express example using ioredis:
const express = require('express');
const Redis = require('ioredis');
const app = express();
const redis = new Redis(process.env.REDIS_URL);
const MEMORY_WARN = 0.8; // 80% of maxmemory
const PING_BUDGET_MS = 50; // round-trip latency budget
function parseInfo(text) {
const out = {};
for (const line of text.split('\n')) {
const i = line.indexOf(':');
if (i > 0) out[line.slice(0, i)] = line.slice(i + 1).trim();
}
return out;
}
app.get('/healthz/redis', async (req, res) => {
if (req.headers['x-health-token'] !== process.env.HEALTH_TOKEN) {
return res.status(401).json({ status: 'unauthorized' });
}
try {
const started = Date.now();
const pong = await redis.ping();
const pingMs = Date.now() - started;
const info = parseInfo(await redis.info());
const usedMemory = Number(info.used_memory);
const maxMemory = Number(info.maxmemory) || Infinity;
const memoryRatio = usedMemory / maxMemory;
const slaves = Number(info.connected_slaves || 0);
const problems = [];
if (pong !== 'PONG') problems.push('ping_failed');
if (pingMs > PING_BUDGET_MS) problems.push('slow');
if (memoryRatio > MEMORY_WARN) problems.push('memory_pressure');
if (Number(process.env.EXPECTED_SLAVES || 0) > slaves) problems.push('replica_down');
const healthy = problems.length === 0;
res.status(healthy ? 200 : 503).json({
status: healthy ? 'healthy' : 'unhealthy',
ping_ms: pingMs,
memory_ratio: Math.round(memoryRatio * 100) / 100,
connected_slaves: slaves,
problems,
});
} catch (err) {
res.status(503).json({ status: 'unhealthy', error: 'redis check failed' });
}
});
app.listen(3000);
The same shape in Python / FastAPI with redis-py:
import os, time, redis
from fastapi import FastAPI, Header, Response
app = FastAPI()
r = redis.from_url(os.environ["REDIS_URL"])
MEMORY_WARN = 0.8
PING_BUDGET_MS = 50
@app.get("/healthz/redis")
def redis_health(response: Response, x_health_token: str = Header(default="")):
if x_health_token != os.environ["HEALTH_TOKEN"]:
response.status_code = 401
return {"status": "unauthorized"}
try:
started = time.monotonic()
pong = r.ping()
ping_ms = (time.monotonic() - started) * 1000
info = r.info()
used = info["used_memory"]
max_mem = info.get("maxmemory") or float("inf")
ratio = used / max_mem
slaves = info.get("connected_slaves", 0)
problems = []
if not pong: problems.append("ping_failed")
if ping_ms > PING_BUDGET_MS: problems.append("slow")
if ratio > MEMORY_WARN: problems.append("memory_pressure")
if int(os.environ.get("EXPECTED_SLAVES", 0)) > slaves:
problems.append("replica_down")
healthy = not problems
response.status_code = 200 if healthy else 503
return {"status": "healthy" if healthy else "unhealthy",
"ping_ms": round(ping_ms, 1),
"memory_ratio": round(ratio, 2),
"problems": problems}
except Exception:
response.status_code = 503
return {"status": "unhealthy", "error": "redis check failed"}
Two design points carry over from the database-health-endpoint guidance: require a token so the endpoint isn't a public information leak about your infrastructure, and never return raw error text or the connection string in the body — emit a generic "redis check failed", not the driver's stack trace.
Monitoring the endpoint with CronAlert
- Create a monitor pointing at
https://api.yourapp.com/healthz/redis, with thex-health-tokenvalue added as a custom request header so the check authenticates. - Set the expected status code to 200. When any threshold is breached the endpoint returns 503 and the status-code check fires. This alone catches the breach.
- Add keyword/content monitoring (Pro plan) to require the body contain
"healthy". This catches the deceptive case where a proxy rewrites the 503 into a 200 with an unhealthy body. See keyword monitoring. - Set the interval to 1 minute on a paid plan. Memory pressure and eviction storms develop in seconds under a write spike; a 3-minute gap can miss the window where you could still act.
- Route the alert to the right place — a degrading cache that's about to take the database down with it is on-call-worthy. See incident response workflows for routing to PagerDuty or Opsgenie.
ElastiCache specifics
Amazon ElastiCache (Redis or Valkey) is the same cache with a managed wrapper: you can't SSH to the node, and AWS already exposes deep metrics through CloudWatch. That changes the division of labor, not the strategy.
- Keep the HTTP health endpoint for fast, external, vantage-point-independent detection. Your app already has a Redis client and a VPC route to the cluster; the endpoint is cheap.
- Lean on CloudWatch alarms for node-level depth:
DatabaseMemoryUsagePercentage,Evictions,CurrConnections,ReplicationLag,CPUUtilization, andSwapUsage. These come for free and see things the client can't. - Use the cluster's configuration endpoint (for Cluster Mode Enabled) rather than a single node address in your health check, so a node failover doesn't make the check itself fail spuriously.
- Mind the multi-AZ failover. During a failover the primary endpoint repoints to a promoted replica; a brief blip is expected. Set CronAlert's consecutive-check threshold so a single failover blip doesn't page, but a sustained failure does.
The two layers complement each other: CloudWatch sees the node from the inside, CronAlert sees the endpoint from the outside and keeps reporting even if the box hosting your app — and therefore your CloudWatch-publishing agent — is itself in trouble. This is the same external-vantage-point argument that makes monitoring from Cloudflare's edge valuable for any internal service.
What a degraded cache actually breaks
It's worth being concrete about why this matters, because the failure modes are indirect:
- Sessions vanish. If Redis is your session store and it starts evicting, users get logged out mid-action. The site is "up," but people can't stay signed in.
- Rate limiters misfire. Token-bucket counters in Redis that get evicted reset to zero, so either everyone gets rate-limited or no one does.
- The database absorbs the miss. A collapsing hit ratio sends every request to Postgres or MySQL. The cache outage becomes a database overload, which is how a cache problem turns into a real outage — and why pairing this with database health monitoring matters.
- Queues back up. If you use Redis as a broker (Sidekiq, BullMQ, Celery with a Redis backend), a degraded cache stalls background workers and jobs pile up invisibly.
Common pitfalls
- Checking only PING. The most common mistake — a green liveness probe on a cache that's evicting keys. Always read a few
INFOthresholds too. - Heavy health checks. Don't run
KEYS *orDBSIZEon a large keyspace from the health endpoint —KEYSblocks the server.PINGplusINFOis cheap; keep it that way. - No external vantage point. A monitoring script on the same host as the cache loses the signal exactly when the host dies. External monitoring survives the failure of the thing it watches.
- Treating eviction as always-bad. For a pure LRU cache with
maxmemory-policy allkeys-lru, some eviction is normal. Alert on the rate and on memory pressure, not on the mere existence of evictions. For a session or queue store where eviction means data loss, any eviction is a problem — set the threshold accordingly. - Ignoring failover blips. Managed failovers cause brief endpoint errors. Use consecutive-check verification so a planned or transient failover doesn't generate a false positive.
Where this fits in a broader strategy
Redis health monitoring is one deep check among several. Pair it with a database connectivity endpoint, replication-lag monitoring, third-party dependency checks, and standard uptime monitoring on the user-facing endpoints. Each covers a layer the others can't: the front door, the database, the cache, the replicas, and the external services. For teams on Kubernetes, the same Redis health endpoint doubles as a readiness probe so the platform stops routing traffic to a pod whose cache connection has gone bad.
Frequently asked questions
How do you health-check a Redis instance?
Start with PING for liveness, then read INFO fields — used_memory vs maxmemory, evicted_keys, connected_slaves, rejected_connections — and apply thresholds. Expose the combined result through an authenticated HTTP endpoint that returns 200 when healthy and 503 when a threshold is breached, so an external monitor can watch it.
Can CronAlert connect to Redis directly?
No. Redis speaks RESP over TCP, not HTTP, and ElastiCache lives in a private subnet. Expose a small HTTP health endpoint that runs the checks and returns a status code plus a "status" string; CronAlert checks that endpoint from outside your infrastructure.
What Redis metrics should trigger an alert?
Memory usage approaching maxmemory, a rising eviction rate, a dropped replica, rejected connections, and climbing command latency. On ElastiCache, also alarm on CloudWatch's DatabaseMemoryUsagePercentage, Evictions, CurrConnections, and ReplicationLag.
How is monitoring ElastiCache different from self-hosted Redis?
The cache is the same, but ElastiCache is managed: no shell access, and AWS publishes deep metrics via CloudWatch. Keep your HTTP health endpoint for fast external detection and use CloudWatch alarms for node-level depth. CronAlert watches the endpoint so you get a signal independent of the host's own health.
Why not just monitor my website instead of the cache?
Because a degraded cache makes the site slow and inconsistent rather than fully down — sessions vanish, rate limiters misfire, the database takes the miss load — all while the homepage keeps returning 200. You need a check aimed at the cache's actual health.
Monitor your cache with CronAlert
Expose a Redis health endpoint that checks PING, memory pressure, and replication, then let CronAlert watch it from outside your infrastructure. Create a free account (25 monitors, no credit card), point a monitor at your endpoint, add keyword monitoring on Pro to catch deceptive 200s, and route the alert to your on-call channel. The next time your cache starts evicting keys or loses a replica, you'll hear it from CronAlert — not from a wave of logged-out users and a database melting under cache misses.
Related reading: how to monitor your database health endpoint, monitoring Postgres replication lag, the complete guide to HTTP health check endpoints, monitoring third-party dependencies, and monitoring background workers and queue consumers.