How to Monitor Redis and ElastiCache Endpoints

Q: How do you health-check a Redis instance?

The simplest liveness check is the PING command, which returns PONG when the server is responsive. But liveness isn't health: a Redis instance can answer PING instantly while being nearly out of memory, evicting keys, or disconnected from its replica. A real health check also reads INFO fields — used_memory vs maxmemory, evicted_keys, connected_slaves, and the role/replication status — and applies thresholds. Expose the result through a small authenticated HTTP endpoint so an external uptime monitor can watch it.

Q: Can CronAlert connect to Redis directly?

No, and it shouldn't. Redis speaks its own TCP protocol (RESP), not HTTP, and an ElastiCache cluster usually isn't publicly reachable — it lives in a private VPC subnet by design. The right pattern is to expose a small HTTP health endpoint inside your app or a sidecar that runs PING and INFO against Redis, applies thresholds, and returns 200/503 plus a status string. CronAlert then checks that HTTP endpoint from outside your infrastructure.

Q: What Redis metrics should trigger an alert?

The high-signal ones: memory usage approaching maxmemory (eviction is imminent), a rising evicted_keys count (you're already losing data the app expects to be cached), a dropped replica (connected_slaves fell below expected), rejected connections (maxclients hit), and command latency creeping up. On ElastiCache specifically, watch CloudWatch's DatabaseMemoryUsagePercentage, Evictions, CurrConnections, and ReplicationLag. Any of these can degrade the cache while the app still returns 200.

Q: How is monitoring ElastiCache different from self-hosted Redis?

Functionally the cache is the same, but ElastiCache is managed: you can't SSH to the node, and AWS already exposes rich metrics through CloudWatch (memory percentage, evictions, replication lag, CPU). The cheapest approach is to keep your health endpoint doing PING plus a few INFO checks for fast external detection, and rely on CloudWatch alarms for the deeper node-level metrics. CronAlert watches the HTTP endpoint so you get an external, vantage-point-independent signal that survives the failure of the box hosting your app.

Q: Why not just monitor my website instead of the cache?

Because a degraded cache rarely takes the website fully down — it makes it slow, inconsistent, or intermittently broken in ways a homepage status-code check can't see. Sessions stored in Redis start vanishing, rate limiters misfire, queued jobs back up, and database load spikes as cache misses fall through. The site keeps returning 200 throughout. You need a check aimed at the cache's actual health, not the front door.

Redis almost never fails the way a web server fails. There's no clean 500, no connection-refused that your error tracker lights up on. Instead it degrades: memory creeps toward maxmemory and the eviction policy starts quietly dropping keys, a replica drops off and you lose your read scaling, latency climbs under load, or maxclients gets hit and new connections are rejected. Through all of it your application keeps returning 200, because a cache miss falls through to the database and the page still renders — just slower, and with sessions or rate limits silently misbehaving.

A normal uptime check on your homepage is blind to this, the same way it's blind to a dead database or a lagging Postgres replica. This guide covers what "healthy" actually means for Redis and Amazon ElastiCache, how to expose that through an HTTP health endpoint, and how to monitor it externally with CronAlert so a degrading cache pages you before it becomes a customer-facing outage.

Liveness is not health

The reflex is to reach for PING. It's a fine liveness probe — Redis returns PONG when the process is up and the event loop is responsive — but liveness answers "is it running," not "is it healthy." A Redis instance can answer PING in under a millisecond while:

Memory is nearly exhausted. used_memory is approaching maxmemory, and the next write triggers an eviction (or an out-of-memory error if the policy is noeviction).
Keys are being evicted. evicted_keys is climbing, which means data your app assumed was cached is being thrown away. Session stores, rate limiters, and idempotency keys silently break.
A replica has dropped. connected_slaves fell below what you expect, so you've lost read scaling and your failover target.
Connections are being rejected. maxclients is hit, and new clients get errors while existing ones look fine.

So a real health check does two things: it confirms responsiveness with PING, and it reads a handful of INFO fields and applies thresholds. That distinction — liveness versus deep health — is the same one covered in the complete guide to HTTP health check endpoints, applied to a cache.

The metrics that matter

INFO returns dozens of fields. These are the high-signal ones worth alerting on:

Memory pressure — used_memory as a fraction of maxmemory. Above ~80% you're close to eviction or OOM. This is the single most important number.
Eviction rate — a rising evicted_keys means you're already losing cached data. A steady non-zero rate is sometimes acceptable for a pure LRU cache; a sudden spike is not.
Replication health — connected_slaves and, on each replica, master_link_status:up. A dropped link is your replication-lag equivalent for Redis.
Rejected connections — rejected_connections climbing means maxclients is saturated.
Blocked clients and latency — blocked_clients, plus command latency you can sample with a timed PING round-trip.
Hit ratio — keyspace_hits vs keyspace_misses. A collapsing hit ratio means the cache has stopped doing its job and your database is taking the load.

Exposing Redis health through an HTTP endpoint

CronAlert — like any external HTTP monitor — can't and shouldn't connect to Redis directly. Redis speaks RESP over TCP, not HTTP, and a production cache (especially ElastiCache) lives inside a private VPC subnet that isn't publicly routable by design. The pattern, the same one used for database health endpoints, is a small authenticated HTTP endpoint inside your app (or a sidecar) that runs the checks and translates the result into a status code and a status string.

A Node.js / Express example using ioredis:

const express = require('express');
const Redis = require('ioredis');

const app = express();
const redis = new Redis(process.env.REDIS_URL);

const MEMORY_WARN = 0.8;   // 80% of maxmemory
const PING_BUDGET_MS = 50; // round-trip latency budget

function parseInfo(text) {
  const out = {};
  for (const line of text.split('\n')) {
    const i = line.indexOf(':');
    if (i > 0) out[line.slice(0, i)] = line.slice(i + 1).trim();
  }
  return out;
}

app.get('/healthz/redis', async (req, res) => {
  if (req.headers['x-health-token'] !== process.env.HEALTH_TOKEN) {
    return res.status(401).json({ status: 'unauthorized' });
  }

  try {
    const started = Date.now();
    const pong = await redis.ping();
    const pingMs = Date.now() - started;

    const info = parseInfo(await redis.info());
    const usedMemory = Number(info.used_memory);
    const maxMemory = Number(info.maxmemory) || Infinity;
    const memoryRatio = usedMemory / maxMemory;
    const slaves = Number(info.connected_slaves || 0);

    const problems = [];
    if (pong !== 'PONG') problems.push('ping_failed');
    if (pingMs > PING_BUDGET_MS) problems.push('slow');
    if (memoryRatio > MEMORY_WARN) problems.push('memory_pressure');
    if (Number(process.env.EXPECTED_SLAVES || 0) > slaves) problems.push('replica_down');

    const healthy = problems.length === 0;
    res.status(healthy ? 200 : 503).json({
      status: healthy ? 'healthy' : 'unhealthy',
      ping_ms: pingMs,
      memory_ratio: Math.round(memoryRatio * 100) / 100,
      connected_slaves: slaves,
      problems,
    });
  } catch (err) {
    res.status(503).json({ status: 'unhealthy', error: 'redis check failed' });
  }
});

app.listen(3000);

The same shape in Python / FastAPI with redis-py:

import os, time, redis
from fastapi import FastAPI, Header, Response

app = FastAPI()
r = redis.from_url(os.environ["REDIS_URL"])
MEMORY_WARN = 0.8
PING_BUDGET_MS = 50

@app.get("/healthz/redis")
def redis_health(response: Response, x_health_token: str = Header(default="")):
    if x_health_token != os.environ["HEALTH_TOKEN"]:
        response.status_code = 401
        return {"status": "unauthorized"}

    try:
        started = time.monotonic()
        pong = r.ping()
        ping_ms = (time.monotonic() - started) * 1000

        info = r.info()
        used = info["used_memory"]
        max_mem = info.get("maxmemory") or float("inf")
        ratio = used / max_mem
        slaves = info.get("connected_slaves", 0)

        problems = []
        if not pong: problems.append("ping_failed")
        if ping_ms > PING_BUDGET_MS: problems.append("slow")
        if ratio > MEMORY_WARN: problems.append("memory_pressure")
        if int(os.environ.get("EXPECTED_SLAVES", 0)) > slaves:
            problems.append("replica_down")

        healthy = not problems
        response.status_code = 200 if healthy else 503
        return {"status": "healthy" if healthy else "unhealthy",
                "ping_ms": round(ping_ms, 1),
                "memory_ratio": round(ratio, 2),
                "problems": problems}
    except Exception:
        response.status_code = 503
        return {"status": "unhealthy", "error": "redis check failed"}

Two design points carry over from the database-health-endpoint guidance: require a token so the endpoint isn't a public information leak about your infrastructure, and never return raw error text or the connection string in the body — emit a generic "redis check failed", not the driver's stack trace.

Monitoring the endpoint with CronAlert

Create a monitor pointing at https://api.yourapp.com/healthz/redis, with the x-health-token value added as a custom request header so the check authenticates.
Set the expected status code to 200. When any threshold is breached the endpoint returns 503 and the status-code check fires. This alone catches the breach.
Add keyword/content monitoring (Pro plan) to require the body contain "healthy". This catches the deceptive case where a proxy rewrites the 503 into a 200 with an unhealthy body. See keyword monitoring.
Set the interval to 1 minute on a paid plan. Memory pressure and eviction storms develop in seconds under a write spike; a 3-minute gap can miss the window where you could still act.
Route the alert to the right place — a degrading cache that's about to take the database down with it is on-call-worthy. See incident response workflows for routing to PagerDuty or Opsgenie.

ElastiCache specifics

Amazon ElastiCache (Redis or Valkey) is the same cache with a managed wrapper: you can't SSH to the node, and AWS already exposes deep metrics through CloudWatch. That changes the division of labor, not the strategy.

Keep the HTTP health endpoint for fast, external, vantage-point-independent detection. Your app already has a Redis client and a VPC route to the cluster; the endpoint is cheap.
Lean on CloudWatch alarms for node-level depth: DatabaseMemoryUsagePercentage, Evictions, CurrConnections, ReplicationLag, CPUUtilization, and SwapUsage. These come for free and see things the client can't.
Use the cluster's configuration endpoint (for Cluster Mode Enabled) rather than a single node address in your health check, so a node failover doesn't make the check itself fail spuriously.
Mind the multi-AZ failover. During a failover the primary endpoint repoints to a promoted replica; a brief blip is expected. Set CronAlert's consecutive-check threshold so a single failover blip doesn't page, but a sustained failure does.

The two layers complement each other: CloudWatch sees the node from the inside, CronAlert sees the endpoint from the outside and keeps reporting even if the box hosting your app — and therefore your CloudWatch-publishing agent — is itself in trouble. This is the same external-vantage-point argument that makes monitoring from Cloudflare's edge valuable for any internal service.

What a degraded cache actually breaks

It's worth being concrete about why this matters, because the failure modes are indirect:

Sessions vanish. If Redis is your session store and it starts evicting, users get logged out mid-action. The site is "up," but people can't stay signed in.
Rate limiters misfire. Token-bucket counters in Redis that get evicted reset to zero, so either everyone gets rate-limited or no one does.
The database absorbs the miss. A collapsing hit ratio sends every request to Postgres or MySQL. The cache outage becomes a database overload, which is how a cache problem turns into a real outage — and why pairing this with database health monitoring matters.
Queues back up. If you use Redis as a broker (Sidekiq, BullMQ, Celery with a Redis backend), a degraded cache stalls background workers and jobs pile up invisibly.

Common pitfalls

Checking only PING. The most common mistake — a green liveness probe on a cache that's evicting keys. Always read a few INFO thresholds too.
Heavy health checks. Don't run KEYS * or DBSIZE on a large keyspace from the health endpoint — KEYS blocks the server. PING plus INFO is cheap; keep it that way.
No external vantage point. A monitoring script on the same host as the cache loses the signal exactly when the host dies. External monitoring survives the failure of the thing it watches.
Treating eviction as always-bad. For a pure LRU cache with maxmemory-policy allkeys-lru, some eviction is normal. Alert on the rate and on memory pressure, not on the mere existence of evictions. For a session or queue store where eviction means data loss, any eviction is a problem — set the threshold accordingly.
Ignoring failover blips. Managed failovers cause brief endpoint errors. Use consecutive-check verification so a planned or transient failover doesn't generate a false positive.

Where this fits in a broader strategy

Redis health monitoring is one deep check among several. Pair it with a database connectivity endpoint, replication-lag monitoring, third-party dependency checks, and standard uptime monitoring on the user-facing endpoints. Each covers a layer the others can't: the front door, the database, the cache, the replicas, and the external services. For teams on Kubernetes, the same Redis health endpoint doubles as a readiness probe so the platform stops routing traffic to a pod whose cache connection has gone bad.

Frequently asked questions

How do you health-check a Redis instance?

Start with PING for liveness, then read INFO fields — used_memory vs maxmemory, evicted_keys, connected_slaves, rejected_connections — and apply thresholds. Expose the combined result through an authenticated HTTP endpoint that returns 200 when healthy and 503 when a threshold is breached, so an external monitor can watch it.

Can CronAlert connect to Redis directly?

No. Redis speaks RESP over TCP, not HTTP, and ElastiCache lives in a private subnet. Expose a small HTTP health endpoint that runs the checks and returns a status code plus a "status" string; CronAlert checks that endpoint from outside your infrastructure.

What Redis metrics should trigger an alert?

Memory usage approaching maxmemory, a rising eviction rate, a dropped replica, rejected connections, and climbing command latency. On ElastiCache, also alarm on CloudWatch's DatabaseMemoryUsagePercentage, Evictions, CurrConnections, and ReplicationLag.

How is monitoring ElastiCache different from self-hosted Redis?

The cache is the same, but ElastiCache is managed: no shell access, and AWS publishes deep metrics via CloudWatch. Keep your HTTP health endpoint for fast external detection and use CloudWatch alarms for node-level depth. CronAlert watches the endpoint so you get a signal independent of the host's own health.

Why not just monitor my website instead of the cache?

Because a degraded cache makes the site slow and inconsistent rather than fully down — sessions vanish, rate limiters misfire, the database takes the miss load — all while the homepage keeps returning 200. You need a check aimed at the cache's actual health.

Monitor your cache with CronAlert

Expose a Redis health endpoint that checks PING, memory pressure, and replication, then let CronAlert watch it from outside your infrastructure. Create a free account (25 monitors, no credit card), point a monitor at your endpoint, add keyword monitoring on Pro to catch deceptive 200s, and route the alert to your on-call channel. The next time your cache starts evicting keys or loses a replica, you'll hear it from CronAlert — not from a wave of logged-out users and a database melting under cache misses.