How to Monitor Message Queues (RabbitMQ, SQS, Kafka) for Backlog and Availability

Q: What should you monitor on a message queue?

Four signals: broker availability (can clients connect and publish), queue depth (number of messages waiting, against a per-queue threshold), oldest message age (how stale the head of the queue is — depth alone misleads on high-volume queues), and consumer presence (a queue with zero consumers will never drain no matter how healthy the broker is). Depth and age catch backlog; availability and consumer count catch the causes.

Q: How do you monitor queue depth externally?

Expose a small authenticated health endpoint in your app or a sidecar service that queries the broker — RabbitMQ's management API, SQS's GetQueueAttributes, Kafka's consumer group offsets — compares the numbers to per-queue thresholds, and returns 200 when healthy and 503 when any queue exceeds its limits. Then point an external monitor like CronAlert at that endpoint so you find out even when the problem is the network, DNS, or the box the broker shares with your metrics stack.

Q: What is a healthy queue depth threshold?

There is no universal number — a queue processing 10,000 messages a minute can healthily hold thousands, while 50 messages in a low-volume webhook queue means processing stopped an hour ago. Set thresholds per queue from its normal drain rate: alert when depth exceeds what consumers can clear within your tolerance window (for example, 10 minutes of backlog). Pair depth with oldest-message age, which is volume-independent and harder to fool.

Q: How do you monitor Kafka consumer lag?

Consumer lag is the gap between the latest offset in a partition and the consumer group's committed offset. Query it with the AdminClient API (or the kafka-consumer-groups CLI), sum lag across partitions per group, and expose it through a health endpoint with per-group thresholds. Growing lag with active consumers means they can't keep up; lag with zero group members means the consumers are gone — both should alert.

Q: Is monitoring the broker enough, or do I need to monitor consumers too?

Both, and they fail independently. A perfectly healthy broker happily accumulates messages forever if consumers crash — broker monitoring won't catch that, which is why depth, age, and consumer-count checks exist. And consumers can look alive while the broker rejects publishes. Monitor the broker through a health endpoint and give the consumers themselves heartbeat monitors that alert when processing stops.

Message queues fail politely. A web server that dies throws connection errors at users until someone notices; a queue just... accumulates. The broker stays reachable, producers keep publishing successfully (200s all around), and somewhere downstream a consumer has crashed, a queue is fifty thousand messages deep, and every one of them is an unsent receipt, an unprocessed payment event, or an export a customer is refreshing the page for. Nothing in the request path ever errors. The queue is the buffer that's supposed to absorb failure — which is exactly what makes its own failure invisible.

This guide covers how to monitor the queue itself — RabbitMQ, Amazon SQS, and Kafka — for the two things that matter: availability (can clients connect and publish?) and backlog (is anything actually draining?). It pairs with the background worker monitoring guide, which covers the consumer side; this post is about the broker and the queues in it.

The four signals that matter

Broker availability. Can a client connect, authenticate, and publish? Process-up is not the same as accepting-work: RabbitMQ can be running but have a blocked connection due to a memory alarm; Kafka can be up with an under-replicated partition that refuses writes at your required acks level.
Queue depth. How many messages are waiting? Depth is the headline backlog number, but it needs per-queue thresholds — 5,000 messages is Tuesday for a high-volume events queue and a five-alarm fire for your password-reset email queue.
Oldest message age. How long has the head of the queue been waiting? Age is the better signal on busy queues because it's volume-independent: depth 50,000 with age 30 seconds is a healthy firehose; depth 200 with age 2 hours means processing stopped 2 hours ago.
Consumer presence. A queue with zero consumers will never drain, no matter how healthy everything else is. RabbitMQ exposes consumer counts per queue; Kafka exposes group membership. Zero consumers on a queue that should always have them is an instant alert.

The first signal catches a broken broker. The other three catch the far more common case: a healthy broker faithfully storing an ever-growing pile of work nobody is doing.

The pattern: a queue health endpoint

Brokers don't speak browser-friendly HTTP with your thresholds baked in, so the play is the same one used for Redis and database health: a small authenticated endpoint in your app (or a sidecar) that queries the broker, applies your per-queue thresholds, and collapses the answer into a clean 200/503 that an external monitor can act on. Keep it side-effect free, protect it with a token, and never echo raw broker errors or connection strings in the response.

app.get('/health/queues', async (req, res) => {
  if (req.headers['x-health-token'] !== process.env.HEALTH_TOKEN) {
    return res.status(401).json({ status: 'unauthorized' });
  }

  const THRESHOLDS = {
    'emails':         { maxDepth: 500,    maxAgeSec: 300,  minConsumers: 1 },
    'webhooks-out':   { maxDepth: 1000,   maxAgeSec: 600,  minConsumers: 1 },
    'analytics':      { maxDepth: 100000, maxAgeSec: 3600, minConsumers: 1 },
  };

  const problems = [];
  for (const [queue, limits] of Object.entries(THRESHOLDS)) {
    try {
      const stats = await getQueueStats(queue); // broker-specific, below
      if (stats.depth > limits.maxDepth) problems.push(`${queue}:depth=${stats.depth}`);
      if (stats.oldestAgeSec > limits.maxAgeSec) problems.push(`${queue}:age=${stats.oldestAgeSec}s`);
      if (stats.consumers < limits.minConsumers) problems.push(`${queue}:no_consumers`);
    } catch {
      problems.push(`${queue}:broker_unreachable`);
    }
  }

  const healthy = problems.length === 0;
  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'healthy' : 'unhealthy',
    problems,
  });
});

The getQueueStats implementation is the only broker-specific part:

RabbitMQ

The management plugin exposes everything over HTTP: GET /api/queues/%2f/emails returns messages (depth), consumers, and message rates. For broker availability there's also the built-in aliveness check, GET /api/aliveness-test/%2f, which declares a test queue, publishes, and consumes — a true end-to-end "can work flow through this broker" probe. Oldest-message age isn't directly exposed; approximate it by publishing a timestamped canary message your consumer echoes back, or track it in the consumer. Also watch for memory and disk alarms in /api/nodes — an alarmed node blocks publishers, which looks like a hang, not an error.

Amazon SQS

GetQueueAttributes gives you ApproximateNumberOfMessages (depth) and ApproximateAgeOfOldestMessage (age — SQS hands you the best signal directly). There's no consumer count; infer consumer health from age, or give the consumers their own heartbeats (below). One SQS-specific essential: monitor the dead-letter queue. Depth > 0 on a DLQ means messages are failing repeatedly and have been shunted aside — that's a processing bug, and nothing alerts on it by default.

Kafka

Kafka's "depth" is consumer lag: the gap between each partition's latest offset and the consumer group's committed offset. Query both with the AdminClient API, sum per group, and threshold per group in the health endpoint. Two distinct failure shapes: lag growing while the group has active members means consumers can't keep up (scale them); lag with an empty group means the consumers are gone entirely. Check broker availability separately via a metadata request, and watch under-replicated partition counts if you produce with acks=all — under-replication silently turns into publish failures.

Don't forget the consumers themselves

Depth and age tell you work isn't draining; they don't tell you why, and they take a threshold-crossing's worth of time to fire. The fastest signal that processing stopped comes from the workers: give each consumer a heartbeat monitor that it pings after each batch, so a crashed or wedged consumer alerts within minutes — usually before the backlog threshold trips. The background worker guide covers this side in depth, including Sidekiq, Celery, and BullMQ specifics; for long-running consumers doing batch work, see batch job monitoring. Queue-side and consumer-side checks are complementary: the heartbeat names the broken worker, the depth check catches the cases heartbeats can't (a worker that's alive but erroring every message straight to the DLQ).

Setting it up in CronAlert

Create a monitor on /health/queues with the x-health-token custom header. Expect 200; the 503 from any threshold breach fires the alert, and the problems array in the body names the queue.
Add keyword monitoring (Pro) requiring "healthy" in the body, so a proxy or error page returning 200 can't mask a failure — see keyword monitoring.
Monitor the broker's own HTTP surface where it has one — RabbitMQ's aliveness-test endpoint behind appropriate auth — as a second, app-independent opinion on broker availability.
Add a heartbeat monitor per consumer group, pinged after each successful batch.
Split critical and bulk queues into separate monitors if their urgency differs: the payments queue pages PagerDuty, the analytics queue posts to Slack. Routing by severity is the core of incident response workflows.
Watch response time on the health endpoint. A management API that takes 8 seconds to answer is often the first symptom of a broker under memory pressure.

Common pitfalls

Monitoring only the broker process. "RabbitMQ is running" and "messages are flowing" are different facts. Most queue incidents happen with a perfectly green broker.
One global depth threshold. Thresholds must be per-queue, derived from each queue's normal drain rate — otherwise the busy queue cries wolf and the quiet one never alerts.
Depth without age. High-volume queues legitimately run deep. Age catches a stalled queue regardless of volume; use both.
Ignoring the dead-letter queue. A growing DLQ is a processing bug that no availability check will ever flag. Monitor DLQ depth with a threshold of approximately zero.
Health checks with side effects. Don't publish to production queues on every check (canary messages excepted, and tagged as such). Read stats; don't generate load.
Internal-only monitoring. If the broker's network is down, the metrics stack on the same network is usually down too. An external check is the opinion that survives — the same argument as for every other internal service you monitor.

Frequently asked questions

What should you monitor on a message queue?

Broker availability, queue depth, oldest-message age, and consumer presence. Availability catches a broken broker; the other three catch the more common failure — a healthy broker accumulating work nobody is processing. Add dead-letter queue depth as a fifth signal where DLQs exist.

How do you monitor queue depth externally?

Expose an authenticated health endpoint that queries the broker (RabbitMQ management API, SQS GetQueueAttributes, Kafka consumer offsets), applies per-queue thresholds, and returns 200/503. Point CronAlert at it so backlog alerts fire even when the problem is the broker's network or host.

What is a healthy queue depth threshold?

Whatever your consumers can drain within your tolerance window — there's no universal number. Derive each queue's threshold from its normal drain rate (e.g., alert at 10 minutes' worth of backlog), and pair it with oldest-message age, which works regardless of volume.

How do you monitor Kafka consumer lag?

Compare each partition's latest offset to the consumer group's committed offset via the AdminClient API, sum per group, and expose the result through a health endpoint with per-group thresholds. Growing lag with active members means scale the consumers; any lag with an empty group means the consumers are gone.

Is monitoring the broker enough, or do I need to monitor consumers too?

You need both — they fail independently. The broker check catches publish-side failures; depth and age catch drain-side failures; consumer heartbeats name the broken worker fastest. Queue-side tells you that work is piling up, consumer-side tells you why.

Monitor your message queues with CronAlert

Queues are where failures go to hide: the request path stays green while the real work backs up silently. Create a free account (25 monitors, no credit card), expose a queue health endpoint with per-queue depth and age thresholds, point a monitor at it, give each consumer group a heartbeat, and put the DLQ on watch. The next time a consumer dies on a Friday night, you'll get paged at message 51 — not find fifty thousand on Monday.