Background workers are the part of a system that breaks at 2am on a Saturday, gets noticed at 9am on Monday when the support inbox is full, and turns out to have been broken since the Friday deploy. The reason is straightforward: nobody is watching them.
A web server that goes down throws obvious signals. Requests time out. Load balancers mark instances unhealthy. Error rates spike. Dashboards turn red. Someone notices within minutes. A background worker that goes down is a process that has stopped reading from a queue — nothing in the application emits an error, because the application doesn't know the worker died. The queue just grows. Jobs sit longer. Customers eventually notice that their password resets, welcome emails, or scheduled exports never arrived, and by then you have a backlog of thousands of jobs and an angry support queue.
This is a fixable problem. The pattern is similar to heartbeat monitoring for cron jobs — you instrument the worker so it tells you it's alive, and you supplement that with queue-depth probes that tell you it's keeping up. This post walks through the patterns for Sidekiq, Celery, BullMQ, RabbitMQ consumers, SQS workers, and any other background processing system.
The two signals that matter
External monitoring of a background worker reduces to two questions:
- Is the worker alive? The process is running, connected to the queue, and reading jobs. A crashed pod, an OOM kill, a deploy that didn't restart the worker — all of these fail this check.
- Is the worker keeping up? The queue depth is bounded, jobs are processed within SLA, and there's no backlog growing without limit. A slow downstream API, a poison-pill job blocking a single worker, or an autoscaling miss all fail this check while the previous one still passes.
Both questions need answers in production. Monitoring only one leaves a class of failures invisible. A worker that's alive but can't drain its queue is just as broken as a worker that's dead.
Pattern 1: Worker heartbeats
The simplest, most reliable pattern. Every worker pings an external monitor on a regular interval — every minute, every five minutes, whatever matches your alerting tolerance. If the monitor doesn't see a ping within the expected window, it alerts.
For Sidekiq, the cleanest place to put the ping is in a recurring job that runs every minute via sidekiq-cron or sidekiq-scheduler:
class WorkerHeartbeat
include Sidekiq::Worker
def perform
Net::HTTP.get(URI("https://api.cronalert.com/ping/<your-monitor-id>"))
end
end
# config/sidekiq.yml or sidekiq-scheduler.yml
:schedule:
worker_heartbeat:
cron: '* * * * *'
class: WorkerHeartbeat For Celery, the same pattern with Celery Beat:
from celery import Celery
from celery.schedules import crontab
import requests
app = Celery('myapp')
@app.task
def worker_heartbeat():
requests.get('https://api.cronalert.com/ping/<your-monitor-id>')
app.conf.beat_schedule = {
'worker-heartbeat': {
'task': 'myapp.tasks.worker_heartbeat',
'schedule': crontab(minute='*'),
},
} For BullMQ:
import { Queue, Worker } from 'bullmq';
const heartbeatQueue = new Queue('heartbeat', { connection: redis });
await heartbeatQueue.add('ping', {}, {
repeat: { pattern: '* * * * *' },
});
new Worker('heartbeat', async () => {
await fetch('https://api.cronalert.com/ping/<monitor-id>');
}, { connection: redis }); The principle is the same regardless of stack: enqueue a heartbeat job on a schedule, have a worker process it, and have the worker make an outbound HTTP call to an external monitor. If any of the steps fails — scheduler crashed, worker crashed, network is down — the monitor doesn't get pinged and the alert fires.
Why this is more robust than checking the process
You could check whether the process is running via a Kubernetes liveness probe or a systemd unit. That tells you the process exists. It does not tell you the process is doing useful work. A worker that's running but stuck on a deadlock, blocked on a network call, or unable to connect to Redis is technically alive but not actually processing jobs.
The heartbeat pattern requires the worker to do work — pull a job off the queue, execute it, hit an external URL. If the worker can't do any of those things, the heartbeat fails. It's a much higher-fidelity signal than kill -0 $PID.
Pattern 2: Queue-depth probes
Heartbeats catch dead workers. They don't catch slow workers. A worker that's processing one job per minute when the producer is generating ten per minute will pass every heartbeat — the worker is alive, processing jobs, pinging the monitor — but the queue depth grows without bound. By the time someone notices, there are 50,000 pending jobs and a four-hour backlog.
The fix is a queue-depth probe. Expose a small HTTP endpoint inside your application that reports current queue depth as JSON or plain text, then have an external uptime monitor hit that endpoint on a schedule and alert if the depth exceeds a threshold.
Sidekiq queue-depth endpoint
require 'sidekiq/api'
class HealthController < ApplicationController
def queue_depth
stats = Sidekiq::Stats.new
render json: {
enqueued: stats.enqueued,
retry_size: stats.retry_size,
dead_size: stats.dead_size,
processed: stats.processed,
failed: stats.failed,
}
end
end
# config/routes.rb
get '/healthz/queue-depth', to: 'health#queue_depth'
Then create a CronAlert keyword monitor pointed at /healthz/queue-depth with an authentication header, and use a regex-style assertion to verify the enqueued count is below the threshold. The keyword monitoring guide covers the assertion patterns.
Celery queue-depth endpoint (Redis-backed)
from flask import Flask, jsonify
from celery import Celery
app = Flask(__name__)
celery = Celery('myapp', broker='redis://...')
@app.route('/healthz/queue-depth')
def queue_depth():
inspect = celery.control.inspect()
active = inspect.active() or {}
reserved = inspect.reserved() or {}
return jsonify({
'active': sum(len(v) for v in active.values()),
'reserved': sum(len(v) for v in reserved.values()),
}) RabbitMQ queue-depth endpoint
RabbitMQ exposes queue stats through its management plugin. Don't expose the management API to the public internet directly; instead, have your application proxy it through an authenticated endpoint:
const express = require('express');
const fetch = require('node-fetch');
const app = express();
app.get('/healthz/queue-depth', async (req, res) => {
if (req.headers.authorization !== `Bearer ${process.env.HEALTH_TOKEN}`) {
return res.sendStatus(401);
}
const r = await fetch('http://rabbitmq:15672/api/queues/%2F/myqueue', {
headers: { Authorization: 'Basic ' + Buffer.from('user:pass').toString('base64') },
});
const queue = await r.json();
res.json({ depth: queue.messages, ready: queue.messages_ready });
}); SQS queue-depth endpoint
import boto3
from flask import Flask, jsonify
sqs = boto3.client('sqs')
@app.route('/healthz/queue-depth')
def queue_depth():
attrs = sqs.get_queue_attributes(
QueueUrl=os.environ['QUEUE_URL'],
AttributeNames=['ApproximateNumberOfMessages', 'ApproximateNumberOfMessagesNotVisible'],
)['Attributes']
return jsonify({
'available': int(attrs['ApproximateNumberOfMessages']),
'in_flight': int(attrs['ApproximateNumberOfMessagesNotVisible']),
}) The pattern is identical across stacks: expose a small JSON endpoint that surfaces queue depth, hit it from an external monitor, alert when depth exceeds a threshold. See the database health endpoint guide for the broader pattern.
Pattern 3: Queue lag (job age) monitoring
Raw queue depth is a useful signal but it's the wrong unit for SLA-sensitive work. A queue with 5000 pending jobs that processes 1000 per minute clears in five minutes — fine. The same depth with a 10-jobs-per-minute consumer is an eight-hour backlog — catastrophic. The same number, two completely different operational realities.
A better signal is "how old is the oldest job in the queue." If your SLA for welcome emails is "deliver within five minutes," then the relevant alert is "the oldest pending welcome-email job is more than five minutes old." That maps directly to the SLA and doesn't depend on guessing depth thresholds.
# Sidekiq oldest-job age
require 'sidekiq/api'
queue = Sidekiq::Queue.new('default')
oldest = queue.first
age_seconds = oldest ? (Time.now - oldest.enqueued_at) : 0
render plain: age_seconds.to_i Then your CronAlert monitor checks the response body and alerts if the number exceeds your SLA in seconds. This is also the right metric to drive autoscaling decisions — "scale up when oldest job exceeds N seconds" is more meaningful than "scale up when depth exceeds N."
Putting all three together
A complete background-worker monitoring setup looks like this:
- Worker heartbeat monitor — fires when no ping arrives within the expected interval. Catches dead workers, failed deploys, OOM kills.
- Queue-depth monitor — fires when depth exceeds 3-5x normal peak. Catches "worker is alive but can't keep up." Use the keyword-match assertion against the depth endpoint.
- Job-age monitor — fires when oldest job exceeds your SLA. Catches SLA breaches before customers complain.
- Dead-letter queue monitor — fires when the DLQ depth exceeds zero (or a small number). Catches poison-pill jobs and persistent failures.
On CronAlert, that's three or four monitors per worker pool. With multi-region quorum, you avoid pages from a single-region transient network blip. With consecutive-check verification, you avoid pages from a momentary depth spike during a deploy.
Common pitfalls
Heartbeat job runs on a different worker pool than the one you're monitoring
If your heartbeat job is configured to run on a dedicated low-priority queue, and the workers for that queue are separate from the ones doing real work, you're monitoring the wrong workers. The heartbeat keeps pinging while the main worker pool is dead. Make sure the heartbeat job runs on the same worker pool you care about, or run a heartbeat per pool.
Queue depth endpoint is unauthenticated
A /healthz/queue-depth endpoint that exposes "we have 50,000 unprocessed jobs" is potentially sensitive information leaking. Put it behind a bearer token or signed query string, and configure the monitor to send the auth header. CronAlert supports custom request headers on every monitor.
Thresholds set too tight
A queue depth threshold of "any pending jobs" alerts every deploy, every traffic spike, every legitimate burst. Set it to 3-5x normal peak depth, and use job-age monitoring for SLA-sensitive signals. The goal is to alert on "the worker is broken," not on "the worker has work to do."
Heartbeat ping happens before the job actually runs
Some implementations call the ping at the start of the heartbeat job, before any real work. That ping fires even if the queue is jammed and the job times out. Move the ping to the end of the job, after the worker has actually executed something, so a stuck worker doesn't keep reporting healthy.
No alert for the dead-letter queue
Most queue systems route failed jobs to a DLQ after some retry count. The DLQ is the place where unprocessable jobs go to be ignored. If nobody is watching it, persistent bugs accumulate silently and you find out about them when a customer escalates. A simple "DLQ depth > 0" monitor surfaces the class of failures that retry logic was supposed to handle but didn't.
Frequently asked questions
Why is monitoring background workers harder than monitoring web servers?
Web servers fail loudly through HTTP errors and load balancer health checks. Background workers fail quietly — a crashed worker just stops pulling jobs from the queue, and your application has no way to know until somebody notices the work isn't getting done. The signal has to come from outside the worker.
Should I use heartbeats or queue-depth monitoring?
Both. Heartbeats catch "the worker is dead." Queue depth catches "the worker is alive but can't keep up." They surface different failure modes and a healthy production system monitors both.
How do I monitor queue depth without exposing queue credentials?
Expose an internal HTTP endpoint that reads queue depth from inside your app, secure it with a bearer token, and have your external uptime monitor hit that endpoint. Queue credentials stay inside the app; only a single read-only depth value leaves it.
What's a reasonable queue-depth alerting threshold?
3-5x normal peak depth, not "any pending jobs." Pair with job-age monitoring for SLA-sensitive work, which maps more directly to customer impact than raw count.
Do I need to monitor every queue?
Monitor the queues with SLA implications. Worker process heartbeats are universal — every worker pool should have one. Queue-depth and lag alerts should be scoped to the queues whose lag affects customers.
Get started
Background-worker monitoring is one of those things that costs an hour to set up correctly and saves you a weekend's incident response several times a year. Create a free CronAlert account, add a heartbeat monitor for each worker pool, expose a queue-depth endpoint in your app, and add a keyword monitor against it. You'll catch the next silent worker outage before it becomes a customer-facing incident.
Related reading: heartbeat monitoring for cron jobs, long-running batch job monitoring, database health endpoint, HTTP health check endpoints, keyword monitoring, microservices uptime monitoring, and serverless function monitoring.