Replication lag is one of the most dangerous failure modes in a Postgres deployment precisely because nothing looks broken. The primary accepts writes, your website returns 200, your dashboards render. Meanwhile a read replica has fallen minutes behind, your read-heavy endpoints are serving stale data, and the standby you're counting on for failover has quietly become a liability — promote it and you lose every transaction it hadn't received yet.
A normal uptime check on your homepage is blind to all of this, the same way a homepage check misses a dead database. This guide covers how to measure replication lag inside Postgres, how to expose it through a health endpoint, and how to monitor that endpoint externally with CronAlert so a lagging replica pages you before your users notice the stale data.
What replication lag actually is
Postgres streaming replication works by shipping the Write-Ahead Log (WAL) from a primary server to one or more replicas, which replay it to stay in sync. Lag is the gap between "committed on the primary" and "replayed on the replica." It's measured two ways, and you want both:
- Byte lag (volume). How many bytes of WAL the replica is behind the primary's current write position. Tells you how much data is in flight. A replica falling hundreds of megabytes behind is a problem even if it's catching up quickly.
- Time lag (staleness). How many seconds old the most recently replayed transaction is. This is the number that maps directly to user impact: "data on this replica is N seconds out of date." Most alerting keys on time lag because it answers the question that matters.
Some lag is always present and normal. The failure is unbounded growth — lag that climbs and doesn't recover — or a replica that stops reporting entirely, which usually means replication has broken rather than merely slowed.
Why it's worth alerting on
- Stale reads. If your application routes read traffic to replicas (a common scaling pattern), a lagging replica serves out-of-date data. Users update a setting and don't see it change; a just-placed order doesn't appear in their history. Read-after-write consistency silently breaks.
- Failover data loss. The whole point of a standby is to take over when the primary fails. Promote a replica that's 90 seconds behind and you've just discarded 90 seconds of committed transactions — orders, payments, signups — with no way to recover them.
- Broken replication. A replica that has stopped replaying WAL entirely isn't lagging, it's dead as a standby. You're now one primary failure away from an outage with no usable failover target, and you won't know unless you're watching.
None of these trip a normal uptime check, because the primary keeps serving traffic and returning 200 throughout. This is the same blind spot that makes database health endpoints and deep health checks for microservices necessary — you have to monitor the thing that's actually at risk, not the front door.
Measuring lag inside Postgres
Byte lag, queried on the primary
On the primary, pg_stat_replication has one row per connected replica. Compare the primary's current WAL position to each replica's replay_lsn:
SELECT
client_addr,
application_name,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication; A row missing from this result for a replica you expect to be connected is itself a signal: that replica has disconnected from the primary.
Time lag, queried on the replica
On a replica, compute how many seconds behind the last replayed transaction is:
SELECT
CASE
WHEN pg_is_in_recovery() AND pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
THEN 0
ELSE EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
END AS replication_lag_seconds;
The CASE guards against a false positive on a low-traffic system: if no new transactions are arriving, pg_last_xact_replay_timestamp() grows "old" even though the replica is perfectly caught up. When the received and replayed LSNs are equal, the replica is current regardless of how long ago the last transaction was, so report zero lag.
Exposing lag through a health endpoint
External monitoring tools can't (and shouldn't) connect to your database directly. The pattern — the same one in the complete guide to HTTP health check endpoints — is to expose the metric through a small, authenticated HTTP endpoint that runs the query and translates the result into a status code and a status string.
Here's a Node.js / Express example that checks the replica's time lag and returns 503 when it exceeds a threshold:
const express = require('express');
const { Pool } = require('pg');
const app = express();
const replicaPool = new Pool({ connectionString: process.env.REPLICA_URL });
const WARN_SECONDS = 5;
const CRITICAL_SECONDS = 30;
app.get('/healthz/replication', async (req, res) => {
if (req.headers['x-health-token'] !== process.env.HEALTH_TOKEN) {
return res.status(401).json({ status: 'unauthorized' });
}
try {
const { rows } = await replicaPool.query(`
SELECT CASE
WHEN pg_is_in_recovery()
AND pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
THEN 0
ELSE EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
END AS lag_seconds;
`);
const lag = Number(rows[0].lag_seconds);
const healthy = lag < CRITICAL_SECONDS;
res.status(healthy ? 200 : 503).json({
status: healthy ? 'healthy' : 'unhealthy',
lag_seconds: Math.round(lag * 1000) / 1000,
degraded: lag >= WARN_SECONDS,
});
} catch (err) {
res.status(503).json({ status: 'unhealthy', error: 'replication check failed' });
}
});
app.listen(3000); A Python / FastAPI version follows the same shape:
import os, psycopg
from fastapi import FastAPI, Header, Response
app = FastAPI()
CRITICAL_SECONDS = 30
@app.get("/healthz/replication")
def replication_health(response: Response, x_health_token: str = Header(default="")):
if x_health_token != os.environ["HEALTH_TOKEN"]:
response.status_code = 401
return {"status": "unauthorized"}
try:
with psycopg.connect(os.environ["REPLICA_URL"]) as conn:
lag = conn.execute("""
SELECT CASE
WHEN pg_is_in_recovery()
AND pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
THEN 0
ELSE EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
END;
""").fetchone()[0]
healthy = lag < CRITICAL_SECONDS
response.status_code = 200 if healthy else 503
return {"status": "healthy" if healthy else "unhealthy",
"lag_seconds": round(lag, 3)}
except Exception:
response.status_code = 503
return {"status": "unhealthy", "error": "replication check failed"}
Both return a clear "status" string and an HTTP status code that an external monitor can key on. Two design points carry over from the database-health-endpoint guidance: require a token so the endpoint isn't a public information leak, and never expose connection strings or raw error text in the body — return a generic "replication check failed", not the driver's stack trace.
Monitoring the endpoint with CronAlert
With the endpoint live, point an external monitor at it. CronAlert checks it on a schedule from outside your infrastructure, so you get the alert even if the box hosting the endpoint is itself in trouble.
- Create a monitor pointing at
https://api.yourapp.com/healthz/replication. Add thex-health-tokenvalue as a custom request header so the check authenticates. - Set the expected status code to 200. When lag crosses your critical threshold the endpoint returns 503, and the status-code check fires. This alone catches the breach.
- Add keyword/content monitoring (Pro plan) to require the body contain
"healthy". This is the belt-and-suspenders layer that catches a misconfigured proxy rewriting the 503 to a 200 with an unhealthy body. See keyword monitoring. - Set the interval to 1 minute on a paid plan. Replication lag can grow fast under a write spike; a 3-minute interval may let it climb dangerously between checks.
- Route the alert appropriately. A lagging replica is an on-call-worthy event — route it to PagerDuty, Opsgenie, or Slack depending on severity. See incident response workflows for the routing pattern.
The combination of the status-code check (hard breach) and keyword monitoring (soft breach) gives you the same two-layer detection that works for any deep health check: the code catches the obvious failure, the keyword catches the deceptive 200.
Monitor each replica separately
A single aggregate health endpoint can hide a problem. If you have three replicas and one falls behind, an endpoint that reports "the worst replica's lag" is correct but doesn't tell you which one. Expose a per-replica endpoint (or include the replica identifier in the response body) and create a separate CronAlert monitor per replica. That way the alert names the specific standby that's lagging, and you can see at a glance in the dashboard whether it's one replica or all of them — a single lagging replica is a node problem, all of them lagging is usually a primary write-volume or network problem.
Choosing thresholds
- Baseline first. Watch the
lag_secondsvalue during normal operation, including peak write hours, before setting a threshold. A database that normally lags 200ms should alert far sooner than one that routinely lags several seconds under load. - Warn early, page late. A common split is a warning at a few seconds and a paging alert at 30–60 seconds. The warning gives you a chance to investigate (a long-running migration, a write spike, a slow disk) before it becomes user-facing.
- Alert on silence. A replica that stops responding to the health endpoint, or whose row vanishes from
pg_stat_replication, has likely broken replication entirely. That's more urgent than slow lag — treat a missing replica as critical. - Account for maintenance. Lag spikes are expected during large data migrations,
VACUUM FULL, or index rebuilds. Use a maintenance window to suppress alerts during planned heavy-write operations so you don't train your team to ignore the alert.
Common pitfalls
- Idle-database false positives. Without the
pg_is_in_recovery()/ LSN-equality guard, a quiet replica reports growing lag just because no transactions are arriving. Always include the guard. - Querying lag on the wrong node.
pg_last_xact_replay_timestamp()only makes sense on a replica;pg_stat_replicationonly has data on the primary. Point each query at the right node. - Expensive health checks. Keep the query trivial — it's two function calls, not a table scan. A health endpoint that adds load is part of the problem. The same discipline applies as in any health endpoint design.
- No external vantage point. Monitoring lag from a script running on the same database host means you lose the signal exactly when the host dies. External monitoring from Cloudflare's edge survives the failure of the thing it's watching.
- Treating logical and physical replication the same. The queries above are for physical streaming replication. Logical replication (and tools like the managed replication on RDS or Aurora) exposes lag through different views —
pg_stat_subscriptionand provider-specific metrics. Adapt the query; the monitoring pattern (endpoint → status code + keyword → external check) is identical.
Where this fits in a broader monitoring strategy
Replication lag monitoring is one deep check among several. Pair it with a database connectivity health endpoint (can we reach the primary at all?), third-party dependency monitoring for the services your app relies on, and standard uptime monitoring on the user-facing endpoints. Together they cover the layers a single homepage check can't: the front door, the database connection, the replica freshness, and the external dependencies. For teams running on Kubernetes, the same endpoint doubles as a readiness probe so the platform stops routing read traffic to a lagging replica automatically.
Frequently asked questions
What is Postgres replication lag?
It's the delay between a write being committed on the primary and that write becoming visible on a replica. It's measured in bytes (how far behind the replica's WAL position is) and in seconds (how old the last replayed transaction is). Some lag is normal; unbounded growth means stale reads and failover data loss.
How do you measure replication lag in Postgres?
On the primary, compare pg_current_wal_lsn() to each replica's replay_lsn in pg_stat_replication for byte lag. On a replica, use EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) for time lag, guarded by a recovery/LSN-equality check to avoid idle-database false positives.
How do you monitor replication lag with an uptime tool?
Expose lag through an authenticated HTTP endpoint that returns 503 plus an unhealthy status string when lag exceeds your threshold, and 200 with a healthy string otherwise. Point CronAlert at the endpoint: the status-code check catches the breach and keyword monitoring confirms the body. This turns an internal database metric into an external alert without exposing the database.
What is a safe replication lag threshold?
Set it relative to your baseline. Common practice is to warn at a few seconds and page at 30–60 seconds of time lag, alert on a replica that stops reporting entirely, and for byte lag alert when a replica falls more than a few hundred megabytes behind.
Why is monitoring replication lag important?
Because it causes stale reads, failover data loss, and silently broken standbys — all of which are invisible to a normal uptime check, since the primary keeps returning 200 the whole time the replica is falling behind.
Monitor your replicas with CronAlert
Expose a replication-lag health endpoint, then let CronAlert watch it from outside your infrastructure. Create a free account (25 monitors, no credit card), point a monitor at your endpoint, add keyword monitoring on Pro to catch deceptive 200s, and route the alert to your on-call channel. The next time a replica starts falling behind, you'll hear about it from CronAlert — not from a customer wondering why their data looks out of date.
Related reading: how to monitor your database health endpoint, the complete guide to HTTP health check endpoints, monitoring microservices for uptime, monitoring third-party dependencies, and keyword monitoring.