How to Monitor Postgres Replication Lag

Q: What is Postgres replication lag?

Replication lag is the delay between when a write is committed on the primary Postgres server and when that write is visible on a replica (standby) server. It's measured either in bytes (how far behind the replica's WAL position is) or in seconds (how old the most recently replayed transaction is). Some lag is normal; the danger is when it grows unbounded, because reads from the replica return stale data and a failover to a lagging replica loses the un-replicated transactions.

Q: How do you measure replication lag in Postgres?

On the primary, query pg_stat_replication and compare pg_current_wal_lsn() to each replica's replay_lsn to get byte lag. On a replica, compute time lag with EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) to see how many seconds behind the last replayed transaction is. The byte measure tells you volume; the time measure tells you staleness. Most teams alert on the time measure because it maps directly to 'how stale is the data my users might read.'

Q: How do you monitor replication lag with an uptime tool?

Expose replication lag through an authenticated HTTP health endpoint that queries the lag in seconds and returns HTTP 503 (plus an unhealthy status string in the body) when lag exceeds your threshold, and 200 with a healthy string otherwise. Point an external uptime monitor like CronAlert at that endpoint: the status-code check catches the breach, and keyword/content monitoring confirms the body says 'healthy.' This turns an internal database metric into an external alert without exposing the database itself.

Q: What is a safe replication lag threshold?

It depends on the workload, but common practice is to warn at a few seconds of time lag and alert/page at 30-60 seconds, while also alerting if a replica stops reporting entirely (which usually means replication has broken, not merely slowed). For byte lag, alert when a replica falls more than a few hundred megabytes behind. Set thresholds relative to your normal baseline: a database that normally lags 200ms should alert far sooner than one that routinely lags several seconds under load.

Q: Why is monitoring replication lag important?

Three reasons: stale reads (applications reading from a lagging replica show users out-of-date data and break read-after-write expectations), failover data loss (promoting a lagging replica discards the transactions it hadn't received), and broken replication (a replica that has stopped replaying WAL entirely is a single hardware failure away from an outage with no usable standby). All three are invisible to a normal uptime check on your website, because the primary keeps returning 200 the whole time.

Replication lag is one of the most dangerous failure modes in a Postgres deployment precisely because nothing looks broken. The primary accepts writes, your website returns 200, your dashboards render. Meanwhile a read replica has fallen minutes behind, your read-heavy endpoints are serving stale data, and the standby you're counting on for failover has quietly become a liability — promote it and you lose every transaction it hadn't received yet.

A normal uptime check on your homepage is blind to all of this, the same way a homepage check misses a dead database. This guide covers how to measure replication lag inside Postgres, how to expose it through a health endpoint, and how to monitor that endpoint externally with CronAlert so a lagging replica pages you before your users notice the stale data.

What replication lag actually is

Postgres streaming replication works by shipping the Write-Ahead Log (WAL) from a primary server to one or more replicas, which replay it to stay in sync. Lag is the gap between "committed on the primary" and "replayed on the replica." It's measured two ways, and you want both:

Byte lag (volume). How many bytes of WAL the replica is behind the primary's current write position. Tells you how much data is in flight. A replica falling hundreds of megabytes behind is a problem even if it's catching up quickly.
Time lag (staleness). How many seconds old the most recently replayed transaction is. This is the number that maps directly to user impact: "data on this replica is N seconds out of date." Most alerting keys on time lag because it answers the question that matters.

Some lag is always present and normal. The failure is unbounded growth — lag that climbs and doesn't recover — or a replica that stops reporting entirely, which usually means replication has broken rather than merely slowed.

Why it's worth alerting on

Stale reads. If your application routes read traffic to replicas (a common scaling pattern), a lagging replica serves out-of-date data. Users update a setting and don't see it change; a just-placed order doesn't appear in their history. Read-after-write consistency silently breaks.
Failover data loss. The whole point of a standby is to take over when the primary fails. Promote a replica that's 90 seconds behind and you've just discarded 90 seconds of committed transactions — orders, payments, signups — with no way to recover them.
Broken replication. A replica that has stopped replaying WAL entirely isn't lagging, it's dead as a standby. You're now one primary failure away from an outage with no usable failover target, and you won't know unless you're watching.

None of these trip a normal uptime check, because the primary keeps serving traffic and returning 200 throughout. This is the same blind spot that makes database health endpoints and deep health checks for microservices necessary — you have to monitor the thing that's actually at risk, not the front door.

Measuring lag inside Postgres

Byte lag, queried on the primary

On the primary, pg_stat_replication has one row per connected replica. Compare the primary's current WAL position to each replica's replay_lsn:

SELECT
    client_addr,
    application_name,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;

A row missing from this result for a replica you expect to be connected is itself a signal: that replica has disconnected from the primary.

Time lag, queried on the replica

On a replica, compute how many seconds behind the last replayed transaction is:

SELECT
    CASE
        WHEN pg_is_in_recovery() AND pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
            THEN 0
        ELSE EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
    END AS replication_lag_seconds;

The CASE guards against a false positive on a low-traffic system: if no new transactions are arriving, pg_last_xact_replay_timestamp() grows "old" even though the replica is perfectly caught up. When the received and replayed LSNs are equal, the replica is current regardless of how long ago the last transaction was, so report zero lag.

Exposing lag through a health endpoint

External monitoring tools can't (and shouldn't) connect to your database directly. The pattern — the same one in the complete guide to HTTP health check endpoints — is to expose the metric through a small, authenticated HTTP endpoint that runs the query and translates the result into a status code and a status string.

Here's a Node.js / Express example that checks the replica's time lag and returns 503 when it exceeds a threshold:

const express = require('express');
const { Pool } = require('pg');

const app = express();
const replicaPool = new Pool({ connectionString: process.env.REPLICA_URL });

const WARN_SECONDS = 5;
const CRITICAL_SECONDS = 30;

app.get('/healthz/replication', async (req, res) => {
  if (req.headers['x-health-token'] !== process.env.HEALTH_TOKEN) {
    return res.status(401).json({ status: 'unauthorized' });
  }

  try {
    const { rows } = await replicaPool.query(`
      SELECT CASE
        WHEN pg_is_in_recovery()
             AND pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
        THEN 0
        ELSE EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
      END AS lag_seconds;
    `);

    const lag = Number(rows[0].lag_seconds);
    const healthy = lag < CRITICAL_SECONDS;

    res.status(healthy ? 200 : 503).json({
      status: healthy ? 'healthy' : 'unhealthy',
      lag_seconds: Math.round(lag * 1000) / 1000,
      degraded: lag >= WARN_SECONDS,
    });
  } catch (err) {
    res.status(503).json({ status: 'unhealthy', error: 'replication check failed' });
  }
});

app.listen(3000);

A Python / FastAPI version follows the same shape:

import os, psycopg
from fastapi import FastAPI, Header, Response

app = FastAPI()
CRITICAL_SECONDS = 30

@app.get("/healthz/replication")
def replication_health(response: Response, x_health_token: str = Header(default="")):
    if x_health_token != os.environ["HEALTH_TOKEN"]:
        response.status_code = 401
        return {"status": "unauthorized"}

    try:
        with psycopg.connect(os.environ["REPLICA_URL"]) as conn:
            lag = conn.execute("""
                SELECT CASE
                  WHEN pg_is_in_recovery()
                       AND pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
                  THEN 0
                  ELSE EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
                END;
            """).fetchone()[0]

        healthy = lag < CRITICAL_SECONDS
        response.status_code = 200 if healthy else 503
        return {"status": "healthy" if healthy else "unhealthy",
                "lag_seconds": round(lag, 3)}
    except Exception:
        response.status_code = 503
        return {"status": "unhealthy", "error": "replication check failed"}

Both return a clear "status" string and an HTTP status code that an external monitor can key on. Two design points carry over from the database-health-endpoint guidance: require a token so the endpoint isn't a public information leak, and never expose connection strings or raw error text in the body — return a generic "replication check failed", not the driver's stack trace.

Monitoring the endpoint with CronAlert

With the endpoint live, point an external monitor at it. CronAlert checks it on a schedule from outside your infrastructure, so you get the alert even if the box hosting the endpoint is itself in trouble.

Create a monitor pointing at https://api.yourapp.com/healthz/replication. Add the x-health-token value as a custom request header so the check authenticates.
Set the expected status code to 200. When lag crosses your critical threshold the endpoint returns 503, and the status-code check fires. This alone catches the breach.
Add keyword/content monitoring (Pro plan) to require the body contain "healthy". This is the belt-and-suspenders layer that catches a misconfigured proxy rewriting the 503 to a 200 with an unhealthy body. See keyword monitoring.
Set the interval to 1 minute on a paid plan. Replication lag can grow fast under a write spike; a 3-minute interval may let it climb dangerously between checks.
Route the alert appropriately. A lagging replica is an on-call-worthy event — route it to PagerDuty, Opsgenie, or Slack depending on severity. See incident response workflows for the routing pattern.

The combination of the status-code check (hard breach) and keyword monitoring (soft breach) gives you the same two-layer detection that works for any deep health check: the code catches the obvious failure, the keyword catches the deceptive 200.

Monitor each replica separately

A single aggregate health endpoint can hide a problem. If you have three replicas and one falls behind, an endpoint that reports "the worst replica's lag" is correct but doesn't tell you which one. Expose a per-replica endpoint (or include the replica identifier in the response body) and create a separate CronAlert monitor per replica. That way the alert names the specific standby that's lagging, and you can see at a glance in the dashboard whether it's one replica or all of them — a single lagging replica is a node problem, all of them lagging is usually a primary write-volume or network problem.

Choosing thresholds

Baseline first. Watch the lag_seconds value during normal operation, including peak write hours, before setting a threshold. A database that normally lags 200ms should alert far sooner than one that routinely lags several seconds under load.
Warn early, page late. A common split is a warning at a few seconds and a paging alert at 30–60 seconds. The warning gives you a chance to investigate (a long-running migration, a write spike, a slow disk) before it becomes user-facing.
Alert on silence. A replica that stops responding to the health endpoint, or whose row vanishes from pg_stat_replication, has likely broken replication entirely. That's more urgent than slow lag — treat a missing replica as critical.
Account for maintenance. Lag spikes are expected during large data migrations, VACUUM FULL, or index rebuilds. Use a maintenance window to suppress alerts during planned heavy-write operations so you don't train your team to ignore the alert.

Common pitfalls

Idle-database false positives. Without the pg_is_in_recovery() / LSN-equality guard, a quiet replica reports growing lag just because no transactions are arriving. Always include the guard.
Querying lag on the wrong node. pg_last_xact_replay_timestamp() only makes sense on a replica; pg_stat_replication only has data on the primary. Point each query at the right node.
Expensive health checks. Keep the query trivial — it's two function calls, not a table scan. A health endpoint that adds load is part of the problem. The same discipline applies as in any health endpoint design.
No external vantage point. Monitoring lag from a script running on the same database host means you lose the signal exactly when the host dies. External monitoring from Cloudflare's edge survives the failure of the thing it's watching.
Treating logical and physical replication the same. The queries above are for physical streaming replication. Logical replication (and tools like the managed replication on RDS or Aurora) exposes lag through different views — pg_stat_subscription and provider-specific metrics. Adapt the query; the monitoring pattern (endpoint → status code + keyword → external check) is identical.

Where this fits in a broader monitoring strategy

Replication lag monitoring is one deep check among several. Pair it with a database connectivity health endpoint (can we reach the primary at all?), third-party dependency monitoring for the services your app relies on, and standard uptime monitoring on the user-facing endpoints. Together they cover the layers a single homepage check can't: the front door, the database connection, the replica freshness, and the external dependencies. For teams running on Kubernetes, the same endpoint doubles as a readiness probe so the platform stops routing read traffic to a lagging replica automatically.

Frequently asked questions

What is Postgres replication lag?

It's the delay between a write being committed on the primary and that write becoming visible on a replica. It's measured in bytes (how far behind the replica's WAL position is) and in seconds (how old the last replayed transaction is). Some lag is normal; unbounded growth means stale reads and failover data loss.

How do you measure replication lag in Postgres?

On the primary, compare pg_current_wal_lsn() to each replica's replay_lsn in pg_stat_replication for byte lag. On a replica, use EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) for time lag, guarded by a recovery/LSN-equality check to avoid idle-database false positives.

How do you monitor replication lag with an uptime tool?

Expose lag through an authenticated HTTP endpoint that returns 503 plus an unhealthy status string when lag exceeds your threshold, and 200 with a healthy string otherwise. Point CronAlert at the endpoint: the status-code check catches the breach and keyword monitoring confirms the body. This turns an internal database metric into an external alert without exposing the database.

What is a safe replication lag threshold?

Set it relative to your baseline. Common practice is to warn at a few seconds and page at 30–60 seconds of time lag, alert on a replica that stops reporting entirely, and for byte lag alert when a replica falls more than a few hundred megabytes behind.

Why is monitoring replication lag important?

Because it causes stale reads, failover data loss, and silently broken standbys — all of which are invisible to a normal uptime check, since the primary keeps returning 200 the whole time the replica is falling behind.

Monitor your replicas with CronAlert

Expose a replication-lag health endpoint, then let CronAlert watch it from outside your infrastructure. Create a free account (25 monitors, no credit card), point a monitor at your endpoint, add keyword monitoring on Pro to catch deceptive 200s, and route the alert to your on-call channel. The next time a replica starts falling behind, you'll hear about it from CronAlert — not from a customer wondering why their data looks out of date.

Related reading: how to monitor your database health endpoint, monitoring Redis and ElastiCache endpoints, uptime monitoring for multi-region architectures (where replication lag is the data-safety signal behind every failover), the complete guide to HTTP health check endpoints, monitoring microservices for uptime, monitoring third-party dependencies, monitoring scheduled database backups with heartbeats (the other database job that fails silently), and keyword monitoring.