DNS Monitoring: Catch Resolution Failures, Propagation Delays, and Hijacks

Q: What does DNS monitoring actually detect?

Five specific failure modes: a) the authoritative nameserver stopped responding (NS-level outage), b) the resolver returned NXDOMAIN or SERVFAIL when it shouldn't have (record corruption, registrar issue), c) a record returned the wrong value (someone changed the A or CNAME without coordination, or a hijack), d) a record's TTL is too long for safe failover, and e) the registration is approaching expiry. Standard HTTP uptime checks catch most of these as 'the site is down,' but DNS-specific monitoring tells you why and lets you alert before HTTP checks notice.

Q: How is DNS monitoring different from HTTP uptime monitoring?

HTTP monitoring is end-to-end: if the site responds, the entire chain (DNS, TLS, server, app) is working. DNS monitoring isolates one specific link. The benefit is alert fidelity — when DNS-specific monitoring alerts, you know exactly where to look. HTTP-only monitoring tells you something is wrong but not where; you spend the first ten minutes of an incident running dig manually. DNS monitoring also catches problems HTTP can't see: the record is correct in your authoritative DNS but a popular resolver (Google 8.8.8.8, Cloudflare 1.1.1.1) is serving stale data.

Q: Should I monitor my DNS records or my registrar?

Both, for different reasons. Monitor DNS records (resolution from public resolvers) to catch operational issues — record changes, propagation, resolver staleness. Monitor your registrar for domain-level issues — pending expiry (set a reminder 30 and 7 days out), unexpected NS changes, transfer locks. Domain expiry is the single highest-impact 'how did we miss this' failure mode in DNS; an expired domain takes the entire business offline and recovery is measured in days, not minutes. A WHOIS check at least monthly is non-optional for any domain that matters.

Q: How long does DNS propagation actually take?

Propagation isn't really 'global propagation'; it's individual resolver caches expiring at their own pace based on the record's TTL. If your A record has a TTL of 3600, a resolver that cached the old value will keep serving it for up to 3600 seconds after you changed it. There's no magic 24-48 hour window — that figure assumes long TTLs and slow registrars. Plan TTLs deliberately: 60 seconds for records you might need to change in an emergency (origin servers, CDN switches), 86400 seconds for records that almost never change (MX, SPF). For a planned cutover, lower the TTL 24+ hours before the change so the cached values expire by the time you flip.

Q: Can I detect DNS hijacks with monitoring?

Yes — monitor for unexpected changes in the resolved value. The pattern is to record the expected A/AAAA/CNAME/NS values for each critical record and alert when a query returns something different. CronAlert's keyword monitoring can match a known IP or CNAME target in a special status endpoint that returns the resolved value. More commonly, DNS monitoring services (DNSimple's monitoring, NS1, Cloudflare DNS analytics) have purpose-built record-change detection. The signal you want is 'the record we got back doesn't match the record we expect to be authoritative right now.' Hijacks usually flip NS records or A records to attacker-controlled values; you want to know within minutes, not when a customer notices.

A lot of "the site is down" incidents aren't really about the site. They're about DNS. The web server is fine, the database is fine, the CDN is fine — but the domain doesn't resolve, or it resolves to the wrong IP, or it resolves correctly from your laptop but not from a chunk of the internet. The site appears down to users; the team is staring at green dashboards and wondering what's broken.

DNS monitoring is the missing layer for these incidents. Most of the common causes of website downtime include DNS in some form: expired registrations, record-update propagation lag, registrar-side outages, resolver-cache divergence, NS changes that didn't propagate. This post walks through what's worth monitoring at the DNS layer, how to wire it into a normal HTTP-based uptime monitoring setup, and the specific failure modes you can catch with a few extra checks.

What can fail in DNS

To monitor DNS effectively you need to know where it can break. The chain from "user types a domain" to "browser connects to a server" has more than half a dozen places to fail:

Domain registration expires. If you don't pay your registrar, the registrar yanks the domain. Recovery is days, not minutes.
Registrar-side incident. Your registrar's nameserver delegation breaks — usually because of an internal incident or an account-level dispute.
Authoritative nameserver outage. The nameservers that hold your zone (Cloudflare, Route 53, NS1, your provider) become unreachable.
Record misconfiguration. Someone updates an A record to the wrong IP, or changes a CNAME target without testing, or accidentally deletes an MX record.
Resolver-cache staleness. Your authoritative DNS shows the new value, but Google's 8.8.8.8 (or any other big public resolver) is still serving the old one because of TTLs that haven't expired.
DNSSEC validation failure. The signature chain is broken — the domain looks intact but resolvers refuse to return answers.
NS hijack. The delegation at the registrar is modified to point to attacker-controlled nameservers. Less common than other failure modes but catastrophic when it happens.

HTTP uptime monitoring catches most of these eventually, because most of them result in "the site doesn't load." But the alert is "site is down," not "DNS is broken." The first ten minutes of the incident are spent running dig manually to figure out which layer failed. DNS-specific monitoring removes that lag.

What to monitor at the DNS layer

1. Resolution from public resolvers

The first DNS check is the basic one: resolve yourdomain.com from a public resolver and confirm it returns an answer. The implementation pattern with CronAlert is to expose a small endpoint on a different domain that does a DNS lookup of your primary domain and returns the resolved IP, then point a keyword monitor at it. For example:

// /dns-check/yourdomain
import dns from "node:dns/promises";

export async function GET({ request }) {
  try {
    const a = await dns.resolve4("yourdomain.com");
    return new Response(`OK: ${a.join(",")}`, { status: 200 });
  } catch (err) {
    return new Response(`FAIL: ${err.code}`, { status: 500 });
  }
}

Run this endpoint on a domain whose DNS is provided by a different vendor than the one you're monitoring. If your primary domain is on Cloudflare, run the check on a domain hosted by AWS Route 53 or Vercel. That way the check itself doesn't share a failure mode with what it's checking. Then point a CronAlert keyword monitor at it expecting the substring OK: in the body.

For deeper coverage, run the check from multiple regions to catch resolver-cache divergence — sometimes Google 8.8.8.8 in one region serves a stale value that Google 8.8.8.8 in another region doesn't. Multi-region monitoring covers this naturally.

2. Specific record values

Monitoring just "did the resolution succeed" misses the case where DNS returns the wrong answer. A more targeted check confirms the resolved value matches what you expect:

const expected = ["1.2.3.4", "5.6.7.8"];
const got = await dns.resolve4("yourdomain.com");
const match = expected.every((ip) => got.includes(ip)) &&
  got.every((ip) => expected.includes(ip));
return new Response(match ? "OK" : `MISMATCH: ${got.join(",")}`, { status: match ? 200 : 500 });

This catches accidental record changes — someone updates the A record to the wrong IP, or a CNAME flip points to an old origin, or in the worst case, a hijack swaps your record to an attacker-controlled IP. The monitor goes red within one check interval. Run this for the records that genuinely matter: apex A/AAAA, your CNAME for www, your api hostname, and your MX records if email is critical.

3. NS delegation

Domain hijacks usually happen at the NS level — the attacker takes over your registrar account, points NS to nameservers they control, and serves whatever they want. Monitor the NS records as carefully as the A records:

const expected = ["ns1.example-dns.com", "ns2.example-dns.com"];
const got = await dns.resolveNs("yourdomain.com");
const ok = expected.every((ns) => got.includes(ns));
return new Response(ok ? "OK" : `NS_CHANGED: ${got.join(",")}`, { status: ok ? 200 : 500 });

NS records change rarely; an unexpected change should immediately page on-call. For high-value domains, also consider a registrar-level change-notification setting and/or a registry lock that requires manual intervention to change NS.

4. Domain expiry

Domain expiry is the highest-impact, most-preventable DNS failure. An expired domain takes everything offline and recovery requires interaction with the registrar — sometimes during a redemption-grace-period window with extra fees. The mitigation is trivial:

Set the registrar's auto-renew on, billed to a card that doesn't expire before the domain.
Use a registrar account with billing-failure notifications routed to a real human, not a defunct distribution list.
Run a periodic WHOIS-based expiry check that alerts when the domain expiry is < 30 days or < 7 days.

The third item is the monitoring-vendor-friendly version. Build a small endpoint that runs a WHOIS lookup (most languages have a WHOIS library; whois-json for Node is one option) and returns the days-to-expiry. CronAlert then keyword-monitors it for "DAYS_LEFT_OK" and fires an alert when the body shifts to "DAYS_LEFT_CRITICAL".

5. SSL certificate expiry

Adjacent to DNS but related — most outages people remember as "DNS issues" are actually expired SSL certificates that cause connection failures. CronAlert's SSL certificate monitoring catches these directly during normal HTTPS checks. Don't conflate the two; a fresh DNS record pointing at a server with an expired cert still produces a "site is down" experience.

6. DNSSEC validation

If you've enabled DNSSEC (and most teams haven't, unless they're regulated), monitor for validation failures. The signature chain breaks on key rollovers when DS records aren't updated correctly at the parent zone. Symptom: SERVFAIL from validating resolvers. Public resolvers like Cloudflare's 1.1.1.1 validate by default; you can compare answers from a validating resolver and a non-validating one to detect breakage.

How DNS monitoring fits into HTTP uptime monitoring

DNS monitoring shouldn't replace HTTP uptime monitoring; it complements it. The pattern that works:

Primary HTTP checks on the actual customer-facing surfaces (homepage, app, API), running at 1-minute intervals from multiple regions. These catch the end-to-end "is the site reachable" case.
DNS resolution checks running at 5-minute intervals on the domains that matter. These catch DNS-specific failure modes faster than HTTP checks alone, because the DNS check completes faster and is more diagnostic.
WHOIS expiry check running daily. Anything more frequent is wasteful; anything less frequent risks missing a short-warning expiry.
SSL certificate expiry bundled into HTTPS checks (CronAlert does this automatically).

The combined alert routing should distinguish between layers. An HTTP failure with DNS green is a server-side problem; an HTTP failure with DNS red is a DNS problem. Routing to the same on-call is fine; tagging by layer in the alert body cuts triage time.

TTLs and propagation

DNS "propagation" is one of the most-misunderstood concepts in operations. There's no global propagation event; each individual resolver caches a record for the record's TTL, and serves the cached value until the TTL expires. If your A record has a TTL of 86400 (24 hours), a resolver that fetched the record yesterday will keep serving the old value for up to 24 hours after you change it — even though your authoritative nameservers are serving the new value immediately.

The practical implications:

Lower TTLs ahead of planned changes. 24+ hours before a planned cutover, lower the TTL on the records you'll change to 60-300 seconds. After the change is stable, raise them back.
Match TTL to change-frequency. Records that almost never change (MX, SPF, the apex A for an established domain) can stay at 86400. Records that might need to change in an emergency (CDN switches, origin failover) belong at 60-300.
Don't trust "propagation checkers." The various web tools that claim to show "global DNS propagation" check a small set of resolvers — usually the ones that aggressively respect TTLs. Real-world resolvers vary wildly. The only ground truth is the authoritative nameserver's response.
Monitor what users actually resolve. If most of your users are in regions where their ISP runs an aggressively-caching resolver, a 60-second TTL doesn't help — those resolvers ignore short TTLs. Monitor from realistic resolvers, not just from your laptop's resolver.

Alerting on DNS issues

DNS-failure alerts deserve their own routing rules because the on-call response is different from "the server is down":

NS changes page immediately. An unexpected NS change is either a hijack or a misconfigured registrar action — both are urgent.
Resolution failures from one region but not others get a chat-channel alert rather than a page. Resolver-cache divergence usually self-resolves; you want to see it but not be woken up.
Resolution failures from all regions page immediately. Genuine NS-level outage.
Mismatched record values page immediately. Either a misconfigured deploy or a hijack — both want a fast human response.
Domain expiry warnings (30 / 14 / 7 days) go to a chat channel and an email. Page only at 1 day if no one has acknowledged the earlier warnings.
SSL expiry warnings follow the same tiered pattern.

Wire DNS alerts into the same incident-response process as other uptime alerts — see incident response for small teams. The general fight against alert fatigue applies; DNS alerts don't deserve a separate inbox, but they do deserve clear labelling so the right runbook gets pulled up.

Common DNS incident patterns

A few specific incidents that DNS monitoring catches faster than HTTP-only monitoring:

The CNAME flip. A team migrates from one CDN to another, updates the CNAME, and watches it propagate — but missed updating one of the apex records. Subdomains work; www doesn't. DNS-record-value monitoring catches this within minutes.
The accidental delete. A junior engineer cleans up "unused" records in the DNS panel; an MX record disappears; email starts bouncing. Specific-record monitoring catches it before email piles up.
The provider outage. Cloudflare DNS, Route 53, or NS1 has a regional outage. Resolution fails from some areas of the internet. Multi-region resolution checks catch the regionality directly.
The forgot-to-renew. Auto-renew was disabled when the team migrated billing. Domain enters expired-but-grace state. Customers can still load the site — until they can't. WHOIS monitoring catches this 30+ days out.
The resolver-cache poisoning. Rare but real. A misconfigured DNSSEC change or a deliberate poisoning attack causes a popular resolver to serve incorrect records. Resolution from multiple resolvers (8.8.8.8, 1.1.1.1, 9.9.9.9) catches divergence.

DNS monitoring tools and approaches

A few options for the actual implementation:

CronAlert keyword monitoring against a self-hosted DNS-check endpoint, as described above. Works for any record type and is fully under your control. Pairs naturally with HTTP uptime monitoring on the same dashboard.
DNS provider built-in monitoring. Cloudflare, NS1, and DNSimple have varying levels of native record-change and resolution monitoring. Use these in addition to external checks — they catch some issues earlier but share a fate with their own platform.
Specialized DNS monitoring services. DNSCheck, DNSimple's monitoring, ThousandEyes (enterprise) provide purpose-built DNS analysis. Worth it for very large or compliance-driven setups; overkill for most.
WHOIS monitoring tools. Domain-monitor.com and WhoisXML offer dedicated expiry alerts. A self-hosted WHOIS check with CronAlert covers the same ground if you're already in the CronAlert ecosystem.

The tradeoff is the usual one: a self-hosted DNS-check endpoint behind CronAlert keeps your monitoring in one place; a specialized DNS-monitoring service goes deeper but adds a vendor. For most teams the self-hosted approach hits the right balance.

Frequently asked questions

What does DNS monitoring actually detect?

Authoritative nameserver outages, NXDOMAIN/SERVFAIL on records that should resolve, wrong record values (misconfigurations or hijacks), excessive TTLs, and approaching domain expiry. HTTP-only monitoring catches most as "site down" but doesn't tell you why.

How is DNS monitoring different from HTTP uptime monitoring?

HTTP monitoring is end-to-end and tells you "something is wrong somewhere." DNS monitoring isolates the DNS layer specifically, which speeds up triage and catches DNS-only issues like resolver-cache divergence that HTTP doesn't see directly.

Should I monitor my DNS records or my registrar?

Both. DNS records to catch operational issues; the registrar to catch domain expiry and unexpected NS changes. Domain expiry is the single highest-impact DNS failure mode and warrants its own dedicated check.

How long does DNS propagation actually take?

Up to the record's TTL — there's no global propagation event. Lower TTLs ahead of planned changes; long TTLs are fine for stable records.

Can I detect DNS hijacks with monitoring?

Yes — monitor specific record values (A, CNAME, NS) and alert on unexpected changes. Hijacks usually flip NS or A records to attacker-controlled values; record-value monitoring catches this within one check interval.

Add DNS checks alongside your HTTP monitors

DNS monitoring is a small addition to a normal uptime monitoring setup — usually one or two extra endpoints and a daily WHOIS check — that pays off the first time DNS is the problem. Create a free CronAlert account and start with a basic resolution check on your apex domain, then layer in record-value monitoring and expiry checks as the setup matures.

Related reading: causes of website downtime, SSL certificate monitoring, multi-region monitoring, uptime monitoring for multi-region architectures (where DNS is the failover mechanism itself), keyword monitoring, and how to reduce false positive alerts.