Every multi-tenant SaaS eventually has this incident: the dashboards are green, the status page says operational, and a customer is on the phone insisting they're completely down. They're right. A shard died, a tenant migration half-finished, or their custom domain's certificate expired — and because it only affected some tenants, every aggregate signal averaged it away. "The app is up" is a meaningless sentence in a multi-tenant system. Up for whom?
The naive fix — a monitor per tenant — collapses immediately at SaaS scale: you can't hand-manage five thousand monitors, and you don't need to. Tenants don't fail individually; they fail in groups, along the seams of your architecture. Monitor the seams. This guide shows how: per-shard health checks, synthetic canary tenants, and individual monitors for the customers whose downtime costs the most. It builds on the general SaaS uptime monitoring guide — read that first if you're starting from zero.
Why aggregate monitoring lies in multi-tenant systems
Multi-tenancy means shared infrastructure with per-tenant blast radii. The failures that hurt are precisely the ones aggregate checks can't see:
- Sharded databases. Tenants 1–2,000 live on shard A, the rest on shard B. Shard B's failover hangs: half your customers are down, your
/healthz(wired to shard A, or to whichever shard the health check's test tenant lives on) is green, and your error rate graph shows a 50% dip that on-call reads as "degraded," not "half our customers see nothing but 500s." - Per-tenant routing.
acme.example.comresolves a tenant from the hostname before any query runs. A bug in tenant resolution, a wildcard DNS change, or a cache poisoning takes out tenant subdomains whileapp.example.comworks perfectly. - Custom domains. Enterprise tenants bring
portal.acme.com— their DNS, your certificate automation. Each domain is an independent failure unit: one expired cert is one customer fully down and zero signal anywhere in your stack. - Tenant-specific state. Migrations that succeed on 4,990 tenant schemas and fail on 10. Feature flags, plan-based config, a tenant whose data shape hits an edge case. The app is fine; those ten tenants get 500s on every page.
- Noisy neighbors. One tenant's bulk import saturates a shared shard, and every tenant on it degrades. Average latency barely moves; that shard's tenants time out.
Notice the pattern: each failure follows an architectural seam — shard, cell, domain, tenant config. That's what makes the monitoring problem tractable.
Layer 1: one deep health check per shard or cell
Whatever your unit of tenant infrastructure is — database shard, cell, cluster — give it a health endpoint and an external monitor. The check should exercise the unit the way tenant traffic does: connect to that shard, run a tenant-scoped query against it, touch its cache. The database health endpoint pattern applies directly; the multi-tenant twist is that you need one per shard, not one for "the database":
app.get('/healthz/shard/:shardId', async (req, res) => {
if (req.headers['x-health-token'] !== process.env.HEALTH_TOKEN) {
return res.status(401).json({ status: 'unauthorized' });
}
const shard = getShardConnection(req.params.shardId);
const problems = [];
try {
// a real tenant-scoped query, not SELECT 1 —
// exercises the schema and the tenant-resolution path
await shard.query(
'SELECT id FROM tenants WHERE shard_id = $1 LIMIT 1',
[req.params.shardId]
);
} catch {
problems.push('shard_query_failed');
}
const healthy = problems.length === 0;
res.status(healthy ? 200 : 503).json({
status: healthy ? 'healthy' : 'unhealthy',
shard: req.params.shardId,
problems,
});
});
One CronAlert monitor per shard, named so the alert is the diagnosis: Shard B — tenant DB. Ten shards is ten monitors, covering every tenant's infrastructure layer. When shard B fails over badly at 2am, the alert names shard B — and your support team knows which customers are affected before the first ticket arrives.
Layer 2: a canary tenant per shard
Shard checks prove the infrastructure works; they don't prove the tenant path works — hostname → tenant resolution → shard routing → tenant-scoped page render. The cleanest probe for that path is a canary tenant: a synthetic tenant you create on each shard, existing solely to be monitored.
- Create
canary-a.example.comon shard A,canary-b.example.comon shard B, and so on — provisioned through your real signup flow, so the canary also catches provisioning regressions. - Monitor each canary's login page (or a lightweight tenant-scoped page) externally. Use keyword monitoring to require tenant-rendered content — a string that only appears when tenant resolution and a shard query actually succeeded — so an error page or a generic 200 can't pass.
- Don't use real tenants as probes. Their data shape isn't yours to depend on, their analytics shouldn't include your checks, and their consent shouldn't be assumed. The canary is yours; instrument it freely.
A canary failing while its shard check passes is a precise signal: infrastructure fine, tenant path broken — routing, resolution, or config. That distinction is the difference between paging the DBA and paging the app team.
Layer 3: named monitors for the customers that matter most
Layers 1 and 2 cover classes of failure. Your top customers deserve coverage as individuals — their subdomain or custom domain, monitored directly:
- Anyone with a contractual SLA gets a monitor on their actual entry point. When Acme's CTO asks for their uptime last quarter, you answer with their tenant's history — not a platform average that papered over the day their shard was down. This is the per-customer flavor of SLA compliance, and the check history doubles as the evidence for SLA reporting.
- Every custom domain you can afford to watch, highest-value first. HTTPS checks surface certificate problems automatically on every check — and expiring tenant certs are the single most common per-tenant outage. The customer's own DNS breaking isn't your fault, but you want to be the one who tells them.
- Create these monitors programmatically. Wire monitor creation into tenant onboarding via the CronAlert API — when a customer adds a custom domain, the deploy pipeline adds the monitor (the same automate-on-deploy idea as monitoring in CI/CD). Coverage that depends on a human remembering decays; coverage in the provisioning path scales with the feature.
The arithmetic is friendly: 10 shards + 10 canaries + your top 30 customers is 50 monitors covering five thousand tenants, with the control plane (marketing site, login, billing, API — per the SaaS guide) on top. If you run a higher tier, agencies solve the same shape of problem — many isolated client properties, grouped monitoring — and the agency guide's organizational patterns transfer directly.
Routing and communication
- Severity by blast radius. Shard or canary failure means a tenant population is down: page PagerDuty/Opsgenie. A single customer's custom-domain failure: high-priority Slack plus an automatic heads-up to that account's team. See incident response workflows for the wiring.
- Status pages need care in multi-tenant systems. A public "all operational" while shard B's customers are down destroys trust. At minimum, post partial-outage incidents that say some customers are affected; ideally, give SLA-tier customers a status page scoped to the components they actually depend on. See status page setup.
- Tell support which tenants are affected. The alert that says "shard B down" should reach the support channel with the customer list (or a link to one). Half the cost of a partial outage is support confidently telling an affected customer everything is fine.
Common pitfalls
- One health check wired to one shard. Your
/healthzproves whichever shard it happens to touch. Every shard needs its own check, or shard failures are invisible until customers call. - Trusting the aggregate. Error rates and average latency dilute per-tenant outages by design. A 10%-of-tenants outage is a 100% outage for those tenants; monitor failure domains, not averages.
- Using a real customer as the probe. Their data isn't a fixture and their analytics aren't yours to pollute. Build canaries.
- Custom domains nobody monitors. Each one is an independent cert and DNS failure unit owned half by you, half by the customer. Automate monitor creation at domain onboarding.
- Forgetting the canary is also a tenant. Migrations and tenant scripts will hit it. That's a feature — the canary failing after a migration is your ten-broken-tenants alarm — but exempt it from cleanup jobs that delete "inactive" tenants.
- A status page that only knows global truth. "All systems operational" is a lie to the customer on the dead shard. Communicate partial outages as partial.
Frequently asked questions
Why doesn't normal uptime monitoring work for multi-tenant SaaS?
Because availability is per-tenant, not global. Shards, subdomains, custom domains, and tenant config each fail for a subset of tenants while aggregate checks and dashboards stay green. A monitor on app.example.com validates shared infrastructure, not the shard your biggest customer lives on.
How do you monitor per-tenant health without a monitor per tenant?
Monitor failure domains: one deep health check per shard, one canary tenant per shard for the tenant path, and named monitors for top customers' entry points. Tenants fail in groups along architectural seams, so tens of monitors cover thousands of tenants.
What is a canary tenant?
A synthetic tenant created on each shard purely to be monitored. It exercises real tenant resolution, routing, and tenant-scoped queries, so an external check against it proves the full tenant path works — without depending on any real customer's data or polluting their analytics.
How do you monitor tenant custom domains and their SSL certificates?
Monitor each high-value custom domain individually over HTTPS — certificate errors surface automatically on every check — and create the monitors programmatically when the domain is onboarded, so coverage scales with the feature rather than with anyone's memory.
Should each enterprise customer get their own monitor?
If they have a contractual SLA or material churn risk, yes. Their monitor's history is also the per-tenant uptime record you'll need the day they ask you to prove their SLA was met — their tenant's uptime, not your platform average.
Monitor every tenant's reality with CronAlert
"Up for whom?" is the only availability question that matters in a multi-tenant system. Create a free account (25 monitors, no credit card), add a health check per shard, provision a canary tenant on each, and put your top customers' domains under direct watch — then wire new custom domains into monitor creation via the API. The next time a shard dies at 2am, you'll know which one, which customers, and what to tell them — before the first angry ticket lands.
Related reading: uptime monitoring for SaaS, monitoring your database health endpoint, using uptime data for SLA compliance, uptime monitoring for agencies, and managing monitors with the REST API.