Uptime reports are easy to generate and hard to use. Every monitoring tool produces them. Every dashboard shows percentages, response times, and incident counts. Most teams glance at them once a week, nod, and close the tab.

The reports become useful only when you know which numbers actually predict trouble, which ones are vanity metrics, and what action each one is supposed to drive. Otherwise an uptime report is just decoration — a thing you screenshot for the quarterly review and forget about until next quarter.

This post walks through how to read a CronAlert uptime report (or any uptime report) in a way that changes engineering behavior. What to look at first, what to ignore, what each metric is and is not telling you, and how to turn a report into a list of three things to do this week.

What's in an uptime report

Most uptime reports include some combination of:

  • Uptime percentage — the headline number. Time the monitor saw the URL as up, divided by total time, over a given window.
  • Incidents — list of distinct outages with start time, duration, and resolution status.
  • Response time — average, median, or p95 latency over the period, often charted.
  • Per-region breakdown — if you run multi-region checks, uptime split by region.
  • Status code distribution — counts of 200s, 4xxs, 5xxs, timeouts, DNS errors.
  • SSL events — certificate errors and expiration warnings.
  • Alerts fired — number of notifications sent across all channels.

CronAlert exposes all of these on the monitor detail page and via API for paid plans, so you can pull a report into a notebook, a status page, or a quarterly slide deck. The question isn't whether the data exists — it's which slice of the data is worth your attention.

The headline percentage is the least useful number

Counter-intuitively, the uptime percentage is usually the worst metric to lead with. Three reasons:

It compresses too much. 99.5% over a month means 3.6 hours of downtime, but it doesn't tell you whether that was one bad afternoon or a flapping monitor that bounced 200 times. The action you take is completely different in those two cases.

It's lagging. A monitor that's been on fire for the last three days but pristine for 27 days will still show "98%+" for the month. The percentage tells you what already happened, not what's currently broken.

It's framed wrong. Stakeholders see "99.5%" and think "great, almost perfect." Engineers see the same number and know it's actually below the two-nines mark for an API. The number doesn't communicate what it should.

The headline number is fine for an executive summary. It's not where the work lives.

Read the incident list first

The single most useful section of an uptime report is the incident list. Five questions to ask of it:

  1. How many incidents fired? Three is normal. Twenty is a flapping problem or a bad monitor configuration.
  2. How long did each one last? A 30-second blip and a 90-minute outage are very different problems even if both count against your uptime.
  3. How were they resolved? "Auto-resolved" most of the time means transient issues — possibly false positives. "Manually acknowledged" means a human had to do something.
  4. Which monitors had the most incidents? One monitor with 80% of the incidents is your problem. The rest are noise.
  5. Did any incident page someone after hours? Off-hours pages are the most expensive incidents you have. Even if they're short.

A report with one 12-minute incident, manually acknowledged, on a monitor that's normally clean, is a healthy month. A report with 22 sub-minute auto-resolved incidents on the same monitor is a configuration problem — see how to reduce false positives and why false positives happen at all.

Response time: look at p95, not the average

Average response time hides everything interesting. A site that's normally 200ms with one 30-second timeout per hour will show "average response time: 380ms" — a number that looks fine and tells you nothing.

Read the p95 instead. P95 is the response time below which 95% of requests fall. If your p95 is 800ms and your average is 200ms, you have a tail latency problem that the average is happily hiding from you. Sustained p95 climbs are the earliest leading indicator of trouble — slow before broken, every time.

Two specific patterns to watch for:

  • Steady climb week-over-week. A p95 that creeps up 20ms per week for six weeks is a database that's growing past its working set, a connection pool that's saturating, or a third-party dependency that's degrading. None of these will cause an outage today; all of them will cause one eventually.
  • Sudden step change. A p95 that jumps 200ms overnight usually points to a recent deploy. Cross-reference your deploy log against the change point in the chart and you'll find it.

For a deeper take on response-time monitoring versus user-perceived performance, see synthetic monitoring versus real user monitoring.

Per-region breakdown: where the geography problems hide

If you run multi-region checks, the per-region table is where regional infrastructure problems show up that the headline number papers over.

A site at 99.9% globally but 98.5% in Asia is telling you something specific: the Asia probe is seeing failures the others aren't. Common causes are CDN cache miss patterns that hit the origin from far away, a regional DNS provider issue, a third-party dependency that's flaky from one geography, or a misconfigured WAF rule that geo-blocks legitimate probe traffic.

Per-region breakdowns are also the data you need to argue for or against expansion. If your Asia uptime is fine but your p95 from Asia is 4x the global average, that's the case for adding a regional CDN edge or a multi-region deploy. Without the data, the conversation is just opinions.

See multi-region uptime monitoring for how to set this up if you haven't already.

Status code distribution: the boring section that finds real bugs

The status code chart is usually a single-color bar that's 99%+ green. Most teams glance at it and move on. The interesting part is the other 1%.

  • Rising 5xx rate. Server errors that didn't trip the alert threshold but appeared more often this month than last. Probably a flaky background job or a non-fatal exception path that's becoming more common.
  • Sudden 4xx spike. Client errors usually don't cause alerts (correctly), but a sudden spike in 401s can mean an auth provider issue, and a sudden spike in 429s means rate limiting. Both are worth investigating before they escalate.
  • DNS errors at all. DNS failures are rare and almost always meaningful — a misconfigured record, a registrar issue, an expired domain in a related zone. Even one or two per month deserves a look.
  • Timeouts that are not 5xxs. Timeouts mean the server didn't respond within the timeout window. They look different from 503s in the chart and can indicate a different problem — usually a long-running query, a deadlock, or a downstream dependency hang.

For the meaning of each status code in monitoring context, see HTTP status codes explained.

Alerts fired: the canary for alert fatigue

The alerts-fired count is the easiest number to ignore and one of the most predictive of an unhealthy alerting setup. The signal is the ratio of alerts to incidents.

  • Alerts ≈ incidents. Healthy. Each real incident fires an alert, each alert maps to a real incident.
  • Alerts >> incidents. You have a flapping monitor or duplicate channel routing — the same incident is paging multiple times or multiple channels are paging for the same event. Time to consolidate.
  • Alerts < incidents. Some incidents are not paging. Either you have alert routing turned off on important monitors or there's a configuration bug. Audit the alert channel attachments.

The full playbook for keeping this number in line is in how to reduce alert noise without missing real outages.

Three things to do with a report this week

A report is only useful if it generates action. Every monthly review should produce three concrete items:

  1. Fix the noisy monitor. Look at incident counts per monitor. Whichever monitor is generating the most alerts relative to its real failures is your top fix this week. Tighten the consecutive-check threshold, add multi-region quorum, or add a maintenance window if the noise is from scheduled work.
  2. Investigate the climbing p95. Find the monitor whose p95 has climbed most over the period. Cross-reference deploys, traffic, and dependencies. Either explain the climb or open a ticket to investigate.
  3. Close the gap on the longest incident. Whichever incident lasted longest is the one to write up. What was the time-to-detect? Time-to-acknowledge? Time-to-resolve? Each gap has a remediation — better alerting closes detection, better paging closes acknowledgment, better runbooks close resolution. Pick the slowest one and fix it.

Teams that follow this pattern see the report drive a measurable change in uptime numbers within a couple of months. Teams that just look at the percentage see the same number every month forever.

Reports for different audiences

The same data has to work for different readers. The format that helps engineering doesn't work for executives, customers, or auditors.

For engineering

Per-monitor incident lists, p95 charts, status code breakdowns, alert-to-incident ratios. The goal is action: which monitors need fixing, which trends need investigating. Weekly cadence.

For executives and stakeholders

Headline uptime number, total incident count, longest incident duration, biggest customer-facing impact. The goal is reassurance and risk visibility — what happened, what's the trajectory, what would change the trajectory. Monthly cadence.

For customers

A public status page is the right surface, not a report. Current state, recent incidents, a 90-day uptime history. CronAlert generates this automatically from monitor data. For enterprise customers under SLA contracts, supplement the status page with a quarterly summary — see how to use uptime data for SLA reporting.

For auditors and compliance

Auditors care about completeness and verifiability. Export check results via API, archive them on a defined retention schedule, and produce a report that maps incidents to your incident response process. The data structure matters more than the visuals.

Frequently asked questions

How often should I review uptime reports?

Weekly for triage, monthly for trends, quarterly for SLA reconciliation. Weekly catches noisy monitors before they get normalized. Monthly catches gradual degradation. Quarterly is the formal SLA cadence.

What is a good uptime percentage?

It depends on what the service is. 99.5% is fine for a marketing site, poor for a customer-facing API. The benchmark is your SLA commitment and the cost of downtime for your specific business — not a generic "more nines is better."

Why does my uptime percentage differ between tools?

Different check intervals, different locations, and different definitions of "down" produce different numbers. A 5-minute interval misses outages a 1-minute interval catches. A single-region check sees different failures than a multi-region one. Align the methodology before comparing.

Should I share uptime reports with customers?

Yes, but as a status page, not a raw report. For SLA-bound enterprise customers, supplement with a quarterly summary. Don't send the full engineering report — it has too much detail and the wrong framing for an external audience.

What's the difference between uptime and availability?

Casually they're synonyms. Strictly, uptime is "monitor saw the URL respond" and availability is "customers successfully completed actions." The gap is closed by monitoring the endpoints customers actually use, not just the homepage. See how to monitor API endpoints.

Start reading reports that change behavior

Every CronAlert monitor produces an uptime report from the moment you create it — incident list, response time charts, per-region breakdown, status code distribution, all on the monitor detail page. Create an account, set up your first monitors, and start with a simple weekly habit: open the dashboard, find the noisiest monitor, and fix one thing about it.

The compounding from that habit is what separates teams whose uptime number drifts up over time from teams whose number drifts down. The data is the same. The behavior is what differs.