Every incident retro eventually produces a number: "MTTR was 47 minutes." It sounds rigorous. The problem is that the same incident, measured by four people, can produce four different MTTRs — because "MTTR" is quietly an abbreviation for at least four distinct things, and almost nobody says which one they mean. Layer on MTTA, MTTD, and MTBF, and you have an alphabet soup that's easy to quote in a slide and easy to misuse in a decision.

This guide cuts through it: what each metric actually measures, why MTTR is secretly four metrics, how to calculate them from real data instead of memory, what "good" does and doesn't mean, and which ones a small team should actually track. The goal isn't to collect numbers — it's to find the part of your incident lifecycle that's slow and fix it.

The incident timeline, and where each metric lives

Every incident is a sequence of moments. The metrics are just the gaps between them:

  1. The service breaks. Something fails — a deploy, a dependency, a full disk. The clock that matters to customers starts here, whether or not anyone knows yet.
  2. You detect it. A monitor fires, or a customer emails. The gap from break to detection is MTTD — Mean Time To Detect.
  3. Someone acknowledges it. A human takes ownership of the alert. The gap from alert to acknowledgment is MTTA — Mean Time To Acknowledge.
  4. Service is restored. Users are working again. The gap from break (step 1) to restoration is MTTR — Mean Time To Recovery.
  5. It happens again. The healthy stretch between this incident and the next is captured by MTBF — Mean Time Between Failures.

Breaking the timeline into stages is the whole point: each gap has different causes and different fixes. Slow detection is a monitoring problem. Slow acknowledgment is an on-call/alerting problem. Slow recovery is a runbook/automation problem. Frequent failures are an engineering-quality problem. A single blended number hides which one is actually hurting you.

MTTR is four metrics wearing one acronym

Here's the trap. "MTTR" gets expanded as all of these, and they measure different spans:

Expansion Measures From → to
Mean Time To Respond How fast a human starts working it Alert → response begins
Mean Time To Repair Hands-on fixing time Work begins → fix applied
Mean Time To Recovery Total customer-facing downtime Break → service restored
Mean Time To Resolve Including root-cause fix & cleanup Break → permanently resolved

The same incident might have a 5-minute time-to-respond, a 25-minute time-to-repair, a 40-minute time-to-recovery, and a 3-day time-to-resolve (because the permanent fix shipped two days later). All four are "MTTR." So whenever someone quotes one, the only useful follow-up is: measured from what event to what event? For customer impact, Mean Time To Recovery — break to restored — is usually the one that matters most, and it's the definition this guide uses unless noted.

How to calculate MTTR (and why the inputs are the hard part)

The formula is simple:

MTTR = total downtime across incidents ÷ number of incidents

Three incidents lasting 20, 40, and 60 minutes give 120 minutes ÷ 3 = a 40-minute MTTR. The arithmetic is trivial. The honesty of the inputs is not.

The number that wrecks most MTTR calculations is the start time. Teams instinctively start the clock when someone noticed — but the incident began when the service actually broke, which may have been long before. If you start counting at detection, you've silently folded your MTTD into your MTTR and made recovery look better than it was. That's exactly why detection is a separate metric: keeping them apart is what tells you whether your problem is "we're slow to notice" or "we're slow to fix."

This is where real monitoring data beats reconstructed memory. A monitor that records the precise timestamp of the first failed check and the first recovered check gives you defensible incident boundaries — break time and restore time — without anyone guessing after the fact. CronAlert's uptime reports and incident history capture those timestamps automatically, which means your MTTR is computed from when the service was actually down, not when a human happened to look. Reconstructing incident windows from Slack scrollback is how MTTR quietly becomes fiction.

What "good" means (and the benchmark trap)

There is no universal "good MTTR," and chasing someone else's number is a mistake. Recovery time is dominated by architecture and what "recovery" even requires:

  • A stateless web app behind a load balancer might recover in minutes via an automated rollback.
  • A corrupted primary database might take hours of careful restore-and-verify — and rushing it is how you turn an outage into data loss.
  • A failure in a third-party dependency you don't control has an MTTR that's partly out of your hands entirely. (Which is its own reason to monitor your dependencies so you at least know whose problem it is.)

Comparing your MTTR to an industry "average" tells you almost nothing, because that average blends wildly different systems. The metric that does mean something is your own trend. Is MTTR falling quarter over quarter as you add runbooks, better detection, and rollback automation? A team that halves its own MTTR is unambiguously improving; a team that matches an arbitrary benchmark with no idea why is just lucky. Track the slope, not the absolute.

How to actually improve each metric

Because the metrics map to distinct stages, you improve them with distinct moves:

Lower MTTD — detect faster

The single highest-leverage fix for most teams, because every minute of slow detection is a minute added to every downstream metric. Detection improves with real external monitoring at a sensible interval, checks that catch "up but wrong" (a keyword check for content failures, not just status codes), and verification that suppresses false positives so alerts stay trustworthy. If your detection channel is "a customer emailed us," your MTTD is measured in hours and nothing downstream can save you.

Lower MTTA — acknowledge faster

Acknowledgment time is an alerting-design problem. Route alerts to a channel a human is actually watching, escalate if the first responder doesn't ack within minutes (PagerDuty, Opsgenie, Splunk On-Call all do this), and fight alert fatigue hard — a team drowning in noisy alerts acknowledges the real one slowly because it looks like all the others.

Lower MTTR — recover faster

Recovery speed comes from preparation, not heroics. Write runbooks for your likely failure modes so the responder follows steps instead of improvising. Automate rollback so reverting a bad deploy is one command, not a debugging session. Practice — a team that has run the restore procedure once recovers far faster than one reading the docs for the first time mid-outage. This is the operational core of incident response for small teams and the structured version in incident response workflows.

Raise MTBF — fail less often

The long game: fewer incidents in the first place. This is where the blameless postmortem pays off — each incident, honestly analyzed, removes a class of future incident. MTBF rising over time is the signal that your postmortems are actually changing the system, not just documenting failures.

Which metrics a small team should track

Don't track all four with equal ceremony. For a team of two to ten:

  • Track MTTD and MTTR seriously. They cover the parts of an incident customers feel, and they're the two you can most directly improve with monitoring and runbooks.
  • Track MTTA once you have a real on-call rotation. Before that, "who's awake" is the de facto answer and the metric adds little.
  • Glance at MTBF as a trend. It's useful as a reliability direction but easy to game — reclassify a "blip" as a non-incident and MTBF improves on paper while nothing got better.

And the cardinal rule: never optimize a metric in a way that makes you hide real incidents. The fastest way to a great-looking MTTR is to stop counting the incidents that went badly. Metrics exist to find the slow stage and fix it — the moment they become a performance score, people start managing the number instead of the reliability, and you've lost the entire point. Keep them blameless, keep them honest, and watch the trend.

Frequently asked questions

What is MTTR?

Most often Mean Time To Recovery: the average time from when an incident begins to when service is fully restored, calculated as total downtime ÷ number of incidents. But "MTTR" is also used for Repair, Respond, and Resolve, which measure different spans — so always ask "from what event to what event?"

What's the difference between MTTR, MTTA, MTTD, and MTBF?

They're different gaps on the incident timeline: MTTD is break → detected, MTTA is alert → acknowledged, MTTR is break → restored, and MTBF is the healthy stretch between incidents. Splitting them lets you improve detection, acknowledgment, recovery, and frequency independently.

How do you calculate MTTR?

Sum total downtime across incidents and divide by the number of incidents (three incidents of 20, 40, 60 minutes → 40-minute MTTR). The hard part is honest start/end timestamps — start the clock when the service broke, not when you noticed, which is why detection is tracked separately. Monitoring data gives you accurate boundaries; memory doesn't.

What is a good MTTR?

There's no universal number — it depends on your architecture and what recovery requires. Comparing to industry averages is a trap. The meaningful target is your own downward trend over time as you add runbooks, faster detection, and rollback automation.

Which incident metrics should a small team track?

Mainly MTTD and MTTR — they cover what customers feel and what you can most directly improve. Add MTTA once you have an on-call rotation, glance at MTBF as a trend, track your own slope rather than benchmarks, and never optimize a metric by hiding real incidents.

Measure recovery from real data, not memory

Incident metrics are only as honest as their timestamps. If you start the clock when a customer complained, every number downstream is fiction. Create a free CronAlert account to get precise first-failure and recovery timestamps on every monitor — the raw material for an MTTR and MTTD you can actually trust, plus the fast detection that shrinks both. Then improve the slope: better detection, tighter alerting, practiced recovery.

Related reading: incident response for small teams, writing a blameless postmortem, reducing alert fatigue, and turning checks into uptime reports.