Most incident response writing focuses on the playbook — who triages, who communicates, when to rollback. That part matters, but it's only half the system. The other half is the mechanical wiring inside your monitoring tool that turns the playbook from "twelve steps and a panic" into "the right people are paged, the status page is updated, and the timeline is recording" within thirty seconds of the first failed check.

This guide is about that wiring. We've already published a small-team incident response playbook; this post walks through how to configure CronAlert so the playbook actually runs end-to-end without anyone having to remember the twelve steps. Severity-routed alert channels, escalation chains, status-page automation, postmortem-ready timelines, and the API hooks that connect the workflow to the rest of your stack.

The workflow in one diagram

Before configuring anything, it helps to know what the assembled workflow looks like. Each step happens automatically if it's wired correctly:

  1. Detect. A monitor fails. Consecutive-check verification and multi-region quorum filter false positives before anything fires.
  2. Open the incident. CronAlert opens an incident record with a timeline that starts capturing every state transition.
  3. Route alerts by severity. Low-severity monitors notify a chat channel. High-severity monitors notify the on-call rotation in addition to chat.
  4. Update the status page. If the monitor is attached to a public status page, the affected service is marked degraded or down automatically.
  5. Escalate. If the on-call tool doesn't get an acknowledgment within the configured window, it escalates to secondary, then to a manager.
  6. Respond. Responders add narrative timeline updates via the API or dashboard as they work the incident.
  7. Recover. When the monitor recovers, the incident is auto-resolved, the status page is restored, and the recovery alert closes the on-call page.
  8. Postmortem. Three days later, the timeline view is what you write the postmortem from.

The rest of this post walks through configuring each layer, in roughly the order you'd build them out.

Step 1: Filter false positives before they enter the workflow

Every minute of the workflow runs faster if the alert is real. A workflow that pages the on-call at 3am for a single-region transient blip will be ignored within a week. Two settings handle the vast majority of false positives:

  • Consecutive-check verification. CronAlert requires N consecutive failed checks before opening an incident. The default is 2 for 1-minute intervals, 1 for 3-minute intervals. Bumping it to 3 for noisy monitors (anything that historically flaps without real impact) trades a small amount of detection latency for a large reduction in pages.
  • Multi-region quorum. On Team and Business plans, monitors can require failure in N-of-M regions before alerting. A single bad probe location — a Cloudflare colo with a transient routing issue, a CDN cache miss — won't trigger the workflow. See multi-region monitoring for the configuration.

Both settings live on the monitor edit page. Tune them per monitor: the marketing site doesn't need 3-region quorum, the payments API probably does. Background on the math in false positive alerts.

Step 2: Encode severity as channel attachment

CronAlert intentionally doesn't have a global "severity" field. Severity is encoded by which channels are attached to which monitors, which makes the routing explicit per-monitor and visible at a glance. The pattern that works:

  • Slack #alerts — attached to every monitor. Everything goes here for situational awareness.
  • Slack #alerts-critical — attached to production-critical monitors only. Same chat, separate channel, separate notification settings on each team member's client.
  • On-call channel (PagerDuty / Opsgenie / Splunk On-Call) — attached to monitors that warrant paging someone at 3am. Configure the integration via PagerDuty webhook, Opsgenie webhook, or Splunk On-Call REST endpoint.
  • Email — attached as a fallback, mostly for non-engineering stakeholders who want a passive notification.

To decide what's critical, ask: "If this fails at 3am on a Sunday, do we want someone woken up?" If yes, attach the on-call channel. If no — staging, internal tools, dashboards — chat only. The internal tools monitoring guide goes deeper on the staging-vs-prod split.

Step 3: Configure escalation in the on-call tool

CronAlert handles routing to the on-call tool; the on-call tool handles escalation. The split is intentional — escalation policy is about people, schedules, and time-zone math, and there's a whole category of tools that do nothing else. Don't try to encode escalation inside the monitoring tool.

The minimum viable escalation policy:

  1. Page primary on-call. They have 5 minutes to acknowledge.
  2. Escalate to secondary. If primary doesn't acknowledge, page secondary after 5 minutes.
  3. Escalate to manager. If neither acknowledges, page the engineering manager after another 10 minutes.
  4. Open a channel. If still unacknowledged at 30 minutes, automatically open a Slack incident channel and ping the wider engineering group.

Every on-call tool supports this shape. The exact configuration is in your tool's docs; the point is to set it up before you need it. The alert fatigue guide covers how to tune the escalation timing so you don't over-page during normal incidents.

Step 4: Auto-update the status page

Status pages are part of the workflow, not a separate manual task. Customers should know about an incident before they have to email support to ask. CronAlert status pages can be configured to automatically reflect the state of attached monitors:

  1. Create a status page in Settings → Status Pages.
  2. Attach the monitors that represent customer-facing services. The marketing site might be a separate service from the API; group monitors per customer-visible component.
  3. Enable Auto-create incidents from monitors. When an attached monitor goes down, an incident is opened on the status page automatically with the affected component marked degraded. When the monitor recovers, the incident is auto-resolved.

Auto-incidents handle the "something is down right now" signal. For narrative updates — root cause, workarounds, ETAs — responders add updates manually as they work. The fastest way to do that is via the API.

Step 5: Wire the timeline into the rest of your stack

Once an incident is open, the timeline grows automatically: status transitions, regional spread, every alert fired and which channel it went to. The missing piece is the narrative — what the responder thinks happened, what they tried, what the workaround is. Two API endpoints make this fast:

# Add a narrative update to an open incident
curl -X POST https://cronalert.com/api/incidents/<incident_id>/updates \
  -H 'Authorization: Bearer <api_key>' \
  -H 'Content-Type: application/json' \
  -d '{
    "status": "investigating",
    "message": "Database failover in progress. ETA 10 minutes. Reads degraded; writes failing."
  }'

# List incident updates (used to render a postmortem timeline)
curl https://cronalert.com/api/incidents/<incident_id>/updates \
  -H 'Authorization: Bearer <api_key>'

Wire this into a Slack slash command, a CLI script, or your incident-response tool. A responder typing /incident-update Database failover in progress in Slack should produce a status page update in under five seconds. Three real workflows we've seen:

  • Slack slash command. /incident-update <message> POSTs to a small Cloudflare Worker that forwards to the CronAlert API. Update appears on the status page; customers see it without anyone leaving Slack.
  • CLI for terminal responders. A cronalert update wrapper that reads the active incident from a local config file. Engineers debugging on the terminal don't context-switch to add updates.
  • MCP integration. Via the CronAlert MCP server, an AI assistant can post timeline updates conversationally. "Claude, add an update to the active incident saying we've identified the root cause" works.

Step 6: Recovery and auto-resolution

When the monitor recovers, the workflow runs in reverse:

  • CronAlert sends a recovery alert through every channel that fired the initial alert. The on-call tool closes the incident automatically (assuming entity-ID-based deduplication is configured — covered in the PagerDuty, Opsgenie, and Splunk On-Call guides).
  • The status page auto-resolves the incident and restores the component to "operational."
  • The incident timeline is closed but remains accessible — every transition, every alert, every narrative update is preserved.

Don't close incidents manually unless you need to. Manual closures lose the recovery timestamp and break the auto-deduplication on subsequent flaps. Let the workflow close them.

Step 7: The postmortem-ready timeline

The reason all of the above is wired up is so that three days later, when you sit down to write the postmortem, the timeline is already written. Open the incident, read the timeline top to bottom, and you have:

  • When detection happened (first failed check timestamp).
  • Which regions failed and in what order (multi-region spread).
  • Every alert fired and which channel received it.
  • Every status-page update the responder added.
  • When recovery happened.

The postmortem template writes itself from these inputs: What broke. When we noticed. How we communicated. What we fixed. What we learned. If your team is small, fifteen minutes per postmortem is enough — the small-team playbook has the format. The point is that the data is captured in real time, not reconstructed from Slack scrollback at the end of the week.

Workflow tuning over time

The first version of the workflow will be wrong in small ways. After every incident, ask three workflow-tuning questions in the postmortem:

  • Did the right people get paged? If a critical monitor only fired to chat, attach the on-call channel. If a non-critical monitor woke someone up, remove the on-call channel.
  • Did the status page reflect reality fast enough? If customers emailed before the status page updated, attach more monitors to the status page.
  • Did the timeline capture what mattered? If the postmortem requires reading Slack scrollback, the timeline is missing data. Add an update step to the playbook or wire more updates into your slash command.

Tune one thing per incident. Don't try to redesign the workflow during a 2am page. Capture the friction in the postmortem and adjust the configuration on a calm Tuesday afternoon.

Common workflow mistakes

Every monitor pages the on-call rotation

The single biggest cause of on-call burnout. Attach the on-call channel only to monitors whose failure justifies waking someone up at 3am. Everything else goes to chat. If you're not sure, default to chat-only and promote monitors to on-call after the first incident where chat-only wasn't enough.

Status page is manually updated during incidents

Manual updates during an incident are a context-switch nobody has time for. Auto-incidents from monitors handle the binary up/down signal; the slash-command or CLI update flow handles the narrative. If you find yourself logging into the status page admin UI during an incident, the workflow needs more automation.

No consecutive-check threshold or quorum

Single-check failures fire the entire workflow — paging, status page update, escalation timer. By the time the responder logs in, the monitor has already recovered. Set the threshold to 2 minimum, 3 for noisy monitors. The 1-2 minutes of added detection latency is well worth not running the workflow against transient blips.

Postmortems written from memory three days later

If responders aren't adding timeline updates during the incident, the postmortem will be guesswork. Make the update flow frictionless — a slash command, a CLI, an MCP integration — so adding an update takes the same effort as sending a Slack message.

Frequently asked questions

What's the difference between the playbook and the workflow?

The playbook is the human process; the workflow is the wiring inside the monitoring tool that makes the playbook fast. You need both — start with the playbook, then configure the workflow to support it.

How do I route alerts by severity in CronAlert?

By attaching different channels to different monitors. There's no global severity field — severity is encoded per-monitor by what channels you attach. Critical monitors get the on-call channel; everything else gets chat.

Can CronAlert auto-update my status page?

Yes. Enable auto-create incidents on the status page, attach the relevant monitors, and the page updates automatically when monitors fail and recover. Narrative updates go through the API or dashboard.

What goes into a postmortem-ready timeline?

First-detected timestamp, regional spread, every alert fired and where it went, narrative updates added by responders, and recovery timestamp. CronAlert captures four of those automatically; the fifth comes from your update flow.

How do I avoid burning out the on-call from a noisy workflow?

Set the consecutive-check threshold to 2 or 3, enable multi-region quorum on critical monitors, and route only true on-call-worthy monitors to the rotation. The alert fatigue guide goes deeper.

Get started

Wiring up the full workflow takes about an hour the first time, less for subsequent monitors as you reuse the channel and status-page configuration. Create a free CronAlert account, configure one critical monitor with the full pipeline (chat + on-call + status page + auto-incident), and walk through a manual test by temporarily pointing the monitor at https://httpstat.us/500. Watch the alert fire, the status page update, the escalation timer start, and the recovery close everything cleanly. Then duplicate the pattern across the rest of your production monitors.

Related reading: incident response playbook for small teams, avoiding alert fatigue, reducing false positive alerts, PagerDuty alerts, Opsgenie alerts, Splunk On-Call alerts, and operationalizing SLA compliance.