Long-running batch jobs are the unglamorous workhorses of every data pipeline. Nightly ETL. dbt runs. Airflow DAGs. Spark jobs. Database backups. Index rebuilds. Customer data exports. They run for hours, sometimes days, and the failure modes they exhibit are the ones traditional uptime monitoring covers worst.
The reason is structural. A normal cron job runs for seconds or a few minutes and pings a heartbeat URL when it finishes. Cron-job heartbeat monitoring works because the run duration is short relative to the expected schedule — if the hourly job didn't ping in 80 minutes, something is broken. A six-hour ETL can't follow that pattern. A single end-of-job heartbeat means the monitoring tool has no idea whether the job is running fine, stuck on a query, or quietly burning through cloud credits for the last four hours. You need different patterns.
This post walks through the patterns that work for batch jobs: paired start/finish pings, max-duration alerts, milestone monitoring, partial-failure detection, and how the patterns map onto Airflow, dbt, Spark, and any other long-running workload.
Why short-cron patterns don't translate
Three things break when you apply short-cron monitoring to long-running jobs:
- The expected interval is too coarse. A "ping every 24 hours" heartbeat tolerates a job that's been hung for 23 hours. By the time the alert fires, downstream data is a day late.
- "Not late yet" is ambiguous. If the nightly job normally takes four hours and sometimes seven, is six hours a problem? Without a max-duration signal, the only answer the monitor has is "nope, the heartbeat window hasn't expired yet."
- Partial failures look like successes. A dbt run that errored on three models out of two hundred but kept going will still hit the end-of-run hook. A single "I finished" ping can't communicate "I finished but the data is wrong."
The fix is to instrument batch jobs with more granular signals: separate start and finish events, an explicit duration ceiling, and milestone or status pings that reflect the real shape of the work.
Pattern 1: Paired start and finish pings
The minimum viable batch-job monitor is two events: "I'm starting now" and "I'm done now." A monitor watches both, calculates the gap, and alerts on three conditions:
- The start ping never arrived. The scheduler didn't trigger, the runner is dead, the host is down. Alert fires.
- The finish ping never arrived. Started but never completed. Alert fires after a max-duration window.
- The finish ping arrived too late. The job did complete but took 8 hours instead of the expected 4. Alert fires, optionally as a warning rather than a page.
CronAlert supports this with two monitors per batch job — one heartbeat for the start, one heartbeat for the finish, both pointed at a different ping URL. A simple shell wrapper handles the instrumentation:
#!/usr/bin/env bash
set -euo pipefail
START_URL="https://api.cronalert.com/ping/<start-monitor-id>"
FINISH_URL="https://api.cronalert.com/ping/<finish-monitor-id>"
FAIL_URL="https://api.cronalert.com/ping/<finish-monitor-id>/fail"
curl -fsS "$START_URL" > /dev/null || true
if ./run_nightly_etl.sh; then
curl -fsS "$FINISH_URL" > /dev/null
else
curl -fsS "$FAIL_URL" > /dev/null
exit 1
fi
The wrapper fires the start ping, runs the job, then fires either the finish or the fail URL based on the exit code. The || true on the start ping is intentional — if the ping fails for some reason, the job should still run; the monitoring blind spot is preferable to a job that didn't run because the monitor was unreachable.
Pattern 2: Max-duration alerts
The finish heartbeat catches "the job never finished" only after its grace window expires. For long-running jobs that grace window has to be large enough to tolerate normal slow runs — which means it's also large enough to miss several hours of a stuck job.
The fix is a separate max-duration signal. Set the finish-ping monitor's expected interval to "start time + max acceptable duration" rather than "start time + normal duration." If the nightly ETL normally takes 4 hours, set the finish ping's grace window to 6 hours, not 24. The monitor fires within two hours of the job going over budget, not within twenty-four.
The principle generalizes: the alerting window for batch jobs should be set against the job's maximum tolerable duration, not its average duration. Average duration is a description of normal; max tolerable duration is a description of "we should know about it." Use the latter.
Pattern 3: Milestone pings inside a long job
Start and finish alerts catch "the job started" and "the job finished," but they don't catch "the job is halfway through and the second half is stuck." For jobs with clear internal stages — extract, transform, load; or extract, validate, write, publish — a milestone ping per stage gives you visibility into where a failure occurred.
ping() {
curl -fsS "https://api.cronalert.com/ping/$1" > /dev/null || true
}
ping "$JOB_START_ID"
extract_data
ping "$EXTRACT_DONE_ID"
transform_data
ping "$TRANSFORM_DONE_ID"
load_data
ping "$LOAD_DONE_ID"
ping "$JOB_FINISH_ID" Each milestone monitor has its own expected window. When the alert fires it tells you which stage hung, which is the first useful piece of information for diagnosing the problem — knowing that "extract finished but transform didn't" eliminates half the possible failure modes immediately.
Don't go overboard with milestone monitors. Five milestones per job is informative; fifty is alert fatigue. Pick the stages that have meaningfully different failure modes — networked extracts, expensive transforms, downstream-visible loads — and skip the trivial ones.
Pattern 4: Partial-failure detection
A job that ran to completion but errored on some of its work is still a problem. A dbt run that skipped three models. An ETL that wrote rows but failed the post-write data quality check. A backup that completed but couldn't verify the integrity of the final archive.
The fix is to make the finish ping carry status information that a keyword monitor can assert on. Instead of a bare GET to the ping URL, send a POST with a small JSON body:
curl -fsS -X POST \
-H 'Content-Type: application/json' \
-d '{
"status": "success",
"models_run": 197,
"models_failed": 3,
"rows_inserted": 4231809
}' \
"$FINISH_URL"
Then configure the CronAlert monitor as a keyword monitor that checks the response body for "status":"success" and "models_failed":0. A finish ping that arrives but reports failed models trips the keyword check and fires an alert. See the keyword monitoring guide for the assertion patterns.
Patterns by tool
Airflow
Airflow DAGs are a natural fit for this pattern because the DAG run itself is a unit with a clear start and end. The cleanest instrumentation uses on_success_callback and on_failure_callback at the DAG level:
from airflow.models import DAG
import requests
def ping_success(context):
requests.get("https://api.cronalert.com/ping/<finish-id>", timeout=5)
def ping_failure(context):
requests.get("https://api.cronalert.com/ping/<finish-id>/fail", timeout=5)
dag = DAG(
'nightly_etl',
schedule='0 2 * * *',
on_success_callback=ping_success,
on_failure_callback=ping_failure,
catchup=False,
) Add a start ping in the first task of the DAG. For per-task milestone pings, attach the same callbacks to specific operators that have SLA implications. Resist the urge to instrument every task — DAG-level callbacks plus a handful of milestone tasks is plenty for almost any pipeline.
dbt
dbt has on-run-start and on-run-end hooks in dbt_project.yml. Combine them with a post-run macro that posts the run summary to a ping URL:
# dbt_project.yml
on-run-start:
- "{{ post_to_cronalert('start') }}"
on-run-end:
- "{{ post_to_cronalert('end', results) }}"
The macro reads results (the dbt run results object), counts errors and warnings, and POSTs a JSON body to the CronAlert finish URL. The keyword monitor asserts on the body for zero errors. Documenting the macro and the keyword assertion together in the repo makes the contract clear for future engineers.
Spark
Spark jobs run inside a driver that has access to the application lifecycle. Add a SparkListener for application start, application end, and stage failures, with each listener firing a ping. For batch jobs submitted via spark-submit, wrap the submission in the shell pattern from earlier — start ping, run, finish ping conditional on exit code.
Generic batch scripts
The shell wrapper at the top of this post is the universal pattern. Anything that runs from cron, systemd, a CI runner, or a Kubernetes Job can be wrapped in a six-line shell script that handles start, finish, and failure pings without modifying the underlying job at all.
Common pitfalls
Setting the max-duration alert to the average run time
Normal runs take 4 hours, but some take 5.5 hours legitimately. A 4-hour ceiling fires false positives every other week. Set the ceiling against the worst acceptable run, not the typical one.
Pinging start from outside the job
If the scheduler pings "start" and then the job fails to launch (image missing, runner down), the start ping fires but no work happens. Move the start ping inside the job's entry point so it only fires if the job actually starts executing.
Routing every retry attempt to paging
Most batch jobs have transient failures — a flaky downstream API, a brief network blip, a transient lock. Routing every retry attempt to your on-call paging trains people to ignore alerts. Send transient failures to a logging or metrics destination; only page on final failure after retries are exhausted.
Monitoring every task in a 200-task DAG
Two hundred monitors of unequal importance produce one signal: alert fatigue. Monitor the DAG run, then add task-level monitors only for the handful of tasks with customer-visible SLAs.
Ignoring the silent recovery
A job that failed Monday and ran fine Tuesday should produce a "recovered" signal, not silence. CronAlert sends recovery notifications by default when the finish ping arrives correctly after a failure. Make sure your alerting destinations surface those — the all-clear is a meaningful signal in incident postmortems.
Combining the patterns
A full setup for a nightly ETL might look like this:
- Start heartbeat — expected daily at 02:00 +/- 15 minutes. Catches scheduler failures and runner death.
- Finish heartbeat — expected daily by 08:00. Catches stuck or hung jobs.
- Milestone heartbeats — extract done by 03:00, transform done by 05:30, load done by 07:00. Catches stage-level hangs and pinpoints the failing stage.
- Keyword check on the finish payload — assert
"errors": 0in the JSON body. Catches partial failures that completed without erroring out.
Combined with CronAlert's consecutive-check verification and the right routing through PagerDuty or Splunk On-Call, you have full visibility into a job that runs for six hours and complete confidence the right person gets paged when it fails.
Frequently asked questions
How is batch-job monitoring different from cron-job heartbeat monitoring?
Cron heartbeats assume a single short job and a single end-of-run ping. Batch jobs run too long for that to be enough — they need paired start/finish events, max-duration alerts, and milestone pings for the parts of the work that matter.
What's the difference between a heartbeat alert and a max-duration alert?
Heartbeats fire when a ping doesn't arrive in time. Max-duration alerts fire when a job has been running for too long, even though the heartbeat hasn't yet expired. Long jobs need both.
Should I monitor every Airflow task?
No. Monitor the DAG run and the handful of tasks with customer-visible SLAs. Everything else lives in Airflow's UI.
How do I detect partial failures?
Send the finish ping as a POST with status JSON, then use keyword monitoring to assert on the body. A successful POST with "errors": 5 trips the assertion and fires an alert.
How do I handle retries?
Ping only after retries are exhausted or the job ultimately succeeds. Route transient retry failures to logging/metrics, not to paging — every retry-as-page trains the team to ignore alerts.
Get started
Pick your most important batch job — the one whose silent failure would be discovered by an angry user. Create a free CronAlert account, add a start heartbeat and a finish heartbeat, wire them in with the six-line shell wrapper, and set the finish window to the maximum tolerable duration. The next time the job runs long, you'll know before anyone else does.
Related reading: cron-job heartbeat monitoring, background worker monitoring, keyword monitoring, database health endpoint, avoiding alert fatigue, and incident response for small teams.