How is monitoring a long-running batch job different from monitoring a cron job?

A normal cron job runs for seconds or minutes and pings a heartbeat URL when it finishes. The pattern works because the run duration is small relative to the expected interval — if the daily backup hasn't pinged in 25 hours, something is wrong. Long-running batch jobs break this pattern. A nightly ETL that normally takes four hours and sometimes takes seven is not 'late' at six hours. A Spark job that's running normally cannot ping a single end-of-job heartbeat for eight hours without producing a sixteen-hour blind spot if it dies halfway through. The fix is two-stage: pair a start ping with a finish ping, alert on the gap between them exceeding a maximum duration, and supplement with milestone pings for the parts of the job that matter.

What's the difference between a max-duration alert and a heartbeat alert?

A heartbeat alert fires when a check-in doesn't arrive within the expected window — 'I haven't heard from you, so something killed the process.' A max-duration alert fires when a check-in does arrive but the job has been running too long — 'You started six hours ago and you're still going; that's wrong.' Long-running jobs need both. The heartbeat catches 'the job never started' or 'the runner died'; the max-duration alert catches 'the job started fine and is silently grinding away on a stuck query.' Without max-duration alerts, a hung job can keep its scheduler happy for days without anyone noticing the data is stale.

Should I monitor each task in an Airflow DAG individually or just the DAG run?

Monitor the DAG run for completion; monitor specific tasks for SLA-sensitive milestones. A DAG with 200 tasks that takes six hours doesn't benefit from 200 monitors — most of those tasks are uninteresting from a monitoring perspective, and the noise drowns out the signal. The DAG run itself needs a heartbeat (did it start) and a max-duration alert (did it finish on time). Beyond that, monitor only the tasks that have a customer-facing or downstream SLA — 'export to S3 is done' if a downstream report depends on it, 'data quality check passed' if anyone cares about correctness. Everything else stays in Airflow's own UI, where it belongs.

What happens if my batch job partially fails — finishes but with errors?

A job that 'finished' but rolled back half its work, retried 90% successfully, or skipped errored rows is still a failure from a monitoring perspective. The fix is to make the finish ping conditional on success — only ping success if the job actually succeeded, ping a separate 'partial-failure' or 'failed' URL otherwise, and let the monitor's keyword assertions check the body of the ping payload to distinguish full success from partial. CronAlert keyword monitoring lets you assert on response body content, so a single endpoint per job can communicate the difference between 'all done' and 'done but degraded' to the monitor.

How do I monitor a batch job that retries automatically?

Decide which signal you want to alert on — the underlying transient failure, or the final outcome after retries. For most teams the final outcome is what matters, so the heartbeat fires only after retries are exhausted or the job ultimately succeeds. If you also need visibility into transient failures (to spot a flaky downstream getting worse over time), send those to a separate logging or metrics destination, not to your paging channel. Routing every retry attempt to your on-call paging is one of the fastest ways to train people to ignore alerts.

How to Monitor Long-Running Batch Jobs for Uptime and Completion

Long-running batch jobs are the unglamorous workhorses of every data pipeline. Nightly ETL. dbt runs. Airflow DAGs. Spark jobs. Database backups. Index rebuilds. Customer data exports. They run for hours, sometimes days, and the failure modes they exhibit are the ones traditional uptime monitoring covers worst.

The reason is structural. A normal cron job runs for seconds or a few minutes and pings a heartbeat URL when it finishes. Cron-job heartbeat monitoring works because the run duration is short relative to the expected schedule — if the hourly job didn't ping in 80 minutes, something is broken. A six-hour ETL can't follow that pattern. A single end-of-job heartbeat means the monitoring tool has no idea whether the job is running fine, stuck on a query, or quietly burning through cloud credits for the last four hours. You need different patterns.

This post walks through the patterns that work for batch jobs: paired start/finish pings, max-duration alerts, milestone monitoring, partial-failure detection, and how the patterns map onto Airflow, dbt, Spark, and any other long-running workload.

Why short-cron patterns don't translate

Three things break when you apply short-cron monitoring to long-running jobs:

The expected interval is too coarse. A "ping every 24 hours" heartbeat tolerates a job that's been hung for 23 hours. By the time the alert fires, downstream data is a day late.
"Not late yet" is ambiguous. If the nightly job normally takes four hours and sometimes seven, is six hours a problem? Without a max-duration signal, the only answer the monitor has is "nope, the heartbeat window hasn't expired yet."
Partial failures look like successes. A dbt run that errored on three models out of two hundred but kept going will still hit the end-of-run hook. A single "I finished" ping can't communicate "I finished but the data is wrong."

The fix is to instrument batch jobs with more granular signals: separate start and finish events, an explicit duration ceiling, and milestone or status pings that reflect the real shape of the work.

Pattern 1: Paired start and finish pings

The minimum viable batch-job monitor is two events: "I'm starting now" and "I'm done now." A monitor watches both, calculates the gap, and alerts on three conditions:

The start ping never arrived. The scheduler didn't trigger, the runner is dead, the host is down. Alert fires.
The finish ping never arrived. Started but never completed. Alert fires after a max-duration window.
The finish ping arrived too late. The job did complete but took 8 hours instead of the expected 4. Alert fires, optionally as a warning rather than a page.

CronAlert supports this with two monitors per batch job — one heartbeat for the start, one heartbeat for the finish, both pointed at a different ping URL. A simple shell wrapper handles the instrumentation:

#!/usr/bin/env bash
set -euo pipefail

START_URL="https://api.cronalert.com/ping/<start-monitor-id>"
FINISH_URL="https://api.cronalert.com/ping/<finish-monitor-id>"
FAIL_URL="https://api.cronalert.com/ping/<finish-monitor-id>/fail"

curl -fsS "$START_URL" > /dev/null || true

if ./run_nightly_etl.sh; then
  curl -fsS "$FINISH_URL" > /dev/null
else
  curl -fsS "$FAIL_URL" > /dev/null
  exit 1
fi

The wrapper fires the start ping, runs the job, then fires either the finish or the fail URL based on the exit code. The || true on the start ping is intentional — if the ping fails for some reason, the job should still run; the monitoring blind spot is preferable to a job that didn't run because the monitor was unreachable.

Pattern 2: Max-duration alerts

The finish heartbeat catches "the job never finished" only after its grace window expires. For long-running jobs that grace window has to be large enough to tolerate normal slow runs — which means it's also large enough to miss several hours of a stuck job.

The fix is a separate max-duration signal. Set the finish-ping monitor's expected interval to "start time + max acceptable duration" rather than "start time + normal duration." If the nightly ETL normally takes 4 hours, set the finish ping's grace window to 6 hours, not 24. The monitor fires within two hours of the job going over budget, not within twenty-four.

The principle generalizes: the alerting window for batch jobs should be set against the job's maximum tolerable duration, not its average duration. Average duration is a description of normal; max tolerable duration is a description of "we should know about it." Use the latter.

Pattern 3: Milestone pings inside a long job

Start and finish alerts catch "the job started" and "the job finished," but they don't catch "the job is halfway through and the second half is stuck." For jobs with clear internal stages — extract, transform, load; or extract, validate, write, publish — a milestone ping per stage gives you visibility into where a failure occurred.

ping() {
  curl -fsS "https://api.cronalert.com/ping/$1" > /dev/null || true
}

ping "$JOB_START_ID"
extract_data
ping "$EXTRACT_DONE_ID"
transform_data
ping "$TRANSFORM_DONE_ID"
load_data
ping "$LOAD_DONE_ID"
ping "$JOB_FINISH_ID"

Each milestone monitor has its own expected window. When the alert fires it tells you which stage hung, which is the first useful piece of information for diagnosing the problem — knowing that "extract finished but transform didn't" eliminates half the possible failure modes immediately.

Don't go overboard with milestone monitors. Five milestones per job is informative; fifty is alert fatigue. Pick the stages that have meaningfully different failure modes — networked extracts, expensive transforms, downstream-visible loads — and skip the trivial ones.

Pattern 4: Partial-failure detection

A job that ran to completion but errored on some of its work is still a problem. A dbt run that skipped three models. An ETL that wrote rows but failed the post-write data quality check. A backup that completed but couldn't verify the integrity of the final archive.

The fix is to make the finish ping carry status information that a keyword monitor can assert on. Instead of a bare GET to the ping URL, send a POST with a small JSON body:

curl -fsS -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "status": "success",
    "models_run": 197,
    "models_failed": 3,
    "rows_inserted": 4231809
  }' \
  "$FINISH_URL"

Then configure the CronAlert monitor as a keyword monitor that checks the response body for "status":"success" and "models_failed":0. A finish ping that arrives but reports failed models trips the keyword check and fires an alert. See the keyword monitoring guide for the assertion patterns.

Patterns by tool

Airflow

Airflow DAGs are a natural fit for this pattern because the DAG run itself is a unit with a clear start and end. The cleanest instrumentation uses on_success_callback and on_failure_callback at the DAG level:

from airflow.models import DAG
import requests

def ping_success(context):
    requests.get("https://api.cronalert.com/ping/<finish-id>", timeout=5)

def ping_failure(context):
    requests.get("https://api.cronalert.com/ping/<finish-id>/fail", timeout=5)

dag = DAG(
    'nightly_etl',
    schedule='0 2 * * *',
    on_success_callback=ping_success,
    on_failure_callback=ping_failure,
    catchup=False,
)

Add a start ping in the first task of the DAG. For per-task milestone pings, attach the same callbacks to specific operators that have SLA implications. Resist the urge to instrument every task — DAG-level callbacks plus a handful of milestone tasks is plenty for almost any pipeline.

dbt

dbt has on-run-start and on-run-end hooks in dbt_project.yml. Combine them with a post-run macro that posts the run summary to a ping URL:

# dbt_project.yml
on-run-start:
  - "{{ post_to_cronalert('start') }}"
on-run-end:
  - "{{ post_to_cronalert('end', results) }}"

The macro reads results (the dbt run results object), counts errors and warnings, and POSTs a JSON body to the CronAlert finish URL. The keyword monitor asserts on the body for zero errors. Documenting the macro and the keyword assertion together in the repo makes the contract clear for future engineers.

Spark

Spark jobs run inside a driver that has access to the application lifecycle. Add a SparkListener for application start, application end, and stage failures, with each listener firing a ping. For batch jobs submitted via spark-submit, wrap the submission in the shell pattern from earlier — start ping, run, finish ping conditional on exit code.

Generic batch scripts

The shell wrapper at the top of this post is the universal pattern. Anything that runs from cron, systemd, a CI runner, or a Kubernetes Job can be wrapped in a six-line shell script that handles start, finish, and failure pings without modifying the underlying job at all.

Common pitfalls

Setting the max-duration alert to the average run time

Normal runs take 4 hours, but some take 5.5 hours legitimately. A 4-hour ceiling fires false positives every other week. Set the ceiling against the worst acceptable run, not the typical one.

Pinging start from outside the job

If the scheduler pings "start" and then the job fails to launch (image missing, runner down), the start ping fires but no work happens. Move the start ping inside the job's entry point so it only fires if the job actually starts executing.

Routing every retry attempt to paging

Most batch jobs have transient failures — a flaky downstream API, a brief network blip, a transient lock. Routing every retry attempt to your on-call paging trains people to ignore alerts. Send transient failures to a logging or metrics destination; only page on final failure after retries are exhausted.

Monitoring every task in a 200-task DAG

Two hundred monitors of unequal importance produce one signal: alert fatigue. Monitor the DAG run, then add task-level monitors only for the handful of tasks with customer-visible SLAs.

Ignoring the silent recovery

A job that failed Monday and ran fine Tuesday should produce a "recovered" signal, not silence. CronAlert sends recovery notifications by default when the finish ping arrives correctly after a failure. Make sure your alerting destinations surface those — the all-clear is a meaningful signal in incident postmortems.

Combining the patterns

A full setup for a nightly ETL might look like this:

Start heartbeat — expected daily at 02:00 +/- 15 minutes. Catches scheduler failures and runner death.
Finish heartbeat — expected daily by 08:00. Catches stuck or hung jobs.
Milestone heartbeats — extract done by 03:00, transform done by 05:30, load done by 07:00. Catches stage-level hangs and pinpoints the failing stage.
Keyword check on the finish payload — assert "errors": 0 in the JSON body. Catches partial failures that completed without erroring out.

Combined with CronAlert's consecutive-check verification and the right routing through PagerDuty or Splunk On-Call, you have full visibility into a job that runs for six hours and complete confidence the right person gets paged when it fails.

Frequently asked questions

How is batch-job monitoring different from cron-job heartbeat monitoring?

Cron heartbeats assume a single short job and a single end-of-run ping. Batch jobs run too long for that to be enough — they need paired start/finish events, max-duration alerts, and milestone pings for the parts of the work that matter.

What's the difference between a heartbeat alert and a max-duration alert?

Heartbeats fire when a ping doesn't arrive in time. Max-duration alerts fire when a job has been running for too long, even though the heartbeat hasn't yet expired. Long jobs need both.

Should I monitor every Airflow task?

No. Monitor the DAG run and the handful of tasks with customer-visible SLAs. Everything else lives in Airflow's UI.

How do I detect partial failures?

Send the finish ping as a POST with status JSON, then use keyword monitoring to assert on the body. A successful POST with "errors": 5 trips the assertion and fires an alert.

How do I handle retries?

Ping only after retries are exhausted or the job ultimately succeeds. Route transient retry failures to logging/metrics, not to paging — every retry-as-page trains the team to ignore alerts.

Get started

Pick your most important batch job — the one whose silent failure would be discovered by an angry user. Create a free CronAlert account, add a start heartbeat and a finish heartbeat, wire them in with the six-line shell wrapper, and set the finish window to the maximum tolerable duration. The next time the job runs long, you'll know before anyone else does.