Teams ship AI features with retry logic, fallbacks, and prompt evals — and no monitoring. Then OpenAI has a degraded afternoon, or the Anthropic key hits a quota cliff, or p95 latency quietly triples after a model update, and the first signal is a customer asking why the assistant has been "thinking" for two minutes.

LLM APIs are third-party dependencies with extra failure modes stacked on top: they degrade without going down, they throttle per-account, they fail mid-stream after returning a 200, and their normal latency is so variable that "slow" and "broken" blur together. This guide covers how to monitor the providers, your own AI endpoints, and the gap between them.

How AI APIs fail (it's rarely a clean outage)

  • Degradation, not downtime. The API answers, but p95 goes from 4 seconds to 40. For a user-facing feature that is an outage; on the provider's status page it is often "operational" or, at best, "elevated error rates" half an hour later.
  • 429s at your traffic level. Rate limits are per-account and per-model. Your peak hour can hit a throttling wall that affects nobody else on the planet — invisible on any status page, very visible to your users.
  • Account-level cutoffs. Spend limits, expired payment methods, revoked or rotated keys, org policy changes. The provider is fine; you are down. Teams discover these on weekends with depressing regularity.
  • Mid-stream failures. Streaming responses return 200 immediately, then die halfway through generation. Any monitor that only looks at the status code scores this as success.
  • Silent model changes. Deprecations, snapshot retirements, and behavior shifts after upgrades. Not strictly an uptime problem, but the operational handling is the same: detect fast, fall back gracefully.

The common thread: most of these never register as "down" from the provider's perspective. Your monitoring has to measure your integration, not their marketing.

Layer 1: monitor the provider directly

Still worth doing — it is how you distinguish "their problem" from "our problem" in the first minute of an incident. Both major providers expose machine-readable status:

  • OpenAI: https://status.openai.com/api/v2/status.json — Statuspage JSON; check it with keyword monitoring for "indicator":"none", and alert when the keyword disappears.
  • Anthropic: https://status.anthropic.com/api/v2/status.json — same Statuspage format, same pattern.
  • Azure OpenAI / Bedrock / Vertex: covered by the cloud provider's status feeds, which are notoriously slow to confirm — all the more reason for Layer 2.

Treat these as confirmation signals, not detection signals. The detection layer is yours.

Layer 2: a canary endpoint that exercises a real completion

The single highest-value monitor for an AI feature is a small authenticated route in your own app that makes a real but tiny LLM call and collapses the result into a status code — the same health-endpoint pattern used for databases and queues, applied to your AI dependency:

app.get('/health/ai', async (req, res) => {
  if (req.headers['x-health-token'] !== process.env.HEALTH_TOKEN) {
    return res.status(401).json({ status: 'unauthorized' });
  }

  const start = Date.now();
  try {
    const completion = await client.messages.create({
      model: 'claude-haiku-4-5-20251001',   // cheapest/fastest tier
      max_tokens: 8,
      messages: [{ role: 'user', content: 'Reply with the word OK.' }],
    }, { timeout: 10_000 });

    const latency = Date.now() - start;
    const text = completion.content[0]?.text ?? '';

    if (!text || latency > 8000) {
      return res.status(503).json({ status: 'degraded', latency_ms: latency });
    }
    return res.json({ status: 'ok', latency_ms: latency });
  } catch (err) {
    const status = err.status === 429 ? 'rate_limited' : 'error';
    return res.status(503).json({ status });
  }
});

Design notes that matter:

  • Use the cheapest, fastest model tier and a few output tokens. At one check per minute this costs cents per month and keeps the latency baseline tight enough to alert on.
  • Use your production API key (or one in the same org/project). The point is to inherit your account's rate limits, quota, and billing state — a separate "monitoring key" hides exactly the failures you want to catch.
  • Return degraded on slow, not just on error. The latency threshold turns "technically up but unusable" into an alert. More on choosing thresholds in setting timeouts and response-time thresholds.
  • Distinguish 429 in the response body. When the alert fires, "rate_limited" versus "error" is the difference between raising your limits and opening a support ticket.

Point a CronAlert monitor at the endpoint with the token in a custom request header, a 15-second timeout, and keyword matching on "status":"ok". At 1-minute intervals you will know about a degraded provider 20-40 minutes before their status page admits it.

Streaming: the failure mode status codes can't see

If your feature streams tokens to users, a completed-response canary is not enough — streams fail after the 200. The fix is to make the canary consume a short stream to completion internally and only report healthy if the final chunk arrived. Externally it is still a clean 200/503 for the monitor; internally it exercises the exact path that breaks.

If you relay streams to browsers over server-sent events, the relay is its own failure surface — buffering proxies, idle timeouts, HTTP/2 quirks. The SSE monitoring guide covers that half of the problem.

Fallbacks, and monitoring the fallback

The standard resilience pattern for AI features is provider fallback: primary model fails or times out, route to a second provider or a smaller self-hosted model. Two monitoring implications teams miss:

  • Monitor the fallback path even when it is idle. A fallback that has silently broken — expired key, deprecated model name — is discovered at the worst possible moment, during a primary outage. Give it its own canary at a slower interval.
  • Alert when the fallback activates, not just when both paths fail. Sustained fallback traffic means degraded quality or higher cost, and it means your primary is having problems your users have not noticed yet. A heartbeat or webhook from your routing layer to a Slack channel does the job.

Routing the alerts

Calibrate urgency to what AI is in your product:

  • AI is the product (assistant, copilot, support bot replacing humans): page on-call like a checkout outage. Provider incident or not, you own the user experience.
  • AI is a feature (summaries, suggestions, autocomplete): Slack alert, plus graceful degradation in the product — hide the feature or queue the work rather than surfacing errors. The alert fatigue guide covers the routing mechanics.
  • Either way, post it on your status page. "AI features degraded due to an upstream provider incident" deflects tickets and is more honest than silence. A status page with a dedicated component for AI features makes this a 30-second update.

Frequently asked questions

How do I monitor OpenAI or Anthropic API availability?

Monitor their status JSON for confirmation, and run your own canary endpoint that makes a tiny real completion with your production credentials for detection. The canary catches account-level and degradation failures that status pages never show.

Why is a status page not enough for monitoring an LLM provider?

It lags incidents by 15-45 minutes, undercounts partial degradation, and is blind to your account's rate limits, quota, and key state — which fail your users just as hard as a provider outage.

What timeout should I set when monitoring AI endpoints?

Size it to the canary, not the workload: a few-token completion on a fast model is normally 1-3 seconds, so 10-15 seconds separates degraded from normal. Never monitor with production-sized prompts.

How do I monitor streaming LLM responses?

Have the canary consume a short stream to completion and report 503 if the final chunk never arrives. Status-code-only monitoring scores a dead stream as a success.

Should AI feature outages page on-call?

Page if AI is the product; Slack plus graceful degradation if it is a feature. The unacceptable option is the default one — finding out from support tickets.

Monitor your AI stack with CronAlert

AI features inherit every failure mode of a third-party API and add degradation, throttling, and mid-stream death on top. A status-feed monitor, a canary endpoint with a latency threshold, and a fallback-path check cover the whole surface — about 20 minutes of setup. Create a free account (25 monitors, no credit card), and the next provider incident will be something you announce to your users instead of something they announce to you.

Related reading: monitoring third-party dependencies, API endpoint monitoring, monitoring server-sent events, and the complete guide to HTTP health check endpoints.