How to Monitor AI APIs and LLM Features (OpenAI, Anthropic, and Your Own Endpoints)

Q: How do I monitor OpenAI or Anthropic API availability?

Two layers. Monitor the provider's status feed (both publish Statuspage-style JSON endpoints you can check with an HTTP monitor) for confirmed incidents, and monitor your own integration with a lightweight canary endpoint in your app that makes a tiny real completion request — a few tokens against the cheapest model — and returns 200 or 503. The canary catches what status pages miss: regional degradation, your key being rate limited or suspended, and latency that has degraded below usable, all of which happen while the status page is still green.

Q: Why is a status page not enough for monitoring an LLM provider?

Provider status pages typically confirm incidents 15-45 minutes after they start, mark partial degradation as operational, and say nothing about problems specific to your account — exhausted quota, an expired payment method, a revoked key, or rate limits that only bite at your traffic level. Your users experience all of those identically to a full outage. Direct monitoring of your own integration is the only signal that covers your actual failure surface.

Q: What timeout should I set when monitoring AI endpoints?

Much higher than for normal APIs, but derived from your canary, not your full workload. A canary request that generates a handful of tokens against a fast model normally completes in 1-3 seconds, so a 10-15 second timeout catches real degradation without false-positiving on normal variance. Do not monitor with your production prompt sizes — a 30-second-normal request leaves no room to distinguish slow from broken.

Q: How do I monitor streaming LLM responses?

Streams fail differently: the request returns 200 and starts streaming, then dies mid-response — a failure mode status-code monitoring can never see. Have your canary endpoint consume a short stream to completion internally and report whether it received the final chunk. Externally, the monitor just sees the canary's 200 or 503. If you serve server-sent events to your own users, monitor that path the same way.

Q: Should AI feature outages page on-call?

Depends on whether AI is the product or a feature. If your core value proposition is the AI (a writing assistant, a support bot replacing your inbox), treat it like checkout — page immediately. If AI augments an otherwise functional product (summaries, suggestions), route alerts to Slack and lean on graceful degradation so the feature hides or queues instead of erroring. The worst setup is the common one: no monitoring at all, where you learn about provider outages from your own support tickets.

Teams ship AI features with retry logic, fallbacks, and prompt evals — and no monitoring. Then OpenAI has a degraded afternoon, or the Anthropic key hits a quota cliff, or p95 latency quietly triples after a model update, and the first signal is a customer asking why the assistant has been "thinking" for two minutes.

LLM APIs are third-party dependencies with extra failure modes stacked on top: they degrade without going down, they throttle per-account, they fail mid-stream after returning a 200, and their normal latency is so variable that "slow" and "broken" blur together. This guide covers how to monitor the providers, your own AI endpoints, and the gap between them.

How AI APIs fail (it's rarely a clean outage)

Degradation, not downtime. The API answers, but p95 goes from 4 seconds to 40. For a user-facing feature that is an outage; on the provider's status page it is often "operational" or, at best, "elevated error rates" half an hour later.
429s at your traffic level. Rate limits are per-account and per-model. Your peak hour can hit a throttling wall that affects nobody else on the planet — invisible on any status page, very visible to your users. Our guide to monitoring API rate limits and 429s covers headroom heartbeats that warn you on approach.
Account-level cutoffs. Spend limits, expired payment methods, revoked or rotated keys, org policy changes. The provider is fine; you are down. Teams discover these on weekends with depressing regularity.
Mid-stream failures. Streaming responses return 200 immediately, then die halfway through generation. Any monitor that only looks at the status code scores this as success.
Silent model changes. Deprecations, snapshot retirements, and behavior shifts after upgrades. Not strictly an uptime problem, but the operational handling is the same: detect fast, fall back gracefully.

The common thread: most of these never register as "down" from the provider's perspective. Your monitoring has to measure your integration, not their marketing.

Layer 1: monitor the provider directly

Still worth doing — it is how you distinguish "their problem" from "our problem" in the first minute of an incident. Both major providers expose machine-readable status:

OpenAI: https://status.openai.com/api/v2/status.json — Statuspage JSON; check it with keyword monitoring for "indicator":"none", and alert when the keyword disappears.
Anthropic: https://status.anthropic.com/api/v2/status.json — same Statuspage format, same pattern.
Azure OpenAI / Bedrock / Vertex: covered by the cloud provider's status feeds, which are notoriously slow to confirm — all the more reason for Layer 2.

Treat these as confirmation signals, not detection signals. The detection layer is yours.

Layer 2: a canary endpoint that exercises a real completion

The single highest-value monitor for an AI feature is a small authenticated route in your own app that makes a real but tiny LLM call and collapses the result into a status code — the same health-endpoint pattern used for databases and queues, applied to your AI dependency:

app.get('/health/ai', async (req, res) => {
  if (req.headers['x-health-token'] !== process.env.HEALTH_TOKEN) {
    return res.status(401).json({ status: 'unauthorized' });
  }

  const start = Date.now();
  try {
    const completion = await client.messages.create({
      model: 'claude-haiku-4-5-20251001',   // cheapest/fastest tier
      max_tokens: 8,
      messages: [{ role: 'user', content: 'Reply with the word OK.' }],
    }, { timeout: 10_000 });

    const latency = Date.now() - start;
    const text = completion.content[0]?.text ?? '';

    if (!text || latency > 8000) {
      return res.status(503).json({ status: 'degraded', latency_ms: latency });
    }
    return res.json({ status: 'ok', latency_ms: latency });
  } catch (err) {
    const status = err.status === 429 ? 'rate_limited' : 'error';
    return res.status(503).json({ status });
  }
});

Design notes that matter:

Use the cheapest, fastest model tier and a few output tokens. At one check per minute this costs cents per month and keeps the latency baseline tight enough to alert on.
Use your production API key (or one in the same org/project). The point is to inherit your account's rate limits, quota, and billing state — a separate "monitoring key" hides exactly the failures you want to catch.
Return degraded on slow, not just on error. The latency threshold turns "technically up but unusable" into an alert. More on choosing thresholds in setting timeouts and response-time thresholds.
Distinguish 429 in the response body. When the alert fires, "rate_limited" versus "error" is the difference between raising your limits and opening a support ticket.

Point a CronAlert monitor at the endpoint with the token in a custom request header, a 15-second timeout, and keyword matching on "status":"ok". At 1-minute intervals you will know about a degraded provider 20-40 minutes before their status page admits it.

Streaming: the failure mode status codes can't see

If your feature streams tokens to users, a completed-response canary is not enough — streams fail after the 200. The fix is to make the canary consume a short stream to completion internally and only report healthy if the final chunk arrived. Externally it is still a clean 200/503 for the monitor; internally it exercises the exact path that breaks.

If you relay streams to browsers over server-sent events, the relay is its own failure surface — buffering proxies, idle timeouts, HTTP/2 quirks. The SSE monitoring guide covers that half of the problem.

Fallbacks, and monitoring the fallback

The standard resilience pattern for AI features is provider fallback: primary model fails or times out, route to a second provider or a smaller self-hosted model. Two monitoring implications teams miss:

Monitor the fallback path even when it is idle. A fallback that has silently broken — expired key, deprecated model name — is discovered at the worst possible moment, during a primary outage. Give it its own canary at a slower interval.
Alert when the fallback activates, not just when both paths fail. Sustained fallback traffic means degraded quality or higher cost, and it means your primary is having problems your users have not noticed yet. A heartbeat or webhook from your routing layer to a Slack channel does the job.

Routing the alerts

Calibrate urgency to what AI is in your product:

AI is the product (assistant, copilot, support bot replacing humans): page on-call like a checkout outage. Provider incident or not, you own the user experience.
AI is a feature (summaries, suggestions, autocomplete): Slack alert, plus graceful degradation in the product — hide the feature or queue the work rather than surfacing errors. The alert fatigue guide covers the routing mechanics.
Either way, post it on your status page. "AI features degraded due to an upstream provider incident" deflects tickets and is more honest than silence. A status page with a dedicated component for AI features makes this a 30-second update.

Frequently asked questions

How do I monitor OpenAI or Anthropic API availability?

Monitor their status JSON for confirmation, and run your own canary endpoint that makes a tiny real completion with your production credentials for detection. The canary catches account-level and degradation failures that status pages never show.

Why is a status page not enough for monitoring an LLM provider?

It lags incidents by 15-45 minutes, undercounts partial degradation, and is blind to your account's rate limits, quota, and key state — which fail your users just as hard as a provider outage.

What timeout should I set when monitoring AI endpoints?

Size it to the canary, not the workload: a few-token completion on a fast model is normally 1-3 seconds, so 10-15 seconds separates degraded from normal. Never monitor with production-sized prompts.

How do I monitor streaming LLM responses?

Have the canary consume a short stream to completion and report 503 if the final chunk never arrives. Status-code-only monitoring scores a dead stream as a success.

Should AI feature outages page on-call?

Page if AI is the product; Slack plus graceful degradation if it is a feature. The unacceptable option is the default one — finding out from support tickets.

Monitor your AI stack with CronAlert

AI features inherit every failure mode of a third-party API and add degradation, throttling, and mid-stream death on top. A status-feed monitor, a canary endpoint with a latency threshold, and a fallback-path check cover the whole surface — about 20 minutes of setup. Create a free account (25 monitors, no credit card), and the next provider incident will be something you announce to your users instead of something they announce to you.