Your code can be perfect. Your servers can be healthy. Every internal monitor can be green. And your site can still be effectively down because Stripe stopped accepting charges, or Twilio stopped delivering SMS, or AWS us-east-1 is having one of its bad days.

Modern applications depend on a stack of third-party services, and your overall uptime is capped by theirs. The standard advice — "subscribe to their status page" — is necessary but not sufficient. Status pages lag the actual outage by 15-30 minutes, are sometimes vague, and arrive after your customers have started filing tickets.

This post walks through how to monitor third-party dependencies so you know about their outages on your own timeline: what to watch, how to hit their APIs without burning rate limits, how to design integrations that fail gracefully, and how to communicate the difference between "our problem" and "their problem" to your users.

Why dependency outages catch teams off guard

Most uptime monitoring is built around "is the site responding 200." When the site is up, it is up. But a third-party dependency outage usually does not take down the site — it takes down a feature. The homepage loads. The login works. The dashboard renders. But checkout silently fails, or password reset emails never arrive, or push notifications stop sending.

The result is partial outage that does not register on a homepage uptime monitor and does not register on the vendor's status page yet (because they have not confirmed it) and does not register internally as a code regression (because no code changed). The first signal is a customer ticket, then five more, then the support team is buried.

The fix is two parallel monitoring strategies running in tandem: monitor the dependencies directly so you know when they are degraded, and monitor your own integration code paths so you know when your specific use of the dependency is failing.

The dependency map

Before monitoring, list out what you actually depend on. Most teams underestimate this until they sit down and write it. A typical SaaS stack:

  • Payments: Stripe, Paddle, Lemonsqueezy, PayPal.
  • Email: SendGrid, Postmark, Resend, Mailgun, AWS SES.
  • SMS / voice: Twilio, MessageBird, Vonage.
  • Authentication: Auth0, Clerk, WorkOS, Okta, Better Auth (self-hosted).
  • Push notifications: APNs, FCM, OneSignal, Pushover.
  • Cloud infrastructure: AWS (multiple services per account), Cloudflare, Fastly, Vercel, Netlify, Heroku.
  • DNS: Cloudflare, Route53, NS1, Google Cloud DNS.
  • Search: Algolia, Meilisearch, Elastic Cloud, Typesense.
  • Database hosting: Supabase, Neon, PlanetScale, Aiven, Render.
  • CDN and asset hosting: Cloudflare R2, AWS S3, Cloudinary, Bunny.
  • Webhooks-out infrastructure: Hookdeck, Svix.
  • Analytics: PostHog, Mixpanel, Amplitude, Google Analytics.
  • Monitoring / logging: Sentry, Datadog, Honeycomb, Logtail.

For each dependency, ask two questions: what fails when this is down, and how would I know? Often the answers are "checkout" and "I would not, until customers told me." That is the gap you are filling.

Two layers of monitoring

Layer 1: Direct vendor monitoring

Hit a public endpoint of the vendor's API on a schedule and log the result. This tells you whether the vendor is responding to anyone, regardless of your specific integration.

Most major vendors expose an explicit health or status endpoint. CronAlert can monitor any of these as a regular HTTP monitor:

  • Stripe: https://api.stripe.com/v1/health — public, returns 200 when API is healthy. Excluded from rate limits.
  • Twilio: https://status.twilio.com/api/v2/status.json — Statuspage.io API, returns current incident status.
  • SendGrid: https://status.sendgrid.com/api/v2/status.json
  • AWS: regional endpoints expose Health API, plus the Public Status Page at https://health.aws.amazon.com.
  • Cloudflare: https://www.cloudflarestatus.com/api/v2/status.json
  • GitHub: https://www.githubstatus.com/api/v2/status.json — useful if your CI or webhooks depend on GitHub.
  • OpenAI / Anthropic: public status pages with JSON APIs at status.openai.com and status.anthropic.com.

The Statuspage.io JSON format is standard across many vendors. Use a keyword check on the response body for a string like "indicator":"none" (all systems normal). When the indicator changes to minor, major, or critical, the keyword check fails and you get an alert — usually before the vendor sends an email update.

Set the check interval to 5-10 minutes for most dependencies. You cannot fix a vendor outage faster than they can; you just need to know it is happening.

Layer 2: Synthetic checks of your integration

Direct vendor monitoring tells you "Stripe is up." Synthetic integration checks tell you "Stripe charges work for our specific account, with our specific API keys, against our specific webhook endpoint."

Build a small synthetic flow that exercises each critical integration end-to-end. Examples:

  • Payment flow: Create a $0.01 test charge against Stripe in test mode and verify the webhook receives the event.
  • Email flow: Send a transactional email to a dedicated test inbox (a Mailbox.org or Fastmail account works) and check that it arrived within 60 seconds.
  • SMS flow: Send a Twilio SMS to a test number that exposes received messages via API, then verify receipt.
  • Auth flow: Run a test login through the auth provider and confirm a session cookie is set.
  • Search flow: Query the search index for a known document and verify the result.

Each of these flows lives behind a /healthz/integration-name endpoint on your own server, returning 200 if the flow succeeded and a non-200 with details if it failed. CronAlert hits the endpoint on a schedule and alerts on failures. This pattern is essentially a deep health check applied to external dependencies.

The big advantage: this catches account-specific issues that a vendor's overall status page would never reflect. Your Stripe API keys got rotated and the new ones never deployed. Your SendGrid sender identity got revoked because someone clicked a spam complaint. Your Twilio account hit its monthly cap. None of these show up on the vendor's status page; all of them break your integration.

Specific patterns for major vendors

Stripe

Monitor: the /v1/health endpoint (direct), plus a synthetic flow that creates a test PaymentIntent in test mode and verifies the webhook callback. Alert on any non-200 from the health endpoint, and on missing webhook delivery within 60 seconds.

The most common Stripe-related outages teams miss: webhooks getting silently disabled because too many recent deliveries failed, API keys getting rotated and the new ones not deploying, and Stripe Connect platform actions failing while direct charges still work. Synthetic flow catches all three.

Twilio

Monitor: status.twilio.com JSON, plus a synthetic flow that sends an SMS to a test number you control. Alert on any non-"operational" status component you depend on (Programmable SMS, Voice, Verify).

Twilio outages tend to be regional or carrier-specific — the API is up but messages to a specific carrier are queuing. Watching the JSON for component-level status catches these.

SendGrid / Postmark / Resend

Monitor: the vendor's status JSON, plus a synthetic email send to a test inbox you can read programmatically. Alert if the email does not arrive within 60 seconds.

Email vendor outages are particularly bad because they affect password resets and other "I cannot log in to fix this" workflows. The blast radius is high; the customer impact is hard to recover from. Synthetic monitoring of the full deliver-and-receive loop is the only reliable signal.

AWS

Monitor: the AWS Health API (per-region, per-service), plus actual usage of each service from your app's health endpoint. Alert on any "issue" or "outage" status for services you use.

AWS regional events (us-east-1 is the famous one) cascade in non-obvious ways. A region having issues with Lambda might also have issues with API Gateway, S3, and DynamoDB at the same time, but each shows up as a separate signal. Direct API calls from your own infrastructure are often the fastest indicator.

Cloudflare

Monitor: the Cloudflare status JSON. Plus, since CronAlert runs on Cloudflare's edge, multi-region uptime checks already give you signal on edge-network issues for the regions you check from.

Cloudflare incidents are unusual because the same provider that is having issues might also be the provider running the monitoring. If you are deeply on Cloudflare, consider a secondary monitoring source (UptimeRobot or a different provider) for the specific case of "Cloudflare control plane is down." Or just rely on the status page and accept that monitoring during a Cloudflare outage is best-effort.

Auth providers (Auth0, Clerk, WorkOS)

Monitor: the vendor's status JSON, plus a synthetic login that obtains and validates a token. Auth outages have outsized impact because no one can log in to anything during them, including your support team.

Have a fallback plan. Some teams keep an emergency local-auth path that works only for staff accounts during auth provider outages, so the team can still access the admin panel to communicate. See internal tools monitoring for the related angle.

Designing for graceful degradation

Monitoring is one half of the answer. The other half is making your code resilient to dependency outages.

Circuit breakers and timeouts

Every third-party call should have an aggressive timeout — 2 to 5 seconds for synchronous user-facing calls, 30 seconds for background jobs. Without this, a slow vendor degrades into a hung application: every web worker is blocked waiting on Stripe, the connection pool fills up, and your site goes down because your dependency went slow.

Wrap critical calls in a circuit breaker (libraries like opossum for Node, resilience4j for JVM, circuitbreaker for Go). After N consecutive failures, the breaker opens and subsequent calls fail fast for a configurable period. This protects you from cascading failures during a vendor outage.

Queue and retry for non-user-facing flows

Email sends, push notifications, webhook deliveries, and analytics events should go through a queue with retry rather than firing inline. When SendGrid is down, the queue absorbs the load; when SendGrid recovers, the queue drains. The user does not see the outage at all.

Fail open vs fail closed

Decide explicitly per dependency, not by default:

  • Fail closed for state-mutating user actions where partial success is unrecoverable. Payment processing — do not let the user complete checkout if Stripe is unreachable. Account changes, billing operations, signups that require email verification.
  • Fail open with a banner for read-mostly flows. Search autocomplete, recommendations, recent activity feeds, optional integrations. Show a clean fallback state with a message ("Search is temporarily unavailable, try again in a few minutes").
  • Fail open silently for telemetry. Analytics, logging, error reporting. Never let your analytics outage cause the user-facing app to fail.

Telling customers what is happening

When a dependency outage causes a partial outage, customers want to know what is broken and what is not. The two failure modes:

  • Silent partial outage: "We are open for business" while checkout is broken. Customers who hit the broken flow think it is them, give up, and may not return.
  • Vague generic outage: "We are experiencing issues" while everything except one feature is fine. Drives away customers who could complete their workflow.

The fix is a status page that distinguishes specific feature impact:

Investigating: Customers may experience errors completing checkout. All other features (browsing, account management, dashboard) are operating normally. We are tracking a Stripe incident that may be related: status.stripe.com.

Update the status page automatically from your monitoring when possible — CronAlert's status page integrates with monitor state so a downed monitor reflects on the status page within seconds. Be specific, link to the vendor's status page so customers can verify, and update again when the vendor confirms.

Frequently asked questions

Why isn't subscribing to my vendor's status page enough?

Status pages lag the actual outage by 15-30 minutes because a human has to confirm the issue and write the update. By the time they post, your customers have been hitting errors for half an hour. Status pages are useful as confirmation, not as your primary signal. Synthetic checks running on your schedule are faster.

Which third-party dependencies should I monitor first?

The ones whose failure mode is total — payments, auth, transactional email, SMS for 2FA. These have direct revenue or signup impact when they break. Then add infrastructure (AWS, CDN, DNS) and high-traffic data services. Skip dependencies that are invisible to users.

How do I monitor a third-party API without burning my rate limits?

Use the vendor's dedicated health endpoint when one exists — these are typically excluded from rate limits. For vendors without one, hit a low-cost read endpoint at a 5-10 minute interval. Real-time monitoring of vendors is unnecessary; you cannot fix their outage faster than they can.

Should I fail open or fail closed when a dependency is down?

Fail closed for unrecoverable state changes (payments, signups). Fail open with a banner for read-mostly features (search, recommendations). Fail open silently for telemetry. The decision should be explicit per dependency.

How do I tell my customers when a third-party dependency is down?

Update your status page immediately, even if the vendor has not confirmed. Be specific about scope ("checkout is unavailable due to a payment provider issue") rather than generic ("experiencing issues"). Link to the vendor's status page once they catch up.

Start monitoring your dependencies

Most outage postmortems eventually point at a third-party service. The teams that handle them well are the teams that knew about the outage at the same time as their customers, not 30 minutes after.

Create a free CronAlert account — monitor any vendor's public status JSON with keyword checks, monitor your synthetic integration health endpoints from multiple regions, and route alerts to the right team for each kind of failure. The free plan covers 25 monitors, which is enough for most teams' core dependency map.

For the broader incident response framework, see incident response for small teams. For the cost calculation that justifies this work, how to calculate the cost of downtime.