A monolith goes down, and you know it -- one process dies, the whole application is gone. Microservices fail differently. One service starts returning 500s. Another service retries, backs off, and then times out. A third service that depends on the second starts queuing requests. The user sees a page that is half-loaded: their profile picture appears, but their order history is a spinner that never resolves. Your top-level health check still returns 200 because the gateway process is alive. Nothing is technically "down." Everything is broken.

Monitoring microservices for uptime is harder than monitoring a monolith because failure is partial, cascading, and often invisible to any single check. This guide covers the patterns that actually work: health check design, dependency monitoring, service mesh observability, and why external monitoring from outside your infrastructure is the piece that ties everything together. If you are new to uptime monitoring, start with our intro to uptime monitoring first.

Why microservices need a different monitoring strategy

In a monolith, a health check endpoint can verify the entire application in one request. It checks the database, the cache, the file system, and returns a single status. If it is green, the application works. If it is red, something is wrong.

Microservices break this assumption. Each service has its own health, its own dependencies, and its own failure modes. A healthy order service does not mean the payment service is healthy. A healthy API gateway does not mean any of the backends behind it are healthy. You need to monitor at multiple layers:

  • Individual service health -- is each service running and able to serve requests?
  • Dependency health -- can each service reach its databases, caches, and downstream services?
  • End-to-end path health -- can a real user complete a full workflow that spans multiple services?
  • Infrastructure health -- are the API gateway, service mesh, DNS, and load balancers all routing correctly?

Internal probes cover the first two. External monitoring covers the last two. You need both.

Health check patterns that actually work

Not all health checks are equal. The difference between a useful health endpoint and a useless one determines whether you catch outages in seconds or discover them from customer complaints.

Shallow health checks (liveness)

A shallow check returns 200 if the process is running. It does not check databases, downstream services, or anything external. Its only job is to answer: "Is this process alive and accepting HTTP connections?"

GET /healthz
200 OK

Shallow checks are fast (sub-millisecond) and should never fail unless the process itself has crashed or is deadlocked. Use these for container orchestration -- Kubernetes liveness probes, load balancer health checks, and service mesh sidecar routing. They should not be your external monitoring target because they tell you almost nothing about whether the service actually works.

Deep health checks (readiness)

A deep check verifies that the service can do its job. It tests database connectivity, cache availability, and optionally the reachability of critical downstream services. A deep check that fails means the service is running but cannot serve real requests.

GET /health/ready
200 OK
{"{"}
  "status": "ok",
  "db": "connected",
  "cache": "connected",
  "orderService": "reachable"
{"}"}

Deep checks are what you want for external monitoring. They catch the case where the process is alive but a database connection pool is exhausted, a Redis instance has restarted, or a downstream service has moved to a new address. Use keyword monitoring on these endpoints to verify the response body contains "status":"ok" -- not just that the endpoint returns 200.

Composite health endpoints

For architectures with many internal services, create a composite health endpoint on your API gateway or a dedicated health aggregator. This endpoint calls the deep health check of every critical service and returns an aggregated status:

GET /health/system
200 OK
{"{"}
  "status": "degraded",
  "services": {"{"}
    "auth": "ok",
    "orders": "ok",
    "payments": "failing",
    "notifications": "ok"
  {"}"}
{"}"}

This gives you a single URL to monitor externally that covers your entire service fleet. Return 200 only when all critical services are healthy. Return 503 when any critical service is failing. Return 200 with a "degraded" status when non-critical services are down -- and use keyword monitoring to differentiate between "ok" and "degraded."

Timeout your dependency checks. A deep health check that waits 30 seconds for a dead database to respond is worse than no health check at all. Set aggressive timeouts (500ms-1s) on each dependency check within your health endpoint. If a dependency does not respond in time, report it as failing and move on.

Service mesh considerations

If you run a service mesh like Istio, Linkerd, or Consul Connect, you already have internal observability -- mTLS between services, retry policies, circuit breakers, and distributed tracing. It is tempting to think the mesh handles monitoring. It does not.

A service mesh monitors traffic within the mesh. It knows that Service A's requests to Service B are failing. But it does not know whether a user on the public internet can reach Service A in the first place. The mesh operates inside your infrastructure. External monitoring operates from outside it.

Specific gaps that a service mesh does not cover:

  • Ingress gateway failures. The mesh sidecar is healthy on every pod, but the ingress gateway that bridges external traffic into the mesh has crashed or is misconfigured. Internal mesh traffic flows fine. External traffic gets a connection refused.
  • DNS and certificate issues. Your domain's DNS record is wrong, or the TLS certificate on the edge has expired. The mesh does not manage external DNS or edge certificates -- those are outside its scope.
  • CDN and WAF interference. A CDN rule is caching error responses, or a WAF is blocking legitimate requests. Traffic never reaches the mesh at all.
  • Mesh control plane outages. If the Istio control plane goes down, sidecar configurations stop updating. Existing connections may continue working, but new deployments and routing changes fail silently. The mesh cannot alert you about its own control plane being down.

Use your service mesh for internal observability and traffic management. Use external multi-region monitoring to verify the full path from user to service is working. They are complementary, not interchangeable.

Dependency monitoring and cascade detection

The hardest failures to detect in a microservices architecture are cascading failures. Service A depends on Service B, which depends on Service C. Service C's database runs out of connections. Service C starts responding slowly. Service B's requests to Service C start timing out. Service B's thread pool fills up waiting for Service C. Service A's requests to Service B start timing out. Now three services are failing, but the root cause is a single database connection pool.

To catch cascading failures early:

  • Monitor leaf services, not just edge services. The service at the bottom of your dependency tree is the one most likely to cause a cascade. If you only monitor the API gateway, you will not know which downstream service caused the failure.
  • Check response times, not just status codes. A cascade often starts with latency increases before it escalates to errors. A service returning 200 in 5 seconds instead of its usual 50ms is the early warning sign. CronAlert records response times for every check, giving you visibility into gradual degradation.
  • Monitor your databases and caches externally. If your service exposes its database health in its deep health check, an external monitor on that endpoint catches database failures even when the service itself has not started error-ing yet.
  • Set up monitors for each critical dependency path. If your checkout flow hits auth, inventory, payments, and shipping in sequence, monitor each of those services individually. When checkout breaks, you immediately see which dependency is the root cause.

Circuit breakers protect your services. Monitoring protects your users. A circuit breaker stops Service A from being dragged down by Service B's failure. But it does not notify anyone that Service B is down. You still need alerting via Slack, email, or webhook to know something is wrong and trigger your incident response process.

What to monitor externally in a microservices architecture

You cannot (and should not) point an external monitor at every internal service. Here is what to target:

  1. Every public API route. Each distinct route that users or clients hit through your API gateway needs its own monitor. /api/v1/users, /api/v1/orders, /api/v1/products -- each one. A routing change can break one path while leaving others functional.
  2. The composite health endpoint. One monitor that covers the aggregate health of all internal services. This is your early warning for any internal service failure.
  3. Authentication endpoints. If your auth service is down, nothing works -- even if every other service is healthy. Monitor your login, token refresh, and OAuth callback endpoints specifically.
  4. Webhook receivers. If your architecture ingests events from external services (Stripe webhooks, GitHub events, partner callbacks), monitor those endpoints. A dead webhook receiver means lost events and silent data inconsistency.
  5. Status pages and public dashboards. If you run a status page, monitor it too. The irony of a status page that is down during an outage is real and avoidable.

Automating monitor lifecycle for microservices

Microservices architectures change frequently. New services get deployed, old ones get decommissioned, routes get added and removed. Manual monitor management does not scale past a handful of services.

Integrate monitor creation into your deployment pipeline using CronAlert's REST API:

curl -X POST https://cronalert.com/api/v1/monitors \
  -H "Authorization: Bearer ca_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "checkout-service-health",
    "url": "https://api.example.com/checkout/health",
    "method": "GET",
    "expectedStatusCode": 200,
    "keyword": "\"status\":\"ok\""
  }'

Build this into your CI/CD pipeline so that every new service automatically gets a monitor, and decommissioned services have their monitors cleaned up. Store monitor definitions alongside your service code -- if the health endpoint URL changes, the monitor definition updates in the same commit.

For Kubernetes-based microservices, you can also derive monitors from your ingress definitions. See our Kubernetes monitoring guide for the full approach to automating monitor creation from kubectl get ingress output.

Alert routing for microservices teams

In a microservices organization, different teams own different services. The payments team should not get woken up because the notification service is down, and vice versa. Route alerts to match your ownership model:

  • Service-level alert channels. Each service's monitors alert to the owning team's Slack channel. The checkout team gets checkout alerts. The auth team gets auth alerts.
  • Severity-based escalation. Critical services (auth, payments) alert via Slack and email simultaneously. Less critical services alert via email only during business hours.
  • Composite health alerts go to platform/SRE. The aggregate health endpoint alert goes to the platform team or SRE on-call, since cascading failures need someone with cross-service visibility.
  • Use separate channels for staging and production. Staging alerts should be visible but not urgent. Production alerts should interrupt. CronAlert lets you configure different alert channels per monitor to make this separation clean.

Multi-region monitoring for distributed services

Microservices deployed across multiple regions add another dimension of complexity. Your US deployment might be healthy while the EU deployment has a broken database migration. Geographic routing means users in different regions hit different instances of the same service.

CronAlert's multi-region monitoring checks from 5 locations simultaneously -- US East, US West, EU West, EU Central, and AP Southeast. For microservices with global traffic, this catches region-specific failures that a single-region monitor would miss entirely. You can configure alerts to fire when a specific number of regions fail, filtering out transient network issues while catching real regional outages.

FAQ

Should I monitor every microservice individually or just the public-facing ones?

Monitor every public-facing endpoint externally. For internal services, expose their health through a composite health endpoint on your API gateway or a dedicated health aggregator service, and monitor that externally. You do not need a separate external monitor for each internal service -- but you need their status surfaced through an endpoint that external monitoring can reach. The goal is that no critical service can fail without an external monitor detecting it.

How do I monitor microservices behind an API gateway?

Point your monitors at the gateway's public routes, not internal service addresses. This tests the full path including gateway routing, authentication middleware, rate limiting, and load balancing. Create a monitor for each critical route through the gateway. If the gateway returns a 503 because a backend service is down, your monitor catches it. If the gateway itself is misconfigured and routing to the wrong backend, your keyword monitoring catches that too.

What is the difference between a shallow and deep health check for microservices?

A shallow health check returns 200 if the service process is running -- it checks nothing else. A deep health check verifies the service can actually do its job by testing database connections, cache availability, and downstream service reachability. Use shallow checks for liveness (container orchestration and load balancer routing). Use deep checks for readiness and external monitoring. Both should have aggressive timeouts so a single slow dependency does not make the health check itself hang.

How many monitors do I need for a microservices architecture?

At minimum: one per public-facing endpoint plus one for a composite health endpoint. A typical microservices application with 5-10 public routes and a health aggregator needs 6-11 monitors. CronAlert's free plan includes 25 monitors with 3-minute checks, which covers most microservices architectures. Larger systems benefit from the Pro plan (100 monitors, 1-minute intervals) or Team plan (500 monitors) as your service count grows.

Start monitoring your microservices from the outside

Internal health checks, service meshes, and circuit breakers handle failure within your infrastructure. They do not tell you whether a real user, hitting your public URL from a real browser on a real network, can actually use your product. That is the gap external monitoring fills -- and in a microservices architecture, where failures are partial and cascading, it is the most important signal you have.

Create a free CronAlert account to start monitoring your microservices endpoints. The free plan covers 25 monitors with 3-minute checks and alerts via email, Slack, Discord, and webhook. When you need 1-minute intervals, multi-region checks, or more monitors, plans start at $4/month.