System DesignMedium
What is the difference between liveness and readiness probes, and why does it matter for service discovery?
Liveness and readiness probes look similar but answer two completely different questions — and confusing them is one of the most common causes of microservice outages.
Liveness — "is this process still alive?"
- If it fails: Kubernetes restarts the pod.
- What it checks: can the process respond to a basic HTTP request? Is it deadlocked? Has it crashed without exiting?
- Should NOT check: downstream dependencies (DB, message broker, other services). A transient DB blip will kill your pod for no reason.
app.MapHealthChecks("/health/live", new() {
Predicate = c => c.Tags.Contains("live") // only the "self" check
});
Readiness — "can this pod serve traffic right now?"
- If it fails: Kubernetes removes the pod from the Service''s Endpoints list but doesn''t kill it.
- What it checks: all the dependencies you need to handle a real request — DB connection, cache, critical downstreams.
- Why this matters for discovery: an unready pod is invisible to other services. When it recovers, it''s added back automatically.
app.MapHealthChecks("/health/ready", new() {
Predicate = c => c.Tags.Contains("ready") // DB + downstreams
});
Why the split matters
Imagine your DB has a 10-second hiccup:
- Same probe for both: every pod fails liveness → Kubernetes restarts them all → cold-start storm → real outage.
- Split probes: every pod fails readiness → drained from Service → no traffic → no errors. When DB recovers, readiness passes → pods are added back. Zero restarts. Self-healing.
Plus the startup probe (Kubernetes 1.18+)
For slow-starting services (EF Core migrations, heavy DI, JIT warmup), add a startupProbe. It runs first; liveness/readiness only kick in after startup passes. This prevents a slow start being mistaken for a crash.
Rule of thumb
- Liveness = self-check only. Cheap, fast, never touches the network.
- Readiness = checks every dependency that matters for serving traffic.
- Startup = grace period for slow boots.
Get this right and your service discovery is self-healing. Get it wrong and you''ll page on-call every time a DB twitches.