
Retries amplify outages when they turn a local failure into a system-wide load spike.majordomesticsites They feel safe (“just try again”), but under stress they create positive feedback loops.
Here’s the anatomy and the failure modes.
The core dynamic (retry storms)
-
Partial failure occurs
-
A dependency slows down or returns errors (DB, cache, auth, downstream API).
-
-
Clients retry automatically
-
Often immediately, often all at once.
-
-
Load increases exactly where capacity is lowest
-
Retries stack on top of in-flight work.
-
-
Latency rises ? timeouts rise ? more retries
-
Feedback loop kicks in.
-
-
Healthy components get dragged down
-
Thread pools, queues, connection pools saturate.
-
-
Outage spreads
-
What was a brownout becomes a blackout.
-
This is why retries are a load multiplier, not a resilience feature by default.
Common amplification patterns
1. Retry × fan-out
One request fans out to N downstream calls.
-
1 user request ? 5 backend calls
-
Each backend call retries 3 times
-
Effective load = 15 calls per user request
During an outage, this explodes combinatorially.
2. Synchronized retries (thundering herd)
-
Clients share the same timeout (e.g., 1s)
-
They all retry at the same moment
-
Causes periodic traffic spikes every timeout interval
This often shows up as sawtooth latency graphs.
3. Retrying non-idempotent work
-
Writes, payments, state transitions
-
Retries cause duplication, contention, or compensating work
-
System slows further due to cleanup and locks
Even if correctness is preserved, performance collapses.
4. Retries hide failures from callers
-
Upstream sees “slowness” instead of errors
-
Load shedding never triggers
-
Traffic keeps flowing into a degraded system
This delays human and automated response.
5. Backpressure inversion
-
Downstream is overloaded
-
Upstream retries harder
-
The system punishes the component that is already failing
This is the opposite of graceful degradation.
Why retries feel correct—but aren’t
Retries optimize for success probability of a single request, not for:
-
Fleet health
-
Tail latency
-
Recovery time
-
Blast radius
At scale, local optimism becomes global pessimism.
Signals you’re in retry amplification
-
Error rates flat, but latency skyrockets
-
CPU low, but thread pools exhausted
-
Traffic higher during outages than normal
-
Downstream QPS > upstream QPS
-
Recovery takes longer than failure onset
Retries often make outages longer, not just worse.
Design rules to prevent amplification
1. Retry budgets
Cap retries as a fraction of baseline traffic.
-
Example: retries ? 10% of successful requests
-
When budget exhausted ? fail fast
This preserves system shape under failure.
2. Exponential backoff + jitter (mandatory)
-
No fixed intervals
-
No immediate retries
-
Randomization breaks synchronization
Without jitter, backoff is mostly theater.
3. Fail fast when dependency is unhealthy
-
Circuit breakers
-
Adaptive timeouts
-
Error-rate-based shedding
Errors are cheaper than timeouts.
4. Push retries to the edge
-
Fewer layers retrying
-
Ideally only one place retries
-
Never retry blindly at every hop
Retries compose badly.
5. Distinguish retryable vs terminal failures
-
Timeouts ? retryable by default
-
Overload signals (429, 503 with Retry-After) must stop retries
Respect backpressure explicitly.
6. Prefer hedging over retries (carefully)
-
Send a second request before the first times out
-
Only when capacity exists
-
Cancel the loser
This reduces tail latency without retry storms—but is dangerous if misused.
Mental model to keep
Retries trade correctness probability for load.
During outages, load is the scarce resource.
If you want, I can:
-
Map this to a concrete architecture (RPC, Kafka, HTTP, gRPC)
-
Walk through a real outage timeline with metrics
-
Compare retries vs circuit breakers vs hedged requests
-
Help design a retry policy that doesn’t melt the system
Just tell me where you want to go next.

