As of 2026, we present the top 15 major sites and safe playgrounds in Korea, proven for their reliability and stability.

AWS cloud map shows US DynamoDB latency and retry options.

Retries amplify outages when they turn a local failure into a system-wide load spike.majordomesticsites They feel safe (“just try again”), but under stress they create positive feedback loops.

Here’s the anatomy and the failure modes.

The core dynamic (retry storms)

  1. Partial failure occurs

    • A dependency slows down or returns errors (DB, cache, auth, downstream API).

  2. Clients retry automatically

    • Often immediately, often all at once.

  3. Load increases exactly where capacity is lowest

    • Retries stack on top of in-flight work.

  4. Latency rises ? timeouts rise ? more retries

    • Feedback loop kicks in.

  5. Healthy components get dragged down

    • Thread pools, queues, connection pools saturate.

  6. Outage spreads

    • What was a brownout becomes a blackout.

This is why retries are a load multiplier, not a resilience feature by default.

Common amplification patterns

1. Retry × fan-out

One request fans out to N downstream calls.

  • 1 user request ? 5 backend calls

  • Each backend call retries 3 times

  • Effective load = 15 calls per user request

During an outage, this explodes combinatorially.

2. Synchronized retries (thundering herd)

  • Clients share the same timeout (e.g., 1s)

  • They all retry at the same moment

  • Causes periodic traffic spikes every timeout interval

This often shows up as sawtooth latency graphs.

3. Retrying non-idempotent work

  • Writes, payments, state transitions

  • Retries cause duplication, contention, or compensating work

  • System slows further due to cleanup and locks

Even if correctness is preserved, performance collapses.

4. Retries hide failures from callers

  • Upstream sees “slowness” instead of errors

  • Load shedding never triggers

  • Traffic keeps flowing into a degraded system

This delays human and automated response.

5. Backpressure inversion

  • Downstream is overloaded

  • Upstream retries harder

  • The system punishes the component that is already failing

This is the opposite of graceful degradation.

Why retries feel correct—but aren’t

Retries optimize for success probability of a single request, not for:

  • Fleet health

  • Tail latency

  • Recovery time

  • Blast radius

At scale, local optimism becomes global pessimism.

Signals you’re in retry amplification

  • Error rates flat, but latency skyrockets

  • CPU low, but thread pools exhausted

  • Traffic higher during outages than normal

  • Downstream QPS > upstream QPS

  • Recovery takes longer than failure onset

Retries often make outages longer, not just worse.

Design rules to prevent amplification

1. Retry budgets

Cap retries as a fraction of baseline traffic.

  • Example: retries ? 10% of successful requests

  • When budget exhausted ? fail fast

This preserves system shape under failure.

2. Exponential backoff + jitter (mandatory)

  • No fixed intervals

  • No immediate retries

  • Randomization breaks synchronization

Without jitter, backoff is mostly theater.

3. Fail fast when dependency is unhealthy

  • Circuit breakers

  • Adaptive timeouts

  • Error-rate-based shedding

Errors are cheaper than timeouts.

4. Push retries to the edge

  • Fewer layers retrying

  • Ideally only one place retries

  • Never retry blindly at every hop

Retries compose badly.

5. Distinguish retryable vs terminal failures

  • Timeouts ? retryable by default

  • Overload signals (429, 503 with Retry-After) must stop retries

Respect backpressure explicitly.

6. Prefer hedging over retries (carefully)

  • Send a second request before the first times out

  • Only when capacity exists

  • Cancel the loser

This reduces tail latency without retry storms—but is dangerous if misused.

Mental model to keep

Retries trade correctness probability for load.
During outages, load is the scarce resource.

If you want, I can:

  • Map this to a concrete architecture (RPC, Kafka, HTTP, gRPC)

  • Walk through a real outage timeline with metrics

  • Compare retries vs circuit breakers vs hedged requests

  • Help design a retry policy that doesn’t melt the system

Just tell me where you want to go next.

Ce blog ne contient aucun article.
majordomesticsites

Catégories

  • Pas de catégorie.

Derniers articles

  • Pas d'article.