A single failed request retried immediately is harmless. Ten thousand clients retrying a struggling service in lockstep is a denial-of-service attack you wrote yourself.
This is the retry storm: a service slows, timeouts fire, every caller retries at the same moment, and that synchronized wave of load finishes off whatever was still standing. The retries don’t help the system recover — they prevent it from recovering.
The fix isn’t “retry less.” It’s to make retries disagree about when to fire. Exponential backoff spreads them out; jitter (randomized delay) breaks the lockstep so the waves never re-form. Add a retry budget — a cap on the fraction of traffic allowed to be retries — so a broad outage can’t be amplified into a bigger one.
And retries are only safe when the operation is idempotent. Otherwise you’re not recovering, you’re duplicating.