You don't have five reliability problems, you have one loop
Flapping autoscalers, retry storms, the on-call death spiral, SLOs that quietly rot. They're the same handful of feedback structures wearing different hats. Here's how to spot which loop you're feeding before you patch the symptom again.
A backend gets slow. Clients time out and retry. Each retry adds load, so the backend gets slower, so more clients time out, so they retry harder. The original blip is long gone. The thing that’s keeping you down now is the recovery behavior itself.
If you’ve run anything at scale, you’ve watched this. You’ve also probably watched a different incident where the autoscaler couldn’t settle: it provisioned, the load had already moved, it over-corrected, then it hunted back and forth across the right answer while pages fired. And you’ve watched a third thing that doesn’t look like an incident at all, where your two best on-call engineers quietly transferred to a calmer team, the survivors picked up the slack, and within two quarters the rotation was on fire.
The obvious read is that these are three separate problems with three separate fixes. Tune the retry policy. Tune the scaling thresholds. Fix the staffing. That read is not wrong, but it stops one level too early, and the level it stops at is the one that matters for where you put your attention.
The pattern: it’s the same small set of structures
All three are feedback loops. Not as a metaphor. As the actual mechanism producing the behavior.
This is the core claim of system dynamics, a modeling discipline Jay Forrester built at MIT in the late 1950s. Forrester was an electrical engineer, the guy who invented magnetic-core memory, and he got handed a General Electric appliance plant stuck in boom-and-bust production cycles. He showed the cycles weren’t coming from the market. They were generated by management’s own decision rules (System Dynamics Society; MIT News obituary, 2016). The thesis that fell out of that: structure drives behavior. The way a system is wired tends to produce its characteristic failures regardless of who’s on shift.
The vocabulary is small enough to carry in your head. Stocks are accumulations that can’t change instantly: a request queue, your pool of on-call engineers, the backlog of deferred fixes. Flows are the rates that fill or drain them. Feedback loops come in two flavors. A reinforcing loop amplifies, growth feeding growth or collapse feeding collapse. A balancing loop is goal-seeking, like a thermostat comparing the room to a setpoint and correcting. And delays are what make balancing loops misbehave, because a correction aimed at where the system was overshoots where the system is now.
Now look back at the three scenarios with those labels on.
The retry storm is a reinforcing loop where the mitigation is the cause. The AWS Builders’ Library lays it out cleanly: in a five-deep call chain where each layer retries three times, load at the deepest dependency multiplies by 3⁵, which is 243 times the original (Brooker, AWS Builders’ Library). Aleksey Charapko’s work on metastable failures generalizes the ugly part: the original trigger fades, but the sustaining loop keeps the system pinned down, and every efficiency feature you added, caching, autoscaling, failover, strengthens that loop once you’re already degraded (Charapko). That’s why backoff with jitter and one-layer-only retries are the fix. They reduce the loop’s gain.
The flapping autoscaler is a balancing loop with a delay. Kubernetes’ Horizontal Pod Autoscaler measures load, decides to scale, then waits for pods to become ready, and that reaction takes on the order of a minute or two. (That figure is from practitioner analysis, not the official docs, and it’s implementation-dependent, so treat it as a ballpark.) The delay is inherent to any physical scaling process. Under volatile load the correction lands after the problem has moved, and a loop that’s balancing in intent behaves like an amplifier. Tighter utilization targets make it worse, because they shrink the margin and push the system toward the region where small perturbations tip it.
The on-call spiral is a reinforcing attrition loop. Google’s SRE Workbook documents it: high pager load drives experienced engineers to transfer out, the survivors absorb more load per shift, more transfers follow. Google’s rule of thumb, no more than two actionable incidents per on-call shift, is a deliberate attempt to hold that loop below escape velocity (Google SRE).
Three incidents, three war rooms, two loop structures. That’s the pattern.
Once you see it, you can’t stop seeing it
Technical debt is a stock. Ward Cunningham’s 1992 debt metaphor maps directly: debt accumulates, and its “interest” is the extra time every future change costs in degraded code, and that interest grows as the stock grows (Fowler). A Stripe study cited by Stack Overflow’s blog put developer time on debt-related work near 17.5 hours a week (Stack Overflow Blog, 2023). No one has formally modeled tech debt as a stock-and-flow in a peer-reviewed paper, so that mapping is inferential, but it’s a clean inference, and the drain-on-capacity behavior is exactly what every team living it describes.
Alert fatigue is a feedback loop where the volume meant to improve detection degrades it: too many non-actionable pages desensitize people, they start skimming, real incidents get masked by the noise (Google SRE). DORA’s lead-time and MTTR numbers are, structurally, measurements of how fast your feedback loops close, which is why elite teams close them in minutes and low performers in weeks (DORA / Four Keys).
And the recurring shapes have names. Peter Senge cataloged them as systems archetypes in The Fifth Discipline. “Shifting the Burden” is the one that should make every IT manager wince: a quick fix relieves the symptom but atrophies the capacity to fix the root cause. Restarting a leaking service instead of finding the leak. On-call heroics that prevent the team ever building the automation that would make heroics unnecessary. “Eroding Goals” is the SLO that gets quietly widened, the error budget relaxed, “done” redefined downward, one reasonable-looking increment at a time. Fair warning: these IT-specific mappings are structural analogy and practitioner observation, not peer-reviewed studies. Plenty of SRE teams arrived at the same insights without ever saying Senge’s name. Take them as a lens, not a citation.
What this changes about where you focus
Here’s the prioritization payoff, and it’s the whole reason to bother with any of this.
When you label the loop, you stop reaching for the wrong lever. Donella Meadows spent an essay on exactly this, “Leverage Points: Places to Intervene in a System,” published sometime in the late 1990s (the year gets cited inconsistently, so I won’t pin it) (Meadows, donellameadows.org). Her counterintuitive finding: the places we instinctively push, the parameters and thresholds and numbers, are the weakest levers there are. The strong ones are the system’s information flows, its rules, and its goals.
In operational terms, that’s the difference between adding another heroic responder to the on-call rotation and changing the rule that lets the rotation eat your toil-reduction time. The first pushes harder on the same lever and feeds the loop. The second changes the structure. One of those is a durable fix and one of those is next quarter’s incident.
So the question to carry into a postmortem, a capacity review, or a “why does this keep happening” meeting is not “who messed up” and not even “what do we tune.” It’s: which loop am I feeding? If a fix keeps not working, you are almost certainly pushing on a parameter inside a loop whose structure guarantees the behavior comes back. The cheapest thing you can do is the marker-on-a-whiteboard version. Draw the variables, draw the arrows, mark which loops reinforce and which balance, and look for the one place where a structural change, not a harder push, breaks the cycle. (This is the same instinct behind treating executive accountability as a loop, which I wrote about in the feedback loop is broken; same shape, different layer of the org.)
What to watch, and what this isn’t
The honest limit, and it comes from inside the field. John Sterman’s 2002 lecture “All Models Are Wrong” works out the trap (Sterman, MIT). A diagram with two dozen variables can emit confident-looking trajectories that feel like the future has been calculated when it’s only been hypothesized. Fitting a model to history doesn’t prove it’s right. So the deliverable here is the thinking, surfacing and arguing the loop, not a simulation you trust.
This is the part the consulting decks get wrong. “Draw stocks and flows and your problems vanish” is a sales pitch, not a method. The loop diagram is a Rorschach test until it’s grounded in something you can check, and for a lot of operational problems there’s a sharper tool than a causal loop. When you have measurable state and a setpoint and an actuator, control theory beats hand-drawn loops and ships with guarantees they can’t offer. When you want steady-state throughput, queueing theory answers in closed form. When there’s one obvious bottleneck, Theory of Constraints asks the one question worth asking and you’re done.
What system dynamics is good for is the thing in your head before you pick a tool: recognizing that your retry storm, your flapping scaler, and your bleeding rotation are three faces of two structures, and that the lever you keep reaching for is the weak one. You don’t need software for that. You need to name the loop.
PatchDayAlert is built on the same instinct one level down. When three vendors patch the same component in the same month, that’s not three coincidences, it’s a structure. That’s the read we’re trying to give you every morning: not just what shipped, but what pattern it belongs to.
Sources
- Origin of System Dynamics — System Dynamics Society
- Jay Forrester obituary — MIT News — 2016-11-19
- Timeouts, Retries and Backoff with Jitter — AWS Builders’ Library (Marc Brooker)
- Metastable Failures in Distributed Systems — Aleksey Charapko
- On-Call — Google SRE Workbook
- TechnicalDebt — Martin Fowler
- If you want to address tech debt, quantify it first — Stack Overflow Blog — 2023-08-24
- Monitoring Distributed Systems — Google SRE Book
- Using the Four Keys to measure your DevOps performance — Google Cloud Blog
- Leverage Points: Places to Intervene in a System — Donella Meadows
- All Models Are Wrong — John Sterman, System Dynamics Review — 2002
Share
Related field notes
Get the free CVE triage cheat sheet
Subscribe and we'll email you the one-page triage flow for fresh CVEs. Plus the weekly digest.
Subscribe