The most dangerous sentence in a code comment is 'this should never happen'
From Therac-25 to CrowdStrike, the same pattern keeps producing catastrophic failures: an engineer reasons that a condition is impossible, skips the guard, and the system outgrows the assumption.
CrowdStrike bricked 8.5 million Windows machines on July 19, 2024 because a rapid-response content update defined 21 input fields but provided only 20. The Content Validator, the system built to catch exactly this class of error, didn’t catch it. The kernel driver read memory out of bounds. Because Falcon loads as a Boot-Start driver, every affected machine entered a crash loop requiring hands-on recovery.
That’s the obvious read: a bad update, a missed check, an ugly day. The more interesting detail is what this incident has in common with a 1996 rocket explosion, a 2012 trading meltdown, a 2021 certificate expiry, and a 2019 cloud outage. The pattern underneath is the same every time, and it’s not a testing gap. It’s an assumption about impossibility that quietly became load-bearing.
The pattern has a consistent shape
An engineer encounters an edge case. They reason that the conditions required to trigger it are implausible. They skip the protective handling, or build it but don’t test it against real-world evolution. The assumption hardens into convention as subsequent engineers inherit the codebase and treat the gap as intentional. Then the system outgrows the context in which the assumption was made.
The Ariane 5 overflow that destroyed a $370 million rocket came from reused code that worked perfectly on Ariane 4, where the trajectory stayed within safe ranges. Knight Capital lost $440 million in 45 minutes when a deployment missed one of eight servers and reactivated nine-year-old deprecated trading logic. The Therac-25 killed three patients when software inherited a race condition that hardware interlocks had silently absorbed in the prior model. Different domains, same arc: the assumption was correct in the original context, and the original context changed.
What the data actually shows is that this pattern doesn’t cluster in one industry. It runs across safety-critical hardware, financial systems, time-dependent infrastructure, cloud platforms, and security boundaries. Each domain produces its own version.
Where the assumptions were built and where they broke
The Ariane 5 explosion in June 1996 is the textbook version. Engineers reused inertial reference code from Ariane 4 that converted a 64-bit floating-point horizontal velocity into a 16-bit signed integer. On Ariane 4, the trajectory never produced values large enough to overflow. Three variables were left unguarded because engineers believed them to be “physically limited” to safe ranges. On Ariane 5, with its steeper trajectory, the overflow killed the vehicle thirty-seven seconds into flight.
Knight Capital’s August 2012 loss shows the same structure in financial software. A trading feature called Power Peg had been deprecated since 2003. When engineers reused its flag bit for new functionality, they assumed the old logic was disconnected. One of eight servers never received the updated deployment. Dead code woke up. A broken position-reporting subsystem provided no signal that anything was wrong until $440 million was gone.
The Therac-25, 1985 to 1987, is the version where people die. A race condition existed in the Therac-20, but hardware interlocks swallowed the error every time it occurred. When the manufacturer removed those safeguards for the Therac-25’s software-only safety model, the latent bug became lethal. Six patients received massive overdoses; three died. The manufacturer’s initial response was a memo asserting that an overdose was “physically impossible.”
Certificates expire, clocks jump, counters wrap
Time-based failures deserve their own category because they’re the most knowable version of this pattern. The date is on the calendar. The counter’s bit width is in the spec. And they still happen.
When Let’s Encrypt’s DST Root CA X3 root certificate expired on September 30, 2021, it had been communicated years in advance. The workaround for older Android devices was already in place via cross-signing. None of that prevented OpenSSL 1.0.x on older Linux systems from rejecting the entire certificate chain at the moment of expiration. The library’s path-building logic terminated on the first expired root rather than trying alternatives. IoT devices, API clients, and embedded systems with no auto-update path stopped trusting any Let’s Encrypt-issued certificate. Not because the cert was wrong, but because the validation code’s assumption about how certificate chains are walked was wrong.
Cloudflare’s New Year’s Eve 2016 leap-second crash is almost elegant in its minimalism. A DNS resolver assumed that time.Now() could never return a value earlier than a prior call. When the leap second caused the system clock to step backward, negative response times were passed to a function that panics on negative input. The fix was one character: changing == 0 to <= 0. The bug had been invisible because its trigger condition occurs once every few years, at a moment announced months in advance.
The April 2019 GPS week rollover was even more predictable. The 10-bit week counter wraps every 1,024 weeks. The first rollover happened in 1999. The second was calculable to the day. Honeywell published service bulletins in March 2019. Operators didn’t patch. A KLM Boeing 777 was grounded seven hours while technicians resolved navigation systems showing dates from 1999.
When redundancy fails in the same direction
Cloud infrastructure is built on the premise that correlated failure across independent systems is vanishingly unlikely. Three incidents show how that premise breaks.
On February 28, 2017, an AWS engineer debugging a billing issue for S3 mistyped a capacity-removal command and removed a far larger set of servers than intended. S3’s subsystems had been designed to tolerate significant capacity loss. But that guarantee was built and tested against the S3 of 2010, not 2017. The removed servers took down the index layer and placement layer simultaneously. Both required full cold restarts, procedures that had never been validated at current scale because they hadn’t been needed in years. The AWS Service Health Dashboard couldn’t report the outage because it too depended on S3.
Fastly’s June 2021 outage followed a different path to the same conclusion. A single customer pushed a valid, well-formed configuration change that hit an undiscovered bug introduced 27 days earlier. Within minutes, 85 percent of the network returned errors. A single customer action propagated globally rather than being blast-radius-limited. Fastly’s own post-mortem acknowledged: “Even though there were specific conditions that triggered this outage, we should have anticipated it.”
Google’s June 2019 network incident exposed the subtlest version. Google’s network was designed to “fail static,” maintaining routing without its control plane for a short window. Three concurrent conditions converged: a configuration error, an eligibility mistake, and a software bug in maintenance automation that allowed it to deschedule independent clusters across different physical locations simultaneously. Geographic separation, the core redundancy guarantee, was defeated not by a physical event but by software that didn’t know it was supposed to respect geographic boundaries.
Security vulnerabilities in code paths that “can’t be reached”
Log4Shell (CVE-2021-44228, CVSS 10.0) was a feature, not a bug, that existed for eight years. Apache Log4j had supported JNDI lookups in log messages since 2013. The implicit assumption was that log messages are inert data and that logging isn’t a trust boundary. That assumption collapsed when attackers realized any attacker-controlled string written to a log could trigger remote code execution.
The XZ Utils backdoor (CVE-2024-3094, CVSS 10.0) exploited an attack surface that didn’t exist in any single component. Standard OpenSSH has no dependency on liblzma. But many Linux distributions patch sshd to support systemd-notify, which pulls in libsystemd, which transitively depends on liblzma. The attack surface only existed in distribution-specific patching decisions nobody thought to audit as a security boundary. Attribution remains partially uncertain; the identity behind “Jia Tan” has not been publicly confirmed.
Cisco’s Smart Install (CVE-2018-0171, CVSS 9.8) was designed for zero-touch switch provisioning with no authentication, because authentication would defeat the zero-touch premise. That design assumption collapsed whenever the management interface was internet-reachable. Over 150,000 devices exposed the port publicly. In 2025, CISA confirmed a Russian state-sponsored group was still exploiting this seven-year-old vulnerability.
What this means for prioritization
Reliability engineering has a name for the mechanism: normalization of deviance, from Diane Vaughan’s analysis of the Challenger disaster. The arc is consistent. An edge case is identified, reasoned to be implausible, left unguarded, and inherited by engineers who treat the gap as intentional. The Google SRE book distills the corrective to a single sentence: “If you haven’t tried it, assume it’s broken.”
The operational implication for anyone setting patch priorities is that the most dangerous vulnerabilities aren’t always the ones with the highest CVSS scores. They’re the ones in components that nobody modeled as an attack surface because the code path was “unreachable,” the configuration was “impossible,” or the feature was “deprecated.” Log4Shell sat in a logging library. The XZ backdoor sat in a compression dependency. Smart Install sat in a provisioning protocol.
When you’re triaging, the question worth asking isn’t just “how severe is this?” It’s “how load-bearing was the assumption that this couldn’t happen?” The answer usually tells you more about blast radius than the score does. That’s the kind of context that belongs next to a CVSS number in a daily digest, and it’s what PatchDay Alert is built to surface.
Sources
- Channel File 291 Incident Root Cause Analysis (CrowdStrike)
- ARIANE 5 Flight 501 Failure: Report by the Inquiry Board
- SEC Charges Knight Capital With Violations of Market Access Rule
- An Investigation of the Therac-25 Accidents (Leveson & Turner, IEEE Computer 1993)
- DST Root CA X3 Expiration (September 2021) (Let's Encrypt)
- How and Why the Leap Second Affected Cloudflare DNS
- Honeywell Service Information Letter Briefing: GPS Week Rollover
- Summary of the Amazon S3 Service Disruption (US-EAST-1)
- Summary of June 8 Outage (Fastly)
- Google Cloud Networking Incident #19009
- CVE-2021-44228 (Log4Shell) (NVD)
- XZ Utils Backdoor (Wikipedia)
- Russian Government Cyber Actors Targeting Networking Devices, Critical Infrastructure (FBI/IC3)
- The Challenger Launch Decision (Diane Vaughan, University of Chicago Press)
- Testing for Reliability (Google SRE Book)
Share
Related field notes