PatchDayAlert
Analysis · 12 min read · 2,403 words By Colten Anderson

You're measuring how many alerts fired. The number that matters is how many you acted on

Most monitored environments fire far more alerts than anyone investigates, and almost nobody tracks the gap. That ratio is where the one real alert dies.

You're measuring how many alerts fired. The number that matters is how many you acted on

The obvious read on alert overload is that it’s a tuning problem. Too many alerts, the thresholds are too sensitive, so you raise them, mute the noisy rules, and the queue gets quieter. Everyone feels better. The dashboards go green.

The more interesting detail is what that move actually does to your detection. Raising a threshold doesn’t reduce noise. It raises the miss rate. You’ve traded a loud queue you can see for a silent miss you can’t. And almost no shop measures the thing that would tell them which trade they just made: of every alert that fired, what fraction required a human to actually do something.

That ratio has a name, several names, and that’s part of the problem. But the gap it describes is consistent across every source in the record. Teams are buried, they’re measuring volume instead of action, and the one alert that mattered dies inside the pile of the ones that didn’t.

The data shows teams measuring the wrong thing

Start with scale. Splunk’s State of Security 2025, fielded by Oxford Economics across 2,058 security leaders in nine countries, found 59% say they receive too many alerts. The same report puts another 55% on spending too much time chasing false positives, though that figure rides in the press coverage rather than the summary writeup, so treat it as one notch less direct than the 59%. The report also found 57% lose investigation time to data-management gaps and 46% spend more time maintaining tools than investigating threats. That’s the shape: teams too buried in the queue to do the work the queue exists to surface.

The behavioral data is worse than the self-reported overwhelm. Trend Micro’s 2021 SOC survey of 2,303 respondents found 43% occasionally or frequently turn alert notifications off entirely, and 40% admit they ignore incoming alerts. That study is four years old and should be read as foundational, not current, but turning the alarm off is a more honest signal than any overwhelm percentage.

Here’s the part that ties it together. The IDC and Tines Voice of Security 2025 white paper found only 23% of security teams are measured by alert count at all, and frames that as the failure. Count is a volume measure. It tells you how loud the queue is, not whether the queue is doing its job. Teams that do track a number are tracking the wrong one, and most aren’t tracking anything.

A few widely circulated figures, alerts-per-day counts and a much-repeated false-positive rate, trace back to single-vendor telemetry or no primary source at all. They’re directional color, not spine. Splunk and Trend Micro carry the weight here.

The metric exists, and it’s computable

The reason actionability stays unmeasured isn’t that it’s hard to compute. It’s that there’s no single canonical definition, so it never became a default field in anyone’s tooling. It travels as actionability rate, signal-to-noise, alert-to-incident conversion. The definitions disagree mainly on what counts as “action”: some count any acknowledgment, stricter ones require a documented incident or change before an alert scores.

The foundational framing is Rob Ewaschuk’s My Philosophy on Alerting, written at Google and later folded into the SRE book. It sets a floor: an alert that’s a false positive more than 10% of the time merits real scrutiny, and one less than 50% accurate is broken. That roughly 90% precision bar is the earliest widely circulated quantitative standard for alert quality. Ewaschuk’s companion test defines “actionable” as a gate rather than a score: a page should fire only for a condition that’s urgent, actionable, and genuinely visible to a user or about to be. The SRE book’s Monitoring Distributed Systems chapter restates it as four conditions every page must meet: urgent, important, actionable, real.

The SRE Workbook’s alerting chapter turns the philosophy into math: precision, recall, detection time, reset time. It shows a naive error-rate alert firing up to 144 times a day while the service still meets its SLO. That’s the whole argument in one example. Volume is a meaningless denominator until you pair it with an actionability numerator.

The instrumentation behind it is a disposition logged against every alert. OneUptime’s writeup gives the cleanest published formula: actionability rate equals alerts that required investigation, remediation, or escalation, divided by total alerts, with dispositions keyed as actionable, false_positive, auto_resolved, duplicate, flapping, and a practitioner target north of 70%. The instrumentation is nearly free, one disposition field per alert, then an aggregation. That it’s this cheap and still this rare is the finding.

One distinction worth holding onto: mean time to acknowledge tells you how fast a human engaged. Actionability tells you whether engaging was warranted. A team can post fast MTTA on alerts that are mostly junk, which is a different failure than slow MTTA on real pages. And the 70% target is a practitioner benchmark, not a standards-body number. So is the 90% precision floor. They’re defensible starting lines, not laws. One classification call moves the whole ratio: whether you score an auto-resolved alert as signal or noise, on which there’s no consensus, changes the number materially. Decide it on purpose.

The toolkit, and how each piece backfires

Once you accept actionability as the target, the standard levers are obvious: tune thresholds, deduplicate, suppress, route by severity, retire dead rules. Each works. Each has a failure mode most teams never audit for, and the failures share a shape. They all create silence that looks like health.

Threshold tuning is the most quietly dangerous. Move a CPU alert from 80% to 90% and you haven’t reduced noise, you’ve raised the miss rate. Ewaschuk’s rule is blunt: above a 10% false-positive rate the alert is suspect, above 50% it should be deleted, not re-tuned, because a chronically noisy alert is usually measuring the wrong thing. The fix is often to alert on the symptom the user actually feels instead of the resource metric underneath it.

Deduplication and grouping attack the cascade, where one database failure spawns dozens of downstream alerts. Collapsing those is correct. The failure surfaces when the window is too wide or the key too coarse. A storage issue and an unrelated auth outage then merge into one incident even though they are genuinely separate failures. The responder fixes the first, closes the ticket, never sees the second. Rootly’s documentation draws the line: deduplication collapses the same alert into one timeline; grouping combines similar but distinct ones. Conflate them in a single rule and a real incident gets buried inside a resolved one.

Suppression is the highest-risk technique because it manufactures silence by design. PagerDuty’s guidance is explicit that suppression mutes the notification while the underlying event still fires, so you’re betting nothing else breaks during the window. The classic failure, documented in upstat’s suppression guide, is the orphaned rule: a two-hour migration window gets configured, the migration ends, nobody removes the rule, and a network issue two weeks later gets silently swallowed. A blanket suppression with no severity exemption mutes a Sev-1 as readily as a heartbeat. incident.io names the cultural version: suppress non-actionable alerts en masse instead of retiring them, and teams habituate to reading silence as health.

Severity routing is correct practice, and its failure mode is ownership erosion. Route an alert to a ticket queue with no SLA, no triage owner, and no review cadence, and you’ve routed it to /dev/null with extra steps. The SRE book’s Bigtable example is the cautionary tale: voluminous email alerts ate the team’s triage time until they disabled them entirely to get focus back.

Retiring dead alerts is the most defensible lever and the one teams avoid most. The standard is simple. If people respond to a page with “I looked, nothing was wrong,” demote or delete it. Review any rule exercised less than once a quarter. The failure here is purely cultural. Deletion requires someone to own the judgment that an alert is wrong, and in most shops nobody wrote the rule down as theirs, nobody is measured on its false-positive rate, and nobody calls the review. The ruleset drifts into a record of past system states instead of a model of current risk.

The fatigue is a response-degradation mechanism

The cost of getting this wrong shows up in two registers. The first is human, and it’s not just morale. Tines’ Voice of the SOC Analyst (2022, 468 U.S. analysts at 500-plus-employee firms) found 71% reporting burnout, nearly half “very burned out,” and 64% likely to switch jobs within the year. That’s exhaustion, not disengagement: 69% of those same analysts called their teams understaffed, which is overload reported by the people living it, not apathy. That survey is dated now; a like-methodology update would be stronger.

The attrition cycle is structural. When a burned-out analyst leaves, the institutional knowledge of which alert classes are junk and which are real walks out the door. The replacement starts from zero, the tuning debt resets, and signal-versus-noise judgment degrades. That’s the mechanism by which fatigue becomes a breach vector. It isn’t a dramatic failure. It’s a gradual habituation that makes a team statistically more likely to wave off the one alert that mattered.

The strongest evidence that this is a law and not a complaint comes from outside security, so label it as an analog. Clinical alarm fatigue is the same problem with a deeper research base and mortal outcomes. AHRQ’s Making Healthcare Safer III cites FDA data linking 566 patient deaths to monitoring-device alarms over 2005 to 2008, with false-alarm rates of 72 to 99%. A UCSF ICU study documented 2.5 million alarms across five units in 31 days, and staff physically turning speakers to the wall to cope. Critically, all 17 cardiopulmonary arrests in that window were preceded by legitimate alarms. The signal fired. Volume had trained people not to act on it. A 2025 surgical-ICU study of 201 nurses found a statistically significant negative correlation between alarm-fatigue scores and error tendency, with fatigue accounting for roughly 14.5% of the variance in error likelihood. The clinical field measured desensitization, found it real, and tied it to the outcomes the alarms exist to prevent. That’s the proof of concept for what SOC research only sees qualitatively.

Target and Equifax: the alert fired and nobody acted

The canonical proof case in security is Target’s 2013 breach. Six months earlier, Target had stood up a $1.6M FireEye platform. On November 30, FireEye detected exfiltration malware on the point-of-sale systems and threw its top-grade alert, flagged “malware.binary,” with the destination servers in the payload, per Bloomberg Businessweek’s reconstruction from more than ten former employees. Symantec independently flagged the same server. Target’s Bangalore SOC received the alerts and escalated to Minneapolis. Then nothing happened. Target had also disabled FireEye’s auto-delete, so the system was set to alert and wait for a human action that never came. The Senate Commerce Committee’s Kill Chain report confirmed Target missed multiple chances to stop the attack at the detection stage, before one card left the network. Forty million card numbers went anyway.

Equifax 2017 is the same failure with a different mechanism. Instead of an alert dismissed, the monitoring was structurally blind: a tool that decrypted and inspected outbound traffic ran with an expired SSL certificate for 19 months, so it couldn’t see exfiltration. The House Oversight Committee report found the device inactive for that span and noted Equifax had let 300-plus certificates expire. The breach ran 76 days. Detection, when it came, was accidental: an engineer renewed the cert during routine maintenance and the tool immediately began flagging suspicious traffic. The committee called the breach “entirely preventable.”

Target’s signal reached humans who did nothing. Equifax’s signal never reached humans because a silent monitor is indistinguishable from a working one until you need it. Both organizations had functioning technology and a broken process layer, and neither had a mechanism to verify the detection layer was actually working: no escalation SLA forcing a response to a top-urgency alert, no expiration tracking for security-critical infrastructure.

What to watch

The signal worth watching isn’t a cleaner queue. It’s the actionability ratio’s trend line, per rule. You don’t need the 70% target to be right to read it. A rule whose actionability rate is drifting toward zero is a rule that’s quietly training your team to ignore it, and that drift is visible months before it swallows the one alert that matters.

The thing to watch in your own environment is the gap between alerts fired and incidents opened, sliced by rule. Where that gap is widest and the rule still exists, you’re looking at a candidate for the Target failure: an alert firing faithfully into a void. Whether anyone owns the judgment to retire it tells you more about your detection posture than your alert volume ever will.

PatchDayAlert tracks which CVEs are actually being exploited versus which just score high, so the alerts you escalate are the ones with a body behind them. That’s the same discipline, one tier up: signal you can act on, not volume you have to wade through.

Sources

Share

Related field notes

Get the free CVE triage cheat sheet

Subscribe and we'll email you the one-page triage flow for fresh CVEs. Plus the weekly digest.

Subscribe