The .de outage was a TLD postmortem, not a patch you missed

DENIC's signing pipeline shipped two-thirds bad signatures during a routine ZSK rotation on May 5. Nothing in your environment caused it, and nothing in your environment could prevent it. Here's what you can still change at your resolver.

On Tuesday May 5, somewhere around 21:36 local time in Frankfurt, a German online shop noticed that order confirmations had stopped going out. The mail server was up. The web tier was up. The SaaS provider’s status page was green. Inbound mail to the shop’s .de address was simply failing to resolve from anywhere outside Germany. A DevOps engineer in their on-call channel posted the kind of message that always precedes a long evening: “Is DNS broken or is it just me?”

It wasn’t them. It was everybody. For the next two to five hours, depending on which resolver you pointed at, the entire .de top-level domain returned SERVFAIL to any resolver doing DNSSEC validation. Bahn.de, Spiegel.de, Amazon.de, DHL, N26, Hetzner, IONOS, Strato, Sparkassen, Web.de. The longer list is shorter than the list of .de services that kept working.

There is nothing in this story you can fix with a patch. That is also the point.

What actually broke

DENIC, the registry that runs .de, deployed the third generation of its DNSSEC signing infrastructure in April 2026. It is built on Knot DNS plus in-house orchestration code plus a fleet of Hardware Security Modules holding the private keys. On May 5 the registry executed a routine Zone Signing Key rotation. A defect in the in-house code, which DENIC’s own postmortem describes as “not fully covered by the test scenarios and was therefore not identified as defective during test runs or in ‘cold’ parallel operation prior to commissioning,” fired during that rotation.

Instead of generating one ZSK pair and replicating it across the HSMs, the buggy code generated three independent key pairs, one per HSM. Only one of those public keys (keytag 33834) was written into the zone’s DNSKEY record. The other two HSMs then went ahead and signed records with private keys whose corresponding public keys were never published anywhere.

The result, in the plainest possible terms: roughly one-third of the RRSIGs in the .de zone were verifiable. The other two-thirds were cryptographically valid signatures over the right data with the wrong key, which is to say, garbage to any resolver that checks. The SOA and NSEC3 records were among the broken set. Blackfort Technology’s analysis confirmed an RRSIG with keytag 33834 over the NSEC3 record failing verification on every validating resolver. Heise reported all six authoritative nameservers serving the same defective signature within minutes.

Early on May 5, Cloudflare’s writeup described the failure as “incorrect DNSSEC signatures” from “a routine key rotation,” without naming the multi-HSM root cause. That account is fine for a resolver-side view. DENIC’s postmortem, published May 10, is the authoritative read on the mechanism: three HSMs, three keys, one published, two-thirds of signatures unverifiable. Where the two accounts disagree on detail, DENIC’s controls the answer because the failure happened inside their pipeline. Cloudflare was downstream of it like everyone else.

DENIC’s internal monitoring detected anomalies. The alerts, per the postmortem, “were not processed correctly.” That is a process failure stacked on top of a code failure. The combination is what turned a deployment defect into a multi-hour public outage.

The timeline that mattered

All times UTC, May 5 to May 6 2026.

~19:30 DENIC publishes the malformed zone. SERVFAIL begins on 1.1.1.1 almost immediately.
19:36 UptimeRobot logs the first measurable alert spike. Volume peaks near 10,000 alerts per minute.
~20:03 to 20:36 Public outage reports surface for N26, DHL, IONOS.
~20:15 DENIC attempts a partial remediation, re-signing only the SOA. It does not fix the zone. Cascading inconsistencies follow.
~21:28 DENIC publicly acknowledges the disruption.
22:17 Cloudflare deploys an override for .de on 1.1.1.1 and on the internal resolver its CDN uses to reach customer origins. Impact ends for Cloudflare users.
00:08 (May 6) DENIC begins distributing a correctly signed zone using keytag 32911, the previous ZSK, as the rollback key.
01:15 (May 6) Full resolution across resolvers.

Cloudflare users got back to work after 2 hours 47 minutes. Everyone validating who did not apply a manual override waited closer to 5 hours 45 minutes.

What the blast radius actually was

.de holds roughly 18 million registered domains. The DNSSEC-signed cohort at the second-level layer is small, around 648,000 domains per The Register’s figures, roughly 3.6 percent of the zone. That number is also a red herring. The signatures that broke were on .de itself, on the TLD’s own DNSKEY and NSEC3 and SOA. Resolvers could not build a chain of trust to the zone. That broke delegation lookups for every .de domain, signed at the SLD or not.

A user querying an unsigned bahn.de from a validating resolver still has to walk down through .de’s delegation. If .de is BOGUS, the answer is SERVFAIL regardless of whether bahn.de knows what DNSSEC is.

What users actually saw depended entirely on the resolver in front of them:

Resolver	Validation	What the user saw
Cloudflare 1.1.1.1	Yes	SERVFAIL with EDE Code 22, “No Reachable Authority” (which Cloudflare later admitted obscured the real cause)
Google 8.8.8.8	Yes (strict)	SERVFAIL / DNSSEC Bogus
Quad9 9.9.9.9	Yes (strict)	SERVFAIL / DNSSEC Bogus
Most German ISP resolvers	Partial or none	NOERROR. The site loaded.
Any resolver with a warm cache and serve-stale enabled	N/A	NOERROR from stale cache, until the TTL ran out

The grim irony for German users is that the people most likely to keep working through the outage were the ones on their ISP’s non-validating resolver, the configuration security people have been quietly trying to discourage for a decade.

What the mitigations were, and weren’t

Cloudflare’s 22:17 UTC override is the action that ended impact for most public web traffic. They are careful to call it an “existing override rule mechanism” rather than a formal RFC 7646 Negative Trust Anchor, but the operational effect is the one RFC 7646 describes: validate everything else, treat .de as if it were unsigned. Cloudflare also applied the same override to the internal resolver their CDN uses to fetch from customer origins, which is a separate stack from 1.1.1.1 and would otherwise have kept failing for Cloudflare-hosted .de sites even after 1.1.1.1 recovered.

RFC 7646 names “operational failure of a TLD” as the canonical case for an NTA. This is exactly that. The tradeoff is real, validation is the protection DNSSEC provides against cache poisoning, and turning it off for a zone removes that protection until the override is lifted. With a publicly confirmed registry-side failure and no plausible self-service fix, the call to favor availability is the right one.

What DENIC did, on their side, was not an immediate rollback. The first remediation attempt at 20:15 was a partial re-signing of the SOA that did not fix the zone. The correct rollback to keytag 32911 did not go out until 00:08 the next morning. Time-to-fix was over four hours from acknowledgment, and DENIC suspended all future ZSK rollovers pending the investigation. That is the right reaction. It is also a reaction that does not help anyone who was offline that night.

What you can change at your resolver

This is the part the Field Notes Desk cares about. If the only honest takeaway is “your TLD registry can ruin your evening and there is nothing you can do,” that’s a Hacker News thread, not an operations post. There are at least four levers you can pull at your resolver layer that change the blast radius next time. Pick the one you don’t have yet.

Run your own validating recursive resolver, with NTA capability already configured. Unbound, BIND 9.11+, and PowerDNS Recursor all support local negative trust anchors. For Unbound it is domain-insecure plus the unbound-control insecure_add runtime command. For BIND it is the negative-trust-anchors option, also runtime-settable via rndc nta. The point is not to keep an NTA in place. The point is to be able to add one for .de in 30 seconds during an incident, without a change-management ticket. Document the procedure now, while nothing is on fire.
Turn on serve-stale. RFC 8767. Unbound has serve-expired and serve-expired-ttl. BIND has stale-answer-enable. PowerDNS Recursor has serve-stale-extensions. This is the mechanism that quietly absorbed the outage for many users who never noticed it. It is roughly five lines of config. If your resolver does not have it on, fix that this week.
Monitor DNS resolution itself, not just HTTPS endpoints. A simple synthetic that queries a known .de name (or any name under a TLD you care about) from two upstreams, one validating and one not, will distinguish “my domain is down” from “this TLD’s DNSSEC is broken everywhere” in the first ten minutes. UptimeRobot saw the alert spike at 19:36 UTC, almost two hours before DENIC publicly acknowledged the problem. You can see that signal too if you are looking for it.
Know which of your services have a hard dependency on a single TLD. .se had a comparable DNSSEC failure in 2022. .ru had one in 2024. Now .de. The shared-fate pattern is not theoretical, it is recurring on roughly a two-year cycle. If a critical service of yours only answers on a single ccTLD, that is a decision worth making explicit rather than inheriting.

If you do only one of those, do the first. The next time a TLD signing pipeline breaks, the question your team will be asking is “can we add an NTA right now,” and the answer should be yes, with a documented command and a known rollback, not a debate about whether it’s safe to disable validation in production. PatchDay Alert will keep flagging registry-layer incidents the same way we flag CVEs, because the failure mode is genuinely the same shape: shared fate, narrow window, and the operational work happens at your resolver.