5 Ways GitHub Spent April Lighting Itself On Fire
GitHub logged ten separate outages in one month, including one fixed by turning DNS off and on again. Here are the five most absurd.
There’s a specific facial expression a sysadmin makes when they look at Slack at 9:47 a.m. on a Wednesday and see the words “Actions is degraded.” It’s not anger. Anger requires energy. It’s the expression of a person who has, against their better judgment, started to believe in things again, and is now being reminded that belief is a luxury good not covered by their compensation package.
In April 2026, GitHub gave that expression away for free. Ten separate times.
For scale: GitHub’s January 2026 availability report listed two incidents. March listed four. April listed ten, which GitHub itself acknowledged in a CTO post titled, in the corporate dialect of someone whose pants are visibly on fire, “An update on GitHub availability.” What follows are the five most absurd, ranked by how much they made me put down my coffee and stare at the wall.
#5. Two Separate Things Broke At The Same Time On April 1, And No, This Is Not An April Fools Joke
GitHub kicked off the month with two unrelated incidents on the same day. On April Fools’ Day. The first was Code Search going down for nearly nine hours after an automated change was applied “too aggressively” during a messaging-system upgrade: 100% query failure for over two hours, stale results for another six. The second was a failed credential rotation that took the Audit Log service offline for 28 minutes and served 5xx errors to roughly 4,400 users.
Two automated systems. Two surfaces. One day. Zero of these were jokes, even though the calendar would have suggested otherwise.
The thing about an April 1 outage is that nobody believes you. You post in the channel that Code Search is down, three people reply with the laugh-react, and one person, who has been at the company eleven years and has seen the laugh-react before, just types “for real?” and waits. That coworker is the only one who matters. Everyone else finds out the hard way around hour four, when they realize they cannot find the file they need to ship the feature they were supposed to demo at 2 p.m.
The postmortem is here, written in the bloodless prose of people who have been told, repeatedly, by legal, to use the bloodless prose.
#4. A DNS Management Tool Responded To Bad Input By Deleting A Working DNS Record
I want you to sit with this one for a second.
On April 13, GitHub Pages went down for 1 hour 37 minutes, generating roughly 17.5 million failed requests. The cause was an automated DNS-management tool. Its upstream data source returned intermittent failures. The tool’s response to receiving bad data was, and I cannot find a way to say this that makes it less stupid, to delete a working DNS record.
This is the software equivalent of a smoke detector that, upon detecting smoke, sets the house on fire as a precaution. It is fail-deadly behavior wearing the lanyard of fail-safe behavior. Somewhere there is an engineer who wrote this code, a second engineer who reviewed it, and a third engineer who deployed it, and what I want more than anything in the world is to be a fly on the wall of the meeting where they all sat down to discuss the incident report.
GitHub’s postmortem flags that detection on this one took “approximately 53 minutes,” which they describe as “longer than we would have liked.” A truer phrasing: “We did not notice that we had pointed a chainsaw at our own foot for the better part of an hour.”
The deeper joke is that this is the failure mode of a lot of modern infrastructure tooling. Everything is automated, nothing is supervised, and the failure modes are creative. The tool didn’t crash. The tool didn’t refuse the input. The tool confidently and competently and at production-grade reliability did the worst possible thing it could do with the information it had. Ten out of ten on the dispatch, zero out of ten on the desire.
#3. Copilot Had Two Outages On April 9. At The Same Time. Stacked.
On April 9, GitHub Copilot Coding Agent fell over in two waves, and the second wave hit before the first had cleared, which is the technical definition of “having a worse day than you thought possible.” A rate-limiting bug was scoped globally instead of per-installation, which means the rate limiter, instead of throttling individual misbehaving customers, throttled the entire planet. A client update simultaneously tripled or quadrupled traffic, hit the broken limiter, and queue wait times peaked at 54 minutes.
Twenty-two thousand seven hundred workflow creations failed. GitHub counts this as two separate incidents in the official tally, which is how you get to ten. If you count it as one, it’s nine. If you count it the way the on-call engineer who handled both waves counts it, it’s probably closer to forty.
A global rate limiter scoped wrong is the Samuel L. Jackson scene in Jurassic Park: “ah ah ah, you didn’t say the magic word.” You ask politely for a token. The system, in the voice of a smug raccoon, tells you no. You ask again. It tells you no. You wait 54 minutes. It still tells you no. Eventually you go outside. You had forgotten outside was a thing.
The thing about Copilot specifically is that knowledge workers have wired the canary directly into their cerebral cortex and now cannot remember how to do their jobs without it. When Copilot goes down, what you discover, with mounting horror, is how much of your craft you have outsourced to a confident parrot. A misconfigured rate limiter on a Wednesday morning is now a class of event that ruins the productivity of millions of people, which is a sentence that would have read like science fiction in 2019 and reads like a Tuesday now.
#2. The DNS Cascade Of April 23, Which Was Fixed By Turning It Off And On Again
This is the one. This is the headliner.
On April 23, between 16:03 and 17:30 UTC, GitHub’s DNS infrastructure failed and took five services with it: Copilot, Webhooks, Git Operations, Actions, and Migrations. The cause, per GitHub’s postmortem, was “a recently introduced traffic-balancing mechanism” that “malfunctioned under specific load patterns.” For good measure, a completely separate bug the same day let Merge Queue silently lose data on 2,804 pull requests, which the postmortem treats as its own distinct disaster. One Thursday, two unrelated catastrophes.
You’ll love the fix. GitHub first tried a configuration rollback. The configuration rollback did not work. Recovery, and I am quoting here, “required restarting the affected DNS infrastructure.” That is the corporate-speak version of the IT Crowd opening sequence. The most valuable software company in the United States, in the year of our Lord 2026, paid down nine figures of engineering salary and resolved its most embarrassing root cause of the month by turning DNS off and on again.
It gets better. StatusGator issued an Early Warning seven minutes before GitHub confirmed, based purely on user reports. Downdetector had accumulated two thousand user complaints before the official githubstatus.com page acknowledged anything beyond degraded Copilot, Codespaces, and Packages. The status page is technically a status page in the same way the obituary section is technically a news source. By the time it tells you, it’s already over and the survivors have left for the buffet.
This is a documented pattern. A separate community discussion notes that Down Detector “consistently indicates problems several minutes before the GitHub status changes.” If the status page is always green when everything is on fire, here’s your answer: the status page is not a monitor. It is a press release with a CSS animation.
#1. Six Hundred Thousand IP Addresses Took Down Search For Six And A Half Hours, And It Was Not Even A Top Story
The #1 entry isn’t really a GitHub story. It’s the story of where the entire industry is going, and GitHub just happens to be the part of the industry that finally got too big to ignore.
On April 27, between 16:15 and 22:46 UTC, GitHub’s Elasticsearch cluster behind public search was overloaded by, per The Register quoting GitHub directly, “likely a botnet attack.” Six hundred thousand unique IPs, generating traffic equivalent to thirty percent of GitHub’s daily search volume compressed into a four-hour burst, engineered specifically to dodge public API rate limits. Up to 65% of searches timed out for the first hour and forty-five minutes. Issues, Pull Requests, Projects, Repositories, Actions, Package Registry, and Dependabot Alerts all started failing in cascading order because every one of them reads from search at runtime.
Six and a half hours. Six. And. A. Half. Hours. The longest, broadest outage of the month, by a wide margin, and it barely made the news.
The villain is AI scrapers, and the easy take is “AI bad,” and the easy take is also boring. The specifics are more interesting:
- Sourcehut’s Drew DeVault has documented spending 20-100% of his weekly time mitigating LLM crawlers, eventually blocking entire cloud provider IP ranges in self-defense.
- GNOME’s GitLab hit 97% automated traffic at peak; after deploying Anubis, a proof-of-work JavaScript challenge, the bot share dropped to 3%.
- Read the Docs cut daily bandwidth from 800GB to 200GB by blocking AI crawlers, saving roughly $1,500 a month.
- Wikimedia Commons saw a 50% bandwidth surge attributed to training scrapers.
- Cloudflare puts the scale at 50 billion AI-crawler requests per day across its network.
What happened to GitHub on April 27 has happened to Sourcehut every week for the last year and a half. The only difference is GitHub has a press team and Sourcehut has Drew DeVault, who has been screaming about this for 18-plus months, and who has now been vindicated in the most expensive way possible: by Microsoft.
The lesson, if you want one: the read side of your code-hosting infrastructure is now a hostile environment. Any service that exposes an expensive endpoint to anonymous traffic is being treated, by a meaningful fraction of the population of Earth, as a buffet line. “We’ll figure it out when it becomes a problem” became “it is a problem now” while you weren’t looking, and the people not looking the hardest were the people running the largest code host in the world.
So What Do You Do, You Specifically, On A Tuesday Morning
There’s no five-step hardening checklist that would have saved you from any of this; cascading-DNS-cluster-of-doom problems happen at a layer you don’t own. But three small, cheap, embarrassingly-low-tech moves disproportionately matter:
- Pull-mirror your important repos to Gitea or Codeberg. Pulls every eight hours by default, runs without GitHub Actions, gives you a read-only fallback when GitHub is on fire. Cheapest hedge in the stack.
- Put a queue in front of your webhook receiver. A dumb SQS or Redis Streams or RabbitMQ layer that accepts the payload, returns 200 in under ten seconds, and lets a worker process it on a schedule that doesn’t care if GitHub is having a moment. The pattern Shortcut documented in 2019, somehow still not the default.
- Stop trusting githubstatus.com as an early-warning signal. On April 23, StatusGator’s Early Warning beat GitHub’s own confirmation by a documented seven minutes; external monitors that watch user reports routinely observe larger leads than that. If your alerting is keyed to GitHub’s status page, your alerting is keyed to GitHub’s PR department.
And the self-hosted runner thing, since someone is about to bring it up: GitHub’s own April 2025 blog post is explicit that self-hosted runners do not improve availability against control-plane outages, because all runners depend on the same dispatch infrastructure. Self-hosting helps you when GitHub-managed compute breaks. It does not help you when GitHub can’t tell your runner what to do, which is most of what broke in April. You still want it for long-running and sensitive jobs. You don’t want it because you think it’s a parachute.
If your build is currently failing, congratulations: you’re part of a venerable tradition. If your build is passing, treasure it like the brief miracle it is. GitHub’s CTO has now promised “availability first” going forward, which is the sort of phrase a company uses when the previous strategy is most accurately described as “availability eventually, on a good day, if the bots are sleeping, and provided someone remembers to power-cycle DNS.”
Set your monitors. Mirror your repos. Queue your webhooks. And maybe, just maybe, do not point your DNS management tooling at an upstream source that fails open. “Do not delete production records on bad input” is the kind of advice that should not require a postmortem to discover.
But here we are. Together. Refreshing the status page.
Sources
- GitHub availability report: April 2026
- GitHub availability report: January 2026
- GitHub availability report: March 2026
- An update on GitHub availability (CTO)
- GitHub outage on April 23, 2026 (StatusGator)
- GitHub Community discussion #179905
- GitHub says sorry and says it will do better as uptime slips (The Register) — 2026-04-29
- FOSS infrastructure is under attack by AI companies (The Libre News)
- Open source devs are fighting AI crawlers with cleverness and vengeance (TechCrunch) — 2025-03-27
- AI crawlers cause Wikimedia Commons bandwidth demands to surge 50% (TechCrunch) — 2025-04-02
- Trapping misbehaving bots in an AI Labyrinth (Cloudflare) — 2025-03-19
- When to choose GitHub-hosted runners or self-hosted runners (GitHub) — 2025-04-15
- More reliable webhooks with queues (Shortcut Engineering) — 2019-08
- Mirror to Codeberg
Share
Get the free CVE triage cheat sheet
Subscribe and we'll email you the one-page triage flow for fresh CVEs. Plus the weekday digest.
Subscribe