Stories
Postmortems, Incident Reports, and Stories from Real-World Failures.
Algolia #
Atlassian #
Authzed #
AWS #
- Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region - November, 25th 2020
- Summary Tweet thread of the incident. Threadreader link
Bungie #
Cloudflare #
- Details of the Cloudflare outage on July 2, 2019
- Cloudflare outage on July 17, 2020
- A Byzantine failure in the real world - Cloudflare API availability incident on 2020-11-02, Twitter thread
- Cloudflare outage on June 21, 2022
Celer Bridge #
DataDog #
DataSpring #
- Datacenter and tornado
- The website is in Czech; use a translation service to read it. archive.today link
- This is a story of how a data center dealt with a tornado—a good reminder to verify your offsite backups, disaster recovery plan, and conduct disaster recovery dry runs.
Deno #
DoorDash #
Facebook #
- October 4, 2021: Facebook Group (Facebook, Instagram, WhatsApp, Oculus) Outage.
- Understanding How Facebook Disappeared from the Internet
- What happened on the Internet during the Facebook outage
- Update about the October 4th outage - Facebook Engineering
- More details about the October 4 outage - Facebook Engineering
- What Happened to Facebook, Instagram, WhatsApp? Krebs on Security
- Why was Facebook down for five hours? - YouTube - Ben Eater explains the Facebook outage in detail with a demo.
- This outage had side effects on the whole internet; the most common one was ISPs getting DoSed with DNS queries for Facebook domains.
Fastly #
- Summary of June 8 outage - Fastly - June 8, 2021, global outage
Garmin #
GitHub #
- October 21 post-incident analysis
- April service disruptions analysis - May 22, 2020
- An update on recent service disruptions - March 16, 2022
GitLab #
Google Cloud #
- An update on Sunday’s service disruption - June 3, 2019
- Google OAuth access was unavailable - December 14, 2020
- Global: Experiencing Issue with Cloud networking - November 16, 2021
- London (europe-west2) cooling system failure - July 19, 2022
- Oracle Cloud also saw a cooling-related failure on the same day in the London Data Center.
Grafana Labs #
- How a GCP Persistent Disk Incident Snowballed into a 23-Hour Outage – and Taught Us Some Important Lessons
- How we responded to a 2-hour outage in our Grafana Cloud Hosted Prometheus service
- How a production outage in Grafana Cloud’s Hosted Prometheus service was caused by a bad etcd client setup
- How adding Kubernetes label selectors caused an outage in Grafana Cloud Logs — and how we resolved it
Independent Stories #
- Debugging a misbehaving distributed system, by Erin, alt:Threadreader
- Ask HN: Tell me an engineering war story from your career
- Impact of an upstream outage - chain of issues due to launchpad outage
Indian Registry for Internet Names and Numbers (IRINN) #
KLAYswap #
- KLAYswap Incident Report - Feb 03, 2022, also see a more details analysis Post Mortem of KlaySwap Incident through BGP Hijacking - EN
Level 3 Communications (CenturyLink) #
Loom.com #
Netflix #
Nomad Bridge #
n8n #
Oracle Cloud #
- Datacenter cooling infrastructure failed in UK South - July 19, 2022
- Following unseasonably high temperatures in the UK South (London) region, two cooler units in the data center experienced a failure when they were required to operate above their design limits. As a result, temperatures in the data center began to climb causing a subset of Compute infrastructure to go into protective shut down.
- Google Cloud also had cooling related failure in London (europe-west2)
Roblox #
Rogers Communications #
Salesforce #
Stripe #
- Stripe was down for all those with Stripe Tax enabled, and status updates on @stripestatus twitter page
Slack #
- Users are unable to connect to Slack - Tuesday, May 12, 2020, and Twitter Thread by copyconstruct
- Slack’s Outage on January 4th, 2021
- May 12, 2020 Outage - A Terrible, Horrible, No-Good, Very Bad Day at Slack - Slack Engineering
- Double Trouble with Datastores - Slack’s Incident on February 22, 2022