Stories

Postmortems, Incident Reports, and Stories from Real-World Failures.

Algolia #

Cloudflare #

Celer Bridge #

Celer Network Bridge dapp incident analysis by Coinbase

DataDog #

Datadog Outage Affects Multiple Regions for a Day, Also see the Reddit discussion.

DataSpring #

Datacenter and tornado
- The website is in Czech; use a translation service to read it. archive.today link
- This is a story of how a data center dealt with a tornado—a good reminder to verify your offsite backups, disaster recovery plan, and conduct disaster recovery dry runs.

Deno #

May 30 incident update - May 30, 2022

DoorDash #

How to Handle Kubernetes Health Checks - health checks outage on Black Friday

Facebook #

October 4, 2021: Facebook Group (Facebook, Instagram, WhatsApp, Oculus) Outage.
- Understanding How Facebook Disappeared from the Internet
- What happened on the Internet during the Facebook outage
- Update about the October 4th outage - Facebook Engineering
- More details about the October 4 outage - Facebook Engineering
- What Happened to Facebook, Instagram, WhatsApp? Krebs on Security
- Why was Facebook down for five hours? - YouTube - Ben Eater explains the Facebook outage in detail with a demo.
- This outage had side effects on the whole internet; the most common one was ISPs getting DoSed with DNS queries for Facebook domains.

Fastly #

Summary of June 8 outage - Fastly - June 8, 2021, global outage

Garmin #

Garmin’s multi-day service outage, thread by Osma Ahvenlampi

GitHub #

GitLab #

GitLab.com database incident - January 31st 2017
- Postmortem of database outage of January 31

Google Cloud #

An update on Sunday’s service disruption - June 3, 2019
Google OAuth access was unavailable - December 14, 2020
- Relevant Twitter thread 1 and thread 2
Global: Experiencing Issue with Cloud networking - November 16, 2021
- HN Thread
- A bug introduced 6 months ago brought Google’s Cloud Load Balancer to its knees
London (europe-west2) cooling system failure - July 19, 2022
- Oracle Cloud also saw a cooling-related failure on the same day in the London Data Center.

Grafana Labs #

Independent Stories #

Indian Registry for Internet Names and Numbers (IRINN) #

Missing IRINN route objects & outage! - 6 July 2020, Anurag Bhatia

KLAYswap #

KLAYswap Incident Report - Feb 03, 2022, also see a more details analysis Post Mortem of KlaySwap Incident through BGP Hijacking - EN

Level 3 Communications (CenturyLink) #

August 30th, 2020: Analysis of CenturyLink/Level 3 Outage

Loom.com #

In Place AWS Elasticache Redis Upgrade went wrong

Netflix #

Containers taking out nodes - Twitter thread by Sargun Dhillon. Threadreader link

Nomad Bridge #

Nomad Bridge incident analysis by Coinbase

n8n #

Ask HN: Azure has run out of compute – anyone else affected?

Oracle Cloud #

Datacenter cooling infrastructure failed in UK South - July 19, 2022
- Following unseasonably high temperatures in the UK South (London) region, two cooler units in the data center experienced a failure when they were required to operate above their design limits. As a result, temperatures in the data center began to climb causing a subset of Compute infrastructure to go into protective shut down.
- Google Cloud also had cooling related failure in London (europe-west2)

Roblox #

Roblox 73-hour outage - October 28th, 2021

Rogers Communications #

Rogers Communications outage in Canada - July 8, 2022
- Causes - Wikipedia

Salesforce #

Multi-Instance Service Disruption on May 11-12, 2021
- That Salesforce outage: Global DNS downfall started by one engineer trying a quick fix • The Register - in the news

Stripe #

Stripe was down for all those with Stripe Tax enabled, and status updates on @stripestatus twitter page

Stories

Algolia #

Atlassian #

Authzed #

AWS #

Bungie #

Cloudflare #

Celer Bridge #

DataDog #

DataSpring #

Deno #

DoorDash #

Facebook #

Fastly #

Garmin #

GitHub #

GitLab #

Google Cloud #

Grafana Labs #

Independent Stories #

Indian Registry for Internet Names and Numbers (IRINN) #

KLAYswap #

Level 3 Communications (CenturyLink) #

Loom.com #

Netflix #

Nomad Bridge #

n8n #

Oracle Cloud #

Roblox #

Rogers Communications #

Salesforce #

Stripe #

Slack #

Twitter #

Verizon #