Learn
Learn about building resilient systems #
A collection of resources to learn about failures and failure modes of software systems.
Books #
Books for those who are inersted in Datacenters and Datacenter Design:
- The Practice of System and Network Administration - Thomas A. Limoncelli, Christina J. Hogan, Strata R. Chalup
- The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines - Luiz André Barroso, Jimmy Clidaras, Urs Hölzle, this book is open access, find the PDF here
Blog Posts #
Blog posts on failures, reliability, testing, and other relevant topics
-
Chaos Engineering — Review Lineage Driven Failure Injection(LDFI)
-
Failure Modes and Continuous Resilience - Adrian cockcroft, also see this thread
-
Post Mortem - The Cloudflare Blog, lists postmortems from cloudflare
-
How we’re building a production readiness review process at Grafana Labs
Talks #
Talks on how systems fail, demos of systems, and other wisdom on how we can build better systems -
-
Debugging Under Fire: Keep your Head when Systems have Lost their Mind - Bryan Cantrill
-
Bryan Cantrill - Docker in Production: Tales From the Engine Room
-
Keynote: High Reliability Infrastructure Migrations - Julia Evans, Software Engineer, Stripe
-
Orchestrated Chaos: Applying Failure Testing Research at Scale
-
Orchestrating Chaos Applying Database Research in the Wild - Peter Alvaro
-
Testing Cloud-Native Databases with Chaos Mesh - Siddon Tang
-
The Hurricane’s Butterfly: Debugging Pathologically Performing Systems
-
SREcon24 Americas - System Performance and Queuing Theory - Concepts and Application
Tools & Projects #
Tools and projects focused on failures and failure modes of software systems.
Chaos Toolkit #
The Open Source Platform for Chaos Engineering
Chaos Monkey #
Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures.
Learning from Incidents in Software #
Incidents are costly. Without spending time analyzing and determining the conditions that exist for an incident to take place, we won’t learn how to successfully remove or recover from these conditions in the future.
Let’s help each other learn.
SNAFU catchers #
A consortium of industry leaders and researchers united in the common cause of understanding and coping with the immense levels of complexity involved in the operation of critical digital services.
Resilience engineering papers #
Contains notes about people active in resilience engineering as well as some influential researchers who are no longer with us.
Kubernetes Failure Stories #
A compiled list of links to public failure stories related to Kubernetes.
Chaos Mesh #
Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments.
Debugging stories - Dan Luu #
A collection of links to various debugging stories.
A List of Post-mortems! - Dan Luu #
A collection of postmortems.
Testing Distributed Systems #
Curated list of resources on testing distributed systems
k6 - A modern load testing tool #
Research #
Research on failures and how to test, build, and operate reliable systems -
-
Gray failure: the Achilles’ heel of cloud-scale systems - the morning paper
-
Report from the SNAFU catchers Workshop on Coping With Complexity
Fault Isolation using Shuffule Sharding #
- AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)
- Shuffle Sharding: Massive and Magical Fault Isolation
- Shuffle Sharding - Cortex
- Level 300: Fault Isolation with Shuffle Sharding
- Patterns for the Distributed Systems in Cloud — Part 1
- Uno, DDoS, Tres — The magic of Shuffle sharding
- Great thread from Colm MacCárthaigh, alt: thread reader
Systems #
Real-world failure stories and incident postmortems of widely used systems
PostgreSQL #
Kafka #
- Kafkapocalypse: a postmortem on our service outage
- Stories from the Front: Lessons Learned from Supporting Apache Kafka
- How to Lose Messages on a Kafka Cluster - Part 1
Kubernetes #
- Compilation of public failure/horror stories related to Kubernetes
- 10 Ways to Shoot Yourself in the Foot with Kubernetes, #9 Will Surprise You - Laurent Bernaille
- Kubernetes pods /etc/resolv.conf ndots:5 option and why it may negatively affect your application performance, also see DNS Lookups in Kubernetes