Learn about building resilient systems #
Collection of resources to learn about failures, and failure modes of software systems.
Blog Posts #
Blog Posts on failures, reliability, testing and other relevant topics
Post Mortem - The Cloudflare Blog, list postmortems from cloudflare
Talks on how systems fail, demo of systems, and other wisdom on how we can build better systems -
Tools & Projects #
Tools and projects focused on failures, and failure modes of software systems.
The Open Source Platform for Chaos Engineering
Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures.
Incidents are costly. Without spending time analyzing and determining the conditions that exist in order for an incident to take place, we won’t learn how to successfully remove nor recover from these conditions in the future.
Let’s help each other learn.
A consortium of industry leaders and researchers united in the common cause of understanding and coping with the immense levels of complexity involved in the operation of critical digital services.
Contains notes about people active in resilience engineering, as well as some influential researchers who are no longer with us
A compiled list of links to public failure stories related to Kubernetes.
Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments.
Collection of links to various debugging stories.
A collection of postmortems.
Curated list of resources on testing distributed systems
Research on failures and how to test, build and operate reliable systems -
Fault Isolation using Shuffule Sharding #
- AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)
- Shuffle Sharding: Massive and Magical Fault Isolation
- Shuffle Sharding - Cortex
- Level 300: Fault Isolation with Shuffle Sharding
- Patterns for the Distributed Systems in Cloud — Part 1
- Uno, DDoS, Tres — The magic of Shuffle sharding
- Great thread from Colm MacCárthaigh, alt: thread reader
Real world failure stories and incident postmortems of widely used systems
- Kafkapocalypse: a postmortem on our service outage
- Stories from the Front: Lessons Learned from Supporting Apache Kafka
- How to Lose Messages on a Kafka Cluster - Part 1
- Compilation of public failure/horror stories related to Kubernetes
- 10 Ways to Shoot Yourself in the Foot with Kubernetes, #9 Will Surprise You - Laurent Bernaille
- Kubernetes pods /etc/resolv.conf ndots:5 option and why it may negatively affect your application performances, also see DNS Lookups in Kubernetes