Tools and projects focused on failures, and failure modes of software systems.
The Open Source Platform for Chaos Engineering
Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures.
Incidents are costly. Without spending time analyzing and determining the conditions that exist in order for an incident to take place, we won’t learn how to successfully remove nor recover from these conditions in the future.
Let’s help each other learn.
A consortium of industry leaders and researchers united in the common cause of understanding and coping with the immense levels of complexity involved in the operation of critical digital services.
Contains notes about people active in resilience engineering, as well as some influential researchers who are no longer with us
A compiled list of links to public failure stories related to Kubernetes.
Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments.
Collection of links to various debugging stories.
A collection of postmortems.
Curated list of resources on testing distributed systems