Learn about building resilient systems #

Collection of resources to learn about failures, and failure modes of software systems.

Blog Posts #

Blog Posts on failures, reliability, testing and other relevant topics

Talks #

Talks on how systems fail, demo of systems, and other wisdom on how we can build better systems -

Tools & Projects #

Tools and projects focused on failures, and failure modes of software systems.

Chaos Toolkit #

The Open Source Platform for Chaos Engineering

Chaos Monkey #

Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures.

Learning from Incidents in Software #

Incidents are costly. Without spending time analyzing and determining the conditions that exist in order for an incident to take place, we won’t learn how to successfully remove nor recover from these conditions in the future.

Let’s help each other learn.

SNAFU catchers #

A consortium of industry leaders and researchers united in the common cause of understanding and coping with the immense levels of complexity involved in the operation of critical digital services.

Resilience engineering papers #

Contains notes about people active in resilience engineering, as well as some influential researchers who are no longer with us

Kubernetes Failure Stories #

A compiled list of links to public failure stories related to Kubernetes.

Chaos Mesh #

Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments.

Debugging stories - Dan Luu #

Collection of links to various debugging stories.

A List of Post-mortems! - Dan Luu #

A collection of postmortems.

Testing Distributed Systems #

Curated list of resources on testing distributed systems

k6 - A modern load testing tool #

Research #

Research on failures and how to test, build and operate reliable systems -

Fault Isolation using Shuffule Sharding #

Systems #

Real world failure stories and incident postmortems of widely used systems

PostgreSQL #

Kafka #

Kubernetes #

YugabyteDB #