Learn

Learn about building resilient systems #

A collection of resources to learn about failures and failure modes of software systems.

Books #

Books for those who are inersted in Datacenters and Datacenter Design:

The Practice of System and Network Administration - Thomas A. Limoncelli, Christina J. Hogan, Strata R. Chalup
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines - Luiz André Barroso, Jimmy Clidaras, Urs Hölzle, this book is open access, find the PDF here

Blog Posts #

Blog posts on failures, reliability, testing, and other relevant topics

Talks #

Talks on how systems fail, demos of systems, and other wisdom on how we can build better systems -

Tools & Projects #

Tools and projects focused on failures and failure modes of software systems.

Chaos Toolkit #

The Open Source Platform for Chaos Engineering

Chaos Monkey #

Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures.

Learning from Incidents in Software #

Incidents are costly. Without spending time analyzing and determining the conditions that exist for an incident to take place, we won’t learn how to successfully remove or recover from these conditions in the future.

Let’s help each other learn.

SNAFU catchers #

A consortium of industry leaders and researchers united in the common cause of understanding and coping with the immense levels of complexity involved in the operation of critical digital services.