Welcome to Failure Modes


Failure modes is collection of literature on how and why software systems fail :boom:


Running things in production is hard and running distributed systems extra hard.

Failure Modes is an effort to curate resources and stories from the community, to learn and get better at running large scale software in production.

See announcement blog post


Please send Pull Request to extend this collection.

It can be anything from incident postmortems, blog posts, projects, talks, tweets, research, etc.

Huge thanks to our contributors :bowing_man: :bowing_woman: :tada:

