Welcome to Failure Modes


Failure modes is collection of literature on how and why software systems fail :boom:


Running things in production is hard and running distributed systems extra hard.

Failure Modes is an effort to curate resources and stories from the community, to learn and get better at running large scale software in production.

See announcement blog post


Please send Pull Request to extend this collection.

It can be anything from incident postmortems, blog posts, projects, talks, tweets, research, etc.

Huge thanks to our contributors :bowing_man: :bowing_woman: :tada:

Keep in touch

Subscribe to Failure Modes Newsletter to get blog posts, talks, notes and research on building and running production systems in your Inbox

Have suggestions or questions, reach out on twitter @electron0zero