Reliability
Retrospective Prime Directive
“Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.”
https://retrospectivewiki.org/index.php?title=The\_Prime\_Directive
Rolling out on call
Some great resources:
- https://wiki.dbbs.co/view/suggested-reading-for-on-call
- https://increment.com/on-call/
- https://www.honeycomb.io/blog/oncall-and-sustainable-software-development
- https://response.pagerduty.com
Examples of good incident communication to customers
- https://github.com/danluu/post-mortems (a collection of post-mortems)
- https://blog.cloudflare.com/cloudflare-incident-on-october-30-2023
- https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage