Jade Rubick
jade rubick
  • Blog
  • Podcast
  • Newsletter
  • Course
  • About
  • Wiki
Wiki › Process › Reliability
Edit this page on github

Reliability

Retrospective Prime Directive

“Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.”

https://retrospectivewiki.org/index.php?title=The\_Prime\_Directive

Rolling out on call

Some great resources:

  • https://wiki.dbbs.co/view/suggested-reading-for-on-call
    • https://thenewstack.io/advice-management-teams-enrolling-changes-on-call-systems/
    • https://codywilbourn.com/2018/03/22/sustainable-on-call/
    • https://www.gremlin.com/community/tutorials/how-to-establish-a-high-severity-incident-management-program/
  • https://increment.com/on-call/
  • https://www.honeycomb.io/blog/oncall-and-sustainable-software-development
  • https://response.pagerduty.com

Examples of good incident communication to customers

  • https://github.com/danluu/post-mortems (a collection of post-mortems)
  • https://blog.cloudflare.com/cloudflare-incident-on-october-30-2023
  • https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage

Setting up rotations on Slack

  • Using Slack Workflows and Google Sheets for free rotations
Last updated Oct 31, 2025 21:25.