Jade Rubick
jade rubick
  • Blog
  • Podcast
  • Newsletter
  • Course
  • About
  • Wiki
Edit this page on github
Wiki › Process › Reliability

Reliability

Retrospective Prime Directive

“Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.”

https://retrospectivewiki.org/index.php?title=The\_Prime\_Directive

Rolling out on call

Some great resources:

  • https://wiki.dbbs.co/view/suggested-reading-for-on-call
    • https://thenewstack.io/advice-management-teams-enrolling-changes-on-call-systems/
    • https://codywilbourn.com/2018/03/22/sustainable-on-call/
    • https://www.gremlin.com/community/tutorials/how-to-establish-a-high-severity-incident-management-program/
  • https://increment.com/on-call/
  • https://www.honeycomb.io/blog/oncall-and-sustainable-software-development
  • https://response.pagerduty.com

Examples of good incident communication to customers

  • https://github.com/danluu/post-mortems (a collection of post-mortems)
  • https://blog.cloudflare.com/cloudflare-incident-on-october-30-2023
  • https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage

Setting up rotations on Slack

  • Using Slack Workflows and Google Sheets for free rotations
Last updated Jul 11, 2025 20:52.