Getting Good at System Failure Analysis


Delivered at DevOpsDays Chicago in Sept. 2017.


Every failure is a mystery to be solved. Solving those mysteries is a skill that can be honed. Let’s talk about how to get better at figuring out what’s up when things go wrong! This is a talk full of both high level advice and concrete tips from somebody who loves fixing weird production issues.

What does it mean to be good at debugging production issues? That’s the question we’ll explore in this talk! I’ll be sharing a grab bag of the postures, practices, tips, and tricks I’ve learned from years hanging out near production.

Running production systems are not always designed for operability, and yet we still need to fix them. Thusly, my goal is to share techniques that apply across a range of operational maturity levels. This breaks down into a few sections:

  • Adopting a productive attitude towards failures
  • Learning to love logs, wherever you may find them
  • Guerrilla systems thinking and domain modeling
  • Code reading for failure analysis
  • Collaborating to remediate and solve production issues

Production failure analysis has been one of the most rewarding skills that I’ve built up in my career. I hope that after this talk you’ll have a few tools to walk away with, but - more importantly - you’ll be inspired to get better at responding to failures.