Lorin Hochstein, Staff Software Engineer for Reliability at Airbnb and former Netflix Chaos team member, discusses how real-world software failures provide deeper insights than synthetic fault injection tools like Chaos Monkey. Key themes include the distinction between robustness (designing for known failures) and resilience (handling unknown failures), how adding reliability mechanisms increases system complexity and creates new failure modes ('Lorin's Law'), the importance of blameless post-mortems that assume rational actors, the underappreciated role of organizational complexity in incidents, and why resilience engineering principles haven't spread widely in the industry. He also advocates for storytelling as a mechanism for spreading incident knowledge within organizations.
Table of contents
TranscriptHow Did You Become A Reliability Engineer? [ 01:52 ]The Limits of Chaos Monkey and Fault Injection [ 03:35 ]Real Incidents Provide the Real Learning [ 06:22 ]How do Architects Learn From Real Incidents [ 06:59 ]Advanced Failure Mitigation Can Lead To More Failures [ 10:38 ]Homeostasis and Failures Due to Resource Saturation [ 12:17 ]Risk Mitigation and Tradeoffs [ 15:29 ]Socio-technical Constraints [ 17:10 ]The Build vs. Buy Decision and Organizational Complexity [ 19:06 ]Robustness vs. Resilience [ 20:53 ]We Make the Same Mistakes Over and Over Again [ 23:10 ]The Blameless Culture and Personal Responsibility [ 26:09 ]Lack of Competence Should Show Up in Everyday Work [ 31:32 ]Software Reliability Principles Are Not Widespread [ 33:26 ]The Importance of Storytelling [ 37:48 ]The Architect’s Questionnaire [ 39:37 ]About the AuthorSort: