Failure As a Means to Build Resilient Software Systems: A Conversation with Lorin Hochstein

Lorin Hochstein, Staff Software Engineer for Reliability at Airbnb and former Netflix Chaos team member, discusses how real-world software failures provide deeper insights than synthetic fault injection tools like Chaos Monkey. Key themes include the distinction between robustness (designing for known failures) and resilience (handling unknown failures), how adding reliability mechanisms increases system complexity and creates new failure modes ('Lorin's Law'), the importance of blameless post-mortems that assume rational actors, the underappreciated role of organizational complexity in incidents, and why resilience engineering principles haven't spread widely in the industry. He also advocates for storytelling as a mechanism for spreading incident knowledge within organizations.

#distributed-systems

Mar 31•52m read time•From infoq.com

Table of contents

Transcript How Did You Become A Reliability Engineer? [ 01:52 ]The Limits of Chaos Monkey and Fault Injection [ 03:35 ]Real Incidents Provide the Real Learning [ 06:22 ]How do Architects Learn From Real Incidents [ 06:59 ]Advanced Failure Mitigation Can Lead To More Failures [ 10:38 ]Homeostasis and Failures Due to Resource Saturation [ 12:17 ]Risk Mitigation and Tradeoffs [ 15:29 ]Socio-technical Constraints [ 17:10 ]The Build vs. Buy Decision and Organizational Complexity [ 19:06 ]Robustness vs. Resilience [ 20:53 ]We Make the Same Mistakes Over and Over Again [ 23:10 ]The Blameless Culture and Personal Responsibility [ 26:09 ]Lack of Competence Should Show Up in Everyday Work [ 31:32 ]Software Reliability Principles Are Not Widespread [ 33:26 ]The Importance of Storytelling [ 37:48 ]The Architect’s Questionnaire [ 39:37 ]About the Author

Comment

Bookmark

Copy

Sort: