SRE Book List
Site Reliability Engineering
by Betsy Beyer, Chris Jones, Niall Richard Murphy, Jennifer Petoff
Released April 2016
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781491929124
https://sre.google/sre-book/table-of-contents/
AI Summary
# Site Reliability Engineering: Book Summary
## Overview
Site Reliability Engineering describes Google's approach to managing large-scale systems and services. The book explains how Google's Site Reliability Engineering (SRE) team combines software engineering and systems engineering to build and maintain scalable, reliable systems.
## Key Concepts
- SRE applies software engineering principles to operations and infrastructure problems
- Teams aim to spend max 50% time on operations work and minimum 50% on development work
- SRE focuses on automation over manual operations
- Error budgets are used to balance reliability with innovation
- Monitoring and alerting are fundamental to running reliable services
## Major Topics
### Risk and Reliability
The book discusses how to manage risk and reliability through:
- Setting appropriate availability targets
- Using error budgets to make risk-based decisions
- Implementing monitoring and alerting effectively
- Creating incident response procedures
### Operations
Key operational aspects covered include:
- On-call rotations and incident management
- Effective troubleshooting practices
- Change management and release processes
- Capacity planning
### Engineering
Engineering practices discussed include:
- Building reliable distributed systems
- Load balancing and handling overload
- Data processing pipelines
- Configuration management
- Testing for reliability
## Culture and Processes
The book emphasizes important cultural aspects:
- Blameless postmortem culture
- Focus on automation and reducing toil
- Clear incident management procedures
- Knowledge sharing and documentation
## Key Takeaways
- Reliability is a fundamental feature that requires ongoing engineering effort
- Automation is crucial for managing systems at scale
- Clear processes and culture are as important as technical solutions
- Balance between reliability and innovation is essential
- Learning from incidents through postmortems drives improvement
The Site Reliability Workbook
by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, Stephen Thorne
Released July 2018
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781492029502
https://sre.google/workbook/table-of-contents/
AI Summary
# Summary: The Site Reliability Workbook
The Site Reliability Workbook is a practical guide that builds upon Google's first Site Reliability Engineering book. This workbook provides detailed implementation guidance and real-world examples for putting SRE principles into practice.
## Key Topics Covered
- How SRE relates to DevOps and how they complement each other rather than compete
- Implementing Service Level Objectives (SLOs) and error budgets to measure and maintain reliability
- Setting up effective monitoring and alerting systems based on SLOs
- Identifying and eliminating toil through automation and process improvements
- Managing on-call rotations and incident response effectively
- Creating a postmortem culture focused on learning from failures
- Designing reliable systems using Non-Abstract Large System Design (NALSD)
## Real-World Examples
The book includes detailed case studies from both Google and other companies like Evernote, The Home Depot, and PagerDuty, demonstrating how SRE principles can be adapted for different organizational contexts and scales.
## Key Takeaways
- SRE practices can be implemented successfully at organizations of any size
- Focus on incremental improvements rather than attempting complete transformations
- Use data and SLOs to drive decisions about reliability
- Build a culture of blameless postmortems and continuous learning
- Invest in automation while maintaining a balance between operations and development work
Seeking SRE
by David N. Blank-Edelman
Released September 2018
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781491978863
Becoming SRE
by David N Blank-Edelman
Released March 2024
ISBN: 9781492090557
High Performance SRE
by Anchal Arora Mishra
Released February 2024
ISBN: 9789355516718
Establishing SRE Foundations
by Vladyslav Ukis
Released September 2022
Publisher(s): Addison-Wesley Professional
ISBN: 9780137424887
Becoming a Rockstar SRE
by Jeremy Proffitt, Rod Anami
Released April 2023
ISBN: 9781803239224
Building Secure and Reliable Systems
by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield
Released March 2020
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781492083122
https://google.github.io/building-secure-and-reliable-systems/raw/toc.html
Implementing Service Level Objectives
by Alex Hidalgo
Released August 2020
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781492076766
Practice of Cloud System Administration
by Christina Hogan, Strata Chalup, Thomas Limoncelli
Released September 2014
ISBN: 9780321943187
Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations
by Gene Kim, Jez Humble, Nicole Forsgren, PhD
Released March 2018
ISBN: 9781942788331
Real-World SRE
by Nat Welch
Released August 2018
ISBN: 9781788628884
Systems Performance: Enterprise and the Cloud
by Brendan Gregg
Released November 2013
ISBN: 9780133390094
97 Things Every SRE Should Know
by Emil Stolarsky, Jaime Woo
Released December 2020
ISBN: 9781492081494
Observability Engineering
by Charity Majors, Liz Fong-Jones, George Miranda
Released June 2022
ISBN: 9781492076445
Chaos Engineering: System Resiliency in Practice
by Casey Rosenthal, Nora Jones
Released May 2020
ISBN: 9781492043867
Chaos Engineering: Site reliability through controlled disruption
by Casey Rosenthal, Nora Jones
Released May 2020
ISBN: 9781492043867
Database Reliability Engineering
by Laine Campbell, Charity Majors
Released December 2017
ISBN: 9781491925942
Site Reliability Engineering (SRE) Handbook
by Stephen Fleming
Released November 2018
ISBN: 9781790150052
DevOps and Site Reliability Engineering (SRE) Handbook
by Stephen Fleming
Released November 2018
ISBN: 9781790238408
The Linux Programming Interface
by Michael Kerrisk
Released October 2010
ISBN: 9781593272203
Reliable Machine Learning: Applying SRE Principles to ML in Production
by Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd Underwood
Released September 2022
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781098106225
The Art of Site Reliability Engineering (SRE) with Azure
by Unai Huete Beloki
Released September 2022
Publisher(s): Apress
ISBN: 9781484287033
Hands-On Guide to AgileOps
by Navin Sabharwal, Raminder Rathore, Udita Agrawal
Released December 2021
Publisher(s): Apress
ISBN: 9781484275054
Hands-on Site Reliability Engineering
by Shamayel Mohammed Farooqui, Vishnu Vardhan Chikoti
Released July 2021
Publisher(s): bpb
ISBN: 9789391030339