Skip to content

SRE Book List

Site Reliability Engineering

by Betsy Beyer, Chris Jones, Niall Richard Murphy, Jennifer Petoff
Released April 2016
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781491929124

Site Reliability Engineering

https://sre.google/sre-book/table-of-contents/

AI Summary

# Site Reliability Engineering: Book Summary

## Overview

Site Reliability Engineering describes Google's approach to managing large-scale systems and services. The book explains how Google's Site Reliability Engineering (SRE) team combines software engineering and systems engineering to build and maintain scalable, reliable systems.

## Key Concepts

- SRE applies software engineering principles to operations and infrastructure problems
- Teams aim to spend max 50% time on operations work and minimum 50% on development work
- SRE focuses on automation over manual operations
- Error budgets are used to balance reliability with innovation
- Monitoring and alerting are fundamental to running reliable services

## Major Topics

### Risk and Reliability

The book discusses how to manage risk and reliability through:

- Setting appropriate availability targets
- Using error budgets to make risk-based decisions
- Implementing monitoring and alerting effectively
- Creating incident response procedures

### Operations

Key operational aspects covered include:

- On-call rotations and incident management
- Effective troubleshooting practices
- Change management and release processes
- Capacity planning

### Engineering

Engineering practices discussed include:

- Building reliable distributed systems
- Load balancing and handling overload
- Data processing pipelines
- Configuration management
- Testing for reliability

## Culture and Processes

The book emphasizes important cultural aspects:

- Blameless postmortem culture
- Focus on automation and reducing toil
- Clear incident management procedures
- Knowledge sharing and documentation

## Key Takeaways

- Reliability is a fundamental feature that requires ongoing engineering effort
- Automation is crucial for managing systems at scale
- Clear processes and culture are as important as technical solutions
- Balance between reliability and innovation is essential
- Learning from incidents through postmortems drives improvement

The Site Reliability Workbook

by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, Stephen Thorne
Released July 2018
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781492029502

The Site Reliability Workbook

https://sre.google/workbook/table-of-contents/

AI Summary

# Summary: The Site Reliability Workbook

The Site Reliability Workbook is a practical guide that builds upon Google's first Site Reliability Engineering book. This workbook provides detailed implementation guidance and real-world examples for putting SRE principles into practice.

## Key Topics Covered

- How SRE relates to DevOps and how they complement each other rather than compete
- Implementing Service Level Objectives (SLOs) and error budgets to measure and maintain reliability
- Setting up effective monitoring and alerting systems based on SLOs
- Identifying and eliminating toil through automation and process improvements
- Managing on-call rotations and incident response effectively
- Creating a postmortem culture focused on learning from failures
- Designing reliable systems using Non-Abstract Large System Design (NALSD)

## Real-World Examples

The book includes detailed case studies from both Google and other companies like Evernote, The Home Depot, and PagerDuty, demonstrating how SRE principles can be adapted for different organizational contexts and scales.

## Key Takeaways

- SRE practices can be implemented successfully at organizations of any size
- Focus on incremental improvements rather than attempting complete transformations
- Use data and SLOs to drive decisions about reliability
- Build a culture of blameless postmortems and continuous learning
- Invest in automation while maintaining a balance between operations and development work

Seeking SRE

by David N. Blank-Edelman
Released September 2018
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781491978863

Seeking SRE

Becoming SRE

by David N Blank-Edelman
Released March 2024
ISBN: 9781492090557

Becoming SRE

High Performance SRE

by Anchal Arora Mishra
Released February 2024
ISBN: 9789355516718

High Performance SRE

Establishing SRE Foundations

by Vladyslav Ukis
Released September 2022
Publisher(s): Addison-Wesley Professional
ISBN: 9780137424887

Establishing SRE Foundations

Becoming a Rockstar SRE

by Jeremy Proffitt, Rod Anami
Released April 2023
ISBN: 9781803239224

Becoming a Rockstar SRE

Building Secure and Reliable Systems

by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield
Released March 2020
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781492083122

Building Secure and Reliable Systems

https://google.github.io/building-secure-and-reliable-systems/raw/toc.html

Implementing Service Level Objectives

by Alex Hidalgo
Released August 2020
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781492076766

Implementing Service Level Objectives

Practice of Cloud System Administration

by Christina Hogan, Strata Chalup, Thomas Limoncelli
Released September 2014
ISBN: 9780321943187 

Practice of Cloud System Administration

Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations

by Gene Kim, Jez Humble, Nicole Forsgren, PhD
Released March 2018
ISBN: 9781942788331

Accelerate

Real-World SRE

by Nat Welch
Released August 2018
ISBN: 9781788628884

Real-World SRE

Systems Performance: Enterprise and the Cloud

by Brendan Gregg
Released November 2013
ISBN: 9780133390094

Systems Performance: Enterprise and the Cloud

97 Things Every SRE Should Know

by Emil Stolarsky, Jaime Woo
Released December 2020
ISBN: 9781492081494

97 Things Every SRE Should Know

Observability Engineering

by Charity Majors, Liz Fong-Jones, George Miranda
Released June 2022
ISBN: 9781492076445

Observability Engineering

Chaos Engineering: System Resiliency in Practice

by Casey Rosenthal, Nora Jones
Released May 2020
ISBN: 9781492043867

Chaos Engineering

Chaos Engineering: Site reliability through controlled disruption

by Casey Rosenthal, Nora Jones
Released May 2020
ISBN: 9781492043867

Chaos Engineering

Database Reliability Engineering

by Laine Campbell, Charity Majors
Released December 2017
ISBN: 9781491925942

Database Reliability Engineering

Site Reliability Engineering (SRE) Handbook

by Stephen Fleming
Released November 2018
ISBN: 9781790150052

Site Reliability Engineering Handbook

DevOps and Site Reliability Engineering (SRE) Handbook

by Stephen Fleming
Released November 2018
ISBN: 9781790238408

DevOps and Site Reliability Engineering Handbook

The Linux Programming Interface

by Michael Kerrisk
Released October 2010
ISBN: 9781593272203

The Linux Programming Interface

Reliable Machine Learning: Applying SRE Principles to ML in Production

by Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd Underwood
Released September 2022
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781098106225

Reliable Machine Learning

The Art of Site Reliability Engineering (SRE) with Azure

by Unai Huete Beloki
Released September 2022
Publisher(s): Apress
ISBN: 9781484287033

The Art of Site Reliability Engineering

Hands-On Guide to AgileOps

by Navin Sabharwal, Raminder Rathore, Udita Agrawal
Released December 2021
Publisher(s): Apress
ISBN: 9781484275054

Hands-On Guide to AgileOps

Hands-on Site Reliability Engineering

by Shamayel Mohammed Farooqui, Vishnu Vardhan Chikoti
Released July 2021
Publisher(s): bpb
ISBN: 9789391030339

Hands-on Site Reliability Engineering

Feedback