Skip to content

SRE Learning Materials

homepage-banner

Introduction

Site Reliability Engineering (SRE) is a relatively new field that has been gaining popularity in recent years. SRE teams are responsible for ensuring the reliability, performance, and efficiency of complex systems. To become an SRE, you need to have a solid understanding of computer science fundamentals, as well as a deep knowledge of distributed systems, networking, and cloud infrastructure. In this blog post, we will discuss the learning materials available at the School of SRE that can help you gain the knowledge and skills needed to become an SRE.

Online Courses

The School of SRE offers several online courses that cover a wide range of topics related to SRE. These courses are designed to be self-paced, and they include a mix of lectures, hands-on labs, and quizzes to help you reinforce your learning. Some of the courses available include “Introduction to SRE”, “Distributed Systems”, “Cloud Infrastructure”, “Networking”, and “Monitoring and Alerting.” These courses are perfect for those who prefer a structured learning experience and want to earn certificates upon completion.

  • https://linkedin.github.io/school-of-sre/
  • https://github.com/bregman-arie/sre-checklist
  • https://github.com/upgundecha/howtheysre
  • https://github.com/bregman-arie/devops-exercises
  • https://sre.google/books/
  • https://docs.microsoft.com/en-us/azure/site-reliability-engineering/resources/books
  • https://www.oreilly.com/library/view/seeking-sre/9781491978856/
  • https://opensource.com/article/18/10/sre-startup
  • https://stackpulse.com/blog/site-reliability-engineering-sre-what-why-and-5-best-practices/
  • https://www.usenix.org/blog/what-is-sre-how-does-it-relate-to-devops-lisa18
  • https://www.bmc.com/blogs/sre-vs-devops/
  • https://cloud.google.com/blog/products/management-tools/sre-error-budgets-and-maintenance-windows
  • https://www.atlassian.com/incident-management/kpis/error-budget
  • https://devopsinstitute.com/choosing-the-right-service-level-indicators/
  • https://www.observability.splunk.com/en_us/infrastructure-monitoring/guide-to-sre-and-the-four-golden-signals-of-monitoring.html
  • https://www.enov8.com/blog/site-reliability-engineering-sre-top-10-best-practice/
  • https://www.blameless.com/blog/5-best-practices-nailing-postmortems
  • https://learnxinyminutes.com/

Book & Course

  • (Book) Site Reliability Engineering - https://landing.google.com/sre/book/index.html
  • (Book) Site Reliability Workbook - https://landing.google.com/sre/workbook/toc/
  • (Book) Building Secure and Reliable Systems - https://landing.google.com/sre/resources/foundationsandprinciples/srs-book/
  • (Course) Intro to DevOps - https://www.udacity.com/course/intro-to-devops--ud611
  • (Course) Google Cloud Platform for Systems Operations - https://www.coursera.org/specializations/gcp-sysops
  • (Course) Measuring and Managing Reliability - https://www.coursera.org/learn/site-reliability-engineering-slos

Operating Systems

  • (Course) Introduction to Operating Systems - https://www.udacity.com/course/introduction-to-operating-systems--ud923
  • (Course) Advanced Operating Systems - https://www.udacity.com/course/advanced-operating-systems--ud189

Automation

  • (Tutorial) Ansible - https://www.digitalocean.com/community/tutorials/configuration-management-101-writing-ansible-playbooks
  • (Course) Terraform - https://www.udemy.com/course/learn-devops-infrastructure-automation-with-terraform/

Distributed Systems

  • (Tutorial) Introduction to Distributed Systems Design - http://www.hpcs.cs.tsukuba.ac.jp/~tatebe/lecture/h23/dsys/dsd-tutorial.html

Networking

  • (Book) Understanding Linux Network Internals - http://shop.oreilly.com/product/9780596002558.do

Programming Languages

Python

  • (Book) Learn Python 3 The Hard Way - https://learnpythonthehardway.org/python3/
  • (Course) Developing Scalable Apps in Python - https://www.udacity.com/course/developing-scalable-apps-in-python--ud858

Go

  • (Book) The Go Programming Language - https://www.amazon.com/Programming-Language-Addison-Wesley-Professional-Computing/dp/0134190440
  • (Webinar) Go Language for Ops and Site Reliability Engineering - https://www.youtube.com/watch?v=Q_H4hrUez80
  • (Hands On) https://gopherlabs.kubedaily.com/

Production Web App

  • (Tutorial) Building for Production: Web Applications - https://www.digitalocean.com/community/tutorial_series/building-for-production-web-applications
  • (Book) Production Ready Microservices - https://www.amazon.com/gp/product/1491965975/

Monitoring and Logging

  • (Course) Monitoring and Alerting with Prometheus - https://www.udemy.com/course/monitoring-and-alerting-with-prometheus/
  • (Book) Prometheus UP and Running - https://www.amazon.com/Prometheus-Infrastructure-Application-Performance-Monitoring/dp/1492034142

Continuous Integration | Continuous Delivery

  • (Course) Continuous Deliver Better Software - https://www.udemy.com/course/learn-devops-continuously-deliver-better-software/

Containers

  • (Course) Docker for Devops - https://www.udemy.com/course/docker-tutorial-for-devops-run-docker-containers/

Web Servers

Nginx

  • (Course) Nginx Fundamentals - https://www.udemy.com/course/nginx-fundamentals/

Cluster Management

Kubernetes

  • (Tutorial) Kubernetes Bootcamp - https://kubernetes.io/docs/tutorials/kubernetes-basics/
  • (Course) Scalable Microservices with Kubernetes - https://www.udacity.com/course/scalable-microservices-with-kubernetes--ud615
  • (Tutorial) Kubernetes Tutorial for Beginners - https://spacelift.io/blog/kubernetes-tutorial

Cloud

Amazon AWS

  • (Tutorial) Amazon AWS - https://aws.amazon.com/getting-started/tutorials/

Post-Mortem

  • Post-Mortem Template - https://sre.google/sre-book/example-postmortem/

Websites

  • https://highscalability.com
  • https://sreweekly.com
  • https://sre.news

DevOps | SRE Roadmap

  • DevOps Roadmap - https://roadmap.sh/devops

SRE Interview

  • https://github.com/michaelkkehoe/sre-interview
Feedback