Skip to content

Terminology and glossary of Site Reliability Engineering


Site Reliability Engineering (SRE) is an approach to software engineering that emphasizes reliability, scalability, and maintainability. It is a discipline that combines software engineering and operations to build and run large-scale, distributed systems. As with any field, SRE has its own set of terminology, acronyms, and jargon that can be confusing for newcomers. In this blog post, we will discuss some of the most common terms and concepts used in SRE.

Glossary Summary

As the field of SRE continues to grow, it is important for practitioners to have a common vocabulary and understanding of the key terms and concepts that are used in the discipline. This glossary provides an overview of some of the most important SRE terms and definitions, including availability, error budget, and incident response. By familiarizing themselves with these concepts, SRE teams can better understand the challenges and opportunities that arise in their work, and develop more effective strategies for improving the reliability and availability of their systems. As SRE continues to evolve, it is likely that new terms and concepts will emerge, and it will be important for practitioners to stay up-to-date with the latest developments in the field.

  • SLA
  • SLO
  • SLI
  • RTO
  • RPO
  • MTTR
  • MTTF
  • MTBF
  • WAF
  • shared responsibility model
  • CI/CD
  • Continuous Delivery
  • Continuous Deployment
  • Agile
  • Scrum
  • Kanban
  • IaC
  • Performance testing
  • Load testing
  • Stress testing
  • Fault injection testing
  • Chaos Engineering
  • Disaster/recovery testing
  • Multiregion
  • zone redundancy
  • Availability zone
  • Endpoint monitoring
  • Throttling
  • circuit breaker pattern
  • Idempotent task
  • Avoid affinity
  • Vertical scaling (up/down)
  • Horizontal scaling (out/in)
  • autoscaling
  • Application Gateway
  • Load Balancer
  • hot standby
  • cold standby
  • Active-active
  • Push model
  • Pull model
  • DSC
  • Pipeline
  • Artifacts
  • Blue-Green Deployment
  • Canary Deployments
  • Ring-based deployments
  • A/B Testing
  • Key Vault
  • Observability
  • Monitoring
  • error budget



A Service Level Agreement (SLA) is a contract between a service provider and its customers that defines the level of service that will be provided. SLAs typically specify the availability, uptime, response time, and other key metrics that the service provider must meet.

A Service Level Objective (SLO) is a target that the service provider sets for itself, based on the SLA. SLOs are used to measure the performance of the service provider and to ensure that it meets its commitments to customers. The SLO should be set at a level that is achievable and meaningful to the customer.

A Service Level Indicator (SLI) is a measurement of the performance of a service. SLIs are used to track the performance of the service and to identify areas where improvements can be made. The SLI should be a metric that is meaningful to the customer and that can be tracked over time.

Error Budget and Toil

An Error Budget is the amount of time that a service can be unavailable or perform poorly, without violating the SLO. The error budget is a key concept in SRE because it provides a way to balance reliability and innovation. If the error budget is exhausted, the service provider must focus on improving reliability rather than adding new features.

Toil refers to repetitive, manual work that is necessary to keep a service running. Toil is a problem in SRE because it takes time away from more valuable work, such as improving the reliability of the service. SRE teams should strive to minimize toil by automating repetitive tasks and by eliminating unnecessary work.

Incident Management and Postmortem

Incident Management is the process of responding to incidents and outages. The goal of incident management is to minimize the impact of the incident on customers and to restore service as quickly as possible. Incident management typically involves identifying the cause of the incident, mitigating the impact, and communicating with customers.

Postmortem is the process of analyzing incidents and outages to identify their root cause and to prevent them from happening again. Postmortems are a key part of SRE because they provide a way to learn from incidents and to improve the reliability of the service. Postmortems should be conducted in a blameless manner and should focus on identifying ways to improve the system.

  1. License under CC BY-NC 4.0
  2. Copyright issue feedback, replace # with @
  3. Not all the commands and scripts are tested in production environment, use at your own risk
  4. No personal information is collected
  5. Partial content rewritten by AI, verified by humans