Site Reliability Engineering
What is SRE?
Site Reliability Engineering (SRE) is an approach to managing and operating large-scale, complex software systems. It emerged as a discipline within the field of software engineering to address the growing need for reliable and scalable infrastructure. SRE combines software engineering principles with operational expertise to ensure service reliability, performance, and availability. By treating infrastructure and application configurations as part of the software release cycle, SREs can effectively manage and maintain complex systems.
The need for SRE arose due to the increasing complexity of modern software systems, which often involve distributed architectures, cloud platforms, and rapid deployment cycles. As organizations strive to provide highly available and reliable services, SRE has become instrumental in aligning development and operations teams, fostering collaboration, and establishing resilient systems that can handle the demands of today’s digital landscape.
A Day of an SRE
- Improved Reliability: SRE methodologies significantly enhance the reliability of software systems. By treating reliability as a product trait, SRE teams ensure that systems remain accessible even during unexpected outages. Furthermore, they address performance issues that could discourage customers and reduce revenue.
- Reduced Downtime: E-businesses heavily rely on their websites for revenue. Any operational interruption can harm a brand’s reputation and risk sales. SRE reduces these risks by ensuring uninterrupted system operations. Proactive monitoring and alerts help SRE teams detect and resolve issues early, preventing major setbacks.
- Increased Scalability: SRE can enhance system scalability. Automated infrastructure handling and deployment enable SRE teams to quickly allocate new resources as demand increases.
- Quick Recovery: SRE also reduces the mean time to recovery (MTTR) after incidents, ensuring fast problem resolution and minimizing business impact.
SRE Basic Concepts
Service level indicators
Service level indicators (SLIs) are a foundational concept in SRE. All the other concepts build on top of SLIs. In the book Site Reliability Engineering: How Google Runs Production Systems, an SLI is succinctly defined as “a service level indicator - a carefully defined quantitative measure of some aspect of the level of service that is provided.”
SLI
SLI = [Good events / Valid events] x 100
SLIs for different Systems
System Type | Relevant SLIs | Questions Answered by SLIs |
---|---|---|
User-facing serving systems | Availability Latency Throughput |
Could we respond to the request? How long did it take to respond? How many requests could be handled? |
Storage systems | Availability Durability Throughput |
How long does it take to read or write data? Can we access the data on demand? Is the data still there when we need it? |
Big data systems | Throughput End-to-end latency |
How much data is being processed? How long does it take the data to progress from ingestion to completion? |
Service Level Objectives
Whereas SLIs are about customer expectations, SLOs are about how those expectations will be met.
Reliability level | Per year | Per quarter | Per 30 days |
---|---|---|---|
90% | 36.5 days | 9 days | 3 days |
95% | 18.25 days | 4.5 days | 1.5 days |
99% | 3.65 days | 21.6 hours | 7.2 hours |
99.5% | 1.83 days | 10.8 hours | 3.6 hours |
99.9% | 8.76 hours | 2.16 hours | 43.2 minutes |
99.95% | 4.38 hours | 1.08 hours | 21.6 minutes |
99.99% | 52.6 minutes | 12.96 minutes | 4.32 minutes |
99.999% | 5.26 minutes | 1.30 minutes | 25.9 seconds |
Error Budget
Error Budget
error budget = maximum service level – SLO threshold
SRE Concept Pyramid
General Practices
SRE Principles
- Hire only coders.
- Have an SLA for your service.
- Measure and report performance against the SLA.
- Use Error Budgets and gate launches on them.
- Have a common staffing pool for SRE and Developers.
- Have excess Ops work overflow to the Dev team.
- Cap SRE operational load at 50 percent.
- Share 5 percent of Ops work with the Dev team.
- Oncall teams should have at least eight people at one location, or six people at each of multiple locations.
- Aim for a maximum of two events per oncall shift.
- Do a postmortem for every event.
- Postmortems are blameless and focus on process and technology, not people.
Terminology
SOW - Scope of Work
SOW of SRE
- Organizational positioning
- Monitoring construction
- Change management
- Exception response
- Stability governance
- Incident review
- Capacity management
- Cost control
- Activity support
Infrastructure Life Cycle
Lifecycle Management
- Configuration
- Startup and shutdown
- Queue draining
- Software upgrades
- Backups and restores
- Redundancy
- Replicated databases
- Hot swaps
- Toggles for individual features
- Graceful degradation
- Access controls and rate limits
- Data import controls
- Monitoring
- Auditing
- Debug instrumentation
- Exception collection
Incident Management
Incidents are inevitable in any system. One of the primary roles of an SRE team is to manage these incidents. When such events occur, the SRE team quickly identifies the problem, determines its origin, and implements solutions.
SRE teams utilize various tools and methodologies to handle incidents. Early detection is facilitated through monitoring and alerting systems, allowing for prompt responses. Additionally, post-mortem analyses are conducted to identify the root causes of incidents and to implement measures to prevent their recurrence in the future.
Incident management often involves the following steps:
- Detection: Using monitoring tools to promptly identify incidents.
- Escalation: If the initial team is unable to resolve an issue, it is escalated to a senior member.
- Diagnosis: Determining the root cause of the incident.
- Mitigation: Taking steps to minimize the impact of the incident.
- Resolution: Achieving a final resolution and implementing measures to prevent future recurrences.
Roles in a team
Interview
- Linux Questions
- Python Questions
- Cloud Questions
- System Design Rounds
- Incident management Rounds
- Code review Rounds
- Programming Questions and few programs to practice
- Basic Troubleshooting
- Tools in DevOps
- Tips and Few final words.
Reference
- Becoming Sre: First Steps Toward Reliability for You and Your Organization (David N. Blank-Edelman)
- High Performance SRE : Automation, error budgeting, RPAs, SLOs, and SLAs with site reliability engineering (Mishra, Anchal Arora)
- Becoming a Rockstar SRE: Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems (Jeremy Proffitt, Rod Anami)
- Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations (Vladyslav Ukis)
- Scaling Google Cloud Platform: Run Workloads Across Compute, Serverless PaaS, Database, Distributed Computing, and SRE (Swapnil Dubey)
- The Art of Site Reliability Engineering (SRE) with Azure: Building and Deploying Applications That Endure (Unai Huete Beloki)
- Hands-On Guide to AgileOps: A Guide to Implementing Agile, DevOps, and SRE for Cloud Operations (Navin Sabharwal, Raminder Rathore, Udita Agrawal)
- The Site Reliability Workbook: Practical Ways to Implement SRE (Betsy Beyer et al.)
- Seeking SRE: Conversations About Running Production Systems at Scale (David N. Blank-Edelman)
- Chaos Engineering: Site reliability through controlled disruption (Mikolaj Pawlikowski)
- Hands-on Site Reliability Engineering (Shamayel Mohammed Farooqui; Vishnu Vardhan Chikoti)
- 大型网站运维:从系统管理到SRE,顾贤杰 徐赟 颜中冠
- 数字化运维:IT运维架构的数字化转型
- 凤凰项目:一个IT运维的传奇故事(修订版),吉恩•金 凯文•贝尔 乔治•斯帕福德
- SRE:Google运维解密
- 运维数字化转型:构建四位一体的数字化运维体系,彭华盛
- 中小银行运维架构:解密与实战,李丙洋 刘正配 罗丹 邹天涌等
- Linux自动化运维(Shell与Ansible),杨寅冬
- 曝光:Linux企业运维实战,吴光科
- Python自动化运维快速入门,郑征
- 高性能Linux服务器运维实战:shell编程、监控告警、性能优化与实战案例,高俊峰
- 智能运维之道——基于AI技术的应用实践,钱兵
- Ansible自动化运维:技术与最佳实践,陈金窗 沈灿 刘政委
- 金融级IT架构与运维:云原生、分布式与安全,魏新宇
- DevOps和自动化运维实践,余洪春
https://www.infracloud.io/blogs/sre-best-practices
https://github.com/bregman-arie/sre-checklist
https://sre.google/books/
https://www.itprotoday.com/it-operations/how-become-site-reliability-engineer-step-step-guide
- SRE in the Real World
https://blog.relyabilit.ie/sre-in-the-real-world/
https://medium.com/automation-avengers/demystifying-sre-devops-and-platform-engineering-understanding-the-differences-5ef37b5d813
https://github.com/chowmean/InterviewPreperationForDevOpsAndSRE
https://semaphoreci.com/blog/site-reliability-engineering
https://github.com/alibaba/SREWorks
https://github.com/jumpserver/jumpserver
https://www.mindmeister.com/app/map/3687575216