Skip to content

Better Practices of Site Reliability Engineering

General Practices

  1. Hire only coders.
  2. Have an SLA for your service.
  3. Measure and report performance against the SLA.
  4. Use Error Budgets and gate launches on them.
  5. Have a common staffing pool for SRE and Developers.
  6. Have excess Ops work overflow to the Dev team.
  7. Cap SRE operational load at 50 percent.
  8. Share 5 percent of Ops work with the Dev team.
  9. Oncall teams should have at least eight people at one location, or six people at each of multiple locations.
  10. Aim for a maximum of two events per oncall shift.
  11. Do a postmortem for every event.
  12. Postmortems are blameless and focus on process and technology, not people.
SLI = [Good events / Valid events] x 100
Reliability levelPer yearPer quarterPer 30 days
90%36.5 days9 days3 days
95%18.25 days4.5 days1.5 days
99%3.65 days21.6 hours7.2 hours
99.5%1.83 days10.8 hours3.6 hours
99.9%8.76 hours2.16 hours43.2 minutes
99.95%4.38 hours1.08 hours21.6 minutes
99.99%52.6 minutes12.96 minutes4.32 minutes
99.999%5.26 minutes1.30 minutes25.9 seconds

SOW (Scope of Work)

  1. 组织定位
  2. 监控建设
  3. 变更管理
  4. 异常响应
  5. 稳定性治理
  6. 事故复盘
  7. 容量管理
  8. 成本控制
  9. 活动保障