SRE: Types of Failures

Software Failures
- Software Crashes 1. Automated Restarts and Escalation 2. Automated Crash Data Collection and Analysis
- Software Hangs
- Query of Death
Physical Failures
- Parts and Components 1. RAM 2. Disks 3. Power Supplies 4. Network Interfaces 5. Machines 6. Load Balancers 7. Racks 8. Datacenters
Overload Failures
- Traffic Surges 1. Dynamic Resource Allocation 2. Load Shedding
- DoS and DDoS Attacks
- Scraping Attacks
Human Error
Other Failures
- Halting failures: A component simply stops. The failure can only be detected by timeout because it either stops sending “I’m alive” (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure.
- Fail-stop: A halting failure with some kind of notification to other components. A network file server telling its clients it is about to go down is a fail-stop.
- Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded.
- Network failures: A network link breaks.
- Network partition failure: A network fragments into two or more disjoint sub-networks within which messages can be sent, but between which messages are lost. This can occur due to a network failure.
- Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized; when a message is delayed longer than a threshold period, etc.
- Byzantine failures: This captures several types of faulty behaviors including data corruption or loss, failures caused by malicious programs, etc.
Reference
- Security Chaos Engineering: Sustaining Resilience in Software and Systems, Kelly Shortridge, Aaron Rinehart
- Chaos Engineering: Site reliability through controlled disruption, Mikolaj Pawlikowski
- Learning Chaos Engineering: Discovering and Overcoming System Weaknesses Through Experimentation, Russ Miles
- The DevOps Toolkit: Kubernetes Chaos Engineering, Viktor Farcic, Darin Pope
https://github.com/chaos-mesh/chaos-meshhttps://netflix.github.io/chaosmonkeyhttps://github.com/chaosblade-io/chaosblade- The Practice of Cloud System Administration
https://www.cncf.io/blog/2023/07/19/building-resilience-with-chaos-engineering-and-litmus/
Some of the content is generated by AI, please be cautious in identifying it.