SRE: Types of Failures

/images/auto-chaos.png

Software Failures

Software Crashes 1. Automated Restarts and Escalation 2. Automated Crash Data Collection and Analysis
Software Hangs
Query of Death

Parts and Components 1. RAM 2. Disks 3. Power Supplies 4. Network Interfaces 5. Machines 6. Load Balancers 7. Racks 8. Datacenters

Halting failures: A component simply stops. The failure can only be detected by timeout because it either stops sending “I’m alive” (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure.
Fail-stop: A halting failure with some kind of notification to other components. A network file server telling its clients it is about to go down is a fail-stop.
Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded.
Network failures: A network link breaks.
Network partition failure: A network fragments into two or more disjoint sub-networks within which messages can be sent, but between which messages are lost. This can occur due to a network failure.
Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized; when a message is delayed longer than a threshold period, etc.
Byzantine failures: This captures several types of faulty behaviors including data corruption or loss, failures caused by malicious programs, etc.

Security Chaos Engineering: Sustaining Resilience in Software and Systems, Kelly Shortridge, Aaron Rinehart
Chaos Engineering: Site reliability through controlled disruption, Mikolaj Pawlikowski
Learning Chaos Engineering: Discovering and Overcoming System Weaknesses Through Experimentation, Russ Miles
The DevOps Toolkit: Kubernetes Chaos Engineering, Viktor Farcic, Darin Pope
https://github.com/chaos-mesh/chaos-mesh
https://netflix.github.io/chaosmonkey
https://github.com/chaosblade-io/chaosblade
The Practice of Cloud System Administration
https://www.cncf.io/blog/2023/07/19/building-resilience-with-chaos-engineering-and-litmus/

Some of the content is generated by AI, please be cautious in identifying it.

Leave Your Message