Skip to content

Chaos Scenario Checklist

Type of Failures


Software Failures

  • Software Crashes 1. Automated Restarts and Escalation 2. Automated Crash Data Collection and Analysis
  • Software Hangs
  • Query of Death

Physical Failures

  • Parts and Components 1. RAM 2. Disks 3. Power Supplies 4. Network Interfaces 5. Machines 6. Load Balancers 7. Racks 8. Datacenters

Overload Failures

  • Traffic Surges 1. Dynamic Resource Allocation 2. Load Shedding
  • DoS and DDoS Attacks
  • Scraping Attacks

Human Error

Other Failures

  • Halting failures: A component simply stops. The failure can only be detected by timeout because it either stops sending “I’m alive” (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure.
  • Fail-stop: A halting failure with some kind of notification to other components. A network file server telling its clients it is about to go down is a fail-stop.
  • Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded.
  • Network failures: A network link breaks.
  • Network partition failure: A network fragments into two or more disjoint sub-networks within which messages can be sent, but between which messages are lost. This can occur due to a network failure.
  • Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized; when a message is delayed longer than a threshold period, etc.
  • Byzantine failures: This captures several types of faulty behaviors including data corruption or loss, failures caused by malicious programs, etc.


  • The Practice of Cloud System Administration