Jim Gray, Why Do Computers Stop and What Can Be Done About It?,
Proc. SRDS, Jan. 1986, pp. 3-12

(Summary by George Candea)

Roughly speaking, the paper is about using HWFT techniques to achieve SWFT. Analyzes failure statistics of the Tandem NonStop system and shows the following causes of outage: administration (42%), software (25%), hardware (18%), and environment (14%), concluding that the key to HA is tolerating operations (e.g., by self-configuration) and SW faults. As hardware gets more reliable, SW and sys administration becomes more complex and unreliable. SW patches outnumber HW fixes by orders of magnitude.

Defines availability as MTBF / (MTBF+MTTR). Advocates hierarchical modularity, redundant fail-fast modules, and prompt fault detection for high hardware availability. For SWFT need: hierarchical system decomposition into fail-fast SW modules (for quick fault detection) communicating via messages (isolation) + use process pairs (tolerate HW and transient SW faults) + transactions (data and message integrity). Goes over 5 types of process pairs: lockstep, state / automatic / delta checkpointing, persistence.

Conjectures that faults in production SW are mostly Heisenbugs; hence we can increase availability through restart + retry. In an experiment, over 99% of bugs were transient. He also says communication lines (SW+HW) are the most unreliable part of a distributed computer system.