Roughly speaking, the paper is about using HWFT techniques to achieve SWFT. Analyzes failure statistics of the Tandem NonStop system and shows the following causes of outage: administration (42%), software (25%), hardware (18%), and environment (14%), concluding that the key to HA is tolerating operations (e.g., by self-configuration) and SW faults. As hardware gets more reliable, SW and sys administration becomes more complex and unreliable. SW patches outnumber HW fixes by orders of magnitude.
Defines availability as MTBF / (MTBF+MTTR). Advocates hierarchical modularity, redundant fail-fast modules, and prompt fault detection for high hardware availability. For SWFT need: hierarchical system decomposition into fail-fast SW modules (for quick fault detection) communicating via messages (isolation) + use process pairs (tolerate HW and transient SW faults) + transactions (data and message integrity). Goes over 5 types of process pairs: lockstep, state / automatic / delta checkpointing, persistence.
Conjectures that faults in production SW are mostly Heisenbugs; hence we can increase availability through restart + retry. In an experiment, over 99% of bugs were transient. He also says communication lines (SW+HW) are the most unreliable part of a distributed computer system.