A NonStop kernel Bartlett 1981 Summary by Ed Swierk Goal: an expandable system for general purpose transaction processing, capable of running in the presence of a single (hardware) fault in a processor, a bus, or an I/O controller. Types of hardware failures: - permanent failures: detected relatively easily, but risk contaminating data before detection - intermittent failures: harder to detect, so much higher risk of data contamination - failures due to external interference (A/C fails, sysadmin error): cause entire system to crash Interesting features: * Lots of hardware redundancy network of 1-255 nodes, connected by redundant buses each node has 2-16 processors, connected by redundant interprocessor buses I/O devices can have multiple controllers * No shared between processors (for data) * Extensive use of message passing transparency: communication mechanism is independent of the location of a process fault tolerance: bus transmission errors can be hidden from the application every request has a positive acknowledgement with a 1-second timeout OS handles error recovery on message failures; only unrecoverable errors (like the other process does not exist) propagate to the application each processor sends a keepalive to each other processor every 1 second * Process-pairs provide software redundancy each primary process checkpoints its state to a backup process to allow transparent failover Analysis: Few performance numbers, no fault-tolerance numbers Designing applications for multiple processors and in process pairs is difficult At that time, address space was more of a constraint than computational power for transaction processing applications, and extra processors were added solely for the additional address space. 32-bit architectures were being developed to solve this problem. Unanswered questions: How did they arrive at the network parameters: retry after 1 second, 3-message transmission window? What effect do these have on performance? They claim that correctness of the recovery scheme is more important than efficiency, because the error rate is very low. What if recovery turns out to be so slow that the system grinds to a halt? How reliable is the hardware, anyway? Is all this redundancy necessary? Other notes: http://www.cs.berkeley.edu/~gribble/osprelims/summaries/NonStop.html