A NonStop kernel
Bartlett
1981
Summary by Ed Swierk

Goal: an expandable system for general purpose transaction processing,
capable of running in the presence of a single (hardware) fault in a
processor, a bus, or an I/O controller.


Types of hardware failures:

- permanent failures: detected relatively easily, but risk
  contaminating data before detection

- intermittent failures: harder to detect, so much higher risk of data
  contamination

- failures due to external interference (A/C fails, sysadmin error):
  cause entire system to crash


Interesting features:

* Lots of hardware redundancy

network of 1-255 nodes, connected by redundant buses

each node has 2-16 processors, connected by redundant interprocessor
buses

I/O devices can have multiple controllers

* No shared between processors (for data)

* Extensive use of message passing

transparency: communication mechanism is independent of the location
of a process

fault tolerance: bus transmission errors can be hidden from the
application

every request has a positive acknowledgement with a 1-second timeout

OS handles error recovery on message failures; only unrecoverable
errors (like the other process does not exist) propagate to the
application

each processor sends a keepalive to each other processor every 1
second

* Process-pairs provide software redundancy

each primary process checkpoints its state to a backup process to
allow transparent failover


Analysis:

Few performance numbers, no fault-tolerance numbers

Designing applications for multiple processors and in process pairs is
difficult

At that time, address space was more of a constraint than
computational power for transaction processing applications, and extra
processors were added solely for the additional address space.  32-bit
architectures were being developed to solve this problem.


Unanswered questions:

How did they arrive at the network parameters: retry after 1 second,
3-message transmission window?  What effect do these have on
performance?

They claim that correctness of the recovery scheme is more important
than efficiency, because the error rate is very low.  What if recovery
turns out to be so slow that the system grinds to a halt?

How reliable is the hardware, anyway?  Is all this redundancy
necessary?


Other notes:

http://www.cs.berkeley.edu/~gribble/osprelims/summaries/NonStop.html