Fault-Tolerance Under Unix(tm)
A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle
---------------------------------------------------------

The paper describes a backup/restore mechanism that is used to provide
fail-over for services within a cluster of machines. They primarily target
hardware faults (a machine crashes). Their target services include user
processes and system processes (e.g., file servers). 

Does _NOT_ assume:
	* programs can be modified to work with this system
	* we can have unutilized hardware *just* for backup

Assumptions:
	* backed-up user processes are deterministic
	* real-time recovery is not important

Similar to Harp, all messages to a process are broadcast to the primary
process, and its backup.  In case of a failure of the primary, the backup
replays all messages since the last explicit synchronization.  

Major technical parts of the paper:
	* all communication channels are resilient to failure, and are
	  restored to the backup trivially. This Includes open files,
          network connections etc.
	* synchronization done in two parts: all dirty pages sent to page
	  server (remote VM server), and machine-independent process info
	  is sent to the backup.
	* Must synch before handling asynchronous interactions (UNIX
          signals)
	* Crash detection: its assumed that whole machines crash, not
	  processes!  Uses "Check on your neighbor" detection.

Misc Notes:
	* many services are constrained to run on certain machines due to
	  the availabality of hardware (e.g., printer server)
	* use a special file system to enable easy recovery.

Comments:
---------

I thought this paper was poorly written, with too many details, and not
emphasis on the high-level concepts to pull it all together.

Though not explicitly mentioned, this also recovers from software faults.

Also, because write messages from a backup during recovery are discarded,
you don't have to worry about messages being side-effect free.

I thought it was interesting that they shield the user process from almost
all local state (the time on the local machine, etc).

Discussion:
-----------

* How valid is it today to assume that sofware failures don't occur?
(e.g., the crash detection mechanism here only catches whole machine
failures).

* You still need underutilization of the cluster to be able to have a
backup.  (Harvest/Yield).

* Is the performance degradation reasonable?

* How could this paper have been better written?