Fault-Tolerance Under Unix(tm) A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle --------------------------------------------------------- The paper describes a backup/restore mechanism that is used to provide fail-over for services within a cluster of machines. They primarily target hardware faults (a machine crashes). Their target services include user processes and system processes (e.g., file servers). Does _NOT_ assume: * programs can be modified to work with this system * we can have unutilized hardware *just* for backup Assumptions: * backed-up user processes are deterministic * real-time recovery is not important Similar to Harp, all messages to a process are broadcast to the primary process, and its backup. In case of a failure of the primary, the backup replays all messages since the last explicit synchronization. Major technical parts of the paper: * all communication channels are resilient to failure, and are restored to the backup trivially. This Includes open files, network connections etc. * synchronization done in two parts: all dirty pages sent to page server (remote VM server), and machine-independent process info is sent to the backup. * Must synch before handling asynchronous interactions (UNIX signals) * Crash detection: its assumed that whole machines crash, not processes! Uses "Check on your neighbor" detection. Misc Notes: * many services are constrained to run on certain machines due to the availabality of hardware (e.g., printer server) * use a special file system to enable easy recovery. Comments: --------- I thought this paper was poorly written, with too many details, and not emphasis on the high-level concepts to pull it all together. Though not explicitly mentioned, this also recovers from software faults. Also, because write messages from a backup during recovery are discarded, you don't have to worry about messages being side-effect free. I thought it was interesting that they shield the user process from almost all local state (the time on the local machine, etc). Discussion: ----------- * How valid is it today to assume that sofware failures don't occur? (e.g., the crash detection mechanism here only catches whole machine failures). * You still need underutilization of the cluster to be able to have a backup. (Harvest/Yield). * Is the performance degradation reasonable? * How could this paper have been better written?