Hive: Fault Containment in Shared-Memory Multiprocessors ======================================================== Chapin et al, SOSP 95 Big Picture ----------- How should we design an operating system for large-scale SMP's so that faults in one small part of the system do not bring the entire system down? Motivation ---------- In earlier days, large-scale SMP machines were built for providing more FLOPS/MIPS to make number-crunching parallel programs go faster. One program employed all the resources of the machine and it was ok if a fault crashed the entire system. Checkpointing sufficed. Of late, SMP's are being used for general purpose computing, esp. high-end servers. In the new scenario, multiple largely-independant processes use the system. It is not acceptable that a fault in one part of the system bring the entire system down. This calls for the OS to provide "fault containment". Challenge --------- Fault containment in distributed shared-nothing architectures is relatively easy. Hive's challenge lies in the fact that the whole system must provide the traditional "single system" image, providing flexible sharing of processors, memory and other resources. Of great concern is the "wild writes" problem where a faulty node can corrupt other nodes' memory. Solutions --------- Hive employs both hardware and software mechanisms for fault containment. They were fortunate in that their hardware platform, FLASH, was being designed in conjunction with their OS. Hive is structured as a set of cells. Each cell is assigned a fixed set of processors (alongwith its memory and I/O devices). Each cell manages its resources, essentially as a (semi-)independant multiprocessor OS. A cell is the smallest unit of the system that will be lost in case of a hardware/software failure. However, individual cells are not visible to the end user. The cells cooperate to provide the required single-system view. Different cells communicate via RPC. The kernels never "write" directly into each others' memory. Remote "reads" are allowed. However, at user-level, for efficiency reasons, it is desirable that processors be able to both "read" and "write" into remote memory. Hive withstands Operating System failures as follows: - Corrupt Messages: RPC sanity checks suffice. - Remote reads: It is the reading cell's responsibility to defend itself against reading corrupt memory. A "careful reference" protocol (an elaborate procedure for reading memory and performing consistency checks on read values) is employed. - Remote writes: Kernels never write into each other's memory. Such "wild writes" during a kernel crash are caught by a hardware "firewall". Hive "firewall": - Each physical page within a cell has an associated bitmap which indicates other cells that are allowed "write"-access to it. - The bits are off for all local processes that do not share any page with any other cell. Hive also tries to protect as many of the shared pages as possible. See paper for policies for updating these bits [I haven't read Section 5 in detail]. - "Wild writes" by kernels into remote user pages cannot be caught. [Such a mechanism can be provided but it makes individual user-level writes very expensive]. As a consequence, when a cell crashes, Hive takes the pessimistic view that all remote pages that were vulnerable are corrupt and shuts down remote processes using those pages. Failure detection and recovery ------------------------------ - All cells monitor themselves at regular intervals using heuristic checks. A failed check triggers recovery immediately. All processors save state and then run a distributed consensus protocol to identify the failed cell(s) and issue their reboot. The frequency of such checks presents a tradeoff between fault containmant and performance. - Failures are also triggered by failed RPC calls, hardware error detection and a failed "careful read" call. Issues ------ - One small process should not have its state distributed over too many cells. This impacts the global memory manager's page-allocation policy. - Huge applications using lots of system-wide resources do not benefit from this scheme. Small to medium sized applications do. - The mechanisms add extra hardware (firewall) and software (rpc/careful-read). This slows the system down. The paper claims that the slowdown is less than 20% or so. This slowdown increases as processes "span" more and more cells. Problems -------- Toedosiu recently published his thesis on fault containment. His thesis offers the following insights: - Commercial SMPs like SGI Origin 2000 do not provide hardware fault containment. - Applying HIVE approach to an existing off-the-shelf os would require considerable redesign and reimplementation effort. All kernel subsystems would have to be reexamined and modified. Hive fellas modified close to 100K lines of IRIX source code to get bare minimum functionality upon which they could run meaningful experiments. Teodosiu estimates that close to 1 million lines would have to be modified to provide a "proper" hive-style IRIX OS on Flash. - Another problem is that cells have to perform detailed checks on every message they receive. This adds around 20% overhead, which is ok. but Teodosiu suspects that the overhead would be higher if they were to implement "difficult" features like tasks spanning multiple cells and transparent migration of processes across cells, which buyers of an SMP would definitely make use of. - An alternative approach is to build a cellular virtual machine atop which other OS's run. Cellular Disco is one such implementation, described in a paper by Teodosiu, Govil, and others in SOSP 99. Discussion ---------- Is it not possible to identify functional units (servers) and provide individual smaller-sized SMP boxes for them? Let them communicate over high-speed LANs. Do we really need a 128-way SMP running all the servers together? What are the tradeoffs?