Causes and impacts of failures and failure behaviors

Summary by Andy Huang of highlights of various papers, including:

[Oppe02] David Oppenheimer and David A. Patterson. Why do Internet services fail and what can be done about it?
[Xu99] Jun Xu, Zbigniew Kalbarczyk and Ravishankar K. Iyer. Networked Windows NT System Field Failure Data Analysis. In Proceedings of IEEE Pacific Rim Intl' Symp. on Dependable Computing (PRDC), Hong Kong, China, Dec. 1999.
[Murp00] Brendan Murphy, Bjorn Levidow. Windows 2000 Dependability. Microsoft Research Technical Report, MSR-TR-2000-56, June 2000.
[Murp95] Brendan Murphy, Ted Gent. Measuring System and Software Reliability Using an Automated Data Collection Process. Quality and Reliability Engineering International, Vol 11, pp. 341-353, 1995.
[Sull92] Mark Sullivan, Ram Chillarege. A Comparison of Software Defects in Database Management Systems and Operating Systems. In Proccedings of IEEE Twenty-Second Annual International Symposium on Fault-Tolerant Computing, July 8-10, 1992, Boston, Massachusetts, USA.
[Gray90] Jim Gray. A Census of Tandem System Availability Between 1985 and 1990. Tandem Technical Report 90.1 (Part Number 33579), January 1990.

Focus on reducing TBF due to maintainance and TTR for HW/SW system failures, because

they are the leading contributers to total downtime,
maintainance occurs frequently, but has a low MTTR [Xu99, Murph00, Murp95],
system failures are rare, but have a very MTTR [Xu99], and
application failures occur just as often as system failures, but they have a much lower MTTR, and thus don't contribute much to total downtime [Xu99].

Future reading: Read David's full-length paper (in two weeks) for information on TTR/TTD and what errors are high impact Internet services. Frontends fail the most, but do they recover quickly because they are stateless?
The root cause of many SW system failures are overlay bugs, which are most often triggered by boundary conditions (e.g., workload, unusual parameters, HW/SW config) [Sull92].
Note on isolation: There may not be as much fault isolation between nodes in a cluster as one might think. Not only are failures of a single node clustered in time, but so are failures among several nodes [Xu99].

What it means for using VMs for:

Failure detection: We'd like to improve TTD (and thus TTR) for the OS, but there might not be much you can gain from introspection at the VMM level [Mendel], especially if we're looking for overlay bugs. It may be more suitable to attack the problem at the compiler or C-library level (e.g., check-summed data structure libraries).
Isolation: Server consolidation is already being done [Mendel], but not to the extreme level we're thinking. At the same time, the NT clusters paper noted that nodes in a cluster are not always as fault isolated as we think.

Causes of failures in various system types

- Internet services [Oppe02]: Except for in read-mostly services, the component that causes the most user-visible failures is the stateless front-end (vs. stateful back-end or network). The complexity of the custom software running on front-ends and the associated complexity of configuring and administering them explain why operator error is the leading cause of failure followed by node software problems.
- NT Clusters [Xu99]: The leading contributers to downtime are system HW/SW failures and maintainance-related tasks. System failures occur much less frequently (4% of failures), but their high MTTRs make them the highest contributers to downtime (32%). Planned maintainance, configurations, and installations have shorter MTTRs, but occur much more frequently, and thus contribute to 24% of the total downtime. Application software failures occur as often as system software failures, but they are recovered from more quickly and contribute to only 1% of total downtime.
- Windows NT [Murp00]: System failures accounted for only 14% of all outages; of those, 43% were due to core NT defects, 32% device drivers and 3rd-party drivers, 13% hardware, and 12% anti-virus software.
Improving system dependability requires one to address all system outages (not just bugs), irrespective of their cause. For example, 65% of outages are planned (preventative reboots, OS install, app install/config, OS config, hardware install/config), so in Win2k, many installation and configuration changes no longer require reboots. Another observation is that the biggest impact on availability is recovery time rather than reliability. Therefore, in Win2k, chkdsk's performance was improved by 4-8x.
- Tandem Non-Stop [Gray90]: Between 1985 and 1990, hardware and maintainance got much better (50% of outages to 10%) while software (33% to 60%) and operations, system management (9% to 15%) didn't improve. Present software has a 30-year MTBF, and this doesn't include planned, operations outages, which is really a hidden form of software outages.
- VAX [Murp95]: In 1993, system management (e.g., incorrect app installation and system configuration) was the leading cause of crashes (over 50%). Further, 90% of system interruptions were due to operator shutdowns.
- Characterization of software defects [Sull92]: Error type is the low level programming mistake that led to the failure (e.g., pointer management, type mismatch). Error trigger is the environment that caused the defective code to be executed (e.g., boundary conditions, timing). Note: in this study, bugs were separated into overlay (i.e., memory overwrites) and regular bugs.
* Error type: In overlay bugs, the most common error types were memory allocation errors, copying overruns, and pointer management errors (> 50%). The error types with the highest impact were memory allocation and pointer management errors (~ 50% of the high impact errors). Finally, most overlays are small in size and are close to their intended destination (rather than large "wild stores"), which makes fault-detection difficult and error propogation more likely (many subsystems damaged by an overlay use the corrupted data before failing and have an opportunity to propogate the error). Many of the non-overlay errors were concurrency-related (e.g., deadlocks, synchronization) and often appear in network and device protocols.
* Error trigger: Boundary conditions (e.g., unusual parameters, unique HW/SW config, and workload) accounted for largest number of errors. Code reuse can partially explain this. Over time, modules are used in ways the original designer never considered. This makes module tests less effective since tests run on the old module by the original programmer don't stress aspects of the module used by newer clients. Bug fixes, recovery code, and timing accounted for most of the other errors, with recovery code accounting for the largest fraction of high impact errors

Failure impact/behavior in various system types

- NT Clusters [Xu99]: Single node outages are clustered closely in time. A reason for this may be that most crashes are due to cumulative problems (e.g., memory leaks or file system errors) and result in incomplete system cleanup, which may require several reboots to completely fix. I think, the explanation for clustered reboots is that most reboots are due to maintainance, which may require several reboots. Windows 2000 resolves this problem by eliminating the need to reboot after most HW/SW maintainance operations [Murp00]. It was also observered that "errors can propogate from one machine to another, and a single error may affect multiple computation nodes." Statistics show that most of the time, a cluster is a fully functional state, but once the cluster enters a state with a server down, there is a non-negligible probabily taht the system will stay in this state or more servers will fail.
- Tandem Non-Stop [Gray90]: "...once a system starts failing, it is in jeapardy. Human error rates are relatively high; recovery procedures are comlex, and are often not well-tested. Recover software suffers from similar complexity and limited testing. Latent faults further increase the chance of multiple faults..."
- VAX [Murp95]: There are two distinct periods of reliability: the period immediately after installation (where certain problems occur that can be permanently corrected) and the steady state level of reliabiility.

[email protected]