Causes and impacts of failures and failure behaviors
Summary by Andy Huang of highlights of various papers, including:
- [Oppe02] David Oppenheimer and David A. Patterson. Why do Internet
services fail and what can be done about it?
- [Xu99] Jun Xu, Zbigniew Kalbarczyk and Ravishankar K. Iyer. Networked
Windows NT System Field Failure Data Analysis. In Proceedings of IEEE
Pacific Rim Intl' Symp. on Dependable Computing (PRDC), Hong Kong, China,
Dec. 1999.
- [Murp00] Brendan Murphy, Bjorn Levidow. Windows 2000 Dependability.
Microsoft Research Technical Report, MSR-TR-2000-56, June 2000.
- [Murp95] Brendan Murphy, Ted Gent. Measuring System and Software
Reliability Using an Automated Data Collection Process. Quality and
Reliability Engineering International, Vol 11, pp. 341-353, 1995.
- [Sull92] Mark Sullivan, Ram Chillarege. A Comparison of Software Defects
in Database Management Systems and Operating Systems. In Proccedings of IEEE
Twenty-Second Annual International Symposium on Fault-Tolerant Computing,
July 8-10, 1992, Boston, Massachusetts, USA.
- [Gray90] Jim Gray. A Census of Tandem System Availability Between 1985 and
1990. Tandem Technical Report 90.1 (Part Number 33579), January 1990.
Focus on reducing TBF due to maintainance and TTR for HW/SW system failures,
because
- they are the leading contributers to total downtime,
- maintainance occurs frequently, but has a low MTTR [Xu99, Murph00,
Murp95],
- system failures are rare, but have a very MTTR [Xu99], and
- application failures occur just as often as system failures, but they have
a much lower MTTR, and thus don't contribute much to total downtime [Xu99].
- Future reading: Read David's full-length paper (in two weeks) for
information on TTR/TTD and what errors are high impact Internet services.
Frontends fail the most, but do they recover quickly because they are
stateless?
- The root cause of many SW system failures are overlay bugs, which are most
often triggered by boundary conditions (e.g., workload, unusual parameters,
HW/SW config) [Sull92].
- Note on isolation: There may not be as much fault isolation between nodes
in a cluster as one might think. Not only are failures of a single node
clustered in time, but so are failures among several nodes [Xu99].
What it means for using VMs for:
- Failure detection: We'd like to improve TTD (and thus TTR) for the OS, but
there might not be much you can gain from introspection at the VMM level
[Mendel], especially if we're looking for overlay bugs. It may be more
suitable to attack the problem at the compiler or C-library level (e.g.,
check-summed data structure libraries).
- Isolation: Server consolidation is already being done [Mendel], but not to
the extreme level we're thinking. At the same time, the NT clusters paper
noted that nodes in a cluster are not always as fault isolated as we think.
Causes of failures in various system types
- - Internet services [Oppe02]: Except for in read-mostly services, the
component that causes the most user-visible failures is the stateless
front-end (vs. stateful back-end or network). The complexity of the custom
software running on front-ends and the associated complexity of configuring
and administering them explain why operator error is the leading cause of
failure followed by node software problems.
- - NT Clusters [Xu99]: The leading contributers to downtime are system
HW/SW failures and maintainance-related tasks. System failures occur much
less frequently (4% of failures), but their high MTTRs make them the highest
contributers to downtime (32%). Planned maintainance, configurations, and
installations have shorter MTTRs, but occur much more frequently, and thus
contribute to 24% of the total downtime. Application software failures occur
as often as system software failures, but they are recovered from more
quickly and contribute to only 1% of total downtime.
- - Windows NT [Murp00]: System failures accounted for only 14% of all
outages; of those, 43% were due to core NT defects, 32% device drivers and
3rd-party drivers, 13% hardware, and 12% anti-virus software.
- Improving system dependability requires one to address all system outages
(not just bugs), irrespective of their cause. For example, 65% of outages
are planned (preventative reboots, OS install, app install/config, OS
config, hardware install/config), so in Win2k, many installation and
configuration changes no longer require reboots. Another observation is that
the biggest impact on availability is recovery time rather than reliability.
Therefore, in Win2k, chkdsk's performance was improved by 4-8x.
- - Tandem Non-Stop [Gray90]: Between 1985 and 1990, hardware and
maintainance got much better (50% of outages to 10%) while software (33% to
60%) and operations, system management (9% to 15%) didn't improve. Present
software has a 30-year MTBF, and this doesn't include planned, operations
outages, which is really a hidden form of software outages.
- - VAX [Murp95]: In 1993, system management (e.g., incorrect app
installation and system configuration) was the leading cause of crashes
(over 50%). Further, 90% of system interruptions were due to operator
shutdowns.
- - Characterization of software defects [Sull92]: Error type is the low
level programming mistake that led to the failure (e.g., pointer management,
type mismatch). Error trigger is the environment that caused the defective
code to be executed (e.g., boundary conditions, timing). Note: in this
study, bugs were separated into overlay (i.e., memory overwrites) and
regular bugs.
- * Error type: In overlay bugs, the most common error types were memory
allocation errors, copying overruns, and pointer management errors (>
50%). The error types with the highest impact were memory allocation and
pointer management errors (~ 50% of the high impact errors). Finally, most
overlays are small in size and are close to their intended destination
(rather than large "wild stores"), which makes fault-detection
difficult and error propogation more likely (many subsystems damaged by an
overlay use the corrupted data before failing and have an opportunity to
propogate the error). Many of the non-overlay errors were
concurrency-related (e.g., deadlocks, synchronization) and often appear in
network and device protocols.
- * Error trigger: Boundary conditions (e.g., unusual parameters, unique
HW/SW config, and workload) accounted for largest number of errors. Code
reuse can partially explain this. Over time, modules are used in ways the
original designer never considered. This makes module tests less effective
since tests run on the old module by the original programmer don't stress
aspects of the module used by newer clients. Bug fixes, recovery code, and
timing accounted for most of the other errors, with recovery code accounting
for the largest fraction of high impact errors
Failure impact/behavior in various system types
- - NT Clusters [Xu99]: Single node outages are clustered closely in time. A
reason for this may be that most crashes are due to cumulative problems
(e.g., memory leaks or file system errors) and result in incomplete system
cleanup, which may require several reboots to completely fix. I think, the
explanation for clustered reboots is that most reboots are due to
maintainance, which may require several reboots. Windows 2000 resolves this
problem by eliminating the need to reboot after most HW/SW maintainance
operations [Murp00]. It was also observered that "errors can propogate
from one machine to another, and a single error may affect multiple
computation nodes." Statistics show that most of the time, a cluster is
a fully functional state, but once the cluster enters a state with a server
down, there is a non-negligible probabily taht the system will stay in this
state or more servers will fail.
- - Tandem Non-Stop [Gray90]: "...once a system starts failing, it is
in jeapardy. Human error rates are relatively high; recovery procedures are
comlex, and are often not well-tested. Recover software suffers from similar
complexity and limited testing. Latent faults further increase the chance of
multiple faults..."
- - VAX [Murp95]: There are two distinct periods of reliability: the period
immediately after installation (where certain problems occur that can be
permanently corrected) and the steady state level of reliabiility.
[email protected]