Causes and impacts of software bugs

Networked Windows NT System Field Failure Data Analysis. Jun Xu, Zbigniew Kalbarczyk and Ravishankar K. Iyer. In Proceedings of IEEE Pacific Rim Intl' Symp. on Dependable Computing (PRDC), Hong Kong, China, Dec. 1999.

Analysis of Failures in Windows NT Systems. Mahesh Chittur Kalyanakrishnan. Masters Thesis, University of Illinois at Urbana-Champaign, 1998.

Summary by Emre Kiciman

Xu99 analyses a broad sample of NT system logs (>500 machines) across multiple application domains (email, enterprise resource management, etc). In contrast, Ka98 provides a more detailed analysis and method description, while analysing a smaller sample of NT system logs (>70 machines) of email servers. Both Ka98 and Xu99 come from the same group at UIUC.

Ka98:

The logs analysed in Ka98 are event logs from ~70 Windows NT mail servers in a large enterprise. The events recorded include error messages and boot events. Boot events are logged after system boot. The log does not contain explicit shutdown or reboot events. A time to restart is estimated as the difference between the timestamps of a boot event and the event immediately preceeding it. If multiple reboots occur within an hour of each other, Ka98 assumes that a malfunction is continuing, and considers the machine "down" until the final reboot.

When a problem occurred, average down-time was found to be almost 2 hours. Less than 20% of the machines had up-times > 99.9%. Average up-time was 99.35%. About 10% of the machines had uptimes worse than 90% (!). These availability numbers are not based on application availability, but on whether the machine is up. 40% of restarts occurred during working hours.

The probable cause of a reboot is determined by analysing 1 hour of events prior to the reboot, looking for serious error messages (e.g., a failed network connection). Operator error is not accounted for. The probable causes are summarized in the following table:

Category	Frequency	Percentage
Unknown	308	28.00
Connectivity Problems	241	21.91
Normal Reboots/power-off (no indication of any problem)	178	16.18
Crucial Application Failure	152	13.82
Hardware or firmware problems	105	9.55
Normal shutdowns (for maintenance or unknown	63	5.73
Problem with a software component	42	3.82
Total reboots	1100	100

By modeling the machine's behavior as a state machine, and analysing the logs of each machine, Ka98 analyses the common movements among a functional state, restart state, and various failure states. This analysis finds that only 40% of reboots appeared to be put a machine into a functional state. Other interesting results: 1) most problems that occurred in a functional machine were network problems and persisted across reboots, and 2) most disk errors did not lead to restarts or more serious failures---the machine seemed to continue functioning normally.

Additionally, Ka98 found that some failures did propogate or were otherwise correlated across the network, though most seemed to be localized to individual machines. Single-point of failures (the Windows PDC and master browser) could be reliability bottlenecks.

Xu99:

Xu99 provides less detail on their methodology (e.g., how the NT event logs were analysed to determine downtime). A significant difference is that Xu99 had access to operator-annotated reboot logs.

Interesting results from Xu99:

Software and hardware failures account for 22% 10% of downtime, but only 3% and 1% of all reboots.
Application software only responsible for 1% of downtime and 3% of reboots.
Reboots on single or multiple machines tend to occur in bursts.
Average availability is over 99%, but strong indication of error propagation across the network.
The majority of outages (58%) were due to unknown causes (unclassified). These accounted for 36% of downtime.
The MTBF for the system of 500 machines was 1.37 hours. MTTR was 0.25 hours. Unclassified errors occured the most often, having a MTBF of 2.3 hours.
Operator error was classified as an "other" cause, which in total accounted for 4% of outages and 7% of downtime.
Machines undergo periodic behavior: days of stability (from 5 to 10, dependent on application domain) followed by many reboot

Both Xu99 and Ka98 agree that existing logging techniques generate much useful information, but there is significant room for improvement. Both find some failure propogation, though Xu99 finds more. Both agree that tools to help maintainability and repair are essential.

Questions:

What is the cause of the discrepancies between Xu99 and Ka98? Avg availability is 92% in Ka98, but >99% in Xu99. Major cause of outages in Ka98 is network problems, while Xu99's largest known cause is maintenance. Were these discrepancies because the event logs were from different sources? Was this due to some different analysis techniques?