Networked Windows NT System Field Failure Data Analysis
. Jun Xu, Zbigniew Kalbarczyk and Ravishankar K. Iyer. In Proceedings of IEEE Pacific Rim Intl' Symp. on Dependable Computing (PRDC), Hong Kong, China, Dec. 1999.

Analysis of Failures in Windows NT Systems. Mahesh Chittur Kalyanakrishnan. Masters Thesis, University of Illinois at Urbana-Champaign, 1998.

Summary by Emre Kiciman

Xu99 analyses a broad sample of NT system logs (>500 machines) across multiple application domains (email, enterprise resource management, etc). In contrast, Ka98 provides a more detailed analysis and method description, while analysing a smaller sample of NT system logs (>70 machines) of email servers.  Both Ka98 and Xu99 come from the same group at UIUC.


Ka98:

The logs analysed in Ka98 are event logs from ~70 Windows NT mail servers in a large enterprise.  The events recorded include error messages and boot events.  Boot events are logged after system boot.  The log does not contain explicit shutdown or reboot events.  A time to restart is estimated as the difference between the timestamps of a boot event and the event immediately preceeding it.  If multiple reboots occur within an hour of each other, Ka98 assumes that a malfunction is continuing, and considers the machine "down" until the final reboot.

When a problem occurred, average down-time was found to be almost 2 hours.  Less than 20% of the machines had up-times > 99.9%.  Average up-time was 99.35%. About 10% of the machines had uptimes worse than 90% (!). These availability numbers are not based on application availability, but on whether the machine is up.  40% of restarts occurred during working hours.

The probable cause of a reboot is determined by analysing 1 hour of events prior to the reboot, looking for serious error messages (e.g., a failed network connection).  Operator error is not accounted for. The probable causes are summarized in the following table:

Category
Frequency
Percentage
Unknown
308
28.00
Connectivity Problems
241
21.91
Normal Reboots/power-off (no indication of any problem)
178
16.18
Crucial Application Failure
152
13.82
Hardware or firmware problems
105
9.55
Normal shutdowns (for maintenance or unknown
63
5.73
Problem with a software component
42
3.82
Total reboots
1100
100

By modeling the machine's behavior as a state machine, and analysing the logs of each machine, Ka98 analyses the common movements among a functional state, restart state, and various failure states.  This analysis finds that only 40% of reboots appeared to be put a machine into a functional state.  Other interesting results:  1) most problems that occurred in a functional machine were network problems and persisted across reboots, and 2) most disk errors did not lead to restarts or more serious failures---the machine seemed to continue functioning normally.

Additionally, Ka98 found that some failures did propogate or were otherwise correlated across the network, though most seemed to be localized to individual machines.  Single-point of failures (the Windows PDC and master browser) could be reliability bottlenecks.




Xu99:

Xu99 provides less detail on their methodology (e.g., how the NT event logs were analysed to determine downtime).  A significant difference is that Xu99 had access to operator-annotated reboot logs.

Interesting results from Xu99:

Both Xu99 and Ka98 agree that existing logging techniques generate much useful information, but there is significant room for improvement.  Both find some failure propogation, though Xu99 finds more.  Both agree that tools to help maintainability and repair are essential.

Questions:

What is the cause of the discrepancies between Xu99 and Ka98? Avg availability is 92% in Ka98, but >99% in Xu99.  Major cause of outages in Ka98 is network problems, while Xu99's largest known cause is maintenance. Were these discrepancies because the event logs were from different sources? Was this due to some different analysis techniques?