Summary by Emre Kiciman
Xu99 analyses a broad sample of NT system logs (>500 machines) across
multiple application domains (email, enterprise resource management, etc).
In contrast, Ka98 provides a more detailed analysis and method description,
while analysing a smaller sample of NT system logs (>70 machines) of
email servers. Both Ka98 and Xu99 come from the same group at UIUC.
The logs analysed in Ka98 are event logs from ~70 Windows NT mail servers
in a large enterprise. The events recorded include error messages and
boot events. Boot events are logged after system boot. The log
does not contain explicit shutdown or reboot events. A time to restart
is estimated as the difference between the timestamps of a boot event and
the event immediately preceeding it. If multiple reboots occur within
an hour of each other, Ka98 assumes that a malfunction is continuing, and
considers the machine "down" until the final reboot.
When a problem occurred, average down-time was found to be almost 2 hours. Less than 20% of the machines had up-times > 99.9%. Average up-time was 99.35%. About 10% of the machines had uptimes worse than 90% (!). These availability numbers are not based on application availability, but on whether the machine is up. 40% of restarts occurred during working hours.
The probable cause of a reboot is determined by analysing
1 hour of events prior to the reboot, looking for serious error messages
(e.g., a failed network connection). Operator error is not accounted
for. The probable causes are summarized in the following table:
|Normal Reboots/power-off (no indication of any problem)
|Crucial Application Failure
|Hardware or firmware problems
|Normal shutdowns (for maintenance or unknown
|Problem with a software component
By modeling the machine's behavior as a state machine, and analysing the logs of each machine, Ka98 analyses the common movements among a functional state, restart state, and various failure states. This analysis finds that only 40% of reboots appeared to be put a machine into a functional state. Other interesting results: 1) most problems that occurred in a functional machine were network problems and persisted across reboots, and 2) most disk errors did not lead to restarts or more serious failures---the machine seemed to continue functioning normally.
Additionally, Ka98 found that some failures did propogate or were otherwise
correlated across the network, though most seemed to be localized to individual
machines. Single-point of failures (the Windows PDC and master browser)
could be reliability bottlenecks.
What is the cause of the discrepancies between Xu99 and Ka98? Avg availability is 92% in Ka98, but >99% in Xu99. Major cause of outages in Ka98 is network problems, while Xu99's largest known cause is maintenance. Were these discrepancies because the event logs were from different sources? Was this due to some different analysis techniques?