Causes and impacts of failures and failure behaviors

Summary by Andy Huang of highlights of various papers, including:

Focus on reducing TBF due to maintainance and TTR for HW/SW system failures, because

  1. they are the leading contributers to total downtime,
  2. maintainance occurs frequently, but has a low MTTR [Xu99, Murph00, Murp95],
  3. system failures are rare, but have a very MTTR [Xu99], and
  4. application failures occur just as often as system failures, but they have a much lower MTTR, and thus don't contribute much to total downtime [Xu99].

What it means for using VMs for:

Causes of failures in various system types

Failure impact/behavior in various system types