Back to index
Safeware: System Safety and Computers
Nancy G. Leveson, University of Washington
One-line summary:
Nature of Risk: (likelihood) x (severity)
  -  Shift of responsibility from individual to groups/experts
       
         -  Shift from craftsman to assy line (and accompanying increase in
              complexity of production systems) has engendered a shift in
              safety responsibility from employee to employer, and by
              extension, from members of the general public to "experts".
         
 -  "Whereas in the past, component failure
              was cited as the major factor in accidents, today more accidents
              result from dangerous design characteristics and interactions
              among components".  (Paraphrased from Willie Hammer, Product
              safety management and engineering.  Englewood Cliffs, NJ:
              Prentice-Hall, 1980)
       
 
   -  Can't measure system before it's built/designed, so risk
              assessment tries to quantify risk up front.  Acceptable
              risk is threshold level, but often is determined
              irrespective to those receiving benefits of risk
              protection.  Ex: Pinto gas tanks were known to be
              volatile, but the occasional explosion of a car was
              considered acceptable risk rather than pay cost of reengineering.
  
 -  Role of computers and software
       
         -  Technology eliminates some hazards, but creates some new ones
              that are more pervasive and harder to find and eliminate than the
              ones that were reduced.
         
 -  Computers make this worse by enabling interactive,
              tightly-coupled, error prone designs:
              "by increasing the complexity of the processes that can be
              controlled, [computer programs] have increased the scope for the
              introduction of conventional errors" (Trevor Kletz: Wise after
              the event, in Control and Instrumentation 20(10):57-59,
              Oct. 1988.)
         
 -  Operators often isolated from phys system (for safety
              reasons, or because the scale of the system is
              incomprehensible to individual users, as in the case of an
              oil rig) and must rely
              exclusively on indirect/physically-unverifiable info
              or are "out of the loop" till
              too late.  Ex: 1985 China Airlines jet losing power in
              right engine, but autopilot automatically corrected.  Crew
              didn't find out till it was too late (autopilot could no
              longer correct).
         
 -  We learn more from bridges that collapse than from those
              that stand.  Increased pace of change leads to decreased
              opportunity to 
              learn from failure.  Empirical design rules derived by
              trial and error are being replaced by reliance on a
              priori hazard identification and
              reliable-system-building techniques.
         
 -  Software emphasizes only the steps to be achieved, without
              worrying about how to achieve them (unlike mechanical
              systems).  Proven system-safety eng techniques haven't
              been applied to software, nor can they be because of the
              essence of the medium.
         
 -  SW "flexibility" encourages redefinition of tasks to shift
              responsibility from HW to SW late in the process.
              Interfaces in HW systems tend to be simpler than SW ones
              because physical constraints make the costs of complex HW
              interfaces immediate and obvious.
         
 -  SW is logically brittle (not physically brittle like HW),
              so harder to see how easily it can be broken, esp. since
              partial success is trivial to achieve (compared to
              HW) and rarely indicative of complete success.
       
 
   -  Software myths exploded:
       
         -  Cost of computers is not lower than cost of
              electromechanical/analog devices, when maintenance is
              considered.  Space shuttle software is about 800Kbytes
              but costs $100M/yr to maintain.
         
 -  Software is not easy to change if safety is to be
              preserved.  Plus, software errors tend to lurk dormant for
              a long time -- there is no "infant mortality" for
              software, thus the belief that safety programs can be
              "phased out" after a few years is fundamentally
              misguided. (There is no way to measure the number of
              accidents a good program has avoided.)
         
 -  Increasing SW reliability will not increase safety:
              often the SW is working as designed, and the bug comes
              from unexpected component interactions.  "Safety is a
              system property, not a software property."  SW  is considered
              safety-critical only if a SW failure alone can lead to an
              accident, but in real accidents this is rarely
              what happens. 
         
 -  Reusing software does not increase safety: may
              actually decrease it, because of complacency it engenders.
              Also, most SW is specially constructed
              for its application and
              reusing it elsewher is probably dangerous (Therac-6 SW
              ported to Therac-25).
       
 
   -  Hierarchical view of accident causation: mechanisms ("the car hit
       the tree"), conditions ("it was raining and the car didn't have
       antilock brakes"), and root causes.
       
         -  Root causes: inherent weaknesses that can also affect
              future accidents.  Important because accidents rarely
              repeat themselves in the same manifestation, so addressing
              root causes is more important than "patching".  (eg Mars
              Pathfinder thread-priority inversion)  "Often one finds
              triple locks on those doors from which horses have been
              stolen, while other doors are wide open."
         
 -  Most root causes
              involve 2 or more low-probability events happening in the
              worst possible combination.
         
 -  Overreliance on redundancy for low-probability events that
              turn out not to be independent.  Ex: in Challenger
              accident, primary O-ring failure caused conditions that
              led to failure of secondary (backup) O-rings.
         
 -  Overreliance on redundancy for poorly-design device: worse
              than no device at all, since it engenders complacency.
         
 -  Technologies are developed because they have clear and
              quantifiable benefits, but often the risks are ambiguous
              and elusive, so up-front hazard management is often
              replaced by "downstream protection" (adding safety to a
              completed design).  Note the similar situation in
              computer/network security!
       
 
   -  Organizational factors
       
         -  Large engineered systems usually reflect the structure,
              management, culture, etc of the organization that created
              them.
         
 -  Other factors: diffusion of responsibility and authority
              in large organizations; lack of independence or lowly
              status of safety personnel, or requiring them to report to
              the organization whose safety is  being checked.
         
 -  Need  a
              reference channel that communicates safety
              goals/policies downstream, and an independent measuring
              channel that communicates actual state of affairs
              upstream.
              
       
 
 
Intro. System Safety/Definitions and Models
Elements of a Safeware Program
Back to index