Back to index

Safeware: System Safety and Computers

Nancy G. Leveson, University of Washington

One-line summary:

Nature of Risk: (likelihood) x (severity)

Shift of responsibility from individual to groups/experts
- Shift from craftsman to assy line (and accompanying increase in complexity of production systems) has engendered a shift in safety responsibility from employee to employer, and by extension, from members of the general public to "experts".
- "Whereas in the past, component failure was cited as the major factor in accidents, today more accidents result from dangerous design characteristics and interactions among components". (Paraphrased from Willie Hammer, Product safety management and engineering. Englewood Cliffs, NJ: Prentice-Hall, 1980)
Can't measure system before it's built/designed, so risk assessment tries to quantify risk up front. Acceptable risk is threshold level, but often is determined irrespective to those receiving benefits of risk protection. Ex: Pinto gas tanks were known to be volatile, but the occasional explosion of a car was considered acceptable risk rather than pay cost of reengineering.
Role of computers and software
- Technology eliminates some hazards, but creates some new ones that are more pervasive and harder to find and eliminate than the ones that were reduced.
- Computers make this worse by enabling interactive, tightly-coupled, error prone designs: "by increasing the complexity of the processes that can be controlled, [computer programs] have increased the scope for the introduction of conventional errors" (Trevor Kletz: Wise after the event, in Control and Instrumentation 20(10):57-59, Oct. 1988.)
- Operators often isolated from phys system (for safety reasons, or because the scale of the system is incomprehensible to individual users, as in the case of an oil rig) and must rely exclusively on indirect/physically-unverifiable info or are "out of the loop" till too late. Ex: 1985 China Airlines jet losing power in right engine, but autopilot automatically corrected. Crew didn't find out till it was too late (autopilot could no longer correct).
- We learn more from bridges that collapse than from those that stand. Increased pace of change leads to decreased opportunity to learn from failure. Empirical design rules derived by trial and error are being replaced by reliance on a priori hazard identification and reliable-system-building techniques.
- Software emphasizes only the steps to be achieved, without worrying about how to achieve them (unlike mechanical systems). Proven system-safety eng techniques haven't been applied to software, nor can they be because of the essence of the medium.
- SW "flexibility" encourages redefinition of tasks to shift responsibility from HW to SW late in the process. Interfaces in HW systems tend to be simpler than SW ones because physical constraints make the costs of complex HW interfaces immediate and obvious.
- SW is logically brittle (not physically brittle like HW), so harder to see how easily it can be broken, esp. since partial success is trivial to achieve (compared to HW) and rarely indicative of complete success.
Software myths exploded:
- Cost of computers is not lower than cost of electromechanical/analog devices, when maintenance is considered. Space shuttle software is about 800Kbytes but costs $100M/yr to maintain.
- Software is not easy to change if safety is to be preserved. Plus, software errors tend to lurk dormant for a long time -- there is no "infant mortality" for software, thus the belief that safety programs can be "phased out" after a few years is fundamentally misguided. (There is no way to measure the number of accidents a good program has avoided.)
- Increasing SW reliability will not increase safety: often the SW is working as designed, and the bug comes from unexpected component interactions. "Safety is a system property, not a software property." SW is considered safety-critical only if a SW failure alone can lead to an accident, but in real accidents this is rarely what happens.
- Reusing software does not increase safety: may actually decrease it, because of complacency it engenders. Also, most SW is specially constructed for its application and reusing it elsewher is probably dangerous (Therac-6 SW ported to Therac-25).
Hierarchical view of accident causation: mechanisms ("the car hit the tree"), conditions ("it was raining and the car didn't have antilock brakes"), and root causes.
- Root causes: inherent weaknesses that can also affect future accidents. Important because accidents rarely repeat themselves in the same manifestation, so addressing root causes is more important than "patching". (eg Mars Pathfinder thread-priority inversion) "Often one finds triple locks on those doors from which horses have been stolen, while other doors are wide open."
- Most root causes involve 2 or more low-probability events happening in the worst possible combination.
- Overreliance on redundancy for low-probability events that turn out not to be independent. Ex: in Challenger accident, primary O-ring failure caused conditions that led to failure of secondary (backup) O-rings.
- Overreliance on redundancy for poorly-design device: worse than no device at all, since it engenders complacency.
- Technologies are developed because they have clear and quantifiable benefits, but often the risks are ambiguous and elusive, so up-front hazard management is often replaced by "downstream protection" (adding safety to a completed design). Note the similar situation in computer/network security!
Organizational factors
- Large engineered systems usually reflect the structure, management, culture, etc of the organization that created them.
- Other factors: diffusion of responsibility and authority in large organizations; lack of independence or lowly status of safety personnel, or requiring them to report to the organization whose safety is being checked.
- Need a reference channel that communicates safety goals/policies downstream, and an independent measuring channel that communicates actual state of affairs upstream.

Intro. System Safety/Definitions and Models

Elements of a Safeware Program

Back to index