Back to index
Safeware: System Safety and Computers
Nancy G. Leveson, University of Washington
One-line summary:
Nature of Risk: (likelihood) x (severity)
- Shift of responsibility from individual to groups/experts
- Shift from craftsman to assy line (and accompanying increase in
complexity of production systems) has engendered a shift in
safety responsibility from employee to employer, and by
extension, from members of the general public to "experts".
- "Whereas in the past, component failure
was cited as the major factor in accidents, today more accidents
result from dangerous design characteristics and interactions
among components". (Paraphrased from Willie Hammer, Product
safety management and engineering. Englewood Cliffs, NJ:
Prentice-Hall, 1980)
- Can't measure system before it's built/designed, so risk
assessment tries to quantify risk up front. Acceptable
risk is threshold level, but often is determined
irrespective to those receiving benefits of risk
protection. Ex: Pinto gas tanks were known to be
volatile, but the occasional explosion of a car was
considered acceptable risk rather than pay cost of reengineering.
- Role of computers and software
- Technology eliminates some hazards, but creates some new ones
that are more pervasive and harder to find and eliminate than the
ones that were reduced.
- Computers make this worse by enabling interactive,
tightly-coupled, error prone designs:
"by increasing the complexity of the processes that can be
controlled, [computer programs] have increased the scope for the
introduction of conventional errors" (Trevor Kletz: Wise after
the event, in Control and Instrumentation 20(10):57-59,
Oct. 1988.)
- Operators often isolated from phys system (for safety
reasons, or because the scale of the system is
incomprehensible to individual users, as in the case of an
oil rig) and must rely
exclusively on indirect/physically-unverifiable info
or are "out of the loop" till
too late. Ex: 1985 China Airlines jet losing power in
right engine, but autopilot automatically corrected. Crew
didn't find out till it was too late (autopilot could no
longer correct).
- We learn more from bridges that collapse than from those
that stand. Increased pace of change leads to decreased
opportunity to
learn from failure. Empirical design rules derived by
trial and error are being replaced by reliance on a
priori hazard identification and
reliable-system-building techniques.
- Software emphasizes only the steps to be achieved, without
worrying about how to achieve them (unlike mechanical
systems). Proven system-safety eng techniques haven't
been applied to software, nor can they be because of the
essence of the medium.
- SW "flexibility" encourages redefinition of tasks to shift
responsibility from HW to SW late in the process.
Interfaces in HW systems tend to be simpler than SW ones
because physical constraints make the costs of complex HW
interfaces immediate and obvious.
- SW is logically brittle (not physically brittle like HW),
so harder to see how easily it can be broken, esp. since
partial success is trivial to achieve (compared to
HW) and rarely indicative of complete success.
- Software myths exploded:
- Cost of computers is not lower than cost of
electromechanical/analog devices, when maintenance is
considered. Space shuttle software is about 800Kbytes
but costs $100M/yr to maintain.
- Software is not easy to change if safety is to be
preserved. Plus, software errors tend to lurk dormant for
a long time -- there is no "infant mortality" for
software, thus the belief that safety programs can be
"phased out" after a few years is fundamentally
misguided. (There is no way to measure the number of
accidents a good program has avoided.)
- Increasing SW reliability will not increase safety:
often the SW is working as designed, and the bug comes
from unexpected component interactions. "Safety is a
system property, not a software property." SW is considered
safety-critical only if a SW failure alone can lead to an
accident, but in real accidents this is rarely
what happens.
- Reusing software does not increase safety: may
actually decrease it, because of complacency it engenders.
Also, most SW is specially constructed
for its application and
reusing it elsewher is probably dangerous (Therac-6 SW
ported to Therac-25).
- Hierarchical view of accident causation: mechanisms ("the car hit
the tree"), conditions ("it was raining and the car didn't have
antilock brakes"), and root causes.
- Root causes: inherent weaknesses that can also affect
future accidents. Important because accidents rarely
repeat themselves in the same manifestation, so addressing
root causes is more important than "patching". (eg Mars
Pathfinder thread-priority inversion) "Often one finds
triple locks on those doors from which horses have been
stolen, while other doors are wide open."
- Most root causes
involve 2 or more low-probability events happening in the
worst possible combination.
- Overreliance on redundancy for low-probability events that
turn out not to be independent. Ex: in Challenger
accident, primary O-ring failure caused conditions that
led to failure of secondary (backup) O-rings.
- Overreliance on redundancy for poorly-design device: worse
than no device at all, since it engenders complacency.
- Technologies are developed because they have clear and
quantifiable benefits, but often the risks are ambiguous
and elusive, so up-front hazard management is often
replaced by "downstream protection" (adding safety to a
completed design). Note the similar situation in
computer/network security!
- Organizational factors
- Large engineered systems usually reflect the structure,
management, culture, etc of the organization that created
them.
- Other factors: diffusion of responsibility and authority
in large organizations; lack of independence or lowly
status of safety personnel, or requiring them to report to
the organization whose safety is being checked.
- Need a
reference channel that communicates safety
goals/policies downstream, and an independent measuring
channel that communicates actual state of affairs
upstream.
Intro. System Safety/Definitions and Models
Elements of a Safeware Program
Back to index