Accidents Waiting to Happen
Why do complex systems fail
when they shouldn’t? Why did the Challenger explode? Why did
Columbia disintegrate? Why were large areas of the northeastern
United States blacked out in 1965
— and again in 2003?
Technology historians offer a pair of explanations: 1) Designers are not cognizant of
what caused predecessor systems to fail; or 2) They are aware,
but are willing to accept risks based on a succession of prior
When the Tacoma Narrows
suspension bridge (“Galloping Gertie”) collapsed in 1940 from the
effects of aerodynamic phenomena, it was widely reported that the
bridge had been perfectly designed for static loads; it was simply
that unanticipated effects of wind caused its demise. In fact, an
engineering professor at the University of Washington recalled in
1949 that 10 suspension bridges had been destroyed or severely
damaged by wind between 1818 and 1889.
Brooklyn Bridge designer
John Roebling noted in 1841 that “storms are unquestionably the
greatest enemies of suspension bridges.” Fourteen years later he
wrote, “the catalogue of disastrous [suspension bridge] failures
is now large enough to warn against light fabrics, suspended to be
blown down, as it were, in defiance of the elements.” So,
historians believe, the Tacoma bridge disaster should have come as
no surprise and could have been averted. Roebling, they say,
learned from studying failures; the designers of the Tacoma
Narrows Bridge did not. Were they unaware of the suspension bridge
failures of the 1800s, or did they consider them too remote for
In the case of the
Challenger, a potential failure mechanism in the form of a
propulsion-rocket joint seal that would deteriorate during launch
was well known. But successive launches in which NASA
“got by” culminated in the 1986 explosion in which the Challenger
and its crew were lost. For many NASA engineers, the cause was
easy to guess. In her post-accident analysis, author Diane Vaughan
labeled it a case of “normalizing deviation.” That is, NASA
engineers and management were lulled into complacency and some
hubris through a string of well publicized, successful launches.
After each flight, they’d analyze the degree of joint degradation,
but did not eliminate the fault mechanism. They accepted it as
When the Columbia
disintegrated in February 2003, the cause was found to be another
known fault mechanism —
chunks of foam insulation from
the external tanks impinging on the spacecraft. This time, 81
seconds into the launch, a chunk of foam destroyed enough of the
vehicle’s thermal protection shield to cause burning and melting
and ultimate break-up of the craft during re-entry. The lead
accident investigator said his task force had determined that
“NASA is not a learning organization. They do not learn from their
mistakes.” There was no way to monitor any damage to the
spacecraft during its mission, nor repair it in flight. The
commission called for both in its final report.
After the great Northeast
blackout of 1965, the public was assured that such an event could
not happen again. Failures would be localized and the offending
power company isolated quickly from the grid. That it happened
again was a surprise to the public and grist for politicians and
the media, but perhaps not that much of a surprise to power
engineers. They may have known where the weaknesses lay, but felt
powerless in the face of any risk-benefit analysis to do much
about it. At this writing, the cause has not been fully defined.
Some suggest the grid was not sufficiently automated, others that
it was too automated.
There is a theory that
catastrophes in large systems or major engineering projects occur
cyclically. One study revealed a 30-year interval between major
bridge disasters in the United States, an interval thought to be
the result of “a communication gap between one generation of
engineers and the next.” If valid, this suggests that only the
information on successful designs is passed along from one
generation to the next, while information about what didn’t work
and why is not. We can guess why this might be so. We all like to
be recognized for our successes, not our failures. Published
papers seldom recount failed approaches to an ultimately
successful design. Design and development today is commonly done
in teams; the consensus vote may be to emphasize successes only.
And corporate patent attorneys may prefer to suggest that the path
to any successful invention be seen as straightforward, not strewn
with miscues or failed approaches. Finally, a company may need to
consider pending or potential lawsuits.
In his book Design
Paradigms, Duke University engineering professor Henry
Petroski encourages designers to study past failures. In
contemplating new systems or even incremental steps forward in
existing systems, the designer should ask how failure can occur
and what design changes can obviate that failure mode without
introducing another. Petroski warns that once a design becomes
accepted, incremental design extrapolation in succeeding
generations tends to be the norm and first principles tend to be
overlooked. And design decisions that may prove critical are given
to less experienced engineers, further fostering the generation
gap mentioned earlier.
It strikes me that the
likelihood of accidents waiting to happen —
and actually happening —
will only increase as our
systems become more complex.
There may be greater hope for “boundable” systems, like a bridge
or a space vehicle, but I worry most about systems that are
“universal” and expansive, yet consist of numerous, potentially
fallible, semi-independent pieces
— like a power grid or,
worse yet, the Internet.
The burden will rest on
future hardware and software designers to prove my projection
wrong. I hope they can.
For more about design
J. A. Roebling, “Remarks on
Suspension Bridges . . .,” American Railroad Journal and
Mechanics Magazine, Vol. 6, 1841.
F. B. Farquharson,
“Aerodynamic Stability of Suspension Bridges, with Special
Reference to the Tacoma Narrows Bridge,” University of Washington
Structural Research Laboratory, 1949.
D. Vaughan, The
Challenger Launch Decision, The University of Chicago Press,
T. E. Bell (ed.), “Special
Report: Managing Risk in Large Complex Systems,” IEEE Spectrum,
T. E. Bell and Karl Esch,
“The Fatal Flaw in Flight 51-L,” IEEE Spectrum, February
H. E. McCurdy, “The Decay
of NASA’s Technical Culture,” Space Policy, November 1989.
Investigation Board Final Report, August 2003 (www.caib.us).
H. Petroski, Design
Paradigms: Case Histories of Error and Judgment in Engineering,
Cambridge University Press, 1994.
Christiansen is the former editor and publisher of IEEE
Spectrum and an independent publishing consultant. He can be
reached at firstname.lastname@example.org.