The Space Shuttle Challenger explosion.
It’s an unfortunate truth in quality assurance and control (QA/QC) that the only time most people recognize the importance of the job is after something has gone horribly wrong. Like air traffic controllers and IT departments, quality professionals tend to garner the most attention when disaster strikes and everyone is looking for someone to blame.
But some of the biggest engineering failures in history had nothing to do with QA/QC. If you’re looking for examples, NASA has plenty.
1. Space Shuttle Challenger
On Jan. 28, 1986, the Space Shuttle Challenger broke apart 73 seconds into its flight, resulting in the deaths of all seven crew members. The spacecraft’s disintegration was caused by the failure of an O-ring seal in its right solid rocket booster at liftoff, which took place under unusually cold conditions for which the shuttle had not been certified.
One of the engineers working at Morton Thiokol, the company which manufactured the solid rocket boosters, wrote a letter to the company’s vice president anticipating the disaster in July 1985.
Boisjoly’s letter to Morton Thiokol’s vice president of engineering. (Image courtesy of Letters of Note
“The mistakenly accepted position on the joint problem was to fly without fear of failure and to run a series of design evaluations which would ultimately lead to a solution or at least a significant reduction of the erosion problem,” wrote Roger Boisjoly. “This position is now drastically changed as a result of the SRM 16A nozzle joint erosion which eroded a secondary O-ring with the primary O-ring never sealing.”
“If the same scenario should occur in a field joint (and it could), then it is a jump ball as to the success or failure of the joint because the secondary O-ring cannot respond to the clevis opening rate and may not be capable of pressurization. The result would be a catastrophe of the highest order - loss of human life.”
Boisjoly’s warning went unheeded.
He later published a report entitled Ethical Decisions – Morton Thiokol and the Challenger Disaster. In it, Boisjoly describes the engineering presentation made during a teleconference between Morton Thiokol, the Kennedy Space Center and the Marshal Space Flight Center the night before the launch.
“[After the presentation,] Joe Kilminster [vice president of the rocket booster program] asked for a five-minute, off-line caucus to re-evaluate the data and as soon as the mute button was pushed, our general manager, Jerry Mason, said in a soft voice, ‘We have to make a management decision.’ I became furious when I heard this, because I sensed that an attempt would be made by executive-level management to reverse the no-launch decision,” wrote Boisjoly.
Ice on the launch tower hours before the Challenger's launch.
Boisjoly’s intuition proved correct, but the moral of this story is not about a team of engineers being overridden by a single manager. Boisjoly actually emphasized that it was the managerial caucus responding to NASA’s pressure to launch which led to the reversal of the recommendation.
“The caucus constituted the unethical decision-making forum resulting from intense customer intimidation,” Boisjoly wrote. “NASA placed MTI in the position of proving that it was not safe to fly instead of proving that it was safe to fly. Also, note that NASA immediately accepted the new decision to launch because it was consistent with their desires and please note that no probing questions were asked.”
This disaster didn’t happen because of a faulty O-ring. The part performed exactly according to its design specifications and if the shuttle had been launched on a warmer day it wouldn’t have been a problem. In other words, the Challenger was not a QA/QC failure.
Engineers are under enormous pressure to complete jobs under budget and ahead of schedule, which can lead to the temptation to think like a manager first and an engineer second—or not at all. The Challenger represents the severe consequences of giving in to that temptation.
2. Mars Climate Orbiter
Artist's rendering of the Mars Climate Orbiter.
Sometimes the stakes aren’t quite as high as the loss of human life.
In the case of the Mars Climate Orbiter, the cost was $327.6 million. The robotic space probe was launched on Dec. 11, 1998 to study the Martian climate, atmosphere and surface changes as well as act as a relay for the Mars Polar Lander.
On Sept. 23, 1999, communication with the spacecraft was lost during its orbital insertion.
An investigation revealed that the spacecraft’s altitude was significantly lower than the intended 150-170 km. Post-failure calculations demonstrated that the spacecraft’s trajectory would have taken it within 57 km of the surface, where it most likely disintegrated from atmospheric stresses.
The cause of the error turned out to be a discrepancy between Lockheed Martin’s software, which generated results in U.S. customary units, and NASA’s, which was designed to accept metric units. Consequently, outputs in pound-seconds were taken as inputs expected in newton-seconds.
Mars Climate Orbiter photo taken in the Lockheed Martin Astronautics facility in Denver, Colorado in January 1998.
This discrepancy in calculation resulted in a discrepancy between the target and the actual orbit insertion altitudes. And so a spacecraft that cost $193.1 million to develop, $91.7 million to launch and $42.8 million to operate burned up in the Martian atmosphere because of a failure to convert one unit of measurement to another.
The lesson here—aside from the obvious, which is always check your math—lies in the importance of oversight and the value of communication between customers and providers. After countless cycles of design, development and testing, it’s easy to assume that all the obvious errors have been caught. As the Mars Climate Orbiter illustrates, sometimes it’s the simplest errors that slip through.
However, despite the simplicity of the error, it did not occur because of a failure in QA/QC. The quality professionals who reviewed the Lockheed and NASA software would have found that both were working perfectly. The real problem was interoperability, an issue that remains a major concern across industries today.
3. Space Shuttle Columbia
Close-up camera view of the Space Shuttle Columbia as it lifts off on Jan. 16, 2003.
The second space shuttle disaster occurred when the Columbia disintegrated over Texas and Louisiana during re-entry on Feb. 1, 2003. During the shuttle’s launch, a piece of foam insulation from the external fuel tank broke off and struck its left wing. When the Columbia re-entered the Earth’s atmosphere, the damage allowed plasma and superheated gases to penetrate the internal wing structure, destroying it and destabilizing the spacecraft.
Post-accident investigations concluded that mistakes during installation were the likely cause of the foam breaking. This resulted in the employees at the Michoud Assembly Facility in Louisiana being retrained in how to apply foam to the fuel tanks.
However, the cause of the accident goes much deeper than incorrectly applied insulation. The Columbia Accident Investigation Board (CAIB) identified fundamental organizational and structural issues in NASA which compromised the safety of shuttle missions.
The grid on the floor of the RLV Hangar as workers in the field bring in pieces of Columbia's debris.
For example, the shuttle program manager was responsible for achieving safe, timely launches at acceptable costs. The fact that these goals are rarely complimentary indicates another source of failure that goes beyond QA/QC.
Safety, timeliness and cost-effectiveness are universal engineering values, as important to manufacturers as they are to NASA. However, these values are often in tension with one another. Balancing them requires careful negotiation and putting that balancing act in the hands of a single person is a recipe for disaster more often than not.
From the Launch Pad to the Shop Floor
The stories of the Challenger, Mars Climate Orbiter and Columbia illustrate how high the price of failure can be in engineering. More importantly, they show that engineering catastrophes are not always the result of failures in QA/QC.
Do you have other examples of engineering disasters that had nothing to do with quality? Comment below.