Published in TDAN.com October 2002
The disastrous accident of the 1999 Mars Climate Orbiter quickly became part of the lore of technology failures. It occurred when firing durations for a guidance jet were mis-communicated between
engineering teams, one of which assumed English units and the other assumed metric units. It is interesting that this case has been coincidentally selected by three data management professionals to
promote their particular, but different, techniques.
The first time I encountered such a reference was in Dr. Terry Halpin’s excellent 1999 Information Modeling and Relational Databases: From Conceptual Analysis to Logical Design. He
used the Mars disaster to support the use of Object Role Modeling (ORM), the topic of his book, noting: “Embarrassingly, the likely cause of this demise was the failure to make a simple
conversion from the U.S. customary system of measurement to metric units. Data by itself is not enough-what we really need is information, the meaning or semantics behind the data.”
The clear implication was that lack of data semantic modeling, in particular, ORM, caused the disaster. This seemed a bit odd to me, since NASA had previously launched hundreds of successful
missions, including moon missions and the interplanetary Voyager, that went just fine without (as far as I am aware) the assistance of ORM. Perhaps there was something about the Mars Orbiter that
particularly demanded a more rigorous approach to data management than had been needed by prior missions, but that does not seem likely.
That reference was only the first of three similarly dubious claims. The second I saw was in David Marco’s 2000 Building and Managing the Metadata Repository. In this book he tells us
“Obviously the NASA space program could use a meta data repository to provide that semantic layer between its systems and engineers.” This is a curious supposition. The author is
correct in noting that engineers have to agree on the definitions of their data. A repository that makes data definitions visible for all to see would contribute toward that end. But, data failure
or not, the accident was mostly a failure in communication, and the idea that the communication could be improved by of adding another layer into the process is questionable. If anything, the
engineers needed fewer layers between them, not more, in order to reduce the chances of mis-communication.
The third and final example (there are probably more that I haven’t run across yet) was in a Software AG publication by Berthold Daum and Chris Horak titled XML Shockwave. They state
“This error occurred because the data was transmitted as plain numbers, not decorated with tags or units.” They are right in, noting that the addition of a definition of
units-of-measure would have helped to avoid the problem, but claiming that the exchange needed to be in XML is a bit of a stretch. The descriptions “nnn joules” or “nnn
foot-pounds” could just have easily been delivered in an email using plain English, instead of XML.
These three authors give rise to quandary. They each claim to know the cause of the crash, but they each cite different reasons. Which one of them is right? They could of course all be wrong, but
is there a way that they all could be right? They each have good arguments, and address one part of the issue, but they all slightly miss the mark. If we look deep enough, we find that the accident
was not caused by the lack of application of data management techniques; it was caused by lapses in application of basic engineering processes.
Before solving a problem by employing the wonders of computers and modern database theory, you must first make sure that the underlying business process is sound. In this case, NASA was repeating
an engineering process that it had executed many times before. Sound procedures would have included knowing the units of calculation, balancing them, perhaps even having different parties duplicate
the calculations to cross check one-another, and, most important, anticipating and testing for errors. If live testing is not feasible (as it never is for a one-shot space mission), then you
perform end-to-end system testing as thoroughly as possible to simulate the actual mission.
At the time NASA first landed on the Moon. I happened to be taking high school chemistry while enrolled in the American High School in Santiago, Chile. Our teacher was a flamboyant Chileno named
Senor Machuca. Sr. Machuca tucked his thumbs behind his suspenders while he paced around the classroom and drilled into us the need to identify and balance units of calculation. These concepts have
been around for a couple of hundred years, and, until the Mars Orbiter crash, NASA employed them well. This is basic as it gets for science. The NASA team failed to apply basic engineering
disciplines that should have been instilled in them since high school. We put satellites in orbit and went to the moon years before we had relational databases; clear proof that NASA could have
accomplished its mission without XML, or a meta data repository, or ORM notation. There are at least three lessons to be gained from all this:
1) Determining the cause and correct solution for a problem takes effort and time. As seen here, the number of possible solutions to any problem is predictable – expect to get a new and
different solution from each different person you ask! Of course, interplanetary navigation is not the type of problem you can solve by majority vote. You would have to ask your question of a
pretty large sample in order to get a valid ‘consensus’. In these cases, the best answer often comes from the one person who understands the entire process and has thoroughly studied
the problem, end to end, across all the parties who may be involved.
2) When we read any technical material, it is always good to carefully think through all claims and to consider the writer’s motivation and interest. For some unfortunate reason, our industry
seems to make normally rational people extend claims for technology beyond the point where they may be justified. This occurs with such frequency that, as we read books, press releases and other
technical documentation, we should always keep a skeptical eye on the message. It is a good mental exercise to think through any hypothesis stated and come to your own conclusions.
3) Probably the most important lesson is to use NASA’s experience to help avoid similar problems in our own work. NASA should be commended not only for successfully tracing back and finding
the basis of the error, but also for owning up to the problem and telling us what happened. Like most businesses, NASA performs many complex processes that must work without fail over many
repetitions (missions) over the years. The processes must include all the needed design, controls and tests necessary to prevent errors from sneaking in. The three articles cited here point out
that use of the correct technology is important, but NASA’s experience demonstrates that defining and managing a reliable business process is even more important.
I will conclude by saying I have the utmost respect for the three authors referenced above, and for their work. ORM is clearly the best way to model the most intricate business rules of data, there
is obvious value to be gained from maintaining metadata in a corporate repository, and XML is a “giant leap for mankind” toward improved data exchange between automated processes. There
are good reasons for using all three practices, but none of them, alone or together, would have prevented the Orbiter accident without the application of traditional engineering professionalism and
common sense. In this case, the best potential help for NASA might have not come from Halpin, Marco of Horak, but from the lessons of Sr. Machuca.