SELECT * FROM Celko September 2011

Return with me now to those thrilling days of yesteryear .. the late 1970s. This was when programming moved from a bunch of loose ad hoc tricks and looser tools to an engineering discipline and applied mathematics.

The term “software engineering” was actually invented twice: once to get funding for a conference from the military in Europe and once by Kenneth Kolence. Younger programmers will not remember him, but he was one of the founders of Boole & Babbage, the first software products company in Silicon Valley.

Mr. Kolence saw an airplane and thought that the pilot was trusting his life to physics and engineering. Kolence thought to himself that he would never trust software that way. This led him to invent the idea of “software physics,” which he put into a book (Introduction to Software Physics by Kenneth W. Kolence; McGraw-Hill, 1985, ISBN 978-0070352476).

The concept was that a “software physics” corresponds to the natural physics and that measurements could give basic constitutive definitions for the nature of software. Hardware was easy – electricity has one speed. But two different algorithms on the same hardware would perform differently. The goal was to find conservation laws in software physics such as software energy, work, force and power laws. When Kolence helped start the Institute for Software Engineering, the idea that you could manage performance, equipment planning, capacity planning, performance management and all that stuff became an accepted part of every major installation.

About the same time, most programmers learned to use Data Flow Diagrams (DFD) in their systems design course. This technique replaced flowcharts during the Structure Programming Revolution in the late 1970s. The old ANSI X3.6-1970 Standard simply did not work because you did not know what was flowing; sometimes it was control, sometimes it was data and sometimes both. If you do not know the technique, find a tutorial or read Structured Systems Analysis: Tools and Techniques by Chris Gane and Trish Sarson, 1977.

DFDs are a good tool for looking at data from a higher level. The model for a DFD was that data flows between processes (“data in motion,” process execution) and data stores (“data at rest,” think of files and databases), much like water, automobile traffic, electricity or goods in an assembly line. Data comes into the system from an external environment, gets processed and leaves it. Inside the system, sub-systems move the data in the same manner. A DFD does not worry about any particular piece of data or any control flow decisions made by the processes.

A process does not have to physically change the data. For example, a process might input raw data and output validated data. If the raw data was correct, the output will be physically the same as the input. But there is a logical difference; we can do things with validated data that we cannot do with raw data.

Did you notice that with all of that early software engineering, nobody did much with data? The stress was on the performance, design and correctness of algorithms.

I would like to propose we go back and add “Data Physics” to computer science. This is a still a rough idea, but I want to use physics concepts and map them into data. The DFD models give us a network with flows. We have conceptual tools such things, depending on whether we use a continuous flow (water in pipes, electricity in wires) or discrete units (automobile traffic on roads).

Let’s consider an analogy in physics and data for speed and velocity of data. Asking about the “speed of data” sounds a bit like asking for the “weight of p” but think of it as data change over time, just as speed in physics is location change (distance) over time (i.e., kilometers per hour). You do not have any problem with the idea of daily, weekly or monthly reports since that is the way we have done data processing even before computers.

Today, we have ad hoc, interactive and real-time processing. The best example of speed versus velocity is data on an auction website. The auction is open for a fixed length of time (say, 24 hours); it has a known initial value (opening bid) and a known terminal value (winning bid).

If I am the seller, my data changed from initial to terminal value in 24 hours. That is the speed of my data. If I am the auctioneer or a bidder, I have to look at and respond to each bid. My data is changing (almost) continuously. That is the velocity of data and it can be more important than the actual data values. The basic situations are:

  1. A few small bids
  2. Many small bids
  3. A few large bids
  4. Many large bids
Mother Celko’s Uncertainty Principle, with a nod to Dr. Heisenberg: The greater the velocity of the data, the less certain you are of the value and the quality.

There are concepts of permanence and value in data. The analog is radioactive or biological decay. A 500-year-old table of logarithms is valid and useful now and forever. That is the stable isotope. Last year’s lottery numbers have no value today, not even as predictors for future choices – an isotope with a (very) short half-life. Documents that could be used against you in court under discovery should have been destroyed under the record retention laws before they hurt you. Here is where my isotope analogy breaks down; I cannot think of an isotope that decays to something more toxic.

Anyone have other thoughts on this?


submit to reddit

About Joe Celko

Joe is an Independent SQL and RDBMS Expert. He joined the ANSI X3H2 Database Standards Committee in 1987 and helped write the ANSI/ISO SQL-89 and SQL-92 standards. He is one of the top SQL experts in the world, writing over 700 articles primarily on SQL and database topics in the computer trade and academic press. The author of six books on databases and SQL, Joe also contributes his time as a speaker and instructor at universities, trade conferences and local user groups. Joe is now an independent contractor based in the Austin, TX area.