We have been portraying the move to a Data-Centric paradigm as a “Revolution” because of the major mental and cultural shifts that are prerequisites to making this shift. In another sense, the shift is the result of a long, gradual process; one which would have to be characterized as “evolutionary.”
This column is going to review some of the key missing links in the evolutionary history of the movement.
(For more on the Data Centric Revolution, see The Data Centric Revolution. In the likelihood that you’re not already data centric, see The Seven Warning Signs of Appliosclerosis)
Applications as Decks of Cards
In the 50’s and 60’s, many computer applications made very little distinction between data and programs. A program was often punched out on thin cardboard “computer cards.” The data was punched out on the same kind of cards. The two decks of cards were put in the hopper together, and voila, output came out the other end. Payroll was a classic example of applications in this era. There was a card for each employee with their Social Security Number, rate of pay, current regular hours, overtime hours, and a few other essential bits of data. The program referred to data by the “column” numbers on the card where the data was found. Often people didn’t think of the data as separate from the program, as the two were intimately connected.
Magnetic Tape
In the magnetic tape era, we replaced cards with reel-to-reel magnetic tape. This gave more independence to the data. Initially, what we did was take the transactional part of the time card data (the current hours reported) and put it on a reel of tape. Being freed from sharing an 80-column card with a bit of master data allowed people to add things like charge codes, projects, and more categories of pay, which they did.
The other part of the card represented data that did not change on a week-by-week basis, “master data.” This master data went on another reel of tape and was called, remarkably, the master data file. When you wanted to run an application program, you first had to get the correct master file (versioning was a big deal) and the correct transaction file. Then you mounted those files and another blank tape on three tape drives, and started the program. The prevailing paradigm was to sort the master file and the transaction file into the same order and process them sequentially, creating the output (paychecks) as well as writing an updated master file onto the blank tape.
Most programs still relied on an offset of a number of characters from the beginning of each record to understand where the data was and what it meant. Because knowledge of this data was in the computer program itself, most of the knowledge of the data was there as well.
The existence of this master file was the beginning of the idea of shared data. An employee master file might be used by the payroll department as well as the HR department. Might have been. Usually wasn’t. There were mechanisms that gave names to the offsets in the records. Mostly, these were “copy books” which were usually chunks of the front part of the Cobol program. In principle, these could be shared between departments who wished to use shared data. Usually this did not happen in practice.
Indexed Files
With the invention of the disk drive1, we could have master data that was permanently “mounted.” These files had indexes (ISAM and VSAM were popular) that allowed direct access to any record. There was a bit of data sharing, but not as much as you would think. The inventory system kept the inventory master file, the payroll system kept the employee master file, etc.
Databases, Metadata, & Schemas
Databases created a more public and shared view of the data being managed. This was true with the hierarchical and codasyl databases, which were mostly proprietary. This more shared view increased with the advent of relational databases, and especially SQL. It was an open and shared grammar, playing the role of a lingua franca for database access. Not only could many programs share data tables, they did not even have to be written in the same programming language. Database access was now mediated through SQL.
The first generation of “databased” systems were re-creations of the systems that had preceded them. Old habits die hard. If you had a tape-based inventory system, there was a good chance you replaced it with a VSAM inventory system, which you eventually replaced with a relational database inventory system. It was going to take a while before “materials management” came into the lexicon.
The other thing that databases brought with them was the idea of metadata and schemas. Metadata2 for a database is primarily the information about the tables and columns: what are they called? How are they related? What can you put in each field? This existed in a primitive form prior to the database, but came to its fore in the database era.
Integrated Systems
The database systems were creating silos. For the most part, each application had its own database. This was causing a huge amount of “interfacing” (writing code to move data from one database to another). In the 80’s, people started seeing the potential for integrating more functions through shared data. Inventory and Purchasing could share product data. Accounts Payable and Purchasing could share vendor data.
Initially companies added functionality (application programs) to existing systems, and (if they were lucky) achieved integration incrementally. This was possible to do if you had written both the applications and databases, you had documented knowledge of the systems, and you had skilled programmers who could make the changes in a predictable amount of time.
Eventually the software vendors got into this business. They discovered that the more functions they could bundle onto one database, the more RFPs they could win.
ERP System
The logical extension of this thought process was the ERP (Enterprise Resource Planning) system. “If we could just get all our enterprise data into one integrated system, it would be the holy grail. We would eliminate all interfacing.” This sounded so good.
The problem was, with the technology of the time, covering all functionality for all types of industries required a huge amount of complexity. The average SAP (the premier ERP vendor) installation has 90,000 tables. There are over 1 million attributes (columns) on all those tables.
Despite the complexity, some companies did get most of their functionality on a single instance and achieved many of the touted benefits. Most did not; and the bigger the company, the less likely the success. Not only did many companies have many existing systems which they could not and did not port to the ERP platform, in a bizarre twist, a majority of large enterprises ended up with multiple ERP systems, most of which were not integrated with each other. Let that sink in.
The necessity of keeping track of millions of attributes was the mother of the invention of the metadata repository. A metadata repository is a database of (hopefully) all the attribute and field names in your various systems.
The Data Warehouse
Somewhere along the way, companies recognized that neither their database applications, nor even their ERP applications, were delivering an integrated set of data for their entire enterprise. No amount of interfacing would address this, because any one system only had a subset of the total data.
Thus, the data warehouse was born. The essential premise was: rather than integrating the operational systems (those that actually run the day-to-day business), let’s integrate the reporting databases. So people made copies of the data in all their operational databases, and put them in a data warehouse.
Well it wasn’t quite that easy. Each of those databased or integrated or ERP systems had its own schema. They were arbitrarily designed, generally as the most expedient way to solve the immediate problem at hand. Generally, the language used was that of the sponsoring department. If you bought a commercial off-the-shelf package, the schema often had no relation to your particular business, and certainly had no relationship to any system you’d implemented.
The real work of the data warehouse was in what would eventually become the ETL industry (“Extract, Transform, and Load,” referring to the three processing steps needed to “on board” a new application to the data warehouse). In the extract step you wrote some code that queried the application database and retrieved the data you were interested in having in the data warehouse. Later, tools were used to do essentially the same thing. The transform stage was how you converted the local language of the application to the shared language of the data warehouse. This occurred at two levels, metadata and data. At the metadata level you had to recognize that the “pid” column in one table actually corresponds to what the data warehouse has called the “patientID.” At the data level you had to realize that “01” in a row in the employee table means “M” in the data warehouse (which really means the gender is “Male”). And finally, in the load stage you had to efficiently load the transformed data into the data warehouse.
The data warehouse was not a transactional system. It was not designed to be updated in place; it was uploaded in batch via the “load” process. It was tuned for fast analytic queries, typically aggregates (rollup all the sales by region, product, promotion, month).
At last we had all of our data in one place – well sort of. In fact, it was just that small subset that we deemed to be of interest to the analysis. Generally, data warehouses were successful, and for quite a long time. Eventually, though, as we will see, the cost of the ETL process eventually caused them to lose favor.
The Operational Data Store (ODS)
The success of the data warehouse, and the general frustration with interfacing the growing number of application systems (which by now had also become the multi-billion soon to be multi-hundred-billion dollar “systems integration” industry), lead to the creation of something often called the “Operational Data Store”. The idea was to create a data warehouse that would be used for the purpose of making the integration process easier, by making it into a hub-and-spoke model.
Hub-and-spoke said: put any data that might be shared between systems in a shared model, and then let anyone else access or subscribe to changes. Sounded great. We know of many cases where it didn’t work and no cases where it did work. The main cause of it not working was because the model was too complex – almost as complex as the union of all the schemas it was trying to integrate. It became a bottleneck for change. This bottleneck was exacerbated, because, inevitably, if one system changed it forced a change on the operational data store, which in turn forced all its subscribers to change.
End of Part 1, in the next installment we take up some of the more recent developments on the road to Data Centric including: Master Data Management, Canonical Message Modeling, REST, Big Data and Data Lakes.
1 It’s hard to imagine a computer industry without disk drives, but it existed that way for quite some time.
2 The prevailing definition of metadata is “data about other data,” which is hugely unsatisfying. http://whatis.techtarget.com/definition/metadata