Data Transformation is usually thought of in the context of database ETL (Extraction, Transformation, and Loading)— i.e. moving data wholesale between systems. But in a larger sense, any process that converts data from one form into another is a data transformation. This would include processes for data analytics and reporting. MDOX (Meta-Data Object X-form) is a tool designed for this purpose: to automatically process data from many sources into a universal data model, a fixed set of database tables, within which data can be analyzed and further transformed into output products using SQL (Structured Query Language).
MDOX takes the concept of generic data modeling to an extreme. Generic modeling simplifies data structures and improves the stability, reliability, and reusability of code, thus increasing productivity and product quality. (For more on generic modeling, see The Data Administration Newsletter (TDAN.com) article, Agile Principles in Data Projects, Part 3)
MDOX Data Model
Although generic data modeling is acknowledged as a best practice, most relational databases utilize a complicated Entity-Relation (E/R) model, having tables or “entities” with quite specific attributes. MDOX has a simpler, more general E/R model, shown in Figure 1. The MDOX model can integrate complex, detailed data from many disparate sources together with related metadata – data semantics, annotations, lineage, relations, and transformation processes.
Figure 1. MDOX Entity-Relations
MDOX is based on the idea that all data is intrinsically hierarchical. This is the principle behind “text-markup languages” like SGML or XML, which represent data as a hierarchy of tagged elements, each with an optional list of tagged attributes. MDOX implements this model in tables Element and Attribute, shown above and described below.
Table 1: [Element]
Table 2: [Attribute]
Metadata is similarly modeled in MDOX with tables Term, Property, and Process, as shown below.
Table 3: [Term]
Table 4: [Property]
Table 5: [Process] stores information documenting operations on data. Its main fields are:
- [path id] identifies the term to which the Process relates
- [re id] is the sequence in which the operation is executed
- [re text] is an optional regular expression (regex) e.g. “\s(\d\d/\d\d/20\d\d)” for a date
- [sql text] contains code for an operation (data manipulation or definition)
The remaining 10 columns of the table process store regex parameters and audit information. Regular expression pattern matching using MDOX will be discussed later, but for now, suffice it to say that pattern matches are temporarily stored in a staging table, [_regexp].
Table 6. [_regexp] is temporary storage for matches to a regular expression pattern, [re text]
To manage hierarchical data objects in a relational database, MDOX needs to resolve the “object-relational impedance mismatch”. This requires some custom views and functions which will be described in subsequent articles. What follows next are some snapshots of the prototype application interface used for a simple transformation process.
MDOX is currently implemented using Microsoft Access, a desktop database that is widely available and offers an intuitive development environment with a rich library of pre-packaged methods, as well as the ability to add custom methods using VBA (Visual Basic for Applications). MDOX has also been implemented in Oracle using Application Express (APEX), but the main development thread has been focused on Microsoft Office products.
The application opens with the “Main Menu” form, shown below.
Figure 2. Form [Main Menu]
Clicking the button “Developer View” opens form “Thesaurus”, showing metadata terms in the “Term Tree” at the left, and details, child terms, process operations, and derived elements in the tabs at the right.
Figure 3. Form [Thesaurus]
Process operations such as that shown above can be copied from other terms used as “templates.” Subsequent process operations can move or manipulate the data in any way imaginable. For MDOX, structure is data; semantics is data; code is data; everything is data.
MDOX Data Processing
Any Term in the metadata taxonomy may have an associated process: an ordered list of processing steps or operations which may be executed individually or in batch mode.
The process shown in Figure 3 above loads an Excel workbook into table Element. There are 2804 elements in the transformed workbook. Elements are labeled by term id. This SQL for “Query0” counts the elements by term: “SELECT term_id, term, COUNT(elem_id) FROM Element GROUP BY term_id, term;”
Figure 4: Query “Query0”
The Element taxonomy can be browsed using form “Dictionary” shown below.
Figure 10. Form [Dictionary] showing details of the selected “sheet” element
Clicking the button [!] at bottom-right executes the query selected in the “Action List” at left. In this case, it will open a tabular view of the selected “sheet” element, pivoting the “cell” elements by their parent “row” elements. This view, shown below, replicates the layout of the original spreadsheet. Meanwhile, data lineage is preserved by the hierarchy of elements, from “workbook” to “row,” to “cell,” to “Role,” the latter being a “normalized” value derived from cells in column 3, which is named “Original Response” for rows where column 2 has value “Role.” Using this structure, we can report normalized “roles” for each “bubble” of the “Data Management Domain Framework” (DMDF) taxonomy.
Figure 11: Query “curElem_xtab” is a tabular view of the current “sheet” element
This column has introduced the MDOX concept and data model and has shown one small example of how it can integrate and transform data. In the next article, we will examine the MDOX application more closely and see what makes it “agile.” Then we will show how you can build your own version of MDOX.
This quarter’s column was written by Robert Park.
Robert Park is a senior database engineer at The MITRE Corporation, where he pursues his interests in data integration and transformation in support of various government projects. Outside of work, he pursues his interest in music as a cellist in various ensembles throughout the Washington, D.C. area.
The author’s affiliation with The MITRE Corporation is provided for identification purposes only and is not intended to convey or imply MITRE’s concurrence with, or support for, the positions, opinions, or viewpoints expressed by the author. © 2018 The MITRE Corporation. ALL RIGHTS RESERVED.