Tales & Tips from the Trenches: Agile Data Transformation using MDOX

What is agility? I would define it as the ability to move quickly, fluidly, and efficiently in any direction. In data transformation, information moves through a process consisting of a sequence of operations. Ideally, processes should be reversible, thus ensuring that no information is lost.

By modeling transformation operations generically in table [Process], MDOX enables the creation of agile data transformations that can be easily run, reversed, revised and repeated, or removed.

In our previous article, Introducing MDOX, we focused on Meta-Data Objects (MDO) as a universal data model that stores information as objects: hierarchically structured collections of data Elements and Attributes, related to a hierarchy of metadata Terms with optional Properties. In this article, we will focus on the Processes that implement agile data transformations, thus putting the “X” in MDOX.

Table 5 from the previous article, “Introducing MDOX” shows the main fields of the [Process] table:

Table [Process] also contains several fields which control how the optional regular expression [re_text] is applied to elements of the parent term, to enable recursive parsing of child elements from parent text:

Matches for the regular expression in field [re_text] are temporarily stored in table [_regexp]:

Each record in table [Process] defines a single transformation operation:

If field [re_text] contains a valid regular expression, then that expression is applied to the long text [Element].[elem_value] for all elements related to the parent term of [Process].[path_id]. Regex matches are temporarily stored in table [_regex].
If field [sql_text] contains a valid SQL statement, then the statement is executed by the application. Select and Transform queries are either shown on screen, or if the process is automated, exported as spreadsheets to the current application directory. Insert statements add new records to a table or updatable view. Update or Delete statements affect selected record.

Note that the presence of an invalid regular expression or SQL statement will cause a process to fail. The developer must repeatedly test each operation until the entire process executes flawlessly.

Operations for a given Term are executed in order of [re_id]. If a term has child terms and the “descend tree” option is selected, then child terms are processed in sequence order. Thus, the overall order of operations within a transformation is determined by both [Process].[re_id] and [Term].[seq]. You can move a previously developed operation to a new location simply by updating those fields. Using the MDOX application, one can easily create new terms and operations, or update the name, description, and sequence order of existing terms and operations. If a data or meta-data object is deleted, then cascading deletes will remove any objects descended from it, e.g. sub-terms, properties, processes, elements, attributes, and regex matches. This behavior is beneficial. Shown below is a term named “test” having a child term “regex” with a process that reports regex matches found in “test” elements. Later, the “test” term can be deleted, and all subsidiary objects will disappear from the database.

Example 1: Transformation operation creating a crosstab report from regex matches

The regex “(\d+)\.\s(\w+)” contains 2 sub-expressions, enclosed within parentheses. The first sub-expression (\d+) matches one or more digits. The second (w+) matches one or more letters. Between these sub-expressions, the expression “ \.\s” matches a dot followed by a whitespace character. In this example, MDOX searches for regex, “(\d+)\.\s(\w+)” in the elements of the parent term, “test”:

The regex matches are stored in table [_regex] where

[nsub] = 0 => the complete regex match, e.g. “10. foo”
[nsub] = 1 => the 1^st sub-expression match, e.g. “10”
[nsub] = 2 => the 2^nd sub-expression match, e.g. “bar”

After the regex matches have been collected, another operation compiles and executes this SQL text:

TRANSFORM First([_regexp].value) AS FirstOfvalue

SELECT Element.elem_id, Element.elem_value, [_regexp].nmatch

FROM Element INNER JOIN _regexp ON Element.elem_id = [_regexp].rid

GROUP BY Element.elem_id, Element.elem_value, [_regexp].nmatch

PIVOT [_regexp].nsub;

The resulting recordset is displayed or exported to “Test_regex.csv”:

Example 2: Transformation operation creating a catalog of filesystem objects

The updatable view [elemAttributes] is defined as follows:

SELECT Elements.*, Attributes.attr_name, Attributes.attr_value, Attributes.elem_id AS attr_id

FROM Elements LEFT JOIN Attributes ON Elements.elem_id = Attributes.elem_id;

The following statement inserts a new element together with a related attribute into [elemAttributes]:

INSERT INTO attributes ( elem_id, attr_name, attr_value )

SELECT GetFileSystem(GetFolderDialog("Select root folder for files"),".*"

,True,True) AS elem_id, "insertDateTime" AS attr_name, Now() AS attr_value;

This operation uses the statement above to create a catalog of a filesystem:

The catalog created is a hierarchy of elements with filesystem attributes, e.g.

A subsequent operation can report selected files through a view based on Elements and Attributes, e.g.

As shown above, function GetFileSystem() can extract file text and store it in [Element].[elem_value], either with or without replacement of special characters by their closest ASCII equivalents. This option is useful for regex parsing of files with text content, e.g. filetypes txt, sql, doc, ppt, pdf, html, etc.

Similar GetData() functions import other types of data into the MDOX database schema. For example, GetSheets() imports tabular data from Excel spreadsheets and Word documents. GetWordDoc() imports text content from any document format that can be opened by Microsoft Word, including PDF and PowerPoint. GetXml() imports structured data as elements with related attributes from text-based markup formats such as XML, XSD, and WSDL files. Implementation details for these data import methods will be described by the next TDAN article, MDOX Application Development.

After the initial import of source data into MDOX using GetData() methods, new terms and processes can be developed to further transform the data into the desired output format, which can be exported or loaded directly into a linked database table. Some sample transformations with varying degrees of complexity will be described in the last article in this series, MDOX Data Transformation Techniques.

In summary, MDOX enables agile data transformation by modeling data and processes generically. This makes it possible to build an application that facilitates data processing within a static data model. MDOX can execute complicated transformations automatically, through a process, built of operations, which can be run, reversed, revised and repeated, or removed as necessary.

MenuMenu

Tales & Tips from the Trenches: Agile Data Transformation using MDOX

Example 1: Transformation operation creating a crosstab report from regex matches

Example 2: Transformation operation creating a catalog of filesystem objects

Bonnie O'Neil

MenuMenu

Example 1: Transformation operation creating a crosstab report from regex matches

Example 2: Transformation operation creating a catalog of filesystem objects

Share this post

Bonnie O'Neil