Data Types of the Future!

“A type is an indication for a set of permitted values.” – Type Inference for COBOL Systems, Arie van Deursen and Leon Moonen

“Type: A named set of values. Types can be scalar or non-scalar.” – The Relational Database Dictionary, C.J. Date

Our topic for this article is one of the most fascinating and leading-edge issues on the data management scene: data types! Well, maybe not yet so fascinating and leading-edge…but it may well be in the near future. In the same way that concrete blocks are the foundation of a building, types are the foundation of variables, which in turn are the foundation of applications. In this article, I’d like to share some thoughts on type metamodels and propose some novel perspectives on scalar types and potential non-scalar types.1

Type Systems, Scalar and Non-Scalar TypesEveryone familiar with data modeling, databases or any programming language is familiar with the concept of a data type. Types specify constraints on variables; constraining a variable with a type prevents values inconsistent with the type from being assigned to that variable.

DBMS built-in types such as CHAR and INTEGER are scalar. Scalar types by definition have no components or visible inner structure.2 Non-scalar types do have an internal structure; probably the most familiar non-scalar type is the relation, and, since an entity – in logical data modeling terms – is equivalent to a relation, an entity is another example of a non-scalar type.

A type system comprises a set of types and the relationships among them – typically of the supertype-subtype variety, in the form of a hierarchy. The diagram below shows, for example, the built-in type system of DB2 UDB DBMS Version 9.1. Built-in types such as these can also provide specifications for other specifications, commonly known as User Defined Types, or UDTs.

 

A type system is a model of constraints on variables. An Entity-Relationship model is a model of constraints on variables, so it qualifies as a kind of type system.

Character Types and RelationsConventional wisdom tells us that the relational model has limited value for managing narrative data because, rather than being in “relational” form, narrative data is of an “unstructured” type of arbitrary length.

 Seq  Char
 1  C
 2  o
 3  n
 4  v
 5  e
 6  n
 7  t
 8  i
 9  o
 10  n
 11  a
 12  l

English-language narrative data just happens to be written and read horizontally, left to right. If we broaden our view to a data management or global perspective, we can see that this just one of many possible views of narrative data. The Chinese language can be written and read in vertical columns top to bottom; Hebrew is written and read right to left.

It’s just as valid to consider characters in a narrative as being stored as a relation of degree two, where each tuple/row has one column of a single-character type, and another of numeric type, indicating the relative sequence of the characters. When transposed in this manner, many text-specific operations can be replaced with relational operations: Find/Replace becomes SELECT/UPDATE…WHERE; Substring operations become SELECT…WHERE SEQ Equal/Between; Concatenate becomes UNION.

Expressing narrative text in relations such as this would free us from any concerns about specifying minimum or maximum lengths for text variables, and also make text-specific operations unnecessary. Of course this approach would consume more storage space – but when a terabyte runs about $100 at Best Buy…I’m just sayin’…
            
Observations on Enumerated and Boolean Types A type can be defined either by describing its constraints – by intension – or by extension, that is, specifying each and every valid value. A type defined by extension is an enumerated type:

“An extensional definition…specifies…a list naming every object that is a member of a specific set. A domain – the set of valid values for a type – is the extensional definition of the type. An extensional definition of ‘Prime Minister’ would be a list of all past, present and future prime ministers.” (Wikipedia)

Examples of enumerated types include currency codes, country codes, subtype discriminators and classifications.

A Boolean type can be used to define True/False indicators and the output of condition tests. Boolean can be thought of as a simple, specialized subtype of enumerated with only two declared values. It can also be seen as a specialized subtype of number.3 If we use the number 1 to represent the value “True,” and 0 to represent “False” (not too much of a stretch), we can replace AND with multiplication and OR with addition – again reducing the number of functions required in a type-based application.

Data Types of the FutureIn application specifications, precision in names and definitions, although indispensable documentation for human readers, is trivial compared to the precision available through type definitions. Narrative definitions cannot be enforced by an application or a DBMS, but type constraints can. The more that requirements can be expressed as type constraints, the better they can specify not only what we want or expect, but also how they are actually implemented.

The specification of requirements as type constraints can be greatly enhanced by the application of non-scalar types composed of one or more scalar and/or non-scalar components. Non-scalar types are often referred to as “complex”, “structured,” or composite types. Composite types, in the words of Chris Date, can be “of arbitrary complexity,”4 in both breadth and depth. Currently, specification of composite types other than relations has been left to custom implementations of user-defined types (UDTs) based on DBMS built-in types.

An informal survey of experienced DBAs confirmed that composite types are still very seldom used. Their support by data modeling tools is quite limited, and as a result they’re largely invisible to most data modelers and DBAs. Could composite types be more usefully applied? One of the most practical potential uses of composite types would be what I call qualified types.

Qualified TypesA composite type can contain two or more component types. A qualified composite type would include a type that’s qualified by one or more other types such as units of measure – for example, temporal units and/or currencies. In data warehousing applications where data values in the same row could be sourced from more than one application, a composite type could also include a source-system identifier to precisely attest to the provenance of the value assigned to any given variable in that row.

The table below shows an example of a qualified composite type:

 Monetary Amount  Effective Date  End Date  Currency
 100.00  2011/06/07  2011/06/08  USD

Modeling and Implementation of Composite TypesAlthough major DBMS products now support composites through user-defined types, the situation with data modeling tools is less encouraging; few if any tools support composite types at a logical level.

Some tools, such as IBM’s InfoSphere Data Architect (IDA), do support composite UDTs at a DBMS-specific physical level. The screen image below shows part of a physical data model in IDA that includes a composite type named “Currency Amount Qualified.”

  1. Shows the definition of a user-defined type CURR_AMT_QUAL containing attributes AMT, EFF_DT, END_DT and CURR_CD, each using a built-in type.5
  2. Shows the DB2 UDB DDL generated to create the UDT.
  3. Shows the table ACCOUNT including the column ACCT_BAL, of the type CURR_AMT_QUAL. (Any other monetary amounts in the model can of course also reference the CURR_AMT_QUAL type.)

A Distribution of TypesThe table below shows the distribution of some of the types (“domains,” in ERwin terminology) of the attributes contained in an example real-world decision-support application.

 Type (Domain)  Number of Attributes  Percentage
 Classification  477
 25%
 Currency  440  23%
 Date/Time  305  16%

The table shows that classifications (enumerations), currency amounts (implicitly-qualified variables) and temporal types (qualifiers) comprise the majority of the data in the application. Currency amounts are most often implicitly understood to represent USD only, but could well be qualified by multiple currencies. In most instances, temporal-type variables qualify the data values in entire rows, rather than more precisely qualifying individual variables within rows as described above.

The distribution of types in this application is just one example demonstrating the untapped potential of data types of the future.

End Notes:

  1. For the purposes of this discussion I’d like use the terms “data type” and “type” interchangeably.
  2. A case can be made that DATE, TIME and TIMESTAMP are non-scalar types, since they do have an internal structure.
  3. Detailed coverage of multiple type inheritance can be found in The Third Manifesto by Date and Darwen.
  4. Date, C.J., DBMS Interview – October 1994
  5. The attributes of a type should, strictly speaking, be types themselves.

Share

submit to reddit
Top