Data blending is a newly emerged technique that is used primarily by those who work in the area of big data analytics. It is a process of combining data from multiple sources into a single structure. Blending provides a relatively quick and easy way to access multiple disparate data sources and find patterns among them without the time and expense of traditional data integration processing. Ideally, data blending is performed using data preparation tools designed for non-technical data analysts who need to prepare data interactively and without coding.
Interactivity is an important characteristic of data blending. Working with visual interfaces and in real time, the analyst blends, visualizes, and analyzes data in cycles – blend, visualize, analyze, adjust blending based on analysis, and repeat the cycle. Interactive in this case means that the analyst is truly interacting with the data.
Join vs. Blend – Different Techniques with Different Effects
Blending is a distinctly different approach than joins for combining data. Joins work with relational tables and only within a single database instance. Blending works with relational, NoSQL, and various other sources where the data is dispersed across databases.
When creating multiple charts and graphs with joins, all of the charts must work with the same data structure and the same logic and rules for combining data. Some charts may be constrained by the requirements of others and the join is often a “lowest common denominator” solution. Figure 1 illustrates the effect of joins when combining data for visualization and analysis.
Blending removes constraints, as the blend for each visualization is done independently of others. Data structure and organization is tailored to each chart and blending rules can vary from chart to chart as shown in Figure 2. Blending removes arbitrary constraints and allows greater freedom in visualizing and analyzing data than is possible using joins to combine data. The price of freedom, however, is the need to be aware of workload, performance impacts, and other considerations for source systems. Blending also works with many kinds of databases including the variety of NoSQL technologies, while joins are limited to RDBMS.
Integrate vs. Blend – Latency and Limited Interactivity
Data warehousing is a process that physically integrates data either as a hub data warehouse with dependent data marts, or alternatively into a collection of conformed data marts using bus architecture. Tailored data marts mitigate (but don’t entirely eliminate) the “lowest common denominator” effect that occurs with joins. But warehousing introduces data latency when data transformation and loading is done as periodic batch processing. Perhaps more significantly, warehousing inhibits interactivity because the work of combining data is decoupled and separated from the work of visualizing and analyzing data.
Data Pipelines – Mixing the Methods
I am not suggesting that the data warehouse is obsolete or that it can be replaced with data blending. Data joins, data blending, and data warehousing are compatible and complementary techniques. Joins are fundamental to warehouse access and use. Combining blending with integration (all forms of integration including warehousing, federation, operational data stores, MDM, etc.) is effective when constructing data pipelines to move data from sources to consumers. Integration is a good choice when combining data for persistence to serve multiple use cases. Blending is a good choice when combining data to serve a single use case. Figure 4 illustrates the use of multiple methods in data pipelines to serve three use cases.