Various databases, plus one or more data warehouses, have been the state-of-the art data management infrastructure in companies for years. The emergence of various new concepts, technologies, and applications such as Hadoop, Tableau, R, Power BI, or Data Lakes indicate that changes are under way.
Which concepts will be forgotten in five years and which will be used in many companies? Looking at who benefits and how from which technology gives us a glimpse of the future.
The Classical World of
Data Management:
Databases and Data Warehouses
The backbone of many companies are IT applications for the efficient execution of the daily company processes: ERP systems, core banking solutions, CRM software, etc. All these systems are built on top of databases that reliably store and process data. All departments in companies rely on such systems.
This is different for data warehouses. Once companies have data from the daily process execution stored in a database, this data can be analyzed to get better insights into customers, internal processes, and financials. While a simple analysis can be done directly on an application database, there are three main reasons why companies use data warehouses instead:
- Analytical queries can be complex and resource intensive. Moving them to a separate system ensures there is no performance impact on the systems needed for the daily processes.
- They enable queries over data from various applications (e.g., customer data from the CRM, the core banking, and the mobile banking system) after harmonizing the data (e.g., a country can be represented by “CH” or “756” in ISO-3166).
- They provide (pre-calculated) data marts, a curated collection of domain-specific data (e.g., for sales) which eases the use by the business.
A key point for understanding the discussion about new trends is the pattern of “similar technology, but different stakeholders and stakeholder expectations.” Databases and data warehouses are very close technology-wise. However, databases are a “technology only” topic. Many departments and teams in companies need a database (as they need laptops and servers), so having one central team in the company allows for a higher service level and/or lower costs.
In contrast, the need for a data warehouse forms around one or two departments, typically sales, marketing, accounting, or controlling. They work with data that is distributed in various application databases and must be brought together. So, on one hand, they need technical support to keep a data warehouse up and running. On the other hand, the data load process loading the data from various source databases into the data warehouse must work as well. This requires a deep understanding of domain-specific data models. Thus, data warehouses only have stakeholders in a few departments who must ensure the funding. An engineering department might not want to pay for a controlling data warehouse.
This stakeholder-centric approach is key to understand how new data management trends from machine learning to data lakes can thrive in companies.
New Technologies (not only) for Old Data:
Machine Learning and Visual Analytics
Visual analytics and machine learning promise the same thing: Buy us to get more out of your data! The way to achieve this, however, differs. Visual analytics improves the user interface by providing an easier, more interactive way to explore data. Additionally, visual analytics tools enable users to create visually stunning diagrams. Some well-known products are Qlik, Tableau, SAP Lumira or Microsoft Power BI.
We elaborate on the benefit using the example of a product manager who wants to understand which customers buy a specific chocolate brand in an online shop. In the classical (pre-visual analytics) data warehouse world, formulating such a business question is the starting point (Fig. 1, step 1). Then, the product manager looks at the existing sales reports and talks with colleagues. She formulates a hypothesis of what the influencing factors might be, e.g., age and gender (step 2). She looks through the data in the data warehouse using filters, sorting, pivot tables, etc. (step 3). In complex cases or if there is a need for additional data, the data warehouse team supports her (step 3a). Then, she visualizes her findings by means of diagrams (step 4). Finally, she decides: Is the question from step 1 answered? Does the diagram provide the needed information? If not, she has to start again from step 2.
Using visual analytics tools, there are improvements to this process. The analysis using filtering, sorting, etc. is possible without SQL knowledge (step 3). Also, while doing the analysis, there is an instant visualization of the data and its results. If the selected data changes, the diagram changes at the same time. This is a comfort gain and increases productivity (steps 3 and 4). Finally, visual analytics provide visually stunning diagram types based on maps, etc. (step 4), which helps getting attention for the results.
So, who would fund a visual analytics solution and why? The motivation is cost savings: the same analysis can be done quicker or with employees less senior and easier to recruit. These savings must be balanced with the total cost of ownership of the visual analytics tool. Obviously, the more business users work with the data warehouse, the higher the savings. Stakeholders in the case of visual analytics are the same as the ones funding the data warehouse. This eases the adoption in companies.
Is Visual Analytics self-service BI?
Some vendors market visual analytics tools as self-service business intelligence. This is an over-exaggeration. Business users can work without IT support only if two conditions are met— the data (1) warehouse provides already a data mart with all the needed data, and (2) the queries must not be too complex. These conditions are rarely met. A GUI does not solve the complexity and abstraction issue of a query with three where clauses, two left outer joins, and three group by clauses. Even if the business user has the knowledge to formulate such a query, how does the organization ensure that this query is maintained and executed in the next five years if this one person is not in the office?
Visual analytics makes the work a little bit more pleasant, but machine learning (including predictive analytics) is game changing. Machine learning overcomes the main limitation of manual data analysis: the human imagination. If you cannot imagine that the gender is a relevant factor for buying decisions, you do not test the hypothesis, and, thus, never find out about it.
Data warehouses have hundreds or thousands of data points per customer. It is not possible to manually look at each factor. It is certainly not possible to look at every combination of two or three factors. More complex impact factors – such as how monthly income is important for customers under 40, for older customers the gender is the key factor – are completely impossible to figure out manually. Machine learning, however, uses decades-old and well-tested statistical methodologies to identify the 3 or 4 relevant factors out of hundreds or thousands. Known software products in this area are SAP, R/RStudio, SAS, or KNIME. They differ in the level of automation, visual presentation, and enterprise-level tooling and stability.
So, who would fund such technology and why? The rationale is simple: machine learning provides superior results than manual data analysis. Sales success, for example, can increase by 20% or by 500-700%. The funding question is trickier:
- Every department using reports from a data warehouse benefits from machine learning. They get better insights enabling them to lower costs or to increase revenues.
- Not everyone in a department that benefits from analytics benefits himself. The worst enemies of machine learnings are employees who are paid only or mostly for playing around with Excel lists from a data warehouse to produce new lists based on a mixture of many years of experience and gut instincts. Replacing (sometimes voodoo-like inspirational) manual list generation by accurate and statically validated lists from machine learning threatens their roles as did machines to weavers in the industrial revolution. However, as soon as employees work with the results and with such lists, they greatly benefit.
So which technology will companies use in five years’ time: visual analytics or machine learning? Currently, data warehouses are like donkey carts – not really comfortable, but you eventually reach your goal. Visual analytics is like a horse-drawn carriage. It makes the ride much more comfortable and you feel better. Also, the adjustment is simple. The way people work and their roles remains unchanged. Machine learning is more like a sports car. Not comfortable, you have to learn new skills, but it brings you much further.
In my opinion, many companies will benefit from both trends and, thus, implement both. If you want to know whether visual analytics or machine learning is strategically more relevant, just look in our cities: Do you see more carriages or more sports cars?
Figure 1: How visual analytics and machine learning innovate data exploration | ||||||
Step 1 | Step 2 | Step 3 | Step 3a | Step 4 | Step 5 | |
Classic | Business user formulates business question. | Business user comes up with potential influencing factors. | Business user analyzes data sing SQL or Data Warehouse. Engineering might support. | If needed, the data warehouse team provides analysis support or retrieves additional data not in the data marts yet. | Visualize with diagrams for better understanding. | Decide whether business question is answered. |
Visual Analytics | unchanged | unchanged | More interactive process, simple questions can be formulated easier | unchanged | Quicker and nicer diagrams since part of Step 3. | unchanged |
Machine Learning | unchanged | Not necessary | Data analyst following a clear methodology | n/a | optional | Part of Step 3 |
BYE-BYE, DATA WAREHOUSE, HELLO BIG DATA?
The heydays for Data Warehousing on Google Trends were in the early 2000s. Big data took off in 2012 and overtook Data Warehousing quickly (Figure 2). So why is the interest so high? Is it in rivalry with machine learning or visual analytics? Who would spend money on it?
Gartner defines Big Data as “high-volume, high-velocity, and/or high-variety information assets” that require new forms of cost-efficient processing and analysis functionality. Today, we have new types of data which became relevant in the last few years and have characteristics for which traditional databases and data warehouses are not optimal. Still, companies want to manage and analyze this new type of data as known from traditional databases.
One typical example of such data is behavioral data. How do customers click through a webpage and an app? When do they contact a call center? When do they visit which shop for what? Another application area is IoT. Hundreds or thousands of sensors measure how engines or trains operate. This data is useful for the actual person controlling such machines as well as for product engineering, for the production process, and for the after-sales services in many industry sectors.
These scenarios illustrate three aspects about Big Data in companies. First, Big Data is about a technology stack for new types of data. Second, Big Data is orthogonal to visual analytics and machine learning. These technologies might also be part of Big Data architectures, but Big Data is no substitution. Third, typical stakeholders are user experience designers and sales and/or product management, engineering, etc. They have specific roles that might not overlap with the ones who fund database or data warehouse teams.
This has advantages and disadvantages. On the one hand, having stakeholders that benefit highly (e.g., engineering or after sales) makes it likely that they are willing to invest in Big Data teams. If they see a return on investment, they bring the topic forward. On the other side, this can result in frustration to database and data warehouse teams. They like to move into the new territory of Big Data, but new teams might get these tasks.
Figure 2: Popularity of Topics on Google Trends (retrieved March 5th, 2019)
DATA LAKES: THE PAYMENT DATE INNOVATION
Data lakes are not (primarily) about new technology. They are about a time-shift in spending that enables companies to speculate on not-now-but-soon business ideas.
Data lakes and data warehouses both collect data from various databases and applications. Their business rationales, however, differ. Data warehouses have curated data marts. They are like exhibitions in museums— every piece is valuable and there is a reason why exactly this piece has been chosen. Data Lakes are more like lumber-rooms of large families. Everyone puts in something and maybe someone else wants to use it later.
In other words: Data warehouse data can be queried immediately. They contain and integrate data coming from various database and sources. This data is cleaned and transformed when put into the data warehouse. Thus, the data is ready to use. Inconsistencies between data of one or between data of various databases are resolved. The data warehouse is the single source of truth. This has implications on the cost side. Adding databases or large amounts of tables to a data warehouse is a financial investment. These costs have to be funded up-front, whether or not the data is used in the future. Thus, companies evaluate carefully before adding new data to a data warehouse.
In contrast, data lakes store data in its original format (mostly) without any preparation. Obviously, such data is not ready to use. No one can or should expect the data to be free of inconsistencies. However, this allows the addition of data to a data lake without (big) up-front investment. Adding data to a data lake without a clear business case today is affordable. If there is a business case with/for this data in the future, the future business case will pay for the data cleansing and transformation.
The key take-away for data lakes is that even if new technologies or software products might be needed, this is only a secondary aspect. The main idea is that we collect and store data inexpensively without much data preparation. We hope that one might find a way of monetizing this data later. The funding can come from business development or from a strategy department – or from product managers who think that some data collection might help them at a later point to improve and sell their product. Again, we see that these are not the typical stakeholders for databases and data warehouses.
It might sound frustrating for engineers; there are many new technologies they are interested in and they could help with – visual analytics, machine learning, Big Data, and data lakes – but their managers and stakeholders simply might not care about this. That is one way to look at the trends. Another option is to think the following: The technology landscape will change in the next few years much more than in the last decades. If my managers know how to handle and acquire new stakeholders and new funding, the next five years will be the most exciting years in my whole career.
Figure 3: Understanding who benefits from which technology | ||
Technology | Stakeholders / Potential Sponsors | Benefit |
Databases | All departments on a shared infrastructure | Efficient process execution based on a reliable data storage |
Data Warehousing | Controlling, Accounting, Sales, Marketing, Product Management | Better insights in historic and current data combined with in-depth data model knowledge |
Visual Analytics | Same as Data Warehousing | Easier data analysis, e.g., of data stored in a data warehouse |
Machine Learning | Many departments (including data warehousing), but specialized/fragmented | New level of understanding data with large attribute sets |
Big Data | Customer Experience, Engineering, Production, and Maintenance of the company core product | Ability to store and analyze large and diverse data sets with short query execution times |
Data Lakes | Business development | Collecting much data without investing much for data preparation until there is a business case |