When companies embark on a journey of becoming data-driven, usually, this goes hand in and with using new technologies and concepts such as AI and data lakes or Hadoop and IoT. This can cause an unexpected social dynamic.
Suddenly, the data warehouse team and their software are not the only ones anymore that turn data into insights. New teams and projects with more fashionable software solutions claim to be able to do the same–or even better. Traditional business intelligence and data warehouse architectures look old-fashioned and outdated.
Is now the time for companies to redesign their complete business intelligence landscape? Which technologies and concepts are just a hype and which are here to stay? In the following, we look at the classic data warehouse architectures before elaborating on four trends that complement classic data warehouses in data driven companies.
The Classic Data Warehouse Architecture
Data warehouses build their functionality on top of an operational databases layer. Operational databases serve business applications by storing their data. This is the data needed for daily operations, such as master data, product descriptions, bank transfers, or point of sales records.
Business users and database administrators can submit analytical queries against operational databases, but only if they do not impact the stability and performance of the system. Even then, combining data from different databases (e.g., one with online shop data and one with retail store data) is challenging or impossible on the operational databases layer. To overcome these hurdles, companies implement data warehouses.
Oracle’s data warehouse reference architecture consists of three layers [McQuigg11], to which we refer to in this article with the terms “staging area layer,” “consolidated data warehouse layer,” and “curated data marts layer”. The staging area is an intermediate storage place for data being copied from operational databases or coming from other data sources during the ETL process. The data warehouse software extracts data from operational warehouses (“E” step), transforms (“T” step) and cleanses the data including aggregating data and solving inconsistencies. In the last step, the data is loaded into the data warehouse layer and its normalized tables (“L” step).
The data cleansing is a quite work-intense step to develop. Data inconsistencies between source systems are common, even in well-designed application landscapes. Slight variations in the meaning of terms due to different foci in business units are an important reason. How do you count the number of orders in your system? Marketing and sales might count the number of signed contracts. A factory might only count orders if all needed supplies are in the factory building such that production can start immediately. Such inconsistencies are solved during the ETL process. Thus, the data in the data warehouse layer is consistent and becomes the one source of truth for the company, especially if various business units talk about them. The third data warehouse layer provides curated data marts. User groups such as accounting, sales, or operations get views and reports with the data relevant for them in a user-friendly way.
Three observations about how data warehouses are typically operated in organizations and companies influence our discussion of the newer technology trends:
- ETL processes run typically overnight. This is to minimize the impact on operational databases during office hours. As a consequence, data warehouses tend not to have (near) real time data, but rather one day old data.
- Many companies have data warehouses solutions in place for years. They have a team that provides services to the business users already for a long time. Thus, business users have an opinion on how user-centric, innovative, on-time, and flexible the team is. Perception and reality might not match, but has a high impact when the business and IT managers decide which tasks to assign to which team.
- Data warehouses are a big investment, though the license fees are usually not the biggest spending category. Building a data model and solving data inconsistencies is a time-consuming, non-trivial endeavor. If any new technology wants to succeed, it makes sense to build on top of a solid data model such as provided by the consolidated data warehouse layer.
In the following, we discuss four newer trends and how they relate to the classic data warehouse architecture.
Trend 1: The Knowledge Layer – The technology is ready, BI teams (often) not
One of the biggest challenges for data warehouse teams is the fact that that business users expect more from the IT department than lists and reports. In a world of data-driven organizations, IT departments have to ensure that their data warehouse delivers input for better decision making. Can you tell me why car sales dropped in 5 of our 17 regions? How do these regions differ from the rest? Which clients should the bank advisors call tomorrow? Data warehouses have all data and information to answer such questions–and the knowledge layer is about analyzing the wealth of data, to draw conclusions, and making the results accessible to the business users.
The knowledge layer has two factettes: machine learning and statistical models and self-service business intelligence. Machine learning and statistical models use advanced algorithms to generate insights humans cannot deliver. Which of the thousands of parameters allow me to understand what customers might want to buy? Data engineers and scientists provide answers, often using tools such as R or Python running in a database or on separate servers or using prepackaged solutions such as SAS or SAP Analytics.
The second facet of the knowledge layer is self-service business intelligence. Self-service BI focuses on enabling business users to make analysis by themselves. Many business users have a good understanding of query language and, if provided with a user-friendly access to data, can analyze their data without help. The old-fashioned approach that a business analyst specifies the perfect report with the business and engineers implement it the next 2-3 months works for regulatory reports. There is a time lag between when the necessity is clear and when the report has to be implemented and available. Since such reports are generated afterwards on a daily, weekly, or monthly basis, the delay and the high effort for the implementation are acceptable. However, if the business has operational and strategic questions that they want to get answered today and not in three months and which only need to be answered once, the existing service model it too slow and expensive. Thus, business users demand self-service business intelligence. Power BI and Qlik address this need with quite some clever marketing–data warehouse team should have an answer for that trend.
Some data warehouse professionals might even not be aware of this key point: The trend is not (only) about features of the data warehouse and deep learning algorithms and fancy GUIs. In other words, stating that Oracle Autonomous Data Warehouse Cloud comes with AI and machine learning capabilities is not the solution. The key is the changing interplay between business users and the data warehouse team – and the latter must adapt:
- There is the established way of reporting and developing new periodic reports and changing existing ones. This is one area of support, and (only) for this, the traditional service model works.
- There is the new wish of self-service business intelligence. Data warehouse teams have to make sure that business users can analyze their data more independently.
- The business needs engineers who build, run, and maintain machine learning models–and explain to the business what the models mean. This is a highly interactive activity.
The knowledge layer requires data warehouse teams to change. They need new skills: technical skills to understand the new features and social skills to be able to work closer with business users and consult them. When all this comes together, business users benefit from all the features of the data warehouse and all the data stored within it.
Trend 2: Commoditization and Democratization of Databases complicates Data Warehousing
Installing an Oracle database (or other databases) got easier over the years. Now in the cloud, you even click a button and get a new instance within seconds–with no need to install any patches or do any other maintenance work. You do not even need database administrators to keep them running. But databases have not just become much easier to handle, but the license costs are often much lower. New database vendors such as Microsoft entered the market with an extremely competitive pricing, plus there are free open source solutions such as Maria DB. Nowadays, development projects can fund license fees from their project budget. Thus, databases became a commodity for many applications. The days are gone that development or integration projects depended for knowledge or license fees reasons on a central database team and their database servers.
A consequence of this commoditization is the democratization of databases. When databases work “out of the box” and are inexpensive, every application can choose the technology they want. Application vendors ship their applications with a database of their choice. Development projects use databases they are familiar with. We are in a period of decentralized data management. As a side effect, the work for data warehouse teams gets more complicated. Integrating and maintaining feeds from operational databases to data warehouses requires more effort. There are many more databases and they are less easy to access. They might be placed in different network zones behind firewalls, use their own user management, or run different technology and product versions.
The quick spread of NoSQL databases was only possible because software developers can choose the database technology they prefer. So, why not use graph databases, document databases, key value stores, and object stores? The good news is that more and different data becomes available such as log files with behavioral data of web users. However, data warehouse teams have to be able to access the data and load the data into their systems–and to process the data. If IT departments build up e.g. a Hadoop cluster just to handle the new types of data, it can be quite challenging to make such a business case working.
It was remarkably interesting to listen to Oracle’s opinion on this second trend–decentralized databases and a variety of SQL plus NoSQL databases in many companies at the Oracle OpenWorld 2020 earlier this year. Oracle’s senior vice president, Juan Loaiza, talked about “converged databases,” i.e., one database for all purposes [Loaiza20]. Obviously, this would reduce the number of technologies in companies. At this moment, however, this is more a future vision (or Oracle’s desperate hope), far away from today’s reality.
Trend 3: New Types of Data, New Opportunities
Logfiles–especially with behavioral data–is one new type of data. Data streams from IoT devices is another one. IoT devices are core building blocks that enable the digitalization of new business sectors such as logistics, manufacturing, or production. Logfiles and behavioral data as well as IoT devices result in an explosion of available data in companies. While many executes believe that such data might provide a competitive edge in the future, there is often no clear business case at this moment. Thus, managers do not provide funding for analyzing data in depth to load it in data warehouses. Instead, companies invest in data lakes. Data lakes store data inexpensively. So, companies can store a lot of data and decide later whether to use the data (and invest in analyzing and preparing such data) or to drop the data.
Another category of data gets also more attention than in the past: external data. Companies and organizations have data (only) about their customers in their databases. Predictive analytics algorithms can identify the characteristics of profitable customers in the car leasing business. This could be males aged 30 to 40, living in an urban area. To make this knowledge actionable, companies need external data–addresses of the target group that are not yet customers. If you know the characteristics of towns and regions that you operate in successfully, you might want to use some open data from the government to understand the other towns have similar characteristics. They are candidates for opening your next branches. But external data helps as well for optimizing company-internal processes. Benchmarking with similar companies is a popular consulting topic.
Interestingly, using external data is more common on an operational level and for daily business processes. For years, companies rely on external data (providers) for credit scoring, address validation, or financial market data. Specific organizational questions are constantly new when using external data for analytics. Do you have to update the data frequently? Who is responsible to organize the data? Who pays for the internal effort? Who pays the data provider? Thus, this third trend is a great opportunity for data warehouse teams to establish new services regarding logfiles, data lakes, IoT data streams, or external data, at least, if they get adequate funding.
Trend 4: Pantheistic Intelligence – Analytics Capabilities outside Data Warehouses
There are two reasons why event-driven architectures might result in more and more “intelligent” components outside data warehouses: speed & integrated intelligence.
First, if ETL processes run overnight, everything that requires (near) real time data processing and evaluation cannot run in the data warehouse. When a customer pays with a credit card, the card issues has to decide immediately whether or not to authorize the payment. This implies that an “approval engine” with intelligence has to be set up in addition to the data warehouse.
The second reason is integrated intelligence, especially in non-business applications such as in network management. Certainly, a central data warehouse can gather the actual network and its changes and issues. However, simple tasks such as trying to determine the root cause of issues is something various network devices or an AI Ops solution can do. They can even try to fix them. With the rise of AI, it is clear that intelligence will be part of many more components and software solutions. AI can be found outside of databases and data warehouses as well. Not every data warehouse specialist likes this trend, but this is a consequence of pantheistic intelligence being in many more components outside of a big central data warehouse.
The world of data warehousing, business intelligence, and analytics changes with company cultures becoming more data driven. New technologies and new scenarios emerge at this very moment. It is a fascinating time to work, learn, and growth in this area!
[McQuigg11] Jeff McQuigg: Oracle’s Data Warehouse Reference Architecture, April 4th, 2011, https://greatobi.wordpress.com/2011/04/11/oracle%E2%80%99s-data-warehouse-reference-architecture/, last retrievd 22.4.2020
[Loaiza20] Juan Loaiza at Oracle OpenWorld Europe: London 2020, https://www.youtube.com/watch?v=A0ruIgV-zA0