Global Big Data Conference

Industry News Details

Data Lakes vs. Data Warehousing – How Each Works in the Digital Technology Boom Posted on : Mar 27 - 2017

PricewaterhouseCoopers mentioned that data lakes could “put an end to data silos.” In their study on data lakes, they noted that enterprises were “starting to extract and place data for analytics into a single, Hadoop-based repository.”

Data warehouse was coined by William H. Inmon in the 1970s. Inmon, known as the Father of Data Warehousing, described a data warehouse as being “a subject-oriented, integrated, time-variant and non-volatile collection of data that supports management’s decision-making process.”

Emmett Torney of DATUM said, “Smart devices, hyper connectivity, supercomputing and cloud are quickly changing the world we live in and the way companies conduct business. All of these technological drivers are being fuelled by one important asset: Data.”

Data

One of the essential parts of differentiation is that a data warehouse only stores data that has been modelled/structured, while a data lake takes all data in its original form and stores it all – structured, semi-structured and unstructured. Many companies are still analysing structured data. “Newer” data sources such as text data, streaming data and geospatial data are becoming part of an evolving data landscape. That includes data that would be useful to analyse today, in the future or perhaps never at all.

Data mart issues

“Data performance issues caused by centralizing data in an enterprise data warehouse have led to the creation of data marts, which solve performance problems by spreading the BI processing across multiple data stores,” said Colin White, president of DataBase Associates Inc. and founder of BI Research. “It is often quicker and easier to build a data mart than to incorporate additional data into the enterprise data warehouse and then build the data mart from the data warehouse.”

The problem with data marts is that organizations often build them directly from business transaction databases, rather than the enterprise data warehouse.

Processing and modelling

First and foremost, we have to give a formal shape and structure before we can load data into a data warehouse. In fact, we need to model it. That’s called schema-on-write. However, with a data lake, you just load in the raw data as-is, and then when you’re ready to use the data – that’s when you give it shape and structure. That’s called schema-on-read. Two very different approaches. This also meant that the models had to be very well constructed, for if the model was not applicable, the final results could be worthless or even have negative consequences.

Storage and Hadoop

Processing technologies like open-source Hadoop allow managing far larger quantities of data. One of the primary features of big data technologies like Hadoop is that the cost of storing data is relatively low as compared to the data warehouse. There are two key reasons for this. First, Hadoop is open source software, so the licensing and community support is free. And second, Hadoop is designed to be installed on low-cost commodity hardware. Hadoop uses a computational paradigm named MapReduce (by Google) to divide an application into many small fragments, each of which may be executed on any computer node in a cluster. For example Visa was able to reduce processing time for two years’ worth of data (73 billion transactions) from one month to 13 minutes using Hadoop.

Flexibility and business intelligence

We know that data warehousing is highly structured, and schema is defined before data is stored. The quality of data that exists in a traditional data warehouse is cleansed, whereas typical data that exist in data lake is raw. While this makes it a powerful storage option, it makes changes within the data warehouse difficult. That’s why the increasing demand for self-service business intelligence and modern BI makes a data lake highly attractive.

Agility and data warehousing

A well-designed archive can enhance data protection, restore and ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems. A data warehouse is a highly-structured repository, but it can be time-consuming. As a flexible, open source data storage technology, Hadoop offers improved processing at just five percent of the cost of relational database technology. A data lake, on the other hand, lacks the structure of a data warehouse – which enables developers and data scientists the ability to easily configure and reconfigure their models, queries and apps on-the-fly.

Security and data lakes

Apache Hadoop, the grid technology, is increasingly popular for storing massive amounts of data. By default, Hadoop runs in non-secure mode. When service-level authentication is turned on, Hadoop end-users must be authenticated by Kerberos – the popular computer network authentication protocol. A data lake is not a data warehouse. They are both optimized for different purposes, and the goal is to use each one for what they were designed to do.

Data warehouses are made up of data that has already been integrated, but they are limited in that they have trouble hosting data from unstructured sources, such as data collected from product sensors, social media and other non-traditional sources.

Writer Amber Lee Dennis notes, “Data warehouse technologies have been around for decades, while big data technologies (the underpinnings of data lake) are relatively new. Thus, the ability to secure data in a data warehouse is much more mature than securing data in a data lake. It should be noted, however, that there’s a significant effort being placed on security right now in the big data industry. It’s not a question of if, but when.” View More

Get the