Back

 Industry News Details

 
Data Lake vs Data Warehouse: Is the warehouse going under the lake? Posted on : Jul 22 - 2016

The desire to save every bit and byte of data for future use, to make data-driven decisions is the key to staying ahead in the competitive world of business operations. All this is possible due to the low cost storage systems like Hadoop and Amazon S3. For the same cost, organizations can now store 50 times as much data as in a Hadoop data lake than in a data warehouse. Data lake is gaining momentum across various organizations and everyone wants to know how to implement a data lake and why. The powerful data lake architecture leverages analytics capabilities for big data processing and helps businesses address operational challenges which were difficult to be solved using conventional data warehousing technologies.

There are several people writing that data lakes are replacing data warehouses but this is just another technology hype that is coming across the effective use of data. Will data lake replace a data warehouse or will the two complement each other is currently the hottest discussion in the big data community. This article explores the most debated discussion on “Data Lake vs. Data Warehouse”, to which DeZyre industry experts add the point “or Both Coexist".

Data warehouses do a good job for what they are meant to do, but with disparate data sources and different data types like transaction logs, social media data, tweets, user reviews, and clickstream data –Data Lakes fulfil a critical need. Data warehouses can store only structured data in a standard format that fit in rows and columns. However, to manage semi-structured and unstructured data, organizations are adopting data lake architecture for greater flexibility. Data lakes can store any type and amount of data and use it for their applications when required which is not possible with a conventional data warehouse. Data lakes eliminate the cost and data transformation complexity of data ingestion as they follow a “Schema on Read” approach.

Somebody asks you “How will you handle analytics for 64 TB of data that a company creates every month, Data Lake or a data warehouse?” To find out the right answer to this question, you must first understand what a data lake is and how is it different from a data warehouse approach.

Data Lake vs. Data Warehouse

Many people are of the thought that Data lakes are just re-creation of the data warehouse but is this the truth behind the new shiny data lakes?

Data warehouses require a pre-defined ETL process to extract the data and bring it to a data warehouse. When loading data into a data warehouse the schema must be known and also the query for data warehouse should be known. In case if the query needs to be changed then the data has to be reinjested into the data warehouse.

Data lakes follow an ELT process i.e. Extract, Load and Transform. The schema is defined only when the data is pulled and accessed for analysis. Data is stored at a leaf level in an untransformed state and schema is applied only to fulfil data analysis requirements.

Data Warehouses do not retain all data whereas Data Lakes do.

Database professionals analyse various data sources to understand the business processes and then profile the data into a structured data model for reporting. This requires considerable amount of time to analyse various data sources, understand business processes and profile data. Most of the time is spent in making decisions about what data needs to be included in the data warehouse and what not. Usually, if some data is not required to answer specific business questions or for reporting, then the data is not loaded into the data warehouse as this helps simplify the data model and saves expensive disk storage space.

To the contrary, in a data Lake ALL data is loaded into the storage repository irrespective of its use. This helps businesses to dig into the storage repository whenever they require specific data for any kind of analysis. Commodity, off the shelf servers make data lake hardware infrastructure easily scalable to petabytes very economical.

Data Lake vs Data Warehouse- Schema on Read vs. Schema on Write

As explained earlier, data lakes place raw data into large storage repositories like  HDFS used by Hadoop where it can be analysed without a defined structure(Schema on Read) unlike data warehouse that relies on a schema (Schema on Write), data lake is free for all kinds of analytics. In a data lake architecture, parsing and schema is applied to the data when a data scientist reads the data in raw format from the lake. Organizations cannot choose between either one -“Schema on Read” vs. “Schema on Write” as this depends on several factors.

Suppose that there is a data set that contains millions of PDF documents which have been scanned from paper business cards. There is a choice when writing this data to the data lake. The organisation can create a schema to the data before writing it by organizing the data from PDF into a table with columns like name, company name, email id, phone number and so forth. This is schema on write approach as the data on each business card is mapped and written to predefined columns in a data warehouse. Another way to do this is merely dump all the PDF documents in the data lake and later identify the schema that would be required to do analysis. Data lakes provide the ability to do exploratory analytics, using tools like Hadoop, Spark, Hive and Apache Drill as the data scientist can control the parser and schema. There is nothing like a better approach as they are two very different approaches and it all depends on what a data scientist or a developer is trying to do.

Data Lake vs. Data Warehouse- Economical vs. Expensive Storage

Storage industry has lots to offer in terms of low cost horizontally scalable platforms for storing large datasets. Hadoop has evolved as a batch processing framework built on top of low cost hardware and storage and most companies have started using Hadoop as a data lake because of its economical storage cost unlike data warehouses that are expensive. View More