How to plan for and build a central hub for data analytics with the ever-evolving Hadoop ecosystem
There is no denying that data is everywhere and there is a lot of it. Smartphones and other mobile devices are enabling companies to communicate directly with their customers and provide highly customized insights and services. Not only are the amounts of data increasing dramatically, but new types of data are being created at an accelerating rate. Much of this data is different both in structure and in substance, and it cannot easily be arranged into tables or transformed in a reasonable amount of time.
Meanwhile, data-driven business analytics often involves processing so intensive that even a few years ago it would have seemed inconceivable. While we have existing systems specifically designed for this type of processing, they are typically extremely closed proprietary architectures that are priced well out of reach for small and medium-sized organizations.
These challenges have given rise to a new open, distributed computing architecture typified by implementations such as Hadoop and Apache Spark. Recently these techniques and technologies are being combined to form a type of hub-and-spoke integration called a data lake. In this article, we will outline what a data lake is, as well as the tools and considerations involved in implementing one.
For a closer look at business use cases where companies have implemented Hadoop ecosystem tools to leverage new analytics and improve analytics workflows, please see our data lake whitepaper.
In a world that is becoming more data-driven and globally competitive, the success of a company will increasingly depend on investing in technologies that can create a 360-degree view of its business, designing a scalable infrastructure, and using analytics to drive better decision-making throughout the organization.
What is a data lake?
A data lake is an enterprise-wide system for storing and analyzing disparate sources of data in their native formats. A data lake might combine sensor data, social media data, click-streams, location data, log files, and much more with traditional data from existing RDBMSes. The goal is to break the information silos in an enterprise by bringing all the data into a single place for analysis without the restrictions of schema, security, or authorization. Data lakes are designed to store vast amounts of data, even petabytes, in local or cloud-based clusters consisting of commodity hardware.
In a more technical sense, a data lake is a set of tools for ingesting, transforming, storing, securing, recovering, accessing, and analyzing all the relevant data of a company. Which set of tools depends in some part on the specific needs and requirements of your business.
Creating a plan
The choice of technologies to be installed will vary depending on the use cases and the streams that need to be processed. Additional considerations could be whether to host a solution internally or to rely on companies providing platforms as a service, like Amazon Web Services.
Regardless of the physical location of the cluster, Hadoop provides a broad range of tools with the flexibility of including some or all at any stage of the data lake’s development. Hadoop also allows the system to be horizontally scaled by adding or removing nodes without considerable time or effort. As a result, organizations can start small and grow the cluster as usage increases.
Use cases can be broadly categorized into different streams of processing like batch, streaming, and queries with fast read/write times. An appropriate tool set can be configured depending on the kind of processing that needs to take place.
Use case discovery
Defining use cases early helps immensely in implementing a data lake. Analyzing data without enough information about the data will lead to unwanted results and wasted engineering cycles. Lack of information on how the data will be utilized or who the target audience is will also impact the performance of the analytical process implemented on top of the data lake. This will become abundantly evident as the data lake usage increases over time.
How do you identify a big data challenge? A good measure is to define or categorize data in terms of the three V's: volume, variety, velocity. Volume refers to the amount of data, usually in the scale of petabytes. Variety refers to the structure and sources of the data. Velocity indicates the speed at which the data is generated and needs to be processed.
Identifying the problem and implementing a big data architecture will not in itself produce the results automatically. A proper system must be implemented and optimized to provide repeatable results.
The goal of identifying a use case should not be to create an exhaustive list, but to define a small number of business-critical use cases so that the data lake can be designed and built to meet those needs. This will assist with cluster design and sizing, planning for future growth, and decisions around ingestion, as well as a comparative analysis with existing systems.
Building the data lake
Given the advantages of a central data lake, there are some measures and utility challenges within the building process that every enterprise should address. The most important include data management and security for sensitive data. However, there are Hadoop ecosystem tools available on top of the core frameworks that make data lakes an enterprise-ready solution. Taking into account these common challenges, we can reclassify a data lake as an enterprise data warehouse that does not require you to predefine metadata or define a specific schema before ingesting data into the system.
Hadoop is not any one system or tool; it is an ever-evolving collection of tools. At the core of Hadoop is HDFS: a reliable, fault-tolerant, distributed file storage system with MapReduce as its processing engine. Future releases of Hadoop are progressively moving away from MapReduce, which is batch-oriented processing, to a more responsive processing engine like Apache Spark.
The following is a list of some Hadoop ecosystem tools currently in use. The list is not exhaustive but includes those most commonly found among production deployments:
Distributed file storage
Batch processing engine
A columnar NoSQL database on HDFS
Provides SQL-like access to HDFS
Tez, Impala, HAWQ
Massively parallel processing engines on top of HDFS that provide fast response
Sqoop, Flume, Pig
Data ingestion tools
Distributed messaging service
Distributed stream processing engine
Fast and general-purpose engine for data processing
In addition, production deployments of Hadoop usually include tools for data management, data security, and cluster management. Tools for these tasks differentiate Hadoop offerings available in the market. For example, Cloudera offers a proprietary solution that includes Cloudera Manager, Navigator, and Sentry for cluster administration and security. Hortonworks offers open source tools like Ambari, Ranger, Falcon, and Atlas.
Building a data pipeline
Constructing a data lake requires transferring data into Hadoop clusters, then running the necessary processes to parse, cleanse, and prepare the data for analytics. Depending on the data producers, data can be either “pushed” or “pulled” into the cluster. To push data into the cluster, you could use command-line tools from a client machine, REST APIs, or a Hadoop ingestion tool such as Flume. Data can also be pulled with tools like Sqoop. The choice of tools and strategies for data ingestion are determined by the nature of the data ingested into the cluster.
Ensuring data quality
The old adage “garbage in, garbage out” still applies, perhaps even more so given the massive amounts of data typical in a Hadoop cluster. Before any ingested data is analyzed, a minimum degree of data sanitization should be performed to ensure its authenticity.
Hadoop is not a source of record, and most of the data is transferred to the data lake from disparate sources. As data is transferred into the data lake, there is a chance of the data becoming corrupted or arriving in an unexpected form. Therefore, any data lake implementation should include workflows that check the quality of the ingested data as early as possible. View More