Back

 Industry News Details

 
Hadoop and Big Data Storage: The Challenge of Overcoming the Science Project Posted on : May 27 - 2015

About two years ago, I started talking to Fortune 500 companies about their use of tools like Apache Hadoop and Spark to deal with big data in their organizations. I specifically sought out these large enterprises because I expected they would all have huge deployments, sophisticated analytics apps, and teams that were taking huge advantage of data at scale.

As the CTO of an enterprise infrastructure startup, I wanted to understand how these large-scale big data deployments were integrating with existing enterprise IT, especially from a storage perspective, and to get a sense of the pain points.

What I found was quite surprising. With the exception of a small number of large installs — and there were some of these, with thousands of compute nodes in at least two cases — the use of big data tools in most of the large organizations I met with had a number of similar properties:

Many Small Big Data Clusters

The deployment of analytics tools inside enterprise has been incredibly organic. In many situations, when I asked to talk to the “big data owner,” I wound up with a list of people, each of whom ran an 8-12 node cluster. Organizational IT owners and CIOs have referred to this as “analytics sprawl,” and several IT directors jokingly mentioned that the packaging and delivery of Cloudera CDH in Docker packages was making it “too easy” for people to stand up new ad hoc clusters. They had the sense that this sprawl of small clusters is actually accelerating within their companies.

Non-standard Installs, Even on Standard Distributions

 

The well-known big data distributions, especially Cloudera and Hortonworks, are broadly deployed in these small clusters, as they do a great job of combining a wide set of analytics tools into a single documented and manageable environment. Interestingly, these distributions are generally used as a “base image,” into which all sorts of other tools are hand-installed. As an example, customizations for ETL (extract, transform, load) — pulling data out of existing enterprise data sources — are common. So are additions of new analytics engines (H2o, Naiad and several graph analytics tools), that aren’t included in standard distributions. The software ecosystem around big data is moving so fast that developers are actively trying out new things, and extending these standard distributions with additional tools. While agile, this makes it difficult to deploy and maintain a single central cluster for an entire organization.  View more