Back

 Industry News Details

 
Reflecting on Ten Years of Hadoop Posted on : Sep 12 - 2016

This year marks the 10th anniversary of Hadoop, a technology that has come to represent a major transformation in the enterprise computing industry. I began working with Hadoop around its inception, and have seen it become a central platform in big data analytics today. As we celebrate the anniversary of a technology that is so fundamental to so many, I want to shed some light on my own experience with the development and growth of Hadoop.

My earliest experience with Hadoop began in 2007, less than a year after the open-source technology’s release, during my time as part of the original data service team at Facebook. Before then, we had been using a mix of home-grown software and a traditional legacy data warehouse. However, neither approach was capable of meeting the company’s data processing demands, and we soon found ourselves at a point where processing a full day’s worth of data on the platforms actually took longer than 24 hours. We had an urgent need for infrastructure that was capable of scaling along with our data, and it was at that time that we began exploring Hadoop. The fact that it was an open-source project already being used at petabyte scale and providing scalability using commodity hardware was a very compelling proposition for us. Moreover, the same jobs that had taken more than a day to complete could now be completed within a few hours using the platform.

Our first implementation of the open-source platform focused almost exclusively on batch-processing tasks. At the same time, using early Hadoop was often difficult for end users on the data service team, especially for those who were unfamiliar with writing MapReduce programming models. It lacked the articulation of more popular query languages like SQL, and many of us spent hours writing programs for even simple analytical processes.

It was clear that in order to effectively analyze Facebook’s growing trove of data, we needed to improve the query capabilities of Hadoop. That was what inspired us to stack SQL on top of Hadoop to create Hive. While there was still a heavy amount of batch processing with the platform, for the first time, our analysts were able to conduct ad-hoc data analysis in HDFS. Much of the next few years were spent expanding and refining this infrastructure to meet growing usage as the tools made big data accessible to ever-larger groups of Facebook employees, and much of that big data architecture we built with Hadoop is still in place at the company today.

I attribute much of Hadoop’s early success to its ability to fill the missing technical capabilities of parallel data processing systems that were available at the time. Most of these systems had limited scalability, and would invariably resort to making each computation node in a cluster more and more powerful so that data teams needed fewer of them. As the nodes became less of a commodity, they also became more expensive, driving up the cost of computation. The Hadoop architecture, on the other hand, was built to scale out using commodity nodes and was able to bring down the cost of large-scale data processing by an order of magnitude.

Perhaps an equally important aspect of Hadoop’s success was the role of open-source. While there were a number of open-source options available for data systems to power their applications (such as MySQL and PostgreSQL), the ecosystem for data analysis, data warehousing and data processing was very much dominated by a few proprietary vendors. Hadoop was the first true system to be created by a community of web-scale companies, and as a result, it became a platform open to new insights and innovations from a number of leading industry practitioners. These two factors, scalable architecture that commoditized large-scale data processing and open-source technology, were the key ingredients in Hadoop’s success.

Hadoop has played a crucial role in the development of enterprise computing over the last 10 years, maturing from its early days as an open-source batch processing platform to the complex analytics architecture we see today. And while it is true that no one architecture system can handle the full spectrum of analytics use cases companies now require, the ecosystem of projects under the Hadoop umbrella will continue to provide key data-processing capabilities for a number of different engines. That is the real success of Hadoop – it has galvanized the open-source community to create so many new and powerful solutions for analytics infrastructures.

Hadoop also remains the lynchpin for many ETL processes, and for production-ready workloads. At the same time, however, the technology must strike a balance with new projects finding their way in the enterprise-computing space. Hive is being used for complex SQL, Spark has emerged as a great data science and machine learning engine, and Presto architecture has found its niche in rapid ad-hoc SQL analysis. Newer technologies such as Flink and Heron are also emerging on the real-time analysis side, while Quark looks to take advantage of the unique capabilities of various query engines by building out SQL federation on top of the systems. All of these technologies wouldn’t be possible without Hadoop, and they all take advantages of the core Hadoop platform to enable their abilities. Needless to say, we have much to look forward to in the next 10 years of Hadoop. Source