Global Big Data Conference

Industry News Details

Anatomy of a Hadoop Project Failure Posted on : Mar 18 - 2017

Several years ago, the educational technology company Blackboard selected Apache Hadoop to run a new data analytics application designed to turn data exhaust into actionable insight. Months later, the failed project was cancelled, and Blackboard implemented a hosted relational data warehousing product instead.

The reasons behind Blackboard‘s initial selection of Hadoop for this project will sound familiar: a desire to maximize data exhaust, a need to bring large amounts of data together for analysis, and a curiosity to work with emerging technology.

But the factors leading to the Hadoop failure will also ring a bell to those experienced with Hadoop projects: difficulty integrating opens source pieces, complex architectures and data flows, and an inability to read data from Hadoop in a useful and timely fashion.

Maximize Data Exhaust

On paper, Blackboard would seem to be a good candidate for Hadoop. As the leading provider of learning management systems for the educational sector, the Washington D.C.-based company is well-versed in technology. The company counts 16,000 direct clients and touches more than 100 million students, who interact with educators and other students via its Web-based products.

“We have access to the largest data set in higher education in the world,” says Jason White, Blackboard’s director of product development, who led the Hadoop project. “We attempted a couple of years ago to build out a data lake to support not only the data science research but also some internal telemetry analysis of product usage. We attempted to build that out in a Hadoop stack.”

The main reasons for choosing Hadoop were scale, a desire to unify data, and technical curiosity on the part of Blackboard engineers, White says. The company had lots of familiarity with Microsoft SQL Server stack, and ran a data warehouse based on SQL Server Analytical Services. But it wanted to expand beyond what SQL Server could provide.

“Nobody inside of Blackboard had attempted to consolidate the vast amount of telemetry data and application log file exhaust that we had in the organization,” White told Datanami. “Nobody had attempted to put that in one clearinghouse type of place. At the time, looking at Hadoop, that stack seemed the right direction to go.”

Most of the data Blackboard was working with was in the JSON format. After adopting Hadoop, Blackboard switched to Avro, another semi-structured data format that’s at home in Hadoop. Everything seemed to be on track during the initial data ingestion phase. But problems soon surfaced when it came time to read data from Hadoop.

Trouble in the Reads

Getting into Hadoop, or writing data to the lake, has traditionally been easy. But getting the data back out, or reading it, has been a recurring problem expressed by many users.

“Reporting directly out of Hadoop, we were never able to find the right combination of tools,” White said. “We ended up building a series of very complex processes to, on a regular basis, push that data out to a relational database stack of back-end BI tools.”

Blackboard was moving JSON data exhaust from a relational database into Hadoop and then back out into a Postgres database, where it could be analyze using SQL. But that convoluted workflow didn’t sit right with White.

“Architecturally, it always just felt cumbersome to have all this data being sourced from relational database, pushed into a Hadoop stack, and then sharded back out into relational databases. It was all just an entirely complicated architectural diagram with very heavy DevOps lift.”

When Blackboard started working on the Hadoop project, it had a team of 10 engineers, with two of them dedicated to wrangling the various open source tools, patching bugs, and building work-arounds. That was more than White expected.

“The DevOps lift was heavier than we anticipated,” White explains. “We were not using a packaged distribution of Hadoop. We were attempting to roll our own and find all the right farm animals the suites of various Hadoop environment-related products to accomplish all the things we needed.”

Blackboard eventually decided to shut down the Hadoop cluster when it became clear that the approach they were taking was not cost effective. When Blackboard finally shut down the cluster, it had about 50TB of data, he estimated.

“Once we realized the burden that we have gotten ourselves into in terms of the care and feeding required to keep that environment up and running we quickly started looking for another solution,” he said.

Don’t Believe the Hype

Did Blackboard get suckered into the hype around Hadoop?

“I definitely think that’s a big part of it,” he answered. “I had three or four engineers on that team that are top-notch engineers who were really excited to go play with Hadoop because they had heard about it and wanted to check it out. That’s really what carried us so far down the road before coming to the realization that this wasn’t the right application for what we were trying to accomplish.”

White acknowledged that his decision to go the plain vanilla Apache Hadoop route may have played a role in the difficulty Blackboard experienced, and that choosing one of the packaged Hadoop distributions may have alleviated some of the burden.

Nonetheless, Blackboard looked elsewhere, including Redshift warehouse hosted on Amazon Web Services and an on-premise Vertica cluster. Eventually the company adopted a hosted SQL data warehouse from Snowflake Computing. View More

Get the