Global Big Data Conference

Industry News Details

Hadoop Has Failed Us, Tech Experts Say Posted on : Mar 15 - 2017

The Hadoop dream of unifying data and compute in a distributed manner has all but failed in a smoking heap of cost and complexity, according to technology experts and executives.

“I can’t find a happy Hadoop customer. It’s sort of as simple as that,” says Bob Muglia, CEO of Snowflake Computing, which develops and runs a cloud-based relational data warehouse offering. “It’s very clear to me, technologically, that it’s not the technology base the world will be built on going forward.”

Thousands of organizations store huge amounts of data in Hadoop, and so Hadoop won’t disappear overnight. After all, many companies still run mainframe applications that were originally developed half a century ago. But thanks to better mousetraps like S3 (for storage) and Spark (for processing), Hadoop will be relegated to niche and legacy statuses going forward, Muglia says.

“The number of customers who have actually successfully tamed Hadoop is probably less than 20 and it might be less than 10,” Muglia says. “That’s just nuts given how long that product, that technology has been in the market and how much general industry energy has gone into it.”

One of the companies that supposedly tamed Hadoop is Facebook, which developed relational database technologies like Hive and Presto to enable SQL querying for data stored in HDFS. But according to Bobby Johnson, who helped run Facebook’s Hadoop cluster before co-founding behavioral analytics company Interana, the fact that Hadoop is still around is a “historical glitch.”

“That may be a little strong,” Johnson says. “But there’s a bunch of things that people have been trying to do with it for a long time that it’s just not well suited for.”

Hadoop’s strengths lie in serving as a cheap storage repository and for processing ETL batch workloads, Johnson says. But it’s ill-suited for running interactive, user-facing applications, he says.

“It’s never really broken out of the developer world,” Johnson says. “People have this idea of, ‘Oh there’s all these legacy data warehouses and the new cool way to do it is Hadoop.’ But when you try to do that, you realize it’s actually a lot worse. It’s better than a data warehouse in that have all the raw data there, but it’s a lot worse in that it’s so slow.”

Getting answers out of Facebook’s Hadoop environment was an exercise in patience and frustration, according to Johnson. “After years of banging our heads against it at Facebook, it was never great at it,” he says. “It’s really hard to dig into and actually get real answer from…You really have to understand how this thing works to get what you want.”

Hadoop is great if you’re a data scientist who knows how to code in MapReduce or Pig, Johnson says, but as you go higher up the stack, the abstraction layers have mostly failed to deliver on the promise of enabling business analysts to get at the data.

“At the Hive layer, it’s kind of OK. But people think they’re going to use Hadoop for data warehouse…are pretty surprised that this hot new technology is 10x slower that what they’re using before,” Johnson says. “[Kudo, Impala, and Presto] are much better than Hive. But they are still pretty far behind where people would like them to be.”

The Hadoop community has so far failed to account for the poor performance and high complexity of Hadoop, Johnson says. “The Hadoop ecosystem is still basically in the hands of a small number of experts,” he says. “If you have that power and you’ve learned know how to use these tools and you’re programmer, then this thing is super powerful. But there aren’t a lot of those people. I’ve read all these things how we need another million data scientists in the world, which I think means our tools aren’t very good.”

A better architecture for enabling people to be productive with big data can be found in Apache Kafka, Johnson says.

“I like the Kafka version of the world, which is there’s a pipe of data and anything that wants to do something useful with it can tap into that thing,” he says. “That feels like a better unifying principal, a stream of data and what’s happening versus a disk format.”

Kafka creator Jay Kreps was responsible for running a large Hadoop cluster at LinkedIn before co-founding Confluent. “It’s just a very complicated stack to build on,” he tells Datanami. “I think that’s more a technology problem than anything else.”

While Kafka is included in many Hadoop distributions, Kreps purposely avoided building any Hadoop dependencies into Kafka, and he strived to make it as simple to use as possible.

“Kafka is definitely its own thing. It runs standalone. It has no connection to Hadoop. I think that’s absolutely a good thing for people who are trying to build production application,” he says. “You can imagine if you’re trying to build something that’s part of your UX and you need to pull in Kafka, it’s really important that you don’t have to pull in 55 very complicated Hadoop-oriented projects and set all that up and operate it. That would be something that would make this very difficult to get going with, and I think that’s one of appeals of this whole streaming processing areas is that it’s relatively easy to work with.”

Plenty of software vendors have hitched their wagons to Hadoop, which was the first open source distributed computing platform to see widespread adoption. Some of these companies, like real-time stream processing vendor Datatorrent, have succeeded in helping to mask some of the underlying complexity in Hadoop.

“From an industry standpoint, people understand the amount of data they have to process necessitates a distributed computing platform,” says Phu Hoang, who built Hadoop-based systems at Yahoo before co-founding DataTorrent several years ago. “Hadoop is the only player out there. They may not be married to it. But they are definitely married to the distributed computing and scale-out strategy.”

Until there’s a better option, Hadoop will be the only game in town for distributed processing, says Hoang, who created the DataTorrent RTS product before releasing it as an open source project named Apache Apex. View More

Get the