Back

Speaker "Mostafa Mokhtar" Details Back

 

Topic

Hive on Spark is Blazing Fast... Or Is It?

Abstract

Ovum analyst Tony Baer calls SQL the “Gateway drug to Hadoop” because of its familiarity and ubiquity. As Hadoop’s popularity surges, thousands of businesses have found that running SQL on Hadoop is the easiest way to get value out of Big Data. Apache Hive is the most popular and most widely used SQL solution for Hadoop. To keep pace with Hadoop’s increasingly vital role in the Enterprise, Hive has transformed from a batch-only, high-latency system into a modern SQL engine capable of both batch and interactive queries over large datasets. Hive’s momentum is accelerating: With Spark integration and a shift to in-memory processing on the horizon, Hive continues to expand the boundaries of Big Data. This session will examine Hive performance past, present and future. In particular we’ll: Look at Hive’s origins as a petabyte scale SQL engine. See how Hive became 100x faster by moving beyond MapReduce, by vectorizing execution and by introducing a cost-based optimizer. Take a detailed look at the challenges of scalable SQL on Hadoop. Look into Hive’s sub-second future, powered by LLAP and Hive on Spark. See just how fast Hive on Spark really is.

Profile

Lead Engineer at Hortonworks, where he works on large scale performance design, analysis and tuning covering storage, execution and query optimization.