Global Big Data Conference

SQL engines boost Hadoop query processing for big data users Posted on : Apr 30 - 2016

Organizations with big data environments are turning to SQL-on-Hadoop software to speed up analytical queries and data integration jobs -- and eliminate the need to program in MapReduce.

When Premier Inc. offloaded some of its business intelligence data from a data warehouse appliance to a Hadoop cluster in mid-2014, the analytics and purchasing services company relied on MapReduce -- Hadoop's initial programming and processing environment -- to pull in the supply chain information from healthcare providers. But batch-oriented MapReduce wasn't seen as the right technology to power a Web-based BI dashboard application built for use by hospital purchasing managers and Premier's own data analysts and supply chain account executives.

For that job, Premier turned to Impala, Hadoop vendor Cloudera Inc.'s SQL-on-Hadoop query engine. Based on SQL, the standard programming language for mainstream relational databases, Impala provided faster Hadoop query performance for the end users and a more familiar programming framework for Premier's development team, according to Tom Palmer, a software engineering director who led the deployment of the Hadoop system at the Charlotte, N.C., company.

It was the same thing when Premier moved the processing of data on clinical outcomes and patient costs by physician from another data warehouse appliance to the Cloudera-based Hadoop cluster last September. After first trying to develop the required extract, transform and load (ETL) processes in MapReduce, the company switched to Impala before going into production.

"The developers who created the ETL jobs, really, all they understood was SQL," said Alfred Kosgey, a senior technical architect at Premier. "So, we were able to develop a lot more, a lot faster, because they were using the SQL syntax they were familiar with."

Premier, which works with about 3,600 hospitals and 120,000 smaller healthcare providers in the U.S., is now also leaning on Impala for a new analytics platform that its data scientists will use to track industry norms on quality of care, length of stay and other performance metrics. The data scientists have been running queries created in the Python programming language on Macintosh systems, but Palmer said it can take "hours and hours" for the jobs to complete. That should be sped up, he added, by moving them to the Hadoop cluster and tapping Impala through Ibis, an open source technology that integrates Python and the SQL-on-Hadoop engine.

Premier is just one of a growing number of organizations looking to SQL-based Hadoop query tools to help simplify programming, and boost both ETL and analytics performance in their big data environments. The pairing of Hadoop and SQL also lets companies put all of their existing developers and data analysts who are versed in SQL to work on big data applications, reducing or even eliminating the need to invest in retraining or new hiring to build up MapReduce skills internally.

SQL-on-Hadoop options, issues

Prospective users have plenty of options to consider: Analysts from consultancies Gartner and Forrester Research have both counted more than a dozen SQL-on-Hadoop technologies, including a mix of open source software and commercial products. The top Hadoop vendors -- Cloudera, Hortonworks Inc. and

MapR Technologies Inc. -- are all in the game, along with major IT providers, such as IBM, Microsoft, Oracle, Teradata and Hewlett Packard Enterprise. Jethro Data, Splice Machine Inc. and other startups are also pushing SQL-on-Hadoop offerings, as is Databricks Inc., the primary driving force behind the Apache Spark processing engine, which includes a Spark SQL module.

Most of the available tools are still relatively new and not yet fully mature. For example, many Hadoop query engines don't support all of the functionality provided in relational SQL implementations. The fast pace of development on Hadoop and related technologies is helping to close that gap -- but it also means organizations need to keep up with frequent new releases in order to take advantage of added SQL-on-Hadoop features. In addition, SQL itself can be too much for many business users, and even some BI and analytics professionals, to handle directly, often prompting companies to put SQL-on-Hadoop tools under the covers of self-service BI software or front-end Web user interfaces.

That all isn't stopping early adopters, such as Premier and virtualization technology vendor VMware. The latter is building out a Hortonworks-based Hadoop cluster to power advanced analytics and predictive modeling applications after deploying a pilot system in early 2015. The cluster, which is due to go into production use this month, is being expanded from eight to 48 nodes, and will have a storage capacity of about 350 TB, said Joti Sidhu, IT director for enterprise BI applications at VMware.

In addition, the Palo Alto, Calif., company has deployed Pivotal HDB, a commercial version of the HAWQ open source SQL-on-Hadoop engine that's sold by Pivotal Software Inc. -- which VMware partly owns, along with its own parent, EMC. Sidhu said Pivotal HDB will primarily be used by a team of data scientists to run queries against a combination of marketing data, clickstream records from VMware's website, and customer files that contain product log data and technical support information. "Those guys are working with huge data sets and essentially trying to find a needle in a haystack," she said, adding that the SQL interface will let them do their work, without having to "learn or unlearn anything."

Tight ties needed between tools

As the Hadoop ecosystem expands to include more and more technologies, most of which are being developed at a rapid pace, Sidhu said users have to ensure there's tight integration between the underlying Hadoop platform and associated tools. There were some integration issues between Pivotal HDB and Hortonworks' Hadoop distribution when VMware began working with them, she said, but the two vendors have brought the technologies closer together since then. To make sure that happens, "you have to work very closely with the product teams on the vendor side," Sidhu advised.

Premier's Palmer said the healthcare company also tries to deploy new versions of Cloudera's Hadoop distribution and tools such as Impala within two to three months of their release. "Our attitude is that we're going full speed ahead, so we can take advantage of the new functionality," he said. "If you're two releases behind, you're not in good shape."

Deploying a SQL-on-Hadoop query engine might require more processing horsepower in a Hadoop cluster, as well. For example, Premier added five compute nodes to its cluster when it moved the ETL processing for physician performance data to the Hadoop system, increasing the total number of nodes to 19. And Palmer said he expects the cluster to grow further as the company puts more applications on it.

But he added that the SQL-driven expansion of the cluster, which currently holds about 65 TB of data, has saved money overall by enabling Premier to completely replace one data warehouse appliance system and reduce its use of another one -- both being higher-cost processing platforms than Hadoop. The new Hadoop math is simple, according to Palmer: "I like to say that the more we expand the cluster, the more money we save." Source

Get the