Back

Speaker "Aaron Benz" Details Back

 

Topic

Making HBase Accessible to Scientists

Abstract

Our case study covers two main topics:

  1. Using HBase as a storage solution for hierarchical time series data
  2.  Using R and Python to make HBase data readily accessible

The data presented in our case study is hierarchical time series data, where individual time series datasets are organized according to some consistent hierarchy. Such a data model has use in applications ranging from stock price data to sensor data to meteorological data. For example, daily stock price datasets could be organized according a hierarchy such as exchange/stock/day (a stock exchange has many stocks listed under it, and each of those stocks have a stock price time series for each day it was traded on the open market). Each individual stock price dataset would consist of two fields: time and price. There are other time series variables we might wish to measure, such as trade volume.

Hierarchical time series data may not lend itself well to a classic relational database solution. Forcing such a data model into a classic RDBMS system might require a ton of replicated data, dealing with long query times, and a complex data model. On the other hand, hierarchical data fits well in a Google BigTable-inspired NoSQL solution such as HBase. The data modeler simply designs a rowkey structure according to the hierarchy. And instead of storing individual observations in the HBase cells, we opted to store each entire dataset as a “blob” of data in a cell. This “blob” storage approach, along with an intuitive rowkey design, provided us with very fast lookups of data.

Any data storage solution is only effective when coupled with a system that allows for easy retrieval of the data by those who need to work with it. Therefore, we show how to work with HBase using two languages that should be in every data scientist’s toolbox: R and Python. R’s rhbase and Python’s happybase allow the user to get basic functionality out of HBase without the need to know Java or to work in the HBase shell. Specifically, we will talk about how we opted to use happybase for building the HBase table and rhbase for data consumption and analysis. We will also demo a simple front-end web application developed with the help of RStudio’s new Shiny framework.

Profile

Aaron is currently a Data Scientist and Big Data Modeler at Accenture. He has extensive experience in R, Cassandra, and HBase, particularly in time-series analysis, discovery, and modeling. Aaron graduated from the Templeton Honors College at Eastern University with a Bachelors of Mathematics. While attending he played lacrosse and was awarded All-America and Academic All-America