September 08 to 10 2014, Santa Clara, USA.


Speaker "Mohit Jaggi" Details

Name :
mohit jaggi
Company :
Title :
Software Engineer
Topic :

df: A pandas like dataframe on Spark

Abstract :

A lot of data scientists use the python library pandas for quick exploration of data. The most useful construct in pandas (based on R, I think) is the dataframe, which is a 2D array(aka matrix) with the option to “name” the columns (and rows). But pandas is not distributed, so there is a limit on the data size that can be explored. Spark is a great map-reduce like framework that can handle very big data by using a shared nothing cluster of machines. This work is an attempt to provide a pandas-like DSL on top of spark, so that data scientists familiar with pandas have a very gradual learning curve.

Profile :
System architect and engineer

Get latest updates of 2nd Annual Global Big Data Conference
sent to your inbox.

Weekly insight from industry insiders.
Plus exclusive content and offers.