Event Speaker - Big Data Developer Conference

Name

Julien Le Dem
Company

Twitter
Designation

Staff Software Engineer

Ask a Question to this Speaker

Topic

How to use parquet as a basis for ETL and Analytics

Abstract

Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.

Profile

Julien co-created the Parquet project currently in the Apache Incubator. He is the tech lead for Analytics Data Pipeline at Twitter and is on the Apache Pig PMC. His French accent makes his talks attractive.

Subscribe Newsletter

Please fill below credentials to Subscribe Newsletter

Global Big Data Conference

Speaker "Julien Le Dem" Details Back

Name

Julien Le Dem

Company

Twitter

Designation

Staff Software Engineer

Topic

Abstract

Profile