Back

Speaker "Adam Breindel" Details Back

 

Topic

Data Science with Spark: Beyond the Basics

Abstract

This class is aimed at practitioners who are already familiar with the basics of Apache Spark and are have tried the machine learning samples in the Spark docs or some of the ML tutorial examples online. We'll start from there and work to advance our knowledge of Spark ML. After briefly reviewing some fundamentals of Spark, DataFrames and Spark ML APIs, the class will then explore: - Performing feature preparation/transformation beyond the Spark built-in tools - "Borrowing" functionality from scikit-learn to help us pre-process features in Spark - Converting DataFrame data to access legacy (RDD) mllib features that are not yet exposed in the SparkML DataFrame API - Implementing data prep operations as reusable components by implementing new Transformers and Estimators - Adding a reusable parallel machine learning algorithm to Spark, by creating our own Estimator and Model classes - Sharing our reusable components with our Python data science colleagues by creating Python wrappers like those built into Spark

Profile

Adam Breindel consults and teaches widely on Apache Spark and other technologies. Adam's experience includes work with banks on neural-net fraud detection, streaming analytics, cluster management code, and web apps, as well as development at a variety of startup and established companies in the travel, productivity, and entertainment industries. He is excited by the way that Spark and other modern big-data tech remove so many old obstacles to system design and make it possible to explore new categories of interesting, fun, hard problems.