Overkill Analytics on High Dimensional Feature Spaces
In our quest for data science automation we have learned many lessons that I am going to share in this session.
Less slides and more demos featuring real world use cases such as predicting port destination for oil ships and the Outbrain Kaggle competition, all performed from our own notebook (called DSL Workbench) we built for exploratory data analysis. DSL is the fluent and expressive API we created to expose data and services from our data science platform.
I will compare multiple approaches for feature engineering, reduction as well as full feature space training employing OKA (OverKill Analytics) techniques: where spark.ml/spark.mllib could not perform on high dimensional sparse feature spaces we employed Spark for distributing scikit-learn, VW, TensorFlow and R packages and produced ensemble models and prediction tables that still yield highly accurate predictions.
I will cover and show concrete examples for geo-spatial, composite and progressive modeling, deep learning, high dimensional and sparse feature engineering, the primitives we built for handling sparse data beyond the support in Spark or scipy.
While I’ll focus on data science at scale I will also touch on infrastructure aspects, with tips and tricks we learned with the underlying technology stack: scala, python, Spark, HDFS, Cassandra, ElasticSearch, Zookeeper, VW, TensorFlow etc
|Your Review / Feedback :|
|Your Company :|