Back

Speaker "Michelle Sun" Details Back

 

Topic

WINGS - A light weight big data job management framework ( soon to be open sourced )

Abstract

In order to provide the most relevant search results, huge amount of data needs to be processed and analyzed on a daily basis. This involves several stages of data processing, with inputs of one system being used as an output to the other. As part of the process, intermediate relevance data is generated which is then calibrated and used for ranking the search results. Our team built light weight python based data framework to help developers efficiently onboard jobs and do data quality checks by using in-built functions of the framework. This framework is quite language agnostic i.e. it can be used to run any kind of code or big data jobs (python, Java, Scala, hive scripts, pig scripts etc). We have created this framework as a python package which can easily installable anywhere by pip command. Out of the box, it enforces and provides many different flavors of validation for the processed data like duplicate checks, column level checks for output/input tables, check for data changes over the time, file size checks, table/file count checks etc. and provides a standard interface for command execution and error handling. Also, it is one simple python file where an engineer can just overwrite a function and define combinations of steps (hive step, mysql step, python code step, java code step, data load, data transfer steps). There are very few such tools available today and our aim is to provide one such tool to increase the productivity of the engineers by writing a generic framework to execute each and every steps efficiently and catch problems in the data pipelines at all possible steps.

Profile

Michelle Sun is Staff Software Engineer working at WalmartLabs. She is part of the search bigdata team, responsible for data pipelines for Product Search using technologies like Hadoop, Hive and Cassandra. She has been at walmart labs for 4 years, before that she has over 10 years relational database and storage management experience at Oracle.