Large-scale machine learning pipelines using Luigi, PySpark and scikit-learn

Alexander Bauer

Alexander Bauer holds a Ph.D. in computer science. He has around 10 years industry experience, currently leading a team of data scientists at Lidl, one of the largest global discount supermarket chains. He is a Kaggle Master and regular speaker at the Frankfurt Predictive Analytics Meetup. He believes in agile software development practices and promotes Python as a primary language for data science applications in production.

Abstract

Tags: data-science analytics python machine learning

For prescriptive analytics applications, data science teams need to design, build and maintain complex machine learning pipelines. In this talk, we demonstrate how such pipelines can be implemented in a robust, scalable and extensible manner using Python, Luigi, PySpark and scikit-learn.

Description

Data science teams working on real-world prescriptive analytics applications face the challenge to design, build and maintain considerably complex machine learning pipelines on a daily basis. Such pipelines include parsing data from multiple data sources, extracting relevant predictive features, executing training, validation, prediction steps and finally optimizing actions to meet desired business outcome so that they can be shared and visualized to business users. In this talk, we demonstrate how such pipelines can be implemented end-to- end in a robust, scalable and extensible manner using Python, Luigi, PySpark and scikit-learn. We will share our lessons learned from using this framework in a real-world demand forecasting use case.