Flow is in the Air: Best Practices of Building Analytical Data Pipelines with Apache Airflow

Dominik Benz (@john_maverick)

Dominik Benz holds a PhD from the University of Kassel in the field of Data Mining on the Social Web. Since 2012 he is working as a Big Data Engineer at Inovex GmbH. In this time, he was involved in several projects concerned with establishing analytical data platforms in various companies. He is most experienced in tools around the Hadoop Ecosystem like Apache Hive and Spark, and has hands-on experience with productionizing analytical applications.

Abstract

Tags: workflow data pipeline data-science analytics

Apache Airflow is an Open-Source python project which facilitates an intuitive programmatic definition of analytical data pipelines. Based on 2+ years of productive experience, we summarize its core concepts, detail on lessons learned and set it in context with the Big Data Analytics Ecosystem.

Description

Motivation & Outline

Creating, orchestrating and running multiple data processing or analysis steps may cover a substantial portion of a Data Engineer and Data Scientist business. A widely adopted notion for this process is a "data pipeline" - which consists mainly of a set of "operators" which perform a particular action on data, with the possibility to specify dependencies among those. Real-Life examples may include:

  • Importing several files with different formats into a Hadoop platform, perform data cleansing, and training a machine learning model on the result
  • perform feature extraction on a given dataset, apply an existing deep learning model to it, and write the results in the backend of a microservice

Apache Airflow is an open-source Python project developed by AirBnB which facilitates the programmatic definition of such pipelines. Features which differentiate Airflow from similar projects like Apache Oozie, Luigi or Azkaban include (i) its pluggable architecture with several extension points (ii) the programmatic approach of "workflow is code" and (iii) its tight relationship with the the Python as well as the Big Data Analytics Ecosystem. Based on several years of productive usage, we briefly summarize the core concepts of Airflow, and detail in-depth on lessons learned and best practices from our experience. These include hints for getting efficient quickly with Airflow, approaches to structure workflows, integrating it in an enterprise landscape, writing plugins and extentions, and maintaining it in productive environment. We conclude with a comparison with other analytical workflow engines and summarize why we have chosen Airflow.

Questions answered by this talk

  • What are the core concepts of Apache Airflow?
  • How can Airflow help me with moving data pipelines from analytics to production?
  • Which concepts of Airflow make it more slim and more efficient compared to Apache Oozie?
  • How can I specify dynamic dependencies at runtime between my analytical data processing steps?
  • Which facilities does Airflow offer to enable automation and orchestration of analytical tasks?
  • How can I extend the built-in facilities of Airflow by writing Python plugins?

People who benefit most from this talk

  • Data Scientists who are looking for a slim library to automate and control their data processing steps
  • Data Engineers who want to save time debugging static workflow definitions (e.g. in XML)
  • Project leaders interested in tools which lower the burden of moving from analytics to production
  • Hadoop Cluster administrators eager to save cluster resources