Modern ETL-ing with Python and Airflow (and Spark)

Tamara Mendt (@TamaraMendt)

Tamara Mendt is a Data Engineer at HelloFresh, a meal kit delivery service headquartered in Berlin, and one of the top 3 tech startups to come out of Europe over the past 4 years. She devotes her time to building data pipelines and designing and maintaining the company's data infrastructure. Tamara has a computer engineering degree from her native country Venezuela, and an Erasmus Mundus Masters degree in IT for Business Intelligence. She wrote her Master thesis at the TU Berlin with the research group where Apache Flink was born. At HelloFresh she is continuing to work with distributed technologies such has Apache Hadoop, Apache Kafka and Apache Spark to cope with the scalability that the fast growing company requires for dealing with their data.

Abstract

Tags: data data-science pipeline

The challenge of data integration is real. The sheer amount of tools that exist to address this problem is proof that organizations struggle with it. This talk will discuss the inherent challenges of data integration, and show how it can be tackled using Python and Apache Airflow and Apache Spark.

Description

The way organizations analyze their data has evolved very quickly since the beginning of the millennium. The development of Hadoop, and the explosion in the variety of data that companies are dealing with nowadays, has fostered the appearance of the concept of data lake, and the shift of traditional ETL (extract, transform, load), to ELT (extract, load, transform). Yet, the challenge of integrating data to obtain valuable insights still remains, and despite the hype and attention being focused on data, very few organizations have actually managed to become data driven. In this talk I will present insights into how we are currently building data pipelines using Python (as a replacement to high level ETL software), Apache Airflow as a scheduler to our coded transformations, and Apache Spark for achieve scalability. Though building data pipelines is not the only element required to become data driven, it is a crucial one, and I hope to encourage the audience to use these open source technologies in their own ETL-ing (or ELT-ing) efforts.