Skip to main content

DBT & Python - How to write reusable and testable pipelines

Track:
PyData: Data Engineering
Type:
Talk
Level:
intermediate
Room:
North Hall
Start:
11:20 on 11 July 2024
Duration:
30 minutes

Abstract

The “data build tool” (DBT) was designed to unlock software engineering best practices for SQL-based data pipelines: pipelines as version controlled directed acyclic graphs (DAGs) consisting of testable and reusable nodes. With the increasing number of cloud data warehouses and data lakehouses that allow the native execution of Python code, DBT also added support for Python models. In this talk, I will explain how Flatiron Health uses DBT to improve and extend lives by learning from the experience of every person with cancer. We will discuss an example project setup that uses SQL as well as Python models. I will share our experiences with unit and data testing as well as with writing a reusable variable library. The talk is well-suited for anyone with prior data warehouse or data lakehouse experience who is curious how they can leverage DBT to write test-driven and reusable data piplines. The example project will use SQL, Python and Snowflake.


The speaker

Florian Stefan

Florian Stefan

Florian is based in Berlin and works as Software Engineer for Flatiron Health. Before joining Flatiron Health’s mission to improve and extend lives by learning from the experience of every person with cancer, he worked for eBay and Immobilienscout24. Florian loves traveling with his family, uses his little son as excuse to buy toys for himself and is passionate about software engineering, software architecture and punk rock.