DBT & Python - How to write reusable and testable pipelines

Track:: PyData: Data Engineering (2024)
Type:: Talk
Level:: intermediate
Room:: North Hall
Start:: 11:20 on 11 July 2024
Duration:: 30 minutes

Abstract

The “data build tool” (DBT) was designed to unlock software engineering best practices for SQL-based data pipelines: pipelines as version controlled directed acyclic graphs (DAGs) consisting of testable and reusable nodes. With the increasing number of cloud data warehouses and data lakehouses that allow the native execution of Python code, DBT also added support for Python models. In this talk, I will explain how Flatiron Health uses DBT to improve and extend lives by learning from the experience of every person with cancer. We will discuss an example project setup that uses SQL as well as Python models. I will share our experiences with unit and data testing as well as with writing a reusable variable library. The talk is well-suited for anyone with prior data warehouse or data lakehouse experience who is curious how they can leverage DBT to write test-driven and reusable data piplines. The example project will use SQL, Python and Snowflake.

Recording

Resources

Testability and reusability in data pipelines