Don't fix bad data, do this instead

Track:: PyData: Data Engineering (2024)
Type:: Talk
Level:: intermediate
Room:: North Hall
Start:: 11:55 on 11 July 2024
Duration:: 30 minutes

Abstract

In a time where GenAI is quickly growing in popularity, along with prescriptive analytics and online ML models, the question is raised whether we still need to care about data quality? We strongly believe that the answer is yes, and even more so than before!

Our expectations of data are high, and this often leads to frustrations when reality does not meet these expectations.

In the pursuit of data quality, expectations must be grounded in reality. It is often the case that a gap exists between anticipated outcomes and the actual data reality, which leads to frustration and mistrust.

This talk delves into pragmatic strategies that can be employed to bridge this gap. The talk will discuss both the technical (hard) and cultural (soft) measures implemented to uphold these standards.

Key Takeaways:

Integration tests serve as a proactive barrier, preempting the violation of data contracts, unlike reactive data quality checks.
Prioritisation is crucial; a product-centric mindset is key when evaluating the balance between resource investment and potential gain.
Data quality management is requiring both hard and soft measures

Are you a data scientist, software engineer, product manager, or data engineer? Join us in this discussion; data quality concerns us all.

Recording

Resources

Don't fix bad data slides