Skip to main content

From built-in concurrency primitives to large scale distributed computing

Track:
PyData: Data Engineering
Type:
Talk (long session)
Level:
intermediate
Room:
Terrace 2A
Start:
12:10 on 10 July 2024
Duration:
45 minutes

Abstract

This talk is specifically designed for Python developers and data practitioners who wish to deepen their skills in asynchronous code execution, from single CPU applications to complex distributed systems with thousands of cores. We’ll provide a detailed exploration and explanation of Python’s asynchronous execution models and concurrency primitives, focusing on Future and Executor interfaces within the concurrent.futures module, and the event-driven architecture of asyncio. Special attention will be given to the processing of large datasets, a common challenge in data science and engineering.

We will start with the fundamental concepts and then explore how they apply to large scale, distributed execution frameworks like Dask or Ray. On step-by-step examples, we aim to demonstrate simple function executions and map-reduce operations. We will illustrate efficient collaboration between different concurrency models. The session will cover the transition to large-scale, distributed execution frameworks, offering practical guidelines for scaling your computations effectively and addressing common hurdles like data serialization in distributed environments.

Attendees will leave with a solid understanding of asynchronous code execution underpinnings. This talk will empower you to make informed practical decisions about applying concurrency in your data processing workflows. You will be able to seamlessly integrate new libraries or frameworks into your projects, ensuring optimal development lifecycle, performance and scalability.


The speaker

Jakub Urban

Jakub Urban

Jakub currently leads the data science platform team that enables the Flyr for Hospitality science organisation developing, operating and maintaining data science products in a user friendly and sustainable way. He started tinkering with Python for computer simulations and data analysis during his computation physics PhD studies, when NumPy and Matplotlib were brand new projects and Pandas had not met Python yet. Since then, Python and its ecosystem have become Jakub’s de facto work and hobby toolset for anything programming and data modelling related. After leading the theory group at the tokamak department of the Institute of Plasma Physics in Prague, Jakub was in different roles in the data science and engineering landscape. He also co-founded PyData Prague meetup, performs an occasional speaker or tutor at meetups or conferences and tutors scientific computing with Python at the Czech Technical University.