Caching for Jupyter Notebooks

Track:: PyData: Software Packages & Jupyter (2024)
Type:: Talk
Level:: intermediate
Room:: Terrace 2A
Start:: 15:30 on 11 July 2024
Duration:: 30 minutes

Abstract

Caching data and calculation results in jupyter notebooks is a great way to speed up development by making expensive cells easier to re-run.

Data scientists and developers using notebooks on a daily basis, can improve their notebook workflow with low-effort changes in the notebook code, cut the time spent waiting and reduce context switches.

This talk targets developers and data scientist of all experience levels and will cover:

Why caching in notebooks? Setting up the context in which developers and data scientists use notebooks for exploratory work and how caching is relevant in it.

What is caching Quick definition of caching, introducing the different types of persistence (in-memory, on disk, database, object storage …), cache invalidation strategies (parameters, code changes, ttl, …), with some cautionary comments about data security when caching protected data.

Caching Techniques Going through readily available options from the python standard library, and how to use them in notebooks. Introducing a few off-the-shelves options like ipython % magics, and cachetools. Showcasing how one would build their own mini-caching framework, that fits for their specific use case, using pandas and spark for the example Explaining when to stop trying to cache, and keeping the caching framework mini, what are the signs that caching went overboard.

Recording

Resources

slides deck