Deconstructing the text embedding models

Track:: PyData: Deep Learning, NLP, CV (2024)
Type:: Talk (long session)
Level:: advanced
Room:: North Hall
Start:: 12:10 on 10 July 2024
Duration:: 45 minutes

Abstract

Selecting the optimal text embedding model is often guided by benchmarks such as the Massive Text Embedding Benchmark (MTEB). While choosing the best model from the leaderboard is a common practice, it may not always align perfectly with the unique characteristics of your specific dataset. This approach overlooks a crucial yet frequently underestimated element - the tokenizer.

We will delve deep into the tokenizer’s fundamental role, shedding light on its operations and introducing straightforward techniques to assess whether a particular model is suited to your data based solely on its tokenizer. We will explore the significance of the tokenizer in the fine-tuning process of embedding models and discuss strategic approaches to optimize its effectiveness.

Recording

Resources

Slides