Skip to main content
EuroPython logo

Deconstructing the text embedding models

Track:
PyData: Deep Learning, NLP, CV
Type:
Talk (long session)
Level:
advanced
Room:
North Hall
Start:
12:10 on 10 July 2024
Duration:
45 minutes

Abstract

Selecting the optimal text embedding model is often guided by benchmarks such as the Massive Text Embedding Benchmark (MTEB). While choosing the best model from the leaderboard is a common practice, it may not always align perfectly with the unique characteristics of your specific dataset. This approach overlooks a crucial yet frequently underestimated element - the tokenizer.

We will delve deep into the tokenizer’s fundamental role, shedding light on its operations and introducing straightforward techniques to assess whether a particular model is suited to your data based solely on its tokenizer. We will explore the significance of the tokenizer in the fine-tuning process of embedding models and discuss strategic approaches to optimize its effectiveness.


Resources