Deconstructing the text embedding models
- Track:
- PyData: Deep Learning, NLP, CV
- Type:
- Talk (long session)
- Level:
- advanced
- Room:
- North Hall
- Start:
- 12:10 on 10 July 2024
- Duration:
- 45 minutes
Abstract
Selecting the optimal text embedding model is often guided by benchmarks such as the Massive Text Embedding Benchmark (MTEB). While choosing the best model from the leaderboard is a common practice, it may not always align perfectly with the unique characteristics of your specific dataset. This approach overlooks a crucial yet frequently underestimated element - the tokenizer.
We will delve deep into the tokenizer’s fundamental role, shedding light on its operations and introducing straightforward techniques to assess whether a particular model is suited to your data based solely on its tokenizer. We will explore the significance of the tokenizer in the fine-tuning process of embedding models and discuss strategic approaches to optimize its effectiveness.