Python code to build your BPE - Tokenizer from scratch (w/ HuggingFace)

Опубликовано: 26 Октябрь 2024
на канале: Discover AI
4,127
49

Python TF2 code (JupyterLab) to train your Byte-Pair Encoding tokenizer (BPE):
a. Start with all the characters present in the training corpus as tokens.
b. Identify the most common pair of tokens and merge it into one token.
c. Repeat until the vocabulary (e.g., the number of tokens) has reached the size you want.

Training a tokenizer is not (!) the same as training a DL model. TensorFlow2 code:
from tokenizers.trainers import BpeTrainer
tokenizer.train(files, trainer)

Here the special case of a Byte-Pair Encoding (BPE) from HuggingFace's Tokenizer Library! See original downloadable models, tokenizers and datasets at: https://huggingface.co/models

#Tokenizer
#HuggingFace
#BPE

00:00 Code my optimized BPE Tokenizer
03:13 BPE model and trainer
04:36 Train a new Tokenizer
05:38 Use newly constructed Tokenizer
07:55 Encode batch
09:44 Summary