Python to optimize Input DATA Pipeline | BERT Transformer Models

Опубликовано: 30 Сентябрь 2024
на канале: Discover AI
237
3

Python TF2 code to optimize your Tokenizer and Vocabulary for your specific dataset. Pre-trained (BERT) NLP models are trained on a general set of documents, which will not provide good enough performance for your specific Deep Learning task (in NLP).

Code examples from original HuggingFace description, or modified / altered to my specific presentation needs. Check HuggingFace for original models and datasets: https://huggingface.co/models


#code_in_real_time
#Tokenizer
#HuggingFace

00:00 Code your Tokenizers
03:58 Tokenization pipeline
06:20 Full service Tokenizer
09:15 Train a new Tokenizer
15:00 Fast Tokenizer
16:16 Encode your sentences with the new tokenizer
18:30 Use a pretrained tokenizer (with vocabulary)