Relative Content

Tag Archive for pythonhuggingface-transformersbert-language-modelhuggingface-tokenizers

BertTokenizer vocab size vs length of tokenizer

I manually added vocabularies to a BertTokenizer. The original pre-trained BertTokenizer has 51,271 vocabs, and I added 209,902 vocabularies (therefore the total vocabulary size is 261,173) to the tokenizer using add_tokens(). I save the modified tokenizer with the save_pretrained() function to a local folder ./vocab, which the additional 209,902 tokens are saved in a file called added_tokens.json instead of merged with vocab.txt.