BertTokenizer vocab size vs length of tokenizer
I manually added vocabularies to a BertTokenizer
. The original pre-trained BertTokenizer
has 51,271 vocabs, and I added 209,902 vocabularies (therefore the total vocabulary size is 261,173) to the tokenizer using add_tokens()
. I save the modified tokenizer with the save_pretrained()
function to a local folder ./vocab
, which the additional 209,902 tokens are saved in a file called added_tokens.json
instead of merged with vocab.txt
.