Suppose you would like to train a sequence-to-sequence model like T5-small from scratch on a task where the vocabulary is quite limited compared to the tokenizer of T5 which was trained on much larger vocabulary.
For instance, the data have the following format:
Can you please add A and B?
e.g.
Can you please add 45 and 56?
Can you please add 87 and 34?
A
and B
are just placeholders for integer numbers.
Instead the tokenizer of T5 was trained to represent a vocabulary of approximately something like 32-50K tokens.
What would be some consideration and issues taken into account since in the data only a few tokens change every time?
Basically only tokens A
and B
change every time.
Is that still possible?