How can a Transformer model predict output by a Python loop?
I understand how a transformer model combines input and output to produce a shifted output, but I don’t understand how it hires a loop to generate word by word when the training is done?
I guess it assumes that the output has only one start token at the first and then it feeds the input and assumed output into the model and catches the last index of the prediction and by this edits next index of output and feed them again into the model.
Something like this:
The goal is to convert ‘I am Student’ to ‘Ich bin Student’
English tokens: {0: , 1: , 2: , 3: Am, 4: Student, 5: I}
German tokens: {0: , 1: , 2: , 3: Bin, 4: Ich, 5: Student}
Loop 1.
input: [1, 5, 3, 4, 2, 0]
output: [1, 0, 0, 0, 0]
predict: transformer((input, output)) = [4, 0, 0, 0, 0]
next-token = 4
Loop 2.
input: [1, 5, 3, 4, 2, 0]
output: [1, 4, 0, 0, 0]
predict: transformer((input, output)) = [4, 3, 0, 0, 0]
next-token = 3
Loop 3.
input: [1, 5, 3, 4, 2, 0]
output: [1, 4, 3, 0, 0]
predict: transformer((input, output)) = [4, 3, 5, 0, 0]
next-token = 5
Loop 4.
input: [1, 5, 3, 4, 2, 0]
output: [1, 4, 3, 0, 0]
predict: transformer((input, output)) = [4, 3, 5, 3, 0]
next-token = 3
Loop 5.
next-token is 3: so the loop ends here.
TransformerEncoderLayer.forward() got an unexpected keyword argument ‘is_causal’
I have tried to learn TrainAD, but I am not learning because of the error Transformer EncoderLayer.forward() got an unexpected keyword argument ‘is_causal’.
Loss function for training a transformer language model
When training a language model, I would expect a maximum likelihood setting, i.e. searching for the model parameters that maximize the probability that the model generates the training text. Or equivalently, minimize -ln
of that probability.