Relative Content

Tag Archive for huggingface-transformerslarge-language-model

phi 3 vision model tokens

I am looking at using phi-3-vision models to try and describe an image. However, I couldn’t help but notice that the number of tokens that an image takes is quite large (~2000). Is this correct, or a potential bug? I have included a code snippet so that you can check my assumptions: