Where is the Bottleneck for multiple requests with Whisper on Nvidia A100
I want to use Whisper-Large-v3 (Speech-to-Text) for a real-time application. However, I want to process several requests at the same time. My Whisper instance runs on an Nvidia A100 with 80GB VRAM.
In principle, I would assume that I could process many requests at the same time, but that the KV matrices can probably only be accessed again once the first request has been processed. So it processes the requests sequentially, so to speak.