How to do distributed batch inference using tensor parallelism with Ray?
I want to perform offline batch inference with a model that is too large to fit into one GPU. I want to use tensor parallelism for this. Previously I have used vLLM
for batch inference. However, now I have a custom model that does not fit into vLLM
‘s offered architecture.