Hardware specifications for self hosting llama-2

  Kiến thức lập trình

I am a SE , and we are working with a client to implement a RAG based AI solution with them and as the title suggest , I want suggestions from this community on the hardware specifications to host this model , I tested everything on google-colab pro-version , and I ran sometimes A100 (40gb) , and mostly V100(16gb) for Llama-2-7b-chat-hf , on A100 without any quantization it was working smoothly , and giving back responses within 6-10 seconds , on V100 I had to use 8bit though.

Now , the client wants it for 2000 guaranteed users in the initial phase , so

Is 7b enough or should I need to look into 13b or greater

And Hardware specifications with these gpu for such traffic

Thankyou !

LEAVE A COMMENT