Relative Content

Tag Archive for amazon-web-servicespytorchhuggingface-transformersamazon-sagemakerdistributed-computing

Using torchrun with AWS sagemaker estimator on multi-GPU node

I would like to run a training job ml.p4d.24xlarge machine on AWS SageMaker. I ran into a similar issue described here with significant slowdowns in training time. I understand now that I should run it with torchrun. My constraints are that I don’t want to use the HuggingFace or PyTorch estimators from SageMaker (for customizability and to properly understand the stack).

Thiết kế website giá rẻ

Danh mục

Relative Content

Tag Archive for amazon-web-servicespytorchhuggingface-transformersamazon-sagemakerdistributed-computing

Using torchrun with AWS sagemaker estimator on multi-GPU node