Using torchrun with AWS sagemaker estimator on multi-GPU node
I would like to run a training job ml.p4d.24xlarge machine on AWS SageMaker. I ran into a similar issue described here with significant slowdowns in training time. I understand now that I should run it with torchrun. My constraints are that I don’t want to use the HuggingFace or PyTorch estimators from SageMaker (for customizability and to properly understand the stack).