How to run pytorch lightning with multiple GPUs, with Apptainer and SLURM?
When using 2 GPUs on a single node, or multiple nodes on multiple nodes the training does not start while the job keeps running. I use a container (Apptainer) to deploy the environment and then submit the script to SLURM. The job starts but then stalls. I also tried strategy='deepspeed'
.
How to run pytorch lightning with multiple GPUS?
When using 2 GPUs on a single node, or multiple nodes on multiple nodes the training does not start while the job keeps running. I use a container (Apptainer) to deploy the environment and then submit the script to SLURM. The job starts but then stalls.
Script freezes when pytorch lightning’s Trainer is instantiated
I’m trying to train a model using pytorch lightning in a cluster with Ubuntu 20.04. However, the code freezes once when the lightning.Trainer
is instantiated. There are no error messages, it just freezes, the program does not stop.