Slurm sbatch on multiple nodes with 1 gpu for each one to parallelize cross validation

  Kiến thức lập trình

As object question, I am trying to refine lines of code in bash to start a job where I require 5 nodes with 1 gpu for each (and thus 1 task per node) in order to start a cross validation with 5 folds in parallel.
My lines of code for the moment look like this :

#!/bin/bash

#SBATCH -A <account>
#SBATCH -p <partition>
#SBATCH --time 5:00:00
#SBATCH -N 5
#SBATCH --gres=gpu:1
#SBATCH --mem=50000
#SBATCH --job-name=<jobname>
#SBATCH --error=file.err
#SBATCH --output=file.out
#SBATCH --ntasks-per-node=1

for fold in {0..4}; do
    srun -N1 -n1 --gres=gpu:1 --exclusive bash file.sh --fold ${fold} --device cuda:0 &
done
wait

From this job I have some doubts.
Does the srun command actually start commands on the available nodes ?
The available cuda index on the node is sometimes not always 0 and thus does not start one of the folds in parallel. How is this possible ?
I hope the question is clear and any corrections are welcome

LEAVE A COMMENT