Relative Content

Tag Archive for slurm

Slurm not setting –ntasks correctly

I set --ntasks=8, --cpus-per-task=4 in my SLURM job script, but $SLURM_NTASKS does not exist, and $SLURM_TASKS_PER_NODE is set to 1, which is unexpected. Below is my test.sh script (partition info is also printed below):

Slurm not setting –ntasks correctly

I set --ntasks=8, --cpus-per-task=4 in my SLURM job script, but $SLURM_NTASKS does not exist, and $SLURM_TASKS_PER_NODE is set to 1, which is unexpected. Below is my test.sh script (partition info is also printed below):

Slurm not setting –ntasks correctly

I set --ntasks=8, --cpus-per-task=4 in my SLURM job script, but $SLURM_NTASKS does not exist, and $SLURM_TASKS_PER_NODE is set to 1, which is unexpected. Below is my test.sh script (partition info is also printed below):

Slurm allocation of jobs across a few nodes

I am running a parallel process in slurm using python and tensorflow. I’ve generated a command file to be sourced with N lines, typically 20-100, each on running a tensorflow training. I’ve already got code to allocate gpus so I don’t need slurm to do that. I’m using sbatch to schedule the job as a job array so I can request chunks of a few hours at a time and then each new job array step will restart all the N trainings, typically running 50 steps of 3 hours each to train for about a week.

Slurm allocation of jobs across a few nodes

I am running a parallel process in slurm using python and tensorflow. I’ve generated a command file to be sourced with N lines, typically 20-100, each on running a tensorflow training. I’ve already got code to allocate gpus so I don’t need slurm to do that. I’m using sbatch to schedule the job as a job array so I can request chunks of a few hours at a time and then each new job array step will restart all the N trainings, typically running 50 steps of 3 hours each to train for about a week.

Slurm allocation of jobs across a few nodes

I am running a parallel process in slurm using python and tensorflow. I’ve generated a command file to be sourced with N lines, typically 20-100, each on running a tensorflow training. I’ve already got code to allocate gpus so I don’t need slurm to do that. I’m using sbatch to schedule the job as a job array so I can request chunks of a few hours at a time and then each new job array step will restart all the N trainings, typically running 50 steps of 3 hours each to train for about a week.

Slurm allocation of jobs across a few nodes

I am running a parallel process in slurm using python and tensorflow. I’ve generated a command file to be sourced with N lines, typically 20-100, each on running a tensorflow training. I’ve already got code to allocate gpus so I don’t need slurm to do that. I’m using sbatch to schedule the job as a job array so I can request chunks of a few hours at a time and then each new job array step will restart all the N trainings, typically running 50 steps of 3 hours each to train for about a week.

Launching a service from SLURM task prolog script fails

On the HPC I work with, I have to use docker in rootless-mode by calling a script start-docker.sh on every job.
To automate this, I would like to use the --task-prolog argument of srun and have the script called from a task_prolog.sh script:

Launching a service from SLURM task prolog script fails

On the HPC I work with, I have to use docker in rootless-mode by calling a script start-docker.sh on every job.
To automate this, I would like to use the --task-prolog argument of srun and have the script called from a task_prolog.sh script:

Is there any way to get slurm to report the task restart count, or only the job restart count?

I’m running a single slurm array job with many tasks. Each task may fail and be restarted and each task needs to know how many times it has been restarted. I was hoping the environment variable SLURM_RESTART_COUNT would solve this, but it seems to increment every time any task is restarted (i.e., the global restarts for the whole job, not just that one task). Does slurm save the task restart count somewhere I’m not seeing, or do I need to parse the sacct logs to get that info?