Question

I have a fairly simple TensorFlow model which I’ve been trying to run on a supercomputer which has several GPUs. It runs seamlessly on my laptop, however, on the supercomputer it appears to run every line 32 times, resulting in an output file that contains phrases like “DFs loaded” and “Data normalized” again and again. Once it gets to the building of the neural network, it prints errors like these over and over and then stops running:

2024-04-26 20:07:47.537408: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 20.40GiB (21908291584 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-04-26 20:07:47.537383: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 28.35GiB (30439505920 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-04-26 20:07:47.537395: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 24.94GiB (26775781376 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Has anybody encountered this before? I’m using a Slurm scheduler to set the script going, also with fairly basic parameters:

#!/bin/bash

#SBATCH --account=XXXXX
#SBATCH --time=0:30:0
#SBATCH --partition=test
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --mem=512G

# Activate Conda environment
export CONDADIR=/nobackup/projects/XXXXX/$USER
source $CONDADIR/miniconda/etc/profile.d/conda.sh
conda activate tf-env

module load cuda/12.0.1
export XLA_FLAGS="--xla_gpu_cuda_data_dir=$CUDA_HOME"

module load gcc
module load openmpi/4.0.5
xxxx-mpirun python3 NN-test.py

Can anyone see anything in here that might be causing this issue?

Thank you!

Why is my TensorFlow script running every line 32 times when executed by a GPU supercomputer?

LEAVE A COMMENT Hủy