How do I resolve `initialization error` when working with CUDA?
I am working on a g5.2xlarge EC2 instance with Ubuntu 24.04. I am trying to get CUDA working on it but I am constantly running into returned 3-> initialization error
.
How do I resolve `initialization error` when working with CUDA?
I am working on a g5.2xlarge EC2 instance with Ubuntu 24.04. I am trying to get CUDA working on it but I am constantly running into returned 3-> initialization error
.
Inconsistent global memory access between blocks despite use of volatile, threadfence and disabling L1 cache
In the following minimal reproducible example for the construction of a tree, where bodies are inserted based on their position (so a 1D version of a Quad/Octree) when multiple blocks are used, some blocks overwrite the insertions of other blocks, so that the number of bodies in the tree does not equal the number of bodies given to the kernel. This is despite using threadfences (probably an unnecessary amount), marking the tree array as volatile, and disabling the L1 cache with “-Xptxas -dlcm=cg”. This was tested on a Quadro P600 (nvcc -o example -arch=sm_61 -G -g -Xptxas -dlcm=cg example.cu
) and an A30 (nvcc -o example -arch=sm_80 -G -g -Xptxas -dlcm=cg example.cu
).
I want to use 11.7 version of cuda (but my driver wants 12.2) [closed]
Closed 2 days ago.
Weird behaviour of CUDA recursion
In the following minimal reproducible example, when the recursion in device_func
is active, the __synchthreads()
barrier is ignored, and when debugged, breakpoint 2 occurs before breakpoint 1. If the recursion is removed, it works as expected. How could this be? The code is compiled with nvcc -arch=sm_61 -G -g example.cu
for an NVIDIA Quadro P600, using CUDA Toolkit 12.5.
On NVIDIA GPUs is register access actually instantaneous? Are certain access patterns more efficient than other?
I am writing a tiled matrix multiplication in CUDA, and the inner loop of the kernel looks roughly like this
What is the relationship between GPU thread occupancy and sychronization stalls?
I am writing a CUDA kernel with an inner loop that looks roughly like this:
sh cuda_11.3.0_465.19.01_linux.run fails to compile ‘struct task_struct’ has no member named ‘state’; did you mean ‘stats’
I’m on Ubuntu 20.04 and I’m trying to install cuda 11.3.