Relative Content

Tag Archive for cudanvidia

Inconsistent global memory access between blocks despite use of volatile, threadfence and disabling L1 cache

In the following minimal reproducible example for the construction of a tree, where bodies are inserted based on their position (so a 1D version of a Quad/Octree) when multiple blocks are used, some blocks overwrite the insertions of other blocks, so that the number of bodies in the tree does not equal the number of bodies given to the kernel. This is despite using threadfences (probably an unnecessary amount), marking the tree array as volatile, and disabling the L1 cache with “-Xptxas -dlcm=cg”. This was tested on a Quadro P600 (nvcc -o example -arch=sm_61 -G -g -Xptxas -dlcm=cg example.cu) and an A30 (nvcc -o example -arch=sm_80 -G -g -Xptxas -dlcm=cg example.cu).

Weird behaviour of CUDA recursion

In the following minimal reproducible example, when the recursion in device_func is active, the __synchthreads() barrier is ignored, and when debugged, breakpoint 2 occurs before breakpoint 1. If the recursion is removed, it works as expected. How could this be? The code is compiled with nvcc -arch=sm_61 -G -g example.cu for an NVIDIA Quadro P600, using CUDA Toolkit 12.5.