.cuh and .h file difference CUDA
When I declare an extern __device__
variable in a .cuh file, which is defined in a .cu file, and then try to use it in a different .cu file, I get a multiple definition error, but when I change the .cuh file to .h file, the error disappears. Is there a difference between .cuh and .h file? This post https://forums.developer.nvidia.com/t/whats-the-difference-between-cuh-and-h/266214 suggests there is no difference between .cuh and .h files, so why is this happening?
Referencing a pitched pointer in device function CUDA
I have created a 3D matrix using cudaMalloc3D
using cudaPitchedPtr
, I would like to reference the created matrix from a device function as well. Does copying the pitched pointer into __device__ cudaPitchedPtr
and then referencing it work? For example –
Copying a 1D array to 3d pitched array CUDA
I need to copy a 1d array into a 3d pitched array. Each thread in the kernel copies one row into a 3d array. Is there anyway to do it using cudaMemcpy or cudaMemcpy3d?
Allocating memory dynamically for each thread CUDA
I need to allocate an array for each thread, but the length of the array is known only at runtime. Once the array length is calculated, it is a constant value. cudaMalloc does not seem to work inside the kernel.
Is there anyway I can do it? Something like this –
How do you Allocate a structure that contains a double pointer using cudaMalloc
I have tried everything to allocate a struct containing a double pointer to device memory using cudaMalloc. I know this question has been asked multiple times yet the question is always for only a single pointer in a structure. I cannot flatten the data because training.data[] index is linked to many other functions of the program. If i cant index the data using training.data[1][n], training.data[2][n]……etc then i cant make use of the data. Ive tried this but keep getting a memory access violation error:
Why sometimes the same kernel executes 10x slower?
Here’s the code:
How to get a Cuda event time in CPU timeline?
Here’s the pseudo code:
How to do cudaMemcpy with priority
Objective: I have two groups of data that need to be copied to the GPU. The first group is large and has a lower priority, while the second one is smaller and has a higher priority, such as metadata for a new job. The cudaMempy of low-priority one is issued first. I want to ensure […]