What’s wrong with “cudaMemset( d_arr, 10, 10 * sizeof( int ) ); ” in CUDA code? [duplicate]
This question already has an answer here: cudaMemset() – does it set bytes or integers? (1 answer) Closed 1 hour ago. I have a toy code below: #include <stdio.h> #include <stdlib.h> __global__ void add1InGPU( int *devArr ) { int i = threadIdx.x; devArr[i] += 1; } int main( void ) { int *h_arr = (int*)malloc( […]
cudaHostRegister() fails with “out of memory” while cudaMallocHost() works with far larger amounts of data [closed]
Closed 3 hours ago.
cudaHostRegister() fails with “out of memory” while cudaMallocHost() works with far larger amounts of data [closed]
Closed 3 hours ago.
Why is the object I just constructed in CUDA not matching what it looks like just before it returns from the constructor?
I am writing CUDA all on the device. I have a class to emulate strings since the string class cannot be used on a GPU. My class has wchar_t * string in data_ and a size_ for the length of the string. I call into the constructor at one point when creating a new variable instance, and inside the constructor all goes well. Just before returning I can see it is still fine. But s soon as I return it has garbage, even for the size_, and the memory location seems to have moved slightly (which could explain the garbage). The problem with garbage for size_ goes away if I don’t do the cudaMalloc in the
constructor (ie if I only set _size to len it comes back fine from the constructor).
The disassembly of a GPU is foreign to me so it’s hard to tell what’s going wrong
on the return to the caller.
error: ‘cudaDriverEntryPointQueryResult’ was not declared in this scope
On my CUDA compiler identification is NVIDIA 11.7.64 & I am importing both cuda_runtime_api.h as well as cuda_runtime.h but this error still persists.Any workarounds?
error: ‘cudaDriverEntryPointQueryResult’ was not declared in this scope
On my CUDA compiler identification is NVIDIA 11.7.64 & I am importing both cuda_runtime_api.h as well as cuda_runtime.h but this error still persists.Any workarounds?
How to properly free a Cuda context?
I am implementing Optix denoising inside my C++ path tracer. I then need to create a Cuda context before calling Optix kernels. That context should be created every time i spawn a rendering thread since each thread have its own Cuda context
identifier “atomicAdd” in cuda
I was running the k-means algorithm using cuda and encountered a problem in this part of the code before for if (idx < numPoints) { atomicAdd(&counts[points[idx].cluster], 1);
code:
identifier “atomicAdd” in cuda
I was running the k-means algorithm using cuda and encountered a problem in this part of the code before for if (idx < numPoints) { atomicAdd(&counts[points[idx].cluster], 1);
code:
Perform quick flip operations on matrices using CUDA
I want to perform A fast flip operation similar to Matlab for 3D matrix in CUDA C++, but I have encountered a speed bottleneck and need to ask for help. The following will take 222 matrix A to demonstrate the flip function as an example (A = reshape(1:8,2,2,2):