cuda device class data modification fails for large number of threads [duplicate]
This question already has answers here: Illegal Memory Access on cudaDeviceSynchronize (1 answer) CUDA: How to fill a vector of dynamic size on device and return its contents to another device function? (1 answer) Program hit cudaErrorIllegalAdress without cuda-memcheck error when running program with a large dataset (1 answer) Closed 3 days ago. I instantiate […]
Estimation on coalesced memory accesses and caching via CUDA device attributes and guidelines
I’ve queried the CUDA device and picked the values of some specific CUDA device attributes as follows. (Note: this question is a little bit lengthy ☺.)
CUDA : cant use atomicAdd()
I need help with the use of atomicAdd() in cuda 12.5.
Can you help me find the reason why my CUDA coded MLP will not learn?
I wanted to write an MLP in CUDA without any dependancy’s I apologise in advance for my messy code. Please can you examine my CUDA functions to see if there is an obvious mistake which could explain why it will not solve , as it should, the simple XoR problem. We should see a decrease in error but instead it just produces a random error. I tried to make my own CUDA rng but I am using rand() instead. I have
Performance Issue with Custom Kernels in CUDA Compared to cuSOLVER
I’ve implemented a QR factorization algorithm in CUDA tailored to my specific needs. While testing, I’ve noticed that the execution time of my custom kernel increases exponentially as the size of the matrix grows, whereas the execution time of NVIDIA’s cuSOLVER scales much more evenly.
Redistribution of cudart_110.dll
Can someone explain in plain English redistribution terms for cudart_110.dll
?
CUDA cudaMemcpy an illegal memory access was encountered
I want to change the variable M,N by the argv parameters when the code is executed.My code[tvm_test.cu] is below:
Placement of shared memory in matrix multiplication sample
I’ve learned from the classic sample implementation of matrix multiplication in CUDA below.
Placement of shared memory in matrix multiplication sample
I’ve learned from the classic sample implementation of matrix multiplication in CUDA below.
Passing array of struct from device to array
I am trying to copy the variable d_output->list from device to the host using cudaMemcpy but I am obtaining Segmentation fault (core dumped) Could you please let me know why?