Relative Content

Tag Archive for optimizationcudaconvolutionmemory-accessnsight-compute

performance difference in coarsened kernels

I’m trying to understand of impact of thread coarsening in a convolution kernel. I have been trying to reuse convolution matrix and issue less global memory access for each pixel while doing more work per thread. But I do not understand the difference in performance at each coarsening level :