performance difference in coarsened kernels
I’m trying to understand of impact of thread coarsening in a convolution kernel. I have been trying to reuse convolution matrix and issue less global memory access for each pixel while doing more work per thread. But I do not understand the difference in performance at each coarsening level :