CUDA constant memory provides no improvement compared to the global memory accesses
I am using 2D convolution and applying filter (3 x 3) to an image (2048 x 2048). I wrote two versions: one uses global memory accesses and another uses constant memory for the filter. When I benchmark the code (on my RTX 3090), I see no improvement with the use of constant memory.