CUDA more load transactions than store even though both are coalesced?
I am profiling the NVIDIA’s matrix transpose available on their github repo. From the looks of it and from the profiler, there are no bank conflicts. However, one thing I noticed is that global load transactions per request is more than global store transactions per request. From the looks of it both store and loads are coalesced. The data that is being read is int, so i should be getting perfect coalescing and I would imagine load/store transactions per request would be at least same if not equal. What am i missing? Note that my input matrix is of 12800×12800 size.