Relative Content

Tag Archive for memorycudansight-compute

Vectorized Memory Stores Reduce Load Instructions

I have a kernel that is 16x coarsened (1×16 tiling). To reduce the STG (store global) instructions I have implemented vectorized memory accesses via uchar4 in my case. When i took a look at the memory chart I see this :