Tiled matrix multiplication with cuda
I am trying to make a tiled matrix multiplication kernel with Cuda. I get a mismatch when I try to compare the CPU and GPU results from the first element. This is my code:
I am trying to make a tiled matrix multiplication kernel with Cuda. I get a mismatch when I try to compare the CPU and GPU results from the first element. This is my code: