Relative Content

Tag Archive for cudajulia

How many threads and block are used when doing a CuArray multiplication in Julia?

I’m trying to optimize some matrices multiplication in julia and decided to use CUDA.jl.
I am slightly confused on how the multiplication of CuArray works.
If I define two CuArray and multiply them, I understand that the computation happens on the device but I have no idea on the dimensions of the grid and blocks used for this.

Julia CUDA synchronisation over multiple Blocks

I am quite new to CUDA in Julia but I have been able to obtain a speed up of a code through using just threads in CUDA in julia. However, depending on the complexity of the code the threads are limited by some number less than 1024. I do a summation over a set of N points, in which if N is greater than 340 it cannot use just threads. Therefore, I have looked at implementing blocks into the code, which I have been able to get to run for greater N. However, there are synchronisation issues here as I only have the sync_threads() command. Is there anywhere to synchronise the blocks and the threads? I will try and sketch the code below: