How many threads and block are used when doing a CuArray multiplication in Julia?
I’m trying to optimize some matrices multiplication in julia and decided to use CUDA.jl.
I am slightly confused on how the multiplication of CuArray works.
If I define two CuArray and multiply them, I understand that the computation happens on the device but I have no idea on the dimensions of the grid and blocks used for this.
Parallel Computing function value with Julia CUDA
I have a function f(x,y,z) defined as
Julia CUDA synchronisation over multiple Blocks
I am quite new to CUDA in Julia but I have been able to obtain a speed up of a code through using just threads in CUDA in julia. However, depending on the complexity of the code the threads are limited by some number less than 1024. I do a summation over a set of N points, in which if N is greater than 340 it cannot use just threads. Therefore, I have looked at implementing blocks into the code, which I have been able to get to run for greater N. However, there are synchronisation issues here as I only have the sync_threads() command. Is there anywhere to synchronise the blocks and the threads? I will try and sketch the code below:
Function not working in Julia even if it is defined in the file
I am trying to run a code on CPU. However, I am getting the following error:
Beginner to CUDA in Julia, Correct use of Threads
I am looking at speeding up a calculation via CUDA, I have achieved a speed-up by purely using threads. I will sketch as follows: