Why is using multiple CUDA streams not improving performance as expected?
I am working on optimizing a CUDA application that processes a matrix by updating each row sequentially. The process involves three main kernels:
CUDA Deadlock Issue with Multiple Streams Despite No Direct Dependencies
I am experiencing intermittent errors in my CUDA program, which uses multiple streams for parallel execution. For a single iteration , I use one stream (rowStream) for updating row next to the pivot row and pivot element of that row one after the other in the same stream and another stream (otherRowStream) for remaining row computations. For next iteration I ensure that both these events gets completed.