How to run a CUDA kernel on only one Streaming Multiprocessor (32 cores/threads) so that there can be perfect synchrony between them?
A typical NVidia SM has 32 processing cores, thus its warp size is 32. The warp size is rather important when choosing the number of threads later on. All threads inside a single warp share a single instruction counter. That means those 32 threads are truly synchronized in that every thread executes every command at the same time.
Syncing threads is also not a simple matter. You can only sync threads within a single SM. Everything outside the SM is unsyncable from inside the kernel. You’ll have to write seperate kernels and launch them one after the other.