Pytorch Lightning distributed training: what should I set all_gather sync_grads?
I am using pytorch lightning for distributed training. I am using all_gather to gather all the gradients from the gpus in order to calculate the loss function. I am unsure of what I should set the sync_grads parameter. In what cases would I or would I not want to synchronize gradients?