Adjacent Leader Decentralized Stochastic Gradient Descent
Published in ICML, to be submitted, 2022
Elastic Averaging SGD (ASGD) and Leader Gradient Descent (LSGD) can accelerate the convergence of centralized distributed SGD and lead to faster training versus both wall-clock time and the number of epochs. However, both of these algorithms can not be applied to the state-of-the-art decentralized distributed SGD frameworks which can alleviate the congestion communication traffic issue by abandoning the centralized parameter server. In this paper, we propose the decentralized Adjacent Leader Decentralized Gradient Descent(AL-DSGD), which can accelerate the convergence of decentralized SOTA framework. The main idea of AL-DSGD is to assign specific weights to different neighbor learners according to their performance when averaging and apply a corrective force dictated by the currently best-performing neighbor when training. The convergence analysis is applied to demonstrate the faster convergence. Experiments on a suite of datasets and deep learning neural networks validate the theoretical analyses and demonstrate that AL-DSGD speeds up the training and fastens the convergence. Finally, we developed a general and concise distributed training pytorch framework which can implement any distributed machine learning systems easily (any synchronous/ asynchronous, centralized/decentralized distributed SGD system).