This work develops a fully decentralized variance-reduced learning algorithm for on-device intelligence where nodes store and process the data locally and are only allowed to communicate with their immediate neighbors. In the proposed algorithm, there is no need for a central or master unit while the objective is to enable the dispersed nodes to learn the \em exact global model despite their limited localized interactions. The resulting algorithm is shown to have low memory requirement, guaranteed linear convergence, robustness to failure of links or nodes, scalability to the network size, and privacy-preserving properties. Moreover, the decentralized nature of the solution makes large-scale machine learning problems more tractable and also scalable since data is stored and processed locally at the nodes.
Several useful variance-reduced stochastic gradient algorithms, such as SVRG, SAGA, Finito, and SAG, have been proposed to minimize empirical risks with linear convergence properties to the exact minimizers. The existing convergence results assume uniform data sampling with replacement. However, it has been observed that random reshuffling can deliver superior performance. No formal proofs or guarantees of exact convergence exist for variance-reduced algorithms under random reshuffling. This paper resolves this open convergence issue and provides the first theoretical guarantee of linear convergence under random reshuffling for SAGA; the argument is also adaptable to other variance-reduced algorithms. Under random reshuffling, the paper further proposes a new amortized variance-reduced gradient (AVRG) algorithm with constant storage requirements compared to SAGA and with balanced gradient computations compared to SVRG. The balancing in computations are attained by amortizing the full gradient calculation across all iterations. AVRG is also shown analytically to converge linearly.
The analysis in Part I revealed interesting properties for subgradient learning algorithms in the context of stochastic optimization when gradient noise is present. These algorithms are used when the risk functions are non-smooth and involve non-differentiable components. They have been long recognized as being slow converging methods. However, it was revealed in Part I that the rate of convergence becomes linear for stochastic optimization problems, with the error iterate converging at an exponential rate $\alpha^i$ to within an $O(\mu)-$neighborhood of the optimizer, for some $\alpha \in (0,1)$ and small step-size $\mu$. The conclusion was established under weaker assumptions than the prior literature and, moreover, several important problems (such as LASSO, SVM, and Total Variation) were shown to satisfy these weaker assumptions automatically (but not the previously used conditions from the literature). These results revealed that sub-gradient learning methods have more favorable behavior than originally thought when used to enable continuous adaptation and learning. The results of Part I were exclusive to single-agent adaptation. The purpose of the current Part II is to examine the implications of these discoveries when a collection of networked agents employs subgradient learning as their cooperative mechanism. The analysis will show that, despite the coupled dynamics that arises in a networked scenario, the agents are still able to attain linear convergence in the stochastic case; they are also able to reach agreement within $O(\mu)$ of the optimizer.
Feb 20 2017 math.OC
Part I of this work developed the exact diffusion algorithm to remove the bias that is characteristic of distributed solutions for deterministic optimization problems. The algorithm was shown to be applicable to a larger set of combination policies than earlier approaches in the literature. In particular, the combination matrices are not required to be doubly stochastic or right-stochastic, which impose stringent conditions on the graph topology and communications protocol. In this Part II, we examine the convergence and stability properties of exact diffusion in some detail and establish its linear convergence rate. We also show that it has a wider stability range than the EXTRA consensus solution, meaning that it is stable for a wider range of step-sizes and can, therefore, attain faster convergence rates. Analytical examples and numerical simulations illustrate the theoretical findings.
Feb 20 2017 math.OC
This work develops a distributed optimization strategy with guaranteed exact convergence for a broad class of left-stochastic combination policies. The resulting exact diffusion strategy is shown in Part II to have a wider stability range and superior convergence performance than the EXTRA strategy. The exact diffusion solution is applicable to non-symmetric left-stochastic combination matrices, while many earlier developments on exact consensus implementations are limited to doubly-stochastic matrices; these latter matrices impose stringent constraints on the network topology. The derivation of the exact diffusion strategy in this work relies on reformulating the aggregate optimization problem as a penalized problem and resorting to a diagonally-weighted incremental construction. Detailed stability and convergence analyses are pursued in Part II and are facilitated by examining the evolution of the error dynamics in a transformed domain. Numerical simulations illustrate the theoretical conclusions.
The article examines in some detail the convergence rate and mean-square-error performance of momentum stochastic gradient methods in the constant step-size and slow adaptation regime. The results establish that momentum methods are equivalent to the standard stochastic gradient method with a re-scaled (larger) step-size value. The size of the re-scaling is determined by the value of the momentum parameter. The equivalence result is established for all time instants and not only in steady-state. The analysis is carried out for general strongly convex and smooth risk functions, and is not limited to quadratic risks. One notable conclusion is that the well-known bene ts of momentum constructions for deterministic optimization problems do not necessarily carry over to the adaptive online setting when small constant step-sizes are used to enable continuous adaptation and learn- ing in the presence of persistent gradient noise. From simulations, the equivalence between momentum and standard stochastic gradient methods is also observed for non-differentiable and non-convex problems.
The stochastic dual coordinate-ascent (S-DCA) technique is a useful alternative to the traditional stochastic gradient-descent algorithm for solving large-scale optimization problems due to its scalability to large data sets and strong theoretical guarantees. However, the available S-DCA formulation is limited to finite sample sizes and relies on performing multiple passes over the same data. This formulation is not well-suited for online implementations where data keep streaming in. In this work, we develop an \em online dual coordinate-ascent (O-DCA) algorithm that is able to respond to streaming data and does not need to revisit the past data. This feature embeds the resulting construction with continuous adaptation, learning, and tracking abilities, which are particularly attractive for online learning scenarios.
The paper examines the learning mechanism of adaptive agents over weakly-connected graphs and reveals an interesting behavior on how information flows through such topologies. The results clarify how asymmetries in the exchange of data can mask local information at certain agents and make them totally dependent on other agents. A leader-follower relationship develops with the performance of some agents being fully determined by the performance of other agents that are outside their domain of influence. This scenario can arise, for example, due to intruder attacks by malicious agents or as the result of failures by some critical links. The findings in this work help explain why strong-connectivity of the network topology, adaptation of the combination weights, and clustering of agents are important ingredients to equalize the learning abilities of all agents against such disturbances. The results also clarify how weak-connectivity can be helpful in reducing the effect of outlier data on learning performance.