results for au:Lan_G in:cs

- The popular cubic regularization (CR) method converges with first- and second-order optimality guarantee for nonconvex optimization, but encounters a high sample complexity issue for solving large-scale problems. Various sub-sampling variants of CR have been proposed to improve the sample complexity.In this paper, we propose a stochastic variance-reduced cubic-regularized (SVRC) Newton's method under both sampling with and without replacement schemes. We characterize the per-iteration sample complexity bounds which guarantee the same rate of convergence as that of CR for nonconvex optimization. Furthermore, our method achieves a total Hessian sample complexity of $\mathcal{O}(N^{8/11} \epsilon^{-3/2})$ and $\mathcal{O}(N^{3/4}\epsilon^{-3/2})$ respectively under sampling without and with replacement, which improve that of CR as well as other sub-sampling variant methods via the variance reduction scheme. Our result also suggests that sampling without replacement yields lower sample complexity than that of sampling with replacement. We further compare the practical performance of SVRC with other cubic regularization methods via experiments.
- In this paper, we consider a class of finite-sum convex optimization problems defined over a distributed multiagent network with $m$ agents connected to a central server. In particular, the objective function consists of the average of $m$ ($\ge 1$) smooth components associated with each network agent together with a strongly convex term. Our major contribution is to develop a new randomized incremental gradient algorithm, namely random gradient extrapolation method (RGEM), which does not require any exact gradient evaluation even for the initial point, but can achieve the optimal ${\cal O}(\log(1/\epsilon))$ complexity bound in terms of the total number of gradient evaluations of component functions to solve the finite-sum problems. Furthermore, we demonstrate that for stochastic finite-sum optimization problems, RGEM maintains the optimal ${\cal O}(1/\epsilon)$ complexity (up to a certain logarithmic factor) in terms of the number of stochastic gradient computations, but attains an ${\cal O}(\log(1/\epsilon))$ complexity in terms of communication rounds (each round involves only one agent). It is worth noting that the former bound is independent of the number of agents $m$, while the latter one only linearly depends on $m$ or even $\sqrt m$ for ill-conditioned problems. To the best of our knowledge, this is the first time that these complexity bounds have been obtained for distributed and stochastic optimization problems. Moreover, our algorithms were developed based on a novel dual perspective of Nesterov's accelerated gradient method.
- Nov 01 2017 cs.LG arXiv:1710.11241v1In this paper, we study the problem of optimizing a two-layer artificial neural network that best fits a training dataset. We look at this problem in the setting where the number of parameters is greater than the number of sampled points. We show that for a wide class of differentiable activation functions (this class involves "almost" all functions which are not piecewise linear), we have that first-order optimal solutions satisfy global optimality provided the hidden layer is non-singular. Our results are easily extended to hidden layers given by a flat matrix from that of a square matrix. Results are applicable even if network has more than one hidden layer provided all hidden layers satisfy non-singularity, all activations are from the given "good" class of differentiable functions and optimization is only with respect to the last hidden layer. We also study the smoothness properties of the objective function and show that it is actually Lipschitz smooth, i.e., its gradients do not change sharply. We use smoothness properties to guarantee asymptotic convergence of O(1/number of iterations) to a first-order optimal solution. We also show that our algorithm will maintain non-singularity of hidden layer for any finite number of iterations.
- In this paper, we consider multi-stage stochastic optimization problems with convex objectives and conic constraints at each stage. We present a new stochastic first-order method, namely the dynamic stochastic approximation (DSA) algorithm, for solving these types of stochastic optimization problems. We show that DSA can achieve an optimal ${\cal O}(1/\epsilon^4)$ rate of convergence in terms of the total number of required scenarios when applied to a three-stage stochastic optimization problem. We further show that this rate of convergence can be improved to ${\cal O}(1/\epsilon^2)$ when the objective function is strongly convex. We also discuss variants of DSA for solving more general multi-stage stochastic optimization problems with the number of stages $T > 3$. The developed DSA algorithms only need to go through the scenario tree once in order to compute an $\epsilon$-solution of the multi-stage stochastic optimization problem. To the best of our knowledge, this is the first time that stochastic approximation type methods are generalized for multi-stage stochastic optimization with $T \ge 3$.
- In this work we introduce a conditional accelerated lazy stochastic gradient descent algorithm with optimal number of calls to a stochastic first-order oracle and convergence rate $O\left(\frac{1}{\varepsilon^2}\right)$ improving over the projection-free, Online Frank-Wolfe based stochastic gradient descent of Hazan and Kale [2012] with convergence rate $O\left(\frac{1}{\varepsilon^4}\right)$.
- We present a new class of decentralized first-order methods for nonsmooth and stochastic optimization problems defined over multiagent networks. Considering that communication is a major bottleneck in decentralized optimization, our main goal in this paper is to develop algorithmic frameworks which can significantly reduce the number of inter-node communications. We first propose a decentralized primal-dual method which can find an $\epsilon$-solution both in terms of functional optimality gap and feasibility residual in $O(1/\epsilon)$ inter-node communication rounds when the objective functions are convex and the local primal subproblems are solved exactly. Our major contribution is to present a new class of decentralized primal-dual type algorithms, namely the decentralized communication sliding (DCS) methods, which can skip the inter-node communications while agents solve the primal subproblems iteratively through linearizations of their local objective functions. By employing DCS, agents can still find an $\epsilon$-solution in $O(1/\epsilon)$ (resp., $O(1/\sqrt{\epsilon})$) communication rounds for general convex functions (resp., strongly convex functions), while maintaining the $O(1/\epsilon^2)$ (resp., $O(1/\epsilon)$) bound on the total number of intra-node subgradient evaluations. We also present a stochastic counterpart for these algorithms, denoted by SDCS, for solving stochastic optimization problems whose objective function cannot be evaluated exactly. In comparison with existing results for decentralized nonsmooth and stochastic optimization, we can reduce the total number of inter-node communication rounds by orders of magnitude while still maintaining the optimal complexity bounds on intra-node stochastic subgradient evaluations. The bounds on the subgradient evaluations are actually comparable to those required for centralized nonsmooth and stochastic optimization.