Currently, deep neural networks are deployed on low-power embedded devices by first training a full-precision model using powerful computing hardware, and then deriving a corresponding low-precision model for efficient inference on such systems. However, training models directly with coarsely quantized weights is a key step towards learning on embedded platforms that have limited computing resources, memory capacity, and power consumption. Numerous recent publications have studied methods for training quantized network, but these studies have mostly been empirical. In this work, we investigate training methods for quantized neural networks from a theoretical viewpoint. We first explore accuracy guarantees for training methods under convexity assumptions. We then look at the behavior of algorithms for non-convex problems, and we show that training algorithms that exploit high-precision representations have an important annealing property that purely quantized training methods lack, which explains many of the observed empirical differences between these types of algorithms.
We propose a method to optimize the representation and distinguishability of samples from two probability distributions, by maximizing the estimated power of a statistical test based on the maximum mean discrepancy (MMD). This optimized MMD is applied to the setting of unsupervised learning by generative adversarial networks (GAN), in which a model attempts to generate realistic samples, and a discriminator attempts to tell these apart from data samples. In this context, the MMD may be used in two roles: first, as a discriminator, either directly on the samples, or on features of the samples. Second, the MMD can be used to evaluate the performance of a generative model, by testing the model's samples against a reference data set. In the latter role, the optimized MMD is particularly helpful, as it gives an interpretable indication of how the model and data distributions differ, even in cases where individual model samples are not easily distinguished either by eye or by classifier.
Classical stochastic gradient methods for optimization rely on noisy gradient approximations that become progressively less accurate as iterates approach a solution. The large noise and small signal in the resulting gradients makes it difficult to use them for adaptive stepsize selection and automatic stopping. We propose alternative "big batch" SGD schemes that adaptively grow the batch size over time to maintain a nearly constant signal-to-noise ratio in the gradient approximation. The resulting methods have similar convergence rates to classical SGD, and do not require convexity of the objective. The high fidelity gradients enable automated learning rate selection and do not require stepsize decay. Big batch methods are thus easily automated and can run with little or no oversight.
Stochastic Gradient Descent (SGD) has become one of the most popular optimization methods for training machine learning models on massive datasets. However, SGD suffers from two main drawbacks: (i) The noisy gradient updates have high variance, which slows down convergence as the iterates approach the optimum, and (ii) SGD scales poorly in distributed settings, typically experiencing rapidly decreasing marginal benefits as the number of workers increases. In this paper, we propose a highly parallel method, CentralVR, that uses error corrections to reduce the variance of SGD gradient updates, and scales linearly with the number of worker nodes. CentralVR enjoys low iteration complexity, provably linear convergence rates, and exhibits linear performance gains up to hundreds of cores for massive datasets. We compare CentralVR to state-of-the-art parallel stochastic optimization methods on a variety of models and datasets, and find that our proposed methods exhibit stronger scaling than other SGD variants.
Variance reduction (VR) methods boost the performance of stochastic gradient descent (SGD) by enabling the use of larger, constant stepsizes and preserving linear convergence rates. However, current variance reduced SGD methods require either high memory usage or an exact gradient computation (using the entire dataset) at the end of each epoch. This limits the use of VR methods in practical distributed settings. In this paper, we propose a variance reduction method, called VR-lite, that does not require full gradient computations or extra storage. We explore distributed synchronous and asynchronous variants that are scalable and remain stable with low communication frequency. We empirically compare both the sequential and distributed algorithms to state-of-the-art stochastic optimization methods, and find that our proposed algorithms perform favorably to other stochastic methods.
Mar 30 2015 stat.ME
This paper develops a theory and methodology for estimation of Gini index such that both cost of sampling and estimation error are minimum. Methods in which sample size is fixed in advance, cannot minimize estimation error and sampling cost at the same time. In this article, a purely sequential procedure is proposed which provides an estimate of the sample size required to achieve a sufficiently smaller estimation error and lower sampling cost. Characteristics of the purely sequential procedure are examined and asymptotic optimality properties are proved without assuming any specific distribution of the data. Performance of our method is examined through extensive simulation study.
Mar 30 2015 stat.ME
Gini index is a widely used measure of economic inequality. This article develops a general theory for constructing a confidence interval for Gini index with a specified confidence coefficient and a specified width. Fixed sample size methods cannot simultaneously achieve both the specified confidence coefficient and specified width. We develop a purely sequential procedure for interval estimation of Gini index with a specified confidence coefficient and a fixed margin of error. Optimality properties of the proposed method, namely first order asymptotic efficiency and asymptotic consistency are proved. All theoretical results are derived without assuming any specific distribution of the data.