results for au:Zheng_S in:stat

- We propose and study the problem of generative multi-agent behavioral cloning, where the goal is to learn a generative multi-agent policy from pre-collected demonstration data. Building upon advances in deep generative models, we present a hierarchical policy framework that can tractably learn complex mappings from input states to distributions over multi-agent action spaces. Our framework is flexible and can incorporate high-level domain knowledge into the structure of the underlying deep graphical model. For instance, we can effectively learn low-dimensional structures, such as long-term goals and team coordination, from data. Thus, an additional benefit of our hierarchical approach is the ability to plan over multiple time scales for effective long-term planning. We showcase our approach in an application of modeling team offensive play from basketball tracking data. We show how to instantiate our framework to effectively model complex interactions between basketball players and generate realistic multi-agent trajectories of basketball gameplay over long time periods. We validate our approach using both quantitative and qualitative evaluations, including a user study comparison conducted with professional sports analysts.
- It has been widely observed that many activation functions and pooling methods of neural network models have (positive-) rescaling-invariant property, including ReLU, PReLU, max-pooling, and average pooling, which makes fully-connected neural networks (FNNs) and convolutional neural networks (CNNs) invariant to (positive) rescaling operation across layers. This may cause unneglectable problems with their optimization: (1) different NN models could be equivalent, but their gradients can be very different from each other; (2) it can be proven that the loss functions may have many spurious critical points in the redundant weight space. To tackle these problems, in this paper, we first characterize the rescaling-invariant properties of NN models using equivalent classes and prove that the dimension of the equivalent class space is significantly smaller than the dimension of the original weight space. Then we represent the loss function in the compact equivalent class space and develop novel algorithms that conduct optimization of the NN models directly in the equivalent class space. We call these algorithms Equivalent Class Optimization (abbreviated as EC-Opt) algorithms. Moreover, we design efficient tricks to compute the gradients in the equivalent class, which almost have no extra computational complexity as compared to standard back-propagation (BP). We conducted experimental study to demonstrate the effectiveness of our proposed new optimization algorithms. In particular, we show that by using the idea of EC-Opt, we can significantly improve the accuracy of the learned model (for both FNN and CNN), as compared to using conventional stochastic gradient descent algorithms.
- Jan 30 2018 stat.ML arXiv:1801.09150v1Modern vehicles are equipped with increasingly complex sensors. These sensors generate large volumes of data that provide opportunities for modeling and analysis. Here, we are interested in exploiting this data to learn aspects of behaviors and the road network associated with individual drivers. Our dataset is collected on a standard vehicle used to commute to work and for personal trips. A Hidden Markov Model (HMM) trained on the GPS position and orientation data is utilized to compress the large amount of position information into a small amount of road segment states. Each state has a set of observations, i.e. car signals, associated with it that are quantized and modeled as draws from a Hierarchical Dirichlet Process (HDP). The inference for the topic distributions is carried out using HDP split-merge sampling algorithm. The topic distributions over joint quantized car signals characterize the driving situation in the respective road state. In a novel manner, we demonstrate how the sparsity of the personal road network of a driver in conjunction with a hierarchical topic model allows data driven predictions about destinations as well as likely road conditions.
- Mar 06 2017 stat.ME arXiv:1703.01102v1The multivariate nonlinear Granger causality developed by Bai et al. (2010) plays an important role in detecting the dynamic interrelationships between two groups of variables. Following the idea of Hiemstra-Jones (HJ) test proposed by Hiemstra and Jones (1994), they attempt to establish a central limit theorem (CLT) of their test statistic by applying the asymptotical property of multivariate $U$-statistic. However, Bai et al. (2016) revisit the HJ test and find that the test statistic given by HJ is NOT a function of $U$-statistics which implies that the CLT neither proposed by Hiemstra and Jones (1994) nor the one extended by Bai et al. (2010) is valid for statistical inference. In this paper, we re-estimate the probabilities and reestablish the CLT of the new test statistic. Numerical simulation shows that our new estimates are consistent and our new test performs decent size and power.
- The alternating direction method of multipliers (ADMM) is a powerful optimization solver in machine learning. Recently, stochastic ADMM has been integrated with variance reduction methods for stochastic gradient, leading to SAG-ADMM and SDCA-ADMM that have fast convergence rates and low iteration complexities. However, their space requirements can still be high. In this paper, we propose an integration of ADMM with the method of stochastic variance reduced gradient (SVRG). Unlike another recent integration attempt called SCAS-ADMM, the proposed algorithm retains the fast convergence benefits of SAG-ADMM and SDCA-ADMM, but is more advantageous in that its storage requirement is very low, even independent of the sample size $n$. We also extend the proposed method for nonconvex problems, and obtain a convergence rate of $O(1/T)$. Experimental results demonstrate that it is as fast as SAG-ADMM and SDCA-ADMM, much faster than SCAS-ADMM, and can be used on much bigger data sets.
- In modern large-scale machine learning applications, the training data are often partitioned and stored on multiple machines. It is customary to employ the "data parallelism" approach, where the aggregated training loss is minimized without moving data across machines. In this paper, we introduce a novel distributed dual formulation for regularized loss minimization problems that can directly handle data parallelism in the distributed setting. This formulation allows us to systematically derive dual coordinate optimization procedures, which we refer to as Distributed Alternating Dual Maximization (DADM). The framework extends earlier studies described in (Boyd et al., 2011; Ma et al., 2015a; Jaggi et al., 2014; Yang, 2013) and has rigorous theoretical analyses. Moreover with the help of the new formulation, we develop the accelerated version of DADM (Acc-DADM) by generalizing the acceleration technique from (Shalev-Shwartz and Zhang, 2014) to the distributed setting. We also provide theoretical results for the proposed accelerated version and the new result improves previous ones (Yang, 2013; Ma et al., 2015a) whose runtimes grow linearly on the condition number. Our empirical studies validate our theory and show that our accelerated approach significantly improves the previous state-of-the-art distributed dual coordinate optimization algorithms.
- Random Fisher matrices arise naturally in multivariate statistical analysis and understanding the properties of its eigenvalues is of primary importance for many hypothesis testing problems like testing the equality between two multivariate population covariance matrices, or testing the independence between sub-groups of a multivariate random vector. This paper is concerned with the properties of a large-dimensional Fisher matrix when the dimension of the population is proportionally large compared to the sample size. Most of existing works on Fisher matrices deal with a particular Fisher matrix where populations have i.i.d components so that the population covariance matrices are all identity. In this paper, we consider general Fisher matrices with arbitrary population covariance matrices. The first main result of the paper establishes the limiting distribution of the eigenvalues of a Fisher matrix while in a second main result, we provide a central limit theorem for a wide class of functionals of its eigenvalues. Some applications of these results are also proposed for testing hypotheses on high-dimensional covariance matrices.
- Apr 29 2014 stat.ME arXiv:1404.6633v1Sample covariance matrices are widely used in multivariate statistical analysis. The central limit theorems (CLT's) for linear spectral statistics of high-dimensional non-centered sample covariance matrices have received considerable attention in random matrix theory and have been applied to many high-dimensional statistical problems. However, known population mean vectors are assumed for non-centered sample covariance matrices, some of which even assume Gaussian-like moment conditions. In fact, there are still another two most frequently used sample covariance matrices: the MLE (by subtracting the sample mean vector from each sample vector) and the unbiased sample covariance matrix (by changing the denominator $n$ as $N=n-1$ in the MLE) without depending on unknown population mean vectors. In this paper, we not only establish new CLT's for non-centered sample covariance matrices without Gaussian-like moment conditions but also characterize the non-negligible differences among the CLT's for the three classes of high-dimensional sample covariance matrices by establishing a \em substitution principle: substitute the \em adjusted sample size $N=n-1$ for the actual sample size $n$ in the major centering term of the new CLT's so as to obtain the CLT of the unbiased sample covariance matrices. Moreover, it is found that the difference between the CLT's for the MLE and unbiased sample covariance matrix is non-negligible in the major centering term although the two sample covariance matrices only have differences $n$ and $n-1$ on the dominator. The new results are applied to two testing problems for high-dimensional data.
- Jun 13 2013 stat.ME arXiv:1306.2791v1Panel studies typically suffer from attrition, which reduces sample size and can result in biased inferences. It is impossible to know whether or not the attrition causes bias from the observed panel data alone. Refreshment samples - new, randomly sampled respondents given the questionnaire at the same time as a subsequent wave of the panel - offer information that can be used to diagnose and adjust for bias due to attrition. We review and bolster the case for the use of refreshment samples in panel studies. We include examples of both a fully Bayesian approach for analyzing the concatenated panel and refreshment data, and a multiple imputation approach for analyzing only the original panel. For the latter, we document a positive bias in the usual multiple imputation variance estimator. We present models appropriate for three waves and two refreshment samples, including nonterminal attrition. We illustrate the three-wave analysis using the 2007-2008 Associated Press-Yahoo! News Election Poll.
- This paper proposes a CLT for linear spectral statistics of random matrix $S^{-1}T$ for a general non-negative definite and \bf non-random Hermitian matrix $T$.
- Sample covariance matrix and multivariate $F$-matrix play important roles in multivariate statistical analysis. The central limit theorems \sl (CLT) of linear spectral statistics associated with these matrices were established in Bai and Silverstein (2004) and Zheng (2012) which received considerable attentions and have been applied to solve many large dimensional statistical problems. However, the sample covariance matrices used in these papers are not centralized and there exist some questions about CLT's defined by the centralized sample covariance matrices. In this note, we shall provide some short complements on the CLT's in Bai and Silverstein (2004) and Zheng (2012), and show that the results in these two papers remain valid for the centralized sample covariance matrices, provided that the ratios of dimension $p$ to sample sizes $(n,n_1,n_2)$ are redefined as $p/(n-1)$ and $p/(n_i-1)$, $i=1,2$, respectively.
- Large scale agglomerative clustering is hindered by computational burdens. We propose a novel scheme where exact inter-instance distance calculation is replaced by the Hamming distance between Kernelized Locality-Sensitive Hashing (KLSH) hashed values. This results in a method that drastically decreases computation time. Additionally, we take advantage of certain labeled data points via distance metric learning to achieve a competitive precision and recall comparing to K-Means but in much less computation time.
- For a multivariate linear model, Wilk's likelihood ratio test (LRT) constitutes one of the cornerstone tools. However, the computation of its quantiles under the null or the alternative requires complex analytic approximations and more importantly, these distributional approximations are feasible only for moderate dimension of the dependent variable, say $p\le 20$. On the other hand, assuming that the data dimension $p$ as well as the number $q$ of regression variables are fixed while the sample size $n$ grows, several asymptotic approximations are proposed in the literature for Wilk's $\bLa$ including the widely used chi-square approximation. In this paper, we consider necessary modifications to Wilk's test in a high-dimensional context, specifically assuming a high data dimension $p$ and a large sample size $n$. Based on recent random matrix theory, the correction we propose to Wilk's test is asymptotically Gaussian under the null and simulations demonstrate that the corrected LRT has very satisfactory size and power, surely in the large $p$ and large $n$ context, but also for moderately large data dimensions like $p=30$ or $p=50$. As a byproduct, we give a reason explaining why the standard chi-square approximation fails for high-dimensional data. We also introduce a new procedure for the classical multiple sample significance test in MANOVA which is valid for high-dimensional data.
- In this paper, we give an explanation to the failure of two likelihood ratio procedures for testing about covariance matrices from Gaussian populations when the dimension is large compared to the sample size. Next, using recent central limit theorems for linear spectral statistics of sample covariance matrices and of random F-matrices, we propose necessary corrections for these LR tests to cope with high-dimensional effects. The asymptotic distributions of these corrected tests under the null are given. Simulations demonstrate that the corrected LR tests yield a realized size close to nominal level for both moderate p (around 20) and high dimension, while the traditional LR tests with chi-square approximation fails. Another contribution from the paper is that for testing the equality between two covariance matrices, the proposed correction applies equally for non-Gaussian populations yielding a valid pseudo-likelihood ratio test.
- Jun 01 2007 stat.ME arXiv:0705.4588v1We propose the variable selection procedure incorporating prior constraint information into lasso. The proposed procedure combines the sample and prior information, and selects significant variables for responses in a narrower region where the true parameters lie. It increases the efficiency to choose the true model correctly. The proposed procedure can be executed by many constrained quadratic programming methods and the initial estimator can be found by least square or Monte Carlo method. The proposed procedure also enjoys good theoretical properties. Moreover, the proposed procedure is not only used for linear models but also can be used for generalized linear models(\sl GLM), Cox models, quantile regression models and many others with the help of Wang and Leng (2007)'s LSA, which changes these models as the approximation of linear models. The idea of combining sample and prior constraint information can be also used for other modified lasso procedures. Some examples are used for illustration of the idea of incorporating prior constraint information in variable selection procedures.