results for au:Bai_Z in:stat

- Jul 06 2017 stat.ME arXiv:1707.01225v1Magnetoencephalography (MEG) is an advanced imaging technique used to measure the magnetic fields outside the human head produced by the electrical activity inside the brain. Various source localization methods in MEG require the knowledge of the underlying active sources, which are identified by a priori. Common methods used to estimate the number of sources include principal component analysis or information criterion methods, both of which make use of the eigenvalue distribution of the data, thus avoiding solving the time-consuming inverse problem. Unfortunately, all these methods are very sensitive to the signal-to-noise ratio (SNR), as examining the sample extreme eigenvalues does not necessarily reflect the perturbation of the population ones. To uncover the unknown sources from the very noisy MEG data, we introduce a framework, referred to as the intrinsic dimensionality (ID) of the optimal transformation for the SNR rescaling functional. It is defined as the number of the spiked population eigenvalues of the associated transformed data matrix. It is shown that the ID yields a more reasonable estimate for the number of sources than its sample counterparts, especially when the SNR is small. By means of examples, we illustrate that the new method is able to capture the number of signal sources in MEG that can escape PCA or other information criterion based methods.
- This paper considers the optimal modification of the likelihood ratio test (LRT) for the equality of two high-dimensional covariance matrices. The classical LRT is not well defined when the dimensions are larger than or equal to one of the sample sizes. In this paper, an optimally modified test that works well in cases where the dimensions may be larger than the sample sizes is proposed. In addition, the test is established under the weakest conditions on the moments and the dimensions of the samples. We also present weakly consistent estimators of the fourth moments, which are necessary for the proposed test, when they are not equal to 3. From the simulation results and real data analysis, we find that the performances of the proposed statistics are robust against affine transformations.
- Mar 06 2017 stat.ME arXiv:1703.01102v1The multivariate nonlinear Granger causality developed by Bai et al. (2010) plays an important role in detecting the dynamic interrelationships between two groups of variables. Following the idea of Hiemstra-Jones (HJ) test proposed by Hiemstra and Jones (1994), they attempt to establish a central limit theorem (CLT) of their test statistic by applying the asymptotical property of multivariate $U$-statistic. However, Bai et al. (2016) revisit the HJ test and find that the test statistic given by HJ is NOT a function of $U$-statistics which implies that the CLT neither proposed by Hiemstra and Jones (1994) nor the one extended by Bai et al. (2010) is valid for statistical inference. In this paper, we re-estimate the probabilities and reestablish the CLT of the new test statistic. Numerical simulation shows that our new estimates are consistent and our new test performs decent size and power.
- Jan 17 2017 stat.ME arXiv:1701.03992v1The famous Hiemstra-Jones (HJ) test developed by Hiemstra and Jones (1994) plays a significant role in studying nonlinear causality. Over the last two decades, there have been numerous applications and theoretical extensions based on this pioneering work. However, several works note that counterintuitive results are obtained from the HJ test, and some researchers find that the HJ test is seriously over-rejecting in simulation studies. In this paper, we reinvestigate HJ's creative 1994 work and find that their proposed estimators of the probabilities over different time intervals were not consistent with the target ones proposed in their criterion. To test HJ's novel hypothesis on Granger causality, we propose new estimators of the probabilities defined in their paper and reestablish the asymptotic properties to induce new tests similar to those of HJ. Some simulations will also be presented to support our findings.
- In this paper, we adopt the eigenvector empirical spectral distribution (VESD) to investigate the limiting behavior of eigenvectors of a large dimensional Wigner matrix W_n. In particular, we derive the optimal bound for the rate of convergence of the expected VESD of W_n to the semicircle law, which is of order O(n^-1/2) under the assumption of having finite 10th moment. We further show that the convergence rates in probability and almost surely of the VESD are O(n^-1/4) and O(n^-1/6), respectively, under finite 8th moment condition. Numerical studies demonstrate that the convergence rate does not depend on the choice of unit vector involved in the VESD function, and the best possible bound for the rate of convergence of the VESD is of order O(n^-1/2).
- This paper is to prove the asymptotic normality of a statistic for detecting the existence of heteroscedasticity for linear regression models without assuming randomness of covariates when the sample size $n$ tends to infinity and the number of covariates $p$ is either fixed or tends to infinity. Moreover our approach indicates that its asymptotic normality holds even without homoscedasticity.
- In this paper, we will introduce the so called naive tests and give a brief review on the newly development. Naive testing methods are easy to understand and performs robust especially when the dimension is large. In this paper, we mainly focus on reviewing some naive testing methods for the mean vectors and covariance matrices of high dimensional populations and believe this naive test idea can be wildly used in many other testing problems.
- Consider the following dynamic factor model: $\mathbf{R}_t=\sum_{i=0}^q \mathbf{\Lambda}_i \mathbf{f}_{t-i}+\mathbf{e}_t,t=1,...,T$, where $\mathbf{\Lambda}_i$ is an $n\times k$ loading matrix of full rank, $\{\mathbf{f}_t\}$ are i.i.d. $k\times1$-factors, and $\mathbf{e}_t$ are independent $n\times1$ white noises. Now, assuming that $n/T\to c>0$, we want to estimate the orders $k$ and $q$ respectively. Define a random matrix $$\mathbf\Phi_n(\tau)=\frac12T\sum_j=1^T (\mathbfR_j \mathbfR_j+\tau^* + \mathbfR_j+\tau \mathbfR_j^*),$$ where $\tau\ge 0$ is an integer. When there are no factors, the matrix $\Phi_{n}(\tau)$ reduces to $$\mathbfM_n(\tau) = \frac12T \sum_j=1^T (\mathbfe_j \mathbfe_j+\tau^* + \mathbfe_j+\tau \mathbfe_j^*).$$ When $\tau=0$, $\mathbf{M}_n(\tau)$ reduces to the usual sample covariance matrix whose ESD tends to the well known MP law and $\mathbf{\Phi}_n(0)$ reduces to the standard spike model. Hence the number $k(q+1)$ can be estimated by the number of spiked eigenvalues of $\mathbf{\Phi}_n(0)$. To obtain separate estimates of $k$ and $q$ , we have employed the spectral analysis of $\mathbf{M}_n(\tau)$ and established the spiked model analysis for $\mathbf{\Phi}_n(\tau)$.
- In this article, we focus on the problem of testing the equality of several high dimensional mean vectors with unequal covariance matrices. This is one of the most important problem in multivariate statistical analysis and there have been various tests proposed in the literature. Motivated by \citetBaiS96E and \citeChenQ10T, a test statistic is introduced and the asymptomatic distributions under the null hypothesis as well as the alternative hypothesis are given. In addition, it is compared with a test statistic recently proposed by \citeSrivastavaK13Ta. It is shown that our test statistic performs much better especially in the large dimensional case.
- Random Fisher matrices arise naturally in multivariate statistical analysis and understanding the properties of its eigenvalues is of primary importance for many hypothesis testing problems like testing the equality between two multivariate population covariance matrices, or testing the independence between sub-groups of a multivariate random vector. This paper is concerned with the properties of a large-dimensional Fisher matrix when the dimension of the population is proportionally large compared to the sample size. Most of existing works on Fisher matrices deal with a particular Fisher matrix where populations have i.i.d components so that the population covariance matrices are all identity. In this paper, we consider general Fisher matrices with arbitrary population covariance matrices. The first main result of the paper establishes the limiting distribution of the eigenvalues of a Fisher matrix while in a second main result, we provide a central limit theorem for a wide class of functionals of its eigenvalues. Some applications of these results are also proposed for testing hypotheses on high-dimensional covariance matrices.
- Apr 29 2014 stat.ME arXiv:1404.6633v1Sample covariance matrices are widely used in multivariate statistical analysis. The central limit theorems (CLT's) for linear spectral statistics of high-dimensional non-centered sample covariance matrices have received considerable attention in random matrix theory and have been applied to many high-dimensional statistical problems. However, known population mean vectors are assumed for non-centered sample covariance matrices, some of which even assume Gaussian-like moment conditions. In fact, there are still another two most frequently used sample covariance matrices: the MLE (by subtracting the sample mean vector from each sample vector) and the unbiased sample covariance matrix (by changing the denominator $n$ as $N=n-1$ in the MLE) without depending on unknown population mean vectors. In this paper, we not only establish new CLT's for non-centered sample covariance matrices without Gaussian-like moment conditions but also characterize the non-negligible differences among the CLT's for the three classes of high-dimensional sample covariance matrices by establishing a \em substitution principle: substitute the \em adjusted sample size $N=n-1$ for the actual sample size $n$ in the major centering term of the new CLT's so as to obtain the CLT of the unbiased sample covariance matrices. Moreover, it is found that the difference between the CLT's for the MLE and unbiased sample covariance matrix is non-negligible in the major centering term although the two sample covariance matrices only have differences $n$ and $n-1$ on the dominator. The new results are applied to two testing problems for high-dimensional data.
- In Jin et al. (2014), the limiting spectral distribution (LSD) of a symmetrized auto-cross covariance matrix is derived using matrix manipulation, with finite $(2+\delta)$-th moment assumption. Here we give an alternative method using a result in Bai and Silverstein (2010), in which a weaker condition of finite 2nd moment is assumed.
- The auto-cross covariance matrix is defined as \[\mathbfM_n=\frac1 2T\sum_j=1^T\bigl(\mathbfe_j\mathbfe_j+\tau^*+\mathbfe_j+ \tau\mathbfe_j^*\bigr),\]where $\mathbf{e}_j$'s are $n$-dimensional vectors of independent standard complex components with a common mean 0, variance $\sigma^2$, and uniformly bounded $2+\eta$th moments and $\tau$ is the lag. Jin et al. [Ann. Appl. Probab. 24 (2014) 1199-1225] has proved that the LSD of $\mathbf{M}_n$ exists uniquely and nonrandomly, and independent of $\tau$ for all $\tau\ge 1$. And in addition they gave an analytic expression of the LSD. As a continuation of Jin et al. [Ann. Appl. Probab. 24 (2014) 1199-1225], this paper proved that under the condition of uniformly bounded fourth moments, in any closed interval outside the support of the LSD, with probability 1 there will be no eigenvalues of $\mathbf{M}_n$ for all large $n$. As a consequence of the main theorem, the limits of the largest and smallest eigenvalue of $\mathbf{M}_n$ are also obtained.
- The eigenvector Empirical Spectral Distribution (VESD) is adopted to investigate the limiting behavior of eigenvectors and eigenvalues of covariance matrices. In this paper, we shall show that the Kolmogorov distance between the expected VESD of sample covariance matrix and the Marčenko-Pastur distribution function is of order $O(N^{-1/2})$. Given that data dimension $n$ to sample size $N$ ratio is bounded between 0 and 1, this convergence rate is established under finite 10th moment condition of the underlying distribution. It is also shown that, for any fixed $\eta>0$, the convergence rates of VESD are $O(N^{-1/4})$ in probability and $O(N^{-1/4+\eta})$ almost surely, requiring finite 8th moment of the underlying distribution.
- This paper proposes a CLT for linear spectral statistics of random matrix $S^{-1}T$ for a general non-negative definite and \bf non-random Hermitian matrix $T$.
- Sample covariance matrix and multivariate $F$-matrix play important roles in multivariate statistical analysis. The central limit theorems \sl (CLT) of linear spectral statistics associated with these matrices were established in Bai and Silverstein (2004) and Zheng (2012) which received considerable attentions and have been applied to solve many large dimensional statistical problems. However, the sample covariance matrices used in these papers are not centralized and there exist some questions about CLT's defined by the centralized sample covariance matrices. In this note, we shall provide some short complements on the CLT's in Bai and Silverstein (2004) and Zheng (2012), and show that the results in these two papers remain valid for the centralized sample covariance matrices, provided that the ratios of dimension $p$ to sample sizes $(n,n_1,n_2)$ are redefined as $p/(n-1)$ and $p/(n_i-1)$, $i=1,2$, respectively.
- Estimation of the population spectral distribution from a large dimensional sample covariance matrixFeb 05 2013 stat.ME arXiv:1302.0355v1This paper introduces a new method to estimate the spectral distribution of a population covariance matrix from high-dimensional data. The method is founded on a meaningful generalization of the seminal Marcenko-Pastur equation, originally defined in the complex plan, to the real line. Beyond its easy implementation and the established asymptotic consistency, the new estimator outperforms two existing estimators from the literature in almost all the situations tested in a simulation experiment. An application to the analysis of the correlation matrix of S&P stocks data is also given.
- For a multivariate linear model, Wilk's likelihood ratio test (LRT) constitutes one of the cornerstone tools. However, the computation of its quantiles under the null or the alternative requires complex analytic approximations and more importantly, these distributional approximations are feasible only for moderate dimension of the dependent variable, say $p\le 20$. On the other hand, assuming that the data dimension $p$ as well as the number $q$ of regression variables are fixed while the sample size $n$ grows, several asymptotic approximations are proposed in the literature for Wilk's $\bLa$ including the widely used chi-square approximation. In this paper, we consider necessary modifications to Wilk's test in a high-dimensional context, specifically assuming a high data dimension $p$ and a large sample size $n$. Based on recent random matrix theory, the correction we propose to Wilk's test is asymptotically Gaussian under the null and simulations demonstrate that the corrected LRT has very satisfactory size and power, surely in the large $p$ and large $n$ context, but also for moderately large data dimensions like $p=30$ or $p=50$. As a byproduct, we give a reason explaining why the standard chi-square approximation fails for high-dimensional data. We also introduce a new procedure for the classical multiple sample significance test in MANOVA which is valid for high-dimensional data.
- Multivariate distributions are explored using the joint distributions of marginal sample quantiles. Limit theory for the mean of a function of order statistics is presented. The results include a multivariate central limit theorem and a strong law of large numbers. A result similar to Bahadur's representation of quantiles is established for the mean of a function of the marginal quantiles. In particular, it is shown that \[\sqrtn\Biggl(\frac1n\sum_i=1^n\phi\bigl(X_n:i^(1),...,X_n:i^(d)\bigr)-\bar\gamma\Biggr)=\frac1\sqrtn\sum_i=1^nZ_n,i+\mathrmo_P(1)\]as $n\rightarrow\infty$, where $\bar{\gamma}$ is a constant and $Z_{n,i}$ are i.i.d. random variables for each $n$. This leads to the central limit theorem. Weak convergence to a Gaussian process using equicontinuity of functions is indicated. The results are established under very general conditions. These conditions are shown to be satisfied in many commonly occurring situations.
- Using Bernstein polynomial approximations, we prove the central limit theorem for linear spectral statistics of sample covariance matrices, indexed by a set of functions with continuous fourth order derivatives on an open interval including $[(1-\sqrt{y})^2,(1+\sqrt{y})^2]$, the support of the Marc̆enko--Pastur law. We also derive the explicit expressions for asymptotic mean and covariance functions.
- In this paper, we give an explanation to the failure of two likelihood ratio procedures for testing about covariance matrices from Gaussian populations when the dimension is large compared to the sample size. Next, using recent central limit theorems for linear spectral statistics of sample covariance matrices and of random F-matrices, we propose necessary corrections for these LR tests to cope with high-dimensional effects. The asymptotic distributions of these corrected tests under the null are given. Simulations demonstrate that the corrected LR tests yield a realized size close to nominal level for both moderate p (around 20) and high dimension, while the traditional LR tests with chi-square approximation fails. Another contribution from the paper is that for testing the equality between two covariance matrices, the proposed correction applies equally for non-Gaussian populations yielding a valid pseudo-likelihood ratio test.
- In the spiked population model introduced by Johnstone (2001),the population covariance matrix has all its eigenvalues equal to unit except for a few fixed eigenvalues (spikes). The question is to quantify the effect of the perturbation caused by the spike eigenvalues. Baik and Silverstein (2006) establishes the almost sure limits of the extreme sample eigenvalues associated to the spike eigenvalues when the population and the sample sizes become large. In a recent work (Bai and Yao, 2008), we have provided the limiting distributions for these extreme sample eigenvalues. In this paper, we extend this theory to a \em generalized spiked population model where the base population covariance matrix is arbitrary, instead of the identity matrix as in Johnstone's case. New mathematical tools are introduced for establishing the almost sure convergence of the sample eigenvalues generated by the spikes.