results for au:Zhang_X in:stat

- The investment on the stock market is prone to be affected by the Internet. For the purpose of improving the prediction accuracy, we propose a multi-task stock prediction model that not only considers the stock correlations but also supports multi-source data fusion. Our proposed model first utilizes tensor to integrate the multi-sourced data, including financial Web news, investors' sentiments extracted from the social network and some quantitative data on stocks. In this way, the intrinsic relationships among different information sources can be captured, and meanwhile, multi-sourced information can be complemented to solve the data sparsity problem. Secondly, we propose an improved sub-mode coordinate algorithm (SMC). SMC is based on the stock similarity, aiming to reduce the variance of their subspace in each dimension produced by the tensor decomposition. The algorithm is able to improve the quality of the input features, and thus improves the prediction accuracy. And the paper utilizes the Long Short-Term Memory (LSTM) neural network model to predict the stock fluctuation trends. Finally, the experiments on 78 A-share stocks in CSI 100 and thirteen popular HK stocks in the year 2015 and 2016 are conducted. The results demonstrate the improvement on the prediction accuracy and the effectiveness of the proposed model.
- The goal of online prediction with expert advice is to find a decision strategy which will perform almost as well as the best expert in a given pool of experts, on any sequence of outcomes. This problem has been widely studied and $O(\sqrt{T})$ and $O(\log{T})$ regret bounds can be achieved for convex losses (\citezinkevich2003online) and strictly convex losses with bounded first and second derivatives (\citehazan2007logarithmic) respectively. In special cases like the Aggregating Algorithm (\citevovk1995game) with mixable losses and the Weighted Average Algorithm (\citekivinen1999averaging) with exp-concave losses, it is possible to achieve $O(1)$ regret bounds. \citevan2012exp has argued that mixability and exp-concavity are roughly equivalent under certain conditions. Thus by understanding the underlying relationship between these two notions we can gain the best of both algorithms (strong theoretical performance guarantees of the Aggregating Algorithm and the computational efficiency of the Weighted Average Algorithm). In this paper we provide a complete characterization of the exp-concavity of any proper composite loss. Using this characterization and the mixability condition of proper losses (\citevan2012mixability), we show that it is possible to transform (re-parameterize) any $\beta$-mixable binary proper loss into a $\beta$-exp-concave composite loss with the same $\beta$. In the multi-class case, we propose an approximation approach for this transformation.
- Error bound conditions (EBC) are properties that characterize the growth of an objective function when a point is moved away from the optimal set. They have recently received increasing attention in the field of optimization for developing optimization algorithms with fast convergence. However, the studies of EBC in statistical learning are hitherto still limited. The main contributions of this paper are two-fold. First, we develop fast and intermediate rates of empirical risk minimization (ERM) under EBC for risk minimization with Lipschitz continuous, and smooth convex random functions. Second, we establish fast and intermediate rates of an efficient stochastic approximation (SA) algorithm for risk minimization with Lipschitz continuous random functions, which requires only one pass of $n$ samples and adapts to EBC. For both approaches, the convergence rates span a full spectrum between $\widetilde O(1/\sqrt{n})$ and $\widetilde O(1/n)$ depending on the power constant in EBC, and could be even faster than $O(1/n)$ in special cases for ERM. Moreover, these convergence rates are automatically adaptive without using any knowledge of EBC. Overall, this work not only strengthens the understanding of ERM for statistical learning but also brings new fast stochastic algorithms for solving a broad range of statistical learning problems.
- May 14 2018 stat.ME arXiv:1805.04421v1In contemporary scientific research, it is of great interest to predict a categorical response based on a high-dimensional tensor (i.e. multi-dimensional array) and additional covariates. This mixture of different types of data leads to challenges in statistical analysis. Motivated by applications in science and engineering, we propose a comprehensive and interpretable discriminant analysis model, called CATCH model (in short for Covariate-Adjusted Tensor Classification in High-dimensions), which efficiently integrates the covariates and the tensor to predict the categorical outcome. The CATCH model jointly models the relationships among the covariates, the tensor predictor, and the categorical response. More importantly, it preserves and utilizes the structures of the data for maximum interpretability and optimal prediction. To tackle the new computational and statistical challenges arising from the intimidating tensor dimensions, we propose a penalized approach to select a subset of tensor predictor entries that has direct discriminative effect after adjusting for covariates. We further develop an efficient algorithm that takes advantage of the tensor structure. Theoretical results confirm that our method achieves variable selection consistency and optimal classification error, even when the tensor dimension is much larger than the sample size. The superior performance of our method over existing methods is demonstrated in extensive simulated and real data examples.
- This short paper describes our solution to the 2018 IEEE World Congress on Computational Intelligence One-Minute Gradual-Emotional Behavior Challenge, whose goal was to estimate continuous arousal and valence values from short videos. We designed four base regression models using visual and audio features, and then used a spectral approach to fuse them to obtain improved performance.
- This paper studies a new type of 3D bin packing problem (BPP), in which a number of cuboid-shaped items must be put into a bin one by one orthogonally. The objective is to find a way to place these items that can minimize the surface area of the bin. This problem is based on the fact that there is no fixed-sized bin in many real business scenarios and the cost of a bin is proportional to its surface area. Based on previous research on 3D BPP, the surface area is determined by the sequence, spatial locations and orientations of items. It is a new NP-hard combinatorial optimization problem on unfixed-sized bin packing, for which we propose a multi-task framework based on Selected Learning, generating the sequence and orientations of items packed into the bin simultaneously. During training steps, Selected Learning chooses one of loss functions derived from Deep Reinforcement Learning and Supervised Learning corresponding to the training procedure. Numerical results show that the method proposed significantly outperforms Lego baselines by a substantial gain of 7.52%. Moreover, we produce large scale 3D Bin Packing order data set for studying bin packing problems and will release it to the research community.
- It is a fundamental, but still elusive question whether methods based on quantum mechanics, in particular on quantum entanglement, can be used for classical information processing and machine learning. Even partial answer to this question would bring important insights to both fields of machine learning and quantum mechanics. In this work, we implement simple numerical experiments, related to pattern/images classification, in which we represent the classifiers by many-qubit quantum states written in the matrix product states (MPS). Classical machine learning algorithm is applied to these quantum states to learn the classical data. We explicitly show how quantum features (i.e., single-site and bipartite entanglement) can emerge in such represented images. Particularly, entanglement characterizes here the importance of data, and such information are used to guide the architecture of MPS, and improve the efficiency. The number of needed qubits can be reduced to less than $1/10$ of the original number. We expect such numerical experiments could open new paths in classical machine learning algorithms, and at the same time shed lights on generic quantum simulations/computations for machine learning tasks.
- Low-rank signal modeling has been widely leveraged to capture non-local correlation in image processing applications. We propose a new method that employs low-rank tensor factor analysis for tensors generated by grouped image patches. The low-rank tensors are fed into the alternative direction multiplier method (ADMM) to further improve image reconstruction. The motivating application is compressive sensing (CS), and a deep convolutional architecture is adopted to approximate the expensive matrix inversion in CS applications. An iterative algorithm based on this low-rank tensor factorization strategy, called NLR-TFA, is presented in detail. Experimental results on noiseless and noisy CS measurements demonstrate the superiority of the proposed approach, especially at low CS sampling rates.
- Mar 08 2018 stat.ME arXiv:1803.02575v1Stochastic kriging is a popular technique for simulation metamodeling due to its exibility and analytical tractability. Its computational bottleneck is the inversion of a covariance matrix, which takes $O(n^3)$ time in general and becomes prohibitive for large n, where n is the number of design points. Moreover, the covariance matrix is often ill-conditioned for large n, and thus the inversion is prone to numerical instability, resulting in erroneous parameter estimation and prediction. These two numerical issues preclude the use of stochastic kriging at a large scale. This paper presents a novel approach to address them. We construct a class of covariance functions, called Markovian covariance functions (MCFs), which have two properties: (i) the associated covariance matrices can be inverted analytically, and (ii) the inverse matrices are sparse. With the use of MCFs, the inversion-related computational time is reduced to $O(n^2)$ in general, and can be further reduced by orders of magnitude with additional assumptions on the simulation errors and design points. The analytical invertibility also enhance the numerical stability dramatically. The key in our approach is that we identify a general functional form of covariance functions that can induce sparsity in the corresponding inverse matrices. We also establish a connection between MCFs and linear ordinary differential equations. Such a connection provides a flexible, principled approach to constructing a wide class of MCFs. Extensive numerical experiments demonstrate that stochastic kriging with MCFs can handle large-scale problems in an both computationally efficient and numerically stable manner.
- We revisit the inductive matrix completion problem that aims to recover a rank-$r$ matrix with ambient dimension $d$ given $n$ features as the side prior information. The goal is to make use of the known $n$ features to reduce sample and computational complexities. We present and analyze a new gradient-based non-convex optimization algorithm that converges to the true underlying matrix at a linear rate with sample complexity only linearly depending on $n$ and logarithmically depending on $d$. To the best of our knowledge, all previous algorithms either have a quadratic dependency on the number of features in sample complexity or a sub-linear computational convergence rate. In addition, we provide experiments on both synthetic and real world data to demonstrate the effectiveness of our proposed algorithm.
- We call a learner super-teachable if a teacher can trim down an iid training set while making the learner learn even better. We provide sharp super-teaching guarantees on two learners: the maximum likelihood estimator for the mean of a Gaussian, and the large margin classifier in 1D. For general learners, we provide a mixed-integer nonlinear programming-based algorithm to find a super teaching set. Empirical experiments show that our algorithm is able to find good super-teaching sets for both regression and classification problems.
- Feb 05 2018 stat.ME arXiv:1802.00677v2Stochastic kriging is a popular metamodeling technique for representing the unknown response surface of a simulation model. However, the simulation model may be inadequate in the sense that there may be a non-negligible discrepancy between it and the real system of interest. Failing to account for the model discrepancy may conceivably result in erroneous prediction of the real system's performance and mislead the decision-making process. This paper proposes a metamodel that extends stochastic kriging to incorporate the model discrepancy. Both the simulation outputs and the real data are used to characterize the model discrepancy. The proposed metamodel can provably enhance the prediction of the real system's performance. We derive general results for experiment design and analysis, and demonstrate the advantage of the proposed metamodel relative to competing methods. Finally, we study the effect of Common Random Numbers (CRN). The use of CRN is well known to be detrimental to the prediction accuracy of stochastic kriging in general. By contrast, we show that the effect of CRN in the new context is substantially more complex. The use of CRN can be either detrimental or beneficial depending on the interplay between the magnitude of the observation errors and other parameters involved.
- Feb 05 2018 stat.ME arXiv:1802.00665v1We propose a novel approach to estimate the Cox model with temporal covariates. Our new approach treats the temporal covariates as arising from a longitudinal process which is modeled jointly with the event time. Different from the literature, the longitudinal process in our model is specified as a bounded variational process and determined by a family of Initial Value Problems associated with an Ordinary Differential Equation. Our specification has the advantage that only the observation of the temporal covariates at the time to event and the time to event itself are required to fit the model, while it is fine but not necessary to have more longitudinal observations. This fact makes our approach very useful for many medical outcome datasets, like the New York State Statewide Planning and Research Cooperative System and the National Inpatient Sample, where it is important to find the hazard rate of being discharged given the accumulative cost but only the total cost at the discharge time is available due to the protection of patient information. Our estimation procedure is based on maximizing the full information likelihood function. The resulting estimators are shown to be consistent and asymptotically normally distributed. Variable selection techniques, like Adaptive LASSO, can be easily modified and incorporated into our estimation procedure. The oracle property is verified for the resulting estimator of the regression coefficients. Simulations and a real example illustrate the practical utility of the proposed model. Finally, a couple of potential extensions of our approach are discussed.
- Pretraining with expert demonstrations have been found useful in speeding up the training process of deep reinforcement learning algorithms since less online simulation data is required. Some people use supervised learning to speed up the process of feature learning, others pretrain the policies by imitating expert demonstrations. However, these methods are unstable and not suitable for actor-critic reinforcement learning algorithms. Also, some existing methods rely on the global optimum assumption, which is not true in most scenarios. In this paper, we employ expert demonstrations in a actor-critic reinforcement learning framework, and meanwhile ensure that the performance is not affected by the fact that expert demonstrations are not global optimal. We theoretically derive a method for computing policy gradients and value estimators with only expert demonstrations. Our method is theoretically plausible for actor-critic reinforcement learning algorithms that pretrains both policy and value functions. We apply our method to two of the typical actor-critic reinforcement learning algorithms, DDPG and ACER, and demonstrate with experiments that our method not only outperforms the RL algorithms without pretraining process, but also is more simulation efficient.
- Training set bugs are flaws in the data that adversely affect machine learning. The training set is usually too large for man- ual inspection, but one may have the resources to verify a few trusted items. The set of trusted items may not by itself be adequate for learning, so we propose an algorithm that uses these items to identify bugs in the training set and thus im- proves learning. Specifically, our approach seeks the smallest set of changes to the training set labels such that the model learned from this corrected training set predicts labels of the trusted items correctly. We flag the items whose labels are changed as potential bugs, whose labels can be checked for veracity by human experts. To find the bugs in this way is a challenging combinatorial bilevel optimization problem, but it can be relaxed into a continuous optimization problem. Ex- periments on toy and real data demonstrate that our approach can identify training set bugs effectively and suggest appro- priate changes to the labels. Our algorithm is a step toward trustworthy machine learning.
- Predicting diagnoses from Electronic Health Records (EHRs) is an important medical application of multi-label learning. We propose a convolutional residual model for multi-label classification from doctor notes in EHR data. A given patient may have multiple diagnoses, and therefore multi-label learning is required. We employ a Convolutional Neural Network (CNN) to encode plain text into a fixed-length sentence embedding vector. Since diagnoses are typically correlated, a deep residual network is employed on top of the CNN encoder, to capture label (diagnosis) dependencies and incorporate information directly from the encoded sentence vector. A real EHR dataset is considered, and we compare the proposed model with several well-known baselines, to predict diagnoses based on doctor notes. Experimental results demonstrate the superiority of the proposed convolutional residual model.
- Big data sets must be carefully partitioned into statistically similar data subsets that can be used as representative samples for big data analysis tasks. In this paper, we propose the random sample partition (RSP) data model to represent a big data set as a set of non-overlapping data subsets, called RSP data blocks, where each RSP data block has a probability distribution similar to the whole big data set. Under this data model, efficient block level sampling is used to randomly select RSP data blocks, replacing expensive record level sampling to select sample data from a big distributed data set on a computing cluster. We show how RSP data blocks can be employed to estimate statistics of a big data set and build models which are equivalent to those built from the whole big data set. In this approach, analysis of a big data set becomes analysis of few RSP data blocks which have been generated in advance on the computing cluster. Therefore, the new method for data analysis based on RSP data blocks is scalable to big data.
- Nov 28 2017 stat.ME arXiv:1711.09179v1Many statistical applications require the quantification of joint dependence among more than two random vectors. In this work, we generalize the notion of distance covariance to quantify joint dependence among d >= 2 random vectors. We introduce the high order distance covariance to measure the so-called Lancaster interaction dependence. The joint distance covariance is then defined as a linear combination of pairwise distance covariances and their higher order counterparts which together completely characterize mutual independence. We further introduce some related concepts including the distance cumulant, distance characteristic function, and rank-based distance covariance. Empirical estimators are constructed based on certain Euclidean distances between sample elements. We study the large sample properties of the estimators and propose a bootstrap procedure to approximate their sampling distributions. The asymptotic validity of the bootstrap procedure is justified under both the null and alternative hypotheses. The new metrics are employed to perform model selection in causal inference, which is based on the joint independence testing of the residuals from the fitted structural equation models. The effectiveness of the method is illustrated via both simulated and real datasets.
- Whereas maintenance has been recognized as an important and effective means for risk management in power systems, it turns out to be intractable if cascading blackout risk is considered due to the extremely high computational complexity. In this paper, based on the inference from the blackout simulation data, we propose a methodology to efficiently identify the most influential component(s) for mitigating cascading blackout risk in a large power system. To this end, we first establish an analytic relationship between maintenance strategies and blackout risk estimation by inferring from the data of cascading outage simulations. Then we formulate the component maintenance decision-making problem as a nonlinear 0-1 programming. Afterwards, we quantify the credibility of blackout risk estimation, leading to an adaptive method to determine the least required number of simulations, which servers as a crucial parameter of the optimization model. Finally, we devise two heuristic algorithms to find approximate optimal solutions to the model with very high efficiency. Numerical experiments well manifest the efficacy and high efficiency of our methodology.
- In this paper we document our experiences with developing speech recognition for medical transcription - a system that automatically transcribes doctor-patient conversations. Towards this goal, we built a system along two different methodological lines - a Connectionist Temporal Classification (CTC) phoneme based model and a Listen Attend and Spell (LAS) grapheme based model. To train these models we used a corpus of anonymized conversations representing approximately 14,000 hours of speech. Because of noisy transcripts and alignments in the corpus, a significant amount of effort was invested in data cleaning issues. We describe a two-stage strategy we followed for segmenting the data. The data cleanup and development of a matched language model was essential to the success of the CTC based models. The LAS based models, however were found to be resilient to alignment and transcript noise and did not require the use of language models. CTC models were able to achieve a word error rate of 20.1%, and the LAS models were able to achieve 18.3%. Our analysis shows that both models perform well on important medical utterances and therefore can be practical for transcribing medical conversations.
- The concept of $\phi$-complete mixability and $\phi$-joint mixability was first introduced in Bignozzi and Puccetti (2015), which is a direct extension of complete and joint mixability. Following Bignozzi and Puccetti (2015), we consider two cases of $\phi$ and investigate the $\phi$-joint mixability for elliptical distributions and logarithmic elliptical distributions. We obtain a necessary and sufficient condition for the $\phi$-joint mixability of some distributions and a sufficient condition for uniqueness of the center of $\phi$-joint mixability for some elliptical distributions.
- This paper proposes a practical approach for automatic sleep stage classification based on a multi-level feature learning framework and Recurrent Neural Network (RNN) classifier using heart rate and wrist actigraphy derived from a wearable device. The feature learning framework is designed to extract low- and mid-level features. Low-level features capture temporal and frequency domain properties and mid-level features learn compositions and structural information of signals. Since sleep staging is a sequential problem with long-term dependencies, we take advantage of RNNs with Bidirectional Long Short-Term Memory (BLSTM) architectures for sequence data learning. To simulate the actual situation of daily sleep, experiments are conducted with a resting group in which sleep is recorded in resting state, and a comprehensive group in which both resting sleep and non-resting sleep are included.We evaluate the algorithm based on an eight-fold cross validation to classify five sleep stages (W, N1, N2, N3, and REM). The proposed algorithm achieves weighted precision, recall and F1 score of 58.0%, 60.3%, and 58.2% in the resting group and 58.5%, 61.1%, and 58.5% in the comprehensive group, respectively. Various comparison experiments demonstrate the effectiveness of feature learning and BLSTM. We further explore the influence of depth and width of RNNs on performance. Our method is specially proposed for wearable devices and is expected to be applicable for long-term sleep monitoring at home. Without using too much prior domain knowledge, our method has the potential to generalize sleep disorder detection.
- We introduce a new deep convolutional neural network, CrescendoNet, by stacking simple building blocks without residual connections. Each Crescendo block contains independent convolution paths with increased depths. The numbers of convolution layers and parameters are only increased linearly in Crescendo blocks. In experiments, CrescendoNet with only 15 layers outperforms almost all networks without residual connections on benchmark datasets, CIFAR10, CIFAR100, and SVHN. Given sufficient amount of data as in SVHN dataset, CrescendoNet with 15 layers and 4.1M parameters can match the performance of DenseNet-BC with 250 layers and 15.3M parameters. CrescendoNet provides a new way to construct high performance deep convolutional neural networks without residual connections. Moreover, through investigating the behavior and performance of subnetworks in CrescendoNet, we note that the high performance of CrescendoNet may come from its implicit ensemble behavior, which differs from the FractalNet that is also a deep convolutional neural network without residual connections. Furthermore, the independence between paths in CrescendoNet allows us to introduce a new path-wise training procedure, which can reduce the memory needed for training.
- Conjugate gradient methods are a class of important methods for solving linear equations and nonlinear optimization. In our work, we propose a new stochastic conjugate gradient algorithm with variance reduction (CGVR) and prove its linear convergence with the Fletcher and Revves method for strongly convex and smooth functions. We experimentally demonstrate that the CGVR algorithm converges faster than its counterparts for six large-scale optimization problems that may be convex, non-convex or non-smooth, and its AUC (Area Under Curve) performance with $L2$-regularized $L2$-loss is comparable to that of LIBLINEAR but with significant improvement in computational efficiency.
- Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and cross-modal search. However, for social images which contain both link information and multimodal contents (e.g., text description, and visual content), simply employing the embedding learnt from network structure or data content results in sub-optimal social image representation. In this paper, we propose a novel social image embedding approach called Deep Multimodal Attention Networks (DMAN), which employs a deep model to jointly embed multimodal contents and link information. Specifically, to effectively capture the correlations between multimodal contents, we propose a multimodal attention network to encode the fine-granularity relation between image regions and textual words. To leverage the network structure for embedding learning, a novel Siamese-Triplet neural network is proposed to model the links among images. With the joint deep model, the learnt embedding can capture both the multimodal contents and the nonlinear network information. Extensive experiments are conducted to investigate the effectiveness of our approach in the applications of multi-label classification and cross-modal search. Compared to state-of-the-art image embeddings, our proposed DMAN achieves significant improvement in the tasks of multi-label classification and cross-modal search.
- PET image reconstruction is challenging due to the ill-poseness of the inverse problem and limited number of detected photons. Recently deep neural networks have been widely and successfully used in computer vision tasks and attracted growing interests in medical imaging. In this work, we trained a deep residual convolutional neural network to improve PET image quality by using the existing inter-patient information. An innovative feature of the proposed method is that we embed the neural network in the iterative reconstruction framework for image representation, rather than using it as a post-processing tool. We formulate the objective function as a constraint optimization problem and solve it using the alternating direction method of multipliers (ADMM) algorithm. Both simulation data and hybrid real data are used to evaluate the proposed method. Quantification results show that our proposed iterative neural network method can outperform the neural network denoising and conventional penalized maximum likelihood methods.
- We consider a ranking and selection problem in the context of personalized decision making, where the best alternative is not universal but varies as a function of observable covariates. The goal of ranking and selection with covariates (R&S-C) is to use sampling to compute a decision rule that can specify the best alternative with certain statistical guarantee for each subsequent individual after observing his or her covariates. A linear model is proposed to capture the relationship between the mean performance of an alternative and the covariates. Under the indifference-zone formulation, we develop two-stage procedures for both homoscedastic and heteroscedastic sampling errors, respectively, and prove their statistical validity, which is defined in terms of probability of correct selection. We also generalize the well-known slippage configuration, and prove that the generalized slippage configuration is the least favorable configuration of our procedures. Extensive numerical experiments are conducted to investigate the performance of the proposed procedures. Finally, we demonstrate the usefulness of R&S-C via a case study of selecting the best treatment regimen in the prevention of esophageal cancer. We find that by leveraging disease-related personal information, R&S-C can improve substantially the expected quality-adjusted life years for some groups of patients through providing patient-specific treatment regimen.
- In today's era of big data, robust least-squares regression becomes a more challenging problem when considering the adversarial corruption along with explosive growth of datasets. Traditional robust methods can handle the noise but suffer from several challenges when applied in huge dataset including 1) computational infeasibility of handling an entire dataset at once, 2) existence of heterogeneously distributed corruption, and 3) difficulty in corruption estimation when data cannot be entirely loaded. This paper proposes online and distributed robust regression approaches, both of which can concurrently address all the above challenges. Specifically, the distributed algorithm optimizes the regression coefficients of each data block via heuristic hard thresholding and combines all the estimates in a distributed robust consolidation. Furthermore, an online version of the distributed algorithm is proposed to incrementally update the existing estimates with new incoming data. We also prove that our algorithms benefit from strong robustness guarantees in terms of regression coefficient recovery with a constant upper bound on the error of state-of-the-art batch methods. Extensive experiments on synthetic and real datasets demonstrate that our approaches are superior to those of existing methods in effectiveness, with competitive efficiency.
- Sep 13 2017 stat.ME arXiv:1709.03945v2An envelope is a targeted dimension reduction subspace for simultaneously achieving dimension reduction and improving parameter estimation efficiency. While many envelope methods have been proposed in recent years, all envelope methods hinge on the knowledge of a key hyperparameter, the structural dimension of the envelope. How to estimate the envelope dimension consistently is of substantial interest from both theoretical and practical aspects. Moreover, very recent advances in the literature have generalized envelope as a model-free method, which makes selecting the envelope dimension even more challenging. Likelihood-based approaches such as information criteria and likelihood-ratio tests either cannot be directly applied or have no theoretical justification. To address this critical issue of dimension selection, we propose two unified approaches -- called FG and 1D selections -- for determining the envelope dimension that can be applied to any envelope models and methods. The two model-free selection approaches are based on the two different envelope optimization procedures: the full Grassmannian (FG) optimization and the 1D algorithm (Cook and Zhang, 2016), and are shown to be capable of correctly identifying the structural dimension with a probability tending to 1 under mild moment conditions as the sample size increases. While the FG selection unifies and generalizes the BIC and modified BIC approaches that existing in the literature, and hence provides the theoretical justification of them under weak moment condition and model-free context, the 1D selection is computationally more stable and efficient in finite sample. Extensive simulations and a real data analysis demonstrate the superb performance of our proposals.
- Many of today's machine learning (ML) systems are not built from scratch, but are compositions of an array of \em modular learning components (MLCs). The increasing use of MLCs significantly simplifies the ML system development cycles. However, as most MLCs are contributed and maintained by third parties, their lack of standardization and regulation entails profound security implications. In this paper, for the first time, we demonstrate that potentially harmful MLCs pose immense threats to the security of ML systems. We present a broad class of \em logic-bomb attacks in which maliciously crafted MLCs trigger host systems to malfunction in a predictable manner. By empirically studying two state-of-the-art ML systems in the healthcare domain, we explore the feasibility of such attacks. For example, we show that, without prior knowledge about the host ML system, by modifying only 3.3\textperthousand of the MLC's parameters, each with distortion below $10^{-3}$, the adversary is able to force the misdiagnosis of target victims' skin cancers with 100\% success rate. We provide analytical justification for the success of such attacks, which points to the fundamental characteristics of today's ML models: high dimensionality, non-linearity, and non-convexity. The issue thus seems fundamental to many ML systems. We further discuss potential countermeasures to mitigate MLC-based attacks and their potential technical challenges.
- Influenza remains a significant burden on health systems. Effective responses rely on the timely understanding of the magnitude and the evolution of an outbreak. For monitoring purposes, data on severe cases of influenza in England are reported weekly to Public Health England. These data are both readily available and have the potential to provide valuable information to estimate and predict the key transmission features of seasonal and pandemic influenza. We propose an epidemic model that links the underlying unobserved influenza transmission process to data on severe influenza cases. Within a Bayesian framework, we infer retrospectively the parameters of the epidemic model for each seasonal outbreak from 2012 to 2015, including: the effective reproduction number; the initial susceptibility; the probability of admission to intensive care given infection; and the effect of school closure on transmission. The model is also implemented in real time to assess whether early forecasting of the number of admission to intensive care is possible. Our model of admissions data allows reconstruction of the underlying transmission dynamics revealing: increased transmission during the season 2013/14 and a noticeable effect of Christmas school holiday on disease spread during season 2012/13 and 2014/15. When information on the initial immunity of the population is available, forecasts of the number of admissions to intensive care can be substantially improved. Readily available severe case data can be effectively used to estimate epidemiological characteristics and to predict the evolution of an epidemic, crucially allowing real-time monitoring of the transmission and severity of the outbreak.
- May 30 2017 stat.AP arXiv:1705.09976v1In this paper, we discuss the connection between the RGRST models (Gardiner et al 2002, Polverejan et al 2003) and the Coxian Phase-Type (CPH) models (Marshall et al 2007, Tang 2012) through a construction that converts a special sub-class of RGRST models to CPH models. Both of the two models are widely used to characterize the distribution of hospital charge and length of stay (LOS), but the lack of connections between them makes the two models rarely used together. We claim that our construction can make up this gap and make it possible to take advantage of the two different models simultaneously. As a consequence, we derive a measure of the "price" of staying in each medical stage (identified with phases of a CPH model), which can't be approached without considering the RGRST and CPH models together.A two-stage algorithm is provided to generate consistent estimation of model parameters. Applying the algorithm to a sample drawn from the New York State's Statewide Planning and Research Cooperative System 2013 (SPARCS 2013), we estimate the prices in a four-phase CPH model and discuss the implications.
- We consider the robust phase retrieval problem of recovering the unknown signal from the magnitude-only measurements, where the measurements can be contaminated by both sparse arbitrary corruption and bounded random noise. We propose a new nonconvex algorithm for robust phase retrieval, namely Robust Wirtinger Flow to jointly estimate the unknown signal and the sparse corruption. We show that our proposed algorithm is guaranteed to converge linearly to the unknown true signal up to a minimax optimal statistical precision in such a challenging setting. Compared with existing robust phase retrieval methods, we achieve an optimal sample complexity of $O(n)$ in both noisy and noise-free settings. Thorough experiments on both synthetic and real datasets corroborate our theory.
- We consider the topic of multivariate regression on manifold-valued output, that is, for a multivariate observation, its output response lies on a manifold. Moreover, we propose a new regression model to deal with the presence of grossly corrupted manifold-valued responses, a bottleneck issue commonly encountered in practical scenarios. Our model first takes a correction step on the grossly corrupted responses via geodesic curves on the manifold, and then performs multivariate linear regression on the corrected data. This results in a nonconvex and nonsmooth optimization problem on manifolds. To this end, we propose a dedicated approach named PALMR, by utilizing and extending the proximal alternating linearized minimization techniques. Theoretically, we investigate its convergence property, where it is shown to converge to a critical point under mild conditions. Empirically, we test our model on both synthetic and real diffusion tensor imaging data, and show that our model outperforms other multivariate regression models when manifold-valued responses contain gross errors, and is effective in identifying gross errors.
- Feb 22 2017 stat.ML arXiv:1702.06525v3We propose a unified framework to solve general low-rank plus sparse matrix recovery problems based on matrix factorization, which covers a broad family of objective functions satisfying the restricted strong convexity and smoothness conditions. Based on projected gradient descent and the double thresholding operator, our proposed generic algorithm is guaranteed to converge to the unknown low-rank and sparse matrices at a locally linear rate, while matching the best-known robustness guarantee (i.e., tolerance for sparsity). At the core of our theory is a novel structural Lipschitz gradient condition for low-rank plus sparse matrices, which is essential for proving the linear convergence rate of our algorithm, and we believe is of independent interest to prove fast rates for general superposition-structured models. We illustrate the application of our framework through two concrete examples: robust matrix sensing and robust PCA. Experiments on both synthetic and real datasets corroborate our theory.
- Feb 20 2017 stat.ME arXiv:1702.05195v1This paper studies the sparse normal mean models under the empirical Bayes framework. We focus on the mixture priors with an atom at zero and a density component centered at a data driven location determined by maximizing the marginal likelihood or minimizing the Stein Unbiased Risk Estimate. We study the properties of the corresponding posterior median and posterior mean. In particular, the posterior median is a thresholding rule and enjoys the multi-direction shrinkage property that shrinks the observation toward either the origin or the data-driven location. The idea is extended by considering a finite mixture prior, which is flexible to model the cluster structure of the unknown means. We further generalize the results to heteroscedastic normal mean models. Specifically, we propose a semiparametric estimator which can be calculated efficiently by combining the familiar EM algorithm with the Pool-Adjacent-Violators algorithm for isotonic regression. The effectiveness of our methods is demonstrated via extensive numerical studies.
- Motivated by applications in biological science, we propose a novel test to assess the conditional mean dependence of a response variable on a large number of covariates. Our procedure is built on the martingale difference divergence recently proposed in Shao and Zhang (2014), and it is able to detect a certain type of departure from the null hypothesis of conditional mean independence without making any specific model assumptions. Theoretically, we establish the asymptotic normality of the proposed test statistic under suitable assumption on the eigenvalues of a Hermitian operator, which is constructed based on the characteristic function of the covariates. These conditions can be simplified under banded dependence structure on the covariates or Gaussian design. To account for heterogeneity within the data, we further develop a testing procedure for conditional quantile independence at a given quantile level and provide an asymptotic justification. Empirically, our test of conditional mean independence delivers comparable results to the competitor, which was constructed under the linear model framework, when the underlying model is linear. It significantly outperforms the competitor when the conditional mean admits a nonlinear form.
- Jan 24 2017 stat.ME arXiv:1701.06263v1In functional data analysis (FDA), covariance function is fundamental not only as a critical quantity for understanding elementary aspects of functional data but also as an indispensable ingredient for many advanced FDA methods. This paper develops a new class of nonparametric covariance function estimators in terms of various spectral regularizations of an operator associated with a reproducing kernel Hilbert space. Despite their nonparametric nature, the covariance estimators are automatically positive semi-definite without any additional modification steps. An unconventional representer theorem is established to provide a finite dimensional representation for this class of covariance estimators, which leads to a closed-form expression of the corresponding $L^2$ eigen-decomposition. Trace-norm regularization is particularly studied to further achieve a low-rank representation, another desirable property which leads to dimension reduction and is often needed in advanced FDA approaches. An efficient algorithm is developed based on the accelerated proximal gradient method. This resulted estimator is shown to enjoy an excellent rate of convergence under both fixed and random designs. The outstanding practical performance of the trace-norm-regularized covariance estimator is demonstrated by a simulation study and the analysis of a traffic dataset.
- Jan 18 2017 stat.AP arXiv:1701.04423v2We extend the model used in Gardiner et al. (2002) and Polverejan et al. (2003) through deriving an explicit expression for the joint probability density function of hospital charge and length of stay (LOS) under a general class of conditions. Using this joint density function, we can apply the full maximum likelihood method (FML) to estimate the effect of covariates on charge and LOS. By FML, the endogeneity issues arisen from the dependence between charge and LOS can be efficiently resolved. As an illustrative example, we apply our method to real charge and LOS data sampled from New York State Statewide Planning and Research Cooperative System 2013 (SPARCS 2013). We compare our fitting result with the fitting to the marginal LOS data generated by the widely used Phase-Type model, and conclude that our method is more efficient in fitting.
- Jan 17 2017 stat.ML arXiv:1701.04207v1Canonical correlation analysis (CCA) is a multivariate statistical technique for finding the linear relationship between two sets of variables. The kernel generalization of CCA named kernel CCA has been proposed to find nonlinear relations between datasets. Despite their wide usage, they have one common limitation that is the lack of sparsity in their solution. In this paper, we consider sparse kernel CCA and propose a novel sparse kernel CCA algorithm (SKCCA). Our algorithm is based on a relationship between kernel CCA and least squares. Sparsity of the dual transformations is introduced by penalizing the $\ell_{1}$-norm of dual vectors. Experiments demonstrate that our algorithm not only performs well in computing sparse dual transformations but also can alleviate the over-fitting problem of kernel CCA.
- This paper studies the problem of multivariate linear regression where a portion of the observations is grossly corrupted or is missing, and the magnitudes and locations of such occurrences are unknown in priori. To deal with this problem, we propose a new approach by explicitly consider the error source as well as its sparseness nature. An interesting property of our approach lies in its ability of allowing individual regression output elements or tasks to possess their unique noise levels. Moreover, despite working with a non-smooth optimization problem, our approach still guarantees to converge to its optimal solution. Experiments on synthetic data demonstrate the competitiveness of our approach compared with existing multivariate regression models. In addition, empirically our approach has been validated with very promising results on two exemplar real-world applications: The first concerns the prediction of \textitBig-Five personality based on user behaviors at social network sites (SNSs), while the second is 3D human hand pose estimation from depth images. The implementation of our approach and comparison methods as well as the involved datasets are made publicly available in support of the open-source and reproducible research initiatives.
- Jan 10 2017 stat.ML arXiv:1701.02301v2We propose a generic framework based on a new stochastic variance-reduced gradient descent algorithm for accelerating nonconvex low-rank matrix recovery. Starting from an appropriate initial estimator, our proposed algorithm performs projected gradient descent based on a novel semi-stochastic gradient specifically designed for low-rank matrix recovery. Based upon the mild restricted strong convexity and smoothness conditions, we derive a projected notion of the restricted Lipschitz continuous gradient property, and prove that our algorithm enjoys linear convergence rate to the unknown low-rank matrix with an improved computational complexity. Moreover, our algorithm can be employed to both noiseless and noisy observations, where the optimal sample complexity and the minimax optimal statistical rate can be attained respectively. We further illustrate the superiority of our generic framework through several specific examples, both theoretically and experimentally.
- Jan 03 2017 stat.ML arXiv:1701.00481v2We study the problem of estimating low-rank matrices from linear measurements (a.k.a., matrix sensing) through nonconvex optimization. We propose an efficient stochastic variance reduced gradient descent algorithm to solve a nonconvex optimization problem of matrix sensing. Our algorithm is applicable to both noisy and noiseless settings. In the case with noisy observations, we prove that our algorithm converges to the unknown low-rank matrix at a linear rate up to the minimax optimal statistical error. And in the noiseless setting, our algorithm is guaranteed to linearly converge to the unknown low-rank matrix and achieves exact recovery with optimal sample complexity. Most notably, the overall computational complexity of our proposed algorithm, which is defined as the iteration complexity times per iteration time complexity, is lower than the state-of-the-art algorithms based on gradient descent. Experiments on synthetic data corroborate the superiority of the proposed algorithm over the state-of-the-art algorithms.
- Oct 18 2016 stat.ML arXiv:1610.05275v1We propose a unified framework for estimating low-rank matrices through nonconvex optimization based on gradient descent algorithm. Our framework is quite general and can be applied to both noisy and noiseless observations. In the general case with noisy observations, we show that our algorithm is guaranteed to linearly converge to the unknown low-rank matrix up to minimax optimal statistical error, provided an appropriate initial estimator. While in the generic noiseless setting, our algorithm converges to the unknown low-rank matrix at a linear rate and enables exact recovery with optimal sample complexity. In addition, we develop a new initialization algorithm to provide a desired initial estimator, which outperforms existing initialization algorithms for nonconvex low-rank matrix estimation. We illustrate the superiority of our framework through three examples: matrix regression, matrix completion, and one-bit matrix completion. We also corroborate our theory through extensive experiments on synthetic data.
- Mobile big data contains vast statistical features in various dimensions, including spatial, temporal, and the underlying social domain. Understanding and exploiting the features of mobile data from a social network perspective will be extremely beneficial to wireless networks, from planning, operation, and maintenance to optimization and marketing. In this paper, we categorize and analyze the big data collected from real wireless cellular networks. Then, we study the social characteristics of mobile big data and highlight several research directions for mobile big data in the social computing areas.
- Sep 30 2016 stat.ME arXiv:1609.09380v2In this paper, we introduce a ${\mathcal L}_2$ type test for testing mutual independence and banded dependence structure for high dimensional data. The test is constructed based on the pairwise distance covariance and it accounts for the non-linear and non-monotone dependences among the data, which cannot be fully captured by the existing tests based on either Pearson correlation or rank correlation. Our test can be conveniently implemented in practice as the limiting null distribution of the test statistic is shown to be standard normal. It exhibits excellent finite sample performance in our simulation studies even when the sample size is small albeit dimension is high, and is shown to successfully identify nonlinear dependence in empirical data analysis. On the theory side, asymptotic normality of our test statistic is shown under quite mild moment assumptions and with little restriction on the growth rate of the dimension as a function of sample size. As a demonstration of good power properties for our distance covariance based test, we further show that an infeasible version of our test statistic has the rate optimality in the class of Gaussian distribution with equal correlation.
- The latent Dirichlet allocation (LDA) model is a widely-used latent variable model in machine learning for text analysis. Inference for this model typically involves a single-site collapsed Gibbs sampling step for latent variables associated with observations. The efficiency of the sampling is critical to the success of the model in practical large scale applications. In this article, we introduce a blocking scheme to the collapsed Gibbs sampler for the LDA model which can, with a theoretical guarantee, improve chain mixing efficiency. We develop two procedures, an O(K)-step backward simulation and an O(log K)-step nested simulation, to directly sample the latent variables within each block. We demonstrate that the blocking scheme achieves substantial improvements in chain mixing compared to the state of the art single-site collapsed Gibbs sampler. We also show that when the number of topics is over hundreds, the nested-simulation blocking scheme can achieve a significant reduction in computation time compared to the single-site sampler.
- There is a growing need for the ability to analyse interval-valued data. However, existing descriptive frameworks to achieve this ignore the process by which interval-valued data are typically constructed; namely by the aggregation of real-valued data generated from some underlying process. In this article we develop the foundations of likelihood based statistical inference for random intervals that directly incorporates the underlying generative procedure into the analysis. That is, it permits the direct fitting of models for the underlying real-valued data given only the random interval-valued summaries. This generative approach overcomes several problems associated with existing methods, including the rarely satisfied assumption of within-interval uniformity. The new methods are illustrated by simulated and real data analyses.
- The causal discovery of Bayesian networks is an active and important research area, and it is based upon searching the space of causal models for those which can best explain a pattern of probabilistic dependencies shown in the data. However, some of those dependencies are generated by causal structures involving variables which have not been measured, i.e., latent variables. Some such patterns of dependency "reveal" themselves, in that no model based solely upon the observed variables can explain them as well as a model using a latent variable. That is what latent variable discovery is based upon. Here we did a search for finding them systematically, so that they may be applied in latent variable discovery in a more rigorous fashion.
- Scaling multinomial logistic regression to datasets with very large number of data points and classes has not been trivial. This is primarily because one needs to compute the log-partition function on every data point. This makes distributing the computation hard. In this paper, we present a distributed stochastic gradient descent based optimization method (DS-MLR) for scaling up multinomial logistic regression problems to massive scale datasets without hitting any storage constraints on the data and model parameters. Our algorithm exploits double-separability, an attractive property we observe in the objective functions of several models in machine learning, that allows us to achieve both data as well as model parallelism simultaneously. In addition to being parallelizable, our algorithm can also easily be made non-blocking and asynchronous. We demonstrate the effectiveness of DS-MLR empirically on several real-world datasets, the largest being a reddit dataset created out of 1.7 billion user comments, where the data and parameter sizes are 228 GB and 358 GB respectively.
- The r largest order statistics approach is widely used in extreme value analysis because it may use more information from the data than just the block maxima. In practice, the choice of r is critical. If r is too large, bias can occur; if too small, the variance of the estimator can be high. The limiting distribution of the r largest order statistics, denoted by GEVr, extends that of the block maxima. Two specification tests are proposed to select r sequentially. The first is a score test for the GEVr distribution. Due to the special characteristics of the GEVr distribution, the classical chi-square asymptotics cannot be used. The simplest approach is to use the parametric bootstrap, which is straightforward to implement but computationally expensive. An alternative fast weighted bootstrap or multiplier procedure is developed for computational efficiency. The second test uses the difference in estimated entropy between the GEVr and GEV(r-1) models, applied to the r largest order statistics and the r-1 largest order statistics, respectively. The asymptotic distribution of the difference statistic is derived. In a large scale simulation study, both tests held their size and had substantial power to detect various misspecification schemes. A new approach to address the issue of multiple, sequential hypotheses testing is adapted to this setting to control the false discovery rate or familywise error rate. The utility of the procedures is demonstrated with extreme sea level and precipitation data.
- Threshold selection is a critical issue for extreme value analysis with threshold-based approaches. Under suitable conditions, exceedances over a high threshold have been shown to follow the generalized Pareto distribution (GPD) asymptotically. In practice, however, the threshold must be chosen. If the chosen threshold is too low, the GPD approximation may not hold and bias can occur. If the threshold is chosen too high, reduced sample size increases the variance of parameter estimates. To process batch analyses, commonly used selection methods such as graphical diagnosis are subjective and cannot be automated, while computational methods may not be feasible. We propose to test a set of thresholds through the goodness-of-fit of the GPD for the exceedances, and select the lowest one, above which the data provides adequate fit to the GPD. Previous attempts in this setting are not valid due to the special feature that the multiple tests are done in an ordered fashion. We apply two recently available stopping rules that control the false discovery rate or familywise error rate to ordered goodness-of-fit tests to automate threshold selection. Various model specification tests such as the Cramer-von Mises, Anderson-Darling, Moran's, and a score test are investigated. The performance of the method is assessed in a large scale simulation study that mimics practical return level estimation. This procedure was repeated at hundreds of sites in the western US to generate return level maps of extreme precipitation.
- Multiple-input multiple-output (MIMO) radar has become a thriving subject of research during the past decades. In the MIMO radar context, it is sometimes more accurate to model the radar clutter as a non-Gaussian process, more specifically, by using the spherically invariant random process (SIRP) model. In this paper, we focus on the estimation and performance analysis of the angular spacing between two targets for the MIMO radar under the SIRP clutter. First, we propose an iterative maximum likelihood as well as an iterative maximum a posteriori estimator, for the target's spacing parameter estimation in the SIRP clutter context. Then we derive and compare various CramÃ©r-Rao-like bounds (CRLBs) for performance assessment. Finally, we address the problem of target resolvability by using the concept of angular resolution limit (ARL), and derive an analytical, closed-form expression of the ARL based on Smith's criterion, between two closely spaced targets in a MIMO radar context under SIRP clutter. For this aim we also obtain the non-matrix, closed-form expressions for each of the CRLBs. Finally, we provide numerical simulations to assess the performance of the proposed algorithms, the validity of the derived ARL expression, and to reveal the ARL's insightful properties.
- The maximum likelihood (ML) and maximum a posteriori (MAP) estimation techniques are widely used to address the direction-of-arrival (DOA) estimation problems, an important topic in sensor array processing. Conventionally the ML estimators in the DOA estimation context assume the sensor noise to follow a Gaussian distribution. In real-life application, however, this assumption is sometimes not valid, and it is often more accurate to model the noise as a non-Gaussian process. In this paper we derive an iterative ML as well as an iterative MAP estimation algorithm for the DOA estimation problem under the spherically invariant random process noise assumption, one of the most popular non-Gaussian models, especially in the radar context. Numerical simulation results are provided to assess our proposed algorithms and to show their advantage in terms of performance over the conventional ML algorithm.
- This paper proposes a bootstrap-assisted procedure to conduct simultaneous inference for high dimensional sparse linear models based on the recent de-sparsifying Lasso estimator (van de Geer et al. 2014). Our procedure allows the dimension of the parameter vector of interest to be exponentially larger than sample size, and it automatically accounts for the dependence within the de-sparsifying Lasso estimator. Moreover, our simultaneous testing method can be naturally coupled with the margin screening (Fan and Lv 2008) to enhance its power in sparse testing with a reduced computational cost, or with the step-down method (Romano and Wolf 2005) to provide a strong control for the family-wise error rate. In theory, we prove that our simultaneous testing procedure asymptotically achieves the pre-specified significance level, and enjoys certain optimality in terms of its power even when the model errors are non-Gaussian. Our general theory is also useful in studying the support recovery problem. To broaden the applicability, we further extend our main results to generalized linear models with convex loss functions. The effectiveness of our methods is demonstrated via simulation studies.
- In this paper, we provide a general method to obtain the exact solutions of the degree distributions for RBDN with network size decline. First by stochastic process rules, the steady state transformation equations and steady state degree distribution equations are given in the case of m>2, 0<p<1/2 , then the average degree of network with n nodes is introduced to calculate the degree distribution. Especially, taking m=3 as an example, we explain the detailed solving process, in which computer simulation is used to verify our degree distribution solutions. In addition, the tail characteristics of the degree distribution are discussed. Our findings suggest that the degree distributions will exhibit Poisson tail property for the declining RBDN.
- Sep 29 2015 stat.ME arXiv:1509.08444v2Motivated by the likelihood ratio test under the Gaussian assumption, we develop a maximum sum-of-squares test for conducting hypothesis testing on high dimensional mean vector. The proposed test which incorporates the dependence among the variables is designed to ease the computational burden and to maximize the asymptotic power in the likelihood ratio test. A simulation-based approach is developed to approximate the sampling distribution of the test statistic. The validity of the testing procedure is justified under both the null and alternative hypotheses. We further extend the main results to the two sample problem without the equal covariance assumption. Numerical results suggest that the proposed test can be more powerful than some existing alternatives.
- Motivated by the need to statistically quantify the difference between two spatio-temporal datasets that arise in climate downscaling studies, we propose new tests to detect the differences of the covariance operators and their associated characteristics of two functional time series. Our two sample tests are constructed on the basis of functional principal component analysis and self-normalization, the latter of which is a new studentization technique recently developed for the inference of a univariate time series. Compared to the existing tests, our SN-based tests allow for weak dependence within each sample and it is robust to the dependence between the two samples in the case of equal sample sizes. Asymptotic properties of the SN-based test statistics are derived under both the null and local alternatives. Through extensive simulations, our SN-based tests are shown to outperform existing alternatives in size and their powers are found to be respectable. The tests are then applied to the gridded climate model outputs and interpolated observations to detect the difference in their spatial dynamics.
- Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
- There has been an increasing interest in testing the equality of large Pearson's correlation matrices. However, in many applications it is more important to test the equality of large rank-based correlation matrices since they are more robust to outliers and nonlinearity. Unlike the Pearson's case, testing the equality of large rank-based statistics has not been well explored and requires us to develop new methods and theory. In this paper, we provide a framework for testing the equality of two large U-statistic based correlation matrices, which include the rank-based correlation matrices as special cases. Our approach exploits extreme value statistics and the Jackknife estimator for uncertainty assessment and is valid under a fully nonparametric model. Theoretically, we develop a theory for testing the equality of U-statistic based correlation matrices. We then apply this theory to study the problem of testing large Kendall's tau correlation matrices and demonstrate its optimality. For proving this optimality, a novel construction of least favourable distributions is developed for the correlation matrix comparison.
- Feb 02 2015 stat.ME arXiv:1501.07815v1Aiming at abundant scientific and engineering data with not only high dimensionality but also complex structure, we study the regression problem with a multidimensional array (tensor) response and a vector predictor. Applications include, among others, comparing tensor images across groups after adjusting for additional covariates, which is of central interest in neuroimaging analysis. We propose parsimonious tensor response regression adopting a generalized sparsity principle. It models all voxels of the tensor response jointly, while accounting for the inherent structural information among the voxels. It effectively reduces the number of free parameters, leading to feasible computation and improved interpretation. We achieve model estimation through a nascent technique called the envelope method, which identifies the immaterial information and focuses the estimation based upon the material information in the tensor response. We demonstrate that the resulting estimator is asymptotically efficient, and it enjoys a competitive finite sample performance. We also illustrate the new method on two real neuroimaging studies.
- Dec 23 2014 stat.ME arXiv:1412.6592v1In an increasing number of neuroimaging studies, brain images, which are in the form of multidimensional arrays (tensors), have been collected on multiple subjects at multiple time points. Of scientific interest is to analyze such massive and complex longitudinal images to diagnose neurodegenerative disorders and to identify disease relevant brain regions. In this article, we treat those problems in a unifying regression framework with image predictors, and propose tensor generalized estimating equations (GEE) for longitudinal imaging analysis. The GEE approach takes into account intra-subject correlation of responses, whereas a low rank tensor decomposition of the coefficient array enables effective estimation and prediction with limited sample size. We propose an efficient estimation algorithm, study the asymptotics in both fixed $p$ and diverging $p$ regimes, and also investigate tensor GEE with regularization that is particularly useful for region selection. The efficacy of the proposed tensor GEE is demonstrated on both simulated data and a real data set from the Alzheimer's Disease Neuroimaging Initiative (ADNI).
- In (\citezhang2014nonlinear,zhang2014nonlinear2), we have viewed machine learning as a coding and dimensionality reduction problem, and further proposed a simple unsupervised dimensionality reduction method, entitled deep distributed random samplings (DDRS). In this paper, we further extend it to supervised learning incrementally. The key idea here is to incorporate label information into the coding process by reformulating that each center in DDRS has multiple output units indicating which class the center belongs to. The supervised learning method seems somewhat similar with random forests (\citebreiman2001random), here we emphasize their differences as follows. (i) Each layer of our method considers the relationship between part of the data points in training data with all training data points, while random forests focus on building each decision tree on only part of training data points independently. (ii) Our method builds gradually-narrowed network by sampling less and less data points, while random forests builds gradually-narrowed network by merging subclasses. (iii) Our method is trained more straightforward from bottom layer to top layer, while random forests build each tree from top layer to bottom layer by splitting. (iv) Our method encodes output targets implicitly in sparse codes, while random forests encode output targets by remembering the class attributes of the activated nodes. Therefore, our method is a simpler, more straightforward, and maybe a better alternative choice, though both methods use two very basic elements---randomization and nearest neighbor optimization---as the core. This preprint is used to protect the incremental idea from (\citezhang2014nonlinear,zhang2014nonlinear2). Full empirical evaluation will be announced carefully later.
- We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.
- Structured sparsity is an important modeling tool that expands the applicability of convex formulations for data analysis, however it also creates significant challenges for efficient algorithm design. In this paper we investigate the generalized conditional gradient (GCG) algorithm for solving structured sparse optimization problems---demonstrating that, with some enhancements, it can provide a more efficient alternative to current state of the art approaches. After providing a comprehensive overview of the convergence properties of GCG, we develop efficient methods for evaluating polar operators, a subroutine that is required in each GCG iteration. In particular, we show how the polar operator can be efficiently evaluated in two important scenarios: dictionary learning and structured sparse estimation. A further improvement is achieved by interleaving GCG with fixed-rank local subspace optimization. A series of experiments on matrix completion, multi-class classification, multi-view dictionary learning and overlapping group lasso shows that the proposed method can significantly reduce the training cost of current alternatives.
- Multilayer bootstrap network builds a gradually narrowed multilayer nonlinear network from bottom up for unsupervised nonlinear dimensionality reduction. Each layer of the network is a nonparametric density estimator. It consists of a group of k-centroids clusterings. Each clustering randomly selects data points with randomly selected features as its centroids, and learns a one-hot encoder by one-nearest-neighbor optimization. Geometrically, the nonparametric density estimator at each layer projects the input data space to a uniformly-distributed discrete feature space, where the similarity of two data points in the discrete feature space is measured by the number of the nearest centroids they share in common. The multilayer network gradually reduces the nonlinear variations of data from bottom up by building a vast number of hierarchical trees implicitly on the original data space. Theoretically, the estimation error caused by the nonparametric density estimator is proportional to the correlation between the clusterings, both of which are reduced by the randomization steps.
- Aug 04 2014 stat.AP arXiv:1408.0095v1We develop a novel peak detection algorithm for the analysis of comprehensive two-dimensional gas chromatography time-of-flight mass spectrometry (GC$\times$GC-TOF MS) data using normal-exponential-Bernoulli (NEB) and mixture probability models. The algorithm first performs baseline correction and denoising simultaneously using the NEB model, which also defines peak regions. Peaks are then picked using a mixture of probability distribution to deal with the co-eluting peaks. Peak merging is further carried out based on the mass spectral similarities among the peaks within the same peak group. The algorithm is evaluated using experimental data to study the effect of different cutoffs of the conditional Bayes factors and the effect of different mixture models including Poisson, truncated Gaussian, Gaussian, Gamma and exponentially modified Gaussian (EMG) distributions, and the optimal version is introduced using a trial-and-error approach. We then compare the new algorithm with two existing algorithms in terms of compound identification. Data analysis shows that the developed algorithm can detect the peaks with lower false discovery rates than the existing algorithms, and a less complicated peak picking model is a promising alternative to the more complicated and widely used EMG mixture models.
- Many machine learning algorithms minimize a regularized risk, and stochastic optimization is widely used for this task. When working with massive data, it is desirable to perform stochastic optimization in parallel. Unfortunately, many existing stochastic optimization algorithms cannot be parallelized efficiently. In this paper we show that one can rewrite the regularized risk minimization problem as an equivalent saddle-point problem, and propose an efficient distributed stochastic optimization (DSO) algorithm. We prove the algorithm's rate of convergence; remarkably, our analysis shows that the algorithm scales almost linearly with the number of processors. We also verify with empirical evaluations that the proposed algorithm is competitive with other parallel, general purpose stochastic and batch optimization algorithms for regularized risk minimization.
- This article studies bootstrap inference for high dimensional weakly dependent time series in a general framework of approximately linear statistics. The following high dimensional applications are covered: (1) uniform confidence band for mean vector; (2) specification testing on the second order property of time series such as white noise testing and bandedness testing of covariance matrix; (3) specification testing on the spectral property of time series. In theory, we first derive a Gaussian approximation result for the maximum of a sum of weakly dependent vectors, where the dimension of the vectors is allowed to be exponentially larger than the sample size. In particular, we illustrate an interesting interplay between dependence and dimensionality, and also discuss one type of "dimension free" dependence structure. We further propose a blockwise multiplier (wild) bootstrap that works for time series with unknown autocovariance structure. These distributional approximation errors, which are finite sample valid, decrease polynomially in sample size. A non-overlapping block bootstrap is also studied as a more flexible alternative. The above results are established under the general physical/functional dependence framework proposed in Wu (2005). Our work can be viewed as a substantive extension of Chernozhukov et al. (2013) to time series based on a variant of Stein's method developed therein.
- Mar 18 2014 stat.ME arXiv:1403.4138v1Envelopes were recently proposed as methods for reducing estimative variation in multivariate linear regression. Estimation of an envelope usually involves optimization over Grassmann manifolds. We propose a fast and widely applicable one-dimensional (1D) algorithm for estimating an envelope in general. We reveal an important structural property of envelopes that facilitates our algorithm, and we prove both Fisher consistency and root-n-consistency of the algorithm.