results for au:Wang_Y in:stat

- The eigendeomposition of nearest-neighbor (NN) graph Laplacian matrices is the main computational bottleneck in spectral clustering. In this work, we introduce a highly-scalable, spectrum-preserving graph sparsification algorithm that enables to build ultra-sparse NN (u-NN) graphs with guaranteed preservation of the original graph spectrums, such as the first few eigenvectors of the original graph Laplacian. Our approach can immediately lead to scalable spectral clustering of large data networks without sacrificing solution quality. The proposed method starts from constructing low-stretch spanning trees (LSSTs) from the original graphs, which is followed by iteratively recovering small portions of "spectrally critical" off-tree edges to the LSSTs by leveraging a spectral off-tree embedding scheme. To determine the suitable amount of off-tree edges to be recovered to the LSSTs, an eigenvalue stability checking scheme is proposed, which enables to robustly preserve the first few Laplacian eigenvectors within the sparsified graph. Additionally, an incremental graph densification scheme is proposed for identifying extra edges that have been missing in the original NN graphs but can still play important roles in spectral clustering tasks. Our experimental results for a variety of well-known data sets show that the proposed method can dramatically reduce the complexity of NN graphs, leading to significant speedups in spectral clustering.
- Oct 10 2017 stat.CO arXiv:1710.02588v1We consider linear structural equation models that are associated with mixed graphs. The structural equations in these models only involve observed variables, but their idiosyncratic error terms are allowed to be correlated and non-Gaussian. We propose empirical likelihood (EL) procedures for inference, and suggest several modifications, including a profile likelihood, in order to improve tractability and performance of the resulting methods. Through simulations, we show that when the error distributions are non-Gaussian, the use of EL and the proposed modifications may increase statistical efficiency and improve assessment of significance.
- When using stochastic gradient descent to solve large-scale machine learning problems, a common practice of data processing is to shuffle the training data, partition the data across multiple machines if needed, and then perform several epochs of training on the re-shuffled (either locally or globally) data. The above procedure makes the instances used to compute the gradients no longer independently sampled from the training data set. Then does the distributed SGD method have desirable convergence properties in this practical situation? In this paper, we give answers to this question. First, we give a mathematical formulation for the practical data processing procedure in distributed machine learning, which we call data partition with global/local shuffling. We observe that global shuffling is equivalent to without-replacement sampling if the shuffling operations are independent. We prove that SGD with global shuffling has convergence guarantee in both convex and non-convex cases. An interesting finding is that, the non-convex tasks like deep learning are more suitable to apply shuffling comparing to the convex tasks. Second, we conduct the convergence analysis for SGD with local shuffling. The convergence rate for local shuffling is slower than that for global shuffling, since it will lose some information if there's no communication between partitioned data. Finally, we consider the situation when the permutation after shuffling is not uniformly distributed (insufficient shuffling), and discuss the condition under which this insufficiency will not influence the convergence rate. Our theoretical results provide important insights to large-scale machine learning, especially in the selection of data processing methods in order to achieve faster convergence and good speedup. Our theoretical findings are verified by extensive experiments on logistic regression and deep neural networks.
- We present a scalable and robust Bayesian inference method for linear state space models. The method is applied to demand forecasting in the context of a large e-commerce platform, paying special attention to intermittent and bursty target statistics. Inference is approximated by the Newton-Raphson algorithm, reduced to linear-time Kalman smoothing, which allows us to operate on several orders of magnitude larger problems than previous related work. In a study on large real-world sales datasets, our method outperforms competing approaches on fast and medium moving items.
- Although deep Convolutional Neural Network (CNN) has shown better performance in various machine learning tasks, its application is accompanied by a significant increase in storage and computation. Among CNN simplification techniques, parameter pruning is a promising approach which aims at reducing the number of weights of various layers without intensively reducing the original accuracy. In this paper, we propose a novel progressive parameter pruning method, named Structured Probabilistic Pruning (SPP), which efficiently prunes weights of convolutional layers in a probabilistic manner. Unlike existing deterministic pruning approaches, in which the pruned weights of a well-trained model are permanently eliminated, SPP utilizes the relative importance of weights during training iterations, which makes the pruning procedure more accurate by leveraging the accumulated weight importance. Specifically, we introduce an effective weight competition mechanism to emphasize the important weights and gradually undermine the unimportant ones. Experiments indicate that our proposed method has obtained superior performance on ConvNet and AlexNet compared with existing pruning methods. Our pruned AlexNet achieves 4.0 $\sim$ 8.9x (averagely 5.8x) layer-wise speedup in convolutional layers with only 1.3\% top-5 error increase on the ImageNet-2012 validation dataset. We also prove the effectiveness of our method on transfer learning scenarios using AlexNet.
- In this note we prove a tight lower bound for the MNL-bandit assortment selection model that matches the upper bound given in (Agrawal et al., 2016) for all parameters, up to logarithmic factors.
- Sep 18 2017 stat.ML arXiv:1709.05216v1We consider the problem of sequentially making decisions that are rewarded by "successes" and "failures" which can be predicted through an unknown relationship that depends on a partially controllable vector of attributes for each instance. The learner takes an active role in selecting samples from the instance pool. The goal is to maximize the probability of success in either offline (training) or online (testing) phases. Our problem is motivated by real-world applications where observations are time-consuming and/or expensive. We develop a knowledge gradient policy using an online Bayesian linear classifier to guide the experiment by maximizing the expected value of information of labeling each alternative. We provide a finite-time analysis of the estimated error and show that the maximum likelihood estimator based produced by the KG policy is consistent and asymptotically normal. We also show that the knowledge gradient policy is asymptotically optimal in an offline setting. This work further extends the knowledge gradient to the setting of contextual bandits. We report the results of a series of experiments that demonstrate its efficiency.
- Sep 04 2017 stat.ML arXiv:1709.00379v1Sparse alpha-norm regularization has many data-rich applications in marketing and economics. In contrast to traditional lasso and ridge regularization, the alpha-norm penalty has the property of jumping to a sparse solution. This is an attractive feature for ultra high-dimensional problems that occur in market demand estimation and forecasting. The underlying nonconvex regularization problem is solved via coordinate descent, and a proximal operator. To illustrate our methodology, we study a classic demand forecasting problem of Bajari, Nekipelov, Ryan, and Yang (2015a). On the empirical side, we find many strong sparse predictors, including price, equivalized volume, promotion, flavor scent, and brand effects. Benchmark methods including linear regression, ridge, lasso and elastic net, are used in an out-of-sample forecasting study. In particular, alpha-norm regularization provides accurate estimates for the promotion effects. Finally, we conclude with directions for future research.
- We study the problem of optimal subset selection from a set of correlated random variables. In particular, we consider the associated combinatorial optimization problem of maximizing the determinant of a symmetric positive definite matrix that characterizes the chosen subset. This problem arises in many domains, such as experimental designs, regression modeling, and environmental statistics. We establish an efficient polynomial-time algorithm using Determinantal Point Process for approximating the optimal solution to the problem. We demonstrate the advantages of our methods by presenting computational results for both synthetic and real data sets.
- An efficient structural identifiability analysis algorithm is developed in this study for a broad range of network structures. The proposed method adopts the Wright's path coefficient method to generate identifiability equations in forms of symbolic polynomials, and then converts these symbolic equations to binary matrices (called identifiability matrix). Several matrix operations are introduced for identifiability matrix reduction with system equivalency maintained. Based on the reduced identifiability matrices, the structural identifiability of each parameter is determined. A number of benchmark models are used to verify the validity of the proposed approach. Finally, the network module for influenza A virus replication is employed as a real example to illustrate the application of the proposed approach in practice. The proposed approach can deal with cyclic networks with latent variables. The key advantage is that it intentionally avoids symbolic computation and is thus highly efficient. Also, this method is capable of determining the identifiability of each single parameter and is thus of higher resolution in comparison with many existing approaches. Overall, this study provides a basis for systematic examination and refinement of graphical models of biological networks from the identifiability point of view, and it has a significant potential to be extended to more complex network structures or high-dimensional systems.
- We consider a non-stationary sequential stochastic optimization problem, in which the underlying cost functions change over time under a variation budget constraint. We propose an $L_{p,q}$-variation functional to quantify the change, which captures local spatial and temporal variations of the sequence of functions. Under the $L_{p,q}$-variation functional constraint, we derive both upper and matching lower regret bounds for smooth and strongly convex function sequences, which generalize previous results in (Besbes et al., 2015). Our results reveal some surprising phenomena under this general variation functional, such as the curse of dimensionality of the function domain. The key technical novelties in our analysis include an affinity lemma that characterizes the distance of the minimizers of two convex functions with bounded $L_p$ difference, and a cubic spline based construction that attains matching lower bounds.
- Aug 10 2017 stat.ML arXiv:1708.02883v1Consider a structured matrix factorization scenario where one factor is modeled to have columns lying in the unit simplex. Such a simplex-structured matrix factorization (SSMF) problem has spurred much interest in key topics such as hyperspectral unmixing in remote sensing and topic discovery in machine learning. In this paper we develop a new theoretical framework for SSMF. The idea is to study a maximum volume ellipsoid inscribed in the convex hull of the data points, which has not been attempted in prior literature. We show a sufficient condition under which this maximum volume inscribed ellipsoid (MVIE) framework can guarantee exact recovery of the factors. The condition derived is much better than that of separable non-negative matrix factorization (or pure-pixel search) and is comparable to that of another powerful framework called minimum volume enclosing simplex. From the MVIE framework we also develop an algorithm that uses facet enumeration and convex optimization to achieve the aforementioned recovery result. Numerical results are presented to demonstrate the potential of this new theoretical SSMF framework.
- Aug 08 2017 stat.AP arXiv:1708.01948v1Recent research in Aerosol Optical Depth (AOD) retrieval algorithms for Multi-angle Imaging SpectroRadiometer (MISR) proposed a hierarchical Bayesian model. However the inference algorithm used in their work was Markov Chain Monte Carlo (MCMC), which was reported prohibitively slow. The poor speed of MCMC dramatically limited the production feasibility of the Bayesian framework if large scale (e.g. global scale) of aerosol retrieval is desired. In this paper, we present an alternative optimization method to mitigate the speed problem. In particular we adopt Maximize a Posteriori (MAP) approach, and apply a gradient-free "hill-climbing" algorithm: the coordinate-wise stochastic-search. Our method has shown to be much (about 100 times) faster than MCMC, easier to converge, and insensitive to hyper parameters. To further scale our approach, we parallelized our method using Apache Spark, which achieves linear speed-up w.r.t number of CPU cores up to 16. Due to these efforts, we are able to retrieve AOD at much finer resolution (1.1km) with a tiny fraction of time consumption compared with existing methods. During our research, we find that in low AOD levels, the Bayesian network tends to produce overestimated retrievals. We also find that high absorbing aerosol types are retrieved at the same time. This is likely caused by the Dirichlet prior for aerosol types, as it is shown to encourage selecting absorbing types in practice. After changing Dirichlet to uniform, the AOD retrievals show excellent agreement with ground measurement in all levels.
- This paper gives the exact solution in terms of the Karhunen-Loève expansion to a fractional stochastic partial differential equation on the unit sphere $\mathbb{S}^{2}\subset \mathbb{R}^{3}$ with fractional Brownian motion as driving noise and with random initial condition given by a fractional stochastic Cauchy problem. A numerical approximation to the solution is given by truncating the Karhunen-Loève expansion. We show the convergence rates of the truncation errors in degree and the mean square approximation errors in time. Numerical examples using an isotropic Gaussian random field as initial condition and simulations of evolution of cosmic microwave background (CMB) are shown to illustrate the theoretical results.
- Genome-wide chromosome conformation capture techniques such as Hi-C enable the generation of 3D genome contact maps and offer new pathways toward understanding the spatial organization of genome. One specific feature of the 3D organization is known as topologically associating domains (TADs), which are densely interacting, contiguous chromatin regions playing important roles in regulating gene expression. A few algorithms have been proposed to detect TADs. In particular, the structure of Hi-C data naturally inspires application of community detection methods. However, one of the drawbacks of community detection is that most methods take exchangeability of the nodes in the network for granted; whereas the nodes in this case, i.e. the positions on the chromosomes, are not exchangeable. We propose a network model for detecting TADs using Hi-C data that takes into account this non-exchangeability. In addition, our model explicitly makes use of cell-type specific CTCF binding sites as biological covariates and can be used to identify conserved TADs across multiple cell types. The model leads to a likelihood objective that can be efficiently optimized via relaxation. We also prove that when suitably initialized, this model finds the underlying TAD structure with high probability. Using simulated data, we show the advantages of our method and the caveats of popular community detection methods, such as spectral clustering, in this application. Applying our method to real Hi-C data, we demonstrate the domains identified have desirable epigenetic features and compare them across different cell types. The code is available upon request.
- Compressing convolutional neural networks (CNNs) is essential for transferring the success of CNNs to a wide variety of applications to mobile devices. In contrast to directly recognizing subtle weights or filters as redundant in a given CNN, this paper presents an evolutionary method to automatically eliminate redundant convolution filters. We represent each compressed network as a binary individual of specific fitness. Then, the population is upgraded at each evolutionary iteration using genetic operations. As a result, an extremely compact CNN is generated using the fittest individual. In this approach, either large or small convolution filters can be redundant, and filters in the compressed network are more distinct. In addition, since the number of filters in each convolutional layer is reduced, the number of filter channels and the size of feature maps are also decreased, naturally improving both the compression and speed-up ratios. Experiments on benchmark deep CNN models suggest the superiority of the proposed algorithm over the state-of-the-art compression methods.
- Differential privacy (DP), ever since its advent, has been a controversial object. On the one hand, it provides strong provable protection of individuals in a data set, on the other hand, it has been heavily criticized for being not practical, partially due to its complete independence to the actual data set it tries to protect. In this paper, we address this issue by a new and more fine-grained notion of differential privacy --- per instance differential privacy (pDP), which captures the privacy of a specific individual with respect to a fixed data set. We show that this is a strict generalization of the standard DP and inherits all its desirable properties, e.g., composition, invariance to side information and closedness to postprocessing, except that they all hold for every instance separately. When the data is drawn from a distribution, we show that per-instance DP implies generalization. Moreover, we provide explicit calculations of the per-instance DP for the output perturbation on a class of smooth learning problems. The result reveals an interesting and intuitive fact that an individual has stronger privacy if he/she has small "leverage score" with respect to the data set and if he/she can be predicted more accurately using the leave-one-out data set. Using the developed techniques, we provide a novel analysis of the One-Posterior-Sample (OPS) estimator and show that when the data set is well-conditioned it provides $(\epsilon,\delta)$-pDP for any target individuals and matches the exact lower bound up to a $1+\tilde{O}(n^{-1}\epsilon^{-2})$ multiplicative factor. We also propose AdaOPS which uses adaptive regularization to achieve the same results with $(\epsilon,\delta)$-DP. Simulation shows several orders-of-magnitude more favorable privacy and utility trade-off when we consider the privacy of only the users in the data set.
- Jul 17 2017 stat.ML arXiv:1707.04368v1In this study, we tested the interaction effect of multimodal datasets using a novel method called the kernel method for detecting higher order interactions among biologically relevant mulit-view data. Using a semiparametric method on a reproducing kernel Hilbert space (RKHS), we used a standard mixed-effects linear model and derived a score-based variance component statistic that tests for higher order interactions between multi-view data. The proposed method offers an intangible framework for the identification of higher order interaction effects (e.g., three way interaction) between genetics, brain imaging, and epigenetic data. Extensive numerical simulation studies were first conducted to evaluate the performance of this method. Finally, this method was evaluated using data from the Mind Clinical Imaging Consortium (MCIC) including single nucleotide polymorphism (SNP) data, functional magnetic resonance imaging (fMRI) scans, and deoxyribonucleic acid (DNA) methylation data, respectfully, in schizophrenia patients and healthy controls. We treated each gene-derived SNPs, region of interest (ROI) and gene-derived DNA methylation as a single testing unit, which are combined into triplets for evaluation. In addition, cardiovascular disease risk factors such as age, gender, and body mass index were assessed as covariates on hippocampal volume and compared between triplets. Our method identified $13$-triplets ($p$-values $\leq 0.001$) that included $6$ gene-derived SNPs, $10$ ROIs, and $6$ gene-derived DNA methylations that correlated with changes in hippocampal volume, suggesting that these triplets may be important in explaining schizophrenia-related neurodegeneration. With strong evidence ($p$-values $\leq 0.000001$), the triplet (\bf MAGI2, CRBLCrus1.L, FBXO28) has the potential to distinguish schizophrenia patients from the healthy control variations.
- In this paper, we consider an estimation problem concerning the matrix of correlation coefficients in context of high dimensional data settings. In particular, we revisit some results in Li and Rolsalsky [Li, D. and Rolsalsky, A. (2006). Some strong limit theorems for the largest entries of sample correlation matrices, The Annals of Applied Probability, 16, 1, 423-447]. Four of the main theorems of Li and Rolsalsky (2006) are established in their full generalities and we simplify substantially some proofs of the quoted paper. Further, we generalize a theorem which is useful in deriving the existence of the pth moment as well as in studying the convergence rates in law of large numbers.
- It is known that Boosting can be interpreted as a gradient descent technique to minimize an underlying loss function. Specifically, the underlying loss being minimized by the traditional AdaBoost is the exponential loss, which is proved to be very sensitive to random noise/outliers. Therefore, several Boosting algorithms, e.g., LogitBoost and SavageBoost, have been proposed to improve the robustness of AdaBoost by replacing the exponential loss with some designed robust loss functions. In this work, we present a new way to robustify AdaBoost, i.e., incorporating the robust learning idea of Self-paced Learning (SPL) into Boosting framework. Specifically, we design a new robust Boosting algorithm based on SPL regime, i.e., SPLBoost, which can be easily implemented by slightly modifying off-the-shelf Boosting packages. Extensive experiments and a theoretical characterization are also carried out to illustrate the merits of the proposed SPLBoost.
- Classical matrix perturbation results, such as Weyl's theorem for eigenvalues and the Davis-Kahan theorem for eigenvectors, are general purpose. These classical bounds are tight in the worst case, but in many settings sub-optimal in the typical case. In this paper, we present perturbation bounds which consider the nature of the perturbation and its interaction with the unperturbed structure in order to obtain significant improvements over the classical theory in many scenarios, such as when the perturbation is random. We demonstrate the utility of these new results by analyzing perturbations in the stochastic blockmodel where we derive much tighter bounds than provided by the classical theory. We use our new perturbation theory to show that a very simple and natural clustering algorithm -- whose analysis was difficult using the classical tools -- nevertheless recovers the communities of the blockmodel exactly even in very sparse graphs.
- Motivated by applications such as autonomous vehicles, test-time attacks via adversarial examples have received a great deal of recent attention. In this setting, an adversary is capable of making queries to a classifier, and perturbs a test example by a small amount in order to force the classifier to report an incorrect label. While a long line of work has explored a number of attacks, not many reliable defenses are known, and there is an overall lack of general understanding about the foundations of designing machine learning algorithms robust to adversarial examples. In this paper, we take a step towards addressing this challenging question by introducing a new theoretical framework, analogous to bias-variance theory, which we can use to tease out the causes of vulnerability. We apply our framework to a simple classification algorithm: nearest neighbors, and analyze its robustness to adversarial examples. Motivated by our analysis, we propose a modified version of the nearest neighbor algorithm, and demonstrate both theoretically and empirically that it has superior robustness to standard nearest neighbors.
- Learning Bayesian networks using both observational and interventional data is now a fundamentally important problem due to recent technological developments in genomics that generate such single-cell gene expression data at a very large scale. In order to utilize this data for learning gene regulatory networks, efficient and reliable causal inference algorithms are needed that can make use of both observational and interventional data. In this paper, we present two algorithms of this type and prove that both are consistent under the faithfulness assumption. These algorithms are interventional adaptations of the Greedy SP algorithm and are the first algorithms using both observational and interventional data with consistency guarantees. Moreover, these algorithms have the advantage that they are non-parametric, which makes them useful also for analyzing non-Gaussian data. In this paper, we present these two algorithms and their consistency guarantees, and we analyze their performance on simulated data, protein signaling data, and single-cell gene expression data.
- May 29 2017 stat.ME arXiv:1705.09591v1In genetic epidemiological studies, family history data are collected on relatives of study participants and used to estimate the age-specific risk of disease for individuals who carry a causal mutation. However, a family member's genotype data may not be collected due to the high cost of in-person interview to obtain blood sample or death of a relative. Previously, efficient nonparametric genotype-specific risk estimation in censored mixture data has been proposed without considering covariates. With multiple predictive risk factors available, risk estimation requires a multivariate model to account for additional covariates that may affect disease risk simultaneously. Therefore, it is important to consider the role of covariates in the genotype-specific distribution estimation using family history data. We propose an estimation method that permits more precise risk prediction by controlling for individual characteristics and incorporating interaction effects with missing genotypes in relatives, and thus gene-gene interactions and gene-environment interactions can be handled within the framework of a single model. We examine performance of the proposed methods by simulations and apply them to estimate the age-specific cumulative risk of Parkinson's disease (PD) in carriers of LRRK2 G2019S mutation using first-degree relatives who are at genetic risk for PD. The utility of estimated carrier risk is demonstrated through designing a future clinical trial under various assumptions. Such sample size estimation is seen in the Huntington's disease literature using the length of abnormal expansion of a CAG repeat in the HTT gene, but is less common in the PD literature.
- This paper presents two unsupervised learning layers (UL layers) for label-free video analysis: one for fully connected layers, and the other for convolutional ones. The proposed UL layers can play two roles: they can be the cost function layer for providing global training signal; meanwhile they can be added to any regular neural network layers for providing local training signals and combined with the training signals backpropagated from upper layers for extracting both slow and fast changing features at layers of different depths. Therefore, the UL layers can be used in either pure unsupervised or semi-supervised settings. Both a closed-form solution and an online learning algorithm for two UL layers are provided. Experiments with unlabeled synthetic and real-world videos demonstrated that the neural networks equipped with UL layers and trained with the proposed online learning algorithm can extract shape and motion information from video sequences of moving objects. The experiments demonstrated the potential applications of UL layers and online learning algorithm to head orientation estimation and moving object localization.
- In voxel-based neuroimage analysis, lesion features have been the main focus in disease prediction due to their interpretability with respect to the related diseases. However, we observe that there exists another type of features introduced during the preprocessing steps and we call them "\textbfProcedural Bias". Besides, such bias can be leveraged to improve classification accuracy. Nevertheless, most existing models suffer from either under-fit without considering procedural bias or poor interpretability without differentiating such bias from lesion ones. In this paper, a novel dual-task algorithm namely \emphGSplit LBI is proposed to resolve this problem. By introducing an augmented variable enforced to be structural sparsity with a variable splitting term, the estimators for prediction and selecting lesion features can be optimized separately and mutually monitored by each other following an iterative scheme. Empirical experiments have been evaluated on the Alzheimer's Disease Neuroimaging Initiative\thinspace(ADNI) database. The advantage of proposed model is verified by improved stability of selected lesion features and better classification results.
- Sparsity helps reduce the computational complexity of deep neural networks by skipping zeros. Taking advantage of sparsity is listed as a high priority in next generation DNN accelerators such as TPU. The structure of sparsity, i.e., the granularity of pruning, affects the efficiency of hardware accelerator design as well as the prediction accuracy. Coarse-grained pruning creates regular sparsity patterns, making it more amenable for hardware acceleration but more challenging to maintain the same accuracy. In this paper we quantitatively measure the trade-off between sparsity regularity and prediction accuracy, providing insights in how to maintain accuracy while having more a more structured sparsity pattern. Our experimental results show that coarse-grained pruning can achieve a sparsity ratio similar to unstructured pruning without loss of accuracy. Moreover, due to the index saving effect, coarse-grained pruning is able to obtain a better compression ratio than fine-grained sparsity at the same accuracy threshold. Based on the recent sparse convolutional neural network accelerator (SCNN), our experiments further demonstrate that coarse-grained sparsity saves about 2x the memory references compared to fine-grained sparsity. Since memory reference is more than two orders of magnitude more expensive than arithmetic operations, the regularity of sparse structure leads to more efficient hardware design.
- May 12 2017 stat.ML arXiv:1705.04194v1Many unsupervised kernel methods rely on the estimation of the kernel covariance operator (kernel CO) or kernel cross-covariance operator (kernel CCO). Both kernel CO and kernel CCO are sensitive to contaminated data, even when bounded positive definite kernels are used. To the best of our knowledge, there are few well-founded robust kernel methods for statistical unsupervised learning. In addition, while the influence function (IF) of an estimator can characterize its robustness, asymptotic properties and standard error, the IF of a standard kernel canonical correlation analysis (standard kernel CCA) has not been derived yet. To fill this gap, we first propose a robust kernel covariance operator (robust kernel CO) and a robust kernel cross-covariance operator (robust kernel CCO) based on a generalized loss function instead of the quadratic loss function. Second, we derive the IF for robust kernel CCO and standard kernel CCA. Using the IF of the standard kernel CCA, we can detect influential observations from two sets of data. Finally, we propose a method based on the robust kernel CO and the robust kernel CCO, called \bf robust kernel CCA, which is less sensitive to noise than the standard kernel CCA. The introduced principles can also be applied to many other kernel methods involving kernel CO or kernel CCO. Our experiments on synthesized data and imaging genetics analysis demonstrate that the proposed IF of standard kernel CCA can identify outliers. It is also seen that the proposed robust kernel CCA method performs better for ideal and contaminated data than the standard kernel CCA.
- Reducing the number of false positive discoveries is presently one of the most pressing issues in the life sciences. It is of especially great importance for many applications in neuroimaging and genomics, where datasets are typically high-dimensional, which means that the number of explanatory variables exceeds the sample size. The false discovery rate (FDR) is a criterion that can be employed to address that issue. Thus it has gained great popularity as a tool for testing multiple hypotheses. Canonical correlation analysis (CCA) is a statistical technique that is used to make sense of the cross-correlation of two sets of measurements collected on the same set of samples (e.g., brain imaging and genomic data for the same mental illness patients), and sparse CCA extends the classical method to high-dimensional settings. Here we propose a way of applying the FDR concept to sparse CCA, and a method to control the FDR. The proposed FDR correction directly influences the sparsity of the solution, adapting it to the unknown true sparsity level. Theoretical derivation as well as simulation studies show that our procedure indeed keeps the FDR of the canonical vectors below a user-specified target level. We apply the proposed method to an imaging genomics dataset from the Philadelphia Neurodevelopmental Cohort. Our results link the brain activity during a working memory task, as measured by functional magnetic resonance imaging (fMRI), to the corresponding subjects' genomic data. Our findings are supported by previous work on cognitive ability, neurodevelopmental, and other mental disorders.
- A key challenge for modern Bayesian statistics is how to perform scalable inference of posterior distributions. To address this challenge, VB methods have emerged as a popular alternative to the classical MCMC methods. VB methods tend to be faster while achieving comparable predictive performance. However, there are few theoretical results around VB. In this paper, we establish frequentist consistency and asymptotic normality of VB methods. Specifically, we connect VB methods to point estimates based on variational approximations, called frequentist variational approximations, and we use the connection to prove a variational Bernstein-von-Mises theorem. The theorem leverages the theoretical characterizations of frequentist variational approximations to understand asymptotic properties of VB. In summary, we prove that (1) the VB posterior converges to the KL minimizer of a normal distribution, centered at the truth and (2) the corresponding variational expectation of the parameter is consistent and asymptotically normal. As applications of the theorem, we derive asymptotic properties of VB posteriors in Bayesian mixture models, Bayesian generalized linear mixed models, and Bayesian stochastic block models. We conduct a simulation study to illustrate these theoretical results.
- In observational studies, weight adjustments are often performed by fitting a model for the propensity score and then inverting the predicted propensities, but recently several approaches have been proposed that instead focus on directly balancing the covariates. In this paper, we study the general class of approximate balancing weights (ABW) by drawing a connection to shrinkage estimation. This allows us to establish the large sample properties of ABW by leveraging asymptotic results from propensity score estimation. In particular, we show that under mild technical conditions ABW are consistent estimates of the true inverse probability weights. To the best of our knowledge, this functional consistency of balancing weights has not been established in the literature even for exact balancing. We also show that the resulting weighting estimator is consistent, asymptotically normal, and semiparametrically efficient. In finite samples, we present an oracle inequality that roughly bounds the loss incurred by balancing too many functions of the covariates. In three empirical studies we show that the root mean squared error of the weighting estimator can be reduced by nearly a half by using approximate balancing instead of exact balancing when the data exhibits poor overlap in covariate distributions.
- Many methods for automatic music transcription involves a multi-pitch estimation method that estimates an activity score for each pitch. A second processing step, called note segmentation, has to be performed for each pitch in order to identify the time intervals when the notes are played. In this study, a pitch-wise two-state on/off firstorder Hidden Markov Model (HMM) is developed for note segmentation. A complete parametrization of the HMM sigmoid function is proposed, based on its original regression formulation, including a parameter alpha of slope smoothing and beta? of thresholding contrast. A comparative evaluation of different note segmentation strategies was performed, differentiated according to whether they use a fixed threshold, called "Hard Thresholding" (HT), or a HMM-based thresholding method, called "Soft Thresholding" (ST). This evaluation was done following MIREX standards and using the MAPS dataset. Also, different transcription scenarios and recording natures were tested using three units of the Degradation toolbox. Results show that note segmentation through a HMM soft thresholding with a data-based optimization of the alpha,beta parameter couple significantly enhances transcription performance.
- Genome-wide association studies (GWAS) have achieved great success in the genetic study of Alzheimer's disease (AD). Collaborative imaging genetics studies across different research institutions show the effectiveness of detecting genetic risk factors. However, the high dimensionality of GWAS data poses significant challenges in detecting risk SNPs for AD. Selecting relevant features is crucial in predicting the response variable. In this study, we propose a novel Distributed Feature Selection Framework (DFSF) to conduct the large-scale imaging genetics studies across multiple institutions. To speed up the learning process, we propose a family of distributed group Lasso screening rules to identify irrelevant features and remove them from the optimization. Then we select the relevant group features by performing the group Lasso feature selection process in a sequence of parameters. Finally, we employ the stability selection to rank the top risk SNPs that might help detect the early stage of AD. To the best of our knowledge, this is the first distributed feature selection model integrated with group Lasso feature selection as well as detecting the risk genetic factors across multiple research institutions system. Empirical studies are conducted on 809 subjects with 5.9 million SNPs which are distributed across several individual institutions, demonstrating the efficiency and effectiveness of the proposed method.
- Recently low displacement rank (LDR) matrices, or so-called structured matrices, have been proposed to compress large-scale neural networks. Empirical results have shown that neural networks with weight matrices of LDR matrices, referred as LDR neural networks, can achieve significant reduction in space and computational complexity while retaining high accuracy. We formally study LDR matrices in deep learning. First, we prove the universal approximation property of LDR neural networks with a mild condition on the displacement operators. We then show that the error bounds of LDR neural networks are as efficient as general neural networks with both single-layer and multiple-layer structure. Finally, we propose back-propagation based training algorithm for general LDR neural networks.
- Segmental structure is a common pattern in many types of sequences such as phrases in human languages. In this paper, we present a probabilistic model for sequences via their segmentations. The probability of a segmented sequence is calculated as the product of the probabilities of all its segments, where each segment is modeled using existing tools such as recurrent neural networks. Since the segmentation of a sequence is usually unknown in advance, we sum over all valid segmentations to obtain the final probability for the sequence. An efficient dynamic programming algorithm is developed for forward and backward computations without resorting to any approximation. We demonstrate our approach on text segmentation and speech recognition tasks. In addition to quantitative results, we also show that our approach can discover meaningful segments in their respective application contexts.
- We show that given an estimate $\widehat{A}$ that is close to a general high-rank positive semi-definite (PSD) matrix $A$ in spectral norm (i.e., $\|\widehat{A}-A\|_2 \leq \delta$), the simple truncated SVD of $\widehat{A}$ produces a multiplicative approximation of $A$ in Frobenius norm. This observation leads to many interesting results on general high-rank matrix estimation problems, which we briefly summarize below ($A$ is an $n\times n$ high-rank PSD matrix and $A_k$ is the best rank-$k$ approximation of $A$): (1) High-rank matrix completion: By observing $\Omega(\frac{n\max\{\epsilon^{-4},k^2\}\mu_0^2\|A\|_F^2\log n}{\sigma_{k+1}(A)^2})$ elements of $A$ where $\sigma_{k+1}\left(A\right)$ is the $\left(k+1\right)$-th singular value of $A$ and $\mu_0$ is the incoherence, the truncated SVD on a zero-filled matrix satisfies $\|\widehat{A}_k-A\|_F \leq (1+O(\epsilon))\|A-A_k\|_F$ with high probability. (2)High-rank matrix de-noising: Let $\widehat{A}=A+E$ where $E$ is a Gaussian random noise matrix with zero mean and $\nu^2/n$ variance on each entry. Then the truncated SVD of $\widehat{A}$ satisfies $\|\widehat{A}_k-A\|_F \leq (1+O(\sqrt{\nu/\sigma_{k+1}(A)}))\|A-A_k\|_F + O(\sqrt{k}\nu)$. (3) Low-rank Estimation of high-dimensional covariance: Given $N$ i.i.d.~samples $X_1,\cdots,X_N\sim\mathcal N_n(0,A)$, can we estimate $A$ with a relative-error Frobenius norm bound? We show that if $N = \Omega\left(n\max\{\epsilon^{-4},k^2\}\gamma_k(A)^2\log N\right)$ for $\gamma_k(A)=\sigma_1(A)/\sigma_{k+1}(A)$, then $\|\widehat{A}_k-A\|_F \leq (1+O(\epsilon))\|A-A_k\|_F$ with high probability, where $\widehat{A}=\frac{1}{N}\sum_{i=1}^N{X_iX_i^\top}$ is the sample covariance.
- Bayesian networks, or directed acyclic graph (DAG) models, are widely used to represent complex causal systems. Since the basic task of learning a Bayesian network from data is NP-hard, a standard approach is greedy search over the space of DAGs or Markov equivalent DAGs. Since the space of DAGs on p nodes and the associated space of Markov equivalence classes are both much larger than the space of permutations, it is desirable to consider permutation-based searches. We here provide the first consistency guarantees, both uniform and high-dimensional, of a permutation-based greedy search. Geometrically, this search corresponds to a simplex-type algorithm on a sub-polytope of the permutohedron, the DAG associahedron. Every vertex in this polytope is associated with a DAG, and hence with a collection of permutations that are consistent with the DAG ordering. A walk is performed on the edges of the polytope maximizing the sparsity of the associated DAGs. We show based on simulations that this permutation search is competitive with standard approaches.
- We consider the problem of estimating and constructing component-wise confidence intervals of a sparse high-dimensional linear regression model when some covariates of the design matrix are missing completely at random. A variant of the Dantzig selector (Candes & Tao, 2007) is analyzed for estimating the regression model and a de-biasing argument is employed to construct component-wise confidence intervals under additional assumptions on the covariance of the design matrix. We also derive rates of convergence of the mean-square estimation error and the average confidence interval length, and show that the dependency over several model parameters (e.g., sparsity $s$, portion of observed covariates $\rho_*$, signal level $\|\beta_0\|_2$) are optimal in a minimax sense.
- This paper presents privileged multi-label learning (PrML) to explore and exploit the relationship between labels in multi-label learning problems. We suggest that for each individual label, it cannot only be implicitly connected with other labels via the low-rank constraint over label predictors, but also its performance on examples can receive the explicit comments from other labels together acting as an \emphOracle teacher. We generate privileged label feature for each example and its individual label, and then integrate it into the framework of low-rank based multi-label learning. The proposed algorithm can therefore comprehensively explore and exploit label relationships by inheriting all the merits of privileged information and low-rank constraints. We show that PrML can be efficiently solved by dual coordinate descent algorithm using iterative optimization strategy with cheap updates. Experiments on benchmark datasets show that through privileged label features, the performance can be significantly improved and PrML is superior to several competing methods in most cases.
- In rare disease physician targeting, a major challenge is how to identify physicians who are treating diagnosed or underdiagnosed rare diseases patients. Rare diseases have extremely low incidence rate. For a specified rare disease, only a small number of patients are affected and a fractional of physicians are involved. The existing targeting methodologies, such as segmentation and profiling, are developed under mass market assumption. They are not suitable for rare disease market where the target classes are extremely imbalanced. The authors propose a graphical model approach to predict targets by jointly modeling physician and patient features from different data spaces and utilizing the extra relational information. Through an empirical example with medical claim and prescription data, the proposed approach demonstrates better accuracy in finding target physicians. The graph representation also provides visual interpretability of relationship among physicians and patients. The model can be extended to incorporate more complex dependency structures. This article contributes to the literature of exploring the benefit of utilizing relational dependencies among entities in healthcare industry.
- This paper proposes a convolutional neural network (CNN)-based method that learns traffic as images and predicts large-scale, network-wide traffic speed with a high accuracy. Spatiotemporal traffic dynamics are converted to images describing the time and space relations of traffic flow via a two-dimensional time-space matrix. A CNN is applied to the image following two consecutive steps: abstract traffic feature extraction and network-wide traffic speed prediction. The effectiveness of the proposed method is evaluated by taking two real-world transportation networks, the second ring road and north-east transportation network in Beijing, as examples, and comparing the method with four prevailing algorithms, namely, ordinary least squares, k-nearest neighbors, artificial neural network, and random forest, and three deep learning architectures, namely, stacked autoencoder, recurrent neural network, and long-short-term memory network. The results show that the proposed method outperforms other algorithms by an average accuracy improvement of 42.91% within an acceptable execution time. The CNN can train the model in a reasonable time and, thus, is suitable for large-scale transportation networks.
- We consider the problem of off-policy evaluation---estimating the value of a target policy using data collected by another policy---under the contextual bandit model. We establish a minimax lower bound on the mean squared error (MSE), and show that it is matched up to constant factors by the inverse propensity scoring (IPS) estimator. Since in the multi-armed bandit problem the IPS is suboptimal (Li et. al, 2015), our result highlights the difficulty of the contextual setting with non-degenerate context distributions. We further consider improvements on this minimax MSE bound, given access to a reward model. We show that the existing doubly robust approach, which utilizes such a reward model, may continue to suffer from high variance even when the reward model is perfect. We propose a new estimator called SWITCH which more effectively uses the reward model and achieves a superior bias-variance tradeoff compared with prior work. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of datasets, often seeing orders of magnitude improvements over a number of baselines.
- Nov 22 2016 stat.ML arXiv:1611.06843v1In this paper we describe an algorithm for predicting the websites at risk in a long range hacking activity, while jointly inferring the provenance and evolution of vulnerabilities on websites over continuous time. Specifically, we use hazard regression with a time-varying additive hazard function parameterized in a generalized linear form. The activation coefficients on each feature are continuous-time functions constrained with total variation penalty inspired by hacking campaigns. We show that the optimal solution is a 0th order spline with a finite number of adaptively chosen knots, and can be solved efficiently. Experiments on real data show that our method significantly outperforms classic methods while providing meaningful interpretability.
- Deep learning (DL) training-as-a-service (TaaS) is an important emerging industrial workload. The unique challenge of TaaS is that it must satisfy a wide range of customers who have no experience and resources to tune DL hyper-parameters, and meticulous tuning for each user's dataset is prohibitively expensive. Therefore, TaaS hyper-parameters must be fixed with values that are applicable to all users. IBM Watson Natural Language Classifier (NLC) service, the most popular IBM cognitive service used by thousands of enterprise-level clients around the globe, is a typical TaaS service. By evaluating the NLC workloads, we show that only the conservative hyper-parameter setup (e.g., small mini-batch size and small learning rate) can guarantee acceptable model accuracy for a wide range of customers. We further justify theoretically why such a setup guarantees better model convergence in general. Unfortunately, the small mini-batch size causes a high volume of communication traffic in a parameter-server based system. We characterize the high communication bandwidth requirement of TaaS using representative industrial deep learning workloads and demonstrate that none of the state-of-the-art scale-up or scale-out solutions can satisfy such a requirement. We then present GaDei, an optimized shared-memory based scale-up parameter server design. We prove that the designed protocol is deadlock-free and it processes each gradient exactly once. Our implementation is evaluated on both commercial benchmarks and public benchmarks to demonstrate that it significantly outperforms the state-of-the-art parameter-server based implementation while maintaining the required accuracy and our implementation reaches near the best possible runtime performance, constrained only by the hardware limitation. Furthermore, to the best of our knowledge, GaDei is the only scale-up DL system that provides fault-tolerance.
- Nov 16 2016 stat.ME arXiv:1611.04619v1This work is motivated by a biological experiment with a split-plot design, for the purpose of comparison of the changing patterns in seed weight from two treatment groups as subgroups in each of the two groups subject to increasing levels of stress. We formalize the question into a nonparametric two sample comparison problem for changes among the sub samples, which was analyzed using U-statistics. Zero inflated value were also considered in the construction of the U-statistics. The U-statistics were then used in a Chi-square type test statistics framework for hypothesis testing. Bootstrapped p-values were obtained through simulated samples. It was proven that the distribution of the simulated sample can be independent provided the observed samples have certain summary statistics. Simulation results suggest that the test is consistent.
- We combine fine-grained spatially referenced census data with the vote outcomes from the 2016 US presidential election. Using this dataset, we perform ecological inference using distribution regression (Flaxman et al, KDD 2015) with a multinomial-logit regression so as to model the vote outcome Trump, Clinton, Other / Didn't vote as a function of demographic and socioeconomic features. Ecological inference allows us to estimate "exit poll" style results like what was Trump's support among white women, but for entirely novel categories. We also perform exploratory data analysis to understand which census variables are predictive of voting for Trump, voting for Clinton, or not voting for either. All of our methods are implemented in python and R and are available online for replication.
- In this paper we describe an algorithm for estimating the provenance of hacks on websites. That is, given properties of sites and the temporal occurrence of attacks, we are able to attribute individual attacks to joint causes and vulnerabilities, as well as estimating the evolution of these vulnerabilities over time. Specifically, we use hazard regression with a time-varying additive hazard function parameterized in a generalized linear form. The activation coefficients on each feature are continuous-time functions over time. We formulate the problem of learning these functions as a constrained variational maximum likelihood estimation problem with total variation penalty and show that the optimal solution is a 0th order spline (a piecewise constant function) with a finite number of known knots. This allows the inference problem to be solved efficiently and at scale by solving a finite dimensional optimization problem. Extensive experiments on real data sets show that our method significantly outperforms Cox's proportional hazard model. We also conduct a case study and verify that the fitted functions are indeed recovering vulnerable features and real-life events such as the release of code to exploit these features in hacker blogs.
- Nov 09 2016 stat.ME arXiv:1611.02314v1Dynamic treatment regimens (DTRs) are sequential decision rules tailored at each stage by potentially time-varying patient features and intermediate outcomes observed in previous stages. The complexity, patient heterogeneity and chronicity of many diseases and disorders call for learning optimal DTRs which best dynamically tailor treatment to each individual's response over time. Proliferation of personalized data (e.g., genetic and imaging data) provides opportunities for deep tailoring as well as new challenges for statistical methodology. In this work, we propose a robust hybrid approach referred as Augmented Multistage Outcome-Weighted Learning (AMOL) to integrate outcome-weighted learning and Q-learning to identify optimal DTRs from the Sequential Multiple Assignment Randomization Trials (SMARTs). We generalize outcome weighted learning (O-learning; Zhao et al.~2012) to allow for negative outcomes; we propose methods to reduce variability of weights in O-learning to achieve numeric stability and higher efficiency; finally, for multiple-stage SMART studies, we introduce doubly robust augmentation to machine learning based O-learning to improve efficiency by drawing information from regression model-based Q-learning at each stage. The proposed AMOL remains valid even if the Q-learning model is misspecified. We establish the theoretical properties of AMOL, including the consistency of the estimated rules and the rates of convergence to the optimal value function. The comparative advantage of AMOL over existing methods is demonstrated in extensive simulation studies and applications to two SMART data sets: a two-stage trial for attention deficit and hyperactive disorder (ADHD) and the STAR*D trial for major depressive disorder (MDD).
- Safeguarding privacy in machine learning is highly desirable, especially in collaborative studies across many organizations. Privacy-preserving distributed machine learning (based on cryptography) is popular to solve the problem. However, existing cryptographic protocols still incur excess computational overhead. Here, we make a novel observation that this is partially due to naive adoption of mainstream numerical optimization (e.g., Newton method) and failing to tailor for secure computing. This work presents a contrasting perspective: customizing numerical optimization specifically for secure settings. We propose a seemingly less-favorable optimization method that can in fact significantly accelerate privacy-preserving logistic regression. Leveraging this new method, we propose two new secure protocols for conducting logistic regression in a privacy-preserving and distributed manner. Extensive theoretical and empirical evaluations prove the competitive performance of our two secure proposals while without compromising accuracy or privacy: with speedup up to 2.3x and 8.1x, respectively, over state-of-the-art; and even faster as data scales up. Such drastic speedup is on top of and in addition to performance improvements from existing (and future) state-of-the-art cryptography. Our work provides a new way towards efficient and practical privacy-preserving logistic regression for large-scale studies which are common for modern science.
- There is an increased interest in the scientific community in the problem of measuring gender homophily in co-authorship on scholarly publications (Eisen, 2016). For a given set of publications and co-authorships, we assume that author identities have not been disambiguated in that we do not know when one person is an author on more than one paper. In this case, one way to think about measuring gender homophily is to consider all observed co-authorship pairs and obtain a set-based gender homophily coefficient (e.g., Bergstrom et al., 2016). Another way is to consider papers as observed disjoint networks of co-authors and use a network-based assortativity coefficient (e.g., Newman, 2003). In this note, we review both metrics and show that the gender homophily set-based index is equivalent to the gender assortativity network-based coefficient with properly weighted edges.
- Subspace clustering is the problem of partitioning unlabeled data points into a number of clusters so that data points within one cluster lie approximately on a low-dimensional linear subspace. In many practical scenarios, the dimensionality of data points to be clustered are compressed due to constraints of measurement, computation or privacy. In this paper, we study the theoretical properties of a popular subspace clustering algorithm named sparse subspace clustering (SSC) and establish formal success conditions of SSC on dimensionality-reduced data. Our analysis applies to the most general fully deterministic model where both underlying subspaces and data points within each subspace are deterministically positioned, and also a wide range of dimensionality reduction techniques (e.g., Gaussian random projection, uniform subsampling, sketching) that fall into a subspace embedding framework (Meng & Mahoney, 2013; Avron et al., 2014). Finally, we apply our analysis to a differentially private SSC algorithm and established both privacy and utility guarantees of the proposed method.
- Factor modeling is an essential tool for exploring intrinsic dependence structures among high-dimensional random variables. Much progress has been made for estimating the covariance matrix from a high-dimensional factor model. However, the blessing of dimensionality has not yet been fully embraced in the literature: much of the available data is often ignored in constructing covariance matrix estimates. If our goal is to accurately estimate a covariance matrix of a set of targeted variables, shall we employ additional data, which are beyond the variables of interest, in the estimation? In this paper, we provide sufficient conditions for an affirmative answer, and further quantify its gain in terms of Fisher information and convergence rate. In fact, even an oracle-like result (as if all the factors were known) can be achieved when a sufficiently large number of variables is used. The idea of utilizing data as much as possible brings computational challenges. A divide-and-conquer algorithm is thus proposed to alleviate the computational burden, and also shown not to sacrifice any statistical accuracy in comparison with a pooled analysis. Simulation studies further confirm our advocacy for the use of full data, and demonstrate the effectiveness of the above algorithm. Our proposal is applied to a microarray data example that shows empirical benefits of using more data.
- Oct 26 2016 stat.AP arXiv:1610.07684v1In this paper, we address the the major hurdle of high dimensionality in EEG analysis by extracting the optimal lower dimensional representations. Using our approach, connectivity between regions in a high-dimensional brain network is characterized through the connectivity between region-specific factors. The proposed approach is motivated by our observation that electroencephalograms (EEGs) from channels within each region exhibit a high degree of multicollinearity and synchrony. These observations suggest that it would be sensible to extract summary factors for each region. We consider the general approach for deriving summary factors which are solutions to the criterion of squared error reconstruction. In this work, we focus on two special cases of linear auto encoder and decoder. In the first approach, the factors are characterized as instantaneous linear mixing of the observed high dimensional time series. In the second approach, the factors signals are linear filtered versions of the original signal which is more general than an instantaneous mixing. This exploratory analysis is the starting point to the multi-scale factor analysis model where the concatenated factors from all regions are represented by vector auto-regressive model that captures the connectivity in high dimensional signals. We performed evaluations on the two approaches via simulations under different conditions. The simulation results provide insights on the performance and application scope of the methods. We also performed exploratory analysis of EEG recorded over several epochs during resting state. Finally, we implemented these exploratory methods in a Matlab toolbox XHiDiTS available from https://goo.gl/uXc8ei .
- Oct 12 2016 stat.CO arXiv:1610.03434v1Software for computation of maximum likelihood estimates in linear structural equation models typically employs general techniques from non-linear optimization, such as quasi-Newton methods. In practice, careful tuning of initial values is often required to avoid convergence issues. As an alternative approach, we propose a block-coordinate descent method that cycles through the considered variables, updating only the parameters related to a given variable in each step. We show that the resulting block update problems can be solved in closed form even when the structural equation model comprises feedback cycles. Furthermore, we give a characterization of the models for which the block-coordinate descent algorithm is well-defined, meaning that for generic data and starting values all block optimization problems admit a unique solution. For the characterization, we represent each model by its mixed graph (also known as path diagram), which leads to criteria that can be checked in time that is polynomial in the number of considered variables.
- Sep 28 2016 stat.ML arXiv:1609.08397v1Many machine learning tasks can be formulated as Regularized Empirical Risk Minimization (R-ERM), and solved by optimization algorithms such as gradient descent (GD), stochastic gradient descent (SGD), and stochastic variance reduction (SVRG). Conventional analysis on these optimization algorithms focuses on their convergence rates during the training process, however, people in the machine learning community may care more about the generalization performance of the learned model on unseen test data. In this paper, we investigate on this issue, by using stability as a tool. In particular, we decompose the generalization error for R-ERM, and derive its upper bound for both convex and non-convex cases. In convex cases, we prove that the generalization error can be bounded by the convergence rate of the optimization algorithm and the stability of the R-ERM process, both in expectation (in the order of $\mathcal{O}((1/n)+\mathbb{E}\rho(T))$, where $\rho(T)$ is the convergence error and $T$ is the number of iterations) and in high probability (in the order of $\mathcal{O}\left(\frac{\log{1/\delta}}{\sqrt{n}}+\rho(T)\right)$ with probability $1-\delta$). For non-convex cases, we can also obtain a similar expected generalization error bound. Our theorems indicate that 1) along with the training process, the generalization error will decrease for all the optimization algorithms under our investigation; 2) Comparatively speaking, SVRG has better generalization ability than GD and SGD. We have conducted experiments on both convex and non-convex problems, and the experimental results verify our theoretical findings.
- Kernel and Multiple Kernel Canonical Correlation Analysis (CCA) are employed to classify schizophrenic and healthy patients based on their SNPs, DNA Methylation and fMRI data. Kernel and Multiple Kernel CCA are popular methods for finding nonlinear correlations between high-dimensional datasets. Data was gathered from 183 patients, 79 with schizophrenia and 104 healthy controls. Kernel and Multiple Kernel CCA represent new avenues for studying schizophrenia, because, to our knowledge, these methods have not been used on these data before. Classification is performed via k-means clustering on the kernel matrix outputs of the Kernel and Multiple Kernel CCA algorithm. Accuracies of the Kernel and Multiple Kernel CCA classification are compared to that of the regularized linear CCA algorithm classification, and are found to be significantly more accurate. Both algorithms demonstrate maximal accuracies when the combination of DNA methylation and fMRI data are used, and experience lower accuracies when the SNP data are incorporated.
- Random Projection (RP) technique has been widely applied in many scenarios because it can reduce high-dimensional features into low-dimensional space within short time and meet the need of real-time analysis of massive data. There is an urgent need of dimensionality reduction with fast increase of big genomics data. However, the performance of RP is usually lower. We attempt to improve classification accuracy of RP through combining other reduction dimension methods such as Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Feature Selection (FS). We compared classification accuracy and running time of different combination methods on three microarray datasets and a simulation dataset. Experimental results show a remarkable improvement of 14.77% in classification accuracy of FS followed by RP compared to RP on BC-TCGA dataset. LDA followed by RP also helps RP to yield a more discriminative subspace with an increase of 13.65% on classification accuracy on the same dataset. FS followed by RP outperforms other combination methods in classification accuracy on most of the datasets.
- Genome-wide association studies (GWAS) offer new opportunities to identify genetic risk factors for Alzheimer's disease (AD). Recently, collaborative efforts across different institutions emerged that enhance the power of many existing techniques on individual institution data. However, a major barrier to collaborative studies of GWAS is that many institutions need to preserve individual data privacy. To address this challenge, we propose a novel distributed framework, termed Local Query Model (LQM) to detect risk SNPs for AD across multiple research institutions. To accelerate the learning process, we propose a Distributed Enhanced Dual Polytope Projection (D-EDPP) screening rule to identify irrelevant features and remove them from the optimization. To the best of our knowledge, this is the first successful run of the computationally intensive model selection procedure to learn a consistent model across different institutions without compromising their privacy while ranking the SNPs that may collectively affect AD. Empirical studies are conducted on 809 subjects with 5.9 million SNP features which are distributed across three individual institutions. D-EDPP achieved a 66-fold speed-up by effectively identifying irrelevant features.
- Multi-view data clustering refers to categorizing a data set by making good use of related information from multiple representations of the data. It becomes important nowadays because more and more data can be collected in a variety of ways, in different settings and from different sources, so each data set can be represented by different sets of features to form different views of it. Many approaches have been proposed to improve clustering performance by exploring and integrating heterogeneous information underlying different views. In this paper, we propose a new multi-view fuzzy clustering approach called MinimaxFCM by using minimax optimization based on well-known Fuzzy c means. In MinimaxFCM the consensus clustering results are generated based on minimax optimization in which the maximum disagreements of different weighted views are minimized. Moreover, the weight of each view can be learned automatically in the clustering process. In addition, there is only one parameter to be set besides the fuzzifier. The detailed problem formulation, updating rules derivation, and the in-depth analysis of the proposed MinimaxFCM are provided here. Experimental studies on nine multi-view data sets including real world image and document data sets have been conducted. We observed that MinimaxFCM outperforms related multi-view clustering approaches in terms of clustering accuracy, demonstrating the great potential of MinimaxFCM for multi-view data analysis.
- Incremental clustering approaches have been proposed for handling large data when given data set is too large to be stored. The key idea of these approaches is to find representatives to represent each cluster in each data chunk and final data analysis is carried out based on those identified representatives from all the chunks. However, most of the incremental approaches are used for single view data. As large multi-view data generated from multiple sources becomes prevalent nowadays, there is a need for incremental clustering approaches to handle both large and multi-view data. In this paper we propose a new incremental clustering approach called incremental minimax optimization based fuzzy clustering (IminimaxFCM) to handle large multi-view data. In IminimaxFCM, representatives with multiple views are identified to represent each cluster by integrating multiple complementary views using minimax optimization. The detailed problem formulation, updating rules derivation, and the in-depth analysis of the proposed IminimaxFCM are provided. Experimental studies on several real world multi-view data sets have been conducted. We observed that IminimaxFCM outperforms related incremental fuzzy clustering in terms of clustering accuracy, demonstrating the great potential of IminimaxFCM for large multi-view data analysis.
- Recently dictionary screening has been proposed as an effective way to improve the computational efficiency of solving the lasso problem, which is one of the most commonly used method for learning sparse representations. To address today's ever increasing large dataset, effective screening relies on a tight region bound on the solution to the dual lasso. Typical region bounds are in the form of an intersection of a sphere and multiple half spaces. One way to tighten the region bound is using more half spaces, which however, adds to the overhead of solving the high dimensional optimization problem in lasso screening. This paper reveals the interesting property that the optimization problem only depends on the projection of features onto the subspace spanned by the normals of the half spaces. This property converts an optimization problem in high dimension to much lower dimension, and thus sheds light on reducing the computation overhead of lasso screening based on tighter region bounds.
- One way to solve lasso problems when the dictionary does not fit into available memory is to first screen the dictionary to remove unneeded features. Prior research has shown that sequential screening methods offer the greatest promise in this endeavor. Most existing work on sequential screening targets the context of tuning parameter selection, where one screens and solves a sequence of $N$ lasso problems with a fixed grid of geometrically spaced regularization parameters. In contrast, we focus on the scenario where a target regularization parameter has already been chosen via cross-validated model selection, and we then need to solve many lasso instances using this fixed value. In this context, we propose and explore a feedback controlled sequential screening scheme. Feedback is used at each iteration to select the next problem to be solved. This allows the sequence of problems to be adapted to the instance presented and the number of intermediate problems to be automatically selected. We demonstrate our feedback scheme using several datasets including a dictionary of approximate size 100,000 by 300,000.
- Multi-view spectral clustering, which aims at yielding an agreement or consensus data objects grouping across multi-views with their graph laplacian matrices, is a fundamental clustering problem. Among the existing methods, Low-Rank Representation (LRR) based method is quite superior in terms of its effectiveness, intuitiveness and robustness to noise corruptions. However, it aggressively tries to learn a common low-dimensional subspace for multi-view data, while inattentively ignoring the local manifold structure in each view, which is critically important to the spectral clustering; worse still, the low-rank minimization is enforced to achieve the data correlation consensus among all views, failing to flexibly preserve the local manifold structure for each view. In this paper, 1) we propose a multi-graph laplacian regularized LRR with each graph laplacian corresponding to one view to characterize its local manifold structure. 2) Instead of directly enforcing the low-rank minimization among all views for correlation consensus, we separately impose low-rank constraint on each view, coupled with a mutual structural consensus constraint, where it is able to not only well preserve the local manifold structure but also serve as a constraint for that from other views, which iteratively makes the views more agreeable. Extensive experiments on real-world multi-view data sets demonstrate its superiority.
- The scale of functional magnetic resonance image data is rapidly increasing as large multi-subject datasets are becoming widely available and high-resolution scanners are adopted. The inherent low-dimensionality of the information in this data has led neuroscientists to consider factor analysis methods to extract and analyze the underlying brain activity. In this work, we consider two recent multi-subject factor analysis methods: the Shared Response Model and Hierarchical Topographic Factor Analysis. We perform analytical, algorithmic, and code optimization to enable multi-node parallel implementations to scale. Single-node improvements result in 99x and 1812x speedups on these two methods, and enables the processing of larger datasets. Our distributed implementations show strong scaling of 3.3x and 5.5x respectively with 20 nodes on real datasets. We also demonstrate weak scaling on a synthetic dataset with 1024 subjects, on up to 1024 nodes and 32,768 cores.
- High demand for computation resources severely hinders deployment of large-scale Deep Neural Networks (DNN) in resource constrained devices. In this work, we propose a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNNs evaluation. Experimental results show that SSL achieves on average 5.1x and 3.1x speedups of convolutional layer computation of AlexNet against CPU and GPU, respectively, with off-the-shelf libraries. These speedups are about twice speedups of non-structured sparsity; (3) regularize the DNN structure to improve classification accuracy. The results show that for CIFAR-10, regularization on layer depth can reduce 20 layers of a Deep Residual Network (ResNet) to 18 layers while improve the accuracy from 91.25% to 92.60%, which is still slightly higher than that of original ResNet with 32 layers. For AlexNet, structure regularization by SSL also reduces the error by around ~1%. Open source code is in https://github.com/wenwei202/caffe/tree/scnn
- Poisson likelihood models have been prevalently used in imaging, social networks, and time series analysis. We propose fast, simple, theoretically-grounded, and versatile, optimization algorithms for Poisson likelihood modeling. The Poisson log-likelihood is concave but not Lipschitz-continuous. Since almost all gradient-based optimization algorithms rely on Lipschitz-continuity, optimizing Poisson likelihood models with a guarantee of convergence can be challenging, especially for large-scale problems. We present a new perspective allowing to efficiently optimize a wide range of penalized Poisson likelihood objectives. We show that an appropriate saddle point reformulation enjoys a favorable geometry and a smooth structure. Therefore, we can design a new gradient-based optimization algorithm with $O(1/t)$ convergence rate, in contrast to the usual $O(1/\sqrt{t})$ rate of non-smooth minimization alternatives. Furthermore, in order to tackle problems with large samples, we also develop a randomized block-decomposition variant that enjoys the same convergence rate yet more efficient iteration cost. Experimental results on several point process applications including social network estimation and temporal recommendation show that the proposed algorithm and its randomized block variant outperform existing methods both on synthetic and real-world datasets.
- A treatment regime is a function that maps individual patient information to a recommended treatment, hence explicitly incorporating the heterogeneity in need for treatment across individuals. Patient responses are dichotomous and can be predicted through an unknown relationship that depends on the patient information and the selected treatment. The goal is to find the treatments that lead to the best patient responses on average. Each experiment is expensive, forcing us to learn the most from each experiment. We adopt a Bayesian approach both to incorporate possible prior information and to update our treatment regime continuously as information accrues, with the potential to allow smaller yet more informative trials and for patients to receive better treatment. By formulating the problem as contextual bandits, we introduce a knowledge gradient policy to guide the treatment assignment by maximizing the expected value of information, for which an approximation method is used to overcome computational challenges. We provide a detailed study on how to make sequential medical decisions under uncertainty to reduce health care costs on a real world knee replacement dataset. We use clustering and LASSO to deal with the intrinsic sparsity in health datasets. We show experimentally that even though the problem is sparse, through careful selection of physicians (versus picking them at random), we can significantly improve the success rates.
- In this work we develop a theory of hierarchical clustering for graphs. Our modeling assumption is that graphs are sampled from a graphon, which is a powerful and general model for generating graphs and analyzing large networks. Graphons are a far richer class of graph models than stochastic blockmodels, the primary setting for recent progress in the statistical theory of graph clustering. We define what it means for an algorithm to produce the "correct" clustering, give sufficient conditions in which a method is statistically consistent, and provide an explicit algorithm satisfying these properties.
- In this paper, we resolve many of the key algorithmic questions regarding robustness, memory efficiency, and differential privacy of tensor decomposition. We propose simple variants of the tensor power method which enjoy these strong properties. We present the first guarantees for online tensor power method which has a linear memory requirement. Moreover, we present a noise calibrated tensor power method with efficient privacy guarantees. At the heart of all these guarantees lies a careful perturbation analysis derived in this paper which improves up on the existing results significantly.
- Probabilistic models analyze data by relying on a set of assumptions. Data that exhibit deviations from these assumptions can undermine inference and prediction quality. Robust models offer protection against mismatch between a model's assumptions and reality. We propose a way to systematically detect and mitigate mismatch of a large class of probabilistic models. The idea is to raise the likelihood of each observation to a weight and then to infer both the latent variables and the weights from data. Inferring the weights allows a model to identify observations that match its assumptions and down-weight others. This enables robust inference and improves predictive accuracy. We study four different forms of mismatch with reality, ranging from missing latent groups to structure misspecification. A Poisson factorization analysis of the Movielens 1M dataset shows the benefits of this approach in a practical scenario.