results for au:Ye_J in:stat

- Recurrent Neural Networks (RNNs) are powerful sequence modeling tools. However, when dealing with high dimensional inputs, the training of RNNs becomes computational expensive due to the large number of model parameters. This hinders RNNs from solving many important computer vision tasks, such as Action Recognition in Videos and Image Captioning. To overcome this problem, we propose a compact and flexible structure, namely Block-Term tensor decomposition, which greatly reduces the parameters of RNNs and improves their training efficiency. Compared with alternative low-rank approximations, such as tensor-train RNN (TT-RNN), our method, Block-Term RNN (BT-RNN), is not only more concise (when using the same rank), but also able to attain a better approximation to the original RNNs with much fewer parameters. On three challenging tasks, including Action Recognition in Videos, Image Captioning and Image Generation, BT-RNN outperforms TT-RNN and the standard RNN in terms of both prediction accuracy and convergence rate. Specifically, BT-LSTM utilizes 17,388 times fewer parameters than the standard LSTM to achieve an accuracy improvement over 15.6\% in the Action Recognition task on the UCF11 dataset.
- Can artificial intelligence (AI) learn complicated non-linear physics? Here we propose a novel deep learning approach that learns non-linear photon scattering physics and obtains accurate 3D distribution of optical anomalies. In contrast to the traditional black-box deep learning approaches to inverse problems, our deep network learns to invert the Lippmann-Schwinger integral equation which describes the essential physics of photon migration of diffuse near-infrared (NIR) photons in turbid media. As an example for clinical relevance, we applied the method to our prototype diffuse optical tomography (DOT). We show that our deep neural network, trained with only simulation data, can accurately recover the location of anomalies within biomimetic phantoms and live animals without the use of an exogenous contrast agent.
- We propose a framework, named Aggregated Wasserstein, for computing a dissimilarity measure or distance between two Hidden Markov Models with state conditional distributions being Gaussian. For such HMMs, the marginal distribution at any time position follows a Gaussian mixture distribution, a fact exploited to softly match, aka register, the states in two HMMs. We refer to such HMMs as Gaussian mixture model-HMM (GMM-HMM). The registration of states is inspired by the intrinsic relationship of optimal transport and the Wasserstein metric between distributions. Specifically, the components of the marginal GMMs are matched by solving an optimal transport problem where the cost between components is the Wasserstein metric for Gaussian distributions. The solution of the optimization problem is a fast approximation to the Wasserstein metric between two GMMs. The new Aggregated Wasserstein distance is a semi-metric and can be computed without generating Monte Carlo samples. It is invariant to relabeling or permutation of states. The distance is defined meaningfully even for two HMMs that are estimated from data of different dimensionality, a situation that can arise due to missing variables. This distance quantifies the dissimilarity of GMM-HMMs by measuring both the difference between the two marginal GMMs and that between the two transition matrices. Our new distance is tested on tasks of retrieval, classification, and t-SNE visualization of time series. Experiments on both synthetic and real data have demonstrated its advantages in terms of accuracy as well as efficiency in comparison with existing distances based on the Kullback-Leibler divergence.
- In portable, 3-D, or ultra-fast ultrasound (US) imaging systems, there is an increasing demand to reconstruct high quality images from limited number of data. However, the existing solutions require either hardware changes or computationally expansive algorithms. To overcome these limitations, here we propose a novel deep learning approach that interpolates the missing RF data by utilizing the sparsity of the RF data in the Fourier domain. Extensive experimental results from sub-sampled RF data from a real US system confirmed that the proposed method can effectively reduce the data rate without sacrificing the image quality.
- In building intelligent transportation systems such as taxi or rideshare services, accurate prediction of travel time and distance is crucial for customer experience and resource management. Using the NYC taxi dataset, which contains taxi trips data collected from GPS-enabled taxis [23], this paper investigates the use of deep neural networks to jointly predict taxi trip time and distance. We propose a model, called ST-NN (Spatio-Temporal Neural Network), which first predicts the travel distance between an origin and a destination GPS coordinate, then combines this prediction with the time of day to predict the travel time. The beauty of ST-NN is that it uses only the raw trips data without requiring further feature engineering and provides a joint estimate of travel time and distance. We compare the performance of ST-NN to that of state-of-the-art travel time estimation methods, and we observe that the proposed approach generalizes better than state-of-the-art methods. We show that ST-NN approach significantly reduces the mean absolute error for both predicted travel time and distance, about 17% for travel time prediction. We also observe that the proposed approach is more robust to outliers present in the dataset by testing the performance of ST-NN on the datasets with and without outliers.
- Genome-wide association studies (GWA studies or GWAS) investigate the relationships between genetic variants such as single-nucleotide polymorphisms (SNPs) and individual traits. Recently, incorporating biological priors together with machine learning methods in GWA studies has attracted increasing attention. However, in real-world, nucleotide-level bio-priors have not been well-studied to date. Alternatively, studies at gene-level, for example, protein--protein interactions and pathways, are more rigorous and legitimate, and it is potentially beneficial to utilize such gene-level priors in GWAS. In this paper, we proposed a novel two-level structured sparse model, called Sparse Group Lasso with Group-level Graph structure (SGLGG), for GWAS. It can be considered as a sparse group Lasso along with a group-level graph Lasso. Essentially, SGLGG penalizes the nucleotide-level sparsity as well as takes advantages of gene-level priors (both gene groups and networks), to identifying phenotype-associated risk SNPs. We employ the alternating direction method of multipliers algorithm to optimize the proposed model. Our experiments on the Alzheimer's Disease Neuroimaging Initiative whole genome sequence data and neuroimage data demonstrate the effectiveness of SGLGG. As a regression model, it is competitive to the state-of-the-arts sparse models; as a variable selection method, SGLGG is promising for identifying Alzheimer's disease-related risk SNPs.
- X-ray computed tomography (CT) using sparse projection views is often used to reduce the radiation dose. However, due to the insufficient projection views, a reconstruction approach using the filtered back projection (FBP) produces severe streaking artifacts. Recently, deep learning approaches using large receptive field neural networks such as U-net have demonstrated impressive performance for sparse view CT reconstruction. However, theoretical justification is still lacking. The main goal of this paper is, therefore, to develop a mathematical theory and to discuss how to improve these algorithms. In particular, inspired by the recent theory of deep convolutional framelets, we show that the U-net relies on a sub-optimal non-local bases that overly emphasizes low frequency components. The discovery leads to a dual frame and a tight frame U-net architectures for effective recovery of directional image components.
- Model based iterative reconstruction (MBIR) algorithms for low-dose X-ray CT are computationally expensive. To address this problem, we recently proposed the world-first deep convolutional neural network (CNN) for low-dose X-ray CT and won the second place in 2016 AAPM Low-Dose CT Grand Challenge. However, some of the texture were not fully recovered. To cope with this problem, here we propose a deep residual learning approach in directional wavelet domain. The proposed method is motivated by an observation that a deep convolutional neural network can be interpreted as a multilayer convolutional framelets expansion using non-local basis convolved with data-driven local basis. We further extend the idea to derive a deep convolutional framelet expansion by combining global redundant transforms and signal boosting from multiple signal representations. Extensive experimental results confirm that the proposed network has significantly improved performance and preserves the detail texture of the original images
- Tissue characterization has long been an important component of Computer Aided Diagnosis (CAD) systems for automatic lesion detection and further clinical planning. Motivated by the superior performance of deep learning methods on various computer vision problems, there has been increasing work applying deep learning to medical image analysis. However, the development of a robust and reliable deep learning model for computer-aided diagnosis is still highly challenging due to the combination of the high heterogeneity in the medical images and the relative lack of training samples. Specifically, annotation and labeling of the medical images is much more expensive and time-consuming than other applications and often involves manual labor from multiple domain experts. In this work, we propose a multi-stage, self-paced learning framework utilizing a convolutional neural network (CNN) to classify Computed Tomography (CT) image patches. The key contribution of this approach is that we augment the size of training samples by refining the unlabeled instances with a self-paced learning CNN. By implementing the framework on high performance computing servers including the NVIDIA DGX1 machine, we obtained the experimental result, showing that the self-pace boosted network consistently outperformed the original network even with very scarce manual labels. The performance gain indicates that applications with limited training samples such as medical image analysis can benefit from using the proposed framework.
- Recently, deep learning approaches have achieved significant performance improvement in various imaging problems. However, it is still unclear why these deep learning architectures work. Moreover, the link between the deep learning and the classical signal processing approaches such as wavelet, non-local processing, compressed sensing, etc, is still not well understood. To address these issues, here we show that the long-searched-for missing link is the convolutional framelets for representing a signal by convolving local and non-local bases. The convolutional framelets was originally developed to generalize the recent theory of low-rank Hankel matrix approaches, and this paper significantly extends the idea to derive a deep neural network using multi-layer convolutional framelets with perfect reconstruction (PR) under rectified linear unit (ReLU). Our analysis also shows that the popular deep network components such as residual block, redundant filter channels, and concatenated ReLU (CReLU) indeed help to achieve the PR, while the pooling and unpooling layers should be augmented with multi-resolution convolutional framelets to achieve PR condition. This discovery reveals the limitations of many existing deep learning architectures for inverse problems, and leads us to propose a novel deep convolutional framelets neural network. Using numerical experiments with sparse view x-ray computed tomography (CT), we demonstrated that our deep convolution framelets network shows consistent improvement. This discovery suggests that the success of deep learning is not from a magical power of a black-box, but rather comes from the power of a novel signal representation using non-local basis combined with data-driven local basis, which is indeed a natural extension of classical signal processing theory.
- Mutual Information (MI) is often used for feature selection when developing classifier models. Estimating the MI for a subset of features is often intractable. We demonstrate, that under the assumptions of conditional independence, MI between a subset of features can be expressed as the Conditional Mutual Information (CMI) between pairs of features. But selecting features with the highest CMI turns out to be a hard combinatorial problem. In this work, we have applied two unique global methods, Truncated Power Method (TPower) and Low Rank Bilinear Approximation (LowRank), to solve the feature selection problem. These algorithms provide very good approximations to the NP-hard CMI based feature selection problem. We experimentally demonstrate the effectiveness of these procedures across multiple datasets and compare them with existing MI based global and iterative feature selection procedures.
- Generative Adversarial Nets (GANs) represent an important milestone for effective generative models, which has inspired numerous variants seemingly different from each other. One of the main contributions of this paper is to reveal a unified geometric structure in GAN and its variants. Specifically, we show that the adversarial generative model training can be decomposed into three geometric steps: separating hyperplane search, discriminator parameter update away from the separating hyperplane, and the generator update along the normal vector direction of the separating hyperplane. This geometric intuition reveals the limitations of the existing approaches and leads us to propose a new formulation called geometric GAN using SVM separating hyperplane that maximizes the margin. Our theoretical analysis shows that the geometric GAN converges to a Nash equilibrium between the discriminator and generator. In addition, extensive numerical results show that the superior performance of geometric GAN.
- Genome-wide association studies (GWAS) have achieved great success in the genetic study of Alzheimer's disease (AD). Collaborative imaging genetics studies across different research institutions show the effectiveness of detecting genetic risk factors. However, the high dimensionality of GWAS data poses significant challenges in detecting risk SNPs for AD. Selecting relevant features is crucial in predicting the response variable. In this study, we propose a novel Distributed Feature Selection Framework (DFSF) to conduct the large-scale imaging genetics studies across multiple institutions. To speed up the learning process, we propose a family of distributed group Lasso screening rules to identify irrelevant features and remove them from the optimization. Then we select the relevant group features by performing the group Lasso feature selection process in a sequence of parameters. Finally, we employ the stability selection to rank the top risk SNPs that might help detect the early stage of AD. To the best of our knowledge, this is the first distributed feature selection model integrated with group Lasso feature selection as well as detecting the risk genetic factors across multiple research institutions system. Empirical studies are conducted on 809 subjects with 5.9 million SNPs which are distributed across several individual institutions, demonstrating the efficiency and effectiveness of the proposed method.
- We study an extreme scenario in multi-label learning where each training instance is endowed with a single one-bit label out of multiple labels. We formulate this problem as a non-trivial special case of one-bit rank-one matrix sensing and develop an efficient non-convex algorithm based on alternating power iteration. The proposed algorithm is able to recover the underlying low-rank matrix model with linear convergence. For a rank-$k$ model with $d_1$ features and $d_2$ classes, the proposed algorithm achieves $O(\epsilon)$ recovery error after retrieving $O(k^{1.5}d_1 d_2/\epsilon)$ one-bit labels within $O(kd)$ memory. Our bound is nearly optimal in the order of $O(1/\epsilon)$. This significantly improves the state-of-the-art sampling complexity of one-bit multi-label learning. We perform experiments to verify our theory and evaluate the performance of the proposed algorithm.
- Mar 03 2017 stat.ML arXiv:1703.00598v3We study a fundamental class of regression models called the second order linear model (SLM). The SLM extends the linear model to high order functional space and has attracted considerable research interest recently. Yet how to efficiently learn the SLM under full generality using nonconvex solver still remains an open question due to several fundamental limitations of the conventional gradient descent learning framework. In this study, we try to attack this problem from a gradient-free approach which we call the moment-estimation-sequence (MES) method. We show that the conventional gradient descent heuristic is biased by the skewness of the distribution therefore is no longer the best practice of learning the SLM. Based on the MES framework, we design a nonconvex alternating iteration process to train a $d$-dimension rank-$k$ SLM within $O(kd)$ memory and one-pass of the dataset. The proposed method converges globally and linearly, achieves $\epsilon$ recovery error after retrieving $O[k^{2}d\cdot\mathrm{polylog}(kd/\epsilon)]$ samples. Furthermore, our theoretical analysis reveals that not all SLMs can be learned on every sub-gaussian distribution. When the instances are sampled from a so-called $\tau$-MIP distribution, the SLM can be learned by $O(p/\tau^{2})$ samples where $p$ and $\tau$ are positive constants depending on the skewness and kurtosis of the distribution. For non-MIP distribution, an addition diagonal-free oracle is necessary and sufficient to guarantee the learnability of the SLM. Numerical simulations verify the sharpness of our bounds on the sampling complexity and the linear convergence rate of our algorithm.
- We proposed a probabilistic approach to joint modeling of participants' reliability and humans' regularity in crowdsourced affective studies. Reliability measures how likely a subject will respond to a question seriously; and regularity measures how often a human will agree with other seriously-entered responses coming from a targeted population. Crowdsourcing-based studies or experiments, which rely on human self-reported affect, pose additional challenges as compared with typical crowdsourcing studies that attempt to acquire concrete non-affective labels of objects. The reliability of participants has been massively pursued for typical non-affective crowdsourcing studies, whereas the regularity of humans in an affective experiment in its own right has not been thoroughly considered. It has been often observed that different individuals exhibit different feelings on the same test question, which does not have a sole correct response in the first place. High reliability of responses from one individual thus cannot conclusively result in high consensus across individuals. Instead, globally testing consensus of a population is of interest to investigators. Built upon the agreement multigraph among tasks and workers, our probabilistic model differentiates subject regularity from population reliability. We demonstrate the method's effectiveness for in-depth robust analysis of large-scale crowdsourced affective data, including emotion and aesthetic assessments collected by presenting visual stimuli to human subjects.
- Probabilistic Temporal Tensor Factorization (PTTF) is an effective algorithm to model the temporal tensor data. It leverages a time constraint to capture the evolving properties of tensor data. Nowadays the exploding dataset demands a large scale PTTF analysis, and a parallel solution is critical to accommodate the trend. Whereas, the parallelization of PTTF still remains unexplored. In this paper, we propose a simple yet efficient Parallel Probabilistic Temporal Tensor Factorization, referred to as P$^2$T$^2$F, to provide a scalable PTTF solution. P$^2$T$^2$F is fundamentally disparate from existing parallel tensor factorizations by considering the probabilistic decomposition and the temporal effects of tensor data. It adopts a new tensor data split strategy to subdivide a large tensor into independent sub-tensors, the computation of which is inherently parallel. We train P$^2$T$^2$F with an efficient algorithm of stochastic Alternating Direction Method of Multipliers, and show that the convergence is guaranteed. Experiments on several real-word tensor datasets demonstrate that P$^2$T$^2$F is a highly effective and efficiently scalable algorithm dedicated for large scale probabilistic temporal tensor analysis.
- The RNA-sequencing (RNA-seq) is becoming increasingly popular for quantifying gene expression levels. Since the RNA-seq measurements are relative in nature, between-sample normalization of counts is an essential step in differential expression (DE) analysis. The normalization of existing DE detection algorithms is ad hoc and performed once for all prior to DE detection, which may be suboptimal since ideally normalization should be based on non-DE genes only and thus coupled with DE detection. We propose a unified statistical model for joint normalization and DE detection of log-transformed RNA-seq data. Sample-specific normalization factors are modeled as unknown parameters in the gene-wise linear models and jointly estimated with the regression coefficients. By imposing sparsity-inducing L1 penalty (or mixed L1/L2-norm for multiple treatment conditions) on the regression coefficients, we formulate the problem as a penalized least-squares regression problem and apply the augmented lagrangian method to solve it. Simulation studies show that the proposed model and algorithms outperform existing methods in terms of detection power and false-positive rate when more than half of the genes are differentially expressed and/or when the up- and down-regulated genes among DE genes are unbalanced in amount.
- Genome-wide association studies (GWAS) offer new opportunities to identify genetic risk factors for Alzheimer's disease (AD). Recently, collaborative efforts across different institutions emerged that enhance the power of many existing techniques on individual institution data. However, a major barrier to collaborative studies of GWAS is that many institutions need to preserve individual data privacy. To address this challenge, we propose a novel distributed framework, termed Local Query Model (LQM) to detect risk SNPs for AD across multiple research institutions. To accelerate the learning process, we propose a Distributed Enhanced Dual Polytope Projection (D-EDPP) screening rule to identify irrelevant features and remove them from the optimization. To the best of our knowledge, this is the first successful run of the computationally intensive model selection procedure to learn a consistent model across different institutions without compromising their privacy while ranking the SNPs that may collectively affect AD. Empirical studies are conducted on 809 subjects with 5.9 million SNP features which are distributed across three individual institutions. D-EDPP achieved a 66-fold speed-up by effectively identifying irrelevant features.
- We develop an efficient alternating framework for learning a generalized version of Factorization Machine (gFM) on steaming data with provable guarantees. When the instances are sampled from $d$ dimensional random Gaussian vectors and the target second order coefficient matrix in gFM is of rank $k$, our algorithm converges linearly, achieves $O(\epsilon)$ recovery error after retrieving $O(k^{3}d\log(1/\epsilon))$ training instances, consumes $O(kd)$ memory in one-pass of dataset and only requires matrix-vector product operations in each iteration. The key ingredient of our framework is a construction of an estimation sequence endowed with a so-called Conditionally Independent RIP condition (CI-RIP). As special cases of gFM, our framework can be applied to symmetric or asymmetric rank-one matrix sensing problems, such as inductive matrix completion and phase retrieval.
- Aug 15 2016 stat.CO arXiv:1608.03859v4Learning under a Wasserstein loss, a.k.a. Wasserstein loss minimization (WLM), is an emerging research topic for gaining insights from a large set of structured objects. Despite being conceptually simple, WLM problems are computationally challenging because they involve minimizing over functions of quantities (i.e. Wasserstein distances) that themselves require numerical algorithms to compute. In this paper, we introduce a stochastic approach based on simulated annealing for solving WLMs. Particularly, we have developed a Gibbs sampler to approximate effectively and efficiently the partial gradients of a sequence of Wasserstein losses. Our new approach has the advantages of numerical stability and readiness for warm starts. These characteristics are valuable for WLM problems that often require multiple levels of iterations in which the oracle for computing the value and gradient of a loss function is embedded. We applied the method to optimal transport with Coulomb cost and the Wasserstein non-negative matrix factorization problem, and made comparisons with the existing method of entropy regularization.
- We propose a framework, named Aggregated Wasserstein, for computing a dissimilarity measure or distance between two Hidden Markov Models with state conditional distributions being Gaussian. For such HMMs, the marginal distribution at any time spot follows a Gaussian mixture distribution, a fact exploited to softly match, aka register, the states in two HMMs. We refer to such HMMs as Gaussian mixture model-HMM (GMM-HMM). The registration of states is inspired by the intrinsic relationship of optimal transport and the Wasserstein metric between distributions. Specifically, the components of the marginal GMMs are matched by solving an optimal transport problem where the cost between components is the Wasserstein metric for Gaussian distributions. The solution of the optimization problem is a fast approximation to the Wasserstein metric between two GMMs. The new Aggregated Wasserstein distance is a semi-metric and can be computed without generating Monte Carlo samples. It is invariant to relabeling or permutation of the states. This distance quantifies the dissimilarity of GMM-HMMs by measuring both the difference between the two marginal GMMs and the difference between the two transition matrices. Our new distance is tested on the tasks of retrieval and classification of time series. Experiments on both synthetic data and real data have demonstrated its advantages in terms of accuracy as well as efficiency in comparison with existing distances based on the Kullback-Leibler divergence.
- Sparse support vector machine (SVM) is a popular classification technique that can simultaneously learn a small set of the most interpretable features and identify the support vectors. It has achieved great successes in many real-world applications. However, for large-scale problems involving a huge number of samples and extremely high-dimensional features, solving sparse SVMs remains challenging. By noting that sparse SVMs induce sparsities in both feature and sample spaces, we propose a novel approach, which is based on accurate estimations of the primal and dual optima of sparse SVMs, to simultaneously identify the features and samples that are guaranteed to be irrelevant to the outputs. Thus, we can remove the identified inactive samples and features from the training phase, leading to substantial savings in both the memory usage and computational cost without sacrificing accuracy. To the best of our knowledge, the proposed method is the \emphfirst \emphstatic feature and sample reduction method for sparse SVM. Experiments on both synthetic and real datasets (e.g., the kddb dataset with about 20 million samples and 30 million features) demonstrate that our approach significantly outperforms state-of-the-art methods and the speedup gained by our approach can be orders of magnitude.
- In a variety of research areas, the weighted bag of vectors and the histogram are widely used descriptors for complex objects. Both can be expressed as discrete distributions. D2-clustering pursues the minimum total within-cluster variation for a set of discrete distributions subject to the Kantorovich-Wasserstein metric. D2-clustering has a severe scalability issue, the bottleneck being the computation of a centroid distribution, called Wasserstein barycenter, that minimizes its sum of squared distances to the cluster members. In this paper, we develop a modified Bregman ADMM approach for computing the approximate discrete Wasserstein barycenter of large clusters. In the case when the support points of the barycenters are unknown and have low cardinality, our method achieves high accuracy empirically at a much reduced computational cost. The strengths and weaknesses of our method and its alternatives are examined through experiments, and we recommend scenarios for their respective usage. Moreover, we develop both serial and parallelized versions of the algorithm. By experimenting with large-scale data, we demonstrate the computational efficiency of the new methods and investigate their convergence properties and numerical stability. The clustering results obtained on several datasets in different domains are highly competitive in comparison with some widely used methods in the corresponding areas.
- Aug 25 2015 stat.AP arXiv:1508.05412v2An important endpoint variable in a cocaine rehabilitation study is the time to first relapse of a patient after the treatment. We propose a joint modeling approach based on functional data analysis to study the relationship between the baseline longitudinal cocaine-use pattern and the interval censored time to first relapse. For the baseline cocaine-use pattern, we consider both self-reported cocaine-use amount trajectories and dichotomized use trajectories. Variations within the generalized longitudinal trajectories are modeled through a latent Gaussian process, which is characterized by a few leading functional principal components. The association between the baseline longitudinal trajectories and the time to first relapse is built upon the latent principal component scores. The mean and the eigenfunctions of the latent Gaussian process as well as the hazard function of time to first relapse are modeled nonparametrically using penalized splines, and the parameters in the joint model are estimated by a Monte Carlo EM algorithm based on Metropolis-Hastings steps. An Akaike information criterion (AIC) based on effective degrees of freedom is proposed to choose the tuning parameters, and a modified empirical information is proposed to estimate the variance-covariance matrix of the estimators.
- Sparse systems are usually parameterized by a tuning parameter that determines the sparsity of the system. How to choose the right tuning parameter is a fundamental and difficult problem in learning the sparse system. In this paper, by treating the the tuning parameter as an additional dimension, persistent homological structures over the parameter space is introduced and explored. The structures are then further exploited in speeding up the computation using the proposed soft-thresholding technique. The topological structures are further used as multivariate features in the tensor-based morphometry (TBM) in characterizing white matter alterations in children who have experienced severe early life stress and maltreatment. These analyses reveal that stress-exposed children exhibit more diffuse anatomical organization across the whole white matter region.
- Stochastic gradient algorithms estimate the gradient based on only one or a few samples and enjoy low computational cost per iteration. They have been widely used in large-scale optimization problems. However, stochastic gradient algorithms are usually slow to converge and achieve sub-linear convergence rates, due to the inherent variance in the gradient computation. To accelerate the convergence, some variance-reduced stochastic gradient algorithms, e.g., proximal stochastic variance-reduced gradient (Prox-SVRG) algorithm, have recently been proposed to solve strongly convex problems. Under the strongly convex condition, these variance-reduced stochastic gradient algorithms achieve a linear convergence rate. However, many machine learning problems are convex but not strongly convex. In this paper, we introduce Prox-SVRG and its projected variant called Variance-Reduced Projected Stochastic Gradient (VRPSG) to solve a class of non-strongly convex optimization problems widely used in machine learning. As the main technical contribution of this paper, we show that both VRPSG and Prox-SVRG achieve a linear convergence rate without strong convexity. A key ingredient in our proof is a Semi-Strongly Convex (SSC) inequality which is the first to be rigorously proved for a class of non-strongly convex problems in both constrained and regularized settings. Moreover, the SSC inequality is independent of algorithms and may be applied to analyze other stochastic gradient algorithms besides VRPSG and Prox-SVRG, which may be of independent interest. To the best of our knowledge, this is the first work that establishes the linear convergence rate for the variance-reduced stochastic gradient algorithms on solving both constrained and regularized problems without strong convexity.
- Learning a distance function or metric on a given data manifold is of great importance in machine learning and pattern recognition. Many of the previous works first embed the manifold to Euclidean space and then learn the distance function. However, such a scheme might not faithfully preserve the distance function if the original manifold is not Euclidean. Note that the distance function on a manifold can always be well-defined. In this paper, we propose to learn the distance function directly on the manifold without embedding. We first provide a theoretical characterization of the distance function by its gradient field. Based on our theoretical analysis, we propose to first learn the gradient field of the distance function and then learn the distance function itself. Specifically, we set the gradient field of a local distance function as an initial vector field. Then we transport it to the whole manifold via heat flow on vector fields. Finally, the geodesic distance function can be obtained by requiring its gradient field to be close to the normalized vector field. Experimental results on both synthetic and real data demonstrate the effectiveness of our proposed algorithm.
- In this paper, we propose an efficient and scalable low rank matrix completion algorithm. The key idea is to extend orthogonal matching pursuit method from the vector case to the matrix case. We further propose an economic version of our algorithm by introducing a novel weight updating rule to reduce the time and storage complexity. Both versions are computationally inexpensive for each matrix pursuit iteration, and find satisfactory results in a few iterations. Another advantage of our proposed algorithm is that it has only one tunable parameter, which is the rank. It is easy to understand and to use by the user. This becomes especially important in large-scale learning problems. In addition, we rigorously show that both versions achieve a linear convergence rate, which is significantly better than the previous known results. We also empirically compare the proposed algorithms with several state-of-the-art matrix completion algorithms on many real-world datasets, including the large-scale recommendation dataset Netflix as well as the MovieLens datasets. Numerical results show that our proposed algorithm is more efficient than competing algorithms while achieving similar or better prediction performance.
- In this paper, we propose a novel framework to analyze the theoretical properties of the learning process for a representative type of domain adaptation, which combines data from multiple sources and one target (or briefly called representative domain adaptation). In particular, we use the integral probability metric to measure the difference between the distributions of two domains and meanwhile compare it with the H-divergence and the discrepancy distance. We develop the Hoeffding-type, the Bennett-type and the McDiarmid-type deviation inequalities for multiple domains respectively, and then present the symmetrization inequality for representative domain adaptation. Next, we use the derived inequalities to obtain the Hoeffding-type and the Bennett-type generalization bounds respectively, both of which are based on the uniform entropy number. Moreover, we present the generalization bounds based on the Rademacher complexity. Finally, we analyze the asymptotic convergence and the rate of convergence of the learning process for representative domain adaptation. We discuss the factors that affect the asymptotic behavior of the learning process and the numerical experiments support our theoretical findings as well. Meanwhile, we give a comparison with the existing results of domain adaptation and the classical results under the same-distribution assumption.
- Forward-Backward Greedy Algorithms for General Convex Smooth Functions over A Cardinality ConstraintJan 01 2014 stat.ML arXiv:1401.0086v2We consider forward-backward greedy algorithms for solving sparse feature selection problems with general convex smooth functions. A state-of-the-art greedy method, the Forward-Backward greedy algorithm (FoBa-obj) requires to solve a large number of optimization problems, thus it is not scalable for large-size problems. The FoBa-gdt algorithm, which uses the gradient information for feature selection at each forward iteration, significantly improves the efficiency of FoBa-obj. In this paper, we systematically analyze the theoretical properties of both forward-backward greedy algorithms. Our main contributions are: 1) We derive better theoretical bounds than existing analyses regarding FoBa-obj for general smooth convex functions; 2) We show that FoBa-gdt achieves the same theoretical performance as FoBa-obj under the same condition: restricted strong convexity condition. Our new bounds are consistent with the bounds of a special case (least squares) and fills a previously existing theoretical gap for general convex smooth functions; 3) We show that the restricted strong convexity condition is satisfied if the number of independent samples is more than $\bar{k}\log d$ where $\bar{k}$ is the sparsity number and $d$ is the dimension of the variable; 4) We apply FoBa-gdt (with the conditional random field objective) to the sensor selection problem for human indoor activity recognition and our results show that FoBa-gdt outperforms other methods (including the ones based on forward greedy selection and L1-regularization).
- The support vector machine (SVM) is a widely used method for classification. Although many efforts have been devoted to develop efficient solvers, it remains challenging to apply SVM to large-scale problems. A nice property of SVM is that the non-support vectors have no effect on the resulting classifier. Motivated by this observation, we present fast and efficient screening rules to discard non-support vectors by analyzing the dual problem of SVM via variational inequalities (DVI). As a result, the number of data instances to be entered into the optimization can be substantially reduced. Some appealing features of our screening method are: (1) DVI is safe in the sense that the vectors discarded by DVI are guaranteed to be non-support vectors; (2) the data set needs to be scanned only once to run the screening, whose computational cost is negligible compared to that of solving the SVM problem; (3) DVI is independent of the solvers and can be integrated with any existing efficient solvers. We also show that the DVI technique can be extended to detect non-support vectors in the least absolute deviations regression (LAD). To the best of our knowledge, there are currently no screening methods for LAD. We have evaluated DVI on both synthetic and real data sets. Experiments indicate that DVI significantly outperforms the existing state-of-the-art screening rules for SVM, and is very effective in discarding non-support vectors for LAD. The speedup gained by DVI rules can be up to two orders of magnitude.
- Sparse learning techniques have been routinely used for feature selection as the resulting model usually has a small number of non-zero entries. Safe screening, which eliminates the features that are guaranteed to have zero coefficients for a certain value of the regularization parameter, is a technique for improving the computational efficiency. Safe screening is gaining increasing attention since 1) solving sparse learning formulations usually has a high computational cost especially when the number of features is large and 2) one needs to try several regularization parameters to select a suitable model. In this paper, we propose an approach called "Sasvi" (Safe screening with variational inequalities). Sasvi makes use of the variational inequality that provides the sufficient and necessary optimality condition for the dual problem. Several existing approaches for Lasso screening can be casted as relaxed versions of the proposed Sasvi, thus Sasvi provides a stronger safe screening rule. We further study the monotone properties of Sasvi for Lasso, based on which a sure removal regularization parameter can be identified for each feature. Experimental results on both synthetic and real data sets are reported to demonstrate the effectiveness of the proposed Sasvi for Lasso screening.
- Sparse learning has recently received increasing attention in many areas including machine learning, statistics, and applied mathematics. The mixed-norm regularization based on the l1q norm with q>1 is attractive in many applications of regression and classification in that it facilitates group sparsity in the model. The resulting optimization problem is, however, challenging to solve due to the inherent structure of the mixed-norm regularization. Existing work deals with special cases with q=1, 2, infinity, and they cannot be easily extended to the general case. In this paper, we propose an efficient algorithm based on the accelerated gradient method for solving the general l1q-regularized problem. One key building block of the proposed algorithm is the l1q-regularized Euclidean projection (EP_1q). Our theoretical analysis reveals the key properties of EP_1q and illustrates why EP_1q for the general q is significantly more challenging to solve than the special cases. Based on our theoretical analysis, we develop an efficient algorithm for EP_1q by solving two zero finding problems. To further improve the efficiency of solving large dimensional mixed-norm regularized problems, we propose a screening method which is able to quickly identify the inactive groups, i.e., groups that have 0 components in the solution. This may lead to substantial reduction in the number of groups to be entered to the optimization. An appealing feature of our screening method is that the data set needs to be scanned only once to run the screening. Compared to that of solving the mixed-norm regularized problems, the computational cost of our screening test is negligible. The key of the proposed screening method is an accurate sensitivity analysis of the dual optimal solution when the regularization parameter varies. Experimental results demonstrate the efficiency of the proposed algorithm.
- The l1-regularized logistic regression (or sparse logistic regression) is a widely used method for simultaneous classification and feature selection. Although many recent efforts have been devoted to its efficient implementation, its application to high dimensional data still poses significant challenges. In this paper, we present a fast and effective sparse logistic regression screening rule (Slores) to identify the 0 components in the solution vector, which may lead to a substantial reduction in the number of features to be entered to the optimization. An appealing feature of Slores is that the data set needs to be scanned only once to run the screening and its computational cost is negligible compared to that of solving the sparse logistic regression problem. Moreover, Slores is independent of solvers for sparse logistic regression, thus Slores can be integrated with any existing solver to improve the efficiency. We have evaluated Slores using high-dimensional data sets from different applications. Extensive experimental results demonstrate that Slores outperforms the existing state-of-the-art screening rules and the efficiency of solving sparse logistic regression is improved by one magnitude in general.
- May 02 2013 stat.ML arXiv:1305.0047v2We consider the following signal recovery problem: given a measurement matrix $\Phi\in \mathbb{R}^{n\times p}$ and a noisy observation vector $c\in \mathbb{R}^{n}$ constructed from $c = \Phi\theta^* + \epsilon$ where $\epsilon\in \mathbb{R}^{n}$ is the noise vector whose entries follow i.i.d. centered sub-Gaussian distribution, how to recover the signal $\theta^*$ if $D\theta^*$ is sparse \rca under a linear transformation $D\in\mathbb{R}^{m\times p}$? One natural method using convex optimization is to solve the following problem: $$\min_\theta 1\over 2\|\Phi\theta - c\|^2 + \lambda\|D\theta\|_1.$$ This paper provides an upper bound of the estimate error and shows the consistency property of this method by assuming that the design matrix $\Phi$ is a Gaussian random matrix. Specifically, we show 1) in the noiseless case, if the condition number of $D$ is bounded and the measurement number $n\geq \Omega(s\log(p))$ where $s$ is the sparsity number, then the true solution can be recovered with high probability; and 2) in the noisy case, if the condition number of $D$ is bounded and the measurement increases faster than $s\log(p)$, that is, $s\log(p)=o(n)$, the estimate error converges to zero with probability 1 when $p$ and $s$ go to infinity. Our results are consistent with those for the special case $D=\bold{I}_{p\times p}$ (equivalently LASSO) and improve the existing analysis. The condition number of $D$ plays a critical role in our analysis. We consider the condition numbers in two cases including the fused LASSO and the random graph: the condition number in the fused LASSO case is bounded by a constant, while the condition number in the random graph case is bounded with high probability if $m\over p$ (i.e., $#text{edge}\over #text{vertex}$) is larger than a certain constant. Numerical simulations are consistent with our theoretical results.
- Non-convex sparsity-inducing penalties have recently received considerable attentions in sparse learning. Recent theoretical investigations have demonstrated their superiority over the convex counterparts in several sparse learning settings. However, solving the non-convex optimization problems associated with non-convex penalties remains a big challenge. A commonly used approach is the Multi-Stage (MS) convex relaxation (or DC programming), which relaxes the original non-convex problem to a sequence of convex problems. This approach is usually not very practical for large-scale problems because its computational cost is a multiple of solving a single convex problem. In this paper, we propose a General Iterative Shrinkage and Thresholding (GIST) algorithm to solve the nonconvex optimization problem for a large class of non-convex penalties. The GIST algorithm iteratively solves a proximal operator problem, which in turn has a closed-form solution for many commonly used penalties. At each outer iteration of the algorithm, we use a line search initialized by the Barzilai-Borwein (BB) rule that allows finding an appropriate step size quickly. The paper also presents a detailed convergence analysis of the GIST algorithm. The efficiency of the proposed algorithm is demonstrated by extensive experiments on large-scale data sets.
- Lasso is a widely used regression technique to find sparse representations. When the dimension of the feature space and the number of samples are extremely large, solving the Lasso problem remains challenging. To improve the efficiency of solving large-scale Lasso problems, El Ghaoui and his colleagues have proposed the SAFE rules which are able to quickly identify the inactive predictors, i.e., predictors that have $0$ components in the solution vector. Then, the inactive predictors or features can be removed from the optimization problem to reduce its scale. By transforming the standard Lasso to its dual form, it can be shown that the inactive predictors include the set of inactive constraints on the optimal dual solution. In this paper, we propose an efficient and effective screening rule via Dual Polytope Projections (DPP), which is mainly based on the uniqueness and nonexpansiveness of the optimal dual solution due to the fact that the feasible set in the dual space is a convex and closed polytope. Moreover, we show that our screening rule can be extended to identify inactive groups in group Lasso. To the best of our knowledge, there is currently no "exact" screening rule for group Lasso. We have evaluated our screening rule using synthetic and real data sets. Results show that our rule is more effective in identifying inactive predictors than existing state-of-the-art screening rules for Lasso.
- Oct 23 2012 stat.ML arXiv:1210.5806v1Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an $\ell_0$-type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel non-convex regularizer. To solve the non-convex optimization problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm; we also provide intuitive interpretations, detailed convergence and reproducibility analysis for the proposed algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms.
- In this paper, we consider the problem of estimating multiple graphical models simultaneously using the fused lasso penalty, which encourages adjacent graphs to share similar structures. A motivating example is the analysis of brain networks of Alzheimer's disease using neuroimaging data. Specifically, we may wish to estimate a brain network for the normal controls (NC), a brain network for the patients with mild cognitive impairment (MCI), and a brain network for Alzheimer's patients (AD). We expect the two brain networks for NC and MCI to share common structures but not to be identical to each other; similarly for the two brain networks for MCI and AD. The proposed formulation can be solved using a second-order method. Our key technical contribution is to establish the necessary and sufficient condition for the graphs to be decomposable. Based on this key property, a simple screening rule is presented, which decomposes the large graphs into small subgraphs and allows an efficient estimation of multiple independent (small) subgraphs, dramatically reducing the computational cost. We perform experiments on both synthetic and real data; our results demonstrate the effectiveness and efficiency of the proposed approach.
- We study the problem of estimating multiple predictive functions from a dictionary of basis functions in the nonparametric regression setting. Our estimation scheme assumes that each predictive function can be estimated in the form of a linear combination of the basis functions. By assuming that the coefficient matrix admits a sparse low-rank structure, we formulate the function estimation problem as a convex program regularized by the trace norm and the $\ell_1$-norm simultaneously. We propose to solve the convex program using the accelerated gradient (AG) method and the alternating direction method of multipliers (ADMM) respectively; we also develop efficient algorithms to solve the key components in both AG and ADMM. In addition, we conduct theoretical analysis on the proposed function estimation scheme: we derive a key property of the optimal solution to the convex program; based on an assumption on the basis functions, we establish a performance bound of the proposed function estimation scheme (via the composite regularization). Simulation studies demonstrate the effectiveness and efficiency of the proposed algorithms.
- Sparse feature selection has been demonstrated to be effective in handling high-dimensional data. While promising, most of the existing works use convex methods, which may be suboptimal in terms of the accuracy of feature selection and parameter estimation. In this paper, we expand a nonconvex paradigm to sparse group feature selection, which is motivated by applications that require identifying the underlying group structure and performing feature selection simultaneously. The main contributions of this article are twofold: (1) statistically, we introduce a nonconvex sparse group feature selection model which can reconstruct the oracle estimator. Therefore, consistent feature selection and parameter estimation can be achieved; (2) computationally, we propose an efficient algorithm that is applicable to large-scale problems. Numerical results suggest that the proposed nonconvex method compares favorably against its competitors on synthetic data and real-world applications, thus achieving desired goal of delivering high performance.
- The problem of joint feature selection across a group of related tasks has applications in many areas including biomedical informatics and computer vision. We consider the l2,1-norm regularized regression model for joint feature selection from multiple tasks, which can be derived in the probabilistic framework by assuming a suitable prior from the exponential family. One appealing feature of the l2,1-norm regularization is that it encourages multiple predictors to share similar sparsity patterns. However, the resulting optimization problem is challenging to solve due to the non-smoothness of the l2,1-norm regularization. In this paper, we propose to accelerate the computation by reformulating it as two equivalent smooth convex optimization problems which are then solved via the Nesterov's method-an optimal first-order black-box method for smooth convex optimization. A key building block in solving the reformulations is the Euclidean projection. We show that the Euclidean projection for the first reformulation can be analytically computed, while the Euclidean projection for the second one can be computed in linear time. Empirical evaluations on several data sets verify the efficiency of the proposed algorithms.