results for au:Yin_W in:cs

- Slow running or straggler tasks can significantly reduce computation speed in distributed computation. Recently, coding-theory-inspired approaches have been applied to mitigate the effect of straggling, through embedding redundancy in certain linear computational steps of the optimization algorithm, thus completing the computation without waiting for the stragglers. In this paper, we propose an alternate approach where we embed the redundancy directly in the data itself, and allow the computation to proceed completely oblivious to encoding. We propose several encoding schemes, and demonstrate that popular batch algorithms, such as gradient descent and L-BFGS, applied in a coding-oblivious manner, deterministically achieve sample path linear convergence to an approximate solution of the original problem, using an arbitrarily varying subset of the nodes at each iteration. Moreover, this approximation can be controlled by the amount of redundancy and the number of nodes used in each iteration. We provide experimental results demonstrating the advantage of the approach over uncoded and data replication strategies.
- Oct 03 2017 cs.CL arXiv:1710.00519v1In NLP, convolution neural networks (CNNs) have benefited less than recurrent neural networks (RNNs) from attention mechanisms. We hypothesize that this is because attention in CNNs has been mainly implemented as attentive pooling (i.e., it is applied to pooling) rather than as attentive convolution (i.e., it is integrated into convolution). Convolution is the differentiator of CNNs in that it can powerfully model the higher-level representation of a word by taking into account its local fixed-size context in input text $t^x$. In this work, we propose an attentive convolution network, AttentiveConvNet. It extends the context scope of the convolution operation, deriving higher-level features for a word not only from local context, but also from information extracted from nonlocal context by the attention mechanism commonly used in RNNs. This nonlocal context can come (i) from parts of the input text $t^x$ that are distant or (ii) from a second input text, the context text $t^y$. In an evaluation on sentence relation classification (textual entailment and answer sentence selection) and text classification, experiments demonstrate that AttentiveConvNet has state-of-the-art performance and outperforms RNN/CNN variants with and without attention.
- Convolutional sparse representations are a form of sparse representation with a structured, translation invariant dictionary. Most convolutional dictionary learning algorithms to date operate in batch mode, requiring simultaneous access to all training images during the learning process, which results in very high memory usage and severely limits the training data that can be used. Very recently, however, a number of authors have considered the design of online convolutional dictionary learning algorithms that offer far better scaling of memory and computational cost with training set size than batch methods. This paper extends our prior work, improving a number of aspects of our previous algorithm; proposing an entirely new one, with better performance, and that supports the inclusion of a spatial mask for learning from incomplete data; and providing a rigorous theoretical analysis of these methods.
- While a number of different algorithms have recently been proposed for convolutional dictionary learning, this remains an expensive problem. The single biggest impediment to learning from large training sets is the memory requirements, which grow at least linearly with the size of the training set since all existing methods are batch algorithms. The work reported here addresses this limitation by extending online dictionary learning ideas to the convolutional context.
- In this paper, we present a method for identifying infeasible, unbounded, and pathological conic programs based on Douglas-Rachford splitting, or equivalently ADMM. When an optimization program is infeasible, unbounded, or pathological, the iterates of Douglas-Rachford splitting diverge. Somewhat surprisingly, such divergent iterates still provide useful information, which our method uses for identification. In addition, for strongly infeasible problems the method produces a separating hyperplane and informs the user on how to minimally modify the given problem to achieve strong feasibility. As a first-order method, the proposed algorithm relies on simple subroutines, and therefore is simple to implement and has low per-iteration cost.
- Relation detection is a core component for many NLP applications including Knowledge Base Question Answering (KBQA). In this paper, we propose a hierarchical recurrent neural network enhanced by residual learning that detects KB relations given an input question. Our method uses deep residual bidirectional LSTMs to compare questions and relation names via different hierarchies of abstraction. Additionally, we propose a simple KBQA system that integrates entity linking and our proposed relation detector to enable one enhance another. Experimental results evidence that our approach achieves not only outstanding relation detection performance, but more importantly, it helps our KBQA system to achieve state-of-the-art accuracy for both single-relation (SimpleQuestions) and multi-relation (WebQSP) QA benchmarks.
- Apr 10 2017 cs.CV arXiv:1704.02166v1Generating and manipulating human facial images using high-level attributal controls are important and interesting problems. The models proposed in previous work can solve one of these two problems (generation or manipulation), but not both coherently. This paper proposes a novel model that learns how to both generate and modify the facial image from high-level semantic attributes. Our key idea is to formulate a Semi-Latent Facial Attribute Space (SL-FAS) to systematically learn relationship between user-defined and latent attributes, as well as between those attributes and RGB imagery. As part of this newly formulated space, we propose a new model --- SL-GAN which is a specific form of Generative Adversarial Network. Finally, we present an iterative training algorithm for SL-GAN. The experiments on recent CelebA and CASIA-WebFace datasets validate the effectiveness of our proposed framework. We will also make data, pre-trained models and code available.
- Feb 08 2017 cs.CL arXiv:1702.01923v1Deep neural networks (DNN) have revolutionized the field of natural language processing (NLP). Convolutional neural network (CNN) and recurrent neural network (RNN), the two main types of DNN architectures, are widely explored to handle various NLP tasks. CNN is supposed to be good at extracting position-invariant features and RNN at modeling units in sequence. The state of the art on many NLP tasks often switches due to the battle between CNNs and RNNs. This work is the first systematic comparison of CNN and RNN on a wide range of representative NLP tasks, aiming to give basic guidance for DNN selection.
- Jan 10 2017 cs.CL arXiv:1701.02149v1This work studies comparatively two typical sentence matching tasks: textual entailment (TE) and answer selection (AS), observing that weaker phrase alignments are more critical in TE, while stronger phrase alignments deserve more attention in AS. The key to reach this observation lies in phrase detection, phrase representation, phrase alignment, and more importantly how to connect those aligned phrases of different matching degrees with the final classifier. Prior work (i) has limitations in phrase generation and representation, or (ii) conducts alignment at word and phrase levels by handcrafted features or (iii) utilizes a single framework of alignment without considering the characteristics of specific tasks, which limits the framework's effectiveness across tasks. We propose an architecture based on Gated Recurrent Unit that supports (i) representation learning of phrases of arbitrary granularity and (ii) task-specific attentive pooling of phrase alignments between two sentences. Experimental results on TE and AS match our observation and show the effectiveness of our approach.
- Recent years have witnessed the surge of asynchronous parallel (async-parallel) iterative algorithms due to problems involving very large-scale data and a large number of decision variables. Because of asynchrony, the iterates are computed with outdated information, and the age of the outdated information, which we call delay, is the number of times it has been updated since its creation. Almost all recent works prove convergence under the assumption of a finite maximum delay and set their stepsize parameters accordingly. However, the maximum delay is practically unknown. This paper presents convergence analysis of an async-parallel method from a probabilistic viewpoint, and it allows for large unbounded delays. An explicit formula of stepsize that guarantees convergence is given depending on delays' statistics. With $p+1$ identical processors, we empirically measured that delays closely follow the Poisson distribution with parameter $p$, matching our theoretical model, and thus the stepsize can be set accordingly. Simulations on both convex and nonconvex optimization problems demonstrate the validness of our analysis and also show that the existing maximum-delay induced stepsize is too conservative, often slowing down the convergence of the algorithm.
- In this paper, we discuss how to design the graph topology to reduce the communication complexity of certain algorithms for decentralized optimization. Our goal is to minimize the total communication needed to achieve a prescribed accuracy. We discover that the so-called expander graphs are near-optimal choices. We propose three approaches to construct expander graphs for different numbers of nodes and node degrees. Our numerical results show that the performance of decentralized optimization is significantly better on expander graphs than other regular graphs.
- We propose an asynchronous, decentralized algorithm for consensus optimization. The algorithm runs over a network in which the agents communicate with their neighbors and perform local computation. In the proposed algorithm, each agent can compute and communicate independently at different times, for different durations, with the information it has even if the latest information from its neighbors is not yet available. Such an asynchronous algorithm reduces the time that agents would otherwise waste idle because of communication delays or because their neighbors are slower. It also eliminates the need for a global clock for synchronization. Mathematically, the algorithm involves both primal and dual variables, uses fixed step-size parameters, and provably converges to the exact solution under a bounded delay assumption and a random agent assumption. When running synchronously, the algorithm performs just as well as existing competitive synchronous algorithms such as PG-EXTRA, which diverges without synchronization. Numerical experiments confirm the theoretical findings and illustrate the performance of the proposed algorithm.
- The need for scalable numerical solutions has motivated the development of asynchronous parallel algorithms, where a set of nodes run in parallel with little or no synchronization, thus computing with delayed information. This paper studies the convergence of the asynchronous parallel algorithm ARock under potentially unbounded delays. ARock is a general asynchronous algorithm that has many applications. It parallelizes fixed-point iterations by letting a set of nodes randomly choose solution coordinates and update them in an asynchronous parallel fashion. ARock takes some recent asynchronous coordinate descent algorithms as special cases and gives rise to new asynchronous operator-splitting algorithms. Existing analysis of ARock assumes the delays to be bounded, and uses this bound to set a step size that is important to both convergence and efficiency. Other work, though allowing unbounded delays, imposes strict conditions on the underlying fixed-point operator, resulting in limited applications. In this paper, convergence is established under unbounded delays, which can be either stochastic or deterministic. The proposed step sizes are more practical and generally larger than those in the existing work. The step size adapts to the delay distribution or the current delay being experienced in the system. New Lyapunov functions, which are the key to analyzing asynchronous algorithms, are generated to obtain our results. A set of applicable optimization algorithms with large-scale applications are given, including machine learning and scientific computing algorithms.
- Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been proposed for convex consensus optimization. However, to the behaviors or consensus \emphnonconvex optimization, our understanding is more limited. When we lose convexity, we cannot hope our algorithms always return global solutions though they sometimes still do sometimes. Somewhat surprisingly, the decentralized consensus algorithms, DGD and Prox-DGD, retain most other properties that are known in the convex setting. In particular, when diminishing (or constant) step sizes are used, we can prove convergence to a (or a neighborhood of) consensus stationary solution and have guaranteed rates of convergence. It is worth noting that Prox-DGD can handle nonconvex nonsmooth functions if their proximal operators can be computed. Such functions include SCAD and $\ell_q$ quasi-norms, $q\in[0,1)$. Similarly, Prox-DGD can take the constraint to a nonconvex set with an easy projection. To establish these properties, we have to introduce a completely different line of analysis, as well as modify existing proofs that were used the convex setting.
- Jun 13 2016 cs.CL arXiv:1606.03391v2This work focuses on answering single-relation factoid questions over Freebase. Each question can acquire the answer from a single fact of form (subject, predicate, object) in Freebase. This task, simple question answering (SimpleQA), can be addressed via a two-step pipeline: entity linking and fact selection. In fact selection, we match the subject entity in a fact candidate with the entity mention in the question by a character-level convolutional neural network (char-CNN), and match the predicate in that fact with the question by a word-level CNN (word-CNN). This work makes two main contributions. (i) A simple and effective entity linker over Freebase is proposed. Our entity linker outperforms the state-of-the-art entity linker over SimpleQA task. (ii) A novel attentive maxpooling is stacked over word-CNN, so that the predicate representation can be matched with the predicate-focused question representation more effectively. Experiments show that our system sets new state-of-the-art in this task.
- Apr 26 2016 cs.CL arXiv:1604.06896v2This work studies comparatively two typical sentence pair classification tasks: textual entailment (TE) and answer selection (AS), observing that phrase alignments of different intensities contribute differently in these tasks. We address the problems of identifying phrase alignments of flexible granularity and pooling alignments of different intensities for these tasks. Examples for flexible granularity are alignments between two single words, between a single word and a phrase and between a short phrase and a long phrase. By intensity we roughly mean the degree of match, it ranges from identity over surface-form co-occurrence, rephrasing and other semantic relatedness to unrelated words as in lots of parenthesis text. Prior work (i) has limitations in phrase generation and representation, or (ii) conducts alignment at word and phrase levels by handcrafted features or (iii) utilizes a single attention mechanism over alignment intensities without considering the characteristics of specific tasks, which limits the system's effectiveness across tasks. We propose an architecture based on Gated Recurrent Unit that supports (i) representation learning of phrases of arbitrary granularity and (ii) task-specific focusing of phrase alignments between two sentences by attention pooling. Experimental results on TE and AS match our observation and are state-of-the-art.
- Apr 05 2016 cs.CL arXiv:1604.00503v1This work, concerning paraphrase identification task, on one hand contributes to expanding deep learning embeddings to include continuous and discontinuous linguistic phrases. On the other hand, it comes up with a new scheme TF-KLD-KNN to learn the discriminative weights of words and phrases specific to paraphrase task, so that a weighted sum of embeddings can represent sentences more effectively. Based on these two innovations we get competitive state-of-the-art performance on paraphrase identification.
- Apr 05 2016 cs.CL arXiv:1604.00502v1We propose online unsupervised domain adaptation (DA), which is performed incrementally as data comes in and is applicable when batch DA is not possible. In a part-of-speech (POS) tagging evaluation, we find that online unsupervised DA performs as well as batch DA.
- Mar 16 2016 cs.CL arXiv:1603.04513v1We propose MVCNN, a convolution neural network (CNN) architecture for sentence classification. It (i) combines diverse versions of pretrained word embeddings and (ii) extracts features of multigranular phrases with variable-size convolution filters. We also show that pretraining MVCNN is critical for good performance. MVCNN achieves state-of-the-art performance on four tasks: on small-scale binary, small-scale multi-class and largescale Twitter sentiment prediction and on subjectivity classification.
- Feb 16 2016 cs.CL arXiv:1602.04341v1Understanding open-domain text is one of the primary challenges in natural language processing (NLP). Machine comprehension benchmarks evaluate the system's ability to understand text based on the text content only. In this work, we investigate machine comprehension on MCTest, a question answering (QA) benchmark. Prior work is mainly based on feature engineering approaches. We come up with a neural network framework, named hierarchical attention-based convolutional neural network (HABCNN), to address this task without any manually designed features. Specifically, we explore HABCNN for this task by two routes, one is through traditional joint modeling of passage, question and answer, one is through textual entailment. HABCNN employs an attention mechanism to detect key phrases, key sentences and key snippets that are relevant to answering the question. Experiments show that HABCNN outperforms prior deep learning approaches by a big margin.
- This paper focuses on coordinate update methods, which are useful for solving problems involving large or high-dimensional datasets. They decompose a problem into simple subproblems, where each updates one, or a small block of, variables while fixing others. These methods can deal with linear and nonlinear mappings, smooth and nonsmooth functions, as well as convex and nonconvex problems. In addition, they are easy to parallelize. The great performance of coordinate update methods depends on solving simple sub-problems. To derive simple subproblems for several new classes of applications, this paper systematically studies coordinate-friendly operators that perform low-cost coordinate updates. Based on the discovered coordinate friendly operators, as well as operator splitting techniques, we obtain new coordinate update algorithms for a variety of problems in machine learning, image processing, as well as sub-areas of optimization. Several problems are treated with coordinate update for the first time in history. The obtained algorithms are scalable to large instances through parallel and even asynchronous computing. We present numerical examples to illustrate how effective these algorithms are.
- Dec 17 2015 cs.CL arXiv:1512.05193v3How to model a pair of sentences is a critical issue in many NLP tasks such as answer selection (AS), paraphrase identification (PI) and textual entailment (TE). Most prior work (i) deals with one individual task by fine-tuning a specific system; (ii) models each sentence's representation separately, rarely considering the impact of the other sentence; or (iii) relies fully on manually designed, task-specific linguistic features. This work presents a general Attention Based Convolutional Neural Network (ABCNN) for modeling a pair of sentences. We make three contributions. (i) ABCNN can be applied to a wide variety of tasks that require modeling of sentence pairs. (ii) We propose three attention schemes that integrate mutual influence between sentences into CNN; thus, the representation of each sentence takes into consideration its counterpart. These interdependent sentence pair representations are more powerful than isolated sentence representations. (iii) ABCNN achieves state-of-the-art performance on AS, PI and TE tasks.
- In this paper, we analyze the convergence of the alternating direction method of multipliers (ADMM) for minimizing a nonconvex and possibly nonsmooth objective function, $\phi(x_0,\ldots,x_p,y)$, subject to coupled linear equality constraints. Our ADMM updates each of the primal variables $x_0,\ldots,x_p,y$, followed by updating the dual variable. We separate the variable $y$ from $x_i$'s as it has a special role in our analysis. The developed convergence guarantee covers a variety of nonconvex functions such as piecewise linear functions, $\ell_q$ quasi-norm, Schatten-$q$ quasi-norm ($0<q<1$), minimax concave penalty (MCP), and smoothly clipped absolute deviation (SCAD) penalty. It also allows nonconvex constraints such as compact manifolds (e.g., spherical, Stiefel, and Grassman manifolds) and linear complementarity constraints. Also, the $x_0$-block can be almost any lower semi-continuous function. By applying our analysis, we show, for the first time, that several ADMM algorithms applied to solve nonconvex models in statistical learning, optimization on manifold, and matrix decomposition are guaranteed to converge. Our results provide sufficient conditions for ADMM to converge on (convex or nonconvex) monotropic programs with three or more blocks, as they are special cases of our model. ADMM has been regarded as a variant to the augmented Lagrangian method (ALM). We present a simple example to illustrate how ADMM converges but ALM diverges with bounded penalty parameter $\beta$. Indicated by this example and other analysis in this paper, ADMM might be a better choice than ALM for some nonconvex \emphnonsmooth problems, because ADMM is not only easier to implement, it is also more likely to converge for the concerned scenarios.
- In this note, we extend the algorithms Extra and subgradient-push to a new algorithm ExtraPush for consensus optimization with convex differentiable objective functions over a directed network. When the stationary distribution of the network can be computed in advance, we propose a simplified algorithm called Normalized ExtraPush. Just like Extra, both ExtraPush and Normalized ExtraPush can iterate with a fixed step size. But unlike Extra, they can take a column-stochastic mixing matrix, which is not necessarily doubly stochastic. Therefore, they remove the undirected-network restriction of Extra. Subgradient-push, while also works for directed networks, is slower on the same type of problem because it must use a sequence of diminishing step sizes. We present preliminary analysis for ExtraPush under a bounded sequence assumption. For Normalized ExtraPush, we show that it naturally produces a bounded, linearly convergent sequence provided that the objective function is strongly convex. In our numerical experiments, ExtraPush and Normalized ExtraPush performed similarly well. They are significantly faster than subgradient-push, even when we hand-optimize the step sizes for the latter.
- Aug 19 2015 cs.CL arXiv:1508.04257v2Word embeddings -- distributed representations of words -- in deep learning are beneficial for many tasks in natural language processing (NLP). However, different embedding sets vary greatly in quality and characteristics of the captured semantics. Instead of relying on a more advanced algorithm for embedding learning, this paper proposes an ensemble approach of combining different public embedding sets with the aim of learning meta-embeddings. Experiments on word similarity and analogy tasks and on part-of-speech tagging show better performance of meta-embeddings compared to individual embedding sets. One advantage of meta-embeddings is the increased vocabulary coverage. We will release our meta-embeddings publicly.
- Jun 09 2015 cs.LG arXiv:1506.02585v1In this paper, a novel framework of sparse kernel learning for Support Vector Data Description (SVDD) based anomaly detection is presented. In this work, optimal sparse feature selection for anomaly detection is first modeled as a Mixed Integer Programming (MIP) problem. Due to the prohibitively high computational complexity of the MIP, it is relaxed into a Quadratically Constrained Linear Programming (QCLP) problem. The QCLP problem can then be practically solved by using an iterative optimization method, in which multiple subsets of features are iteratively found as opposed to a single subset. The QCLP-based iterative optimization problem is solved in a finite space called the \emphEmpirical Kernel Feature Space (EKFS) instead of in the input space or \emphReproducing Kernel Hilbert Space (RKHS). This is possible because of the fact that the geometrical properties of the EKFS and the corresponding RKHS remain the same. Now, an explicit nonlinear exploitation of the data in a finite EKFS is achievable, which results in optimal feature ranking. Experimental results based on a hyperspectral image show that the proposed method can provide improved performance over the current state-of-the-art techniques.
- Finding a fixed point to a nonexpansive operator, i.e., $x^*=Tx^*$, abstracts many problems in numerical linear algebra, optimization, and other areas of scientific computing. To solve fixed-point problems, we propose ARock, an algorithmic framework in which multiple agents (machines, processors, or cores) update $x$ in an asynchronous parallel fashion. Asynchrony is crucial to parallel computing since it reduces synchronization wait, relaxes communication bottleneck, and thus speeds up computing significantly. At each step of ARock, an agent updates a randomly selected coordinate $x_i$ based on possibly out-of-date information on $x$. The agents share $x$ through either global memory or communication. If writing $x_i$ is atomic, the agents can read and write $x$ without memory locks. Theoretically, we show that if the nonexpansive operator $T$ has a fixed point, then with probability one, ARock generates a sequence that converges to a fixed points of $T$. Our conditions on $T$ and step sizes are weaker than comparable work. Linear convergence is also obtained. We propose special cases of ARock for linear systems, convex optimization, machine learning, as well as distributed and decentralized consensus problems. Numerical experiments of solving sparse logistic regression problems are presented.
- Various algorithms have been proposed for dictionary learning. Among those for image processing, many use image patches to form dictionaries. This paper focuses on whole-image recovery from corrupted linear measurements. We address the open issue of representing an image by overlapping patches: the overlapping leads to an excessive number of dictionary coefficients to determine. With very few exceptions, this issue has limited the applications of image-patch methods to the local kind of tasks such as denoising, inpainting, cartoon-texture decomposition, super-resolution, and image deblurring, for which one can process a few patches at a time. Our focus is global imaging tasks such as compressive sensing and medical image recovery, where the whole image is encoded together, making it either impossible or very ineffective to update a few patches at a time. Our strategy is to divide the sparse recovery into multiple subproblems, each of which handles a subset of non-overlapping patches, and then the results of the subproblems are averaged to yield the final recovery. This simple strategy is surprisingly effective in terms of both quality and speed. In addition, we accelerate computation of the learned dictionary by applying a recent block proximal-gradient method, which not only has a lower per-iteration complexity but also takes fewer iterations to converge, compared to the current state-of-the-art. We also establish that our algorithm globally converges to a stationary point. Numerical results on synthetic data demonstrate that our algorithm can recover a more faithful dictionary than two state-of-the-art methods. Combining our whole-image recovery and dictionary-learning methods, we numerically simulate image inpainting, compressive sensing recovery, and deblurring. Our recovery is more faithful than those of a total variation method and a method based on overlapping patches.
- We present a video compressive sensing framework, termed kt-CSLDS, to accelerate the image acquisition process of dynamic magnetic resonance imaging (MRI). We are inspired by a state-of-the-art model for video compressive sensing that utilizes a linear dynamical system (LDS) to model the motion manifold. Given compressive measurements, the state sequence of an LDS can be first estimated using system identification techniques. We then reconstruct the observation matrix using a joint structured sparsity assumption. In particular, we minimize an objective function with a mixture of wavelet sparsity and joint sparsity within the observation matrix. We derive an efficient convex optimization algorithm through alternating direction method of multipliers (ADMM), and provide a theoretical guarantee for global convergence. We demonstrate the performance of our approach for video compressive sensing, in terms of reconstruction accuracy. We also investigate the impact of various sampling strategies. We apply this framework to accelerate the acquisition process of dynamic MRI and show it achieves the best reconstruction accuracy with the least computational time compared with existing algorithms in the literature.
- Minimization of the $\ell_{\infty}$ (or maximum) norm subject to a constraint that imposes consistency to an underdetermined system of linear equations finds use in a large number of practical applications, including vector quantization, approximate nearest neighbor search, peak-to-average power ratio (or "crest factor") reduction in communication systems, and peak force minimization in robotics and control. This paper analyzes the fundamental properties of signal representations obtained by solving such a convex optimization problem. We develop bounds on the maximum magnitude of such representations using the uncertainty principle (UP) introduced by Lyubarskii and Vershynin, and study the efficacy of $\ell_{\infty}$-norm-based dynamic range reduction. Our analysis shows that matrices satisfying the UP, such as randomly subsampled Fourier or i.i.d. Gaussian matrices, enable the computation of what we call democratic representations, whose entries all have small and similar magnitude, as well as low dynamic range. To compute democratic representations at low computational complexity, we present two new, efficient convex optimization algorithms. We finally demonstrate the efficacy of democratic representations for dynamic range reduction in a DVB-T2-based broadcast system.
- Dec 19 2013 cs.CL arXiv:1312.5129v2Deep learning embeddings have been successfully used for many natural language processing problems. Embeddings are mostly computed for word forms although a number of recent papers have extended this to other linguistic units like morphemes and phrases. In this paper, we argue that learning embeddings for discontinuous linguistic units should also be considered. In an experimental evaluation on coreference resolution, we show that such embeddings perform better than word form embeddings.
- Higher-order low-rank tensors naturally arise in many applications including hyperspectral data recovery, video inpainting, seismic data recon- struction, and so on. We propose a new model to recover a low-rank tensor by simultaneously performing low-rank matrix factorizations to the all-mode ma- tricizations of the underlying tensor. An alternating minimization algorithm is applied to solve the model, along with two adaptive rank-adjusting strategies when the exact rank is not known. Phase transition plots reveal that our algorithm can recover a variety of synthetic low-rank tensors from significantly fewer samples than the compared methods, which include a matrix completion method applied to tensor recovery and two state-of-the-art tensor completion methods. Further tests on real- world data show similar advantages. Although our model is non-convex, our algorithm performs consistently throughout the tests and give better results than the compared methods, some of which are based on convex models. In addition, the global convergence of our algorithm can be established in the sense that the gradient of Lagrangian function converges to zero.
- Convex optimization models find interesting applications, especially in signal/image processing and compressive sensing. We study some augmented convex models, which are perturbed by strongly convex functions, and propose a dual gradient algorithm. The proposed algorithm includes the linearized Bregman algorithm and the singular value thresholding algorithm as special cases. Based on fundamental properties of proximal operators, we present a concise approach to establish the convergence of both primal and dual sequences, improving the results in the existing literature.
- The $\ell_1$-synthesis model and the $\ell_1$-analysis model recover structured signals from their undersampled measurements. The solution of former is a sparse sum of dictionary atoms, and that of the latter makes sparse correlations with dictionary atoms. This paper addresses the question: when can we trust these models to recover specific signals? We answer the question with a condition that is both necessary and sufficient to guarantee the recovery to be unique and exact and, in presence of measurement noise, to be robust. The condition is one--for--all in the sense that it applies to both of the $\ell_1$-synthesis and $\ell_1$-analysis models, to both of their constrained and unconstrained formulations, and to both the exact recovery and robust recovery cases. Furthermore, a convex infinity--norm program is introduced for numerically verifying the condition. A comprehensive comparison with related existing conditions are included.
- The convergence behavior of gradient methods for minimizing convex differentiable functions is one of the core questions in convex optimization. This paper shows that their well-known complexities can be achieved under conditions weaker than the commonly accepted ones. We relax the common gradient Lipschitz-continuity condition and strong convexity condition to ones that hold only over certain line segments. Specifically, we establish complexities $O(\frac{R}{\epsilon})$ and $O(\sqrt{\frac{R}{\epsilon}})$ for the ordinary and accelerate gradient methods, respectively, assuming that $\nabla f$ is Lipschitz continuous with constant $R$ over the line segment joining $x$ and $x-\frac{1}{R}\nabla f$ for each $x\in\dom f$. Then we improve them to $O(\frac{R}{\nu}\log(\frac{1}{\epsilon}))$ and $O(\sqrt{\frac{R}{\nu}}\log(\frac{1}{\epsilon}))$ for function $f$ that also satisfies the secant inequality $\ < \nabla f(x), x- x^*\ > \ge \nu\|x-x^*\|^2$ for each $x\in \dom f$ and its projection $x^*$ to the minimizer set of $f$. The secant condition is also shown to be necessary for the geometric decay of solution error. Not only are the relaxed conditions met by more functions, the restrictions give smaller $R$ and larger $\nu$ than they are without the restrictions and thus lead to better complexity bounds. We apply these results to sparse optimization and demonstrate a faster algorithm.
- This paper shows that the solutions to various convex $\ell_1$ minimization problems are \emphunique if and only if a common set of conditions are satisfied. This result applies broadly to the basis pursuit model, basis pursuit denoising model, Lasso model, as well as other $\ell_1$ models that either minimize $f(Ax-b)$ or impose the constraint $f(Ax-b)\leq\sigma$, where $f$ is a strictly convex function. For these models, this paper proves that, given a solution $x^*$ and defining $I=\supp(x^*)$ and $s=\sign(x^*_I)$, $x^*$ is the unique solution if and only if $A_I$ has full column rank and there exists $y$ such that $A_I^Ty=s$ and $|a_i^Ty|_\infty<1$ for $i\not\in I$. This condition is previously known to be sufficient for the basis pursuit model to have a unique solution supported on $I$. Indeed, it is also necessary, and applies to a variety of other $\ell_1$ models. The paper also discusses ways to recognize unique solutions and verify the uniqueness conditions numerically.
- This paper studies the long-existing idea of adding a nice smooth function to "smooth" a non-differentiable objective function in the context of sparse optimization, in particular, the minimization of $||x||_1+1/(2\alpha)||x||_2^2$, where $x$ is a vector, as well as the minimization of $||X||_*+1/(2\alpha)||X||_F^2$, where $X$ is a matrix and $||X||_*$ and $||X||_F$ are the nuclear and Frobenius norms of $X$, respectively. We show that they can efficiently recover sparse vectors and low-rank matrices. In particular, they enjoy exact and stable recovery guarantees similar to those known for minimizing $||x||_1$ and $||X||_*$ under the conditions on the sensing operator such as its null-space property, restricted isometry property, spherical section property, or RIPless property. To recover a (nearly) sparse vector $x^0$, minimizing $||x||_1+1/(2\alpha)||x||_2^2$ returns (nearly) the same solution as minimizing $||x||_1$ almost whenever $\alpha\ge 10||x^0||_\infty$. The same relation also holds between minimizing $||X||_*+1/(2\alpha)||X||_F^2$ and minimizing $||X||_*$ for recovering a (nearly) low-rank matrix $X^0$, if $\alpha\ge 10||X^0||_2$. Furthermore, we show that the linearized Bregman algorithm for minimizing $||x||_1+1/(2\alpha)||x||_2^2$ subject to $Ax=b$ enjoys global linear convergence as long as a nonzero solution exists, and we give an explicit rate of convergence. The convergence property does not require a solution solution or any properties on $A$. To our knowledge, this is the best known global convergence result for first-order sparse optimization algorithms.
- We propose and analyze an extremely fast, efficient, and simple method for solving the problem:minparallel to u parallel to(1) : Au = f, u is an element of R-n.This method was first described in [J. Darbon and S. Osher, preprint, 2007], with more details in [W. Yin, S. Osher, D. Goldfarb and J. Darbon, SIAM J. Imaging Sciences, 1(1), 143-168, 2008] and rigorous theory given in [J. Cai, S. Osher and Z. Shen, Math. Comp., to appear, 2008, see also UCLA CAM Report 08-06] and [J. Cai, S. Osher and Z. Shen, UCLA CAM Report, 08-52, 2008]. The motivation was compressive sensing, which now has a vast and exciting history, which seems to have started with Candes, et. al. [E. Candes, J. Romberg and T. Tao, 52(2), 489-509, 2006] and Donoho, [D. L. Donoho, IEEE Trans. Inform. Theory, 52, 1289-1306, 2006]. See [W. Yin, S. Osher, D. Goldfarb and J. Darbon, SIAM J. Imaging Sciences 1(1), 143-168, 2008] and [J. Cai, S. Osher and Z. Shen, Math. Comp., to appear, 2008, see also UCLA CAM Report, 08-06] and [J. Cai, S. Osher and Z. Shen, UCLA CAM Report, 08-52, 2008] for a large set of references. Our method introduces an improvement called "kicking" of the very efficient method of [J. Darbon and S. Osher, preprint, 2007] and [W. Yin, S. Osher, D. Goldfarb and J. Darbon, SIAM J. Imaging Sciences, 1(1), 143-168, 2008] and also applies it to the problem of denoising of undersampled signals. The use of Bregman iteration for denoising of images began in [S. Osher, M. Burger, D. Goldfarb, J. Xu and W. Yin, Multiscale Model. Simul, 4(2), 460-489, 2005] and led to improved results for total variation based methods. Here we apply it to denoise signals, especially essentially sparse signals, which might even be undersampled.
- This paper introduces an algorithm for the nonnegative matrix factorization-and-completion problem, which aims to find nonnegative low-rank matrices X and Y so that the product XY approximates a nonnegative data matrix M whose elements are partially known (to a certain accuracy). This problem aggregates two existing problems: (i) nonnegative matrix factorization where all entries of M are given, and (ii) low-rank matrix completion where nonnegativity is not required. By taking the advantages of both nonnegativity and low-rankness, one can generally obtain superior results than those of just using one of the two properties. We propose to solve the non-convex constrained least-squares problem using an algorithm based on the classic alternating direction augmented Lagrangian method. Preliminary convergence properties of the algorithm and numerical simulation results are presented. Compared to a recent algorithm for nonnegative matrix factorization, the proposed algorithm produces factorizations of similar quality using only about half of the matrix entries. On tasks of recovering incomplete grayscale and hyperspectral images, the proposed algorithm yields overall better qualities than those produced by two recent matrix-completion algorithms that do not exploit nonnegativity.
- Spectrum sensing, which aims at detecting spectrum holes, is the precondition for the implementation of cognitive radio (CR). Collaborative spectrum sensing among the cognitive radio nodes is expected to improve the ability of checking complete spectrum usage. Due to hardware limitations, each cognitive radio node can only sense a relatively narrow band of radio spectrum. Consequently, the available channel sensing information is far from being sufficient for precisely recognizing the wide range of unoccupied channels. Aiming at breaking this bottleneck, we propose to apply matrix completion and joint sparsity recovery to reduce sensing and transmitting requirements and improve sensing results. Specifically, equipped with a frequency selective filter, each cognitive radio node senses linear combinations of multiple channel information and reports them to the fusion center, where occupied channels are then decoded from the reports by using novel matrix completion and joint sparsity recovery algorithms. As a result, the number of reports sent from the CRs to the fusion center is significantly reduced. We propose two decoding approaches, one based on matrix completion and the other based on joint sparsity recovery, both of which allow exact recovery from incomplete reports. The numerical results validate the effectiveness and robustness of our approaches. In particular, in small-scale networks, the matrix completion approach achieves exact channel detection with a number of samples no more than 50% of the number of channels in the network, while joint sparsity recovery achieves similar performance in large-scale networks.
- In cognitive radio, spectrum sensing is a key component to detect spectrum holes (i.e., channels not used by any primary users). Collaborative spectrum sensing among the cognitive radio nodes is expected to improve the ability of checking complete spectrum usage states. Unfortunately, due to power limitation and channel fading, available channel sensing information is far from being sufficient to tell the unoccupied channels directly. Aiming at breaking this bottleneck, we apply recent matrix completion techniques to greatly reduce the sensing information needed. We formulate the collaborative sensing problem as a matrix completion subproblem and a joint-sparsity reconstruction subproblem. Results of numerical simulations that validated the effectiveness and robustness of the proposed approach are presented. In particular, in noiseless cases, when number of primary user is small, exact detection was obtained with no more than 8% of the complete sensing information, whilst as number of primary user increases, to achieve a detection rate of 95.55%, the required information percentage was merely 16.8%.
- We present a novel sparse signal reconstruction method "ISD", aiming to achieve fast reconstruction and a reduced requirement on the number of measurements compared to the classical l_1 minimization approach. ISD addresses failed reconstructions of l_1 minimization due to insufficient measurements. It estimates a support set I from a current reconstruction and obtains a new reconstruction by solving the minimization problem \min\sum_i\not∈I|x_i|:Ax=b, and it iterates these two steps for a small number of times. ISD differs from the orthogonal matching pursuit (OMP) method, as well as its variants, because (i) the index set I in ISD is not necessarily nested or increasing and (ii) the minimization problem above updates all the components of x at the same time. We generalize the Null Space Property to Truncated Null Space Property and present our analysis of ISD based on the latter. We introduce an efficient implementation of ISD, called threshold--ISD, for recovering signals with fast decaying distributions of nonzeros from compressive sensing measurements. Numerical experiments show that threshold--ISD has significant advantages over the classical l_1 minimization approach, as well as two state--of--the--art algorithms: the iterative reweighted l_1 minimization algorithm (IRL1) and the iterative reweighted least--squares algorithm (IRLS). MATLAB code is available for download from http://www.caam.rice.edu/~optimization/L1/ISD/.
- We describe and provide code and examples for a polygonal edge matching method.