# Learning (cs.LG)

• We explore design principles for general pixel-level prediction problems, from low-level edge detection to mid-level surface normal estimation to high-level semantic segmentation. Convolutional predictors, such as the fully-convolutional network (FCN), have achieved remarkable success by exploiting the spatial redundancy of neighboring pixels through convolutional processing. Though computationally efficient, we point out that such approaches are not statistically efficient during learning precisely because spatial redundancy limits the information learned from neighboring pixels. We demonstrate that stratified sampling of pixels allows one to (1) add diversity during batch updates, speeding up learning; (2) explore complex nonlinear predictors, improving accuracy; and (3) efficiently train state-of-the-art models tabula rasa (i.e., "from scratch") for diverse pixel-labeling tasks. Our single architecture produces state-of-the-art results for semantic segmentation on PASCAL-Context dataset, surface normal estimation on NYUDv2 depth dataset, and edge detection on BSDS.
• Recommendation for e-commerce with a mix of durable and nondurable goods has characteristics that distinguish it from the well-studied media recommendation problem. The demand for items is a combined effect of form utility and time utility, i.e., a product must both be intrinsically appealing to a consumer and the time must be right for purchase. In particular for durable goods, time utility is a function of inter-purchase duration within product category because consumers are unlikely to purchase two items in the same category in close temporal succession. Moreover, purchase data, in contrast to ratings data, is implicit with non-purchases not necessarily indicating dislike. Together, these issues give rise to the positive-unlabeled demand-aware recommendation problem that we pose via joint low-rank tensor completion and product category inter-purchase duration vector estimation. We further relax this problem and propose a highly scalable alternating minimization approach with which we can solve problems with millions of users and items. We also show superior prediction accuracies on multiple real-world data sets.
• Feb 22 2017 cs.LG stat.ML arXiv:1702.06295v1
Initialization of parameters in deep neural networks has been shown to have a big impact on the performance of the networks (Mishkin & Matas, 2015). The initialization scheme devised by He et al, allowed convolution activations to carry a constrained mean which allowed deep networks to be trained effectively (He et al., 2015a). Orthogonal initializations and more generally orthogonal matrices in standard recurrent networks have been proved to eradicate the vanishing and exploding gradient problem (Pascanu et al., 2012). Majority of current initialization schemes do not take fully into account the intrinsic structure of the convolution operator. This paper introduces a new type of initialization built around the duality of the Fourier transform and the convolution operator. With Convolution Aware Initialization we noticed not only higher accuracy and lower loss, but faster convergence in general. We achieve new state of the art on the CIFAR10 dataset, and achieve close to state of the art on various other tasks.
• Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.
• Recommendation system is a common demand in daily life and matrix completion is a widely adopted technique for this task. However, most matrix completion methods lack semantic interpretation and usually result in weak-semantic recommendations. To this end, this paper proposes a \bf Semantic \bf Analysis approach for \bf Recommendation systems \textbf(SAR), which applies a two-level hierarchical generative process that assigns semantic properties and categories for user and item. SAR learns semantic representations of users/items merely from user ratings on items, which offers a new path to recommendation by semantic matching with the learned representations. Extensive experiments demonstrate SAR outperforms other state-of-the-art baselines substantially.
• We obtain the first polynomial-time algorithm for exact tensor completion that improves over the bound implied by reduction to matrix completion. The algorithm recovers an unknown 3-tensor with $r$ incoherent, orthogonal components in $\mathbb R^n$ from $r\cdot \tilde O(n^{1.5})$ randomly observed entries of the tensor. This bound improves over the previous best one of $r\cdot \tilde O(n^{2})$ by reduction to exact matrix completion. Our bound also matches the best known results for the easier problem of approximate tensor completion (Barak & Moitra, 2015). Our algorithm and analysis extends seminal results for exact matrix completion (Candes & Recht, 2009) to the tensor setting via the sum-of-squares method. The main technical challenge is to show that a small number of randomly chosen monomials are enough to construct a degree-3 polynomial with a precisely planted orthogonal global optima over the sphere and that this fact can be certified within the sum-of-squares proof system.
• Boolean matrix factorisation (BooMF) infers interpretable decompositions of a binary data matrix into a pair of low-rank, binary matrices: One containing meaningful patterns, the other quantifying how the observations can be expressed as a combination of these patterns. We introduce the OrMachine, a probabilistic generative model for BooMF and derive a Metropolised Gibbs sampler that facilitates very efficient parallel posterior inference. Our method outperforms all currently existing approaches for Boolean Matrix factorization and completion, as we show on simulated and real world data. This is the first method to provide full posterior inference for BooMF which is relevant in applications, e.g. for controlling false positive rates in collaborative filtering, and crucially it improves the interpretability of the inferred patterns. The proposed algorithm scales to large datasets as we demonstrate by analysing single cell gene expression data in 1.3 million mouse brain cells across 11,000 genes on commodity hardware.
• Feature extraction is a critical component of many applied data science workflows. In recent years, rapid advances in artificial intelligence and machine learning have led to an explosion of feature extraction tools and services that allow data scientists to cheaply and effectively annotate their data along a vast array of dimensions---ranging from detecting faces in images to analyzing the sentiment expressed in coherent text. Unfortunately, the proliferation of powerful feature extraction services has been mirrored by a corresponding expansion in the number of distinct interfaces to feature extraction services. In a world where nearly every new service has its own API, documentation, and/or client library, data scientists who need to combine diverse features obtained from multiple sources are often forced to write and maintain ever more elaborate feature extraction pipelines. To address this challenge, we introduce a new open-source framework for comprehensive multimodal feature extraction. Pliers is an open-source Python package that supports standardized annotation of diverse data types (video, images, audio, and text), and is expressly with both ease-of-use and extensibility in mind. Users can apply a wide range of pre-existing feature extraction tools to their data in just a few lines of Python code, and can also easily add their own custom extractors by writing modular classes. A graph-based API enables rapid development of complex feature extraction pipelines that output results in a single, standardized format. We describe the package's architecture, detail its major advantages over previous feature extraction toolboxes, and use a sample application to a large functional MRI dataset to illustrate how pliers can significantly reduce the time and effort required to construct sophisticated feature extraction workflows while increasing code clarity and maintainability.
• Reason and inference require process as well as memory skills by humans. Neural networks are able to process tasks like image recognition (better than humans) but in memory aspects are still limited (by attention mechanism, size). Recurrent Neural Network (RNN) and it's modified version LSTM are able to solve small memory contexts, but as context becomes larger than a threshold, it is difficult to use them. The Solution is to use large external memory. Still, it poses many challenges like, how to train neural networks for discrete memory representation, how to describe long term dependencies in sequential data etc. Most prominent neural architectures for such tasks are Memory networks: inference components combined with long term memory and Neural Turing Machines: neural networks using external memory resources. Also, additional techniques like attention mechanism, end to end gradient descent on discrete memory representation are needed to support these solutions. Preliminary results of above neural architectures on simple algorithms (sorting, copying) and Question Answering (based on story, dialogs) application are comparable with the state of the art. In this paper, I explain these architectures (in general), the additional techniques used and the results of their application.
• Two of the most fundamental prototypes of greedy optimization are the matching pursuit and Frank-Wolfe algorithms. In this paper, we take a unified view on both classes of methods, leading to the first explicit convergence rates of matching pursuit methods in an optimization sense, for general sets of atoms. We derive sublinear ($1/t$) convergence for both classes on general smooth objectives, and linear convergence on strongly convex objectives, as well as a clear correspondence of algorithm variants. Our presented algorithms and rates are affine invariant, and do not need any incoherence or sparsity assumptions.
• Given data over the joint distribution of two univariate or multivariate random variables $X$ and $Y$ of mixed or single type data, we consider the problem of inferring the most likely causal direction between $X$ and $Y$. We take an information theoretic approach, from which it follows that first describing the data over cause and then that of effect given cause is shorter than the reverse direction. For practical inference, we propose a score for causal models for mixed type data based on the Minimum Description Length (MDL) principle. In particular, we model dependencies between $X$ and $Y$ using classification and regression trees. Inferring the optimal model is NP-hard, and hence we propose Crack, a fast greedy algorithm to infer the most likely causal direction directly from the data. Empirical evaluation on synthetic, benchmark, and real world data shows that Crack reliably and with high accuracy infers the correct causal direction on both univariate and multivariate cause--effect pairs over both single and mixed type data.
• Identifying significant location categories visited by mobile phone users is the key to a variety of applications. This is an extremely challenging task due to the possible deviation between the estimated location coordinate and the actual location, which could be on the order of kilometers. Using the collected location coordinate as the center and its associated location error as the radius, we can draw a location uncertainty circle that may cover multiple location categories, especially in densely populated areas. To estimate the actual location category more precisely, we propose a novel tensor factorization framework, through several key observations including the intrinsic correlations between users, to infer the most likely location categories within the location uncertainty circle. In addition, the proposed algorithm can also predict where users are even when there is no location update. In order to efficiently solve the proposed framework, we propose a parameter-free and scalable optimization algorithm by effectively exploring the sparse and low-rank structure of the tensor. Our empirical studies show that the proposed algorithm is both efficient and effective: it can solve problems with millions of users and billions of location updates, and also provides superior prediction accuracies on real-world location updates and check-in data sets.
• We propose an inlier-based outlier detection method capable of both identifying the outliers and explaining why they are outliers, by identifying the outlier-specific features. Specifically, we employ an inlier-based outlier detection criterion, which uses the ratio of inlier and test probability densities as a measure of plausibility of being an outlier. For estimating the density ratio function, we propose a localized logistic regression algorithm. Thanks to the locality of the model, variable selection can be outlier-specific, and will help interpret why points are outliers in a high-dimensional space. Through synthetic experiments, we show that the proposed algorithm can successfully detect the important features for outliers. Moreover, we show that the proposed algorithm tends to outperform existing algorithms in benchmark datasets.
• We study the problem of online learning in a class of Markov decision processes known as linearly solvable MDPs. In the stationary version of this problem, a learner interacts with its environment by directly controlling the state transitions, attempting to balance a fixed state-dependent cost and a certain smooth cost penalizing extreme control inputs. In the current paper, we consider an online setting where the state costs may change arbitrarily between consecutive rounds, and the learner only observes the costs at the end of each respective round. We are interested in constructing algorithms for the learner that guarantee small regret against the best stationary control policy chosen in full knowledge of the cost sequence. Our main result is showing that the smoothness of the control cost enables the simple algorithm of following the leader to achieve a regret of order $\log^2 T$ after $T$ rounds, vastly improving on the best known regret bound of order $T^{3/4}$ for this setting.
• We present and analyze statistically optimal, communication and memory efficient distributed stochastic optimization algorithms with near-linear speedups (up to $\log$-factors). This improves over prior work which includes methods with near-linear speedups but polynomial communication requirements (accelerated minibatch SGD) and communication efficient methods which do not exhibit any runtime speedups over a naive single-machine approach. We first analyze a distributed SVRG variant as a distributed stochastic optimization method and show that it can achieve near-linear speedups with logarithmic rounds of communication, at the cost of high memory requirements. We then present a novel method, stochastic DANE, which trades off memory for communication and still allows for optimization with communication which scales only logarithmically with the desired accuracy while also being memory efficient. Stochastic DANE is based on a minibatch prox procedure, solving a non-linearized subproblem on a minibatch at each iteration. We provide a novel analysis for this procedure which achieves the statistical optimal rate regardless of minibatch size and smoothness, and thus significantly improving on prior work.
• Arising naturally in many fields, optimal stopping problems consider the question of deciding when to stop an observation-generating process. We examine the problem of simultaneously learning and planning in such domains, when data is collected directly from the environment. We propose GFSE, a simple and flexible model-free policy search method that reuses data for sample efficiency by leveraging problem structure. We bound the sample complexity of our approach to guarantee uniform convergence of policy value estimates, tightening existing PAC bounds to achieve logarithmic dependence on horizon length for our setting. We also examine the benefit of our method against prevalent model-based and model-free approaches on 3 domains taken from diverse fields.
• Consider the problem of modeling hysteresis for finite-state random walks using higher-order Markov chains. This Letter introduces a Bayesian framework to determine, from data, the number of prior states of recent history upon which a trajectory is statistically dependent. The general recommendation is to use leave-one-out cross validation, using an easily-computable formula that is provided in closed form. Importantly, Bayes factors using flat model priors are biased in favor of too-complex a model (more hysteresis) when a large amount of data is present and the Akaike information criterion (AIC) is biased in favor of too-sparse a model (less hysteresis) when few data are present.
• Since the events of the Arab Spring, there has been increased interest in using social media to anticipate social unrest. While efforts have been made toward automated unrest prediction, we focus on filtering the vast volume of tweets to identify tweets relevant to unrest, which can be provided to downstream users for further analysis. We train a supervised classifier that is able to label Arabic language tweets as relevant to unrest with high reliability. We examine the relationship between training data size and performance and investigate ways to optimize the model building process while minimizing cost. We also explore how confidence thresholds can be set to achieve desired levels of performance.
• This paper concerns the problem of recovering an unknown but structured signal $x \in R^n$ from $m$ quadratic measurements of the form $y_r=|<a_r,x>|^2$ for $r=1,2,...,m$. We focus on the under-determined setting where the number of measurements is significantly smaller than the dimension of the signal ($m<<n$). We formulate the recovery problem as a nonconvex optimization problem where prior structural information about the signal is enforced through constrains on the optimization variables. We prove that projected gradient descent, when initialized in a neighborhood of the desired signal, converges to the unknown signal at a linear rate. These results hold for any constraint set (convex or nonconvex) providing convergence guarantees to the global optimum even when the objective function and constraint set is nonconvex. Furthermore, these results hold with a number of measurements that is only a constant factor away from the minimal number of measurements required to uniquely identify the unknown signal. Our results provide the first provably tractable algorithm for this data-poor regime, breaking local sample complexity barriers that have emerged in recent literature. In a companion paper we demonstrate favorable properties for the optimization problem that may enable similar results to continue to hold more globally (over the entire ambient space). Collectively these two papers utilize and develop powerful tools for uniform convergence of empirical processes that may have broader implications for rigorous understanding of constrained nonconvex optimization heuristics. The mathematical results in this paper also pave the way for a new generation of data-driven phase-less imaging systems that can utilize prior information to significantly reduce acquisition time and enhance image reconstruction, enabling nano-scale imaging at unprecedented speeds and resolutions.
• Feb 22 2017 cs.LG arXiv:1702.06120v1
We prove in this paper that the expected value of the objective function of the $k$-means++ algorithm for samples converges to population expected value. As $k$-means++, for samples, provides with constant factor approximation for $k$-means objectives, such an approximation can be achieved for the population with increase of the sample size. This result is of potential practical relevance when one is considering using subsampling when clustering large data sets (large data bases).
• FPGA-based hardware accelerators for convolutional neural networks (CNNs) have obtained great attentions due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this paper, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages. Experiment results show that the proposed architecture is 8.3x faster and 75x more energy-efficient than a Titan X GPU for processing online individual requests (in small batch size). For processing static data (in large batch size), the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5x higher energy efficiency.

Omar Shehab Sep 12 2016 12:50 UTC

I am still trying to understand the following statement from II.A.

> This leads to the condition that the first- and second-order moments
> of the model and data distributions should be equal for the parameters
> to be optimal.

Alessandro Dec 09 2015 01:12 UTC

Hey, I've already seen this title! http://arxiv.org/abs/1307.0401

Noon van der Silk Jul 13 2015 10:44 UTC

There's some code for this here: https://github.com/ryankiros/skip-thoughts

anti-plagiarism Jul 09 2015 15:11 UTC

This paper "**Tree-based convolution for sentence modeling**" is a deliberate plagiarism. The texts, models and ideas overlap significantly with previous work on arXiv.

- TBCNN: A **Tree-based Convolutional** Neural Network for Programming
Language Processing (arXiv:1409.5718)
- **Tree-based

...(continued)