Computer Science (cs)

  • PDF
    A widespread folklore for explaining the success of convolutional neural network (CNN) is that CNN is a more compact representation than the fully connected neural network (FNN) and thus requires fewer samples for learning. We initiate the study of rigorously characterizing the sample complexity of learning convolutional neural networks. We show that for learning an $m$-dimensional convolutional filter with linear activation acting on a $d$-dimensional input, the sample complexity of achieving population prediction error of $\epsilon$ is $\widetilde{O} (m/\epsilon^2)$, whereas its FNN counterpart needs at least $\Omega(d/\epsilon^2)$ samples. Since $m \ll d$, this result demonstrates the advantage of using CNN. We further consider the sample complexity of learning a one-hidden-layer CNN with linear activation where both the $m$-dimensional convolutional filter and the $r$-dimensional output weights are unknown. For this model, we show the sample complexity is $\widetilde{O}\left((m+r)/\epsilon^2\right)$ when the ratio between the stride size and the filter size is a constant. For both models, we also present lower bounds showing our sample complexities are tight up to logarithmic factors. Our main tools for deriving these results are localized empirical process and a new lemma characterizing the convolutional structure. We believe these tools may inspire further developments in understanding CNN.
  • PDF
    We establish a tight characterization of the worst-case rates for the excess risk of agnostic learning with sample compression schemes and for uniform convergence for agnostic sample compression schemes. In particular, we find that the optimal rates of convergence for size-$k$ agnostic sample compression schemes are of the form $\sqrt{\frac{k \log(n/k)}{n}}$, which contrasts with agnostic learning with classes of VC dimension $k$, where the optimal rates are of the form $\sqrt{\frac{k}{n}}$.
  • PDF
    We develop a framework for obtaining polynomial time approximation schemes (PTAS) for a class of stochastic dynamic programs. Using our framework, we obtain the first PTAS for the following stochastic combinatorial optimization problems: \probemax: We are given a set of $n$ items, each item $i\in [n]$ has a value $X_i$ which is an independent random variable with a known (discrete) distribution $\pi_i$. We can \em probe a subset $P\subseteq [n]$ of items sequentially. Each time after probing an item $i$, we observe its value realization, which follows the distribution $\pi_i$. We can \em adaptively probe at most $m$ items and each item can be probed at most once. The reward is the maximum among the $m$ realized values. Our goal is to design an adaptive probing policy such that the expected value of the reward is maximized. To the best of our knowledge, the best known approximation ratio is $1-1/e$, due to Asadpour \etal~\citeasadpour2015maximizing. We also obtain PTAS for some generalizations and variants of the problem and some other problems.
  • PDF
    Let $G$ be an undirected, bounded degree graph with $n$ vertices. Fix a finite graph $H$, and suppose one must remove $\eps n$ edges from $G$ to make it $H$-minor free (for some small constant $\eps > 0$). We give an $n^{1/2+o(1)}$-time randomized procedure that, with high probability, finds an $H$-minor in such a graph. For an example application, suppose one must remove $\eps n$ edges from a bounded degree graph $G$ to make it planar. This result implies an algorithm, with the same running time, that produces a $K_{3,3}$ or $K_5$ minor in $G$. No sublinear time bound was known for this problem, prior to this result. By the graph minor theorem, we get an analogous result for any minor-closed property. Up to $n^{o(1)}$ factors, this resolves a conjecture of Benjamini-Schramm-Shapira (STOC 2008) on the existence of one-sided property testers for minor-closed properties. Furthermore, our algorithm is nearly optimal, by an $\Omega(\sqrt{n})$ lower bound of Czumaj et al (RSA 2014). Prior to this work, the only graphs $H$ for which non-trivial property testers were known for $H$-minor freeness are the following: $H$ being a forest or a cycle (Czumaj et al, RSA 2014), $K_{2,k}$, $(k\times 2)$-grid, and the $k$-circus (Fichtenberger et al, Arxiv 2017).
  • PDF
    We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems. However, existing systems rely on manually optimized libraries such as cuDNN where only a narrow range of server class GPUs are well-supported. The reliance on hardware-specific operator libraries limits the applicability of high-level graph optimizations and incurs significant engineering costs when deploying to new hardware targets. We use learning to remove this engineering burden. We learn domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search by effective model transfer across workloads. Experimental results show that our framework delivers performance competitive with state-of-the-art hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPU.
  • PDF
    In this work, we aim to create a data marketplace; a robust real-time matching mechanism to efficiently buy and sell training data for Machine Learning tasks. While the monetization of data and pre-trained models is an essential focus of industry today, there does not exist a market mechanism to price training data and match buyers to vendors while still addressing the associated (computational and other) complexity. The challenge in creating such a market stems from the very nature of data as an asset: it is freely replicable; its value is inherently combinatorial due to correlation with signal in other data; prediction tasks and the value of accuracy vary widely; usefulness of training data is difficult to verify a priori without first applying it to a prediction task. As our main contributions we: (i) propose a mathematical model for a two-sided data market and formally define key challenges; (ii) construct algorithms for such a market to function and rigorously prove how they meet the challenges defined. We highlight two technical contributions: (i) a remarkable link between Myerson's payment function arising in mechanism design and the Lovasz extension arising in submodular optimization; (ii) a novel notion of "fairness" required for cooperative games with freely replicable goods. These might be of independent interest.
  • PDF
    We propose a novel technique for faster Neural Network (NN) training by systematically approximating all the constituent matrix multiplications and convolutions. This approach is complementary to other approximation techniques, requires no changes to the dimensions of the network layers, hence compatible with existing training frameworks. We first analyze the applicability of the existing methods for approximating matrix multiplication to NN training, and extend the most suitable column-row sampling algorithm to approximating multi-channel convolutions. We apply approximate tensor operations to training MLP, CNN and LSTM network architectures on MNIST, CIFAR-100 and Penn Tree Bank datasets and demonstrate 30%-80% reduction in the amount of computations while maintaining little or no impact on the test accuracy. Our promising results encourage further study of general methods for approximating tensor operations and their application to NN training.
  • PDF
    Learning with a \it convex loss function has been a dominating paradigm for many years. It remains an interesting question how non-convex loss functions help improve the generalization of learning with broad applicability. In this paper, we study a family of objective functions formed by truncating traditional loss functions, which is applicable to both shallow learning and deep learning. Truncating loss functions has potential to be less vulnerable and more robust to large noise in observations that could be adversarial. More importantly, it is a generic technique without assuming the knowledge of noise distribution. To justify non-convex learning with truncated losses, we establish excess risk bounds of empirical risk minimization based on truncated losses for heavy-tailed output, and statistical error of an approximate stationary point found by stochastic gradient descent (SGD) method. Our experiments for shallow and deep learning for regression with outliers, corrupted data and heavy-tailed noise further justify the proposed method.
  • PDF
    Two important elements have driven recent innovation in the field of regression: sparsity-inducing regularization, to cope with high-dimensional problems; multi-task learning through joint parameter estimation, to augment the number of training samples. Both approaches complement each other in the sense that a joint estimation results in more samples, which are needed to estimate sparse models accurately, whereas sparsity promotes models that act on subsets of related variables. This idea has driven the proposal of block regularizers such as L1/Lq norms, which however effective, require that active regressors strictly overlap. In this paper, we propose a more flexible convex regularizer based on unbalanced optimal transport (OT) theory. That regularizer promotes parameters that are close, according to the OT geometry, which takes into account a prior geometric knowledge on the regressor variables. We derive an efficient algorithm based on a regularized formulation of optimal transport, which iterates through applications of Sinkhorn's algorithm along with coordinate descent iterations. The performance of our model is demonstrated on regular grids and complex triangulated geometries of the cortex with an application in neuroimaging.
  • PDF
    The main computational cost in the training of and prediction with Convolution Neural Networks (CNNs) typically stems from the convolution. In this paper, we present three novel ways to parameterize the convolution more efficiently, significantly decreasing the computational complexity. Commonly used CNNs filter the input data using a series of spatial convolutions with compact stencils that couple features from all channels and point-wise nonlinearities. In this paper, we propose three architectures that are cheaper to couple the channel dimension and thereby reduce both the number of trainable weights and the computational cost of the CNN. The first architecture is inspired by tensor-products and imposes a circulant coupling of the channels. The second and third architectures arise as discretizations of a new type of residual neural network (ResNet) that is inspired by Partial Differential Equations (PDEs) or reaction-diffusion type. The coupling patterns of the first two architectures are applicable to a large class of CNNs. We outline in our numerical experiments that the proposed architectures, although considerably reducing the number of trainable weights, yield comparable accuracy to existing CNNs that are fully coupled in the channel dimension.
  • PDF
    We present a new technique for deep reinforcement learning that automatically detects moving objects and uses the relevant information for action selection. The detection of moving objects is done in an unsupervised way by exploiting structure from motion. Instead of directly learning a policy from raw images, the agent first learns to detect and segment moving objects by exploiting flow information in video sequences. The learned representation is then used to focus the policy of the agent on the moving objects. Over time, the agent identifies which objects are critical for decision making and gradually builds a policy based on relevant moving objects. This approach, which we call Motion-Oriented REinforcement Learning (MOREL), is demonstrated on a suite of Atari games where the ability to detect moving objects reduces the amount of interaction needed with the environment to obtain a good policy. Furthermore, the resulting policy is more interpretable than policies that directly map images to actions or values with a black box neural network. We can gain insight into the policy by inspecting the segmentation and motion of each object detected by the agent. This allows practitioners to confirm whether a policy is making decisions based on sensible information.
  • PDF
    The study of high-throughput genomic profiles from a pharmacogenomics viewpoint has provided unprecedented insights into the oncogenic features modulating drug response. A recent screening of ~1,000 cancer cell lines to a collection of anti-cancer drugs illuminated the link between genotypes and vulnerability. However, due to essential differences between cell lines and tumors, the translation into predicting drug response in tumors remains challenging. Here we proposed a DNN model to predict drug response based on mutation and expression profiles of a cancer cell or a tumor. The model contains a mutation and an expression encoders pre-trained using a large pan-cancer dataset to abstract core representations of high-dimension data, followed by a drug response predictor network. Given a pair of mutation and expression profiles, the model predicts IC50 values of 265 drugs. We trained and tested the model on a dataset of 622 cancer cell lines and achieved an overall prediction performance of mean squared error at 1.96 (log-scale IC50 values). The performance was superior in prediction error or stability than two classical methods and four analog DNNs of our model. We then applied the model to predict drug response of 9,059 tumors of 33 cancer types. The model predicted both known, including EGFR inhibitors in non-small cell lung cancer and tamoxifen in ER+ breast cancer, and novel drug targets. The comprehensive analysis further revealed the molecular mechanisms underlying the resistance to a chemotherapeutic drug docetaxel in a pan-cancer setting and the anti-cancer potential of a novel agent, CX-5461, in treating gliomas and hematopoietic malignancies. Overall, our model and findings improve the prediction of drug response and the identification of novel therapeutic options.
  • PDF
    We propose a new Bayesian Neural Net (BNN) formulation that affords variational inference for which the evidence lower bound (ELBO) is analytically tractable subject to a tight approximation. We achieve this tractability by decomposing ReLU nonlinearities into an identity function and a Kronecker delta function. We demonstrate formally that assigning the outputs of these functions to separate latent variables allows representing the neural network likelihood as the composition of a chain of linear operations. Performing variational inference on this construction enables closed-form computation of the evidence lower bound. It can thus be maximized without requiring Monte Carlo sampling to approximate the problematic expected log-likelihood term. The resultant formulation boils down to stochastic gradient descent, where the gradients are not distorted by any factor besides minibatch selection. This amends a long-standing disadvantage of BNNs relative to deterministic nets. Experiments on four benchmark data sets show that the cleaner gradients provided by our construction yield a steeper learning curve, achieving higher prediction accuracies for a fixed epoch budget.
  • PDF
    Given data, deep generative models, such as variational autoencoders (VAE) and generative adversarial networks (GAN), train a lower dimensional latent representation of the data space. The linear Euclidean geometry of data space pulls back to a nonlinear Riemannian geometry on the latent space. The latent space thus provides a low-dimensional nonlinear representation of data and classical linear statistical techniques are no longer applicable. In this paper we show how statistics of data in their latent space representation can be performed using techniques from the field of nonlinear manifold statistics. Nonlinear manifold statistics provide generalizations of Euclidean statistical notions including means, principal component analysis, and maximum likelihood fits of parametric probability distributions. We develop new techniques for maximum likelihood inference in latent space, and adress the computational complexity of using geometric algorithms with high-dimensional data by training a separate neural network to approximate the Riemannian metric and cometric tensor capturing the shape of the learned data manifold.
  • PDF
    Boosted decision trees enjoy popularity in a variety of applications; however, for large-scale datasets, the cost of training a decision tree in each round can be prohibitively expensive. Inspired by ideas from the multi-arm bandit literature, we develop a highly efficient algorithm for computing exact greedy-optimal decision trees, outperforming the state-of-the-art Quick Boost method. We further develop a framework for deriving lower bounds on the problem that applies to a wide family of conceivable algorithms for the task (including our algorithm and Quick Boost), and we demonstrate empirically on a wide variety of data sets that our algorithm is near-optimal within this family of algorithms. We also derive a lower bound applicable to any algorithm solving the task, and we demonstrate that our algorithm empirically achieves performance close to this best-achievable lower bound.
  • PDF
    Deep learning stands at the forefront in many computer vision tasks. However, deep neural networks are usually data-hungry and require a huge amount of well-annotated training samples. Collecting sufficient annotated data is very expensive in many applications, especially for pixel-level prediction tasks such as semantic segmentation. To solve this fundamental issue, we consider a new challenging vision task, Internetly supervised semantic segmentation, which only uses Internet data with noisy image-level supervision of corresponding query keywords for segmentation model training. We address this task by proposing the following solution. A class-specific attention model unifying multiscale forward and backward convolutional features is proposed to provide initial segmentation "ground truth". The model trained with such noisy annotations is then improved by an online fine-tuning procedure. It achieves state-of-the-art performance under the weakly-supervised setting on PASCAL VOC2012 dataset. The proposed framework also paves a new way towards learning from the Internet without human interaction and could serve as a strong baseline therein. Code and data will be released upon the paper acceptance.
  • PDF
    Augmenting deep neural networks with skip connections, as introduced in the so called ResNet architecture, surprised the community by enabling the training of networks of more than 1000 layers with significant performance gains. It has been shown that identity skip connections eliminate singularities and improve the optimization landscape of the network. This paper deciphers ResNet by analyzing the of effect of skip connections in the backward path and sets forth new theoretical results on the advantages of identity skip connections in deep neural networks. We prove that the skip connections in the residual blocks facilitate preserving the norm of the gradient and lead to well-behaved and stable back-propagation, which is a desirable feature from optimization perspective. We also show that, perhaps surprisingly, as more residual blocks are stacked, the network becomes more norm-preserving. Traditionally, norm-preservation is enforced on the network only at beginning of the training, by using initialization techniques. However, we show that identity skip connection retain norm-preservation during the training procedure. Our theoretical arguments are supported by extensive empirical evidence. Can we push for more norm-preservation? We answer this question by proposing zero-phase whitening of the fully-connected layer and adding norm-preserving transition layers. Our numerical investigations demonstrate that the learning dynamics and the performance of ResNets can be improved by making it even more norm preserving through changing only a few blocks in very deep residual networks. Our results and the introduced modification for ResNet, referred to as Procrustes ResNets, can be used as a guide for studying more complex architectures such as DenseNet, training deeper networks, and inspiring new architectures.
  • PDF
    Boltzmann machines are powerful distributions that have been shown to be an effective prior over binary latent variables in variational autoencoders (VAEs). However, previous methods for training discrete VAEs have used the evidence lower bound and not the tighter importance-weighted bound. We propose two approaches for relaxing Boltzmann machines to continuous distributions that permit training with importance-weighted bounds. These relaxations are based on generalized overlapping transformations and the Gaussian integral trick. Experiments on the MNIST and OMNIGLOT datasets show that these relaxations outperform previous discrete VAEs with Boltzmann priors.
  • PDF
    Neural networks have been learning complex multi-hop reasoning in various domains. One such formal setting for reasoning, logic, provides a challenging case for neural networks. In this article, we propose a Neural Inference Network (NIN) for learning logical inference over classes of logic programs. Trained in an end-to-end fashion NIN learns representations of normal logic programs, by processing them at a character level, and the reasoning algorithm for checking whether a logic program entails a given query. We define 12 classes of logic programs that exemplify increased level of complexity of the inference process (multi-hop and default reasoning) and show that our NIN passes 10 out of the 12 tasks. We also analyse the learnt representations of logic programs that NIN uses to perform the logical inference.
  • PDF
    We explore the possibility of using machine learning to identify interesting mathematical structures by using certain quantities that serve as fingerprints. In particular, we extract features from integer sequences using two empirical laws: Benford's law and Taylor's law and experiment with various classifiers to identify whether a sequence is nice, important, multiplicative, easy to compute or related to primes or palindromes.
  • PDF
    We study the decades-old problem of online portfolio management and propose the first algorithm with logarithmic regret that is not based on Cover's Universal Portfolio algorithm and admits much faster implementation. Specifically Universal Portfolio enjoys optimal regret $\mathcal{O}(N\ln T)$ for $N$ financial instruments over $T$ rounds, but requires log-concave sampling and has a large polynomial running time. Our algorithm, on the other hand, ensures a slightly larger but still logarithmic regret of $\mathcal{O}(N^2(\ln T)^4)$, and is based on the well-studied Online Mirror Descent framework with a novel regularizer that can be implemented via standard optimization methods in time $\mathcal{O}(TN^{2.5})$ per round. The regret of all other existing works is either polynomial in $T$ or has a potentially unbounded factor such as the inverse of the smallest price relative.
  • PDF
    We consider the problem of uncertainty estimation in the context of (non-Bayesian) deep neural classification. All current methods are based on extracting uncertainty signals from a trained network optimized to solve the classification problem at hand. We demonstrate that such techniques tend to misestimate instances whose predictions are supposed to be highly confident. This deficiency is an artifact of the training process with SGD-like optimizers. Based on this observation, we develop an uncertainty estimation algorithm that "peels away" highly confident points sequentially and estimates their confidence using earlier snapshots of the trained model, before their uncertainty estimates are jittered. We present extensive experiments indicating that the proposed algorithm provides uncertainty estimates that are consistently better than the best known methods.
  • PDF
    It is important to learn classifiers under noisy labels due to their ubiquities. As noisy labels are corrupted from ground-truth labels by an unknown noise transition matrix, the accuracy of classifiers can be improved by estimating this matrix, without introducing either sample-selection or regularization biases. However, such estimation is often inexact, which inevitably degenerates the accuracy of classifiers. The inexact estimation is due to either a heuristic trick, or the brutal-force learning by deep networks under a finite dataset. In this paper, we present a human-assisted approach called "\textitmasking". The masking conveys human cognition of invalid class transitions, and naturally speculates the structure of the noise transition matrix. Given the structure information, we only learn the noise transition probability to reduce the estimation burden. To instantiate this approach, we derive a structure-aware probabilistic model, which incorporates a structure prior. During the model realization, we solve the challenges from structure extraction and alignment in principle. Empirical results on benchmark datasets with three noise structures show that, our approach can improve the robustness of classifiers significantly.
  • PDF
    We propose a hierarchically structured reinforcement learning approach to address the challenges of planning for generating coherent multi-sentence stories for the visual storytelling task. Within our framework, the task of generating a story given a sequence of images is divided across a two-level hierarchical decoder. The high-level decoder constructs a plan by generating a semantic concept (i.e., topic) for each image in sequence. The low-level decoder generates a sentence for each image using a semantic compositional network, which effectively grounds the sentence generation conditioned on the topic. The two decoders are jointly trained end-to-end using reinforcement learning. We evaluate our model on the visual storytelling (VIST) dataset. Empirical results demonstrate that the proposed hierarchically structured reinforced training achieves significantly better performance compared to a flat deep reinforcement learning baseline.
  • PDF
    In 2010, A. Shpilka and I. Volkovich established a prominent result on the equivalence of polynomial factorization and identity testing. It follows from their result that a multilinear polynomial over the finite field of order 2 (a multilinear boolean polynomial) can be factored in time cubic in the size of the polynomial given as a string. We have later rediscovered this result and provided a simple factorization algorithm based on the computation of derivatives of multilinear boolean polynomials. The algorithm has been applied to solve problems of compact representation of various combinatorial structures, including boolean functions and relational data tables. In this paper, we improve the known cubic upper complexity bound of this factorization algorithm and report on a preliminary experimental analysis.
  • PDF
    Subspace clustering refers to the problem of segmenting high dimensional data drawn from a union of subspaces into the respective subspaces. In some applications, partial side-information to indicate "must-link" or "cannot-link" in clustering is available. This leads to the task of subspace clustering with side-information. However, in prior work the supervision value of the side-information for subspace clustering has not been fully exploited. To this end, in this paper, we present an enhanced approach for constrained subspace clustering with side-information, termed Constrained Sparse Subspace Clustering plus (CSSC+), in which the side-information is used not only in the stage of learning an affinity matrix but also in the stage of spectral clustering. Moreover, we propose to estimate clustering accuracy based on the partial side-information and discuss the potential connection to the true clustering accuracy. We conduct experiments on three cancer gene expression datasets to validate the effectiveness of our proposals.
  • PDF
    Predicting how Congressional legislators will vote is important for understanding their past and future behavior. However, previous work on roll-call prediction has been limited to single session settings, thus did not consider generalization across sessions. In this paper, we show that metadata is crucial for modeling voting outcomes in new contexts, as changes between sessions lead to changes in the underlying data generation process. We show how augmenting bill text with the sponsors' ideologies in a neural network model can achieve an average of a 4% boost in accuracy over the previous state-of-the-art.
  • PDF
    Reinforcement Learning (RL) algorithms can suffer from poor sample efficiency when rewards are delayed and sparse. We introduce a solution that enables agents to learn temporally extended actions at multiple levels of abstraction in a sample efficient and automated fashion. Our approach combines universal value functions and hindsight learning, allowing agents to learn policies belonging to different time scales in parallel. We show that our method significantly accelerates learning in a variety of discrete and continuous tasks.
  • PDF
    We study in this paper the problems of both image captioning and text-to-image generation, and present a novel turbo learning approach to jointly training an image-to-text generator (a.k.a. captionbot) and a text-to-image generator (a.k.a. drawingbot). The key idea behind the joint training is that image-to-text generation and text-to-image generation as dual problems can form a closed loop to provide informative feedback to each other. Based on such feedback, we introduce a new loss metric by comparing the original input with the output produced by the closed loop. In addition to the old loss metrics used in captionbot and drawingbot, this extra loss metric makes the jointly trained captionbot and drawingbot better than the separately trained captionbot and drawingbot. Furthermore, the turbo-learning approach enables semi-supervised learning since the closed loop can provide peudo-labels for unlabeled samples. Experimental results on the COCO dataset demonstrate that the proposed turbo learning can significantly improve the performance of both captionbot and drawingbot by a large margin.
  • PDF
    Over the years, the Web has shrunk the world, allowing individuals to share viewpoints with many more people than they are able to in real life. At the same time, however, it has also enabled anti-social and toxic behavior to occur at an unprecedented scale. Video sharing platforms like YouTube receive uploads from millions of users, covering a wide variety of topics and allowing others to comment and interact in response. Unfortunately, these communities are periodically plagued with aggression and hate attacks. In particular, recent work has showed how these attacks often take place as a result of "raids," i.e., organized efforts coordinated by ad-hoc mobs from third-party communities. Despite the increasing relevance of this phenomenon, online services often lack effective countermeasures to mitigate it. Unlike well-studied problems like spam and phishing, coordinated aggressive behavior both targets and is perpetrated by humans, making defense mechanisms that look for automated activity unsuitable. Therefore, the de-facto solution is to reactively rely on user reports and human reviews. In this paper, we propose an automated solution to identify videos that are likely to be targeted by coordinated harassers. First, we characterize and model YouTube videos along several axes (metadata, audio transcripts, thumbnails) based on a ground truth dataset of raid victims. Then, we use an ensemble of classifiers to determine the likelihood that a video will be raided with high accuracy (AUC up to 94%). Overall, our work paves the way for providing video platforms like YouTube with proactive systems to detect and mitigate coordinated hate attacks.
  • PDF
    The recent advances in Deep Convolutional Neural Networks (DCNNs) have shown extremely good results for video human action classification, however, action detection is still a challenging problem. The current action detection approaches follow a complex pipeline which involves multiple tasks such as tube proposals, optical flow, and tube classification. In this work, we present a more elegant solution for action detection based on the recently developed capsule network. We propose a 3D capsule network for videos, called VideoCapsuleNet: a unified network for action detection which can jointly perform pixel-wise action segmentation along with action classification. The proposed network is a generalization of capsule network from 2D to 3D, which takes a sequence of video frames as input. The 3D generalization drastically increases the number of capsules in the network, making capsule routing computationally expensive. We introduce capsule-pooling in the convolutional capsule layer to address this issue which makes the voting algorithm tractable. The routing-by-agreement in the network inherently models the action representations and various action characteristics are captured by the predicted capsules. This inspired us to utilize the capsules for action localization and the class-specific capsules predicted by the network are used to determine a pixel-wise localization of actions. The localization is further improved by parameterized skip connections with the convolutional capsule layers and the network is trained end-to-end with a classification as well as localization loss. The proposed network achieves sate-of-the-art performance on multiple action detection datasets including UCF-Sports, J-HMDB, and UCF-101 (24 classes) with an impressive ~20% improvement on UCF-101 and ~15% improvement on J-HMDB in terms of v-mAP scores.
  • PDF
    Despite substantial interest in applications of neural networks to information retrieval, neural ranking models have only been applied to standard ad hoc retrieval tasks over web pages and newswire documents. This paper proposes MP-HCNN (Multi-Perspective Hierarchical Convolutional Neural Network) a novel neural ranking model specifically designed for ranking short social media posts. We identify document length, informal language, and heterogeneous relevance signals as features that distinguish documents in our domain, and present a model specifically designed with these characteristics in mind. Our model uses hierarchical convolutional layers to learn latent semantic soft-match relevance signals at the character, word, and phrase levels. A pooling-based similarity measurement layer integrates evidence from multiple types of matches between the query, the social media post, as well as URLs contained in the post. Extensive experiments using Twitter data from the TREC Microblog Tracks 2011--2014 show that our model significantly outperforms prior feature-based as well and existing neural ranking models. To our best knowledge, this paper presents the first substantial work tackling search over social media posts using neural ranking models.
  • PDF
    Adapting deep networks to new concepts from few examples is extremely challenging, due to the high computational and data requirements of standard fine-tuning procedures. Most works on meta-learning and few-shot learning have thus focused on simple learning techniques for adaptation, such as nearest neighbors or gradient descent. Nonetheless, the machine learning literature contains a wealth of methods that learn non-deep models very efficiently. In this work we propose to use these fast convergent methods as the main adaptation mechanism for few-shot learning. The main idea is to teach a deep network to use standard machine learning tools, such as logistic regression, as part of its own internal model, enabling it to quickly adapt to novel tasks. This requires back-propagating errors through the solver steps. While normally the matrix operations involved would be costly, the small number of examples works to our advantage, by making use of the Woodbury identity. We propose both iterative and closed-form solvers, based on logistic regression and ridge regression components. Our methods achieve excellent performance on three few-shot learning benchmarks, showing competitive performance on Omniglot and surpassing all state-of-the-art alternatives on miniImageNet and CIFAR-100.
  • PDF
    We consider a new family of operators for reinforcement learning with the goal of alleviating the negative effects and becoming more robust to approximation or estimation errors. Various theoretical results are established, which include showing on a sample path basis that our family of operators preserve optimality and increase the action gap. Our empirical results illustrate the strong benefits of our family of operators, significantly outperforming the classical Bellman operator and recently proposed operators.
  • PDF
    Zero-Shot Learning (ZSL) is achieved via aligning the semantic relationships between the global image feature vector and the corresponding class semantic descriptions. However, using the global features to represent fine-grained images may lead to sub-optimal results since they neglect the discriminative differences of local regions. Besides, different regions contain distinct discriminative information. The important regions should contribute more to the prediction. To this end, we propose a novel stacked semantics-guided attention (S2GA) model to obtain semantic relevant features by using individual class semantic features to progressively guide the visual features to generate an attention map for weighting the importance of different local regions. Feeding both the integrated visual features and the class semantic features into a multi-class classification architecture, the proposed framework can be trained end-to-end. Extensive experimental results on CUB and NABird datasets show that the proposed approach has a consistent improvement on both fine-grained zero-shot classification and retrieval tasks.
  • PDF
    We combine conditional variational autoencoders (VAE) with adversarial censoring in order to learn invariant representations that are disentangled from nuisance/sensitive variations. In this method, an adversarial network attempts to recover the nuisance variable from the representation, which the VAE is trained to prevent. Conditioning the decoder on the nuisance variable enables clean separation of the representation, since they are recombined for model learning and data reconstruction. We show this natural approach is theoretically well-founded with information-theoretic arguments. Experiments demonstrate that this method achieves invariance while preserving model learning performance, and results in visually improved performance for style transfer and generative sampling tasks.
  • PDF
    We propose a fast second-order method that can be used as a drop-in replacement for current deep learning solvers. Compared to stochastic gradient descent (SGD), it only requires two additional forward-mode automatic differentiation operations per iteration, which has a computational cost comparable to two standard forward passes and is easy to implement. Our method addresses long-standing issues with current second-order solvers, which invert an approximate Hessian matrix every iteration exactly or by conjugate-gradient methods, a procedure that is both costly and sensitive to noise. Instead, we propose to keep a single estimate of the gradient projected by the inverse Hessian matrix, and update it once per iteration. This estimate has the same size and is similar to the momentum variable that is commonly used in SGD. No estimate of the Hessian is maintained. We first validate our method, called CurveBall, on small problems with known closed-form solutions (noisy Rosenbrock function and degenerate 2-layer linear networks), where current deep learning solvers seem to struggle. We then train several large models on CIFAR and ImageNet, including ResNet and VGG-f networks, where we demonstrate faster convergence with no hyperparameter tuning. Code is available.
  • PDF
    In this paper, the traditional model based variational method and learning based algorithms are naturally integrated to address mixed noise removal problem. To be different from single type noise (e.g. Gaussian) removal, it is a challenge problem to accurately discriminate noise types and levels for each pixel. We propose a variational method to iteratively estimate the noise parameters, and then the algorithm can automatically classify the noise according to the different statistical parameters. The proposed variational problem can be separated into regularization, synthesis, parameter estimation and noise classification four steps with the operator splitting scheme. Each step is related to an optimization subproblem. To enforce the regularization, the deep learning method is employed to learn the natural images priori. Compared with some model based regularizations, the CNN regularizer can significantly improve the quality of the restored images. Compared with some learning based methods, the synthesis step can produce better reconstructions by analyzing the recognized noise types and levels. In our method, the convolution neutral network (CNN) can be regarded as an operator which associated to a variational functional. From this viewpoint, the proposed method can be extended to many image reconstruction and inverse problems. Numerical experiments in the paper show that our method can achieve some state-of-the-art results for mixed noise removal.
  • PDF
    Traditionally, Referring Expression Generation (REG) models first decide on the form and then on the content of references to discourse entities in text, typically relying on features such as salience and grammatical function. In this paper, we present a new approach (NeuralREG), relying on deep neural networks, which makes decisions about form and content in one go without explicit feature extraction. Using a delexicalized version of the WebNLG corpus, we show that the neural model substantially improves over two strong baselines. Data and models are publicly available.
  • PDF
    Neural models for question answering (QA) over documents have achieved significant performance improvements. Although effective, these models do not scale to large corpora due to their complex modeling of interactions between the document and the question. Moreover, recent work has shown that such models are sensitive to adversarial inputs. In this paper, we study the minimal context required to answer the question, and find that most questions in existing datasets can be answered with a small set of sentences. Inspired by this observation, we propose a simple sentence selector to select the minimal set of sentences to feed into the QA model. Our overall system achieves significant reductions in training (up to 15 times) and inference times (up to 13 times), with accuracy comparable to or better than the state-of-the-art on SQuAD, NewsQA, TriviaQA and SQuAD-Open. Furthermore, our experimental results and analyses show that our approach is more robust to adversarial inputs.
  • PDF
    Graph Convolutional Neural Networks (GCNNs) are the most recent exciting advancement in deep learning field and their applications are quickly spreading in multi-cross-domains including bioinformatics, chemoinformatics, social networks, natural language processing and computer vision. In this paper, we expose and tackle some of the basic weaknesses of a GCNN model with a capsule idea presented in~\citehinton2011transforming and propose our Graph Capsule Network (GCAPS-CNN) model. In addition, we design our GCAPS-CNN model to solve especially graph classification problem which current GCNN models find challenging. Through extensive experiments, we show that our proposed Graph Capsule Network can significantly outperforms both the existing state-of-art deep learning methods and graph kernels on graph classification benchmark datasets.
  • PDF
    Visible light communications (VLC) is an emerging field in technology and research. Estimating the channel taps is a major requirement for designing reliable communication systems. Due to the nonlinear characteristics of the VLC channel those parameters cannot be derived easily. They can be calculated by means of software simulation. In this work, a novel methodology is proposed for the prediction of channel parameters using neural networks. Measurements conducted in a controlled experimental setup are used to train neural networks for channel tap prediction. Our experiment results indicate that neural networks can be effectively trained to predict channel taps under different environmental conditions.
  • PDF
    One Monad to Prove Them All is a modern fairy tale about curiosity and perseverance, two important properties of a successful PhD student. We follow the PhD student Mona on her adventurous journey of proving properties about Haskell programs in the proof assistant Coq. In quest for a solution Mona goes on an educating journey through the land of functional programming. We follow her on her journey when she learns about concepts like free monads and containers as well as basics and restrictions of proof assistants like Coq. These concepts are well-known individually, but their interplay gives rise to a solution for Mona's problem that has not been presented before. More precisely, Mona's final solution does not only work for a specific monad instance but even allows her to prove monad-generic properties. That is, instead of proving properties over and over again for specific monad instances she is able to prove properties that hold for all monads. In the end of her journey Mona comes to the land of functional logic and the land of probabilistic programming. She learns that, because her proofs are monad-generic, they even hold in languages with other effects like non-determinism or probabilism. If you are a citizen of the land of functional programming or are at least familiar with its customs, had a journey that involved reasoning about functional programs of your own, or are just a curious soul looking for the next story about monads and proofs, then this tale is for you.
  • PDF
    We consider online learning for minimizing regret in unknown, episodic Markov decision processes (MDPs) with continuous states and actions. We develop variants of the UCRL and posterior sampling algorithms that employ nonparametric Gaussian process priors to generalize across the state and action spaces. When the transition and reward functions of the true MDP are either sampled from Gaussian process priors (fully Bayesian setting) or are members of the associated Reproducing Kernel Hilbert Spaces of functions induced by symmetric psd kernels (frequentist setting), we show that the algorithms enjoy sublinear regret bounds. The bounds are in terms of explicit structural parameters of the kernels, namely a novel generalization of the information gain metric from kernelized bandit, and highlight the influence of transition and reward function structure on the learning performance. Our results are applicable to multi-dimensional state and action spaces with composite kernel structures, and generalize results from the literature on kernelized bandits, and the adaptive control of parametric linear dynamical systems with quadratic costs.
  • PDF
    Our human sense of touch enables us to manipulate our surroundings; therefore, complex robotic manipulation will require artificial tactile sensing. Typically tactile sensor arrays are used in robotics, implying that a straightforward way of interpreting multidimensional data is required. In this paper we present a simple visualisation approach based on applying principal component analysis (PCA) to systematically collected sets of tactile data. We apply the visualisation approach to 4 different types of tactile sensor, encompassing fingertips and vibrissal arrays. The results show that PCA can reveal structure and regularities in the tactile data, which also permits the use of simple classifiers such as $k$-NN to achieve good inference. Additionally, the Euclidean distance in principal component space gives a measure of sensitivity, which can aid visualisation and also be used to find regions in the tactile input space where the sensor is able to perceive with higher accuracy. We expect that these observations will generalise, and thus offer the potential for novel control methods based on touch.
  • PDF
    In this work, we present a new derivative-free optimization method and investigate its use for training neural networks. Our method is motivated by the Ensemble Kalman Filter (EnKF), which has been used successfully for solving optimization problems that involve large-scale, highly nonlinear dynamical systems. A key benefit of the EnKF method is that it requires only the evaluation of the forward propagation but not its derivatives. Hence, in the context of neural networks it alleviates the need for back propagation and reduces the memory consumption dramatically. However, the method is not a pure "black-box" global optimization heuristic as it efficiently utilizes the structure of typical learning problems. We propose an important modification of the EnKF that enables us to prove convergence of our method to the minimizer of a strongly convex function. Our method also bears similarity with implicit filtering and we demonstrate its potential for minimizing highly oscillatory functions using a simple example. Further, we provide numerical examples that demonstrate the potential of our method for training deep neural networks.
  • PDF
    We present Poetwannabe, a chatbot submitted by the University of Wrocław to the NIPS 2017 Conversational Intelligence Challenge, in which it ranked first ex-aequo. It is able to conduct a conversation with a user in a natural language. The primary functionality of our dialogue system is context-aware question answering (QA), while its secondary function is maintaining user engagement. The chatbot is composed of a number of sub-modules, which independently prepare replies to user's prompts and assess their own confidence. To answer questions, our dialogue system relies heavily on factual data, sourced mostly from Wikipedia and DBpedia, data of real user interactions in public forums, as well as data concerning general literature. Where applicable, modules are trained on large datasets using GPUs. However, to comply with the competition's requirements, the final system is compact and runs on commodity hardware.
  • PDF
    Word Sense Disambiguation (WSD) aims to identify the correct meaning of polysemous words in the particular context. Lexical resources like WordNet which are proved to be of great help for WSD in the knowledge-based methods. However, previous neural networks for WSD always rely on massive labeled data (context), ignoring lexical resources like glosses (sense definitions). In this paper, we integrate the context and glosses of the target word into a unified framework in order to make full use of both labeled data and lexical knowledge. Therefore, we propose GAS: a gloss-augmented WSD neural network which jointly encodes the context and glosses of the target word. GAS models the semantic relationship between the context and the gloss in an improved memory network framework, which breaks the barriers of the previous supervised methods and knowledge-based methods. We further extend the original gloss of word sense via its semantic relations in WordNet to enrich the gloss information. The experimental results show that our model outperforms the state-of-theart systems on several English all-words WSD datasets.
  • PDF
    Deep Neural Networks (DNNs) have recently shown state of the art performance on semantic segmentation tasks, however they still suffer from problems of poor boundary localization and spatial fragmented predictions. The difficulties lie in the requirement of making dense predictions from a long path model all at once, since details are hard to keep when data goes through deeper layers. Instead, in this work, we decompose this difficult task into two relative simple sub-tasks: seed detection which is required to predict initial predictions without need of wholeness and preciseness, and similarity estimation which measures the possibility of any two nodes belong to the same class without need of knowing which class they are. We use one branch for one sub-task each, and apply a cascade of random walks base on hierarchical semantics to approximate a complex diffusion process which propagates seed information to the whole image according to the estimated similarities. The proposed DifNet consistently produces improvements over the baseline models with the same depth and with equivalent number of parameters, and also achieves promising performance on Pascal VOC and Pascal Context dataset. Our DifNet is trained end-to-end without complex loss functions.
  • PDF
    Conceptual design relies on extensive manipulation of morphological properties of real or virtual objects.This study investigates the nature of the perceptual information that could be retrieved from different representation modalities to reproduce structural properties of a complex object by drawing . The abstract and complex object (tensegrity simplex) was presented to two study populations (design experts/architects and non-experts) in three different representation modalities (2D image view explored visually only, digital 3D model explored visually using a computer mouse, the real object explored visually and manually. After viewing and exploring, observers had to draw the most critical parts of the structure by hand into a 2D reference frame. The results reveal a considerable performance advantage of digital 3D model exploration compared with real-world 3D object manipulation in the expert population.The results are discussed in terms of the specific nature of morphological cues made available through the different representation modalities.

Recent comments

Max Lu Apr 25 2018 22:08 UTC

"This is a very inspiring paper! The new framework (ZR = All Reality) it provided allows us to understand all kinds of different reality technologies (VR, AR, MR, XR etc) that are currently loosely connected to each other and has been confusing to many people. Instead of treating our perceived sens

...(continued)
NJBouman Apr 22 2018 18:26 UTC

[Fredrik Johansson][1] has pointed out to me (the author) the following about the multiplication benchmark w.r.t. GMP. This will be taken into account in the upcoming revision.

Fredrik Johansson wrote:
> You shouldn't be comparing your code to `mpn_mul`, because this function is not actually th

...(continued)
Luis Cruz Mar 16 2018 15:34 UTC

Related Work:

- [Performance-Based Guidelines for Energy Efficient Mobile Applications](http://ieeexplore.ieee.org/document/7972717/)
- [Leafactor: Improving Energy Efficiency of Android Apps via Automatic Refactoring](http://ieeexplore.ieee.org/document/7972807/)

igorot Feb 28 2018 05:19 UTC

The Igorots built an [online community][1] that helps in the exchange, revitalization, practice, and learning of indigenous culture. It is the first and only Igorot community on the web.

[1]: https://www.igorotage.com/

wenling yang Jan 30 2018 19:08 UTC

off-loading is an interesting topic. Investigating the off-loading computation under the context of deep neural networks is a novel insight.

Luhao Wang Jan 30 2018 00:28 UTC

well written paper! State-of-art works that are good to publish to some decent conferences/journals

mahdi aliakbari Jan 29 2018 20:49 UTC

Very well written paper with formal problem formulation and extensive results on multiple benchmarks

aaditya prakash Jan 29 2018 19:58 UTC

Code is available here: https://github.com/iamaaditya/pixel-deflection

Faraz Rabbani Jan 29 2018 07:53 UTC

Interesting case study for computation offloading

Bin Shi Oct 05 2017 00:07 UTC

Welcome to give the comments for this paper!