- May 22 2018 quant-ph arXiv:1805.07772v1Energy-time uncertainty plays an important role in quantum foundations and technologies, and it was even discussed by the founders of quantum mechanics. However, standard approaches (e.g., Robertson's uncertainty relation) do not apply to energy-time uncertainty because, in general, there is no Hermitian operator associated with time. Following previous approaches, we quantify time uncertainty by how well one can read off the time from a quantum clock. We then use entropy to quantify the information-theoretic distinguishability of the various time states of the clock. Our main result is an entropic energy-time uncertainty relation for general time-independent Hamiltonians, stated for both the discrete-time and continuous-time cases. Our uncertainty relation is strong, in the sense that it allows for a quantum memory to help reduce the uncertainty, and this formulation leads us to reinterpret it as a bound on the relative entropy of asymmetry. Due to the operational relevance of entropy, we anticipate that our uncertainty relation will have information-processing applications.
- May 22 2018 quant-ph arXiv:1805.08138v1The calculation of excited state energies of electronic structure Hamiltonians has many important applications, such as the calculation of optical spectra and reaction rates. While low-depth quantum algorithms, such as the variational quantum eigenvalue solver, have been used to determine ground state energies, methods for calculating excited states currently involve the implementation of high-depth controlled-unitaries or a large number of additional samples. Here we show how overlap estimation can be used in the calculation of excited state energies and their degeneracies. We propose an implementation using the destructive SWAP test that can determine the excited states of a $N$-qubit Hamiltonian using a $N\times 2$ grid of qubits and a low-depth circuit. Our method remains robust to control errors and can be implemented on near-term quantum computers.
- A widespread folklore for explaining the success of convolutional neural network (CNN) is that CNN is a more compact representation than the fully connected neural network (FNN) and thus requires fewer samples for learning. We initiate the study of rigorously characterizing the sample complexity of learning convolutional neural networks. We show that for learning an $m$-dimensional convolutional filter with linear activation acting on a $d$-dimensional input, the sample complexity of achieving population prediction error of $\epsilon$ is $\widetilde{O} (m/\epsilon^2)$, whereas its FNN counterpart needs at least $\Omega(d/\epsilon^2)$ samples. Since $m \ll d$, this result demonstrates the advantage of using CNN. We further consider the sample complexity of learning a one-hidden-layer CNN with linear activation where both the $m$-dimensional convolutional filter and the $r$-dimensional output weights are unknown. For this model, we show the sample complexity is $\widetilde{O}\left((m+r)/\epsilon^2\right)$ when the ratio between the stride size and the filter size is a constant. For both models, we also present lower bounds showing our sample complexities are tight up to logarithmic factors. Our main tools for deriving these results are localized empirical process and a new lemma characterizing the convolutional structure. We believe these tools may inspire further developments in understanding CNN.
- We establish a tight characterization of the worst-case rates for the excess risk of agnostic learning with sample compression schemes and for uniform convergence for agnostic sample compression schemes. In particular, we find that the optimal rates of convergence for size-$k$ agnostic sample compression schemes are of the form $\sqrt{\frac{k \log(n/k)}{n}}$, which contrasts with agnostic learning with classes of VC dimension $k$, where the optimal rates are of the form $\sqrt{\frac{k}{n}}$.
- May 22 2018 cs.DS arXiv:1805.07742v1We develop a framework for obtaining polynomial time approximation schemes (PTAS) for a class of stochastic dynamic programs. Using our framework, we obtain the first PTAS for the following stochastic combinatorial optimization problems: \probemax: We are given a set of $n$ items, each item $i\in [n]$ has a value $X_i$ which is an independent random variable with a known (discrete) distribution $\pi_i$. We can \em probe a subset $P\subseteq [n]$ of items sequentially. Each time after probing an item $i$, we observe its value realization, which follows the distribution $\pi_i$. We can \em adaptively probe at most $m$ items and each item can be probed at most once. The reward is the maximum among the $m$ realized values. Our goal is to design an adaptive probing policy such that the expected value of the reward is maximized. To the best of our knowledge, the best known approximation ratio is $1-1/e$, due to Asadpour \etal~\citeasadpour2015maximizing. We also obtain PTAS for some generalizations and variants of the problem and some other problems.
- Let $G$ be an undirected, bounded degree graph with $n$ vertices. Fix a finite graph $H$, and suppose one must remove $\eps n$ edges from $G$ to make it $H$-minor free (for some small constant $\eps > 0$). We give an $n^{1/2+o(1)}$-time randomized procedure that, with high probability, finds an $H$-minor in such a graph. For an example application, suppose one must remove $\eps n$ edges from a bounded degree graph $G$ to make it planar. This result implies an algorithm, with the same running time, that produces a $K_{3,3}$ or $K_5$ minor in $G$. No sublinear time bound was known for this problem, prior to this result. By the graph minor theorem, we get an analogous result for any minor-closed property. Up to $n^{o(1)}$ factors, this resolves a conjecture of Benjamini-Schramm-Shapira (STOC 2008) on the existence of one-sided property testers for minor-closed properties. Furthermore, our algorithm is nearly optimal, by an $\Omega(\sqrt{n})$ lower bound of Czumaj et al (RSA 2014). Prior to this work, the only graphs $H$ for which non-trivial property testers were known for $H$-minor freeness are the following: $H$ being a forest or a cycle (Czumaj et al, RSA 2014), $K_{2,k}$, $(k\times 2)$-grid, and the $k$-circus (Fichtenberger et al, Arxiv 2017).
- May 22 2018 quant-ph arXiv:1805.08172v1Recently, dimensionality testing of a quantum state has received extensive attention (Acín et al. Phys. Rev. Letts. 2006, Scarani et al. Phys. Rev. Letts. 2006). Security proofs of existing quantum information processing protocols rely on the assumption about the dimension of quantum states in which logical bits will be encoded. However, removing such assumption may cause security loophole. In the present report, we show that this is indeed the case. We choose two player's quantum private query protocol by Yang et al. (Quant. Inf. Process. 2014) as an example and show how one player can gain considerable information than desired by changing the dimension of subsystem of a shared quantum system. To resist such attack we propose dimensionality testing in a different way. Our proposal is based on CHSH like game which certifies the degrees of freedom of the subsystems. As we exploit CHSH like game, it can be used to test if the states are separable for which the protocol becomes completely vulnerable.
- Over the years, the Web has shrunk the world, allowing individuals to share viewpoints with many more people than they are able to in real life. At the same time, however, it has also enabled anti-social and toxic behavior to occur at an unprecedented scale. Video sharing platforms like YouTube receive uploads from millions of users, covering a wide variety of topics and allowing others to comment and interact in response. Unfortunately, these communities are periodically plagued with aggression and hate attacks. In particular, recent work has showed how these attacks often take place as a result of "raids," i.e., organized efforts coordinated by ad-hoc mobs from third-party communities. Despite the increasing relevance of this phenomenon, online services often lack effective countermeasures to mitigate it. Unlike well-studied problems like spam and phishing, coordinated aggressive behavior both targets and is perpetrated by humans, making defense mechanisms that look for automated activity unsuitable. Therefore, the de-facto solution is to reactively rely on user reports and human reviews. In this paper, we propose an automated solution to identify videos that are likely to be targeted by coordinated harassers. First, we characterize and model YouTube videos along several axes (metadata, audio transcripts, thumbnails) based on a ground truth dataset of raid victims. Then, we use an ensemble of classifiers to determine the likelihood that a video will be raided with high accuracy (AUC up to 94%). Overall, our work paves the way for providing video platforms like YouTube with proactive systems to detect and mitigate coordinated hate attacks.
- We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems. However, existing systems rely on manually optimized libraries such as cuDNN where only a narrow range of server class GPUs are well-supported. The reliance on hardware-specific operator libraries limits the applicability of high-level graph optimizations and incurs significant engineering costs when deploying to new hardware targets. We use learning to remove this engineering burden. We learn domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search by effective model transfer across workloads. Experimental results show that our framework delivers performance competitive with state-of-the-art hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPU.
- May 22 2018 cs.GT arXiv:1805.08125v1In this work, we aim to create a data marketplace; a robust real-time matching mechanism to efficiently buy and sell training data for Machine Learning tasks. While the monetization of data and pre-trained models is an essential focus of industry today, there does not exist a market mechanism to price training data and match buyers to vendors while still addressing the associated (computational and other) complexity. The challenge in creating such a market stems from the very nature of data as an asset: it is freely replicable; its value is inherently combinatorial due to correlation with signal in other data; prediction tasks and the value of accuracy vary widely; usefulness of training data is difficult to verify a priori without first applying it to a prediction task. As our main contributions we: (i) propose a mathematical model for a two-sided data market and formally define key challenges; (ii) construct algorithms for such a market to function and rigorously prove how they meet the challenges defined. We highlight two technical contributions: (i) a remarkable link between Myerson's payment function arising in mechanism design and the Lovasz extension arising in submodular optimization; (ii) a novel notion of "fairness" required for cooperative games with freely replicable goods. These might be of independent interest.
- We propose a novel technique for faster Neural Network (NN) training by systematically approximating all the constituent matrix multiplications and convolutions. This approach is complementary to other approximation techniques, requires no changes to the dimensions of the network layers, hence compatible with existing training frameworks. We first analyze the applicability of the existing methods for approximating matrix multiplication to NN training, and extend the most suitable column-row sampling algorithm to approximating multi-channel convolutions. We apply approximate tensor operations to training MLP, CNN and LSTM network architectures on MNIST, CIFAR-100 and Penn Tree Bank datasets and demonstrate 30%-80% reduction in the amount of computations while maintaining little or no impact on the test accuracy. Our promising results encourage further study of general methods for approximating tensor operations and their application to NN training.
- Quantum multiparameter estimation involves estimating multiple parameters simultaneously and was shown to be more accurate than estimating the parameters individually. Our interest here is to determine multiparameter quantum Cramer-Rao bounds (QCRBs) for noisy metrology. We do so, in particular, using anti-symmetric logarithmic derivatives (ALDs) for the quantum Fisher information matrix (QFIM). A recent work studied the simultaneous estimation of all three components of a magnetic field using a pure probe state and unitary evolution. Here, we consider the more practical problem of simultaneously estimating multiple parameters using a mixed probe state. We found that the QCRB so obtained using ALDs depends only on the first and second order reduced density operators of the probe state. Moreover, recent works have studied upper bounds to single-parameter as well as multiple-parameter QFIs for systems under arbitrary evolution. We here present an upper bound to the noisy multiparameter QFIM in the most general form using the ALD-approach.
- Learning with a \it convex loss function has been a dominating paradigm for many years. It remains an interesting question how non-convex loss functions help improve the generalization of learning with broad applicability. In this paper, we study a family of objective functions formed by truncating traditional loss functions, which is applicable to both shallow learning and deep learning. Truncating loss functions has potential to be less vulnerable and more robust to large noise in observations that could be adversarial. More importantly, it is a generic technique without assuming the knowledge of noise distribution. To justify non-convex learning with truncated losses, we establish excess risk bounds of empirical risk minimization based on truncated losses for heavy-tailed output, and statistical error of an approximate stationary point found by stochastic gradient descent (SGD) method. Our experiments for shallow and deep learning for regression with outliers, corrupted data and heavy-tailed noise further justify the proposed method.
- May 22 2018 gr-qc arXiv:1805.07840v1We study the absorption of monochromatic electromagnetic plane waves impinging upon a Kerr black hole, in the general case that the direction of incidence is not aligned with the black hole spin axis. We present numerical results that are in accord with low- and high-frequency approximations. We find that circularly-polarized waves are distinguished by the black hole spin, with counter-rotating polarizations being more absorbed than co-rotating polarizations. At low frequencies and moderate incidence angles, there exists a narrow parameter window in which superradiant emission in the dipole mode can exceed absorption in the non-superradiant modes, allowing a planar electromagnetic wave to stimulate net emission from a black hole.
- Two important elements have driven recent innovation in the field of regression: sparsity-inducing regularization, to cope with high-dimensional problems; multi-task learning through joint parameter estimation, to augment the number of training samples. Both approaches complement each other in the sense that a joint estimation results in more samples, which are needed to estimate sparse models accurately, whereas sparsity promotes models that act on subsets of related variables. This idea has driven the proposal of block regularizers such as L1/Lq norms, which however effective, require that active regressors strictly overlap. In this paper, we propose a more flexible convex regularizer based on unbalanced optimal transport (OT) theory. That regularizer promotes parameters that are close, according to the OT geometry, which takes into account a prior geometric knowledge on the regressor variables. We derive an efficient algorithm based on a regularized formulation of optimal transport, which iterates through applications of Sinkhorn's algorithm along with coordinate descent iterations. The performance of our model is demonstrated on regular grids and complex triangulated geometries of the cortex with an application in neuroimaging.
- The main computational cost in the training of and prediction with Convolution Neural Networks (CNNs) typically stems from the convolution. In this paper, we present three novel ways to parameterize the convolution more efficiently, significantly decreasing the computational complexity. Commonly used CNNs filter the input data using a series of spatial convolutions with compact stencils that couple features from all channels and point-wise nonlinearities. In this paper, we propose three architectures that are cheaper to couple the channel dimension and thereby reduce both the number of trainable weights and the computational cost of the CNN. The first architecture is inspired by tensor-products and imposes a circulant coupling of the channels. The second and third architectures arise as discretizations of a new type of residual neural network (ResNet) that is inspired by Partial Differential Equations (PDEs) or reaction-diffusion type. The coupling patterns of the first two architectures are applicable to a large class of CNNs. We outline in our numerical experiments that the proposed architectures, although considerably reducing the number of trainable weights, yield comparable accuracy to existing CNNs that are fully coupled in the channel dimension.
- We present a new technique for deep reinforcement learning that automatically detects moving objects and uses the relevant information for action selection. The detection of moving objects is done in an unsupervised way by exploiting structure from motion. Instead of directly learning a policy from raw images, the agent first learns to detect and segment moving objects by exploiting flow information in video sequences. The learned representation is then used to focus the policy of the agent on the moving objects. Over time, the agent identifies which objects are critical for decision making and gradually builds a policy based on relevant moving objects. This approach, which we call Motion-Oriented REinforcement Learning (MOREL), is demonstrated on a suite of Atari games where the ability to detect moving objects reduces the amount of interaction needed with the environment to obtain a good policy. Furthermore, the resulting policy is more interpretable than policies that directly map images to actions or values with a black box neural network. We can gain insight into the policy by inspecting the segmentation and motion of each object detected by the agent. This allows practitioners to confirm whether a policy is making decisions based on sensible information.
- The study of high-throughput genomic profiles from a pharmacogenomics viewpoint has provided unprecedented insights into the oncogenic features modulating drug response. A recent screening of ~1,000 cancer cell lines to a collection of anti-cancer drugs illuminated the link between genotypes and vulnerability. However, due to essential differences between cell lines and tumors, the translation into predicting drug response in tumors remains challenging. Here we proposed a DNN model to predict drug response based on mutation and expression profiles of a cancer cell or a tumor. The model contains a mutation and an expression encoders pre-trained using a large pan-cancer dataset to abstract core representations of high-dimension data, followed by a drug response predictor network. Given a pair of mutation and expression profiles, the model predicts IC50 values of 265 drugs. We trained and tested the model on a dataset of 622 cancer cell lines and achieved an overall prediction performance of mean squared error at 1.96 (log-scale IC50 values). The performance was superior in prediction error or stability than two classical methods and four analog DNNs of our model. We then applied the model to predict drug response of 9,059 tumors of 33 cancer types. The model predicted both known, including EGFR inhibitors in non-small cell lung cancer and tamoxifen in ER+ breast cancer, and novel drug targets. The comprehensive analysis further revealed the molecular mechanisms underlying the resistance to a chemotherapeutic drug docetaxel in a pan-cancer setting and the anti-cancer potential of a novel agent, CX-5461, in treating gliomas and hematopoietic malignancies. Overall, our model and findings improve the prediction of drug response and the identification of novel therapeutic options.
- We propose a new Bayesian Neural Net (BNN) formulation that affords variational inference for which the evidence lower bound (ELBO) is analytically tractable subject to a tight approximation. We achieve this tractability by decomposing ReLU nonlinearities into an identity function and a Kronecker delta function. We demonstrate formally that assigning the outputs of these functions to separate latent variables allows representing the neural network likelihood as the composition of a chain of linear operations. Performing variational inference on this construction enables closed-form computation of the evidence lower bound. It can thus be maximized without requiring Monte Carlo sampling to approximate the problematic expected log-likelihood term. The resultant formulation boils down to stochastic gradient descent, where the gradients are not distorted by any factor besides minibatch selection. This amends a long-standing disadvantage of BNNs relative to deterministic nets. Experiments on four benchmark data sets show that the cleaner gradients provided by our construction yield a steeper learning curve, achieving higher prediction accuracies for a fixed epoch budget.
- Most approaches that model time-series data in human activity recognition based on body-worn sensing (HAR) use a fixed size temporal context to represent different activities. This might, however, not be apt for sets of activities with individ- ually varying durations. We introduce attention models into HAR research as a data driven approach for exploring relevant temporal context. Attention models learn a set of weights over input data, which we leverage to weight the temporal context being considered to model each sensor reading. We construct attention models for HAR by adding attention layers to a state- of-the-art deep learning HAR model (DeepConvLSTM) and evaluate our approach on benchmark datasets achieving sig- nificant increase in performance. Finally, we visualize the learned weights to better understand what constitutes relevant temporal context.
- May 22 2018 cs.CV arXiv:1805.07632v1Given data, deep generative models, such as variational autoencoders (VAE) and generative adversarial networks (GAN), train a lower dimensional latent representation of the data space. The linear Euclidean geometry of data space pulls back to a nonlinear Riemannian geometry on the latent space. The latent space thus provides a low-dimensional nonlinear representation of data and classical linear statistical techniques are no longer applicable. In this paper we show how statistics of data in their latent space representation can be performed using techniques from the field of nonlinear manifold statistics. Nonlinear manifold statistics provide generalizations of Euclidean statistical notions including means, principal component analysis, and maximum likelihood fits of parametric probability distributions. We develop new techniques for maximum likelihood inference in latent space, and adress the computational complexity of using geometric algorithms with high-dimensional data by training a separate neural network to approximate the Riemannian metric and cometric tensor capturing the shape of the learned data manifold.
- May 22 2018 quant-ph arXiv:1805.07620v1Non-Hermitian singularities are ubiquitous in non-conservative open systems. These singularities are often points of measure zero in the eigenspectrum of the system which make them difficult to access without careful engineering. Despite that, they can remotely induce observable effects when some of the system's parameters are varied along closed trajectories in the parameter space. To date, a general formalism for describing this process beyond simple cases is still lacking. Here, we bridge this gap and develop a general approach for treating this problem by utilizing the power of permutation operators and representation theory. This in turn allows us to reveal the following surprising result which contradicts the common belief in the field: loops that enclose the same singularities starting from the same initial point and traveling in the same direction, do not necessarily share the same end outcome. Interestingly, we find that this equivalence can be formally established only by invoking the topological notion of homotopy. Our findings are general with far reaching implications in various fields ranging from photonics and atomic physics to microwaves and acoustics.
- The nodal set of a Laplacian eigenfunction forms a partition of the underlying manifold or graph. Another natural partition is based on the gradient vector field of the eigenfunction (on a manifold) or on the extremal points of the eigenfunction (on a graph). The submanifolds (or subgraphs) of this partition are called Neumann domains. This paper reviews the subject, as appears in a few recent works and points out some open questions and conjectures. The paper concerns both manifolds and metric graphs and the exposition allows for a comparison between the results obtained for each of them.
- Boosted decision trees enjoy popularity in a variety of applications; however, for large-scale datasets, the cost of training a decision tree in each round can be prohibitively expensive. Inspired by ideas from the multi-arm bandit literature, we develop a highly efficient algorithm for computing exact greedy-optimal decision trees, outperforming the state-of-the-art Quick Boost method. We further develop a framework for deriving lower bounds on the problem that applies to a wide family of conceivable algorithms for the task (including our algorithm and Quick Boost), and we demonstrate empirically on a wide variety of data sets that our algorithm is near-optimal within this family of algorithms. We also derive a lower bound applicable to any algorithm solving the task, and we demonstrate that our algorithm empirically achieves performance close to this best-achievable lower bound.
- May 22 2018 math.CO arXiv:1805.07576v1A pair $(A,B)$ of square $(0,1)$-matrices is called a Lehman pair if $AB^T=J+kI$ for some integer $k\in\{-1,1,2,3,\ldots\}$, and the matrices $A$ and $B$ are called Lehman pair. This terminology arises because Lehman showed that the rows of minimum weight in any non-degenerate minimally nonideal (mni) matrix $M$ form a square Lehman submatrix of $M$. In this paper, we view a Lehman matrix as the bipartite adjacency matrix of a regular bipartite graph, focussing in particular on the case where the graph is cubic. From this perspective, we identify two constructions that generate cubic Lehman graphs from smaller Lehman graphs. The most prolific of these constructions involves repeatedly replacing suitable pairs of edges with a particular $6$-vertex subgraph that we call a $3$-rung ladder segment. Two decades ago, Lütolf and Margot initiated a computational study of mni matrices and constructed a catalogue containing (among other things) a listing of all cubic Lehman matrices with $k =1$ of order up to $17 \times 17$. We verify their catalogue (which has just one omission), and extend the computational results to $20 \times 20$ matrices. Of the $908$ cubic Lehman matrices (with $k=1$) of order up to $20 \times 20$, only two do not arise from our $3$-rung ladder construction. However these exceptions can be derived from our second construction, and so our two constructions cover all known cubic Lehman matrices with $k=1$.
- May 22 2018 cs.CV arXiv:1805.07548v1Deep learning stands at the forefront in many computer vision tasks. However, deep neural networks are usually data-hungry and require a huge amount of well-annotated training samples. Collecting sufficient annotated data is very expensive in many applications, especially for pixel-level prediction tasks such as semantic segmentation. To solve this fundamental issue, we consider a new challenging vision task, Internetly supervised semantic segmentation, which only uses Internet data with noisy image-level supervision of corresponding query keywords for segmentation model training. We address this task by proposing the following solution. A class-specific attention model unifying multiscale forward and backward convolutional features is proposed to provide initial segmentation "ground truth". The model trained with such noisy annotations is then improved by an online fine-tuning procedure. It achieves state-of-the-art performance under the weakly-supervised setting on PASCAL VOC2012 dataset. The proposed framework also paves a new way towards learning from the Internet without human interaction and could serve as a strong baseline therein. Code and data will be released upon the paper acceptance.
- May 22 2018 cs.CV arXiv:1805.07477v1Augmenting deep neural networks with skip connections, as introduced in the so called ResNet architecture, surprised the community by enabling the training of networks of more than 1000 layers with significant performance gains. It has been shown that identity skip connections eliminate singularities and improve the optimization landscape of the network. This paper deciphers ResNet by analyzing the of effect of skip connections in the backward path and sets forth new theoretical results on the advantages of identity skip connections in deep neural networks. We prove that the skip connections in the residual blocks facilitate preserving the norm of the gradient and lead to well-behaved and stable back-propagation, which is a desirable feature from optimization perspective. We also show that, perhaps surprisingly, as more residual blocks are stacked, the network becomes more norm-preserving. Traditionally, norm-preservation is enforced on the network only at beginning of the training, by using initialization techniques. However, we show that identity skip connection retain norm-preservation during the training procedure. Our theoretical arguments are supported by extensive empirical evidence. Can we push for more norm-preservation? We answer this question by proposing zero-phase whitening of the fully-connected layer and adding norm-preserving transition layers. Our numerical investigations demonstrate that the learning dynamics and the performance of ResNets can be improved by making it even more norm preserving through changing only a few blocks in very deep residual networks. Our results and the introduced modification for ResNet, referred to as Procrustes ResNets, can be used as a guide for studying more complex architectures such as DenseNet, training deeper networks, and inspiring new architectures.
- May 22 2018 cs.CV arXiv:1805.07457v1The per-pixel cross-entropy loss (CEL) has been widely used in structured output prediction tasks as a spatial extension of generic image classification. However, its i.i.d. assumption neglects the structural regularity present in natural images. Various attempts have been made to incorporate structural reasoning mostly through structure priors in a cooperative way where co-occuring patterns are encouraged. We, on the other hand, approach this problem from an opposing angle and propose a new framework for training such structured prediction networks via an adversarial process, in which we train a structure analyzer that provides the supervisory signals, the adversarial structure matching loss (ASML). The structure analyzer is trained to maximize ASML, or to exaggerate recurring structural mistakes usually among co-occurring patterns. On the contrary, the structured output prediction network is trained to reduce those mistakes and is thus enabled to distinguish fine-grained structures. As a result, training structured output prediction networks using ASML reduces contextual confusion among objects and improves boundary localization. We demonstrate that ASML outperforms its counterpart CEL especially in context and boundary aspects on figure-ground segmentation and semantic segmentation tasks with various base architectures, such as FCN, U-Net, DeepLab, and PSPNet.
- Boltzmann machines are powerful distributions that have been shown to be an effective prior over binary latent variables in variational autoencoders (VAEs). However, previous methods for training discrete VAEs have used the evidence lower bound and not the tighter importance-weighted bound. We propose two approaches for relaxing Boltzmann machines to continuous distributions that permit training with importance-weighted bounds. These relaxations are based on generalized overlapping transformations and the Gaussian integral trick. Experiments on the MNIST and OMNIGLOT datasets show that these relaxations outperform previous discrete VAEs with Boltzmann priors.
- Neural networks have been learning complex multi-hop reasoning in various domains. One such formal setting for reasoning, logic, provides a challenging case for neural networks. In this article, we propose a Neural Inference Network (NIN) for learning logical inference over classes of logic programs. Trained in an end-to-end fashion NIN learns representations of normal logic programs, by processing them at a character level, and the reasoning algorithm for checking whether a logic program entails a given query. We define 12 classes of logic programs that exemplify increased level of complexity of the inference process (multi-hop and default reasoning) and show that our NIN passes 10 out of the 12 tasks. We also analyse the learnt representations of logic programs that NIN uses to perform the logical inference.
- We explore the possibility of using machine learning to identify interesting mathematical structures by using certain quantities that serve as fingerprints. In particular, we extract features from integer sequences using two empirical laws: Benford's law and Taylor's law and experiment with various classifiers to identify whether a sequence is nice, important, multiplicative, easy to compute or related to primes or palindromes.
- We study the decades-old problem of online portfolio management and propose the first algorithm with logarithmic regret that is not based on Cover's Universal Portfolio algorithm and admits much faster implementation. Specifically Universal Portfolio enjoys optimal regret $\mathcal{O}(N\ln T)$ for $N$ financial instruments over $T$ rounds, but requires log-concave sampling and has a large polynomial running time. Our algorithm, on the other hand, ensures a slightly larger but still logarithmic regret of $\mathcal{O}(N^2(\ln T)^4)$, and is based on the well-studied Online Mirror Descent framework with a novel regularizer that can be implemented via standard optimization methods in time $\mathcal{O}(TN^{2.5})$ per round. The regret of all other existing works is either polynomial in $T$ or has a potentially unbounded factor such as the inverse of the smallest price relative.
- We consider the problem of uncertainty estimation in the context of (non-Bayesian) deep neural classification. All current methods are based on extracting uncertainty signals from a trained network optimized to solve the classification problem at hand. We demonstrate that such techniques tend to misestimate instances whose predictions are supposed to be highly confident. This deficiency is an artifact of the training process with SGD-like optimizers. Based on this observation, we develop an uncertainty estimation algorithm that "peels away" highly confident points sequentially and estimates their confidence using earlier snapshots of the trained model, before their uncertainty estimates are jittered. We present extensive experiments indicating that the proposed algorithm provides uncertainty estimates that are consistently better than the best known methods.
- It is important to learn classifiers under noisy labels due to their ubiquities. As noisy labels are corrupted from ground-truth labels by an unknown noise transition matrix, the accuracy of classifiers can be improved by estimating this matrix, without introducing either sample-selection or regularization biases. However, such estimation is often inexact, which inevitably degenerates the accuracy of classifiers. The inexact estimation is due to either a heuristic trick, or the brutal-force learning by deep networks under a finite dataset. In this paper, we present a human-assisted approach called "\textitmasking". The masking conveys human cognition of invalid class transitions, and naturally speculates the structure of the noise transition matrix. Given the structure information, we only learn the noise transition probability to reduce the estimation burden. To instantiate this approach, we derive a structure-aware probabilistic model, which incorporates a structure prior. During the model realization, we solve the challenges from structure extraction and alignment in principle. Empirical results on benchmark datasets with three noise structures show that, our approach can improve the robustness of classifiers significantly.
- May 22 2018 cs.CV arXiv:1805.08191v1We propose a hierarchically structured reinforcement learning approach to address the challenges of planning for generating coherent multi-sentence stories for the visual storytelling task. Within our framework, the task of generating a story given a sequence of images is divided across a two-level hierarchical decoder. The high-level decoder constructs a plan by generating a semantic concept (i.e., topic) for each image in sequence. The low-level decoder generates a sentence for each image using a semantic compositional network, which effectively grounds the sentence generation conditioned on the topic. The two decoders are jointly trained end-to-end using reinforcement learning. We evaluate our model on the visual storytelling (VIST) dataset. Empirical results demonstrate that the proposed hierarchically structured reinforced training achieves significantly better performance compared to a flat deep reinforcement learning baseline.
- May 22 2018 cs.DM arXiv:1805.08186v1In 2010, A. Shpilka and I. Volkovich established a prominent result on the equivalence of polynomial factorization and identity testing. It follows from their result that a multilinear polynomial over the finite field of order 2 (a multilinear boolean polynomial) can be factored in time cubic in the size of the polynomial given as a string. We have later rediscovered this result and provided a simple factorization algorithm based on the computation of derivatives of multilinear boolean polynomials. The algorithm has been applied to solve problems of compact representation of various combinatorial structures, including boolean functions and relational data tables. In this paper, we improve the known cubic upper complexity bound of this factorization algorithm and report on a preliminary experimental analysis.
- May 22 2018 quant-ph cond-mat.stat-mech arXiv:1805.08184v1We analyze the role of indirect quantum measurements in work extraction from quantum systems in nonequilibrium states. In particular, we focus on the work that can be obtained by exploiting the correlations shared between the system of interest and an additional ancilla, where measurement backaction introduces a nontrivial thermodynamic tradeoff. We present optimal state-dependent protocols for extracting work from both classical and quantum correlations, the latter being measured by discord. We show that, while the work content of classical correlations can be fully extracted by performing local operations on the system of interest, the amount of work related to quantum discord requires a specific driving protocol that includes interaction between system and ancilla.
- May 22 2018 cs.CV arXiv:1805.08183v1Subspace clustering refers to the problem of segmenting high dimensional data drawn from a union of subspaces into the respective subspaces. In some applications, partial side-information to indicate "must-link" or "cannot-link" in clustering is available. This leads to the task of subspace clustering with side-information. However, in prior work the supervision value of the side-information for subspace clustering has not been fully exploited. To this end, in this paper, we present an enhanced approach for constrained subspace clustering with side-information, termed Constrained Sparse Subspace Clustering plus (CSSC+), in which the side-information is used not only in the stage of learning an affinity matrix but also in the stage of spectral clustering. Moreover, we propose to estimate clustering accuracy based on the partial side-information and discuss the potential connection to the true clustering accuracy. We conduct experiments on three cancer gene expression datasets to validate the effectiveness of our proposals.
- Predicting how Congressional legislators will vote is important for understanding their past and future behavior. However, previous work on roll-call prediction has been limited to single session settings, thus did not consider generalization across sessions. In this paper, we show that metadata is crucial for modeling voting outcomes in new contexts, as changes between sessions lead to changes in the underlying data generation process. We show how augmenting bill text with the sponsors' ideologies in a neural network model can achieve an average of a 4% boost in accuracy over the previous state-of-the-art.
- Reinforcement Learning (RL) algorithms can suffer from poor sample efficiency when rewards are delayed and sparse. We introduce a solution that enables agents to learn temporally extended actions at multiple levels of abstraction in a sample efficient and automated fashion. Our approach combines universal value functions and hindsight learning, allowing agents to learn policies belonging to different time scales in parallel. We show that our method significantly accelerates learning in a variety of discrete and continuous tasks.
- May 22 2018 math.GT arXiv:1805.08178v1We generalize the index polynomial invariant to the case of virtual tangles. Three polynomial invariants result from this generalization; we give a brief overview of their definition and some basic properties.
- May 22 2018 cs.CV arXiv:1805.08170v1We study in this paper the problems of both image captioning and text-to-image generation, and present a novel turbo learning approach to jointly training an image-to-text generator (a.k.a. captionbot) and a text-to-image generator (a.k.a. drawingbot). The key idea behind the joint training is that image-to-text generation and text-to-image generation as dual problems can form a closed loop to provide informative feedback to each other. Based on such feedback, we introduce a new loss metric by comparing the original input with the output produced by the closed loop. In addition to the old loss metrics used in captionbot and drawingbot, this extra loss metric makes the jointly trained captionbot and drawingbot better than the separately trained captionbot and drawingbot. Furthermore, the turbo-learning approach enables semi-supervised learning since the closed loop can provide peudo-labels for unlabeled samples. Experimental results on the COCO dataset demonstrate that the proposed turbo learning can significantly improve the performance of both captionbot and drawingbot by a large margin.
- May 22 2018 cs.CV arXiv:1805.08162v1The recent advances in Deep Convolutional Neural Networks (DCNNs) have shown extremely good results for video human action classification, however, action detection is still a challenging problem. The current action detection approaches follow a complex pipeline which involves multiple tasks such as tube proposals, optical flow, and tube classification. In this work, we present a more elegant solution for action detection based on the recently developed capsule network. We propose a 3D capsule network for videos, called VideoCapsuleNet: a unified network for action detection which can jointly perform pixel-wise action segmentation along with action classification. The proposed network is a generalization of capsule network from 2D to 3D, which takes a sequence of video frames as input. The 3D generalization drastically increases the number of capsules in the network, making capsule routing computationally expensive. We introduce capsule-pooling in the convolutional capsule layer to address this issue which makes the voting algorithm tractable. The routing-by-agreement in the network inherently models the action representations and various action characteristics are captured by the predicted capsules. This inspired us to utilize the capsules for action localization and the class-specific capsules predicted by the network are used to determine a pixel-wise localization of actions. The localization is further improved by parameterized skip connections with the convolutional capsule layers and the network is trained end-to-end with a classification as well as localization loss. The proposed network achieves sate-of-the-art performance on multiple action detection datasets including UCF-Sports, J-HMDB, and UCF-101 (24 classes) with an impressive ~20% improvement on UCF-101 and ~15% improvement on J-HMDB in terms of v-mAP scores.
- May 22 2018 astro-ph.IM arXiv:1805.08160v1We present new fully-automatic classification model to select extragalactic objects within astronomy photometric catalogs. Construction of the our classification model is based on the three important procedures: 1) data representation to create feature space; 2) building hypersurface in feature space to limit range of features (outliers detection); 3) building hyperplane separating extragalactic objects from the galactic ones. We trained our model with 1.7 million objects (1.4 million galaxies and quasars, 0.3 million stars). The application of the model is presented as a photometric catalog of 38 million extragalactic objects, identified in the WISE and Pan-STARRS catalogs cross-matched with each other.
- Despite substantial interest in applications of neural networks to information retrieval, neural ranking models have only been applied to standard ad hoc retrieval tasks over web pages and newswire documents. This paper proposes MP-HCNN (Multi-Perspective Hierarchical Convolutional Neural Network) a novel neural ranking model specifically designed for ranking short social media posts. We identify document length, informal language, and heterogeneous relevance signals as features that distinguish documents in our domain, and present a model specifically designed with these characteristics in mind. Our model uses hierarchical convolutional layers to learn latent semantic soft-match relevance signals at the character, word, and phrase levels. A pooling-based similarity measurement layer integrates evidence from multiple types of matches between the query, the social media post, as well as URLs contained in the post. Extensive experiments using Twitter data from the TREC Microblog Tracks 2011--2014 show that our model significantly outperforms prior feature-based as well and existing neural ranking models. To our best knowledge, this paper presents the first substantial work tackling search over social media posts using neural ranking models.
- Adapting deep networks to new concepts from few examples is extremely challenging, due to the high computational and data requirements of standard fine-tuning procedures. Most works on meta-learning and few-shot learning have thus focused on simple learning techniques for adaptation, such as nearest neighbors or gradient descent. Nonetheless, the machine learning literature contains a wealth of methods that learn non-deep models very efficiently. In this work we propose to use these fast convergent methods as the main adaptation mechanism for few-shot learning. The main idea is to teach a deep network to use standard machine learning tools, such as logistic regression, as part of its own internal model, enabling it to quickly adapt to novel tasks. This requires back-propagating errors through the solver steps. While normally the matrix operations involved would be costly, the small number of examples works to our advantage, by making use of the Woodbury identity. We propose both iterative and closed-form solvers, based on logistic regression and ridge regression components. Our methods achieve excellent performance on three few-shot learning benchmarks, showing competitive performance on Omniglot and surpassing all state-of-the-art alternatives on miniImageNet and CIFAR-100.
- We consider a new family of operators for reinforcement learning with the goal of alleviating the negative effects and becoming more robust to approximation or estimation errors. Various theoretical results are established, which include showing on a sample path basis that our family of operators preserve optimality and increase the action gap. Our empirical results illustrate the strong benefits of our family of operators, significantly outperforming the classical Bellman operator and recently proposed operators.
- May 22 2018 cond-mat.mes-hall arXiv:1805.08119v1We study the low-energy physics of a one-dimensional array of superconducting quantum dots realized by proximity coupling a semiconductor nanowire to multiple superconducting islands separated by narrow uncovered regions. The effective electrostatic potential inside the quantum dots and the uncovered regions can be controlled using potential gates. By performing detailed numerical calculations based on effective tightbinding models, we find that multiple low-energy sub-gap states consisting of partially overlapping Majorana bound states emerge generically in the vicinity of the uncovered regions. Explicit differential conductance calculations show that a robust zero-bias conductance peak is not inconsistent with the presence of such states localized throughout the system, hence the observation of such a peak does not demonstrate the realization of well-separated Majorana zero modes. However, we find that creating effective potential wells in the uncovered regions traps pairs of nearby partially overlapping Majorana bound states, which become less separated and acquire a finite gap that protects the pair of Majorana zero modes localized at the ends of the system. This behavior persists over a significant parameter range, suggesting that proximitized quantum dot arrays could provide a platform for highly controllable Majorana devices.
- May 22 2018 cs.CV arXiv:1805.08113v1Zero-Shot Learning (ZSL) is achieved via aligning the semantic relationships between the global image feature vector and the corresponding class semantic descriptions. However, using the global features to represent fine-grained images may lead to sub-optimal results since they neglect the discriminative differences of local regions. Besides, different regions contain distinct discriminative information. The important regions should contribute more to the prediction. To this end, we propose a novel stacked semantics-guided attention (S2GA) model to obtain semantic relevant features by using individual class semantic features to progressively guide the visual features to generate an attention map for weighting the importance of different local regions. Feeding both the integrated visual features and the class semantic features into a multi-class classification architecture, the proposed framework can be trained end-to-end. Extensive experimental results on CUB and NABird datasets show that the proposed approach has a consistent improvement on both fine-grained zero-shot classification and retrieval tasks.
- May 22 2018 quant-ph arXiv:1805.08104v1We consider branched quantum wires, whose connection rules provide PT-symmetry for the Schrodinger equation on graph. For such PT-symmetric quantum graph we derive general boundary conditions which keep the Hamiltonian as PT-symmetric with real eigenvalues and positively defined norm. Explicit boundary conditions which are consistent with the general PT-symmetric boundary conditions are presented. Secular equations for finding the eigenvalues of the quantum graph are derived. Breaking of the Kirchhoff rule at the branching point is shown. Experimental realization of PT-symmetric quantum graphs on branched optical waveguides is discussed.