- We introduce a framework for the formal specification and verification of quantum circuits based on the Feynman path integral. Our formalism, built around exponential sums of polynomial functions, provides a structured and natural way of specifying quantum operations, particularly for quantum implementations of classical functions. Verification of circuits over all levels of the Clifford hierarchy with respect to either a specification or reference circuit is enabled by a novel rewrite system for exponential sums with free variables. Our algorithm is further shown to give a polynomial-time decision procedure for checking the equivalence of Clifford group circuits. We evaluate our methods by performing automated verification of optimized Clifford+T circuits with up to 100 qubits and thousands of T gates, as well as the functional verification of quantum algorithms using hundreds of qubits. Our experiments culminate in the automated verification of the Hidden Shift algorithm for a class of Boolean functions in a fraction of the time it has taken recent algorithms to simulate.
- May 21 2018 quant-ph cond-mat.stat-mech arXiv:1805.07168v1Investigating translationally invariant qudit spin chains with a low local dimension, we ask what is the best possible tradeoff between the scaling of the entanglement entropy of a large block and the inverse-polynomial scaling of the spectral gap. Restricting ourselves to Hamiltonians with a "rewriting" interaction, we find the pair-flip model, a family of spin chains with nearest neighbor, translationally invariant, frustration-free interactions, with a very entangled ground state and an inverse-polynomial spectral gap. For a ground state in a particular invariant subspace, the entanglement entropy across a middle cut scales as $\log n$ for qubits (it is equivalent to the XXX model), while for qutrits and higher, it scales as $\sqrt{n}$. Moreover, we conjecture that this particular ground state can be made unique by adding a small translationally-invariant perturbation that favors neighboring letter pairs, adding a small amount of frustration, while retaining the entropy scaling.
- May 22 2018 quant-ph arXiv:1805.07772v1Energy-time uncertainty plays an important role in quantum foundations and technologies, and it was even discussed by the founders of quantum mechanics. However, standard approaches (e.g., Robertson's uncertainty relation) do not apply to energy-time uncertainty because, in general, there is no Hermitian operator associated with time. Following previous approaches, we quantify time uncertainty by how well one can read off the time from a quantum clock. We then use entropy to quantify the information-theoretic distinguishability of the various time states of the clock. Our main result is an entropic energy-time uncertainty relation for general time-independent Hamiltonians, stated for both the discrete-time and continuous-time cases. Our uncertainty relation is strong, in the sense that it allows for a quantum memory to help reduce the uncertainty, and this formulation leads us to reinterpret it as a bound on the relative entropy of asymmetry. Due to the operational relevance of entropy, we anticipate that our uncertainty relation will have information-processing applications.
- A widespread folklore for explaining the success of convolutional neural network (CNN) is that CNN is a more compact representation than the fully connected neural network (FNN) and thus requires fewer samples for learning. We initiate the study of rigorously characterizing the sample complexity of learning convolutional neural networks. We show that for learning an $m$-dimensional convolutional filter with linear activation acting on a $d$-dimensional input, the sample complexity of achieving population prediction error of $\epsilon$ is $\widetilde{O} (m/\epsilon^2)$, whereas its FNN counterpart needs at least $\Omega(d/\epsilon^2)$ samples. Since $m \ll d$, this result demonstrates the advantage of using CNN. We further consider the sample complexity of learning a one-hidden-layer CNN with linear activation where both the $m$-dimensional convolutional filter and the $r$-dimensional output weights are unknown. For this model, we show the sample complexity is $\widetilde{O}\left((m+r)/\epsilon^2\right)$ when the ratio between the stride size and the filter size is a constant. For both models, we also present lower bounds showing our sample complexities are tight up to logarithmic factors. Our main tools for deriving these results are localized empirical process and a new lemma characterizing the convolutional structure. We believe these tools may inspire further developments in understanding CNN.
- We study novel three-dimensional quantum phases of matter which support particles with restricted mobility, including immobile "fracton" excitations. Many fracton models may be viewed as generalized Abelian lattice gauge theories. Here, by analogy with Dijkgraaf-Witten topological gauge theories, we discover a natural generalization of fracton models, obtained by twisting the gauge symmetries. Introducing generalized gauge transformation operators carrying extra phase factors depending on local configurations, we construct a large and rich family of exactly solvable three-dimensional models, which we dub "twisted fracton models." A key result of our approach is to demonstrate explicitly the existence of non-Abelian fractons (as well as non-Abelian quasi-particles whose motion is restricted to some dimensions) in a three-dimensional system with local interactions. We further characterize these twisted fracton phases by embedding them on a three-torus and computing their ground state degeneracies, which depend sub-extensively on the system size. Moreover, as an important advance in the study of fracton order, we develop a general mathematical framework which systematically captures the fusion and braiding properties of fractons and other quasi-particles with restricted mobility.
- May 22 2018 quant-ph arXiv:1805.08138v1The calculation of excited state energies of electronic structure Hamiltonians has many important applications, such as the calculation of optical spectra and reaction rates. While low-depth quantum algorithms, such as the variational quantum eigenvalue solver, have been used to determine ground state energies, methods for calculating excited states currently involve the implementation of high-depth controlled-unitaries or a large number of additional samples. Here we show how overlap estimation can be used in the calculation of excited state energies and their degeneracies. We propose an implementation using the destructive SWAP test that can determine the excited states of a $N$-qubit Hamiltonian using a $N\times 2$ grid of qubits and a low-depth circuit. Our method remains robust to control errors and can be implemented on near-term quantum computers.
- In this second part of the study initiated in arxiv:1804.07410, we investigate holographic complexity for eternal black hole backgrounds perturbed by shock waves, with both the complexity$=$action (CA) and complexity$=$volume (CV) proposals. In particular, we consider Vaidya geometries describing a thin shell of null fluid with arbitrary energy falling in from one of the boundaries of a two-sided AdS-Schwarzschild spacetime. We demonstrate how known properties of complexity, such as the switchback effect for light shocks, as well as analogous properties for heavy ones, are imprinted in the complexity of formation and in the full time evolution of complexity. Following our discussion in arxiv:1804.07410, we find that in order to obtain the expected properties of the complexity, the inclusion of a particular counterterm on the null boundaries of the Wheeler-DeWitt patch is required for the CA proposal.
- We establish a tight characterization of the worst-case rates for the excess risk of agnostic learning with sample compression schemes and for uniform convergence for agnostic sample compression schemes. In particular, we find that the optimal rates of convergence for size-$k$ agnostic sample compression schemes are of the form $\sqrt{\frac{k \log(n/k)}{n}}$, which contrasts with agnostic learning with classes of VC dimension $k$, where the optimal rates are of the form $\sqrt{\frac{k}{n}}$.
- May 22 2018 cs.DS arXiv:1805.07742v1We develop a framework for obtaining polynomial time approximation schemes (PTAS) for a class of stochastic dynamic programs. Using our framework, we obtain the first PTAS for the following stochastic combinatorial optimization problems: \probemax: We are given a set of $n$ items, each item $i\in [n]$ has a value $X_i$ which is an independent random variable with a known (discrete) distribution $\pi_i$. We can \em probe a subset $P\subseteq [n]$ of items sequentially. Each time after probing an item $i$, we observe its value realization, which follows the distribution $\pi_i$. We can \em adaptively probe at most $m$ items and each item can be probed at most once. The reward is the maximum among the $m$ realized values. Our goal is to design an adaptive probing policy such that the expected value of the reward is maximized. To the best of our knowledge, the best known approximation ratio is $1-1/e$, due to Asadpour \etal~\citeasadpour2015maximizing. We also obtain PTAS for some generalizations and variants of the problem and some other problems.
- May 21 2018 quant-ph arXiv:1805.07351v1Constructing a large scale quantum processor will require gate operations that are robust in the presence of noise and experimental imperfection. We experimentally demonstrate, making use of a pair of trapped ions, how a new type of Mølmer-Sørensen gate protects against infidelity caused by heating of the motional mode used during the gate. Furthermore, we show how the same technique simultaneously provides significant protection against slow fluctuations and mis-sets in the frequency of the mode. Since this sensitivity is further enhanced in cases where the ions are not ground state cooled, our method provides a path towards relaxing ion cooling requirements in practical realisations of quantum computing and simulation.
- We define Radon transform and its inverse on the two-dimensional anti-de Sitter space over local fields using a novel construction through a quadratic equation over the local field. We show that the holographic bulk reconstruction of quantum fields in this space can be formulated as the inverse Radon transform, generalizing the case over the reals, studied earlier.
- Let $G$ be an undirected, bounded degree graph with $n$ vertices. Fix a finite graph $H$, and suppose one must remove $\eps n$ edges from $G$ to make it $H$-minor free (for some small constant $\eps > 0$). We give an $n^{1/2+o(1)}$-time randomized procedure that, with high probability, finds an $H$-minor in such a graph. For an example application, suppose one must remove $\eps n$ edges from a bounded degree graph $G$ to make it planar. This result implies an algorithm, with the same running time, that produces a $K_{3,3}$ or $K_5$ minor in $G$. No sublinear time bound was known for this problem, prior to this result. By the graph minor theorem, we get an analogous result for any minor-closed property. Up to $n^{o(1)}$ factors, this resolves a conjecture of Benjamini-Schramm-Shapira (STOC 2008) on the existence of one-sided property testers for minor-closed properties. Furthermore, our algorithm is nearly optimal, by an $\Omega(\sqrt{n})$ lower bound of Czumaj et al (RSA 2014). Prior to this work, the only graphs $H$ for which non-trivial property testers were known for $H$-minor freeness are the following: $H$ being a forest or a cycle (Czumaj et al, RSA 2014), $K_{2,k}$, $(k\times 2)$-grid, and the $k$-circus (Fichtenberger et al, Arxiv 2017).
- May 22 2018 quant-ph arXiv:1805.08172v1Recently, dimensionality testing of a quantum state has received extensive attention (Acín et al. Phys. Rev. Letts. 2006, Scarani et al. Phys. Rev. Letts. 2006). Security proofs of existing quantum information processing protocols rely on the assumption about the dimension of quantum states in which logical bits will be encoded. However, removing such assumption may cause security loophole. In the present report, we show that this is indeed the case. We choose two player's quantum private query protocol by Yang et al. (Quant. Inf. Process. 2014) as an example and show how one player can gain considerable information than desired by changing the dimension of subsystem of a shared quantum system. To resist such attack we propose dimensionality testing in a different way. Our proposal is based on CHSH like game which certifies the degrees of freedom of the subsystems. As we exploit CHSH like game, it can be used to test if the states are separable for which the protocol becomes completely vulnerable.
- We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems. However, existing systems rely on manually optimized libraries such as cuDNN where only a narrow range of server class GPUs are well-supported. The reliance on hardware-specific operator libraries limits the applicability of high-level graph optimizations and incurs significant engineering costs when deploying to new hardware targets. We use learning to remove this engineering burden. We learn domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search by effective model transfer across workloads. Experimental results show that our framework delivers performance competitive with state-of-the-art hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPU.
- May 22 2018 cs.GT arXiv:1805.08125v1In this work, we aim to create a data marketplace; a robust real-time matching mechanism to efficiently buy and sell training data for Machine Learning tasks. While the monetization of data and pre-trained models is an essential focus of industry today, there does not exist a market mechanism to price training data and match buyers to vendors while still addressing the associated (computational and other) complexity. The challenge in creating such a market stems from the very nature of data as an asset: it is freely replicable; its value is inherently combinatorial due to correlation with signal in other data; prediction tasks and the value of accuracy vary widely; usefulness of training data is difficult to verify a priori without first applying it to a prediction task. As our main contributions we: (i) propose a mathematical model for a two-sided data market and formally define key challenges; (ii) construct algorithms for such a market to function and rigorously prove how they meet the challenges defined. We highlight two technical contributions: (i) a remarkable link between Myerson's payment function arising in mechanism design and the Lovasz extension arising in submodular optimization; (ii) a novel notion of "fairness" required for cooperative games with freely replicable goods. These might be of independent interest.
- We propose a novel technique for faster Neural Network (NN) training by systematically approximating all the constituent matrix multiplications and convolutions. This approach is complementary to other approximation techniques, requires no changes to the dimensions of the network layers, hence compatible with existing training frameworks. We first analyze the applicability of the existing methods for approximating matrix multiplication to NN training, and extend the most suitable column-row sampling algorithm to approximating multi-channel convolutions. We apply approximate tensor operations to training MLP, CNN and LSTM network architectures on MNIST, CIFAR-100 and Penn Tree Bank datasets and demonstrate 30%-80% reduction in the amount of computations while maintaining little or no impact on the test accuracy. Our promising results encourage further study of general methods for approximating tensor operations and their application to NN training.
- Quantum multiparameter estimation involves estimating multiple parameters simultaneously and was shown to be more accurate than estimating the parameters individually. Our interest here is to determine multiparameter quantum Cramer-Rao bounds (QCRBs) for noisy metrology. We do so, in particular, using anti-symmetric logarithmic derivatives (ALDs) for the quantum Fisher information matrix (QFIM). A recent work studied the simultaneous estimation of all three components of a magnetic field using a pure probe state and unitary evolution. Here, we consider the more practical problem of simultaneously estimating multiple parameters using a mixed probe state. We found that the QCRB so obtained using ALDs depends only on the first and second order reduced density operators of the probe state. Moreover, recent works have studied upper bounds to single-parameter as well as multiple-parameter QFIs for systems under arbitrary evolution. We here present an upper bound to the noisy multiparameter QFIM in the most general form using the ALD-approach.
- Learning with a \it convex loss function has been a dominating paradigm for many years. It remains an interesting question how non-convex loss functions help improve the generalization of learning with broad applicability. In this paper, we study a family of objective functions formed by truncating traditional loss functions, which is applicable to both shallow learning and deep learning. Truncating loss functions has potential to be less vulnerable and more robust to large noise in observations that could be adversarial. More importantly, it is a generic technique without assuming the knowledge of noise distribution. To justify non-convex learning with truncated losses, we establish excess risk bounds of empirical risk minimization based on truncated losses for heavy-tailed output, and statistical error of an approximate stationary point found by stochastic gradient descent (SGD) method. Our experiments for shallow and deep learning for regression with outliers, corrupted data and heavy-tailed noise further justify the proposed method.
- May 22 2018 gr-qc arXiv:1805.07840v1We study the absorption of monochromatic electromagnetic plane waves impinging upon a Kerr black hole, in the general case that the direction of incidence is not aligned with the black hole spin axis. We present numerical results that are in accord with low- and high-frequency approximations. We find that circularly-polarized waves are distinguished by the black hole spin, with counter-rotating polarizations being more absorbed than co-rotating polarizations. At low frequencies and moderate incidence angles, there exists a narrow parameter window in which superradiant emission in the dipole mode can exceed absorption in the non-superradiant modes, allowing a planar electromagnetic wave to stimulate net emission from a black hole.
- Two important elements have driven recent innovation in the field of regression: sparsity-inducing regularization, to cope with high-dimensional problems; multi-task learning through joint parameter estimation, to augment the number of training samples. Both approaches complement each other in the sense that a joint estimation results in more samples, which are needed to estimate sparse models accurately, whereas sparsity promotes models that act on subsets of related variables. This idea has driven the proposal of block regularizers such as L1/Lq norms, which however effective, require that active regressors strictly overlap. In this paper, we propose a more flexible convex regularizer based on unbalanced optimal transport (OT) theory. That regularizer promotes parameters that are close, according to the OT geometry, which takes into account a prior geometric knowledge on the regressor variables. We derive an efficient algorithm based on a regularized formulation of optimal transport, which iterates through applications of Sinkhorn's algorithm along with coordinate descent iterations. The performance of our model is demonstrated on regular grids and complex triangulated geometries of the cortex with an application in neuroimaging.
- The main computational cost in the training of and prediction with Convolution Neural Networks (CNNs) typically stems from the convolution. In this paper, we present three novel ways to parameterize the convolution more efficiently, significantly decreasing the computational complexity. Commonly used CNNs filter the input data using a series of spatial convolutions with compact stencils that couple features from all channels and point-wise nonlinearities. In this paper, we propose three architectures that are cheaper to couple the channel dimension and thereby reduce both the number of trainable weights and the computational cost of the CNN. The first architecture is inspired by tensor-products and imposes a circulant coupling of the channels. The second and third architectures arise as discretizations of a new type of residual neural network (ResNet) that is inspired by Partial Differential Equations (PDEs) or reaction-diffusion type. The coupling patterns of the first two architectures are applicable to a large class of CNNs. We outline in our numerical experiments that the proposed architectures, although considerably reducing the number of trainable weights, yield comparable accuracy to existing CNNs that are fully coupled in the channel dimension.
- We present a new technique for deep reinforcement learning that automatically detects moving objects and uses the relevant information for action selection. The detection of moving objects is done in an unsupervised way by exploiting structure from motion. Instead of directly learning a policy from raw images, the agent first learns to detect and segment moving objects by exploiting flow information in video sequences. The learned representation is then used to focus the policy of the agent on the moving objects. Over time, the agent identifies which objects are critical for decision making and gradually builds a policy based on relevant moving objects. This approach, which we call Motion-Oriented REinforcement Learning (MOREL), is demonstrated on a suite of Atari games where the ability to detect moving objects reduces the amount of interaction needed with the environment to obtain a good policy. Furthermore, the resulting policy is more interpretable than policies that directly map images to actions or values with a black box neural network. We can gain insight into the policy by inspecting the segmentation and motion of each object detected by the agent. This allows practitioners to confirm whether a policy is making decisions based on sensible information.
- The study of high-throughput genomic profiles from a pharmacogenomics viewpoint has provided unprecedented insights into the oncogenic features modulating drug response. A recent screening of ~1,000 cancer cell lines to a collection of anti-cancer drugs illuminated the link between genotypes and vulnerability. However, due to essential differences between cell lines and tumors, the translation into predicting drug response in tumors remains challenging. Here we proposed a DNN model to predict drug response based on mutation and expression profiles of a cancer cell or a tumor. The model contains a mutation and an expression encoders pre-trained using a large pan-cancer dataset to abstract core representations of high-dimension data, followed by a drug response predictor network. Given a pair of mutation and expression profiles, the model predicts IC50 values of 265 drugs. We trained and tested the model on a dataset of 622 cancer cell lines and achieved an overall prediction performance of mean squared error at 1.96 (log-scale IC50 values). The performance was superior in prediction error or stability than two classical methods and four analog DNNs of our model. We then applied the model to predict drug response of 9,059 tumors of 33 cancer types. The model predicted both known, including EGFR inhibitors in non-small cell lung cancer and tamoxifen in ER+ breast cancer, and novel drug targets. The comprehensive analysis further revealed the molecular mechanisms underlying the resistance to a chemotherapeutic drug docetaxel in a pan-cancer setting and the anti-cancer potential of a novel agent, CX-5461, in treating gliomas and hematopoietic malignancies. Overall, our model and findings improve the prediction of drug response and the identification of novel therapeutic options.
- We propose a new Bayesian Neural Net (BNN) formulation that affords variational inference for which the evidence lower bound (ELBO) is analytically tractable subject to a tight approximation. We achieve this tractability by decomposing ReLU nonlinearities into an identity function and a Kronecker delta function. We demonstrate formally that assigning the outputs of these functions to separate latent variables allows representing the neural network likelihood as the composition of a chain of linear operations. Performing variational inference on this construction enables closed-form computation of the evidence lower bound. It can thus be maximized without requiring Monte Carlo sampling to approximate the problematic expected log-likelihood term. The resultant formulation boils down to stochastic gradient descent, where the gradients are not distorted by any factor besides minibatch selection. This amends a long-standing disadvantage of BNNs relative to deterministic nets. Experiments on four benchmark data sets show that the cleaner gradients provided by our construction yield a steeper learning curve, achieving higher prediction accuracies for a fixed epoch budget.
- May 22 2018 cs.CV arXiv:1805.07632v1Given data, deep generative models, such as variational autoencoders (VAE) and generative adversarial networks (GAN), train a lower dimensional latent representation of the data space. The linear Euclidean geometry of data space pulls back to a nonlinear Riemannian geometry on the latent space. The latent space thus provides a low-dimensional nonlinear representation of data and classical linear statistical techniques are no longer applicable. In this paper we show how statistics of data in their latent space representation can be performed using techniques from the field of nonlinear manifold statistics. Nonlinear manifold statistics provide generalizations of Euclidean statistical notions including means, principal component analysis, and maximum likelihood fits of parametric probability distributions. We develop new techniques for maximum likelihood inference in latent space, and adress the computational complexity of using geometric algorithms with high-dimensional data by training a separate neural network to approximate the Riemannian metric and cometric tensor capturing the shape of the learned data manifold.
- May 22 2018 quant-ph arXiv:1805.07620v1Non-Hermitian singularities are ubiquitous in non-conservative open systems. These singularities are often points of measure zero in the eigenspectrum of the system which make them difficult to access without careful engineering. Despite that, they can remotely induce observable effects when some of the system's parameters are varied along closed trajectories in the parameter space. To date, a general formalism for describing this process beyond simple cases is still lacking. Here, we bridge this gap and develop a general approach for treating this problem by utilizing the power of permutation operators and representation theory. This in turn allows us to reveal the following surprising result which contradicts the common belief in the field: loops that enclose the same singularities starting from the same initial point and traveling in the same direction, do not necessarily share the same end outcome. Interestingly, we find that this equivalence can be formally established only by invoking the topological notion of homotopy. Our findings are general with far reaching implications in various fields ranging from photonics and atomic physics to microwaves and acoustics.
- Boosted decision trees enjoy popularity in a variety of applications; however, for large-scale datasets, the cost of training a decision tree in each round can be prohibitively expensive. Inspired by ideas from the multi-arm bandit literature, we develop a highly efficient algorithm for computing exact greedy-optimal decision trees, outperforming the state-of-the-art Quick Boost method. We further develop a framework for deriving lower bounds on the problem that applies to a wide family of conceivable algorithms for the task (including our algorithm and Quick Boost), and we demonstrate empirically on a wide variety of data sets that our algorithm is near-optimal within this family of algorithms. We also derive a lower bound applicable to any algorithm solving the task, and we demonstrate that our algorithm empirically achieves performance close to this best-achievable lower bound.
- May 22 2018 math.CO arXiv:1805.07576v1A pair $(A,B)$ of square $(0,1)$-matrices is called a Lehman pair if $AB^T=J+kI$ for some integer $k\in\{-1,1,2,3,\ldots\}$, and the matrices $A$ and $B$ are called Lehman pair. This terminology arises because Lehman showed that the rows of minimum weight in any non-degenerate minimally nonideal (mni) matrix $M$ form a square Lehman submatrix of $M$. In this paper, we view a Lehman matrix as the bipartite adjacency matrix of a regular bipartite graph, focussing in particular on the case where the graph is cubic. From this perspective, we identify two constructions that generate cubic Lehman graphs from smaller Lehman graphs. The most prolific of these constructions involves repeatedly replacing suitable pairs of edges with a particular $6$-vertex subgraph that we call a $3$-rung ladder segment. Two decades ago, Lütolf and Margot initiated a computational study of mni matrices and constructed a catalogue containing (among other things) a listing of all cubic Lehman matrices with $k =1$ of order up to $17 \times 17$. We verify their catalogue (which has just one omission), and extend the computational results to $20 \times 20$ matrices. Of the $908$ cubic Lehman matrices (with $k=1$) of order up to $20 \times 20$, only two do not arise from our $3$-rung ladder construction. However these exceptions can be derived from our second construction, and so our two constructions cover all known cubic Lehman matrices with $k=1$.
- May 22 2018 cs.CV arXiv:1805.07548v1Deep learning stands at the forefront in many computer vision tasks. However, deep neural networks are usually data-hungry and require a huge amount of well-annotated training samples. Collecting sufficient annotated data is very expensive in many applications, especially for pixel-level prediction tasks such as semantic segmentation. To solve this fundamental issue, we consider a new challenging vision task, Internetly supervised semantic segmentation, which only uses Internet data with noisy image-level supervision of corresponding query keywords for segmentation model training. We address this task by proposing the following solution. A class-specific attention model unifying multiscale forward and backward convolutional features is proposed to provide initial segmentation "ground truth". The model trained with such noisy annotations is then improved by an online fine-tuning procedure. It achieves state-of-the-art performance under the weakly-supervised setting on PASCAL VOC2012 dataset. The proposed framework also paves a new way towards learning from the Internet without human interaction and could serve as a strong baseline therein. Code and data will be released upon the paper acceptance.
- May 22 2018 cs.CV arXiv:1805.07477v1Augmenting deep neural networks with skip connections, as introduced in the so called ResNet architecture, surprised the community by enabling the training of networks of more than 1000 layers with significant performance gains. It has been shown that identity skip connections eliminate singularities and improve the optimization landscape of the network. This paper deciphers ResNet by analyzing the of effect of skip connections in the backward path and sets forth new theoretical results on the advantages of identity skip connections in deep neural networks. We prove that the skip connections in the residual blocks facilitate preserving the norm of the gradient and lead to well-behaved and stable back-propagation, which is a desirable feature from optimization perspective. We also show that, perhaps surprisingly, as more residual blocks are stacked, the network becomes more norm-preserving. Traditionally, norm-preservation is enforced on the network only at beginning of the training, by using initialization techniques. However, we show that identity skip connection retain norm-preservation during the training procedure. Our theoretical arguments are supported by extensive empirical evidence. Can we push for more norm-preservation? We answer this question by proposing zero-phase whitening of the fully-connected layer and adding norm-preserving transition layers. Our numerical investigations demonstrate that the learning dynamics and the performance of ResNets can be improved by making it even more norm preserving through changing only a few blocks in very deep residual networks. Our results and the introduced modification for ResNet, referred to as Procrustes ResNets, can be used as a guide for studying more complex architectures such as DenseNet, training deeper networks, and inspiring new architectures.
- Boltzmann machines are powerful distributions that have been shown to be an effective prior over binary latent variables in variational autoencoders (VAEs). However, previous methods for training discrete VAEs have used the evidence lower bound and not the tighter importance-weighted bound. We propose two approaches for relaxing Boltzmann machines to continuous distributions that permit training with importance-weighted bounds. These relaxations are based on generalized overlapping transformations and the Gaussian integral trick. Experiments on the MNIST and OMNIGLOT datasets show that these relaxations outperform previous discrete VAEs with Boltzmann priors.
- Neural networks have been learning complex multi-hop reasoning in various domains. One such formal setting for reasoning, logic, provides a challenging case for neural networks. In this article, we propose a Neural Inference Network (NIN) for learning logical inference over classes of logic programs. Trained in an end-to-end fashion NIN learns representations of normal logic programs, by processing them at a character level, and the reasoning algorithm for checking whether a logic program entails a given query. We define 12 classes of logic programs that exemplify increased level of complexity of the inference process (multi-hop and default reasoning) and show that our NIN passes 10 out of the 12 tasks. We also analyse the learnt representations of logic programs that NIN uses to perform the logical inference.
- We explore the possibility of using machine learning to identify interesting mathematical structures by using certain quantities that serve as fingerprints. In particular, we extract features from integer sequences using two empirical laws: Benford's law and Taylor's law and experiment with various classifiers to identify whether a sequence is nice, important, multiplicative, easy to compute or related to primes or palindromes.
- We study the decades-old problem of online portfolio management and propose the first algorithm with logarithmic regret that is not based on Cover's Universal Portfolio algorithm and admits much faster implementation. Specifically Universal Portfolio enjoys optimal regret $\mathcal{O}(N\ln T)$ for $N$ financial instruments over $T$ rounds, but requires log-concave sampling and has a large polynomial running time. Our algorithm, on the other hand, ensures a slightly larger but still logarithmic regret of $\mathcal{O}(N^2(\ln T)^4)$, and is based on the well-studied Online Mirror Descent framework with a novel regularizer that can be implemented via standard optimization methods in time $\mathcal{O}(TN^{2.5})$ per round. The regret of all other existing works is either polynomial in $T$ or has a potentially unbounded factor such as the inverse of the smallest price relative.
- May 21 2018 cond-mat.mes-hall cond-mat.mtrl-sci arXiv:1805.07314v1Although the richness of spatial symmetries has led to a rapidly expanding inventory of possible topological crystalline (TC) phases of electrons, physical realizations have been slow to materialize due to the practical difficulty to ascertaining band topology in realistic calculations. Here, we integrate the recently established theory of symmetry indicators of band topology into first-principle band-structure calculations, and test it on a databases of previously synthesized crystals. The combined algorithm is found to efficiently unearth topological materials and predict topological properties like protected surface states. On applying our algorithm to just 8 out of the 230 space groups, we already discover numerous materials candidates displaying a diversity of topological phenomena, which are simultaneously captured in a single sweep. The list includes recently proposed classes of TC insulators that had no previous materials realization as well as other topological phases, including: (i) a screw-protected 3D TC insulator, e̱ta-MoTe2, with gapped surfaces except for 1D helical "hinge" states; (ii) a rotation-protected TC insulator BiBr with coexisting surface Dirac cones and hinge states; (iii) non-centrosymmetric Z2 topological insulators undetectable using the well-established parity criterion, AgXO (X=Na,K,Rb); (iv) a Dirac semimetal MgBi2O6; (v) a Dirac nodal-line semimetal AgF2; and (vi) a metal with three-fold degenerate band crossing near the Fermi energy, AuLiMgSn. Our work showcases how the recent theoretical insights on the fundamentals of band structures can aid in the practical goal of discovering new topological materials.
- May 21 2018 math.RA arXiv:1805.07299v1We study the structure of the Lie algebra $\mathfrak{s}(n,\mathbb R)$ corresponding to the so-called stochastic Lie group $\mathcal{S} (n,\mathbb R)$. We obtain the Levi decomposition of the Lie algebra, classify Levi factor and classify the representation of the factor in $\mathbb{R}^n$. We discuss isomorphism of $\mathcal{S}(n,\mathbb R)$ with the group of invertible affine maps ${\it Aff}(n-1,\mathbb R)$. We prove that $\mathfrak s(n, \mathbb R)$ is generated by two generic elements.
- Learning and inference movement is a very challenging problem due to its high dimensionality and dependency to varied environments or tasks. In this paper, we propose an effective probabilistic method for learning and inference of basic movements. The motion planning problem is formulated as learning on a directed graphic model and deep generative model is used to perform learning and inference from demonstrations. An important characteristic of this method is that it flexibly incorporates the task descriptors and context information for long-term planning and it can be combined with dynamic systems for robot control. The experimental validations on robotic approaching path planning tasks show the advantages over the base methods with limited training data.
- May 21 2018 math.OC arXiv:1805.07248v1A novel class of derivative-free optimization algorithms is developed. The main idea is to utilize certain non-commutative maps in order to approximate the gradient of the objective function. Convergence properties of the novel algorithms are established and simulation examples are presented.
- May 21 2018 math.HO arXiv:1805.06932v1We present some episodes from the history of interactions between geometry and physics over the past century.
- Locality imposes stringent constraints on the spreading of information in nonrelativistic quantum systems, which is reminiscent of a "light-cone," a casual structure arising in their relativistic counterparts. Long-range interactions can potentially soften such constraints, allowing almost instantaneous long jumps of particles, thus defying causality. Since interactions decaying as a power-law with distance, $r^{-\alpha}$, are ubiquitous in nature, it is pertinent to understand what is the fate of causality and information spreading in such systems. Using a numerically exact technique we address these questions by studying the out-of-time-order correlation function of a representative generic system in one-dimension. We show that while the interactions are long-range, their effect on information spreading is asymptotically negligible as long as $\alpha>1$. In this range we find a complex compound behavior, where after a short transient a fully local behavior emerges, yielding asymptotic "light-cones" virtually indistinguishable from "light-cones" in corresponding local models. The long-range nature of the interaction is only expressed in the power-law leaking of information from the "light-cone," with the same exponent as the exponent of the interaction, $\alpha$. Our results directly imply that all previously obtained rigorous bounds on information spreading in long-range interacting systems are not tight, and thus could be improved.
- We consider the problem of uncertainty estimation in the context of (non-Bayesian) deep neural classification. All current methods are based on extracting uncertainty signals from a trained network optimized to solve the classification problem at hand. We demonstrate that such techniques tend to misestimate instances whose predictions are supposed to be highly confident. This deficiency is an artifact of the training process with SGD-like optimizers. Based on this observation, we develop an uncertainty estimation algorithm that "peels away" highly confident points sequentially and estimates their confidence using earlier snapshots of the trained model, before their uncertainty estimates are jittered. We present extensive experiments indicating that the proposed algorithm provides uncertainty estimates that are consistently better than the best known methods.
- It is important to learn classifiers under noisy labels due to their ubiquities. As noisy labels are corrupted from ground-truth labels by an unknown noise transition matrix, the accuracy of classifiers can be improved by estimating this matrix, without introducing either sample-selection or regularization biases. However, such estimation is often inexact, which inevitably degenerates the accuracy of classifiers. The inexact estimation is due to either a heuristic trick, or the brutal-force learning by deep networks under a finite dataset. In this paper, we present a human-assisted approach called "\textitmasking". The masking conveys human cognition of invalid class transitions, and naturally speculates the structure of the noise transition matrix. Given the structure information, we only learn the noise transition probability to reduce the estimation burden. To instantiate this approach, we derive a structure-aware probabilistic model, which incorporates a structure prior. During the model realization, we solve the challenges from structure extraction and alignment in principle. Empirical results on benchmark datasets with three noise structures show that, our approach can improve the robustness of classifiers significantly.
- May 22 2018 cs.CV arXiv:1805.08191v1We propose a hierarchically structured reinforcement learning approach to address the challenges of planning for generating coherent multi-sentence stories for the visual storytelling task. Within our framework, the task of generating a story given a sequence of images is divided across a two-level hierarchical decoder. The high-level decoder constructs a plan by generating a semantic concept (i.e., topic) for each image in sequence. The low-level decoder generates a sentence for each image using a semantic compositional network, which effectively grounds the sentence generation conditioned on the topic. The two decoders are jointly trained end-to-end using reinforcement learning. We evaluate our model on the visual storytelling (VIST) dataset. Empirical results demonstrate that the proposed hierarchically structured reinforced training achieves significantly better performance compared to a flat deep reinforcement learning baseline.
- May 22 2018 cs.DM arXiv:1805.08186v1In 2010, A. Shpilka and I. Volkovich established a prominent result on the equivalence of polynomial factorization and identity testing. It follows from their result that a multilinear polynomial over the finite field of order 2 (a multilinear boolean polynomial) can be factored in time cubic in the size of the polynomial given as a string. We have later rediscovered this result and provided a simple factorization algorithm based on the computation of derivatives of multilinear boolean polynomials. The algorithm has been applied to solve problems of compact representation of various combinatorial structures, including boolean functions and relational data tables. In this paper, we improve the known cubic upper complexity bound of this factorization algorithm and report on a preliminary experimental analysis.
- May 22 2018 quant-ph cond-mat.stat-mech arXiv:1805.08184v1We analyze the role of indirect quantum measurements in work extraction from quantum systems in nonequilibrium states. In particular, we focus on the work that can be obtained by exploiting the correlations shared between the system of interest and an additional ancilla, where measurement backaction introduces a nontrivial thermodynamic tradeoff. We present optimal state-dependent protocols for extracting work from both classical and quantum correlations, the latter being measured by discord. We show that, while the work content of classical correlations can be fully extracted by performing local operations on the system of interest, the amount of work related to quantum discord requires a specific driving protocol that includes interaction between system and ancilla.
- May 22 2018 cs.CV arXiv:1805.08183v1Subspace clustering refers to the problem of segmenting high dimensional data drawn from a union of subspaces into the respective subspaces. In some applications, partial side-information to indicate "must-link" or "cannot-link" in clustering is available. This leads to the task of subspace clustering with side-information. However, in prior work the supervision value of the side-information for subspace clustering has not been fully exploited. To this end, in this paper, we present an enhanced approach for constrained subspace clustering with side-information, termed Constrained Sparse Subspace Clustering plus (CSSC+), in which the side-information is used not only in the stage of learning an affinity matrix but also in the stage of spectral clustering. Moreover, we propose to estimate clustering accuracy based on the partial side-information and discuss the potential connection to the true clustering accuracy. We conduct experiments on three cancer gene expression datasets to validate the effectiveness of our proposals.
- Predicting how Congressional legislators will vote is important for understanding their past and future behavior. However, previous work on roll-call prediction has been limited to single session settings, thus did not consider generalization across sessions. In this paper, we show that metadata is crucial for modeling voting outcomes in new contexts, as changes between sessions lead to changes in the underlying data generation process. We show how augmenting bill text with the sponsors' ideologies in a neural network model can achieve an average of a 4% boost in accuracy over the previous state-of-the-art.
- Reinforcement Learning (RL) algorithms can suffer from poor sample efficiency when rewards are delayed and sparse. We introduce a solution that enables agents to learn temporally extended actions at multiple levels of abstraction in a sample efficient and automated fashion. Our approach combines universal value functions and hindsight learning, allowing agents to learn policies belonging to different time scales in parallel. We show that our method significantly accelerates learning in a variety of discrete and continuous tasks.
- May 22 2018 math.GT arXiv:1805.08178v1We generalize the index polynomial invariant to the case of virtual tangles. Three polynomial invariants result from this generalization; we give a brief overview of their definition and some basic properties.
- May 22 2018 cs.CV arXiv:1805.08170v1We study in this paper the problems of both image captioning and text-to-image generation, and present a novel turbo learning approach to jointly training an image-to-text generator (a.k.a. captionbot) and a text-to-image generator (a.k.a. drawingbot). The key idea behind the joint training is that image-to-text generation and text-to-image generation as dual problems can form a closed loop to provide informative feedback to each other. Based on such feedback, we introduce a new loss metric by comparing the original input with the output produced by the closed loop. In addition to the old loss metrics used in captionbot and drawingbot, this extra loss metric makes the jointly trained captionbot and drawingbot better than the separately trained captionbot and drawingbot. Furthermore, the turbo-learning approach enables semi-supervised learning since the closed loop can provide peudo-labels for unlabeled samples. Experimental results on the COCO dataset demonstrate that the proposed turbo learning can significantly improve the performance of both captionbot and drawingbot by a large margin.