- Mar 30 2017 quant-ph arXiv:1703.09835v1Previous analysis of randomized benchmarking assumed that experimental noise "weakly" depends on the target gate. We show that this condition is more restrictive than it initially appears, so much so that it is practically unverifiable. We then resolve this limitation by proving that the exact impact of gate-dependent noise can be described by a single perturbation term that decays exponentially with the sequence length. That is, the exact behavior of randomized benchmarking under general gate-dependent noise converges exponentially to a true exponential decay of exactly the same form as that predicted by previous analysis for gate-independent noise. Moreover, we show that the operational meaning of the decay rate for gate-dependent noise is essentially unchanged, that is, we show that it quantifies the average fidelity of the noise between ideal gates. We numerically demonstrate that our analysis is valid for realistic gate-dependent noise models.
- Mar 30 2017 quant-ph arXiv:1703.10133v1We investigate the relationship between the energy spectrum of a local Hamiltonian and the geometric properties of its ground state. By generalizing a standard framework from the analysis of Markov chains to arbitrary (non-stoquastic) Hamiltonians we are naturally led to see that the spectral gap can always be upper bounded by an isoperimetric ratio that depends only on the ground state probability distribution and the range of the terms in the Hamiltonian, but not on any other details of the interaction couplings. This means that for a given probability distribution the inequality constrains the spectral gap of any local Hamiltonian with this distribution as its ground state probability distribution in some basis (Eldar and Harrow derived a similar result [1] in order to characterize the output of low-depth quantum circuits). Going further, we relate the Hilbert space localization properties of the ground state to higher energy eigenvalues by showing that the presence of k strongly localized ground state modes (i.e. clusters of probability, or subsets with small expansion) in Hilbert space implies the presence of k energy eigenvalues that are close to the ground state energy. Our results suggest that quantum adiabatic optimization using local Hamiltonians will inevitably encounter small spectral gaps when attempting to prepare ground states corresponding to multi-modal probability distributions with strongly localized modes, and this problem cannot necessarily be alleviated with the inclusion of non-stoquastic couplings.
- Mar 30 2017 quant-ph arXiv:1703.09747v1We compute the expected randomized benchmarking sequence fidelity for a system subject to Gaussian time-correlated noise. For single qubit benchmarking we show that the expected sequence fidelity is given by the partition function of a long-range coupled spin-one Ising model, with each site in the Ising model corresponding to a free evolution interval. For d-state systems, the expected sequence fidelity is given by an Ising-like model partition function whose site variables are given by the weights of the adjoint representation of SU(d). A high effective temperature expansion for the partition function in the single qubit case shows decay of sequence fidelity varying from exponential for uncorrelated noise to a power law for quasistatic noise. Fitting an exponential to the sequence fidelity decay under correlated noise gives unreliable estimates of the average gate error rate.
- In this work we consider a quantum generalization of the task considered by Slepian and Wolf [1973] regarding distributed source compression. In our task Alice, Bob, Charlie and Referee share a joint pure state. Alice and Bob wish to send a part of their respective systems to Charlie without collaborating with each other. We give achievability bounds for this task in the one-shot setting and provide asymptotic analysis in the case when there is no side information with Charlie. Our result implies the result of Abeyesinghe, Devetak, Hayden and Winter [2009] who studied a special case of this problem. As another special case wherein Bob holds trivial registers, we recover the result of Devetak and Yard [2008] regarding quantum state redistribution.
- Mar 30 2017 quant-ph arXiv:1703.10134v1We define a discrete-time, coined quantum walk on weighted graphs that is inspired by Szegedy's quantum walk. Using this, we prove that many lackadaisical quantum walks, where each vertex has $l$ integer self-loops, can be generalized to a quantum walk where each vertex has a single self-loop of real-valued weight $l$. We apply this real-valued lackadaisical quantum walk to two problems. First, we analyze it on the line or one-dimensional lattice, showing that it is exactly equivalent to a continuous deformation of the three-state Grover walk with faster ballistic dispersion. Second, we generalize Grover's algorithm, or search on the complete graph, to have a weighted self-loop at each vertex, yielding an improved success probability when $l < 3 + 2\sqrt{2} \approx 5.828$.
- Mar 30 2017 quant-ph cond-mat.supr-con arXiv:1703.09762v1One of the largest obstacles to building a quantum computer is gate error, where the physical evolution of the state of a qubit or group of qubits during a gate operation does not match the intended unitary transformation. Gate error stems from a combination of control errors and random single qubit errors from interaction with the environment. While great strides have been made in mitigating control errors, intrinsic qubit error remains a serious problem that sets the primary limit for gate fidelity in modern superconducting qubit architectures. Simultaneously, recent developments of small error-corrected logical qubit devices promise significant increases in logical state lifetime, but translating those improvements into increases in gate fidelity is a complex challenge. In this Letter, we propose a new formalism for implementing gates on and between small logical qubit devices which inherit the parent device's tolerance to single qubit errors which occur at any time before or during the gate. Using a standard phenomenological noise model for superconducting qubits, we demonstrate a universal one- and two-qubit gate set with error rates an order of magnitude lower than those for equivalent operations on single qubits or pairs of qubits, running for the same total duration. The effective logical gate error rate in these models displays superlinear error reduction with linear increases in single qubit lifetime, proving that passive error correction is capable of increasing gate fidelity. These developments further suggest that incorporating small logical qubits into a measurement based code could substantially improve code performance.
- Mar 30 2017 cond-mat.str-el cond-mat.quant-gas arXiv:1703.10141v1It is commonly believed that in quantum Monte Carlo approaches to fermionic many- body problems, the infamous sign problem generically implies prohibitively large computational times for obtaining thermodynamic-limit quantities. We point out that for convergent Feynman diagrammatic series evaluated with the Monte Carlo algorithm of [Rossi, arXiv:1612.05184], the computational time increases only polynomially with the inverse error on thermodynamic-limit quantities.
- A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.
- Mar 30 2017 quant-ph arXiv:1703.10118v1Amplitude amplification is one of primary tools in building algorithms for quantum computers. This technique develops key ideas of the Grover search algorithm. The original formulation by Grover has been reformulated in order to to make building blocks of the algorithm as generally as possible. Potentially useful modifications are connected with changing phases in the rotation operations and replacing the intermediate Hadamard transform with arbitrary unitary one. In addition, arbitrary initial distribution of the amplitudes may be prepared. There are practical problems, in which we may have \it a priori information about the database. We examine trade-off relations between measures of quantum coherence and the success probability in amplitude amplification processes. We try to understand how prior knowledge and other modifications of algorithm blocks should be exploited properly. As measures of coherence, the geometric coherence and the relative entropy of coherence are mainly considered. In terms of the relative entropy of coherence, complementarity relations with the success probability seem to be the most expository. The general relations presented are illustrated within several model scenarios of amplitude amplification process.
- Mar 30 2017 cs.CV arXiv:1703.10106v1We address human action recognition from multi-modal video data involving articulated pose and RGB frames and propose a two-stream approach. The pose stream is processed with a convolutional model taking as input a 3D tensor holding data from a sub-sequence. A specific joint ordering, which respects the topology of the human body, ensures that different convolutional layers correspond to meaningful levels of abstraction. The raw RGB stream is handled by a spatio-temporal soft-attention mechanism conditioned on features from the pose network. An LSTM network receives input from a set of image locations at each instant. A trainable glimpse sensor extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. Appearance features give important cues on hand motion and on objects held in each hand. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. Finally a temporal attention mechanism learns how to fuse LSTM features over time. We evaluate the method on 3 datasets. State-of-the-art results are achieved on the largest dataset for human activity recognition, namely NTU-RGB+D, as well as on the SBU Kinect Interaction dataset. Performance close to state-of-the-art is achieved on the smaller MSR Daily Activity 3D dataset.
- Mar 30 2017 math.RT arXiv:1703.10093v1This article surveys recent advances and future challenges in the $2$-representation theory of finitary $2$-categories with a particular emphasis on problems related to classification of various classes of $2$-representations.
- The recognition of actions from video sequences has many applications in health monitoring, assisted living, surveillance, and smart homes. Despite advances in sensing, in particular related to 3D video, the methodologies to process the data are still subject to research. We demonstrate superior results by a system which combines recurrent neural networks with convolutional neural networks in a voting approach. The gated-recurrent-unit-based neural networks are particularly well-suited to distinguish actions based on long-term information from optical tracking data; the 3D-CNNs focus more on detailed, recent information from video data. The resulting features are merged in an SVM which then classifies the movement. In this architecture, our method improves recognition rates of state-of-the-art methods by 14% on standard data sets.
- Mar 30 2017 cs.CV arXiv:1703.10155v1We present variational generative adversarial networks, a general learning framework that combines a variational auto-encoder with a generative adversarial network, for synthesizing images of fine-grained categories, such as faces of a specific person or objects in a category. Our approach models an image as a composition of label and latent attributes in a probabilistic model. By varying the fine-grained category label fed to the resulting generative model, we can generate images in a specific category by randomly drawn values on a latent attribute vector. The novelty of our approach comes from two aspects. Firstly, we propose to adopt a cross entropy loss for the discriminative and classifier network, but a mean discrepancy objective for the generative network. This kind of asymmetric loss function makes the training of the GAN more stable. Secondly, we adopt an encoder network to learn the relationship between the latent space and the real image space, and use pairwise feature matching to keep the structure of generated images. We experiment with natural images of faces, flowers, and birds, and demonstrate that the proposed models are capable of generating realistic and diverse samples with fine-grained category labels. We further show that our models can be applied to other tasks, such as image inpainting, super-resolution, and data augmentation for training better face recognition models.
- Mar 30 2017 cs.CL arXiv:1703.10152v1In comparison with document summarization on the articles from social media and newswire, argumentative zoning (AZ) is an important task in scientific paper analysis. Traditional methodology to carry on this task relies on feature engineering from different levels. In this paper, three models of generating sentence vectors for the task of sentence classification were explored and compared. The proposed approach builds sentence representations using learned embeddings based on neural network. The learned word embeddings formed a feature space, to which the examined sentence is mapped to. Those features are input into the classifiers for supervised classification. Using 10-cross-validation scheme, evaluation was conducted on the Argumentative-Zoning (AZ) annotated articles. The results showed that simply averaging the word vectors in a sentence works better than the paragraph to vector algorithm and by integrating specific cuewords into the loss function of the neural network can improve the classification performance. In comparison with the hand-crafted features, the word2vec method won for most of the categories. However, the hand-crafted features showed their strength on classifying some of the categories.
- Mar 30 2017 cs.CV arXiv:1703.10131v1It has been recently shown that neural networks can recover the geometric structure of a face from a single given image. A common denominator of most existing face geometry reconstruction methods is the restriction of the solution space to some low-dimensional subspace. While such a model significantly simplifies the reconstruction problem, it is inherently limited in its expressiveness. As an alternative, we propose an Image-to-Image translation network that maps the input image to a depth image and a facial correspondence map. This explicit pixel-based mapping can then be utilized to provide high quality reconstructions of diverse faces under extreme expressions. In the spirit of recent approaches, the network is trained only with synthetic data, and is then evaluated on "in-the-wild" facial images. Both qualitative and quantitative analyses demonstrate the accuracy and the robustness of our approach. As an additional analysis of the proposed network, we show that it can be used as a geometric constraint for facial image translation tasks.
- Mar 30 2017 math.CO arXiv:1703.10122v1We prove the following stability version of the edge isoperimetric inequality for the cube: any subset of the cube with average boundary degree within $K$ of the minimum possible is $\varepsilon $-close to a union of $L$ disjoint cubes, where $L \leq L(K,\varepsilon )$ is independent of the dimension. This extends a stability result of Ellis, and can viewed as a dimension-free version of Friedgut's junta theorem.
- The influence of the $k$'th coordinate on a Boolean function $f:\{0,1\}^n \rightarrow \{0,1\}$ is the probability that flipping $x_k$ changes the value $f(x)$. The total influence $I(f)$ is the sum of influences of the coordinates. The well-known `Junta Theorem' of Friedgut (1998) asserts that if $I(f) \leq M$, then $f$ can be $\epsilon$-approximated by a function that depends on $O(2^{M/\epsilon})$ coordinates. Friedgut's theorem has a wide variety of applications in mathematics and theoretical computer science. For a biased function with $E[f]=\mu$, the edge isoperimetric inequality on the cube implies that $I(f) \geq 2\mu \log(1/\mu)$. Kahn and Kalai (2006) asked, in the spirit of the Junta theorem, whether any $f$ such that $I(f)$ is within a constant factor of the minimum, can be $\epsilon \mu$-approximated by a DNF of a `small' size (i.e., a union of a small number of sub-cubes). We answer the question by proving the following structure theorem: If $I(f) \leq 2\mu(\log(1/\mu)+M)$, then $f$ can be $\epsilon \mu$-approximated by a DNF of size $2^{2^{O(M/\epsilon)}}$. The dependence on $M$ is sharp up to the constant factor in the double exponent.
- Mar 30 2017 cs.CV arXiv:1703.10114v1We propose a method for lossy image compression based on recurrent, convolutional neural networks that outperforms BPG (4:2:0 ), WebP, JPEG2000, and JPEG as measured by MS-SSIM. We introduce three improvements over previous research that lead to this state-of-the-art result. First, we show that training with a pixel-wise loss weighted by SSIM increases reconstruction quality according to several metrics. Second, we modify the recurrent architecture to improve spatial diffusion, which allows the network to more effectively capture and propagate image information through the network's hidden state. Finally, in addition to lossless entropy coding, we use a spatially adaptive bit allocation algorithm to more efficiently use the limited number of bits to encode visually complex image regions. We evaluate our method on the Kodak and Tecnick image sets and compare against standard codecs as well recently published methods based on deep neural networks.
- Mar 30 2017 cs.LG arXiv:1703.10094v1Generative Adversarial Net has shown its great ability in generating samples. The inverse mapping of generator also contains a great value. Some works have been developed to construct the inverse function of generator. However, the existing ways of training the inverse model of GANs have many shortcomings. In this paper, we propose a new approach of training the inverse model of generator by regarding a pre-trained generator as the decoder part of an autoencoder network. This model does not directly minimize the difference between original input and inverse output, but try to minimize the difference between the generated data by using original input and inverse output. This strategy overcome the difficulty in training a inverse model of a non one-to-one function. And the inverse mapping we learned can be directly used in image searching and processing.
- Clinical NLP has an immense potential in contributing to how clinical practice will be revolutionized by the advent of large scale processing of clinical records. However, this potential has remained largely untapped due to slow progress primarily caused by strict data access policies for researchers. In this paper, we discuss the concern for privacy and the measures it entails. We also suggest sources of less sensitive data. Finally, we draw attention to biases that can compromise the validity of empirical research and lead to socially harmful applications.
- Real-world artificial intelligence (AI) applications often require multiple agents to work in a collaborative effort. Efficient learning for intra-agent communication and coordination is an indispensable step towards general AI. In this paper, we take StarCraft combat game as the test scenario, where the task is to coordinate multiple agents as a team to defeat their enemies. To maintain a scalable yet effective communication protocol, we introduce a multiagent bidirectionally-coordinated network (BiCNet ['bIknet]) with a vectorised extension of actor-critic formulation. We show that BiCNet can handle different types of combats under diverse terrains with arbitrary numbers of AI agents for both sides. Our analysis demonstrates that without any supervisions such as human demonstrations or labelled data, BiCNet could learn various types of coordination strategies that is similar to these of experienced game players. Moreover, BiCNet is easily adaptable to the tasks with heterogeneous agents. In our experiments, we evaluate our approach against multiple baselines under different scenarios; it shows state-of-the-art performance, and possesses potential values for large-scale real-world applications.
- Autonomous drones (also known as unmanned aerial vehicles) have several advantages over ground vehicles, including agility, swiftness, and energy-efficiency, and hence are convenient for light-weight delivery and substitutions for manned missions in remote operations. It is expected that autonomous drones will be deployed in diverse applications in near future. Typical drones are electric vehicles, powered by on-board batteries. This paper presents several contributions for automated battery-operated drone management systems: (1) We conduct an empirical study to model the battery performance of drones, considering various flight scenarios. (2) We study a joint problem of flight tour planning with recharging optimization for drones with an objective to complete a tour mission for a set of sites of interest. This problem captures diverse applications of delivery and remote operations by drones. (3) We implemented our optimization algorithms in an intelligent drone management system.
- Mar 30 2017 cond-mat.soft arXiv:1703.09984v1Magic Sand, a hydrophobic toy granular material, is widely used in popular science instructions because of its non-intuitive mechanical properties. A detailed study of the failure of an underwater column of magic sand shows that these properties can be traced to a single phenomenon: the system self-generates a cohesive skin that encapsulates the material inside. The skin, consists of pinned air-water-grain interfaces, shows multi-scale mechanical properties: they range from contact-line dynamics in the intra-grain roughness scale, plastic flow at the grain scale, all the way to the sample-scale mechanical responses. With decreasing rigidity of the skin, the failure mode transforms from brittle to ductile (both of which are collective in nature) to a complete disintegration at the single grain scale.
- Mar 30 2017 cs.CV arXiv:1703.09983v1The aim of fine-grained recognition is to identify sub-ordinate categories in images like different species of birds. Existing works have confirmed that, in order to capture the subtle differences across the categories, automatic localization of objects and parts is critical. Most approaches for object and part localization relied on the bottom-up pipeline, where thousands of region proposals are generated and then filtered by pre-trained object/part models. This is computationally expensive and not scalable once the number of objects/parts becomes large. In this paper, we propose a nonparametric data-driven method for object and part localization. Given an unlabeled test image, our approach transfers annotations from a few similar images retrieved in the training set. In particular, we propose an iterative transfer strategy that gradually refine the predicted bounding boxes. Based on the located objects and parts, deep convolutional features are extracted for recognition. We evaluate our approach on the widely-used CUB200-2011 dataset and a new and large dataset called Birdsnap. On both datasets, we achieve better results than many state-of-the-art approaches, including a few using oracle (manually annotated) bounding boxes in the test images.
- We propose to leverage denoising autoencoder networks as priors to address image restoration problems. We build on the key observation that the output of an optimal denoising autoencoder is a local mean of the true data density, and the autoencoder error (the difference between the output and input of the trained autoencoder) is a mean shift vector. We use the magnitude of this mean shift vector, that is, the distance to the local mean, as the negative log likelihood of our natural image prior. For image restoration, we maximize the likelihood using gradient descent by backpropagating the autoencoder error. A key advantage of our approach is that we do not need to train separate networks for different image restoration tasks, such as non-blind deconvolution with different kernels, or super-resolution at different magnification factors. We demonstrate state of the art results for non-blind deconvolution and super-resolution using the same autoencoding prior.
- Mar 30 2017 cs.AI arXiv:1703.09962v1In this paper, we address the problem of how automated situation-awareness can be achieved by learning real-world situations from ubiquitously generated mobility data. Without semantic input about the time and space where situations take place, this turns out to be a fundamental challenging problem. Uncertainties also introduce technical challenges when data is generated in irregular time intervals, being mixed with noise, and errors. Purely relying on temporal patterns observable in mobility data, in this paper, we propose Spaceprint, a fully automated algorithm for finding the repetitive pattern of similar situations in spaces. We evaluate this technique by showing how the latent variables describing the category, and the actual identity of a space can be discovered from the extracted situation patterns. Doing so, we use different real-world mobility datasets with data about the presence of mobile entities in a variety of spaces. We also evaluate the performance of this technique by showing its robustness against uncertainties.
- In string-derived supergravity theory, Kähler metric of chiral matter fields often has a pole. Such Kähler metric is interesting from the viewpoint of the framework of the pole inflation, where the scalar potential can be stretched out to be flat around the pole for a canonically normalized field and inflation can be realized. However, when Kähler metric has a pole, the scalar potential can also have a pole at the same point in supergravity theory. We study such supergravity models with the pole, and provide numerical analysis of inflationary dynamics and resultant density perturbation. We also examine attractor behavior of our model.
- Mar 30 2017 math.CO arXiv:1703.09946v1We say a family of sets is intersecting if any two of its sets intersect, and we say it is trivially intersecting if there is an element which appears in every set of the family. In this paper we study the maximum size of a non-trivially intersecting family in a natural "multi-part" setting. Here the ground set is divided into parts, and one considers families of sets whose intersection with each part is of a prescribed size. Our work is motivated by classical results in the single-part setting due to Erdős, Ko and Rado, and Hilton and Milner, and by a theorem of Frankl concerning intersecting families in this multi-part setting. In the case where the part sizes are sufficiently large we determine the maximum size of a non-trivially intersecting multi-part family, disproving a conjecture of Alon and Katona. A special case of our result can also be interpreted as a stability theorem for a problem of Greenwell and Lovász.
- Mar 30 2017 cs.LG arXiv:1703.09938v1Analyzing multivariate time series data is important for many applications such as automated control, fault diagnosis and anomaly detection. One of the key challenges is to learn latent features automatically from dynamically changing multivariate input. In visual recognition tasks, convolutional neural networks (CNNs) have been successful to learn generalized feature extractors with shared parameters over the spatial domain. However, when high-dimensional multivariate time series is given, designing an appropriate CNN model structure becomes challenging because the kernels may need to be extended through the full dimension of the input volume. To address this issue, we present two structure learning algorithms for deep CNN models. Our algorithms exploit the covariance structure over multiple time series to partition input volume into groups. The first algorithm learns the group CNN structures explicitly by clustering individual input sequences. The second algorithm learns the group CNN structures implicitly from the error backpropagation. In experiments with two real-world datasets, we demonstrate that our group CNNs outperform existing CNN based regression methods.
- Variable stiffness actuators undergo lower peak force in contacts compared to their rigid counterparts, and are thus safer for human-robot interaction. Furthermore, they can store energy in their elastic element and can release it later to achieve human-like dynamic movements. However, it is not clear how to integrate them in teleoperator systems so that they can be controlled intuitively by a human. We performed an experiment to study human use of elastic tools to determine how a teleoperator system with an elastic slave would need to be designed. For this, we had 13 untrained participants hammer with an elastic tool under different stiffness conditions, asking them to try to find the best timing for a backward-forward swing motion in order to achieve the strongest impact. We found that the participants generally executed the task efficiently after a few trials and they converged to very similar solutions. The stiffness influenced the performance slightly, a stiffness between 2.3 Nm/rad and 4.1 Nm/rad showing the best results. We conclude that humans intuitively know how to efficiently use elastic tools for hammering type tasks. This could facilitate the control of teleoperator systems with elastic slave manipulators for tasks requiring explosive movements like hammering.
- Mar 30 2017 cs.CV arXiv:1703.09933v1Lifelogging is a process of collecting rich source of information about daily life of people. In this paper, we introduce the problem of sentiment analysis in egocentric events focusing on the moments that compose the images recalling positive, neutral or negative feelings to the observer. We propose a method for the classification of the sentiments in egocentric pictures based on global and semantic image features extracted by Convolutional Neural Networks. We carried out experiments on an egocentric dataset, which we organized in 3 classes on the basis of the sentiment that is recalled to the user (positive, negative or neutral).
- Mar 30 2017 cs.CV arXiv:1703.09928v1Understanding semantic similarity among images is the core of a wide range of computer vision applications. An important step towards this goal is to collect and learn human perceptions. Interestingly, the semantic context of images is often ambiguous as images can be perceived with emphasis on different aspects, which may be contradictory to each other. In this paper, we present a method for learning the semantic similarity among images, inferring their latent aspects and embedding them into multi-spaces corresponding to their semantic aspects. We consider the multi-embedding problem as an optimization function that evaluates the embedded distances with respect to the qualitative clustering queries. The key idea of our approach is to collect and embed qualitative measures that share the same aspects in bundles. To ensure similarity aspect sharing among multiple measures, image classification queries are presented to, and solved by users. The collected image clusters are then converted into bundles of tuples, which are fed into our bundle optimization algorithm that jointly infers the aspect similarity and multi-aspect embedding. Extensive experimental results show that our approach significantly outperforms state-of-the-art multi-embedding approaches on various datasets, and scales well for large multi-aspect similarity measures.
- Mar 30 2017 cs.CV arXiv:1703.09913v1This paper presents a method for assessing skill of performance from video, for a variety of tasks, ranging from drawing to surgery and rolling dough. We formulate the problem as pairwise and overall ranking of video collections, and propose a supervised deep ranking model to learn discriminative features between pairs of videos exhibiting different amounts of skill. We utilise a two-stream Temporal Segment Network to capture both the type and quality of motions and the evolving task state. Results demonstrate our method is applicable to a variety of tasks, with the percentage of correctly ordered pairs of videos ranging from 70% to 82% for four datasets. We demonstrate the robustness of our approach via sensitivity analysis of its parameters. We see this work as effort toward the automated and objective organisation of how-to videos and overall, generic skill determination in video.
- Mar 30 2017 cs.CV arXiv:1703.09912v1While deep learning methods have achieved state-of-the-art performance in many challenging inverse problems like image inpainting and super-resolution, they invariably involve problem-specific training of the networks. Under this approach, different problems require different networks. In scenarios where we need to solve a wide variety of problems, e.g., on a mobile camera, it is inefficient and costly to use these specially-trained networks. On the other hand, traditional methods using signal priors can be used in all linear inverse problems but often have worse performance on challenging tasks. In this work, we provide a middle ground between the two kinds of methods --- we propose a general framework to train a single deep neural network that solves arbitrary linear inverse problems. The proposed network acts as a proximal operator for an optimization algorithm and projects non-image signals onto the set of natural images defined by the decision boundary of a classifier. In our experiments, the proposed framework demonstrates superior performance over traditional methods using a wavelet sparsity prior and achieves comparable performance of specially-trained networks on tasks including compressive sensing and pixel-wise inpainting.
- Mar 30 2017 cs.CV arXiv:1703.09911v1In this paper, we propose a novel approach for learning multi-label classifiers with the help of privileged information. Specifically, we use similarity constraints to capture the relationship between available information and privileged information, and use ranking constraints to capture the dependencies among multiple labels. By integrating similarity constraints and ranking constraints into the learning process of classifiers, the privileged information and the dependencies among multiple labels are exploited to construct better classifiers during training. A maximum margin classifier is adopted, and an efficient learning algorithm of the proposed method is also developed. We evaluate the proposed method on two applications: multiple object recognition from images with the help of implicit information about object importance conveyed by the list of manually annotated image tags; and multiple facial action unit detection from low-resolution images augmented by high-resolution images. Experimental results demonstrate that the proposed method can effectively take full advantage of privileged information and dependencies among multiple labels for better object recognition and better facial action unit detection.
- This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past decade or so, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of relatively recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of Natural Language Processing, with an emphasis on different evaluation methods and the relationships between them.
- Semantic segmentation requires a detailed labeling of image pixels by object category. Information derived from local image patches is necessary to describe the detailed shape of individual objects. However, this information is ambiguous and can result in noisy labels. Global inference of image content can instead capture the general semantic concepts present. We advocate that holistic inference of image concepts provides valuable information for detailed pixel labeling. We propose a generic framework to leverage holistic information in the form of a LabelBank for pixel-level segmentation. We show the ability of our framework to improve semantic segmentation performance in a variety of settings. We learn models for extracting a holistic LabelBank from visual cues, attributes, and/or textual descriptions. We demonstrate improvements in semantic segmentation accuracy on standard datasets across a range of state-of-the-art segmentation architectures and holistic inference approaches.
- This paper proposes a direct coupling coherent quantum observer for a quantum plant which consists of a two level quantum system. The quantum observer, which is a quantum harmonic oscillator, includes homodyne detection measurements. It is shown that the observer can be designed so that it does not affect the quantum variable of interest in the quantum plant and that measured output converges in a given sense to the plant variable of interest. Also, the plant variable of interest-observer system can be described by a set of linear quantum stochastic differential equations. A minimum variance unbiased estimator form of the Kalman filter is derived for linear quantum systems and applied to the direct coupled coherent quantum observer.
- Mar 30 2017 cs.SY arXiv:1703.09877v1In this paper, we study the robust consensus problem for a set of discrete-time linear agents to coordinate over an uncertain communication network, which is to achieve consensus against the transmission errors and noises resulted from the information exchange between the agents. We model the network by means of communication links subject to multiplicative stochastic uncertainties, which are susceptible to describing packet dropout, random delay, and fading phenomena. Different communication topologies, such as undirected graphs and leader-follower graphs, are considered. We derive sufficient conditions for robust consensus in the mean square sense. This results unveil intrinsic constraints on consensus attainment imposed by the network synchronizability, the unstable agent dynamics, and the channel uncertainty variances. Consensus protocols are designed based on the state information transmitted over the uncertain channels, by solving a modified algebraic Riccati equation.
- Mar 30 2017 astro-ph.GA arXiv:1703.09874v1We present a study of the trade-off between depth and resolution using a large number of U-band imaging observations in the GOODS-North field (Giavalisco et al. 2004) from the Large Binocular Camera (LBC) on the Large Binocular Telescope (LBT). Having acquired over 30 hours of data (315 images with 5-6 mins exposures), we generated multiple image mosaics, starting with the best atmospheric seeing images (FWHM $\lesssim$0.8"), which constitute $\sim$10% of the total data set. For subsequent mosaics, we added in data with larger seeing values until the final, deepest mosaic included all images with FWHM $\lesssim$1.8" ($\sim$94% of the total data set). From the mosaics, we made object catalogs to compare the optimal-resolution, yet shallower image to the lower-resolution but deeper image. We show that the number counts for both images are $\sim$90% complete to $U_{AB}$ $\lesssim26$. Fainter than $U_{AB}$$\sim$ 27, the object counts from the optimal-resolution image start to drop-off dramatically (90% between $U_{AB}$ = 27 and 28 mag), while the deepest image with better surface-brightness sensitivity ($\mu^{AB}_{U}$$\lesssim$ 32 mag arcsec$^{-2}$) show a more gradual drop (10% between $U_{AB}$ $\simeq$ 27 and 28 mag). For the brightest galaxies within the GOODS-N field, structure and clumpy features within the galaxies are more prominent in the optimal-resolution image compared to the deeper mosaics. Finally, we find - for 220 brighter galaxies with $U_{AB}$$\lesssim$ 24 mag - only marginal differences in total flux between the optimal-resolution and lower-resolution light-profiles to $\mu^{AB}_{U}$$\lesssim$ 32 mag arcsec$^{-2}$. In only 10% of the cases are the total-flux differences larger than 0.5 mag. This helps constrain how much flux can be missed from galaxy outskirts, which is important for studies of the Extragalactic Background Light.
- Mar 30 2017 math.CO arXiv:1703.09873v1A packing $k$-coloring of a graph $G$ is a partition of $V(G)$ into sets $V_1,\ldots,V_k$ such that for each $1\leq i\leq k$ the distance between any two distinct $x,y\in V_i$ is at least $i+1$. The packing chromatic number, $\chi_p(G)$, of a graph $G$ is the minimum $k$ such that $G$ has a packing $k$-coloring. Sloper showed that there are $4$-regular graphs with arbitrarily large packing chromatic number. The question whether the packing chromatic number of subcubic graphs is bounded appears in several papers. We answer this question in the negative. Moreover, we show that for every fixed $k$ and $g\geq 2k+2$, almost every $n$-vertex cubic graph of girth at least $g$ has the packing chromatic number greater than $k$.
- Mar 30 2017 cs.NE arXiv:1703.09865v1This paper studies improving solvers based on their past solving experiences, and focuses on improving solvers by offline training. Specifically, the key issues of offline training methods are discussed, and research belonging to this category but from different areas are reviewed in a unified framework. Existing training methods generally adopt a two-stage strategy in which selecting the training instances and training instances are treated in two independent phases. This paper proposes a new training method, dubbed LiangYi, which addresses these two issues simultaneously. LiangYi includes a training module for a population-based solver and an instance sampling module for updating the training instances. The idea behind LiangYi is to promote the population-based solver by training it (with the training module) to improve its performance on those instances (discovered by the sampling module) on which it performs badly, while keeping the good performances obtained by it on previous instances. An instantiation of LiangYi on the Travelling Salesman Problem is also proposed. Empirical results on a huge testing set containing 10000 instances showed LiangYi could train solvers that perform significantly better than the solvers trained by other state-of-the-art training method. Moreover, empirical investigation of the behaviours of LiangYi confirmed it was able to continuously improve the solver through training.
- Mar 30 2017 cs.CV arXiv:1703.09859v1We motivate and address a human-in-the-loop variant of the monocular viewpoint estimation task in which the location and class of one semantic object keypoint is available at test time. In order to leverage the keypoint information, we devise a Convolutional Neural Network called Click-Here CNN (CH-CNN) that integrates the keypoint information with activations from the layers that process the image. It transforms the keypoint information into a 2D map that can be used to weigh features from certain parts of the image more heavily. The weighted sum of these spatial features is combined with global image features to provide relevant information to the prediction layers. To train our network, we collect a novel dataset of 3D keypoint annotations on thousands of CAD models, and synthetically render millions of images with 2D keypoint information. On test instances from PASCAL 3D+, our model achieves a mean class accuracy of 90.7%, whereas the state-of-the-art baseline only obtains 85.7% accuracy, justifying our argument for human-in-the-loop inference.
- Assume we are given a sum of linear measurements of $s$ different rank-$r$ matrices of the form $y = \sum_{k=1}^{s} \mathcal{A}_k ({X}_k)$. When and under which conditions is it possible to extract (demix) the individual matrices ${X}_k$ from the single measurement vector ${y}$? And can we do the demixing numerically efficiently? We present two computationally efficient algorithms based on hard thresholding to solve this low rank demixing problem. We prove that under suitable conditions these algorithms are guaranteed to converge to the correct solution at a linear rate. We discuss applications in connection with quantum tomography and the Internet-of-Things. Numerical simulations demonstrate empirically the performance of the proposed algorithms.
- Privacy directly concerns the user as the data owner (data- subject) and hence privacy in systems should be implemented in a manner which concerns the user (user-centered). There are many concepts and guidelines that support development of privacy and embedding privacy into systems. However, none of them approaches privacy in a user- centered manner. Through this research we propose a framework that would enable developers and designers to grasp privacy in a user-centered manner and implement it along with the software development life cycle.
- Mar 30 2017 cs.LG arXiv:1703.09844v1This paper studies convolutional networks that require limited computational resources at test time. We develop a new network architecture that performs on par with state-of-the-art convolutional networks, whilst facilitating prediction in two settings: (1) an anytime-prediction setting in which the network's prediction for one example is progressively updated, facilitating the output of a prediction at any time; and (2) a batch computational budget setting in which a fixed amount of computation is available to classify a set of examples that can be spent unevenly across 'easier' and 'harder' examples. Our network architecture uses multi-scale convolutions and progressively growing feature representations, which allows for the training of multiple classifiers at intermediate layers of the network. Experiments on three image-classification datasets demonstrate the efficacy of our architecture, in particular, when measured in terms of classification accuracy as a function of the amount of compute available.
- Mar 30 2017 cs.LG arXiv:1703.09842v1We address the problem of inverse reinforcement learning in Markov decision processes where the agent is risk-sensitive. In particular, we model risk-sensitivity in a reinforcement learning framework by making use of models of human decision-making having their origins in behavioral psychology, behavioral economics, and neuroscience. We propose a gradient-based inverse reinforcement learning algorithm that minimizes a loss function defined on the observed behavior. We demonstrate the performance of the proposed technique on two examples, the first of which is the canonical Grid World example and the second of which is a Markov decision process modeling passengers decisions regarding ride-sharing. In the latter, we use pricing and travel time data from a ride-sharing company to construct the transition probabilities and rewards of the Markov decision process.
- Previous theoretical work on deep learning and neural network optimization tend to focus on avoiding saddle points and local minima. However, the practical observation is that, at least for the most successful Deep Convolutional Neural Networks (DCNNs) for visual processing, practitioners can always increase the network size to fit the training data (an extreme example would be [1]). The most successful DCNNs such as VGG and ResNets are best used with a small degree of "overparametrization". In this work, we characterize with a mix of theory and experiments, the landscape of the empirical risk of overparametrized DCNNs. We first prove the existence of a large number of degenerate global minimizers with zero empirical error (modulo inconsistent equations). The zero-minimizers -- in the case of classification -- have a non-zero margin. The same minimizers are degenerate and thus very likely to be found by SGD that will furthermore select with higher probability the zero-minimizer with larger margin, as discussed in Theory III (to be released). We further experimentally explored and visualized the landscape of empirical risk of a DCNN on CIFAR-10 during the entire training process and especially the global minima. Finally, based on our theoretical and experimental results, we propose an intuitive model of the landscape of DCNN's empirical loss surface, which might not be as complicated as people commonly believe.
- We tackle a task where an agent learns to navigate in a 2D maze-like environment called XWORLD. In each session, the agent perceives a sequence of raw-pixel frames, a natural language command issued by a teacher, and a set of rewards. The agent learns the teacher's language from scratch in a grounded and compositional manner, such that after training it is able to correctly execute zero-shot commands: 1) the combination of words in the command never appeared before, and/or 2) the command contains new object concepts that are learned from another task but never learned from navigation. Our deep framework for the agent is trained end to end: it learns simultaneously the visual representations of the environment, the syntax and semantics of the language, and the action module that outputs actions. The zero-shot learning capability of our framework results from its compositionality and modularity with parameter tying. We visualize the intermediate outputs of the framework, demonstrating that the agent truly understands how to solve the problem. We believe that our results provide some preliminary insights on how to train an agent with similar abilities in a 3D environment.
- Mar 30 2017 cs.SI arXiv:1703.09828v1Background: Over the past few decades, numerous forecasting methods have been proposed in the field of epidemic forecasting. Such methods can be classified into different categories such as deterministic vs. probabilistic, comparative methods vs. generative methods, and so on. In some of the more popular comparative methods, researchers compare observed epidemiological data from early stages of an outbreak with the output of proposed models to forecast the future trend and prevalence of the pandemic. A significant problem in this area is the lack of standard well-defined evaluation measures to select the best algorithm among different ones, as well as for selecting the best possible configuration for a particular algorithm. Results: In this paper, we present an evaluation framework which allows for combining different features, error measures, and ranking schema to evaluate forecasts. We describe the various epidemic features (Epi-features) included to characterize the output of forecasting methods and provide suitable error measures that could be used to evaluate the accuracy of the methods with respect to these Epi-features. We focus on long-term predictions rather than short-term forecasting and demonstrate the utility of the framework by evaluating six forecasting methods for predicting influenza in the United States. Our results demonstrate that different error measures lead to different rankings even for a single Epi-feature. Further, our experimental analyses show that no single method dominates the rest in predicting all Epi-features, when evaluated across error measures. As an alternative, we provide various consensus ranking schema that summarizes individual rankings, thus accounting for different error measures. We believe that a comprehensive evaluation framework, as presented in this paper, will add value to the computational epidemiology community.