Jun 27 2017 cs.LG
Do GANS (Generative Adversarial Nets) actually learn the target distribution? The foundational paper of (Goodfellow et al 2014) suggested they do, if they were given sufficiently large deep nets, sample size, and computation time. A recent theoretical analysis in Arora et al (to appear at ICML 2017) raised doubts whether the same holds when discriminator has finite size. It showed that the training objective can approach its optimum value even if the generated distribution has very low support ---in other words, the training objective is unable to prevent mode collapse. The current note reports experiments suggesting that such problems are not merely theoretical. It presents empirical evidence that well-known GANs approaches do learn distributions of fairly low support, and thus presumably are not learning the target distribution. The main technical contribution is a new proposed test, based upon the famous birthday paradox, for estimating the support size of the generated distribution.
We propose a simple and efficient approach to learning sparse models. Our approach consists of (1) projecting the data into a lower dimensional space, (2) learning a dense model in the lower dimensional space, and then (3) recovering the sparse model in the original space via compressive sensing. We apply this approach to Non-negative Matrix Factorization (NMF), tensor decomposition and linear classification---showing that it obtains $10\times$ compression with negligible loss in accuracy on real data, and obtains up to $5\times$ speedups. Our main theoretical contribution is to show the following result for NMF: if the original factors are sparse, then their projections are the sparsest solutions to the projected NMF problem. This explains why our method works for NMF and shows an interesting new property of random projections: they can preserve the solutions of non-convex optimization problems such as NMF.
Jun 27 2017 cs.CC
We describe a communication game, and a conjecture about this game, whose proof would imply the well-known Sensitivity Conjecture asserting a polynomial relation between sensitivity and block sensitivity for Boolean functions. The author defined this game and observed the connection in Dec. 2013 - Jan. 2014. The game and connection were independently discovered by Gilmer, Koucký, and Saks, who also established further results about the game (not proved by us) and published their results in ITCS '15 [GKS15]. This note records our independent work, including some observations that did not appear in [GKS15]. Namely, the main conjecture about this communication game would imply not only the Sensitivity Conjecture, but also a stronger hypothesis raised by Chung, Füredi, Graham, and Seymour [CFGS88]; and, another related conjecture we pose about a "query-bounded" variant of our communication game would suffice to answer a question of Aaronson, Ambainis, Balodis, and Bavarian [AABB14] about the query complexity of the "Weak Parity" problem---a question whose resolution was previously shown by [AABB14] to follow from a proof of the Chung et al. hypothesis.
Jun 27 2017 cs.CR
Attack trees are a popular way to represent and evaluate potential security threats on systems or infrastructures. The goal of this work is to provide a framework allowing to express and check whether an attack tree is consistent with the analyzed system. We model real systems using transition systems and introduce attack trees with formally specified node labels. We formulate the correctness properties of an attack tree with respect to a system and study the complexity of the corresponding decision problems. The proposed framework can be used in practice to assist security experts in manual creation of attack trees and enhance development of tools for automated generation of attack trees.
A number of recent works have proposed techniques for end-to-end learning of communication protocols among cooperative multi-agent populations, and have simultaneously found the emergence of grounded human-interpretable language in the protocols developed by the agents, all learned without any human supervision! In this paper, using a Task and Tell reference game between two agents as a testbed, we present a sequence of 'negative' results culminating in a 'positive' one -- showing that while most agent-invented languages are effective (i.e. achieve near-perfect task rewards), they are decidedly not interpretable or compositional. In essence, we find that natural language does not emerge 'naturally', despite the semblance of ease of natural-language-emergence that one may gather from recent literature. We discuss how it is possible to coax the invented languages to become more and more human-like and compositional by increasing restrictions on how two agents may communicate.
Jun 27 2017 cs.LG
Generative Adversarial Networks (GANs) excel at creating realistic images with complex models for which maximum likelihood is infeasible. However, the convergence of GAN training has still not been proved. We propose a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent that has an individual learning rate for both the discriminator and the generator. We prove that the TTUR converges under mild assumptions to a stationary Nash equilibrium. The convergence carries over to the popular Adam optimization, for which we prove that it follows the dynamics of a heavy ball with friction and thus prefers flat minima in the objective landscape. For the evaluation of the performance of GANs at image generation, we introduce the "Fréchet Inception Distance" (FID) which captures the similarity of generated images to real ones better than the Inception Score. In experiments, TTUR improves learning for DCGANs, improved Wasserstein GANs, and BEGANs, outperforming conventional GAN training on CelebA, Billion Word Benchmark, and LSUN bedrooms.
This paper presents a margin-based multiclass generalization bound for neural networks which scales with their margin-normalized "spectral complexity": their Lipschitz constant, meaning the product of the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network on the mnist and cifar10 datasets, with both original and random labels, where it tightly correlates with the observed excess risks.
Current grammar-based NeuroEvolution approaches have several shortcomings. On the one hand, they do not allow the generation of Artificial Neural Networks (ANNs composed of more than one hidden-layer. On the other, there is no way to evolve networks with more than one output neuron. To properly evolve ANNs with more than one hidden-layer and multiple output nodes there is the need to know the number of neurons available in previous layers. In this paper we introduce Dynamic Structured Grammatical Evolution (DSGE): a new genotypic representation that overcomes the aforementioned limitations. By enabling the creation of dynamic rules that specify the connection possibilities of each neuron, the methodology enables the evolution of multi-layered ANNs with more than one output neuron. Results in different classification problems show that DSGE evolves effective single and multi-layered ANNs, with a varying number of output neurons.
Jun 27 2017 cs.CV
Data association problems are an important component of many computer vision applications, with multi-object tracking being one of the most prominent examples. A typical approach to data association involves finding a graph matching or network flow that minimizes a sum of pairwise association costs, which are often either hand-crafted or learned as linear functions of fixed features. In this work, we demonstrate that it is possible to learn features for network-flow-based data association via backpropagation, by expressing the optimum of a smoothed network flow problem as a differentiable function of the pairwise association costs. We apply this approach to multi-object tracking with a network flow formulation. Our experiments demonstrate that we are able to successfully learn all cost functions for the association problem in an end-to-end fashion, which outperform hand-crafted costs in all settings. The integration and combination of various sources of inputs becomes easy and the cost functions can be learned entirely from data, alleviating tedious hand-designing of costs.
Jun 27 2017 cs.CV
Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations, and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant research effort has been dedicated to the development of saliency prediction models, which can predict human eye fixations. Despite saliency information could be useful to condition an image captioning architecture, by providing an indication of what is salient and what is not, no model has yet succeeded in effectively incorporating these two techniques. In this work, we propose an image captioning approach in which a generative recurrent neural network can focus on different parts of the input image during the generation of the caption, by exploiting the conditioning given by a saliency prediction model on which parts of the image are salient and which are contextual. We demonstrate, through extensive quantitative and qualitative experiments on large scale datasets, that our model achieves superior performances with respect to different image captioning baselines with and without saliency.
Jun 27 2017 cs.HC
Modern manufacturing systems typically require high degrees of flexibility, in terms of ability to customize the production lines to the constantly changing market requests. For this purpose, manufacturing systems are required to be able to cope with changes in the types of products, and in the size of the production batches. As a consequence, the human-machine interfaces (HMIs) are typically very complex, and include a wide range of possible operational modes and commands. This generally implies an unsustainable cognitive workload for the human operators, in addition to a non-negligible training effort. To overcome this issue, in this paper we present a methodology for the design of adaptive human-centred HMIs for industrial machines and robots. The proposed approach relies on three pillars: measurement of user's capabilities, adaptation of the information presented in the HMI, and training of the user. The results expected from the application of the proposed methodology are investigated in terms of increased customization and productivity of manufacturing processes, and wider acceptance of automation technologies. The proposed approach has been devised in the framework of the European project INCLUSIVE.
Jun 27 2017 cs.CV
Awareness of the road scene is an essential component for both autonomous vehicles and Advances Driver Assistance Systems and is gaining importance both for the academia and car companies. This paper presents a way to learn a semantic-aware transformation which maps detections from a dashboard camera view onto a broader bird's eye occupancy map of the scene. To this end, a huge synthetic dataset featuring 1M couples of frames, taken from both car dashboard and bird's eye view, has been collected and automatically annotated. A deep-network is then trained to warp detections from the first to the second view. We demonstrate the effectiveness of our model against several baselines and observe that is able to generalize on real-world data despite having been trained solely on synthetic ones.
Colombia has a privileged geographical location which makes it a cornerstone and equidistant point to all regional markets. The country has a great ecological diversity and it is one of the largest suppliers of flowers for US. Colombian flower companies have made innovations in the marketing process, using methods to reach all conditions for final consumers. This article develops a monitoring system for floriculture industries. The system was implemented in a robotic platform. This device has the ability to be programmed in different programming languages. The robot takes the necessary environment information from its camera. The algorithm of the monitoring system was developed with the image processing toolbox on Matlab. The implemented algorithm acquires images through its camera, it performs a preprocessing of the image, noise filter, enhancing of the color and adjusting the dimension in order to increase processing speed. Then, the image is segmented by color and with the binarized version of the image using morphological operations (erosion and dilation), extract relevant features such as centroid, perimeter and area. The data obtained from the image processing helps the robot with the automatic identification of objectives, orientation and move towards them. Also, the results generate a diagnostic quality of each object scanned.
We propose a new selection rule for the coordinate selection in coordinate descent methods for huge-scale optimization. The efficiency of this novel scheme is provably better than the efficiency of uniformly random selection, and can reach the efficiency of steepest coordinate descent (SCD), enabling an acceleration of a factor of up to $n$, the number of coordinates. In many practical applications, our scheme can be implemented at no extra cost and computational efficiency very close to the faster uniform selection. Numerical experiments with Lasso and Ridge regression show promising improvements, in line with our theoretical guarantees.
Jun 27 2017 cs.DC
Today, massive amounts of streaming data from smart devices need to be analyzed automatically to realize the Internet of Things. The Complex Event Processing (CEP) paradigm promises low-latency pattern detection on event streams. However, CEP systems need to be extended with Machine Learning (ML) capabilities such as online training and inference in order to be able to detect fuzzy patterns (e.g., outliers) and to improve pattern recognition accuracy during runtime using incremental model training. In this paper, we propose a distributed CEP system denoted as StreamLearner for ML-enabled complex event detection. The proposed programming model and data-parallel system architecture enable a wide range of real-world applications and allow for dynamically scaling up and out system resources for low-latency, high-throughput event processing. We show that the DEBS Grand Challenge 2017 case study (i.e., anomaly detection in smart factories) integrates seamlessly into the StreamLearner API. Our experiments verify scalability and high event throughput of StreamLearner.
Jun 27 2017 cs.DC
Performance of a distributed algorithm depends on environment in which the algorithm is executed. This is often modeled as a game between the algorithm and a conceptual adversary causing specific distractions. In this work we define a class of ordered adversaries, which cause distractions according to some partial order fixed by the adversary before the execution, and study how they affect performance of algorithms. For this purpose, we focus on well-known Do-All problem of performing t tasks on a shared channel consisting of p crash-prone stations. The channel restricts communication by the fact that no message is delivered to the alive stations if more than one station transmits at the same time. The most popular and meaningful performance measure for Do-All type of problems considered in the literature is work, defined as the total number of available processor steps during the whole execution. The question addressed in this work is how the ordered adversaries controlling crashes of stations influence work performance of Do-All algorithms. We provide two randomized distributed algorithms. The first solves Do-All with work O(t+p\sqrtt\log p) against the Linearly-Ordered adversary, restricted by some pre-defined linear order of crashing stations. Another algorithm is developed against the Weakly-Adaptive adversary, restricted by some pre-defined set of crash-prone stations, it can be seen as an ordered adversary with the order being an anti-chain consisting of crashing stations. The work done by this algorithm is O(t+p\sqrtt+p\minp/(p-f),t\log p). Both results are close to the corresponding lower bounds from [CKL]. We generalize this result to the class of adversaries restricted by partial order of f stations with maximum anti-chain of size k and complementary lower bound. We also consider a class of delayed adaptive adversaries, that is, who could see random choices with some delay. We give an algorithm that works efficiently against the 1-RD adversary, which could see random choices of stations with one round delay, achieving close to optimal O(t+p\sqrtt\log^2 p) work complexity. This shows that restricting adversary by even 1 round delay results in (almost) optimal work on a shared channel.
In this paper, we present a novel massively parallel algorithm for accelerating the decision tree building procedure on GPUs (Graphics Processing Units), which is a crucial step in Gradient Boosted Decision Tree (GBDT) and random forests training. Previous GPU based tree building algorithms are based on parallel multi-scan or radix sort to find the exact tree split, and thus suffer from scalability and performance issues. We show that using a histogram based algorithm to approximately find the best split is more efficient and scalable on GPU. By identifying the difference between classical GPU-based image histogram construction and the feature histogram construction in decision tree training, we develop a fast feature histogram building kernel on GPU with carefully designed computational and memory access sequence to reduce atomic update conflict and maximize GPU utilization. Our algorithm can be used as a drop-in replacement for histogram construction in popular tree boosting systems to improve their scalability. As an example, to train GBDT on epsilon dataset, our method using a main-stream GPU is 7-8 times faster than histogram based algorithm on CPU in LightGBM and 25 times faster than the exact-split finding algorithm in XGBoost on a dual-socket 28-core Xeon server, while achieving similar prediction accuracy.
Robots are expected to operate autonomously in dynamic environments. Understanding the underlying dynamic characteristics of objects is a key enabler for achieving this goal. In this paper, we propose a method for pointwise semantic classification of 3D LiDAR data into three classes: non-movable, movable and dynamic. We concentrate on understanding these specific semantics because they characterize important information required for an autonomous system. Non-movable points in the scene belong to unchanging segments of the environment, whereas the remaining classes corresponds to the changing parts of the scene. The difference between the movable and dynamic class is their motion state. The dynamic points can be perceived as moving, whereas movable objects can move, but are perceived as static. To learn the distinction between movable and non-movable points in the environment, we introduce an approach based on deep neural network and for detecting the dynamic points, we estimate pointwise motion. We propose a Bayes filter framework for combining the learned semantic cues with the motion cues to infer the required semantic classification. In extensive experiments, we compare our approach with other methods on a standard benchmark dataset and report competitive results in comparison to the existing state-of-the-art. Furthermore, we show an improvement in the classification of points by combining the semantic cues retrieved from the neural network with the motion cues.
Jun 27 2017 cs.CV
We present a method to jointly refine the geometry and semantic segmentation of 3D surface meshes. Our method alternates between updating the shape and the semantic labels. In the geometry refinement step, the mesh is deformed with variational energy minimization, such that it simultaneously maximizes photo-consistency and the compatibility of the semantic segmentations across a set of calibrated images. Label-specific shape priors account for interactions between the geometry and the semantic labels in 3D. In the semantic segmentation step, the labels on the mesh are updated with MRF inference, such that they are compatible with the semantic segmentations in the input images. Also, this step includes prior assumptions about the surface shape of different semantic classes. The priors induce a tight coupling, where semantic information influences the shape update and vice versa. Specifically, we introduce priors that favor (i) adaptive smoothing, depending on the class label; (ii) straightness of class boundaries; and (iii) semantic labels that are consistent with the surface orientation. The novel mesh-based reconstruction is evaluated in a series of experiments with real and synthetic data. We compare both to state-of-the-art, voxel-based semantic 3D reconstruction, and to purely geometric mesh refinement, and demonstrate that the proposed scheme yields improved 3D geometry as well as an improved semantic segmentation.
Jun 27 2017 cs.LG
We consider the problem of learning when obtaining the training labels is costly, which is usually tackled in the literature using active-learning techniques. These approaches provide strategies to choose the examples to label before or during training. These strategies are usually based on heuristics or even theoretical measures, but are not learned as they are directly used during training. We design a model which aims at \textitlearning active-learning strategies using a meta-learning setting. More specifically, we consider a pool-based setting, where the system observes all the examples of the dataset of a problem and has to choose the subset of examples to label in a single shot. Experiments show encouraging results.
Multi-label learning deals with training instances associated with multiple labels. Many common multi-label algorithms are to treat each label in a crisp manner, being either relevant or irrelevant to an instance, and such label can be called logical label. In contrast, we assume that there is a vector of numerical label behind each multi-label instance, and the numerical label can be treated as the indicator to judge whether the corresponding label is relevant or irrelevant to the instance. The approach we are proposing transforms multi-label problem into regression problem about numerical label. In order to explore the numerical label, one way is to extend the label space to a Euclidean space by mining the hidden label importance from the training examples. Such process of transforming logical labels into numerical labels is called Label Enhancement. Besides, we give three assumptions for numerical label of multi-label instance in this paper. Based on this, we propose an effective multi-label learning framework called MLL-LE, ie. Multi-Label Learning with Label Enhancement, which incorporates the regression loss and the three assumptions into a unified framework. Extensive experiments validate the effectiveness of MLL-LE framework.
Jun 27 2017 cs.HC
In this paper we consider the modern theory of the Bayesian brain from cognitive neurosciences in the light of recommender systems and expose potentials for our community. In particular, we elaborate on noisy user feedback and the thus resulting multicomponent user models, which have indeed a biological origin. In real user experiments we observe the impact of both factors directly in a repeated rating task along with recommendation. As a consequence, this contribution supports the plausibility of contemporary theories of mind in the context of recommender systems and can be understood as a solicitation to integrate ideas of cognitive neurosciences into our systems in order to further improve the prediction of human behaviour.
Jun 27 2017 cs.FL
Nested weighted automata (NWA) present a robust and convenient automata-theoretic formalism for quantitative specifications. Previous works have considered NWA that processed input words only in the forward direction. It is natural to allow the automata to process input words backwards as well, for example, to measure the maximal or average time between a response and the preceding request. We therefore introduce and study bidirectional NWA that can process input words in both directions. First, we show that bidirectional NWA can express interesting quantitative properties that are not expressible by forward-only NWA. Second, for the fundamental decision problems of emptiness and universality, we establish decidability and complexity results for the new framework which match the best-known results for the special case of forward-only NWA. Thus, for NWA, the increased expressiveness of bidirectionality is achieved at no additional computational complexity. This is in stark contrast to the unweighted case, where bidirectional finite automata are no more expressive but exponentially more succinct than their forward-only counterparts.
Jun 27 2017 cs.CV
Skeleton-based human action recognition has attracted a lot of research attention during the past few years. Recent works attempted to utilize recurrent neural networks to model the temporal dependencies between the 3D positional configurations of human body joints for better analysis of human activities in the skeletal data. The proposed work extends this idea to spatial domain as well as temporal domain to better analyze the hidden sources of action-related information within the human skeleton sequences in both of these domains simultaneously. Based on the pictorial structure of Kinect's skeletal data, an effective tree-structure based traversal framework is also proposed. In order to deal with the noise in the skeletal data, a new gating mechanism within LSTM module is introduced, with which the network can learn the reliability of the sequential data and accordingly adjust the effect of the input data on the updating procedure of the long-term context representation stored in the unit's memory cell. Moreover, we introduce a novel multi-modal feature fusion strategy within the LSTM unit in this paper. The comprehensive experimental results on seven challenging benchmark datasets for human action recognition demonstrate the effectiveness of the proposed method.
Jun 27 2017 cs.CV
Automatic photo adjustment is to mimic the photo retouching style of professional photographers and automatically adjust photos to the learned style. There have been many attempts to model the tone and the color adjustment globally with low-level color statistics. Also, spatially varying photo adjustment methods have been studied by exploiting high-level features and semantic label maps. Those methods are semantics-aware since the color mapping is dependent on the high-level semantic context. However, their performance is limited to the pre-computed hand-crafted features and it is hard to reflect user's preference to the adjustment. In this paper, we propose a deep neural network that models the semantics-aware photo adjustment. The proposed network exploits bilinear models that are the multiplicative interaction of the color and the contexual features. As the contextual features we propose the semantic adjustment map, which discovers the inherent photo retouching presets that are applied according to the scene context. The proposed method is trained using a robust loss with a scene parsing task. The experimental results show that the proposed method outperforms the existing method both quantitatively and qualitatively. The proposed method also provides users a way to retouch the photo by their own likings by giving customized adjustment maps.
Jun 27 2017 cs.DB
The execution logs that are used for process mining in practice are often obtained by querying an operational database and storing the result in a flat file. Consequently, the data processing power of the database system cannot be used anymore for this information, leading to constrained flexibility in the definition of mining patterns and limited execution performance in mining large logs. Enabling process mining directly on a database - instead of via intermediate storage in a flat file - therefore provides additional flexibility and efficiency. To help facilitate this ideal of in-database process mining, this paper formally defines a database operator that extracts the 'directly follows' relation from an operational database. This operator can both be used to do in-database process mining and to flexibly evaluate process mining related queries, such as: "which employee most frequently changes the 'amount' attribute of a case from one task to the next". We define the operator using the well-known relational algebra that forms the formal underpinning of relational databases. We formally prove equivalence properties of the operator that are useful for query optimization and present time-complexity properties of the operator. By doing so this paper formally defines the necessary relational algebraic elements of a 'directly follows' operator, which are required for implementation of such an operator in a DBMS.
Jun 27 2017 cs.CV
In this paper, we study object detection using a large pool of unlabeled images and only a few labeled images per category, named "few-shot object detection". The key challenge consists in generating trustworthy training samples as many as possible from the pool. Using few training examples as seeds, our method iterates between model training and high-confidence sample selection. In training, easy samples are generated first and, then the poorly initialized model undergoes improvement. As the model becomes more discriminative, challenging but reliable samples are selected. After that, another round of model improvement takes place. To further improve the precision and recall of the generated training samples, we embed multiple detection models in our framework, which has proven to outperform the single model baseline and the model ensemble method. Experiments on PASCAL VOC'07 indicate that by using as few as three or four samples selected for each category, our method produces very competitive results when compared to the state-of-the-art weakly-supervised approaches using a large number of image-level labels.
This paper presents a new approach in understanding how deep neural networks (DNNs) work by applying homomorphic signal processing techniques. Focusing on the task of multi-pitch estimation (MPE), this paper demonstrates the equivalence relation between a generalized cepstrum and a DNN in terms of their structures and functionality. Such an equivalence relation, together with pitch perception theories and the recently established rectified-correlations-on-a-sphere (RECOS) filter analysis, provide an alternative way in explaining the role of the nonlinear activation function and the multi-layer structure, both of which exist in a cepstrum and a DNN. To validate the efficacy of this new approach, a new feature designed in the same fashion is proposed for pitch salience function. The new feature outperforms the one-layer spectrum in the MPE task and, as predicted, it addresses the issue of the missing fundamental effect and also achieves better robustness to noise.
Jun 27 2017 cs.CV
The revolutionary developments in the field of supervised machine learning have paved way to the development of CAD tools for assisting doctors in diagnosis. Recently, the former has been employed in the prediction of neurological disorders such as Alzheimer's disease. We propose a CAD (Computer Aided Diagnosis tool for differentiating neural lesions caused by CVA (Cerebrovascular Accident) from the lesions caused by other neural disorders by using Non-negative Matrix Factorisation (NMF) and Haralick features for feature extraction and SVM (Support Vector Machine) for pattern recognition. We also introduce a multi-level classification system that has better classification efficiency, sensitivity and specificity when compared to systems using NMF or Haralick features alone as features for classification. Cross-validation was performed using LOOCV (Leave-One-Out Cross Validation) method and our proposed system has a classification accuracy of over 86%.
Jun 27 2017 cs.CV
In this paper, we present YoTube-a novel network fusion framework for searching action proposals in untrimmed videos, where each action proposal corresponds to a spatialtemporal video tube that potentially locates one human action. Our method consists of a recurrent YoTube detector and a static YoTube detector, where the recurrent YoTube explores the regression capability of RNN for candidate bounding boxes predictions using learnt temporal dynamics and the static YoTube produces the bounding boxes using rich appearance cues in a single frame. Both networks are trained using rgb and optical flow in order to fully exploit the rich appearance, motion and temporal context, and their outputs are fused to produce accurate and robust proposal boxes. Action proposals are finally constructed by linking these boxes using dynamic programming with a novel trimming method to handle the untrimmed video effectively and efficiently. Extensive experiments on the challenging UCF-101 and UCF-Sports datasets show that our proposed technique obtains superior performance compared with the state-of-the-art.
Large-scale datasets have played a significant role in progress of neural network and deep learning areas. YouTube-8M is such a benchmark dataset for general multi-label video classification. It was created from over 7 million YouTube videos (450,000 hours of video) and includes video labels from a vocabulary of 4716 classes (3.4 labels/video on average). It also comes with pre-extracted audio & visual features from every second of video (3.2 billion feature vectors in total). Google cloud recently released the datasets and organized 'Google Cloud & YouTube-8M Video Understanding Challenge' on Kaggle. Competitors are challenged to develop classification algorithms that assign video-level labels using the new and improved Youtube-8M V2 dataset. Inspired by the competition, we started exploration of audio understanding and classification using deep learning algorithms and ensemble methods. We built several baseline predictions according to the benchmark paper and public github tensorflow code. Furthermore, we improved global prediction accuracy (GAP) from base level 77% to 80.7% through approaches of ensemble.
Jun 27 2017 cs.CV
We propose an image based end-to-end learning framework that helps lane-change decisions for human drivers and autonomous vehicles. The proposed system, Safe Lane-Change Aid Network (SLCAN), trains a deep convolutional neural network to classify the status of adjacent lanes from rear view images acquired by cameras mounted on both sides of the vehicle. Rather than depending on any explicit object detection or tracking scheme, SLCAN reads the whole input image and directly decides whether initiation of the lane-change at the moment is safe or not. We collected and annotated 77,273 rear side view images to train and test SLCAN. Experimental results show that the proposed framework achieves 96.98% classification accuracy although the test images are from unseen roadways. We also visualize the saliency map to understand which part of image SLCAN looks at for correct decisions.
Jun 27 2017 cs.CV
Video-based eye tracking is a valuable technique in various research fields. Numerous open-source eye tracking algorithms have been developed in recent years, primarily designed for general application with many different camera types. These algorithms do not, however, capitalize on the high frame rate of eye tracking cameras often employed in psychophysical studies. We present a pupil detection method that utilizes this high-speed property to obtain reliable predictions through recursive estimation about certain pupil characteristics in successive camera frames. These predictions are subsequently used to carry out novel image segmentation and classification routines to improve pupil detection performance. Based on results from hand-labelled eye images, our approach was found to have a greater detection rate, accuracy and speed compared to other recently published open-source pupil detection algorithms. The program's source code, together with a graphical user interface, can be downloaded at https://github.com/tbrouns/eyestalker.
Jun 27 2017 cs.CL
Recognizing entity synonyms from text has become a crucial task in many entity-leveraging applications. However, discovering entity synonyms from domain-specific text corpora (e.g., news articles, scientific papers) is rather challenging. Current systems take an entity name string as input to find out other names that are synonymous, ignoring the fact that often times a name string can refer to multiple entities (e.g., "apple" could refer to both Apple Inc and the fruit apple). Moreover, most existing methods require training data manually created by domain experts to construct supervised-learning systems. In this paper, we study the problem of automatic synonym discovery with knowledge bases, that is, identifying synonyms for knowledge base entities in a given domain-specific corpus. The manually-curated synonyms for each entity stored in a knowledge base not only form a set of name strings to disambiguate the meaning for each other, but also can serve as "distant" supervision to help determine important features for the task. We propose a novel framework, called DPE, to integrate two kinds of mutually-complementing signals for synonym discovery, i.e., distributional features based on corpus-level statistics and textual patterns based on local contexts. In particular, DPE jointly optimizes the two kinds of signals in conjunction with distant supervision, so that they can mutually enhance each other in the training stage. At the inference stage, both signals will be utilized to discover synonyms for the given entities. Experimental results prove the effectiveness of the proposed framework.
Jun 27 2017 cs.SY
Maintaining the security of control systems in the presence of integrity attacks is a significant challenge. In literature, several possible attacks against control systems have been formulated including replay, false data injection, and zero dynamics attacks. The detection and prevention of these attacks may require the defender to possess a particular subset of trusted communication channels. Alternatively, these attacks can be prevented by keeping the system model secret from the adversary. In this paper, we consider an adversary who has the ability to modify and read all sensor and actuator channels. To thwart this adversary, we introduce external states dependent on the state of the control system, with linear time-varying dynamics unknown to the adversary. We also include sensors to measure these states. The presence of unknown time-varying dynamics is leveraged to detect an adversary who simultaneously aims to identify the system and inject stealthy outputs. Potential attack strategies and bounds on the attacker's performance are provided.
Jun 27 2017 cs.CL
The practice of evidence-based medicine (EBM) urges medical practitioners to utilise the latest research evidence when making clinical decisions. Because of the massive and growing volume of published research on various medical topics, practitioners often find themselves overloaded with information. As such, natural language processing research has recently commenced exploring techniques for performing medical domain-specific automated text summarisation (ATS) techniques-- targeted towards the task of condensing large medical texts. However, the development of effective summarisation techniques for this task requires cross-domain knowledge. We present a survey of EBM, the domain-specific needs for EBM, automated summarisation techniques, and how they have been applied hitherto. We envision that this survey will serve as a first resource for the development of future operational text summarisation techniques for EBM.
Word embeddings, which represent a word as a point in a vector space, have become ubiquitous to several NLP tasks. A recent line of work uses bilingual (two languages) corpora to learn a different vector for each sense of a word, by exploiting crosslingual signals to aid sense identification. We present a multi-view Bayesian non-parametric algorithm which improves multi-sense word embeddings by (a) using multilingual (i.e., more than two languages) corpora to significantly improve sense embeddings beyond what one achieves with bilingual information, and (b) uses a principled approach to learn a variable number of senses per word, in a data-driven manner. Ours is the first approach with the ability to leverage multilingual corpora efficiently for multi-sense representation learning. Experiments show that multilingual training significantly improves performance over monolingual and bilingual training, by allowing us to combine different parallel corpora to leverage multilingual context. Multilingual training yields comparable performance to a state of the art mono-lingual model trained on five times more training data.
Jun 27 2017 cs.CV
Photometric Stereo methods seek to reconstruct the 3d shape of an object from motionless images obtained with varying illumination. Most existing methods solve a restricted problem where the physical reflectance model, such as Lambertian reflectance, is known in advance. In contrast, we do not restrict ourselves to a specific reflectance model. Instead, we offer a method that works on a wide variety of reflectances. Our approach uses a simple yet uncommonly used property of the problem - the sought after normals are points on a unit hemisphere. We present a novel embedding method that maps pixels to normals on the unit hemisphere. Our experiments demonstrate that this approach outperforms existing manifold learning methods for the task of hemisphere embedding. We further show successful reconstructions of objects from a wide variety of reflectances including smooth, rough, diffuse and specular surfaces, even in the presence of significant attached shadows. Finally, we empirically prove that under these challenging settings we obtain more accurate shape reconstructions than existing methods.
Jun 27 2017 cs.SE
Maintenance is an important activity in industry. It is performed either to revive a machine/component or to prevent it from breaking down. Different strategies have evolved through time, bringing maintenance to its current state: condition-based and predictive maintenances. This evolution was due to the increasing demand of reliability in industry. The key process of condition-based and predictive maintenances is prognostics and health management, and it is a tool to predict the remaining useful life of engineering assets. Nowadays, plants are required to avoid shutdowns while offering safety and reliability. Nevertheless, planning a maintenance activity requires accurate information about the system/component health state. Such information is usually gathered by means of independent sensor nodes. In this study, we consider the case where the nodes are interconnected and form a wireless sensor network. As far as we know, no research work has considered such a case of study for prognostics. Regarding the importance of data accuracy, a good prognostics requires reliable sources of information. This is why, in this paper, we will first discuss the dependability of wireless sensor networks, and then present a state of the art in prognostic and health management activities.
Jun 27 2017 cs.CV
Real-time tool segmentation from endoscopic videos is an essential part of many computer-assisted robotic surgical systems and of critical importance in robotic surgical data science. We propose two novel deep learning architectures for automatic segmentation of non-rigid surgical instruments. Both methods take advantage of automated deep-learning-based multi-scale feature extraction while trying to maintain an accurate segmentation quality at all resolutions. The two proposed methods encode the multi-scale constraint inside the network architecture. The first proposed architecture enforces it by cascaded aggregation of predictions and the second proposed network does it by means of a holistically-nested architecture where the loss at each scale is taken into account for the optimization process. As the proposed methods are for real-time semantic labeling, both present a reduced number of parameters. We propose the use of parametric rectified linear units for semantic labeling in these small architectures to increase the regularization ability of the design and maintain the segmentation accuracy without overfitting the training sets. We compare the proposed architectures against state-of-the-art fully convolutional networks. We validate our methods using existing benchmark datasets, including ex vivo cases with phantom tissue and different robotic surgical instruments present in the scene. Our results show a statistically significant improved Dice Similarity Coefficient over previous instrument segmentation methods. We analyze our design choices and discuss the key drivers for improving accuracy.
Jun 27 2017 cs.CV
Rectified linear unit (ReLU) is a widely used activation function for deep convolutional neural networks. In this paper, we propose a novel activation function called flexible rectified linear unit (FReLU). FReLU improves the flexibility of ReLU by a learnable rectified point. FReLU achieves a faster convergence and higher performance. Furthermore, FReLU does not rely on strict assumptions by self-adaption. FReLU is also simple and effective without using exponential function. We evaluate FReLU on two standard image classification dataset, including CIFAR-10 and CIFAR-100. Experimental results show the strengths of the proposed method.
Achieving a symbiotic blending between reality and virtuality is a dream that has been lying in the minds of many people for a long time. Advances in various domains constantly bring us closer to making that dream come true. Augmented reality as well as virtual reality are in fact trending terms and are expected to further progress in the years to come. This master's thesis aims to explore these areas and starts by defining necessary terms such as augmented reality (AR) or virtual reality (VR). Usual taxonomies to classify and compare the corresponding experiences are then discussed. In order to enable those applications, many technical challenges need to be tackled, such as accurate motion tracking with 6 degrees of freedom (positional and rotational), that is necessary for compelling experiences and to prevent user sickness. Additionally, augmented reality experiences typically rely on image processing to position the superimposed content. To do so, "paper" markers or features extracted from the environment are often employed. Both sets of techniques are explored and common solutions and algorithms are presented. After investigating those technical aspects, I carry out an objective comparison of the existing state-of-the-art and state-of-the-practice in those domains, and I discuss present and potential applications in these areas. As a practical validation, I present the results of an application that I have developed using Microsoft HoloLens, one of the more advanced affordable technologies for augmented reality that is available today. Based on the experience and lessons learned during this development, I discuss the limitations of current technologies and present some avenues of future research.
Jun 27 2017 cs.IR
With an exponentially growing number of scientific papers published each year, advanced tools for exploring and discovering publications of interest are becoming indispensable. To empower users beyond a simple keyword search provided e.g. by Google Scholar, we present the novel web application PubVis. Powered by a variety of machine learning techniques, it combines essential features to help researchers find the content most relevant to them. An interactive visualization of a large collection of scientific publications provides an overview of the field and encourages the user to explore articles beyond a narrow research focus. This is augmented by personalized content based article recommendations as well as an advanced full text search to discover relevant references. The open sourced implementation of the app can be easily set up and run locally on a desktop computer to provide access to content tailored to the specific needs of individual users. Additionally, a PubVis demo with access to a collection of 10,000 papers can be tested online.
Jun 27 2017 cs.AI
We introduce a new count-based optimistic exploration algorithm for Reinforcement Learning (RL) that is feasible in environments with high-dimensional state-action spaces. The success of RL algorithms in these domains depends crucially on generalisation from limited training experience. Function approximation techniques enable RL agents to generalise in order to estimate the value of unvisited states, but at present few methods enable generalisation regarding uncertainty. This has prevented the combination of scalable RL algorithms with efficient exploration strategies that drive the agent to reduce its uncertainty. We present a new method for computing a generalised state visit-count, which allows the agent to estimate the uncertainty associated with any state. Our \phi-pseudocount achieves generalisation by exploiting same feature representation of the state space that is used for value function approximation. States that have less frequently observed features are deemed more uncertain. The \phi-Exploration-Bonus algorithm rewards the agent for exploring in feature space rather than in the untransformed state space. The method is simpler and less computationally expensive than some previous proposals, and achieves near state-of-the-art results on high-dimensional RL benchmarks.
Before reaching full autonomy, vehicles will gradually be equipped with more and more advanced driver assistance systems (ADAS), effectively rendering them semi-autonomous. However, current ADAS technologies seem unable to handle complex traffic situations, notably when dealing with vehicles arriving from the sides, either at intersections or when merging on highways. The high rate of accidents in these settings prove that they constitute difficult driving situations. Moreover, intersections and merging lanes are often the source of important traffic congestion and, sometimes, deadlocks. In this article, we propose a cooperative framework to safely coordinate semi-autonomous vehicles in such settings, removing the risk of collision or deadlocks while remaining compatible with human driving. More specifically, we present a supervised coordination scheme that overrides control inputs from human drivers when they would result in an unsafe or blocked situation. To avoid unnecessary intervention and remain compatible with human driving, overriding only occurs when collisions or deadlocks are imminent. In this case, safe overriding controls are chosen while ensuring they deviate minimally from those originally requested by the drivers. Simulation results based on a realistic physics simulator show that our approach is scalable to real-world scenarios, and computations can be performed in real-time on a standard computer for up to a dozen simultaneous vehicles.
Jun 27 2017 cs.CV
We propose a deep neural network for the prediction of future frames in natural video sequences. To effectively handle complex evolution of pixels in videos, we propose to decompose the motion and content, two key components generating dynamics in videos. Our model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional LSTM for pixel-level prediction, which independently capture the spatial layout of an image and the corresponding temporal dynamics. By independently modeling motion and content, predicting the next frame reduces to converting the extracted content features into the next frame content by the identified motion features, which simplifies the task of prediction. Our model is end-to-end trainable over multiple time steps, and naturally learns to decompose motion and content without separate training. We evaluate the proposed network architecture on human activity videos using KTH, Weizmann action, and UCF-101 datasets. We show state-of-the-art performance in comparison to recent approaches. To the best of our knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos.
Jun 27 2017 cs.DC
In the era when the market segment of Internet of Things (IoT) tops the chart in various business reports, it is apparently envisioned that the field of medicine expects to gain a large benefit from the explosion of wearables and internet-connected sensors that surround us to acquire and communicate unprecedented data on symptoms, medication, food intake, and daily-life activities impacting one's health and wellness. However, IoT-driven healthcare would have to overcome many barriers, such as: 1) There is an increasing demand for data storage on cloud servers where the analysis of the medical big data becomes increasingly complex, 2) The data, when communicated, are vulnerable to security and privacy issues, 3) The communication of the continuously collected data is not only costly but also energy hungry, 4) Operating and maintaining the sensors directly from the cloud servers are non-trial tasks. This book chapter defined Fog Computing in the context of medical IoT. Conceptually, Fog Computing is a service-oriented intermediate layer in IoT, providing the interfaces between the sensors and cloud servers for facilitating connectivity, data transfer, and queryable local database. The centerpiece of Fog computing is a low-power, intelligent, wireless, embedded computing node that carries out signal conditioning and data analytics on raw data collected from wearables or other medical sensors and offers efficient means to serve telehealth interventions. We implemented and tested an fog computing system using the Intel Edison and Raspberry Pi that allows acquisition, computing, storage and communication of the various medical data such as pathological speech data of individuals with speech disorders, Phonocardiogram (PCG) signal for heart rate estimation, and Electrocardiogram (ECG)-based Q, R, S detection.
Jun 27 2017 cs.PL
We introduce the Fusion algorithm for local refinement type inference, yielding a new SMT-based method for verifying programs with polymorphic data types and higher-order functions. Fusion is concise as the programmer need only write signatures for (externally exported) top-level functions and places with cyclic (recursive) dependencies, after which Fusion can predictably synthesize the most precise refinement types for all intermediate terms (expressible in the decidable refinement logic), thereby checking the program without false alarms. We have implemented Fusion and evaluated it on the benchmarks from the LiquidHaskell suite totalling about 12KLOC. Fusion checks an existing safety benchmark suite using about half as many templates as previously required and nearly 2x faster. In a new set of theorem proving benchmarks Fusion is both 10 - 50x faster and, by synthesizing the most precise types, avoids false alarms to make verification possible.
In this article, we extend the conventional framework of convolutional-Restricted-Boltzmann-Machine to learn highly abstract features among abitrary number of time related input maps by constructing a layer of multiplicative units, which capture the relations among inputs. In many cases, more than two maps are strongly related, so it is wise to make multiplicative unit learn relations among more input maps, in other words, to find the optimal relational-order of each unit. In order to enable our machine to learn relational order, we developed a reinforcement-learning method whose optimality is proven to train the network.
Jun 27 2017 cs.LO
Real world programming languages crucially depend on the availability of computational effects to achieve programming convenience and expressive power as well as program efficiency. Logical frameworks rely on predicates, or dependent types, to express detailed logical properties about entities. According to the Curry-Howard correspondence, programming languages and logical frameworks should be very closely related. However, a language that has both good support for real programming and serious proving is still missing from the programming languages zoo. We believe this is due to a fundamental lack of understanding of how dependent types should interact with computational effects. In this thesis, we make a contribution towards such an understanding, with a focus on semantic methods.