# Computer Vision and Pattern Recognition (cs.CV)

• Learning-based approaches to robotic manipulation are limited by the scalability of data collection and accessibility of labels. In this paper, we present a multi-task domain adaptation framework for instance grasping in cluttered scenes by utilizing simulated robot experiments. Our neural network takes monocular RGB images and the instance segmentation mask of a specified target object as inputs, and predicts the probability of successfully grasping the specified object for each candidate motor command. The proposed transfer learning framework trains a model for instance grasping in simulation and uses a domain-adversarial loss to transfer the trained model to real robots using indiscriminate grasping data, which is available both in simulation and the real world. We evaluate our model in real-world robot experiments, comparing it with alternative model architectures as well as an indiscriminate grasping baseline.
• A robust and informative local shape descriptor plays an important role in mesh registration. In this regard, spectral descriptors that are based on the spectrum of the Laplace-Beltrami operator have gained a spotlight among the researchers for the last decade due to their desirable properties, such as isometry invariance. Despite such, however, spectral descriptors often fail to give a correct similarity measure for non-isometric cases where the metric distortion between the models is large. Hence, they are in general not suitable for the registration problems, except for the special cases when the models are near-isometry. In this paper, we investigate a way to develop shape descriptors for non-isometric registration tasks by embedding the spectral shape descriptors into a different metric space where the Euclidean distance between the elements directly indicates the geometric dissimilarity. We design and train a Siamese deep neural network to find such an embedding, where the embedded descriptors are promoted to rearrange based on the geometric similarity. We found our approach can significantly enhance the performance of the conventional spectral descriptors for the non-isometric registration tasks, and outperforms recent state-of-the-art method reported in literature.
• With tens of thousands of electrocardiogram (ECG) records processed by mobile cardiac event recorders every day, heart rhythm classification algorithms are an important tool for the continuous monitoring of patients at risk. We utilise an annotated dataset of 12,186 single-lead ECG recordings to build a diverse ensemble of recurrent neural networks (RNNs) that is able to distinguish between normal sinus rhythms, atrial fibrillation, other types of arrhythmia and signals that are too noisy to interpret. In order to ease learning over the temporal dimension, we introduce a novel task formulation that harnesses the natural segmentation of ECG signals into heartbeats to drastically reduce the number of time steps per sequence. Additionally, we extend our RNNs with an attention mechanism that enables us to reason about which heartbeats our RNNs focus on to make their decisions. Through the use of attention, our model maintains a high degree of interpretability, while also achieving state-of-the-art classification performance with an average F1 score of 0.79 on an unseen test set (n=3,658).
• The cost-effectiveness and practical harmlessness of ultrasound imaging have made it one of the most widespread tools for medical diagnosis. Unfortunately, the beam-forming based image formation produces granular speckle noise, blurring, shading and other artifacts. To overcome these effects, the ultimate goal would be to reconstruct the tissue acoustic properties by solving a full wave propagation inverse problem. In this work, we make a step towards this goal, using Multi-Resolution Convolutional Neural Networks (CNN). As a result, we are able to reconstruct CT-quality images from the reflected ultrasound radio-frequency(RF) data obtained by simulation from real CT scans of a human body. We also show that CNN is able to imitate existing computationally heavy despeckling methods, thereby saving orders of magnitude in computations and making them amenable to real-time applications.
• Images in the wild encapsulate rich knowledge about varied abstract concepts and cannot be sufficiently described with models built only using image-caption pairs containing selected objects. We propose to handle such a task with the guidance of a knowledge base that incorporate many abstract concepts. Our method is a two-step process where we first build a multi-entity-label image recognition model to predict abstract concepts as image labels and then leverage them in the second step as an external semantic attention and constrained inference in the caption generation model for describing images that depict unseen/novel objects. Evaluations show that our models outperform most of the prior work for out-of-domain captioning on MSCOCO and are useful for integration of knowledge and vision in general.
• In this paper, we present substantial evidence that a deep neural network will intrinsically learn the appropriate way to discretize the ideal continuous reconstruction filter. Currently, the Ram-Lak filter or heuristic filters which impose different noise assumptions are used for filtered back-projection. All of these, however, inhibit a fully data-driven reconstruction deep learning approach. In addition, the heuristic filters are not chosen in an optimal sense. To tackle this issue, we propose a formulation to directly learn the reconstruction filter. The filter is initialized with the ideal Ramp filter as a strong pre-training and learned in frequency domain. We compare the learned filter with the Ram-Lak and the Ramp filter on a numerical phantom as well as on a real CT dataset. The results show that the network properly discretizes the continuous Ramp filter and converges towards the Ram-Lak solution. In our view these observations are interesting to gain a better understanding of deep learning techniques and traditional analytic techniques such as Wiener filtering and discretization theory. Furthermore, this will allow fully trainable data-driven reconstruction deep learning approaches.
• We present an overview and evaluation of a new, systematic approach for generation of highly realistic, annotated synthetic data for training of deep neural networks in computer vision tasks. The main contribution is a procedural world modeling approach enabling high variability coupled with physically accurate image synthesis, and is a departure from the hand-modeled virtual worlds and approximate image synthesis methods used in real-time applications. The benefits of our approach include flexible, physically accurate and scalable image synthesis, implicit wide coverage of classes and features, and complete data introspection for annotations, which all contribute to quality and cost efficiency. To evaluate our approach and the efficacy of the resulting data, we use semantic segmentation for autonomous vehicles and robotic navigation as the main application, and we train multiple deep learning architectures using synthetic data with and without fine tuning on organic (i.e. real-world) data. The evaluation shows that our approach improves the neural network's performance and that even modest implementation efforts produce state-of-the-art results.
• Oct 18 2017 cs.CV arXiv:1710.06236v1
Temporal action detection is a very important yet challenging problem, since videos in real applications are usually long, untrimmed and contain multiple action instances. This problem requires not only recognizing action categories but also detecting start time and end time of each action instance. Many state-of-the-art methods adopt the "detection by classification" framework: first do proposal, and then classify proposals. The main drawback of this framework is that the boundaries of action instance proposals have been fixed during the classification step. To address this issue, we propose a novel Single Shot Action Detector (SSAD) network based on 1D temporal convolutional layers to skip the proposal generation step via directly detecting action instances in untrimmed video. On pursuit of designing a particular SSAD network that can work effectively for temporal action detection, we empirically search for the best network architecture of SSAD due to lacking existing models that can be directly adopted. Moreover, we investigate into input feature types and fusion strategies to further improve detection accuracy. We conduct extensive experiments on two challenging datasets: THUMOS 2014 and MEXaction2. When setting Intersection-over-Union threshold to 0.5 during evaluation, SSAD significantly outperforms other state-of-the-art systems by increasing mAP from 19.0% to 24.6% on THUMOS 2014 and from 7.4% to 11.0% on MEXaction2.
• This paper proposes a novel system to estimate and track the 3D poses of multiple persons in calibrated RGB-Depth camera networks. The multi-view 3D pose of each person is computed by a central node which receives the single-view outcomes from each camera of the network. Each single-view outcome is computed by using a CNN for 2D pose estimation and extending the resulting skeletons to 3D by means of the sensor depth. The proposed system is marker-less, multi-person, independent of background and does not make any assumption on people appearance and initial pose. The system provides real-time outcomes, thus being perfectly suited for applications requiring user interaction. Experimental results show the effectiveness of this work with respect to a baseline multi-view approach in different scenarios. To foster research and applications based on this work, we released the source code in OpenPTrack, an open source project for RGB-D people tracking.
• Unsupervised object modeling is important in robotics, especially for handling a large set of objects. We present a method for unsupervised 3D object discovery, reconstruction, and localization that exploits multiple instances of an identical object contained in a single RGB-D image. The proposed method does not rely on segmentation, scene knowledge, or user input, and thus is easily scalable. Our method aims to find recurrent patterns in a single RGB-D image by utilizing appearance and geometry of the salient regions. We extract keypoints and match them in pairs based on their descriptors. We then generate triplets of the keypoints matching with each other using several geometric criteria to minimize false matches. The relative poses of the matched triplets are computed and clustered to discover sets of triplet pairs with similar relative poses. Triplets belonging to the same set are likely to belong to the same object and are used to construct an initial object model. Detection of remaining instances with the initial object model using RANSAC allows to further expand and refine the model. The automatically generated object models are both compact and descriptive. We show quantitative and qualitative results on RGB-D images with various objects including some from the Amazon Picking Challenge. We also demonstrate the use of our method in an object picking scenario with a robotic arm.
• Driverless vehicles operate by sensing and perceiving its surrounding environment to make the accurate driving decisions. A combination of several different sensors such as LiDAR, radar, ultrasound sensors and cameras are utilized to sense the surrounding environment of driverless vehicles. The heterogeneous sensors simultaneously capture various physical attributes of the environment. Such multimodality and redundancy of sensing need to be positively utilized for reliable and consistent perception of the environment through sensor data fusion. However, these multimodal sensor data streams are different from each other in many ways, such as temporal and spatial resolution, data format, and geometric alignment. For the subsequent perception algorithms to utilize the diversity offered by multimodal sensing, the data streams need to be spatially, geometrically and temporally aligned with each other. In this paper, we address the problem of fusing the outputs of a Light Detection and Ranging (LiDAR) scanner and a wide-angle monocular image sensor. The outputs of LiDAR scanner and the image sensor are of different spatial resolutions and need to be aligned with each other. A geometrical model is used to spatially align the two sensor outputs, followed by a Gaussian Process (GP) regression based resolution matching algorithm to interpolate the missing data with quantifiable uncertainty. The results indicate that the proposed sensor data fusion framework significantly aids the subsequent perception steps, as illustrated by the performance improvement of a typical free space detection algorithm.
• In recent years, we witnessed a huge success of Convolutional Neural Networks on the task of the image classification. However, these models are notoriously data hungry and require tons of training images to learn the parameters. In contrast, people are far better learner who can learn a new concept very fast with only a few samples. The plausible mysteries making the difference are two fundamental learning mechanisms: learning to learn and learning by analogy. In this paper, we attempt to investigate a new human-like learning method by organically combining these two mechanisms. In particular, we study how to generalize the classification parameters of previously learned concepts to a new concept. we first propose a novel Visual Analogy Network Embedded Regression (VANER) model to jointly learn a low-dimensional embedding space and a linear mapping function from the embedding space to classification parameters for base classes. We then propose an out-of-sample embedding method to learn the embedding of a new class represented by a few samples through its visual analogy with base classes. By inputting the learned embedding into VANER, we can derive the classification parameters for the new class.These classification parameters are purely generalized from base classes (i.e. transferred classification parameters), while the samples in the new class, although only a few, can also be exploited to generate a set of classification parameters (i.e. model classification parameters). Therefore, we further investigate the fusion strategy of the two kinds of parameters so that the prior knowledge and data knowledge can be fully leveraged. We also conduct extensive experiments on ImageNet and the results show that our method can consistently and significantly outperform state-of-the-art baselines.
• Pedestrian detection is an important component for safety of autonomous vehicles, as well as for traffic and street surveillance. There are extensive benchmarks on this topic and it has been shown to be a challenging problem when applied on real use-case scenarios. In purely image-based pedestrian detection approaches, the state-of-the-art results have been achieved with convolutional neural networks (CNN) and surprisingly few detection frameworks have been built upon multi-cue approaches. In this work, we develop a new pedestrian detector for autonomous vehicles that exploits LiDAR data, in addition to visual information. In the proposed approach, LiDAR data is utilized to generate region proposals by processing the three dimensional point cloud that it provides. These candidate regions are then further processed by a state-of-the-art CNN classifier that we have fine-tuned for pedestrian detection. We have extensively evaluated the proposed detection process on the KITTI dataset. The experimental results show that the proposed LiDAR space clustering approach provides a very efficient way of generating region proposals leading to higher recall rates and fewer misses for pedestrian detection. This indicates that LiDAR data can provide auxiliary information for CNN-based approaches.
• This paper reports on a novel template-free monocular non-rigid surface reconstruction approach. Existing techniques using motion and deformation cues rely on multiple prior assumptions, are often computationally expensive and do not perform equally well across the variety of data sets. In contrast, the proposed Scalable Monocular Surface Reconstruction (SMSR) combines strengths of several algorithms, i.e., it is scalable with the number of points, can handle sparse and dense settings as well as different types of motions and deformations. We estimate camera pose by singular value thresholding and proximal gradient. Our formulation adopts alternating direction method of multipliers which converges in linear time for large point track matrices. In the proposed SMSR, trajectory space constraints are integrated by smoothing of the measurement matrix. In the extensive experiments, SMSR is demonstrated to consistently achieve state-of-the-art accuracy on a wide variety of data sets.
• We introduce a large-scale 3D shape understanding benchmark using data and annotation from ShapeNet 3D object database. The benchmark consists of two tasks: part-level segmentation of 3D shapes and 3D reconstruction from single view images. Ten teams have participated in the challenge and the best performing teams have outperformed state-of-the-art approaches on both tasks. A few novel deep learning architectures have been proposed on various 3D representations on both tasks. We report the techniques used by each team and the corresponding performances. In addition, we summarize the major discoveries from the reported results and possible trends for the future work in the field.
• We propose a framework to understand the unprecedented performance and robustness of deep neural networks using field theory. Correlations between the weights within the same layer can be described by symmetries in that layer, and networks generalize better if such symmetries are broken to reduce the redundancies of the weights. Using a two parameter field theory, we find that the network can break such symmetries itself towards the end of training in a process commonly known in physics as spontaneous symmetry breaking. This corresponds to a network generalizing itself without any user input layers to break the symmetry, but by communication with adjacent layers. In the layer decoupling limit applicable to residual networks (He et al., 2015), we show that the remnant symmetries that survive the non-linear layers are spontaneously broken. The Lagrangian for the non-linear and weight layers together has striking similarities with the one in quantum field theory of a scalar. Using results from quantum field theory we show that our framework is able to explain many experimentally observed phenomena,such as training on random labels with zero error (Zhang et al., 2017), the information bottleneck, the phase transition out of it and gradient variance explosion (Shwartz-Ziv & Tishby, 2017), shattered gradients (Balduzzi et al., 2017), and many more.
• Face transfer animates the facial performances of the character in the target video by a source actor. Traditional methods are typically based on face modeling. We propose an end-to-end face transfer method based on Generative Adversarial Network. Specifically, we leverage CycleGAN to generate the face image of the target character with the corresponding head pose and facial expression of the source. In order to improve the quality of generated videos, we adopt PatchGAN and explore the effect of different receptive field sizes on generated images.
• Recent advancements in neutron and x-ray sources, instrumentation and data collection modes have significantly increased the experimental data size (which could easily contain $10^{8}$-$10^{10}$ points), so that conventional volumetric visualization approaches become inefficient for both still imaging and interactive OpenGL rendition in a 3-D setting. We introduce a new approach based on the unsupervised machine learning algorithm, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), to efficiently analyze and visualize large volumetric datasets. Here we present two examples, including a single crystal diffuse scattering dataset and a neutron tomography dataset. We found that by using the intensity as the weight factor during clustering, the algorithm becomes very effective in de-noising and feature/boundary detection, and thus enables better visualization of the hierarchical internal structures of the scattering data.
• We develop a method for policy architecture search and adaptation via gradient-free optimization which can learn to perform autonomous driving tasks. By learning from both demonstration and environmental reward we develop a model that can learn with relatively few early catastrophic failures. We first learn an architecture of appropriate complexity to perceive aspects of world state relevant to the expert demonstration, and then mitigate the effect of domain-shift during deployment by adapting a policy demonstrated in a source domain to rewards obtained in a target environment. We show that our approach allows safer learning than baseline methods, offering a reduced cumulative crash metric over the agent's lifetime as it learns to drive in a realistic simulated environment.
• Neonatal brain segmentation in magnetic resonance (MR) is a challenging problem due to poor image quality and similar levels of intensity between white and gray matter in MR-T1 and T2 images. To tackle this problem, most existing approaches are based on multi-atlas label fusion strategies, which are time-consuming and sensitive to registration errors. As alternative to these methods, we propose a hyper densely connected 3D convolutional neural network that employs MR-T1 and T2 as input, processed independently in two separated paths. A main difference with respect to previous densely connected networks is the use of direct connections between layers from the same and different paths. Adopting such dense connectivity leads to a benefit from a learning perspective thanks to: i) including deep supervision and ii) improving gradient flow. This approach has been evaluated in the MICCAI grand Challenge iSEG and obtains very competitive results among 21 teams, ranking first and second in many metrics, which translates into a promising performance.
• The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose a new activation function, named Swish, which is simply $f(x) = x \cdot \text{sigmoid}(x)$. Our experiments show that Swish tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.
• The purpose of this study is to provide a detailed performance comparison of feature detector/descriptor methods, particularly when their various combinations are used for image-matching. The localization experiments of a mobile robot in an indoor environment are presented as a case study. In these experiments, 3090 query images and 127 dataset images were used. This study includes five methods for feature detectors (features from accelerated segment test (FAST), oriented FAST and rotated binary robust independent elementary features (BRIEF) (ORB), speeded-up robust features (SURF), scale invariant feature transform (SIFT), and binary robust invariant scalable keypoints (BRISK)) and five other methods for feature descriptors (BRIEF, BRISK, SIFT, SURF, and ORB). These methods were used in 23 different combinations and it was possible to obtain meaningful and consistent comparison results using the performance criteria defined in this study. All of these methods were used independently and separately from each other as either feature detector or descriptor. The performance analysis shows the discriminative power of various combinations of detector and descriptor methods. The analysis is completed using five parameters: (i) accuracy, (ii) time, (iii) angle difference between keypoints, (iv) number of correct matches, and (v) distance between correctly matched keypoints. In a range of 60\deg, covering five rotational pose points for our system, the FAST-SURF combination had the lowest distance and angle difference values and the highest number of matched keypoints. SIFT-SURF was the most accurate combination with a 98.41% correct classification rate. The fastest algorithm was ORB-BRIEF, with a total running time of 21,303.30 s to match 560 images captured during motion with 127 dataset images.
• We discuss the geometry of rational maps from a projective space of an arbitrary dimension to the product of projective spaces of lower dimensions induced by linear projections. In particular, we give a purely algebro-geometric proof of the projective reconstruction theorem by Hartley and Schaffalitzky [HS09].
• In this paper, we propose a new minimal path model for minimally interactive retinal vessel centerline extraction. The main contribution lies at the construction of a novel coherence-penalized Riemannian metric in a lifted space, dependently of the local geometry of tubularity and an external scalar-valued reference feature map. The globally minimizing curves associated to the proposed metric favour to pass through a set of retinal vessel segments with low variations of the feature map, thus can avoid the short branches combination problem and shortcut problem, commonly suffered by the existing minimal path models in the application of retinal imaging. We validate our model on a series of retinal vessel patches obtained from the DRIVE and IOSTAR datasets, showing that our model indeed get promising results.
• Image classification is the task of assigning to an input image a label from a fixed set of categories. One of its most important applicative fields is that of robotics, in particular the needing of a robot to be aware of what's around and the consequent exploitation of that information as a benefit for its tasks. In this work we consider the problem of a robot that enters a new environment and wants to understand visual data coming from its camera, so to extract knowledge from them. As main novelty we want to overcome the needing of a physical robot, as it could be expensive and unhandy, so to hopefully enhance, speed up and ease the research in this field. That's why we propose to develop an application for a mobile platform that wraps several deep visual recognition tasks. First we deal with a simple Image classification, testing a model obtained from an AlexNet trained on the ILSVRC 2012 dataset. Several photo settings are considered to better understand which factors affect most the quality of classification. For the same purpose we are interested to integrate the classification task with an extra module dealing with segmentation of the object inside the image. In particular we propose a technique for extracting the object shape and moving out all the background, so to focus the classification only on the region occupied by the object. Another significant task that is included is that of object discovery. Its purpose is to simulate the situation in which the robot needs a certain object to complete one of its activities. It starts searching for what it needs by looking around and trying to understand the location of the object by scanning the surrounding environment. Finally we provide a tool for dealing with the creation of customized task-specific databases, meant to better suit to one's needing in a particular vision task.
• The problem of minimization of the number of measurements needed for digital image acquisition and reconstruction with a given accuracy is addressed. Basics of the sampling theory are outlined to show that the lower bound of signal sampling rate sufficient for signal reconstruction with a given accuracy is equal to the spectrum sparsity of the signal sparse approximation that has this accuracy. The capability of Compressed Sensing of reconstruction of signals sampled with aliasing is demystified using a simple and intuitive model and limitations of this capability are discussed. It is revealed that the Compressed Sensing approach advanced as a solution to the sampling rate minimization problem is far from reaching the sampling rate theoretical minimum. A method of image Arbitrary Sampling and Bounded Spectrum Reconstruction (ASBSR-method) is described that allows to draw near the image sampling rate theoretical minimum. Presented and discussed are also results of experimental verification of the ASBSR-method and its possible applicability extensions to solving various under-determined inverse problems such as color image demosaicing, image in-painting, image reconstruction from their sparsely sampled or decimated projections, image reconstruction from module of its Fourier spectrum and image reconstruction from its sparse samples in Fourier domain

Eddie Smolansky May 26 2017 05:23 UTC

Updated summary [here](https://github.com/eddiesmo/papers).

# How they made the dataset
- automated filtering with yolo and landmark detection projects
- crowd source final filtering (AMT - give 50 face images to turks and ask which don't belong)
- quality control through s

...(continued)
Qian Wang Mar 07 2017 17:21 UTC

"To get the videos and their labels, we used a YouTube video annotation system, which labels videos with their main topics."