# Computer Vision and Pattern Recognition (cs.CV)

• Many people see a human face or animals in the pattern of the maria on the moon. Although the pattern corresponds to the actual variation in composition of the lunar surface, the culture and environment of each society influence the recognition of these objects (i.e., symbols) as specific entities. In contrast, a convolutional neural network (CNN) recognizes objects from characteristic shapes in a training data set. Using CNN, this study evaluates the probabilities of the pattern of lunar maria categorized into the shape of a crab, a lion and a hare. If Mare Frigoris (a dark band on the moon) is included in the lunar image, the lion is recognized. However, in an image without Mare Frigoris, the hare has the highest probability of recognition. Thus, the recognition of objects similar to the lunar pattern depends on which part of the lunar maria is taken into account. In human recognition, before we find similarities between the lunar maria and objects such as animals, we may be persuaded in advance to see a particular image from our culture and environment and then adjust the lunar pattern to the shape of the imagined object.
• We introduce Eigen Evolution Pooling, an efficient method to aggregate a sequence of feature vectors. Eigen evolution pooling is designed to produce compact feature representations for a sequence of feature vectors, while maximally preserving as much information about the sequence as possible, especially the temporal evolution of the features over time. Eigen evolution pooling is a general pooling method that can be applied to any sequence of feature vectors, from low-level RGB values to high-level Convolutional Neural Network (CNN) feature vectors. We show that eigen evolution pooling is more effective than average, max, and rank pooling for encoding the dynamics of human actions in video. We demonstrate the power of eigen evolution pooling on UCF101 and Hollywood2 datasets, two human action recognition benchmarks, and achieve state-of-the-art performance.
• 3D pose estimation is a key component of many important computer vision tasks such as autonomous navigation and 3D scene understanding. Most state-of-the-art approaches to 3D pose estimation solve this problem as a pose-classification problem in which the pose space is discretized into bins and a CNN classifier is used to predict a pose bin. We argue that the 3D pose space is continuous and propose to solve the pose estimation problem in a CNN regression framework with a suitable representation, data augmentation and loss function that captures the geometry of the pose space. Experiments on PASCAL3D+ show that the proposed 3D pose regression approach achieves competitive performance compared to the state-of-the-art.
• Simultaneous Localization and Mapping (SLAM) systems use commodity visible/near visible digital sensors coupled with processing units that detect, recognize and track image points in a camera stream. These systems are cheap, fast and make use of readily available camera technologies. However, SLAM systems suffer from issues of drift as well as sensitivity to lighting variation such as shadows and changing brightness. Beaconless SLAM systems will continue to suffer from this inherent drift problem irrespective of the improvements in on-board camera resolution, speed and inertial sensor precision. To cancel out destructive forms of drift, relocalization algorithms are used which use known detected landmarks together with loop closure processes to continually readjust the current location and orientation estimates to match "known" positions. However this is inherently problematic because these landmarks themselves may have been recorded with errors and they may also change under different illumination conditions. In this note we describe a unique beacon light coding system which is robust to desynchronized clock bit drift. The described beacons and codes are designed to be used in industrial or consumer environments for full standalone 6dof tracking or as known error free landmarks in a SLAM pipeline.
• Salient object detection has seen remarkable progress driven by deep learning techniques. However, most of deep learning based salient object detection methods are black-box in nature and lacking in interpretability. This paper proposes the first self-explanatory saliency detection network that explicitly exploits low- and high-level features for salient object detection. We demonstrate that such supportive clues not only significantly enhances performance of salient object detection but also gives better justified detection results. More specifically, we develop a multi-stage saliency encoder to extract multi-scale features which contain both low- and high-level saliency context. Dense short- and long-range connections are introduced to reuse these features iteratively. Benefiting from the direct access to low- and high-level features, the proposed saliency encoder can not only model the object context but also preserve the boundary. Furthermore, a self-explanatory generator is proposed to interpret how the proposed saliency encoder or other deep saliency models making decisions. The generator simulates the absence of interesting features by preventing these features from contributing to the saliency classifier and estimates the corresponding saliency prediction without these features. A comparison function, saliency explanation, is defined to measure the prediction changes between deep saliency models and corresponding generator. Through visualizing the differences, we can interpret the capability of different deep neural networks based saliency detection models and demonstrate that our proposed model indeed uses more reasonable structure for salient object detection. Extensive experiments on five popular benchmark datasets and the visualized saliency explanation demonstrate that the proposed method provides new state-of-the-art.
• Convolutional neural network provides an end-to-end solution to train many computer vision tasks and has gained great successes. However, the design of network architectures usually relies heavily on expert knowledge and is hand-crafted. In this paper, we provide a solution to automatically and efficiently design high performance network architectures. To reduce the search space of network design, we focus on constructing network blocks, which can be stacked to generate the whole network. Blocks are generated through an agent, which is trained with Q-learning to maximize the expected accuracy of the searching blocks on the learning task. Distributed asynchronous framework and early stop strategy are used to accelerate the training process. Our experimental results demonstrate that the network architectures designed by our approach perform competitively compared with hand-crafted state-of-the-art networks. We trained the Q-learning on CIFAR-100, and evaluated on CIFAR10 and ImageNet, the designed block structure achieved 3.60% error on CIFAR-10 and competitive result on ImageNet. The Q-learning process can be efficiently trained only on 32 GPUs in 3 days.
• Line separators are used to segregate text-lines from one another in document image analysis. Finding the separator points at every line terminal in a document image would enable text-line segmentation. In particular, identifying the separators in handwritten text could be a thrilling exercise. Obviously it would be challenging to perform this in the compressed version of a document image and that is the proposed objective in this research. Such an effort would prevent the computational burden of decompressing a document for text-line segmentation. Since document images are generally compressed using run length encoding (RLE) technique as per the CCITT standards, the first column in the RLE will be a white column. The value (depth) in the white column is very low when a particular line is a text line and the depth could be larger at the point of text line separation. A longer consecutive sequence of such larger depth should indicate the gap between the text lines, which provides the separator region. In case of over separation and under separation issues, corrective actions such as deletion and insertion are suggested respectively. An extensive experimentation is conducted on the compressed images of the benchmark datasets of ICDAR13 and Alireza et al [17] to demonstrate the efficacy.
• Aug 21 2017 cs.CV arXiv:1708.05543v1
In the era of autonomous driving, urban mapping represents a core step to let vehicles interact with the urban context. Successful mapping algorithms have been proposed in the last decade building the map leveraging on data from a single sensor. The focus of the system presented in this paper is twofold: the joint estimation of a 3D map from lidar data and images, based on a 3D mesh, and its texturing. Indeed, even if most surveying vehicles for mapping are endowed by cameras and lidar, existing mapping algorithms usually rely on either images or lidar data; moreover both image-based and lidar-based systems often represent the map as a point cloud, while a continuous textured mesh representation would be useful for visualization and navigation purposes. In the proposed framework, we join the accuracy of the 3D lidar data, and the dense information and appearance carried by the images, in estimating a visibility consistent map upon the lidar measurements, and refining it photometrically through the acquired images. We evaluate the proposed framework against the KITTI dataset and we show the performance improvement with respect to two state of the art urban mapping algorithms, and two widely used surface reconstruction algorithms in Computer Graphics.
• Retrieval of text information from natural scene images and video frames is a challenging task due to its inherent problems like complex character shapes, low resolution, background noise, etc. Available OCR systems often fail to retrieve such information in scene/video frames. Keyword spotting, an alternative way to retrieve information, performs efficient text searching in such scenarios. However, current word spotting techniques in scene/video images are script-specific and they are mainly developed for Latin script. This paper presents a novel word spotting framework using dynamic shape coding for text retrieval in natural scene image and video frames. The framework is designed to search query keyword from multiple scripts with the help of on-the-fly script-wise keyword generation for the corresponding script. We have used a two-stage word spotting approach using Hidden Markov Model (HMM) to detect the translated keyword in a given text line by identifying the script of the line. A novel unsupervised dynamic shape coding based scheme has been used to group similar shape characters to avoid confusion and to improve text alignment. Next, the hypotheses locations are verified to improve retrieval performance. To evaluate the proposed system for searching keyword from natural scene image and video frames, we have considered two popular Indic scripts such as Bangla (Bengali) and Devanagari along with English. Inspired by the zone-wise recognition approach in Indic scripts[1], zone-wise text information has been used to improve the traditional word spotting performance in Indic scripts. For our experiment, a dataset consisting of images of different scenes and video frames of English, Bangla and Devanagari scripts were considered. The results obtained showed the effectiveness of our proposed word spotting approach.
• This paper presents a novel method for fully automatic and convenient extrinsic calibration of a 3D LiDAR and a panoramic camera with a normally printed chessboard. The proposed method is based on the 3D corner estimation of the chessboard from the sparse point cloud generated by one frame scan of the LiDAR. To estimate the corners, we formulate a full-scale model of the chessboard and fit it to the segmented 3D points of the chessboard. The model is fitted by optimizing the cost function under constraints of correlation between the reflectance intensity of laser and the color of the chessboard's patterns. Powell's method is introduced for resolving the discontinuity problem in optimization. The corners of the fitted model are considered as the 3D corners of the chessboard. Once the corners of the chessboard in the 3D point cloud are estimated, the extrinsic calibration of the two sensors is converted to a 3D-2D matching problem. The corresponding 3D-2D points are used to calculate the absolute pose of the two sensors with Unified Perspective-n-Point (UPnP). Further, the calculated parameters are regarded as initial values and are refined using the Levenberg-Marquardt method. The performance of the proposed corner detection method from the 3D point cloud is evaluated using simulations. The results of experiments, conducted on a Velodyne HDL-32e LiDAR and a Ladybug3 camera under the proposed re-projection error metric, qualitatively and quantitatively demonstrate the accuracy and stability of the final extrinsic calibration parameters.
• Person re-identification (Re-ID) aims at matching images of the same person across disjoint camera views, which is a challenging problem in multimedia analysis, multimedia editing and content-based media retrieval communities. The major challenge lies in how to preserve similarity of the same person across video footages with large appearance variations, while discriminating different individuals. To address this problem, conventional methods usually consider the pairwise similarity between persons by only measuring the point to point (P2P) distance. In this paper, we propose to use deep learning technique to model a novel set to set (S2S) distance, in which the underline objective focuses on preserving the compactness of intra-class samples for each camera view, while maximizing the margin between the intra-class set and inter-class set. The S2S distance metric is consisted of three terms, namely the class-identity term, the relative distance term and the regularization term. The class-identity term keeps the intra-class samples within each camera view gathering together, the relative distance term maximizes the distance between the intra-class class set and inter-class set across different camera views, and the regularization term smoothness the parameters of deep convolutional neural network (CNN). As a result, the final learned deep model can effectively find out the matched target to the probe object among various candidates in the video gallery by learning discriminative and stable feature representations. Using the CUHK01, CUHK03, PRID2011 and Market1501 benchmark datasets, we extensively conducted comparative evaluations to demonstrate the advantages of our method over the state-of-the-art approaches.
• Automatic generation of facial images has been well studied after the Generative Adversarial Network (GAN) came out. There exists some attempts applying the GAN model to the problem of generating facial images of anime characters, but none of the existing work gives a promising result. In this work, we explore the training of GAN models specialized on an anime facial image dataset. We address the issue from both the data and the model aspect, by collecting a more clean, well-suited dataset and leverage proper, empirical application of DRAGAN. With quantitative analysis and case studies we demonstrate that our efforts lead to a stable and high-quality model. Moreover, to assist people with anime character design, we build a website (http://make.girls.moe) with our pre-trained model available online, which makes the model easily accessible to general public.
• Deep neural networks (DNNs) have demonstrated impressive performance on a wide array of tasks, but they are usually considered opaque since internal structure and learned parameters are not interpretable. In this paper, we re-examine the internal representations of DNNs using adversarial images, which are generated by an ensemble-optimization algorithm. We find that: (1) the neurons in DNNs do not truly detect semantic objects/parts, but respond to objects/parts only as recurrent discriminative patches; (2) deep visual representations are not robust distributed codes of visual concepts because the representations of adversarial images are largely not consistent with those of real images, although they have similar visual appearance, both of which are different from previous findings. To further improve the interpretability of DNNs, we propose an adversarial training scheme with a consistent loss such that the neurons are endowed with human-interpretable concepts. The induced interpretable representations enable us to trace eventual outcomes back to influential neurons. Therefore, human users can know how the models make predictions, as well as when and why they make errors.
• Variations of deep neural networks such as convolutional neural network (CNN) have been successfully applied to image denoising. The goal is to automatically learn a mapping from a noisy image to a clean image given training data consisting of pairs of noisy and clean image patches. Most existing CNN models for image denoising have many layers. In such cases, the models involve a large amount of parameters and are computationally expensive to train. In this paper, we develop a dilated residual convolutional neural network (CNN) for Gaussian image denoising. Compared with the recently proposed residual denoiser, our method can achieve comparable performance with less computational cost. Specifically, we enlarge receptive field by adopting dilated convolution in residual network, and the dilation factor is set to a certain value. Appropriate zero padding is utilized to make the dimension of the output the same as the input. It has been proven that the expansion of receptive field can boost the CNN performance in image classification, and we further demonstrate that it can also lead to competitive performance for denoising problem. Moreover, we present a formula to calculate receptive field size when dilated convolution is incorporated. Thus, the change of receptive field can be interpreted mathematically. To validate the efficacy of our approach, we conduct extensive experiments for both gray and color image denoising with specific or randomized noise levels. Both of the quantitative measurements and the visual results of denoising are promising comparing with state-of-the-art baselines.
• We propose a new deep learning approach for automatic detection and segmentation of fluid within retinal OCT images. The proposed framework utilizes both ResNet and Encoder-Decoder neural network architectures. When training the network, we apply a novel data augmentation method called myopic warping together with standard rotation-based augmentation to increase the training set size to 45 times the original amount. Finally, the network output is post-processed with an energy minimization algorithm (graph cut) along with a few other knowledge guided morphological operations to finalize the segmentation process. Based on OCT imaging data and its ground truth from the RETOUCH challenge, the proposed system achieves dice indices of 0.522, 0.682, and 0.612, and average absolute volume differences of 0.285, 0.115, and 0.156 mm$^3$ for intaretinal fluid, subretinal fluid, and pigment epithelial detachment respectively.
• This paper presents a novel approach to reconstruct complete 3D deformable models over time by a single depth camera. These are the steps employed for deforming objects from single depth camera. The partial surfaces reconstructed from various times of capture are assembled together to form a complete 3D surface. A mesh warping algorithm is used to align different partial surfaces based on linear mesh deformation. A volumetric method is then applied to combine partial surfaces, fix missing holes and smooth alignment errors.

### Recent comments

Eddie Smolansky May 26 2017 05:23 UTC

Updated summary [here](https://github.com/eddiesmo/papers).

# How they made the dataset
- collect youtube videos
- automated filtering with yolo and landmark detection projects
- crowd source final filtering (AMT - give 50 face images to turks and ask which don't belong)
- quality control through s

...(continued)
Qian Wang Mar 07 2017 17:21 UTC

"To get the videos and their labels, we used a YouTube video annotation system, which labels videos with their main topics."
Can anyone explain a bit about this?

Jaiden Mispy Jul 09 2015 10:12 UTC

There's also a [docker image](http://ryankennedy.io/running-the-deep-dream/) if you want to play with it, though if you're on Linux or OS X you might want to install everything natively in order to get GPU acceleration (the gradient ascent can be quite slow on higher layers in the network)

Jaiden Mispy Jul 09 2015 08:13 UTC

The image recognition model described here is the one responsible for [deepdream](http://github.com/google/deepdream).

![deepdream nebula][1]

[1]: https://pbs.twimg.com/media/CI_EASXWcAAGXnK.jpg