Computer Vision and Pattern Recognition (cs.CV)

  • PDF
    A proper semantic representation for encoding side information is key to the success of zero-shot learning. In this paper, we explore two alternative semantic representations especially for zero-shot human action recognition: textual descriptions of human actions and deep features extracted from still images relevant to human actions. Such side information are accessible on Web with little cost, which paves a new way in gaining side information for large-scale zero-shot human action recognition. We investigate different encoding methods to generate semantic representations for human actions from such side information. Based on our zero-shot visual recognition method, we conducted experiments on UCF101 and HMDB51 to evaluate two proposed semantic representations . The results suggest that our proposed text- and image-based semantic representations outperform traditional attributes and word vectors considerably for zero-shot human action recognition. In particular, the image-based semantic representations yield the favourable performance even though the representation is extracted from a small number of images per class.
  • PDF
    We took part in the YouTube-8M Video Understanding Challenge hosted on Kaggle, and achieved the 10th place within less than one month's time. In this paper, we present an extensive analysis and solution to the underlying machine-learning problem based on frame-level data, where major challenges are identified and corresponding preliminary methods are proposed. It's noteworthy that, with merely the proposed strategies and uniformly-averaging multi-crop ensemble was it sufficient for us to reach our ranking. We also report the methods we believe to be promising but didn't have enough time to train to convergence. We hope this paper could serve, to some extent, as a review and guideline of the YouTube-8M multi-label video classification benchmark, inspiring future attempts and research.
  • PDF
    We tackle the task of semi-supervised video object segmentation, i.e. segmenting the pixels belonging to an object in the video using the ground truth pixel mask for the first frame. We build on the recently introduced one-shot video object segmentation (OSVOS) approach which uses a pretrained network and fine-tunes it on the first frame. While achieving impressive performance, at test time OSVOS uses the fine-tuned network in unchanged form and is not able to adapt to large changes in object appearance. To overcome this limitation, we propose Online Adaptive Video Object Segmentation (OnAVOS) which updates the network online using training examples selected based on the confidence of the network and the spatial configuration. Additionally, we add a pretraining step based on objectness, which is learned on PASCAL. Our experiments show that both extensions are highly effective and improve the state of the art on DAVIS to an intersection-over-union score of 85.7%.
  • PDF
    Urban informatics explore data science methods to address different urban issues using intensively based on data. The large variety and quantity of data available should be explored but this brings important challenges. For instance, although there are powerful computer vision methods that may be explored, they may require large annotated datasets. In this work we propose a novel approach to automatically creating an object recognition system with minimal manual annotation. The basic idea behind the method is to use large input datasets using available online cameras on large cities. A off-the-shelf weak classifier is used to detect an initial set of urban elements of interest (e.g. cars, pedestrians, bikes, etc.). Such initial dataset undergoes a quality control procedure and it is subsequently used to fine tune a strong classifier. Quality control and comparative performance assessment are used as part of the pipeline. We evaluate the method for detecting cars based on monitoring cameras. Experimental results using real data show that despite losing generality, the final detector provides better detection rates tailored to the selected cameras. The programmed robot gathered 770 video hours from 24 online city cameras (\~300GB), which has been fed to the proposed system. Our approach has shown that the method nearly doubled the recall (93\%) with respect to state-of-the-art methods using off-the-shelf algorithms.
  • PDF
    High-resolution satellite imagery have been increasingly used on remote sensing classification problems. One of the main factors is the availability of this kind of data. Even though, very little effort has been placed on the zebra crossing classification problem. In this letter, crowdsourcing systems are exploited in order to enable the automatic acquisition and annotation of a large-scale satellite imagery database for crosswalks related tasks. Then, this dataset is used to train deep-learning-based models in order to accurately classify satellite images that contains or not zebra crossings. A novel dataset with more than 240,000 images from 3 continents, 9 countries and more than 20 cities was used in the experiments. Experimental results showed that freely available crowdsourcing data can be used to accurately (97.11%) train robust models to perform crosswalk classification on a global scale.
  • PDF
    Class-agnostic object tracking is particularly difficult in cluttered environments as target specific discriminative models cannot be learned a priori. Inspired by how the human visual cortex employs spatial attention and separate "where" and "what" processing pathways to actively suppress irrelevant visual features, this work develops a hierarchical attentive recurrent model for single object tracking in videos. The first layer of attention discards the majority of background by selecting a region containing the object of interest, while the subsequent layers tune in on visual features particular to the tracked object. This framework is fully differentiable and can be trained in a purely data driven fashion by gradient methods. To improve training convergence, we augment the loss function with terms for a number of auxiliary tasks relevant for tracking. Evaluation of the proposed model is performed on two datasets of increasing difficulty: pedestrian tracking on the KTH activity recognition dataset and the KITTI object tracking dataset.
  • PDF
    We present a parameterized approach to produce personalized variable length summaries of soccer matches. Our approach is based on temporally segmenting the soccer video into 'plays', associating a user-specifiable 'utility' for each type of play and using 'bin-packing' to select a subset of the plays that add up to the desired length while maximizing the overall utility (volume in bin-packing terms). Our approach systematically allows a user to override the default weights assigned to each type of play with individual preferences and thus see a highly personalized variable length summarization of soccer matches. We demonstrate our approach based on the output of an end-to-end pipeline that we are building to produce such summaries. Though aspects of the overall end-to-end pipeline are human assisted at present, the results clearly show that the proposed approach is capable of producing semantically meaningful and compelling summaries. Besides the obvious use of producing summaries of superior league matches for news broadcasts, we anticipate our work to promote greater awareness of the local matches and junior leagues by producing consumable summaries of them.
  • PDF
    This paper introduces a new real-time object detection approach named Yes-Net. It realizes the prediction of bounding boxes and class via single neural network like YOLOv2 and SSD, but owns more efficient and outstanding features. It combines local information with global information by adding the RNN architecture as a packed unit in CNN model to form the basic feature extractor. Independent anchor boxes coming from full-dimension k-means is also applied in Yes-Net, it brings better average IOU than grid anchor box. In addition, instead of NMS, Yes-Net uses RNN as a filter to get the final boxes, which is more efficient. For 416 x 416 input, Yes-Net achieves 74.3% mAP on VOC2007 test at 39 FPS on an Nvidia Titan X Pascal.
  • PDF
    In this paper, we propose a principled Perceptual Adversarial Networks (PAN) for image-to-image transformation tasks. Unlike existing application-specific algorithms, PAN provides a generic framework of learning mapping relationship between paired images (Fig. 1), such as mapping a rainy image to its de-rained counterpart, object edges to its photo, semantic labels to a scenes image, etc. The proposed PAN consists of two feed-forward convolutional neural networks (CNNs), the image transformation network T and the discriminative network D. Through combining the generative adversarial loss and the proposed perceptual adversarial loss, these two networks can be trained alternately to solve image-to-image transformation tasks. Among them, the hidden layers and output of the discriminative network D are upgraded to continually and automatically discover the discrepancy between the transformed image and the corresponding ground-truth. Simultaneously, the image transformation network T is trained to minimize the discrepancy explored by the discriminative network D. Through the adversarial training process, the image transformation network T will continually narrow the gap between transformed images and ground-truth images. Experiments evaluated on several image-to-image transformation tasks (e.g., image de-raining, image inpainting, etc.) show that the proposed PAN outperforms many related state-of-the-art methods.
  • PDF
    The Classification of medical images and illustrations in the literature aims to label a medical image according to the modality it was produced or label an illustration according to its production attributes. It is an essential and challenging research hotspot in the area of automated literature review, retrieval and mining. The significant intra-class variation and inter-class similarity caused by the diverse imaging modalities and various illustration types brings a great deal of difficulties to the problem. In this paper, we propose a synergic deep learning (SDL) model to address this issue. Specifically, a dual deep convolutional neural network with a synergic signal system is designed to mutually learn image representation. The synergic signal is used to verify whether the input image pair belongs to the same category and to give the corrective feedback if a synergic error exists. Our SDL model can be trained 'end to end'. In the test phase, the class label of an input can be predicted by averaging the likelihood probabilities obtained by two convolutional neural network components. Experimental results on the ImageCLEF2016 Subfigure Classification Challenge suggest that our proposed SDL model achieves the state-of-the art performance in this medical image classification problem and its accuracy is higher than that of the first place solution on the Challenge leader board so far.
  • PDF
    The recent phenomenal interest in convolutional neural networks (CNNs) must have made it inevitable for the super-resolution (SR) community to explore its potential. The response has been immense and in the last three years, since the advent of the pioneering work, there appeared too many works not to warrant a comprehensive survey. This paper surveys the SR literature in the context of deep learning. We focus on the three important aspects of multimedia - namely image, video and multi-dimensions, especially depth maps. In each case, first relevant benchmarks are introduced in the form of datasets and state of the art SR methods, excluding deep learning. Next is a detailed analysis of the individual works, each including a short description of the method and a critique of the results with special reference to the benchmarking done. This is followed by minimum overall benchmarking in the form of comparison on some common dataset, while relying on the results reported in various works.
  • PDF
    Retinal vessel segmentation is an indispensable step for automatic detection of retinal diseases with fundoscopic images. Though many approaches have been proposed, existing methods tend to miss fine vessels or allow false positives at terminal branches. Let alone under-segmentation, over-segmentation is also problematic when quantitative studies need to measure the precise width of vessels. In this paper, we present a method that generates the precise map of retinal vessels using generative adversarial training. Our methods achieve dice coefficient of 0.829 on DRIVE dataset and 0.834 on STARE dataset which is the state-of-the-art performance on both datasets.
  • PDF
    Automatic lane tracking involves estimating the underlying signal from a sequence of noisy signal observations. Many models and methods have been proposed for lane tracking, and dynamic targets tracking in general. The Kalman Filter is a widely used method that works well on linear Gaussian models. But this paper shows that Kalman Filter is not suitable for lane tracking, because its Gaussian observation model cannot faithfully represent the procured observations. We propose using a Particle Filter on top of a novel multiple mode observation model. Experiments show that our method produces superior performance to a conventional Kalman Filter.

Recent comments

Eddie Smolansky May 26 2017 05:23 UTC

Updated summary [here](

# How they made the dataset
- collect youtube videos
- automated filtering with yolo and landmark detection projects
- crowd source final filtering (AMT - give 50 face images to turks and ask which don't belong)
- quality control through s

Qian Wang Mar 07 2017 17:21 UTC

"To get the videos and their labels, we used a YouTube video annotation system, which labels videos with their main topics."
Can anyone explain a bit about this?

Jaiden Mispy Jul 09 2015 10:12 UTC

There's also a [docker image]( if you want to play with it, though if you're on Linux or OS X you might want to install everything natively in order to get GPU acceleration (the gradient ascent can be quite slow on higher layers in the network)

Jaiden Mispy Jul 09 2015 08:13 UTC

The image recognition model described here is the one responsible for [deepdream](

![deepdream nebula][1]