Mar 20 2017 cs.CV
In this paper, we propose the first higher frame rate video dataset (called Need for Speed - NfS) and benchmark for visual object tracking. The dataset consists of 100 videos (380K frames) captured with now commonly available higher frame rate (240 FPS) cameras from real world scenarios. All frames are annotated with axis aligned bounding boxes and all sequences are manually labelled with nine visual attributes - such as occlusion, fast motion, background clutter, etc. Our benchmark provides an extensive evaluation of many recent and state-of-the-art trackers on higher frame rate sequences. We ranked each of these trackers according to their tracking accuracy and real-time performance. One of our surprising conclusions is that at higher frame rates, simple trackers such as correlation filters outperform complex methods based on deep networks. This suggests that for practical applications (such as in robotics or embedded vision), one needs to carefully tradeoff bandwidth constraints associated with higher frame rate acquisition, computational costs of real-time analysis, and the required application accuracy. Our dataset and benchmark allows for the first time (to our knowledge) systematic exploration of such issues, and will be made available to allow for further research in this space.
Mar 16 2017 cs.CV
Correlation Filters (CFs) have recently demonstrated excellent performance in terms of rapidly tracking objects under challenging photometric and geometric variations. The strength of the approach comes from its ability to efficiently learn - "on the fly" - how the object is changing over time. A fundamental drawback to CFs, however, is that the background of the object is not be modelled over time which can result in suboptimal results. In this paper we propose a Background-Aware CF that can model how both the foreground and background of the object varies over time. Our approach, like conventional CFs, is extremely computationally efficient - and extensive experiments over multiple tracking benchmarks demonstrate the superior accuracy and real-time performance of our method compared to the state-of-the-art trackers including those based on a deep learning paradigm.
Dec 19 2016 cs.CV
In this paper, we present our method for enabling dense SDM to run at over 90 FPS on a mobile device. Our contributions are two-fold. Drawing inspiration from the FFT, we propose a Sparse Compositional Regression (SCR) framework, which enables a significant speed up over classical dense regressors. Second, we propose a binary approximation to SIFT features. Binary Approximated SIFT (BASIFT) features, which are a computationally efficient approximation to SIFT, a commonly used feature with SDM. We demonstrate the performance of our algorithm on an iPhone 7, and show that we achieve similar accuracy to SDM.
In this paper, we establish a theoretical connection between the classical Lucas & Kanade (LK) algorithm and the emerging topic of Spatial Transformer Networks (STNs). STNs are of interest to the vision and learning communities due to their natural ability to combine alignment and classification within the same theoretical framework. Inspired by the Inverse Compositional (IC) variant of the LK algorithm, we present Inverse Compositional Spatial Transformer Networks (IC-STNs). We demonstrate that IC-STNs can achieve better performance than conventional STNs with less model capacity; in particular, we show superior performance in pure image alignment tasks as well as joint alignment/classification problems on real-world problems.
We propose a novel algorithm for the joint refinement of structure and motion parameters from image data directly without relying on fixed and known correspondences. In contrast to traditional bundle adjustment (BA) where the optimal parameters are determined by minimizing the reprojection error using tracked features, the proposed algorithm relies on maximizing the photometric consistency and estimates the correspondences implicitly. Since the proposed algorithm does not require correspondences, its application is not limited to corner-like structure; any pixel with nonvanishing gradient could be used in the estimation process. Furthermore, we demonstrate the feasibility of refining the motion and structure parameters simultaneously using the photometric in unconstrained scenes and without requiring restrictive assumptions such as planarity. The proposed algorithm is evaluated on range of challenging outdoor datasets, and it is shown to improve upon the accuracy of the state-of-the-art VSLAM methods obtained using the minimization of the reprojection error using traditional BA as well as loop closure.
Feature descriptors, such as SIFT and ORB, are well-known for their robustness to illumination changes, which has made them popular for feature-based VSLAM\@. However, in degraded imaging conditions such as low light, low texture, blur and specular reflections, feature extraction is often unreliable. In contrast, direct VSLAM methods which estimate the camera pose by minimizing the photometric error using raw pixel intensities are often more robust to low textured environments and blur. Nonetheless, at the core of direct VSLAM is the reliance on a consistent photometric appearance across images, otherwise known as the brightness constancy assumption. Unfortunately, brightness constancy seldom holds in real world applications. In this work, we overcome brightness constancy by incorporating feature descriptors into a direct visual odometry framework. This combination results in an efficient algorithm that combines the strength of both feature-based algorithms and direct methods. Namely, we achieve robustness to arbitrary photometric variations while operating in low-textured and poorly lit environments. Our approach utilizes an efficient binary descriptor, which we call Bit-Planes, and show how it can be used in the gradient-based optimization required by direct methods. Moreover, we show that the squared Euclidean distance between Bit-Planes is equivalent to the Hamming distance. Hence, the descriptor may be used in least squares optimization without sacrificing its photometric invariance. Finally, we present empirical results that demonstrate the robustness of the approach in poorly lit underground environments.
Mar 30 2016 cs.CV
The Lucas & Kanade (LK) algorithm is the method of choice for efficient dense image and object alignment. The approach is efficient as it attempts to model the connection between appearance and geometric displacement through a linear relationship that assumes independence across pixel coordinates. A drawback of the approach, however, is its generative nature. Specifically, its performance is tightly coupled with how well the linear model can synthesize appearance from geometric displacement, even though the alignment task itself is associated with the inverse problem. In this paper, we present a new approach, referred to as the Conditional LK algorithm, which: (i) directly learns linear models that predict geometric displacement as a function of appearance, and (ii) employs a novel strategy for ensuring that the generative pixel independence assumption can still be taken advantage of. We demonstrate that our approach exhibits superior performance to classical generative forms of the LK algorithm. Furthermore, we demonstrate its comparable performance to state-of-the-art methods such as the Supervised Descent Method with substantially less training examples, as well as the unique ability to "swap" geometric warp functions without having to retrain from scratch. Finally, from a theoretical perspective, our approach hints at possible redundancies that exist in current state-of-the-art methods for alignment that could be leveraged in vision systems of the future.
Feb 02 2016 cs.CV
Binary descriptors have been instrumental in the recent evolution of computationally efficient sparse image alignment algorithms. Increasingly, however, the vision community is interested in dense image alignment methods, which are more suitable for estimating correspondences from high frame rate cameras as they do not rely on exhaustive search. However, classic dense alignment approaches are sensitive to illumination change. In this paper, we propose an easy to implement and low complexity dense binary descriptor, which we refer to as bit-planes, that can be seamlessly integrated within a multi-channel Lucas & Kanade framework. This novel approach combines the robustness of binary descriptors with the speed and accuracy of dense alignment methods. The approach is demonstrated on a template tracking problem achieving state-of-the-art robustness and faster than real-time performance on consumer laptops (400+ fps on a single core Intel i7) and hand-held mobile devices (100+ fps on an iPad Air 2).
Sep 07 2015 cs.CV
In this paper we tackle the problem of efficient video event detection. We argue that linear detection functions should be preferred in this regard due to their scalability and efficiency during estimation and evaluation. A popular approach in this regard is to represent a sequence using a bag of words (BOW) representation due to its: (i) fixed dimensionality irrespective of the sequence length, and (ii) its ability to compactly model the statistics in the sequence. A drawback to the BOW representation, however, is the intrinsic destruction of the temporal ordering information. In this paper we propose a new representation that leverages the uncertainty in relative temporal alignments between pairs of sequences while not destroying temporal ordering. Our representation, like BOW, is of a fixed dimensionality making it easily integrated with a linear detection function. Extensive experiments on CK+, 6DMG, and UvA-NEMO databases show significant performance improvements across both isolated and continuous event detection tasks.
In this technical document we present a proof on the uniqueness of group sparse coding through the block ACS theorem. Leveraging the original ACS theorem of Hillar and Sommer for sparse coding, we demonstrate a similar uniqueness property holds for the task of group sparse coding.
May 19 2015 cs.CV
Determining dense semantic correspondences across objects and scenes is a difficult problem that underpins many higher-level computer vision algorithms. Unlike canonical dense correspondence problems which consider images that are spatially or temporally adjacent, semantic correspondence is characterized by images that share similar high-level structures whose exact appearance and geometry may differ. Motivated by object recognition literature and recent work on rapidly estimating linear classifiers, we treat semantic correspondence as a constrained detection problem, where an exemplar LDA classifier is learned for each pixel. LDA classifiers have two distinct benefits: (i) they exhibit higher average precision than similarity metrics typically used in correspondence problems, and (ii) unlike exemplar SVM, can output globally interpretable posterior probabilities without calibration, whilst also being significantly faster to train. We pose the correspondence problem as a graphical model, where the unary potentials are computed via convolution with the set of exemplar classifiers, and the joint potentials enforce smoothly varying correspondence assignment.
Jul 09 2014 cs.CV
Gradient-descent methods have exhibited fast and reliable performance for image alignment in the facial domain, but have largely been ignored by the broader vision community. They require the image function be smooth and (numerically) differentiable -- properties that hold for pixel-based representations obeying natural image statistics, but not for more general classes of non-linear feature transforms. We show that transforms such as Dense SIFT can be incorporated into a Lucas Kanade alignment framework by predicting descent directions via regression. This enables robust matching of instances from general object categories whilst maintaining desirable properties of Lucas Kanade such as the capacity to handle high-dimensional warp parametrizations and a fast rate of convergence. We present alignment results on a number of objects from ImageNet, and an extension of the method to unsupervised joint alignment of objects from a corpus of images.
Jun 11 2014 cs.CV
Sparse and convolutional constraints form a natural prior for many optimization problems that arise from physical processes. Detecting motifs in speech and musical passages, super-resolving images, compressing videos, and reconstructing harmonic motions can all leverage redundancies introduced by convolution. Solving problems involving sparse and convolutional constraints remains a difficult computational problem, however. In this paper we present an overview of convolutional sparse coding in a consistent framework. The objective involves iteratively optimizing a convolutional least-squares term for the basis functions, followed by an L1-regularized least squares term for the sparse coefficients. We discuss a range of optimization methods for solving the convolutional sparse coding objective, and the properties that make each method suitable for different applications. In particular, we concentrate on computational complexity, speed to \epsilon convergence, memory usage, and the effect of implied boundary conditions. We present a broad suite of examples covering different signal and application domains to illustrate the general applicability of convolutional sparse coding, and the efficacy of the available optimization methods.
Linear Support Vector Machines trained on HOG features are now a de facto standard across many visual perception tasks. Their popularisation can largely be attributed to the step-change in performance they brought to pedestrian detection, and their subsequent successes in deformable parts models. This paper explores the interactions that make the HOG-SVM symbiosis perform so well. By connecting the feature extraction and learning processes rather than treating them as disparate plugins, we show that HOG features can be viewed as doing two things: (i) inducing capacity in, and (ii) adding prior to a linear SVM trained on pixels. From this perspective, preserving second-order statistics and locality of interactions are key to good performance. We demonstrate surprising accuracy on expression recognition and pedestrian detection tasks, by assuming only the importance of preserving such local second-order interactions.
Apr 01 2014 cs.CV
Correlation filters take advantage of specific properties in the Fourier domain allowing them to be estimated efficiently: O(NDlogD) in the frequency domain, versus O(D^3 + ND^2) spatially where D is signal length, and N is the number of signals. Recent extensions to correlation filters, such as MOSSE, have reignited interest of their use in the vision community due to their robustness and attractive computational properties. In this paper we demonstrate, however, that this computational efficiency comes at a cost. Specifically, we demonstrate that only 1/D proportion of shifted examples are unaffected by boundary effects which has a dramatic effect on detection/tracking performance. In this paper, we propose a novel approach to correlation filter estimation that: (i) takes advantage of inherent computational redundancies in the frequency domain, and (ii) dramatically reduces boundary effects. Impressive object tracking and detection results are presented in terms of both accuracy and computational efficiency.
Mar 31 2014 cs.CV
Computer vision is increasingly becoming interested in the rapid estimation of object detectors. Canonical hard negative mining strategies are slow as they require multiple passes of the large negative training set. Recent work has demonstrated that if the distribution of negative examples is assumed to be stationary, then Linear Discriminant Analysis (LDA) can learn comparable detectors without ever revisiting the negative set. Even with this insight, however, the time to learn a single object detector can still be on the order of tens of seconds on a modern desktop computer. This paper proposes to leverage the resulting structured covariance matrix to obtain detectors with identical performance in orders of magnitude less time and memory. We elucidate an important connection to the correlation filter literature, demonstrating that these can also be trained without ever revisiting the negative set.