Since their inception, CNNs have utilized some type of striding operator to reduce the overlap of receptive fields and spatial dimensions. Although having clear heuristic motivations (i.e. lowering the number of parameters to learn) the mathematical role of striding within CNN learning remains unclear. This paper offers a novel and mathematical rigorous perspective on the role of the striding operator within modern CNNs. Specifically, we demonstrate theoretically that one can always represent a CNN that incorporates striding with an equivalent non-striding CNN which has more filters and smaller size. Through this equivalence we are then able to characterize striding as an additional mechanism for parameter sharing among channels, thus reducing training complexity. Finally, the framework presented in this paper offers a new mathematical perspective on the role of striding which we hope shall facilitate and simplify the future theoretical analysis of CNNs.
Stochastic Gradient Descent (SGD) is the central workhorse for training modern CNNs. Although giving impressive empirical performance it can be slow to converge. In this paper we explore a novel strategy for training a CNN using an alternation strategy that offers substantial speedups during training. We make the following contributions: (i) replace the ReLU non-linearity within a CNN with positive hard-thresholding, (ii) reinterpret this non-linearity as a binary state vector making the entire CNN linear if the multi-layer support is known, and (iii) demonstrate that under certain conditions a global optima to the CNN can be found through local descent. We then employ a novel alternation strategy (between weights and support) for CNN training that leads to substantially faster convergence rates, nice theoretical properties, and achieving state of the art results across large scale datasets (e.g. ImageNet) as well as other standard benchmarks.
Dec 04 2017 cs.CV
The problem of obtaining dense reconstruction of an object in a natural sequence of images has been long studied in computer vision. Classically this problem has been solved through the application of bundle adjustment (BA). More recently, excellent results have been attained through the application of photometric bundle adjustment (PBA) methods -- which directly minimize the photometric error across frames. A fundamental drawback to BA & PBA, however, is: (i) their reliance on having to view all points on the object, and (ii) for the object surface to be well textured. To circumvent these limitations we propose semantic PBA which incorporates a 3D object prior, obtained through deep learning, within the photometric bundle adjustment problem. We demonstrate state of the art performance in comparison to leading methods for object reconstruction across numerous natural sequences.
Dec 04 2017 cs.CV
The ability to predict depth from a single image - using recent advances in CNNs - is of increasing interest to the vision community. Unsupervised strategies to learning are particularly appealing as they can utilize much larger and varied monocular video datasets during learning without the need for ground truth depth or stereo. In previous works, separate pose and depth CNN predictors had to be determined such that their joint outputs minimized the photometric error. Inspired by recent advances in direct visual odometry (DVO), we argue that the depth CNN predictor can be learned without a pose CNN predictor. Further, we demonstrate empirically that incorporation of a differentiable implementation of DVO, along with a novel depth normalization strategy - substantially improves performance over state of the art that use monocular videos for training.
Nov 30 2017 cs.CV
One challenge that remains open in 3D deep learning is how to efficiently represent 3D data to feed deep networks. Recent works have relied on volumetric or point cloud representations, but such approaches suffer from a number of issues such as computational complexity, unordered data, and lack of finer geometry. This paper demonstrates that a mesh representation (i.e. vertices and faces to form polygonal surfaces) is able to capture fine-grained geometry for 3D reconstruction tasks. A mesh however is also unstructured data similar to point clouds. We address this problem by proposing a learning framework to infer the parameters of a compact mesh representation rather than learning from the mesh itself. This compact representation encodes a mesh using free-form deformation and a sparse linear combination of models allowing us to reconstruct 3D meshes from single images. In contrast to prior work, we do not rely on silhouettes and landmarks to perform 3D reconstruction. We evaluate our method on synthetic and real-world datasets with very promising results. Our framework efficiently reconstructs 3D objects in a low-dimensional way while preserving its important geometrical aspects.
Nov 07 2017 cs.CV
Reconstructing 3D shapes from a sequence of images has long been a problem of interest in computer vision. Classical Structure from Motion (SfM) methods have attempted to solve this problem through projected point displacement \& bundle adjustment. More recently, deep methods have attempted to solve this problem by directly learning a relationship between geometry and appearance. There is, however, a significant gap between these two strategies. SfM tackles the problem from purely a geometric perspective, taking no account of the object shape prior. Modern deep methods more often throw away geometric constraints altogether, rendering the results unreliable. In this paper we make an effort to bring these two seemingly disparate strategies together. We introduce learned shape prior in the form of deep shape generators into Photometric Bundle Adjustment (PBA) and propose to accommodate full 3D shape generated by the shape prior within the optimization-based inference framework, demonstrating impressive results.
Aug 11 2017 cs.CV
Visual object tracking is a fundamental and time-critical vision task. Recent years have seen many shallow tracking methods based on real-time pixel-based correlation filters, as well as deep methods that have top performance but need a high-end GPU. In this paper, we learn to improve the speed of deep trackers without losing accuracy. Our fundamental insight is to take an adaptive approach, where easy frames are processed with cheap features (such as pixel values), while challenging frames are processed with invariant but expensive deep features. We formulate the adaptive tracking problem as a decision-making process, and learn an agent to decide whether to locate objects with high confidence on an early layer, or continue processing subsequent layers of a network. This significantly reduces the feed-forward cost for easy frames with distinct or slow-moving objects. We train the agent offline in a reinforcement learning fashion, and further demonstrate that learning all deep layers (so as to provide good features for adaptive tracking) can lead to near real-time average tracking speed of 23 fps on a single CPU while achieving state-of-the-art performance. Perhaps most tellingly, our approach provides a 100X speedup for almost 50% of the time, indicating the power of an adaptive approach.
Jul 25 2017 cs.CV
3D reconstruction from 2D images is a central problem in computer vision. Recent works have been focusing on reconstruction directly from a single image. It is well known however that only one image cannot provide enough information for such a reconstruction. A prior knowledge that has been entertained are 3D CAD models due to its online ubiquity. A fundamental question is how to compactly represent millions of CAD models while allowing generalization to new unseen objects with fine-scaled geometry. We introduce an approach to compactly represent a 3D mesh. Our method first selects a 3D model from a graph structure by using a novel free-form deformation FFD 3D-2D registration, and then the selected 3D model is refined to best fit the image silhouette. We perform a comprehensive quantitative and qualitative analysis that demonstrates impressive dense and realistic 3D reconstruction from single images.
Jul 18 2017 cs.CV
An emerging problem in computer vision is the reconstruction of 3D shape and pose of an object from a single image. Hitherto, the problem has been addressed through the application of canonical deep learning methods to regress from the image directly to the 3D shape and pose labels. These approaches, however, are problematic from two perspectives. First, they are minimizing the error between 3D shapes and pose labels - with little thought about the nature of this label error when reprojecting the shape back onto the image. Second, they rely on the onerous and ill-posed task of hand labeling natural images with respect to 3D shape and pose. In this paper we define the new task of pose-aware shape reconstruction from a single image, and we advocate that cheaper 2D annotations of objects silhouettes in natural images can be utilized. We design architectures of pose-aware shape reconstruction which re-project the predicted shape back on to the image using the predicted pose. Our evaluation on several object categories demonstrates the superiority of our method for predicting pose-aware 3D shapes from natural images.
Conventional methods of 3D object generative modeling learn volumetric predictions using deep networks with 3D convolutional operations, which are direct analogies to classical 2D ones. However, these methods are computationally wasteful in attempt to predict 3D shapes, where information is rich only on the surfaces. In this paper, we propose a novel 3D generative modeling framework to efficiently generate object shapes in the form of dense point clouds. We use 2D convolutional operations to predict the 3D structure from multiple viewpoints and jointly apply geometric reasoning with 2D projection optimization. We introduce the pseudo-renderer, a differentiable module to approximate the true rendering operation, to synthesize novel depth maps for optimization. Experimental results for single-image 3D object reconstruction tasks show that we outperforms state-of-the-art methods in terms of shape similarity and prediction density.
Jun 14 2017 cs.CV
In this paper the problem of complex event detection in the continuous domain (i.e. events with unknown starting and ending locations) is addressed. Existing event detection methods are limited to features that are extracted from the local spatial or spatio-temporal patches from the videos. However, this makes the model vulnerable to the events with similar concepts e.g. "Open drawer" and "Open cupboard". In this work, in order to address the aforementioned limitations we present a novel model based on the combination of semantic and temporal features extracted from video frames. We train a max-margin classifier on top of the extracted features in an adaptive framework that is able to detect the events with unknown starting and ending locations. Our model is based on the Bidirectional Region Neural Network and large margin Structural Output SVM. The generality of our model allows it to be simply applied to different labeled and unlabeled datasets. We finally test our algorithm on three challenging datasets, "UCF 101-Action Recognition", "MPII Cooking Activities" and "Hollywood", and we report state-of-the-art performance.
May 22 2017 cs.CV
In this paper we present a new approach for efficient regression based object tracking which we refer to as Deep- LK. Our approach is closely related to the Generic Object Tracking Using Regression Networks (GOTURN) framework of Held et al. We make the following contributions. First, we demonstrate that there is a theoretical relationship between siamese regression networks like GOTURN and the classical Inverse-Compositional Lucas & Kanade (IC-LK) algorithm. Further, we demonstrate that unlike GOTURN IC-LK adapts its regressor to the appearance of the currently tracked frame. We argue that this missing property in GOTURN can be attributed to its poor performance on unseen objects and/or viewpoints. Second, we propose a novel framework for object tracking - which we refer to as Deep-LK - that is inspired by the IC-LK framework. Finally, we show impressive results demonstrating that Deep-LK substantially outperforms GOTURN. Additionally, we demonstrate comparable tracking performance to current state of the art deep-trackers whilst being an order of magnitude (i.e. 100 FPS) computationally efficient.
Apr 25 2017 cs.CV
Recent advances in 3D vision have demonstrated the strengths of photometric bundle adjustment. By directly minimizing reprojected pixel errors, instead of geometric reprojection errors, such methods can achieve sub-pixel alignment accuracy in both high and low textured regions. Typically, these problems are solved using a forwards compositional Lucas-Kanade formulation parameterized by 6-DoF rigid camera poses and a depth per point in the structure. For large problems the most CPU-intensive component of the pipeline is the creation and factorization of the Hessian matrix at each iteration. For many warps, the inverse compositional formulation can offer significant speed-ups since the Hessian need only be inverted once. In this paper, we show that an ordinary inverse compositional formulation does not work for warps of this type of parameterization due to ill-conditioning of its partial derivatives. However, we show that it is possible to overcome this limitation by introducing the concept of a proxy template image. We show an order of magnitude improvement in speed, with little effect on quality, going from forwards to inverse compositional in our own photometric bundle adjustment method designed for object-centric structure from motion. This means less processing time for large systems or denser reconstructions under the same real-time constraints. We additionally show that this theory can be readily applied to existing methods by integrating it with the recently released Direct Sparse Odometry SLAM algorithm.
Mar 20 2017 cs.CV
In this paper, we propose the first higher frame rate video dataset (called Need for Speed - NfS) and benchmark for visual object tracking. The dataset consists of 100 videos (380K frames) captured with now commonly available higher frame rate (240 FPS) cameras from real world scenarios. All frames are annotated with axis aligned bounding boxes and all sequences are manually labelled with nine visual attributes - such as occlusion, fast motion, background clutter, etc. Our benchmark provides an extensive evaluation of many recent and state-of-the-art trackers on higher frame rate sequences. We ranked each of these trackers according to their tracking accuracy and real-time performance. One of our surprising conclusions is that at higher frame rates, simple trackers such as correlation filters outperform complex methods based on deep networks. This suggests that for practical applications (such as in robotics or embedded vision), one needs to carefully tradeoff bandwidth constraints associated with higher frame rate acquisition, computational costs of real-time analysis, and the required application accuracy. Our dataset and benchmark allows for the first time (to our knowledge) systematic exploration of such issues, and will be made available to allow for further research in this space.
Mar 16 2017 cs.CV
Correlation Filters (CFs) have recently demonstrated excellent performance in terms of rapidly tracking objects under challenging photometric and geometric variations. The strength of the approach comes from its ability to efficiently learn - "on the fly" - how the object is changing over time. A fundamental drawback to CFs, however, is that the background of the object is not be modelled over time which can result in suboptimal results. In this paper we propose a Background-Aware CF that can model how both the foreground and background of the object varies over time. Our approach, like conventional CFs, is extremely computationally efficient - and extensive experiments over multiple tracking benchmarks demonstrate the superior accuracy and real-time performance of our method compared to the state-of-the-art trackers including those based on a deep learning paradigm.
Dec 19 2016 cs.CV
In this paper, we present our method for enabling dense SDM to run at over 90 FPS on a mobile device. Our contributions are two-fold. Drawing inspiration from the FFT, we propose a Sparse Compositional Regression (SCR) framework, which enables a significant speed up over classical dense regressors. Second, we propose a binary approximation to SIFT features. Binary Approximated SIFT (BASIFT) features, which are a computationally efficient approximation to SIFT, a commonly used feature with SDM. We demonstrate the performance of our algorithm on an iPhone 7, and show that we achieve similar accuracy to SDM.
In this paper, we establish a theoretical connection between the classical Lucas & Kanade (LK) algorithm and the emerging topic of Spatial Transformer Networks (STNs). STNs are of interest to the vision and learning communities due to their natural ability to combine alignment and classification within the same theoretical framework. Inspired by the Inverse Compositional (IC) variant of the LK algorithm, we present Inverse Compositional Spatial Transformer Networks (IC-STNs). We demonstrate that IC-STNs can achieve better performance than conventional STNs with less model capacity; in particular, we show superior performance in pure image alignment tasks as well as joint alignment/classification problems on real-world problems.
We propose a novel algorithm for the joint refinement of structure and motion parameters from image data directly without relying on fixed and known correspondences. In contrast to traditional bundle adjustment (BA) where the optimal parameters are determined by minimizing the reprojection error using tracked features, the proposed algorithm relies on maximizing the photometric consistency and estimates the correspondences implicitly. Since the proposed algorithm does not require correspondences, its application is not limited to corner-like structure; any pixel with nonvanishing gradient could be used in the estimation process. Furthermore, we demonstrate the feasibility of refining the motion and structure parameters simultaneously using the photometric in unconstrained scenes and without requiring restrictive assumptions such as planarity. The proposed algorithm is evaluated on range of challenging outdoor datasets, and it is shown to improve upon the accuracy of the state-of-the-art VSLAM methods obtained using the minimization of the reprojection error using traditional BA as well as loop closure.
Feature descriptors, such as SIFT and ORB, are well-known for their robustness to illumination changes, which has made them popular for feature-based VSLAM\@. However, in degraded imaging conditions such as low light, low texture, blur and specular reflections, feature extraction is often unreliable. In contrast, direct VSLAM methods which estimate the camera pose by minimizing the photometric error using raw pixel intensities are often more robust to low textured environments and blur. Nonetheless, at the core of direct VSLAM is the reliance on a consistent photometric appearance across images, otherwise known as the brightness constancy assumption. Unfortunately, brightness constancy seldom holds in real world applications. In this work, we overcome brightness constancy by incorporating feature descriptors into a direct visual odometry framework. This combination results in an efficient algorithm that combines the strength of both feature-based algorithms and direct methods. Namely, we achieve robustness to arbitrary photometric variations while operating in low-textured and poorly lit environments. Our approach utilizes an efficient binary descriptor, which we call Bit-Planes, and show how it can be used in the gradient-based optimization required by direct methods. Moreover, we show that the squared Euclidean distance between Bit-Planes is equivalent to the Hamming distance. Hence, the descriptor may be used in least squares optimization without sacrificing its photometric invariance. Finally, we present empirical results that demonstrate the robustness of the approach in poorly lit underground environments.
Mar 30 2016 cs.CV
The Lucas & Kanade (LK) algorithm is the method of choice for efficient dense image and object alignment. The approach is efficient as it attempts to model the connection between appearance and geometric displacement through a linear relationship that assumes independence across pixel coordinates. A drawback of the approach, however, is its generative nature. Specifically, its performance is tightly coupled with how well the linear model can synthesize appearance from geometric displacement, even though the alignment task itself is associated with the inverse problem. In this paper, we present a new approach, referred to as the Conditional LK algorithm, which: (i) directly learns linear models that predict geometric displacement as a function of appearance, and (ii) employs a novel strategy for ensuring that the generative pixel independence assumption can still be taken advantage of. We demonstrate that our approach exhibits superior performance to classical generative forms of the LK algorithm. Furthermore, we demonstrate its comparable performance to state-of-the-art methods such as the Supervised Descent Method with substantially less training examples, as well as the unique ability to "swap" geometric warp functions without having to retrain from scratch. Finally, from a theoretical perspective, our approach hints at possible redundancies that exist in current state-of-the-art methods for alignment that could be leveraged in vision systems of the future.
Feb 02 2016 cs.CV
Binary descriptors have been instrumental in the recent evolution of computationally efficient sparse image alignment algorithms. Increasingly, however, the vision community is interested in dense image alignment methods, which are more suitable for estimating correspondences from high frame rate cameras as they do not rely on exhaustive search. However, classic dense alignment approaches are sensitive to illumination change. In this paper, we propose an easy to implement and low complexity dense binary descriptor, which we refer to as bit-planes, that can be seamlessly integrated within a multi-channel Lucas & Kanade framework. This novel approach combines the robustness of binary descriptors with the speed and accuracy of dense alignment methods. The approach is demonstrated on a template tracking problem achieving state-of-the-art robustness and faster than real-time performance on consumer laptops (400+ fps on a single core Intel i7) and hand-held mobile devices (100+ fps on an iPad Air 2).
Sep 07 2015 cs.CV
In this paper we tackle the problem of efficient video event detection. We argue that linear detection functions should be preferred in this regard due to their scalability and efficiency during estimation and evaluation. A popular approach in this regard is to represent a sequence using a bag of words (BOW) representation due to its: (i) fixed dimensionality irrespective of the sequence length, and (ii) its ability to compactly model the statistics in the sequence. A drawback to the BOW representation, however, is the intrinsic destruction of the temporal ordering information. In this paper we propose a new representation that leverages the uncertainty in relative temporal alignments between pairs of sequences while not destroying temporal ordering. Our representation, like BOW, is of a fixed dimensionality making it easily integrated with a linear detection function. Extensive experiments on CK+, 6DMG, and UvA-NEMO databases show significant performance improvements across both isolated and continuous event detection tasks.
In this technical document we present a proof on the uniqueness of group sparse coding through the block ACS theorem. Leveraging the original ACS theorem of Hillar and Sommer for sparse coding, we demonstrate a similar uniqueness property holds for the task of group sparse coding.
May 19 2015 cs.CV
Determining dense semantic correspondences across objects and scenes is a difficult problem that underpins many higher-level computer vision algorithms. Unlike canonical dense correspondence problems which consider images that are spatially or temporally adjacent, semantic correspondence is characterized by images that share similar high-level structures whose exact appearance and geometry may differ. Motivated by object recognition literature and recent work on rapidly estimating linear classifiers, we treat semantic correspondence as a constrained detection problem, where an exemplar LDA classifier is learned for each pixel. LDA classifiers have two distinct benefits: (i) they exhibit higher average precision than similarity metrics typically used in correspondence problems, and (ii) unlike exemplar SVM, can output globally interpretable posterior probabilities without calibration, whilst also being significantly faster to train. We pose the correspondence problem as a graphical model, where the unary potentials are computed via convolution with the set of exemplar classifiers, and the joint potentials enforce smoothly varying correspondence assignment.
Jul 09 2014 cs.CV
Gradient-descent methods have exhibited fast and reliable performance for image alignment in the facial domain, but have largely been ignored by the broader vision community. They require the image function be smooth and (numerically) differentiable -- properties that hold for pixel-based representations obeying natural image statistics, but not for more general classes of non-linear feature transforms. We show that transforms such as Dense SIFT can be incorporated into a Lucas Kanade alignment framework by predicting descent directions via regression. This enables robust matching of instances from general object categories whilst maintaining desirable properties of Lucas Kanade such as the capacity to handle high-dimensional warp parametrizations and a fast rate of convergence. We present alignment results on a number of objects from ImageNet, and an extension of the method to unsupervised joint alignment of objects from a corpus of images.
Jun 11 2014 cs.CV
Sparse and convolutional constraints form a natural prior for many optimization problems that arise from physical processes. Detecting motifs in speech and musical passages, super-resolving images, compressing videos, and reconstructing harmonic motions can all leverage redundancies introduced by convolution. Solving problems involving sparse and convolutional constraints remains a difficult computational problem, however. In this paper we present an overview of convolutional sparse coding in a consistent framework. The objective involves iteratively optimizing a convolutional least-squares term for the basis functions, followed by an L1-regularized least squares term for the sparse coefficients. We discuss a range of optimization methods for solving the convolutional sparse coding objective, and the properties that make each method suitable for different applications. In particular, we concentrate on computational complexity, speed to \epsilon convergence, memory usage, and the effect of implied boundary conditions. We present a broad suite of examples covering different signal and application domains to illustrate the general applicability of convolutional sparse coding, and the efficacy of the available optimization methods.
Linear Support Vector Machines trained on HOG features are now a de facto standard across many visual perception tasks. Their popularisation can largely be attributed to the step-change in performance they brought to pedestrian detection, and their subsequent successes in deformable parts models. This paper explores the interactions that make the HOG-SVM symbiosis perform so well. By connecting the feature extraction and learning processes rather than treating them as disparate plugins, we show that HOG features can be viewed as doing two things: (i) inducing capacity in, and (ii) adding prior to a linear SVM trained on pixels. From this perspective, preserving second-order statistics and locality of interactions are key to good performance. We demonstrate surprising accuracy on expression recognition and pedestrian detection tasks, by assuming only the importance of preserving such local second-order interactions.
Apr 01 2014 cs.CV
Correlation filters take advantage of specific properties in the Fourier domain allowing them to be estimated efficiently: O(NDlogD) in the frequency domain, versus O(D^3 + ND^2) spatially where D is signal length, and N is the number of signals. Recent extensions to correlation filters, such as MOSSE, have reignited interest of their use in the vision community due to their robustness and attractive computational properties. In this paper we demonstrate, however, that this computational efficiency comes at a cost. Specifically, we demonstrate that only 1/D proportion of shifted examples are unaffected by boundary effects which has a dramatic effect on detection/tracking performance. In this paper, we propose a novel approach to correlation filter estimation that: (i) takes advantage of inherent computational redundancies in the frequency domain, and (ii) dramatically reduces boundary effects. Impressive object tracking and detection results are presented in terms of both accuracy and computational efficiency.
Mar 31 2014 cs.CV
Computer vision is increasingly becoming interested in the rapid estimation of object detectors. Canonical hard negative mining strategies are slow as they require multiple passes of the large negative training set. Recent work has demonstrated that if the distribution of negative examples is assumed to be stationary, then Linear Discriminant Analysis (LDA) can learn comparable detectors without ever revisiting the negative set. Even with this insight, however, the time to learn a single object detector can still be on the order of tens of seconds on a modern desktop computer. This paper proposes to leverage the resulting structured covariance matrix to obtain detectors with identical performance in orders of magnitude less time and memory. We elucidate an important connection to the correlation filter literature, demonstrating that these can also be trained without ever revisiting the negative set.