Dec 19 2017 cs.CV
We describe Human Mesh Recovery (HMR), an end-to-end framework for reconstructing a full 3D mesh of a human body from a single RGB image. In contrast to most current methods that compute 2D or 3D joint locations, we produce a richer and more useful mesh representation that is parameterized by shape and 3D joint angles. The main objective is to minimize the reprojection loss of keypoints, which allow our model to be trained using in-the-wild images that only have ground truth 2D annotations. However, reprojection loss alone is highly under constrained. In this work we address this problem by introducing an adversary trained to tell whether a human body parameter is real or not using a large database of 3D human meshes. We show that HMR can be trained with and without using any coupled 2D-to-3D supervision. We do not rely on intermediate 2D keypoint detection and infer 3D pose and shape parameters directly from image pixels. Our model runs in real-time given a bounding box containing the person. We demonstrate our approach on various images in-the-wild and out-perform previous optimizationbased methods that output 3D meshes and show competitive results on tasks such as 3D joint location estimation and part segmentation.
Dec 06 2017 cs.CV
We present SfSNet, an end-to-end learning framework for producing an accurate decomposition of an unconstrained image of a human face into shape, reflectance and illuminance. Our network is designed to reflect a physical lambertian rendering model. SfSNet learns from a mixture of labeled synthetic and unlabeled real world images. This allows the network to capture low frequency variations from synthetic images and high frequency details from real images through the photometric reconstruction loss. SfSNet consists of a new decomposition architecture with residual blocks that learns a complete separation of albedo and normal. This is used along with the original image to predict lighting. SfSNet produces significantly better quantitative and qualitative results than state-of-the-art methods for inverse rendering and independent normal and illumination estimation.
Sep 08 2017 cs.CV
Lighting estimation from face images is an important task and has applications in many areas such as image editing, intrinsic image decomposition, and image forgery detection. We propose to train a deep Convolutional Neural Network (CNN) to regress lighting parameters from a single face image. Lacking massive ground truth lighting labels for face images in the wild, we use an existing method to estimate lighting parameters, which are treated as ground truth with unknown noises. To alleviate the effect of such noises, we utilize the idea of Generative Adversarial Networks (GAN) and propose a Label Denoising Adversarial Network (LDAN) to make use of synthetic data with accurate ground truth to help train a deep CNN for lighting regression on real face images. Experiments show that our network outperforms existing methods in producing consistent lighting parameters of different faces under similar lighting conditions. Moreover, our method is 100,000 times faster in execution time than prior optimization-based lighting estimation approaches.
Adversarial neural networks solve many important problems in data science, but are notoriously difficult to train. These difficulties come from the fact that optimal weights for adversarial nets correspond to saddle points, and not minimizers, of the loss function. The alternating stochastic gradient methods typically used for such problems do not reliably converge to saddle points, and when convergence does happen it is often highly sensitive to learning rates. We propose a simple modification of stochastic gradient descent that stabilizes adversarial networks. We show, both in theory and practice, that the proposed method reliably converges to saddle points, and is stable with a wider range of training parameters than a non-prediction method. This makes adversarial networks less likely to "collapse," and enables faster training with larger learning rates.
Feb 28 2017 cs.CV
Most of computer vision focuses on what is in an image. We propose to train a standalone object-centric context representation to perform the opposite task: seeing what is not there. Given an image, our context model can predict where objects should exist, even when no object instances are present. Combined with object detection results, we can perform a novel vision task: finding where objects are missing in an image. Our model is based on a convolutional neural network structure. With a specially designed training strategy, the model learns to ignore objects and focus on context only. It is fully convolutional thus highly efficient. Experiments show the effectiveness of the proposed approach in one important accessibility task: finding city street regions where curb ramps are missing, which could help millions of people with mobility disabilities.
Feb 13 2017 cs.CV
Accurate estimation of camera matrices is an important step in structure from motion algorithms. In this paper we introduce a novel rank constraint on collections of fundamental matrices in multi-view settings. We show that in general, with the selection of proper scale factors, a matrix formed by stacking fundamental matrices between pairs of images has rank 6. Moreover, this matrix forms the symmetric part of a rank 3 matrix whose factors relate directly to the corresponding camera matrices. We use this new characterization to produce better estimations of fundamental matrices by optimizing an L1-cost function using Iterative Re-weighted Least Squares and Alternate Direction Method of Multiplier. We further show that this procedure can improve the recovery of camera locations, particularly in multi-view settings in which fewer images are available.
Feb 03 2017 cs.CV
We introduce a new, integrated approach to uncalibrated photometric stereo. We perform 3D reconstruction of Lambertian objects using multiple images produced by unknown, directional light sources. We show how to formulate a single optimization that includes rank and integrability constraints, allowing also for missing data. We then solve this optimization using the Alternate Direction Method of Multipliers (ADMM). We conduct extensive experimental evaluation on real and synthetic data sets. Our integrated approach is particularly valuable when performing photometric stereo using as few as 4-6 images, since the integrability constraint is capable of improving estimation of the linear subspace of possible solutions. We show good improvements over prior work in these cases.
Nov 24 2016 cs.CV
There has been significant work on learning realistic, articulated, 3D models of the human body. In contrast, there are few such models of animals, despite many applications. The main challenge is that animals are much less cooperative than humans. The best human body models are learned from thousands of 3D scans of people in specific poses, which is infeasible with live animals. Consequently, we learn our model from a small set of 3D scans of toy figurines in arbitrary poses. We employ a novel part-based shape model to compute an initial registration to the scans. We then normalize their pose, learn a statistical shape model, and refine the registrations and the model together. In this way, we accurately align animal scans from different quadruped families with very different shapes and poses. With the registration to a common template we learn a shape space representing animals including lions, cats, dogs, horses, cows and hippos. Animal shapes can be sampled from the model, posed, animated, and fit to data. We demonstrate generalization by fitting it to images of real animals including species not seen in training.
Classical stochastic gradient methods for optimization rely on noisy gradient approximations that become progressively less accurate as iterates approach a solution. The large noise and small signal in the resulting gradients makes it difficult to use them for adaptive stepsize selection and automatic stopping. We propose alternative "big batch" SGD schemes that adaptively grow the batch size over time to maintain a nearly constant signal-to-noise ratio in the gradient approximation. The resulting methods have similar convergence rates to classical SGD, and do not require convexity of the objective. The high fidelity gradients enable automated learning rate selection and do not require stepsize decay. Big batch methods are thus easily automated and can run with little or no oversight.
Semidefinite programming is an indispensable tool in computer vision, but general-purpose solvers for semidefinite programs are often too slow and memory intensive for large-scale problems. We propose a general framework to approximately solve large-scale semidefinite problems (SDPs) at low complexity. Our approach, referred to as biconvex relaxation (BCR), transforms a general SDP into a specific biconvex optimization problem, which can then be solved in the original, low-dimensional variable space at low complexity. The resulting biconvex problem is solved using an efficient alternating minimization (AM) procedure. Since AM has the potential to get stuck in local minima, we propose a general initialization scheme that enables BCR to start close to a global optimum - this is key for our algorithm to quickly converge to optimal or near-optimal solutions. We showcase the efficacy of our approach on three applications in computer vision, namely segmentation, co-segmentation, and manifold metric learning. BCR achieves solution quality comparable to state-of-the-art SDP methods with speedups between 4X and 35X. At the same time, BCR handles a more general set of SDPs than previous approaches, which are more specialized.
Apr 20 2016 cs.CV
We present an approach to matching images of objects in fine-grained datasets without using part annotations, with an application to the challenging problem of weakly supervised single-view reconstruction. This is in contrast to prior works that require part annotations, since matching objects across class and pose variations is challenging with appearance features alone. We overcome this challenge through a novel deep learning architecture, WarpNet, that aligns an object in one image with a different object in another. We exploit the structure of the fine-grained dataset to create artificial data for training this network in an unsupervised-discriminative learning approach. The output of the network acts as a spatial prior that allows generalization at test time to match real images across variations in appearance, viewpoint and articulation. On the CUB-200-2011 dataset of bird categories, we improve the AP over an appearance-only network by 13.6%. We further demonstrate that our WarpNet matches, together with the structure of fine-grained datasets, allow single-view reconstructions with quality comparable to using annotated point correspondences.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We suggest a novel shape matching algorithm for three-dimensional surface meshes of disk or sphere topology. The method is based on the physical theory of nonlinear elasticity and can hence handle large rotations and deformations. Deformation boundary conditions that supplement the underlying equations are usually unknown. Given an initial guess, these are optimized such that the mechanical boundary forces that are responsible for the deformation are of a simple nature. We show a heuristic way to approximate the nonlinear optimization problem by a sequence of convex problems using finite elements. The deformation cost, i.e, the forces, is measured on a coarse scale while ICP-like matching is done on the fine scale. We demonstrate the plausibility of our algorithm on examples taken from different datasets.
Jul 29 2015 cs.CV
Understanding how an animal can deform and articulate is essential for a realistic modification of its 3D model. In this paper, we show that such information can be learned from user-clicked 2D images and a template 3D model of the target animal. We present a volumetric deformation framework that produces a set of new 3D models by deforming a template 3D model according to a set of user-clicked images. Our framework is based on a novel locally-bounded deformation energy, where every local region has its own stiffness value that bounds how much distortion is allowed at that location. We jointly learn the local stiffness bounds as we deform the template 3D mesh to match each user-clicked image. We show that this seemingly complex task can be solved as a sequence of convex optimization problems. We demonstrate the effectiveness of our approach on cats and horses, which are highly deformable and articulated animals. Our framework produces new 3D models of animals that are significantly more plausible than methods without learned stiffness.
Mar 11 2015 cs.CV
This paper proposes a learning-based approach to scene parsing inspired by the deep Recursive Context Propagation Network (RCPN). RCPN is a deep feed-forward neural network that utilizes the contextual information from the entire image, through bottom-up followed by top-down context propagation via random binary parse trees. This improves the feature representation of every super-pixel in the image for better classification into semantic categories. We analyze RCPN and propose two novel contributions to further improve the model. We first analyze the learning of RCPN parameters and discover the presence of bypass error paths in the computation graph of RCPN that can hinder contextual propagation. We propose to tackle this problem by including the classification loss of the internal nodes of the random parse trees in the original RCPN loss function. Secondly, we use an MRF on the parse tree nodes to model the hierarchical dependency present in the output. Both modifications provide performance boosts over the original RCPN and the new system achieves state-of-the-art performance on Stanford Background, SIFT-Flow and Daimler urban datasets.
The Murchison Widefield Array (MWA) is a Square Kilometre Array (SKA) Precursor. The telescope is located at the Murchison Radio--astronomy Observatory (MRO) in Western Australia (WA). The MWA consists of 4096 dipoles arranged into 128 dual polarisation aperture arrays forming a connected element interferometer that cross-correlates signals from all 256 inputs. A hybrid approach to the correlation task is employed, with some processing stages being performed by bespoke hardware, based on Field Programmable Gate Arrays (FPGAs), and others by Graphics Processing Units (GPUs) housed in general purpose rack mounted servers. The correlation capability required is approximately 8 TFLOPS (Tera FLoating point Operations Per Second). The MWA has commenced operations and the correlator is generating 8.3 TB/day of correlation products, that are subsequently transferred 700 km from the MRO to Perth (WA) in real-time for storage and offline processing. In this paper we outline the correlator design, signal path, and processing elements and present the data format for the internal and external interfaces.
Over the past few years, symmetric positive definite (SPD) matrices have been receiving considerable attention from computer vision community. Though various distance measures have been proposed in the past for comparing SPD matrices, the two most widely-used measures are affine-invariant distance and log-Euclidean distance. This is because these two measures are true geodesic distances induced by Riemannian geometry. In this work, we focus on the log-Euclidean Riemannian geometry and propose a data-driven approach for learning Riemannian metrics/geodesic distances for SPD matrices. We show that the geodesic distance learned using the proposed approach performs better than various existing distance measures when evaluated on face matching and clustering tasks.
Dec 17 2014 cs.CV
Convolutional Neural Networks (ConvNets) have shown excellent results on many visual classification tasks. With the exception of ImageNet, these datasets are carefully crafted such that objects are well-aligned at similar scales. Naturally, the feature learning problem gets more challenging as the amount of variation in the data increases, as the models have to learn to be invariant to certain changes in appearance. Recent results on the ImageNet dataset show that given enough data, ConvNets can learn such invariances producing very discriminative features . But could we do more: use less parameters, less data, learn more discriminative features, if certain invariances were built into the learning process? In this paper we present a simple model that allows ConvNets to learn features in a locally scale-invariant manner without increasing the number of model parameters. We show on a modified MNIST dataset that when faced with scale variation, building in scale-invariance allows ConvNets to learn more discriminative features with reduced chances of over-fitting.
We discuss methodological issues related to the evaluation of unsupervised binary code construction methods for nearest neighbor search. These issues have been widely ignored in literature. These coding methods attempt to preserve either Euclidean distance or angular (cosine) distance in the binary embedding space. We explain why when comparing a method whose goal is preserving cosine similarity to one designed for preserving Euclidean distance, the original features should be normalized by mapping them to the unit hypersphere before learning the binary mapping functions. To compare a method whose goal is to preserves Euclidean distance to one that preserves cosine similarity, the original feature data must be mapped to a higher dimension by including a bias term in binary mapping functions. These conditions ensure the fair comparison between different binary code methods for the task of nearest neighbor search. Our experiments show under these conditions the very simple methods (e.g. LSH and ITQ) often outperform recent state-of-the-art methods (e.g. MDSH and OK-means).
Oct 11 2013 cs.CV
We develop a framework for extracting a concise representation of the shape information available from diffuse shading in a small image patch. This produces a mid-level scene descriptor, comprised of local shape distributions that are inferred separately at every image patch across multiple scales. The framework is based on a quadratic representation of local shape that, in the absence of noise, has guarantees on recovering accurate local shape and lighting. And when noise is present, the inferred local shape distributions provide useful shape information without over-committing to any particular image explanation. These local shape distributions naturally encode the fact that some smooth diffuse regions are more informative than others, and they enable efficient and robust reconstruction of object-scale shape. Experimental results show that this approach to surface reconstruction compares well against the state-of-art on both synthetic images and captured photographs.
Optimization-based filtering smoothes an image by minimizing a fidelity function and simultaneously preserves edges by exploiting a sparse norm penalty over gradients. It has obtained promising performance in practical problems, such as detail manipulation, HDR compression and deblurring, and thus has received increasing attentions in fields of graphics, computer vision and image processing. This paper derives a new type of image filter called sparse norm filter (SNF) from optimization-based filtering. SNF has a very simple form, introduces a general class of filtering techniques, and explains several classic filters as special implementations of SNF, e.g. the averaging filter and the median filter. It has advantages of being halo free, easy to implement, and low time and memory costs (comparable to those of the bilateral filter). Thus, it is more generic than a smoothing operator and can better adapt to different tasks. We validate the proposed SNF by a wide variety of applications including edge-preserving smoothing, outlier tolerant filtering, detail manipulation, HDR compression, non-blind deconvolution, image segmentation, and colorization.
May 22 2012 cs.CV
Spectral graph theory is well known and widely used in computer vision. In this paper, we analyze image segmentation algorithms that are based on spectral graph theory, e.g., normalized cut, and show that there is a natural connection between spectural graph theory based image segmentationand and edge preserving filtering. Based on this connection we show that the normalized cut algorithm is equivalent to repeated iterations of bilateral filtering. Then, using this equivalence we present and implement a fast normalized cut algorithm for image segmentation. Experiments show that our implementation can solve the original optimization problem in the normalized cut algorithm 10 to 100 times faster. Furthermore, we present a new algorithm called conditioned normalized cut for image segmentation that can easily incorporate color image patches and demonstrate how this segmentation problem can be solved with edge preserving filtering.
Sep 10 2009 cs.DB
Data consistency is very desirable because strong semantic properties make it easier to write correct programs that perform as users expect. However, there are good reasons why consistency may have to be weakened to achieve other business goals. In this CIDR 2009 Perspectives paper, we present real-world reasons inconsistency may be necessary, offer principles for managing inconsistency coherently, and describe implementation approaches we are investigating for sustainably scalable systems that offer comprehensible user experiences despite inconsistency.