# Computer Vision and Pattern Recognition (cs.CV)

• Skin cancer is one of the major types of cancers and its incidence has been increasing over the past decades. Skin lesions can arise from various dermatologic disorders and can be classified to various types according to their texture, structure, color and other morphological features. The accuracy of diagnosis of skin lesions, specifically the discrimination of benign and malignant lesions, is paramount to ensure appropriate patient treatment. Machine learning-based classification approaches are among popular automatic methods for skin lesion classification. While there are many existing methods, convolutional neural networks (CNN) have shown to be superior over other classical machine learning methods for object detection and classification tasks. In this work, a fully automatic computerized method is proposed, which employs well established pre-trained convolutional neural networks and ensembles learning to classify skin lesions. We trained the networks using 2000 skin lesion images available from the ISIC 2017 challenge, which has three main categories and includes 374 melanoma, 254 seborrheic keratosis and 1372 benign nevi images. The trained classifier was then tested on 150 unlabeled images. The results, evaluated by the challenge organizer and based on the area under the receiver operating characteristic curve (AUC), were 84.8% and 93.6% for Melanoma and seborrheic keratosis binary classification problem, respectively. The proposed method achieved competitive results to experienced dermatologist. Further improvement and optimization of the proposed method with a larger training dataset could lead to a more precise, reliable and robust method for skin lesion classification.
• "If I provide you a face image of mine (without telling you the actual age when I took the picture) and a large amount of face images that I crawled (containing labeled faces of different ages but not necessarily paired), can you show me what I would look like when I am 80 or what I was like when I was 5?" The answer is probably a "No." Most existing face aging works attempt to learn the transformation between age groups and thus would require the paired samples as well as the labeled query image. In this paper, we look at the problem from a generative modeling perspective such that no paired samples is required. In addition, given an unlabeled image, the generative model can directly produce the image with desired age attribute. We propose a conditional adversarial autoencoder (CAAE) that learns a face manifold, traversing on which smooth age progression and regression can be realized simultaneously. In CAAE, the face is first mapped to a latent vector through a convolutional encoder, and then the vector is projected to the face manifold conditional on age through a deconvolutional generator. The latent vector preserves personalized face features (i.e., personality) and the age condition controls progression vs. regression. Two adversarial networks are imposed on the encoder and generator, respectively, forcing to generate more photo-realistic faces. Experimental results demonstrate the appealing performance and flexibility of the proposed framework by comparing with the state-of-the-art and ground truth.
• Mammography screening for early detection of breast lesions currently suffers from high amounts of false positive findings, which result in unnecessary invasive biopsies. Diffusion-weighted MR images (DWI) can help to reduce many of these false-positive findings prior to biopsy. Current approaches estimate tissue properties by means of quantitative parameters taken from generative, biophysical models fit to the q-space encoded signal under certain assumptions regarding noise and spatial homogeneity. This process is prone to fitting instability and partial information loss due to model simplicity. We reveal previously unexplored potentials of the signal by integrating all data processing components into a convolutional neural network (CNN) architecture that is designed to propagate clinical target information down to the raw input images. This approach enables simultaneous and target-specific optimization of image normalization, signal exploitation, global representation learning and classification. Using a multicentric data set of 222 patients, we demonstrate that our approach significantly improves clinical decision making with respect to the current state of the art.
• Visual relations, such as "person ride bike" and "bike next to car", offer a comprehensive scene understanding of an image, and have already shown their great utility in connecting computer vision and natural language. However, due to the challenging combinatorial complexity of modeling subject-predicate-object relation triplets, very little work has been done to localize and predict visual relations. Inspired by the recent advances in relational representation learning of knowledge bases and convolutional object detection networks, we propose a Visual Translation Embedding network (VTransE) for visual relation detection. VTransE places objects in a low-dimensional relation space where a relation can be modeled as a simple vector translation, i.e., subject + predicate $\approx$ object. We propose a novel feature extraction layer that enables object-relation knowledge transfer in a fully-convolutional fashion that supports training and inference in a single forward/backward pass. To the best of our knowledge, VTransE is the first end-to-end relation detection network. We demonstrate the effectiveness of VTransE over other state-of-the-art methods on two large-scale datasets: Visual Relationship and Visual Genome. Note that even though VTransE is a purely visual model, it is still competitive to the Lu's multi-modal model with language priors.
• A cloud server spent a lot of time, energy and money to train a Viola-Jones type object detector with high accuracy. Clients can upload their photos to the cloud server to find objects. However, the client does not want the leakage of the content of his/her photos. In the meanwhile, the cloud server is also reluctant to leak any parameters of the trained object detectors. 10 years ago, Avidan & Butman introduced Blind Vision, which is a method for securely evaluating a Viola-Jones type object detector. Blind Vision uses standard cryptographic tools and is painfully slow to compute, taking a couple of hours to scan a single image. The purpose of this work is to explore an efficient method that can speed up the process. We propose the Random Base Image (RBI) Representation. The original image is divided into random base images. Only the base images are submitted randomly to the cloud server. Thus, the content of the image can not be leaked. In the meanwhile, a random vector and the secure Millionaire protocol are leveraged to protect the parameters of the trained object detector. The RBI makes the integral-image enable again for the great acceleration. The experimental results reveal that our method can retain the detection accuracy of that of the plain vision algorithm and is significantly faster than the traditional blind vision, with only a very low probability of the information leakage theoretically.
• We present a new public dataset with a focus on simulating robotic vision tasks in everyday indoor environments using real imagery. The dataset includes 20,000+ RGB-D images and 50,000+ 2D bounding boxes of object instances densely captured in 9 unique scenes. We train a fast object category detector for instance detection on our data. Using the dataset we show that, although increasingly accurate and fast, the state of the art for object detection is still severely impacted by object scale, occlusion, and viewing direction all of which matter for robotics applications. We next validate the dataset for simulating active vision, and use the dataset to develop and evaluate a deep-network-based system for next best move prediction for object classification using reinforcement learning. Our dataset is available for download at cs.unc.edu/~ammirato/active_vision_dataset_website/.
• Ensembling multiple predictions is a widely used technique to improve the accuracy of various machine learning tasks. In image classification tasks, for example, averaging the predictions for multiple patches extracted from the input image significantly improves accuracy. Using multiple networks trained independently to make predictions improves accuracy further. One obvious drawback of the ensembling technique is its higher execution cost during inference. If we average 100 predictions, the execution cost will be 100 times as high as the cost without the ensemble. This higher cost limits the real-world use of ensembling, even though using it is almost the norm to win image classification competitions. In this paper, we describe a new technique called adaptive ensemble prediction, which achieves the benefits of ensembling with much smaller additional execution costs. Our observation behind this technique is that many easy-to-predict inputs do not require ensembling. Hence we calculate the confidence level of the prediction for each input on the basis of the probability of the predicted label, i.e. the outputs from the softmax, during the ensembling computation. If the prediction for an input reaches a high enough probability on the basis of the confidence level, we stop ensembling for this input to avoid wasting computation power. We evaluated the adaptive ensembling by using various datasets and showed that it reduces the computation time significantly while achieving similar accuracy to the naive ensembling.
• Fluent and safe interactions of humans and robots require both partners to anticipate the others' actions. A common approach to human intention inference is to model specific trajectories towards known goals with supervised classifiers. However, these approaches do not take possible future movements into account nor do they make use of kinematic cues, such as legible and predictable motion. The bottleneck of these methods is the lack of an accurate model of general human motion. In this work, we present a conditional variational autoencoder that is trained to predict a window of future human motion given a window of past frames. Using skeletal data obtained from RGB depth images, we show how this unsupervised approach can be used for online motion prediction for up to 1660 ms. Additionally, we demonstrate online target prediction within the first 300-500 ms after motion onset without the use of target specific training data. The advantage of our probabilistic approach is the possibility to draw samples of possible future motions. Finally, we investigate how movements and kinematic cues are represented on the learned low dimensional manifold.
• We introduce DeepNAT, a 3D Deep convolutional neural network for the automatic segmentation of NeuroAnaTomy in T1-weighted magnetic resonance images. DeepNAT is an end-to-end learning-based approach to brain segmentation that jointly learns an abstract feature representation and a multi-class classification. We propose a 3D patch-based approach, where we do not only predict the center voxel of the patch but also neighbors, which is formulated as multi-task learning. To address a class imbalance problem, we arrange two networks hierarchically, where the first one separates foreground from background, and the second one identifies 25 brain structures on the foreground. Since patches lack spatial context, we augment them with coordinates. To this end, we introduce a novel intrinsic parameterization of the brain volume, formed by eigenfunctions of the Laplace-Beltrami operator. As network architecture, we use three convolutional layers with pooling, batch normalization, and non-linearities, followed by fully connected layers with dropout. The final segmentation is inferred from the probabilistic output of the network with a 3D fully connected conditional random field, which ensures label agreement between close voxels. The roughly 2.7 million parameters in the network are learned with stochastic gradient descent. Our results show that DeepNAT compares favorably to state-of-the-art methods. Finally, the purely learning-based method may have a high potential for the adaptation to young, old, or diseased brains by fine-tuning the pre-trained network with a small training sample on the target application, where the availability of larger datasets with manual annotations may boost the overall segmentation accuracy in the future.
• We propose a novel approach to address the Simultaneous Detection and Segmentation problem. Using hierarchical structures we use an efficient and accurate procedure that exploits the hierarchy feature information using Locality Sensitive Hashing. We build on recent work that utilizes convolutional neural networks to detect bounding boxes in an image (Faster R-CNN) and then use the top similar hierarchical region that best fits each bounding box after hashing, we call this approach HashBox. We then refine our final segmentation results by automatic hierarchy pruning. HashBox introduces a train-free alternative to Hypercolumns. We conduct extensive experiments on Pascal VOC 2012 segmentation dataset, showing that HashBox gives competitive state-of-the-art object segmentations.
• Thin leaves, fine stems, self-occlusion, non-rigid and slowly changing structures make plants difficult for three-dimensional (3D) scanning and reconstruction -- two critical steps in automated visual phenotyping. Many current solutions such as laser scanning, structured light, and multiview stereo can struggle to acquire usable 3D models because of limitations in scanning resolution and calibration accuracy. In response, we have developed a fast, low-cost, 3D scanning platform to image plants on a rotating stage with two tilting DSLR cameras centred on the plant. This uses new methods of camera calibration and background removal to achieve high-accuracy 3D reconstruction. We assessed the system's accuracy using a 3D visual hull reconstruction algorithm applied on 2 plastic models of dicotyledonous plants, 2 sorghum plants and 2 wheat plants across different sets of tilt angles. Scan times ranged from 3 minutes (to capture 72 images using 2 tilt angles), to 30 minutes (to capture 360 images using 10 tilt angles). The leaf lengths, widths, areas and perimeters of the plastic models were measured manually and compared to measurements from the scanning system: results were within 3-4% of each other. The 3D reconstructions obtained with the scanning system show excellent geometric agreement with all six plant specimens, even plants with thin leaves and fine stems.
• Semantic segmentation constitutes an integral part of medical image analyses for which breakthroughs in the field of deep learning were of high relevance. The large number of trainable parameters of deep neural networks however renders them inherently data hungry, a characteristic that heavily challenges the medical imaging community. Though interestingly, with the de facto standard training of fully convolutional networks (FCNs) for semantic segmentation being agnostic towards the `structure' of the predicted label maps, valuable complementary information about the global quality of the segmentation lies idle. In order to tap into this potential, we propose utilizing an adversarial network which discriminates between expert and generated annotations in order to train FCNs for semantic segmentation. Because the adversary constitutes a learned parametrization of what makes a good segmentation at a global level, we hypothesize that the method holds particular advantages for segmentation tasks on complex structured, small datasets. This holds true in our experiments: We learn to segment aggressive prostate cancer utilizing MRI images of 152 patients and show that the proposed scheme is superior over the de facto standard in terms of the detection sensitivity and the dice-score for aggressive prostate cancer. The achieved relative gains are shown to be particularly pronounced in the small dataset limit.
• This paper addresses the task of designing a modular neural network architecture that jointly solves different tasks. As an example we use the tasks of depth estimation and semantic segmentation given a single RGB image. The main focus of this work is to analyze the cross-modality influence between depth and semantic prediction maps on their joint refinement. While most previous works solely focus on measuring improvements in accuracy, we propose a way to quantify the cross-modality influence. We show that there is a relationship between final accuracy and cross-modality influence, although not a simple linear one. Hence a larger cross-modality influence does not necessarily translate into an improved accuracy. We find that a beneficial balance between the cross-modality influences can be achieved by network architecture and conjecture that this relationship can be utilized to understand different network design choices. Towards this end we propose a Convolutional Neural Network (CNN) architecture that fuses the state of the state-of-the-art results for depth estimation and semantic labeling. By balancing the cross-modality influences between depth and semantic prediction, we achieve improved results for both tasks using the NYU-Depth v2 benchmark.
• Most of computer vision focuses on what is in an image. We propose to train a standalone object-centric context representation to perform the opposite task: seeing what is not there. Given an image, our context model can predict where objects should exist, even when no object instances are present. Combined with object detection results, we can perform a novel vision task: finding where objects are missing in an image. Our model is based on a convolutional neural network structure. With a specially designed training strategy, the model learns to ignore objects and focus on context only. It is fully convolutional thus highly efficient. Experiments show the effectiveness of the proposed approach in one important accessibility task: finding city street regions where curb ramps are missing, which could help millions of people with mobility disabilities.
• In this paper, we proposed using a hybrid method that utilises deep convolutional and recurrent neural networks for accurate delineation of skin lesion of images supplied with ISBI 2017 lesion segmentation challenge. The proposed method was trained using 1800 images and tested on 150 images from ISBI 2017 challenge.
• Deep learning is an important component of big-data analytic tools and intelligent applications, such as, self-driving cars, computer vision, speech recognition, or precision medicine. However, the training process is computationally intensive, and often requires a large amount of time if performed sequentially. Modern parallel computing systems provide the capability to reduce the required training time of deep neural networks. In this paper, we present our parallelization scheme for training convolutional neural networks (CNN) named Controlled Hogwild with Arbitrary Order of Synchronization (CHAOS). Major features of CHAOS include the support for thread and vector parallelism, non-instant updates of weight parameters during back-propagation without a significant delay, and implicit synchronization in arbitrary order. CHAOS is tailored for parallel computing systems that are accelerated with the Intel Xeon Phi. We evaluate our parallelization approach empirically using measurement techniques and performance modeling for various numbers of threads and CNN architectures. Experimental results for the MNIST dataset of handwritten digits using the total number of threads on the Xeon Phi show speedups of up to 103x compared to the execution on one thread of the Xeon Phi, 14x compared to the sequential execution on Intel Xeon E5, and 58x compared to the sequential execution on Intel Core i5.
• This paper presents an approach for semantic place categorization using data obtained from RGB cameras. Previous studies on visual place recognition and classification have shown that, by considering features derived from pre-trained Convolutional Neural Networks (CNNs) in combination with part-based classification models, high recognition accuracy can be achieved, even in presence of occlusions and severe viewpoint changes. Inspired by these works, we propose to exploit local deep representations, representing images as set of regions applying a Naïve Bayes Nearest Neighbor (NBNN) model for image classification. As opposed to previous methods where CNNs are merely used as feature extractors, our approach seamlessly integrates the NBNN model into a fully-convolutional neural network. Experimental results show that the proposed algorithm outperforms previous methods based on pre-trained CNN models and that, when employed in challenging robot place recognition tasks, it is robust to occlusions, environmental and sensor changes.
• Magnetic Resonance Imaging (MRI) is widely used in routine clinical diagnosis and treatment. However, variations in MRI acquisition protocols result in different appearances of normal and diseased tissue in the images. Convolutional neural networks (CNNs), which have shown to be successful in many medical image analysis tasks, are typically sensitive to the variations in imaging protocols. Therefore, in many cases, networks trained on data acquired with one MRI protocol, do not perform satisfactorily on data acquired with different protocols. This limits the use of models trained with large annotated legacy datasets on a new dataset with a different domain which is often a recurring situation in clinical settings. In this study, we aim to answer the following central questions regarding domain adaptation in medical image analysis: Given a fitted legacy model, 1) How much data from the new domain is required for a decent adaptation of the original network?; and, 2) What portion of the pre-trained model parameters should be retrained given a certain number of the new domain training samples? To address these questions, we conducted extensive experiments in white matter hyperintensity segmentation task. We trained a CNN on legacy MR images of brain and evaluated the performance of the domain-adapted network on the same task with images from a different domain. We then compared the performance of the model to the surrogate scenarios where either the same trained network is used or a new network is trained from scratch on the new dataset.The domain-adapted network tuned only by two training examples achieved a Dice score of 0.63 substantially outperforming a similar network trained on the same set of examples from scratch.
• Detection of objects in cluttered indoor environments is one of the key enabling functionalities for service robots. The best performing object detection approaches in computer vision exploit deep Convolutional Neural Networks (CNN) to simultaneously detect and categorize the objects of interest in cluttered scenes. Training of such models typically requires large amounts of annotated training data which is time consuming and costly to obtain. In this work we explore the ability of using synthetically generated composite images for training state of the art object detectors. We superimpose 2D images of textured object models into images of real environments at variety of locations and scales. Our experiments evaluate different superimposition strategies ranging from purely image-based blending all the way to depth and semantics informed positioning of the object models to real scenes. We demonstrate the effectiveness of these object detector training strategies on publicly available datasets of GMU-Kitchens and Washington RGB-D Scenes v2, and show how object detectors can be trained with limited amounts of annotated real scenes with objects present. This charts new opportunities for training detectors for new objects by exploiting existing object model repositories in either a purely automatic fashion or with only a very small number of human-annotated examples.
• We present an approach to adaptively utilize deep neural networks in order to reduce the evaluation time on new examples without loss of classification performance. Rather than attempting to redesign or approximate existing networks, we propose two schemes that adaptively utilize networks. First, we pose an adaptive network evaluation scheme, where we learn a system to adaptively choose the components of a deep network to be evaluated for each example. By allowing examples correctly classified using early layers of the system to exit, we avoid the computational time associated with full evaluation of the network. Building upon this approach, we then learn a network selection system that adaptively selects the network to be evaluated for each example. We exploit the fact that many examples can be correctly classified using relatively efficient networks and that complex, computationally costly networks are only necessary for a small fraction of examples. By avoiding evaluation of these complex networks for a large fraction of examples, computational time can be dramatically reduced. Empirically, these approaches yield dramatic reductions in computational cost, with up to a 2.8x speedup on state-of-the-art networks from the ImageNet image recognition challenge with minimal (less than 1%) loss of accuracy.
• Purpose: Basic surgical skills of suturing and knot tying are an essential part of medical training. Having an automated system for surgical skills assessment could help save experts time and improve training efficiency. There have been some recent attempts at automated surgical skills assessment using either video analysis or acceleration data. In this paper, we present a novel approach for automated assessment of OSATS based surgical skills and provide an analysis of different features on multi-modal data (video and accelerometer data). Methods: We conduct the largest study, to the best of our knowledge, for basic surgical skills assessment on a dataset that contained video and accelerometer data for suturing and knot-tying tasks. We introduce "entropy based" features - Approximate Entropy (ApEn) and Cross-Approximate Entropy (XApEn), which quantify the amount of predictability and regularity of fluctuations in time-series data. The proposed features are compared to existing methods of Sequential Motion Texture (SMT), Discrete Cosine Transform (DCT) and Discrete Fourier Transform (DFT), for surgical skills assessment. Results: We report average performance of different features across all applicable OSATS criteria for suturing and knot tying tasks. Our analysis shows that the proposed entropy based features out-perform previous state-of-the-art methods using video data. For accelerometer data, our method performs better for suturing only. We also show that fusion of video and acceleration features can improve overall performance with the proposed entropy features achieving highest accuracy. Conclusions: Automated surgical skills assessment can be achieved with high accuracy using the proposed entropy features. Such a system can significantly improve the efficiency of surgical training in medical schools and teaching hospitals.
• Deep-layered models trained on a large number of labeled samples boost the accuracy of many tasks. It is important to apply such models to different domains because collecting many labeled samples in various domains is expensive. In unsupervised domain adaptation, one needs to train a classifier that works well on a target domain when provided with labeled source samples and unlabeled target samples. Although many methods aim to match the distributions of source and target samples, simply matching the distribution cannot ensure accuracy on the target domain. To learn discriminative representations for the target domain, we assume that artificially labeling target samples can result in a good representation. Tri-training leverages three classifiers equally to give pseudo-labels to unlabeled samples, but the method does not assume labeling samples generated from a different domain.In this paper, we propose an asymmetric tri-training method for unsupervised domain adaptation, where we assign pseudo-labels to unlabeled samples and train neural networks as if they are true labels. In our work, we use three networks asymmetrically. By asymmetric, we mean that two networks are used to label unlabeled target samples and one network is trained by the samples to obtain target-discriminative representations. We evaluate our method on digit recognition and sentiment analysis datasets. Our proposed method achieves state-of-the-art performance on the benchmark digit recognition datasets of domain adaptation.
• We present a variational multi-label segmentation algorithm based on a robust Huber loss for both the data and the regularizer, minimized within a convex optimization framework. We introduce a novel constraint on the common areas, to bias the solution towards mutually exclusive regions. We also propose a regularization scheme that is adapted to the spatial statistics of the residual at each iteration, resulting in a varying degree of regularization being applied as the algorithm proceeds: the effect of the regularizer is strongest at initialization, and wanes as the solution increasingly fits the data. This minimizes the bias induced by the regularizer at convergence. We design an efficient convex optimization algorithm based on the alternating direction method of multipliers using the equivalent relation between the Huber function and the proximal operator of the one-norm. We empirically validate our proposed algorithm on synthetic and real images and offer an information-theoretic derivation of the cost-function that highlights the modeling choices made.
• Artificial neural networks can be trained with relatively low-precision floating-point and fixed-point arithmetic, using between one and 16 bits. Previous works have focused on relatively wide-but-shallow, feed-forward networks. We introduce a quantization scheme that is compatible with training very deep neural networks. Quantizing the network activations in the middle of each batch-normalization module can greatly reduce the amount of memory and computational power needed, with little loss in accuracy.
• Computational anatomy allows the quantitative analysis of organs in medical images. However, most analysis is constrained to the millimeter scale because of the limited resolution of clinical computed tomography (CT). X-ray microtomography ($\mu$CT) on the other hand allows imaging of ex-vivo tissues at a resolution of tens of microns. In this work, we use clinical CT to image lung cancer patients before partial pneumonectomy (resection of pathological lung tissue). The resected specimen is prepared for $\mu$CT imaging at a voxel resolution of 50 $\mu$m (0.05 mm). This high-resolution image of the lung cancer tissue allows further insides into understanding of tumor growth and categorization. For making full use of this additional information, image fusion (registration) needs to be performed in order to re-align the $\mu$CT image with clinical CT. We developed a multi-scale non-rigid registration approach. After manual initialization using a few landmark points and rigid alignment, several levels of non-rigid registration between down-sampled (in the case of $\mu$CT) and up-sampled (in the case of clinical CT) representations of the image are performed. Any non-lung tissue is ignored during the computation of the similarity measure used to guide the registration during optimization. We are able to recover the volume differences introduced by the resection and preparation of the lung specimen. The average ($\pm$ std. dev.) minimum surface distance between $\mu$CT and clinical CT at the resected lung surface is reduced from 3.3 $\pm$ 2.9 (range: [0.1, 15.9]) to 2.3 mm $\pm$ 2.8 (range: [0.0, 15.3]) mm. The alignment of clinical CT with $\mu$CT will allow further registration with even finer resolutions of $\mu$CT (up to 10 $\mu$m resolution) and ultimately with histopathological microscopy images for further macro to micro image fusion that can aid medical image analysis.
• Why does our visual system fail to reconstruct reality, when we look at certain patterns? Where do Geometrical illusions start to emerge in the visual pathway? Should computational models of vision have the same visual ability to detect illusions as we do? This study addresses these questions, by focusing on a specific underlying neural mechanism involved in our visual experiences that affects our final perception. Among many types of visual illusion, 'Geometrical' and, in particular, 'Tilt' illusions are rather important, being characterized by misperception of geometric patterns involving lines and tiles in combination with contrasting orientation, size or position. Over the last decade, many new neurophysiological experiments have led to new insights as to how, when and where retinal processing takes place, and the encoding nature of the retinal representation that is sent to the cortex for further processing. Based on these neurobiological discoveries, we provide computer simulation evidence to suggest that visual Geometrical illusions are explained in part, by the interaction of multiscale visual processing performed in the retina. The output of our retinal stage model is presented for several types of Tilt illusion, in which the final tilt percept arises from multiple scale processing of Differences of Gaussian and the perceptual interaction of foreground and background elements. Our results suggest that this multilevel filtering explanation, which is a simplified simulation for Retinal Ganglion Cell's responses to these patterns is indeed the underlying mechanism connecting low-level filtering to mid- and high-level explanations such as 'anchoring theory' and 'perceptual grouping'.
• Hyperspectral imaging is an important tool in remote sensing, allowing for accurate analysis of vast areas. Due to a low spatial resolution, a pixel of a hyperspectral image rarely represents a single material, but rather a mixture of different spectra. HSU aims at estimating the pure spectra present in the scene of interest, referred to as endmembers, and their fractions in each pixel, referred to as abundances. Today, many HSU algorithms have been proposed, based either on a geometrical or statistical model. While most methods assume that the number of endmembers present in the scene is known, there is only little work about estimating this number from the observed data. In this work, we propose a Bayesian nonparametric framework that jointly estimates the number of endmembers, the endmembers itself, and their abundances, by making use of the Indian Buffet Process as a prior for the endmembers. Simulation results and experiments on real data demonstrate the effectiveness of the proposed algorithm, yielding results comparable with state-of-the-art methods while being able to reliably infer the number of endmembers. In scenarios with strong noise, where other algorithms provide only poor results, the proposed approach tends to overestimate the number of endmembers slightly. The additional endmembers, however, often simply represent noisy replicas of present endmembers and could easily be merged in a post-processing step.
• Learning from demonstrations has gained increasing interest in the recent past, enabling an agent to learn how to make decisions by observing an experienced teacher. While many approaches have been proposed to solve this problem, there is only little work that focuses on reasoning about the observed behavior. We assume that, in many practical problems, an agent makes its decision based on latent features, indicating a certain action. Therefore, we propose a generative model for the states and actions. Inference reveals the number of features, the features, and the policies, allowing us to learn and to analyze the underlying structure of the observed behavior. Further, our approach enables prediction of actions for new states. Simulations are used to assess the performance of the algorithm based upon this model. Moreover, the problem of learning a driver's behavior is investigated, demonstrating the performance of the proposed model in a real-world scenario.
• Mega-city analysis with very high resolution (VHR) satellite images has been drawing increasing interest in the fields of city planning and social investigation. It is known that accurate land-use, urban density, and population distribution information is the key to mega-city monitoring and environmental studies. Therefore, how to generate land-use, urban density, and population distribution maps at a fine scale using VHR satellite images has become a hot topic. Previous studies have focused solely on individual tasks with elaborate hand-crafted features and have ignored the relationship between different tasks. In this study, we aim to propose a universal framework which can: 1) automatically learn the internal feature representation from the raw image data; and 2) simultaneously produce fine-scale land-use, urban density, and population distribution maps. For the first target, a deep convolutional neural network (CNN) is applied to learn the hierarchical feature representation from the raw image data. For the second target, a novel CNN-based universal framework is proposed to process the VHR satellite images and generate the land-use, urban density, and population distribution maps. To the best of our knowledge, this is the first CNN-based mega-city analysis method which can process a VHR remote sensing image with such a large data volume. A VHR satellite image (1.2 m spatial resolution) of the center of Wuhan covering an area of 2606 km2 was used to evaluate the proposed method. The experimental results confirm that the proposed method can achieve a promising accuracy for land-use, urban density, and population distribution maps.
• Like other problems in computer vision, offline handwritten Chinese character recognition (HCCR) has achieved impressive results using convolutional neural network (CNN)-based methods. However, larger and deeper networks are needed to deliver state-of-the-art results in this domain. Such networks intuitively appear to incur high computational cost, and require the storage of a large number of parameters, which renders them unfeasible for deployment in portable devices. To solve this problem, we propose a Global Supervised Low-rank Expansion (GSLRE) method and an Adaptive Drop-weight (ADW) technique to solve the problems of speed and storage capacity. We design a nine-layer CNN for HCCR consisting of 3,755 classes, and devise an algorithm that can reduce the networks computational cost by nine times and compress the network to 1/18 of the original size of the baseline model, with only a 0.21% drop in accuracy. In tests, the proposed algorithm surpassed the best single-network performance reported thus far in the literature while requiring only 2.3 MB for storage. Furthermore, when integrated with our effective forward implementation, the recognition of an offline character image took only 9.7 ms on a CPU. Compared with the state-of-the-art CNN model for HCCR, our approach is approximately 30 times faster, yet 10 times more cost efficient.
• We introduce a new algorithm, called CDER, for supervised machine learning that merges the multi-scale geometric properties of Cover Trees with the information-theoretic properties of entropy. CDER applies to a training set of labeled pointclouds embedded in a common Euclidean space. If typical pointclouds corresponding to distinct labels tend to differ at any scale in any sub-region, CDER can identify these differences in (typically) linear time, creating a set of distributional coordinates which act as a feature extraction mechanism for supervised learning. We describe theoretical properties and implementation details of CDER, and illustrate its benefits on several synthetic examples.
• Comprehensive Two dimensional gas chromatography (GCxGC) plays a central role into the elucidation of complex samples. The automation of the identification of peak areas is of prime interest to obtain a fast and repeatable analysis of chromatograms. To determine the concentration of compounds or pseudo-compounds, templates of blobs are defined and superimposed on a reference chromatogram. The templates then need to be modified when different chromatograms are recorded. In this study, we present a chromatogram and template alignment method based on peak registration called BARCHAN. Peaks are identified using a robust mathematical morphology tool. The alignment is performed by a probabilistic estimation of a rigid transformation along the first dimension, and a non-rigid transformation in the second dimension, taking into account noise, outliers and missing peaks in a fully automated way. Resulting aligned chromatograms and masks are presented on two datasets. The proposed algorithm proves to be fast and reliable. It significantly reduces the time to results for GCxGC analysis.
• Low-textured image stitching remains a challenging problem. It is difficult to achieve good alignment and is easy to break image structures, due to the insufficient and unreliable point correspondences. Besides, for the viewpoint variations between multiple images, the stitched images suffer from projective distortions. To this end, this paper presents a line-guided local warping with global similarity constraint for image stitching. A two-stage alignment scheme is adopted for good alignment. More precisely, the line correspondence is employed as alignment constraint to guide the accurate estimation of projective warp, then line feature constraints are integrated into mesh-based warping framework to refine the alignment while preserving image structures. To mitigate projectve distortions in non-overlapping regions, we combine global similarity constraint with the projective warps via a weight strategy, so that the final warp slowly changes from projective to similarity across the image. This is also integrated into local multiple homographies model for better parallax handling. Our method is evaluated on a series of images and compared with several other methods. Experiments demonstrate that the proposed method provides convincing stitching performance and outperforms other state-of-the-art methods.
• Recently, two-dimensional canonical correlation analysis (2DCCA) has been successfully applied for image feature extraction. The method instead of concatenating the columns of the images to the one-dimensional vectors, directly works with two-dimensional image matrices. Although 2DCCA works well in different recognition tasks, it lacks a probabilistic interpretation. In this paper, we present a probabilistic framework for 2DCCA called probabilistic 2DCCA (P2DCCA) and an iterative EM based algorithm for optimizing the parameters. Experimental results on synthetic and real data demonstrate superior performance in loading factor estimation for P2DCCA compared to 2DCCA. For real data, three subsets of AR face database and also the UMIST face database confirm the robustness of the proposed algorithm in face recognition tasks with different illumination conditions, facial expressions, poses and occlusions.
• This paper deals with the unification of local and non-local signal processing on graphs within a single convolutional neural network (CNN) framework. Building upon recent works on graph CNNs, we propose to use convolutional layers that take as inputs two variables, a signal and a graph, allowing the network to adapt to changes in the graph structure. This also allows us to learn through training the optimal mixing of locality and non-locality, in cases where the graph is built on the input signal itself. We demonstrate the versatility and the effectiveness of our framework on several types of signals (greyscale and color images, color palettes and speech signals) and on several applications (style transfer, color transfer, and denoising).