This paper contributes to cross-lingual image annotation and retrieval in terms of data and methods. We propose COCO-CN, a novel dataset enriching MS-COCO with manually written Chinese sentences and tags. For more effective annotation acquisition, we develop a recommendation-assisted collective annotation system, automatically providing an annotator with several tags and sentences deemed to be relevant with respect to the pictorial content. Having 20,342 images annotated with 27,218 Chinese sentences and 70,993 tags, COCO-CN is currently the largest Chinese-English dataset applicable for cross-lingual image tagging, captioning and retrieval. We develop methods per task for effectively learning from cross-lingual resources. Extensive experiments on the multiple tasks justify the viability of our dataset and methods.
May 23 2018 cs.CV
Many computer vision and image processing applications rely on local features. It is well-known that motion blur decreases the performance of traditional feature detectors and descriptors. We propose an inertial-based deblurring method for improving the robustness of existing feature detectors and descriptors against the motion blur. Unlike most deblurring algorithms, the method can handle spatially-variant blur and rolling shutter distortion. Furthermore, it is capable of running in real-time contrary to state-of-the-art algorithms. The limitations of inertial-based blur estimation are taken into account by validating the blur estimates using image data. The evaluation shows that when the method is used with traditional feature detector and descriptor, it increases the number of detected keypoints, provides higher repeatability and improves the localization accuracy. We also demonstrate that such features will lead to more accurate and complete reconstructions when used in the application of 3D visual reconstruction.
May 23 2018 cs.CV
We propose a person detector on omnidirectional images, an accurate method to generate minimal enclosing rectangles of persons. The basic idea is to adapt the qualitative detection performance of a convolutional neural network based method, namely YOLOv2 to fish-eye images. The design of our approach picks up the idea of a state-of-the-art object detector and highly overlapping areas of images with their regions of interests. This overlap reduces the number of false negatives. Based on the raw bounding boxes of the detector we fine-tuned overlapping bounding boxes by three approaches. The non-maximum suppression, the soft non-maximum suppression and the soft non-maximum suppression with Gaussian smoothing. The evaluation was done on the PIROPO database, supplemented with bounding boxes on omnidirectional images. We achieve an average precision of 64.4 % with YOLOv2 for the class person. For this purpose we fine-tuned the soft non-maximum suppression with Gaussian smoothing.
May 23 2018 cs.CV
In this thesis we investigate the effect of using web images to build a large scale database to be used along a deep learning method for a classification task. We replicate the ImageNet large scale database (ILSVRC-2012) from images collected from the web using 4 different download strategies varying: the search engine, the query and the image resolution. As a deep learning method, we will choose the Convolutional Neural Network that was very successful with recognition tasks; the AlexNet.
May 23 2018 cs.CV
This research mainly emphasizes on traffic detection thus essentially involving object detection and classification. The particular work discussed here is motivated from unsatisfactory attempts of re-using well known pre-trained object detection networks for domain specific data. In this course, some trivial issues leading to prominent performance drop are identified and ways to resolve them are discussed. For example, some simple yet relevant tricks regarding data collection and sampling prove to be very beneficial. Also, introducing a blur net to deal with blurred real time data is another important factor promoting performance elevation. We further study the neural network design issues for beneficial object classification and involve shared, region-independent convolutional features. Adaptive learning rates to deal with saddle points are also investigated and an average covariance matrix based pre-conditioned approach is proposed. We also introduce the use of optical flow features to accommodate orientation information. Experimental results demonstrate that this results in a steady rise in the performance rate.
This work presents CascadeCNN, an automated toolflow that pushes the quantisation limits of any given CNN model, to perform high-throughput inference by exploiting the computation time-accuracy trade-off. Without the need for retraining, a two-stage architecture tailored for any given FPGA device is generated, consisting of a low- and a high-precision unit. A confidence evaluation unit is employed between them to identify misclassified cases at run time and forward them to the high-precision unit or terminate computation. Experiments demonstrate that CascadeCNN achieves a performance boost of up to 55% for VGG-16 and 48% for AlexNet over the baseline design for the same resource budget and accuracy.
May 23 2018 cs.CV
Reliable markerless motion tracking of multiple people participating in complex group activity from multiple handheld cameras is challenging due to frequent occlusions, strong viewpoint and appearance variations, and asynchronous video streams. The key to solving this problem is to reliably associate the same person across distant viewpoint and temporal instances. In this work, we combine motion tracking, mutual exclusion constraints, and multiview geometry in a multitask learning framework to automatically adapt a generic person appearance descriptor to the domain videos. Tracking is formulated as a spatiotemporally constrained clustering using the adapted person descriptor. Physical human constraints are exploited to reconstruct accurate and consistent 3D skeletons for every person across the entire sequence. We show significant improvement in association accuracy (up to 18%) in events with up to 60 people and 3D human skeleton reconstruction (5 to 10 times) over the baseline for events captured "in the wild".
May 23 2018 cs.CV
Recent research has widely explored the problem of aesthetics assessment of images with generic content. However, few approaches have been specifically designed to predict the aesthetic quality of images containing human faces, which make up a massive portion of photos in the web. This paper introduces a method for aesthetic quality assessment of images with faces. We exploit three different Convolutional Neural Networks to encode information regarding perceptual quality, global image aesthetics, and facial attributes; then, a model is trained to combine these features to explicitly predict the aesthetics of images containing faces. Experimental results show that our approach outperforms existing methods for both binary, i.e. low/high, and continuous aesthetic score prediction on four different databases in the state-of-the-art.
May 23 2018 cs.CV
We propose a geometric convexity shape prior preservation method for variational level set based image segmentation methods. Our method is built upon the fact that the level set of a convex signed distanced function must be convex. This property enables us to transfer a complicated geometrical convexity prior into a simple inequality constraint on the function. An active set based Gauss-Seidel iteration is used to handle this constrained minimization problem to get an efficient algorithm. We apply our method to region and edge based level set segmentation models including Chan-Vese (CV) model with guarantee that the segmented region will be convex. Experimental results show the effectiveness and quality of the proposed model and algorithm.
Conditional generative adversarial networks (cGAN) have led to large improvements in the task of conditional image generation, which lies at the heart of computer vision. The major focus so far has been on performance improvement, while there has been little effort in making cGAN more robust to noise or leveraging structure in the output space of the model. The end-to-end regression (of the generator) might lead to arbitrarily large errors in the output, which is unsuitable for the application of such networks to real-world systems. In this work, we introduce a novel conditional GAN, called RoCGAN, which adds implicit constraints to address the issue. Our proposed model augments the generator with an unsupervised pathway, which encourages the outputs of the generator to span the target manifold even in the presence of large amounts of noise. We prove that RoCGAN shares similar theoretical properties as GAN and experimentally verify that the proposed model outperforms existing state-of-the-art cGAN architectures by a large margin in a variety of domains including images from natural scenes and faces.
In this paper we propose solving localized multiple kernel learning (LMKL) using LMKL-Net, a feedforward deep neural network. In contrast to previous works, as a learning principle we propose \em parameterizing both the gating function for learning kernel combination weights and the multiclass classifier in LMKL using an attentional network (AN) and a multilayer perceptron (MLP), respectively. In this way we can learn the (nonlinear) decision function in LMKL (approximately) by sequential applications of AN and MLP. Empirically on benchmark datasets we demonstrate that overall LMKL-Net can not only outperform the state-of-the-art MKL solvers in terms of accuracy, but also be trained about \em two orders of magnitude faster with much smaller memory footprint for large-scale learning.
May 23 2018 cs.CV
Performing the inference step of deep learning in resource constrained environments, such as embedded devices, is challenging. Success requires optimization at both software and hardware levels. Low precision arithmetic and specifically low precision fixed-point number systems have become the standard for performing deep learning inference. However, representing non-uniform data and distributed parameters (e.g. weights) by using uniformly distributed fixed-point values is still a major drawback when using this number system. Recently, the posit number system was proposed, which represents numbers in a non-uniform manner. Therefore, in this paper we are motivated to explore using the posit number system to represent the weights of Deep Convolutional Neural Networks. However, we do not apply any quantization techniques and hence the network weights do not require re-training. The results of this exploration show that using the posit number system outperformed the fixed point number system in terms of accuracy and memory utilization.
May 23 2018 cs.CV
Real-time algorithms for automatically recognizing surgical phases are needed to develop systems that can provide assistance to surgeons, enable better management of operating room (OR) resources and consequently improve safety within the OR. State-of-the-art surgical phase recognition algorithms using laparoscopic videos are based on fully supervised training. This limits their potential for widespread application, since creation of manual annotations is an expensive process considering the numerous types of existing surgeries and the vast amount of laparoscopic videos available. In this work, we propose a new self-supervised pre-training approach based on the prediction of remaining surgery duration (RSD) from laparoscopic videos. The RSD prediction task is used to pre-train a convolutional neural network (CNN) and long short-term memory (LSTM) network in an end-to-end manner. Our proposed approach utilizes all available data and reduces the reliance on annotated data, thereby facilitating the scaling up of surgical phase recognition algorithms to different kinds of surgeries. Additionally, we present EndoN2N, an end-to-end trained CNN-LSTM model for surgical phase recognition and evaluate the performance of our approach on a dataset of 120 Cholecystectomy laparoscopic videos (Cholec120). This work also presents the first systematic study of self-supervised pre-training approaches to understand the amount of annotations required for surgical phase recognition. Interestingly, the proposed RSD pre-training approach leads to performance improvement even when all the training data is manually annotated and outperforms the single pre-training approach for surgical phase recognition presently published in the literature. It is also observed that end-to-end training of CNN-LSTM networks boosts surgical phase recognition performance.
Providing force feedback as relevant information in current Robot-Assisted Minimally Invasive Surgery systems constitutes a technological challenge due to the constraints imposed by the surgical environment. In this context, Sensorless Force Estimation techniques represent a potential solution, enabling to sense the interaction forces between the surgical instruments and soft-tissues. Specifically, if visual feedback is available for observing soft-tissues' deformation, this feedback can be used to estimate the forces applied to these tissues. To this end, a force estimation model, based on Convolutional Neural Networks and Long-Short Term Memory networks, is proposed in this work. This model is designed to process both, the spatiotemporal information present in video sequences and the temporal structure of tool data (the surgical tool-tip trajectory and its grasping status). A series of analyses are carried out to reveal the advantages of the proposal and the challenges that remain for real applications. This research work focuses on two surgical task scenarios, referred to as pushing and pulling tissue. For these two scenarios, different input data modalities and their effect on the force estimation quality are investigated. These input data modalities are tool data, video sequences and a combination of both. The results suggest that the force estimation quality is better when both, the tool data and video sequences, are processed by the neural network model. Moreover, this study reveals the need for a loss function, designed to promote the modeling of smooth and sharp details found in force signals. Finally, the results show that the modeling of forces due to pulling tasks is more challenging than for the simplest pushing actions.
May 23 2018 cs.CV
A key problem in blind image quality assessment (BIQA) is how to effectively model the properties of human visual system in a data-driven manner. In this paper, we propose a simple and efficient BIQA model based on a novel framework which consists of a fully convolutional neural network (FCNN) and a pooling network to solve this problem. In principle, FCNN is capable of predicting a pixel-by-pixel similar quality map only from a distorted image by using the intermediate similarity maps derived from conventional full-reference image quality assessment methods. The predicted pixel-by-pixel quality maps have good consistency with the distortion correlations between the reference and distorted images. Finally, a deep pooling network regresses the quality map into a score. Experiments have demonstrated that our predictions outperform many state-of-the-art BIQA methods.
May 23 2018 cs.CV
Recently, pose-based action recognition has gained more and more attention due to the better performance compared with traditional appearance-based methods. However, there still exist two problems to be further solved. First, existing pose-based methods generally recognize human actions with captured 3D human poses which are very difficult to obtain in real scenarios. Second, few pose-based methods model the action-related objects in recognizing human-object interaction actions in which objects play an important role. To solve the problems above, we propose a pose-based two-stream relational network (PSRN) for action recognition. In PSRN, one stream models the temporal dynamics of the targeted 2D human pose sequences which are directly extracted from raw videos, and the other stream models the action-related objects from a randomly sampled video frame. Most importantly, instead of fusing two-streams in the class score layer as before, we propose a pose-object relational network to model the relationship between human poses and action-related objects. We evaluate the proposed PSRN on two challenging benchmarks, i.e., Sub-JHMDB and PennAction. Experimental results show that our PSRN obtains the state-the-of-art performance on Sub-JHMDB (80.2%) and PennAction (98.1%). Our work opens a new door to action recognition by combining 2D human pose extracted from raw video and image appearance.
May 23 2018 cs.CV
Scene coordinate regression has become an essential part of current camera re-localization methods. Different versions in the form of regression forests and deep learning methods have been successfully applied to estimate the corresponding camera pose given a single input image. In this work, we propose to regress scene coordinates pixel-wise for a given RGB image using deep learning. Compared to the recent methods, which usually employ RANSAC to obtain a robust pose estimate from the established point correspondences, we propose to regress confidences of these correspondences, which allows us to immediately discard erroneous predictions resulting in boosting initial pose estimates. Finally, the resulting confidences can be used to score initial pose hypothesis and aid in pose refinement, offering a generalized solution to solve this task.
We study the quantification of uncertainty of Convolutional Neural Networks (CNNs) based on gradient metrics. Unlike the classical softmax entropy, such metrics gather information from all layers of the CNN. We show for the (E)MNIST data set that for several such metrics we achieve the same meta classification accuracy -- i.e. the task of classifying correctly predicted labels as correct and incorrectly predicted ones as incorrect without knowing the actual label -- as for entropy thresholding. Meta classification rates for out of sample images can be increased when using entropy together with several gradient based metrics as input quantities for a meta-classifier. This proves that our gradient based metrics do not contain the same information as the entropy. We also apply meta classification to concepts not used during training: EMNIST/Omniglot letters, CIFAR10 and noise. Meta classifiers only trained on the uncertainty metrics of classes available during training usually do not perform equally well for all the unknown concepts letters, CIFAR10 and uniform noise. If we however allow the meta classifier to be trained on uncertainty metrics including some samples of some or all of the categories, meta classification for concepts remote from MNIST digits can be improved considerably.
May 23 2018 cs.CV
Facial micro-expression (ME) recognition has posed a huge challenge to researchers for its subtlety in motion and limited databases. Recently, handcrafted techniques have achieved superior performance in micro-expression recognition but at the cost of domain specificity and cumbersome parametric tunings. In this paper, we propose an Enriched Long-term Recurrent Convolutional Network (ELRCN) that first encodes each micro-expression frame into a feature vector through CNN module(s), then predicts the micro-expression by passing the feature vector through a Long Short-term Memory (LSTM) module. The framework contains two different network variants: (1) Channel-wise stacking of input data for spatial enrichment, (2) Feature-wise stacking of features for temporal enrichment. We demonstrate that the proposed approach is able to achieve reasonably good performance, without data augmentation. In addition, we also present ablation studies conducted on the framework and visualizations of what CNN "sees" when predicting the micro-expression classes.
May 23 2018 cs.CV
We propose the autofocus convolutional layer for semantic segmentation with the objective of enhancing the capabilities of neural networks for multi-scale processing. Autofocus layers adaptively change the size of the effective receptive field based on the processed context to generate more powerful features. This is achieved by parallelising multiple convolutional layers with different dilation rates, combined by an attention mechanism that learns to focus on the optimal scales driven by context. By sharing the weights of the parallel convolutions we introduce only a small number of trainable parameters. The proposed autofocus layer can be easily integrated into existing networks to cope with biological variability among tasks. We evaluate our models on the challenging tasks of multi-organ segmentation in pelvic CT and brain tumor segmentation in MRI and achieve very promising performance.
May 23 2018 cs.CV
Deep learning has emerged as a powerful artificial intelligence tool to interpret medical images for a growing variety of applications. However, the paucity of medical imaging data with high-quality annotations that is necessary for training such methods ultimately limits their performance. Medical data is challenging to acquire due to privacy issues, shortage of experts available for annotation, limited representation of rare conditions and cost. This problem has previously been addressed by using synthetically generated data. However, networks trained on synthetic data often fail to generalize to real data. Cinematic rendering simulates the propagation and interaction of light passing through tissue models reconstructed from CT data, enabling the generation of photorealistic images. In this paper, we present one of the first applications of cinematic rendering in deep learning, in which we propose to fine-tune synthetic data-driven networks using cinematically rendered CT data for the task of monocular depth estimation in endoscopy. Our experiments demonstrate that: (a) Convolutional Neural Networks (CNNs) trained on synthetic data and fine-tuned on photorealistic cinematically rendered data adapt better to real medical images and demonstrate more robust performance when compared to networks with no fine-tuning, (b) these fine-tuned networks require less training data to converge to an optimal solution, and (c) fine-tuning with data from a variety of photorealistic rendering conditions of the same scene prevents the network from learning patient-specific information and aids in generalizability of the model. Our empirical evaluation demonstrates that networks fine-tuned with cinematically rendered data predict depth with 56.87% less error for rendered endoscopy images and 27.49% less error for real porcine colon endoscopy images.
Answering visual questions need acquire daily common knowledge and model the semantic connection among different parts in images, which is too difficult for VQA systems to learn from images with the only supervision from answers. Meanwhile, image captioning systems with beam search strategy tend to generate similar captions and fail to diversely describe images. To address the aforementioned issues, we present a system to have these two tasks compensate with each other, which is capable of jointly producing image captions and answering visual questions. In particular, we utilize question and image features to generate question-related captions and use the generated captions as additional features to provide new knowledge to the VQA system. For image captioning, our system attains more informative results in term of the relative improvements on VQA tasks as well as competitive results using automated metrics. Applying our system to the VQA tasks, our results on VQA v2 dataset achieve 65.8% using generated captions and 69.1% using annotated captions in validation set and 68.4% in the test-standard set. Further, an ensemble of 10 models results in 69.7% in the test-standard split.
May 23 2018 cs.CV
A novel framework named Markov Clustering Network (MCN) is proposed for fast and robust scene text detection. MCN predicts instance-level bounding boxes by firstly converting an image into a Stochastic Flow Graph (SFG) and then performing Markov Clustering on this graph. Our method can detect text objects with arbitrary size and orientation without prior knowledge of object size. The stochastic flow graph encode objects' local correlation and semantic information. An object is modeled as strongly connected nodes, which allows flexible bottom-up detection for scale-varying and rotated objects. MCN generates bounding boxes without using Non-Maximum Suppression, and it can be fully parallelized on GPUs. The evaluation on public benchmarks shows that our method outperforms the existing methods by a large margin in detecting multioriented text objects. MCN achieves new state-of-art performance on challenging MSRA-TD500 dataset with precision of 0.88, recall of 0.79 and F-score of 0.83. Also, MCN achieves realtime inference with frame rate of 34 FPS, which is $1.5\times$ speedup when compared with the fastest scene text detection algorithm.
For neural networks (NNs) with rectified linear unit (ReLU) or binary activation functions, we show that their training can be accomplished in a reduced parameter space. Specifically, the weights in each neuron can be trained on the unit sphere, as opposed to the entire space, and the threshold can be trained in a bounded interval, as opposed to the real line. We show that the NNs in the reduced parameter space are mathematically equivalent to the standard NNs with parameters in the whole space. The reduced parameter space shall facilitate the optimization procedure for the network training, as the search space becomes (much) smaller. We demonstrate the improved training performance using numerical examples.
Handling object interaction is a fundamental challenge in practical multi-object tracking, even for simple interactive effects such as one object temporarily occluding another. We formalize the problem of occlusion in tracking with two different abstractions. In object-wise occlusion, objects that are occluded by other objects do not generate measurements. In measurement-wise occlusion, a previously unstudied approach, all objects may generate measurements but some measurements may be occluded by others. While the relative validity of each abstraction depends on the situation and sensor, measurement-wise occlusion fits into probabilistic multi-object tracking algorithms with much looser assumptions on object interaction. Its value is demonstrated by showing that it naturally derives a popular approximation for lidar tracking, and by an example of visual tracking in image space.
The success of deep learning models is heavily tied to the use of massive amount of labeled data and excessively long training time. With the emergence of intelligent edge applications that use these models, the critical challenge is to obtain the same inference capability on a resource-constrained device while providing adaptability to cope with the dynamic changes in the data. We propose AgileNet, a novel lightweight dictionary-based few-shot learning methodology which provides reduced complexity deep neural network for efficient execution at the edge while enabling low-cost updates to capture the dynamics of the new data. Evaluations of state-of-the-art few-shot learning benchmarks demonstrate the superior accuracy of AgileNet compared to prior arts. Additionally, AgileNet is the first few-shot learning approach that prevents model updates by eliminating the knowledge obtained from the primary training. This property is ensured through the dictionaries learned by our novel end-to-end structured decomposition, which also reduces the memory footprint and computation complexity to match the edge device constraints.
We consider the optimization of deep convolutional neural networks (CNNs) such that they provide good performance while having reduced complexity if deployed on either conventional systems utilizing spatial-domain convolution or lower complexity systems designed for Winograd convolution. Furthermore, we explore the universal quantization and compression of these networks. In particular, the proposed framework produces one compressed model whose convolutional filters are sparse not only in the spatial domain but also in the Winograd domain. Hence, one compressed model can be deployed universally on any platform, without need for re-training on the deployed platform, and the sparsity of its convolutional filters can be exploited for further complexity reduction in either domain. To get a better compression ratio, the sparse model is compressed in the spatial domain which has a less number of parameters. From our experiments, we obtain $24.2\times$, $47.7\times$ and $35.4\times$ compressed models for ResNet-18, AlexNet and CT-SRCNN, while their computational complexity is also reduced by $4.5\times$, $5.1\times$ and $23.5\times$, respectively.
May 23 2018 cs.CV
Generating long and coherent reports to describe medical images poses challenges to bridging visual patterns with informative human linguistic descriptions. We propose a novel Hybrid Retrieval-Generation Reinforced Agent (HRGR-Agent) which reconciles traditional retrieval-based approaches populated with human prior knowledge, with modern learning-based approaches to achieve structured, robust, and diverse report generation. HRGR-Agent employs a hierarchical decision-making procedure. For each sentence, a high-level retrieval policy module chooses to either retrieve a template sentence from an off-the-shelf template database, or invoke a low-level generation module to generate a new sentence. HRGR-Agent is updated via reinforcement learning, guided by sentence-level and word-level rewards. Experiments show that our approach achieves the state-of-the-art results on two medical report datasets, generating well-balanced structured sentences with robust coverage of heterogeneous medical report contents. In addition, our model achieves the highest detection accuracy of medical terminologies, and improved human evaluation performance.
May 23 2018 cs.CV
The problem of different training and test set class priors is addressed in the context of CNN classifiers. An EM-based algorithm for test-time class priors estimation is evaluated on fine-grained computer vision problems for both the batch and on-line situations. Experimental results show a significant improvement on the fine-grained classification tasks using the known evaluation-time priors, increasing the top-1 accuracy by 4.0% on the FGVC iNaturalist 2018 validation set and by 3.9% on the FGVCx Fungi 2018 validation set. Iterative estimation of test-time priors on the PlantCLEF 2017 dataset increased the image classification accuracy by 3.4%, allowing a single CNN model to achieve state-of-the-art results and outperform the competition-winning ensemble of 12 CNNs.
Training large-scale image recognition models is computationally expensive. This raises the question of whether there might be simple ways to improve the test performance of an already trained model without having to re-train or even fine-tune it with new data. Here, we show that, surprisingly, this is indeed possible. The key observation we make is that the layers of a deep network close to the output layer contain independent, easily extractable class-relevant information that is not contained in the output layer itself. We propose to extract this extra class-relevant information using a simple key-value cache memory to improve the classification performance of the model at test time. Our cache memory is directly inspired by a similar cache model previously proposed for language modeling (Grave et al., 2017). This cache component does not require any training or fine-tuning; it can be applied to any pre-trained model and, by properly setting only two hyper-parameters, leads to significant improvements in its classification performance. Improvements are observed across several architectures and datasets. In the cache component, using features extracted from layers close to the output (but not from the output layer itself) as keys leads to the largest improvements. Concatenating features from multiple layers to form keys can further improve performance over using single-layer features as keys. The cache component also has a regularizing effect, a simple consequence of which is that it substantially increases the robustness of models against adversarial attacks.
Spatial and spectral approaches are two major approaches for image processing tasks such as image classification and object recognition. Among many such algorithms, convolutional neural networks (CNNs) have recently achieved significant performance improvement in many challenging tasks. Since CNNs process images directly in the spatial domain, they are essentially spatial approaches. Given that spatial and spectral approaches are known to have different characteristics, it will be interesting to incorporate a spectral approach into CNNs. We propose a novel CNN architecture, wavelet CNNs, which combines a multiresolution analysis and CNNs into one model. Our insight is that a CNN can be viewed as a limited form of a multiresolution analysis. Based on this insight, we supplement missing parts of the multiresolution analysis via wavelet transform and integrate them as additional components in the entire architecture. Wavelet CNNs allow us to utilize spectral information which is mostly lost in conventional CNNs but useful in most image processing tasks. We evaluate the practical performance of wavelet CNNs on texture classification and image annotation. The experiments show that wavelet CNNs can achieve better accuracy in both tasks than existing models while having significantly fewer parameters than conventional CNNs.
May 23 2018 cs.CV
We developed a Convolutional Neural Network model to identify and classify Airbrushed (alternatively known as Blow-spun), Electrospun and Steel Wire scaffolds. Our model ScaffoldNet is a 6-layer Convolutional Neural Network trained and tested on 3,043 images of Airbrushed, Electrospun and Steel Wire scaffolds. The model takes in as input an imaged scaffold and then outputs the scaffold type (Airbrushed, Electrospun or Steel Wire) as predicted probabilities for the 3 classes. Our model scored a 99.44% Accuracy, demonstrating potential for adaptation to investigating and solving complex machine learning problems aimed at abstract spatial contexts, or in screening complex, biological, fibrous structures seen in cortical bone and fibrous shells.
May 23 2018 cs.CV
Hashing method maps similar data to binary hashcodes with smaller hamming distance, which has received a broad attention due to its low storage cost and fast retrieval speed. With the rapid development of deep learning, deep hashing methods have achieved promising results in efficient information retrieval. Most of the existing deep hashing methods adopt pairwise or triplet losses to deal with similarities underlying the data, but the training is difficult and less efficient because $O(n^2)$ data pairs and $O(n^3)$ triplets are involved. To address these issues, we propose a novel deep hashing algorithm with unary loss which can be trained very efficiently. We first of all introduce a Unary Upper Bound of the traditional triplet loss, thus reducing the complexity to $O(n)$ and bridging the classification-based unary loss and the triplet loss. Second, we propose a novel Semantic Cluster Deep Hashing (SCDH) algorithm by introducing a modified Unary Upper Bound loss, named Semantic Cluster Unary Loss (SCUL). The resultant hashcodes form several compact clusters, which means hashcodes in the same cluster have similar semantic information. We also demonstrate that the proposed SCDH is easy to be extended to semi-supervised settings by incorporating the state-of-the-art semi-supervised learning algorithms. Experiments on large-scale datasets show that the proposed method is superior to state-of-the-art hashing algorithms.
May 23 2018 cs.CV
We present an efficient neural network approach for locating anatomical landmarks, using a two-pass, two-resolution cascaded approach which leverages a mechanism we term atlas location autocontext. Location predictions are made by regression to Gaussian heatmaps, one heatmap per landmark. This system allows patchwise application of a shallow network, thus enabling the prediction of multiple volumetric heatmaps in a unified system, without prohibitive GPU memory requirements. Evaluation is performed for 22 landmarks defined on a range of structures in head CT scans and the neural network model is benchmarked against a previously reported decision forest model trained with the same cascaded system. Models are trained and validated on 201 scans. Over the final test set of 20 scans which was independently annotated by 2 observers, we show that the neural network reaches an accuracy which matches the annotator variability.
A recent Cell paper [Chang and Tsao, 2017] reports an interesting discovery. For the face stimuli generated by a pre-trained active appearance model (AAM), the responses of neurons in the areas of the primate brain that are responsible for face recognition exhibit strong linear relationship with the shape variables and appearance variables of the AAM that generates the face stimuli. In this paper, we show that this behavior can be replicated by a deep generative model called the generator network, which assumes that the observed signals are generated by latent random variables via a top-down convolutional neural network. Specifically, we learn the generator network from the face images generated by a pre-trained AAM model using variational auto-encoder, and we show that the inferred latent variables of the learned generator network have strong linear relationship with the shape and appearance variables of the AAM model that generates the face images. Unlike the AAM model that has an explicit shape model where the shape variables generate the control points or landmarks, the generator network has no such shape model and shape variables. Yet the generator network can learn the shape knowledge in the sense that some of the latent variables of the learned generator network capture the shape variations in the face images generated by AAM.
3D registration has always been performed invoking singular value decomposition (SVD) or eigenvalue decomposition (EIG) in real engineering practices. However, numerical algorithms suffer from uncertainty of convergence in many cases. A novel fast symbolic solution is proposed in this paper by following our recent publication in this journal. The equivalence analysis shows that our previous solver can be converted to deal with the 3D registration problem. Rather, the computation procedure is studied for further simplification of computing without complex numbers support. Experimental results show that the proposed solver does not loose accuracy and robustness but improves the execution speed to a large extent by almost \%50 to \%80, on both personal computer and embedded processor.
When a person attempts to conceal an emotion, the genuine emotion is manifest as a micro-expression. Exploration of automatic facial micro-expression recognition systems is relatively new in the computer vision domain. This is due to the difficulty in implementing optimal feature extraction methods to cope with the subtlety and brief motion characteristics of the expression. Most of the existing approaches extract the subtle facial movements based on hand-crafted features. In this paper, we address the micro-expression recognition task with a convolutional neural network (CNN) architecture, which well integrates the features extracted from each video. A new feature descriptor, Optical Flow Features from Apex frame Network (OFF-ApexNet) is introduced. This feature descriptor combines the optical ow guided context with the CNN. Firstly, we obtain the location of the apex frame from each video sequence as it portrays the highest intensity of facial motion among all frames. Then, the optical ow information are attained from the apex frame and a reference frame (i.e., onset frame). Finally, the optical flow features are fed into a pre-designed CNN model for further feature enhancement as well as to carry out the expression classification. To evaluate the effectiveness of OFF-ApexNet, comprehensive evaluations are conducted on three public spontaneous micro-expression datasets (i.e., SMIC, CASME II and SAMM). The promising recognition result suggests that the proposed method can optimally describe the significant micro-expression details. In particular, we report that, in a multi-database with leave-one-subject-out cross-validation experimental protocol, the recognition performance reaches 74.60% of recognition accuracy and F-measure of 71.04%. We also note that this is the first work that performs cross-dataset validation on three databases in this domain.
May 23 2018 cs.CV
In recent years, deep learning methods have been successfully applied to image classification tasks. Many such deep neural networks exist today that can easily differentiate cats from dogs. One such model is the ResNeXt model that uses a homogeneous, multi-branch architecture for image classification. This paper aims at implementing and evaluating the ResNeXt model architecture on subsets of the CIFAR-10 dataset. It also tweaks the original ResNeXt hyper-parameters such as cardinality, depth and base-width and compares the performance of the modified model with the original. Analysis of the experiments performed in this paper show that a slight decrease in depth or base-width does not affect the performance of the model much leading to comparable results.
May 23 2018 cs.CV
Urban facade segmentation from automatically acquired imagery, in contrast to traditional image segmentation, poses several unique challenges. 360-degree photospheres captured from vehicles are an effective way to capture a large number of images, but this data presents difficult-to-model warping and stitching artifacts. In addition, each pixel can belong to multiple facade elements, and different facade elements (e.g., window, balcony, sill, etc.) are correlated and vary wildly in their characteristics. In this paper, we propose three network architectures of varying complexity to achieve multilabel semantic segmentation of facade images while exploiting their unique characteristics. Specifically, we propose a MULTIFACSEGNET architecture to assign multiple labels to each pixel, a SEPARABLE architecture as a low-rank formulation that encourages extraction of rectangular elements, and a COMPATIBILITY network that simultaneously seeks segmentation across facade element types allowing the network to 'see' intermediate output probabilities of the various facade element classes. Our results on benchmark datasets show significant improvements over existing facade segmentation approaches for the typical facade elements. For example, on one commonly used dataset, the accuracy scores for window(the most important architectural element) increases from 0.91 to 0.97 percent compared to the best competing method, and comparable improvements on other element types.
Many real-world tasks involve identifying patterns from data satisfying background and prior knowledge, for which the ground truth is not available, but ideal data can be obtained, for example, using theoretical simulations. We propose a novel approach, imitation refinement, which refines imperfect patterns by imitating ideal patterns. The imperfect patterns are obtained for example using an unsupervised learner. Imitation refinement imitates ideal data by incorporating prior knowledge captured by a classifier trained on the ideal data: an imitation refiner applies small modifications to imperfect patterns so that the classifier can identify them. In a sense, imitation refinement fits the data to the classifier, which complements the classical supervised learning task. We show that our imitation refinement approach outperforms existing methods in identifying crystal patterns from X-ray diffraction data in materials discovery. We also show the generality of our approach by illustrating its applicability to a computer vision task.
Deep convolutional neural networks have dominated the pattern recognition scene by providing much more accurate solutions in computer vision problems such as object recognition and object detection. Most of these solutions come at a huge computational cost, requiring billions of multiply-accumulate operations and, thus, making their use quite challenging in real-time applications that run on embedded mobile (resource-power constrained) hardware. This work presents the architecture, the high-level synthesis design, and the implementation of SqueezeJet, an FPGA accelerator for the inference phase of the SqueezeNet DCNN architecture, which is designed specifically for use in embedded systems. Results show that SqueezeJet can achieve 15.16 times speed-up compared to the software implementation of SqueezeNet running on an embedded mobile processor with less than 1% drop in top-5 accuracy.
May 23 2018 cs.CV
We develop a two-stage deep learning framework that recommends fashion images based on other input images of similar style. For that purpose, a neural network classifier is used as data driven, visual-aware feature extractor. The latter then serves as input for similarity based recommendations. Our approach is tested on the publicly available Fashion dataset. Initialization strategies using transfer learning from larger product databases are presented.
Convolutional neural network models (CNNs) have made major advances in computer vision tasks in the last five years. Given the challenge in collecting real world datasets, most studies report performance metrics based on available research datasets. In scenarios where CNNs are to be deployed on images or videos from mobile devices, models are presented with new challenges due to lighting, angle, and camera specifications, which are not accounted for in research datasets. It is essential for assessment to also be conducted on real world datasets if such models are to be reliably integrated with products and services in society. Plant disease datasets can be used to test CNNs in real time and gain insight into real world performance. We train a CNN object detection model to identify foliar symptoms of diseases (or lack thereof) in cassava (Manihot esculenta Crantz). We then deploy the model on a mobile app and test its performance on mobile images and video of 720 diseased leaflets in an agricultural field in Tanzania. Within each disease category we test two levels of severity of symptoms - mild and pronounced, to assess the model performance for early detection of symptoms. In both severities we see a decrease in the F-1 score for real world images and video. The F-1 score dropped by 32% for pronounced symptoms in real world images (the closest data to the training data) due to a drop in model recall. If the potential of smartphone CNNs are to be realized our data suggest it is crucial to consider tuning precision and recall performance in order to achieve the desired performance in real world settings. In addition, the varied performance related to different input data (image or video) is an important consideration for the design of CNNs in real world applications.
May 23 2018 cs.CV
High throughput and low latency inference of deep neural networks are critical for the deployment of deep learning applications. This paper presents the efficient inference techniques of IntelCaffe, the first Intel optimized deep learning framework that supports efficient 8-bit low precision inference and model optimization techniques of convolutional neural networks on Intel Xeon Scalable Processors. The 8-bit optimized model is automatically generated with a calibration process from FP32 model without the need of fine-tuning or retraining. We show that the inference throughput and latency with ResNet-50, Inception-v3 and SSD are improved by 1.38X-2.9X and 1.35X-3X respectively with neglectable accuracy loss from IntelCaffe FP32 baseline and by 56X-75X and 26X-37X from BVLC Caffe. All these techniques have been open-sourced on IntelCaffe GitHub1, and the artifact is provided to reproduce the result on Amazon AWS Cloud.
May 23 2018 cs.CV
In this paper, we describe our approach for the OMG- Emotion Challenge 2018. The goal is to produce utterance-level valence and arousal estimations for videos of approximately 1 minute length. We tackle this problem by first extracting facial expressions features of videos as time series data, and then using Recurrent Neural Networks of the Echo State Network type to model the correspondence between the time series data and valence-arousal values. Experimentally we show that the proposed approach surpasses the baseline methods provided by the organizers.
A detailed environment perception is a crucial component of automated vehicles. However, to deal with the amount of perceived information, we also require segmentation strategies. Based on a grid map environment representation, well-suited for sensor fusion, free-space estimation and machine learning, we detect and classify objects using deep convolutional neural networks. As input for our networks we use a multi-layer grid map efficiently encoding 3D range sensor information. The inference output consists of a list of rotated bounding boxes with associated semantic classes. We conduct extensive ablation studies, highlight important design considerations when using grid maps and evaluate our models on the KITTI Bird's Eye View benchmark. Qualitative and quantitative benchmark results show that we achieve robust detection and state of the art accuracy solely using top-view grid maps from range sensor data.
In this paper, we present an efficient pedestrian detection system, designed by fusion of multiple deep neural network (DNN) systems. Pedestrian candidates are first generated by a single shot convolutional multi-box detector at different locations with various scales and aspect ratios. The candidate generator is designed to provide the majority of ground truth pedestrian annotations at the cost of a large number of false positives. Then, a classification system using the idea of ensemble learning is deployed to improve the detection accuracy. The classification system further classifies the generated candidates based on opinions of multiple deep verification networks and a fusion network which utilizes a novel soft-rejection fusion method to adjust the confidence in the detection results. To improve the training of the deep verification networks, a novel soft-label method is devised to assign floating point labels to the generated pedestrian candidates. A deep context aggregation semantic segmentation network also provides pixel-level classification of the scene and its results are softly fused with the detection results by the single shot detector. Our pedestrian detector compared favorably to state-of-art methods on all popular pedestrian detection datasets. For example, our fused DNN has better detection accuracy on the Caltech Pedestrian dataset than all previous state of art methods, while also being the fastest. We significantly improved the log-average miss rate on the Caltech pedestrian dataset to 7.67% and achieved the new state-of-the-art.
May 23 2018 cs.CV
Recent neuroimaging studies have shown that functional connectomes are unique to individuals, i.e., two distinct fMRIs taken over different sessions of the same subject are more similar in terms of their connectomes than those from two different subjects. In this study, we present significant new results that identify specific parts of the connectome that code the unique signatures. We show that a very small part of the connectome (under 100 features from among over 64K total features) is responsible for the signatures. A network of these features is shown to achieve excellent training and test accuracy in matching imaging datasets. We show that these features are statistically significant, robust to perturbations, invariant across populations, and are localized to a small number of structural regions of the brain (12 regions from among 180). We develop an innovative matrix sampling technique to derive computationally efficient and accurate methods for identifying the discriminating sub-connectome and support all of our claims using state of the art statistical tests and computational techniques.
In the past few years, the technology of automated guided vehicles (AGVs) has notably advanced. In particular, in the field of factory and warehouse automation, different approaches have been presented for detecting and localizing pallets inside warehouses and shop-floor environments based on the data acquired from 2D laser rangefinders. In , we present a robust approach allowing AGVs to detect, localize, and track multiple pallets using machine learning techniques based on an on-board 2D laser rangefinder. In this paper, the data used in [1, 2] for solving the problem of detection, localization and tracking of pallets is described. Furthermore, we present an open repository of dataset and code to the community for further research activities. The dataset comprises a collection of 565 2D scans from real-world environments, which are divided into 340 samples where pallets are present, whereas 225 samples represent the case in which no pallets are present.
May 23 2018 cs.CV
We propose a novel part-based method for tracking an arbitrary object in challenging video sequences, focusing on robustly tracking under the effects of camera motion and object motion change. Each of a group of tracked image patches on the target is represented by pairs of RGB pixel samples and counts of how many pixels in the patch are similar to them. This empirically characterises the underlying colour distribution of the patches and allows for matching using the Bhattacharyya distance. Candidate patch locations are generated by applying non-shearing affine transformations to the patches' previous locations, followed by local optimisation. Experiments using the VOT2016 dataset show that our tracker out-performs all other part-based trackers in terms of robustness to camera motion and object motion change.