Due to the lack of information such as the space environment condition and resident space objects' (RSOs') body characteristics, current orbit predictions that are solely grounded on physics-based models may fail to achieve required accuracy for collision avoidance and have led to satellite collisions already. This paper presents a methodology to predict RSOs' trajectories with higher accuracy than that of the current methods. Inspired by the machine learning (ML) theory through which the models are learned based on large amounts of observed data and the prediction is conducted without explicitly modeling space objects and space environment, the proposed ML approach integrates physics-based orbit prediction algorithms with a learning-based process that focuses on reducing the prediction errors. Using a simulation-based space catalog environment as the test bed, the paper demonstrates three types of generalization capability for the proposed ML approach: 1) the ML model can be used to improve the same RSO's orbit information that is not available during the learning process but shares the same time interval as the training data; 2) the ML model can be used to improve predictions of the same RSO at future epochs; and 3) the ML model based on a RSO can be applied to other RSOs that share some common features.
Jan 10 2018 cs.CV
Scene text detection is an important step of scene text recognition system and also a challenging problem. Different from general object detection, the main challenges of scene text detection lie on arbitrary orientations, small sizes, and significantly variant aspect ratios of text in natural images. In this paper, we present an end-to-end trainable fast scene text detector, named TextBoxes++, which detects arbitrary-oriented scene text with both high accuracy and efficiency in a single network forward pass. No post-processing other than an efficient non-maximum suppression is involved. We have evaluated the proposed TextBoxes++ on four public datasets. In all experiments, TextBoxes++ outperforms competing methods in terms of text localization accuracy and runtime. More specifically, TextBoxes++ achieves an f-measure of 0.817 at 11.6fps for $1024 \times 1024$ ICDAR 2015 Incidental text images, and an f-measure of 0.5591 at 19.8fps for $768 \times 768$ COCO-Text images. Furthermore, combined with a text recognizer, TextBoxes++ significantly outperforms the state-of-the-art approaches for word spotting and end-to-end text recognition tasks on popular benchmarks.
Nov 30 2017 cs.CV
Recently, many methods of person re-identification (Re-ID) rely on part-based feature representation to learn a discriminative pedestrian descriptor. However, the spatial context between these parts is ignored for the independent extractor to each separate part. In this paper, we propose to apply Long Short-Term Memory (LSTM) in an end-to-end way to model the pedestrian, seen as a sequence of body parts from head to foot. Integrating the contextual information strengthens the discriminative ability of local representation. We also leverage the complementary information between local and global feature. Furthermore, we integrate both identification task and ranking task in one network, where a discriminative embedding and a similarity measurement are learned concurrently. This results in a novel three-branch framework named Deep-Person, which learns highly discriminative features for person Re-ID. Experimental results demonstrate that Deep-Person outperforms the state-of-the-art methods by a large margin on three challenging datasets including Market-1501, CUHK03, and DukeMTMC-reID. Specifically, combining with a re-ranking approach, we achieve a 90.84% mAP on Market-1501 under single query setting.
Nov 29 2017 cs.CV
Object detection is an important and challenging problem in computer vision. Although the past decade has witnessed major advances in object detection in natural scenes, such successes have been slow to aerial imagery, not only because of the huge variation in the scale, orientation and shape of the object instances on the earth's surface, but also due to the scarcity of well-annotated datasets of objects in aerial scenes. To advance object detection research in Earth Vision, also known as Earth Observation and Remote Sensing, we introduce a large-scale Dataset for Object deTection in Aerial images (DOTA). To this end, we collect $2806$ aerial images from different sensors and platforms. Each image is of the size about 4000-by-4000 pixels and contains objects exhibiting a wide variety of scales, orientations, and shapes. These DOTA images are then annotated by experts in aerial image interpretation using $15$ common object categories. The fully annotated DOTA images contains $188,282$ instances, each of which is labeled by an arbitrary (8 d.o.f.) quadrilateral To build a baseline for object detection in Earth Vision, we evaluate state-of-the-art object detection algorithms on DOTA. Experiments demonstrate that DOTA well represents real Earth Vision applications and are quite challenging.
In June 2016, Apple announced that it will deploy differential privacy for some user data collection in order to ensure privacy of user data, even from Apple. The details of Apple's approach remained sparse. Although several patents have since appeared hinting at the algorithms that may be used to achieve differential privacy, they did not include a precise explanation of the approach taken to privacy parameter choice. Such choice and the overall approach to privacy budget use and management are key questions for understanding the privacy protections provided by any deployment of differential privacy. In this work, through a combination of experiments, static and dynamic code analysis of macOS Sierra (Version 10.12) implementation, we shed light on the choices Apple made for privacy budget management. We discover and describe Apple's set-up for differentially private data processing, including the overall data pipeline, the parameters used for differentially private perturbation of each piece of data, and the frequency with which such data is sent to Apple's servers. We find that although Apple's deployment ensures that the (differential) privacy loss per each datum submitted to its servers is $1$ or $2$, the overall privacy loss permitted by the system is significantly higher, as high as $16$ per day for the four initially announced applications of Emojis, New words, Deeplinks and Lookup Hints. Furthermore, Apple renews the privacy budget available every day, which leads to a possible privacy loss of 16 times the number of days since user opt-in to differentially private data collection for those four applications. We advocate that in order to claim the full benefits of differentially private data collection, Apple must give full transparency of its implementation, enable user choice in areas related to privacy loss, and set meaningful defaults on the privacy loss permitted.
Sep 01 2017 cs.CV
Chinese is the most widely used language in the world. Algorithms that read Chinese text in natural images facilitate applications of various kinds. Despite the large potential value, datasets and competitions in the past primarily focus on English, which bares very different characteristics than Chinese. This report introduces RCTW, a new competition that focuses on Chinese text reading. The competition features a large-scale dataset with over 12,000 annotated images. Two tasks, namely text localization and end-to-end recognition, are set up. The competition took place from January 20 to May 31, 2017. 23 valid submissions were received from 19 teams. This report includes dataset description, task definitions, evaluation protocols, and results summaries and analysis. Through this competition, we call for more future research on the Chinese text reading problem.
A continuous-time random walk in the quarter plane with homogeneous transition rates is considered. Given a non-negative reward function on the state space, we are interested in the expected stationary performance. Since a direct derivation of the stationary probability distribution is not available in general, the performance is approximated by a perturbed random walk, whose transition rates on the boundaries are changed such that its stationary probability distribution is known in closed form. A perturbed random walk for which the stationary distribution is a sum of geometric terms is considered and the perturbed transition rates are allowed to be inhomogeneous. It is demonstrated that such rates can be constructed for any sum of geometric terms that satisfies the balance equations in the interior of the state space. The inhomogeneous transitions relax the pairwise-coupled structure on these geometric terms that would be imposed if only homogeneous transitions are used. An explicit expression for the approximation error bound is obtained using the Markov reward approach, which does not depend on the values of the inhomogeneous rates but only on the parameters of the geometric terms. Numerical experiments indicate that inhomogeneous perturbation can give smaller error bounds than homogeneous perturbation.
Aug 15 2017 cs.CV
As re-ranking is a necessary procedure to boost person re-identification (re-ID) performance on large-scale datasets, the diversity of feature becomes crucial to person reID for its importance both on designing pedestrian descriptions and re-ranking based on feature fusion. However, in many circumstances, only one type of pedestrian feature is available. In this paper, we propose a "Divide and use" re-ranking framework for person re-ID. It exploits the diversity from different parts of a high-dimensional feature vector for fusion-based re-ranking, while no other features are accessible. Specifically, given an image, the extracted feature is divided into sub-features. Then the contextual information of each sub-feature is iteratively encoded into a new feature. Finally, the new features from the same image are fused into one vector for re-ranking. Experimental results on two person re-ID benchmarks demonstrate the effectiveness of the proposed framework. Especially, our method outperforms the state-of-the-art on the Market-1501 dataset.
Predicting the click-through rate of an advertisement is a critical component of online advertising platforms. In sponsored search, the click-through rate estimates the probability that a displayed advertisement is clicked by a user after she submits a query to the search engine. Commercial search engines typically rely on machine learning models trained with a large number of features to make such predictions. This is inevitably requires a lot of engineering efforts to define, compute, and select the appropriate features. In this paper, we propose two novel approaches (one working at character level and the other working at word level) that use deep convolutional neural networks to predict the click-through rate of a query-advertisement pair. Specially, the proposed architectures only consider the textual content appearing in a query-advertisement pair as input, and produce as output a click-through rate prediction. By comparing the character-level model with the word-level model, we show that language representation can be learnt from scratch at character level when trained on enough data. Through extensive experiments using billions of query-advertisement pairs of a popular commercial search engine, we demonstrate that both approaches significantly outperform a baseline model built on well-selected text features and a state-of-the-art word2vec-based approach. Finally, by combining the predictions of the deep models introduced in this study with the prediction of the model in production of the same commercial search engine, we significantly improve the accuracy and the calibration of the click-through rate prediction of the production system.
Jun 28 2017 cs.CV
In this paper, we investigate the Chinese calligraphy synthesis problem: synthesizing Chinese calligraphy images with specified style from standard font(eg. Hei font) images (Fig. 1(a)). Recent works mostly follow the stroke extraction and assemble pipeline which is complex in the process and limited by the effect of stroke extraction. We treat the calligraphy synthesis problem as an image-to-image translation problem and propose a deep neural network based model which can generate calligraphy images from standard font images directly. Besides, we also construct a large scale benchmark that contains various styles for Chinese calligraphy synthesis. We evaluate our method as well as some baseline methods on the proposed dataset, and the experimental results demonstrate the effectiveness of our proposed model.
May 09 2017 cs.CV
Patch-level image representation is very important for object classification and detection, since it is robust to spatial transformation, scale variation, and cluttered background. Many existing methods usually require fine-grained supervisions (e.g., bounding-box annotations) to learn patch features, which requires a great effort to label images may limit their potential applications. In this paper, we propose to learn patch features via weak supervisions, i.e., only image-level supervisions. To achieve this goal, we treat images as bags and patches as instances to integrate the weakly supervised multiple instance learning constraints into deep neural networks. Also, our method integrates the traditional multiple stages of weakly supervised object classification and discovery into a unified deep convolutional neural network and optimizes the network in an end-to-end way. The network processes the two tasks object classification and discovery jointly, and shares hierarchical deep features. Through this jointly learning strategy, weakly supervised object classification and discovery are beneficial to each other. We test the proposed method on the challenging PASCAL VOC datasets. The results show that our method can obtain state-of-the-art performance on object classification, and very competitive results on object discovery, with faster testing speed than competitors.
Apr 18 2017 cs.CV
Text in natural images contains rich semantics that are often highly relevant to objects or scene. In this paper, we focus on the problem of fully exploiting scene text for visual understanding. The main idea is combining word representations and deep visual features into a globally trainable deep convolutional neural network. First, the recognized words are obtained by a scene text reading system. Then, we combine the word embedding of the recognized words and the deep visual features into a single representation, which is optimized by a convolutional neural network for fine-grained image classification. In our framework, the attention mechanism is adopted to reveal the relevance between each recognized word and the given image, which further enhances the recognition performance. We have performed experiments on two datasets: Con-Text dataset and Drink Bottle dataset, that are proposed for fine-grained classification of business places and drink bottles, respectively. The experimental results consistently demonstrate that the proposed method combining textual and visual cues significantly outperforms classification with only visual representations. Moreover, we have shown that the learned representation improves the retrieval performance on the drink bottle images by a large margin, making it potentially useful in product search.
Apr 04 2017 cs.CV
Of late, weakly supervised object detection is with great importance in object recognition. Based on deep learning, weakly supervised detectors have achieved many promising results. However, compared with fully supervised detection, it is more challenging to train deep network based detectors in a weakly supervised manner. Here we formulate weakly supervised detection as a Multiple Instance Learning (MIL) problem, where instance classifiers (object detectors) are put into the network as hidden nodes. We propose a novel online instance classifier refinement algorithm to integrate MIL and the instance classifier refinement procedure into a single deep network, and train the network end-to-end with only image-level supervision, i.e., without object location information. More precisely, instance labels inferred from weak supervision are propagated to their spatially overlapped instances to refine instance classifier online. The iterative instance classifier refinement procedure is implemented using multiple streams in deep network, where each stream supervises its latter stream. Weakly supervised object detection experiments are carried out on the challenging PASCAL VOC 2007 and 2012 benchmarks. We obtain 47% mAP on VOC 2007 that significantly outperforms the previous state-of-the-art.
Mar 27 2017 cs.CV
Most existing person re-identification algorithms either extract robust visual features or learn discriminative metrics for person images. However, the underlying manifold which those images reside on is rarely investigated. That raises a problem that the learned metric is not smooth with respect to the local geometry structure of the data manifold. In this paper, we study person re-identification with manifold-based affinity learning, which did not receive enough attention from this area. An unconventional manifold-preserving algorithm is proposed, which can 1) make the best use of supervision from training data, whose label information is given as pairwise constraints; 2) scale up to large repositories with low on-line time complexity; and 3) be plunged into most existing algorithms, serving as a generic postprocessing procedure to further boost the identification accuracies. Extensive experimental results on five popular person re-identification benchmarks consistently demonstrate the effectiveness of our method. Especially, on the largest CUHK03 and Market-1501, our method outperforms the state-of-the-art alternatives by a large margin with high efficiency, which is more appropriate for practical applications.
Mar 21 2017 cs.CV
Most state-of-the-art text detection methods are specific to horizontal Latin text and are not fast enough for real-time applications. We introduce Segment Linking (SegLink), an oriented text detection method. The main idea is to decompose text into two locally detectable elements, namely segments and links. A segment is an oriented box covering a part of a word or text line; A link connects two adjacent segments, indicating that they belong to the same word or text line. Both elements are detected densely at multiple scales by an end-to-end trained, fully-convolutional neural network. Final detections are produced by combining segments connected by links. Compared with previous methods, SegLink improves along the dimensions of accuracy, speed, and ease of training. It achieves an f-measure of 75.0% on the standard ICDAR 2015 Incidental (Challenge 4) benchmark, outperforming the previous best by a large margin. It runs at over 20 FPS on 512x512 images. Moreover, without modification, SegLink is able to detect long lines of non-Latin text, such as Chinese.
Mar 17 2017 cs.CV
Junctions play an important role in the characterization of local geometric structures in images, the detection of which is a longstanding and challenging task. Existing junction detectors usually focus on identifying the junction locations and the orientations of the junction branches while ignoring their scales; however, these scales also contain rich geometric information. This paper presents a novel approach to junction detection and characterization that exploits the locally anisotropic geometries of a junction and estimates the scales of these geometries using an \empha contrario model. The output junctions have anisotropic scales --- i.e., each branch of a junction is associated with an independent scale parameter --- and are thus termed anisotropic-scale junctions (ASJs). We then apply the newly detected ASJs for the matching of indoor images, in which there may be dramatic changes in viewpoint and the detected local visual features, e.g., key-points, are usually insufficiently distinctive. We propose to use the anisotropic geometries of our junctions to improve the matching precision for indoor images. Matching results obtained on sets of indoor images demonstrate that our approach achieves state-of-the-art performance in indoor image matching.
Feb 28 2017 cs.CV
Low-textured image stitching remains a challenging problem. It is difficult to achieve good alignment and it is easy to break image structures due to insufficient and unreliable point correspondences. Moreover, because of the viewpoint variations between multiple images, the stitched images suffer from projective distortions. To solve these problems, this paper presents a line-guided local warping method with a global similarity constraint for image stitching. Line features which serve well for geometric descriptions and scene constraints, are employed to guide image stitching accurately. On one hand, the line features are integrated into a local warping model through a designed weight function. On the other hand, line features are adopted to impose strong geometric constraints, including line correspondence and line colinearity, to improve the stitching performance through mesh optimization. To mitigate projective distortions, we adopt a global similarity constraint, which is integrated with the projective warps via a designed weight strategy. This constraint causes the final warp to slowly change from a projective to a similarity transformation across the image. Finally, the images undergo a two-stage alignment scheme that provides accurate alignment and reduces projective distortion. We evaluate our method on a series of images and compare it with several other methods. The experimental results demonstrate that the proposed method provides a convincing stitching performance and that it outperforms other state-of-the-art methods.
This paper investigates the task assignment problem for multiple dispersed robots constrained by limited communication range. The robots are initially randomly distributed and need to visit several target locations while minimizing the total travel time. A centralized rendezvous-based algorithm is proposed, under which all the robots first move towards a rendezvous position until communication paths are established between every pair of robots either directly or through intermediate peers, and then one robot is chosen as the leader to make a centralized task assignment for the other robots. Furthermore, we propose a decentralized algorithm based on a single-traveling-salesman tour, which does not require all the robots to be connected through communication. We investigate the variation of the quality of the assignment solutions as the level of information sharing increases and as the communication range grows, respectively. The proposed algorithms are compared with a centralized algorithm with shared global information and a decentralized greedy algorithm respectively. Monte Carlo simulation results show the satisfying performance of the proposed algorithms.
Feb 13 2017 cs.CV
Texture characterization is a key problem in image understanding and pattern recognition. In this paper, we present a flexible shape-based texture representation using shape co-occurrence patterns. More precisely, texture images are first represented by tree of shapes, each of which is associated with several geometrical and radiometric attributes. Then four typical kinds of shape co-occurrence patterns based on the hierarchical relationship of the shapes in the tree are learned as codewords. Three different coding methods are investigated to learn the codewords, with which, any given texture image can be encoded into a descriptive vector. In contrast with existing works, the proposed method not only inherits the strong ability to depict geometrical aspects of textures and the high robustness to variations of imaging conditions from the shape-based method, but also provides a flexible way to consider shape relationships and to compute high-order statistics on the tree. To our knowledge, this is the first time to use co-occurrence patterns of explicit shapes as a tool for texture analysis. Experiments on various texture datasets and scene datasets demonstrate the efficiency of the proposed method.
Jan 17 2017 cs.SI
MOOCs have brought unprecedented opportunities of making high-quality courses accessible to everybody. However, from the business point of view, MOOCs are often challenged for lacking of sustainable business models, and academic research for marketing strategies of MOOCs is also a blind spot currently. In this work, we try to formulate the business models and pricing strategies in a structured and scientific way. Based on both theoretical research and real marketing data analysis from a MOOC platform, we present the insights of the pricing strategies for existing MOOC markets. We focus on the pricing strategies for verified certificates in the B2C markets, and also give ideas of modeling the course sub-licensing services in B2B markets.
Dec 08 2016 cs.CV
In this paper, we propose an accurate edge detector using richer convolutional features (RCF). Since objects in nature images have various scales and aspect ratios, the automatically learned rich hierarchical representations by CNNs are very critical and effective to detect edges and object boundaries. And the convolutional features gradually become coarser with receptive fields increasing. Based on these observations, our proposed network architecture makes full use of multiscale and multi-level information to perform the image-to-image edge prediction by combining all of the useful convolutional features into a holistic framework. It is the first attempt to adopt such rich convolutional features in computer vision tasks. Using VGG16 network, we achieve \sArt results on several available datasets. When evaluating on the well-known BSDS500 benchmark, we achieve ODS F-measure of \textbf.811 while retaining a fast speed (\textbf8 FPS). Besides, our fast version of RCF achieves ODS F-measure of \textbf.806 with \textbf30 FPS.
Nov 22 2016 cs.CV
This paper presents an end-to-end trainable fast scene text detector, named TextBoxes, which detects scene text with both high accuracy and efficiency in a single network forward pass, involving no post-process except for a standard non-maximum suppression. TextBoxes outperforms competing methods in terms of text localization accuracy and is much faster, taking only 0.09s per image in a fast implementation. Furthermore, combined with a text recognizer, TextBoxes significantly outperforms state-of-the-art approaches on word spotting and end-to-end text recognition tasks.
Recently neural networks and multiple instance learning are both attractive topics in Artificial Intelligence related research fields. Deep neural networks have achieved great success in supervised learning problems, and multiple instance learning as a typical weakly-supervised learning method is effective for many applications in computer vision, biometrics, nature language processing, etc. In this paper, we revisit the problem of solving multiple instance learning problems using neural networks. Neural networks are appealing for solving multiple instance learning problem. The multiple instance neural networks perform multiple instance learning in an end-to-end way, which take a bag with various number of instances as input and directly output bag label. All of the parameters in a multiple instance network are able to be optimized via back-propagation. We propose a new multiple instance neural network to learn bag representations, which is different from the existing multiple instance neural networks that focus on estimating instance label. In addition, recent tricks developed in deep learning have been studied in multiple instance networks, we find deep supervision is effective for boosting bag classification accuracy. In the experiments, the proposed multiple instance networks achieve state-of-the-art or competitive performance on several MIL benchmarks. Moreover, it is extremely fast for both testing and training, e.g., it takes only 0.0003 second to predict a bag and a few seconds to train on a MIL datasets on a moderate CPU.
Sep 14 2016 cs.CV
Object skeletons are useful for object representation and object detection. They are complementary to the object contour, and provide extra information, such as how object scale (thickness) varies among object parts. But object skeleton extraction from natural images is very challenging, because it requires the extractor to be able to capture both local and non-local image context in order to determine the scale of each skeleton pixel. In this paper, we present a novel fully convolutional network with multiple scale-associated side outputs to address this problem. By observing the relationship between the receptive field sizes of the different layers in the network and the skeleton scales they can capture, we introduce two scale-associated side outputs to each stage of the network. The network is trained by multi-task learning, where one task is skeleton localization to classify whether a pixel is a skeleton pixel or not, and the other is skeleton scale prediction to regress the scale of each skeleton pixel. Supervision is imposed at different stages by guiding the scale-associated side outputs toward the groundtruth skeletons at the appropriate scales. The responses of the multiple scale-associated side outputs are then fused in a scale-specific way to detect skeleton pixels using multiple scales effectively. Our method achieves promising results on two skeleton extraction datasets, and significantly outperforms other competitors. Additionally, the usefulness of the obtained skeletons and scales (thickness) are verified on two object detection applications: Foreground object segmentation and object proposal detection.
Aug 19 2016 cs.CV
Aerial scene classification, which aims to automatically label an aerial image with a specific semantic category, is a fundamental problem for understanding high-resolution remote sensing imagery. In recent years, it has become an active task in remote sensing area and numerous algorithms have been proposed for this task, including many machine learning and data-driven approaches. However, the existing datasets for aerial scene classification like UC-Merced dataset and WHU-RS19 are with relatively small sizes, and the results on them are already saturated. This largely limits the development of scene classification algorithms. This paper describes the Aerial Image Dataset (AID): a large-scale dataset for aerial scene classification. The goal of AID is to advance the state-of-the-arts in scene classification of remote sensing images. For creating AID, we collect and annotate more than ten thousands aerial scene images. In addition, a comprehensive review of the existing aerial scene classification techniques as well as recent widely-used deep learning methods is given. Finally, we provide a performance analysis of typical aerial scene classification and deep learning approaches on AID, which can be served as the baseline results on this benchmark.
Despite the great success of convolutional neural networks (CNN) for the image classification task on datasets like Cifar and ImageNet, CNN's representation power is still somewhat limited in dealing with object images that have large variation in size and clutter, where Fisher Vector (FV) has shown to be an effective encoding strategy. FV encodes an image by aggregating local descriptors with a universal generative Gaussian Mixture Model (GMM). FV however has limited learning capability and its parameters are mostly fixed after constructing the codebook. To combine together the best of the two worlds, we propose in this paper a neural network structure with FV layer being part of an end-to-end trainable system that is differentiable; we name our network FisherNet that is learnable using backpropagation. Our proposed FisherNet combines convolutional neural network training and Fisher Vector encoding in a single end-to-end structure. We observe a clear advantage of FisherNet over plain CNN and standard FV in terms of both classification accuracy and computational efficiency on the challenging PASCAL VOC object classification task.
Jun 30 2016 cs.CV
Recently, scene text detection has become an active research topic in computer vision and document analysis, because of its great importance and significant challenge. However, vast majority of the existing methods detect text within local regions, typically through extracting character, word or line level candidates followed by candidate aggregation and false positive elimination, which potentially exclude the effect of wide-scope and long-range contextual cues in the scene. To take full advantage of the rich information available in the whole natural image, we propose to localize text in a holistic manner, by casting scene text detection as a semantic segmentation problem. The proposed algorithm directly runs on full images and produces global, pixel-wise prediction maps, in which detections are subsequently formed. To better make use of the properties of text, three types of information regarding text region, individual characters and their relationship are estimated, with a single Fully Convolutional Network (FCN) model. With such predictions of text properties, the proposed algorithm can simultaneously handle horizontal, multi-oriented and curved text in real-world natural images. The experiments on standard benchmarks, including ICDAR 2013, ICDAR 2015 and MSRA-TD500, demonstrate that the proposed algorithm substantially outperforms previous state-of-the-art approaches. Moreover, we report the first baseline result on the recently-released, large-scale dataset COCO-Text.
May 20 2016 cs.CV
Spectral-spatial processing has been increasingly explored in remote sensing hyperspectral image classification. While extensive studies have focused on developing methods to improve the classification accuracy, experimental setting and design for method evaluation have drawn little attention. In the scope of supervised classification, we find that traditional experimental designs for spectral processing are often improperly used in the spectral-spatial processing context, leading to unfair or biased performance evaluation. This is especially the case when training and testing samples are randomly drawn from the same image - a practice that has been commonly adopted in the experiments. Under such setting, the dependence caused by overlap between the training and testing samples may be artificially enhanced by some spatial information processing methods such as spatial filtering and morphological operation. Such interaction between training and testing sets has violated data independence assumption that is abided by supervised learning theory and performance evaluation mechanism. Therefore, the widely adopted pixel-based random sampling strategy is not always suitable to evaluate spectral-spatial classification algorithms because it is difficult to determine whether the improvement of classification accuracy is caused by incorporating spatial information into classifier or by increasing the overlap between training and testing samples. To partially solve this problem, we propose a novel controlled random sampling strategy for spectral-spatial methods. It can greatly reduce the overlap between training and testing samples and provides more objective and accurate evaluation.
May 03 2016 cs.CV
Multidimensional Scaling (MDS) is a classic technique that seeks vectorial representations for data points, given the pairwise distances between them. However, in recent years, data are usually collected from diverse sources or have multiple heterogeneous representations. How to do multidimensional scaling on multiple input distance matrices is still unsolved to our best knowledge. In this paper, we first define this new task formally. Then, we propose a new algorithm called Multi-View Multidimensional Scaling (MVMDS) by considering each input distance matrix as one view. Our algorithm is able to learn the weights of views (i.e., distance matrices) automatically by exploring the consensus information and complementary nature of views. Experimental results on synthetic as well as real datasets demonstrate the effectiveness of MVMDS. We hope that our work encourages a wider consideration in many domains where MDS is needed.
Apr 15 2016 cs.CV
In this paper, we propose a novel approach for text detec- tion in natural images. Both local and global cues are taken into account for localizing text lines in a coarse-to-fine pro- cedure. First, a Fully Convolutional Network (FCN) model is trained to predict the salient map of text regions in a holistic manner. Then, text line hypotheses are estimated by combining the salient map and character components. Fi- nally, another FCN classifier is used to predict the centroid of each character, in order to remove the false hypotheses. The framework is general for handling text in multiple ori- entations, languages and fonts. The proposed method con- sistently achieves the state-of-the-art performance on three text detection benchmarks: MSRA-TD500, ICDAR2015 and ICDAR2013.
Apr 08 2016 cs.CV
Projective analysis is an important solution for 3D shape retrieval, since human visual perceptions of 3D shapes rely on various 2D observations from different view points. Although multiple informative and discriminative views are utilized, most projection-based retrieval systems suffer from heavy computational cost, thus cannot satisfy the basic requirement of scalability for search engines. In this paper, we present a real-time 3D shape search engine based on the projective images of 3D shapes. The real-time property of our search engine results from the following aspects: (1) efficient projection and view feature extraction using GPU acceleration; (2) the first inverted file, referred as F-IF, is utilized to speed up the procedure of multi-view matching; (3) the second inverted file (S-IF), which captures a local distribution of 3D shapes in the feature manifold, is adopted for efficient context-based re-ranking. As a result, for each query the retrieval task can be finished within one second despite the necessary cost of IO overhead. We name the proposed 3D shape search engine, which combines GPU acceleration and Inverted File Twice, as GIFT. Besides its high efficiency, GIFT also outperforms the state-of-the-art methods significantly in retrieval accuracy on various shape benchmarks and competitions.
Apr 01 2016 cs.CV
Object skeleton is a useful cue for object detection, complementary to the object contour, as it provides a structural representation to describe the relationship among object parts. While object skeleton extraction in natural images is a very challenging problem, as it requires the extractor to be able to capture both local and global image context to determine the intrinsic scale of each skeleton pixel. Existing methods rely on per-pixel based multi-scale feature computation, which results in difficult modeling and high time consumption. In this paper, we present a fully convolutional network with multiple scale-associated side outputs to address this problem. By observing the relationship between the receptive field sizes of the sequential stages in the network and the skeleton scales they can capture, we introduce a scale-associated side output to each stage. We impose supervision to different stages by guiding the scale-associated side outputs toward groundtruth skeletons of different scales. The responses of the multiple scale-associated side outputs are then fused in a scale-specific way to localize skeleton pixels with multiple scales effectively. Our method achieves promising results on two skeleton extraction datasets, and significantly outperforms other competitors.
Mar 15 2016 cs.CV
Recognizing text in natural images is a challenging task with many unsolved problems. Different from those in documents, words in natural images often possess irregular shapes, which are caused by perspective distortion, curved character placement, etc. We propose RARE (Robust text recognizer with Automatic REctification), a recognition model that is robust to irregular text. RARE is a specially-designed deep neural network, which consists of a Spatial Transformer Network (STN) and a Sequence Recognition Network (SRN). In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more "readable" image for the following SRN, which recognizes text through a sequence recognition approach. We show that the model is able to recognize several types of irregular text, including perspective text and curved text. RARE is end-to-end trainable, requiring only images and associated text labels, making it convenient to train and deploy the model in practical systems. State-of-the-art or highly-competitive performance achieved on several benchmarks well demonstrates the effectiveness of the proposed model.
Feb 12 2016 cs.NI
It has been recently advocated that in large communication systems it is beneficial both for the users and for the network as a whole to store content closer to users. One particular implementation of such an approach is to co-locate caches with wireless base stations. In this paper we study geographically distributed caching of a fixed collection of files. We model cache placement with the help of stochastic geometry and optimize the allocation of storage capacity among files in order to minimize the cache miss probability. We consider both per cache capacity constraints as well as an average capacity constraint over all caches. The case of per cache capacity constraints can be efficiently solved using dynamic programming, whereas the case of the average constraint leads to a convex optimization problem. We demonstrate that the average constraint leads to significantly smaller cache miss probability. Finally, we suggest a simple LRU-based policy for geographically distributed caching and show that its performance is close to the optimal.
With the rapid growth of online social network sites (SNS), it has become imperative for platform owners and online marketers to investigate what drives content production on these platforms. However, previous research has found it difficult to statistically model these factors using observational data due to the inability to separate the effects of network formation from those of network influence. The inability to successfully separate these two mechanisms makes it difficult to interpret whether the observed behavior is a result of peer influence or merely indicative of a selection bias due to homophily. In this paper, we propose an actor-oriented continuous-time model to jointly estimate the co-evolution of the users' social network structure and their content production behavior using a Markov Chain Monte Carlo (MCMC) based simulation approach. Specifically, we offer a method to analyze non-stationary and continuous behavior with network effects, similar to what is observed in social media ecosystems. Leveraging a unique dataset contributed by Facebook, we apply our model to data on university students across six months to find that users tend to connect with others that have similar posting behavior. However, after doing so, users tend to diverge in posting behavior. Further, we also discover that homophilous friend selection as well as susceptibility to peer influence are sensitive to the strength of the posting behaviour. Our results provide insights and recommendations for SNS platforms to sustain an active and viable community.
Dec 22 2015 cs.SI
Link recommendation, which suggests links to connect currently unlinked users, is a key functionality offered by major online social networks. Salient examples of link recommendation include "People You May Know" on Facebook and LinkedIn as well as "You May Know" on Google+. The main stakeholders of an online social network include users (e.g., Facebook users) who use the network to socialize with other users and an operator (e.g., Facebook Inc.) that establishes and operates the network for its own benefit (e.g., revenue). Existing link recommendation methods recommend links that are likely to be established by users but overlook the benefit a recommended link could bring to an operator. To address this gap, we define the utility of recommending a link and formulate a new research problem - the utility-based link recommendation problem. We then propose a novel utility-based link recommendation method that recommends links based on the value, cost, and linkage likelihood of a link, in contrast to existing link recommendation methods which focus solely on linkage likelihood. Specifically, our method models the dependency relationship between value, cost, linkage likelihood and utility-based link recommendation decision using a Bayesian network, predicts the probability of recommending a link with the Bayesian network, and recommends links with the highest probabilities. Using data obtained from a major U.S. online social network, we demonstrate significant performance improvement achieved by our method compared to prevalent link recommendation methods from representative prior research.
Nov 24 2015 cs.DC
Applications such as web search and social networking have been moving from centralized to decentralized cloud architectures to improve their scalability. MapReduce, a programming framework for processing large amounts of data using thousands of machines in a single cloud, also needs to be scaled out to multiple clouds to adapt to this evolution. The challenge of building a multi-cloud distributed architecture is substantial. Notwithstanding, the ability to deal with the new types of faults introduced by such setting, such as the outage of a whole datacenter or an arbitrary fault caused by a malicious cloud insider, increases the endeavor considerably. In this paper we propose Medusa, a platform that allows MapReduce computations to scale out to multiple clouds and tolerate several types of faults. Our solution fulfills four objectives. First, it is transparent to the user, who writes her typical MapReduce application without modification. Second, it does not require any modification to the widely used Hadoop framework. Third, the proposed system goes well beyond the fault-tolerance offered by MapReduce to tolerate arbitrary faults, cloud outages, and even malicious faults caused by corrupt cloud insiders. Fourth, it achieves this increased level of fault tolerance at reasonable cost. We performed an extensive experimental evaluation in the ExoGENI testbed, demonstrating that our solution significantly reduces execution time when compared to traditional methods that achieve the same level of resilience.
Multiple-instance learning (MIL) has served as an important tool for a wide range of vision applications, for instance, image classification, object detection, and visual tracking. In this paper, we propose a novel method to solve the classical MIL problem, named relaxed multiple-instance SVM (RMI-SVM). We treat the positiveness of instance as a continuous variable, use Noisy-OR model to enforce the MIL constraints, and jointly optimize the bag label and instance label in a unified framework. The optimization problem can be efficiently solved using stochastic gradient decent. The extensive experiments demonstrate that RMI-SVM consistently achieves superior performance on various benchmarks for MIL. Moreover, we simply applied RMI-SVM to a challenging vision task, common object discovery. The state-of-the-art results of object discovery on Pascal VOC datasets further confirm the advantages of the proposed method.
Sep 29 2015 cs.CV
Stereo matching is the key step in estimating depth from two or more images. Recently, some tree-based non-local stereo matching methods have been proposed, which achieved state-of-the-art performance. The algorithms employed some tree structures to aggregate cost and thus improved the performance and reduced the coputation load of the stereo matching. However, the computational complexity of these tree-based algorithms is still high because they search over the entire disparity range. In addition, the extreme greediness of the minimum spanning tree (MST) causes the poor performance in large areas with similar colors but varying disparities. In this paper, we propose an efficient stereo matching method using a hierarchical disparity prediction (HDP) framework to dramatically reduce the disparity search range so as to speed up the tree-based non-local stereo methods. Our disparity prediction scheme works on a graph pyramid derived from an image whose disparity to be estimated. We utilize the disparity of a upper graph to predict a small disparity range for the lower graph. Some independent disparity trees (DT) are generated to form a disparity prediction forest (HDPF) over which the cost aggregation is made. When combined with the state-of-the-art tree-based methods, our scheme not only dramatically speeds up the original methods but also improves their performance by alleviating the second drawback of the tree-based methods. This is partially because our DTs overcome the extreme greediness of the MST. Extensive experimental results on some benchmark datasets demonstrate the effectiveness and efficiency of our framework. For example, the segment-tree based stereo matching becomes about 25.57 times faster and 2.2% more accurate over the Middlebury 2006 full-size dataset.
Jul 22 2015 cs.CV
Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.
May 27 2015 cs.CR
On modern operating systems, applications under the same user are separated from each other, for the purpose of protecting them against malware and compromised programs. Given the complexity of today's OSes, less clear is whether such isolation is effective against different kind of cross-app resource access attacks (called XARA in our research). To better understand the problem, on the less-studied Apple platforms, we conducted a systematic security analysis on MAC OS~X and iOS. Our research leads to the discovery of a series of high-impact security weaknesses, which enable a sandboxed malicious app, approved by the Apple Stores, to gain unauthorized access to other apps' sensitive data. More specifically, we found that the inter-app interaction services, including the keychain, WebSocket and NSConnection on OS~X and URL Scheme on the MAC OS and iOS, can all be exploited by the malware to steal such confidential information as the passwords for iCloud, email and bank, and the secret token of Evernote. Further, the design of the app sandbox on OS~X was found to be vulnerable, exposing an app's private directory to the sandboxed malware that hijacks its Apple Bundle ID. As a result, sensitive user data, like the notes and user contacts under Evernote and photos under WeChat, have all been disclosed. Fundamentally, these problems are caused by the lack of app-to-app and app-to-OS authentications. To better understand their impacts, we developed a scanner that automatically analyzes the binaries of MAC OS and iOS apps to determine whether proper protection is missing in their code. Running it on hundreds of binaries, we confirmed the pervasiveness of the weaknesses among high-impact Apple apps. Since the issues may not be easily fixed, we built a simple program that detects exploit attempts on OS~X, helping protect vulnerable apps before the problems can be fully addressed.
May 13 2015 cs.CV
With the rapid increase of transnational communication and cooperation, people frequently encounter multilingual scenarios in various situations. In this paper, we are concerned with a relatively new problem: script identification at word or line levels in natural scenes. A large-scale dataset with a great quantity of natural images and 10 types of widely used languages is constructed and released. In allusion to the challenges in script identification in real-world scenarios, a deep learning based algorithm is proposed. The experiments on the proposed dataset demonstrate that our algorithm achieves superior performance, compared with conventional image classification methods, such as the original CNN architecture and LLC.
Sep 26 2014 cs.CV
We study the problem of how to build a deep learning representation for 3D shape. Deep learning has shown to be very effective in variety of visual applications, such as image classification and object detection. However, it has not been successfully applied to 3D shape recognition. This is because 3D shape has complex structure in 3D space and there are limited number of 3D shapes for feature learning. To address these problems, we project 3D shapes into 2D space and use autoencoder for feature learning on the 2D images. High accuracy 3D shape retrieval performance is obtained by aggregating the features learned on 2D images. In addition, we show the proposed deep learning feature is complementary to conventional local image descriptors. By combing the global deep learning representation and the local descriptor representation, our method can obtain the state-of-the-art performance on 3D shape retrieval benchmarks.
Sep 19 2014 cs.CV
In this paper, we present a deep regression approach for face alignment. The deep architecture consists of a global layer and multi-stage local layers. We apply the back-propagation algorithm with the dropout strategy to jointly optimize the regression parameters. We show that the resulting deep regressor gradually and evenly approaches the true facial landmarks stage by stage, avoiding the tendency to yield over-strong early stage regressors while over-weak later stage regressors. Experimental results show that our approach achieves the state-of-the-art
Nov 26 2013 cs.DC
IceCube is a one-gigaton instrument located at the geographic South Pole, designed to detect cosmic neutrinos, iden- tify the particle nature of dark matter, and study high-energy neutrinos themselves. Simulation of the IceCube detector and processing of data require a significant amount of computational resources. IceProd is a distributed management system based on Python, XML-RPC and GridFTP. It is driven by a central database in order to coordinate and admin- ister production of simulations and processing of data produced by the IceCube detector. IceProd runs as a separate layer on top of other middleware and can take advantage of a variety of computing resources, including grids and batch systems such as CREAM, Condor, and PBS. This is accomplished by a set of dedicated daemons that process job submission in a coordinated fashion through the use of middleware plugins that serve to abstract the details of job submission and job management from the framework.
Aug 20 2013 cs.NI
In order to closely simulate the real network scenario thereby verify the effectiveness of protocol designs, it is necessary to model the traffic flows carried over realistic networks. Extensive studies  showed that the actual traffic in access and local area networks (e.g., those generated by ftp and video streams) exhibits the property of self-similarity and long-range dependency (LRD) . In this appendix we briefly introduce the property of self-similarity and suggest a practical approach for modeling self-similar traces with specified traffic intensity.
Mar 02 2011 cs.MM
Interdisciplinary collaboration is essential for the advance of research. As domain subjects become more and more specialized, researchers need to cross disciplines for insights from peers in other areas to have a broader and deeper understand of a topic at micro- and macro-levels. We developed a 3D virtual learning environment that served as a platform for faculty to plan curriculum, share educational beliefs, and conduct cross-discipline research for effective learning. Based upon the scripts designed by faculty from five disciplines, virtual doctors, nurses, or patients interact in a 3D virtual hospital. The teaching vignettes were then converted to video clips, allowing users to view, pause, replay, or comment on the videos individually or in groups. Unlike many existing platforms, we anticipated a value-added by adding a social networking capacity to this virtual environment. The focus of this paper is on the cost-efficiency and system design of the virtual learning environment.