Neural net classifiers trained on data with annotated class labels can also capture apparent visual similarity among categories without being directed to do so. We study whether this observation can be extended beyond the conventional domain of supervised learning: Can we learn a good feature representation that captures apparent similarity among instances, instead of classes, by merely asking the feature to be discriminative of individual instances? We formulate this intuition as a non-parametric classification problem at the instance-level, and use noise-contrastive estimation to tackle the computational challenges imposed by the large number of instance classes. Our experimental results demonstrate that, under unsupervised learning settings, our method surpasses the state-of-the-art on ImageNet classification by a large margin. Our method is also remarkable for consistently improving test performance with more training data and better network architectures. By fine-tuning the learned feature, we further obtain competitive results for semi-supervised learning and object detection tasks. Our non-parametric model is highly compact: With 128 features per image, our method requires only 600MB storage for a million images, enabling fast nearest neighbour retrieval at the run time.
Apr 27 2018 cs.RO
This paper presents a novel cable-driven gripper with perception capabilities for autonomous harvesting of strawberries. Experiments show that the gripper allows for more accurate and faster picking of strawberries compared to existing systems. The gripper consists of four functional parts for sensing, picking, transmission, and storing. It has six fingers that open to form a closed space to swallow a target strawberry and push other surrounding berries away from the target. Equipped with three IR sensors, the gripper controls a manipulator arm to correct for positional error, and can thus pick strawberries that are not exactly localized by the vision algorithm, improving the robustness. Experiments show that the gripper is gentle on the berries as it merely cuts the stem and there is no physical interaction with the berries during the cutting process. We show that the gripper has close-to-perfect successful picking rate when addressing isolated strawberries. By including internal perception, we get high positional error tolerance, and avoid using slow, high-level closed-loop control. Moreover, the gripper can store several berries, which reduces the overall travel distance for the manipulator, and decreases the time needed to pick a single strawberry substantially. The experiments show that the gripper design decreased picking execution time noticeably compared to results found in literature.
Apr 17 2018 cs.CV
High-performance object detection relies on expensive convolutional networks to compute features, often leading to significant challenges in applications, e.g. those that require detecting objects from video streams in real time. The key to this problem is to trade accuracy for efficiency in an effective way, i.e. reducing the computing cost while maintaining competitive performance. To seek a good balance, previous efforts usually focus on optimizing the model architectures. This paper explores an alternative approach, that is, to reallocate the computation over a scale-time space. The basic idea is to perform expensive detection sparsely and propagate the results across both scales and time with substantially cheaper networks, by exploiting the strong correlations among them. Specifically, we present a unified framework that integrates detection, temporal propagation, and across-scale refinement on a Scale-Time Lattice. On this framework, one can explore various strategies to balance performance and cost. Taking advantage of this flexibility, we further develop an adaptive scheme with the detector invoked on demand and thus obtain improved tradeoff. On ImageNet VID dataset, the proposed method can achieve a competitive mAP 79.6% at 20 fps, or 79.0% at 62 fps as a performance/speed tradeoff.
Mar 28 2018 cs.SE
The performance of fault localization techniques is critical to their adoption in practice. This paper reports on an empirical study of a wide range of fault localization techniques on real-world faults. Different from previous studies, this paper (1) considers a wide range of techniques from different families, and (2) combines different techniques. Our results reveal that combined technique significantly outperforms any individual technique (175% increase in defects localized in Top 1), suggesting that combination may be a desirable way to apply fault localization techniques and future techniques should also be evaluated in the combined setting. Our implementation is publicly available for evaluating and combining fault localization techniques.
A useful computation when acting in a complex environment is to infer the marginal probabilities or most probable states of task-relevant variables. Probabilistic graphical models can efficiently represent the structure of such complex data, but performing these inferences is generally difficult. Message-passing algorithms, such as belief propagation, are a natural way to disseminate evidence amongst correlated variables while exploiting the graph structure, but these algorithms can struggle when the conditional dependency graphs contain loops. Here we use Graph Neural Networks (GNNs) to learn a message-passing algorithm that solves these inference tasks. We first show that the architecture of GNNs is well-matched to inference tasks. We then demonstrate the efficacy of this inference approach by training GNNs on an ensemble of graphical models and showing that they substantially outperform belief propagation on loopy graphs. Our message-passing algorithms generalize out of the training set to larger graphs and graphs with different structure.
In this paper, we revisit the recurrent back-propagation (RBP) algorithm, discuss the conditions under which it applies as well as how to satisfy them in deep neural networks. We show that RBP can be unstable and propose two variants based on conjugate gradient on the normal equations (CG-RBP) and Neumann series (Neumann-RBP). We further investigate the relationship between Neumann-RBP and back propagation through time (BPTT) and its truncated version (TBPTT). Our Neumann-RBP has the same time complexity as TBPTT but only requires constant memory, whereas TBPTT's memory cost scales linearly with the number of truncation steps. We examine all RBP variants along with BPTT and TBPTT in three different application domains: associative memory with continuous Hopfield networks, document classification in citation networks using graph neural networks and hyperparameter optimization for fully connected networks. All experiments demonstrate that RBPs, especially the Neumann-RBP variant, are efficient and effective for optimizing convergent recurrent neural networks.
Feb 22 2018 cs.SE
In many scenarios we need to find the most likely program under a local context, where the local context can be an incomplete program, a partial specification, natural language description, etc. We call such problem program estimation. In this paper we propose an abstract framework, learning to synthesis, or L2S in short, to address this problem. L2S combines four tools to achieve this: syntax is used to define the search space and search steps, constraints are used to prune off invalid candidates at each search step, machine-learned models are used to estimate conditional probabilities for the candidates at each search step, and search algorithms are used to find the best possible solution. The main goal of L2S is to lay out the design space to motivate the research on program estimation. We have performed a preliminary evaluation by instantiating this framework for synthesizing conditions of an automated program repair (APR) system. The training data are from the project itself and related JDK packages. Compared to ACS, a state-of-the-art condition synthesis system for program repair, our approach could deal with a larger search space such that we fixed 4 additional bugs outside the search space of ACS, and relies only on the source code of the current projects.
Feb 13 2018 cs.LG
Electric Vehicle (EV) is playing a significant role in the distribution energy management systems since the power consumption level of the EVs is much higher than the other regular home appliances. The randomness of the EV driver behaviors make the optimal charging or discharging scheduling even more difficult due to the uncertain charging session parameters. To minimize the impact of behavioral uncertainties, it is critical to develop effective methods to predict EV load for smart EV energy management. Using the EV smart charging infrastructures on UCLA campus and city of Santa Monica as testbeds, we have collected real-world datasets of EV charging behaviors, based on which we proposed an EV user modeling technique which combines statistical analysis and machine learning approaches. Specifically, unsupervised clustering algorithm, and multilayer perceptron are applied to historical charging record to make the day-ahead EV parking and load prediction. Experimental results with cross-validation show that our model can achieve good performance for charging control scheduling and online EV load forecasting.
This paper deals with the reality gap from a novel perspective, targeting transferring Deep Reinforcement Learning (DRL) policies learned in simulated environments to the real-world domain for visual control tasks. Instead of adopting the common solutions to the problem by increasing the visual fidelity of synthetic images output from simulators during the training phase, this paper seeks to tackle the problem by translating the real-world image streams back to the synthetic domain during the deployment phase, to make the robot feel at home. We propose this as a lightweight, flexible, and efficient solution for visual control, as 1) no extra transfer steps are required during the expensive training of DRL agents in simulation; 2) the trained DRL agents will not be constrained to being deployable in only one specific real-world environment; 3) the policy training and the transfer operations are decoupled, and can be conducted in parallel. Besides this, we propose a conceptually simple yet very effective shift loss to constrain the consistency between subsequent frames, eliminating the need for optical flow. We validate the shift loss for artistic style transfer for videos and domain adaptation, and validate our visual control approach in real-world robot experiments. A video of our results is available at: https://goo.gl/b1xz1s.
Jan 24 2018 cs.CV
Dynamics of human body skeletons convey significant information for human action recognition. Conventional approaches for modeling skeletons usually rely on hand-crafted parts or traversal rules, thus resulting in limited expressive power and difficulties of generalization. In this work, we propose a novel model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data. This formulation not only leads to greater expressive power but also stronger generalization capability. On two large datasets, Kinetics and NTU-RGBD, it achieves substantial improvements over mainstream methods.
Dec 12 2017 cs.MM
Steganography aims to conceal the very fact that the communication takes place, by embedding a message into a digit object such as image without introducing noticeable artifacts. A number of steganographic systems have been developed in past years, most of which, however, are confined to the laboratory conditions where the real-world use of steganography are rarely concerned. In this paper, we introduce an alternative perspective to steganography. A graph-theoretic model to steganography on social networks is presented to analyze real-world steganographic scenarios. In the graph, steganographic participants are corresponding to the vertices with meaningless unique identifiers. Each edge allows the two vertices to communicate with each other by any steganographic algorithm. Meanwhile, the edges are associated with weights to quantize the corresponding communication risk (or say cost). The optimization task is to minimize the overall risk, which is modeled as additive over the social network. We analyze different scenarios on a social network, and provide the suited solutions to the corresponding optimization tasks. We prove that a multiplicative probabilistic graph is equivalent to an additive weighted graph. From the viewpoint of an attacker, he may hope to detect suspicious communication channels, the data encoder(s) and the data decoder(s). We present limited detection analysis to steganographic communication on a network.
Accurate clock synchronization is required for collaborative operations among nodes across wireless networks. Compared with traditional layer-by-layer methods, cooperative network synchronization techniques lead to significant improvement in performance, efficiency, and robustness. This paper develops a framework for the performance analysis of cooperative network synchronization. We introduce the concepts of cooperative dilution intensity (CDI) and relative CDI to characterize the interaction between agents, which can be interpreted as properties of a random walk over the network. Our approach enables us to derive closed-form asymptotic expressions of performance limits, relating them to the quality of observations as well as network topology.
Jun 29 2017 cs.SE
Test-based automatic program repair has attracted a lot of attention in recent years. However, the test suites in practice are often too weak to guarantee correctness and existing approaches often generate a large number of incorrect patches. To reduce the number of incorrect patches generated, we propose a novel approach that heuristically determines the correctness of the generated patches. The core idea is to exploit the behavior similarity of test case executions. The passing tests on original and patched programs are likely to behave similarly while the failing tests on original and patched programs are likely to behave differently. Also, if two tests exhibit similar runtime behavior, the two tests are likely to have the same test results. Based on these observations, we generate new test inputs to enhance the test suites and use their behavior similarity to determine patch correctness. Our approach is evaluated on a dataset consisting of 139 patches generated from existing program repair systems including jGenProg, Nopol, jKali, ACS and HDRepair. Our approach successfully prevented 56.3\% of the incorrect patches to be generated, without blocking any correct patches.
Jun 12 2017 cs.CV
In this paper, we share our experience in designing a convolutional network-based face detector that could handle faces of an extremely wide range of scales. We show that faces with different scales can be modeled through a specialized set of deep convolutional networks with different structures. These detectors can be seamlessly integrated into a single unified network that can be trained end-to-end. In contrast to existing deep models that are designed for wide scale range, our network does not require an image pyramid input and the model is of modest complexity. Our network, dubbed ScaleFace, achieves promising performance on WIDER FACE and FDDB datasets with practical runtime speed. Specifically, our method achieves 76.4 average precision on the challenging WIDER FACE dataset and 96% recall rate on the FDDB dataset with 7 frames per second (fps) for 900 * 1300 input image.
May 12 2017 cs.SE
Automated program repair techniques, which target to generating correct patches for real world defects automatically, have gained a lot of attention in the last decade. Many different techniques and tools have been proposed and developed. However, even the most sophisticated program repair techniques can only repair a small portion of defects while producing a lot of incorrect patches. A possible reason for this low performance is that the test suites of real world programs are usually too weak to guarantee the behavior of the program. To understand to what extent defects can be fixed with weak test suites, we analyzed 50 real world defects from Defects4J, in which we found that up to 84% of them could be correctly fixed. This result suggests that there is plenty of space for current automated program repair techniques to improve. Furthermore, we summarized seven fault localization strategies and seven patch generation strategies that were useful in localizing and fixing these defects, and compared those strategies with current repair techniques. The results indicate potential directions to improve automatic program repair in the future research.
May 09 2017 cs.CV
Deep convolutional networks have achieved great success for image recognition. However, for action recognition in videos, their advantage over traditional methods is not so evident. We present a general and flexible video-level framework for learning action models in videos. This method, called temporal segment network (TSN), aims to model long-range temporal structures with a new segment-based sampling and aggregation module. This unique design enables our TSN to efficiently learn action models by using the whole action videos. The learned models could be easily adapted for action recognition in both trimmed and untrimmed videos with simple average pooling and multi-scale temporal window integration, respectively. We also study a series of good practices for the instantiation of TSN framework given limited training samples. Our approach obtains the state-the-of-art performance on four challenging action recognition benchmarks: HMDB51 (71.0%), UCF101 (94.9%), THUMOS14 (80.1%), and ActivityNet v1.2 (89.6%). Using the proposed RGB difference for motion models, our method can still achieve competitive accuracy on UCF101 (91.0%) while running at 340 FPS. Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.
Apr 21 2017 cs.CV
Detecting actions in untrimmed videos is an important yet challenging task. In this paper, we present the structured segment network (SSN), a novel framework which models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end fashion. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping (TAG) is devised to generate high quality action proposals. On two challenging benchmarks, THUMOS14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures.
Mar 21 2017 cs.CV
Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in its building modules. In this work, we introduce two new modules to enhance the transformation modeling capacity of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation, giving rise to deformable convolutional networks. Extensive experiments validate the effectiveness of our approach on sophisticated vision tasks of object detection and semantic segmentation. The code would be released.
With the increasing of electric vehicle (EV) adoption in recent years, the impact of EV charging activities to the power grid becomes more and more significant. In this article, an optimal scheduling algorithm which combines smart EV charging and V2G gird service is developed to integrate EVs into power grid as distributed energy resources, with improved system cost performance. Specifically, an optimization problem is formulated and solved at each EV charging station according to control signal from aggregated control center and user charging behavior prediction by mean estimation and linear regression. The control center collects distributed optimization results and updates the control signal, periodically. The iteration continues until it converges to optimal scheduling. Experimental result shows this algorithm helps fill the valley and shave the peak in electric load profiles within a microgrid, while the energy demand of individual driver can be satisfied.
Mar 10 2017 cs.CV
Current action recognition methods heavily rely on trimmed videos for model training. However, it is expensive and time-consuming to acquire a large-scale trimmed video dataset. This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances. Our UntrimmedNet couples two important components, the classification module and the selection module, to learn the action models and reason about the temporal duration of action instances, respectively. These two components are implemented with feed-forward networks, and UntrimmedNet is therefore an end-to-end trainable architecture. We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet only employs weak supervision, our method achieves performance superior or comparable to that of those strongly supervised approaches on these two datasets.
Mar 09 2017 cs.CV
Detecting activities in untrimmed videos is an important but challenging task. The performance of existing methods remains unsatisfactory, e.g., they often meet difficulties in locating the beginning and end of a long complex action. In this paper, we propose a generic framework that can accurately detect a wide variety of activities from untrimmed videos. Our first contribution is a novel proposal scheme that can efficiently generate candidates with accurate temporal boundaries. The other contribution is a cascaded classification pipeline that explicitly distinguishes between relevance and completeness of a candidate instance. On two challenging temporal activity detection datasets, THUMOS14 and ActivityNet, the proposed framework significantly outperforms the existing state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling activities with various temporal structures.
Mar 01 2017 cs.CV
Hashing has played a pivotal role in large-scale image retrieval. With the development of Convolutional Neural Network (CNN), hashing learning has shown great promise. But existing methods are mostly tuned for classification, which are not optimized for retrieval tasks, especially for instance-level retrieval. In this study, we propose a novel hashing method for large-scale image retrieval. Considering the difficulty in obtaining labeled datasets for image retrieval task in large scale, we propose a novel CNN-based unsupervised hashing method, namely Unsupervised Triplet Hashing (UTH). The unsupervised hashing network is designed under the following three principles: 1) more discriminative representations for image retrieval; 2) minimum quantization loss between the original real-valued feature descriptors and the learned hash codes; 3) maximum information entropy for the learned hash codes. Extensive experiments on CIFAR-10, MNIST and In-shop datasets have shown that UTH outperforms several state-of-the-art unsupervised hashing methods in terms of retrieval accuracy.
Feb 23 2017 cs.SE
Mutation analysis has many applications, such as asserting the quality of test suites and localizing faults. One important bottleneck of mutation analysis is scalability. The latest work explores the possibility of reducing the redundant execution via split-stream execution. However, split-stream execution is only able to remove redundant execution before the first mutated statement. In this paper we try to also reduce some of the redundant execution after the execution of the first mutated statement. We observe that, although many mutated statements are not equivalent, the execution result of those mutated statements may still be equivalent to the result of the original statement. In other words, the statements are equivalent modulo the current state. In this paper we propose a fast mutation analysis approach, AccMut. AccMut automatically detects the equivalence modulo states among a statement and its mutations, then groups the statements into equivalence classes modulo states, and uses only one process to represent each class. In this way, we can significantly reduce the number of split processes. Our experiments show that our approach can further accelerate mutation analysis on top of split-stream execution with a speedup of 2.56x on average.
Nov 24 2016 cs.CV
Deep convolutional neutral networks have achieved great success on image recognition tasks. Yet, it is non-trivial to transfer the state-of-the-art image recognition networks to videos as per-frame evaluation is too slow and unaffordable. We present deep feature flow, a fast and accurate framework for video recognition. It runs the expensive convolutional sub-network only on sparse key frames and propagates their deep feature maps to other frames via a flow field. It achieves significant speedup as flow computation is relatively fast. The end-to-end training of the whole architecture significantly boosts the recognition accuracy. Deep feature flow is flexible and general. It is validated on two recent large scale video datasets. It makes a large step towards practical video recognition.
The effort to extend cellular technologies to unlicensed spectrum has been gaining high momentum. Listen-before-talk (LBT) is enforced in the regions such as European Union and Japan to harmonize coexistence of cellular and incumbent systems in unlicensed spectrum. In this paper, we study throughput optimal LBT transmission strategy for load based equipment (LBE). We find that the optimal rule is a pure threshold policy: The LBE should stop listening and transmit once the channel quality exceeds an optimized threshold. We also reveal the optimal set of LBT parameters that are compliant with regulatory requirements. Our results shed light on how the regulatory LBT requirements can affect the transmission strategies of radio equipment in unlicensed spectrum.
Oct 05 2016 cs.CV
Convolutional Neural Networks (CNNs) have made remarkable progress on scene recognition, partially due to these recent large-scale scene datasets, such as the Places and Places2. Scene categories are often defined by multi-level information, including local objects, global layout, and background environment, thus leading to large intra-class variations. In addition, with the increasing number of scene categories, label ambiguity has become another crucial issue in large-scale classification. This paper focuses on large-scale scene recognition and makes two major contributions to tackle these issues. First, we propose a multi-resolution CNN architecture that captures visual content and structure at multiple levels. The multi-resolution CNNs are composed of coarse resolution CNNs and fine resolution CNNs, which are complementary to each other. Second, we design two knowledge guided disambiguation techniques to deal with the problem of label ambiguity. (i) We exploit the knowledge from the confusion matrix computed on validation data to merge ambiguous classes into a super category. (ii) We utilize the knowledge of extra networks to produce a soft label for each image. Then the super categories or soft labels are employed to guide CNN training on the Places2. We conduct extensive experiments on three large-scale image datasets (ImageNet, Places, and Places2), demonstrating the effectiveness of our approach. Furthermore, our method takes part in two major scene recognition challenges, and achieves the second place at the Places2 challenge in ILSVRC 2015, and the first place at the LSUN challenge in CVPR 2016. Finally, we directly test the learned representations on other scene benchmarks, and obtain the new state-of-the-art results on the MIT Indoor67 (86.7\%) and SUN397 (72.0\%). We release the code and models at~\urlhttps://github.com/wanglimin/MRCNN-Scene-Recognition.
Aug 30 2016 cs.SE
Due to the difficulty of repairing defect, many research efforts have been devoted into automatic defect repair. Given a buggy program that fails some test cases, a typical automatic repair technique tries to modify the program to make all tests pass. However, since the test suites in real world projects are usually insufficient, aiming at passing the test suites often leads to incorrect patches. In this paper we aim to produce precise patches, that is, any patch we produce has a relatively high probability to be correct. More concretely, we focus on condition synthesis, which was shown to be able to repair more than half of the defects in existing approaches. Our key insight is threefold. First, it is important to know what variables in a local context should be used in an "if" condition, and we propose a sorting method based on the dependency relations between variables. Second, we observe that the API document can be used to guide the repair process, and propose document analysis technique to further filter the variables. Third, it is important to know what predicates should be performed on the set of variables, and we propose to mine a set of frequently used predicates in similar contexts from existing projects. We develop a novel program repair system, ACS, that could generate precise conditions at faulty locations. Furthermore, given the generated conditions are very precise, we can perform a repair operation that is previously deemed to be too overfitting: directly returning the test oracle to repair the defect. Using our approach, we successfully repaired 18 defects on four projects of Defects4J, which is the largest number of fully automatically repaired defects reported on the dataset so far. More importantly, the precision of our approach in the evaluation is 78.3%, which is significantly higher than previous approaches, which are usually less than 40%.
Aug 03 2016 cs.CV
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ( $ 69.4\% $) and UCF101 ($ 94.2\% $). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices.
Aug 03 2016 cs.CV
This paper presents the method that underlies our submission to the untrimmed video classification task of ActivityNet Challenge 2016. We follow the basic pipeline of temporal segment networks and further raise the performance via a number of other techniques. Specifically, we use the latest deep model architecture, e.g., ResNet and Inception V3, and introduce new aggregation schemes (top-k and attention-weighted pooling). Additionally, we incorporate the audio as a complementary channel, extracting relevant information via a CNN applied to the spectrograms. With these techniques, we derive an ensemble of deep models, which, together, attains a high classification accuracy (mAP $93.23\%$) on the testing set and secured the first place in the challenge.
Aug 25 2015 cs.CY
We propose and implement a novel relative positioning system, WalkieLokie, to enable more kinds of Augmented Reality applications, e.g., virtual shopping guide, virtual business card sharing. WalkieLokie calculates the distance and direction between an inquiring user and the corresponding target. It only requires a dummy speaker binding to the target and broadcasting inaudible acoustic signals. Then the user walking around can obtain the position using a smart device. The key insight is that when a user walks, the distance between the smart device and the speaker changes; and the pattern of displacement (variance of distance) corresponds to the relative position. We use a second-order phase locked loop to track the displacement and further estimate the position. To enhance the accuracy and robustness of our strategy, we propose a synchronization mechanism to synthesize all estimation results from different timeslots. We show that the mean error of ranging and direction estimation is 0.63m and 2.46 degrees respectively, which is accurate even in case of virtual business card sharing. Furthermore, in the shopping mall where the environment is quite severe, we still achieve high accuracy of positioning one dummy speaker, and the mean position error is 1.28m.
Jul 09 2015 cs.CV
Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convolutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets) are relatively shallow compared with those very deep models in image domain (e.g. VGGNet, GoogLeNet), and therefore their modeling capacity is constrained by their depth. Second, probably more importantly, the training dataset of action recognition is extremely small compared with the ImageNet dataset, and thus it will be easy to over-fit on the training dataset. To address these issues, this report presents very deep two-stream ConvNets for action recognition, by adapting recent very deep architectures into video domain. However, this extension is not easy as the size of action recognition is quite small. We design several good practices for the training of very deep two-stream ConvNets, namely (i) pre-training for both spatial and temporal nets, (ii) smaller learning rates, (iii) more data augmentation techniques, (iv) high drop out ratio. Meanwhile, we extend the Caffe toolbox into Multi-GPU implementation with high computational efficiency and low memory consumption. We verify the performance of very deep two-stream ConvNets on the dataset of UCF101 and it achieves the recognition accuracy of $91.4\%$.
In this paper, we propose a novel indoor localization scheme that exploits ubiquitous visible lights, which are necessarily and densely deployed in almost all indoor environments. We unveil two phenomena of lights available for positioning: 1) the light strength varies according to different light sources, which can be easily detected by light sensors embedded in COTS devices (e.g., smart-phone, smart-glass and smart-watch); 2) the light strength is stable in different times of the day thus exploiting it can avoid frequent site-survey and database maintenance. Hence, a user could locate oneself by differentiating the light source of received light strength (RLS). However, different from existing positioning systems that exploit special LEDs, ubiquitous visible lights lack fingerprints that can uniquely identify the light source, which results in an ambiguity problem that an RLS may correspond to multiple positions. Moreover, RLS is not only determined by device's position, but also seriously affected by its orientation, which causes great complexity in site-survey. To address these challenges, we first propose and validate a realistic light strength model that can attributes RLS to arbitrary positions with heterogenous orientations. This model is further perfected by taking account of the device diversity, influence of multiple light sources and shading of obstacles. Then we design a localizing scheme that harness user's mobility to generate spatial-related RLS to tackle the position-ambiguity problem of a single RLS, which is robust against sunlight interference, shading effect of human-body and unpredictable behaviours (e.g., put the device in pocket) of user. Experiment results show that our scheme achieves mean accuracy $1.93$m and $1.98$m in office ($720m^2$) and library scenario ($960m^2$) respectively.
Data science is gaining more and more and widespread attention, but no consensus viewpoint on what data science is has emerged. As a new science, its objects of study and scientific issues should not be covered by established sciences. Data in cyberspace have formed what we call datanature. In the present paper, data science is defined as the science of exploring datanature.
Nov 19 2014 cs.CV
We introduce a multi-scale framework for low-level vision, where the goal is estimating physical scene values from image data---such as depth from stereo image pairs. The framework uses a dense, overlapping set of image regions at multiple scales and a "local model," such as a slanted-plane model for stereo disparity, that is expected to be valid piecewise across the visual field. Estimation is cast as optimization over a dichotomous mixture of variables, simultaneously determining which regions are inliers with respect to the local model (binary variables) and the correct co-ordinates in the local model space for each inlying region (continuous variables). When the regions are organized into a multi-scale hierarchy, optimization can occur in an efficient and parallel architecture, where distributed computational units iteratively perform calculations and share information through sparse connections between parents and children. The framework performs well on a standard benchmark for binocular stereo, and it produces a distributional scene representation that is appropriate for combining with higher-level reasoning and other low-level cues.
Sep 12 2014 cs.CV
In this paper, we propose multi-stage and deformable deep convolutional neural networks for object detection. This new deep learning object detection diagram has innovations in multiple aspects. In the proposed new deep architecture, a new deformation constrained pooling (def-pooling) layer models the deformation of object parts with geometric constraint and penalty. With the proposed multi-stage training strategy, multiple classifiers are jointly optimized to process samples at different difficulty levels. A new pre-training strategy is proposed to learn feature representations more suitable for the object detection task and with good generalization capability. By changing the net structures, training strategies, adding and removing some key components in the detection pipeline, a set of models with large diversity are obtained, which significantly improves the effectiveness of modeling averaging. The proposed approach ranked \#2 in ILSVRC 2014. It improves the mean averaged precision obtained by RCNN, which is the state-of-the-art of object detection, from $31\%$ to $45\%$. Detailed component-wise analysis is also provided through extensive experimental evaluation.
In this paper, we present an OpenCL-based heterogeneous implementation of a computer vision algorithm -- image inpainting-based object removal algorithm -- on mobile devices. To take advantage of the computation power of the mobile processor, the algorithm workflow is partitioned between the CPU and the GPU based on the profiling results on mobile devices, so that the computationally-intensive kernels are accelerated by the mobile GPGPU (general-purpose computing using graphics processing units). By exploring the implementation trade-offs and utilizing the proposed optimization strategies at different levels including algorithm optimization, parallelism optimization, and memory access optimization, we significantly speed up the algorithm with the CPU-GPU heterogeneous implementation, while preserving the quality of the output images. Experimental results show that heterogeneous computing based on GPGPU co-processing can significantly speed up the computer vision algorithms and makes them practical on real-world mobile devices.
In this paper, we first provide a spanning tree (ST)-based centralized group key agreement protocol for unbalanced mobile Ad Hoc networks (MANETs). Based on the centralized solution, a local spanning tree (LST)-based distributed protocol for general MANETs is subsequently presented. Both protocols follow the basic features of the HSK scheme: 1) H means that a hybrid approach, which is the combination of key agreement and key distribution via symmetric encryption, is exploited; 2) S indicates that a ST or LSTs are adopted to form a connected network topology; and 3) K implies that the extended Kruskal algorithm is employed to handle dynamic events. It is shown that the HSK scheme is a uniform approach to handle the initial key establishment process as well as all kinds of dynamic events in group key agreement protocol for MANETs. Additionally, the extended Kruskal algorithm enables to realize the reusability of the precomputed secure links to reduce the overhead. Moreover, some other aspects, such as the network topology connectivity and security, are well analyzed.
Nov 28 2013 cs.CV
To produce images that are suitable for display, tone-mapping is widely used in digital cameras to map linear color measurements into narrow gamuts with limited dynamic range. This introduces non-linear distortion that must be undone, through a radiometric calibration process, before computer vision systems can analyze such photographs radiometrically. This paper considers the inherent uncertainty of undoing the effects of tone-mapping. We observe that this uncertainty varies substantially across color space, making some pixels more reliable than others. We introduce a model for this uncertainty and a method for fitting it to a given camera or imaging pipeline. Once fit, the model provides for each pixel in a tone-mapped digital photograph a probability distribution over linear scene colors that could have induced it. We demonstrate how these distributions can be useful for visual inference by incorporating them into estimation algorithms for a representative set of vision tasks.
Oct 11 2013 cs.CV
We develop a framework for extracting a concise representation of the shape information available from diffuse shading in a small image patch. This produces a mid-level scene descriptor, comprised of local shape distributions that are inferred separately at every image patch across multiple scales. The framework is based on a quadratic representation of local shape that, in the absence of noise, has guarantees on recovering accurate local shape and lighting. And when noise is present, the inferred local shape distributions provide useful shape information without over-committing to any particular image explanation. These local shape distributions naturally encode the fact that some smooth diffuse regions are more informative than others, and they enable efficient and robust reconstruction of object-scale shape. Experimental results show that this approach to surface reconstruction compares well against the state-of-art on both synthetic images and captured photographs.
Jul 12 2013 cs.CR
With the rapid development of MANET, secure and practical authentication is becoming increasingly important. The existing works perform the research from two aspects, i.e., (a)secure key division and distributed storage, (b)secure distributed authentication. But there still exist several unsolved problems. Specifically, it may suffer from cheating problems and fault authentication attack, which can result in authentication failure and DoS attack towards authentication service. Besides, most existing schemes are not with satisfactory efficiency due to exponential arithmetic based on Shamir's scheme. In this paper, we explore the property of verifiable secret sharing(VSS) schemes with Chinese Remainder Theorem (CRT), then propose a secret key distributed storage scheme based on CRT-VSS and trusted computing for MANET. Specifically, we utilize trusted computing technology to solve two existing cheating problems in secret sharing area before. After that, we do the analysis of homomorphism property with CRT-VSS and design the corresponding shares-product sharing scheme with better concision. On such basis, a secure distributed Elliptic Curve-Digital Signature Standard signature (ECC-DSS) authentication scheme based on CRT-VSS scheme and trusted computing is proposed. Furthermore, as an important property of authentication scheme, we discuss the refreshing property of CRT-VSS and do thorough comparisons with Shamir's scheme. Finally, we provide formal guarantees towards our schemes proposed in this paper.
Jul 12 2013 cs.DB
In recent years, due to the wide applications of uncertain data (e.g., noisy data), uncertain frequent itemsets (UFI) mining over uncertain databases has attracted much attention, which differs from the corresponding deterministic problem from the generalized definition and resolutions. As the most costly task in association rule mining process, it has been shown that outsourcing this task to a service provider (e.g.,the third cloud party) brings several benefits to the data owner such as cost relief and a less commitment to storage and computational resources. However, the correctness integrity of mining results can be corrupted if the service provider is with random fault or not honest (e.g., lazy, malicious, etc). Therefore, in this paper, we focus on the integrity and verification issue in UFI mining problem during outsourcing process, i.e., how the data owner verifies the mining results. Specifically, we explore and extend the existing work on deterministic FI outsourcing verification to uncertain scenario. For this purpose, We extend the existing outsourcing FI mining work to uncertain area w.r.t. the two popular UFI definition criteria and the approximate UFI mining methods. Specifically, We construct and improve the basic/enhanced verification scheme with such different UFI definition respectively. After that, we further discuss the scenario of existing approximation UFP mining, where we can see that our technique can provide good probabilistic guarantees about the correctness of the verification. Finally, we present the comparisons and analysis on the schemes proposed in this paper.
Jun 10 2013 cs.NI
We propose and implement a novel indoor localization scheme, Swadloon, built upon an accurate acoustic direction finding. Swadloon leverages sensors of the smartphone without the requirement of any specialized devices. The scheme Swadloon does not rely on any fingerprints and is very easy to use: a user only needs to shake the phone for a short duration before walking and localization. Our Swadloon design exploits a key observation: the relative shift and velocity of the phone-shaking movement corresponds to the subtle phase and frequency shift of the Doppler effects experienced in the received acoustic signal by the phone. A novel method is designed to derive the direction from the phone to the acoustic source by combining the velocity calculated from the subtle Doppler shift with the one from the inertial sensors of the phone. Then a real-time precise localization and tracking is enabled by using a few anchor speakers with known locations. Major challenges in implementing Swadloon are to measure the frequency shift precisely and to estimate the shaking velocity accurately when the speed of phone-shaking is low and changes arbitrarily. We propose rigorous methods to address these challenges, then design and deploy Swadloon in several floors of an indoor building each with area about 2000m^2. Our extensive experiments show that the mean error of direction finding is around 2.1 degree when the acoustic source is within the range of 32m. For indoor localization, the 90-percentile errors are under 0.92m, while the maximum error is 1.73m and the mean is about 0.5m. For real-time tracking, the errors are within 0.4m for walks of 51m.