Apr 21 2017 cs.CV
Detecting activities in untrimmed videos is an important yet challenging task. In this paper, we tackle the difficulties of effectively locating the start and the end of a long complex action, which are often met by existing methods. Our key contribution is the structured segment network, a novel framework for temporal action detection, which models the temporal structure of each activity instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model, which comprises two classifiers, respectively for classifying activities and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end fashion. We also propose a simple yet effective temporal action proposal scheme that can generate proposals of considerably higher qualities. On two challenging benchmarks, THUMOS14 and ActivityNet, our method remarkably outperforms existing state-of-the-art methods by over $ 10\% $ absolute average mAP, demonstrating superior accuracy and strong adaptivity in handling activities with various temporal structures.
Mar 21 2017 cs.CV
Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in its building modules. In this work, we introduce two new modules to enhance the transformation modeling capacity of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation, giving rise to deformable convolutional networks. Extensive experiments validate the effectiveness of our approach on sophisticated vision tasks of object detection and semantic segmentation. The code would be released.
With the increasing of electric vehicle (EV) adoption in recent years, the impact of EV charging activities to the power grid becomes more and more significant. In this article, an optimal scheduling algorithm which combines smart EV charging and V2G gird service is developed to integrate EVs into power grid as distributed energy resources, with improved system cost performance. Specifically, an optimization problem is formulated and solved at each EV charging station according to control signal from aggregated control center and user charging behavior prediction by mean estimation and linear regression. The control center collects distributed optimization results and updates the control signal, periodically. The iteration continues until it converges to optimal scheduling. Experimental result shows this algorithm helps fill the valley and shave the peak in electric load profiles within a microgrid, while the energy demand of individual driver can be satisfied.
Mar 10 2017 cs.CV
Current action recognition methods heavily rely on trimmed videos for model training. However, it is very expensive and time-consuming to acquire a large-scale trimmed video dataset. This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn from untrimmed videos without the need of temporal annotations of action instances. Our UntrimmedNet couples two important components, the classification module and the selection module, to learn the action models and reason about the temporal duration of action instances, respectively. These two modules are implemented with feed-forward networks. UntrimmedNet is essentially an end-to-end trainable architecture, which allows for the joint optimization of model parameters of both components. We exploit the learned models for the problems of action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet only employs weak supervision, our method achieves performance superior or comparable to that of strongly supervised approaches on these two datasets.
Mar 09 2017 cs.CV
Detecting activities in untrimmed videos is an important but challenging task. The performance of existing methods remains unsatisfactory, e.g., they often meet difficulties in locating the beginning and end of a long complex action. In this paper, we propose a generic framework that can accurately detect a wide variety of activities from untrimmed videos. Our first contribution is a novel proposal scheme that can efficiently generate candidates with accurate temporal boundaries. The other contribution is a cascaded classification pipeline that explicitly distinguishes between relevance and completeness of a candidate instance. On two challenging temporal activity detection datasets, THUMOS14 and ActivityNet, the proposed framework significantly outperforms the existing state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling activities with various temporal structures.
Mar 01 2017 cs.CV
Hashing has played a pivotal role in large-scale image retrieval. With the development of Convolutional Neural Network (CNN), hashing learning has shown great promise. But existing methods are mostly tuned for classification, which are not optimized for retrieval tasks, especially for instance-level retrieval. In this study, we propose a novel hashing method for large-scale image retrieval. Considering the difficulty in obtaining labeled datasets for image retrieval task in large scale, we propose a novel CNN-based unsupervised hashing method, namely Unsupervised Triplet Hashing (UTH). The unsupervised hashing network is designed under the following three principles: 1) more discriminative representations for image retrieval; 2) minimum quantization loss between the original real-valued feature descriptors and the learned hash codes; 3) maximum information entropy for the learned hash codes. Extensive experiments on CIFAR-10, MNIST and In-shop datasets have shown that UTH outperforms several state-of-the-art unsupervised hashing methods in terms of retrieval accuracy.
Feb 23 2017 cs.SE
Mutation analysis has many applications, such as asserting the quality of test suites and localizing faults. One important bottleneck of mutation analysis is scalability. The latest work explores the possibility of reducing the redundant execution via split-stream execution. However, split-stream execution is only able to remove redundant execution before the first mutated statement. In this paper we try to also reduce some of the redundant execution after the execution of the first mutated statement. We observe that, although many mutated statements are not equivalent, the execution result of those mutated statements may still be equivalent to the result of the original statement. In other words, the statements are equivalent modulo the current state. In this paper we propose a fast mutation analysis approach, AccMut. AccMut automatically detects the equivalence modulo states among a statement and its mutations, then groups the statements into equivalence classes modulo states, and uses only one process to represent each class. In this way, we can significantly reduce the number of split processes. Our experiments show that our approach can further accelerate mutation analysis on top of split-stream execution with a speedup of 2.56x on average.
Nov 24 2016 cs.CV
Deep convolutional neutral networks have achieved great success on image recognition tasks. Yet, it is non-trivial to transfer the state-of-the-art image recognition networks to videos as per-frame evaluation is too slow and unaffordable. We present deep feature flow, a fast and accurate framework for video recognition. It runs the expensive convolutional sub-network only on sparse key frames and propagates their deep feature maps to other frames via a flow field. It achieves significant speedup as flow computation is relatively fast. The end-to-end training of the whole architecture significantly boosts the recognition accuracy. Deep feature flow is flexible and general. It is validated on two recent large scale video datasets. It makes a large step towards practical video recognition.
The effort to extend cellular technologies to unlicensed spectrum has been gaining high momentum. Listen-before-talk (LBT) is enforced in the regions such as European Union and Japan to harmonize coexistence of cellular and incumbent systems in unlicensed spectrum. In this paper, we study throughput optimal LBT transmission strategy for load based equipment (LBE). We find that the optimal rule is a pure threshold policy: The LBE should stop listening and transmit once the channel quality exceeds an optimized threshold. We also reveal the optimal set of LBT parameters that are compliant with regulatory requirements. Our results shed light on how the regulatory LBT requirements can affect the transmission strategies of radio equipment in unlicensed spectrum.
Oct 05 2016 cs.CV
Convolutional Neural Networks (CNNs) have made remarkable progress on scene recognition, partially due to these recent large-scale scene datasets, such as the Places and Places2. Scene categories are often defined by multi-level information, including local objects, global layout, and background environment, thus leading to large intra-class variations. In addition, with the increasing number of scene categories, label ambiguity has become another crucial issue in large-scale classification. This paper focuses on large-scale scene recognition and makes two major contributions to tackle these issues. First, we propose a multi-resolution CNN architecture that captures visual content and structure at multiple levels. The multi-resolution CNNs are composed of coarse resolution CNNs and fine resolution CNNs, which are complementary to each other. Second, we design two knowledge guided disambiguation techniques to deal with the problem of label ambiguity. (i) We exploit the knowledge from the confusion matrix computed on validation data to merge ambiguous classes into a super category. (ii) We utilize the knowledge of extra networks to produce a soft label for each image. Then the super categories or soft labels are employed to guide CNN training on the Places2. We conduct extensive experiments on three large-scale image datasets (ImageNet, Places, and Places2), demonstrating the effectiveness of our approach. Furthermore, our method takes part in two major scene recognition challenges, and achieves the second place at the Places2 challenge in ILSVRC 2015, and the first place at the LSUN challenge in CVPR 2016. Finally, we directly test the learned representations on other scene benchmarks, and obtain the new state-of-the-art results on the MIT Indoor67 (86.7\%) and SUN397 (72.0\%). We release the code and models at~\urlhttps://github.com/wanglimin/MRCNN-Scene-Recognition.
Aug 30 2016 cs.SE
Due to the difficulty of repairing defect, many research efforts have been devoted into automatic defect repair. Given a buggy program that fails some test cases, a typical automatic repair technique tries to modify the program to make all tests pass. However, since the test suites in real world projects are usually insufficient, aiming at passing the test suites often leads to incorrect patches. In this paper we aim to produce precise patches, that is, any patch we produce has a relatively high probability to be correct. More concretely, we focus on condition synthesis, which was shown to be able to repair more than half of the defects in existing approaches. Our key insight is threefold. First, it is important to know what variables in a local context should be used in an "if" condition, and we propose a sorting method based on the dependency relations between variables. Second, we observe that the API document can be used to guide the repair process, and propose document analysis technique to further filter the variables. Third, it is important to know what predicates should be performed on the set of variables, and we propose to mine a set of frequently used predicates in similar contexts from existing projects. We develop a novel program repair system, ACS, that could generate precise conditions at faulty locations. Furthermore, given the generated conditions are very precise, we can perform a repair operation that is previously deemed to be too overfitting: directly returning the test oracle to repair the defect. Using our approach, we successfully repaired 18 defects on four projects of Defects4J, which is the largest number of fully automatically repaired defects reported on the dataset so far. More importantly, the precision of our approach in the evaluation is 78.3%, which is significantly higher than previous approaches, which are usually less than 40%.
Aug 03 2016 cs.CV
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ( $ 69.4\% $) and UCF101 ($ 94.2\% $). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices.
Aug 03 2016 cs.CV
This paper presents the method that underlies our submission to the untrimmed video classification task of ActivityNet Challenge 2016. We follow the basic pipeline of temporal segment networks and further raise the performance via a number of other techniques. Specifically, we use the latest deep model architecture, e.g., ResNet and Inception V3, and introduce new aggregation schemes (top-k and attention-weighted pooling). Additionally, we incorporate the audio as a complementary channel, extracting relevant information via a CNN applied to the spectrograms. With these techniques, we derive an ensemble of deep models, which, together, attains a high classification accuracy (mAP $93.23\%$) on the testing set and secured the first place in the challenge.
Aug 25 2015 cs.CY
We propose and implement a novel relative positioning system, WalkieLokie, to enable more kinds of Augmented Reality applications, e.g., virtual shopping guide, virtual business card sharing. WalkieLokie calculates the distance and direction between an inquiring user and the corresponding target. It only requires a dummy speaker binding to the target and broadcasting inaudible acoustic signals. Then the user walking around can obtain the position using a smart device. The key insight is that when a user walks, the distance between the smart device and the speaker changes; and the pattern of displacement (variance of distance) corresponds to the relative position. We use a second-order phase locked loop to track the displacement and further estimate the position. To enhance the accuracy and robustness of our strategy, we propose a synchronization mechanism to synthesize all estimation results from different timeslots. We show that the mean error of ranging and direction estimation is 0.63m and 2.46 degrees respectively, which is accurate even in case of virtual business card sharing. Furthermore, in the shopping mall where the environment is quite severe, we still achieve high accuracy of positioning one dummy speaker, and the mean position error is 1.28m.
Jul 09 2015 cs.CV
Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convolutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets) are relatively shallow compared with those very deep models in image domain (e.g. VGGNet, GoogLeNet), and therefore their modeling capacity is constrained by their depth. Second, probably more importantly, the training dataset of action recognition is extremely small compared with the ImageNet dataset, and thus it will be easy to over-fit on the training dataset. To address these issues, this report presents very deep two-stream ConvNets for action recognition, by adapting recent very deep architectures into video domain. However, this extension is not easy as the size of action recognition is quite small. We design several good practices for the training of very deep two-stream ConvNets, namely (i) pre-training for both spatial and temporal nets, (ii) smaller learning rates, (iii) more data augmentation techniques, (iv) high drop out ratio. Meanwhile, we extend the Caffe toolbox into Multi-GPU implementation with high computational efficiency and low memory consumption. We verify the performance of very deep two-stream ConvNets on the dataset of UCF101 and it achieves the recognition accuracy of $91.4\%$.
In this paper, we propose a novel indoor localization scheme that exploits ubiquitous visible lights, which are necessarily and densely deployed in almost all indoor environments. We unveil two phenomena of lights available for positioning: 1) the light strength varies according to different light sources, which can be easily detected by light sensors embedded in COTS devices (e.g., smart-phone, smart-glass and smart-watch); 2) the light strength is stable in different times of the day thus exploiting it can avoid frequent site-survey and database maintenance. Hence, a user could locate oneself by differentiating the light source of received light strength (RLS). However, different from existing positioning systems that exploit special LEDs, ubiquitous visible lights lack fingerprints that can uniquely identify the light source, which results in an ambiguity problem that an RLS may correspond to multiple positions. Moreover, RLS is not only determined by device's position, but also seriously affected by its orientation, which causes great complexity in site-survey. To address these challenges, we first propose and validate a realistic light strength model that can attributes RLS to arbitrary positions with heterogenous orientations. This model is further perfected by taking account of the device diversity, influence of multiple light sources and shading of obstacles. Then we design a localizing scheme that harness user's mobility to generate spatial-related RLS to tackle the position-ambiguity problem of a single RLS, which is robust against sunlight interference, shading effect of human-body and unpredictable behaviours (e.g., put the device in pocket) of user. Experiment results show that our scheme achieves mean accuracy $1.93$m and $1.98$m in office ($720m^2$) and library scenario ($960m^2$) respectively.
Data science is gaining more and more and widespread attention, but no consensus viewpoint on what data science is has emerged. As a new science, its objects of study and scientific issues should not be covered by established sciences. Data in cyberspace have formed what we call datanature. In the present paper, data science is defined as the science of exploring datanature.
Nov 19 2014 cs.CV
We introduce a multi-scale framework for low-level vision, where the goal is estimating physical scene values from image data---such as depth from stereo image pairs. The framework uses a dense, overlapping set of image regions at multiple scales and a "local model," such as a slanted-plane model for stereo disparity, that is expected to be valid piecewise across the visual field. Estimation is cast as optimization over a dichotomous mixture of variables, simultaneously determining which regions are inliers with respect to the local model (binary variables) and the correct co-ordinates in the local model space for each inlying region (continuous variables). When the regions are organized into a multi-scale hierarchy, optimization can occur in an efficient and parallel architecture, where distributed computational units iteratively perform calculations and share information through sparse connections between parents and children. The framework performs well on a standard benchmark for binocular stereo, and it produces a distributional scene representation that is appropriate for combining with higher-level reasoning and other low-level cues.
Sep 12 2014 cs.CV
In this paper, we propose multi-stage and deformable deep convolutional neural networks for object detection. This new deep learning object detection diagram has innovations in multiple aspects. In the proposed new deep architecture, a new deformation constrained pooling (def-pooling) layer models the deformation of object parts with geometric constraint and penalty. With the proposed multi-stage training strategy, multiple classifiers are jointly optimized to process samples at different difficulty levels. A new pre-training strategy is proposed to learn feature representations more suitable for the object detection task and with good generalization capability. By changing the net structures, training strategies, adding and removing some key components in the detection pipeline, a set of models with large diversity are obtained, which significantly improves the effectiveness of modeling averaging. The proposed approach ranked \#2 in ILSVRC 2014. It improves the mean averaged precision obtained by RCNN, which is the state-of-the-art of object detection, from $31\%$ to $45\%$. Detailed component-wise analysis is also provided through extensive experimental evaluation.
In this paper, we present an OpenCL-based heterogeneous implementation of a computer vision algorithm -- image inpainting-based object removal algorithm -- on mobile devices. To take advantage of the computation power of the mobile processor, the algorithm workflow is partitioned between the CPU and the GPU based on the profiling results on mobile devices, so that the computationally-intensive kernels are accelerated by the mobile GPGPU (general-purpose computing using graphics processing units). By exploring the implementation trade-offs and utilizing the proposed optimization strategies at different levels including algorithm optimization, parallelism optimization, and memory access optimization, we significantly speed up the algorithm with the CPU-GPU heterogeneous implementation, while preserving the quality of the output images. Experimental results show that heterogeneous computing based on GPGPU co-processing can significantly speed up the computer vision algorithms and makes them practical on real-world mobile devices.
In this paper, we first provide a spanning tree (ST)-based centralized group key agreement protocol for unbalanced mobile Ad Hoc networks (MANETs). Based on the centralized solution, a local spanning tree (LST)-based distributed protocol for general MANETs is subsequently presented. Both protocols follow the basic features of the HSK scheme: 1) H means that a hybrid approach, which is the combination of key agreement and key distribution via symmetric encryption, is exploited; 2) S indicates that a ST or LSTs are adopted to form a connected network topology; and 3) K implies that the extended Kruskal algorithm is employed to handle dynamic events. It is shown that the HSK scheme is a uniform approach to handle the initial key establishment process as well as all kinds of dynamic events in group key agreement protocol for MANETs. Additionally, the extended Kruskal algorithm enables to realize the reusability of the precomputed secure links to reduce the overhead. Moreover, some other aspects, such as the network topology connectivity and security, are well analyzed.
Nov 28 2013 cs.CV
To produce images that are suitable for display, tone-mapping is widely used in digital cameras to map linear color measurements into narrow gamuts with limited dynamic range. This introduces non-linear distortion that must be undone, through a radiometric calibration process, before computer vision systems can analyze such photographs radiometrically. This paper considers the inherent uncertainty of undoing the effects of tone-mapping. We observe that this uncertainty varies substantially across color space, making some pixels more reliable than others. We introduce a model for this uncertainty and a method for fitting it to a given camera or imaging pipeline. Once fit, the model provides for each pixel in a tone-mapped digital photograph a probability distribution over linear scene colors that could have induced it. We demonstrate how these distributions can be useful for visual inference by incorporating them into estimation algorithms for a representative set of vision tasks.
Oct 11 2013 cs.CV
We develop a framework for extracting a concise representation of the shape information available from diffuse shading in a small image patch. This produces a mid-level scene descriptor, comprised of local shape distributions that are inferred separately at every image patch across multiple scales. The framework is based on a quadratic representation of local shape that, in the absence of noise, has guarantees on recovering accurate local shape and lighting. And when noise is present, the inferred local shape distributions provide useful shape information without over-committing to any particular image explanation. These local shape distributions naturally encode the fact that some smooth diffuse regions are more informative than others, and they enable efficient and robust reconstruction of object-scale shape. Experimental results show that this approach to surface reconstruction compares well against the state-of-art on both synthetic images and captured photographs.
Jul 12 2013 cs.CR
With the rapid development of MANET, secure and practical authentication is becoming increasingly important. The existing works perform the research from two aspects, i.e., (a)secure key division and distributed storage, (b)secure distributed authentication. But there still exist several unsolved problems. Specifically, it may suffer from cheating problems and fault authentication attack, which can result in authentication failure and DoS attack towards authentication service. Besides, most existing schemes are not with satisfactory efficiency due to exponential arithmetic based on Shamir's scheme. In this paper, we explore the property of verifiable secret sharing(VSS) schemes with Chinese Remainder Theorem (CRT), then propose a secret key distributed storage scheme based on CRT-VSS and trusted computing for MANET. Specifically, we utilize trusted computing technology to solve two existing cheating problems in secret sharing area before. After that, we do the analysis of homomorphism property with CRT-VSS and design the corresponding shares-product sharing scheme with better concision. On such basis, a secure distributed Elliptic Curve-Digital Signature Standard signature (ECC-DSS) authentication scheme based on CRT-VSS scheme and trusted computing is proposed. Furthermore, as an important property of authentication scheme, we discuss the refreshing property of CRT-VSS and do thorough comparisons with Shamir's scheme. Finally, we provide formal guarantees towards our schemes proposed in this paper.
Jul 12 2013 cs.DB
In recent years, due to the wide applications of uncertain data (e.g., noisy data), uncertain frequent itemsets (UFI) mining over uncertain databases has attracted much attention, which differs from the corresponding deterministic problem from the generalized definition and resolutions. As the most costly task in association rule mining process, it has been shown that outsourcing this task to a service provider (e.g.,the third cloud party) brings several benefits to the data owner such as cost relief and a less commitment to storage and computational resources. However, the correctness integrity of mining results can be corrupted if the service provider is with random fault or not honest (e.g., lazy, malicious, etc). Therefore, in this paper, we focus on the integrity and verification issue in UFI mining problem during outsourcing process, i.e., how the data owner verifies the mining results. Specifically, we explore and extend the existing work on deterministic FI outsourcing verification to uncertain scenario. For this purpose, We extend the existing outsourcing FI mining work to uncertain area w.r.t. the two popular UFI definition criteria and the approximate UFI mining methods. Specifically, We construct and improve the basic/enhanced verification scheme with such different UFI definition respectively. After that, we further discuss the scenario of existing approximation UFP mining, where we can see that our technique can provide good probabilistic guarantees about the correctness of the verification. Finally, we present the comparisons and analysis on the schemes proposed in this paper.
Jun 10 2013 cs.NI
We propose and implement a novel indoor localization scheme, Swadloon, built upon an accurate acoustic direction finding. Swadloon leverages sensors of the smartphone without the requirement of any specialized devices. The scheme Swadloon does not rely on any fingerprints and is very easy to use: a user only needs to shake the phone for a short duration before walking and localization. Our Swadloon design exploits a key observation: the relative shift and velocity of the phone-shaking movement corresponds to the subtle phase and frequency shift of the Doppler effects experienced in the received acoustic signal by the phone. A novel method is designed to derive the direction from the phone to the acoustic source by combining the velocity calculated from the subtle Doppler shift with the one from the inertial sensors of the phone. Then a real-time precise localization and tracking is enabled by using a few anchor speakers with known locations. Major challenges in implementing Swadloon are to measure the frequency shift precisely and to estimate the shaking velocity accurately when the speed of phone-shaking is low and changes arbitrarily. We propose rigorous methods to address these challenges, then design and deploy Swadloon in several floors of an indoor building each with area about 2000m^2. Our extensive experiments show that the mean error of direction finding is around 2.1 degree when the acoustic source is within the range of 32m. For indoor localization, the 90-percentile errors are under 0.92m, while the maximum error is 1.73m and the mean is about 0.5m. For real-time tracking, the errors are within 0.4m for walks of 51m.