May 17 2018 cs.CV
Because of the powerful learning capability of deep neural networks, counting performance via density map estimation has improved significantly during the past several years. However, it is still very challenging due to severe occlusion, large scale variations, and perspective distortion. Scale variations (from image to image) coupled with perspective distortion (within one image) resulting in huge scale changes of the object size. Earlier methods based on convolutional neural networks (CNN) typically did not handle this scale variation explicitly, until Hydra-CNN and MCNN. MCNN uses three columns, each with different filter sizes, to extract features at different scales. In this paper, in contrast to using filters of different sizes, we utilize an image pyramid to deal with scale variations. It is more effective and efficient to resize the input fed into the network, as compared to using larger filter sizes. Secondly, we adaptively fuse the predictions from different scales (using adaptively changing per-pixel weights), which makes our method adapt to scale changes within an image. The adaptive fusing is achieved by generating an across-scale attention map, which softly selects a suitable scale for each pixel, followed by a 1x1 convolution. Extensive experiments on three popular datasets show very compelling results.
Mar 21 2018 cs.CV
Template-matching methods for visual tracking have gained popularity recently due to their comparable performance and fast speed. However, they lack effective ways to adapt to changes in the target object's appearance, making their tracking accuracy still far from state-of-the-art. In this paper, we propose a dynamic memory network to adapt the template to the target's appearance variations during tracking. An LSTM is used as a memory controller, where the input is the search feature map and the outputs are the control signals for the reading and writing process of the memory block. As the location of the target is at first unknown in the search feature map, an attention mechanism is applied to concentrate the LSTM input on the potential target. To prevent aggressive model adaptivity, we apply gated residual template learning to control the amount of retrieved memory that is used to combine with the initial template. Unlike tracking-by-detection methods where the object's information is maintained by the weight parameters of neural networks, which requires expensive online fine-tuning to be adaptable, our tracker runs completely feed-forward and adapts to the target's appearance changes by updating the external memory. Moreover, the capacity of our model is not determined by the network size as with other trackers -- the capacity can be easily enlarged as the memory requirements of a task increase, which is favorable for memorizing long-term object information. Extensive experiments on OTB and VOT demonstrates that our tracker MemTrack performs favorably against state-of-the-art tracking methods while retaining real-time speed of 50 fps.
Mar 06 2018 cs.CV
We present a model that uses a single first-person image to generate an egocentric basketball motion sequence in the form of a 12D camera configuration trajectory, which encodes a player's 3D location and 3D head orientation throughout the sequence. To do this, we first introduce a future convolutional neural network (CNN) that predicts an initial sequence of 12D camera configurations, aiming to capture how real players move during a one-on-one basketball game. We also introduce a goal verifier network, which is trained to verify that a given camera configuration is consistent with the final goals of real one-on-one basketball players. Next, we propose an inverse synthesis procedure to synthesize a refined sequence of 12D camera configurations that (1) sufficiently matches the initial configurations predicted by the future CNN, while (2) maximizing the output of the goal verifier network. Finally, by following the trajectory resulting from the refined camera configuration sequence, we obtain the complete 12D motion sequence. Our model generates realistic basketball motion sequences that capture the goals of real players, outperforming standard deep learning approaches such as recurrent neural networks (RNNs), long short-term memory networks (LSTMs), and generative adversarial networks (GANs).
Aug 15 2017 cs.CV
Recently using convolutional neural networks (CNNs) has gained popularity in visual tracking, due to its robust feature representation of images. Recent methods perform online tracking by fine-tuning a pre-trained CNN model to the specific target object using stochastic gradient descent (SGD) back-propagation, which is usually time-consuming. In this paper, we propose a recurrent filter generation methods for visual tracking. We directly feed the target's image patch to a recurrent neural network (RNN) to estimate an object-specific filter for tracking. As the video sequence is a spatiotemporal data, we extend the matrix multiplications of the fully-connected layers of the RNN to a convolution operation on feature maps, which preserves the target's spatial structure and also is memory-efficient. The tracked object in the subsequent frames will be fed into the RNN to adapt the generated filters to appearance variations of the target. Note that once the off-line training process of our network is finished, there is no need to fine-tune the network for specific objects, which makes our approach more efficient than methods that use iterative fine-tuning to online learn the target. Extensive experiments conducted on widely used benchmarks, OTB and VOT, demonstrate encouraging results compared to other recent methods.
May 30 2017 cs.CV
For crowded scenes, the accuracy of object-based computer vision methods declines when the images are low-resolution and objects have severe occlusions. Taking counting methods for example, almost all the recent state-of-the-art counting methods bypass explicit detection and adopt regression-based methods to directly count the objects of interest. Among regression-based methods, density map estimation, where the number of objects inside a subregion is the integral of the density map over that subregion, is especially promising because it preserves spatial information, which makes it useful for both counting and localization (detection and tracking). With the power of deep convolutional neural networks (CNNs) the counting performance has improved steadily. The goal of this paper is to evaluate density maps generated by density estimation methods on a variety of crowd analysis tasks, including counting, detection, and tracking. Most existing CNN methods produce density maps with resolution that is smaller than the original images, due to the downsample strides in the convolution/pooling operations. To produce an original-resolution density map, we also evaluate a classical CNN that uses a sliding window regressor to predict the density for every pixel in the image. We also consider a fully convolutional (FCNN) adaptation, with skip connections from lower convolutional layers to compensate for loss in spatial information during upsampling. In our experiments, we found that the lower-resolution density maps sometimes have better counting performance. In contrast, the original-resolution density maps improved localization tasks, such as detection and tracking, compared to bilinear upsampling the lower-resolution density maps. Finally, we also propose several metrics for measuring the quality of a density map, and relate them to experiment results on counting and localization.
Since fall 2012, several National Centers of Academic Excellence in Cyber Defense Research (CAE-Rs) fielded a collaborative course to engage students in solving applied cybersecurity research problems. We describe our experiences with this Information Security Research and Education (INSuRE) research collaborative. We explain how we conducted our project-based research course, give examples of student projects, and discuss the outcomes and lessons learned.
Color theme or color palette can deeply influence the quality and the feeling of a photograph or a graphical design. Although color palettes may come from different sources such as online crowd-sourcing, photographs and graphical designs, in this paper, we consider color palettes extracted from fine art collections, which we believe to be an abundant source of stylistic and unique color themes. We aim to capture color styles embedded in these collections by means of statistical models and to build practical applications upon these models. As artists often use their personal color themes in their paintings, making these palettes appear frequently in the dataset, we employed density estimation to capture the characteristics of palette data. Via density estimation, we carried out various predictions and interpolations on palettes, which led to promising applications such as photo-style exploration, real-time color suggestion, and enriched photo recolorization. It was, however, challenging to apply density estimation to palette data as palettes often come as unordered sets of colors, which make it difficult to use conventional metrics on them. To this end, we developed a divide-and-conquer sorting algorithm to rearrange the colors in the palettes in a coherent order, which allows meaningful interpolation between color palettes. To confirm the performance of our model, we also conducted quantitative experiments on datasets of digitized paintings collected from the Internet and received favorable results.
This paper presents a novel approach to estimating the continuous six degree of freedom (6-DoF) pose (3D translation and rotation) of an object from a single RGB image. The approach combines semantic keypoints predicted by a convolutional network (convnet) with a deformable shape model. Unlike prior work, we are agnostic to whether the object is textured or textureless, as the convnet learns the optimal representation from the available training image data. Furthermore, the approach can be applied to instance- and class-based pose recovery. Empirically, we show that the proposed approach can accurately recover the 6-DoF object pose for both instance- and class-based scenarios with a cluttered background. For class-based object pose estimation, state-of-the-art accuracy is shown on the large-scale PASCAL3D+ dataset.
Nov 22 2016 cs.CV
Computer vision tasks often have side information available that is helpful to solve the task. For example, for crowd counting, the camera perspective (e.g., camera angle and height) gives a clue about the appearance and scale of people in the scene. While side information has been shown to be useful for counting systems using traditional hand-crafted features, it has not been fully utilized in counting systems based on deep learning. In order to incorporate the available side information, we propose an adaptive convolutional neural network (ACNN), where the convolutional filter weights adapt to the current scene context via the side information. In particular, we model the filter weights as a low-dimensional manifold, parametrized by the side information, within the high-dimensional space of filter weights. With the help of side information and adaptive weights, the ACNN can disentangle the variations related to the side information, and extract discriminative features related to the current context. Since existing crowd counting datasets do not contain ground-truth side information, we collect a new dataset with the ground-truth camera angle and height as the side information. On experiments in crowd counting, the ACNN improves counting accuracy compared to a plain CNN with a similar number of parameters. We also apply ACNN to image deconvolution to show its potential effectiveness on other computer vision applications.
Aug 28 2015 cs.CV
This paper focuses on structured-output learning using deep neural networks for 3D human pose estimation from monocular images. Our network takes an image and 3D pose as inputs and outputs a score value, which is high when the image-pose pair matches and low otherwise. The network structure consists of a convolutional neural network for image feature extraction, followed by two sub-networks for transforming the image features and pose into a joint embedding. The score function is then the dot-product between the image and pose embeddings. The image-pose embedding and score function are jointly trained using a maximum-margin cost function. Our proposed framework can be interpreted as a special form of structured support vector machines where the joint feature space is discriminatively learned using deep neural networks. We test our framework on the Human3.6m dataset and obtain state-of-the-art results compared to other recent methods. Finally, we present visualizations of the image-pose embedding space, demonstrating the network has learned a high-level embedding of body-orientation and pose-configuration.
Measuring the impact of scientific articles is important for evaluating the research output of individual scientists, academic institutions and journals. While citations are raw data for constructing impact measures, there exist biases and potential issues if factors affecting citation patterns are not properly accounted for. In this work, we address the problem of field variation and introduce an article level metric useful for evaluating individual articles' visibility. This measure derives from joint probabilistic modeling of the content in the articles and the citations amongst them using latent Dirichlet allocation (LDA) and the mixed membership stochastic blockmodel (MMSB). Our proposed model provides a visibility metric for individual articles adjusted for field variation in citation rates, a structural understanding of citation behavior in different fields, and article recommendations which take into account article visibility and citation patterns. We develop an efficient algorithm for model fitting using variational methods. To scale up to large networks, we develop an online variant using stochastic gradient methods and case-control likelihood approximation. We apply our methods to the benchmark KDD Cup 2003 dataset with approximately 30,000 high energy physics papers.
We propose an heterogeneous multi-task learning framework for human pose estimation from monocular image with deep convolutional neural network. In particular, we simultaneously learn a pose-joint regressor and a sliding-window body-part detector in a deep network architecture. We show that including the body-part detection task helps to regularize the network, directing it to converge to a good solution. We report competitive and state-of-art results on several data sets. We also empirically show that the learned neurons in the middle layer of our network are tuned to localized body parts.
Feb 11 2014 cs.CV
We present a multiple-person tracking algorithm, based on combining particle filters and RVO, an agent-based crowd model that infers collision-free velocities so as to predict pedestrian's motion. In addition to position and velocity, our tracking algorithm can estimate the internal goals (desired destination or desired velocity) of the tracked pedestrian in an online manner, thus removing the need to specify this information beforehand. Furthermore, we leverage the longer-term predictions of RVO by deriving a higher-order particle filter, which aggregates multiple predictions from different prior time steps. This yields a tracker that can recover from short-term occlusions and spurious noise in the appearance model. Experimental results show that our tracking algorithm is suitable for predicting pedestrians' behaviors online without needing scene priors or hand-annotated goal information, and improves tracking in real-world crowded scenes under low frame rates.
A generalized Gaussian process model (GGPM) is a unifying framework that encompasses many existing Gaussian process (GP) models, such as GP regression, classification, and counting. In the GGPM framework, the observation likelihood of the GP model is itself parameterized using the exponential family distribution (EFD). In this paper, we consider efficient algorithms for approximate inference on GGPMs using the general form of the EFD. A particular GP model and its associated inference algorithms can then be formed by changing the parameters of the EFD, thus greatly simplifying its creation for task-specific output domains. We demonstrate the efficacy of this framework by creating several new GP models for regressing to non-negative reals and to real intervals. We also consider a closed-form Taylor approximation for efficient inference on GGPMs, and elaborate on its connections with other model-specific heuristic closed-form approximations. Finally, we present a comprehensive set of experiments to compare approximate inference algorithms on a wide variety of GGPMs.
The hidden Markov model (HMM) is a widely-used generative model that copes with sequential data, assuming that each observation is conditioned on the state of a hidden Markov chain. In this paper, we derive a novel algorithm to cluster HMMs based on the hierarchical EM (HEM) algorithm. The proposed algorithm i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a "cluster center", i.e., a novel HMM that is representative for the group, in a manner that is consistent with the underlying generative model of the HMM. To cope with intractable inference in the E-step, the HEM algorithm is formulated as a variational optimization problem, and efficiently solved for the HMM case by leveraging an appropriate variational approximation. The benefits of the proposed algorithm, which we call variational HEM (VHEM), are demonstrated on several tasks involving time-series data, such as hierarchical clustering of motion capture sequences, and automatic annotation and retrieval of music and of online hand-writing data, showing improvements over current methods. In particular, our variational HEM algorithm effectively leverages large amounts of data when learning annotation models by using an efficient hierarchical estimation procedure, which reduces learning times and memory requirements, while improving model robustness through better regularization.
The hidden Markov model (HMM) is a generative model that treats sequential data under the assumption that each observation is conditioned on the state of a discrete hidden variable that evolves in time as a Markov chain. In this paper, we derive a novel algorithm to cluster HMMs through their probability distributions. We propose a hierarchical EM algorithm that i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a "cluster center", i.e., a novel HMM that is representative for the group. We present several empirical studies that illustrate the benefits of the proposed algorithm.
Finding an optimal key assignment (subject to given constraints) for a key predistribution scheme in wireless sensor networks is a difficult task. Hence, most of the practical schemes are based on probabilistic key assignment, which leads to sub-optimal schemes requiring key storage linear in the total number of nodes. A graph theoretic framework is introduced to study the fundamental tradeoffs between key storage, average key path length (directly related to the battery consumption) and resilience (to compromised nodes) of key predistribution schemes for wireless sensor networks. Based on the proposed framework, a lower bound on key storage is derived for a given average key path length. An upper bound on the compromising probability is also given. This framework also leads to the design of key assignment schemes with a storage complexity of the same order as the lower bound.
Apr 28 2009 cs.DS
Many data dissemination and publish-subscribe systems that guarantee the privacy and authenticity of the participants rely on symmetric key cryptography. An important problem in such a system is to maintain the shared group key as the group membership changes. We consider the problem of determining a key hierarchy that minimizes the average communication cost of an update, given update frequencies of the group members and an edge-weighted undirected graph that captures routing costs. We first present a polynomial-time approximation scheme for minimizing the average number of multicast messages needed for an update. We next show that when routing costs are considered, the problem is NP-hard even when the underlying routing network is a tree network or even when every group member has the same update frequency. Our main result is a polynomial time constant-factor approximation algorithm for the general case where the routing network is an arbitrary weighted graph and group members have nonuniform update frequencies.
The congestion control algorithm of TCP relies on correct feedback from the receiver to determine the rate at which packets should be sent into the network. Hence, correct receiver feedback (in the form of TCP acknowledgements) is essential to the goal of sharing the scarce bandwidth resources fairly and avoiding congestion collapse in the Internet. However, the assumption that a TCP receiver can always be trusted (to generate feedback correctly) no longer holds as there are plenty of incentives for a receiver to deviate from the protocol. In fact, it has been shown that a misbehaving receiver (whose aim is to bring about congestion collapse) can easily generate acknowledgements to conceal packet loss, so as to drive a number of honest, innocent senders arbitrarily fast to create a significant number of non-responsive packet flows, leading to denial of service to other Internet users. We give the first formal treatment to this problem. We also give an efficient, provably secure mechanism to force a receiver to generate feedback correctly; any incorrect acknowledgement will be detected at the sender and cheating TCP receivers would be identified. The idea is as follows: for each packet sent, the sender generates a tag using a secret key (known to himself only); the receiver could generate a proof using the packet and the tag alone, and send it to the sender; the sender can then verify the proof using the secret key; an incorrect proof would indicate a cheating receiver. The scheme is very efficient in the sense that the TCP sender does not need to store the packet or the tag, and the proofs for multiple packets can be aggregated at the receiver. The scheme is based on an aggregate authenticator. In addition, the proposed solution can be applied to network-layer rate-limiting architectures requiring correct feedback.
Any secured system can be modeled as a capability-based access control system in which each user is given a set of secret keys of the resources he is granted access to. In some large systems with resource-constrained devices, such as sensor networks and RFID systems, the design is sensitive to memory or key storage cost. With a goal to minimize the maximum users' key storage, key compression based on key linking, that is, deriving one key from another without compromising security, is studied. A lower bound on key storage needed for a general access structure with key derivation is derived. This bound demonstrates the theoretic limit of any systems which do not trade off security and can be treated as a negative result to provide ground for designs with security tradeoff. A concrete, provably secure key linking scheme based on pseudorandom functions is given. Using the key linking framework, a number of key pre-distribution schemes in the literature are analyzed.
Dec 14 2007 cs.NI
Prior work indicates that 802.11 is extremely inefficient for VoIP transport. Only 12 and 60 VoIP sessions can be supported in an 802.11b and an 802.11g WLAN, respectively. This paper shows that the bad news does not stop there. When there are multiple WLANs in the vicinity of each other, the already-low VoIP capacity can be further eroded in a significant manner. For example, in a 5-by-5, 25-cell multi-WLAN network, the VoIP capacities for 802.11b and 802.11g are only 1.63 and 10.34 sessions per AP, respectively. This paper investigates several solutions to improve the VoIP capacity. Based on a conflict graph model, we propose a clique-analytical call-admission scheme, which increases the VoIP capacity by 52% and 37% in 802.11b and 802.11g respectively. If all the three orthogonal frequency channels available in 11b and 11g are used, the capacity can be nearly tripled by the call-admission scheme. This paper also proposes for the first time the use of coarse-grained time-division multiple access (CoTDMA) in conjunction with the basic 802.11 CSMA to eliminate the performance-degrading exposed-node and hidden-node problems. We find that CoTDMA can further increase the VoIP capacity in the multi-WLAN scenario by an additional 35%.
This paper investigates the many-to-one throughput capacity (and by symmetry, one-to-many throughput capacity) of IEEE 802.11 multi-hop networks. It has generally been assumed in prior studies that the many-to-one throughput capacity is upper-bounded by the link capacity L. Throughput capacity L is not achievable under 802.11. This paper introduces the notion of "canonical networks", which is a class of regularly-structured networks whose capacities can be analyzed more easily than unstructured networks. We show that the throughput capacity of canonical networks under 802.11 has an analytical upper bound of 3L/4 when the source nodes are two or more hops away from the sink; and simulated throughputs of 0.690L (0.740L) when the source nodes are many hops away. We conjecture that 3L/4 is also the upper bound for general networks. When all links have equal length, 2L/3 can be shown to be the upper bound for general networks. Our simulations show that 802.11 networks with random topologies operated with AODV routing can only achieve throughputs far below the upper bounds. Fortunately, by properly selecting routes near the gateway (or by properly positioning the relay nodes leading to the gateway) to fashion after the structure of canonical networks, the throughput can be improved significantly by more than 150%. Indeed, in a dense network, it is worthwhile to deactivate some of the relay nodes near the sink judiciously.
Grid computing consists of the coordinated use of large sets of diverse, geographically distributed resources for high performance computation. Effective monitoring of these computing resources is extremely important to allow efficient use on the Grid. The large number of heterogeneous computing entities available in Grids makes the task challenging. In this work, we describe a Grid monitoring system, called GridMonitor, that captures and makes available the most important information from a large computing facility. The Grid monitoring system consists of four tiers: local monitoring, archiving, publishing and harnessing. This architecture was applied on a large scale linux farm and network infrastructure. It can be used by many higher-level Grid services including scheduling services and resource brokering.