Dec 06 2017 cs.CV
Deep learning has achieved substantial success in a series of tasks in computer vision. Intelligent video analysis, which can be broadly applied to video surveillance in various smart city applications, can also be driven by such powerful deep learning engines. To practically facilitate deep neural network models in the large-scale video analysis, there are still unprecedented challenges for the large-scale video data management. Deep feature coding, instead of video coding, provides a practical solution for handling the large-scale video surveillance data. To enable interoperability in the context of deep feature coding, standardization is urgent and important. However, due to the explosion of deep learning algorithms and the particularity of feature coding, there are numerous remaining problems in the standardization process. This paper envisions the future deep feature coding standard for the AI oriented large-scale video management, and discusses existing techniques, standards and possible solutions for these open problems.
Apr 21 2017 cs.CV
Despite the promising progress made in recent years, person re-identification (re-ID) remains a challenging task due to the complex variations in human appearances from different camera views. For this challenging problem, a large variety of algorithms have been developed in the fully-supervised setting, requiring access to a large amount of labeled training data. However, the main bottleneck for fully-supervised re-ID is the limited availability of labeled training samples. To address this problem, in this paper, we propose a self-trained subspace learning paradigm for person re-ID which effectively utilizes both labeled and unlabeled data to learn a discriminative subspace where person images across disjoint camera views can be easily matched. The proposed approach first constructs pseudo pairwise relationships among unlabeled persons using the k-nearest neighbors algorithm. Then, with the pseudo pairwise relationships, the unlabeled samples can be easily combined with the labeled samples to learn a discriminative projection by solving an eigenvalue problem. In addition, we refine the pseudo pairwise relationships iteratively, which further improves the learning performance. A multi-kernel embedding strategy is also incorporated into the proposed approach to cope with the non-linearity in person's appearance and explore the complementation of multiple kernels. In this way, the performance of person re-ID can be greatly enhanced when training data are insufficient. Experimental results on six widely-used datasets demonstrate the effectiveness of our approach and its performance can be comparable to the reported results of most state-of-the-art fully-supervised methods while using much fewer labeled data.
Social status refers to the relative position within the society. It is an important notion in sociology and related research. The problem of measuring social status has been studied for many years. Various indicators are proposed to assess social status of individuals, including educational attainment, occupation, and income/wealth. However, these indicators are sometimes difficult to collect or measure. We investigate social networks for alternative measures of social status. Online activities expose certain traits of users in the real world. We are interested in how these activities are related to social status, and how social status can be predicted with social network data. To the best of our knowledge, this is the first study on connecting online activities with social status in reality. In particular, we focus on the network structure of microblogs in this study. A user following another implies some kind of status. We cast the predicted social status of users to the "status" of real-world entities, e.g., universities, occupations, and regions, so that we can compare and validate predicted results with facts in the real world. We propose an efficient algorithm for this task and evaluate it on a dataset consisting of 3.4 million users from Sina Weibo. The result shows that it is possible to predict social status with reasonable accuracy using social network data. We also point out challenges and limitations of this approach, e.g., inconsistence between online popularity and real-world status for certain users. Our findings provide insights on analyzing online social status and future designs of ranking schemes for social networks.
We present in this paper a systematic study on how to morph a well-trained neural network to a new one so that its network function can be completely preserved. We define this as \emphnetwork morphism in this research. After morphing a parent network, the child network is expected to inherit the knowledge from its parent network and also has the potential to continue growing into a more powerful one with much shortened training time. The first requirement for this network morphism is its ability to handle diverse morphing types of networks, including changes of depth, width, kernel size, and even subnet. To meet this requirement, we first introduce the network morphism equations, and then develop novel morphing algorithms for all these morphing types for both classic and convolutional neural networks. The second requirement for this network morphism is its ability to deal with non-linearity in a network. We propose a family of parametric-activation functions to facilitate the morphing of any continuous non-linear activation neurons. Experimental results on benchmark datasets and typical neural networks demonstrate the effectiveness of the proposed network morphism scheme.
Automatically describing video content with natural language is a fundamental challenge of multimedia. Recurrent Neural Networks (RNN), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. Our proposed LSTM-E consists of three components: a 2-D and/or 3-D deep convolutional neural networks for learning powerful video representation, a deep RNN for generating sentences, and a joint embedding model for exploring the relationships between visual content and sentence semantics. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best reported performance in generating natural sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. We also demonstrate that LSTM-E is superior in predicting Subject-Verb-Object (SVO) triplets to several state-of-the-art techniques.
Dec 23 2014 cs.CV
Convolutional Neural Networks (CNNs) have achieved comparable error rates to well-trained human on ILSVRC2014 image classification task. To achieve better performance, the complexity of CNNs is continually increasing with deeper and bigger architectures. Though CNNs achieved promising external classification behavior, understanding of their internal work mechanism is still limited. In this work, we attempt to understand the internal work mechanism of CNNs by probing the internal representations in two comprehensive aspects, i.e., visualizing patches in the representation spaces constructed by different layers, and visualizing visual information kept in each layer. We further compare CNNs with different depths and show the advantages brought by deeper architecture.
The identification of urban mobility patterns is very important for predicting and controlling spatial events. In this study, we analyzed millions of geographical check-ins crawled from a leading Chinese location-based social networking service (Jiepang.com), which contains demographic information that facilitates group-specific studies. We determined the distinct mobility patterns of natives and non-natives in all five large cities that we considered. We used a mixed method to assign different algorithms to natives and non-natives, which greatly improved the accuracy of location prediction compared with the basic algorithms. We also propose so-called indigenization coefficients to quantify the extent to which an individual behaves like a native, which depends only on their check-in behavior, rather than requiring demographic information. Surprisingly, the hybrid algorithm weighted using the indigenization coefficients outperformed a mixed algorithm that used additional demographic information, suggesting the advantage of behavioral data in characterizing individual mobility compared with the demographic information. The present location prediction algorithms can find applications in urban planning, traffic forecasting, mobile recommendation, and so on.
Driven by green communications, energy efficiency (EE) has become a new important criterion for designing wireless communication systems. However, high EE often leads to low spectral efficiency (SE), which spurs the research on EE-SE tradeoff. In this paper, we focus on how to maximize the utility in physical layer for an uplink multi-user multiple-input multipleoutput (MU-MIMO) system, where we will not only consider EE-SE tradeoff in a unified way, but also ensure user fairness. We first formulate the utility maximization problem, but it turns out to be non-convex. By exploiting the structure of this problem, we find a convexization procedure to convert the original nonconvex problem into an equivalent convex problem, which has the same global optimum with the original problem. Following the convexization procedure, we present a centralized algorithm to solve the utility maximization problem, but it requires the global information of all users. Thus we propose a primal-dual distributed algorithm which does not need global information and just consumes a small amount of overhead. Furthermore, we have proved that the distributed algorithm can converge to the global optimum. Finally, the numerical results show that our approach can both capture user diversity for EE-SE tradeoff and ensure user fairness, and they also validate the effectiveness of our primal-dual distributed algorithm.
In a complex network, different groups of nodes may have existed for different amounts of time. To detect the evolutionary history of a network is of great importance. We present a general method based on spectral analysis to address this fundamental question in network science. In particular, we argue and demonstrate, using model and real-world networks, the existence of positive correlation between the magnitudes of eigenvalues and node ages. In situations where the network topology is unknown but short time series measured from nodes are available, we suggest to uncover the network topology at the present (or any given time of interest) by using compressive sensing and then perform the spectral analysis. Knowledge of ages of various groups of nodes can provide significant insights into the evolutionary process underpinning the network.
Visual reranking is effective to improve the performance of the text-based video search. However, existing reranking algorithms can only achieve limited improvement because of the well-known semantic gap between low level visual features and high level semantic concepts. In this paper, we adopt interactive video search reranking to bridge the semantic gap by introducing user's labeling effort. We propose a novel dimension reduction tool, termed sparse transfer learning (STL), to effectively and efficiently encode user's labeling information. STL is particularly designed for interactive video search reranking. Technically, it a) considers the pair-wise discriminative information to maximally separate labeled query relevant samples from labeled query irrelevant ones, b) achieves a sparse representation for the subspace to encodes user's intention by applying the elastic net penalty, and c) propagates user's labeling information from labeled samples to unlabeled samples by using the data distribution knowledge. We conducted extensive experiments on the TRECVID 2005, 2006 and 2007 benchmark datasets and compared STL with popular dimension reduction algorithms. We report superior performance by using the proposed STL based interactive video search reranking.