results for au:Tang_X in:cs

- Jun 12 2017 cs.CV arXiv:1706.02863v1In this paper, we share our experience in designing a convolutional network-based face detector that could handle faces of an extremely wide range of scales. We show that faces with different scales can be modeled through a specialized set of deep convolutional networks with different structures. These detectors can be seamlessly integrated into a single unified network that can be trained end-to-end. In contrast to existing deep models that are designed for wide scale range, our network does not require an image pyramid input and the model is of modest complexity. Our network, dubbed ScaleFace, achieves promising performance on WIDER FACE and FDDB datasets with practical runtime speed. Specifically, our method achieves 76.4 average precision on the challenging WIDER FACE dataset and 96% recall rate on the FDDB dataset with 7 frames per second (fps) for 900 * 1300 input image.
- May 31 2017 cs.SI arXiv:1705.10442v1Influence Maximization is an extensively-studied problem that targets at selecting a set of initial seed nodes in the Online Social Networks (OSNs) to spread the influence as widely as possible. However, it remains an open challenge to design fast and accurate algorithms to find solutions in large-scale OSNs. Prior Monte-Carlo-simulation-based methods are slow and not scalable, while other heuristic algorithms do not have any theoretical guarantee and they have been shown to produce poor solutions for quite some cases. In this paper, we propose hop-based algorithms that can easily scale to millions of nodes and billions of edges. Unlike previous heuristics, our proposed hop-based approaches can provide certain theoretical guarantees. Experimental evaluations with real OSN datasets demonstrate the efficiency and effectiveness of our algorithms.
- Sequences with low auto-correlation property have been applied in code-division multiple access communication systems, radar and cryptography. Using the inverse Gray mapping, a quaternary sequence of even length $N$ can be obtained from two binary sequences of the same length, which are called component sequences. In this paper, using interleaving method, we present several classes of component sequences from twin-prime sequences pairs or GMW sequences pairs given by Tang and Ding in 2010; two, three or four binary sequences defined by cyclotomic classes of order $4$. Hence we can obtain new classes of quaternary sequences, which are different from known ones, since known component sequences are constructed from a pair of binary sequences with optimal auto-correlation or Sidel'nikov sequences.
- May 09 2017 cs.CV arXiv:1705.02953v1Deep convolutional networks have achieved great success for image recognition. However, for action recognition in videos, their advantage over traditional methods is not so evident. We present a general and flexible video-level framework for learning action models in videos. This method, called temporal segment network (TSN), aims to model long-range temporal structures with a new segment-based sampling and aggregation module. This unique design enables our TSN to efficiently learn action models by using the whole action videos. The learned models could be easily adapted for action recognition in both trimmed and untrimmed videos with simple average pooling and multi-scale temporal window integration, respectively. We also study a series of good practices for the instantiation of TSN framework given limited training samples. Our approach obtains the state-the-of-art performance on four challenging action recognition benchmarks: HMDB51 (71.0%), UCF101 (94.9%), THUMOS14 (80.1%), and ActivityNet v1.2 (89.6%). Using the proposed RGB difference for motion models, our method can still achieve competitive accuracy on UCF101 (91.0%) while running at 340 FPS. Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.
- Apr 25 2017 cs.CV arXiv:1704.06904v1In this work, we propose "Residual Attention Network", a convolutional neural network using attention mechanism which can incorporate with state-of-art feed forward network architecture in an end-to-end training fashion. Our Residual Attention Network is built by stacking Attention Modules which generate attention-aware features. The attention-aware features from different modules change adaptively as layers going deeper. Inside each Attention Module, bottom-up top-down feedforward structure is used to unfold the feedforward and feedback attention process into a single feedforward process. Importantly, we propose attention residual learning to train very deep Residual Attention Networks which can be easily scaled up to hundreds of layers. Extensive analyses are conducted on CIFAR-10 and CIFAR-100 datasets to verify the effectiveness of every module mentioned above. Our Residual Attention Network achieves state-of-the-art object recognition performance on three benchmark datasets including CIFAR-10 (3.90% error), CIFAR-100 (20.45% error) and ImageNet (4.8% single model and single crop, top-5 error). Note that, our method achieves 0.6% top-1 accuracy improvement with 46% trunk depth and 69% forward FLOPs comparing to ResNet-200. The experiment also demonstrates that our network is robust against noisy labels.
- Apr 21 2017 cs.CV arXiv:1704.06228v1Detecting activities in untrimmed videos is an important yet challenging task. In this paper, we tackle the difficulties of effectively locating the start and the end of a long complex action, which are often met by existing methods. Our key contribution is the structured segment network, a novel framework for temporal action detection, which models the temporal structure of each activity instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model, which comprises two classifiers, respectively for classifying activities and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end fashion. We also propose a simple yet effective temporal action proposal scheme that can generate proposals of considerably higher qualities. On two challenging benchmarks, THUMOS14 and ActivityNet, our method remarkably outperforms existing state-of-the-art methods by over $ 10\% $ absolute average mAP, demonstrating superior accuracy and strong adaptivity in handling activities with various temporal structures.
- Recently, the finite state Markov channel (FSMC) with an additional eavesdropper and delayed feedback from the legitimate receiver to the transmitter has been shown to be a useful model for the physical layer security of the practical mobile wireless communication systems. In this paper, we extend this model to a multiple-access situation (up-link of the wireless communication systems), which we call the finite state multiple-access wiretap channel (FS-MAC-WT) with delayed feedback. To be specific, the FS-MAC-WT is a channel with two inputs (transmitters) and two outputs (a legitimate receiver and an eavesdropper). The channel depends on a state which undergoes a Markov process, and the state is entirely known by the legitimate receiver and the eavesdropper. The legitimate receiver intends to send his channel output and the perfectly known state back to the transmitters through noiseless feedback channels after some time delay. The main contribution of this paper is to provide inner and outer bounds on the secrecy capacity regions of the FS-MAC-WT with delayed state feedback, and with or without delayed legitimate receiver's channel output feedback. The capacity results are further explained via a degraded Gaussian fading example, and from this example we see that sending the legitimate receiver's channel output back to the transmitters helps to enhance the achievable secrecy rate region of the FS-MAC-WT with only delayed state feedback.
- We propose a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. Unlike the conventional model cascade (MC) that is composed of multiple independent models, LC treats a single deep model as a cascade of several sub-models. Earlier sub-models are trained to handle easy and confident regions, and they progressively feed-forward harder regions to the next sub-model for processing. Convolutions are only calculated on these regions to reduce computations. The proposed method possesses several advantages. First, LC classifies most of the easy regions in the shallow stage and makes deeper stage focuses on a few hard regions. Such an adaptive and 'difficulty-aware' learning improves segmentation performance. Second, LC accelerates both training and testing of deep network thanks to early decisions in the shallow stage. Third, in comparison to MC, LC is an end-to-end trainable framework, allowing joint learning of all sub-models. We evaluate our method on PASCAL VOC and Cityscapes datasets, achieving state-of-the-art performance and fast speed.
- In this paper, an improved receiver is proposed to improve the bit error rate (BER) performance in layered asymmetrically clipped optical fast orthogonal frequency division multiplexing (ACO-FOFDM) for intensity-modulated and direct-detected (IM/DD) optical transmission systems. Layered ACO-FOFDM can compensate the weakness of traditional ACO-FOFDM in low spectral efficiency, the utilization of discrete cosine transform (DCT) in FOFDM system instead of fast Fourier transform (FFT) in OFDM system can reduce the computational complexity without any influence on BER performance. The BER performances of layered ACO-FOFDM system with improved receiver and DC-offset FOFDM (DCO-FOFDM) system with optimal DC-bias are compared at the same spectral efficiency. Simulation results show that under different optical bit energy to noise power ratios, layered ACO-FOFDM system with improved receiver has 2.86dB, 5.26dB and 5.72dB BER performance advantages at forward error correction (FEC) limit over DCO-FOFDM system when the spectral efficiencies are 1 bit/s/Hz, 2 bits/s/Hz and 3 bits/s/Hz, respectively. Therefore, the proposed scheme has low-cost property through the use of DCT, and is suitable for application in the adaptive IM/DD systems with zero DC-bias.
- Mar 09 2017 cs.CV arXiv:1703.02716v1Detecting activities in untrimmed videos is an important but challenging task. The performance of existing methods remains unsatisfactory, e.g., they often meet difficulties in locating the beginning and end of a long complex action. In this paper, we propose a generic framework that can accurately detect a wide variety of activities from untrimmed videos. Our first contribution is a novel proposal scheme that can efficiently generate candidates with accurate temporal boundaries. The other contribution is a cascaded classification pipeline that explicitly distinguishes between relevance and completeness of a candidate instance. On two challenging temporal activity detection datasets, THUMOS14 and ActivityNet, the proposed framework significantly outperforms the existing state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling activities with various temporal structures.
- Coded caching scheme, which is an effective technique to reduce the load during peak traffic times, has recently become quite popular among the coding community. A placement delivery array (PDA in short) can be used to design a coded caching scheme. The number of rows of a PDA corresponds to the subpacketization level in the coded caching scheme. Thus, it is meaningful to construct the optimal PDAs with minimum number of rows. However, no one has yet proved that one of the previously known PDAs (or optimal PDAs) has minimum number of rows. We mainly focus on such optimal PDAs in this paper. We first prove that one class of the optimal PDAs by Maddah-Ali and Niesen has minimum number of rows. Next other two classes of optimal PDAs with minimum number of rows are obtained and proved from a new characterization of a PDA by means of a set of $3$ dimensional vectors and a new derived lower bound.
- Feb 24 2017 cs.CV arXiv:1702.07191v2As the intermediate level task connecting image captioning and object detection, visual relationship detection started to catch researchers' attention because of its descriptive power and clear structure. It detects the objects and captures their pair-wise interactions with a subject-predicate-object triplet, e.g. person-ride-horse. In this paper, each visual relationship is considered as a phrase with three components. We formulate the visual relationship detection as three inter-connected recognition problems and propose a Visual Phrase guided Convolutional Neural Network (ViP-CNN) to address them simultaneously. In ViP-CNN, we present a Phrase-guided Message Passing Structure (PMPS) to establish the connection among relationship components and help the model consider the three problems jointly. Corresponding non-maximum suppression method and model training strategy are also proposed. Experimental results show that our ViP-CNN outperforms the state-of-art method both in speed and accuracy. We further pretrain ViP-CNN on our cleansed Visual Genome Relationship dataset, which is found to perform better than the pretraining on the ImageNet for this task.
- We address the problem of synthesizing new video frames in an existing video, either in-between existing frames (interpolation), or subsequent to them (extrapolation). This problem is challenging because video appearance and motion can be highly complex. Traditional optical-flow-based solutions often fail where flow estimation is challenging, while newer neural-network-based methods that hallucinate pixel values directly often produce blurry results. We combine the advantages of these two methods by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which we call deep voxel flow. Our method requires no human supervision, and any video can be used as training data by dropping, and then learning to predict, existing frames. The technique is efficient, and can be applied at any video resolution. We demonstrate that our method produces results that both quantitatively and qualitatively improve upon the state-of-the-art.
- Jan 31 2017 cs.CV arXiv:1701.08393v1We propose a deep convolutional neural network (CNN) for face detection leveraging on facial attributes based supervision. We observe a phenomenon that part detectors emerge within CNN trained to classify attributes from uncropped face images, without any explicit part supervision. The observation motivates a new method for finding faces through scoring facial parts responses by their spatial structure and arrangement. The scoring mechanism is data-driven, and carefully formulated considering challenging cases where faces are only partially visible. This consideration allows our network to detect faces under severe occlusion and unconstrained pose variations. Our method achieves promising performance on popular benchmarks including FDDB, PASCAL Faces, AFW, and WIDER FACE.
- Faster-than-Nyquist (FTN) signal can achieve higher spectral efficiency and capacity than Nyquist signal. For Nyquist signal, the capacity limit was shown in the pioneering work of Shannon. However, different from Nyquist signal, FTN signal has a smaller pulse interval or narrower subcarrier spacing. What is the capacity limit of FTN signal? In this paper, to the best of our knowledge, we first give the mathematical expression for the capacity limit of FTN non-orthogonal frequency-division multiplexing (NOFDM) signal, which can be also applied to FTN non-orthogonal time-division multiplexing signal. The mathematical expression shows that the capacity limit for FTN signal is higher than Shannon limit for Nyquist signal. Meanwhile, we demonstrate the principle of FTN NOFDM by taking fractional cosine transform-based NOFDM (FrCT-NOFDM) for instance. As far as we know, FrCT-NOFDM is first proposed in this paper. The simulations and experiments have been demonstrated to verify the feasibility of FrCT-NOFDM. When the bandwidth compression factor alpha is set to 0.8, the subcarrier spacing is equal to 40% of the symbol rate per subcarrier. The transmission rate is about 25% faster than Nyquist rate and the capacity limit is 25% higher than Shannon limit.
- Coded caching scheme is a promising technique to migrate the network burden in peak hours, which attains more prominent gains than the uncoded caching. The coded caching scheme can be classified into two types, namely, the centralized and the decentralized scheme, according to whether the placement procedures are carefully designed or operated at random. However, most of the previous analysis assumes that the connected links between server and users are error-free. In this paper, we explore the coded caching based delivery design in wireless networks, where all the connected wireless links are different. For both centralized and decentralized cases, we proposed two delivery schemes, namely, the orthogonal delivery scheme and the concurrent delivery scheme. We focus on the transmission time slots spent on satisfying the system requests, and prove that for both the centralized and the decentralized cases, the concurrent delivery always outperforms orthogonal delivery scheme. Furthermore, for the orthogonal delivery scheme, we derive the gap in terms of transmission time between the decentralized and centralized case, which is essentially no more than 1.5.
- In wireless networks, coded caching is an effective technique to reduce network congestion during peak traffic times. Recently, a new concept called placement delivery array (PDA) was proposed to characterize the coded caching scheme. So far, only one class of PDAs by Maddah-Ali and Niesen is known to be optimal. In this paper, we mainly focus on constructing optimal PDAs. Firstly, we derive some lower bounds. Next, we present several infinite classes of PDAs, which are shown to be optimal with respect to the new bounds.
- Existing deep embedding methods in vision tasks are capable of learning a compact Euclidean space from images, where Euclidean distances correspond to a similarity metric. To make learning more effective and efficient, hard sample mining is usually employed, with samples identified through computing the Euclidean feature distance. However, the global Euclidean distance cannot faithfully characterize the true feature similarity in a complex visual feature space, where the intraclass distance in a high-density region may be larger than the interclass distance in low-density regions. In this paper, we introduce a Position-Dependent Deep Metric (PDDM) unit, which is capable of learning a similarity metric adaptive to local feature structure. The metric can be used to select genuinely hard samples in a local neighborhood to guide the deep embedding learning in an online and robust manner. The new layer is appealing in that it is pluggable to any convolutional networks and is trained end-to-end. Our local similarity-aware feature embedding not only demonstrates faster convergence and boosted performance on two complex image retrieval datasets, its large margin nature also leads to superior generalization results under the large and open set scenarios of transfer learning and zero-shot learning on ImageNet 2010 and ImageNet-10K datasets.
- The integer-forcing (IF) linear multiple-input and multiple-output (MIMO) receiver is a recently proposed suboptimal receiver which nearly reaches the performance of the optimal maximum likelihood receiver for the entire signal-to-noise ratio (SNR) range. The optimal integer coefficient matrix $\A^\star\in \mathbb{Z}^{N_t\times N_t}$ for IF maximizes the total achievable rate, where $N_t$ is the column dimension of the channel matrix. To obtain $\A^\star$, a successive minima problem (SMP) on an $N_t$-dimensional lattice that is suspected to be NP-hard needs to be solved. In this paper, an efficient exact algorithm for the SMP is proposed. For efficiency, our algorithm first uses the LLL reduction to reduce the SMP. Then, different from existing SMP algorithms which form the transformed $\A^\star$ column by column in $N_t$ iterations, it first initializes with a suboptimal matrix. The suboptimal matrix is then updated, by utilizing the integer vectors obtained by employing an improved Schnorr-Euchner search algorithm to search the candidate integer vectors within a certain hyper-ellipsoid, via a novel and efficient algorithm until the transformed $\A^{\star}$ is obtained in only one iteration. Finally, the algorithm returns the matrix obtained by left multiplying the solution of the reduced SMP with the unimodular matrix that is generated by the LLL reduction. We then rigorously prove the optimality of the proposed algorithm by showing that it exactly solves the SMP. Furthermore, we develop a theoretical complexity analysis to show that the complexity of the new algorithm in big-O notation is an order of magnitude smaller, with respect to $N_t$, than that of the existing most efficient algorithm. Finally, simulation results are presented to illustrate the optimality and efficiency of our novel algorithm.
- Oct 05 2016 cs.CV arXiv:1610.00838v2This survey aims at reviewing recent computer vision techniques used in the assessment of image aesthetic quality. Image aesthetic assessment aims at computationally distinguishing high-quality photos from low-quality ones based on photographic rules, typically in the form of binary classification or quality scoring. A variety of approaches has been proposed in the literature trying to solve this challenging problem. In this survey, we present a systematic listing of the reviewed approaches based on visual feature types (hand-crafted features and deep features) and evaluation criteria (dataset characteristics and evaluation metrics). Main contributions and novelties of the reviewed approaches are highlighted and discussed. In addition, following the emergence of deep learning techniques, we systematically evaluate recent deep learning settings that are useful for developing a robust deep model for aesthetic scoring. Experiments are conducted using simple yet solid baselines that are competitive with the current state-of-the-arts. Moreover, we discuss the possibility of manipulating the aesthetics of images through computational approaches. We hope that our survey could serve as a comprehensive reference source for future research on the study of image aesthetic assessment.
- Sep 22 2016 cs.CV arXiv:1609.06426v2Interpersonal relation defines the association, e.g., warm, friendliness, and dominance, between two or more people. Motivated by psychological studies, we investigate if such fine-grained and high-level relation traits can be characterized and quantified from face images in the wild. We address this challenging problem by first studying a deep network architecture for robust recognition of facial expressions. Unlike existing models that typically learn from facial expression labels alone, we devise an effective multitask network that is capable of learning from rich auxiliary attributes such as gender, age, and head pose, beyond just facial expression data. While conventional supervised training requires datasets with complete labels (e.g., all samples must be labeled with gender, age, and expression), we show that this requirement can be relaxed via a novel attribute propagation method. The approach further allows us to leverage the inherent correspondences between heterogeneous attribute sources despite the disparate distributions of different datasets. With the network we demonstrate state-of-the-art results on existing facial expression recognition benchmarks. To predict inter-personal relation, we use the expression recognition network as branches for a Siamese model. Extensive experiments show that our model is capable of mining mutual context of faces for accurate fine-grained interpersonal prediction.
- The technique of coded caching proposed by Madddah-Ali and Niesen is a promising approach to alleviate the load of networks during busy times. Recently, placement delivery array (PDA) was presented to characterize both the placement and delivery phase in a single array for the centralized coded caching algorithm. In this paper, we interpret PDA from a new perspective, i.e., the strong edge coloring of bipartite graph. We prove that, a PDA is equivalent to a strong edge colored bipartite graph. Thus, we can construct a class of PDAs from existing structures in bipartite graphs. The class includes the scheme proposed by Maddah-Ali \textitet al. and a more general class of PDAs proposed by Shangguan \textitet al. as special cases. Moreover, it is capable of generating a lot of PDAs with flexible tradeoff between the sub-packet level and load.
- Markov Random Fields (MRFs), a formulation widely used in generative image modeling, have long been plagued by the lack of expressive power. This issue is primarily due to the fact that conventional MRFs formulations tend to use simplistic factors to capture local patterns. In this paper, we move beyond such limitations, and propose a novel MRF model that uses fully-connected neurons to express the complex interactions among pixels. Through theoretical analysis, we reveal an inherent connection between this model and recurrent neural networks, and thereon derive an approximated feed-forward network that couples multiple RNNs along opposite directions. This formulation combines the expressive power of deep neural networks and the cyclic dependency structure of MRF in a unified model, bringing the modeling capability to a new level. The feed-forward approximation also allows it to be efficiently learned from data. Experimental results on a variety of low-level vision tasks show notable improvement over state-of-the-arts.
- Aug 11 2016 cs.CV arXiv:1608.03049v1Visual fashion analysis has attracted many attentions in the recent years. Previous work represented clothing regions by either bounding boxes or human joints. This work presents fashion landmark detection or fashion alignment, which is to predict the positions of functional key points defined on the fashion items, such as the corners of neckline, hemline, and cuff. To encourage future studies, we introduce a fashion landmark dataset with over 120K images, where each image is labeled with eight landmarks. With this dataset, we study fashion alignment by cascading multiple convolutional neural networks in three stages. These stages gradually improve the accuracies of landmark predictions. Extensive experiments demonstrate the effectiveness of the proposed method, as well as its generalization ability to pose estimation. Fashion landmark is also compared to clothing bounding boxes and human joints in two applications, fashion attribute prediction and clothes retrieval, showing that fashion landmark is a more discriminative representation to understand fashion images.
- Aug 10 2016 cs.CV arXiv:1608.02778v1Lossy compression introduces complex compression artifacts, particularly blocking artifacts, ringing effects and blurring. Existing algorithms either focus on removing blocking artifacts and produce blurred output, or restore sharpened images that are accompanied with ringing effects. Inspired by the success of deep convolutional networks (DCN) on superresolution, we formulate a compact and efficient network for seamless attenuation of different compression artifacts. To meet the speed requirement of real-world applications, we further accelerate the proposed baseline model by layer decomposition and joint use of large-stride convolutional and deconvolutional layers. This also leads to a more general CNN framework that has a close relationship with the conventional Multi-Layer Perceptron (MLP). Finally, the modified network achieves a speed up of 7.5 times with almost no performance loss compared to the baseline model. We also demonstrate that a deeper model can be effectively trained with features learned in a shallow network. Following a similar "easy to hard" idea, we systematically investigate three practical transfer settings and show the effectiveness of transfer learning in low-level vision problems. Our method shows superior performance than the state-of-the-art methods both on benchmark datasets and a real-world use case.
- Aug 03 2016 cs.CV arXiv:1608.00859v1Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ( $ 69.4\% $) and UCF101 ($ 94.2\% $). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices.
- Aug 03 2016 cs.CV arXiv:1608.00797v1This paper presents the method that underlies our submission to the untrimmed video classification task of ActivityNet Challenge 2016. We follow the basic pipeline of temporal segment networks and further raise the performance via a number of other techniques. Specifically, we use the latest deep model architecture, e.g., ResNet and Inception V3, and introduce new aggregation schemes (top-k and attention-weighted pooling). Additionally, we incorporate the audio as a complementary channel, extracting relevant information via a CNN applied to the spectrograms. With these techniques, we derive an ensemble of deep models, which, together, attains a high classification accuracy (mAP $93.23\%$) on the testing set and secured the first place in the challenge.
- Aug 02 2016 cs.CV arXiv:1608.00367v1As a successful deep model applied in image super-resolution (SR), the Super-Resolution Convolutional Neural Network (SRCNN) has demonstrated superior performance to the previous hand-crafted models either in speed and restoration quality. However, the high computational cost still hinders it from practical usage that demands real-time performance (24 fps). In this paper, we aim at accelerating the current SRCNN, and propose a compact hourglass-shape CNN structure for faster and better SR. We re-design the SRCNN structure mainly in three aspects. First, we introduce a deconvolution layer at the end of the network, then the mapping is learned directly from the original low-resolution image (without interpolation) to the high-resolution one. Second, we reformulate the mapping layer by shrinking the input feature dimension before mapping and expanding back afterwards. Third, we adopt smaller filter sizes but more mapping layers. The proposed model achieves a speed up of more than 40 times with even superior restoration quality. Further, we present the parameter settings that can achieve real-time performance on a generic CPU while still maintaining good performance. A corresponding transfer strategy is also proposed for fast training and testing across different upscaling factors.
- Jul 19 2016 cs.CV arXiv:1607.05046v1We present a novel framework for hallucinating faces of unconstrained poses and with very low resolution (face size as small as 5pxIOD). In contrast to existing studies that mostly ignore or assume pre-aligned face spatial configuration (e.g. facial landmarks localization or dense correspondence field), we alternatingly optimize two complementary tasks, namely face hallucination and dense correspondence field estimation, in a unified framework. In addition, we propose a new gated deep bi-network that contains two functionality-specialized branches to recover different levels of texture details. Extensive experiments demonstrate that such formulation allows exceptional hallucination quality on in-the-wild low-res faces with significant pose and illumination variations.
- Locally repairable codes are desirable for distributed storage systems to improve the repair efficiency. In this paper, we first build a bridge between locally repairable code and packing. As an application of this bridge, some optimal locally repairable codes can be obtained by packings, which gives optimal locally repairable codes with flexible parameters.
- Semantic segmentation tasks can be well modeled by Markov Random Field (MRF). This paper addresses semantic segmentation by incorporating high-order relations and mixture of label contexts into MRF. Unlike previous works that optimized MRFs using iterative algorithm, we solve MRF by proposing a Convolutional Neural Network (CNN), namely Deep Parsing Network (DPN), which enables deterministic end-to-end computation in a single forward pass. Specifically, DPN extends a contemporary CNN to model unary terms and additional layers are devised to approximate the mean field (MF) algorithm for pairwise terms. It has several appealing properties. First, different from the recent works that required many iterations of MF during back-propagation, DPN is able to achieve high performance by approximating one iteration of MF. Second, DPN represents various types of pairwise terms, making many existing models as its special cases. Furthermore, pairwise terms in DPN provide a unified framework to encode rich contextual information in high-dimensional data, such as images and videos. Third, DPN makes MF easier to be parallelized and speeded up, thus enabling efficient inference. DPN is thoroughly evaluated on standard semantic image/video segmentation benchmarks, where a single DPN model yields state-of-the-art segmentation accuracies on PASCAL VOC 2012, Cityscapes dataset and CamVid dataset.
- Let $\mathbb{F}_{p^{m}}$ be a finite field with cardinality $p^{m}$ and $R=\mathbb{F}_{p^{m}}+u\mathbb{F}_{p^{m}}$ with $u^{2}=0$. We aim to determine all $\alpha+u\beta$-constacyclic codes of length $np^{s}$ over $R$, where $\alpha,\beta\in\mathbb{F}_{p^{m}}^{*}$, $n, s\in\mathbb{N}_{+}$ and $\gcd(n,p)=1$. Let $\alpha_{0}\in\mathbb{F}_{p^{m}}^{*}$ and $\alpha_{0}^{p^{s}}=\alpha$. The residue ring $R[x]/\langle x^{np^{s}}-\alpha-u\beta\rangle$ is a chain ring with the maximal ideal $\langle x^{n}-\alpha_{0}\rangle$ in the case that $x^{n}-\alpha_{0}$ is irreducible in $\mathbb{F}_{p^{m}}[x]$. If $x^{n}-\alpha_{0}$ is reducible in $\mathbb{F}_{p^{m}}[x]$, we give the explicit expressions of the ideals of $R[x]/\langle x^{np^{s}}-\alpha-u\beta\rangle$. Besides, the number of codewords and the dual code of every $\alpha+u\beta$-constacyclic code are provided.
- Caching is a promising solution to satisfy the ongoing explosive demands for multi-media traffics. Recently, Maddah-Ali and Niesen proposed both centralized and decentralized coded caching schemes, which is able to attain significant performance gains over the uncoded caching schemes. Their work indicates that there exists performance gap between the decentralized coded caching scheme and the centralized coded caching scheme. In this paper, we focus on the gap between the performances of the decentralized and centralized coded caching schemes. Most notably, we strictly prove that the multiplicative gap (i.e., the ratio of their performances) is between 1 and 1.5. The upper bound tightens the original one of 12 by Maddah Ali and Niesen, while the lower bound verifies the intuition that the centralized coded caching scheme always outperforms its decentralized counterpart. In particular, both the two bounds are achievable in some cases. Furthermore, we prove that the gap can be arbitrarily close to 1 if the number of the users is large enough, which suggests the great potential in practical applications to use the less optimal but more practical decentralized coded caching scheme.
- In this paper, we investigate some sufficient conditions based on the block restricted isometry property (block-RIP) for exact (when $\v=\0$) and stable (when $\v\neq\0$) recovery of block sparse signals $\x$ from measurements $\y=\A\x+\v$, where $\v$ is a $\ell_2$ bounded noise vector (i.e., $\|\v\|_2\leq \epsilon$ for some constant $\epsilon$).. First, on the one hand, we show that if $\A$ satisfies the block-RIP with constant $\delta_{K+1}<1/\sqrt{K+1}$, then every $K$-block sparse signal $\x$ can be exactly or stably recovered by BOMP in $K$ iterations; On the other hand, for any $K\geq 1$ and $1/\sqrt{K+1}\leq t<1$, there exists a matrix $\A$ satisfying the block-RIP with $\delta_{K+1}=t$ and a $K$-block sparse signal $\x$ such that the BOMP algorithm may fail to recover $\x$ in $K$ iterations. Second, we study some sufficient conditions for recovering $\alpha$-strongly-decaying $K$-block sparse signals. Surprisingly, it is shown that if $\A$ satisfies the block-RIP with $\delta_{K+1}<\sqrt{2}/2$, every $\alpha$-strongly-decaying $K$-block sparse signal can be exactly or stably recovered by the BOMP algorithm in $K$ iterations, under some conditions on $\alpha$. Our newly found sufficient condition on the block-RIP of $\A$ is weaker than that for $\ell_1$ minimization for this special class of sparse signals, which further convinces the effectiveness of BOMP. Furthermore, for any $K\geq 1$, $\alpha>1$ and $\sqrt{2}/2\leq t<1$, the recovery of $\x$ may fail in $K$ iterations for a sensing matrix $\A$ which satisfies the block-RIP with $\delta_{K+1}=t$. Finally, we study some sufficient conditions for partial recovery of block sparse signals. Specifically, if $\A$ satisfies the block-RIP with $\delta_{K+1}<\sqrt{2}/2$, then BOMP is guaranteed to recover some blocks of $\x$ if these blocks satisfy a sufficient condition. We further show that this condition is sharp.
- We consider the problem of constructing exact-repair minimum storage regenerating (MSR) codes, for which both the systematic nodes and parity nodes can be repaired optimally. Although there exist several recent explicit high-rate MSR code constructions (usually with certain restrictions on the coding parameters), quite a few constructions in the literature only allow the optimal repair of systematic nodes. This phenomenon suggests that there might be a barrier between explicitly constructing codes that can only optimally repair systematic nodes and those that can optimally repair both systematic nodes and parity nodes. In the work, we show that this barrier can be completely circumvented by providing a generic transformation that is able to convert any non-binary linear maximum distance separable (MDS) storage codes that can optimally repair only systematic nodes into new MSR codes that can optimally repair all nodes. This transformation does not increase the alphabet size of the original codes, and only increases the sub-packetization by a factor that is equal to the number of parity nodes. Furthermore, the resultant MSR codes also have the optimal access property for all nodes if the original MDS storage codes have the optimal access property for systematic nodes.
- Apr 26 2016 cs.CV arXiv:1604.07279v1Actionness was introduced to quantify the likelihood of containing a generic action instance at a specific location. Accurate and efficient estimation of actionness is important in video analysis and may benefit other relevant tasks such as action recognition and action detection. This paper presents a new deep architecture for actionness estimation, called hybrid fully convolutional network (H-FCN), which is composed of appearance FCN (A-FCN) and motion FCN (M-FCN). These two FCNs leverage the strong capacity of deep models to estimate actionness maps from the perspectives of static appearance and dynamic motion, respectively. In addition, the fully convolutional nature of H-FCN allows it to efficiently process videos with arbitrary sizes. Experiments are conducted on the challenging datasets of Stanford40, UCF Sports, and JHMDB to verify the effectiveness of H-FCN on actionness estimation, which demonstrate that our method achieves superior performance to previous ones. Moreover, we apply the estimated actionness maps on action proposal generation and action detection. Our actionness maps advance the current state-of-the-art performance of these tasks substantially.
- A novel compressive-sensing based signal multiplexing scheme is proposed in this paper to further improve the multiplexing gain for multiple input multiple output (MIMO) system. At the transmitter side, a Gaussian random measurement matrix in compressive sensing is employed before the traditional spatial multiplexing in order to carry more data streams on the available spatial multiplexing streams of the underlying MIMO system. At the receiver side, it is proposed to reformulate the detection of the multiplexing signal into two steps. In the first step, the traditional MIMO equalization can be used to restore the transmitted spatial multiplexing signal of the MIMO system. While in the second step, the standard optimization based detection algorithm assumed in the compressive sensing framework is utilized to restore the CS multiplexing data streams, wherein the exhaustive over-complete dictionary is used to guarantee the sparse representation of the CS multiplexing signal. In order to avoid the excessive complexity, the sub-block based dictionary and the sub-block based CS restoration is proposed. Finally, simulation results are presented to show the feasibility of the proposed CS based enhanced MIMO multiplexing scheme. And our efforts in this paper shed some lights on the great potential in further improving the spatial multiplexing gain for the MIMO system.
- Mar 15 2016 cs.CV arXiv:1603.04015v3In this paper, a discriminative two-phase dictionary learning framework is proposed for classifying human action by sparse shape representations, in which the first-phase dictionary is learned on the selected discriminative frames and the second-phase dictionary is built for recognition using reconstruction errors of the first-phase dictionary as input features. We propose a "zeroth class" trick for detecting undiscriminating frames of the test video and eliminating them before voting on the action categories. Experimental results on benchmarks demonstrate the effectiveness of our method.
- Generalized orthogonal matching pursuit (gOMP), also called orthogonal multi-matching pursuit, is an extension of OMP in the sense that $N\geq1$ indices are identified per iteration. In this paper, we show that if the restricted isometry constant (RIC) $\delta_{NK+1}$ of a sensing matrix $\A$ satisfies $\delta_{NK+1} < 1/\sqrt {K/N+1}$, then under a condition on the signal-to-noise ratio, gOMP identifies at least one index in the support of any $K$-sparse signal $\x$ from $\y=\A\x+\v$ at each iteration, where $\v$ is a noise vector. Surprisingly, this condition does not require $N\leq K$ which is needed in Wang, \textitet al 2012 and Liu, \textitet al 2012. Thus, $N$ can have more choices. When $N=1$, it reduces to be a sufficient condition for OMP, which is less restrictive than that proposed in Wang 2015. Moreover, in the noise-free case, it is a sufficient condition for accurately recovering $\x$ in $K$ iterations which is less restrictive than the best known one. In particular, it reduces to the sharp condition proposed in Mo 2015 when $N=1$.
- Feb 04 2016 cs.CV arXiv:1602.01197v1Data imbalance is common in many vision tasks where one or more classes are rare. Without addressing this issue conventional methods tend to be biased toward the majority class with poor predictive accuracy for the minority class. These methods further deteriorate on small, imbalanced data that has a large degree of class overlap. In this study, we propose a novel discriminative sparse neighbor approximation (DSNA) method to ameliorate the effect of class-imbalance during prediction. Specifically, given a test sample, we first traverse it through a cost-sensitive decision forest to collect a good subset of training examples in its local neighborhood. Then we generate from this subset several class-discriminating but overlapping clusters and model each as an affine subspace. From these subspaces, the proposed DSNA iteratively seeks an optimal approximation of the test sample and outputs an unbiased prediction. We show that our method not only effectively mitigates the imbalance issue, but also allows the prediction to extrapolate to unseen data. The latter capability is crucial for achieving accurate prediction on small dataset with limited samples. The proposed imbalanced learning method can be applied to both classification and regression tasks at a wide range of imbalance levels. It significantly outperforms the state-of-the-art methods that do not possess an imbalance handling mechanism, and is found to perform comparably or even better than recent deep learning methods by using hand-crafted features only.
- Support recovery of sparse signals from noisy measurements with orthogonal matching pursuit (OMP) has been extensively studied. In this paper, we show that for any $K$-sparse signal $\x$, if a sensing matrix $\A$ satisfies the restricted isometry property (RIP) with restricted isometry constant (RIC) $\delta_{K+1} < 1/\sqrt {K+1}$, then under some constraints on the minimum magnitude of nonzero elements of $\x$, OMP exactly recovers the support of $\x$ from its measurements $\y=\A\x+\v$ in $K$ iterations, where $\v$ is a noise vector that is $\ell_2$ or $\ell_{\infty}$ bounded. This sufficient condition is sharp in terms of $\delta_{K+1}$ since for any given positive integer $K$ and any $1/\sqrt{K+1}\leq \delta<1$, there always exists a matrix $\A$ satisfying the RIP with $\delta_{K+1}=\delta$ for which OMP fails to recover a $K$-sparse signal $\x$ in $K$ iterations. Also, our constraints on the minimum magnitude of nonzero elements of $\x$ are weaker than existing ones. Moreover, we propose worst-case necessary conditions for the exact support recovery of $\x$, characterized by the minimum magnitude of the nonzero elements of $\x$.
- Dec 15 2015 cs.SC arXiv:1512.03901v2An algebraic approach to the maximum likelihood estimation problem is to solve a very structured parameterized polynomial system called likelihood equations that have finitely many complex (real or non-real) solutions. The only solutions that are statistically meaningful are the real solutions with positive coordinates. In order to classify the parameters (data) according to the number of real/positive solutions, we study how to efficiently compute the discriminants, say data-discriminants (DD), of the likelihood equations. We develop a probabilistic algorithm with three different strategies for computing DDs. Our implemented probabilistic algorithm based on Maple and FGb is more efficient than our previous version presented in ISSAC2015, and is also more efficient than the standard elimination for larger benchmarks. By applying RAGlib to a DD we compute, we give the real root classification of 3 by 3 symmetric matrix model.
- Dec 08 2015 cs.CV arXiv:1512.01891v1This paper proposes to learn high-performance deep ConvNets with sparse neural connections, referred to as sparse ConvNets, for face recognition. The sparse ConvNets are learned in an iterative way, each time one additional layer is sparsified and the entire model is re-trained given the initial weights learned in previous iterations. One important finding is that directly training the sparse ConvNet from scratch failed to find good solutions for face recognition, while using a previously learned denser model to properly initialize a sparser model is critical to continue learning effective features for face recognition. This paper also proposes a new neural correlation-based weight selection criterion and empirically verifies its effectiveness in selecting informative connections from previously learned models in each iteration. When taking a moderately sparse structure (26%-76% of weights in the dense model), the proposed sparse ConvNet model significantly improves the face recognition performance of the previous state-of-the-art DeepID2+ models given the same training data, while it keeps the performance of the baseline model with only 12% of the original parameters.
- Nov 23 2015 cs.CV arXiv:1511.06627v1Learning to simultaneously handle face alignment of arbitrary views, e.g. frontal and profile views, appears to be more challenging than we thought. The difficulties lay in i) accommodating the complex appearance-shape relations exhibited in different views, and ii) encompassing the varying landmark point sets due to self-occlusion and different landmark protocols. Most existing studies approach this problem via training multiple viewpoint-specific models, and conduct head pose estimation for model selection. This solution is intuitive but the performance is highly susceptible to inaccurate head pose estimation. In this study, we address this shortcoming through learning an Ensemble of Model Recommendation Trees (EMRT), which is capable of selecting optimal model configuration without prior head pose estimation. The unified framework seamlessly handles different viewpoints and landmark protocols, and it is trained by optimising directly on landmark locations, thus yielding superior results on arbitrary-view face alignment. This is the first study that performs face alignment on the full AFLWdataset with faces of different views including profile view. State-of-the-art performances are also reported on MultiPIE and AFW datasets containing both frontaland profile-view faces.
- Nov 23 2015 cs.CV arXiv:1511.06523v1Face detection is one of the most studied topics in the computer vision community. Much of the progresses have been made by the availability of face detection benchmark datasets. We show that there is a gap between current face detection performance and the real world requirements. To facilitate future face detection research, we introduce the WIDER FACE dataset, which is 10 times larger than existing datasets. The dataset contains rich annotations, including occlusions, poses, event categories, and face bounding boxes. Faces in the proposed dataset are extremely challenging due to large variations in scale, pose and occlusion, as shown in Fig. 1. Furthermore, we show that WIDER FACE dataset is an effective training source for face detection. We benchmark several representative detection systems, providing an overview of state-of-the-art performance and propose a solution to deal with large scale variation. Finally, we discuss common failure cases that worth to be further investigated. Dataset can be downloaded at: mmlab.ie.cuhk.edu.hk/projects/WIDERFace
- Binary representation is desirable for its memory efficiency, computation speed and robustness. In this paper, we propose adjustable bounded rectifiers to learn binary representations for deep neural networks. While hard constraining representations across layers to be binary makes training unreasonably difficult, we softly encourage activations to diverge from real values to binary by approximating step functions. Our final representation is completely binary. We test our approach on MNIST, CIFAR10, and ILSVRC2012 dataset, and systematically study the training dynamics of the binarization process. Our approach can binarize the last layer representation without loss of performance and binarize all the layers with reasonably small degradations. The memory space that it saves may allow more sophisticated models to be deployed, thus compensating the loss. To the best of our knowledge, this is the first work to report results on current deep network architectures using complete binary middle representations. Given the learned representations, we find that the firing or inhibition of a binary neuron is usually associated with a meaningful interpretation across different classes. This suggests that the semantic structure of a neural network may be manifested through a guided binarization process.
- We study the geometry of metrics and convexity structures on the space of phylogenetic trees, which is here realized as the tropical linear space of all \ ultrametrics. The ${\rm CAT}(0)$-metric of Billera-Holmes-Vogtman arises from the theory of orthant spaces. While its geodesics can be computed by the Owen-Provan algorithm, geodesic triangles are complicated. We show that the dimension of such a triangle can be arbitrarily high. Tropical convexity and the tropical metric behave better. They exhibit properties desirable for geometric statistics, such as geodesics of small depth.
- Caching is a promising solution to satisfy the ever increasing demands for the multi-media traffics. In caching networks, coded caching is a recently proposed technique that achieves significant performance gains over the uncoded caching schemes. However, to implement the coded caching schemes, each file has to be split into $F$ packets, which usually increases exponentially with the number of users $K$. Thus, designing caching schemes that decrease the order of $F$ is meaningful for practical implementations. In this paper, by reviewing the Ali-Niesen caching scheme, the placement delivery array (PDA) design problem is firstly formulated to characterize the placement issue and the delivery issue with a single array. Moreover, we show that, through designing appropriate PDA, new centralized coded caching schemes can be discovered. Secondly, it is shown that the Ali-Niesen scheme corresponds to a special class of PDA, which realizes the best coding gain with the least $F$. Thirdly, we present a new construction of PDA for the centralized caching system, wherein the cache size of each user $M$ (identical cache size is assumed at all users) and the number of files $N$ satisfies $M/N=1/q$ or ${(q-1)}/{q}$ ($q$ is an integer such that $q\geq 2$). The new construction can decrease the required $F$ from the order $O\left(e^{K\cdot\left(\frac{M}{N}\ln \frac{N}{M} +(1-\frac{M}{N})\ln \frac{N}{N-M}\right)}\right)$ of Ali-Niesen scheme to $O\left(e^{K\cdot\frac{M}{N}\ln \frac{N}{M}}\right)$ or $O\left(e^{K\cdot(1-\frac{M}{N})\ln\frac{N}{N-M}}\right)$ respectively, while the coding gain loss is only $1$.
- Sep 23 2015 cs.CV arXiv:1509.06451v1In this paper, we propose a novel deep convolutional network (DCN) that achieves outstanding performance on FDDB, PASCAL Face, and AFW. Specifically, our method achieves a high recall rate of 90.99% on the challenging FDDB benchmark, outperforming the state-of-the-art method by a large margin of 2.91%. Importantly, we consider finding faces from a new perspective through scoring facial parts responses by their spatial structure and arrangement. The scoring mechanism is carefully formulated considering challenging cases where faces are only partially visible. This consideration allows our network to detect faces under severe occlusion and unconstrained pose variation, which are the main difficulty and bottleneck of most existing face detection approaches. We show that despite the use of DCN, our network can achieve practical runtime speed.
- Social relation defines the association, e.g, warm, friendliness, and dominance, between two or more people. Motivated by psychological studies, we investigate if such fine-grained and high-level relation traits can be characterised and quantified from face images in the wild. To address this challenging problem we propose a deep model that learns a rich face representation to capture gender, expression, head pose, and age-related attributes, and then performs pairwise-face reasoning for relation prediction. To learn from heterogeneous attribute sources, we formulate a new network architecture with a bridging layer to leverage the inherent correspondences among these datasets. It can also cope with missing target attribute labels. Extensive experiments show that our approach is effective for fine-grained social relation learning in images and videos.
- Sep 10 2015 cs.CV arXiv:1509.02634v2This paper addresses semantic image segmentation by incorporating rich information into Markov Random Field (MRF), including high-order relations and mixture of label contexts. Unlike previous works that optimized MRFs using iterative algorithm, we solve MRF by proposing a Convolutional Neural Network (CNN), namely Deep Parsing Network (DPN), which enables deterministic end-to-end computation in a single forward pass. Specifically, DPN extends a contemporary CNN architecture to model unary terms and additional layers are carefully devised to approximate the mean field algorithm (MF) for pairwise terms. It has several appealing properties. First, different from the recent works that combined CNN and MRF, where many iterations of MF were required for each training image during back-propagation, DPN is able to achieve high performance by approximating one iteration of MF. Second, DPN represents various types of pairwise terms, making many existing works as its special cases. Third, DPN makes MF easier to be parallelized and speeded up in Graphical Processing Unit (GPU). DPN is thoroughly evaluated on the PASCAL VOC 2012 dataset, where a single DPN model yields a new state-of-the-art segmentation accuracy.
- Separable codes were introduced to provide protection against illegal redistribution of copyrighted multimedia material. Let $\mathcal{C}$ be a code of length $n$ over an alphabet of $q$ letters. The descendant code ${\sf desc}(\mathcal{C}_0)$ of $\mathcal{C}_0 = \{{\bf c}_1, {\bf c}_2, \ldots, {\bf c}_t\} \subseteq {\mathcal{C}}$ is defined to be the set of words ${\bf x} = (x_1, x_2, \ldots,x_n)^T$ such that $x_i \in \{c_{1,i}, c_{2,i}, \ldots, c_{t,i}\}$ for all $i=1, \ldots, n$, where ${\bf c}_j=(c_{j,1},c_{j,2},\ldots,c_{j,n})^T$. $\mathcal{C}$ is a $\overline{t}$-separable code if for any two distinct $\mathcal{C}_1, \mathcal{C}_2 \subseteq \mathcal{C}$ with $|\mathcal{C}_1| \le t$, $|\mathcal{C}_2| \le t$, we always have ${\sf desc}(\mathcal{C}_1) \neq {\sf desc}(\mathcal{C}_2)$. Let $M(\overline{t},n,q)$ denote the maximal possible size of such a separable code. In this paper, an upper bound on $M(\overline{3},3,q)$ is derived by considering an optimization problem related to a partial Latin square, and then two constructions for $\overline{3}$-SC$(3,M,q)$s are provided by means of perfect hash families and Steiner triple systems.
- Updated on 24/09/2015: This update provides preliminary experiment results for fine-grained classification on the surveillance data of CompCars. The train/test splits are provided in the updated dataset. See details in Section 6.
- Jun 16 2015 cs.CV arXiv:1506.04395v2We develop a Deep-Text Recurrent Network (DTRN) that regards scene text reading as a sequence labelling problem. We leverage recent advances of deep convolutional neural networks to generate an ordered high-level sequence from a whole word image, avoiding the difficult character segmentation problem. Then a deep recurrent model, building on long short-term memory (LSTM), is developed to robustly recognize the generated CNN sequences, departing from most existing approaches recognising each character independently. Our model has a number of appealing properties in comparison to existing scene text recognition methods: (i) It can recognise highly ambiguous words by leveraging meaningful context information, allowing it to work reliably without either pre- or post-processing; (ii) the deep CNN feature is robust to various image distortions; (iii) it retains the explicit order information in word image, which is essential to discriminate word strings; (iv) the model does not depend on pre-defined dictionary, and it can process unknown words and arbitrary strings. Codes for the DTRN will be available.
- May 21 2015 cs.CV arXiv:1505.04868v1Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features. Specifically, we utilize deep architectures to learn discriminative convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional features into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to transform convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and introduce the strategies of trajectory-constrained sampling and pooling for aggregating deep-learned features. We conduct experiments on two challenging datasets: HMDB51 and UCF101. Experimental results show that TDDs outperform previous hand-crafted features and deep-learned features. Our method also achieves superior performance to the state of the art on these datasets (HMDB51 65.9%, UCF101 91.5%).
- In the literature, few constructions of $n$-variable rotation symmetric bent functions have been presented, which either have restriction on $n$ or have algebraic degree no more than $4$. In this paper, for any even integer $n=2m\ge2$, a first systemic construction of $n$-variable rotation symmetric bent functions, with any possible algebraic degrees ranging from $2$ to $m$, is proposed.
- Apr 28 2015 cs.CV arXiv:1504.06993v1Lossy compression introduces complex compression artifacts, particularly the blocking artifacts, ringing effects and blurring. Existing algorithms either focus on removing blocking artifacts and produce blurred output, or restores sharpened images that are accompanied with ringing effects. Inspired by the deep convolutional networks (DCN) on super-resolution, we formulate a compact and efficient network for seamless attenuation of different compression artifacts. We also demonstrate that a deeper model can be effectively trained with the features learned in a shallow network. Following a similar "easy to hard" idea, we systematically investigate several practical transfer settings and show the effectiveness of transfer learning in low-level vision problems. Our method shows superior performance than the state-of-the-arts both on the benchmark datasets and the real-world use case (i.e. Twitter). In addition, we show that our method can be applied as pre-processing to facilitate other low-level vision routines when they take compressed images as input.
- In this paper, we reinterprets the $(k+2,k)$ Zigzag code in coding matrix and then propose an optimal exact repair strategy for its parity nodes, whose repair disk I/O approaches a lower bound derived in this paper.
- Feb 04 2015 cs.CV arXiv:1502.00873v1The state-of-the-art of face recognition has been significantly advanced by the emergence of deep learning. Very deep neural networks recently achieved great success on general object recognition because of their superb learning capacity. This motivates us to investigate their effectiveness on face recognition. This paper proposes two very deep neural network architectures, referred to as DeepID3, for face recognition. These two architectures are rebuilt from stacked convolution and inception layers proposed in VGG net and GoogLeNet to make them suitable to face recognition. Joint face identification-verification supervisory signals are added to both intermediate and final feature extraction layers during training. An ensemble of the proposed two architectures achieves 99.53% LFW face verification accuracy and 96.0% LFW rank-1 face identification accuracy, respectively. A further discussion of LFW face verification result is given in the end.
- Jan 07 2015 cs.DL arXiv:1501.01076v3F1000 recommendations have been validated as a potential data source for research evaluation, but reasons for differences between F1000 Article Factor (FFa scores) and citations remain to be explored. By linking 28254 publications in F1000 to citations in Scopus, we investigated the effect of research level and article type on the internal consistency of assessments based on citations and FFa scores. It turns out that research level has little impact, while article type has big effect on the differences. These two measures are significantly different for two groups: non-primary research or evidence-based research publications are more highly cited rather than highly recommended, however, translational research or transformative research publications are more highly recommended by faculty members but gather relatively lower citations. This can be expected because citation activities are usually practiced by academic authors while the potential for scientific revolutions and the suitability for clinical practice of an article should be investigated from the practitioners' points of view. We conclude with a policy relevant recommendation that the application of bibliometric approaches in research evaluation procedures should include the proportion of three types of publications: evidence-based research, transformative research, and translational research. The latter two types are more suitable to be assessed through peer review.
- Jan 06 2015 cs.CV arXiv:1501.00901v2Learning to recognize pedestrian attributes at far distance is a challenging problem in visual surveillance since face and body close-shots are hardly available; instead, only far-view image frames of pedestrian are given. In this study, we present an alternative approach that exploits the context of neighboring pedestrian images for improved attribute inference compared to the conventional SVM-based method. In addition, we conduct extensive experiments to evaluate the informativeness of background and foreground features for attribute recognition. Experiments are based on our newly released pedestrian attribute dataset, which is by far the largest and most diverse of its kind.
- Jan 05 2015 cs.SC arXiv:1501.00334v5Maximum likelihood estimation (MLE) is a fundamental computational problem in statistics. The problem is to maximize the likelihood function with respect to given data on a statistical model. An algebraic approach to this problem is to solve a very structured parameterized polynomial system called likelihood equations. For general choices of data, the number of complex solutions to the likelihood equations is finite and called the ML-degree of the model. The only solutions to the likelihood equations that are statistically meaningful are the real/positive solutions. However, the number of real/positive solutions is not characterized by the ML-degree. We use discriminants to classify data according to the number of real/positive solutions of the likelihood equations. We call these discriminants data-discriminants (DD). We develop a probabilistic algorithm for computing DDs. Experimental results show that, for the benchmarks we have tried, the probabilistic algorithm is more efficient than the standard elimination algorithm. Based on the computational results, we discuss the real root classification problem for the 3 by 3 symmetric matrix~model.
- We propose a deep learning method for single image super-resolution (SR). Our method directly learns an end-to-end mapping between the low/high-resolution images. The mapping is represented as a deep convolutional neural network (CNN) that takes the low-resolution image as the input and outputs the high-resolution one. We further show that traditional sparse-coding-based SR methods can also be viewed as a deep convolutional network. But unlike traditional methods that handle each component separately, our method jointly optimizes all layers. Our deep CNN has a lightweight structure, yet demonstrates state-of-the-art restoration quality, and achieves fast speed for practical on-line usage. We explore different network structures and parameter settings to achieve trade-offs between performance and speed. Moreover, we extend our network to cope with three color channels simultaneously, and show better overall reconstruction quality.
- In this paper, we propose deformable deep convolutional neural networks for generic object detection. This new deep learning object detection framework has innovations in multiple aspects. In the proposed new deep architecture, a new deformation constrained pooling (def-pooling) layer models the deformation of object parts with geometric constraint and penalty. A new pre-training strategy is proposed to learn feature representations more suitable for the object detection task and with good generalization capability. By changing the net structures, training strategies, adding and removing some key components in the detection pipeline, a set of models with large diversity are obtained, which significantly improves the effectiveness of model averaging. The proposed approach improves the mean averaged precision obtained by RCNN \citegirshick2014rich, which was the state-of-the-art, from 31\% to 50.3\% on the ILSVRC2014 detection test set. It also outperforms the winner of ILSVRC2014, GoogLeNet, by 6.1\%. Detailed component-wise analysis is also provided through extensive experimental evaluation, which provide a global view for people to understand the deep learning object detection pipeline.
- Dec 04 2014 cs.CV arXiv:1412.1265v1This paper designs a high-performance deep convolutional network (DeepID2+) for face recognition. It is learned with the identification-verification supervisory signal. By increasing the dimension of hidden representations and adding supervision to early convolutional layers, DeepID2+ achieves new state-of-the-art on LFW and YouTube Faces benchmarks. Through empirical studies, we have discovered three properties of its deep neural activations critical for the high performance: sparsity, selectiveness and robustness. (1) It is observed that neural activations are moderately sparse. Moderate sparsity maximizes the discriminative power of the deep net as well as the distance between images. It is surprising that DeepID2+ still can achieve high recognition accuracy even after the neural responses are binarized. (2) Its neurons in higher layers are highly selective to identities and identity-related attributes. We can identify different subsets of neurons which are either constantly excited or inhibited when different identities or attributes are present. Although DeepID2+ is not taught to distinguish attributes during training, it has implicitly learned such high-level concepts. (3) It is much more robust to occlusions, although occlusion patterns are not included in the training set.
- Dec 02 2014 cs.CV arXiv:1412.0069v1Deep learning methods have achieved great success in pedestrian detection, owing to its ability to learn features from raw pixels. However, they mainly capture middle-level representations, such as pose of pedestrian, but confuse positive with hard negative samples, which have large ambiguity, e.g. the shape and appearance of `tree trunk' or `wire pole' are similar to pedestrian in certain viewpoint. This ambiguity can be distinguished by high-level representation. To this end, this work jointly optimizes pedestrian detection with semantic tasks, including pedestrian attributes (e.g. `carrying backpack') and scene attributes (e.g. `road', `tree', and `horizontal'). Rather than expensively annotating scene attributes, we transfer attributes information from existing scene segmentation datasets to the pedestrian dataset, by proposing a novel deep model to learn high-level features from multiple tasks and multiple data sources. Since distinct tasks have distinct convergence rates and data from different datasets have different distributions, a multi-task objective function is carefully designed to coordinate tasks and reduce discrepancies among datasets. The importance coefficients of tasks and network parameters in this objective function can be iteratively estimated. Extensive evaluations show that the proposed approach outperforms the state-of-the-art on the challenging Caltech and ETH datasets, where it reduces the miss rates of previous deep models by 17 and 5.5 percent, respectively.
- Dec 01 2014 cs.CV arXiv:1411.7766v3Predicting face attributes in the wild is challenging due to complex face variations. We propose a novel deep learning framework for attribute prediction in the wild. It cascades two CNNs, LNet and ANet, which are fine-tuned jointly with attribute tags, but pre-trained differently. LNet is pre-trained by massive general object categories for face localization, while ANet is pre-trained by massive face identities for attribute prediction. This framework not only outperforms the state-of-the-art with a large margin, but also reveals valuable facts on learning face representation. (1) It shows how the performances of face localization (LNet) and attribute prediction (ANet) can be improved by different pre-training strategies. (2) It reveals that although the filters of LNet are fine-tuned only with image-level attribute tags, their response maps over entire images have strong indication of face locations. This fact enables training LNet for face localization with only image-level annotations, but without face bounding boxes or landmarks, which are required by all attribute recognition works. (3) It also demonstrates that the high-level hidden neurons of ANet automatically discover semantic concepts after pre-training with massive face identities, and such concepts are significantly enriched after fine-tuning with attribute tags. Each attribute can be well explained with a sparse linear combination of these concepts.
- We consider classification tasks in the regime of scarce labeled training data in high dimensional feature space, where specific expert knowledge is also available. We propose a new hybrid optimization algorithm that solves the elastic-net support vector machine (SVM) through an alternating direction method of multipliers in the first phase, followed by an interior-point method for the classical SVM in the second phase. Both SVM formulations are adapted to knowledge incorporation. Our proposed algorithm addresses the challenges of automatic feature selection, high optimization accuracy, and algorithmic flexibility for taking advantage of prior knowledge. We demonstrate the effectiveness and efficiency of our algorithm and compare it with existing methods on a collection of synthetic and real-world data.
- Modern DRAM architectures allow a number of low-power states on individual memory ranks for advanced power management. Many previous studies have taken advantage of demotions on low-power states for energy saving. However, most of the demotion schemes are statically performed on a limited number of pre-selected low-power states, and are suboptimal for different workloads and memory architectures. Even worse, the idle periods are often too short for effective power state transitions, especially for memory intensive applications. Wrong decisions on power state transition incur significant energy and delay penalties. In this paper, we propose a novel memory system design named RAMZzz with rank-aware energy saving optimizations including dynamic page migrations and adaptive demotions. Specifically, we group the pages with similar access locality into the same rank with dynamic page migrations. Ranks have their hotness: hot ranks are kept busy for high utilization and cold ranks can have more lengthy idle periods for power state transitions. We further develop adaptive state demotions by considering all low-power states for each rank and a prediction model to estimate the power-down timeout among states. We experimentally compare our algorithm with other energy saving policies with cycle-accurate simulation. Experiments with benchmark workloads show that RAMZzz achieves significant improvement on energy-delay2 and energy consumption over other energy saving techniques.
- Sep 12 2014 cs.CV arXiv:1409.3505v1In this paper, we propose multi-stage and deformable deep convolutional neural networks for object detection. This new deep learning object detection diagram has innovations in multiple aspects. In the proposed new deep architecture, a new deformation constrained pooling (def-pooling) layer models the deformation of object parts with geometric constraint and penalty. With the proposed multi-stage training strategy, multiple classifiers are jointly optimized to process samples at different difficulty levels. A new pre-training strategy is proposed to learn feature representations more suitable for the object detection task and with good generalization capability. By changing the net structures, training strategies, adding and removing some key components in the detection pipeline, a set of models with large diversity are obtained, which significantly improves the effectiveness of modeling averaging. The proposed approach ranked \#2 in ILSVRC 2014. It improves the mean averaged precision obtained by RCNN, which is the state-of-the-art of object detection, from $31\%$ to $45\%$. Detailed component-wise analysis is also provided through extensive experimental evaluation.