results for au:Tang_X in:cs

- Deep convolutional networks for semantic image segmentation typically require large-scale labeled data, e.g. ImageNet and MS COCO, for network pre-training. To reduce annotation efforts, self-supervised semantic segmentation is recently proposed to pre-train a network without any human-provided labels. The key of this new form of learning is to design a proxy task (e.g. image colorization), from which a discriminative loss can be formulated on unlabeled data. Many proxy tasks, however, lack the critical supervision signals that could induce discriminative representation for the target image segmentation task. Thus self-supervision's performance is still far from that of supervised pre-training. In this study, we overcome this limitation by incorporating a "mix-and-match" (M&M) tuning stage in the self-supervision pipeline. The proposed approach is readily pluggable to many self-supervision methods and does not use more annotated samples than the original process. Yet, it is capable of boosting the performance of target image segmentation task to surpass fully-supervised pre-trained counterpart. The improvement is made possible by better harnessing the limited pixel-wise annotations in the target dataset. Specifically, we first introduce the "mix" stage, which sparsely samples and mixes patches from the target set to reflect rich and diverse local patch statistics of target images. A "match" stage then forms a class-wise connected graph, which can be used to derive a strong triplet-based discriminative loss for fine-tuning the network. Our paradigm follows the standard practice in existing self-supervised studies and no extra data or label is required. With the proposed M&M approach, for the first time, a self-supervision method can achieve comparable or even better performance compared to its ImageNet pre-trained counterpart on both PASCAL VOC2012 dataset and CityScapes dataset.
- Network based on distributed caching of content is a new architecture to alleviate the ongoing explosive demands for rate of multi-media traffic. In caching networks, coded caching is a recently proposed technique that achieves significant performance gains compared to uncoded caching schemes. In this paper, we derive a lower bound on the average rate with a memory constraint for a family of caching allocation placement and a family of XOR cooperative delivery. The lower bound inspires us how placement and delivery affect the rate memory tradeoff. Based on the clues, we design a new placement and two new delivery algorithms. On one hand, the new placement scheme can allocate the cache more flexibly compared to grouping scheme. On the other hand, the new delivery can exploit more cooperative opportunities compared to the known schemes. The simulations validate our idea.
- Wood Identification has never been more important to serve the purpose of global forest species protection and timber regulation. Macroscopic level wood identification practiced by wood anatomists can identify wood up to genus level. This is sufficient to serve as a frontline identification to fight against illegal wood logging and timber trade for law enforcement authority. However, frontline enforcement official may lack of the accuracy and confidence of a well trained wood anatomist. Hence, computer assisted method such as machine vision methods are developed to do rapid field identification for law enforcement official. In this paper, we proposed a rapid and robust macroscopic wood identification system using machine vision method with off-the-shelf smartphone and retrofitted macro-lens. Our system is cost effective, easily accessible, fast and scalable at the same time provides human-level accuracy on identification. Camera-enabled smartphone with Internet connectivity coupled with a macro-lens provides a simple and effective digital acquisition of macroscopic wood images which are essential for macroscopic wood identification. The images are immediately streamed to a cloud server via Internet connection for identification which are done within seconds.
- Coded caching scheme has recently become quite popular in the wireless network due to the efficiency of reducing the load during peak traffic times. Recently the most concern is the problem of subpacketization level in a coded caching scheme. Although there are several classes of constructions, these schemes only apply to some individual cases for the memory size. And in the practice it is very crucial to consider any memory size. In this paper, four classes of new schemes with the wide range of memory size are constructed. And through the performance analyses, our new scheme can significantly reduce the level of subpacketization by decreasing a little efficiency of transmission in the peak traffic times. Moreover some schemes satisfy that the packet number is polynomial or linear with the number of users.
- Aug 10 2017 cs.CV arXiv:1708.02760v1The ability to ask questions is a powerful tool to gather information in order to learn about the world and resolve ambiguities. In this paper, we explore a novel problem of generating discriminative questions to help disambiguate visual instances. Our work can be seen as a complement and new extension to the rich research studies on image captioning and question answering. We introduce the first large-scale dataset with over 10,000 carefully annotated images-question tuples to facilitate benchmarking. In particular, each tuple consists of a pair of images and 4.6 discriminative questions (as positive samples) and 5.9 non-discriminative questions (as negative samples) on average. In addition, we present an effective method for visual discriminative question generation. The method can be trained in a weakly supervised manner without discriminative images-question tuples but just existing visual question answering datasets. Promising results are shown against representative baselines through quantitative evaluations and user studies.
- Fashion landmarks are functional key points defined on clothes, such as corners of neckline, hemline, and cuff. They have been recently introduced as an effective visual representation for fashion image understanding. However, detecting fashion landmarks are challenging due to background clutters, human poses, and scales. To remove the above variations, previous works usually assumed bounding boxes of clothes are provided in training and test as additional annotations, which are expensive to obtain and inapplicable in practice. This work addresses unconstrained fashion landmark detection, where clothing bounding boxes are not provided in both training and test. To this end, we present a novel Deep LAndmark Network (DLAN), where bounding boxes and landmarks are jointly estimated and trained iteratively in an end-to-end manner. DLAN contains two dedicated modules, including a Selective Dilated Convolution for handling scale discrepancies, and a Hierarchical Recurrent Spatial Transformer for handling background clutters. To evaluate DLAN, we present a large-scale fashion landmark dataset, namely Unconstrained Landmark Database (ULD), consisting of 30K images. Statistics show that ULD is more challenging than existing datasets in terms of image scales, background clutters, and human poses. Extensive experiments demonstrate the effectiveness of DLAN over the state-of-the-art methods. DLAN also exhibits excellent generalization across different clothing categories and modalities, making it extremely suitable for real-world fashion analysis.
- Conventional video segmentation methods often rely on temporal continuity to propagate masks. Such an assumption suffers from issues like drifting and inability to handle large displacement. To overcome these issues, we formulate an effective mechanism to prevent the target from being lost via adaptive object re-identification. Specifically, our Video Object Segmentation with Re-identification (VS-ReID) model includes a mask propagation module and a ReID module. The former module produces an initial probability map by flow warping while the latter module retrieves missing instances by adaptive matching. With these two modules iteratively applied, our VS-ReID records a global mean (Region Jaccard and Boundary F measure) of 0.699, the best performance in 2017 DAVIS Challenge.
- Aug 01 2017 cs.CV arXiv:1707.09531v1Since convolutional neural network (CNN) lacks an inherent mechanism to handle large scale variations, we always need to compute feature maps multiple times for multi-scale object detection, which has the bottleneck of computational cost in practice. To address this, we devise a recurrent scale approximation (RSA) to compute feature map once only, and only through this map can we approximate the rest maps on other levels. At the core of RSA is the recursive rolling out mechanism: given an initial map on a particular scale, it generates the prediction on a smaller scale that is half the size of input. To further increase efficiency and accuracy, we (a): design a scale-forecast network to globally predict potential scales in the image since there is no need to compute maps on all levels of the pyramid. (b): propose a landmark retracing network (LRN) to retrace back locations of the regressed landmarks and generate a confidence score for each landmark; LRN can effectively alleviate false positives due to the accumulated error in RSA. The whole system could be trained end-to-end in a unified CNN framework. Experiments demonstrate that our proposed algorithm is superior against state-of-the-arts on face detection benchmarks and achieves comparable results for generic proposal generation. The source code of RSA is available at github.com/sciencefans/RSA-for-object-detection.
- In this paper, the one-sided secrecy of two-way wiretap channel with feedback is investigated, where the confidential messages of one user through multiple transmissions is guaranteed secure against an external eavesdropper. For one thing, one-sided secrecy satisfies the secure demand of many practical scenarios. For another, the secrecy is measured over many blocks since the correlation between eavesdropper's observation and the confidential messages in successive blocks, instead of secrecy measurement of one block in previous works. Thus, firstly, an achievable secrecy rate region is derived for the general two-way wiretap channel with feedback through multiple transmissions under one-sided secrecy. Secondly, outer bounds on the secrecy capacity region are also obtained. The gap between inner and outer bounds on the secrecy capacity region is explored via the binary input two-way wiretap channels. Most notably, the secrecy capacity regions are established for the XOR channel. Furthermore, the result shows that the achievable rate region with feedback is larger than that without feedback. Therefore, the benefit role of feedback is precisely characterized for two-way wiretap channel with feedback under one-sided secrecy.
- In this paper, the individual secrecy of two-way wiretap channel is investigated, where two legitimate users' messages are separately guaranteed secure against an external eavesdropper. For one thing, in some communication scenarios, the joint secrecy is impossible to achieve both positive secrecy rates of two users. For another, the individual secrecy satisfies the secrecy demand of many practical communication systems. Thus, firstly, an achievable secrecy rate region is derived for the general two-way wiretap channel with individual secrecy. In a deterministic channel, the region with individual secrecy is shown to be larger than that with joint secrecy.Secondly, outer bounds on the secrecy capacity region are obtained for the general two-way wiretap channel and for two classes of special two-way wiretap channels.The gap between inner and outer bounds on the secrecy capacity region is explored via the binary input two-way wiretap channels and the degraded Gaussian two-way wiretap. Most notably, the secrecy capacity regions are established for the XOR channel and the degraded Gaussian two-way wiretap channel.Furthermore, the secure sum-rate of the degraded Gaussian two-way wiretap channel under the individual secrecy constraint is demonstrated to be strictly larger than that under the joint secrecy constraint.
- Jul 18 2017 cs.CV arXiv:1707.05251v1We introduce EnhanceGAN, an adversarial learning based model that performs automatic image enhancement. Traditional image enhancement frameworks involve training separate models for automatic cropping or color enhancement in a fully-supervised manner, which requires expensive annotations in the form of image pairs. In contrast to these approaches, our proposed EnhanceGAN only requires weak supervision (binary labels on image aesthetic quality) and is able to learn enhancement parameters for tasks including image cropping and color enhancement. The full differentiability of our image enhancement modules enables training the proposed EnhanceGAN in an end-to-end manner. A novel stage-wise learning scheme is further proposed to stabilize the training of each enhancement task and facilitate the extensibility for other image enhancement techniques. Our weakly-supervised EnhanceGAN reports competitive quantitative results against supervised models in automatic image cropping using standard benchmarking datasets, and a user study confirms that the images enhancement results are on par with or even preferred over professional enhancement.
- Jul 03 2017 cs.MM arXiv:1706.10143v1Many different parametric models for video quality assessment have been proposed in the past few years. This paper presents a review of nine recent models which cover a wide range of methodologies and have been validated for estimating video quality due to different degradation factors. Each model is briefly described with key algorithms and relevant parametric formulas. The generalization capability of each model to estimate video quality in real-application scenarios is evaluated and compared with other models, using a dataset created with video sequences from practical applications. These video sequences cover a wide range of possible realistic encoding parameters, labeled with mean opinion scores (MOS) via subjective test. The weakness and strength of each model are remarked. Finally, future work towards a more general parametric model that could apply for a wider range of applications is discussed.
- Jun 12 2017 cs.CV arXiv:1706.02863v1In this paper, we share our experience in designing a convolutional network-based face detector that could handle faces of an extremely wide range of scales. We show that faces with different scales can be modeled through a specialized set of deep convolutional networks with different structures. These detectors can be seamlessly integrated into a single unified network that can be trained end-to-end. In contrast to existing deep models that are designed for wide scale range, our network does not require an image pyramid input and the model is of modest complexity. Our network, dubbed ScaleFace, achieves promising performance on WIDER FACE and FDDB datasets with practical runtime speed. Specifically, our method achieves 76.4 average precision on the challenging WIDER FACE dataset and 96% recall rate on the FDDB dataset with 7 frames per second (fps) for 900 * 1300 input image.
- May 31 2017 cs.SI arXiv:1705.10442v2Influence Maximization is an extensively-studied problem that targets at selecting a set of initial seed nodes in the Online Social Networks (OSNs) to spread the influence as widely as possible. However, it remains an open challenge to design fast and accurate algorithms to find solutions in large-scale OSNs. Prior Monte-Carlo-simulation-based methods are slow and not scalable, while other heuristic algorithms do not have any theoretical guarantee and they have been shown to produce poor solutions for quite some cases. In this paper, we propose hop-based algorithms that can easily scale to millions of nodes and billions of edges. Unlike previous heuristics, our proposed hop-based approaches can provide certain theoretical guarantees. Experimental evaluations with real OSN datasets demonstrate the efficiency and effectiveness of our algorithms.
- Sequences with low auto-correlation property have been applied in code-division multiple access communication systems, radar and cryptography. Using the inverse Gray mapping, a quaternary sequence of even length $N$ can be obtained from two binary sequences of the same length, which are called component sequences. In this paper, using interleaving method, we present several classes of component sequences from twin-prime sequences pairs or GMW sequences pairs given by Tang and Ding in 2010; two, three or four binary sequences defined by cyclotomic classes of order $4$. Hence we can obtain new classes of quaternary sequences, which are different from known ones, since known component sequences are constructed from a pair of binary sequences with optimal auto-correlation or Sidel'nikov sequences.
- May 09 2017 cs.CV arXiv:1705.02953v1Deep convolutional networks have achieved great success for image recognition. However, for action recognition in videos, their advantage over traditional methods is not so evident. We present a general and flexible video-level framework for learning action models in videos. This method, called temporal segment network (TSN), aims to model long-range temporal structures with a new segment-based sampling and aggregation module. This unique design enables our TSN to efficiently learn action models by using the whole action videos. The learned models could be easily adapted for action recognition in both trimmed and untrimmed videos with simple average pooling and multi-scale temporal window integration, respectively. We also study a series of good practices for the instantiation of TSN framework given limited training samples. Our approach obtains the state-the-of-art performance on four challenging action recognition benchmarks: HMDB51 (71.0%), UCF101 (94.9%), THUMOS14 (80.1%), and ActivityNet v1.2 (89.6%). Using the proposed RGB difference for motion models, our method can still achieve competitive accuracy on UCF101 (91.0%) while running at 340 FPS. Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.
- Apr 25 2017 cs.CV arXiv:1704.06904v1In this work, we propose "Residual Attention Network", a convolutional neural network using attention mechanism which can incorporate with state-of-art feed forward network architecture in an end-to-end training fashion. Our Residual Attention Network is built by stacking Attention Modules which generate attention-aware features. The attention-aware features from different modules change adaptively as layers going deeper. Inside each Attention Module, bottom-up top-down feedforward structure is used to unfold the feedforward and feedback attention process into a single feedforward process. Importantly, we propose attention residual learning to train very deep Residual Attention Networks which can be easily scaled up to hundreds of layers. Extensive analyses are conducted on CIFAR-10 and CIFAR-100 datasets to verify the effectiveness of every module mentioned above. Our Residual Attention Network achieves state-of-the-art object recognition performance on three benchmark datasets including CIFAR-10 (3.90% error), CIFAR-100 (20.45% error) and ImageNet (4.8% single model and single crop, top-5 error). Note that, our method achieves 0.6% top-1 accuracy improvement with 46% trunk depth and 69% forward FLOPs comparing to ResNet-200. The experiment also demonstrates that our network is robust against noisy labels.
- Apr 21 2017 cs.CV arXiv:1704.06228v2Detecting actions in untrimmed videos is an important yet challenging task. In this paper, we present the structured segment network (SSN), a novel framework which models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end fashion. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping (TAG) is devised to generate high quality action proposals. On two challenging benchmarks, THUMOS14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures.
- We propose a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. Unlike the conventional model cascade (MC) that is composed of multiple independent models, LC treats a single deep model as a cascade of several sub-models. Earlier sub-models are trained to handle easy and confident regions, and they progressively feed-forward harder regions to the next sub-model for processing. Convolutions are only calculated on these regions to reduce computations. The proposed method possesses several advantages. First, LC classifies most of the easy regions in the shallow stage and makes deeper stage focuses on a few hard regions. Such an adaptive and 'difficulty-aware' learning improves segmentation performance. Second, LC accelerates both training and testing of deep network thanks to early decisions in the shallow stage. Third, in comparison to MC, LC is an end-to-end trainable framework, allowing joint learning of all sub-models. We evaluate our method on PASCAL VOC and Cityscapes datasets, achieving state-of-the-art performance and fast speed.
- In this paper, an improved receiver based on diversity combining is proposed to improve the bit error rate (BER) performance of layered asymmetrically clipped optical fast orthogonal frequency division multiplexing (ACO-FOFDM) for intensity-modulated and direct-detected (IM/DD) optical transmission systems. Layered ACO-FOFDM can compensate the weakness of traditional ACO-FOFDM in low spectral efficiency, the utilization of discrete cosine transform in FOFDM system instead of fast Fourier transform in OFDM system can reduce the computational complexity without any influence on BER performance. The BER performances of layered ACO-FOFDM system with improved receiver based on diversity combining and DC-offset FOFDM (DCO-FOFDM) system with optimal DC-bias are compared at the same spectral efficiency. Simulation results show that under different optical bit energy to noise power ratios, layered ACO-FOFDM system with improved receiver has 2.86dB, 5.26dB and 5.72dB BER performance advantages at forward error correction limit over DCO-FOFDM system when the spectral efficiencies are 1 bit/s/Hz, 2 bits/s/Hz and 3 bits/s/Hz, respectively. Layered ACO-FOFDM system with improved receiver based on diversity combining is suitable for application in the adaptive IM/DD systems with zero DC-bias.
- Mar 09 2017 cs.CV arXiv:1703.02716v1Detecting activities in untrimmed videos is an important but challenging task. The performance of existing methods remains unsatisfactory, e.g., they often meet difficulties in locating the beginning and end of a long complex action. In this paper, we propose a generic framework that can accurately detect a wide variety of activities from untrimmed videos. Our first contribution is a novel proposal scheme that can efficiently generate candidates with accurate temporal boundaries. The other contribution is a cascaded classification pipeline that explicitly distinguishes between relevance and completeness of a candidate instance. On two challenging temporal activity detection datasets, THUMOS14 and ActivityNet, the proposed framework significantly outperforms the existing state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling activities with various temporal structures.
- Coded caching scheme, which is an effective technique to increase the transmission efficiency during peak traffic times, has recently become quite popular among the coding community. Generally rate can be measured to the transmission in the peak traffic times, i.e., this efficiency increases with the decreasing of rate. In order to implement a coded caching scheme, each file in the library must be split in a certain number of packets. And this number directly reflects the complexity of a coded caching scheme, i.e., the complexity increases with the increasing of the packet number. However there exists a tradeoff between the rate and packet number. So it is meaningful to characterize this tradeoff and design the related Pareto-optimal coded caching schemes with respect to both parameters. Recently, a new concept called placement delivery array (PDA) was proposed to characterize the coded caching scheme. However as far as we know no one has yet proved that one of the previously known PDAs is Pareto-optimal. In this paper, we first derive two lower bounds on the rate under the framework of PDA. Consequently, the PDA proposed by Maddah-Ali and Niesen is Pareto-optimal, and a tradeoff between rate and packet number is obtained for some parameters. Then, from the above observations and the view point of combinatorial design, two new classes of Pareto-optimal PDAs are obtained. Based on these PDAs, the schemes with low rate and packet number are obtained. Finally the performance of some previously known PDAs are estimated by comparing with these two classes of schemes.
- Feb 24 2017 cs.CV arXiv:1702.07191v2As the intermediate level task connecting image captioning and object detection, visual relationship detection started to catch researchers' attention because of its descriptive power and clear structure. It detects the objects and captures their pair-wise interactions with a subject-predicate-object triplet, e.g. person-ride-horse. In this paper, each visual relationship is considered as a phrase with three components. We formulate the visual relationship detection as three inter-connected recognition problems and propose a Visual Phrase guided Convolutional Neural Network (ViP-CNN) to address them simultaneously. In ViP-CNN, we present a Phrase-guided Message Passing Structure (PMPS) to establish the connection among relationship components and help the model consider the three problems jointly. Corresponding non-maximum suppression method and model training strategy are also proposed. Experimental results show that our ViP-CNN outperforms the state-of-art method both in speed and accuracy. We further pretrain ViP-CNN on our cleansed Visual Genome Relationship dataset, which is found to perform better than the pretraining on the ImageNet for this task.
- We address the problem of synthesizing new video frames in an existing video, either in-between existing frames (interpolation), or subsequent to them (extrapolation). This problem is challenging because video appearance and motion can be highly complex. Traditional optical-flow-based solutions often fail where flow estimation is challenging, while newer neural-network-based methods that hallucinate pixel values directly often produce blurry results. We combine the advantages of these two methods by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which we call deep voxel flow. Our method requires no human supervision, and any video can be used as training data by dropping, and then learning to predict, existing frames. The technique is efficient, and can be applied at any video resolution. We demonstrate that our method produces results that both quantitatively and qualitatively improve upon the state-of-the-art.
- Jan 31 2017 cs.CV arXiv:1701.08393v3We propose a deep convolutional neural network (CNN) for face detection leveraging on facial attributes based supervision. We observe a phenomenon that part detectors emerge within CNN trained to classify attributes from uncropped face images, without any explicit part supervision. The observation motivates a new method for finding faces through scoring facial parts responses by their spatial structure and arrangement. The scoring mechanism is data-driven, and carefully formulated considering challenging cases where faces are only partially visible. This consideration allows our network to detect faces under severe occlusion and unconstrained pose variations. Our method achieves promising performance on popular benchmarks including FDDB, PASCAL Faces, AFW, and WIDER FACE.
- Faster-than-Nyquist (FTN) signal achieves higher spectral efficiency and capacity compared to Nyquist signal due to its smaller pulse interval or narrower subcarrier spacing. Shannon limit typically defines the upper-limit capacity of Nyquist signal. To the best of our knowledge, the mathematical expression for the capacity limit of FTN non-orthogonal frequency-division multiplexing (NOFDM) signal is first demonstrated in this paper. The mathematical expression shows that FTN NOFDM signal has the potential to achieve a higher capacity limit compared to Nyquist signal. In this paper, we demonstrate the principle of FTN NOFDM by taking fractional cosine transform-based NOFDM (FrCT-NOFDM) for instance. FrCT-NOFDM is first proposed and implemented by both simulation and experiment. When the bandwidth compression factor $\alpha$ is set to $0.8$ in FrCT-NOFDM, the subcarrier spacing is equal to $40\%$ of the symbol rate per subcarrier, thus the transmission rate is about $25\%$ faster than Nyquist rate. FTN NOFDM with higher capacity would be promising in the future communication systems, especially in the bandwidth-limited applications.
- Coded caching scheme is a promising technique to migrate the network burden in peak hours, which attains more prominent gains than the uncoded caching. The coded caching scheme can be classified into two types, namely, the centralized and the decentralized scheme, according to whether the placement procedures are carefully designed or operated at random. However, most of the previous analysis assumes that the connected links between server and users are error-free. In this paper, we explore the coded caching based delivery design in wireless networks, where all the connected wireless links are different. For both centralized and decentralized cases, we proposed two delivery schemes, namely, the orthogonal delivery scheme and the concurrent delivery scheme. We focus on the transmission time slots spent on satisfying the system requests, and prove that for both the centralized and the decentralized cases, the concurrent delivery always outperforms orthogonal delivery scheme. Furthermore, for the orthogonal delivery scheme, we derive the gap in terms of transmission time between the decentralized and centralized case, which is essentially no more than 1.5.
- In wireless networks, coded caching is an effective technique to reduce network congestion during peak traffic times. Recently, a new concept called placement delivery array (PDA) was proposed to characterize the coded caching scheme. So far, only one class of PDAs by Maddah-Ali and Niesen is known to be optimal. In this paper, we mainly focus on constructing optimal PDAs. Firstly, we derive some lower bounds. Next, we present several infinite classes of PDAs, which are shown to be optimal with respect to the new bounds.
- Existing deep embedding methods in vision tasks are capable of learning a compact Euclidean space from images, where Euclidean distances correspond to a similarity metric. To make learning more effective and efficient, hard sample mining is usually employed, with samples identified through computing the Euclidean feature distance. However, the global Euclidean distance cannot faithfully characterize the true feature similarity in a complex visual feature space, where the intraclass distance in a high-density region may be larger than the interclass distance in low-density regions. In this paper, we introduce a Position-Dependent Deep Metric (PDDM) unit, which is capable of learning a similarity metric adaptive to local feature structure. The metric can be used to select genuinely hard samples in a local neighborhood to guide the deep embedding learning in an online and robust manner. The new layer is appealing in that it is pluggable to any convolutional networks and is trained end-to-end. Our local similarity-aware feature embedding not only demonstrates faster convergence and boosted performance on two complex image retrieval datasets, its large margin nature also leads to superior generalization results under the large and open set scenarios of transfer learning and zero-shot learning on ImageNet 2010 and ImageNet-10K datasets.
- The integer-forcing (IF) linear multiple-input and multiple-output (MIMO) receiver is a recently proposed suboptimal receiver which nearly reaches the performance of the optimal maximum likelihood receiver for the entire signal-to-noise ratio (SNR) range. The optimal integer coefficient matrix $\A^\star\in \mathbb{Z}^{N_t\times N_t}$ for IF maximizes the total achievable rate, where $N_t$ is the column dimension of the channel matrix. To obtain $\A^\star$, a successive minima problem (SMP) on an $N_t$-dimensional lattice that is suspected to be NP-hard needs to be solved. In this paper, an efficient exact algorithm for the SMP is proposed. For efficiency, our algorithm first uses the LLL reduction to reduce the SMP. Then, different from existing SMP algorithms which form the transformed $\A^\star$ column by column in $N_t$ iterations, it first initializes with a suboptimal matrix. The suboptimal matrix is then updated, by utilizing the integer vectors obtained by employing an improved Schnorr-Euchner search algorithm to search the candidate integer vectors within a certain hyper-ellipsoid, via a novel and efficient algorithm until the transformed $\A^{\star}$ is obtained in only one iteration. Finally, the algorithm returns the matrix obtained by left multiplying the solution of the reduced SMP with the unimodular matrix that is generated by the LLL reduction. We then rigorously prove the optimality of the proposed algorithm by showing that it exactly solves the SMP. Furthermore, we develop a theoretical complexity analysis to show that the complexity of the new algorithm in big-O notation is an order of magnitude smaller, with respect to $N_t$, than that of the existing most efficient algorithm. Finally, simulation results are presented to illustrate the optimality and efficiency of our novel algorithm.
- Oct 05 2016 cs.CV arXiv:1610.00838v2This survey aims at reviewing recent computer vision techniques used in the assessment of image aesthetic quality. Image aesthetic assessment aims at computationally distinguishing high-quality photos from low-quality ones based on photographic rules, typically in the form of binary classification or quality scoring. A variety of approaches has been proposed in the literature trying to solve this challenging problem. In this survey, we present a systematic listing of the reviewed approaches based on visual feature types (hand-crafted features and deep features) and evaluation criteria (dataset characteristics and evaluation metrics). Main contributions and novelties of the reviewed approaches are highlighted and discussed. In addition, following the emergence of deep learning techniques, we systematically evaluate recent deep learning settings that are useful for developing a robust deep model for aesthetic scoring. Experiments are conducted using simple yet solid baselines that are competitive with the current state-of-the-arts. Moreover, we discuss the possibility of manipulating the aesthetics of images through computational approaches. We hope that our survey could serve as a comprehensive reference source for future research on the study of image aesthetic assessment.
- Sep 22 2016 cs.CV arXiv:1609.06426v3Interpersonal relation defines the association, e.g., warm, friendliness, and dominance, between two or more people. Motivated by psychological studies, we investigate if such fine-grained and high-level relation traits can be characterized and quantified from face images in the wild. We address this challenging problem by first studying a deep network architecture for robust recognition of facial expressions. Unlike existing models that typically learn from facial expression labels alone, we devise an effective multitask network that is capable of learning from rich auxiliary attributes such as gender, age, and head pose, beyond just facial expression data. While conventional supervised training requires datasets with complete labels (e.g., all samples must be labeled with gender, age, and expression), we show that this requirement can be relaxed via a novel attribute propagation method. The approach further allows us to leverage the inherent correspondences between heterogeneous attribute sources despite the disparate distributions of different datasets. With the network we demonstrate state-of-the-art results on existing facial expression recognition benchmarks. To predict inter-personal relation, we use the expression recognition network as branches for a Siamese model. Extensive experiments show that our model is capable of mining mutual context of faces for accurate fine-grained interpersonal prediction.
- The technique of coded caching proposed by Madddah-Ali and Niesen is a promising approach to alleviate the load of networks during busy times. Recently, placement delivery array (PDA) was presented to characterize both the placement and delivery phase in a single array for the centralized coded caching algorithm. In this paper, we interpret PDA from a new perspective, i.e., the strong edge coloring of bipartite graph. We prove that, a PDA is equivalent to a strong edge colored bipartite graph. Thus, we can construct a class of PDAs from existing structures in bipartite graphs. The class includes the scheme proposed by Maddah-Ali \textitet al. and a more general class of PDAs proposed by Shangguan \textitet al. as special cases. Moreover, it is capable of generating a lot of PDAs with flexible tradeoff between the sub-packet level and load.
- Markov Random Fields (MRFs), a formulation widely used in generative image modeling, have long been plagued by the lack of expressive power. This issue is primarily due to the fact that conventional MRFs formulations tend to use simplistic factors to capture local patterns. In this paper, we move beyond such limitations, and propose a novel MRF model that uses fully-connected neurons to express the complex interactions among pixels. Through theoretical analysis, we reveal an inherent connection between this model and recurrent neural networks, and thereon derive an approximated feed-forward network that couples multiple RNNs along opposite directions. This formulation combines the expressive power of deep neural networks and the cyclic dependency structure of MRF in a unified model, bringing the modeling capability to a new level. The feed-forward approximation also allows it to be efficiently learned from data. Experimental results on a variety of low-level vision tasks show notable improvement over state-of-the-arts.
- Aug 11 2016 cs.CV arXiv:1608.03049v1Visual fashion analysis has attracted many attentions in the recent years. Previous work represented clothing regions by either bounding boxes or human joints. This work presents fashion landmark detection or fashion alignment, which is to predict the positions of functional key points defined on the fashion items, such as the corners of neckline, hemline, and cuff. To encourage future studies, we introduce a fashion landmark dataset with over 120K images, where each image is labeled with eight landmarks. With this dataset, we study fashion alignment by cascading multiple convolutional neural networks in three stages. These stages gradually improve the accuracies of landmark predictions. Extensive experiments demonstrate the effectiveness of the proposed method, as well as its generalization ability to pose estimation. Fashion landmark is also compared to clothing bounding boxes and human joints in two applications, fashion attribute prediction and clothes retrieval, showing that fashion landmark is a more discriminative representation to understand fashion images.
- Aug 10 2016 cs.CV arXiv:1608.02778v1Lossy compression introduces complex compression artifacts, particularly blocking artifacts, ringing effects and blurring. Existing algorithms either focus on removing blocking artifacts and produce blurred output, or restore sharpened images that are accompanied with ringing effects. Inspired by the success of deep convolutional networks (DCN) on superresolution, we formulate a compact and efficient network for seamless attenuation of different compression artifacts. To meet the speed requirement of real-world applications, we further accelerate the proposed baseline model by layer decomposition and joint use of large-stride convolutional and deconvolutional layers. This also leads to a more general CNN framework that has a close relationship with the conventional Multi-Layer Perceptron (MLP). Finally, the modified network achieves a speed up of 7.5 times with almost no performance loss compared to the baseline model. We also demonstrate that a deeper model can be effectively trained with features learned in a shallow network. Following a similar "easy to hard" idea, we systematically investigate three practical transfer settings and show the effectiveness of transfer learning in low-level vision problems. Our method shows superior performance than the state-of-the-art methods both on benchmark datasets and a real-world use case.
- Aug 03 2016 cs.CV arXiv:1608.00859v1Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ( $ 69.4\% $) and UCF101 ($ 94.2\% $). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices.
- Aug 03 2016 cs.CV arXiv:1608.00797v1This paper presents the method that underlies our submission to the untrimmed video classification task of ActivityNet Challenge 2016. We follow the basic pipeline of temporal segment networks and further raise the performance via a number of other techniques. Specifically, we use the latest deep model architecture, e.g., ResNet and Inception V3, and introduce new aggregation schemes (top-k and attention-weighted pooling). Additionally, we incorporate the audio as a complementary channel, extracting relevant information via a CNN applied to the spectrograms. With these techniques, we derive an ensemble of deep models, which, together, attains a high classification accuracy (mAP $93.23\%$) on the testing set and secured the first place in the challenge.
- Aug 02 2016 cs.CV arXiv:1608.00367v1As a successful deep model applied in image super-resolution (SR), the Super-Resolution Convolutional Neural Network (SRCNN) has demonstrated superior performance to the previous hand-crafted models either in speed and restoration quality. However, the high computational cost still hinders it from practical usage that demands real-time performance (24 fps). In this paper, we aim at accelerating the current SRCNN, and propose a compact hourglass-shape CNN structure for faster and better SR. We re-design the SRCNN structure mainly in three aspects. First, we introduce a deconvolution layer at the end of the network, then the mapping is learned directly from the original low-resolution image (without interpolation) to the high-resolution one. Second, we reformulate the mapping layer by shrinking the input feature dimension before mapping and expanding back afterwards. Third, we adopt smaller filter sizes but more mapping layers. The proposed model achieves a speed up of more than 40 times with even superior restoration quality. Further, we present the parameter settings that can achieve real-time performance on a generic CPU while still maintaining good performance. A corresponding transfer strategy is also proposed for fast training and testing across different upscaling factors.
- Jul 19 2016 cs.CV arXiv:1607.05046v1We present a novel framework for hallucinating faces of unconstrained poses and with very low resolution (face size as small as 5pxIOD). In contrast to existing studies that mostly ignore or assume pre-aligned face spatial configuration (e.g. facial landmarks localization or dense correspondence field), we alternatingly optimize two complementary tasks, namely face hallucination and dense correspondence field estimation, in a unified framework. In addition, we propose a new gated deep bi-network that contains two functionality-specialized branches to recover different levels of texture details. Extensive experiments demonstrate that such formulation allows exceptional hallucination quality on in-the-wild low-res faces with significant pose and illumination variations.
- Locally repairable codes are desirable for distributed storage systems to improve the repair efficiency. In this paper, we first build a bridge between locally repairable code and packing. As an application of this bridge, some optimal locally repairable codes can be obtained by packings, which gives optimal locally repairable codes with flexible parameters.
- Semantic segmentation tasks can be well modeled by Markov Random Field (MRF). This paper addresses semantic segmentation by incorporating high-order relations and mixture of label contexts into MRF. Unlike previous works that optimized MRFs using iterative algorithm, we solve MRF by proposing a Convolutional Neural Network (CNN), namely Deep Parsing Network (DPN), which enables deterministic end-to-end computation in a single forward pass. Specifically, DPN extends a contemporary CNN to model unary terms and additional layers are devised to approximate the mean field (MF) algorithm for pairwise terms. It has several appealing properties. First, different from the recent works that required many iterations of MF during back-propagation, DPN is able to achieve high performance by approximating one iteration of MF. Second, DPN represents various types of pairwise terms, making many existing models as its special cases. Furthermore, pairwise terms in DPN provide a unified framework to encode rich contextual information in high-dimensional data, such as images and videos. Third, DPN makes MF easier to be parallelized and speeded up, thus enabling efficient inference. DPN is thoroughly evaluated on standard semantic image/video segmentation benchmarks, where a single DPN model yields state-of-the-art segmentation accuracies on PASCAL VOC 2012, Cityscapes dataset and CamVid dataset.
- Let $\mathbb{F}_{p^{m}}$ be a finite field with cardinality $p^{m}$ and $R=\mathbb{F}_{p^{m}}+u\mathbb{F}_{p^{m}}$ with $u^{2}=0$. We aim to determine all $\alpha+u\beta$-constacyclic codes of length $np^{s}$ over $R$, where $\alpha,\beta\in\mathbb{F}_{p^{m}}^{*}$, $n, s\in\mathbb{N}_{+}$ and $\gcd(n,p)=1$. Let $\alpha_{0}\in\mathbb{F}_{p^{m}}^{*}$ and $\alpha_{0}^{p^{s}}=\alpha$. The residue ring $R[x]/\langle x^{np^{s}}-\alpha-u\beta\rangle$ is a chain ring with the maximal ideal $\langle x^{n}-\alpha_{0}\rangle$ in the case that $x^{n}-\alpha_{0}$ is irreducible in $\mathbb{F}_{p^{m}}[x]$. If $x^{n}-\alpha_{0}$ is reducible in $\mathbb{F}_{p^{m}}[x]$, we give the explicit expressions of the ideals of $R[x]/\langle x^{np^{s}}-\alpha-u\beta\rangle$. Besides, the number of codewords and the dual code of every $\alpha+u\beta$-constacyclic code are provided.
- Caching is a promising solution to satisfy the ongoing explosive demands for multi-media traffics. Recently, Maddah-Ali and Niesen proposed both centralized and de-centralized coded caching schemes, which are able to attain significant performance gains over uncoded caching schemes. Particular, their work indicates that there exists a performance gap between the decentralized coded caching scheme and the centralized coded caching scheme. In this paper, we investigate this gap. As a result, we prove that the multiplicative gap (i.e., the ratio of their performances) is between 1 and 1:5. The upper bound tightens the original one of 12 by Maddah-Ali and Niesen, while the lower bound verifies the intuition that the centralized coded caching scheme always outperforms its decentralized counterpart. Notably, both bounds are achievable in some cases. Furthermore, we prove that the gap can be arbitrarily close to 1 if the number of users is large enough, which suggests the great potential in practical applications to use the less optimal but more practical decentralized coded caching scheme
- In this paper, we investigate some sufficient conditions based on the block restricted isometry property (block-RIP) for exact (when $\v=\0$) and stable (when $\v\neq\0$) recovery of block sparse signals $\x$ from measurements $\y=\A\x+\v$, where $\v$ is a $\ell_2$ bounded noise vector (i.e., $\|\v\|_2\leq \epsilon$ for some constant $\epsilon$).. First, on the one hand, we show that if $\A$ satisfies the block-RIP with constant $\delta_{K+1}<1/\sqrt{K+1}$, then every $K$-block sparse signal $\x$ can be exactly or stably recovered by BOMP in $K$ iterations; On the other hand, for any $K\geq 1$ and $1/\sqrt{K+1}\leq t<1$, there exists a matrix $\A$ satisfying the block-RIP with $\delta_{K+1}=t$ and a $K$-block sparse signal $\x$ such that the BOMP algorithm may fail to recover $\x$ in $K$ iterations. Second, we study some sufficient conditions for recovering $\alpha$-strongly-decaying $K$-block sparse signals. Surprisingly, it is shown that if $\A$ satisfies the block-RIP with $\delta_{K+1}<\sqrt{2}/2$, every $\alpha$-strongly-decaying $K$-block sparse signal can be exactly or stably recovered by the BOMP algorithm in $K$ iterations, under some conditions on $\alpha$. Our newly found sufficient condition on the block-RIP of $\A$ is weaker than that for $\ell_1$ minimization for this special class of sparse signals, which further convinces the effectiveness of BOMP. Furthermore, for any $K\geq 1$, $\alpha>1$ and $\sqrt{2}/2\leq t<1$, the recovery of $\x$ may fail in $K$ iterations for a sensing matrix $\A$ which satisfies the block-RIP with $\delta_{K+1}=t$. Finally, we study some sufficient conditions for partial recovery of block sparse signals. Specifically, if $\A$ satisfies the block-RIP with $\delta_{K+1}<\sqrt{2}/2$, then BOMP is guaranteed to recover some blocks of $\x$ if these blocks satisfy a sufficient condition. We further show that this condition is sharp.
- We consider the problem of constructing exact-repair minimum storage regenerating (MSR) codes, for which both the systematic nodes and parity nodes can be repaired optimally. Although there exist several recent explicit high-rate MSR code constructions (usually with certain restrictions on the coding parameters), quite a few constructions in the literature only allow the optimal repair of systematic nodes. This phenomenon suggests that there might be a barrier between explicitly constructing codes that can only optimally repair systematic nodes and those that can optimally repair both systematic nodes and parity nodes. In the work, we show that this barrier can be completely circumvented by providing a generic transformation that is able to convert any non-binary linear maximum distance separable (MDS) storage codes that can optimally repair only systematic nodes into new MSR codes that can optimally repair all nodes. This transformation does not increase the alphabet size of the original codes, and only increases the sub-packetization by a factor that is equal to the number of parity nodes. Furthermore, the resultant MSR codes also have the optimal access property for all nodes if the original MDS storage codes have the optimal access property for systematic nodes.
- Apr 26 2016 cs.CV arXiv:1604.07279v1Actionness was introduced to quantify the likelihood of containing a generic action instance at a specific location. Accurate and efficient estimation of actionness is important in video analysis and may benefit other relevant tasks such as action recognition and action detection. This paper presents a new deep architecture for actionness estimation, called hybrid fully convolutional network (H-FCN), which is composed of appearance FCN (A-FCN) and motion FCN (M-FCN). These two FCNs leverage the strong capacity of deep models to estimate actionness maps from the perspectives of static appearance and dynamic motion, respectively. In addition, the fully convolutional nature of H-FCN allows it to efficiently process videos with arbitrary sizes. Experiments are conducted on the challenging datasets of Stanford40, UCF Sports, and JHMDB to verify the effectiveness of H-FCN on actionness estimation, which demonstrate that our method achieves superior performance to previous ones. Moreover, we apply the estimated actionness maps on action proposal generation and action detection. Our actionness maps advance the current state-of-the-art performance of these tasks substantially.
- A novel compressive-sensing based signal multiplexing scheme is proposed in this paper to further improve the multiplexing gain for multiple input multiple output (MIMO) system. At the transmitter side, a Gaussian random measurement matrix in compressive sensing is employed before the traditional spatial multiplexing in order to carry more data streams on the available spatial multiplexing streams of the underlying MIMO system. At the receiver side, it is proposed to reformulate the detection of the multiplexing signal into two steps. In the first step, the traditional MIMO equalization can be used to restore the transmitted spatial multiplexing signal of the MIMO system. While in the second step, the standard optimization based detection algorithm assumed in the compressive sensing framework is utilized to restore the CS multiplexing data streams, wherein the exhaustive over-complete dictionary is used to guarantee the sparse representation of the CS multiplexing signal. In order to avoid the excessive complexity, the sub-block based dictionary and the sub-block based CS restoration is proposed. Finally, simulation results are presented to show the feasibility of the proposed CS based enhanced MIMO multiplexing scheme. And our efforts in this paper shed some lights on the great potential in further improving the spatial multiplexing gain for the MIMO system.
- Mar 15 2016 cs.CV arXiv:1603.04015v3In this paper, a discriminative two-phase dictionary learning framework is proposed for classifying human action by sparse shape representations, in which the first-phase dictionary is learned on the selected discriminative frames and the second-phase dictionary is built for recognition using reconstruction errors of the first-phase dictionary as input features. We propose a "zeroth class" trick for detecting undiscriminating frames of the test video and eliminating them before voting on the action categories. Experimental results on benchmarks demonstrate the effectiveness of our method.
- Generalized orthogonal matching pursuit (gOMP), also called orthogonal multi-matching pursuit, is an extension of OMP in the sense that $N\geq1$ indices are identified per iteration. In this paper, we show that if the restricted isometry constant (RIC) $\delta_{NK+1}$ of a sensing matrix $\A$ satisfies $\delta_{NK+1} < 1/\sqrt {K/N+1}$, then under a condition on the signal-to-noise ratio, gOMP identifies at least one index in the support of any $K$-sparse signal $\x$ from $\y=\A\x+\v$ at each iteration, where $\v$ is a noise vector. Surprisingly, this condition does not require $N\leq K$ which is needed in Wang, \textitet al 2012 and Liu, \textitet al 2012. Thus, $N$ can have more choices. When $N=1$, it reduces to be a sufficient condition for OMP, which is less restrictive than that proposed in Wang 2015. Moreover, in the noise-free case, it is a sufficient condition for accurately recovering $\x$ in $K$ iterations which is less restrictive than the best known one. In particular, it reduces to the sharp condition proposed in Mo 2015 when $N=1$.
- Feb 04 2016 cs.CV arXiv:1602.01197v1Data imbalance is common in many vision tasks where one or more classes are rare. Without addressing this issue conventional methods tend to be biased toward the majority class with poor predictive accuracy for the minority class. These methods further deteriorate on small, imbalanced data that has a large degree of class overlap. In this study, we propose a novel discriminative sparse neighbor approximation (DSNA) method to ameliorate the effect of class-imbalance during prediction. Specifically, given a test sample, we first traverse it through a cost-sensitive decision forest to collect a good subset of training examples in its local neighborhood. Then we generate from this subset several class-discriminating but overlapping clusters and model each as an affine subspace. From these subspaces, the proposed DSNA iteratively seeks an optimal approximation of the test sample and outputs an unbiased prediction. We show that our method not only effectively mitigates the imbalance issue, but also allows the prediction to extrapolate to unseen data. The latter capability is crucial for achieving accurate prediction on small dataset with limited samples. The proposed imbalanced learning method can be applied to both classification and regression tasks at a wide range of imbalance levels. It significantly outperforms the state-of-the-art methods that do not possess an imbalance handling mechanism, and is found to perform comparably or even better than recent deep learning methods by using hand-crafted features only.
- Support recovery of sparse signals from noisy measurements with orthogonal matching pursuit (OMP) has been extensively studied. In this paper, we show that for any $K$-sparse signal $\x$, if a sensing matrix $\A$ satisfies the restricted isometry property (RIP) with restricted isometry constant (RIC) $\delta_{K+1} < 1/\sqrt {K+1}$, then under some constraints on the minimum magnitude of nonzero elements of $\x$, OMP exactly recovers the support of $\x$ from its measurements $\y=\A\x+\v$ in $K$ iterations, where $\v$ is a noise vector that is $\ell_2$ or $\ell_{\infty}$ bounded. This sufficient condition is sharp in terms of $\delta_{K+1}$ since for any given positive integer $K$ and any $1/\sqrt{K+1}\leq \delta<1$, there always exists a matrix $\A$ satisfying the RIP with $\delta_{K+1}=\delta$ for which OMP fails to recover a $K$-sparse signal $\x$ in $K$ iterations. Also, our constraints on the minimum magnitude of nonzero elements of $\x$ are weaker than existing ones. Moreover, we propose worst-case necessary conditions for the exact support recovery of $\x$, characterized by the minimum magnitude of the nonzero elements of $\x$.
- Dec 15 2015 cs.SC arXiv:1512.03901v2An algebraic approach to the maximum likelihood estimation problem is to solve a very structured parameterized polynomial system called likelihood equations that have finitely many complex (real or non-real) solutions. The only solutions that are statistically meaningful are the real solutions with positive coordinates. In order to classify the parameters (data) according to the number of real/positive solutions, we study how to efficiently compute the discriminants, say data-discriminants (DD), of the likelihood equations. We develop a probabilistic algorithm with three different strategies for computing DDs. Our implemented probabilistic algorithm based on Maple and FGb is more efficient than our previous version presented in ISSAC2015, and is also more efficient than the standard elimination for larger benchmarks. By applying RAGlib to a DD we compute, we give the real root classification of 3 by 3 symmetric matrix model.
- Dec 08 2015 cs.CV arXiv:1512.01891v1This paper proposes to learn high-performance deep ConvNets with sparse neural connections, referred to as sparse ConvNets, for face recognition. The sparse ConvNets are learned in an iterative way, each time one additional layer is sparsified and the entire model is re-trained given the initial weights learned in previous iterations. One important finding is that directly training the sparse ConvNet from scratch failed to find good solutions for face recognition, while using a previously learned denser model to properly initialize a sparser model is critical to continue learning effective features for face recognition. This paper also proposes a new neural correlation-based weight selection criterion and empirically verifies its effectiveness in selecting informative connections from previously learned models in each iteration. When taking a moderately sparse structure (26%-76% of weights in the dense model), the proposed sparse ConvNet model significantly improves the face recognition performance of the previous state-of-the-art DeepID2+ models given the same training data, while it keeps the performance of the baseline model with only 12% of the original parameters.
- Nov 23 2015 cs.CV arXiv:1511.06627v1Learning to simultaneously handle face alignment of arbitrary views, e.g. frontal and profile views, appears to be more challenging than we thought. The difficulties lay in i) accommodating the complex appearance-shape relations exhibited in different views, and ii) encompassing the varying landmark point sets due to self-occlusion and different landmark protocols. Most existing studies approach this problem via training multiple viewpoint-specific models, and conduct head pose estimation for model selection. This solution is intuitive but the performance is highly susceptible to inaccurate head pose estimation. In this study, we address this shortcoming through learning an Ensemble of Model Recommendation Trees (EMRT), which is capable of selecting optimal model configuration without prior head pose estimation. The unified framework seamlessly handles different viewpoints and landmark protocols, and it is trained by optimising directly on landmark locations, thus yielding superior results on arbitrary-view face alignment. This is the first study that performs face alignment on the full AFLWdataset with faces of different views including profile view. State-of-the-art performances are also reported on MultiPIE and AFW datasets containing both frontaland profile-view faces.
- Nov 23 2015 cs.CV arXiv:1511.06523v1Face detection is one of the most studied topics in the computer vision community. Much of the progresses have been made by the availability of face detection benchmark datasets. We show that there is a gap between current face detection performance and the real world requirements. To facilitate future face detection research, we introduce the WIDER FACE dataset, which is 10 times larger than existing datasets. The dataset contains rich annotations, including occlusions, poses, event categories, and face bounding boxes. Faces in the proposed dataset are extremely challenging due to large variations in scale, pose and occlusion, as shown in Fig. 1. Furthermore, we show that WIDER FACE dataset is an effective training source for face detection. We benchmark several representative detection systems, providing an overview of state-of-the-art performance and propose a solution to deal with large scale variation. Finally, we discuss common failure cases that worth to be further investigated. Dataset can be downloaded at: mmlab.ie.cuhk.edu.hk/projects/WIDERFace
- Binary representation is desirable for its memory efficiency, computation speed and robustness. In this paper, we propose adjustable bounded rectifiers to learn binary representations for deep neural networks. While hard constraining representations across layers to be binary makes training unreasonably difficult, we softly encourage activations to diverge from real values to binary by approximating step functions. Our final representation is completely binary. We test our approach on MNIST, CIFAR10, and ILSVRC2012 dataset, and systematically study the training dynamics of the binarization process. Our approach can binarize the last layer representation without loss of performance and binarize all the layers with reasonably small degradations. The memory space that it saves may allow more sophisticated models to be deployed, thus compensating the loss. To the best of our knowledge, this is the first work to report results on current deep network architectures using complete binary middle representations. Given the learned representations, we find that the firing or inhibition of a binary neuron is usually associated with a meaningful interpretation across different classes. This suggests that the semantic structure of a neural network may be manifested through a guided binarization process.
- We study the geometry of metrics and convexity structures on the space of phylogenetic trees, which is here realized as the tropical linear space of all \ ultrametrics. The ${\rm CAT}(0)$-metric of Billera-Holmes-Vogtman arises from the theory of orthant spaces. While its geodesics can be computed by the Owen-Provan algorithm, geodesic triangles are complicated. We show that the dimension of such a triangle can be arbitrarily high. Tropical convexity and the tropical metric behave better. They exhibit properties desirable for geometric statistics, such as geodesics of small depth.
- Caching is a promising solution to satisfy the ever increasing demands for the multi-media traffics. In caching networks, coded caching is a recently proposed technique that achieves significant performance gains over the uncoded caching schemes. However, to implement the coded caching schemes, each file has to be split into $F$ packets, which usually increases exponentially with the number of users $K$. Thus, designing caching schemes that decrease the order of $F$ is meaningful for practical implementations. In this paper, by reviewing the Ali-Niesen caching scheme, the placement delivery array (PDA) design problem is firstly formulated to characterize the placement issue and the delivery issue with a single array. Moreover, we show that, through designing appropriate PDA, new centralized coded caching schemes can be discovered. Secondly, it is shown that the Ali-Niesen scheme corresponds to a special class of PDA, which realizes the best coding gain with the least $F$. Thirdly, we present a new construction of PDA for the centralized caching system, wherein the cache size of each user $M$ (identical cache size is assumed at all users) and the number of files $N$ satisfies $M/N=1/q$ or ${(q-1)}/{q}$ ($q$ is an integer such that $q\geq 2$). The new construction can decrease the required $F$ from the order $O\left(e^{K\cdot\left(\frac{M}{N}\ln \frac{N}{M} +(1-\frac{M}{N})\ln \frac{N}{N-M}\right)}\right)$ of Ali-Niesen scheme to $O\left(e^{K\cdot\frac{M}{N}\ln \frac{N}{M}}\right)$ or $O\left(e^{K\cdot(1-\frac{M}{N})\ln\frac{N}{N-M}}\right)$ respectively, while the coding gain loss is only $1$.
- Sep 23 2015 cs.CV arXiv:1509.06451v1In this paper, we propose a novel deep convolutional network (DCN) that achieves outstanding performance on FDDB, PASCAL Face, and AFW. Specifically, our method achieves a high recall rate of 90.99% on the challenging FDDB benchmark, outperforming the state-of-the-art method by a large margin of 2.91%. Importantly, we consider finding faces from a new perspective through scoring facial parts responses by their spatial structure and arrangement. The scoring mechanism is carefully formulated considering challenging cases where faces are only partially visible. This consideration allows our network to detect faces under severe occlusion and unconstrained pose variation, which are the main difficulty and bottleneck of most existing face detection approaches. We show that despite the use of DCN, our network can achieve practical runtime speed.
- Social relation defines the association, e.g, warm, friendliness, and dominance, between two or more people. Motivated by psychological studies, we investigate if such fine-grained and high-level relation traits can be characterised and quantified from face images in the wild. To address this challenging problem we propose a deep model that learns a rich face representation to capture gender, expression, head pose, and age-related attributes, and then performs pairwise-face reasoning for relation prediction. To learn from heterogeneous attribute sources, we formulate a new network architecture with a bridging layer to leverage the inherent correspondences among these datasets. It can also cope with missing target attribute labels. Extensive experiments show that our approach is effective for fine-grained social relation learning in images and videos.
- Sep 10 2015 cs.CV arXiv:1509.02634v2This paper addresses semantic image segmentation by incorporating rich information into Markov Random Field (MRF), including high-order relations and mixture of label contexts. Unlike previous works that optimized MRFs using iterative algorithm, we solve MRF by proposing a Convolutional Neural Network (CNN), namely Deep Parsing Network (DPN), which enables deterministic end-to-end computation in a single forward pass. Specifically, DPN extends a contemporary CNN architecture to model unary terms and additional layers are carefully devised to approximate the mean field algorithm (MF) for pairwise terms. It has several appealing properties. First, different from the recent works that combined CNN and MRF, where many iterations of MF were required for each training image during back-propagation, DPN is able to achieve high performance by approximating one iteration of MF. Second, DPN represents various types of pairwise terms, making many existing works as its special cases. Third, DPN makes MF easier to be parallelized and speeded up in Graphical Processing Unit (GPU). DPN is thoroughly evaluated on the PASCAL VOC 2012 dataset, where a single DPN model yields a new state-of-the-art segmentation accuracy.
- Separable codes were introduced to provide protection against illegal redistribution of copyrighted multimedia material. Let $\mathcal{C}$ be a code of length $n$ over an alphabet of $q$ letters. The descendant code ${\sf desc}(\mathcal{C}_0)$ of $\mathcal{C}_0 = \{{\bf c}_1, {\bf c}_2, \ldots, {\bf c}_t\} \subseteq {\mathcal{C}}$ is defined to be the set of words ${\bf x} = (x_1, x_2, \ldots,x_n)^T$ such that $x_i \in \{c_{1,i}, c_{2,i}, \ldots, c_{t,i}\}$ for all $i=1, \ldots, n$, where ${\bf c}_j=(c_{j,1},c_{j,2},\ldots,c_{j,n})^T$. $\mathcal{C}$ is a $\overline{t}$-separable code if for any two distinct $\mathcal{C}_1, \mathcal{C}_2 \subseteq \mathcal{C}$ with $|\mathcal{C}_1| \le t$, $|\mathcal{C}_2| \le t$, we always have ${\sf desc}(\mathcal{C}_1) \neq {\sf desc}(\mathcal{C}_2)$. Let $M(\overline{t},n,q)$ denote the maximal possible size of such a separable code. In this paper, an upper bound on $M(\overline{3},3,q)$ is derived by considering an optimization problem related to a partial Latin square, and then two constructions for $\overline{3}$-SC$(3,M,q)$s are provided by means of perfect hash families and Steiner triple systems.
- Updated on 24/09/2015: This update provides preliminary experiment results for fine-grained classification on the surveillance data of CompCars. The train/test splits are provided in the updated dataset. See details in Section 6.
- Jun 16 2015 cs.CV arXiv:1506.04395v2We develop a Deep-Text Recurrent Network (DTRN) that regards scene text reading as a sequence labelling problem. We leverage recent advances of deep convolutional neural networks to generate an ordered high-level sequence from a whole word image, avoiding the difficult character segmentation problem. Then a deep recurrent model, building on long short-term memory (LSTM), is developed to robustly recognize the generated CNN sequences, departing from most existing approaches recognising each character independently. Our model has a number of appealing properties in comparison to existing scene text recognition methods: (i) It can recognise highly ambiguous words by leveraging meaningful context information, allowing it to work reliably without either pre- or post-processing; (ii) the deep CNN feature is robust to various image distortions; (iii) it retains the explicit order information in word image, which is essential to discriminate word strings; (iv) the model does not depend on pre-defined dictionary, and it can process unknown words and arbitrary strings. Codes for the DTRN will be available.
- May 21 2015 cs.CV arXiv:1505.04868v1Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features. Specifically, we utilize deep architectures to learn discriminative convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional features into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to transform convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and introduce the strategies of trajectory-constrained sampling and pooling for aggregating deep-learned features. We conduct experiments on two challenging datasets: HMDB51 and UCF101. Experimental results show that TDDs outperform previous hand-crafted features and deep-learned features. Our method also achieves superior performance to the state of the art on these datasets (HMDB51 65.9%, UCF101 91.5%).
- In the literature, few constructions of $n$-variable rotation symmetric bent functions have been presented, which either have restriction on $n$ or have algebraic degree no more than $4$. In this paper, for any even integer $n=2m\ge2$, a first systemic construction of $n$-variable rotation symmetric bent functions, with any possible algebraic degrees ranging from $2$ to $m$, is proposed.
- Apr 28 2015 cs.CV arXiv:1504.06993v1Lossy compression introduces complex compression artifacts, particularly the blocking artifacts, ringing effects and blurring. Existing algorithms either focus on removing blocking artifacts and produce blurred output, or restores sharpened images that are accompanied with ringing effects. Inspired by the deep convolutional networks (DCN) on super-resolution, we formulate a compact and efficient network for seamless attenuation of different compression artifacts. We also demonstrate that a deeper model can be effectively trained with the features learned in a shallow network. Following a similar "easy to hard" idea, we systematically investigate several practical transfer settings and show the effectiveness of transfer learning in low-level vision problems. Our method shows superior performance than the state-of-the-arts both on the benchmark datasets and the real-world use case (i.e. Twitter). In addition, we show that our method can be applied as pre-processing to facilitate other low-level vision routines when they take compressed images as input.
- In this paper, we reinterprets the $(k+2,k)$ Zigzag code in coding matrix and then propose an optimal exact repair strategy for its parity nodes, whose repair disk I/O approaches a lower bound derived in this paper.
- Feb 04 2015 cs.CV arXiv:1502.00873v1The state-of-the-art of face recognition has been significantly advanced by the emergence of deep learning. Very deep neural networks recently achieved great success on general object recognition because of their superb learning capacity. This motivates us to investigate their effectiveness on face recognition. This paper proposes two very deep neural network architectures, referred to as DeepID3, for face recognition. These two architectures are rebuilt from stacked convolution and inception layers proposed in VGG net and GoogLeNet to make them suitable to face recognition. Joint face identification-verification supervisory signals are added to both intermediate and final feature extraction layers during training. An ensemble of the proposed two architectures achieves 99.53% LFW face verification accuracy and 96.0% LFW rank-1 face identification accuracy, respectively. A further discussion of LFW face verification result is given in the end.