results for au:Tang_X in:cs

- Mar 22 2018 cs.CV arXiv:1803.07737v1Face detection has been well studied for many years and one of the remaining challenges is to detect small, blurred and partially occluded faces in uncontrolled environment. This paper proposes a novel context-assisted single shot face detector, named PyramidBox, to handle the hard face detection problem. Observing the importance of the context, we improve the utilization of contextual information in the following three aspects. First, we design a novel contextual anchor to supervise high-level contextual feature learning by a semi-supervised method, which we call it PyramidAnchors. Second, we propose the Low-level Feature Pyramid Network to combine adequate high-level contextual semantic feature and Low-level facial feature together, which also allows the PyramidBox to predict faces of all scales in a single shot. Third, we introduce a context-sensitive structure to increase the capacity of prediction network to improve the final accuracy of output. In addition, we use the method of Data-anchor-sampling to augment the training samples across different scales, which increases the diversities of training data for smaller faces. By exploiting the value of context, PyramidBox achieves superior performance among the state-of-the-art on the two common face detection benchmarks, FDDB and WIDER FACE.
- Mar 20 2018 cs.CV arXiv:1803.06626v1The available butterfly data sets comprise a few limited species, and the images in the data sets are always standard patterns without the images of butterflies in their living environment. To overcome the aforementioned limitations in the butterfly data sets, we build a butterfly data set composed of all species of butterflies in China with 4270 standard pattern images of 1176 butterfly species, and 1425 images from living environment of 111 species. We propose to use the deep learning technique Faster-Rcnn to train an automatic butterfly identification system including butterfly position detection and species recognition. We delete those species with only one living environment image from data set, then partition the rest images from living environment into two subsets, one used as test subset, the other as training subset respectively combined with all standard pattern butterfly images or the standard pattern butterfly images with the same species of the images from living environment. In order to construct the training subset for FasterRcnn, nine methods were adopted to amplifying the images in the training subset including the turning of up and down, and left and right, rotation with different angles, adding noises, blurring, and contrast ratio adjusting etc. Three prediction models were trained. The mAP (Mean Average prediction) criterion was used to evaluate the performance of the prediction model. The experimental results demonstrate that our Faster-Rcnn based butterfly automatic identification system performed well, and its worst mAP is up to 60%, and can simultaneously detect the positions of more than one butterflies in one images from living environment and recognize the species of those butterflies as well.
- Labeling each instance in a large dataset is extremely labor- and time- consuming . One way to alleviate this problem is active learning, which aims to which discover the most valuable instances for labeling to construct a powerful classifier. Considering both informativeness and representativeness provides a promising way to design a practical active learning. However, most existing active learning methods select instances favoring either informativeness or representativeness. Meanwhile, many are designed based on the binary class, so that they may present suboptimal solutions on the datasets with multiple classes. In this paper, a hybrid informative and representative criterion based multi-class active learning approach is proposed. We combine the informative informativeness and representativeness into one formula, which can be solved under a unified framework. The informativeness is measured by the margin minimum while the representative information is measured by the maximum mean discrepancy. By minimizing the upper bound for the true risk, we generalize the empirical risk minimization principle to the active learning setting. Simultaneously, our proposed method makes full use of the label information, and the proposed active learning is designed based on multiple classes. So the proposed method is not suitable to the binary class but also the multiple classes. We conduct our experiments on twelve benchmark UCI data sets, and the experimental results demonstrate that the proposed method performs better than some state-of-the-art methods.
- Mar 06 2018 cs.LG arXiv:1803.01254v1Spatial-temporal prediction has many applications such as climate forecasting and urban planning. In particular, traffic prediction has drawn increasing attention in data mining research field for the growing traffic related datasets and for its impacts in real-world applications. For example, an accurate taxi demand prediction can assist taxi companies to pre-allocate taxis to meet with commuting demands. The key challenge of traffic prediction lies in how to model the complex spatial and temporal dependencies. In this paper, we make two important observations which have not been considered by previous studies: (1) the spatial dependency between locations are dynamic; and (2) the temporal dependency follows strong periodicity but is not strictly periodic for its dynamic temporal shifting. Based on these two observations, we propose a novel Spatial-Temporal Dynamic Network (STDN) framework. In this framework, we propose a flow gating mechanism to learn the dynamic similarity between locations via traffic flow. A periodically shifted attention mechanism is designed to handle long-term periodic dependency and periodic temporal shifting. Furthermore, we extend our framework from region-based traffic prediction to traffic prediction for road intersections by using graph convolutional structure. We conduct extensive experiments on several large-scale real traffic datasets and demonstrate the effectiveness of our approach over state-of-the-art methods.
- Mar 05 2018 cs.CV arXiv:1803.00839v1Face recognition achieves exceptional success thanks to the emergence of deep learning. However, many contemporary face recognition models still perform relatively poor in processing profile faces compared to frontal faces. A key reason is that the number of frontal and profile training faces are highly imbalanced - there are extensively more frontal training samples compared to profile ones. In addition, it is intrinsically hard to learn a deep representation that is geometrically invariant to large pose variations. In this study, we hypothesize that there is an inherent mapping between frontal and profile faces, and consequently, their discrepancy in the deep representation space can be bridged by an equivariant mapping. To exploit this mapping, we formulate a novel Deep Residual EquivAriant Mapping (DREAM) block, which is capable of adaptively adding residuals to the input deep representation to transform a profile face representation to a canonical pose that simplifies recognition. The DREAM block consistently enhances the performance of profile face recognition for many strong deep networks, including ResNet models, without deliberately augmenting training data of profile faces. The block is easy to use, light-weight, and can be implemented with a negligible computational overhead.
- Taxi demand prediction is an important building block to enabling intelligent transportation systems in a smart city. An accurate prediction model can help the city pre-allocate resources to meet travel demand and to reduce empty taxis on streets which waste energy and worsen the traffic congestion. With the increasing popularity of taxi requesting services such as Uber and Didi Chuxing (in China), we are able to collect large-scale taxi demand data continuously. How to utilize such big data to improve the demand prediction is an interesting and critical real-world problem. Traditional demand prediction methods mostly rely on time series forecasting techniques, which fail to model the complex non-linear spatial and temporal relations. Recent advances in deep learning have shown superior performance on traditionally challenging tasks such as image classification by learning the complex features and correlations from large-scale data. This breakthrough has inspired researchers to explore deep learning techniques on traffic prediction problems. However, existing methods on traffic prediction have only considered spatial relation (e.g., using CNN) or temporal relation (e.g., using LSTM) independently. We propose a Deep Multi-View Spatial-Temporal Network (DMVST-Net) framework to model both spatial and temporal relations. Specifically, our proposed model consists of three views: temporal view (modeling correlations between future demand values with near time points via LSTM), spatial view (modeling local spatial correlation via local CNN), and semantic view (modeling correlations among regions sharing similar temporal patterns). Experiments on large-scale real taxi demand data demonstrate effectiveness of our approach over state-of-the-art methods.
- Jan 31 2018 cs.CL arXiv:1801.09872v1This paper reviews the state-of-the-art of semantic change computation, one emerging research field in computational linguistics, proposing a framework that summarizes the literature by identifying and expounding five essential components in the field: diachronic corpus, diachronic word sense characterization, change modelling, evaluation data and data visualization. Despite the potential of the field, the review shows that current studies are mainly focused on testifying hypotheses proposed in theoretical linguistics and that several core issues remain to be solved: the need for diachronic corpora of languages other than English, the need for comprehensive evaluation data for evaluation, the comparison and construction of approaches to diachronic word sense characterization and change modelling, and further exploration of data visualization techniques for hypothesis justification.
- Online Social Networks (OSNs) attract billions of users to share information and communicate where viral marketing has emerged as a new way to promote the sales of products. An OSN provider is often hired by an advertiser to conduct viral marketing campaigns. The OSN provider generates revenue from the commission paid by the advertiser which is determined by the spread of its product information. Meanwhile, to propagate influence, the activities performed by users such as viewing video ads normally induce diffusion cost to the OSN provider. In this paper, we aim to find a seed set to optimize a new profit metric that combines the benefit of influence spread with the cost of influence propagation for the OSN provider. Under many diffusion models, our profit metric is the difference between two submodular functions which is challenging to optimize as it is neither submodular nor monotone. We design a general two-phase framework to select seeds for profit maximization and develop several bounds to measure the quality of the seed set constructed. Experimental results with real OSN datasets show that our approach can achieve high approximation guarantees and significantly outperform the baseline algorithms, including state-of-the-art influence maximization algorithms.
- Dec 19 2017 cs.CV arXiv:1712.06080v1Convolutional neural networks (CNNs) are usually built by stacking convolutional operations layer-by-layer. Although CNN has shown strong capability to extract semantics from raw pixels, its capacity to capture spatial relationships of pixels across rows and columns of an image is not fully explored. These relationships are important to learn semantic objects with strong shape priors but weak appearance coherences, such as traffic lanes, which are often occluded or not even painted on the road surface as shown in Fig. 1 (a). In this paper, we propose Spatial CNN (SCNN), which generalizes traditional deep layer-by-layer convolutions to slice-byslice convolutions within feature maps, thus enabling message passings between pixels across rows and columns in a layer. Such SCNN is particular suitable for long continuous shape structure or large objects, with strong spatial relationship but less appearance clues, such as traffic lanes, poles, and wall. We apply SCNN on a newly released very challenging traffic lane detection dataset and Cityscapse dataset. The results show that SCNN could learn the spatial relationship for structure output and significantly improves the performance. We show that SCNN outperforms the recurrent neural network (RNN) based ReNet and MRF+CNN (MRFNet) in the lane detection dataset by 8.7% and 4.6% respectively. Moreover, our SCNN won the 1st place on the TuSimple Benchmark Lane Detection Challenge, with an accuracy of 96.53%.
- The coded caching scheme is an efficient technique as a solution to reduce the wireless network burden during the peak times in a Device-to-Device (D2D in short) communications. In a coded caching scheme, each file block should be divided into $F$ packets. It is meaningful to design a coded caching scheme with the rate and $F$ as small as possible, especially in the practice for D2D network. In this paper we first characterize coded caching scheme for D2D network by a simple array called D2D placement delivery array (DPDA in shot). Consequently some coded caching scheme for D2D network can be discussed by means of an appropriate DPDA. Secondly we derive the lower bounds on the rate and $F$ of a DPDA. According these two lower bounds, we show that the previously known determined scheme proposed by Ji et al., (IEEE Trans. Inform. Theory, 62(2): 849-869,2016) reaches our lower bound on the rate, but does not meet the lower bound on $F$ for some parameters. Finally for these parameters, we construct three classes of DPDAs which meet our two lower bounds. Based on these DPDAs, three classes of coded caching scheme with low rate and lower $F$ are obtained for D2D network.
- Deep convolutional networks for semantic image segmentation typically require large-scale labeled data, e.g. ImageNet and MS COCO, for network pre-training. To reduce annotation efforts, self-supervised semantic segmentation is recently proposed to pre-train a network without any human-provided labels. The key of this new form of learning is to design a proxy task (e.g. image colorization), from which a discriminative loss can be formulated on unlabeled data. Many proxy tasks, however, lack the critical supervision signals that could induce discriminative representation for the target image segmentation task. Thus self-supervision's performance is still far from that of supervised pre-training. In this study, we overcome this limitation by incorporating a "mix-and-match" (M&M) tuning stage in the self-supervision pipeline. The proposed approach is readily pluggable to many self-supervision methods and does not use more annotated samples than the original process. Yet, it is capable of boosting the performance of target image segmentation task to surpass fully-supervised pre-trained counterpart. The improvement is made possible by better harnessing the limited pixel-wise annotations in the target dataset. Specifically, we first introduce the "mix" stage, which sparsely samples and mixes patches from the target set to reflect rich and diverse local patch statistics of target images. A "match" stage then forms a class-wise connected graph, which can be used to derive a strong triplet-based discriminative loss for fine-tuning the network. Our paradigm follows the standard practice in existing self-supervised studies and no extra data or label is required. With the proposed M&M approach, for the first time, a self-supervision method can achieve comparable or even better performance compared to its ImageNet pre-trained counterpart on both PASCAL VOC2012 dataset and CityScapes dataset.
- Network based on distributed caching of content is a new architecture to alleviate the ongoing explosive demands for rate of multi-media traffic. In caching networks, coded caching is a recently proposed technique that achieves significant performance gains compared to uncoded caching schemes. In this paper, we derive a lower bound on the average rate with a memory constraint for a family of caching allocation placement and a family of XOR cooperative delivery. The lower bound inspires us how placement and delivery affect the rate memory tradeoff. Based on the clues, we design a new placement and two new delivery algorithms. On one hand, the new placement scheme can allocate the cache more flexibly compared to grouping scheme. On the other hand, the new delivery can exploit more cooperative opportunities compared to the known schemes. The simulations validate our idea.
- Wood Identification has never been more important to serve the purpose of global forest species protection and timber regulation. Macroscopic level wood identification practiced by wood anatomists can identify wood up to genus level. This is sufficient to serve as a frontline identification to fight against illegal wood logging and timber trade for law enforcement authority. However, frontline enforcement official may lack of the accuracy and confidence of a well trained wood anatomist. Hence, computer assisted method such as machine vision methods are developed to do rapid field identification for law enforcement official. In this paper, we proposed a rapid and robust macroscopic wood identification system using machine vision method with off-the-shelf smartphone and retrofitted macro-lens. Our system is cost effective, easily accessible, fast and scalable at the same time provides human-level accuracy on identification. Camera-enabled smartphone with Internet connectivity coupled with a macro-lens provides a simple and effective digital acquisition of macroscopic wood images which are essential for macroscopic wood identification. The images are immediately streamed to a cloud server via Internet connection for identification which are done within seconds.
- Coded caching scheme has recently become quite popular in the wireless network due to the efficiency of reducing the load during peak traffic times. Recently the most concern is the problem of subpacketization level in a coded caching scheme. Although there are several classes of constructions, these schemes only apply to some individual cases for the memory size. And in the practice it is very crucial to consider any memory size. In this paper, four classes of new schemes with the wide range of memory size are constructed. And through the performance analyses, our new scheme can significantly reduce the level of subpacketization by decreasing a little efficiency of transmission in the peak traffic times. Moreover some schemes satisfy that the packet number is polynomial or linear with the number of users.
- Aug 10 2017 cs.CV arXiv:1708.02760v1The ability to ask questions is a powerful tool to gather information in order to learn about the world and resolve ambiguities. In this paper, we explore a novel problem of generating discriminative questions to help disambiguate visual instances. Our work can be seen as a complement and new extension to the rich research studies on image captioning and question answering. We introduce the first large-scale dataset with over 10,000 carefully annotated images-question tuples to facilitate benchmarking. In particular, each tuple consists of a pair of images and 4.6 discriminative questions (as positive samples) and 5.9 non-discriminative questions (as negative samples) on average. In addition, we present an effective method for visual discriminative question generation. The method can be trained in a weakly supervised manner without discriminative images-question tuples but just existing visual question answering datasets. Promising results are shown against representative baselines through quantitative evaluations and user studies.
- Fashion landmarks are functional key points defined on clothes, such as corners of neckline, hemline, and cuff. They have been recently introduced as an effective visual representation for fashion image understanding. However, detecting fashion landmarks are challenging due to background clutters, human poses, and scales. To remove the above variations, previous works usually assumed bounding boxes of clothes are provided in training and test as additional annotations, which are expensive to obtain and inapplicable in practice. This work addresses unconstrained fashion landmark detection, where clothing bounding boxes are not provided in both training and test. To this end, we present a novel Deep LAndmark Network (DLAN), where bounding boxes and landmarks are jointly estimated and trained iteratively in an end-to-end manner. DLAN contains two dedicated modules, including a Selective Dilated Convolution for handling scale discrepancies, and a Hierarchical Recurrent Spatial Transformer for handling background clutters. To evaluate DLAN, we present a large-scale fashion landmark dataset, namely Unconstrained Landmark Database (ULD), consisting of 30K images. Statistics show that ULD is more challenging than existing datasets in terms of image scales, background clutters, and human poses. Extensive experiments demonstrate the effectiveness of DLAN over the state-of-the-art methods. DLAN also exhibits excellent generalization across different clothing categories and modalities, making it extremely suitable for real-world fashion analysis.
- Conventional video segmentation methods often rely on temporal continuity to propagate masks. Such an assumption suffers from issues like drifting and inability to handle large displacement. To overcome these issues, we formulate an effective mechanism to prevent the target from being lost via adaptive object re-identification. Specifically, our Video Object Segmentation with Re-identification (VS-ReID) model includes a mask propagation module and a ReID module. The former module produces an initial probability map by flow warping while the latter module retrieves missing instances by adaptive matching. With these two modules iteratively applied, our VS-ReID records a global mean (Region Jaccard and Boundary F measure) of 0.699, the best performance in 2017 DAVIS Challenge.
- Aug 01 2017 cs.CV arXiv:1707.09531v2Since convolutional neural network (CNN) lacks an inherent mechanism to handle large scale variations, we always need to compute feature maps multiple times for multi-scale object detection, which has the bottleneck of computational cost in practice. To address this, we devise a recurrent scale approximation (RSA) to compute feature map once only, and only through this map can we approximate the rest maps on other levels. At the core of RSA is the recursive rolling out mechanism: given an initial map at a particular scale, it generates the prediction at a smaller scale that is half the size of input. To further increase efficiency and accuracy, we (a): design a scale-forecast network to globally predict potential scales in the image since there is no need to compute maps on all levels of the pyramid. (b): propose a landmark retracing network (LRN) to trace back locations of the regressed landmarks and generate a confidence score for each landmark; LRN can effectively alleviate false positives caused by the accumulated error in RSA. The whole system can be trained end-to-end in a unified CNN framework. Experiments demonstrate that our proposed algorithm is superior against state-of-the-art methods on face detection benchmarks and achieves comparable results for generic proposal generation. The source code of RSA is available at github.com/sciencefans/RSA-for-object-detection.
- In this paper, the one-sided secrecy of two-way wiretap channel with feedback is investigated, where the confidential messages of one user through multiple transmissions is guaranteed secure against an external eavesdropper. For one thing, one-sided secrecy satisfies the secure demand of many practical scenarios. For another, the secrecy is measured over many blocks since the correlation between eavesdropper's observation and the confidential messages in successive blocks, instead of secrecy measurement of one block in previous works. Thus, firstly, an achievable secrecy rate region is derived for the general two-way wiretap channel with feedback through multiple transmissions under one-sided secrecy. Secondly, outer bounds on the secrecy capacity region are also obtained. The gap between inner and outer bounds on the secrecy capacity region is explored via the binary input two-way wiretap channels. Most notably, the secrecy capacity regions are established for the XOR channel. Furthermore, the result shows that the achievable rate region with feedback is larger than that without feedback. Therefore, the benefit role of feedback is precisely characterized for two-way wiretap channel with feedback under one-sided secrecy.
- In this paper, the individual secrecy of two-way wiretap channel is investigated, where two legitimate users' messages are separately guaranteed secure against an external eavesdropper. For one thing, in some communication scenarios, the joint secrecy is impossible to achieve both positive secrecy rates of two users. For another, the individual secrecy satisfies the secrecy demand of many practical communication systems. Thus, firstly, an achievable secrecy rate region is derived for the general two-way wiretap channel with individual secrecy. In a deterministic channel, the region with individual secrecy is shown to be larger than that with joint secrecy.Secondly, outer bounds on the secrecy capacity region are obtained for the general two-way wiretap channel and for two classes of special two-way wiretap channels.The gap between inner and outer bounds on the secrecy capacity region is explored via the binary input two-way wiretap channels and the degraded Gaussian two-way wiretap. Most notably, the secrecy capacity regions are established for the XOR channel and the degraded Gaussian two-way wiretap channel.Furthermore, the secure sum-rate of the degraded Gaussian two-way wiretap channel under the individual secrecy constraint is demonstrated to be strictly larger than that under the joint secrecy constraint.
- Jul 18 2017 cs.CV arXiv:1707.05251v1We introduce EnhanceGAN, an adversarial learning based model that performs automatic image enhancement. Traditional image enhancement frameworks involve training separate models for automatic cropping or color enhancement in a fully-supervised manner, which requires expensive annotations in the form of image pairs. In contrast to these approaches, our proposed EnhanceGAN only requires weak supervision (binary labels on image aesthetic quality) and is able to learn enhancement parameters for tasks including image cropping and color enhancement. The full differentiability of our image enhancement modules enables training the proposed EnhanceGAN in an end-to-end manner. A novel stage-wise learning scheme is further proposed to stabilize the training of each enhancement task and facilitate the extensibility for other image enhancement techniques. Our weakly-supervised EnhanceGAN reports competitive quantitative results against supervised models in automatic image cropping using standard benchmarking datasets, and a user study confirms that the images enhancement results are on par with or even preferred over professional enhancement.
- Jul 03 2017 cs.MM arXiv:1706.10143v1Many different parametric models for video quality assessment have been proposed in the past few years. This paper presents a review of nine recent models which cover a wide range of methodologies and have been validated for estimating video quality due to different degradation factors. Each model is briefly described with key algorithms and relevant parametric formulas. The generalization capability of each model to estimate video quality in real-application scenarios is evaluated and compared with other models, using a dataset created with video sequences from practical applications. These video sequences cover a wide range of possible realistic encoding parameters, labeled with mean opinion scores (MOS) via subjective test. The weakness and strength of each model are remarked. Finally, future work towards a more general parametric model that could apply for a wider range of applications is discussed.
- Jun 12 2017 cs.CV arXiv:1706.02863v1In this paper, we share our experience in designing a convolutional network-based face detector that could handle faces of an extremely wide range of scales. We show that faces with different scales can be modeled through a specialized set of deep convolutional networks with different structures. These detectors can be seamlessly integrated into a single unified network that can be trained end-to-end. In contrast to existing deep models that are designed for wide scale range, our network does not require an image pyramid input and the model is of modest complexity. Our network, dubbed ScaleFace, achieves promising performance on WIDER FACE and FDDB datasets with practical runtime speed. Specifically, our method achieves 76.4 average precision on the challenging WIDER FACE dataset and 96% recall rate on the FDDB dataset with 7 frames per second (fps) for 900 * 1300 input image.
- May 31 2017 cs.SI arXiv:1705.10442v2Influence Maximization is an extensively-studied problem that targets at selecting a set of initial seed nodes in the Online Social Networks (OSNs) to spread the influence as widely as possible. However, it remains an open challenge to design fast and accurate algorithms to find solutions in large-scale OSNs. Prior Monte-Carlo-simulation-based methods are slow and not scalable, while other heuristic algorithms do not have any theoretical guarantee and they have been shown to produce poor solutions for quite some cases. In this paper, we propose hop-based algorithms that can easily scale to millions of nodes and billions of edges. Unlike previous heuristics, our proposed hop-based approaches can provide certain theoretical guarantees. Experimental evaluations with real OSN datasets demonstrate the efficiency and effectiveness of our algorithms.
- Sequences with low auto-correlation property have been applied in code-division multiple access communication systems, radar and cryptography. Using the inverse Gray mapping, a quaternary sequence of even length $N$ can be obtained from two binary sequences of the same length, which are called component sequences. In this paper, using interleaving method, we present several classes of component sequences from twin-prime sequences pairs or GMW sequences pairs given by Tang and Ding in 2010; two, three or four binary sequences defined by cyclotomic classes of order $4$. Hence we can obtain new classes of quaternary sequences, which are different from known ones, since known component sequences are constructed from a pair of binary sequences with optimal auto-correlation or Sidel'nikov sequences.
- May 09 2017 cs.CV arXiv:1705.02953v1Deep convolutional networks have achieved great success for image recognition. However, for action recognition in videos, their advantage over traditional methods is not so evident. We present a general and flexible video-level framework for learning action models in videos. This method, called temporal segment network (TSN), aims to model long-range temporal structures with a new segment-based sampling and aggregation module. This unique design enables our TSN to efficiently learn action models by using the whole action videos. The learned models could be easily adapted for action recognition in both trimmed and untrimmed videos with simple average pooling and multi-scale temporal window integration, respectively. We also study a series of good practices for the instantiation of TSN framework given limited training samples. Our approach obtains the state-the-of-art performance on four challenging action recognition benchmarks: HMDB51 (71.0%), UCF101 (94.9%), THUMOS14 (80.1%), and ActivityNet v1.2 (89.6%). Using the proposed RGB difference for motion models, our method can still achieve competitive accuracy on UCF101 (91.0%) while running at 340 FPS. Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.
- Apr 25 2017 cs.CV arXiv:1704.06904v1In this work, we propose "Residual Attention Network", a convolutional neural network using attention mechanism which can incorporate with state-of-art feed forward network architecture in an end-to-end training fashion. Our Residual Attention Network is built by stacking Attention Modules which generate attention-aware features. The attention-aware features from different modules change adaptively as layers going deeper. Inside each Attention Module, bottom-up top-down feedforward structure is used to unfold the feedforward and feedback attention process into a single feedforward process. Importantly, we propose attention residual learning to train very deep Residual Attention Networks which can be easily scaled up to hundreds of layers. Extensive analyses are conducted on CIFAR-10 and CIFAR-100 datasets to verify the effectiveness of every module mentioned above. Our Residual Attention Network achieves state-of-the-art object recognition performance on three benchmark datasets including CIFAR-10 (3.90% error), CIFAR-100 (20.45% error) and ImageNet (4.8% single model and single crop, top-5 error). Note that, our method achieves 0.6% top-1 accuracy improvement with 46% trunk depth and 69% forward FLOPs comparing to ResNet-200. The experiment also demonstrates that our network is robust against noisy labels.
- Apr 21 2017 cs.CV arXiv:1704.06228v2Detecting actions in untrimmed videos is an important yet challenging task. In this paper, we present the structured segment network (SSN), a novel framework which models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end fashion. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping (TAG) is devised to generate high quality action proposals. On two challenging benchmarks, THUMOS14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures.
- Recently, it has been shown that the time-varying multiple-access channel (MAC) with perfect channel state information (CSI) at the receiver and delayed feedback CSI at the transmitters can be modeled as the finite state MAC (FS-MAC) with delayed state feedback, where the time variation of the channel is characterized by the statistics of the underlying state process. To study the fundamental limit of the secure transmission over multi-user wireless communication systems, we re-visit the FS-MAC with delayed state feedback by considering an external eavesdropper, which we call the finite state multiple-access wiretap channel (FS-MAC-WT) with delayed feedback. The main contribution of this paper is to show that taking full advantage of the delayed channel output feedback helps to increase the secrecy rate region of the FS-MAC-WT with delayed state feedback, and the results of this paper are further illustrated by a degraded Gaussian fading example.
- We propose a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. Unlike the conventional model cascade (MC) that is composed of multiple independent models, LC treats a single deep model as a cascade of several sub-models. Earlier sub-models are trained to handle easy and confident regions, and they progressively feed-forward harder regions to the next sub-model for processing. Convolutions are only calculated on these regions to reduce computations. The proposed method possesses several advantages. First, LC classifies most of the easy regions in the shallow stage and makes deeper stage focuses on a few hard regions. Such an adaptive and 'difficulty-aware' learning improves segmentation performance. Second, LC accelerates both training and testing of deep network thanks to early decisions in the shallow stage. Third, in comparison to MC, LC is an end-to-end trainable framework, allowing joint learning of all sub-models. We evaluate our method on PASCAL VOC and Cityscapes datasets, achieving state-of-the-art performance and fast speed.
- In this paper, an improved receiver based on diversity combining is proposed to improve the bit error rate (BER) performance of layered asymmetrically clipped optical fast orthogonal frequency division multiplexing (ACO-FOFDM) for intensity-modulated and direct-detected (IM/DD) optical transmission systems. Layered ACO-FOFDM can compensate the weakness of traditional ACO-FOFDM in low spectral efficiency, the utilization of discrete cosine transform in FOFDM system instead of fast Fourier transform in OFDM system can reduce the computational complexity without any influence on BER performance. The BER performances of layered ACO-FOFDM system with improved receiver based on diversity combining and DC-offset FOFDM (DCO-FOFDM) system with optimal DC-bias are compared at the same spectral efficiency. Simulation results show that under different optical bit energy to noise power ratios, layered ACO-FOFDM system with improved receiver has 2.86dB, 5.26dB and 5.72dB BER performance advantages at forward error correction limit over DCO-FOFDM system when the spectral efficiencies are 1 bit/s/Hz, 2 bits/s/Hz and 3 bits/s/Hz, respectively. Layered ACO-FOFDM system with improved receiver based on diversity combining is suitable for application in the adaptive IM/DD systems with zero DC-bias.
- Mar 09 2017 cs.CV arXiv:1703.02716v1Detecting activities in untrimmed videos is an important but challenging task. The performance of existing methods remains unsatisfactory, e.g., they often meet difficulties in locating the beginning and end of a long complex action. In this paper, we propose a generic framework that can accurately detect a wide variety of activities from untrimmed videos. Our first contribution is a novel proposal scheme that can efficiently generate candidates with accurate temporal boundaries. The other contribution is a cascaded classification pipeline that explicitly distinguishes between relevance and completeness of a candidate instance. On two challenging temporal activity detection datasets, THUMOS14 and ActivityNet, the proposed framework significantly outperforms the existing state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling activities with various temporal structures.
- Coded caching scheme, which is an effective technique to increase the transmission efficiency during peak traffic times, has recently become quite popular among the coding community. Generally rate can be measured to the transmission in the peak traffic times, i.e., this efficiency increases with the decreasing of rate. In order to implement a coded caching scheme, each file in the library must be split in a certain number of packets. And this number directly reflects the complexity of a coded caching scheme, i.e., the complexity increases with the increasing of the packet number. However there exists a tradeoff between the rate and packet number. So it is meaningful to characterize this tradeoff and design the related Pareto-optimal coded caching schemes with respect to both parameters. Recently, a new concept called placement delivery array (PDA) was proposed to characterize the coded caching scheme. However as far as we know no one has yet proved that one of the previously known PDAs is Pareto-optimal. In this paper, we first derive two lower bounds on the rate under the framework of PDA. Consequently, the PDA proposed by Maddah-Ali and Niesen is Pareto-optimal, and a tradeoff between rate and packet number is obtained for some parameters. Then, from the above observations and the view point of combinatorial design, two new classes of Pareto-optimal PDAs are obtained. Based on these PDAs, the schemes with low rate and packet number are obtained. Finally the performance of some previously known PDAs are estimated by comparing with these two classes of schemes.
- Feb 24 2017 cs.CV arXiv:1702.07191v2As the intermediate level task connecting image captioning and object detection, visual relationship detection started to catch researchers' attention because of its descriptive power and clear structure. It detects the objects and captures their pair-wise interactions with a subject-predicate-object triplet, e.g. person-ride-horse. In this paper, each visual relationship is considered as a phrase with three components. We formulate the visual relationship detection as three inter-connected recognition problems and propose a Visual Phrase guided Convolutional Neural Network (ViP-CNN) to address them simultaneously. In ViP-CNN, we present a Phrase-guided Message Passing Structure (PMPS) to establish the connection among relationship components and help the model consider the three problems jointly. Corresponding non-maximum suppression method and model training strategy are also proposed. Experimental results show that our ViP-CNN outperforms the state-of-art method both in speed and accuracy. We further pretrain ViP-CNN on our cleansed Visual Genome Relationship dataset, which is found to perform better than the pretraining on the ImageNet for this task.
- We address the problem of synthesizing new video frames in an existing video, either in-between existing frames (interpolation), or subsequent to them (extrapolation). This problem is challenging because video appearance and motion can be highly complex. Traditional optical-flow-based solutions often fail where flow estimation is challenging, while newer neural-network-based methods that hallucinate pixel values directly often produce blurry results. We combine the advantages of these two methods by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which we call deep voxel flow. Our method requires no human supervision, and any video can be used as training data by dropping, and then learning to predict, existing frames. The technique is efficient, and can be applied at any video resolution. We demonstrate that our method produces results that both quantitatively and qualitatively improve upon the state-of-the-art.
- Jan 31 2017 cs.CV arXiv:1701.08393v3We propose a deep convolutional neural network (CNN) for face detection leveraging on facial attributes based supervision. We observe a phenomenon that part detectors emerge within CNN trained to classify attributes from uncropped face images, without any explicit part supervision. The observation motivates a new method for finding faces through scoring facial parts responses by their spatial structure and arrangement. The scoring mechanism is data-driven, and carefully formulated considering challenging cases where faces are only partially visible. This consideration allows our network to detect faces under severe occlusion and unconstrained pose variations. Our method achieves promising performance on popular benchmarks including FDDB, PASCAL Faces, AFW, and WIDER FACE.
- Faster-than-Nyquist (FTN) signal achieves higher spectral efficiency and capacity compared to Nyquist signal due to its smaller pulse interval or narrower subcarrier spacing. Shannon limit typically defines the upper-limit capacity of Nyquist signal. To the best of our knowledge, the mathematical expression for the capacity limit of FTN non-orthogonal frequency-division multiplexing (NOFDM) signal is first demonstrated in this paper. The mathematical expression shows that FTN NOFDM signal has the potential to achieve a higher capacity limit compared to Nyquist signal. In this paper, we demonstrate the principle of FTN NOFDM by taking fractional cosine transform-based NOFDM (FrCT-NOFDM) for instance. FrCT-NOFDM is first proposed and implemented by both simulation and experiment. When the bandwidth compression factor $\alpha$ is set to $0.8$ in FrCT-NOFDM, the subcarrier spacing is equal to $40\%$ of the symbol rate per subcarrier, thus the transmission rate is about $25\%$ faster than Nyquist rate. FTN NOFDM with higher capacity would be promising in the future communication systems, especially in the bandwidth-limited applications.
- Coded caching scheme is a promising technique to migrate the network burden in peak hours, which attains more prominent gains than the uncoded caching. The coded caching scheme can be classified into two types, namely, the centralized and the decentralized scheme, according to whether the placement procedures are carefully designed or operated at random. However, most of the previous analysis assumes that the connected links between server and users are error-free. In this paper, we explore the coded caching based delivery design in wireless networks, where all the connected wireless links are different. For both centralized and decentralized cases, we proposed two delivery schemes, namely, the orthogonal delivery scheme and the concurrent delivery scheme. We focus on the transmission time slots spent on satisfying the system requests, and prove that for both the centralized and the decentralized cases, the concurrent delivery always outperforms orthogonal delivery scheme. Furthermore, for the orthogonal delivery scheme, we derive the gap in terms of transmission time between the decentralized and centralized case, which is essentially no more than 1.5.
- In wireless networks, coded caching is an effective technique to reduce network congestion during peak traffic times. Recently, a new concept called placement delivery array (PDA) was proposed to characterize the coded caching scheme. So far, only one class of PDAs by Maddah-Ali and Niesen is known to be optimal. In this paper, we mainly focus on constructing optimal PDAs. Firstly, we derive some lower bounds. Next, we present several infinite classes of PDAs, which are shown to be optimal with respect to the new bounds.
- Existing deep embedding methods in vision tasks are capable of learning a compact Euclidean space from images, where Euclidean distances correspond to a similarity metric. To make learning more effective and efficient, hard sample mining is usually employed, with samples identified through computing the Euclidean feature distance. However, the global Euclidean distance cannot faithfully characterize the true feature similarity in a complex visual feature space, where the intraclass distance in a high-density region may be larger than the interclass distance in low-density regions. In this paper, we introduce a Position-Dependent Deep Metric (PDDM) unit, which is capable of learning a similarity metric adaptive to local feature structure. The metric can be used to select genuinely hard samples in a local neighborhood to guide the deep embedding learning in an online and robust manner. The new layer is appealing in that it is pluggable to any convolutional networks and is trained end-to-end. Our local similarity-aware feature embedding not only demonstrates faster convergence and boosted performance on two complex image retrieval datasets, its large margin nature also leads to superior generalization results under the large and open set scenarios of transfer learning and zero-shot learning on ImageNet 2010 and ImageNet-10K datasets.
- The integer-forcing (IF) linear multiple-input and multiple-output (MIMO) receiver is a recently proposed suboptimal receiver which nearly reaches the performance of the optimal maximum likelihood receiver for the entire signal-to-noise ratio (SNR) range. The optimal integer coefficient matrix $\A^\star\in \mathbb{Z}^{N_t\times N_t}$ for IF maximizes the total achievable rate, where $N_t$ is the column dimension of the channel matrix. To obtain $\A^\star$, a successive minima problem (SMP) on an $N_t$-dimensional lattice that is suspected to be NP-hard needs to be solved. In this paper, an efficient exact algorithm for the SMP is proposed. For efficiency, our algorithm first uses the LLL reduction to reduce the SMP. Then, different from existing SMP algorithms which form the transformed $\A^\star$ column by column in $N_t$ iterations, it first initializes with a suboptimal matrix. The suboptimal matrix is then updated, by utilizing the integer vectors obtained by employing an improved Schnorr-Euchner search algorithm to search the candidate integer vectors within a certain hyper-ellipsoid, via a novel and efficient algorithm until the transformed $\A^{\star}$ is obtained in only one iteration. Finally, the algorithm returns the matrix obtained by left multiplying the solution of the reduced SMP with the unimodular matrix that is generated by the LLL reduction. We then rigorously prove the optimality of the proposed algorithm by showing that it exactly solves the SMP. Furthermore, we develop a theoretical complexity analysis to show that the complexity of the new algorithm in big-O notation is an order of magnitude smaller, with respect to $N_t$, than that of the existing most efficient algorithm. Finally, simulation results are presented to illustrate the optimality and efficiency of our novel algorithm.
- Oct 05 2016 cs.CV arXiv:1610.00838v2This survey aims at reviewing recent computer vision techniques used in the assessment of image aesthetic quality. Image aesthetic assessment aims at computationally distinguishing high-quality photos from low-quality ones based on photographic rules, typically in the form of binary classification or quality scoring. A variety of approaches has been proposed in the literature trying to solve this challenging problem. In this survey, we present a systematic listing of the reviewed approaches based on visual feature types (hand-crafted features and deep features) and evaluation criteria (dataset characteristics and evaluation metrics). Main contributions and novelties of the reviewed approaches are highlighted and discussed. In addition, following the emergence of deep learning techniques, we systematically evaluate recent deep learning settings that are useful for developing a robust deep model for aesthetic scoring. Experiments are conducted using simple yet solid baselines that are competitive with the current state-of-the-arts. Moreover, we discuss the possibility of manipulating the aesthetics of images through computational approaches. We hope that our survey could serve as a comprehensive reference source for future research on the study of image aesthetic assessment.
- Sep 22 2016 cs.CV arXiv:1609.06426v3Interpersonal relation defines the association, e.g., warm, friendliness, and dominance, between two or more people. Motivated by psychological studies, we investigate if such fine-grained and high-level relation traits can be characterized and quantified from face images in the wild. We address this challenging problem by first studying a deep network architecture for robust recognition of facial expressions. Unlike existing models that typically learn from facial expression labels alone, we devise an effective multitask network that is capable of learning from rich auxiliary attributes such as gender, age, and head pose, beyond just facial expression data. While conventional supervised training requires datasets with complete labels (e.g., all samples must be labeled with gender, age, and expression), we show that this requirement can be relaxed via a novel attribute propagation method. The approach further allows us to leverage the inherent correspondences between heterogeneous attribute sources despite the disparate distributions of different datasets. With the network we demonstrate state-of-the-art results on existing facial expression recognition benchmarks. To predict inter-personal relation, we use the expression recognition network as branches for a Siamese model. Extensive experiments show that our model is capable of mining mutual context of faces for accurate fine-grained interpersonal prediction.
- The technique of coded caching proposed by Madddah-Ali and Niesen is a promising approach to alleviate the load of networks during busy times. Recently, placement delivery array (PDA) was presented to characterize both the placement and delivery phase in a single array for the centralized coded caching algorithm. In this paper, we interpret PDA from a new perspective, i.e., the strong edge coloring of bipartite graph. We prove that, a PDA is equivalent to a strong edge colored bipartite graph. Thus, we can construct a class of PDAs from existing structures in bipartite graphs. The class includes the scheme proposed by Maddah-Ali \textitet al. and a more general class of PDAs proposed by Shangguan \textitet al. as special cases. Moreover, it is capable of generating a lot of PDAs with flexible tradeoff between the sub-packet level and load.
- Markov Random Fields (MRFs), a formulation widely used in generative image modeling, have long been plagued by the lack of expressive power. This issue is primarily due to the fact that conventional MRFs formulations tend to use simplistic factors to capture local patterns. In this paper, we move beyond such limitations, and propose a novel MRF model that uses fully-connected neurons to express the complex interactions among pixels. Through theoretical analysis, we reveal an inherent connection between this model and recurrent neural networks, and thereon derive an approximated feed-forward network that couples multiple RNNs along opposite directions. This formulation combines the expressive power of deep neural networks and the cyclic dependency structure of MRF in a unified model, bringing the modeling capability to a new level. The feed-forward approximation also allows it to be efficiently learned from data. Experimental results on a variety of low-level vision tasks show notable improvement over state-of-the-arts.
- Aug 11 2016 cs.CV arXiv:1608.03049v1Visual fashion analysis has attracted many attentions in the recent years. Previous work represented clothing regions by either bounding boxes or human joints. This work presents fashion landmark detection or fashion alignment, which is to predict the positions of functional key points defined on the fashion items, such as the corners of neckline, hemline, and cuff. To encourage future studies, we introduce a fashion landmark dataset with over 120K images, where each image is labeled with eight landmarks. With this dataset, we study fashion alignment by cascading multiple convolutional neural networks in three stages. These stages gradually improve the accuracies of landmark predictions. Extensive experiments demonstrate the effectiveness of the proposed method, as well as its generalization ability to pose estimation. Fashion landmark is also compared to clothing bounding boxes and human joints in two applications, fashion attribute prediction and clothes retrieval, showing that fashion landmark is a more discriminative representation to understand fashion images.
- Aug 10 2016 cs.CV arXiv:1608.02778v1Lossy compression introduces complex compression artifacts, particularly blocking artifacts, ringing effects and blurring. Existing algorithms either focus on removing blocking artifacts and produce blurred output, or restore sharpened images that are accompanied with ringing effects. Inspired by the success of deep convolutional networks (DCN) on superresolution, we formulate a compact and efficient network for seamless attenuation of different compression artifacts. To meet the speed requirement of real-world applications, we further accelerate the proposed baseline model by layer decomposition and joint use of large-stride convolutional and deconvolutional layers. This also leads to a more general CNN framework that has a close relationship with the conventional Multi-Layer Perceptron (MLP). Finally, the modified network achieves a speed up of 7.5 times with almost no performance loss compared to the baseline model. We also demonstrate that a deeper model can be effectively trained with features learned in a shallow network. Following a similar "easy to hard" idea, we systematically investigate three practical transfer settings and show the effectiveness of transfer learning in low-level vision problems. Our method shows superior performance than the state-of-the-art methods both on benchmark datasets and a real-world use case.
- Aug 03 2016 cs.CV arXiv:1608.00859v1Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ( $ 69.4\% $) and UCF101 ($ 94.2\% $). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices.
- Aug 03 2016 cs.CV arXiv:1608.00797v1This paper presents the method that underlies our submission to the untrimmed video classification task of ActivityNet Challenge 2016. We follow the basic pipeline of temporal segment networks and further raise the performance via a number of other techniques. Specifically, we use the latest deep model architecture, e.g., ResNet and Inception V3, and introduce new aggregation schemes (top-k and attention-weighted pooling). Additionally, we incorporate the audio as a complementary channel, extracting relevant information via a CNN applied to the spectrograms. With these techniques, we derive an ensemble of deep models, which, together, attains a high classification accuracy (mAP $93.23\%$) on the testing set and secured the first place in the challenge.
- Aug 02 2016 cs.CV arXiv:1608.00367v1As a successful deep model applied in image super-resolution (SR), the Super-Resolution Convolutional Neural Network (SRCNN) has demonstrated superior performance to the previous hand-crafted models either in speed and restoration quality. However, the high computational cost still hinders it from practical usage that demands real-time performance (24 fps). In this paper, we aim at accelerating the current SRCNN, and propose a compact hourglass-shape CNN structure for faster and better SR. We re-design the SRCNN structure mainly in three aspects. First, we introduce a deconvolution layer at the end of the network, then the mapping is learned directly from the original low-resolution image (without interpolation) to the high-resolution one. Second, we reformulate the mapping layer by shrinking the input feature dimension before mapping and expanding back afterwards. Third, we adopt smaller filter sizes but more mapping layers. The proposed model achieves a speed up of more than 40 times with even superior restoration quality. Further, we present the parameter settings that can achieve real-time performance on a generic CPU while still maintaining good performance. A corresponding transfer strategy is also proposed for fast training and testing across different upscaling factors.
- Jul 19 2016 cs.CV arXiv:1607.05046v1We present a novel framework for hallucinating faces of unconstrained poses and with very low resolution (face size as small as 5pxIOD). In contrast to existing studies that mostly ignore or assume pre-aligned face spatial configuration (e.g. facial landmarks localization or dense correspondence field), we alternatingly optimize two complementary tasks, namely face hallucination and dense correspondence field estimation, in a unified framework. In addition, we propose a new gated deep bi-network that contains two functionality-specialized branches to recover different levels of texture details. Extensive experiments demonstrate that such formulation allows exceptional hallucination quality on in-the-wild low-res faces with significant pose and illumination variations.
- Locally repairable codes are desirable for distributed storage systems to improve the repair efficiency. In this paper, we first build a bridge between locally repairable code and packing. As an application of this bridge, some optimal locally repairable codes can be obtained by packings, which gives optimal locally repairable codes with flexible parameters.
- Semantic segmentation tasks can be well modeled by Markov Random Field (MRF). This paper addresses semantic segmentation by incorporating high-order relations and mixture of label contexts into MRF. Unlike previous works that optimized MRFs using iterative algorithm, we solve MRF by proposing a Convolutional Neural Network (CNN), namely Deep Parsing Network (DPN), which enables deterministic end-to-end computation in a single forward pass. Specifically, DPN extends a contemporary CNN to model unary terms and additional layers are devised to approximate the mean field (MF) algorithm for pairwise terms. It has several appealing properties. First, different from the recent works that required many iterations of MF during back-propagation, DPN is able to achieve high performance by approximating one iteration of MF. Second, DPN represents various types of pairwise terms, making many existing models as its special cases. Furthermore, pairwise terms in DPN provide a unified framework to encode rich contextual information in high-dimensional data, such as images and videos. Third, DPN makes MF easier to be parallelized and speeded up, thus enabling efficient inference. DPN is thoroughly evaluated on standard semantic image/video segmentation benchmarks, where a single DPN model yields state-of-the-art segmentation accuracies on PASCAL VOC 2012, Cityscapes dataset and CamVid dataset.
- Let $\mathbb{F}_{p^{m}}$ be a finite field with cardinality $p^{m}$ and $R=\mathbb{F}_{p^{m}}+u\mathbb{F}_{p^{m}}$ with $u^{2}=0$. We aim to determine all $\alpha+u\beta$-constacyclic codes of length $np^{s}$ over $R$, where $\alpha,\beta\in\mathbb{F}_{p^{m}}^{*}$, $n, s\in\mathbb{N}_{+}$ and $\gcd(n,p)=1$. Let $\alpha_{0}\in\mathbb{F}_{p^{m}}^{*}$ and $\alpha_{0}^{p^{s}}=\alpha$. The residue ring $R[x]/\langle x^{np^{s}}-\alpha-u\beta\rangle$ is a chain ring with the maximal ideal $\langle x^{n}-\alpha_{0}\rangle$ in the case that $x^{n}-\alpha_{0}$ is irreducible in $\mathbb{F}_{p^{m}}[x]$. If $x^{n}-\alpha_{0}$ is reducible in $\mathbb{F}_{p^{m}}[x]$, we give the explicit expressions of the ideals of $R[x]/\langle x^{np^{s}}-\alpha-u\beta\rangle$. Besides, the number of codewords and the dual code of every $\alpha+u\beta$-constacyclic code are provided.
- Caching is a promising solution to satisfy the ongoing explosive demands for multi-media traffics. Recently, Maddah-Ali and Niesen proposed both centralized and de-centralized coded caching schemes, which are able to attain significant performance gains over uncoded caching schemes. Particular, their work indicates that there exists a performance gap between the decentralized coded caching scheme and the centralized coded caching scheme. In this paper, we investigate this gap. As a result, we prove that the multiplicative gap (i.e., the ratio of their performances) is between 1 and 1:5. The upper bound tightens the original one of 12 by Maddah-Ali and Niesen, while the lower bound verifies the intuition that the centralized coded caching scheme always outperforms its decentralized counterpart. Notably, both bounds are achievable in some cases. Furthermore, we prove that the gap can be arbitrarily close to 1 if the number of users is large enough, which suggests the great potential in practical applications to use the less optimal but more practical decentralized coded caching scheme
- In this paper, we investigate some sufficient conditions based on the block restricted isometry property (block-RIP) for exact (when $\v=\0$) and stable (when $\v\neq\0$) recovery of block sparse signals $\x$ from measurements $\y=\A\x+\v$, where $\v$ is a $\ell_2$ bounded noise vector (i.e., $\|\v\|_2\leq \epsilon$ for some constant $\epsilon$).. First, on the one hand, we show that if $\A$ satisfies the block-RIP with constant $\delta_{K+1}<1/\sqrt{K+1}$, then every $K$-block sparse signal $\x$ can be exactly or stably recovered by BOMP in $K$ iterations; On the other hand, for any $K\geq 1$ and $1/\sqrt{K+1}\leq t<1$, there exists a matrix $\A$ satisfying the block-RIP with $\delta_{K+1}=t$ and a $K$-block sparse signal $\x$ such that the BOMP algorithm may fail to recover $\x$ in $K$ iterations. Second, we study some sufficient conditions for recovering $\alpha$-strongly-decaying $K$-block sparse signals. Surprisingly, it is shown that if $\A$ satisfies the block-RIP with $\delta_{K+1}<\sqrt{2}/2$, every $\alpha$-strongly-decaying $K$-block sparse signal can be exactly or stably recovered by the BOMP algorithm in $K$ iterations, under some conditions on $\alpha$. Our newly found sufficient condition on the block-RIP of $\A$ is weaker than that for $\ell_1$ minimization for this special class of sparse signals, which further convinces the effectiveness of BOMP. Furthermore, for any $K\geq 1$, $\alpha>1$ and $\sqrt{2}/2\leq t<1$, the recovery of $\x$ may fail in $K$ iterations for a sensing matrix $\A$ which satisfies the block-RIP with $\delta_{K+1}=t$. Finally, we study some sufficient conditions for partial recovery of block sparse signals. Specifically, if $\A$ satisfies the block-RIP with $\delta_{K+1}<\sqrt{2}/2$, then BOMP is guaranteed to recover some blocks of $\x$ if these blocks satisfy a sufficient condition. We further show that this condition is sharp.
- We consider the problem of constructing exact-repair minimum storage regenerating (MSR) codes, for which both the systematic nodes and parity nodes can be repaired optimally. Although there exist several recent explicit high-rate MSR code constructions (usually with certain restrictions on the coding parameters), quite a few constructions in the literature only allow the optimal repair of systematic nodes. This phenomenon suggests that there might be a barrier between explicitly constructing codes that can only optimally repair systematic nodes and those that can optimally repair both systematic nodes and parity nodes. In the work, we show that this barrier can be completely circumvented by providing a generic transformation that is able to convert any non-binary linear maximum distance separable (MDS) storage codes that can optimally repair only systematic nodes into new MSR codes that can optimally repair all nodes. This transformation does not increase the alphabet size of the original codes, and only increases the sub-packetization by a factor that is equal to the number of parity nodes. Furthermore, the resultant MSR codes also have the optimal access property for all nodes if the original MDS storage codes have the optimal access property for systematic nodes.
- Apr 26 2016 cs.CV arXiv:1604.07279v1Actionness was introduced to quantify the likelihood of containing a generic action instance at a specific location. Accurate and efficient estimation of actionness is important in video analysis and may benefit other relevant tasks such as action recognition and action detection. This paper presents a new deep architecture for actionness estimation, called hybrid fully convolutional network (H-FCN), which is composed of appearance FCN (A-FCN) and motion FCN (M-FCN). These two FCNs leverage the strong capacity of deep models to estimate actionness maps from the perspectives of static appearance and dynamic motion, respectively. In addition, the fully convolutional nature of H-FCN allows it to efficiently process videos with arbitrary sizes. Experiments are conducted on the challenging datasets of Stanford40, UCF Sports, and JHMDB to verify the effectiveness of H-FCN on actionness estimation, which demonstrate that our method achieves superior performance to previous ones. Moreover, we apply the estimated actionness maps on action proposal generation and action detection. Our actionness maps advance the current state-of-the-art performance of these tasks substantially.
- A novel compressive-sensing based signal multiplexing scheme is proposed in this paper to further improve the multiplexing gain for multiple input multiple output (MIMO) system. At the transmitter side, a Gaussian random measurement matrix in compressive sensing is employed before the traditional spatial multiplexing in order to carry more data streams on the available spatial multiplexing streams of the underlying MIMO system. At the receiver side, it is proposed to reformulate the detection of the multiplexing signal into two steps. In the first step, the traditional MIMO equalization can be used to restore the transmitted spatial multiplexing signal of the MIMO system. While in the second step, the standard optimization based detection algorithm assumed in the compressive sensing framework is utilized to restore the CS multiplexing data streams, wherein the exhaustive over-complete dictionary is used to guarantee the sparse representation of the CS multiplexing signal. In order to avoid the excessive complexity, the sub-block based dictionary and the sub-block based CS restoration is proposed. Finally, simulation results are presented to show the feasibility of the proposed CS based enhanced MIMO multiplexing scheme. And our efforts in this paper shed some lights on the great potential in further improving the spatial multiplexing gain for the MIMO system.
- Mar 15 2016 cs.CV arXiv:1603.04015v3In this paper, a discriminative two-phase dictionary learning framework is proposed for classifying human action by sparse shape representations, in which the first-phase dictionary is learned on the selected discriminative frames and the second-phase dictionary is built for recognition using reconstruction errors of the first-phase dictionary as input features. We propose a "zeroth class" trick for detecting undiscriminating frames of the test video and eliminating them before voting on the action categories. Experimental results on benchmarks demonstrate the effectiveness of our method.
- Generalized orthogonal matching pursuit (gOMP), also called orthogonal multi-matching pursuit, is an extension of OMP in the sense that $N\geq1$ indices are identified per iteration. In this paper, we show that if the restricted isometry constant (RIC) $\delta_{NK+1}$ of a sensing matrix $\A$ satisfies $\delta_{NK+1} < 1/\sqrt {K/N+1}$, then under a condition on the signal-to-noise ratio, gOMP identifies at least one index in the support of any $K$-sparse signal $\x$ from $\y=\A\x+\v$ at each iteration, where $\v$ is a noise vector. Surprisingly, this condition does not require $N\leq K$ which is needed in Wang, \textitet al 2012 and Liu, \textitet al 2012. Thus, $N$ can have more choices. When $N=1$, it reduces to be a sufficient condition for OMP, which is less restrictive than that proposed in Wang 2015. Moreover, in the noise-free case, it is a sufficient condition for accurately recovering $\x$ in $K$ iterations which is less restrictive than the best known one. In particular, it reduces to the sharp condition proposed in Mo 2015 when $N=1$.
- Feb 04 2016 cs.CV arXiv:1602.01197v1Data imbalance is common in many vision tasks where one or more classes are rare. Without addressing this issue conventional methods tend to be biased toward the majority class with poor predictive accuracy for the minority class. These methods further deteriorate on small, imbalanced data that has a large degree of class overlap. In this study, we propose a novel discriminative sparse neighbor approximation (DSNA) method to ameliorate the effect of class-imbalance during prediction. Specifically, given a test sample, we first traverse it through a cost-sensitive decision forest to collect a good subset of training examples in its local neighborhood. Then we generate from this subset several class-discriminating but overlapping clusters and model each as an affine subspace. From these subspaces, the proposed DSNA iteratively seeks an optimal approximation of the test sample and outputs an unbiased prediction. We show that our method not only effectively mitigates the imbalance issue, but also allows the prediction to extrapolate to unseen data. The latter capability is crucial for achieving accurate prediction on small dataset with limited samples. The proposed imbalanced learning method can be applied to both classification and regression tasks at a wide range of imbalance levels. It significantly outperforms the state-of-the-art methods that do not possess an imbalance handling mechanism, and is found to perform comparably or even better than recent deep learning methods by using hand-crafted features only.
- Support recovery of sparse signals from noisy measurements with orthogonal matching pursuit (OMP) has been extensively studied. In this paper, we show that for any $K$-sparse signal $\x$, if a sensing matrix $\A$ satisfies the restricted isometry property (RIP) with restricted isometry constant (RIC) $\delta_{K+1} < 1/\sqrt {K+1}$, then under some constraints on the minimum magnitude of nonzero elements of $\x$, OMP exactly recovers the support of $\x$ from its measurements $\y=\A\x+\v$ in $K$ iterations, where $\v$ is a noise vector that is $\ell_2$ or $\ell_{\infty}$ bounded. This sufficient condition is sharp in terms of $\delta_{K+1}$ since for any given positive integer $K$ and any $1/\sqrt{K+1}\leq \delta<1$, there always exists a matrix $\A$ satisfying the RIP with $\delta_{K+1}=\delta$ for which OMP fails to recover a $K$-sparse signal $\x$ in $K$ iterations. Also, our constraints on the minimum magnitude of nonzero elements of $\x$ are weaker than existing ones. Moreover, we propose worst-case necessary conditions for the exact support recovery of $\x$, characterized by the minimum magnitude of the nonzero elements of $\x$.
- Dec 15 2015 cs.SC arXiv:1512.03901v2An algebraic approach to the maximum likelihood estimation problem is to solve a very structured parameterized polynomial system called likelihood equations that have finitely many complex (real or non-real) solutions. The only solutions that are statistically meaningful are the real solutions with positive coordinates. In order to classify the parameters (data) according to the number of real/positive solutions, we study how to efficiently compute the discriminants, say data-discriminants (DD), of the likelihood equations. We develop a probabilistic algorithm with three different strategies for computing DDs. Our implemented probabilistic algorithm based on Maple and FGb is more efficient than our previous version presented in ISSAC2015, and is also more efficient than the standard elimination for larger benchmarks. By applying RAGlib to a DD we compute, we give the real root classification of 3 by 3 symmetric matrix model.
- Dec 08 2015 cs.CV arXiv:1512.01891v1This paper proposes to learn high-performance deep ConvNets with sparse neural connections, referred to as sparse ConvNets, for face recognition. The sparse ConvNets are learned in an iterative way, each time one additional layer is sparsified and the entire model is re-trained given the initial weights learned in previous iterations. One important finding is that directly training the sparse ConvNet from scratch failed to find good solutions for face recognition, while using a previously learned denser model to properly initialize a sparser model is critical to continue learning effective features for face recognition. This paper also proposes a new neural correlation-based weight selection criterion and empirically verifies its effectiveness in selecting informative connections from previously learned models in each iteration. When taking a moderately sparse structure (26%-76% of weights in the dense model), the proposed sparse ConvNet model significantly improves the face recognition performance of the previous state-of-the-art DeepID2+ models given the same training data, while it keeps the performance of the baseline model with only 12% of the original parameters.
- Nov 23 2015 cs.CV arXiv:1511.06627v1Learning to simultaneously handle face alignment of arbitrary views, e.g. frontal and profile views, appears to be more challenging than we thought. The difficulties lay in i) accommodating the complex appearance-shape relations exhibited in different views, and ii) encompassing the varying landmark point sets due to self-occlusion and different landmark protocols. Most existing studies approach this problem via training multiple viewpoint-specific models, and conduct head pose estimation for model selection. This solution is intuitive but the performance is highly susceptible to inaccurate head pose estimation. In this study, we address this shortcoming through learning an Ensemble of Model Recommendation Trees (EMRT), which is capable of selecting optimal model configuration without prior head pose estimation. The unified framework seamlessly handles different viewpoints and landmark protocols, and it is trained by optimising directly on landmark locations, thus yielding superior results on arbitrary-view face alignment. This is the first study that performs face alignment on the full AFLWdataset with faces of different views including profile view. State-of-the-art performances are also reported on MultiPIE and AFW datasets containing both frontaland profile-view faces.
- Nov 23 2015 cs.CV arXiv:1511.06523v1Face detection is one of the most studied topics in the computer vision community. Much of the progresses have been made by the availability of face detection benchmark datasets. We show that there is a gap between current face detection performance and the real world requirements. To facilitate future face detection research, we introduce the WIDER FACE dataset, which is 10 times larger than existing datasets. The dataset contains rich annotations, including occlusions, poses, event categories, and face bounding boxes. Faces in the proposed dataset are extremely challenging due to large variations in scale, pose and occlusion, as shown in Fig. 1. Furthermore, we show that WIDER FACE dataset is an effective training source for face detection. We benchmark several representative detection systems, providing an overview of state-of-the-art performance and propose a solution to deal with large scale variation. Finally, we discuss common failure cases that worth to be further investigated. Dataset can be downloaded at: mmlab.ie.cuhk.edu.hk/projects/WIDERFace
- Binary representation is desirable for its memory efficiency, computation speed and robustness. In this paper, we propose adjustable bounded rectifiers to learn binary representations for deep neural networks. While hard constraining representations across layers to be binary makes training unreasonably difficult, we softly encourage activations to diverge from real values to binary by approximating step functions. Our final representation is completely binary. We test our approach on MNIST, CIFAR10, and ILSVRC2012 dataset, and systematically study the training dynamics of the binarization process. Our approach can binarize the last layer representation without loss of performance and binarize all the layers with reasonably small degradations. The memory space that it saves may allow more sophisticated models to be deployed, thus compensating the loss. To the best of our knowledge, this is the first work to report results on current deep network architectures using complete binary middle representations. Given the learned representations, we find that the firing or inhibition of a binary neuron is usually associated with a meaningful interpretation across different classes. This suggests that the semantic structure of a neural network may be manifested through a guided binarization process.
- We study the geometry of metrics and convexity structures on the space of phylogenetic trees, which is here realized as the tropical linear space of all \ ultrametrics. The ${\rm CAT}(0)$-metric of Billera-Holmes-Vogtman arises from the theory of orthant spaces. While its geodesics can be computed by the Owen-Provan algorithm, geodesic triangles are complicated. We show that the dimension of such a triangle can be arbitrarily high. Tropical convexity and the tropical metric behave better. They exhibit properties desirable for geometric statistics, such as geodesics of small depth.
- Caching is a promising solution to satisfy the ever increasing demands for the multi-media traffics. In caching networks, coded caching is a recently proposed technique that achieves significant performance gains over the uncoded caching schemes. However, to implement the coded caching schemes, each file has to be split into $F$ packets, which usually increases exponentially with the number of users $K$. Thus, designing caching schemes that decrease the order of $F$ is meaningful for practical implementations. In this paper, by reviewing the Ali-Niesen caching scheme, the placement delivery array (PDA) design problem is firstly formulated to characterize the placement issue and the delivery issue with a single array. Moreover, we show that, through designing appropriate PDA, new centralized coded caching schemes can be discovered. Secondly, it is shown that the Ali-Niesen scheme corresponds to a special class of PDA, which realizes the best coding gain with the least $F$. Thirdly, we present a new construction of PDA for the centralized caching system, wherein the cache size of each user $M$ (identical cache size is assumed at all users) and the number of files $N$ satisfies $M/N=1/q$ or ${(q-1)}/{q}$ ($q$ is an integer such that $q\geq 2$). The new construction can decrease the required $F$ from the order $O\left(e^{K\cdot\left(\frac{M}{N}\ln \frac{N}{M} +(1-\frac{M}{N})\ln \frac{N}{N-M}\right)}\right)$ of Ali-Niesen scheme to $O\left(e^{K\cdot\frac{M}{N}\ln \frac{N}{M}}\right)$ or $O\left(e^{K\cdot(1-\frac{M}{N})\ln\frac{N}{N-M}}\right)$ respectively, while the coding gain loss is only $1$.