results for au:Zheng_L in:cs

- Mar 13 2018 cs.LG arXiv:1803.04008v1Design of incentives or recommendations to users is becoming more common as platform providers continually emerge. We propose a multi-armed bandit approach to the problem in which users types are unknown a priori and evolve dynamically in time. Unlike the traditional bandit setting, observed rewards are generated by a single Markov process. We demonstrate via an illustrative example that blindly applying the traditional bandit algorithms results in very poor performance as measured by regret. We introduce two variants of classical bandit algorithms, upper confidence bound (UCB) and epsilon-greedy, for which we provide theoretical bounds on the regret. We conduct a number of simulation-based experiments to show how the algorithms perform in comparison to traditional UCB and epsilon-greedy algorithms as well as reinforcement learning (Q-learning).
- Mar 12 2018 cs.SI arXiv:1803.03359v1We examined the structure of intra- and postoperative case-collaboration networks among the surgical service providers in a quaternary-care academic medical center, using retrospective electronic medical record (EMR) data. We also analyzed the evolution of the network properties over time, as changes in nodes and edges can affect the network structure. We used de-identified intra- and postoperative data for adult patients, ages >= 21, who received nonambulatory/nonobstetric surgery at Shands at the University of Florida between June 1, 2011 and November 1, 2014. The intraoperative segment contained 30,245 surgical cases, and the postoperative segment considered 30,202 hospitalizations. Our results confirmed the existence of strict small world structure in both intra- and postoperative surgical team networks. A sudden declining trend is expected in the future in both intra- and postoperative networks, since the small world property is currently at its peak. In addition, high network density was observed in the intraoperative segment and partially in postoperative one, representing the existence of cohesive clusters of providers. We also observed that the small world property is exhibited more in the intraoperative compared to the postoperative network. Analyzing the temporal aspects of the networks revealed the postoperative segment tends to lose its cohesiveness as the time passes. Our results highlight the importance of stability of personnel in key positions. This highlights the important role of the central players in the network that offers change-leaders the opportunity to quantify and target those nodes as mediators of process change.
- The increasing rate of urbanization has added pressure on the already constrained transportation networks in our communities. Ride-sharing platforms such as Uber and Lyft are becoming a more commonplace, particularly in urban environments. While such services may be deemed more convenient than riding public transit due to their on-demand nature, reports show that they do not necessarily decrease the congestion in major cities. One of the key problems is that typically mobility decision support systems focus on individual utility and react only after congestion appears. In this paper, we propose socially considerate multi-modal routing algorithms that are proactive and consider, via predictions, the shared effect of riders on the overall efficacy of mobility services. We have adapted the MATSim simulator framework to incorporate the proposed algorithms present a simulation analysis of a case study in Nashville, Tennessee that assesses the effects of our routing models on the traffic congestion for different levels of penetration and adoption of socially considerate routes. Our results indicate that even at a low penetration (social ratio), we are able to achieve an improvement in system-level performance.
- We introduce Texygen, a benchmarking platform to support research on open-domain text generation models. Texygen has not only implemented a majority of text generation models, but also covered a set of metrics that evaluate the diversity, the quality and the consistency of the generated texts. The Texygen platform could help standardize the research on text generation and facilitate the sharing of fine-tuned open-source implementations among researchers for their work. As a consequence, this would help in improving the reproductivity and reliability of future research work in text generation.
- Feb 01 2018 cs.NE arXiv:1801.10546v2Differential Evolution (DE) is one of the most powerful optimizers in the evolutionary algorithm (EA) family. In recent years, many DE variants have been proposed to enhance performance. However, when compared with each other, significant differences in performances are seldomly observed. To meet this challenge of a more significant improvement, this paper proposes a multi-layer competitive-cooperative (MLCC) framework to combine the advantages of multiple DEs. Existing multi-method strategies commonly use a multi-population based structure, which classifies the entire population into several subpopulations and evolve individuals only in their corresponding subgroups. MLCC proposes to implement a parallel structure with the entire population simultaneously monitored by multiple DEs assigned in multiple layers. Each individual can store, utilize and update its evolution information in different layers by using a novel individual preference based layer selecting (IPLS) mechanism and a computational resource allocation bias (RAB) mechanism. In IPLS, individuals only connect to one favorite layer. While in RAB, high quality solutions are evolved by considering all the layers. In this way, the multiple layers work in a competitive and cooperative manner. The proposed MLCC framework has been implemented on several highly competitive DEs. Experimental studies show that MLCC variants significantly outperform the baseline DEs as well as several state-of-the-art and up-to-date DEs on the CEC benchmark functions.
- Jan 31 2018 cs.CV arXiv:1801.10068v2In this paper we make two contributions to unsupervised domain adaptation in the convolutional neural network. First, our approach transfers knowledge in the deep side of neural networks for all convolutional layers. Previous methods usually do so by directly aligning higher-level representations, e.g., aligning the activations of fully-connected layers. In this case, although the convolutional layers can be modified through gradient back-propagation, but not significantly. Our approach takes advantage of the natural image correspondence built by CycleGAN. Departing from previous methods, we use every convolutional layer of the target network to uncover the knowledge shared by the source domain through an attention alignment mechanism. The discriminative part of an image is relatively insensitive to the change of image style, ensuring our attention alignment particularly suitable for robust knowledge adaptation. Second, we estimate the posterior label distribution of the unlabeled data to train the target network. Previous methods, which iteratively update the pseudo labels by the target network and refine the target network by the updated pseudo labels, are straightforward but vulnerable to noisy labels. Instead, our approach uses category distribution to calculate the cross-entropy loss for training, thereby ameliorating deviation accumulation. The two contributions make our approach outperform the state-of-theart methods by +2.6% in all the six transfer tasks on Office- 31 on average. Notably, our approach yields +5.1% improvement for the challenging $\textbf{D}$ ${\rightarrow}$ $\textbf{A}$ task.
- Jan 31 2018 cs.CV arXiv:1801.09927v1This paper considers the task of thorax disease classification on chest X-ray images. Existing methods generally use the global image as input for network learning. Such a strategy is limited in two aspects. 1) A thorax disease usually happens in (small) localized areas which are disease specific. Training CNNs using global image may be affected by the (excessive) irrelevant noisy areas. 2) Due to the poor alignment of some CXR images, the existence of irregular borders hinders the network performance. In this paper, we address the above problems by proposing a three-branch attention guided convolution neural network (AG-CNN). AG-CNN 1) learns from disease-specific regions to avoid noise and improve alignment, 2) also integrates a global branch to compensate the lost discriminative cues by local branch. Specifically, we first learn a global CNN branch using global images. Then, guided by the attention heat map generated from the global branch, we inference a mask to crop a discriminative region from the global image. The local region is used for training a local CNN branch. Lastly, we concatenate the last pooling layers of both the global and local branches for fine-tuning the fusion branch. The Comprehensive experiment is conducted on the ChestX-ray14 dataset. We first report a strong global baseline producing an average AUC of 0.841 with ResNet-50 as backbone. After combining the local cues with the global information, AG-CNN improves the average AUC to 0.868. While DenseNet-121 is used, the average AUC achieves 0.871, which is a new state of the art in the community.
- We propose super-resolution MIMO channel estimators for millimeter-wave (mmWave) systems that employ hybrid analog and digital beamforming and generalized spatial modulation, respectively. Exploiting the inherent sparsity of mmWave channels, the channel estimation problem is formulated as an atomic norm minimization that enhances sparsity in the continuous angles of departure and arrival. Both pilot-assisted and data-aided channel estimators are developed, with the former one formulated as a convex problem and the latter as a non-convex problem. To solve these formulated channel estimation problems, we develop a computationally efficient conjugate gradient descent method based on non-convex factorization which restricts the search space to low-rank matrices. Simulation results are presented to illustrate the superior channel estimation performance of the proposed algorithms for both types of mmWave systems compared to the existing compressed-sensing-based estimators with finely quantized angle grids.
- Jan 19 2018 cs.CV arXiv:1801.05918v1Single Shot MultiBox Detector (SSD) is one of the fastest algorithms in the current object detection field, which uses fully convolutional neural network to detect all scaled objects in an image. Deconvolutional Single Shot Detector (DSSD) is an approach which introduces more context information by adding the deconvolution module to SSD. And the mean Average Precision (mAP) of DSSD on PASCAL VOC2007 is improved from SSD's 77.5% to 78.6%. Although DSSD obtains higher mAP than SSD by 1.1%, the frames per second (FPS) decreases from 46 to 11.8. In this paper, we propose a single stage end-to-end image detection model called ESSD to overcome this dilemma. Our solution to this problem is to cleverly extend better context information for the shallow layers of the best single stage (e.g. SSD) detectors. Experimental results show that our model can reach 79.4% mAP, which is higher than DSSD and SSD by 0.8 and 1.9 points respectively. Meanwhile, our testing speed is 25 FPS in Titan X GPU which is more than double the original DSSD.
- Jan 16 2018 cs.CV arXiv:1801.04461v1In this paper we consider the problem of single monocular image depth estimation. It is a challenging problem due to its ill-posedness nature and has found wide application in industry. Previous efforts belongs roughly to two families: learning-based method and interactive method. Learning-based method, in which deep convolutional neural network (CNN) is widely used, can achieve good result. But they suffer low generalization ability and typically perform poorly for unfamiliar scenes. Besides, data-hungry nature for such method makes data aquisition expensive and time-consuming. Interactive method requires human annotation of depth which, however, is errorneous and of large variance. To overcome these problems, we propose a new perspective for single monocular image depth estimation problem: size to depth. Our method require sparse label for real-world size of object rather than raw depth. A Coarse depth map is then inferred following geometric relationships according to size labels. Then we refine the depth map by doing energy function optimization on conditional random field(CRF). We experimentally demonstrate that our method outperforms traditional depth-labeling methods and can produce satisfactory depth maps.
- Dec 05 2017 cs.CV arXiv:1712.00720v1We designed a gangue sorting system,and built a convolutional neural network model based on AlexNet. Data enhancement and transfer learning are used to solve the problem which the convolution neural network has insufficient training data in the training stage. An object detection and region clipping algorithm is proposed to adjust the training image data to the optimum size. Compared with traditional neural network and SVM algorithm, this algorithm has higher recognition rate for coal and coal gangue, and provides important reference for identification and separation of coal and gangue.
- We introduce MAgent, a platform to support research and development of many-agent reinforcement learning. Unlike previous research platforms on single or multi-agent reinforcement learning, MAgent focuses on supporting the tasks and the applications that require hundreds to millions of agents. Within the interactions among a population of agents, it enables not only the study of learning algorithms for agents' optimal polices, but more importantly, the observation and understanding of individual agent's behaviors and social phenomena emerging from the AI society, including communication languages, leaderships, altruism. MAgent is highly scalable and can host up to one million agents on a single GPU server. MAgent also provides flexible configurations for AI researchers to design their customized environments and agents. In this demo, we present three environments designed on MAgent and show emerged collective intelligence by learning from scratch.
- Nov 29 2017 cs.CV arXiv:1711.10295v1Being a cross-camera retrieval task, person re-identification suffers from image style variations caused by different cameras. The art implicitly addresses this problem by learning a camera-invariant descriptor subspace. In this paper, we explicitly consider this challenge by introducing camera style (CamStyle) adaptation. CamStyle can serve as a data augmentation approach that smooths the camera style disparities. Specifically, with CycleGAN, labeled training images can be style-transferred to each camera, and, along with the original training samples, form the augmented training set. This method, while increasing data diversity against over-fitting, also incurs a considerable level of noise. In the effort to alleviate the impact of noise, the label smooth regularization (LSR) is adopted. The vanilla version of our method (without LSR) performs reasonably well on few-camera systems in which over-fitting often occurs. With LSR, we demonstrate consistent improvement in all systems regardless of the extent of over-fitting. We also report competitive accuracy compared with the state of the art.
- Nov 28 2017 cs.CV arXiv:1711.09243v1In recent years, two types of trackers, namely correlation filter based tracker (CF tracker) and structured output tracker (Struck), have exhibited the state-of-the-art performance. However, there seems to be lack of analytic work on their relations in the computer vision community. In this paper, we investigate two state-of-the-art CF trackers, i.e., spatial regularization discriminative correlation filter (SRDCF) and correlation filter with limited boundaries (CFLB), and Struck, and reveal their relations. Specifically, after extending the CFLB to its multiple channel version we prove the relation between SRDCF and CFLB on the condition that the spatial regularization factor of SRDCF is replaced by the masking matrix of CFLB. We also prove the asymptotical approximate relation between SRDCF and Struck on the conditions that the spatial regularization factor of SRDCF is replaced by an indicator function of object bounding box, the weights of SRDCF in its loss item are replaced by those of Struck, the linear kernel is employed by Struck, and the search region tends to infinity. Extensive experiments on public benchmarks OTB50 and OTB100 are conducted to verify our theoretical results. Moreover, we explain how detailed differences among SRDCF, CFLB, and Struck would give rise to slightly different performances on visual sequences
- Beyond Part Models: Person Retrieval with Refined Part Pooling (and a Strong Convolutional Baseline)Nov 28 2017 cs.CV arXiv:1711.09349v3Employing part-level features for pedestrian image description offers fine-grained information and has been verified as beneficial for person retrieval in very recent literature. A prerequisite of part discovery is that each part should be well located. Instead of using external cues, e.g., pose estimation, to directly locate parts, this paper lays emphasis on the content consistency within each part. Specifically, we target at learning discriminative part-informed features for person retrieval and make two contributions. (i) A network named Part-based Convolutional Baseline (PCB). Given an image input, it outputs a convolutional descriptor consisting of several part-level features. With a uniform partition strategy, PCB achieves competitive results with the state-of-the-art methods, proving itself as a strong convolutional baseline for person retrieval. (ii) A refined part pooling (RPP) method. Uniform partition inevitably incurs outliers in each part, which are in fact more similar to other parts. RPP re-assigns these outliers to the parts they are closest to, resulting in refined parts with enhanced within-part consistency. Experiment confirms that RPP allows PCB to gain another round of performance boost. For instance, on the Market-1501 dataset, we achieve (77.4+4.2)% mAP and (92.3+1.5)% rank-1 accuracy, surpassing the state of the art by a large margin.
- Nov 21 2017 cs.CV arXiv:1711.07027v2Person re-identification (re-ID) models trained on one domain often fail to generalize well to another. In our attempt, we present a "learning via translation" framework. In the baseline, we translate the labeled images from source to target domain in an unsupervised manner. We then train re-ID models with the translated images by supervised methods. Yet, being an essential part of this framework, unsupervised image-image translation suffers from the information loss of source-domain labels during translation. Our motivation is two-fold. First, for each image, the discriminative cues contained in its ID label should be maintained after translation. Second, given the fact that two domains have entirely different persons, a translated image should be dissimilar to any of the target IDs. To this end, we propose to preserve two types of unsupervised similarities, 1) self-similarity of an image before and after translation, and 2) domain-dissimilarity of a translated source image and a target image. Both constraints are implemented in the similarity preserving generative adversarial network (SPGAN) which consists of a Siamese network and a CycleGAN. Through domain adaptation experiment, we show that images generated by SPGAN are more suitable for domain adaptation and yield consistent and competitive re-ID accuracy on two large-scale datasets.
- Nov 16 2017 cs.CV arXiv:1711.05535v2This paper considers the task of matching images and sentences. The challenge consists in discriminatively embedding the two modalities onto a shared visual-textual space. Existing work in this field largely uses Recurrent Neural Networks (RNN) for text feature learning and employs off-the-shelf Convolutional Neural Networks (CNN) for image feature extraction. Our system, in comparison, differs in two key aspects. Firstly, we build a convolutional network amenable for fine-tuning the visual and textual representations, where the entire network only contains four components, i.e., convolution layer, pooling layer, rectified linear unit function (ReLU), and batch normalisation. End-to-end learning allows the system to directly learn from the data and fully utilise the supervisions. Secondly, we propose instance loss according to viewing each multimodal data pair as a class. This works with a large margin objective to learn the inter-modal correspondence between images and their textual descriptions. Experiments on two generic retrieval datasets (Flickr30k and MSCOCO) demonstrate that our method yields competitive accuracy compared to state-of-the-art methods. Moreover, in language person retrieval, we improve the state of the art by a large margin. Code is available at https://github. com/layumi/Image-Text-Embedding
- Oct 18 2017 cs.PF arXiv:1710.06189v1As in various fields like scientific research and industrial application, the computation time optimization is becoming a task that is of increasing importance because of its highly parallel architecture. The graphics processing unit is regarded as a powerful engine for application programs that demand fairly high computation capabilities. Based on this, an algorithm was introduced in this paper to optimize the method used to compute the gray-level co-occurrence matrix (GLCM) of an image, and strategies (e.g., "copying", "image partitioning", etc.) were proposed to optimize the parallel algorithm. Results indicate that without losing the computational accuracy, the speed-up ratio of the GLCM computation of images with different resolutions by GPU by the use of CUDA was 50 times faster than that of the GLCM computation by CPU, which manifested significantly improved performance.
- Sep 28 2017 cs.CV arXiv:1709.09297v1Label estimation is an important component in an unsupervised person re-identification (re-ID) system. This paper focuses on cross-camera label estimation, which can be subsequently used in feature learning to learn robust re-ID models. Specifically, we propose to construct a graph for samples in each camera, and then graph matching scheme is introduced for cross-camera labeling association. While labels directly output from existing graph matching methods may be noisy and inaccurate due to significant cross-camera variations, this paper proposes a dynamic graph matching (DGM) method. DGM iteratively updates the image graph and the label estimation process by learning a better feature space with intermediate estimated labels. DGM is advantageous in two aspects: 1) the accuracy of estimated labels is improved significantly with the iterations; 2) DGM is robust to noisy initial training data. Extensive experiments conducted on three benchmarks including the large-scale MARS dataset show that DGM yields competitive performance to fully supervised baselines, and outperforms competing unsupervised learning methods.
- Sep 26 2017 cs.GT arXiv:1709.08441v4We study the equilibrium behavior in a multi-commodity selfish routing game with many types of uncertain users where each user over- or under-estimates their congestion costs by a multiplicative factor. Surprisingly, we find that uncertainties in different directions have qualitatively distinct impacts on equilibria. Namely, contrary to the usual notion that uncertainty increases inefficiencies, network congestion actually decreases when users over-estimate their costs. On the other hand, under-estimation of costs leads to increased congestion. We apply these results to urban transportation networks, where drivers have different estimates about the cost of congestion. In light of the dynamic pricing policies aimed at tackling congestion, our results indicate that users' perception of these prices can significantly impact the policy's efficacy, and "caution in the face of uncertainty" leads to favorable network conditions.
- Sep 13 2017 cs.SY arXiv:1709.03843v1The millimeter-wave (mmWave) full-dimensional (FD) MIMO system employs planar arrays at both the base station and user equipment and can simultaneously support both azimuth and elevation beamforming. In this paper, we propose atomic-norm-based methods for mmWave FD-MIMO channel estimation under both uniform planar arrays (UPA) and non-uniform planar arrays (NUPA). Unlike existing algorithms such as compressive sensing (CS) or subspace methods, the atomic-norm-based algorithms do not require to discretize the angle spaces of the angle of arrival (AoA) and angle of departure (AoD) into grids, thus provide much better accuracy in estimation. In the UPA case, to reduce the computational complexity, the original large-scale 4D atomic norm minimization problem is approximately reformulated as a semi-definite program (SDP) containing two decoupled two-level Toeplitz matrices. The SDP is then solved via the alternating direction method of multipliers (ADMM) where each iteration involves only closed-form computations. In the NUPA case, the atomic-norm-based formulation for channel estimation becomes nonconvex and a gradient-decent-based algorithm is proposed to solve the problem. Simulation results show that the proposed algorithms achieve better performance than the CS-based and subspace-based algorithms.
- Aug 17 2017 cs.CV arXiv:1708.04896v2In this paper, we introduce Random Erasing, a new data augmentation method for training the convolutional neural network (CNN). In training, Random Erasing randomly selects a rectangle region in an image and erases its pixels with random values. In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models. Albeit simple, Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and yields consistent improvement over strong baselines in image classification, object detection and person re-identification. Code is available at: https://github.com/zhunzhong07/Random-Erasing.
- Jul 25 2017 cs.CV arXiv:1707.07103v1This paper focuses on regularizing the training of the convolutional neural network (CNN). We propose a new regularization approach named ``PatchShuffle`` that can be adopted in any classification-oriented CNN models. It is easy to implement: in each mini-batch, images or feature maps are randomly chosen to undergo a transformation such that pixels within each local patch are shuffled. Through generating images and feature maps with interior orderless patches, PatchShuffle creates rich local variations, reduces the risk of network overfitting, and can be viewed as a beneficial supplement to various kinds of training regularization techniques, such as weight decay, model ensemble and dropout. Experiments on four representative classification datasets show that PatchShuffle improves the generalization ability of CNN especially when the data is scarce. Moreover, we empirically illustrate that CNN models trained with PatchShuffle are more robust to noise and local changes in an image.
- Multicast beamforming is a key technology for next-generation wireless cellular networks to support high-rate content distribution services. In this paper, the coordinated downlink multicast beamforming design in multicell networks is considered. The goal is to maximize the minimum signal-to-interference-plus-noise ratio of all users under individual base station power constraints. We exploit the fractional form of the objective function and geometric properties of the con-straints to reformulate the problem as a parametric manifold optimization program. Afterwards we propose a low-complexity Dinkelbach-type algorithm combined with adaptive exponential smoothing and Riemannian conjugate gradient iteration, which is guaranteed to converge. Numerical experiments show that the proposed algorithm outperforms the existing SDP-based method and DC-programming-based method and achieves near-optimal performance.
- Jul 04 2017 cs.CV arXiv:1707.00408v1Person re-identification (person re-ID) is mostly viewed as an image retrieval problem. This task aims to search a query person in a large image pool. In practice, person re-ID usually adopts automatic detectors to obtain cropped pedestrian images. However, this process suffers from two types of detector errors: excessive background and part missing. Both errors deteriorate the quality of pedestrian alignment and may compromise pedestrian matching due to the position and scale variances. To address the misalignment problem, we propose that alignment can be learned from an identification procedure. We introduce the pedestrian alignment network (PAN) which allows discriminative embedding learning and pedestrian alignment without extra annotations. Our key observation is that when the convolutional neural network (CNN) learns to discriminate between different identities, the learned feature maps usually exhibit strong activations on the human body rather than the background. The proposed network thus takes advantage of this attention mechanism to adaptively locate and align pedestrians within a bounding box. Visual examples show that pedestrians are better aligned with PAN. Experiments on three large-scale re-ID datasets confirm that PAN improves the discriminative ability of the feature embeddings and yields competitive accuracy with the state-of-the-art methods.
- Jun 27 2017 cs.CV arXiv:1706.08249v6In this paper, we study object detection using a large pool of unlabeled images and only a few labeled images per category, named "few-example object detection". The key challenge consists in generating trustworthy training samples as many as possible from the pool. Using few training examples as seeds, our method iterates between model training and high-confidence sample selection. In training, easy samples are generated first and, then the poorly initialized model undergoes improvement. As the model becomes more discriminative, challenging but reliable samples are selected. After that, another round of model improvement takes place. To further improve the precision and recall of the generated training samples, we embed multiple detection models in our framework, which has proven to outperform the single model baseline and the model ensemble method. Experiments on PASCAL VOC'07, MS COCO'14, and ILSVRC'13 indicate that by using as few as three or four samples selected for each category, our method produces very competitive results when compared to the state-of-the-art weakly-supervised approaches using a large number of image-level labels.
- Verification determines whether two samples belong to the same class or not, and has important applications such as face and fingerprint verification, where thousands or millions of categories are present but each category has scarce labeled examples, presenting two major challenges for existing deep learning models. We propose a deep semi-supervised model named SEmi-supervised VErification Network (SEVEN) to address these challenges. The model consists of two complementary components. The generative component addresses the lack of supervision within each category by learning general salient structures from a large amount of data across categories. The discriminative component exploits the learned general features to mitigate the lack of supervision within categories, and also directs the generative component to find more informative structures of the whole data manifold. The two components are tied together in SEVEN to allow an end-to-end training of the two components. Extensive experiments on four verification tasks demonstrate that SEVEN significantly outperforms other state-of-the-art deep semi-supervised techniques when labeled data are in short supply. Furthermore, SEVEN is competitive with fully supervised baselines trained with a larger amount of labeled data. It indicates the importance of the generative component in SEVEN.
- Most existing approaches to co-existing communication/radar systems assume that the radar and communication systems are coordinated, i.e., they share information, such as relative position, transmitted waveforms and channel state. In this paper, we consider an un-coordinated scenario where a communication receiver is to operate in the presence of a number of radars, of which only a sub-set may be active, which poses the problem of estimating the active waveforms and the relevant parameters thereof, so as to cancel them prior to demodulation. Two algorithms are proposed for such a joint waveform estimation/data demodulation problem, both exploiting sparsity of a proper representation of the interference and of the vector containing the errors of the data block, so as to implement an iterative joint interference removal/data demodulation process. The former algorithm is based on classical on-grid compressed sensing (CS), while the latter forces an atomic norm (AN) constraint: in both cases the radar parameters and the communication demodulation errors can be estimated by solving a convex problem. We also propose a way to improve the efficiency of the AN-based algorithm. The performance of these algorithms are demonstrated through extensive simulations, taking into account a variety of conditions concerning both the interferers and the respective channel states.
- May 31 2017 cs.CV arXiv:1705.10444v2The superiority of deeply learned pedestrian representations has been reported in very recent literature of person re-identification (re-ID). In this paper, we consider the more pragmatic issue of learning a deep feature with no or only a few labels. We propose a progressive unsupervised learning (PUL) method to transfer pretrained deep representations to unseen domains. Our method is easy to implement and can be viewed as an effective baseline for unsupervised re-ID feature learning. Specifically, PUL iterates between 1) pedestrian clustering and 2) fine-tuning of the convolutional neural network (CNN) to improve the original model trained on the irrelevant labeled dataset. Since the clustering results can be very noisy, we add a selection operation between the clustering and fine-tuning. At the beginning when the model is weak, CNN is fine-tuned on a small amount of reliable examples which locate near to cluster centroids in the feature space. As the model becomes stronger in subsequent iterations, more images are being adaptively selected as CNN training samples. Progressively, pedestrian clustering and the CNN model are improved simultaneously until algorithm convergence. This process is naturally formulated as self-paced learning. We then point out promising directions that may lead to further improvement. Extensive experiments on three large-scale re-ID datasets demonstrate that PUL outputs discriminative features that improve the re-ID accuracy.
- May 24 2017 cs.AR arXiv:1705.08009v1It remains a challenge to run Deep Learning in devices with stringent power budget in the Internet-of-Things. This paper presents a low-power accelerator for processing Deep Neural Networks in the embedded devices. The power reduction is realized by avoiding multiplications of near-zero valued data. The near-zero approximation and a dedicated Near-Zero Approximation Unit (NZAU) are proposed to predict and skip the near-zero multiplications under certain thresholds. Compared with skipping zero-valued computations, our design achieves 1.92X and 1.51X further reduction of the total multiplications in LeNet-5 and Alexnet respectively, with negligible lose of accuracy. In the proposed accelerator, 256 multipliers are grouped into 16 independent Processing Lanes (PL) to support up to 16 neuron activations simultaneously. With the help of data pre-processing and buffering in each PL, multipliers can be clock-gated in most of the time even the data is excessively streaming in. Designed and simulated in UMC 65 nm process, the accelerator operating at 500 MHz is $>$ 4X faster than the mobile GPU Tegra K1 in processing the fully-connected layer FC8 of Alexnet, while consuming 717X less energy.
- May 23 2017 cs.CV arXiv:1705.07383v3To improve segmentation performance, a novel neural network architecture (termed DFCN-DCRF) is proposed, which combines an RGB-D fully convolutional neural network (DFCN) with a depth-sensitive fully-connected conditional random field (DCRF). First, a DFCN architecture which fuses depth information into the early layers and applies dilated convolution for later contextual reasoning is designed. Then, a depth-sensitive fully-connected conditional random field (DCRF) is proposed and combined with the previous DFCN to refine the preliminary result. Comparative experiments show that the proposed DFCN-DCRF has the best performance compared with most state-of-the-art methods.
- May 08 2017 cs.CV arXiv:1705.02145v1Large-scale is a trend in person re-identification (re-id). It is important that real-time search be performed in a large gallery. While previous methods mostly focus on discriminative learning, this paper makes the attempt in integrating deep learning and hashing into one framework to evaluate the efficiency and accuracy for large-scale person re-id. We integrate spatial information for discriminative visual representation by partitioning the pedestrian image into horizontal parts. Specifically, Part-based Deep Hashing (PDH) is proposed, in which batches of triplet samples are employed as the input of the deep hashing architecture. Each triplet sample contains two pedestrian images (or parts) with the same identity and one pedestrian image (or part) of the different identity. A triplet loss function is employed with a constraint that the Hamming distance of pedestrian images (or parts) with the same identity is smaller than ones with the different identity. In the experiment, we show that the proposed Part-based Deep Hashing method yields very competitive re-id accuracy on the large-scale Market-1501 and Market-1501+500K datasets.
- The focus of this paper is on co-existence between a communication system and a pulsed radar sharing the same bandwidth. Based on the fact that the interference generated by the radar onto the communication receiver is intermittent and depends on the density of scattering objects (such as, e.g., targets), we first show that the communication system is equivalent to a set of independent parallel channels, whereby pre-coding on each channel can be introduced as a new degree of freedom. We introduce a new figure of merit, named the \em compound rate, which is a convex combination of rates with and without interference, to be optimized under constraints concerning the signal-to-interference-plus-noise ratio (including \em signal-dependent interference due to clutter) experienced by the radar and obviously the powers emitted by the two systems: the degrees of freedom are the radar waveform and the afore-mentioned encoding matrix for the communication symbols. We provide closed-form solutions for the optimum transmit policies for both systems under two basic models for the scattering produced by the radar onto the communication receiver, and account for possible correlation of the signal-independent fraction of the interference impinging on the radar. We also discuss the region of the achievable communication rates with and without interference. A thorough performance assessment shows the potentials and the limitations of the proposed co-existing architecture.
- Feature aided tracking can often yield improved tracking performance over the standard multiple target tracking (MTT) algorithms with only kinematic measurements. However, in many applications, the feature signal of the targets consists of sparse Fourier-domain signals. It changes quickly and nonlinearly in the time domain, and the feature measurements are corrupted by missed detections and mis-associations. These two factors make it hard to extract the feature information to be used in MTT. In this paper, we develop a feature-aided nearest neighbour joint probabilistic data association filter (NN-JPDAF) for joint MTT and feature extraction in dense target environments. To estimate the rapidly varying feature signal from incomplete and corrupted measurements, we use the atomic norm constraint to formulate the sparsity of feature signal and use the $\ell_1$-norm to formulate the sparsity of the corruption induced by mis-associations. Based on the sparse representation, the feature signal are estimated by solving a semidefinite program (SDP) which is convex. We also provide an iterative method for solving this SDP via the alternating direction method of multipliers (ADMM) where each iteration involves closed-form computation. With the estimated feature signal, re-filtering is performed to estimate the kinematic states of the targets, where the association makes use of both kinematic and feature information. Simulation results are presented to illustrate the performance of the proposed algorithm in a radar application.
- Mar 22 2017 cs.CV arXiv:1703.07220v2Person re-identification (re-ID) and attribute recognition share a common target at the pedestrian description. Their difference consists in the granularity. Attribute recognition focuses on local aspects of a person while person re-ID usually extracts global representations. Considering their similarity and difference, this paper proposes a very simple convolutional neural network (CNN) that learns a re-ID embedding and predicts the pedestrian attributes simultaneously. This multi-task method integrates an ID classification loss and a number of attribute classification losses, and back-propagates the weighted sum of the individual losses. Albeit simple, we demonstrate on two pedestrian benchmarks that by learning a more discriminative representation, our method significantly improves the re-ID baseline and is scalable on large galleries. We report competitive re-ID performance compared with the state-of-the-art methods on the two datasets.
- Mar 21 2017 cs.CV arXiv:1703.06618v1This paper contributes a new large-scale dataset for weakly supervised cross-media retrieval, named Twitter100k. Current datasets, such as Wikipedia, NUS Wide and Flickr30k, have two major limitations. First, these datasets are lacking in content diversity, i.e., only some pre-defined classes are covered. Second, texts in these datasets are written in well-organized language, leading to inconsistency with realistic applications. To overcome these drawbacks, the proposed Twitter100k dataset is characterized by two aspects: 1) it has 100,000 image-text pairs randomly crawled from Twitter and thus has no constraint in the image categories; 2) text in Twitter100k is written in informal language by the users. Since strongly supervised methods leverage the class labels that may be missing in practice, this paper focuses on weakly supervised learning for cross-media retrieval, in which only text-image pairs are exploited during training. We extensively benchmark the performance of four subspace learning methods and three variants of the Correspondence AutoEncoder, along with various text features on Wikipedia, Flickr30k and Twitter100k. Novel insights are provided. As a minor contribution, inspired by the characteristic of Twitter100k, we propose an OCR-based cross-media retrieval method. In experiment, we show that the proposed OCR-based method improves the baseline performance.
- Mar 17 2017 cs.CV arXiv:1703.05693v4This paper proposes the SVDNet for retrieval problems, with focus on the application of person re-identification (re-ID). We view each weight vector within a fully connected (FC) layer in a convolutional neuron network (CNN) as a projection basis. It is observed that the weight vectors are usually highly correlated. This problem leads to correlations among entries of the FC descriptor, and compromises the retrieval performance based on the Euclidean distance. To address the problem, this paper proposes to optimize the deep representation learning process with Singular Vector Decomposition (SVD). Specifically, with the restraint and relaxation iteration (RRI) training scheme, we are able to iteratively integrate the orthogonality constraint in CNN training, yielding the so-called SVDNet. We conduct experiments on the Market-1501, CUHK03, and Duke datasets, and show that RRI effectively reduces the correlation among the projection vectors, produces more discriminative FC descriptors, and significantly improves the re-ID accuracy. On the Market-1501 dataset, for instance, rank-1 accuracy is improved from 55.3% to 80.5% for CaffeNet, and from 73.8% to 82.3% for ResNet-50.
- Mar 13 2017 cs.CV arXiv:1703.03567v1This paper proposes a new evaluation protocol for cross-media retrieval which better fits the real-word applications. Both image-text and text-image retrieval modes are considered. Traditionally, class labels in the training and testing sets are identical. That is, it is usually assumed that the query falls into some pre-defined classes. However, in practice, the content of a query image/text may vary extensively, and the retrieval system does not necessarily know in advance the class label of a query. Considering the inconsistency between the real-world applications and laboratory assumptions, we think that the existing protocol that works under identical train/test classes can be modified and improved. This work is dedicated to addressing this problem by considering the protocol under an extendable scenario, \ie, the training and testing classes do not overlap. We provide extensive benchmarking results obtained by the existing protocol and the proposed new protocol on several commonly used datasets. We demonstrate a noticeable performance drop when the testing classes are unseen during training. Additionally, a trivial solution, \ie, directly using the predicted class label for cross-media retrieval, is tested. We show that the trivial solution is very competitive in traditional non-extendable retrieval, but becomes less so under the new settings. The train/test split, evaluation code, and benchmarking results are publicly available on our website.
- Feb 17 2017 cs.CV arXiv:1702.04858v2Person Re-IDentification (Re-ID) aims to match person images captured from two non-overlapping cameras. In this paper, a deep hybrid similarity learning (DHSL) method for person Re-ID based on a convolution neural network (CNN) is proposed. In our approach, a CNN learning feature pair for the input image pair is simultaneously extracted. Then, both the element-wise absolute difference and multiplication of the CNN learning feature pair are calculated. Finally, a hybrid similarity function is designed to measure the similarity between the feature pair, which is realized by learning a group of weight coefficients to project the element-wise absolute difference and multiplication into a similarity score. Consequently, the proposed DHSL method is able to reasonably assign parameters of feature learning and metric learning in a CNN so that the performance of person Re-ID is improved. Experiments on three challenging person Re-ID databases, QMUL GRID, VIPeR and CUHK03, illustrate that the proposed DHSL method is superior to multiple state-of-the-art person Re-ID methods.
- Jan 31 2017 cs.CV arXiv:1701.08398v4When considering person re-identification (re-ID) as a retrieval process, re-ranking is a critical step to improve its accuracy. Yet in the re-ID community, limited effort has been devoted to re-ranking, especially those fully automatic, unsupervised solutions. In this paper, we propose a k-reciprocal encoding method to re-rank the re-ID results. Our hypothesis is that if a gallery image is similar to the probe in the k-reciprocal nearest neighbors, it is more likely to be a true match. Specifically, given an image, a k-reciprocal feature is calculated by encoding its k-reciprocal nearest neighbors into a single vector, which is used for re-ranking under the Jaccard distance. The final distance is computed as the combination of the original distance and the Jaccard distance. Our re-ranking method does not require any human interaction or any labeled data, so it is applicable to large-scale datasets. Experiments on the large-scale Market-1501, CUHK03, MARS, and PRW datasets confirm the effectiveness of our method.
- Jan 27 2017 cs.CV arXiv:1701.07732v1Pedestrian misalignment, which mainly arises from detector errors and pose variations, is a critical problem for a robust person re-identification (re-ID) system. With bad alignment, the background noise will significantly compromise the feature learning and matching process. To address this problem, this paper introduces the pose invariant embedding (PIE) as a pedestrian descriptor. First, in order to align pedestrians to a standard pose, the PoseBox structure is introduced, which is generated through pose estimation followed by affine transformations. Second, to reduce the impact of pose estimation errors and information loss during PoseBox construction, we design a PoseBox fusion (PBF) CNN architecture that takes the original image, the PoseBox, and the pose estimation confidence as input. The proposed PIE descriptor is thus defined as the fully connected layer of the PBF network for the retrieval task. Experiments are conducted on the Market-1501, CUHK03, and VIPeR datasets. We show that PoseBox alone yields decent re-ID accuracy and that when integrated in the PBF network, the learned PIE descriptor produces competitive performance compared with the state-of-the-art approaches.
- Jan 27 2017 cs.CV arXiv:1701.07717v5The main contribution of this paper is a simple semi-supervised pipeline that only uses the original training set without collecting extra data. It is challenging in 1) how to obtain more training data only from the training set and 2) how to use the newly generated data. In this work, the generative adversarial network (GAN) is used to generate unlabeled samples. We propose the label smoothing regularization for outliers (LSRO). This method assigns a uniform label distribution to the unlabeled images, which regularizes the supervised model and improves the baseline. We verify the proposed method on a practical problem: person re-identification (re-ID). This task aims to retrieve a query person from other cameras. We adopt the deep convolutional generative adversarial network (DCGAN) for sample generation, and a baseline convolutional neural network (CNN) for representation learning. Experiments show that adding the GAN-generated data effectively improves the discriminative ability of learned CNN embeddings. On three large-scale datasets, Market-1501, CUHK03 and DukeMTMC-reID, we obtain +4.37%, +1.6% and +2.46% improvement in rank-1 precision over the baseline CNN, respectively. We additionally apply the proposed method to fine-grained bird recognition and achieve a +0.6% improvement over a strong baseline. The code is available at https://github.com/layumi/Person-reID_GAN.
- A large amount of information exists in reviews written by users. This source of information has been ignored by most of the current recommender systems while it can potentially alleviate the sparsity problem and improve the quality of recommendations. In this paper, we present a deep model to learn item properties and user behaviors jointly from review text. The proposed model, named Deep Cooperative Neural Networks (DeepCoNN), consists of two parallel neural networks coupled in the last layers. One of the networks focuses on learning user behaviors exploiting reviews written by the user, and the other one learns item properties from the reviews written for the item. A shared layer is introduced on the top to couple these two networks together. The shared layer enables latent factors learned for users and items to interact with each other in a manner similar to factorization machine techniques. Experimental results demonstrate that DeepCoNN significantly outperforms all baseline recommender systems on a variety of datasets.
- Dec 28 2016 cs.CV arXiv:1612.08499v1This paper presents a deep nonlinear metric learning framework for data visualization on an image dataset. We propose the Triangular Similarity and prove its equivalence to the Cosine Similarity in measuring a data pair. Based on this novel similarity, a geometrically motivated loss function - the triangular loss - is then developed for optimizing a metric learning system comprising two identical CNNs. It is shown that this deep nonlinear system can be efficiently trained by a hybrid algorithm based on the conventional backpropagation algorithm. More interestingly, benefiting from classical manifold learning theories, the proposed system offers two different views to visualize the outputs, the second of which provides better classification results than the state-of-the-art methods in the visualizable spaces.
- Nov 18 2016 cs.CV arXiv:1611.05666v2We revisit two popular convolutional neural networks (CNN) in person re-identification (re-ID), i.e, verification and classification models. The two models have their respective advantages and limitations due to different loss functions. In this paper, we shed light on how to combine the two models to learn more discriminative pedestrian descriptors. Specifically, we propose a new siamese network that simultaneously computes identification loss and verification loss. Given a pair of training images, the network predicts the identities of the two images and whether they belong to the same identity. Our network learns a discriminative embedding and a similarity measurement at the same time, thus making full usage of the annotations. Albeit simple, the learned embedding improves the state-of-the-art performance on two public person re-ID benchmarks. Further, we show our architecture can also be applied in image retrieval.
- Coded caching scheme is a promising technique to migrate the network burden in peak hours, which attains more prominent gains than the uncoded caching. The coded caching scheme can be classified into two types, namely, the centralized and the decentralized scheme, according to whether the placement procedures are carefully designed or operated at random. However, most of the previous analysis assumes that the connected links between server and users are error-free. In this paper, we explore the coded caching based delivery design in wireless networks, where all the connected wireless links are different. For both centralized and decentralized cases, we proposed two delivery schemes, namely, the orthogonal delivery scheme and the concurrent delivery scheme. We focus on the transmission time slots spent on satisfying the system requests, and prove that for both the centralized and the decentralized cases, the concurrent delivery always outperforms orthogonal delivery scheme. Furthermore, for the orthogonal delivery scheme, we derive the gap in terms of transmission time between the decentralized and centralized case, which is essentially no more than 1.5.
- We study the problem of designing optical receivers to discriminate between multiple coherent states using coherent processing receivers---i.e., one that uses arbitrary coherent feedback control and quantum-noise-limited direct detection---which was shown by Dolinar to achieve the minimum error probability in discriminating any two coherent states. We first derive and re-interpret Dolinar's binary-hypothesis minimum-probability-of-error receiver as the one that optimizes the information efficiency at each time instant, based on recursive Bayesian updates within the receiver. Using this viewpoint, we propose a natural generalization of Dolinar's receiver design to discriminate $M$ coherent states each of which could now be a codeword, i.e., a sequence of $N$ coherent states each drawn from a modulation alphabet. We analyze the channel capacity of the pure-loss optical channel with a general coherent-processing receiver in the low-photon number regime and compare it with the capacity achievable with direct detection and the Holevo limit (achieving the latter would require a quantum joint-detection receiver). We show compelling evidence that despite the optimal performance of Dolinar's receiver for the binary coherent-state hypothesis test (either in error probability or mutual information), the asymptotic communication rate achievable by such a coherent-processing receiver is only as good as direct detection. This suggests that in the infinitely-long codeword limit, all potential benefits of coherent processing at the receiver can be obtained by designing a good code and direct detection, with no feedback within the receiver.
- Oct 17 2016 cs.CV arXiv:1610.04308v1Active vision is inherently attention-driven: The agent selects views of observation to best approach the vision task while improving its internal representation of the scene being observed. Inspired by the recent success of attention-based models in 2D vision tasks based on single RGB images, we propose to address the multi-view depth-based active object recognition using attention mechanism, through developing an end-to-end recurrent 3D attentional network. The architecture comprises of a recurrent neural network (RNN), storing and updating an internal representation, and two levels of spatial transformer units, guiding two-level attentions. Our model, trained with a 3D shape database, is able to iteratively attend to the best views targeting an object of interest for recognizing it, and focus on the object in each view for removing the background clutter. To realize 3D view selection, we derive a 3D spatial transformer network which is differentiable for training with back-propagation, achieving must faster convergence than the reinforcement learning employed by most existing attention-based models. Experiments show that our method outperforms state-of-the-art methods in cluttered scenes.
- In this paper, we consider the problem of joint delay-Doppler estimation of moving targets in a passive radar that makes use of orthogonal frequency-division multiplexing (OFDM) communication signals. A compressed sensing algorithm is proposed to achieve supper-resolution and better accuracy, using both the atomic norm and the $\ell_1$-norm. The atomic norm is used to manifest the signal sparsity in the continuous domain. Unlike previous works which assume the demodulation to be error free, we explicitly introduce the demodulation error signal whose sparsity is imposed by the $\ell_1$-norm. On this basis, the delays and Doppler frequencies are estimated by solving a semidefinite program (SDP) which is convex. We also develop an iterative method for solving this SDP via the alternating direction method of multipliers (ADMM) where each iteration involves closed-form computation. Simulation results are presented to illustrate the high performance of the proposed algorithm.
- Oct 11 2016 cs.CV arXiv:1610.02984v1Person re-identification (re-ID) has become increasingly popular in the community due to its application and research significance. It aims at spotting a person of interest in other cameras. In the early days, hand-crafted algorithms and small-scale evaluation were predominantly reported. Recent years have witnessed the emergence of large-scale datasets and deep learning systems which make use of large data volumes. Considering different tasks, we classify most current re-ID methods into two classes, i.e., image-based and video-based; in both tasks, hand-crafted and deep learning systems will be reviewed. Moreover, two new re-ID tasks which are much closer to real-world applications are described and discussed, i.e., end-to-end re-ID and fast re-ID in very large galleries. This paper: 1) introduces the history of person re-ID and its relationship with image classification and instance retrieval; 2) surveys a broad selection of the hand-crafted systems and the large-scale methods in both image- and video-based re-ID; 3) describes critical future directions in end-to-end re-ID and fast retrieval in large galleries; and 4) finally briefs some important yet under-developed issues.
- Aug 08 2016 cs.CV arXiv:1608.01807v2In the early days, content-based image retrieval (CBIR) was studied with global features. Since 2003, image retrieval based on local descriptors (de facto SIFT) has been extensively studied for over a decade due to the advantage of SIFT in dealing with image transformations. Recently, image representations based on the convolutional neural network (CNN) have attracted increasing interest in the community and demonstrated impressive performance. Given this time of rapid evolution, this article provides a comprehensive survey of instance retrieval over the last decade. Two broad categories, SIFT-based and CNN-based methods, are presented. For the former, according to the codebook size, we organize the literature into using large/medium-sized/small codebooks. For the latter, we discuss three lines of methods, i.e., using pre-trained or fine-tuned CNN models, and hybrid methods. The first two perform a single-pass of an image to the network, while the last category employs a patch-based feature extraction scheme. This survey presents milestones in modern instance retrieval, reviews a broad selection of previous works in different categories, and provides insights on the connection between SIFT and CNN-based methods. After analyzing and comparing retrieval performance of different categories on several datasets, we discuss promising directions towards generic and specialized instance retrieval.
- This paper addresses the problem of large-scale image retrieval. We propose a two-layer fusion method which takes advantage of global and local cues and ranks database images from coarse to fine (C2F). Departing from the previous methods fusing multiple image descriptors simultaneously, C2F is featured by a layered procedure composed by filtering and refining. In particular, C2F consists of three components. 1) Distractor filtering. With holistic representations, noise images are filtered out from the database, so the number of candidate images to be used for comparison with the query can be greatly reduced. 2) Adaptive weighting. For a certain query, the similarity of candidate images can be estimated by holistic similarity scores in complementary to the local ones. 3) Candidate refining. Accurate retrieval is conducted via local features, combining the pre-computed adaptive weights. Experiments are presented on two benchmarks, \emphi.e., Holidays and Ukbench datasets. We show that our method outperforms recent fusion methods in terms of storage consumption and computation complexity, and that the accuracy is competitive to the state-of-the-arts.
- In this paper, we propose an open-loop unequal-error-protection querying policy based on superposition coding for the noisy 20 questions problem. In this problem, a player wishes to successively refine an estimate of the value of a continuous random variable by posing binary queries and receiving noisy responses. When the queries are designed non-adaptively as a single block and the noisy responses are modeled as the output of a binary symmetric channel the 20 questions problem can be mapped to an equivalent problem of channel coding with unequal error protection (UEP). A new non-adaptive querying strategy based on UEP superposition coding is introduced whose estimation error decreases with an exponential rate of convergence that is significantly better than that of the UEP repetition coding introduced by Variani et al. (2015). With the proposed querying strategy, the rate of exponential decrease in the number of queries matches the rate of a closed-loop adaptive scheme where queries are sequentially designed with the benefit of feedback. Furthermore, the achievable error exponent is significantly better than that of random block codes employing equal error protection.
- May 03 2016 cs.CV arXiv:1605.00052v1An increasing number of computer vision tasks can be tackled with deep features, which are the intermediate outputs of a pre-trained Convolutional Neural Network. Despite the astonishing performance, deep features extracted from low-level neurons are still below satisfaction, arguably because they cannot access the spatial context contained in the higher layers. In this paper, we present InterActive, a novel algorithm which computes the activeness of neurons and network connections. Activeness is propagated through a neural network in a top-down manner, carrying high-level context and improving the descriptive power of low-level and mid-level neurons. Visualization indicates that neuron activeness can be interpreted as spatial-weighted neuron responses. We achieve state-of-the-art classification performance on a wide range of image datasets.
- Apr 12 2016 cs.CV arXiv:1604.02531v2We present a novel large-scale dataset and comprehensive baselines for end-to-end pedestrian detection and person recognition in raw video frames. Our baselines address three issues: the performance of various combinations of detectors and recognizers, mechanisms for pedestrian detection to help improve overall re-identification accuracy and assessing the effectiveness of different detectors for re-identification. We make three distinct contributions. First, a new dataset, PRW, is introduced to evaluate Person Re-identification in the Wild, using videos acquired through six synchronized cameras. It contains 932 identities and 11,816 frames in which pedestrians are annotated with their bounding box positions and identities. Extensive benchmarking results are presented on this dataset. Second, we show that pedestrian detection aids re-identification through two simple yet effective improvements: a discriminatively trained ID-discriminative Embedding (IDE) in the person subspace using convolutional neural network (CNN) features and a Confidence Weighted Similarity (CWS) metric that incorporates detection scores into similarity measurement. Third, we derive insights in evaluating detector performance for the particular scenario of accurate person re-identification.
- Apr 04 2016 cs.CV arXiv:1604.00133v1The objective of this paper is the effective transfer of the Convolutional Neural Network (CNN) feature in image search and classification. Systematically, we study three facts in CNN transfer. 1) We demonstrate the advantage of using images with a properly large size as input to CNN instead of the conventionally resized one. 2) We benchmark the performance of different CNN layers improved by average/max pooling on the feature maps. Our observation suggests that the Conv5 feature yields very competitive accuracy under such pooling step. 3) We find that the simple combination of pooled features extracted across various CNN layers is effective in collecting evidences from both low and high level descriptors. Following these good practices, we are capable of improving the state of the art on a number of benchmarks to a large margin.
- We study the problem of estimating $\beta \in \mathbb{R}^p$ from its noisy linear observations $y= X\beta+ w$, where $w \sim N(0, \sigma_w^2 I_{n\times n})$, under the following high-dimensional asymptotic regime: given a fixed number $\delta$, $p \rightarrow \infty$, while $n/p \rightarrow \delta$. We consider the popular class of $\ell_q$-regularized least squares (LQLS) estimators, a.k.a. bridge, given by the optimization problem: \beginequation* \hat\beta (\lambda, q ) ∈\arg\min_\beta \frac12 \|y-X\beta\|_2^2+ \lambda \|\beta\|_q^q, \endequation* and characterize the almost sure limit of $\frac{1}{p} \|\hat{\beta} (\lambda, q )- \beta\|_2^2$. The expression we derive for this limit does not have explicit forms and hence are not useful in comparing different algorithms, or providing information in evaluating the effect of $\delta$ or sparsity level of $\beta$. To simplify the expressions, researchers have considered the ideal "no-noise" regime and have characterized the values of $\delta$ for which the almost sure limit is zero. This is known as the phase transition analysis. In this paper, we first perform the phase transition analysis of LQLS. Our results reveal some of the limitations and misleading features of the phase transition analysis. To overcome these limitations, we propose the study of these algorithms under the low noise regime. Our new analysis framework not only sheds light on the results of the phase transition analysis, but also makes an accurate comparison of different regularizers possible.
- Data processing inequalities for $f$-divergences can be sharpened using contraction coefficients to produce strong data processing inequalities. These contraction coefficients turn out to provide useful optimization problems for learning likelihood models. Moreover, the contraction coefficient for $\chi^2$-divergence admits a particularly simple linear algebraic solution due to its relation to maximal correlation. Propelled by this context, we analyze the relationship between various contraction coefficients for $f$-divergences and the contraction coefficient for $\chi^2$-divergence. In particular, we prove that the latter coefficient can be obtained from the former coefficients by driving the input $f$-divergences to zero. Then, we establish linear bounds between these contraction coefficients. These bounds are refined for the KL divergence case using a well-known distribution dependent variant of Pinsker's inequality.
- In this letter, we proposed a shared-antenna full-duplex massive MIMO model for the multiuser MIMO system. This model exploits a single antenna array at the base station (BS) to transmit and receive the signals simultaneously. It has the merits of both the full-duplex system and the time-division duplex (TDD) massive MIMO system, i.e., the high spectral efficiency and the channel reciprocity. We focus on the zero-forcing (ZF) and the maximal-ratio transmission/maximal-ratio combining (MRT/MRC) linear processing methods, which are commonly used in the massive MIMO system. As the main finding, we prove that the self-interference (SI) in a shared-antenna full-duplex massive MU-MIMO system can be suppressed in the completely dependent uplink and downlink channels when the number of antennas becomes large enough.
- This paper considers the problem of communication over a discrete memoryless channel (DMC) or an additive white Gaussian noise (AWGN) channel subject to the constraint that the probability that an adversary who observes the channel outputs can detect the communication is low. Specifically, the relative entropy between the output distributions when a codeword is transmitted and when no input is provided to the channel must be sufficiently small. For a DMC whose output distribution induced by the "off" input symbol is not a mixture of the output distributions induced by other input symbols, it is shown that the maximum amount of information that can be transmitted under this criterion scales like the square root of the blocklength. The same is true for the AWGN channel. Exact expressions for the scaling constant are also derived.
- In this paper, we extend the information theoretic framework that was developed in earlier work to multi-hop network settings. For a given network, we construct a novel deterministic model that quantifies the ability of the network in transmitting private and common messages across users. Based on this model, we formulate a linear optimization problem that explores the throughput of a multi-layer network, thereby offering the optimal strategy as to what kind of common messages should be generated in the network to maximize the throughput. With this deterministic model, we also investigate the role of feedback for multi-layer networks, from which we identify a variety of scenarios in which feedback can improve transmission efficiency. Our results provide fundamental guidelines as to how to coordinate cooperation between users to enable efficient information exchanges across them.
- Feb 10 2015 cs.CV arXiv:1502.02171v1For long time, person re-identification and image search are two separately studied tasks. However, for person re-identification, the effectiveness of local features and the "query-search" mode make it well posed for image search techniques. In the light of recent advances in image search, this paper proposes to treat person re-identification as an image search problem. Specifically, this paper claims two major contributions. 1) By designing an unsupervised Bag-of-Words representation, we are devoted to bridging the gap between the two tasks by integrating techniques from image search in person re-identification. We show that our system sets up an effective yet efficient baseline that is amenable to further supervised/unsupervised improvements. 2) We contribute a new high quality dataset which uses DPM detector and includes a number of distractor images. Our dataset reaches closer to realistic settings, and new perspectives are provided. Compared with approaches that rely on feature-feature match, our method is faster by over two orders of magnitude. Moreover, on three datasets, we report competitive results compared with the state-of-the-art methods.
- Widespread use of the Internet and social networks invokes the generation of big data, which is proving to be useful in a number of applications. To deal with explosively growing amounts of data, data analytics has emerged as a critical technology related to computing, signal processing, and information networking. In this paper, a formalism is considered in which data is modeled as a generalized social network and communication theory and information theory are thereby extended to data analytics. First, the creation of an equalizer to optimize information transfer between two data variables is considered, and financial data is used to demonstrate the advantages. Then, an information coupling approach based on information geometry is applied for dimensionality reduction, with a pattern recognition example to illustrate the effectiveness. These initial trials suggest the potential of communication theoretic data analytics for a wide range of applications.
- In many application areas we are faced with the following question: Can we recover a sparse vector $x_o \in \mathbb{R}^N$ from its undersampled set of noisy observations $y \in \mathbb{R}^n$, $y=A x_o+w$. The last decade has witnessed a surge of algorithms and theoretical results addressing this question. One of the most popular algorithms is the $\ell_p$-regularized least squares (LPLS) given by the following formulation: \[ \hatx(\gamma,p )∈\arg\min_x \frac12\|y - Ax\|_2^2+\gamma\|x\|_p^p, \]where $p \in [0,1]$. Despite the non-convexity of these problems for $p<1$, they are still appealing because of the following folklores in compressed sensing: (i) $\hat{x}(\gamma,p )$ is closer to $x_o$ than $\hat{x}(\gamma,1)$. (ii) If we employ iterative methods that aim to converge to a local minima of LPLS, then under good initialization these algorithms converge to a solution that is closer to $x_o$ than $\hat{x}(\gamma,1)$. In spite of the existence of plenty of empirical results that support these folklore theorems, the theoretical progress to establish them has been very limited. This paper aims to study the above folklore theorems and establish their scope of validity. Starting with approximate message passing algorithm as a heuristic method for solving LPLS, we study the impact of initialization on the performance of AMP. Then, we employ the replica analysis to show the connection between the solution of AMP and $\hat{x}(\gamma, p)$ in the asymptotic settings. This enables us to compare the accuracy of $\hat{x}(\gamma,p)$ for $p \in [0,1]$. In particular, we will characterize the phase transition and noise sensitivity of LPLS for every $0\leq p\leq 1$ accurately. Our results in the noiseless setting confirm that LPLS exhibits the same phase transition for every $0\leq p <1$ and this phase transition is much higher than that of LASSO.
- Dec 16 2014 cs.PL arXiv:1412.4480v2Locks have been widely used as an effective synchronization mechanism among processes and threads. However, we observe that a large number of false inter-thread dependencies (i.e., unnecessary lock contentions) exist during the program execution on multicore processors, thereby incurring significant performance overhead. This paper presents a performance debugging framework, PERFPLAY, to facilitate a comprehensive and in-depth understanding of the performance impact of unnecessary lock contentions. The core technique of our debugging framework is trace replay. Specifically, PERFPLAY records the program execution trace, on the basis of which the unnecessary lock contentions can be identified through trace analysis. We then propose a novel technique of trace transformation to transform these identified unnecessary lock contentions in the original trace into the correct pattern as a new trace free of unnecessary lock contentions. Through replaying both traces, PERFPLAY can quantify the performance impact of unnecessary lock contentions. To demonstrate the effectiveness of our debugging framework, we study five real-world programs and PARSEC benchmarks. Our experimental results demonstrate the significant performance overhead of unnecessary lock contentions, and the effectiveness of PERFPLAY in identifying the performance critical unnecessary lock contentions in real applications.
- Many network information theory problems face the similar difficulty of single-letterization. We argue that this is due to the lack of a geometric structure on the space of probability distribution. In this paper, we develop such a structure by assuming that the distributions of interest are close to each other. Under this assumption, the K-L divergence is reduced to the squared Euclidean metric in an Euclidean space. In addition, we construct the notion of coordinate and inner product, which will facilitate solving communication problems. We will present the application of this approach to the point-to-point channel, general broadcast channel, and the multiple access channel (MAC) with the common source. It can be shown that with this approach, information theory problems, such as the single-letterization, can be reduced to some linear algebra problems. Moreover, we show that for the general broadcast channel, transmitting the common message to receivers can be formulated as the trade-off between linear systems. We also provide an example to visualize this trade-off in a geometric way. Finally, for the MAC with the common source, we observe a coherent combining gain due to the cooperation between transmitters, and this gain can be quantified by applying our technique.
- Jun 04 2014 cs.CV arXiv:1406.0680v1This paper introduces an improved reranking method for the Bag-of-Words (BoW) based image search. Built on [1], a directed image graph robust to outlier distraction is proposed. In our approach, the relevance among images is encoded in the image graph, based on which the initial rank list is refined. Moreover, we show that the rank-level feature fusion can be adopted in this reranking method as well. Taking advantage of the complementary nature of various features, the reranking performance is further enhanced. Particularly, we exploit the reranking method combining the BoW and color information. Experiments on two benchmark datasets demonstrate that ourmethod yields significant improvements and the reranking results are competitive to the state-of-the-art methods.
- The majority of recommender systems are designed to recommend items (such as movies and products) to users. We focus on the problem of recommending buyers to sellers which comes with new challenges: (1) constraints on the number of recommendations buyers are part of before they become overwhelmed, (2) constraints on the number of recommendations sellers receive within their budget, and (3) constraints on the set of buyers that sellers want to receive (e.g., no more than two people from the same household). We propose the following critical problems of recommending buyers to sellers: Constrained Recommendation (C-REC) capturing the first two challenges, and Conflict-Aware Constrained Recommendation (CAC-REC) capturing all three challenges at the same time. We show that C-REC can be modeled using linear programming and can be efficiently solved using modern solvers. On the other hand, we show that CAC-REC is NP-hard. We propose two approximate algorithms to solve CAC-REC and show that they achieve close to optimal solutions via comprehensive experiments using real-world datasets.
- Jun 03 2014 cs.CV arXiv:1406.0132v1In the Bag-of-Words (BoW) model based image retrieval task, the precision of visual matching plays a critical role in improving retrieval performance. Conventionally, local cues of a keypoint are employed. However, such strategy does not consider the contextual evidences of a keypoint, a problem which would lead to the prevalence of false matches. To address this problem, this paper defines "true match" as a pair of keypoints which are similar on three levels, i.e., local, regional, and global. Then, a principled probabilistic framework is established, which is capable of implicitly integrating discriminative cues from all these feature levels. Specifically, the Convolutional Neural Network (CNN) is employed to extract features from regional and global patches, leading to the so-called "Deep Embedding" framework. CNN has been shown to produce excellent performance on a dozen computer vision tasks such as image classification and detection, but few works have been done on BoW based image retrieval. In this paper, firstly we show that proper pre-processing techniques are necessary for effective usage of CNN feature. Then, in the attempt to fit it into our model, a novel indexing structure called "Deep Indexing" is introduced, which dramatically reduces memory usage. Extensive experiments on three benchmark datasets demonstrate that, the proposed Deep Embedding method greatly promotes the retrieval accuracy when CNN feature is integrated. We show that our method is efficient in terms of both memory and time cost, and compares favorably with the state-of-the-art methods.
- Apr 10 2014 cs.NI arXiv:1404.2366v1Device-to-device (D2D) communications in cellular networks are promising technologies for improving network throughput, spectrum efficiency, and transmission delay. In this paper, we first introduce the concept of guard distance to explore a proper system model for enabling multiple concurrent D2D pairs in the same cell. Considering the Signal to Interference Ratio (SIR) requirements for both macro-cell and D2D communications, a geometrical method is proposed to obtain the guard distances from a D2D user equipment (DUE) to the base station (BS), to the transmitting cellular user equipment (CUE), and to other communicating D2D pairs, respectively, when the uplink resource is reused. By utilizing the guard distances, we then derive the bounds of the maximum throughput improvement provided by D2D communications in a cell. Extensive simulations are conducted to demonstrate the impact of different parameters on the optimal maximum throughput. We believe that the obtained results can provide useful guidelines for the deployment of future cellular networks with underlaying D2D communications.