results for au:Zhang_S in:cs

- We describe a prototype dialogue response generation model for the customer service domain at Amazon. The model, which is trained in a weakly supervised fashion, measures the similarity between customer questions and agent answers using a dual encoder network, a Siamese-like neural network architecture. Answer templates are extracted from embeddings derived from past agent answers, without turn-by-turn annotations. Responses to customer inquiries are generated by selecting the best template from the final set of templates. We show that, in a closed domain like customer service, the selected templates cover $>$70\% of past customer inquiries. Furthermore, the relevance of the model-selected templates is significantly higher than templates selected by a standard tf-idf baseline.
- Mar 20 2017 cs.CV arXiv:1703.05870v2Chinese font recognition (CFR) has gained significant attention in recent years. However, due to the sparsity of labeled font samples and the structural complexity of Chinese characters, CFR is still a challenging task. In this paper, a DropRegion method is proposed to generate a large number of stochastic variant font samples whose local regions are selectively disrupted and an inception font network (IFN) with two additional convolutional neural network (CNN) structure elements, i.e., a cascaded cross-channel parametric pooling (CCCP) and global average pooling, is designed. Because the distribution of strokes in a font image is non-stationary, an elastic meshing technique that adaptively constructs a set of local regions with equalized information is developed. Thus, DropRegion is seamlessly embedded in the IFN, which enables end-to-end training; the proposed DropRegion-IFN can be used for high performance CFR. Experimental results have confirmed the effectiveness of our new approach for CFR.
- Mar 20 2017 cs.CV arXiv:1703.05868v2Understanding traffic density from large-scale web camera (webcam) videos is a challenging problem because such videos have low spatial and temporal resolution, high occlusion and large perspective. To deeply understand traffic density, we explore both deep learning based and optimization based methods. To avoid individual vehicle detection and tracking, both methods map the image into vehicle density map, one based on rank constrained regression and the other one based on fully convolution networks (FCN). The regression based method learns different weights for different blocks in the image to increase freedom degrees of weights and embed perspective information. The FCN based method jointly estimates vehicle density map and vehicle count with a residual learning framework to perform end-to-end dense prediction, allowing arbitrary image resolution, and adapting to different vehicle scales and perspectives. We analyze and compare both methods, and get insights from optimization based method to improve deep model. Since existing datasets do not cover all the challenges in our work, we collected and labelled a large-scale traffic video dataset, containing 60 million frames from 212 webcams. Both methods are extensively evaluated and compared on different counting tasks and datasets. FCN based method significantly reduces the mean absolute error from 10.99 to 5.31 on the public dataset TRANCOS compared with the state-of-the-art baseline.
- Mar 10 2017 cs.IR arXiv:1703.03112v2Recommender systems have been actively and extensively studied over past decades. In the meanwhile, the boom of Big Data is driving fundamental changes in the development of recommender systems. In this paper, we propose a dynamic intention-aware recommender system to better facilitate users to find desirable products and services. Compare to prior work, our proposal possesses the following advantages: (1) it takes user intentions and demands into account through intention mining techniques. By unearthing user intentions from the historical user-item interactions, and various user digital traces harvested from social media and Internet of Things, it is capable of delivering more satisfactory recommendations by leveraging rich online and offline user data; (2) it embraces the benefits of embedding heterogeneous source information and shared representations of multiple domains to provide accurate and effective recommendations comprehensively; (3) it recommends products or services proactively and timely by capturing the dynamic influences, which can significantly reduce user involvements and efforts.
- Mar 09 2017 cs.NI arXiv:1703.02664v1This article proposes a software defined space-air-ground integrated network architecture for supporting diverse vehicular services in a seamless, efficient, and cost-effective manner. Firstly, the motivations and challenges for integration of space-air-ground networks are reviewed. Secondly, a software defined network architecture with a layered structure is presented. To protect the legacy services in satellite, aerial, and territorial segments, resources in each segment are sliced through network slicing to achieve service isolation. Then, available resources are put into a common and dynamic space-air-ground resource pool, which is managed by hierarchical controllers to accommodate vehicular services. Finally, a case study is carried out, followed by discussion on some open research topics.
- Mar 07 2017 cs.AI arXiv:1703.01893v1As a well-known NP-hard problem, the Three-Index Assignment Problem (AP3) has attracted lots of research efforts for developing heuristics. However, existing heuristics either obtain less competitive solutions or consume too much time. In this paper, a new heuristic named Approximate Muscle guided Beam Search (AMBS) is developed to achieve a good trade-off between solution quality and running time. By combining the approximate muscle with beam search, the solution space size can be significantly decreased, thus the time for searching the solution can be sharply reduced. Extensive experimental results on the benchmark indicate that the new algorithm is able to obtain solutions with competitive quality and it can be employed on instances with largescale. Work of this paper not only proposes a new efficient heuristic, but also provides a promising method to improve the efficiency of beam search.
- Feb 21 2017 cs.CV arXiv:1702.05693v1Convnets have enabled significant progress in pedestrian detection recently, but there are still open questions regarding suitable architectures and training data. We revisit CNN design and point out key adaptations, enabling plain FasterRCNN to obtain state-of-the-art results on the Caltech dataset. To achieve further improvement from more and better data, we introduce CityPersons, a new set of person annotations on top of the Cityscapes dataset. The diversity of CityPersons allows us for the first time to train one single CNN model that generalizes well over multiple benchmarks. Moreover, with additional training with CityPersons, we obtain top results using FasterRCNN on Caltech, improving especially for more difficult cases (heavy occlusion and small scale) and providing higher localization quality.
- Perception-driven approach and end-to-end system are two major vision-based frameworks for self-driving cars. However, it is difficult to introduce attention and historical information of autonomous driving process, which are the essential factors for achieving human-like driving into these two methods. In this paper, we propose a novel model for self-driving cars named brain-inspired cognitive model with attention (CMA). This model consists of three parts: a convolutional neural network for simulating human visual cortex, a cognitive map built to describe relationships between objects in complex traffic scene and a recurrent neural network that combines with the real-time updated cognitive map to implement attention mechanism and long-short term memory. The benefit of our model is that can accurately solve three tasks simultaneously:1) detection of the free space and boundaries of the current and adjacent lanes. 2)estimation of obstacle distance and vehicle attitude, and 3) learning of driving behavior and decision making from human driver. More significantly, the proposed model could accept external navigating instructions during an end-to-end driving process. For evaluation, we build a large-scale road-vehicle dataset which contains more than forty thousand labeled road images captured by three cameras on our self-driving car. Moreover, human driving activities and vehicle states are recorded in the meanwhile.
- Feb 21 2017 cs.CV arXiv:1702.05743v2Most traditional algorithms for compressive sensing image reconstruction suffer from the intensive computation. Recently, deep learning-based reconstruction algorithms have been reported, which dramatically reduce the time complexity than iterative reconstruction algorithms. In this paper, we propose a novel Deep Residual Reconstruction Network (DR$^{2}$-Net) to reconstruct the image from its Compressively Sensed (CS) measurement. The DR$^{2}$-Net is proposed based on two observations: 1) linear mapping could reconstruct a high-quality preliminary image, and 2) residual learning could further improve the reconstruction quality. Accordingly, DR$^{2}$-Net consists of two components, i.e., linear mapping network and residual network, respectively. Specifically, the fully-connected layer in neural network implements the linear mapping network. We then expand the linear mapping network to DR$^{2}$-Net by adding several residual learning blocks to enhance the preliminary image. Extensive experiments demonstrate that the DR$^{2}$-Net outperforms traditional iterative methods and recent deep learning-based methods by large margins at measurement rates 0.01, 0.04, 0.1, and 0.25, respectively.
- Block Coordinate Update (BCU) methods enjoy low per-update computational complexity because every time only one or a few block variables would need to be updated among possibly a large number of blocks. They are also easily parallelized and thus have been particularly popular for solving problems involving large-scale dataset and/or variables. In this paper, we propose a primal-dual BCU method for solving linearly constrained convex program in multi-block variables. The method is an accelerated version of a primal-dual algorithm proposed by the authors, which applies randomization in selecting block variables to update and establishes an $O(1/t)$ convergence rate under weak convexity assumption. We show that the rate can be accelerated to $O(1/t^2)$ if the objective is strongly convex. In addition, if one block variable is independent of the others in the objective, we then show that the algorithm can be modified to achieve a linear rate of convergence. The numerical experiments show that the accelerated method performs stably with a single set of parameters while the original method needs to tune the parameters for different datasets in order to achieve a comparable level of performance.
- We present Deep Generalized Canonical Correlation Analysis (DGCCA) -- a method for learning nonlinear transformations of arbitrarily many views of data, such that the resulting transformations are maximally informative of each other. While methods for nonlinear two-view representation learning (Deep CCA, (Andrew et al., 2013)) and linear many-view representation learning (Generalized CCA (Horst, 1961)) exist, DGCCA is the first CCA-style multiview representation learning technique that combines the flexibility of nonlinear (deep) representation learning with the statistical power of incorporating information from many independent sources, or views. We present the DGCCA formulation as well as an efficient stochastic optimization algorithm for solving it. We learn DGCCA repre- sentations on two distinct datasets for three downstream tasks: phonetic transcrip- tion from acoustic and articulatory measurements, and recommending hashtags and friends on a dataset of Twitter users. We find that DGCCA representations soundly beat existing methods at phonetic transcription and hashtag recommendation, and in general perform no worse than standard linear many-view techniques.
- This paper investigates a new analog beamforming architecture for massive multiple-input multiple-output (MIMO) systems, where each of the multiple transmit antennas is switched to be on or off to form a beam according to the channel state information at transmitters. This on-off analog beamforming (OABF) scheme has the advantages in terms of both hardware complexities and algorithmic complexities. OABF can completely remove the high-cost, power-consuming and bulky analog phase-shifters that are extensively employed by traditional analog beamforming schemes; it only requires the deployment of low-cost analog switches that are easy to implement. Moreover, we show that the beams formed by such simple antenna on-off switch operations can achieve rather good performances with low-complexity beamforming algorithms. Specifically, we first propose two optimal signal-to-noise ratio maximization algorithms to determine the on-off state of each switch under the per-antenna power constraint and the total power constraint, respectively. After that, we theoretically prove that OABF can achieve the full diversity gain and the full array gain with complexities up to a polynomial order. Numerical results are consistent with our theoretical analysis. We believe that the simple structure of OABF makes massive MIMO much easier to implement in real systems.
- A method for efficiently successive cancellation (SC) decoding of polar codes with high-dimensional linear binary kernels (HDLBK) is presented and analyzed. We devise a $l$-expressions method which can obtain simplified recursive formulas of SC decoder in likelihood ratio form for arbitrary linear binary kernels to reduce the complexity of corresponding SC decoder. By considering the bit-channel transition probabilities $W_{G}^{(\cdot)}(\cdot|0)$ and $W_{G}^{(\cdot)}(\cdot|1)$ separately, a $W$-expressions method is proposed to further reduce the complexity of HDLBK based SC decoder. For a $m\times m$ binary kernel, the complexity of straightforward SC decoder is $O(2^{m}N\log N)$. With $W$-expressions, we reduce the complexity of straightforward SC decoder to $O(m^{2}N\log N)$ when $m\leq 16$. Simulation results show that $16\times16$ kernel polar codes offer significant advantages in terms of error performances compared with $2\times2$ kernel polar codes under SC and list SC decoders.
- Convolutional Neural Networks (CNNs) are effective models for reducing spectral variations and modeling spectral correlations in acoustic features for automatic speech recognition (ASR). Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models/Gaussian Mixture Models (HMMs/GMMs) have achieved the state-of-the-art in various benchmarks. Meanwhile, Connectionist Temporal Classification (CTC) with Recurrent Neural Networks (RNNs), which is proposed for labeling unsegmented sequences, makes it feasible to train an end-to-end speech recognition system instead of hybrid settings. However, RNNs are computationally expensive and sometimes difficult to train. In this paper, inspired by the advantages of both CNNs and the CTC approach, we propose an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections. By evaluating the approach on the TIMIT phoneme recognition task, we show that the proposed model is not only computationally efficient, but also competitive with the existing baseline systems. Moreover, we argue that CNNs have the capability to model temporal correlations with appropriate context information.
- Jan 06 2017 cs.CV arXiv:1701.01272v1In this paper, we study learning generalized driving style representations from automobile GPS trip data. We propose a novel Autoencoder Regularized deep neural Network (ARNet) and a trip encoding framework trip2vec to learn drivers' driving styles directly from GPS records, by combining supervised and unsupervised feature learning in a unified architecture. Experiments on a challenging driver number estimation problem and the driver identification problem show that ARNet can learn a good generalized driving style representation: It significantly outperforms existing methods and alternative architectures by reaching the least estimation error on average (0.68, less than one driver) and the highest identification accuracy (by at least 3% improvement) compared with traditional supervised learning methods.
- A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetically discriminative/speaker discriminative DNNs as feature extractors for speaker verification has shown promising results. The extracted frame-level (DNN bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level speaker representation (d-vector or i-vector). In this work we use speaker discriminative CNNs to extract the noise-robust frame-level features. These features are smartly combined to form an utterance-level speaker vector through an attention mechanism. The proposed attention model takes the speaker discriminative information and the phonetic information to learn the weights. The whole system, including the CNN and attention model, is joint optimized using an end-to-end criterion. The training algorithm imitates exactly the evaluation process --- directly mapping a test utterance and a few target speaker utterances into a single verification score. The algorithm can automatically select the most similar impostor for each target speaker to train the network. We demonstrated the effectiveness of the proposed end-to-end system on Windows $10$ "Hey Cortana" speaker verification task.
- Dec 22 2016 cs.CV physics.optics arXiv:1612.07120v1We have designed a single-pixel camera with imaging around corners based on computational ghost imaging. It can obtain the image of an object when the camera cannot look at the object directly. Our imaging system explores the fact that a bucket detector in a ghost imaging setup has no spatial resolution capability. A series of experiments have been designed to confirm our predictions. This camera has potential applications for imaging around corner or other similar environments where the object cannot be observed directly.
- In this paper, we propose a new task and solution for vision and language: generation of grounded visual questions. Visual question answering (VQA) is an emerging topic which links textual questions with visual input. To the best of our knowledge, it lacks automatic method to generate reasonable and versatile questions. So far, almost all the textual questions are generated manually, as well as the corresponding answers. To this end, we propose a system that automatically generates visually grounded questions . First, visual input is analyzed with deep caption model. Second, the captions along with VGG-16 features are used as input for our proposed question generator to generate visually grounded questions. Finally, to enable generating of versatile questions, a question type selection module is provided which selects reasonable question types and provide them as parameters for question generation. This is done using a hybrid LSTM with both visual and answer input. Our system is trained using VQA and Visual7W dataset and shows reasonable results on automatically generating of new visual questions. We also propose a quantitative metric for automatic evaluation of the question quality.
- Dec 20 2016 cs.LG arXiv:1612.06052v1In this paper, we propose a stochastic proximal gradient method to train ternary weight neural networks (TNN). The proposed method features weight ternarization via an exact formula of proximal operator. Our experiments show that our trained TNN are able to preserve the state-of-the-art performance on MNIST and CIFAR10 benchmark datesets.
- Synthesizing photo-realistic images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing text-to-image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we propose stacked Generative Adversarial Networks (StackGAN) to generate photo-realistic images conditioned on text descriptions. The Stage-I GAN sketches the primitive shape and basic colors of the object based on the given text description, yielding Stage-I low resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high resolution images with photo-realistic details. The Stage-II GAN is able to rectify defects and add compelling details with the refinement process. Samples generated by StackGAN are more plausible than those generated by existing approaches. Importantly, our StackGAN for the first time generates realistic 256 x 256 images conditioned on only text descriptions, while state-of-the-art methods can generate at most 128 x 128 images. To demonstrate the effectiveness of the proposed StackGAN, extensive experiments are conducted on CUB and Oxford-102 datasets, which contain enough object appearance variations and are widely-used for text-to-image generation analysis.
- Dec 12 2016 cs.CR physics.data-an arXiv:1612.02886v1Quantum computing has undergone rapid development in recent years. Owing to limitations on scalability, personal quantum computers still seem slightly unrealistic in the near future. The first practical quantum computer for ordinary users is likely to be on the cloud. However, the adoption of cloud computing is possible only if security is ensured. Homomorphic encryption is a cryptographic protocol that allows computation to be performed on encrypted data without decrypting them, so it is well suited to cloud computing. Here, we first applied homomorphic encryption on IBM's cloud quantum computer platform. In our experiments, we successfully implemented a quantum algorithm for linear equations while protecting our privacy. This demonstration opens a feasible path to the next stage of development of cloud quantum information technology.
- In this paper, we consider the problem of event classification with multi-variate time series data consisting of heterogeneous (continuous and categorical) variables. The complex temporal dependencies between the variables combined with sparsity of the data makes the event classification problem particularly challenging. Most state-of-art approaches address this either by designing hand-engineered features or breaking up the problem over homogeneous variates. In this work, we propose and compare three representation learning algorithms over symbolized sequences which enables classification of heterogeneous time-series data using a deep architecture. The proposed representations are trained jointly along with the rest of the network architecture in an end-to-end fashion that makes the learned features discriminative for the given task. Experiments on three real-world datasets demonstrate the effectiveness of the proposed approaches.
- Nov 30 2016 cs.DB arXiv:1611.09691v1There are two main approximations of mining big data in memory. One is to partition a big dataset to several subsets, so as to mine each subset in memory. By this way, global patterns can be obtained by synthesizing all local patterns discovered from these subsets. Another is the statistical sampling method. This indicates that data partitioning should be an important strategy for mining big data. This paper recalls our work on mining big data with a data partitioning and shows some interesting findings among the local patterns discovered from subsets of a dataset.
- Nov 17 2016 cs.CV arXiv:1611.05271v1MeshFace photos have been widely used in many Chinese business organizations to protect ID face photos from being misused. The occlusions incurred by random meshes severely degenerate the performance of face verification systems, which raises the MeshFace verification problem between MeshFace and daily photos. Previous methods cast this problem as a typical low-level vision problem, i.e. blind inpainting. They recover perceptually pleasing clear ID photos from MeshFaces by enforcing pixel level similarity between the recovered ID images and the ground-truth clear ID images and then perform face verification on them. Essentially, face verification is conducted on a compact feature space rather than the image pixel space. Therefore, this paper argues that pixel level similarity and feature level similarity jointly offer the key to improve the verification performance. Based on this insight, we offer a novel feature oriented blind face inpainting framework. Specifically, we implement this by establishing a novel DeMeshNet, which consists of three parts. The first part addresses blind inpainting of the MeshFaces by implicitly exploiting extra supervision from the occlusion position to enforce pixel level similarity. The second part explicitly enforces a feature level similarity in the compact feature space, which can explore informative supervision from the feature space to produce better inpainting results for verification. The last part copes with face alignment within the net via a customized spatial transformer module when extracting deep facial features. All the three parts are implemented within an end-to-end network that facilitates efficient optimization. Extensive experiments on two MeshFace datasets demonstrate the effectiveness of the proposed DeMeshNet as well as the insight of this paper.
- In this paper, we study the resource allocation problem for a single-cell non-orthogonal multiple access (NOMA) relay network where an OFDM amplify-and-forward (AF) relay allocates the spectrum and power resources to the source-destination (SD) pairs. We aim to optimize the resource allocation to maximize the average sum-rate. The optimal approach requires an exhaustive search, leading to an NP-hard problem. To solve this problem, we propose two efficient many-to-many two-sided SD pair-subchannel matching algorithms in which the SD pairs and sub-channels are considered as two sets of players chasing their own interests. The proposed algorithms can provide a sub-optimal solution to this resource allocation problem in affordable time. Both the static matching algorithm and dynamic matching algorithm converge to a pair-wise stable matching after a limited number of iterations. Simulation results show that the capacity of both proposed algorithms in the NOMA scheme significantly outperforms the conventional orthogonal multiple access scheme. The proposed matching algorithms in NOMA scheme also achieve a better user-fairness performance than the conventional orthogonal multiple access.
- Nov 15 2016 cs.DB arXiv:1611.04288v1A challenge for data imputation is the lack of knowledge. In this paper, we attempt to address this challenge by involving extra knowledge from web. To achieve high-performance web-based imputation, we use the dependency, i.e.FDs and CFDs, to impute as many as possible values automatically and fill in the other missing values with the minimal access of web, whose cost is relatively large. To make sufficient use of dependencies, We model the dependency set on the data as a graph and perform automatical imputation and keywords generation for web-based imputation based on such graph model. With the generated keywords, we design two algorithms to extract values for imputation from the search results. Extensive experimental results based on real-world data collections show that the proposed approach could impute missing values efficiently and effectively compared to existing approach.
- This paper describes the USTC_NELSLIP systems submitted to the Trilingual Entity Detection and Linking (EDL) track in 2016 TAC Knowledge Base Population (KBP) contests. We have built two systems for entity discovery and mention detection (MD): one uses the conditional RNNLM and the other one uses the attention-based encoder-decoder framework. The entity linking (EL) system consists of two modules: a rule based candidate generation and a neural networks probability ranking model. Moreover, some simple string matching rules are used for NIL clustering. At the end, our best system has achieved an F1 score of 0.624 in the end-to-end typed mention ceaf plus metric.
- Nov 09 2016 cs.CV arXiv:1611.02644v1Multispectral pedestrian detection is essential for around-the-clock applications, e.g., surveillance and autonomous driving. We deeply analyze Faster R-CNN for multispectral pedestrian detection task and then model it into a convolutional network (ConvNet) fusion problem. Further, we discover that ConvNet-based pedestrian detectors trained by color or thermal images separately provide complementary information in discriminating human instances. Thus there is a large potential to improve pedestrian detection by using color and thermal images in DNNs simultaneously. We carefully design four ConvNet fusion architectures that integrate two-branch ConvNets on different DNNs stages, all of which yield better performance compared with the baseline detector. Our experimental results on KAIST pedestrian benchmark show that the Halfway Fusion model that performs fusion on the middle-level convolutional features outperforms the baseline method by 11% and yields a missing rate 3.5% lower than the other proposed architectures.
- Nov 03 2016 cs.CL arXiv:1611.00601v2Humans have the capacity to draw common-sense inferences from natural language: various things that are likely but not certain to hold based on established discourse, and are rarely stated explicitly. We propose an evaluation of automated common-sense inference based on an extension of recognizing textual entailment: predicting ordinal human responses of subjective likelihood of an inference holding in a given context. We describe a framework for extracting common-sense knowledge for corpora, which is then used to construct a dataset for this ordinal entailment task, which we then use to train and evaluate a sequence to sequence neural network model. Further, we annotate subsets of previously established datasets via our ordinal annotation protocol in order to then analyze the distinctions between these and what we have constructed.
- This paper proposes a new channel estimation scheme for the multiuser massive multiple-input multiple-output (MIMO) systems in time-varying environment. We introduce a discrete Fourier transform (DFT) aided spatial-temporal basis expansion model (ST-BEM) to reduce the effective dimensions of uplink/downlink channels, such that training overhead and feedback cost could be greatly decreased. The newly proposed ST-BEM is suitable for both time division duplex (TDD) systems and frequency division duplex (FDD) systems thanks to the angle reciprocity, and can be efficiently deployed by fast Fourier transform (FFT). Various numerical results have corroborated the proposed studies.
- The Teacher Forcing algorithm trains recurrent networks by supplying observed sequence values as inputs during training and using the network's own one-step-ahead predictions to do multi-step sampling. We introduce the Professor Forcing algorithm, which uses adversarial domain adaptation to encourage the dynamics of the recurrent network to be the same when training the network and when sampling from the network over multiple time steps. We apply Professor Forcing to language modeling, vocal synthesis on raw waveforms, handwriting generation, and image generation. Empirically we find that Professor Forcing acts as a regularizer, improving test likelihood on character level Penn Treebank and sequential MNIST. We also find that the model qualitatively improves samples, especially when sampling for a large number of time steps. This is supported by human evaluation of sample quality. Trade-offs between Professor Forcing and Scheduled Sampling are discussed. We produce T-SNEs showing that Professor Forcing successfully makes the dynamics of the network during training and sampling more similar.
- The first objective towards the effective use of microblogging services such as Twitter for situational awareness during the emerging disasters is discovery of the disaster-related postings. Given the wide range of possible disasters, using a pre-selected set of disaster-related keywords for the discovery is suboptimal. An alternative that we focus on in this work is to train a classifier using a small set of labeled postings that are becoming available as a disaster is emerging. Our hypothesis is that utilizing large quantities of historical microblogs could improve the quality of classification, as compared to training a classifier only on the labeled data. We propose to use unlabeled microblogs to cluster words into a limited number of clusters and use the word clusters as features for classification. To evaluate the proposed semi-supervised approach, we used Twitter data from 6 different disasters. Our results indicate that when the number of labeled tweets is 100 or less, the proposed approach is superior to the standard classification based on the bag or words feature representation. Our results also reveal that the choice of the unlabeled corpus, the choice of word clustering algorithm, and the choice of hyperparameters can have a significant impact on the classification accuracy.
- Oct 10 2016 cs.AR arXiv:1610.02094v1Cycle-accurate software simulation of multicores with complex microarchitectures is often excruciatingly slow. People use simplified core models to gain simulation speed. However, a persistent question is to what extent the results derived from a simplified core model can be used to characterize the behavior of a real machine. We propose a new methodology of validating simplified simulation models, which focuses on the trends of metric values across benchmarks and architectures, instead of errors of absolute metric values. To illustrate this methodology, we conduct a case study using an FPGA-accelerated cycle-accurate full system simulator. We evaluated three cache replacement polices on a 10-stage in-order core model, and then re-conducted all the experiments by substituting a 1-IPC core model for the 10-stage core model. We found that the 1-IPC core model generally produces qualitatively the same results as the accurate core model except for a few mismatches. We argue that most observed mismatches were either indistinguishable from experimental noise or corresponded to the cases where the policy differences even in the accurate model showed inconclusive results. We think it is fair to use simplified core models to study a feature once the influence of the simplification is understood. Additional studies on branch predictors and scaling properties of multithread benchmarks reinforce our argument. However, the validation of a simplified model requires a detailed cycle-accurate model!
- Recurrent neural networks (RNNs) have shown clear superiority in sequence modeling, particularly the ones with gated units, such as long short-term memory (LSTM) and gated recurrent unit (GRU). However, the dynamic properties behind the remarkable performance remain unclear in many applications, e.g., automatic speech recognition (ASR). This paper employs visualization techniques to study the behavior of LSTM and GRU when performing speech recognition tasks. Our experiments show some interesting patterns in the gated memory, and some of them have inspired simple yet effective modifications on the network structure. We report two of such modifications: (1) lazy cell update in LSTM, and (2) shortcut connections for residual learning. Both modifications lead to more comprehensible and powerful networks.
- Sep 29 2016 cs.CV arXiv:1609.08740v1Hashing method maps similar data to binary hashcodes with smaller hamming distance, and it has received a broad attention due to its low storage cost and fast retrieval speed. However, the existing limitations make the present algorithms difficult to deal with large-scale datasets: (1) discrete constraints are involved in the learning of the hash function; (2) pairwise or triplet similarity is adopted to generate efficient hashcodes, resulting both time and space complexity are greater than O(n^2). To address these issues, we propose a novel discrete supervised hash learning framework which can be scalable to large-scale datasets. First, the discrete learning procedure is decomposed into a binary classifier learning scheme and binary codes learning scheme, which makes the learning procedure more efficient. Second, we adopt the Asymmetric Low-rank Matrix Factorization and propose the Fast Clustering-based Batch Coordinate Descent method, such that the time and space complexity is reduced to O(n). The proposed framework also provides a flexible paradigm to incorporate with arbitrary hash function, including deep neural networks and kernel methods. Experiments on large-scale datasets demonstrate that the proposed method is superior or comparable with state-of-the-art hashing algorithms.
- This paper presents a unified model to perform language and speaker recognition simultaneously and altogether. The model is based on a multi-task recurrent neural network where the output of one task is fed as the input of the other, leading to a collaborative learning framework that can improve both language and speaker recognition by borrowing information from each other. Our experiments demonstrated that the multi-task model outperforms the task-specific models on both tasks.
- Molecular profiling data (e.g., gene expression) has been used for clinical risk prediction and biomarker discovery. However, it is necessary to integrate other prior knowledge like biological pathways or gene interaction networks to improve the predictive ability and biological interpretability of biomarkers. Here, we first introduce a general regularized Logistic Regression (LR) framework with regularized term $\lambda \|\bm{w}\|_1 + \eta\bm{w}^T\bm{M}\bm{w}$, which can reduce to different penalties, including Lasso, elastic net, and network-regularized terms with different $\bm{M}$. This framework can be easily solved in a unified manner by a cyclic coordinate descent algorithm which can avoid inverse matrix operation and accelerate the computing speed. However, if those estimated $\bm{w}_i$ and $\bm{w}_j$ have opposite signs, then the traditional network-regularized penalty may not perform well. To address it, we introduce a novel network-regularized sparse LR model with a new penalty $\lambda \|\bm{w}\|_1 + \eta|\bm{w}|^T\bm{M}|\bm{w}|$ to consider the difference between the absolute values of the coefficients. And we develop two efficient algorithms to solve it. Finally, we test our methods and compare them with the related ones using simulated and real data to show their efficiency.
- Aug 02 2016 cs.CV arXiv:1608.00168v1Recently, sparse representation based visual tracking methods have attracted increasing attention in the computer vision community. Although achieve superior performance to traditional tracking methods, however, a basic problem has not been answered yet --- that whether the sparsity constrain is really needed for visual tracking? To answer this question, in this paper, we first propose a robust non-sparse representation based tracker and then conduct extensive experiments to compare it against several state-of-the-art sparse representation based trackers. Our experiment results and analysis indicate that the proposed non-sparse tracker achieved competitive tracking accuracy with sparse trackers while having faster running speed, which support our non-sparse tracker to be used in practical applications.
- Jul 27 2016 cs.LG arXiv:1607.07804v1In this paper, we present the design of error-resilient machine learning architectures by employing a distributed machine learning framework referred to as classifier ensemble (CE). CE combines several simple classifiers to obtain a strong one. In contrast, centralized machine learning employs a single complex block. We compare the random forest (RF) and the support vector machine (SVM), which are representative techniques from the CE and centralized frameworks, respectively. Employing the dataset from UCI machine learning repository and architectural-level error models in a commercial 45 nm CMOS process, it is demonstrated that RF-based architectures are significantly more robust than SVM architectures in presence of timing errors due to process variations in near-threshold voltage (NTV) regions (0.3 V - 0.7 V). In particular, the RF architecture exhibits a detection accuracy (P_det) that varies by 3.2% while maintaining a median P_det > 0.9 at a gate level delay variation of 28.9% . In comparison, SVM exhibits a P_det that varies by 16.8%. Additionally, we propose an error weighted voting technique that incorporates the timing error statistics of the NTV circuit fabric to further enhance robustness. Simulation results confirm that the error weighted voting achieves a P_det that varies by only 1.4%, which is 12X lower compared to SVM.
- There is much interest in incorporating inference capabilities into sensor-rich embedded platforms such as autonomous vehicles, wearables, and others. A central problem in the design of such systems is the need to extract information locally from sensed data on a severely limited energy budget. This necessitates the design of energy-efficient sensory embedded system. A typical sensory embedded system enforces a physical separation between sensing and computational subsystems - a separation mandated by the differing requirements of the sensing and computational functions. As a consequence, the energy consumption in such systems tends to be dominated by the energy consumed in transferring data over the sensor-processor interface (communication energy) and the energy consumed in processing the data in digital processor (computational energy). In this article, we propose an in-sensor computing architecture which (mostly) eliminates the sensor-processor interface by embedding inference computations in the noisy sensor fabric in analog and retraining the hyperparameters in order to compensate for non-ideal computations. The resulting architecture referred to as the Compute Sensor - a sensor that computes in addition to sensing - represents a radical departure from the conventional. We show that a Compute Sensor for image data can be designed by embedding both feature extraction and classification functions in the analog domain in close proximity to the CMOS active pixel sensor (APS) array. Significant gains in energy-efficiency are demonstrated using behavioral and energy models in a commercial semiconductor process technology. In the process, the Compute Sensor creates a unique opportunity to develop machine learning algorithms for information extraction from data on a noisy underlying computational fabric.
- It is well-known that the precision of data, hyperparameters, and internal representations employed in learning systems directly impacts its energy, throughput, and latency. The precision requirements for the training algorithm are also important for systems that learn on-the-fly. Prior work has shown that the data and hyperparameters can be quantized heavily without incurring much penalty in classification accuracy when compared to floating point implementations. These works suffer from two key limitations. First, they assume uniform precision for the classifier and for the training algorithm and thus miss out on the opportunity to further reduce precision. Second, prior works are empirical studies. In this article, we overcome both these limitations by deriving analytical lower bounds on the precision requirements of the commonly employed stochastic gradient descent (SGD) on-line learning algorithm in the specific context of a support vector machine (SVM). Lower bounds on the data precision are derived in terms of the the desired classification accuracy and precision of the hyperparameters used in the classifier. Additionally, lower bounds on the hyperparameter precision in the SGD training algorithm are obtained. These bounds are validated using both synthetic and the UCI breast cancer dataset. Additionally, the impact of these precisions on the energy consumption of a fixed-point SVM with on-line training is studied.
- Jun 22 2016 cs.LG arXiv:1606.06630v2We introduce a general and simple structural design called Multiplicative Integration (MI) to improve recurrent neural networks (RNNs). MI changes the way in which information from difference sources flows and is integrated in the computational building block of an RNN, while introducing almost no extra parameters. The new structure can be easily embedded into many popular RNN models, including LSTMs and GRUs. We empirically analyze its learning behaviour and conduct evaluations on several tasks using different RNN models. Our experimental results demonstrate that Multiplicative Integration can provide a substantial performance boost over many of the existing RNN models.
- Speculative techniques in microarchitectures relax various dependencies in programs, which contributes to the complexity of (weak) memory models. We show using WMM, a new weak memory model, that the model becomes simpler if it includes load-value speculation and thus, does not enforce any dependency! However, in the absence of good value-prediction techniques, a programmer may end up paying a price for the extra fences. Thus, we also present WMM-D, which enforces the dependencies captured by the current microarchitectures. WMM-D is still much simpler than other existing models. We also show that non-atomic multi-copy stores arise as a result of sharing write-through caches. We think restricting microarchitectures to write-back caches (and thus simpler weak memory models) will not incur any performance penalty. Nevertheless, we present WMM-S, another extension to WMM, which could model the effects of non-atomic multi-copy stores. WMM, WMM-D, and WMM-S are all defined using Instantaneous Instruction Execution (I^2E), a new way of describing memory models without explicit reordering or speculative execution.
- What is the minimum amount of information and time needed to solve 2SAT? When the instance is known, it can be solved in polynomial time, but is this also possible without knowing the instance? Bei, Chen and Zhang (STOC '13) considered a model where the input is accessed by proposing possible assignments to a special oracle. This oracle, on encountering some constraint unsatisfied by the proposal, returns only the constraint index. It turns out that, in this model, even 1SAT cannot be solved in polynomial time unless P=NP. Hence, we consider a model in which the input is accessed by proposing probability distributions over assignments to the variables. The oracle then returns the index of the constraint that is most likely to be violated by this distribution. We show that the information obtained this way is sufficient to solve 1SAT in polynomial time, even when the clauses can be repeated. For 2SAT, as long as there are no repeated clauses, in polynomial time we can even learn an equivalent formula for the hidden instance and hence also solve it. Furthermore, we extend these results to the quantum regime. We show that in this setting 1QSAT can be solved in polynomial time up to constant precision, and 2QSAT can be learnt in polynomial time up to inverse polynomial precision.
- Jun 02 2016 cs.CV arXiv:1606.00103v2Unlike image blending algorithms, video blending algorithms have been little studied. In this paper, we investigate 6 popular blending algorithms---feather blending, multi-band blending, modified Poisson blending, mean value coordinate blending, multi-spline blending and convolution pyramid blending. We consider in particular realtime panoramic video blending, a key problem in various virtual reality tasks. To evaluate the performance of the 6 algorithms on this problem, we have created a video benchmark of several videos captured under various conditions. We analyze the time and memory needed by the above 6 algorithms, for both CPU and GPU implementations (where readily parallelizable). The visual quality provided by these algorithms is also evaluated both objectively and subjectively. The video benchmark and algorithm implementations are publicly available.
- May 25 2016 cs.CV arXiv:1605.07314v1In this paper, we develop a novel unified framework called DeepText for text region proposal generation and text detection in natural images via a fully convolutional neural network (CNN). First, we propose the inception region proposal network (Inception-RPN) and design a set of text characteristic prior bounding boxes to achieve high word recall with only hundred level candidate proposals. Next, we present a powerful textdetection network that embeds ambiguous text category (ATC) information and multilevel region-of-interest pooling (MLRP) for text and non-text classification and accurate localization. Finally, we apply an iterative bounding box voting scheme to pursue high recall in a complementary manner and introduce a filtering algorithm to retain the most suitable bounding box, while removing redundant inner and outer boxes for each text instance. Our approach achieves an F-measure of 0.83 and 0.85 on the ICDAR 2011 and 2013 robust text detection benchmarks, outperforming previous state-of-the-art results.
- May 24 2016 cs.CV arXiv:1605.06597v1Computer vision algorithms are known to be extremely sensitive to the environmental conditions in which the data is captured, e.g., lighting conditions and target density. Tuning of parameters or choosing a completely new algorithm is often needed to achieve a certain performance level, especially when there is a limitation of the computation source. In this paper, we focus on this problem and propose a framework to adaptively select the "best" algorithm-parameter combination and the computation platform under performance and cost constraints at design time, and adapt the algorithms at runtime based on real-time inputs. This necessitates developing a mechanism to switch between different algorithms as the nature of the input video changes. Our proposed algorithm calculates a similarity function between a test video scenario and each training scenario, where the similarity calculation is based on learning a manifold of image features that is shared by both the training and test datasets. Similarity between training and test dataset indicates the same algorithm can be applied to both of them and achieve similar performance. We design a cost function with this similarity measure to find the most similar training scenario to the test data. The "best" algorithm under a given platform is obtained by selecting the algorithm with a specific parameter combination that performs the best on the corresponding training data. The proposed framework can be used first offline to choose the platform based on performance and cost constraints, and then online whereby the "best" algorithm is selected for each new incoming video segment for a given platform. In the experiments, we apply our algorithm to the problems of pedestrian detection and tracking. We show how to adaptively select platforms and algorithm-parameter combinations. Our results provide optimal performance on 3 publicly available datasets.
- Constant envelope (CE) precoding is an appealing transmission technique, which enables highly efficient power amplification, and is realizable with a single radio frequency (RF) chain at the multi-antenna transmitter. In this paper, we study the transceiver design for a point-to-point multiple-input multiple-output (MIMO) system with CE precoding. Both single-stream transmission (i.e., beamforming) and multi-stream transmission (i.e., spatial multiplexing) are considered. For single-stream transmission, we optimize the receive beamforming vector to minimize the symbol error rate (SER) for any given channel realization and desired constellation at the combiner output. By reformulating the problem as an equivalent quadratically constrained quadratic program (QCQP), we propose an efficient semi-definite relaxation (SDR) based algorithm to find an approximate solution. Next, for multi-stream transmission, we propose a new scheme based on antenna grouping at the transmitter and minimum mean squared error (MMSE) or zero-forcing (ZF) based beamforming at the receiver. The transmit antenna grouping and receive beamforming vectors are then jointly designed to minimize the maximum SER over all data streams. Finally, the error-rate performance of single- versus multi-stream transmission is compared via simulations under different setups.
- May 12 2016 cs.CV arXiv:1605.03259v2The visual appearance of a person is easily affected by many factors like pose variations, viewpoint changes and camera parameter differences. This makes person Re-Identification (ReID) among multiple cameras a very challenging task. This work is motivated to learn mid-level human attributes which are robust to such visual appearance variations. And we propose a semi-supervised attribute learning framework which progressively boosts the accuracy of attributes only using a limited number of labeled data. Specifically, this framework involves a three-stage training. A deep Convolutional Neural Network (dCNN) is first trained on an independent dataset labeled with attributes. Then it is fine-tuned on another dataset only labeled with person IDs using our defined triplet loss. Finally, the updated dCNN predicts attribute labels for the target dataset, which is combined with the independent dataset for the final round of fine-tuning. The predicted attributes, namely \emphdeep attributes exhibit superior generalization ability across different datasets. By directly using the deep attributes with simple Cosine distance, we have obtained surprisingly good accuracy on four person ReID datasets. Experiments also show that a simple metric learning modular further boosts our method, making it significantly outperform many recent works.
- Nonconvex and nonsmooth optimization problems are frequently encountered in much of statistics, business, science and engineering, but they are not yet widely recognized as a technology in the sense of scalability. A reason for this relatively low degree of popularity is the lack of a well developed system of theory and algorithms to support the applications, as is the case for its convex counterpart. This paper aims to take one step in the direction of disciplined nonconvex and nonsmooth optimization. In particular, we consider in this paper some constrained nonconvex optimization models in block decision variables, with or without coupled affine constraints. In the case of without coupled constraints, we show a sublinear rate of convergence to an $\epsilon$-stationary solution in the form of variational inequality for a generalized conditional gradient method, where the convergence rate is shown to be dependent on the Hölderian continuity of the gradient of the smooth part of the objective. For the model with coupled affine constraints, we introduce corresponding $\epsilon$-stationarity conditions, and apply two proximal-type variants of the ADMM to solve such a model, assuming the proximal ADMM updates can be implemented for all the block variables except for the last block, for which either a gradient step or a majorization-minimization step is implemented. We show an iteration complexity bound of $O(1/\epsilon^2)$ to reach an $\epsilon$-stationary solution for both algorithms. Moreover, we show that the same iteration complexity of a proximal BCD method follows immediately. Numerical results are provided to illustrate the efficacy of the proposed algorithms for tensor robust PCA.
- Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, multiple frameworks have been built on top of it and it has been used to produce many state-of-the-art machine learning models. The present article is structured as follows. Section I provides an overview of the Theano software and its community. Section II presents the principal features of Theano and how to use them, and compares them with other similar projects. Section III focuses on recently-introduced functionalities and improvements. Section IV compares the performance of Theano against Torch7 and TensorFlow on several machine learning models. Section V discusses current limitations of Theano and potential ways of improving it.
- May 10 2016 cs.LG arXiv:1605.02216v1We study the problem of how to distribute the training of large-scale deep learning models in the parallel computing environment. We propose a new distributed stochastic optimization method called Elastic Averaging SGD (EASGD). We analyze the convergence rate of the EASGD method in the synchronous scenario and compare its stability condition with the existing ADMM method in the round-robin scheme. An asynchronous and momentum variant of the EASGD method is applied to train deep convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Our approach accelerates the training and furthermore achieves better test accuracy. It also requires a much smaller amount of communication than other common baseline approaches such as the DOWNPOUR method. We then investigate the limit in speedup of the initial and the asymptotic phase of the mini-batch SGD, the momentum SGD, and the EASGD methods. We find that the spread of the input data distribution has a big impact on their initial convergence rate and stability region. We also find a surprising connection between the momentum SGD and the EASGD method with a negative moving average rate. A non-convex case is also studied to understand when EASGD can get trapped by a saddle point. Finally, we scale up the EASGD method by using a tree structured network topology. We show empirically its advantage and challenge. We also establish a connection between the EASGD and the DOWNPOUR method with the classical Jacobi and the Gauss-Seidel method, thus unifying a class of distributed stochastic optimization methods.
- With years of tremendous traffic and energy consumption growth, green radio has been valued not only for theoretical research interests but also for the operational expenditure reduction and the sustainable development of wireless communications. Fundamental green tradeoffs, served as an important framework for analysis, include four basic relationships: spectrum efficiency (SE) versus energy efficiency (EE), deployment efficiency (DE) versus energy efficiency (EE), delay (DL) versus power (PW), and bandwidth (BW) versus power (PW). In this paper, we first provide a comprehensive overview on the extensive on-going research efforts and categorize them based on the fundamental green tradeoffs. We will then focus on research progresses of 4G and 5G communications, such as orthogonal frequency division multiplexing (OFDM) and non-orthogonal aggregation (NOA), multiple input multiple output (MIMO), and heterogeneous networks (HetNets). We will also discuss potential challenges and impacts of fundamental green tradeoffs, to shed some light on the energy efficient research and design for future wireless networks.
- Apr 21 2016 cs.CV arXiv:1604.05817v1We consider the case of inpainting single depth images. Without corresponding color images, previous or next frames, depth image inpainting is quite challenging. One natural solution is to regard the image as a matrix and adopt the low rank regularization just as inpainting color images. However, the low rank assumption does not make full use of the properties of depth images. A shallow observation may inspire us to penalize the non-zero gradients by sparse gradient regularization. However, statistics show that though most pixels have zero gradients, there is still a non-ignorable part of pixels whose gradients are equal to 1. Based on this specific property of depth images , we propose a low gradient regularization method in which we reduce the penalty for gradient 1 while penalizing the non-zero gradients to allow for gradual depth changes. The proposed low gradient regularization is integrated with the low rank regularization into the low rank low gradient approach for depth image inpainting. We compare our proposed low gradient regularization with sparse gradient regularization. The experimental results show the effectiveness of our proposed approach.
- Apr 19 2016 cs.CV arXiv:1604.04953v1This paper proposes an end-to-end framework, namely fully convolutional recurrent network (FCRN) for handwritten Chinese text recognition (HCTR). Unlike traditional methods that rely heavily on segmentation, our FCRN is trained with online text data directly and learns to associate the pen-tip trajectory with a sequence of characters. FCRN consists of four parts: a path-signature layer to extract signature features from the input pen-tip trajectory, a fully convolutional network to learn informative representation, a sequence modeling layer to make per-frame predictions on the input sequence and a transcription layer to translate the predictions into a label sequence. The FCRN is end-to-end trainable in contrast to conventional methods whose components are separately trained and tuned. We also present a refined beam search method that efficiently integrates the language model to decode the FCRN and significantly improve the recognition results. We evaluate the performance of the proposed method on the test sets from the databases CASIA-OLHWDB and ICDAR 2013 Chinese handwriting recognition competition, and both achieve state-of-the-art performance with correct rates of 96.40% and 95.00%, respectively.
- Identifying topics of discussions in online health communities (OHC) is critical to various applications, but can be difficult because topics of OHC content are usually heterogeneous and domain-dependent. In this paper, we provide a multi-class schema, an annotated dataset, and supervised classifiers based on convolutional neural network (CNN) and other models for the task of classifying discussion topics. We apply the CNN classifier to the most popular breast cancer online community, and carry out a longitudinal analysis to show topic distributions and topic changes throughout members' participation. Our experimental results suggest that CNN outperforms other classifiers in the task of topic classification, and that certain trajectories can be detected with respect to topic changes.
- Learning the "blocking" structure is a central challenge for high dimensional data (e.g., gene expression data). Recently, a sparse singular value decomposition (SVD) has been used as a biclustering tool to achieve this goal. However, this model ignores the structural information between variables (e.g., gene interaction graph). Although typical graph-regularized norm can incorporate such prior graph information to get accurate discovery and better interpretability, it fails to consider the opposite effect of variables with different signs. Motivated by the development of sparse coding and graph-regularized norm, we propose a novel sparse graph-regularized SVD as a powerful biclustering tool for analyzing high-dimensional data. The key of this method is to impose two penalties including a novel graph-regularized norm ($|\pmb{u}|\pmb{L}|\pmb{u}|$) and $L_0$-norm ($\|\pmb{u}\|_0$) on singular vectors to induce structural sparsity and enhance interpretability. We design an efficient Alternating Iterative Sparse Projection (AISP) algorithm to solve it. Finally, we apply our method and related ones to simulated and real data to show its efficiency in capturing natural blocking structures.
- In this paper, we systematically analyze the connecting architectures of recurrent neural networks (RNNs). Our main contribution is twofold: first, we present a rigorous graph-theoretic framework describing the connecting architectures of RNNs in general. Second, we propose three architecture complexity measures of RNNs: (a) the recurrent depth, which captures the RNN's over-time nonlinear complexity, (b) the feedforward depth, which captures the local input-output nonlinearity (similar to the "depth" in feedforward neural networks (FNNs)), and (c) the recurrent skip coefficient which captures how rapidly the information propagates over time. We rigorously prove each measure's existence and computability. Our experimental results show that RNNs might benefit from larger recurrent depth and feedforward depth. We further demonstrate that increasing recurrent skip coefficient offers performance boosts on long term dependency problems.
- Feb 23 2016 cs.CC arXiv:1602.06627v2The Sensitivity Conjecture and the Log-rank Conjecture are among the most important and challenging problems in concrete complexity. Incidentally, the Sensitivity Conjecture is known to hold for monotone functions, and so is the Log-rank Conjecture for $f(x \wedge y)$ and $f(x\oplus y)$ with monotone functions $f$, where $\wedge$ and $\oplus$ are bit-wise AND and XOR, respectively. In this paper, we extend these results to functions $f$ which alternate values for a relatively small number of times on any monotone path from $0^n$ to $1^n$. These deepen our understandings of the two conjectures, and contribute to the recent line of research on functions with small alternating numbers.
- Feb 16 2016 cs.CV arXiv:1602.04348v1Maximally stable extremal regions (MSER), which is a popular method to generate character proposals/candidates, has shown superior performance in scene text detection. However, the pixel-level operation limits its capability for handling some challenging cases (e.g., multiple connected characters, separated parts of one character and non-uniform illumination). To better tackle these cases, we design a character proposal network (CPN) by taking advantage of the high capacity and fast computing of fully convolutional network (FCN). Specifically, the network simultaneously predicts characterness scores and refines the corresponding locations. The characterness scores can be used for proposal ranking to reject non-character proposals and the refining process aims to obtain the more accurate locations. Furthermore, considering the situation that different characters have different aspect ratios, we propose a multi-template strategy, designing a refiner for each aspect ratio. The extensive experiments indicate our method achieves recall rates of 93.88%, 93.60% and 96.46% on ICDAR 2013, SVT and Chinese2k datasets respectively using less than 1000 proposals, demonstrating promising performance of our character proposal network.
- Feb 08 2016 cs.CV arXiv:1602.01887v2In this paper, we propose a novel visual tracking framework that intelligently discovers reliable patterns from a wide range of video to resist drift error for long-term tracking tasks. First, we design a Discrete Fourier Transform (DFT) based tracker which is able to exploit a large number of tracked samples while still ensures real-time performance. Second, we propose a clustering method with temporal constraints to explore and memorize consistent patterns from previous frames, named as reliable memories. By virtue of this method, our tracker can utilize uncontaminated information to alleviate drifting issues. Experimental results show that our tracker performs favorably against other state of-the-art methods on benchmark datasets. Furthermore, it is significantly competent in handling drifts and able to robustly track challenging long videos over 4000 frames, while most of others lose track at early frames.
- Feb 04 2016 cs.SY arXiv:1602.01128v1A distributed consensus algorithm for estimating the maximum value of the initial measurements in a sensor network with communication noise is proposed. In the absence of communication noise, max estimation can be done by updating the state value with the largest received measurements in every iteration at each sensor. In the presence of communication noise, however, the maximum estimate will incorrectly drift and the estimate at each sensor will diverge. As a result, a soft-max approximation together with a non-linear consensus algorithm is introduced herein. A design parameter controls the trade-off between the soft-max error and convergence speed. An analysis of this trade-off gives a guideline towards how to choose the design parameter for the max estimate. We also show that if some prior knowledge of the initial measurements is available, the consensus process can converge faster by using an optimal step size in the iterative algorithm. A shifted non-linear bounded transmit function is also introduced for faster convergence when sensor nodes have some prior knowledge of the initial measurements. Simulation results corroborating the theory are also provided.
- Feb 04 2016 cs.CV arXiv:1602.01237v2Encouraged by the recent progress in pedestrian detection, we investigate the gap between current state-of-the-art methods and the "perfect single frame detector". We enable our analysis by creating a human baseline for pedestrian detection (over the Caltech dataset), and by manually clustering the recurrent errors of a top detector. Our results characterize both localization and background-versus-foreground errors. To address localization errors we study the impact of training annotation noise on the detector performance, and show that we can improve even with a small portion of sanitized training data. To address background/foreground discrimination, we study convnets for pedestrian detection, and discuss which factors affect their performance. Other than our in-depth analysis, we report top performance on the Caltech dataset, and provide a new sanitized set of training and test annotations.
- Jan 15 2016 cs.NI arXiv:1601.03505v1With small cell base stations (SBSs) densely deployed in addition to conventional macro base stations (MBSs), the heterogeneous cellular network (HCN) architecture can effectively boost network capacity. To support the huge power demand of HCNs, renewable energy harvesting technologies can be leveraged. In this paper, we aim to make efficient use of the harvested energy for on-grid power saving while satisfying the quality of service (QoS) requirement. To this end, energy-aware traffic offloading schemes are proposed, whereby user associations, ON-OFF states of SBSs, and power control are jointly optimized according to the statistical information of energy arrival and traffic load. Specifically, for the single SBS case, the power saving gain achieved by activating the SBS is derived in closed form, based on which the SBS activation condition and optimal traffic offloading amount are obtained. Furthermore, a two-stage energy-aware traffic offloading (TEATO) scheme is proposed for the multiple-SBS case, considering various operating characteristics of SBSs with different power sources. Simulation results demonstrate that the proposed scheme can achieve more than 50% power saving gain for typical daily traffic and solar energy profiles, compared with the conventional traffic offloading schemes.
- In [13], Hillar and Lim famously demonstrated that "multilinear (tensor) analogues of many efficiently computable problems in numerical linear algebra are NP-hard". Despite many recent advancements, the state-of-the-art methods for computing such `tensor analogues' still suffer severely from the curse of dimensionality. In this paper we show that the Tucker core of a tensor however, retains many properties of the original tensor, including the CP rank, the border rank, the tensor Schatten quasi norms, and the Z-eigenvalues. When the core tensor is smaller than the original tensor, this property leads to considerable computational advantages as confirmed by our numerical experiments. In our analysis, we in fact work with a generalized Tucker-like decomposition that can accommodate any full column-rank factor matrices.
- Dec 29 2015 cs.NE arXiv:1512.08301v2In this paper, we propose a novel neural network structure, namely \emphfeedforward sequential memory networks (FSMN), to model long-term dependency in time series without using recurrent feedback. The proposed FSMN is a standard fully-connected feedforward neural network equipped with some learnable memory blocks in its hidden layers. The memory blocks use a tapped-delay line structure to encode the long context information into a fixed-size representation as short-term memory mechanism. We have evaluated the proposed FSMNs in several standard benchmark tasks, including speech recognition and language modelling. Experimental results have shown FSMNs significantly outperform the conventional recurrent neural networks (RNN), including LSTMs, in modeling sequential signals like speech or language. Moreover, FSMNs can be learned much more reliably and faster than RNNs or LSTMs due to the inherent non-recurrent model structure.
- Dec 10 2015 cs.CV arXiv:1512.02895v2Recent algorithms in convolutional neural networks (CNN) considerably advance the fine-grained image classification, which aims to differentiate subtle differences among subordinate classes. However, previous studies have rarely focused on learning a fined-grained and structured feature representation that is able to locate similar images at different levels of relevance, e.g., discovering cars from the same make or the same model, both of which require high precision. In this paper, we propose two main contributions to tackle this problem. 1) A multi-task learning framework is designed to effectively learn fine-grained feature representations by jointly optimizing both classification and similarity constraints. 2) To model the multi-level relevance, label structures such as hierarchy or shared attributes are seamlessly embedded into the framework by generalizing the triplet loss. Extensive and thorough experiments have been conducted on three fine-grained datasets, i.e., the Stanford car, the car-333, and the food datasets, which contain either hierarchical labels or shared attributes. Our proposed method has achieved very competitive performance, i.e., among state-of-the-art classification accuracy. More importantly, it significantly outperforms previous fine-grained feature representations for image retrieval at different levels of relevance.
- Dec 01 2015 cs.SI physics.soc-ph arXiv:1511.09368v1To understand the structure and organization of a large-scale social, biological or technological network, it can be helpful to describe and extract local communities or modules of the network. In this article, we develop a neurodynamic framework to describe the local communities which correspond to the stable states of a neuro-system built based on the network. The quantitative criteria to describe the neurodynamic system can cover a large range of objective functions. The resolution limit of these functions enable us to propose a generic criterion to explore multi-resolution local communities. We explain the advantages of this framework and illustrate them by testing on a number of model and real-world networks.
- This paper proposes a new transmission strategy for the multiuser massive multiple-input multiple-output (MIMO) systems, including uplink/downlink channel estimation and user scheduling for data transmission. A discrete Fourier transform (DFT) aided spatial basis expansion model (SBEM) is first introduced to represent the uplink/downlink channels with much few parameter dimensions by exploiting angle reciprocity and the physical characteristics of the uniform linear array (ULA). With SBEM, both uplink and downlink channel estimation of multiuser can be carried out with very few amount of training resources, which significantly reduces the training overhead and feedback cost. Meanwhile, the pilot contamination problem in the uplink raining is immediately relieved by exploiting the spatial information of users. To enhance the spectral efficiency and to fully utilize the spatial resources, we also design a greedy user scheduling scheme during the data transmission period. Compared to existing low-rank models, the newly proposed SBEM offers an alternative for channel acquisition without need of channel statistics for both TDD and FDD systems based on the angle reciprocity. Moreover, the proposed method can be efficiently deployed by the fast Fourier transform (FFT). Various numerical results are provided to corroborate the proposed studies.
- This paper considers the evolution of a network in a discrete time, stochastic setting in which agents learn about each other through repeated interactions and maintain/break links on the basis of what they learn from these interactions. Agents have homophilous preferences and limited capacity, so they maintain links with others who are learned to be similar to themselves and cut links to others who are learned to be dissimilar to themselves. Thus learning influences the evolution of the network, but learning is imperfect so the evolution is stochastic. Homophily matters. Higher levels of homophily decrease the (average) number of links that agents form. However, the effect of homophily is anomalous: mutually beneficial links may be dropped before learning is completed, thereby resulting in sparser networks and less clustering than under complete information. There may be big differences between the networks that emerge under complete and incomplete information. Homophily matters here as well: initially, greater levels of homophily increase the difference between the complete and incomplete information networks, but sufficiently high levels of homophily eventually decrease the difference. Complete and incomplete information networks differ the most when the degree of homophily is intermediate. With multiple stages of life, the effects of incomplete information are large initially but fade somewhat over time.