results for au:Wu_Y in:cs

- Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN's usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN's computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform or compete with its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code in modern libraries.
- Holography encodes the three dimensional (3D) information of a sample in the form of an intensity-only recording. However, to decode the original sample image from its hologram(s), auto-focusing and phase-recovery are needed, which are in general cumbersome and time-consuming to digitally perform. Here we demonstrate a convolutional neural network (CNN) based approach that simultaneously performs auto-focusing and phase-recovery to significantly extend the depth-of-field (DOF) in holographic image reconstruction. For this, a CNN is trained by using pairs of randomly de-focused back-propagated holograms and their corresponding in-focus phase-recovered images. After this training phase, the CNN takes a single back-propagated hologram of a 3D sample as input to rapidly achieve phase-recovery and reconstruct an in focus image of the sample over a significantly extended DOF. This deep learning based DOF extension method is non-iterative, and significantly improves the algorithm time-complexity of holographic image reconstruction from O(nm) to O(1), where n refers to the number of individual object points or particles within the sample volume, and m represents the focusing search space within which each object point or particle needs to be individually focused. These results highlight some of the unique opportunities created by data-enabled statistical image reconstruction methods powered by machine learning, and we believe that the presented approach can be broadly applicable to computationally extend the DOF of other imaging modalities.
- We present a robust method for estimating the facial pose and shape information from a densely annotated facial image. The method relies on Convolutional Point-set Representation (CPR), a carefully designed matrix representation to summarize different layers of information encoded in the set of detected points in the annotated image. The CPR disentangles the dependencies of shape and different pose parameters and enables updating different parameters in a sequential manner via convolutional neural networks and recurrent layers. When updating the pose parameters, we sample reprojection errors along with a predicted direction and update the parameters based on the pattern of reprojection errors. This technique boosts the model's capability in searching a local minimum under challenging scenarios. We also demonstrate that annotation from different sources can be merged under the framework of CPR and contributes to outperforming the current state-of-the-art solutions for 3D face alignment. Experiments indicate the proposed CPRFA (CPR-based Face Alignment) significantly improves 3D alignment accuracy when the densely annotated image contains noise and missing values, which is common under "in-the-wild" acquisition scenarios.
- Secure wireless information and power transfer based on directional modulation is conceived for amplify-and-forward (AF) relaying networks. Explicitly, we first formulate a secrecy rate maximization (SRM) problem, which can be decomposed into a twin-level optimization problem and solved by a one-dimensional (1D) search and semidefinite relaxation (SDR) technique. Then in order to reduce the search complexity, we formulate an optimization problem based on maximizing the signal-to-leakage-AN-noise-ratio (Max-SLANR) criterion, and transform it into a SDR problem. Additionally, the relaxation is proved to be tight according to the classic Karush-Kuhn-Tucker (KKT) conditions. Finally, to reduce the computational complexity, a successive convex approximation (SCA) scheme is proposed to find a near-optimal solution. The complexity of the SCA scheme is much lower than that of the SRM and the Max-SLANR schemes. Simulation results demonstrate that the performance of the SCA scheme is very close to that of the SRM scheme in terms of its secrecy rate and bit error rate (BER), but much better than that of the zero forcing (ZF) scheme.
- Let $G$ be a graph and $T$ be a vertex subset of $G$ with even cardinality. A $T$-join of $G$ is a subset $J$ of edges such that a vertex of $G$ is incident with an odd number of edges in $J$ if and only if the vertex belongs to $T$. Minimum $T$-joins have many applications in combinatorial optimizations. In this paper, we show that a minimum $T$-join of a connected graph $G$ has at most $|E(G)|-\frac 1 2 |E(\widehat{\, G\,})|$ edges where $\widehat{\,G\,}$ is the maximum bidegeless subgraph of $G$. Further, we are able to use this result to show that every flow-admissible signed graph $(G,\sigma)$ has a signed-circuit cover with length at most $\frac{19} 6 |E(G)|$. Particularly, a 2-edge-connected signed graph $(G,\sigma)$ with even negativeness has a signed-circuit cover with length at most $\frac 8 3 |E(G)|$.
- Link prediction in networks is typically accomplished by estimating or ranking the probabilities of edges for all pairs of nodes. In practice, especially for social networks, the data are often collected by egocentric sampling, which means selecting a subset of nodes and recording all of their edges. This sampling mechanism requires different prediction tools than the typical assumption of links missing at random. We propose a new computationally efficient link prediction algorithm for egocentrically sampled networks, which estimates the underlying probability matrix by estimating its row space. For networks created by sampling rows, our method outperforms many popular link prediction and graphon estimation techniques.
- Mar 12 2018 cs.CL arXiv:1803.03476v1Question retrieval is a crucial subtask for community question answering. Previous research focus on supervised models which depend heavily on training data and manual feature engineering. In this paper, we propose a novel unsupervised framework, namely reduced attentive matching network (RAMN), to compute semantic matching between two questions. Our RAMN integrates together the deep semantic representations, the shallow lexical mismatching information and the initial rank produced by an external search engine. For the first time, we propose attention autoencoders to generate semantic representations of questions. In addition, we employ lexical mismatching to capture surface matching between two questions, which is derived from the importance of each word in a question. We conduct experiments on the open CQA datasets of SemEval-2016 and SemEval-2017. The experimental results show that our unsupervised model obtains comparable performance with the state-of-the-art supervised methods in SemEval-2016 Task 3, and outperforms the best system in SemEval-2017 Task 3 by a wide margin.
- Careful tuning of the learning rate, or even schedules thereof, can be crucial to effective neural net training. There has been much recent interest in gradient-based meta-optimization, where one tunes hyperparameters, or even learns an optimizer, in order to minimize the expected loss when the training procedure is unrolled. But because the training procedure must be unrolled thousands of times, the meta-objective must be defined with an orders-of-magnitude shorter time horizon than is typical for neural net training. We show that such short-horizon meta-objectives cause a serious bias towards small step sizes, an effect we term short-horizon bias. We introduce a toy problem, a noisy quadratic cost function, on which we analyze short-horizon bias by deriving and comparing the optimal schedules for short and long time horizons. We then run meta-optimization experiments (both offline and online) on standard benchmark datasets, showing that meta-optimization chooses too small a learning rate by multiple orders of magnitude, even when run with a moderately long time horizon (100 steps) typical of work in the area. We believe short-horizon bias is a fundamental problem that needs to be addressed if meta-optimization is to scale to practical neural net training regimes.
- Predictive models learned from historical data are widely used to help companies and organizations make decisions. However, they may digitally unfairly treat unwanted groups, raising concerns about fairness and discrimination. In this paper, we study the fairness-aware ranking problem which aims to discover discrimination in ranked datasets and reconstruct the fair ranking. Existing methods in fairness-aware ranking are mainly based on statistical parity that cannot measure the true discriminatory effect since discrimination is causal. On the other hand, existing methods in causal-based anti-discrimination learning focus on classification problems and cannot be directly applied to handle the ranked data. To address these limitations, we propose to map the rank position to a continuous score variable that represents the qualification of the candidates. Then, we build a causal graph that consists of both the discrete profile attributes and the continuous score. The path-specific effect technique is extended to the mixed-variable causal graph to identify both direct and indirect discrimination. The relationship between the path-specific effects for the ranked data and those for the binary decision is theoretically analyzed. Finally, algorithms for discovering and removing discrimination from a ranked dataset are developed. Experiments using the real dataset show the effectiveness of our approaches.
- Mar 06 2018 cs.AI arXiv:1803.01118v1We consider the problem of exploration in meta reinforcement learning. Two new meta reinforcement learning algorithms are suggested: E-MAML and E-$\text{RL}^2$. Results are presented on a novel environment we call `Krazy World' and a set of maze environments. We show E-MAML and E-$\text{RL}^2$ deliver better performance on tasks where exploration is important.
- Mar 06 2018 cs.DB arXiv:1803.01390v1Motivated by the continuing interest in the tree data model, we study the expressive power of downward navigational query languages on trees and chains. Basic navigational queries are built from the identity relation and edge relations using composition and union. We study the effects on relative expressiveness when we add transitive closure, projections, coprojections, intersection, and difference; this for boolean queries and path queries on labeled and unlabeled structures. In all cases, we present the complete Hasse diagram. In particular, we establish, for each query language fragment that we study on trees, whether it is closed under difference and intersection.
- Local polynomial regression (Fan & Gijbels, 1996) is an important class of methods for nonparametric density estimation and regression problems. However, straightforward implementation of local polynomial regression has quadratic time complexity which hinders its applicability in large-scale data analysis. In this paper, we significantly accelerate the computation of local polynomial estimates by novel applications of multi-dimensional binary indexed trees (Fenwick, 1994). Both time and space complexities of our proposed algorithm are nearly linear in the number of inputs. Simulation results confirm the efficiency and effectiveness of our approach.
- Despite significant advances in artificial intelligence (AI) for computer vision, its application in medical imaging has been limited by the burden and limits of expert-generated labels. We used images from optical coherence tomography angiography (OCTA), a relatively new imaging modality that measures perfusion of the retinal vasculature, to train an AI algorithm to generate vasculature maps from standard structural optical coherence tomography (OCT) images of the same retinae, both exceeding the ability and bypassing the need for expert labeling. Deep learning was able to infer perfusion of microvasculature from structural OCT images with similar fidelity to OCTA and significantly better than expert clinicians (P < 0.00001). OCTA suffers from need of specialized hardware, laborious acquisition protocols, and motion artifacts; whereas our model works directly from standard OCT which are ubiquitous and quick to obtain, and allows unlocking of large volumes of previously collected standard OCT data both in existing clinical trials and clinical practice. This finding demonstrates a novel application of AI to medical imaging, whereby subtle regularities between different modalities are used to image the same body part and AI is used to generate detailed and accurate inferences of tissue function from structure imaging.
- Applied researchers often construct a network from a random sample of nodes in order to infer properties of the parent network. Two of the most widely used sampling schemes are subgraph sampling, where we sample each vertex independently with probability $p$ and observe the subgraph induced by the sampled vertices, and neighborhood sampling, where we additionally observe the edges between the sampled vertices and their neighbors. In this paper, we study the problem of estimating the number of motifs as induced subgraphs under both models from a statistical perspective. We show that: for any connected $h$ on $k$ vertices, to estimate $s=\mathsf{s}(h,G)$, the number of copies of $h$ in the parent graph $G$ of maximum degree $d$, with a multiplicative error of $\epsilon$, (a) For subgraph sampling, the optimal sampling ratio $p$ is $\Theta_{k}(\max\{ (s\epsilon^2)^{-\frac{1}{k}}, \; \frac{d^{k-1}}{s\epsilon^{2}} \})$, achieved by Horvitz-Thompson type of estimators. (b) For neighborhood sampling, we propose a family of estimators, encompassing and outperforming the Horvitz-Thompson estimator and achieving the sampling ratio $O_{k}(\min\{ (\frac{d}{s\epsilon^2})^{\frac{1}{k-1}}, \; \sqrt{\frac{d^{k-2}}{s\epsilon^2}}\})$. This is shown to be optimal for all motifs with at most $4$ vertices and cliques of all sizes. The matching minimax lower bounds are established using certain algebraic properties of subgraph counts. These results quantify how much more informative neighborhood sampling is than subgraph sampling, as empirically verified by experiments on both synthetic and real-world data. We also address the issue of adaptation to the unknown maximum degree, and study specific problems for parent graphs with additional structures, e.g., trees or planar graphs.
- Estimating the entropy based on data is one of the prototypical problems in distribution property testing and estimation. For estimating the Shannon entropy of a distribution on $S$ elements with independent samples, [Paninski2004] showed that the sample complexity is sublinear in $S$, and [Valiant--Valiant2011] showed that consistent estimation of Shannon entropy is possible if and only if the sample size $n$ far exceeds $\frac{S}{\log S}$. In this paper we consider the problem of estimating the entropy rate of a stationary reversible Markov chain with $S$ states from a sample path of $n$ observations. We show that: (1) As long as the Markov chain mixes not too slowly, i.e., the relaxation time is at most $O(\frac{S}{\ln^3 S})$, consistent estimation is achievable when $n \gg \frac{S^2}{\log S}$. (2) As long as the Markov chain has some slight dependency, i.e., the relaxation time is at least $1+\Omega(\frac{\ln^2 S}{\sqrt{S}})$, consistent estimation is impossible when $n \lesssim \frac{S^2}{\log S}$. Under both assumptions, the optimal estimation accuracy is shown to be $\Theta(\frac{S^2}{n \log S})$. In comparison, the empirical entropy rate requires at least $\Omega(S^2)$ samples to be consistent, even when the Markov chain is memoryless. In addition to synthetic experiments, we also apply the estimators that achieve the optimal sample complexity to estimate the entropy rate of the English language in the Penn Treebank and the Google One Billion Words corpora, which provides a natural benchmark for language modeling and relates it directly to the widely used perplexity measure.
- We present Generative Adversarial Capsule Network (CapsuleGAN), a framework that uses capsule networks (CapsNets) instead of the standard convolutional neural networks (CNNs) as discriminators within the generative adversarial network (GAN) setting, while modeling image data. We provide guidelines for designing CapsNet discriminators and the updated GAN objective function, which incorporates the CapsNet margin loss, for training CapsuleGAN models. We show that CapsuleGAN outperforms convolutional-GAN at modeling image data distribution on MNIST and CIFAR-10 datasets, evaluated on the generative adversarial metric and at semi-supervised image classification.
- Feb 15 2018 cs.CV arXiv:1802.04914v2In this paper, we introduce a web-scale general visual search system deployed in Microsoft Bing. The system accommodates tens of billions of images in the index, with thousands of features for each image, and can respond in less than 200 ms. In order to overcome the challenges in relevance, latency, and scalability in such large scale of data, we employ a cascaded learning-to-rank framework based on various latest deep learning visual features, and deploy in a distributed heterogeneous computing platform. Quantitative and qualitative experiments show that our system is able to support various applications on Bing website and apps.
- A major challenge in understanding the generalization of deep learning is to explain why (stochastic) gradient descent can exploit the network architecture to find solutions that have good generalization performance when using high capacity models. We find simple but realistic examples showing that this phenomenon exists even when learning linear classifiers --- between two linear networks with the same capacity, the one with a convolutional layer can generalize better than the other when the data distribution has some underlying spatial structure. We argue that this difference results from a combination of the convolution architecture, data distribution and gradient descent, all of which are necessary to be included in a meaningful analysis. We provide a general analysis of the generalization performance as a function of data distribution and convolutional filter size, given gradient descent as the optimization algorithm, then interpret the results using concrete examples. Experimental results show that our analysis is able to explain what happens in our introduced examples.
- Non-orthogonal multiple access (NoMA) as an efficient way of radio resource sharing can root back to the network information theory. For generations of wireless communication systems design, orthogonal multiple access (OMA) schemes in time, frequency, or code domain have been the main choices due to the limited processing capability in the transceiver hardware, as well as the modest traffic demands in both latency and connectivity. However, for the next generation radio systems, given its vision to connect everything and the much evolved hardware capability, NoMA has been identified as a promising technology to help achieve all the targets in system capacity, user connectivity, and service latency. This article will provide a systematic overview of the state-of-the-art design of the NoMA transmission based on a unified transceiver design framework, the related standardization progress, and some promising use cases in future cellular networks, based on which the interested researchers can get a quick start in this area.
- With the emergence of diverse mobile applications (such as augmented reality), the quality of experience of mobile users is greatly limited by their computation capacity and finite battery lifetime. Mobile edge computing (MEC) and wireless power transfer are promising to address this issue. However, these two techniques are susceptible to propagation delay and loss. Motivated by the chance of short-distance line-of-sight achieved by leveraging unmanned aerial vehicle (UAV) communications, an UAV-enabled wireless powered MEC system is studied. A power minimization problem is formulated subject to the constraints on the number of the computation bits and energy harvesting causality. The problem is non-convex and challenging to tackle. An alternative optimization algorithm is proposed based on sequential convex optimization. Simulation results show that our proposed design is superior to other benchmark schemes and the proposed algorithm is efficient in terms of the convergence.
- Feb 13 2018 cs.LG arXiv:1802.03464v1In this paper, we propose the Lipschitz margin ratio and a new metric learning framework for classification through maximizing the ratio. This framework enables the integration of both the inter-class margin and the intra-class dispersion, as well as the enhancement of the generalization ability of a classifier. To introduce the Lipschitz margin ratio and its associated learning bound, we elaborate the relationship between metric learning and Lipschitz functions, as well as the representability and learnability of the Lipschitz functions. After proposing the new metric learning framework based on the introduced Lipschitz margin ratio, we also prove that some well known metric learning algorithms can be shown as special cases of the proposed framework. In addition, we illustrate the framework by implementing it for learning the squared Mahalanobis metric, and by demonstrating its encouraging results on eight popular datasets of machine learning.
- Glioma is one of the most common and aggressive types of primary brain tumors. The accurate segmentation of subcortical brain structures is crucial to the study of gliomas in that it helps the monitoring of the progression of gliomas and aids the evaluation of treatment outcomes. However, the large amount of required human labor makes it difficult to obtain the manually segmented Magnetic Resonance Imaging (MRI) data, limiting the use of precise quantitative measurements in the clinical practice. In this work, we try to address this problem by developing a 3D Convolutional Neural Network~(3D CNN) based model to automatically segment gliomas. The major difficulty of our segmentation model comes with the fact that the location, structure, and shape of gliomas vary significantly among different patients. In order to accurately classify each voxel, our model captures multi-scale contextual information by extracting features from two scales of receptive fields. To fully exploit the tumor structure, we propose a novel architecture that hierarchically segments different lesion regions of the necrotic and non-enhancing tumor~(NCR/NET), peritumoral edema~(ED) and GD-enhancing tumor~(ET). Additionally, we utilize densely connected convolutional blocks to further boost the performance. We train our model with a patch-wise training schema to mitigate the class imbalance problem. The proposed method is validated on the BraTS 2017 dataset and it achieves Dice scores of 0.72, 0.83 and 0.81 for the complete tumor, tumor core and enhancing tumor, respectively. These results are comparable to the reported state-of-the-art results, and our method is better than existing 3D-based methods in terms of compactness, time and space efficiency.
- One of the risks of large-scale geologic carbon sequestration is the potential migration of fluids out of the storage formations. Accurate and fast detection of this fluids migration is not only important but also challenging, due to the large subsurface uncertainty and complex governing physics. Traditional leakage detection and monitoring techniques rely on geophysical observations including seismic. However, the resulting accuracy of these methods is limited because of indirect information they provide requiring expert interpretation, therefore yielding in-accurate estimates of leakage rates and locations. In this work, we develop a novel machine-learning detection package, named "Seismic-Net", which is based on the deep densely connected neural network. To validate the performance of our proposed leakage detection method, we employ our method to a natural analog site at ChimayÃ³, New Mexico. The seismic events in the data sets are generated because of the eruptions of geysers, which is due to the leakage of $\mathrm{CO}_\mathrm{2}$. In particular, we demonstrate the efficacy of our Seismic-Net by formulating our detection problem as an event detection problem with time series data. A fixed-length window is slid throughout the time series data and we build a deep densely connected network to classify each window to determine if a geyser event is included. Through our numerical tests, we show that our model achieves precision/recall as high as 0.889/0.923. Therefore, our Seismic-Net has a great potential for detection of $\mathrm{CO}_\mathrm{2}$ leakage.
- Feb 08 2018 cs.CV arXiv:1802.02208v1Automated pavement crack detection is a challenging task that has been researched for decades due to the complicated pavement conditions in real world. In this paper, a supervised method based on deep learning is proposed, which has the capability of dealing with different pavement conditions. Specifically, a convolutional neural network (CNN) is used to learn the structure of the cracks from raw images, without any preprocessing. Small patches are extracted from crack images as inputs to generate a large training database, a CNN is trained and crack detection is modeled as a multi-label classification problem. Typically, crack pixels are much fewer than non-crack pixels. To deal with the problem with severely imbalanced data, a strategy with modifying the ratio of positive to negative samples is proposed. The method is tested on two public databases and compared with five existing methods. Experimental results show that it outperforms the other methods.
- Feb 06 2018 cs.CV arXiv:1802.00853v1In this paper, we address the incremental classifier learning problem, which suffers from catastrophic forgetting. The main reason for catastrophic forgetting is that the past data are not available during learning. Typical approaches keep some exemplars for the past classes and use distillation regularization to retain the classification capability on the past classes and balance the past and new classes. However, there are four main problems with these approaches. First, the loss function is not efficient for classification. Second, there is unbalance problem between the past and new classes. Third, the size of pre-decided exemplars is usually limited and they might not be distinguishable from unseen new classes. Forth, the exemplars may not be allowed to be kept for a long time due to privacy regulations. To address these problems, we propose (a) a new loss function to combine the cross-entropy loss and distillation loss, (b) a simple way to estimate and remove the unbalance between the old and new classes , and (c) using Generative Adversarial Networks (GANs) to generate historical data and select representative exemplars during generation. We believe that the data generated by GANs have much less privacy issues than real images because GANs do not directly copy any real image patches. We evaluate the proposed method on CIFAR-100, Flower-102, and MS-Celeb-1M-Base datasets and extensive experiments demonstrate the effectiveness of our method.
- Feb 02 2018 cs.CV arXiv:1802.00121v1This paper presents a method to learn a decision tree to quantitatively explain the logic of each prediction of a pre-trained convolutional neural networks (CNNs). Our method boosts the following two aspects of network interpretability. 1) In the CNN, each filter in a high conv-layer must represent a specific object part, instead of describing mixed patterns without clear meanings. 2) People can explain each specific prediction made by the CNN at the semantic level using a decision tree, i.e., which filters (or object parts) are used for prediction and how much they contribute in the prediction. To conduct such a quantitative explanation of a CNN, our method learns explicit representations of object parts in high conv-layers of the CNN and mines potential decision modes memorized in fully-connected layers. The decision tree organizes these potential decision modes in a coarse-to-fine manner. Experiments have demonstrated the effectiveness of the proposed method.
- Jan 30 2018 cs.CL arXiv:1801.09031v1Using low dimensional vector space to represent words has been very effective in many NLP tasks. However, it doesn't work well when faced with the problem of rare and unseen words. In this paper, we propose to leverage the knowledge in semantic dictionary in combination with some morphological information to build an enhanced vector space. We get an improvement of 2.3% over the state-of-the-art Heidel Time system in temporal expression recognition, and obtain a large gain in other name entity recognition (NER) tasks. The semantic dictionary Hownet alone also shows promising results in computing lexical similarity.
- Querying graph structured data is a fundamental operation that enables important applications including knowledge graph search, social network analysis, and cyber-network security. However, the growing size of real-world data graphs poses severe challenges for graph databases to meet the response-time requirements of the applications. Planning the computational steps of query processing - Query Planning - is central to address these challenges. In this paper, we study the problem of learning to speedup query planning in graph databases towards the goal of improving the computational-efficiency of query processing via training queries.We present a Learning to Plan (L2P) framework that is applicable to a large class of query reasoners that follow the Threshold Algorithm (TA) approach. First, we define a generic search space over candidate query plans, and identify target search trajectories (query plans) corresponding to the training queries by performing an expensive search. Subsequently, we learn greedy search control knowledge to imitate the search behavior of the target query plans. We provide a concrete instantiation of our L2P framework for STAR, a state-of-the-art graph query reasoner. Our experiments on benchmark knowledge graphs including DBpedia, YAGO, and Freebase show that using the query plans generated by the learned search control knowledge, we can significantly improve the speed of STAR with negligible loss in accuracy.
- In this paper, we study the design of secure communication for time division duplexing multi-cell multi-user massive multiple-input multiple-output (MIMO) systems with active eavesdropping. We assume that the eavesdropper actively attacks the uplink pilot transmission and the uplink data transmission before eavesdropping the downlink data transmission phase of the desired users. We exploit both the received pilots and data signals for uplink channel estimation. We show analytically that when the number of transmit antennas and the length of the data vector both tend to infinity, the signals of the desired user and the eavesdropper lie in different eigenspaces of the received signal matrix at the base station if their signal powers are different. This finding reveals that decreasing (instead of increasing) the desire user's signal power might be an effective approach to combat a strong active attack from an eavesdropper. Inspired by this result, we propose a data-aided secure downlink transmission scheme and derive an asymptotic achievable secrecy sum-rate expression for the proposed design. Numerical results indicate that under strong active attacks, the proposed design achieves significant secrecy rate gains compared to the conventional design employing matched filter precoding and artificial noise generation.
- Gaussian belief propagation (GBP) is a recursive computation method that is widely used in inference for computing marginal distributions efficiently. Depending on how the factorization of the underlying joint Gaussian distribution is performed, GBP may exhibit different convergence properties as different factorizations may lead to fundamentally different recursive update structures. In this paper, we study the convergence of GBP derived from the factorization based on the distributed linear Gaussian model. The motivation is twofold. From the factorization viewpoint, i.e., by specifically employing a factorization based on the linear Gaussian model, in some cases, we are able to bypass difficulties that exist in other convergence analysis methods that use a different (Gaussian Markov random field) factorization. From the distributed inference viewpoint, the linear Gaussian model readily conforms to the physical network topology arising in large-scale networks, and, is practically useful. For the distributed linear Gaussian model, under mild assumptions, we show analytically three main results: the GBP message inverse variance converges exponentially fast to a unique positive limit for arbitrary nonnegative initialization; we provide a necessary and sufficient convergence condition for the belief mean to converge to the optimal value; and, when the underlying factor graph is given by the union of a forest and a single loop, we show that GBP always converges.
- In this technical report, we consider an approach that combines the PPO objective and K-FAC natural gradient optimization, for which we call PPOKFAC. We perform a range of empirical analysis on various aspects of the algorithm, such as sample complexity, training speed, and sensitivity to batch size and training epochs. We observe that PPOKFAC is able to outperform PPO in terms of sample complexity and speed in a range of MuJoCo environments, while being scalable in terms of batch size. In spite of this, it seems that adding more epochs is not necessarily helpful for sample efficiency, and PPOKFAC seems to be worse than its A2C counterpart, ACKTR.
- Physical layer security which safeguards data confidentiality based on the information-theoretic approaches has received significant research interest recently. The key idea behind physical layer security is to utilize the intrinsic randomness of the transmission channel to guarantee the security in physical layer. The evolution towards 5G wireless communications poses new challenges for physical layer security research. This paper provides a latest survey of the physical layer security research on various promising 5G technologies, including physical layer security coding, massive multiple-input multiple-output, millimeter wave communications, heterogeneous networks, non-orthogonal multiple access, full duplex technology, etc. Technical challenges which remain unresolved at the time of writing are summarized and the future trends of physical layer security in 5G and beyond are discussed.
- Learning properties of large graphs from samples has been an important problem in statistical network analysis since the early work of Goodman \citeGoodman1949 and Frank \citeFrank1978. We revisit a problem formulated by Frank \citeFrank1978 of estimating the number of connected components in a large graph based on the subgraph sampling model, in which we randomly sample a subset of the vertices and observe the induced subgraph. The key question is whether accurate estimation is achievable in the \emphsublinear regime where only a vanishing fraction of the vertices are sampled. We show that it is impossible if the parent graph is allowed to contain high-degree vertices or long induced cycles. For the class of chordal graphs, where induced cycles of length four or above are forbidden, we characterize the optimal sample complexity within constant factors and construct linear-time estimators that provably achieve these bounds. This significantly expands the scope of previous results which have focused on unbiased estimators and special classes of graphs such as forests or cliques. Both the construction and the analysis of the proposed methodology rely on combinatorial properties of chordal graphs and identities of induced subgraph counts. They, in turn, also play a key role in proving minimax lower bounds based on construction of random instances of graphs with matching structures of small subgraphs.
- Jan 16 2018 cs.CV arXiv:1801.04461v1In this paper we consider the problem of single monocular image depth estimation. It is a challenging problem due to its ill-posedness nature and has found wide application in industry. Previous efforts belongs roughly to two families: learning-based method and interactive method. Learning-based method, in which deep convolutional neural network (CNN) is widely used, can achieve good result. But they suffer low generalization ability and typically perform poorly for unfamiliar scenes. Besides, data-hungry nature for such method makes data aquisition expensive and time-consuming. Interactive method requires human annotation of depth which, however, is errorneous and of large variance. To overcome these problems, we propose a new perspective for single monocular image depth estimation problem: size to depth. Our method require sparse label for real-world size of object rather than raw depth. A Coarse depth map is then inferred following geometric relationships according to size labels. Then we refine the depth map by doing energy function optimization on conditional random field(CRF). We experimentally demonstrate that our method outperforms traditional depth-labeling methods and can produce satisfactory depth maps.
- In this paper, a secure spatial modulation (SM) system with artificial noise (AN)-aided is investigated. To achieve higher secrecy rate (SR) in such a system, two high-performance schemes of transmit antenna selection (TAS), leakage-based and maximum secrecy rate (Max-SR), are proposed and a generalized Euclidean distance-optimized antenna selection (EDAS) method is designed. From simulation results and analysis, the four TAS schemes have an decreasing order: Max-SR, leakage-based, generalized EDAS, and random (conventional), in terms of SR performance. However, the proposed Max-SR method requires the exhaustive search to achieve the optimal SR performance, thus its complexity is extremely high as the number of antennas tends to medium and large scale. The proposed leakage-based method approaches the Max-SR method with much lower complexity. Thus, it achieves a good balance between complexity and SR performance. In terms of bit error rate (BER), their performances are in an increasing order: random, leakage-based, Max-SR, and generalized EDAS.
- Jan 15 2018 cs.CV arXiv:1801.04065v1Deep neural networks have shown excellent performance for stereo matching. Many efforts focus on the feature extraction and similarity measurement of the matching cost computation step while less attention is paid on cost aggregation which is crucial for stereo matching. In this paper, we present a learning-based cost aggregation method for stereo matching by a novel sub-architecture in the end-to-end trainable pipeline. We reformulate the cost aggregation as a learning process of the generation and selection of cost aggregation proposals which indicate the possible cost aggregation results. The cost aggregation sub-architecture is realized by a two-stream network: one for the generation of cost aggregation proposals, the other for the selection of the proposals. The criterion for the selection is determined by the low-level structure information obtained from a light convolutional network. The two-stream network offers a global view guidance for the cost aggregation to rectify the mismatching value stemming from the limited view of the matching cost computation. The comprehensive experiments on challenge datasets such as KITTI and Scene Flow show that our method outperforms the state-of-the-art methods.
- We investigate the optimality and power allocation algorithm of beam domain transmission for single-cell massive multiple-input multiple-output (MIMO) systems with a multi-antenna passive eavesdropper. Focusing on the secure massive MIMO downlink transmission with only statistical channel state information of legitimate users and the eavesdropper at base station, we introduce a lower bound on the achievable ergodic secrecy sum-rate, from which we derive the condition for eigenvectors of the optimal input covariance matrices. The result shows that beam domain transmission can achieve optimal performance in terms of secrecy sum-rate lower bound maximization. For the case of single-antenna legitimate users, we prove that it is optimal to allocate no power to the beams where the beam gains of the eavesdropper are stronger than those of legitimate users in order to maximize the secrecy sum-rate lower bound. Then, motivated by the concave-convex procedure and the large dimension random matrix theory, we develop an efficient iterative and convergent algorithm to optimize power allocation in the beam domain. Numerical simulations demonstrate the tightness of the secrecy sum-rate lower bound and the near-optimal performance of the proposed iterative algorithm.
- Cluster analysis and outlier detection are strongly coupled tasks in data mining area. Cluster structure can be easily destroyed by few outliers; on the contrary, the outliers are defined by the concept of cluster, which are recognized as the points belonging to none of the clusters. However, most existing studies handle them separately. In light of this, we consider the joint cluster analysis and outlier detection problem, and propose the Clustering with Outlier Removal (COR) algorithm. Generally speaking, the original space is transformed into the binary space via generating basic partitions in order to define clusters. Then an objective function based Holoentropy is designed to enhance the compactness of each cluster with a few outliers removed. With further analyses on the objective function, only partial of the problem can be handled by K-means optimization. To provide an integrated solution, an auxiliary binary matrix is nontrivally introduced so that COR completely and efficiently solves the challenging problem via a unified K-means- - with theoretical supports. Extensive experimental results on numerous data sets in various domains demonstrate the effectiveness and efficiency of COR significantly over the rivals including K-means- - and other state-of-the-art outlier detection methods in terms of cluster validity and outlier detection. Some key factors in COR are further analyzed for practical use. Finally, an application on flight trajectory is provided to demonstrate the effectiveness of COR in the real-world scenario.
- The explosive growth of mobile devices and the rapid increase of wideband wireless services call for advanced communication techniques that can achieve high spectral efficiency and meet the massive connectivity requirement. Cognitive radio (CR) and non-orthogonal multiple access (NOMA) are envisioned to be important solutions for the fifth generation wireless networks. Integrating NOMA techniques into CR networks (CRNs) has the tremendous potential to improve spectral efficiency and increase the system capacity. However, there are many technical challenges due to the severe interference caused by using NOMA. Many efforts have been made to facilitate the application of NOMA into CRNs and to investigate the performance of CRNs with NOMA. This article aims to survey the latest research results along this direction. A taxonomy is devised to categorize the literature based on operation paradigms, enabling techniques, design objectives and optimization characteristics. Moreover, the key challenges are outlined to provide guidelines for the domain researchers and designers to realize CRNs with NOMA. Finally, the open issues are discussed.
- Heterogeneous cloud radio access networks (H-CRANs) are envisioned to be promising in the fifth generation (5G) wireless networks. H-CRANs enable users to enjoy diverse services with high energy efficiency, high spectral efficiency, and low-cost operation, which are achieved by using cloud computing and virtualization techniques. However, H-CRANs face many technical challenges due to massive user connectivity, increasingly severe spectrum scarcity and energy-constrained devices. These challenges may significantly decrease the quality of service of users if not properly tackled. Non-orthogonal multiple access (NOMA) schemes exploit non-orthogonal resources to provide services for multiple users and are receiving increasing attention for their potential of improving spectral and energy efficiency in 5G networks. In this article a framework for energy-efficient NOMA H-CRANs is presented. The enabling technologies for NOMA H-CRANs are surveyed. Challenges to implement these technologies and open issues are discussed. This article also presents the performance evaluation on energy efficiency of H-CRANs with NOMA.
- Towards bridging the gap between machine and human intelligence, it is of utmost importance to introduce environments that are visually realistic and rich in content. In such environments, one can evaluate and improve a crucial property of practical intelligent systems, namely \emphgeneralization. In this work, we build \emphHouse3D, a rich, extensible and efficient environment that contains 45,622 human-designed 3D scenes of houses, ranging from single-room studios to multi-storeyed houses, equipped with a diverse set of fully labeled 3D objects, textures and scene layouts, based on the SUNCG dataset (Song et al., 2017). With an emphasis on semantic-level generalization, we study the task of concept-driven navigation, \emphRoomNav, using a subset of houses in House3D. In RoomNav, an agent navigates towards a target specified by a semantic concept. To succeed, the agent learns to comprehend the scene it lives in by developing perception, understand the concept by mapping it to the correct semantics, and navigate to the target by obeying the underlying physical rules. We train RL agents with both continuous and discrete action spaces and show their ability to generalize in new unseen environments. In particular, we observe that (1) training is substantially harder on large house sets but results in better generalization, (2) using semantic signals (e.g., segmentation mask) boosts the generalization performance, and (3) gated networks on semantic input signal lead to improved training performance and generalization. We hope House3D, including the analysis of the RoomNav task, serves as a building block towards designing practical intelligent systems and we wish it to be broadly adopted by the community.
- Jan 08 2018 cs.CV arXiv:1801.01317v1Semantic image segmentation is one of the most challenged tasks in computer vision. In this paper, we propose a highly fused convolutional network, which consists of three parts: feature downsampling, combined feature upsampling and multiple predictions. We adopt a strategy of multiple steps of upsampling and combined feature maps in pooling layers with its corresponding unpooling layers. Then we bring out multiple pre-outputs, each pre-output is generated from an unpooling layer by one-step upsampling. Finally, we concatenate these pre-outputs to get the final output. As a result, our proposed network makes highly use of the feature information by fusing and reusing feature maps. In addition, when training our model, we add multiple soft cost functions on pre-outputs and final outputs. In this way, we can reduce the loss reduction when the loss is back propagated. We evaluate our model on three major segmentation datasets: CamVid, PASCAL VOC and ADE20K. We achieve a state-of-the-art performance on CamVid dataset, as well as considerable improvements on PASCAL VOC dataset and ADE20K dataset
- The relay broadcast channel (RBC) is considered, in which a transmitter communicates with two receivers with the assistance of a relay. Based on different degradation orders among the relay and the receivers' outputs, three types of physically degraded RBCs (PDRBCs) are introduced. Inner bounds and outer bounds are derived on the capacity region of the presented three types. The bounds are tight for two types of PDRBCs: 1) one receiver's output is a degraded form of the other receiver's output, and the relay's output is a degraded form of the weaker receiver's output; 2) one receiver's output is a degraded form of the relay's output, and the other receiver's output is a degraded form of the relay's output. For the Gaussian PDRBC, the bounds match, i.e., establish its capacity region.
- This paper investigates the fundamental limits for detecting a high-dimensional sparse matrix contaminated by white Gaussian noise from both the statistical and computational perspectives. We consider $p\times p$ matrices whose rows and columns are individually $k$-sparse. We provide a tight characterization of the statistical and computational limits for sparse matrix detection, which precisely describe when achieving optimal detection is easy, hard, or impossible, respectively. Although the sparse matrices considered in this paper have no apparent submatrix structure and the corresponding estimation problem has no computational issue at all, the detection problem has a surprising computational barrier when the sparsity level $k$ exceeds the cubic root of the matrix size $p$: attaining the optimal detection boundary is computationally at least as hard as solving the planted clique problem. The same statistical and computational limits also hold in the sparse covariance matrix model, where each variable is correlated with at most $k$ others. A key step in the construction of the statistically optimal test is a structural property for sparse matrices, which can be of independent interest.
- Dec 27 2017 cs.IR arXiv:1712.09010v1Proliferation of ubiquitous mobile devices makes location based services prevalent. Mobile users are able to volunteer as providers of specific services and in the meanwhile to search these services. For example, drivers may be interested in tracking available nearby users who are willing to help with motor repair or are willing to provide travel directions or first aid. With the diffusion of mobile users, it is necessary to provide scalable means of enabling such users to connect with other nearby users so that they can help each other with specific services. Motivated by these observations, we design and implement a general location based system HelPal for mobile users to provide and enjoy instant service, which is called mobile crowd service. In this demo, we introduce a mobile crowd service system featured with several novel techniques. We sketch the system architecture and illustrate scenarios via several cases. Demonstration shows the user-friendly search interface for users to conveniently find skilled and qualified nearby service providers.
- Dec 19 2017 cs.LG arXiv:1712.05929v1Conventionally, the resource allocation is formulated as an optimization problem and solved online with instantaneous scenario information. Since most resource allocation problems are not convex, the optimal solutions are very difficult to be obtained in real time. Lagrangian relaxation or greedy methods are then often employed, which results in performance loss. Therefore, the conventional methods of resource allocation are facing great challenges to meet the ever-increasing QoS requirements of users with scarce radio resource. Assisted by cloud computing, a huge amount of historical data on scenarios can be collected for extracting similarities among scenarios using machine learning. Moreover, optimal or near-optimal solutions of historical scenarios can be searched offline and stored in advance. When the measured data of current scenario arrives, the current scenario is compared with historical scenarios to find the most similar one. Then, the optimal or near-optimal solution in the most similar historical scenario is adopted to allocate the radio resources for the current scenario. To facilitate the application of new design philosophy, a machine learning framework is proposed for resource allocation assisted by cloud computing. An example of beam allocation in multi-user massive multiple-input-multiple-output (MIMO) systems shows that the proposed machine-learning based resource allocation outperforms conventional methods.
- Dec 19 2017 cs.CL arXiv:1712.05884v2This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of $4.53$ comparable to a MOS of $4.58$ for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and $F_0$ features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.
- Dec 13 2017 cs.AI arXiv:1712.04172v1This paper proposes a low-cost, easily realizable strategy to equip a reinforcement learning (RL) agent the capability of behaving ethically. Our model allows the designers of RL agents to solely focus on the task to achieve, without having to worry about the implementation of multiple trivial ethical patterns to follow. Based on the assumption that the majority of human behavior, regardless which goals they are achieving, is ethical, our design integrates human policy with the RL policy to achieve the target objective with less chance of violating the ethical code that human beings normally obey.
- Dec 12 2017 cs.CV arXiv:1712.03917v1In this paper, we strive to answer two questions: What is the current state of 3D hand pose estimation? And, what are the next challenges that need to be tackled? Following the successful Hands In the Million Challenge (HIM2017), we investigate 11 state-of-the-art methods on three tasks: single frame 3D pose estimation, 3D hand tracking, and hand pose estimation during object interaction. We analyze the performance of different CNN structures with regard to hand shape, joint visibility, view point and articulation distributions. Our findings include: (1) isolated 3D hand pose estimation achieves low mean errors (10 mm) in the view point range of [40, 150] degrees, but it is far from being solved for extreme view points; (2)3D volumetric representations outperform 2D CNNs, better capturing the spatial structure of the depth data; (3)~Discriminative methods still generalize poorly to unseen hand shapes; (4)~While joint occlusions pose a challenge for most methods, explicit modeling of structure constraints can significantly narrow the gap between errors on visible and occluded joints.
- Person Re-identification (re-id) faces two major challenges: the lack of cross-view paired training data and learning discriminative identity-sensitive and view-invariant features in the presence of large pose variations. In this work, we address both problems by proposing a novel deep person image generation model for synthesizing realistic person images conditional on pose. The model is based on a generative adversarial network (GAN) and used specifically for pose normalization in re-id, thus termed pose-normalization GAN (PN-GAN). With the synthesized images, we can learn a new type of deep re-id feature free of the influence of pose variations. We show that this feature is strong on its own and highly complementary to features learned with the original images. Importantly, we now have a model that generalizes to any new re-id dataset without the need for collecting any training data for model fine-tuning, thus making a deep re-id model truly scalable. Extensive experiments on five benchmarks show that our model outperforms the state-of-the-art models, often significantly. In particular, the features learned on Market-1501 can achieve a Rank-1 accuracy of 68.67% on VIPeR without any model fine-tuning, beating almost all existing models fine-tuned on the dataset.
- In this work, an adaptive and robust null-space projection (AR-NSP) scheme is proposed for secure transmission with artificial noise (AN)-aided directional modulation (DM) in wireless networks. The proposed scheme is carried out in three steps. Firstly, the directions of arrival (DOAs) of the signals from the desired user and eavesdropper are estimated by the Root Multiple Signal Classificaiton (Root-MUSIC) algorithm and the related signal-to-noise ratios (SNRs) are estimated based on the ratio of the corresponding eigenvalue to the minimum eigenvalue of the covariance matrix of the received signals. In the second step, the value intervals of DOA estimation errors are predicted based on the DOA and SNR estimations. Finally, a robust NSP beamforming DM system is designed according to the afore-obtained estimations and predictions. Our examination shows that the proposed scheme can significantly outperform the conventional non-adaptive robust scheme and non-robust NSP scheme in terms of achieving a much lower bit error rate (BER) at the desired user and a much higher secrecy rate (SR). In addition, the BER and SR performance gains achieved by the proposed scheme relative to other schemes increase with the value range of DOA estimation error.
- Attention-based sequence-to-sequence models for automatic speech recognition jointly train an acoustic model, language model, and alignment mechanism. Thus, the language model component is only trained on transcribed audio-text pairs. This leads to the use of shallow fusion with an external language model at inference time. Shallow fusion refers to log-linear interpolation with a separately trained language model at each step of the beam search. In this work, we investigate the behavior of shallow fusion across a range of conditions: different types of language models, different decoding units, and different tasks. On Google Voice Search, we demonstrate that the use of shallow fusion with a neural LM with wordpieces yields a 9.1% relative word error rate reduction (WERR) over our competitive attention-based sequence-to-sequence model, obviating the need for second-pass rescoring.
- For decades, context-dependent phonemes have been the dominant sub-word unit for conventional acoustic modeling systems. This status quo has begun to be challenged recently by end-to-end models which seek to combine acoustic, pronunciation, and language model components into a single neural network. Such systems, which typically predict graphemes or words, simplify the recognition process since they remove the need for a separate expert-curated pronunciation lexicon to map from phoneme-based units to words. However, there has been little previous work comparing phoneme-based versus grapheme-based sub-word units in the end-to-end modeling framework, to determine whether the gains from such approaches are primarily due to the new probabilistic model, or from the joint learning of the various components with grapheme-based units. In this work, we conduct detailed experiments which are aimed at quantifying the value of phoneme-based pronunciation lexica in the context of end-to-end models. We examine phoneme-based end-to-end models, which are contrasted against grapheme-based ones on a large vocabulary English Voice-search task, where we find that graphemes do indeed outperform phonemes. We also compare grapheme and phoneme-based approaches on a multi-dialect English task, which once again confirm the superiority of graphemes, greatly simplifying the system for recognizing multiple dialects.
- Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In previous work, we have shown that such architectures are comparable to state-of-theart ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We also introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12, 500 hour voice search task, we find that the proposed changes improve the WER from 9.2% to 5.6%, while the best conventional system achieves 6.7%; on a dictation task our model achieves a WER of 4.1% compared to 5% for the conventional system.
- Having a sequence-to-sequence model which can operate in an online fashion is important for streaming applications such as Voice Search. Neural transducer is a streaming sequence-to-sequence model, but has shown a significant degradation in performance compared to non-streaming models such as Listen, Attend and Spell (LAS). In this paper, we present various improvements to NT. Specifically, we look at increasing the window over which NT computes attention, mainly by looking backwards in time so the model still remains online. In addition, we explore initializing a NT model from a LAS-trained model so that it is guided with a better alignment. Finally, we explore including stronger language models such as using wordpiece models, and applying an external LM during the beam search. On a Voice Search task, we find with these improvements we can get NT to match the performance of LAS.
- Sequence-to-sequence models, such as attention-based models in automatic speech recognition (ASR), are typically trained to optimize the cross-entropy criterion which corresponds to improving the log-likelihood of the data. However, system performance is usually measured in terms of word error rate (WER), not log-likelihood. Traditional ASR systems benefit from discriminative sequence training which optimizes criteria such as the state-level minimum Bayes risk (sMBR) which are more closely related to WER. In the present work, we explore techniques to train attention-based models to directly minimize expected word error rate. We consider two loss functions which approximate the expected number of word errors: either by sampling from the model, or by using N-best lists of decoded hypotheses, which we find to be more effective than the sampling-based method. In experimental evaluations, we find that the proposed training procedure improves performance by up to 8.2% relative to the baseline system. This allows us to train grapheme-based, uni-directional attention-based models which match the performance of a traditional, state-of-the-art, discriminative sequence-trained system on a mobile voice-search task.
- Sequence-to-sequence models provide a simple and elegant solution for building speech recognition systems by folding separate components of a typical system, namely acoustic (AM), pronunciation (PM) and language (LM) models into a single neural network. In this work, we look at one such sequence-to-sequence model, namely listen, attend and spell (LAS), and explore the possibility of training a single model to serve different English dialects, which simplifies the process of training multi-dialect systems without the need for separate AM, PM and LMs for each dialect. We show that simply pooling the data from all dialects into one LAS model falls behind the performance of a model fine-tuned on each dialect. We then look at incorporating dialect-specific information into the model, both by modifying the training targets by inserting the dialect symbol at the end of the original grapheme sequence and also feeding a 1-hot representation of the dialect information into all layers of the model. Experimental results on seven English dialects show that our proposed system is effective in modeling dialect variations within a single LAS model, outperforming a LAS model trained individually on each of the seven dialects by 3.1 ~ 16.5% relative.
- Dec 01 2017 cs.CL arXiv:1711.11191v1We study response generation for open domain conversation in chatbots. Existing methods assume that words in responses are generated from an identical vocabulary regardless of their inputs, which not only makes them vulnerable to generic patterns and irrelevant noise, but also causes a high cost in decoding. We propose a dynamic vocabulary sequence-to-sequence (DVS2S) model which allows each input to possess their own vocabulary in decoding. In training, vocabulary construction and response generation are jointly learned by maximizing a lower bound of the true objective with a Monte Carlo sampling method. In inference, the model dynamically allocates a small vocabulary for an input with the word prediction model, and conducts decoding only with the small vocabulary. Because of the dynamic vocabulary mechanism, DVS2S eludes many generic patterns and irrelevant words in generation, and enjoys efficient decoding at the same time. Experimental results on both automatic metrics and human annotations show that DVS2S can significantly outperform state-of-the-art methods in terms of response quality, but only requires 60% decoding time compared to the most efficient baseline.
- Understanding the global optimality in deep learning (DL) has been attracting more and more attention recently. Conventional DL solvers, however, have not been developed intentionally to seek for such global optimality. In this paper we propose a novel approximation algorithm, BPGrad, towards optimizing deep models globally via branch and pruning. Our BPGrad algorithm is based on the assumption of Lipschitz continuity in DL, and as a result it can adaptively determine the step size for current gradient given the history of previous updates, wherein theoretically no smaller steps can achieve the global optimality. We prove that, by repeating such branch-and-pruning procedure, we can locate the global optimality within finite iterations. Empirically an efficient solver based on BPGrad for DL is proposed as well, and it outperforms conventional DL solvers such as Adagrad, Adadelta, RMSProp, and Adam in the tasks of object recognition, detection, and segmentation.
- In this paper we document our experiences with developing speech recognition for medical transcription - a system that automatically transcribes doctor-patient conversations. Towards this goal, we built a system along two different methodological lines - a Connectionist Temporal Classification (CTC) phoneme based model and a Listen Attend and Spell (LAS) grapheme based model. To train these models we used a corpus of anonymized conversations representing approximately 14,000 hours of speech. Because of noisy transcripts and alignments in the corpus, a significant amount of effort was invested in data cleaning issues. We describe a two-stage strategy we followed for segmenting the data. The data cleanup and development of a matched language model was essential to the success of the CTC based models. The LAS based models, however were found to be resilient to alignment and transcript noise and did not require the use of language models. CTC models were able to achieve a word error rate of 20.1%, and the LAS models were able to achieve 18.3%. Our analysis shows that both models perform well on important medical utterances and therefore can be practical for transcribing medical conversations.
- Conditional Generative Adversarial Networks (cGANs) are generative models that can produce data samples ($x$) conditioned on both latent variables ($z$) and known auxiliary information ($c$). We propose the Bidirectional cGAN (BiCoGAN), which effectively disentangles $z$ and $c$ in the generation process and provides an encoder that learns inverse mappings from $x$ to both $z$ and $c$, trained jointly with the generator and the discriminator. We present crucial techniques for training BiCoGANs, which involve an extrinsic factor loss along with an associated dynamically-tuned importance weight. As compared to other encoder-based cGANs, BiCoGANs encode $c$ more accurately, and utilize $z$ and $c$ more effectively and in a more disentangled way to generate samples.
- Nov 20 2017 cs.CV arXiv:1711.06540v2Recent studies have shown that aggregating convolutional features of a pre-trained Convolutional Neural Network (CNN) can obtain impressive performance for a variety of visual tasks. The symmetric Positive Definite (SPD) matrix becomes a powerful tool due to its remarkable ability to learn an appropriate statistic representation to characterize the underlying structure of visual features. In this paper, we propose to aggregate deep convolutional features into an SPD matrix representation through the SPD generation and the SPD transformation under an end-to-end deep network. To this end, several new layers are introduced in our network, including a nonlinear kernel aggregation layer, an SPD matrix transformation layer, and a vectorization layer. The nonlinear kernel aggregation layer is employed to aggregate the convolutional features into a real SPD matrix directly. The SPD matrix transformation layer is designed to construct a more compact and discriminative SPD representation. The vectorization and normalization operations are performed in the vectorization layer for reducing the redundancy and accelerating the convergence. The SPD matrix in our network can be considered as a mid-level representation bridging convolutional features and high-level semantic features. To demonstrate the effectiveness of our method, we conduct extensive experiments on visual classification. Experiment results show that our method notably outperforms state-of-the-art methods.
- In a directional modulation network, a general power iterative (GPI) based beamforming scheme is proposed to maximize the secrecy rate (SR), where there are two optimization variables required to be optimized. The first one is the useful precoding vector of transmitting confidential messages to the desired user while the second one is the artificial noise (AN) projection matrix of forcing more AN to eavesdroppers. In such a secure network, the paramount problem is how to design or optimize the two optimization variables by different criteria. To maximize the SR (Max-SR), an alternatively iterative structure (AIS) is established between the AN projection matrix and the precoding vector for confidential messages. To choose a good initial value of iteration process of GPI, the proposed Max-SR method can readily double its convergence speed compared to the random choice of initial value. With only four iterations, it may rapidly converge to its rate ceil. From simulation results, it follows that the SR performance of the proposed AIS of GPI-based Max-SR is much better than those of conventional leakage-based and null-space projection methods in the medium and large signal-to-noise ratio (SNR) regions, and its achievable SR performance gain gradually increases as SNR increases.
- This paper studies an unmanned aerial vehicle (UAV)-enabled multicast channel, in which a UAV serves as a mobile transmitter to deliver common information to a set of $K$ ground users. We aim to characterize the capacity of this channel over a finite UAV communication period, subject to its maximum speed constraint and an average transmit power constraint. To achieve the capacity, the UAV should use a sufficiently long code that spans over its whole communication period. Accordingly, the multicast channel capacity is achieved via maximizing the minimum achievable time-averaged rates of the $K$ users, by jointly optimizing the UAV's trajectory and transmit power allocation over time. However, this problem is non-convex and difficult to be solved optimally. To tackle this problem, we first consider a relaxed problem by ignoring the maximum UAV speed constraint, and obtain its globally optimal solution via the Lagrange dual method. The optimal solution reveals that the UAV should hover above a finite number of ground locations, with the optimal hovering duration and transmit power at each location. Next, based on such a multi-location-hovering solution, we present a successive hover-and-fly trajectory design and obtain the corresponding optimal transmit power allocation for the case with the maximum UAV speed constraint. Numerical results show that our proposed joint UAV trajectory and transmit power optimization significantly improves the achievable rate of the UAV-enabled multicast channel, and also greatly outperforms the conventional multicast channel with a fixed-location transmitter.
- Nov 13 2017 cs.CL arXiv:1711.03697v1Task-oriented dialogue systems can efficiently serve a large number of customers and relieve people from tedious works. However, existing task-oriented dialogue systems depend on handcrafted actions and states or extra semantic labels, which sometimes degrades user experience despite the intensive human intervention. Moreover, current user simulators have limited expressive ability so that deep reinforcement Seq2Seq models have to rely on selfplay and only work in some special cases. To address those problems, we propose a uSer and Agent Model IntegrAtion (SAMIA) framework inspired by an observation that the roles of the user and agent models are asymmetric. Firstly, this SAMIA framework model the user model as a Seq2Seq learning problem instead of ranking or designing rules. Then the built user model is used as a leverage to train the agent model by deep reinforcement learning. In the test phase, the output of the agent model is filtered by the user model to enhance the stability and robustness. Experiments on a real-world coffee ordering dataset verify the effectiveness of the proposed SAMIA framework.
- We consider the problem of minimax estimation of the entropy of a density over Lipschitz balls. Dropping the usual assumption that the density is bounded away from zero, we obtain the minimax rates $(n\ln n)^{-\frac{s}{s+d}} + n^{-1/2}$ for $0<s\leq 2$ in arbitrary dimension $d$, where $s$ is the smoothness parameter and $n$ is the number of independent samples. Using a two-stage approximation technique, which first approximate the density by its kernel-smoothed version, and then approximate the non-smooth functional by polynomials, we construct entropy estimators that attain the minimax rate of convergence, shown optimal by matching lower bounds. One of the key steps in analyzing the bias relies on a novel application of the Hardy-Littlewood maximal inequality, which also leads to a new inequality on Fisher information that might be of independent interest.
- In this paper, we consider the block-sparse signals recovery problem in the context of multiple measurement vectors (MMV) with common row sparsity patterns. We develop a new method for recovery of common row sparsity MMV signals, where a pattern-coupled hierarchical Gaussian prior model is introduced to characterize both the block-sparsity of the coefficients and the statistical dependency between neighboring coefficients of the common row sparsity MMV signals. Unlike many other methods, the proposed method is able to automatically capture the block sparse structure of the unknown signal. Our method is developed using an expectation-maximization (EM) framework. Simulation results show that our proposed method offers competitive performance in recovering block-sparse common row sparsity pattern MMV signals.
- Nov 06 2017 cs.CV arXiv:1711.01062v1With the development of depth cameras such as Kinect and Intel Realsense, RGB-D based human detection receives continuous research attention due to its usage in a variety of applications. In this paper, we propose a new Multi-Glimpse LSTM (MG-LSTM) network, in which multi-scale contextual information is sequentially integrated to promote the human detection performance. Furthermore, we propose a feature fusion strategy based on our MG-LSTM network to better incorporate the RGB and depth information. To the best of our knowledge, this is the first attempt to utilize LSTM structure for RGB-D based human detection. Our method achieves superior performance on two publicly available datasets.
- Audio classification is the task of identifying the sound categories that are associated with a given audio signal. This paper presents an investigation on large-scale audio classification based on the recently released AudioSet database. AudioSet comprises 2 millions of audio samples from YouTube, which are human-annotated with 527 sound category labels. Audio classification experiments with the balanced training set and the evaluation set of AudioSet are carried out by applying different types of neural network models. The classification performance and the model complexity of these models are compared and analyzed. While the CNN models show better performance than MLP and RNN, its model complexity is relatively high and undesirable for practical use. We propose two different strategies that aim at constructing low-dimensional embedding feature extractors and hence reducing the number of model parameters. It is shown that the simplified CNN model has only 1/22 model parameters of the original model, with only a slight degradation of performance.
- Nov 02 2017 cs.LG arXiv:1711.00123v3Gradient-based optimization is the foundation of deep learning and reinforcement learning. Even when the mechanism being optimized is unknown or not differentiable, optimization using high-variance or biased gradient estimates is still often the best strategy. We introduce a general framework for learning low-variance, unbiased gradient estimators for black-box functions of random variables. Our method uses gradients of a neural network trained jointly with model parameters or policies, and is applicable in both discrete and continuous settings. We demonstrate this framework for training discrete latent-variable models. We also give an unbiased, action-conditional extension of the advantage actor-critic reinforcement learning algorithm.