results for au:Yu_H in:cs

- Recently there has been a rising interest in training agents, embodied in virtual environments, to perform language-directed tasks by deep reinforcement learning. In this paper, we propose a simple but effective neural language grounding module for embodied agents that can be trained end to end from scratch taking raw pixels, unstructured linguistic commands, and sparse rewards as the inputs. We model the language grounding process as a language-guided transformation of visual features, where latent sentence embeddings are used as the transformation matrices. In several language-directed navigation tasks that feature challenging partial observation and require simple reasoning, our module significantly outperforms the state of the arts. We also release XWORLD 3D, an easy-to-customize 3D environment that can potentially be modified to evaluate a variety of embodied agents.
- We apply neural nets with ReLU gates in online reinforcement learning. Our goal is to train these networks in an incremental manner, without the computationally expensive experience replay. By studying how individual neural nodes behave in online training, we recognize that the global nature of ReLU gates can cause undesirable learning interference in each node's learning behavior. We propose reducing such interferences with two efficient input transformation methods that are geometric in nature and match well the geometric property of ReLU gates. The first one is tile coding, a classic binary encoding scheme originally designed for local generalization based on the topological structure of the input space. The second one (EmECS) is a new method we introduce; it is based on geometric properties of convex sets and topological embedding of the input space into the boundary of a convex set. We discuss the behavior of the network when it operates on the transformed inputs. We also compare it experimentally with some neural nets that do not use the same input transformations, and with the classic algorithm of tile coding plus a linear function approximator, and on several online reinforcement learning tasks, we show that the neural net with tile coding or EmECS can achieve not only faster learning but also more accurate approximations. Our results strongly suggest that geometric input transformation of this type can be effective for interference reduction and takes us a step closer to fully incremental reinforcement learning with neural nets.
- As it requires a huge number of parameters when exposed to high dimensional inputs in video detection and classification, there is a grand challenge to develop a compact yet accurate video comprehension at terminal devices. Current works focus on optimizations of video detection and classification in a separated fashion. In this paper, we introduce a video comprehension (object detection and action recognition) system for terminal devices, namely DEEPEYE. Based on You Only Look Once (YOLO), we have developed an 8-bit quantization method when training YOLO; and also developed a tensorized-compression method of Recurrent Neural Network (RNN) composed of features extracted from YOLO. The developed quantization and tensorization can significantly compress the original network model yet with maintained accuracy. Using the challenging video datasets: MOMENTS and UCF11 as benchmarks, the results show that the proposed DEEPEYE achieves 3.994x model compression rate with only 0.47% mAP decreased; and 15,047x parameter reduction and 2.87x speed-up with 16.58% accuracy improvement.
- May 10 2018 cs.CL arXiv:1805.03366v1Word embedding is a key component in many downstream applications in processing natural languages. Existing approaches often assume the existence of a large collection of text for learning effective word embedding. However, such a corpus may not be available for some low-resource languages. In this paper, we study how to effectively learn a word embedding model on a corpus with only a few million tokens. In such a situation, the co-occurrence matrix is sparse as the co-occurrences of many word pairs are unobserved. In contrast to existing approaches often only sample a few unobserved word pairs as negative samples, we argue that the zero entries in the co-occurrence matrix also provide valuable information. We then design a Positive-Unlabeled Learning (PU-Learning) approach to factorize the co-occurrence matrix and validate the proposed approaches in four different languages.
- Building intelligent agents that can communicate with and learn from humans in natural language is of great value. Supervised language learning is limited by the ability of capturing mainly the statistics of training data, and is hardly adaptive to new scenarios or flexible for acquiring new knowledge without inefficient retraining or catastrophic forgetting. We highlight the perspective that conversational interaction serves as a natural interface both for language learning and for novel knowledge acquisition and propose a joint imitation and reinforcement approach for grounded language learning through an interactive conversational game. The agent trained with this approach is able to actively acquire information by asking questions about novel objects and use the just-learned knowledge in subsequent conversations in a one-shot fashion. Results compared with other methods verified the effectiveness of the proposed approach.
- Apr 23 2018 cs.CL arXiv:1804.07445v1Sentence simplification aims to simplify the content and structure of complex sentences, and thus make them easier to interpret for human readers, and easier to process for downstream NLP applications. Recent advances in neural machine translation have paved the way for novel approaches to the task. In this paper, we adapt an architecture with augmented memory capacities called Neural Semantic Encoders (Munkhdalai and Yu, 2017) for sentence simplification. Our experiments demonstrate the effectiveness of our approach on different simplification datasets, both in terms of automatic evaluation measures and human judgments.
- Apr 13 2018 cs.CV arXiv:1804.04606v1Modern object detectors usually suffer from low accuracy issues, as foregrounds always drown in tons of backgrounds and become hard examples during training. Compared with those proposal-based ones, real-time detectors are in far more serious trouble since they renounce the use of region-proposing stage which is used to filter a majority of backgrounds for achieving real-time rates. Though foregrounds as hard examples are in urgent need of being mined from tons of backgrounds, a considerable number of state-of-the-art real-time detectors, like YOLO series, have yet to profit from existing hard example mining methods, as using these methods need detectors fit series of prerequisites. In this paper, we propose a general hard example mining method named Loss Rank Mining (LRM) to fill the gap. LRM is a general method for real-time detectors, as it utilizes the final feature map which exists in all real-time detectors to mine hard examples. By using LRM, some elements representing easy examples in final feature map are filtered and detectors are forced to concentrate on hard examples during training. Extensive experiments validate the effectiveness of our method. With our method, the improvements of YOLOv2 detector on auto-driving related dataset KITTI and more general dataset PASCAL VOC are over 5% and 2% mAP, respectively. In addition, LRM is the first hard example mining strategy which could fit YOLOv2 perfectly and make it better applied in series of real scenarios where both real-time rates and accurate detection are strongly demanded.
- Apr 10 2018 cs.CR arXiv:1804.02498v1This paper focuses on the routing algorithm for the communications between vehicles and places in urban VANET. As one of the basic transportation facilities in an urban setting, buses periodically run along their fixed routes and widely cover city streets. The trajectory of bus lines can be seen as a sub map of a city. Based on the characters of bus networks, we propose a bus trajectory-based street-centric routing algorithm (BTSC), which uses bus as main relay to deliver message. In BTSC, we build a routing graph based on the trajectories of bus lines by analyzing the probability of bus appearing on every street. We propose two novel concepts, i.e. the probability of street consistency (PSC) and the probability of path consistency (PPC) which is used as metrics to determine routing paths for message delivery. This aims to choose the best path with higher density of busses and lower probability of transmission direction deviating from the routing path. In order to improve the bus forwarding opportunity, we design a bus-based forwarding strategy with ant colony optimization (FACO) to find a reliable and steady multi-hop link between two relay buses in order to decrease end-to-end delay. BTSC makes the improvements in the selection of routing path and the strategy of message forwarding. Simulation results show that our proposed routing algorithm has a better performance in transmission ratio, transmission delay and adaptability to different networks.
- With the ever growing diversity of devices and applications that will be connected to 5G networks, flexible and agile service orchestration with acknowledged QoE that satisfies end-user's functional and QoS requirements is necessary. SDN (Software-Defined Networking) and NFV (Network Function Virtualization) are considered key enabling technologies for 5G core networks. In this regard, this paper proposes a reinforcement learning based QoS/QoE-aware Service Function Chaining (SFC) in SDN/NFV-enabled 5G slices. First, it implements a lightweight QoS information collector based on LLDP, which works in a piggyback fashion on the southbound interface of the SDN controller, to enable QoS-awareness. Then, a DQN (Deep Q Network) based agent framework is designed to support SFC in the context of NFV. The agent takes into account the QoE and QoS as key aspects to formulate the reward so that it is expected to maximize QoE while respecting QoS constraints. The experiment results show that this framework exhibits good performance in QoE provisioning and QoS requirements maintenance for SFC in dynamic network environments.
- Apr 03 2018 cs.CV arXiv:1804.00518v1With the advantage of high mobility, Unmanned Aerial Vehicles (UAVs) are used to fuel numerous important applications in computer vision, delivering more efficiency and convenience than surveillance cameras with fixed camera angle, scale and view. However, very limited UAV datasets are proposed, and they focus only on a specific task such as visual tracking or object detection in relatively constrained scenarios. Consequently, it is of great importance to develop an unconstrained UAV benchmark to boost related researches. In this paper, we construct a new UAV benchmark focusing on complex scenarios with new level challenges. Selected from 10 hours raw videos, about 80,000 representative frames are fully annotated with bounding boxes as well as up to 14 kinds of attributes (e.g., weather condition, flying altitude, camera view, vehicle category, and occlusion) for three fundamental computer vision tasks: object detection, single object tracking, and multiple object tracking. Then, a detailed quantitative study is performed using most recent state-of-the-art algorithms for each task. Experimental results show that the current state-of-the-art methods perform relative worse on our dataset, due to the new challenges appeared in UAV based real scenes, e.g., high density, small object, and camera motion. To our knowledge, our work is the first time to explore such issues in unconstrained scenes comprehensively.
- Mar 28 2018 cs.CV arXiv:1803.09127v1Compact neural networks are inclined to exploit "sparsely-connected" convolutions such as depthwise convolution and group convolution for employment in mobile applications. Compared with standard "fully-connected" convolutions, these convolutions are more computationally economical. However, "sparsely-connected" convolutions block the inter-group information exchange, which induces severe performance degradation. To address this issue, we present two novel operations named merging and evolution to leverage the inter-group information. Our key idea is encoding the inter-group information with a narrow feature map, then combining the generated features with the original network for better representation. Taking advantage of the proposed operations, we then introduce the Merging-and-Evolution (ME) module, an architectural unit specifically designed for compact networks. Finally, we propose a family of compact neural networks called MENet based on ME modules. Extensive experiments on ILSVRC 2012 dataset and PASCAL VOC 2007 dataset demonstrate that MENet consistently outperforms other state-of-the-art compact networks under different computational budgets. For instance, under the computational budget of 140 MFLOPs, MENet surpasses ShuffleNet by 1% and MobileNet by 1.95% on ILSVRC 2012 top-1 accuracy, while by 2.3% and 4.1% on PASCAL VOC 2007 mAP, respectively.
- Mar 28 2018 cs.CV arXiv:1803.09474v1We present a novel semantic light field (LF) refocusing technique that can achieve unprecedented see-through quality. Different from prior art, our semantic see-through (SST) differentiates rays in their semantic meaning and depth. Specifically, we combine deep learning and stereo matching to provide each ray a semantic label. We then design tailored weighting schemes for blending the rays. Although simple, our solution can effectively remove foreground residues when focusing on the background. At the same time, SST maintains smooth transitions in varying focal depths. Comprehensive experiments on synthetic and new real indoor and outdoor datasets demonstrate the effectiveness and usefulness of our technique.
- Mar 19 2018 cs.PL arXiv:1803.06300v1Message Passing Interface (MPI) is the standard paradigm of programming in high performance computing. MPI programming takes significant effort, and is error-prone. Thus, effective tools for analyzing MPI programs are much needed. On the other hand, analyzing MPI programs itself is challenging because of non-determinism caused by program inputs and non-deterministic operations. Existing approaches for analyzing MPI programs either do not handle inputs or fail to support programs with mixed blocking and non-blocking operations. This paper presents MPI symbolic verifier (MPI-SV), the first symbolic execution based tool for verifying MPI programs having both blocking and non-blocking operations. To ensure soundness, we propose a blockingdriven matching algorithm to safely handle non-deterministic operations, and a method to soundly and completely model the equivalent behavior of a program execution path. The models of MPI program paths are generated on-the-fly during symbolic execution, and verified w.r.t. the expected properties by model checking. To improve scalability, MPI-SV uses the results of model checking to prune redundant paths. We have implemented MPI-SV and evaluated it on the verification of deadlock freedom for 108 real-world MPI tasks. The pure symbolic execution based technique can successfully verify 61 out of the 108 tasks (56%) within one hour, while in comparison, MPI-SV can verify 94 tasks (87%), a 31% improvement. On average, MPI-SV also achieves 7.25X speedup on verifying deadlock freedom and 2.64X speedup on finding deadlocks. These experimental results are promising, and demonstrate MPI-SV's effectiveness and efficiency.
- We build a virtual agent for learning language in a 2D maze-like world. The agent sees images of the surrounding environment, listens to a virtual teacher, and takes actions to receive rewards. It interactively learns the teacher's language from scratch based on two language use cases: sentence-directed navigation and question answering. It learns simultaneously the visual representations of the world, the language, and the action control. By disentangling language grounding from other computational routines and sharing a concept detection function between language grounding and prediction, the agent reliably interpolates and extrapolates to interpret sentences that contain new word combinations or new words missing from training sentences. The new words are transferred from the answers of language prediction. Such a language ability is trained and evaluated on a population of over 1.6 million distinct sentences consisting of 119 object words, 8 color words, 9 spatial-relation words, and 50 grammatical words. The proposed model significantly outperforms five comparison methods for interpreting zero-shot sentences. In addition, we demonstrate human-interpretable intermediate outputs of the model in the appendix.
- Hybrid energy storage system (HESS) composed of lithium-ion battery and supercapacitors has been recognized as one of the most promising solutions to face against the high cost, low power density and short cycle life of the battery-only energy storage system, which is the major headache hindering the further penetration of electric vehicles. In this work, the HESS sizing and energy management problem of an electric race car is investigated as a case study to improve the driving mileage and battery cycle life performance. Compared with the existing research, the distinctive features of this work are: (1) A dynamic model and a degradation model of the battery are employed to describe the dynamic behavior and to predict the cycle life of the battery more precisely; (2) Considering the fact that the design and control problems are coupled in most cases, in order to achieve a global optimal design solution and an implementable real-time energy management system, a Bi-level multi-objective sizing and control framework based on non-dominated sorting genetic algorithm-II and fuzzy logic control (FLC) is proposed to size the HESS and to optimize the membership functions of a FLC based EMS at the same time; (3) In order to improve the optimization efficiency, a vectorized fuzzy inference system which allows large scale of fuzzy logic controllers operating in parallel is devised. At last, the Pareto optimal solutions of different HESSs are obtained and compared to show the achieved enhancements of the proposed Bi-level optimal sizing and energy management framework.
- Non-orthogonal multiple access (NOMA) is a promising radio access technique for next-generation wireless networks. In this article, we investigate the NOMA-based cooperative relay network. We begin with an introduction of the existing relay-assisted NOMA systems by classifying them into three categories: uplink, downlink, and composite architectures. Then, we discuss their principles and key features, and provide a comprehensive comparison from the perspective of spectral efficiency, energy efficiency, and total transmit power. A novel strategy termed hybrid power allocation is further discussed for the composite architecture, which can reduce the computational complexity and signaling overhead at the expense of marginal sum rate degradation. Finally, major challenges, opportunities, and future research trends for the design of NOMA-based cooperative relay systems with other techniques are also highlighted to provide insights for researchers in this field.
- This paper considers utility optimal power control for energy harvesting wireless devices with a finite capacity battery. The distribution information of the underlying wireless environment and harvestable energy is unknown and only outdated system state information is known at the device controller. This scenario shares similarity with Lyapunov opportunistic optimization and online learning but is different from both. By a novel combination of Zinkevich's online gradient learning technique and the drift-plus-penalty technique from Lyapunov opportunistic optimization, this paper proposes a learning-aided algorithm that achieves utility within $O(\epsilon)$ of the optimal, for any desired $\epsilon>0$, by using a battery with an $O(1/\epsilon)$ capacity. The proposed algorithm has low complexity and makes power investment decisions based on system history, without requiring knowledge of the system state or its probability distribution.
- Jan 08 2018 cs.CV arXiv:1801.01452v1Spectral computed tomography (CT) has a great superiority in lesion detection, tissue characterization and material decomposition. To further extend its potential clinical applications, in this work, we propose an improved tensor dictionary learning method for low-dose spectral CT reconstruction with a constraint of image gradient L0-norm, which is named as L0TDL. The L0TDL method inherits the advantages of tensor dictionary learning (TDL) by employing the similarity of spectral CT images. On the other hand, by introducing the L0-norm constraint in gradient image domain, the proposed method emphasizes the spatial sparsity to overcome the weakness of TDL on preserving edge information. The alternative direction minimization method (ADMM) is employed to solve the proposed method. Both numerical simulations and real mouse studies are perform to evaluate the proposed method. The results show that the proposed L0TDL method outperforms other competing methods, such as total variation (TV) minimization, TV with low rank (TV+LR), and TDL methods.
- We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of convergence results for several gradient-based TD algorithms with linear function approximation. The algorithms we analyze include: (i) two basic forms of two-time-scale gradient-based TD algorithms, which we call GTD and which minimize the mean squared projected Bellman error using stochastic gradient-descent; (ii) their "robustified" biased variants; (iii) their mirror-descent versions which combine the mirror-descent idea with TD learning; and (iv) a single-time-scale version of GTD that solves minimax problems formulated for approximate policy evaluation. We derive convergence results for three types of stepsizes: constant stepsize, slowly diminishing stepsize, as well as the standard type of diminishing stepsize with a square-summable condition. For the first two types of stepsizes, we apply the weak convergence method from stochastic approximation theory to characterize the asymptotic behavior of the algorithms, and for the standard type of stepsize, we analyze the algorithmic behavior with respect to a stronger mode of convergence, almost sure convergence. Our convergence results are for the aforementioned TD algorithms with three general ways of setting their $\lambda$-parameters: (i) state-dependent $\lambda$; (ii) a recently proposed scheme of using history-dependent $\lambda$ to keep the eligibility traces of the algorithms bounded while allowing for relatively large values of $\lambda$; and (iii) a composite scheme of setting the $\lambda$-parameters that combines the preceding two schemes and allows a broader class of generalized Bellman operators to be used for approximate policy evaluation with TD methods.
- A \it Distance Labeling scheme is a data structure that can answer shortest path queries on a graph. Experiment results from several recent studies (Akiba et al.'13, Delling et al.'14) found very efficient and very accurate labeling schemes, which scale to social and information networks with tens of millions of vertices and edges. Such a finding is not expected in the worst case, since even for graphs with maximum degree $3$, it is known that any distance labeling requires $\Omega{(n^{3/2})}$ space (Gavoille et al.'03). On the other hand, social and information networks have a heavy-tailed degree distribution and small average distance, which are not captured in the worst case. In this paper, we fill in the gap between empirical and worst case results. We consider distance labeling schemes on random graph models with a power law degree distribution. For such graphs, we show that simple breadth-first-search based algorithm can find near optimal labeling schemes. The intuition behind our proof reveals that the distances between different pairs of vertices are almost independent, even for polynomially many pairs.
- Dec 06 2017 cs.CV arXiv:1712.01493v2While attributes have been widely used for person re-identification (Re-ID) that matches the same person images across disjoint camera views, they are used either as extra features or for performing multi-task learning to assist the image-image person matching task. However, how to find a set of person images according to a given attribute description, which is very practical in many surveillance applications, remains a rarely investigated cross-modal matching problem in Person Re-ID. In this work, we present this challenge and employ adversarial learning to formulate the attribute-image cross-modal person Re-ID model. By imposing the regularization on the semantic consistency constraint across modalities, the adversarial learning enables generating image-analogous concepts for query attributes and getting it matched with image in both global level and semantic ID level. We conducted extensive experiments on three attribute datasets and demonstrated that the adversarial modelling is so far the most effective for the attributeimage cross-modal person Re-ID problem.
- In a directional modulation network, a general power iterative (GPI) based beamforming scheme is proposed to maximize the secrecy rate (SR), where there are two optimization variables required to be optimized. The first one is the useful precoding vector of transmitting confidential messages to the desired user while the second one is the artificial noise (AN) projection matrix of forcing more AN to eavesdroppers. In such a secure network, the paramount problem is how to design or optimize the two optimization variables by different criteria. To maximize the SR (Max-SR), an alternatively iterative structure (AIS) is established between the AN projection matrix and the precoding vector for confidential messages. To choose a good initial value of iteration process of GPI, the proposed Max-SR method can readily double its convergence speed compared to the random choice of initial value. With only four iterations, it may rapidly converge to its rate ceil. From simulation results, it follows that the SR performance of the proposed AIS of GPI-based Max-SR is much better than those of conventional leakage-based and null-space projection methods in the medium and large signal-to-noise ratio (SNR) regions, and its achievable SR performance gain gradually increases as SNR increases.
- This paper presents a novel variational inference framework for deriving a family of Bayesian sparse Gaussian process regression (SGPR) models whose approximations are variationally optimal with respect to the full-rank GPR model enriched with various corresponding correlation structures of the observation noises. Our variational Bayesian SGPR (VBSGPR) models jointly treat both the distributions of the inducing variables and hyperparameters as variational parameters, which enables the decomposability of the variational lower bound that in turn can be exploited for stochastic optimization. Such a stochastic optimization involves iteratively following the stochastic gradient of the variational lower bound to improve its estimates of the optimal variational distributions of the inducing variables and hyperparameters (and hence the predictive distribution) of our VBSGPR models and is guaranteed to achieve asymptotic convergence to them. We show that the stochastic gradient is an unbiased estimator of the exact gradient and can be computed in constant time per iteration, hence achieving scalability to big data. We empirically evaluate the performance of our proposed framework on two real-world, massive datasets.
- Sep 28 2017 cs.DS arXiv:1709.09574v1In the fillable array problem one must maintain an array A[1..n] of $w$-bit entries subject to random access reads and writes, and also a $\texttt{fill}(\Delta)$ operation which sets every entry of to some $\Delta\in\{0,\ldots,2^w-1\}$. We show that with just one bit of redundancy, i.e. a data structure using $nw+1$ bits of memory, $\texttt{read}/\texttt{fill}$ can be implemented in worst case constant time, and $\texttt{write}$ can be implemented in either amortized constant time (deterministically) or worst case expected constant (randomized). In the latter case, we need to store an additional $O(\log n)$ random bits to specify a permutation drawn from an $1/n^2$-almost pairwise independent family.
- In OFDM relay networks with IQ imbalances and full-duplex relay station (RS), how to optimize pilot pattern and power allocation using the criterion of minimizing the sum of mean square errors (Sum-MSE) for the frequency-domain least-squares channel estimator has a heavy impact on self-interference cancellation. Firstly, the design problem of pilot pattern is casted as a convex optimization. From the KKT conditions, the optimal analytical expression is derived given the fixed source power and RS power. Subsequently, an optimal power allocation (OPA) strategy is proposed and presented to further alleviate the effect of Sum-MSE under the total transmit power sum constraint of source node and RS. Simulation results show that the proposed OPA performs better than equal power allocation (EPA) in terms of Sum-MSE, and the Sum-MSE performance gain grows with deviating $\rho$ from the value of $\rho^o$ minimizing the Sum-MSE, where $\rho$ is defined as the average ratio of the residual SI channel at RS to the intended channel from source to RS. For example, the OPA achieves about 5dB SNR gain over EPA by shrinking or stretching $\rho$ with a factor $4$. More importantly, as $\rho$ decreases or increases more, the performance gain becomes more significant.
- Abstraction and realization are bilateral processes that are key in deriving intelligence and creativity. In many domains, the two processes are approached through rules: high-level principles that reveal invariances within similar yet diverse examples. Under a probabilistic setting for discrete input spaces, we focus on the rule realization problem which generates input sample distributions that follow the given rules. More ambitiously, we go beyond a mechanical realization that takes whatever is given, but instead ask for proactively selecting reasonable rules to realize. This goal is demanding in practice, since the initial rule set may not always be consistent and thus intelligent compromises are needed. We formulate both rule realization and selection as two strongly connected components within a single and symmetric bi-convex problem, and derive an efficient algorithm that works at large scale. Taking music compositional rules as the main example throughout the paper, we demonstrate our model's efficiency in not only music realization (composition) but also music interpretation and understanding (analysis).
- Sep 06 2017 cs.GR arXiv:1709.01221v2Edge bundling methods can effectively alleviate visual clutter and reveal high-level graph structures in large graph visualization. Researchers have devoted significant efforts to improve edge bundling according to different metrics. As the edge bundling family evolve rapidly, the quality of edge bundles receives increasing attention in the literature accordingly. In this paper, we present MLSEB, a novel method to generate edge bundles based on moving least squares (MLS) approximation. In comparison with previous edge bundling methods, we argue that our MLSEB approach can generate better results based on a quantitative metric of quality, and also ensure scalability and the efficiency for visualizing large graphs.
- Aug 29 2017 cs.CV arXiv:1708.08062v2While metric learning is important for Person re-identification (RE-ID), a significant problem in visual surveillance for cross-view pedestrian matching, existing metric models for RE-ID are mostly based on supervised learning that requires quantities of labeled samples in all pairs of camera views for training. However, this limits their scalabilities to realistic applications, in which a large amount of data over multiple disjoint camera views is available but not labelled. To overcome the problem, we propose unsupervised asymmetric metric learning for unsupervised RE-ID. Our model aims to learn an asymmetric metric, i.e., specific projection for each view, based on asymmetric clustering on cross-view person images. Our model finds a shared space where view-specific bias is alleviated and thus better matching performance can be achieved. Extensive experiments have been conducted on a baseline and five large-scale RE-ID datasets to demonstrate the effectiveness of the proposed model. Through the comparison, we show that our model works much more suitable for unsupervised RE-ID compared to classical unsupervised metric learning models. We also compare with existing unsupervised RE-ID methods, and our model outperforms them with notable margins. Specifically, we report the results on large-scale unlabelled RE-ID dataset, which is important but unfortunately less concerned in literatures.
- Aug 04 2017 cs.CV arXiv:1708.00961v2In this paper, we introduce a new CT image denoising method based on the generative adversarial network (GAN) with Wasserstein distance and perceptual similarity. The Wasserstein distance is a key concept of the optimal transform theory, and promises to improve the performance of the GAN. The perceptual loss compares the perceptual features of a denoised output against those of the ground truth in an established feature space, while the GAN helps migrate the data noise distribution from strong to weak. Therefore, our proposed method transfers our knowledge of visual perception to the image denoising task, is capable of not only reducing the image noise level but also keeping the critical information at the same time. Promising results have been obtained in our experiments with clinical CT images.
- Jul 28 2017 cs.CV arXiv:1707.08682v2SSD is one of the state-of-the-art object detection algorithms, and it combines high detection accuracy with real-time speed. However, it is widely recognized that SSD is less accurate in detecting small objects compared to large objects, because it ignores the context from outside the proposal boxes. In this paper, we present CSSD--a shorthand for context-aware single-shot multibox object detector. CSSD is built on top of SSD, with additional layers modeling multi-scale contexts. We describe two variants of CSSD, which differ in their context layers, using dilated convolution layers (DiCSSD) and deconvolution layers (DeCSSD) respectively. The experimental results show that the multi-scale context modeling significantly improves the detection accuracy. In addition, we study the relationship between effective receptive fields (ERFs) and the theoretical receptive fields (TRFs), particularly on a VGGNet. The empirical results further strengthen our conclusion that SSD coupled with context layers achieves better detection results especially for small objects ($+3.2\% {\rm AP}_{@0.5}$ on MS-COCO compared to the newest SSD), while maintaining comparable runtime performance.
- Jun 22 2017 cs.IR arXiv:1706.06716v2Understanding users' behavior and predicting their future purchase are critical for e-commerce companies to boost their revenue. While explicit user feedback such as ratings plays the most significant role in eliciting users' preferences, such feedback is scarce, which prompts the need for leveraging more abundant implicit user feedback such as purchase record. Consequently, recent studies focused on leveraging users' past purchase record to predict their purchase. However, their performance is not satisfactory due to 1) the lack of purchase history of users, and 2) more importantly the ill-posed assumption of non-purchased items equally being considered as negative feedback. In this paper, we define new pairwise relationships among items aiming at overcoming the limitations of existing works, and propose a novel method called P3S that stands for modeling pairwise relationships among three disjoint item sets, which leverages users' click record in conjunction with their purchase record. Precisely, we partition the items into three disjoint sets based on users' purchase and click record, define new pairwise relationships among them with respect to users, and reflect these relationships into our pairwise learning-to-rank method. Experiments on two real-world datasets demonstrate that our proposed method outperforms the state-of-the-art baselines in the task of predicting users' future purchase in e-commerce.
- Jun 13 2017 cs.SD arXiv:1706.03397v1In this paper, we propose a noise robust bottleneck feature representation which is generated by an adversarial network (AN). The AN includes two cascade connected networks, an encoding network (EN) and a discriminative network (DN). Mel-frequency cepstral coefficients (MFCCs) of clean and noisy speech are used as input to the EN and the output of the EN is used as the noise robust feature. The EN and DN are trained in turn, namely, when training the DN, noise types are selected as the training labels and when training the EN, all labels are set as the same, i.e., the clean speech label, which aims to make the AN features invariant to noise and thus achieve noise robustness. We evaluate the performance of the proposed feature on a Gaussian Mixture Model-Universal Background Model based speaker verification system, and make comparison to MFCC features of speech enhanced by short-time spectral amplitude minimum mean square error (STSA-MMSE) and deep neural network-based speech enhancement (DNN-SE) methods. Experimental results on the RSR2015 database show that the proposed AN bottleneck feature (AN-BN) dramatically outperforms the STSA-MMSE and DNN-SE based MFCCs for different noise types and signal-to-noise ratios. Furthermore, the AN-BN feature is able to improve the speaker verification performance under the clean condition.
- May 30 2017 cs.CL arXiv:1705.09906v1One of the long-term goals of artificial intelligence is to build an agent that can communicate intelligently with human in natural language. Most existing work on natural language learning relies heavily on training over a pre-collected dataset with annotated labels, leading to an agent that essentially captures the statistics of the fixed external training data. As the training data is essentially a static snapshot representation of the knowledge from the annotator, the agent trained this way is limited in adaptiveness and generalization of its behavior. Moreover, this is very different from the language learning process of humans, where language is acquired during communication by taking speaking action and learning from the consequences of speaking action in an interactive manner. This paper presents an interactive setting for grounded natural language learning, where an agent learns natural language by interacting with a teacher and learning from feedback, thus learning and improving language skills while taking part in the conversation. To achieve this goal, we propose a model which incorporates both imitation and reinforcement by leveraging jointly sentence and reward feedbacks from the teacher. Experiments are conducted to validate the effectiveness of the proposed approach.
- May 09 2017 cs.LG arXiv:1705.02699v1Predicting large-scale transportation network traffic has become an important and challenging topic in recent decades. Inspired by the domain knowledge of motion prediction, in which the future motion of an object can be predicted based on previous scenes, we propose a network grid representation method that can retain the fine-scale structure of a transportation network. Network-wide traffic speeds are converted into a series of static images and input into a novel deep architecture, namely, spatiotemporal recurrent convolutional networks (SRCNs), for traffic forecasting. The proposed SRCNs inherit the advantages of deep convolutional neural networks (DCNNs) and long short-term memory (LSTM) neural networks. The spatial dependencies of network-wide traffic can be captured by DCNNs, and the temporal dynamics can be learned by LSTMs. An experiment on a Beijing transportation network with 278 links demonstrates that SRCNs outperform other deep learning-based algorithms in both short-term and long-term traffic prediction.
- May 08 2017 cs.CV arXiv:1705.02233v4One of the major challenges in object detection is to propose detectors with highly accurate localization of objects. The online sampling of high-loss region proposals (hard examples) uses the multitask loss with equal weight settings across all loss types (e.g, classification and localization, rigid and non-rigid categories) and ignores the influence of different loss distributions throughout the training process, which we find essential to the training efficacy. In this paper, we present the Stratified Online Hard Example Mining (S-OHEM) algorithm for training higher efficiency and accuracy detectors. S-OHEM exploits OHEM with stratified sampling, a widely-adopted sampling technique, to choose the training examples according to this influence during hard example mining, and thus enhance the performance of object detectors. We show through systematic experiments that S-OHEM yields an average precision (AP) improvement of 0.5% on rigid categories of PASCAL VOC 2007 for both the IoU threshold of 0.6 and 0.7. For KITTI 2012, both results of the same metric are 1.6%. Regarding the mean average precision (mAP), a relative increase of 0.3% and 0.5% (1% and 0.5%) is observed for VOC07 (KITTI12) using the same set of IoU threshold. Also, S-OHEM is easy to integrate with existing region-based detectors and is capable of acting with post-recognition level regressors.
- In this paper, we make an investigation on the sum-mean-square-error (sum-MSE) performance gain achieved by DFT-based least-square (LS) channel estimator over frequency-domain LS one in full-duplex OFDM system in the presence of colored interference and noise. The closed-form expression of the sum-MSE performance gain is given. Its simple upper and lower bounds are derived by using inequalities of matrix eigen-values. By simulation and analysis, the upper lower bound is shown to be close to the exact value of MSE gain as the ratio of the number N of total subcarriers to the cyclic prefix length L grows and the correlation factor of colored interference increases. More importantly, we also find that the MSE gain varies from one to N/L as the correlation among colored interferences decreases gradually. According to theoretical analysis, we also find the MSE gain has very simple forms in two extreme scenarios. In the first extreme case that the colored interferences over all subchannels are fully correlated, i.e., their covariance matrix is a matrix of all-ones, the sum-MSE gain reduces to 1. In other words, there is no performance gain. In the second extreme case that the colored-interference covariance matrix is an identity matrix, i.e, they are mutually independent, the achievable sum-MSE performance gain is N/L. A large ratio N/L will achieve a significant sum-MSE gain.
- Apr 25 2017 cs.CV arXiv:1704.07142v1With the increasing demands of applications in virtual reality such as 3D films, virtual Human-Machine Interactions and virtual agents, the analysis of 3D human face analysis is considered to be more and more important as a fundamental step for those virtual reality tasks. Due to information provided by an additional dimension, 3D facial reconstruction enables aforementioned tasks to be achieved with higher accuracy than those based on 2D facial analysis. The denser the 3D facial model is, the more information it could provide. However, most existing dense 3D facial reconstruction methods require complicated processing and high system cost. To this end, this paper presents a novel method that simplifies the process of dense 3D facial reconstruction by employing only one frame of depth data obtained with an off-the-shelf RGB-D sensor. The experiments showed competitive results with real world data.
- In this work, we introduce an online model for communication complexity. Analogous to how online algorithms receive their input piece-by-piece, our model presents one of the players, Bob, his input piece-by-piece, and has the players Alice and Bob cooperate to compute a result each time before the next piece is revealed to Bob. This model has a closer and more natural correspondence to dynamic data structures than classic communication models do, and hence presents a new perspective on data structures. We first present a tight lower bound for the online set intersection problem in the online communication model, demonstrating a general approach for proving online communication lower bounds. The online communication model prevents a batching trick that classic communication complexity allows, and yields a stronger lower bound. We then apply the online communication model to prove data structure lower bounds for two dynamic data structure problems: the Group Range problem and the Dynamic Connectivity problem for forests. Both of the problems admit a worst case $O(\log n)$-time data structure. Using online communication complexity, we prove a tight cell-probe lower bound for each: spending $o(\log n)$ (even amortized) time per operation results in at best an $\exp(-\delta^2 n)$ probability of correctly answering a $(1/2+\delta)$-fraction of the $n$ queries.
- We consider off-policy temporal-difference (TD) learning in discounted Markov decision processes, where the goal is to evaluate a policy in a model-free way by using observations of a state process generated without executing the policy. To curb the high variance issue in off-policy TD learning, we propose a new scheme of setting the $\lambda$-parameters of TD, based on generalized Bellman equations. Our scheme is to set $\lambda$ according to the eligibility trace iterates calculated in TD, thereby easily keeping these traces in a desired bounded range. Compared to prior works, this scheme is more direct and flexible, and allows much larger $\lambda$ values for off-policy TD learning with bounded traces. Using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both the evolution of $\lambda$ and the unique invariant probability measure of the state-trace process. These results not only lead immediately to a characterization of the convergence behavior of least-squares based implementation of our scheme, but also prepare the ground for further analysis of gradient-based implementations.
- Tensor factorization models offer an effective approach to convert massive electronic health records into meaningful clinical concepts (phenotypes) for data analysis. These models need a large amount of diverse samples to avoid population bias. An open challenge is how to derive phenotypes jointly across multiple hospitals, in which direct patient-level data sharing is not possible (e.g., due to institutional policies). In this paper, we developed a novel solution to enable federated tensor factorization for computational phenotyping without sharing patient-level data. We developed secure data harmonization and federated computation procedures based on alternating direction method of multipliers (ADMM). Using this method, the multiple hospitals iteratively update tensors and transfer secure summarized information to a central server, and the server aggregates the information to generate phenotypes. We demonstrated with real medical datasets that our method resembles the centralized training model (based on combined datasets) in terms of accuracy and phenotypes discovery while respecting privacy.
- We tackle a task where an agent learns to navigate in a 2D maze-like environment called XWORLD. In each session, the agent perceives a sequence of raw-pixel frames, a natural language command issued by a teacher, and a set of rewards. The agent learns the teacher's language from scratch in a grounded and compositional manner, such that after training it is able to correctly execute zero-shot commands: 1) the combination of words in the command never appeared before, and/or 2) the command contains new object concepts that are learned from another task but never learned from navigation. Our deep framework for the agent is trained end to end: it learns simultaneously the visual representations of the environment, the syntax and semantics of the language, and the action module that outputs actions. The zero-shot learning capability of our framework results from its compositionality and modularity with parameter tying. We visualize the intermediate outputs of the framework, demonstrating that the agent truly understands how to solve the problem. We believe that our results provide some preliminary insights on how to train an agent with similar abilities in a 3D environment.
- This paper proves the first super-logarithmic lower bounds on the cell probe complexity of dynamic boolean (a.k.a. decision) data structure problems, a long-standing milestone in data structure lower bounds. We introduce a new method for proving dynamic cell probe lower bounds and use it to prove a $\tilde{\Omega}(\log^{1.5} n)$ lower bound on the operational time of a wide range of boolean data structure problems, most notably, on the query time of dynamic range counting over $\mathbb{F}_2$ ([Pat07]). Proving an $\omega(\lg n)$ lower bound for this problem was explicitly posed as one of five important open problems in the late Mihai PÇŽtraÅŸcu's obituary [Tho13]. This result also implies the first $\omega(\lg n)$ lower bound for the classical 2D range counting problem, one of the most fundamental data structure problems in computational geometry and spatial databases. We derive similar lower bounds for boolean versions of dynamic polynomial evaluation and 2D rectangle stabbing, and for the (non-boolean) problems of range selection and range median. Our technical centerpiece is a new way of "weakly" simulating dynamic data structures using efficient one-way communication protocols with small advantage over random guessing. This simulation involves a surprising excursion to low-degree (Chebychev) polynomials which may be of independent interest, and offers an entirely new algorithmic angle on the "cell sampling" method of Panigrahy et al. [PTW10].
- This paper studies a green paradigm for the underlay coexistence of primary users (PUs) and secondary users (SUs) in energy harvesting cognitive radio networks (EH-CRNs), wherein battery-free SUs capture both the spectrum and the energy of PUs to enhance spectrum efficiency and green energy utilization. To lower the transmit powers of SUs, we employ multi-hop transmission with time division multiple access, by which SUs first harvest energy from the RF signals of PUs and then transmit data in the allocated time concurrently with PUs, all in the licensed spectrum. In this way, the available transmit energy of each SU mainly depends on the harvested energy before the turn to transmit, namely energy causality. Meanwhile, the transmit powers of SUs must be strictly controlled to protect PUs from harmful interference. Thus, subject to the energy causality constraint and the interference power constraint, we study the end-to-end throughput maximization problem for optimal time and power allocation. To solve this nonconvex problem, we first equivalently transform it into a convex optimization problem and then propose the joint optimal time and power allocation (JOTPA) algorithm that iteratively solves a series of feasibility problems until convergence. Extensive simulations evaluate the performance of EH-CRNs with JOTPA in three typical deployment scenarios and validate the superiority of JOTPA by making comparisons with two other resource allocation algorithms.
- Both feedback of ratings and trust relationships can be used to reveal users' tastes for improving recommendation performance, especially for cold users. However, both of them are facing data sparsity problem, which may severely degrade recommendation performance. In this paper, we propose to utilize the idea of Denoising Auto-Encoders (DAE) to tackle this problem. Specially, we propose a novel deep learning model, the \textitTrust-aware Collaborative Denoising Auto-Encoder (TDAE), to learn compact and effective representations from both rating and trust data for top-N recommendation. In particular, we present a novel neutral network with a weighted hidden layer to balance the importance of these representations. Moreover, we propose a novel correlative regularization to bridge relations between user preferences in different perspectives. We also conduct comprehensive experiments on two public datasets to compare with several state-of-the-art approaches. The results demonstrate that the proposed method significantly outperforms other comparisons for top-N recommendation task.
- Mar 03 2017 cs.CL arXiv:1703.00538v2Background: Electronic health record (EHR) notes contain abundant medical jargon that can be difficult for patients to comprehend. One way to help patients is to reduce information overload and help them focus on medical terms that matter most to them. Objective: The aim of this work was to develop FIT (Finding Important Terms for patients), an unsupervised natural language processing (NLP) system that ranks medical terms in EHR notes based on their importance to patients. Methods: We built FIT on a new unsupervised ensemble ranking model derived from the biased random walk algorithm to combine heterogeneous information resources for ranking candidate terms from each EHR note. Specifically, FIT integrates four single views for term importance: patient use of medical concepts, document-level term salience, word-occurrence based term relatedness, and topic coherence. It also incorporates partial information of term importance as conveyed by terms' unfamiliarity levels and semantic types. We evaluated FIT on 90 expert-annotated EHR notes and compared it with three benchmark unsupervised ensemble ranking methods. Results: FIT achieved 0.885 AUC-ROC for ranking candidate terms from EHR notes to identify important terms. When including term identification, the performance of FIT for identifying important terms from EHR notes was 0.813 AUC-ROC. It outperformed the three ensemble rankers for most metrics. Its performance is relatively insensitive to its parameter. Conclusions: FIT can automatically identify EHR terms important to patients and may help develop personalized interventions to improve quality of care. By using unsupervised learning as well as a robust and flexible framework for information fusion, FIT can be readily applied to other domains and applications.
- Neural networks have been successfully applied in applications with a large amount of labeled data. However, the task of rapid generalization on new concepts with small training data while preserving performances on previously learned ones still presents a significant challenge to neural network models. In this work, we introduce a novel meta learning method, Meta Networks (MetaNet), that learns a meta-level knowledge across tasks and shifts its inductive biases via fast parameterization for rapid generalization. When evaluated on Omniglot and Mini-ImageNet benchmarks, our MetaNet models achieve a near human-level performance and outperform the baseline approaches by up to 6% accuracy. We demonstrate several appealing properties of MetaNet relating to generalization and continual learning.
- Mar 01 2017 cs.CL arXiv:1702.08563v2Item Response Theory (IRT) allows for measuring ability of Machine Learning models as compared to a human population. However, it is difficult to create a large dataset to train the ability of deep neural network models (DNNs). We propose Crowd-Informed Fine-Tuning (CIFT) as a new training process, where a pre-trained model is fine-tuned with a specialized supplemental training set obtained via IRT model-fitting on a large set of crowdsourced response patterns. With CIFT we can leverage the specialized set of data obtained through IRT to inform parameter tuning in DNNs. We experiment with two loss functions in CIFT to represent (i) memorization of fine-tuning items and (ii) learning a probability distribution over potential labels that is similar to the crowdsourced distribution over labels to simulate crowd knowledge. Our results show that CIFT improves ability for a state-of-the art DNN model for Recognizing Textual Entailment (RTE) tasks and is generalizable to a large-scale RTE test set.
- FPGA-based hardware accelerators for convolutional neural networks (CNNs) have obtained great attentions due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this paper, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized FPGA accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages. A key advantage of the FPGA accelerator is that its performance is insensitive to data batch size, while the performance of GPU acceleration varies largely depending on the batch size of the data. Experiment results show that the proposed accelerator architecture for binary CNNs running on a Virtex-7 FPGA is 8.3x faster and 75x more energy-efficient than a Titan X GPU for processing online individual requests in small batch sizes. For processing static data in large batch sizes, the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5x higher energy efficiency.
- Feb 17 2017 cs.CL arXiv:1702.04811v2Deep neural networks (DNNs) have made significant progress in a number of Machine Learning applications. However without a consistent set of evaluation tasks, interpreting performance across test datasets is impossible. In most previous work, characteristics of individual data points are not considered during evaluation, and each data point is treated equally. Using Item Response Theory (IRT) from psychometrics it is possible to model characteristics of specific data points that then inform an estimate of model ability as compared to a population of humans. We report the results of several experiments to determine how different Deep Neural Network (DNN) models perform under different training circumstances with respect to ability. As DNNs train on larger datasets, performance begins to look like human performance under the assumptions of IRT models. That is, easy questions start to have a higher probability of being answered correctly than harder questions. We also report the results of additional analyses regarding model robustness to noise and performance as a function of training set size that further inform our main conclusion
- With the development of speech synthesis techniques, automatic speaker verification systems face the serious challenge of spoofing attack. In order to improve the reliability of speaker verification systems, we develop a new filter bank based cepstral feature, deep neural network filter bank cepstral coefficients (DNN-FBCC), to distinguish between natural and spoofed speech. The deep neural network filter bank is automatically generated by training a filter bank neural network (FBNN) using natural and synthetic speech. By adding restrictions on the training rules, the learned weight matrix of FBNN is band-limited and sorted by frequency, similar to the normal filter bank. Unlike the manually designed filter bank, the learned filter bank has different filter shapes in different channels, which can capture the differences between natural and synthetic speech more effectively. The experimental results on the ASVspoof 2015 database show that the Gaussian mixture model maximum-likelihood (GMM-ML) classifier trained by the new feature performs better than the state-of-the-art linear frequency cepstral coefficients (LFCC) based classifier, especially on detecting unknown attacks.
- Feb 13 2017 cs.LG arXiv:1702.03006v1To estimate the value functions of policies from exploratory data, most model-free off-policy algorithms rely on importance sampling, where the use of importance sampling ratios often leads to estimates with severe variance. It is thus desirable to learn off-policy without using the ratios. However, such an algorithm does not exist for multi-step learning with function approximation. In this paper, we introduce the first such algorithm based on temporal-difference (TD) learning updates. We show that an explicit use of importance sampling ratios can be eliminated by varying the amount of bootstrapping in TD updates in an action-dependent manner. Our new algorithm achieves stability using a two-timescale gradient-based TD update. A prior algorithm based on lookup table representation called Tree Backup can also be retrieved using action-dependent bootstrapping, becoming a special case of our algorithm. In two challenging off-policy tasks, we demonstrate that our algorithm is stable, effectively avoids the large variance issue, and can perform substantially better than its state-of-the-art counterpart.
- The backpressure algorithm has been widely used as a distributed solution to the problem of joint rate control and routing in multi-hop data networks. By controlling a parameter $V$ in the algorithm, the backpressure algorithm can achieve an arbitrarily small utility optimality gap. However, this in turn brings in a large queue length at each node and hence causes large network delay. This phenomenon is known as the fundamental utility-delay tradeoff. The best known utility-delay tradeoff for general networks is $[O(1/V), O(V)]$ and is attained by a backpressure algorithm based on a drift-plus-penalty technique. This may suggest that to achieve an arbitrarily small utility optimality gap, the existing backpressure algorithms necessarily yield an arbitrarily large queue length. However, this paper proposes a new backpressure algorithm that has a vanishing utility optimality gap, so utility converges to exact optimality as the algorithm keeps running, while queue lengths are bounded throughout by a finite constant. The technique uses backpressure and drift concepts with a new method for convex programming.
- Jan 13 2017 cs.SY arXiv:1701.03343v1In this work, we investigate a state estimation problem for a full-car semi-active suspension system. To account for the complex calculation and optimization problems, a vehicle-to- cloud-to-vehicle (V2C2V) scheme is utilized. Moving horizon estimation is introduced for the state estimation system design. All the optimization problems are solved in a remotely-embedded agent with high computational ability. Measurements and state estimates are transmitted between the vehicle and the remote agent via networked communication channels. The effectiveness of the proposed method is illustrated via a set of simulations.
- Dec 13 2016 cs.CV arXiv:1612.03630v1In this paper, we develop a binary convolutional encoder-decoder network (B-CEDNet) for natural scene text processing (NSTP). It converts a text image to a class-distinguished salience map that reveals the categorical, spatial and morphological information of characters. The existing solutions are either memory consuming or run-time consuming that cannot be applied to real-time applications on resource-constrained devices such as advanced driver assistance systems. The developed network can process multiple regions containing characters by one-off forward operation, and is trained to have binary weights and binary feature maps, which lead to both remarkable inference run-time speedup and memory usage reduction. By training with over 200, 000 synthesis scene text images (size of $32\times128$), it can achieve $90\%$ and $91\%$ pixel-wise accuracy on ICDAR-03 and ICDAR-13 datasets. It only consumes $4.59\ ms$ inference run-time realized on GPU with a small network size of 2.14 MB, which is up to $8\times$ faster and $96\%$ smaller than it full-precision version.
- Nov 30 2016 cs.CV arXiv:1611.09587v1Surveillance video parsing, which segments the video frames into several labels, e.g., face, pants, left-leg, has wide applications. However,pixel-wisely annotating all frames is tedious and inefficient. In this paper, we develop a Single frame Video Parsing (SVP) method which requires only one labeled frame per video in training stage. To parse one particular frame, the video segment preceding the frame is jointly considered. SVP (1) roughly parses the frames within the video segment, (2) estimates the optical flow between frames and (3) fuses the rough parsing results warped by optical flow to produce the refined parsing result. The three components of SVP, namely frame parsing, optical flow estimation and temporal fusion are integrated in an end-to-end manner. Experimental results on two surveillance video datasets show the superiority of SVP over state-of-the-arts.
- Nov 29 2016 cs.NI arXiv:1611.09011v2The OpenFlow-based SDN is widely studied to better network performance through planning fine-grained paths. However, being designed to configure path hop-by-hop, it faces the scalability issue that both the flow table overhead and path setup delay are unacceptable for large-scale networks. In this paper, we propose PACO, a framework based on Source Routing to address that problem through quickly pushing paths into the packet header at network edges and pre-installing few rules at the network core. The straightforward implementation of SR is inefficient as it would incur too many path labels; other efficient approaches would sacrifice path flexibility (e.g., DEFO). To implement SR efficiently and flexibly, PACO presents each path as a concatenation of pathlets and introduces algorithms to compute pathlets and concatenate paths with minimum path labels. Our extensive simulations confirm the scalability of PACO as it saves the flow table overhead up to 94% compared with OpenFlow-SDN solutions and show that PACO outperforms SR-SDN solutions by supporting more than 40% paths with few label overhead.
- Deep learning, as a promising new area of machine learning, has attracted a rapidly increasing attention in the field of medical imaging. Compared to the conventional machine learning methods, deep learning requires no hand-tuned feature extractor, and has shown a superior performance in many visual object recognition applications. In this study, we develop a deep convolutional neural network (CNN) and apply it to thoracic CT images for the classification of lung nodules. We present the CNN architecture and classification accuracy for the original images of lung nodules. In order to understand the features of lung nodules, we further construct new datasets, based on the combination of artificial geometric nodules and some transformations of the original images, as well as a stochastic nodule shape model. It is found that simplistic geometric nodules cannot capture the important features of lung nodules.
- Nov 15 2016 cs.CL arXiv:1611.04491v1Objective: Allowing patients to access their own electronic health record (EHR) notes through online patient portals has the potential to improve patient-centered care. However, medical jargon, which abounds in EHR notes, has been shown to be a barrier for patient EHR comprehension. Existing knowledge bases that link medical jargon to lay terms or definitions play an important role in alleviating this problem but have low coverage of medical jargon in EHRs. We developed a data-driven approach that mines EHRs to identify and rank medical jargon based on its importance to patients, to support the building of EHR-centric lay language resources. Methods: We developed an innovative adapted distant supervision (ADS) model based on support vector machines to rank medical jargon from EHRs. For distant supervision, we utilized the open-access, collaborative consumer health vocabulary, a large, publicly available resource that links lay terms to medical jargon. We explored both knowledge-based features from the Unified Medical Language System and distributed word representations learned from unlabeled large corpora. We evaluated the ADS model using physician-identified important medical terms. Results: Our ADS model significantly surpassed two state-of-the-art automatic term recognition methods, TF*IDF and C-Value, yielding 0.810 ROC-AUC versus 0.710 and 0.667, respectively. Our model identified 10K important medical jargon terms after ranking over 100K candidate terms mined from over 7,500 EHR narratives. Conclusion: Our work is an important step towards enriching lexical resources that link medical jargon to lay terms/definitions to support patient EHR comprehension. The identified medical jargon terms and their rankings are available upon request.
- Finding related published articles is an important task in any science, but with the explosion of new work in the biomedical domain it has become especially challenging. Most existing methodologies use text similarity metrics to identify whether two articles are related or not. However biomedical knowledge discovery is hypothesis-driven. The most related articles may not be ones with the highest text similarities. In this study, we first develop an innovative crowd-sourcing approach to build an expert-annotated document-ranking corpus. Using this corpus as the gold standard, we then evaluate the approaches of using text similarity to rank the relatedness of articles. Finally, we develop and evaluate a new supervised model to automatically rank related scientific articles. Our results show that authors' ranking differ significantly from rankings by text-similarity-based models. By training a learning-to-rank model on a subset of the annotated corpus, we found the best supervised learning-to-rank model (SVM-Rank) significantly surpassed state-of-the-art baseline systems.
- One of the biggest open problems in external memory data structures is the priority queue problem with DecreaseKey operations. If only Insert and ExtractMin operations need to be supported, one can design a comparison-based priority queue performing $O((N/B)\lg_{M/B} N)$ I/Os over a sequence of $N$ operations, where $B$ is the disk block size in number of words and $M$ is the main memory size in number of words. This matches the lower bound for comparison-based sorting and is hence optimal for comparison-based priority queues. However, if we also need to support DecreaseKeys, the performance of the best known priority queue is only $O((N/B) \lg_2 N)$ I/Os. The big open question is whether a degradation in performance really is necessary. We answer this question affirmatively by proving a lower bound of $\Omega((N/B) \lg_{\lg N} B)$ I/Os for processing a sequence of $N$ intermixed Insert, ExtraxtMin and DecreaseKey operations. Our lower bound is proved in the cell probe model and thus holds also for non-comparison-based priority queues.
- Hypothesis testing is an important cognitive process that supports human reasoning. In this paper, we introduce a computational hypothesis testing approach based on memory augmented neural networks. Our approach involves a hypothesis testing loop that reconsiders and progressively refines a previously formed hypothesis in order to generate new hypotheses to test. We apply the proposed approach to language comprehension task by using Neural Semantic Encoders (NSE). Our NSE models achieve the state-of-the-art results showing an absolute improvement of 1.2% to 2.6% accuracy over previous results obtained by single and ensemble systems on standard machine comprehension benchmarks such as the Children's Book Test (CBT) and Who-Did-What (WDW) news article datasets.
- Maximum Inner Product Search (MIPS) is an important task in many machine learning applications such as the prediction phase of a low-rank matrix factorization model for a recommender system. There have been some works on how to perform MIPS in sub-linear time recently. However, most of them do not have the flexibility to control the trade-off between search efficient and search quality. In this paper, we study the MIPS problem with a computational budget. By carefully studying the problem structure of MIPS, we develop a novel Greedy-MIPS algorithm, which can handle budgeted MIPS by design. While simple and intuitive, Greedy-MIPS yields surprisingly superior performance compared to state-of-the-art approaches. As a specific example, on a candidate set containing half a million vectors of dimension 200, Greedy-MIPS runs 200x faster than the naive approach while yielding search results with the top-5 precision greater than 75\%.
- Sep 08 2016 cs.GT arXiv:1609.01961v2Motivated by the recent efforts in extending LTE to the unlicensed spectrum, we propose a novel spectrum sharing framework for the coopetition (i.e., cooperation and competition) between LTE and Wi-Fi in the unlicensed band. Basically, the LTE network can choose to work in one of the two modes: in the competition mode, it randomly accesses an unlicensed channel, and interferes with the Wi-Fi access point using the same channel; in the cooperation mode, it delivers traffic for the Wi-Fi users in exchange for the exclusive access of the corresponding channel. Because the LTE network works in an interference-free manner in the cooperation mode, it can achieve a much larger data rate than that in the competition mode, which allows it to effectively serve both its own users and the Wi-Fi users. We design a second-price reverse auction mechanism, which enables the LTE provider and the Wi-Fi access point owners (APOs) to effectively negotiate the operation mode. Specifically, the LTE provider is the auctioneer (buyer), and the APOs are the bidders (sellers) who compete to sell their channel access opportunities to the LTE provider. In Stage I of the auction, the LTE provider announces a reserve rate. In Stage II of the auction, the APOs submit their bids. We show that the auction involves allocative externalities, i.e., the cooperation between the LTE provider and one APO benefits other APOs who are not directly involved in this cooperation. As a result, a particular APO's willingness to cooperate is affected by its belief about other APOs' willingness to cooperate. This makes our analysis much more challenging than that of the conventional second-price auction, where bidding truthfully is a weakly dominant strategy. We show that the APOs have a unique form of the equilibrium bidding strategies in Stage II, based on which we analyze the LTE provider's optimal reserve rate in Stage I.
- Sep 08 2016 cs.GT arXiv:1609.01951v3The proliferation of public Wi-Fi hotspots has brought new business potentials for Wi-Fi networks, which carry a significant amount of global mobile data traffic today. In this paper, we propose a novel Wi-Fi monetization model for venue owners (VOs) deploying public Wi-Fi hotspots, where the VOs can generate revenue by providing two different Wi-Fi access schemes for mobile users (MUs): (i) the premium access, in which MUs directly pay VOs for their Wi-Fi usage, and (ii) the advertising sponsored access, in which MUs watch advertisements in exchange of the free usage of Wi-Fi. VOs sell their ad spaces to advertisers (ADs) via an ad platform, and share the ADs' payments with the ad platform. We formulate the economic interactions among the ad platform, VOs, MUs, and ADs as a three-stage Stackelberg game. In Stage I, the ad platform announces its advertising revenue sharing policy. In Stage II, VOs determine the Wi-Fi prices (for MUs) and advertising prices (for ADs). In Stage III, MUs make access choices and ADs purchase advertising spaces. We analyze the sub-game perfect equilibrium (SPE) of the proposed game systematically, and our analysis shows the following useful observations. First, the ad platform's advertising revenue sharing policy in Stage I will affect only the VOs' Wi-Fi prices but not the VOs' advertising prices in Stage II. Second, both the VOs' Wi-Fi prices and advertising prices are non-decreasing in the advertising concentration level and non-increasing in the MU visiting frequency. Numerical results further show that the VOs are capable of generating large revenues through mainly providing one type of Wi-Fi access (the premium access or advertising sponsored access), depending on their advertising concentration levels and MU visiting frequencies.
- In 1982, Drezner proposed the (1|1)-centroid problem on the plane, in which two players, called the leader and the follower, open facilities to provide service to customers in a competitive manner. The leader opens the first facility, and then the follower opens the second. Each customer will patronize the facility closest to him (ties broken in favor of the leader's one), thereby decides the market share of the two players. The goal is to find the best position for the leader's facility so that his market share is maximized. The best algorithm for this problem is an $O(n^2 \log n)$-time parametric search approach, which searches over the space of possible market share values. In the same paper, Drezner also proposed a general version of (1|1)-centroid problem by introducing a minimal distance constraint $R$, such that the follower's facility is not allowed to be located within a distance $R$ from the leader's. He proposed an $O(n^5 \log n)$-time algorithm for this general version by identifying $O(n^4)$ points as the candidates of the optimal solution and checking the market share for each of them. In this paper, we develop a new parametric search approach searching over the $O(n^4)$ candidate points, and present an $O(n^2 \log n)$-time algorithm for the general version, thereby close the $O(n^3)$ gap between the two bounds.
- Aug 09 2016 cs.CV arXiv:1608.02023v2Study of cultural-heritage objects with embellished realistic and abstract designs made up of connected and intertwined curves crosscuts a number of related disciplines, including archaeology, art history, and heritage management. However, many objects, such as pottery sherds found in the archaeological record, are fragmentary, making the underlying complete designs unknowable at the scale of the sherd fragment. The challenge to reconstruct and study complete designs is stymied because 1) most fragmentary cultural-heritage objects contain only a small portion of the underlying full design, 2) in the case of a stamping application, the same design may be applied multiple times with spatial overlap on one object, and 3) curve patterns detected on an object are usually incomplete and noisy. As a result, classical curve-pattern matching algorithms, such as Chamfer matching, may perform poorly in identifying the underlying design. In this paper, we develop a new partial-to-global curve matching algorithm to address these challenges and better identify the full design from a fragmented cultural heritage object. Specifically, we develop the algorithm to identify the designs of the carved wooden paddles of the Southeastern Woodlands from unearthed pottery sherds. A set of pottery sherds from the Snow Collection, curated at Georgia Southern University, are used to test the proposed algorithm, with promising results.
- We study the cooperation of the mobile network operator (MNO) and the venue owners (VOs) on the public Wi-Fi deployment. We consider a one-to-many bargaining framework, where the MNO bargains with VOs sequentially to determine where to deploy Wi-Fi and how much to pay. Taking into account the negative externalities among different steps of bargaining, we analyze the following two cases: for the exogenous bargaining sequence case, we compute the optimal bargaining solution on the cooperation decisions and payments under a predetermined bargaining sequence; for the endogenous bargaining sequence case, the MNO decides the bargaining sequence to maximize its payoff. Through exploring the structural property of the optimal bargaining sequence, we design a low-complexity Optimal VO Bargaining Sequencing (OVBS) algorithm to search the optimal sequence. More specifically, we categorize the VOs into three types based on the impact of the Wi-Fi deployment at their venues, and show that it is optimal for the MNO to bargain with these three types of VOs sequentially. Numerical results show that compared with the random and worst bargaining sequences, the optimal bargaining sequence improves the MNO's payoff by up to 14.8% and 45.3%, respectively.
- Aug 03 2016 cs.CL arXiv:1608.00612v1Sequence labeling is a widely used method for named entity recognition and information extraction from unstructured natural language data. In clinical domain one major application of sequence labeling involves extraction of medical entities such as medication, indication, and side-effects from Electronic Health Record narratives. Sequence labeling in this domain, presents its own set of challenges and objectives. In this work we experimented with various CRF based structured learning models with Recurrent Neural Networks. We extend the previously studied LSTM-CRF models with explicit modeling of pairwise potentials. We also propose an approximate version of skip-chain CRF inference with RNN potentials. We use these methodologies for structured prediction in order to improve the exact phrase detection of various medical entities.
- We present a memory augmented neural network for natural language understanding: Neural Semantic Encoders. NSE is equipped with a novel memory update rule and has a variable sized encoding memory that evolves over time and maintains the understanding of input sequences through read, compose and write operations. NSE can also access multiple and shared memories. In this paper, we demonstrated the effectiveness and the flexibility of NSE on five different natural language tasks: natural language inference, question answering, sentence classification, document sentiment analysis and machine translation where NSE achieved state-of-the-art performance when evaluated on publically available benchmarks. For example, our shared-memory model showed an encouraging result on neural machine translation, improving an attention-based baseline by approximately 1.0 BLEU.
- Recurrent neural networks (RNNs) process input text sequentially and model the conditional transition between word tokens. In contrast, the advantages of recursive networks include that they explicitly model the compositionality and the recursive structure of natural language. However, the current recursive architecture is limited by its dependence on syntactic tree. In this paper, we introduce a robust syntactic parsing-independent tree structured model, Neural Tree Indexers (NTI) that provides a middle ground between the sequential RNNs and the syntactic treebased recursive models. NTI constructs a full n-ary tree by processing the input text with its node function in a bottom-up fashion. Attention mechanism can then be applied to both structure and node function. We implemented and evaluated a binarytree model of NTI, showing the model achieved the state-of-the-art performance on three different NLP tasks: natural language inference, answer sentence selection, and sentence classification, outperforming state-of-the-art recurrent and recursive neural networks.