In this paper, we propose a novel technique for direct recognition of multiple speech streams given the single channel of mixed speech, without first separating them. Our technique is based on permutation invariant training (PIT) for automatic speech recognition (ASR). In PIT-ASR, we compute the average cross entropy (CE) over all frames in the whole utterance for each possible output-target assignment, pick the one with the minimum CE, and optimize for that assignment. PIT-ASR forces all the frames of the same speaker to be aligned with the same output layer. This strategy elegantly solves the label permutation problem and speaker tracing problem in one shot. Our experiments on artificially mixed AMI data showed that the proposed approach is very promising.
Despite the significant progress made in the recent years in dictating single-talker speech, the progress made in speaker independent multi-talker mixed speech separation and tracing, often referred to as the cocktail-party problem, has been less impressive. In this paper we propose a novel technique for attacking this problem. The core of our technique is permutation invariant training (PIT), which aims at minimizing the source stream reconstruction error no matter how labels are ordered. This is achieved by aligning labels to the output streams automatically during the training time. This strategy effectively solves the label permutation problem observed in deep learning based techniques for speech separation. More interestingly, our approach can integrate speaker tracing in the PIT framework so that separation and tracing can be carried out in one step and trained end-to-end. This is achieved using recurrent neural networks (RNNs) by forcing separated frames belonging to the same speaker to be aligned to the same output layer during training. Furthermore, the computational cost introduced by PIT is very small compared to the RNN computation during training and is zero during separation. We evaluated PIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks and found that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), deep clustering (DPCL) and deep attractor network (DANet), and generalizes well over unseen speakers and languages.
Deep learning models (DLMs) are state-of-the-art techniques in speech recognition. However, training good DLMs can be time consuming especially for production-size models and corpora. Although several parallel training algorithms have been proposed to improve training efficiency, there is no clear guidance on which one to choose for the task in hand due to lack of systematic and fair comparison among them. In this paper we aim at filling this gap by comparing four popular parallel training algorithms in speech recognition, namely asynchronous stochastic gradient descent (ASGD), blockwise model-update filtering (BMUF), bulk synchronous parallel (BSP) and elastic averaging stochastic gradient descent (EASGD), on 1000-hour LibriSpeech corpora using feed-forward deep neural networks (DNNs) and convolutional, long short-term memory, DNNs (CLDNNs). Based on our experiments, we recommend using BMUF as the top choice to train acoustic models since it is most stable, scales well with number of GPUs, can achieve reproducible results, and in many cases even outperforms single-GPU SGD. ASGD can be used as a substitute in some cases.
Mar 16 2017 cs.LG
Deep Neural Networks (DNN) have demonstrated superior ability to extract high level embedding vectors from low level features. Despite the success, the serving time is still the bottleneck due to expensive run-time computation of multiple layers of dense matrices. GPGPU, FPGA, or ASIC-based serving systems require additional hardware that are not in the mainstream design of most commercial applications. In contrast, tree or forest-based models are widely adopted because of low serving cost, but heavily depend on carefully engineered features. This work proposes a Deep Embedding Forest model that benefits from the best of both worlds. The model consists of a number of embedding layers and a forest/tree layer. The former maps high dimensional (hundreds of thousands to millions) and heterogeneous low-level features to the lower dimensional (thousands) vectors, and the latter ensures fast serving. Built on top of a representative DNN model called Deep Crossing, and two forest/tree-based models including XGBoost and LightGBM, a two-step Deep Embedding Forest algorithm is demonstrated to achieve on-par or slightly better performance as compared with the DNN counterpart, with only a fraction of serving time on conventional hardware. After comparing with a joint optimization algorithm called partial fuzzification, also proposed in this paper, it is concluded that the two-step Deep Embedding Forest has achieved near optimal performance. Experiments based on large scale data sets (up to 1 billion samples) from a major sponsored search engine proves the efficacy of the proposed model.
Mar 14 2017 cs.DS
The core number of a vertex is a basic index depicting cohesiveness of a graph, and has been widely used in large-scale graph analytics. In this paper, we study the update of core numbers of vertices in dynamic graphs with edge insertions/deletions, which is known as the core maintenance problem. Different from previous approaches that just focus on the case of single-edge insertion/deletion and sequentially handle the edges when multiple edges are inserted/deleted, we investigate the parallelism in the core maintenance procedure. Specifically, we show that if the inserted/deleted edges constitute a matching, the core number update with respect to each inserted/deleted edge can be handled in parallel. Based on this key observation, we propose parallel algorithms for core maintenance in both cases of edge insertions and deletions. We conduct extensive experiments to evaluate the efficiency, stability, parallelism and scalability of our algorithms on different types of real-world and synthetic graphs. Comparing with sequential approaches, our algorithms can improve the core maintenance efficiency significantly.
Jan 02 2017 cs.DS
This paper initiates the studies of parallel algorithms for core maintenance in dynamic graphs. The core number is a fundamental index reflecting the cohesiveness of a graph, which are widely used in large-scale graph analytics. The core maintenance problem requires to update the core numbers of vertices after a set of edges and vertices are inserted into or deleted from the graph. We investigate the parallelism in the core update process when multiple edges and vertices are inserted or deleted. Specifically, we discover a structure called superior edge set, the insertion or deletion of edges in which can be processed in parallel. Based on the structure of superior edge set, efficient parallel algorithms are then devised for incremental and decremental core maintenance respectively. To the best of our knowledge, the proposed algorithms are the first parallel ones for the fundamental core maintenance problem. The algorithms show a significant speedup in the processing time compared with previous results that sequentially handle edge and vertex insertions/deletions. Finally, extensive experiments are conducted on different types of real-world and synthetic datasets, and the results illustrate the efficiency, stability and scalability of the proposed algorithms.
Nov 11 2016 cs.CV
Visual inspection of x-ray scattering images is a powerful technique for probing the physical structure of materials at the molecular scale. In this paper, we explore the use of deep learning to develop methods for automatically analyzing x-ray scattering images. In particular, we apply Convolutional Neural Networks and Convolutional Autoencoders for x-ray scattering image classification. To acquire enough training data for deep learning, we use simulation software to generate synthetic x-ray scattering images. Experiments show that deep learning methods outperform previously published methods by 10\% on synthetic and real datasets.
Oct 18 2016 cs.CL
Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure the human error rate on the widely used NIST 2000 test set, and find that our latest automated system has reached human parity. The error rate of professional transcribers is 5.9% for the Switchboard portion of the data, in which newly acquainted pairs of people discuss an assigned topic, and 11.3% for the CallHome portion where friends and family members have open-ended conversations. In both cases, our automated system establishes a new state of the art, and edges past the human benchmark, achieving error rates of 5.8% and 11.0%, respectively. The key to our system's performance is the use of various convolutional and LSTM acoustic model architectures, combined with a novel spatial smoothing method and lattice-free MMI acoustic training, multiple recurrent neural network language modeling approaches, and a systematic use of system combination.
Authenticating websites is an ongoing problem for users. Recent proposals have suggested strengthening current server authentication methods by incorporating website location as an additional authentication factor. In this work, we explore how location information affects users' decision-making for security and privacy. We conducted a series of qualitative interviews to learn how location can be integrated into users' decision-making for security, and we designed a security indicator to alert the user to changes in website locations. We evaluated our tool in a 44-participant user study and found that users were less likely to perform security-sensitive tasks when alerted to location changes. Our results suggest that website location can be used as an effective indicator for users' security assessment.
Sep 13 2016 cs.CL
We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training provide significant gains for all acoustic model architectures. Language model rescoring with multiple forward and backward running RNNLMs, and word posterior-based system combination provide a 20% boost. The best single system uses a ResNet architecture acoustic model with RNNLM rescoring, and achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The combined system has an error rate of 6.2%, representing an improvement over previously reported results on this benchmark task.
This paper shows the remarkable distances that can be achieved using millimeter wave communications, and presents a new rural macrocell (RMa) path loss model for millimeter wave frequencies, based on measurements at 73 GHz in rural Virginia. Path loss models are needed to estimate signal coverage and interference for wireless network design, yet little is known about rural propagation at millimeter waves. This work identifies problems with the RMa model used by the 3rd Generation Partnership Project (3GPP) TR 38.900 Release 14, and offers a close-in (CI) reference distance model that has improved accuracy, fewer parameters, and better stability as compared with the existing 3GPP RMa path loss model. The measurements and models presented here are the first to validate rural millimeter wave path loss models.
Aug 17 2016 cs.CR
The Location Service (LCS) proposed by the telecommunication industry is an architecture that allows the location of mobile devices to be accessed in various applications. We explore the use of LCS in location-enhanced server authentication, which traditionally relies on certificates. Given recent incidents involving certificate authorities, various techniques to strengthen server authentication were proposed. They focus on improving the certificate validation process, such as pinning, revocation, or multi-path probing. In this paper, we propose using the server's geographic location as a second factor of its authenticity. Our solution, SALVE, achieves location-based server authentication by using secure DNS resolution and by leveraging LCS for location measurements. We develop a TLS extension that enables the client to verify the server's location in addition to its certificate. Successful server authentication therefore requires a valid certificate and the server's presence at a legitimate geographic location, e.g., on the premises of a data center. SALVE prevents server impersonation by remote adversaries with mis-issued certificates or stolen private keys of the legitimate server. We develop a prototype implementation and our evaluation in real-world settings shows that it incurs minimal impact to the average server throughput. Our solution is backward compatible and can be integrated with existing approaches for improving server authentication in TLS.
We propose a novel deep learning model, which supports permutation invariant training (PIT), for speaker independent multi-talker speech separation, commonly known as the cocktail-party problem. Different from most of the prior arts that treat speech separation as a multi-class regression problem and the deep clustering technique that considers it a segmentation (or clustering) problem, our model optimizes for the separation regression error, ignoring the order of mixing sources. This strategy cleverly solves the long-lasting label permutation problem that has prevented progress on deep learning based techniques for speech separation. Experiments on the equal-energy mixing setup of a Danish corpus confirms the effectiveness of PIT. We believe improvements built upon PIT can eventually solve the cocktail-party problem and enable real-world adoption of, e.g., automatic meeting transcription and multi-party human-computer interaction, where overlapping speech is common.
May 10 2016 cs.DC
We give efficient algorithms for the fundamental problems of Broadcast and Local Broadcast in dynamic wireless networks. We propose a general model of communication which captures and includes both fading models (like SINR) and graph-based models (such as quasi unit disc graphs, bounded-independence graphs, and protocol model). The only requirement is that the nodes can be embedded in a bounded growth quasi-metric, which is the weakest condition known to ensure distributed operability. Both the nodes and the links of the network are dynamic: nodes can come and go, while the signal strength on links can go up or down. The results improve some of the known bounds even in the static setting, including an optimal algorithm for local broadcasting in the SINR model, which is additionally uniform (independent of network size). An essential component is a procedure for balancing contention, which has potentially wide applicability. The results illustrate the importance of carrier sensing, a stock feature of wireless nodes today, which we encapsulate in primitives to better explore its uses and usefulness.
Apr 26 2016 cs.DC
We examine the utility of multiple channels of communication in wireless networks under the SINR model of interference. The central question is whether the use of multiple channels can result in linear speedup, up to some fundamental limit. We answer this question affirmatively for the data aggregation problem, perhaps the most fundamental problem in sensor networks. To achieve this, we form a hierarchical structure of independent interest, and illustrate its versatility by obtaining a new algorithm with linear speedup for the node coloring problem.
In this paper, we extend the deep long short-term memory (DLSTM) recurrent neural networks by introducing gated direct connections between memory cells in adjacent layers. These direct links, called highway connections, enable unimpeded information flow across different layers and thus alleviate the gradient vanishing problem when building deeper LSTMs. We further introduce the latency-controlled bidirectional LSTMs (BLSTMs) which can exploit the whole history while keeping the latency under control. Efficient algorithms are proposed to train these novel networks using both frame and sequence discriminative criteria. Experiments on the AMI distant speech recognition (DSR) task indicate that we can train deeper LSTMs and achieve better improvement from sequence training with highway LSTMs (HLSTMs). Our novel model obtains $43.9/47.7\%$ WER on AMI (SDM) dev and eval sets, outperforming all previous works. It beats the strong DNN and DLSTM baselines with $15.7\%$ and $5.3\%$ relative improvement respectively.
In this paper, we investigate the use of prediction-adaptation-correction recurrent neural networks (PAC-RNNs) for low-resource speech recognition. A PAC-RNN is comprised of a pair of neural networks in which a \it correction network uses auxiliary information given by a \it prediction network to help estimate the state probability. The information from the correction network is also used by the prediction network in a recurrent loop. Our model outperforms other state-of-the-art neural networks (DNNs, LSTMs) on IARPA-Babel tasks. Moreover, transfer learning from a language that is similar to the target language can help improve performance further.
Mar 31 2015 cs.DC
In the information exchange problem, k packets that are initially maintained by k nodes need to be disseminated to the whole network as quickly as possible. We consider this problem in single-hop multi- channel networks of n nodes, and propose a uniform protocol that with high probability accomplishes the dissemination in O(k/F + F ⋅log n) rounds, assuming F available channels and collision detection. This result is asymptotically optimal when k is large (k ≥F^2 ⋅log n). To our knowledge, this is the ?1st uniform protocol for information exchange in multi-channel networks.
This paper studies the state estimation problem of linear discrete-time systems with stochastic unknown inputs. The unknown input is a wide-sense stationary process while no other prior informaton needs to be known. We propose an autoregressive (AR) model based unknown input realization technique which allows us to recover the input statistics from the output data by solving an appropriate least squares problem, then fit an AR model to the recovered input statistics and construct an innovations model of the unknown inputs using the eigensystem realization algorithm (ERA). An augmented state system is constructed and the standard Kalman filter is applied for state estimation. A reduced order model (ROM) filter is also introduced to reduce the computational cost of the Kalman filter. Two numerical examples are given to illustrate the procedure.
In this paper, we study incentive mechanisms for retrieving information from networked agents. Following the model in [Kleinberg and Raghavan 2005], the agents are represented as nodes in an infinite tree, which is generated by a random branching process. A query is issued by the root, and each node possesses an answer with an independent probability $p=1/n$. Further, each node in the tree acts strategically to maximize its own payoff. In order to encourage the agents to participate in the information acquisition process, an incentive mechanism is needed to reward agents who provide the information as well as agents who help to facilitate such acquisition. We focus on designing efficient sybil-proof incentive mechanisms, i.e., which are robust to fake identity attacks. %We consider incentive mechanisms which are sybil-proof, i.e., robust to fake identity attacks. We propose a family of mechanisms, called the direct referral (DR) mechanisms, which allocate most reward to the information holder as well as its direct parent (or direct referral). We show that, when designed properly, the direct referral mechanism is sybil-proof and efficient. In particular, we show that we may achieve an expected cost of $O(h^2)$ for propagating the query down $h$ levels for any branching factor $b>1$. This result exponentially improves on previous work when requiring to find an answer with high probability. When the underlying network is a deterministic chain, our mechanism is optimal under some mild assumptions. In addition, due to its simple reward structure, the DR mechanism might have good chance to be adopted in practice.
Recent studies have shown that deep neural networks (DNNs) perform significantly better than shallow networks and Gaussian mixture models (GMMs) on large vocabulary speech recognition tasks. In this paper, we argue that the improved accuracy achieved by the DNNs is the result of their ability to extract discriminative internal representations that are robust to the many sources of variability in speech signals. We show that these representations become increasingly insensitive to small perturbations in the input with increasing network depth, which leads to better speech recognition performance with deeper networks. We also show that DNNs cannot extrapolate to test samples that are substantially different from the training examples. If the training data are sufficiently representative, however, internal features learned by the DNN are relatively stable with respect to speaker differences, bandwidth differences, and environment distortion. This enables DNN-based recognizers to perform as well or better than state-of-the-art systems based on GMMs or shallow networks without the need for explicit model adaptation or feature normalization.
Jan 08 2013 cs.DS
Cognitive Radio Networks (CRNs) are considered as a promising solution to the spectrum shortage problem in wireless communication. In this paper, we initiate the first systematic study on the algorithmic complexity of the connectivity problem in CRNs through spectrum assignments. We model the network of secondary users (SUs) as a potential graph, where two nodes having an edge between them are connected as long as they choose a common available channel. In the general case, where the potential graph is arbitrary and the SUs may have different number of antennae, we prove that it is NP-complete to determine whether the network is connectable even if there are only two channels. For the special case where the number of channels is constant and all the SUs have the same number of antennae, which is more than one but less than the number of channels, the problem is also NP-complete. For the special cases in which the potential graph is complete, a tree, or a graph with bounded treewidth, we prove the problem is NP-complete and fixed-parameter tractable (FPT) when parameterized by the number of channels. Exact algorithms are also derived to determine the connectability of a given cognitive radio network.
The formal version of our work has been published in BMC Bioinformatics and can be found here: http://www.biomedcentral.com/1471-2105/13/S6/S1 Motivation: To tackle the problem of huge memory usage associated with de Bruijn graph-based algorithms, upon which some of the most widely used de novo genome assemblers have been built, we released SparseAssembler1. SparseAssembler1 can save as much as 90% memory consumption in comparison with the state-of-art assemblers, but it requires rounds of denoising to accurately assemble genomes. In this paper, we introduce a new general model for genome assembly that uses only sparse k-mers. The new model replaces the idea of the de Bruijn graph from the beginning, and achieves similar memory efficiency and much better robustness compared with our previous SparseAssembler1. Results: We demonstrate that the decomposition of reads of all overlapping k-mers, which is used in existing de Bruijn graph genome assemblers, is overly cautious. We introduce a sparse k-mer graph structure for saving sparse k-mers, which greatly reduces memory space requirements necessary for de novo genome assembly. In contrast with the de Bruijn graph approach, we devise a simple but powerful strategy, i.e., finding links between the k-mers in the genome and traversing following the links, which can be done by saving only a few k-mers. To implement the strategy, we need to only select some k-mers that may not even be overlapping ones, and build the links between these k-mers indicated by the reads. We can traverse through this sparse k-mer graph to build the contigs, and ultimately complete the genome assembly. Since the new sparse k-mers graph shares almost all advantages of de Bruijn graph, we are able to adapt a Dijkstra-like breadth-first search algorithm to circumvent sequencing errors and resolve polymorphisms.
de Bruijn graph-based algorithms are one of the two most widely used approaches for de novo genome assembly. A major limitation of this approach is the large computational memory space requirement to construct the de Bruijn graph, which scales with k-mer length and total diversity (N) of unique k-mers in the genome expressed in base pairs or roughly (2k+8)N bits. This limitation is particularly important with large-scale genome analysis and for sequencing centers that simultaneously process multiple genomes. We present a sparse de Bruijn graph structure, based on which we developed SparseAssembler that greatly reduces memory space requirements. The structure also allows us to introduce a novel method for the removal of substitution errors introduced during sequencing. The sparse de Bruijn graph structure skips g intermediate k-mers, therefore reducing the theoretical memory space requirement to ~(2k/g+8)N. We have found that a practical value of g=16 consumes approximately 10% of the memory required by standard de Bruijn graph-based algorithms but yields comparable results. A high error rate could potentially derail the SparseAssembler. Therefore, we developed a sparse de Bruijn graph-based denoising algorithm that can remove more than 99% of substitution errors from datasets with a ≤2% error rate. Given that substitution error rates for the current generation of sequencers is lower than 1%, our denoising procedure is sufficiently effective to safeguard the performance of our algorithm. Finally, we also introduce a novel Dijkstra-like breadth-first search algorithm for the sparse de Bruijn graph structure to circumvent residual errors and resolve polymorphisms.
An optical flow gradient algorithm was applied to spontaneously forming net- works of neurons and glia in culture imaged by fluorescence optical microscopy in order to map functional calcium signaling with single pixel resolution. Optical flow estimates the direction and speed of motion of objects in an image between subsequent frames in a recorded digital sequence of images (i.e. a movie). Computed vector field outputs by the algorithm were able to track the spatiotemporal dynamics of calcium signaling pat- terns. We begin by briefly reviewing the mathematics of the optical flow algorithm, and then describe how to solve for the displacement vectors and how to measure their reliability. We then compare computed flow vectors with manually estimated vectors for the progression of a calcium signal recorded from representative astrocyte cultures. Finally, we applied the algorithm to preparations of primary astrocytes and hippocampal neurons and to the rMC-1 Muller glial cell line in order to illustrate the capability of the algorithm for capturing different types of spatiotemporal calcium activity. We discuss the imaging requirements, parameter selection and threshold selection for reliable measurements, and offer perspectives on uses of the vector data.
Mining frequent sequential patterns from sequence databases has been a central research topic in data mining and various efficient mining sequential patterns algorithms have been proposed and studied. Recently, in many problem domains (e.g, program execution traces), a novel sequential pattern mining research, called mining repetitive gapped sequential patterns, has attracted the attention of many researchers, considering not only the repetition of sequential pattern in different sequences but also the repetition within a sequence is more meaningful than the general sequential pattern mining which only captures occurrences in different sequences. However, the number of repetitive gapped sequential patterns generated by even these closed mining algorithms may be too large to understand for users, especially when support threshold is low. In this paper, we propose and study the problem of compressing repetitive gapped sequential patterns. Inspired by the ideas of summarizing frequent itemsets, RPglobal, we develop an algorithm, CRGSgrow (Compressing Repetitive Gapped Sequential pattern grow), including an efficient pruning strategy, SyncScan, and an efficient representative pattern checking scheme, -dominate sequential pattern checking. The CRGSgrow is a two-step approach: in the first step, we obtain all closed repetitive sequential patterns as the candidate set of representative repetitive sequential patterns, and at the same time get the most of representative repetitive sequential patterns; in the second step, we only spend a little time in finding the remaining the representative patterns from the candidate set. An empirical study with both real and synthetic data sets clearly shows that the CRGSgrow has good performance.
Jun 16 2003 cs.DC
Registration and management of users in a large scale Grid computing environment presents new challenges that are not well addressed by existing protocols. Within a single Virtual Organization (VO), thousands of users will potentially need access to hundreds of computing sites, and the traditional model where users register for local accounts at each site will present significant scaling problems. However, computing sites must maintain control over access to the site and site policies generally require individual local accounts for every user. We present here a model that allows users to register once with a VO and yet still provides all of the computing sites the information they require with the required level of trust. We have developed tools to allow sites to automate the management of local accounts and the mappings between Grid identities and local accounts.
Grid computing consists of the coordinated use of large sets of diverse, geographically distributed resources for high performance computation. Effective monitoring of these computing resources is extremely important to allow efficient use on the Grid. The large number of heterogeneous computing entities available in Grids makes the task challenging. In this work, we describe a Grid monitoring system, called GridMonitor, that captures and makes available the most important information from a large computing facility. The Grid monitoring system consists of four tiers: local monitoring, archiving, publishing and harnessing. This architecture was applied on a large scale linux farm and network infrastructure. It can be used by many higher-level Grid services including scheduling services and resource brokering.