Nov 21 2017 cs.CL
One of the most crucial components of natural human-robot interaction is artificial intuition and its influence on dialog systems. The intuitive capability that humans have is undeniably extraordinary, and so remains one of the greatest challenges for natural communicative dialogue between humans and robots. In this paper, we introduce a novel probabilistic modeling framework of identifying, classifying and learning features of sarcastic text via training a neural network with human-informed sarcastic benchmarks. This is necessary for establishing a comprehensive sentiment analysis schema that is sensitive to the nuances of sarcasm-ridden text by being trained on linguistic cues. We show that our model provides a good fit for this type of real-world informed data, with potential to achieve as accurate, if not more, than alternatives. Though the implementation and benchmarking is an extensive task, it can be extended via the same method that we present to capture different forms of nuances in communication and making for much more natural and engaging dialogue systems.
A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visually-grounded navigation instructions, we present the Matterport3D Simulator -- a large-scale reinforcement learning environment based on real imagery. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings -- the Room-to-Room (R2R) dataset.
In this paper we document our experiences with developing speech recognition for medical transcription - a system that automatically transcribes doctor-patient conversations. Towards this goal, we built a system along two different methodological lines - a Connectionist Temporal Classification (CTC) phoneme based model and a Listen Attend and Spell (LAS) grapheme based model. To train these models we used a corpus of anonymized conversations representing approximately 14,000 hours of speech. Because of noisy transcripts and alignments in the corpus, a significant amount of effort was invested in data cleaning issues. We describe a two-stage strategy we followed for segmenting the data. The data cleanup and development of a matched language model was essential to the success of the CTC based models. The LAS based models, however were found to be resilient to alignment and transcript noise and did not require the use of language models. CTC models were able to achieve a word error rate of 20.1%, and the LAS models were able to achieve 18.3%. Our analysis shows that both models perform well on important medical utterances and therefore can be practical for transcribing medical conversations.
Nov 21 2017 cs.CL
In this paper, we propose a novel BTG-forest-based alignment method. Based on a fast unsupervised initialization of parameters using variational IBM models, we synchronously parse parallel sentences top-down and align hierarchically under the constraint of BTG. Our two-step method can achieve the same run-time and comparable translation performance as fast_align while it yields smaller phrase tables. Final SMT results show that our method even outperforms in the experiment of distantly related languages, e.g., English-Japanese.
Keyword spotting (KWS) is a critical component for enabling speech based user interactions on smart devices. It requires real-time response and high accuracy for good user experience. Recently, neural networks have become an attractive choice for KWS architecture because of their superior accuracy compared to traditional speech processing algorithms. Due to its always-on nature, KWS application has highly constrained power budget and typically runs on tiny microcontrollers with limited memory and compute capability. The design of neural network architecture for KWS must consider these constraints. In this work, we perform neural network architecture evaluation and exploration for running KWS on resource-constrained microcontrollers. We train various neural network architectures for keyword spotting published in literature to compare their accuracy and memory/compute requirements. We show that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers without sacrificing accuracy. We further explore the depthwise separable convolutional neural network (DS-CNN) and compare it against other neural network architectures. DS-CNN achieves an accuracy of 95.4%, which is ~10% higher than the DNN model with similar number of parameters.
Spectral topic modeling algorithms operate on matrices/tensors of word co-occurrence statistics to learn topic-specific word distributions. This approach removes the dependence on the original documents and produces substantial gains in efficiency and provable topic inference, but at a cost: the model can no longer provide information about the topic composition of individual documents. Recently Thresholded Linear Inverse (TLI) is proposed to map the observed words of each document back to its topic composition. However, its linear characteristics limit the inference quality without considering the important prior information over topics. In this paper, we evaluate Simple Probabilistic Inverse (SPI) method and novel Prior-aware Dual Decomposition (PADD) that is capable of learning document-specific topic compositions in parallel. Experiments show that PADD successfully leverages topic correlations as a prior, notably outperforming TLI and learning quality topic compositions comparable to Gibbs sampling on various data.
Nov 21 2017 cs.CL
Incorporating syntactic information in Neural Machine Translation models is a practical way to compensate their requirement for a large amount of parallel training text, specially for low-resource language pairs. Previous works on using syntactic information provided by (inevitably error-prone) parsers has been promising. In this paper, we propose a forest-to-sequence Attentional neural machine translation model to make use of exponentially many parse trees of the source sentence to compensate for the parser errors. Our method represents the collection of parse trees as a packed forest, and learns a neural attentional transduction model from the forest to the target sentence. Experiments on English to German, Chinese and Persian datasets show the superiority of our method over the tree-to-sequence and vanilla sequence-to-sequence attentional neural machine translation models.
Nov 21 2017 cs.CL
Named Entity Recognition and Relation Extraction for Chinese literature text is regarded as the highly difficult problem, partially because of the lack of tagging sets. In this paper, we build a discourse-level dataset from hundreds of Chinese literature articles for improving this task. To build a high quality dataset, we propose two tagging methods to solve the problem of data inconsistency, including a heuristic tagging method and a machine auxiliary tagging method. Based on this corpus, we also introduce several widely used models to conduct experiments. Experimental results not only show the usefulness of the proposed dataset, but also provide baselines for further research. The dataset is available at https://github.com/lancopku/Chinese-Literature-NER-RE-Dataset.
Radiology reports are a rich resource for advancing deep learning applications in medicine by leveraging the large volume of data continuously being updated, integrated, and shared. However, there are significant challenges as well, largely due to the ambiguity and subtlety of natural language. We propose a hybrid strategy that combines semantic-dictionary mapping and word2vec modeling for creating dense vector embeddings of free-text radiology reports. Our method leverages the benefits of both semantic-dictionary mapping as well as unsupervised learning. Using the vector representation, we automatically classify the radiology reports into three classes denoting confidence in the diagnosis of intracranial hemorrhage by the interpreting radiologist. We performed experiments with varying hyperparameter settings of the word embeddings and a range of different classifiers. Best performance achieved was a weighted precision of 88% and weighted recall of 90%. Our work offers the potential to leverage unstructured electronic health record data by allowing direct analysis of narrative clinical notes.
We explore how ideas from infectious disease and genetics can be used to uncover patterns of cultural inheritance and innovation in a corpus of 591 national constitutions spanning 1789 - 2008. Legal "Ideas" are encoded as "topics" - words statistically linked in documents - derived from topic modeling the corpus of constitutions. Using these topics we derive a diffusion network for borrowing from ancestral constitutions back to the US Constitution of 1789 and reveal that constitutions are complex cultural recombinants. We find systematic variation in patterns of borrowing from ancestral texts and "biological"-like behavior in patterns of inheritance with the distribution of "offspring" arising through a bounded preferential-attachment process. This process leads to a small number of highly innovative (influential) constitutions some of which have yet to have been identified as so in the current literature. Our findings thus shed new light on the critical nodes of the constitution-making network. The constitutional network structure reflects periods of intense constitution creation, and systematic patterns of variation in constitutional life-span and temporal influence.
Nov 21 2017 cs.CL
This is one of the first studies that quantitatively examine the usage of English acronyms (e.g. WTO) in Chinese texts. Using newspaper corpora, I try to answer 1) for all instances of a concept that has an English acronym (e.g. World Trade Organization), what percentage is expressed in the English acronym (WTO), and what percentage in its Chinese translation (shijie maoyi zuzhi), and 2) what factors are at play in language users' choice between the English and Chinese forms? Results show that different concepts have different percentage for English acronyms (PercentOfEn), ranging from 2% to 98%. Linear models show that PercentOfEn for individual concepts can be predicted by language economy (how long the Chinese translation is), concept frequency, and whether the first appearance of the concept in Chinese newspapers is the English acronym or its Chinese translation (all p < .05).
Nov 21 2017 cs.CL
Computational synthesis planning approaches have achieved recent success in organic chemistry, where tabulated synthesis procedures are readily available for supervised learning. The syntheses of inorganic materials, however, exist primarily as natural language narratives contained within scientific journal articles. This synthesis information must first be extracted from the text in order to enable analogous synthesis planning methods for inorganic materials. In this work, we present a system for automatically extracting structured representations of synthesis procedures from the texts of materials science journal articles that describe explicit, experimental syntheses of inorganic compounds. We define the structured representation as a set of linked events made up of extracted scientific entities and evaluate two unsupervised approaches for extracting these structures on expert-annotated articles: a strong heuristic baseline and a generative model of procedural text. We also evaluate a variety of supervised models for extracting scientific entities. Our results provide insight into the nature of the data and directions for further work in this exciting new area of research.
Nov 21 2017 cs.CL
Style transfer is an important problem in natural language processing (NLP). However, the progress in language style transfer is lagged behind other domains, such as computer vision, mainly because of the lack of parallel data and principle evaluation metrics. In this paper, we propose to learn style transfer with non-parallel data. We explore two models to achieve this goal, and the key idea behind the proposed models is to learn separate content representations and style representations using adversarial networks. We also propose novel evaluation metrics which measure two aspects of style transfer: transfer strength and content preservation. We access our models and the evaluation metrics on two tasks: paper-news title transfer, and positive-negative review transfer. Results show that the proposed content preservation metric is highly correlate to human judgments, and the proposed models are able to generate sentences with higher style transfer strength and similar content preservation score comparing to auto-encoder.
Nov 21 2017 cs.CL
The anchor words algorithm performs provably efficient topic model inference by finding an approximate convex hull in a high-dimensional word co-occurrence space. However, the existing greedy algorithm often selects poor anchor words, reducing topic quality and interpretability. Rather than finding an approximate convex hull in a high-dimensional space, we propose to find an exact convex hull in a visualizable 2- or 3-dimensional space. Such low-dimensional embeddings both improve topics and clearly show users why the algorithm selects certain words.
Spatial understanding is a fundamental problem with wide-reaching real-world applications. The representation of spatial knowledge is often modeled with spatial templates, i.e., regions of acceptability of two objects under an explicit spatial relationship (e.g., "on", "below", etc.). In contrast with prior work that restricts spatial templates to explicit spatial prepositions (e.g., "glass on table"), here we extend this concept to implicit spatial language, i.e., those relationships (generally actions) for which the spatial arrangement of the objects is only implicitly implied (e.g., "man riding horse"). In contrast with explicit relationships, predicting spatial arrangements from implicit spatial language requires significant common sense spatial understanding. Here, we introduce the task of predicting spatial templates for two objects under a relationship, which can be seen as a spatial question-answering task with a (2D) continuous output ("where is the man w.r.t. a horse when the man is walking the horse?"). We present two simple neural-based models that leverage annotated images and structured text to learn this task. The good performance of these models reveals that spatial locations are to a large extent predictable from implicit spatial language. Crucially, the models attain similar performance in a challenging generalized setting, where the object-relation-object combinations (e.g.,"man walking dog") have never been seen before. Next, we go one step further by presenting the models with unseen objects (e.g., "dog"). In this scenario, we show that leveraging word embeddings enables the models to output accurate spatial predictions, proving that the models acquire solid common sense spatial knowledge allowing for such generalization.
Deep neural networks (DNNs) had great success on NLP tasks such as language modeling, machine translation and certain question answering (QA) tasks. However, the success is limited at more knowledge intensive tasks such as QA from a big corpus. Existing end-to-end deep QA models (Miller et al., 2016; Weston et al., 2014) need to read the entire text after observing the question, and therefore their complexity in responding a question is linear in the text size. This is prohibitive for practical tasks such as QA from Wikipedia, a novel, or the Web. We propose to solve this scalability issue by using symbolic meaning representations, which can be indexed and retrieved efficiently with complexity that is independent of the text size. More specifically, we use sequence-to-sequence models to encode knowledge symbolically and generate programs to answer questions from the encoded knowledge. We apply our approach, called the N-Gram Machine (NGM), to the bAbI tasks (Weston et al., 2015) and a special version of them ("life-long bAbI") which has stories of up to 10 million sentences. Our experiments show that NGM can successfully solve both of these tasks accurately and efficiently. Unlike fully differentiable memory models, NGM's time complexity and answering quality are not affected by the story length. The whole system of NGM is trained end-to-end with REINFORCE (Williams, 1992). To avoid high variance in gradient estimation, which is typical in discrete latent variable models, we use beam search instead of sampling. To tackle the exponentially large search space, we use a stabilized auto-encoding objective and a structure tweak procedure to iteratively reduce and refine the search space.
Nov 21 2017 cs.CL
Spoken word recognition involves at least two basic computations. First is matching acoustic input to phonological categories (e.g. /b/, /p/, /d/). Second is activating words consistent with those phonological categories. Here we test the hypothesis that the listener's probability distribution over lexical items is weighted by the outcome of both computations: uncertainty about phonological discretisation and the frequency of the selected word(s). To test this, we record neural responses in auditory cortex using magnetoencephalography, and model this activity as a function of the size and relative activation of lexical candidates. Our findings indicate that towards the beginning of a word, the processing system indeed weights lexical candidates by both phonological certainty and lexical frequency; however, later into the word, activation is weighted by frequency alone.
This paper introduces a new neural structure called FusionNet, which extends existing attention approaches from three perspectives. First, it puts forward a novel concept of "history of word" to characterize attention information from the lowest word-level embedding up to the highest semantic-level representation. Second, it introduces an improved attention scoring function that better utilizes the "history of word" concept. Third, it proposes a fully-aware multi-level attention mechanism to capture the complete information in one text (such as a question) and exploit it in its counterpart (such as context or passage) layer by layer. We apply FusionNet to the Stanford Question Answering Dataset (SQuAD) and it achieves the first position for both single and ensemble model on the official SQuAD leaderboard at the time of writing (Oct. 4th, 2017). Meanwhile, we verify the generalization of FusionNet with two adversarial SQuAD datasets and it sets up the new state-of-the-art on both datasets: on AddSent, FusionNet increases the best F1 metric from 46.6% to 51.4%; on AddOneSent, FusionNet boosts the best F1 metric from 56.0% to 60.7%.