Machine Comprehension (MC) is a challenging task in Natural Language Processing field, which aims to guide the machine to comprehend a passage and answer the given question. Many existing approaches on MC task are suffering the inefficiency in some bottlenecks, such as insufficient lexical understanding, complex question-passage interaction, incorrect answer extraction and so on. In this paper, we address these problems from the viewpoint of how humans deal with reading tests in a scientific way. Specifically, we first propose a novel lexical gating mechanism to dynamically combine the words and characters representations. We then guide the machines to read in an interactive way with attention mechanism and memory network. Finally we add a checking layer to refine the answer for insurance. The extensive experiments on two popular datasets SQuAD and TriviaQA show that our method exceeds considerable performance than most state-of-the-art solutions at the time of submission.
Oct 04 2017 cs.CV
Fine-grained image classification is to recognize hundreds of subcategories in each basic-level category. Existing methods employ discriminative localization to find the key distinctions among subcategories. However, they generally have two limitations: (1) Discriminative localization relies on region proposal methods to hypothesize the locations of discriminative regions, which are time-consuming. (2) The training of discriminative localization depends on object or part annotations, which are heavily labor-consuming. It is highly challenging to address the two key limitations simultaneously, and existing methods only focus on one of them. Therefore, we propose a weakly supervised discriminative localization approach (WSDL) for fast fine-grained image classification to address the two limitations at the same time, and its main advantages are: (1) n-pathway end-to-end discriminative localization network is designed to improve classification speed, which simultaneously localizes multiple different discriminative regions for one image to boost classification accuracy, and shares full-image convolutional features generated by region proposal network to accelerate the process of generating region proposals as well as reduce the computation of convolutional operation. (2) Multi-level attention guided localization learning is proposed to localize discriminative regions with different focuses automatically, without using object and part annotations, avoiding the labor consumption. Different level attentions focus on different characteristics of the image, which are complementary and boost the classification accuracy. Both are jointly employed to simultaneously improve classification speed and eliminate dependence on object and part annotations. Compared with state-of-the-art methods on 2 widely-used fine-grained image classification datasets, our WSDL approach achieves the best performance.
We present a new tensor product generation network (TPGN) that generates natural language descriptions for images. The model has a novel architecture that instantiates a general framework for encoding and processing symbolic structure through neural network computation. This framework is built on Tensor Product Representations (TPRs). We evaluated the proposed TPGN on the MS COCO image captioning task. The experimental results show that the TPGN outperforms the LSTM based state-of-the-art baseline with a significant margin. Further, we show that our caption generation model can be interpreted as generating sequences of grammatical categories and retrieving words by their categories from a plan encoded as a distributed representation.
Sep 26 2017 cs.CV
Discriminative localization is essential for fine-grained image classification task, which devotes to recognizing hundreds of subcategories in the same basic-level category. Reflecting on discriminative regions of objects, key differences among different subcategories are subtle and local. Existing methods generally adopt a two-stage learning framework: The first stage is to localize the discriminative regions of objects, and the second is to encode the discriminative features for training classifiers. However, these methods generally have two limitations: (1) Separation of the two-stage learning is time-consuming. (2) Dependence on object and parts annotations for discriminative localization learning leads to heavily labor-consuming labeling. It is highly challenging to address these two important limitations simultaneously. Existing methods only focus on one of them. Therefore, this paper proposes the discriminative localization approach via saliency-guided Faster R-CNN to address the above two limitations at the same time, and our main novelties and advantages are: (1) End-to-end network based on Faster R-CNN is designed to simultaneously localize discriminative regions and encode discriminative features, which accelerates classification speed. (2) Saliency-guided localization learning is proposed to localize the discriminative region automatically, avoiding labor-consuming labeling. Both are jointly employed to simultaneously accelerate classification speed and eliminate dependence on object and parts annotations. Comparing with the state-of-the-art methods on the widely-used CUB-200-2011 dataset, our approach achieves both the best classification accuracy and efficiency.
Sep 21 2017 cs.DB
Single-source and top-$k$ SimRank queries are two important types of similarity search in graphs with numerous applications in web mining, social network analysis, spam detection, etc. A plethora of techniques have been proposed for these two types of queries, but very few can efficiently support similarity search over large dynamic graphs, due to either significant preprocessing time or large space overheads. This paper presents ProbeSim, an index-free algorithm for single-source and top-$k$ SimRank queries that provides a non-trivial theoretical guarantee in the absolute error of query results. ProbeSim estimates SimRank similarities without precomputing any indexing structures, and thus can naturally support real-time SimRank queries on dynamic graphs. Besides the theoretical guarantee, ProbeSim also offers satisfying practical efficiency and effectiveness due to several non-trivial optimizations. We conduct extensive experiments on a number of benchmark datasets, which demonstrate that our solutions significantly outperform the existing methods in terms of efficiency and effectiveness. Notably, our experiments include the first empirical study that evaluates the effectiveness of SimRank algorithms on graphs with billion edges, using the idea of pooling.
We propose a collaborative compressive sensing (CCS) framework consisting of a bank of $K$ compressive sensing (CS) systems that share the same sensing matrix but have different sparsifying dictionaries. This CCS system is guaranteed to yield better performance than each individual CS system in a statistical sense, while with the parallel computing strategy, it requires the same time as that needed for each individual CS system to conduct compression and signal recovery. We then provide an approach to designing optimal CCS systems by utilizing a measure that involves both the sensing matrix and dictionaries and hence allows us to simultaneously optimize the sensing matrix and all the $K$ dictionaries under the same scheme. An alternating minimization-based algorithm is derived for solving the corresponding optimal design problem. We provide a rigorous convergence analysis to show that the proposed algorithm is convergent. Experiments with real images are carried out and show that the proposed CCS system significantly improves on existing CS systems in terms of the signal recovery accuracy.
Sep 20 2017 cs.CV
In recent years, much research has been conducted on image super-resolution (SR). To the best of our knowledge, however, few SR methods were concerned with compressed images. The SR of compressed images is a challenging task due to the complicated compression artifacts, while many images suffer from them in practice. The intuitive solution for this difficult task is to decouple it into two sequential but independent subproblems, i.e., compression artifacts reduction (CAR) and SR. Nevertheless, some useful details may be removed in CAR stage, which is contrary to the goal of SR and makes the SR stage more challenging. In this paper, an end-to-end trainable deep convolutional neural network is designed to perform SR on compressed images (CISRDCNN), which reduces compression artifacts and improves image resolution jointly. Experiments on compressed images produced by JPEG (we take the JPEG as an example in this paper) demonstrate that the proposed CISRDCNN yields state-of-the-art SR performance on commonly used test images and imagesets. The results of CISRDCNN on real low quality web images are also very impressive, with obvious quality enhancement. Further, we explore the application of the proposed SR method in low bit-rate image coding, leading to better rate-distortion performance than JPEG.
Predicating macroscopic influences of drugs on human body, like efficacy and toxicity, is a central problem of small-molecule based drug discovery. Molecules can be represented as an undirected graph, and we can utilize graph convolution networks to predication molecular properties. However, graph convolutional networks and other graph neural networks all focus on learning node-level representation rather than graph-level representation. Previous works simply sum all feature vectors for all nodes in the graph to obtain the graph feature vector for drug predication. In this paper, we introduce a dummy super node that is connected with all nodes in the graph by a directed edge as the representation of the graph and modify the graph operation to help the dummy super node learn graph-level feature. Thus, we can handle graph-level classification and regression in the same way as node-level classification and regression. In addition, we apply focal loss to address class imbalance in drug datasets. The experiments on MoleculeNet show that our method can effectively improve the performance of molecular properties predication.
Sep 04 2017 cs.CV
Fine-grained image classification is to recognize hundreds of subcategories belonging to the same basic-level category, which is a highly challenging task due to the quite subtle visual distinctions among similar subcategories. Most existing methods generally learn part detectors to discover discriminative regions for better performance. However, not all localized parts are beneficial and indispensable for classification, and the setting for number of part detectors relies heavily on prior knowledge as well as experimental results. As is known to all, when we describe the object of an image into text via natural language, we only focus on the pivotal characteristics, and rarely pay attention to common characteristics as well as the background areas. This is an involuntary transfer from human visual attention to textual attention, which leads to the fact that textual attention tells us how many and which parts are discriminative and significant. So textual attention of natural language descriptions could help us to discover visual attention in image. Inspired by this, we propose a visual-textual attention driven fine-grained representation learning (VTA) approach, and its main contributions are: (1) Fine-grained visual-textual pattern mining devotes to discovering discriminative visual-textual pairwise information for boosting classification through jointly modeling vision and text with generative adversarial networks (GANs), which automatically and adaptively discovers discriminative parts. (2) Visual-textual representation learning jointly combine visual and textual information, which preserves the intra-modality and inter-modality information to generate complementary fine-grained representation, and further improve classification performance. Experiments on the two widely-used datasets demonstrate the effectiveness of our VTA approach, which achieves the best classification accuracy.
Aug 24 2017 cs.CV
Tracking of multiple objects is an important application in AI City geared towards solving salient problems related to safety and congestion in an urban environment. Frequent occlusion in traffic surveillance has been a major problem in this research field. In this challenge, we propose a model-based vehicle localization method, which builds a kernel at each patch of the 3D deformable vehicle model and associates them with constraints in 3D space. The proposed method utilizes shape fitness evaluation besides color information to track vehicle objects robustly and efficiently. To build 3D car models in a fully unsupervised manner, we also implement evolutionary camera self-calibration from tracking of walking humans to automatically compute camera parameters. Additionally, the segmented foreground masks which are crucial to 3D modeling and camera self-calibration are adaptively refined by multiple-kernel feedback from tracking. For object detection/classification, the state-of-the-art single shot multibox detector (SSD) is adopted to train and test on the NVIDIA AI City Dataset. To improve the accuracy on categories with only few objects, like bus, bicycle and motorcycle, we also employ the pretrained model from YOLO9000 with multi-scale testing. We combine the results from SSD and YOLO9000 based on ensemble learning. Experiments show that our proposed tracking system outperforms both state-of-the-art of tracking by segmentation and tracking by detection.
Aug 18 2017 cs.LG
Many predictive tasks of web applications need to model categorical variables, such as user IDs and demographics like genders and occupations. To apply standard machine learning techniques, these categorical predictors are always converted to a set of binary features via one-hot encoding, making the resultant feature vector highly sparse. To learn from such sparse data effectively, it is crucial to account for the interactions between features. Factorization Machines (FMs) are a popular solution for efficiently using the second-order feature interactions. However, FM models feature interactions in a linear way, which can be insufficient for capturing the non-linear and complex inherent structure of real-world data. While deep neural networks have recently been applied to learn non-linear feature interactions in industry, such as the Wide&Deep by Google and DeepCross by Microsoft, the deep structure meanwhile makes them difficult to train. In this paper, we propose a novel model Neural Factorization Machine (NFM) for prediction under sparse settings. NFM seamlessly combines the linearity of FM in modelling second-order feature interactions and the non-linearity of neural network in modelling higher-order feature interactions. Conceptually, NFM is more expressive than FM since FM can be seen as a special case of NFM without hidden layers. Empirical results on two regression tasks show that with one hidden layer only, NFM significantly outperforms FM with a 7.3% relative improvement. Compared to the recent deep learning methods Wide&Deep and DeepCross, our NFM uses a shallower structure but offers better performance, being much easier to train and tune in practice.
Aug 18 2017 cs.IR
This paper contributes improvements on both the effectiveness and efficiency of Matrix Factorization (MF) methods for implicit feedback. We highlight two critical issues of existing works. First, due to the large space of unobserved feedback, most existing works resort to assign a uniform weight to the missing data to reduce computational complexity. However, such a uniform assumption is invalid in real-world settings. Second, most methods are also designed in an offline setting and fail to keep up with the dynamic nature of online data. We address the above two issues in learning MF models from implicit feedback. We first propose to weight the missing data based on item popularity, which is more effective and flexible than the uniform-weight assumption. However, such a non-uniform weighting poses efficiency challenge in learning the model. To address this, we specifically design a new learning algorithm based on the element-wise Alternating Least Squares (eALS) technique, for efficiently optimizing a MF model with variably-weighted missing data. We exploit this efficiency to then seamlessly devise an incremental update strategy that instantly refreshes a MF model given new feedback. Through comprehensive experiments on two public datasets in both offline and online protocols, we show that our eALS method consistently outperforms state-of-the-art implicit MF methods. Our implementation is available at https://github.com/hexiangnan/sigir16-eals.
Aug 18 2017 cs.IR
In recent years, deep neural networks have yielded immense success on speech recognition, computer vision and natural language processing. However, the exploration of deep neural networks on recommender systems has received relatively less scrutiny. In this work, we strive to develop techniques based on neural networks to tackle the key problem in recommendation -- collaborative filtering -- on the basis of implicit feedback. Although some recent work has employed deep learning for recommendation, they primarily used it to model auxiliary information, such as textual descriptions of items and acoustic features of musics. When it comes to model the key factor in collaborative filtering -- the interaction between user and item features, they still resorted to matrix factorization and applied an inner product on the latent features of users and items. By replacing the inner product with a neural architecture that can learn an arbitrary function from data, we present a general framework named NCF, short for Neural network-based Collaborative Filtering. NCF is generic and can express and generalize matrix factorization under its framework. To supercharge NCF modelling with non-linearities, we propose to leverage a multi-layer perceptron to learn the user-item interaction function. Extensive experiments on two real-world datasets show significant improvements of our proposed NCF framework over the state-of-the-art methods. Empirical evidence shows that using deeper layers of neural networks offers better recommendation performance.
Aug 17 2017 cs.LG
Factorization Machines (FMs) are a supervised learning approach that enhances the linear regression model by incorporating the second-order feature interactions. Despite effectiveness, FM can be hindered by its modelling of all feature interactions with the same weight, as not all feature interactions are equally useful and predictive. For example, the interactions with useless features may even introduce noises and adversely degrade the performance. In this work, we improve FM by discriminating the importance of different feature interactions. We propose a novel model named Attentional Factorization Machine (AFM), which learns the importance of each feature interaction from data via a neural attention network. Extensive experiments on two real-world datasets demonstrate the effectiveness of AFM. Empirically, it is shown on regression task AFM betters FM with a $8.6\%$ relative improvement, and consistently outperforms the state-of-the-art deep learning methods Wide&Deep and DeepCross with a much simpler structure and fewer model parameters. Our implementation of AFM is publicly available at: https://github.com/hexiangnan/attentional_factorization_machine
Aug 16 2017 cs.IR
The bipartite graph is a ubiquitous data structure that can model the relationship between two entity types: for instance, users and items, queries and webpages. In this paper, we study the problem of ranking vertices of a bipartite graph, based on the graph's link structure as well as prior information about vertices (which we term a query vector). We present a new solution, BiRank, which iteratively assigns scores to vertices and finally converges to a unique stationary ranking. In contrast to the traditional random walk-based methods, BiRank iterates towards optimizing a regularization function, which smooths the graph under the guidance of the query vector. Importantly, we establish how BiRank relates to the Bayesian methodology, enabling the future extension in a probabilistic way. To show the rationale and extendability of the ranking methodology, we further extend it to rank for the more generic n-partite graphs. BiRank's generic modeling of both the graph structure and vertex features enables it to model various ranking hypotheses flexibly. To illustrate its functionality, we apply the BiRank and TriRank (ranking for tripartite graphs) algorithms to two real-world applications: a general ranking scenario that predicts the future popularity of items, and a personalized ranking scenario that recommends items of interest to users. Extensive experiments on both synthetic and real-world datasets demonstrate BiRank's soundness (fast convergence), efficiency (linear in the number of graph edges) and effectiveness (achieving state-of-the-art in the two real-world tasks).
Aug 15 2017 cs.IR
The growth of Internet commerce has stimulated the use of collaborative filtering (CF) algorithms as recommender systems. A collaborative filtering (CF) algorithm recommends items of interest to the target user by leveraging the votes given by other similar users. In a standard CF framework, it is assumed that the credibility of every voting user is exactly the same with respect to the target user. This assumption is not satisfied and thus may lead to misleading recommendations in many practical applications. A natural countermeasure is to design a trust-aware CF (TaCF) algorithm, which can take account of the difference in the credibilities of the voting users when performing CF. To this end, this paper presents a trust inference approach, which can predict the implicit trust of the target user on every voting user from a sparse explicit trust matrix. Then an improved CF algorithm termed iTrace is proposed, which takes advantage of both the explicit and the predicted implicit trust to provide recommendations with the CF framework. An empirical evaluation on a public dataset demonstrates that the proposed algorithm provides a significant improvement in recommendation quality in terms of mean absolute error (MAE).
Aug 15 2017 cs.CV
In this work, we address the face parsing task with a Fully-Convolutional continuous CRF Neural Network (FC-CNN) architecture. In contrast to previous face parsing methods that apply region-based subnetwork hundreds of times, our FC-CNN is fully convolutional with high segmentation accuracy. To achieve this goal, FC-CNN integrates three subnetworks, a unary network, a pairwise network and a continuous Conditional Random Field (C-CRF) network into a unified framework. The high-level semantic information and low-level details across different convolutional layers are captured by the convolutional and deconvolutional structures in the unary network. The semantic edge context is learnt by the pairwise network branch to construct pixel-wise affinity. Based on a differentiable superpixel pooling layer and a differentiable C-CRF layer, the unary network and pairwise network are combined via a novel continuous CRF network to achieve spatial consistency in both training and test procedure of a deep neural network. Comprehensive evaluations on LFW-PL and HELEN datasets demonstrate that FC-CNN achieves better performance over the other state-of-arts for accurate face labeling on challenging images.
This paper presents a state-of-the-art model for visual question answering (VQA), which won the first place in the 2017 VQA Challenge. VQA is a task of significant importance for research in artificial intelligence, given its multimodal nature, clear evaluation protocol, and potential real-world applications. The performance of deep neural networks for VQA is very dependent on choices of architectures and hyperparameters. To help further research in the area, we describe in detail our high-performing, though relatively simple model. Through a massive exploration of architectures and hyperparameters representing more than 3,000 GPU-hours, we identified tips and tricks that lead to its success, namely: sigmoid outputs, soft training targets, image features from bottom-up attention, gated tanh activations, output embeddings initialized using GloVe and Google Images, large mini-batches, and smart shuffling of training data. We provide a detailed analysis of their impact on performance to assist others in making an appropriate selection.
Machine comprehension(MC) style question answering is a representative problem in natural language processing. Previous methods rarely spend time on the improvement of encoding layer, especially the embedding of syntactic information and name entity of the words, which are very crucial to the quality of encoding. Moreover, existing attention methods represent each query word as a vector or use a single vector to represent the whole query sentence, neither of them can handle the proper weight of the key words in query sentence. In this paper, we introduce a novel neural network architecture called Multi-layer Embedding with Memory Network(MEMEN) for machine reading task. In the encoding layer, we employ classic skip-gram model to the syntactic and semantic information of the words to train a new kind of embedding layer. We also propose a memory network of full-orientation matching of the query and passage to catch more pivotal information. Experiments show that our model has competitive results both from the perspectives of precision and efficiency in Stanford Question Answering Dataset(SQuAD) among all published results and achieves the state-of-the-art results on TriviaQA dataset.
Jul 28 2017 cs.SE
When developing smart home systems, developers integrate and compose smart devices and software applications. Because of their diversity and heterogeneity, developers usually encounter many problems. In this paper, we present a runtime model based approach to smart home system development. First, we analyze mobile applications associated with smart devices and then extract some device control APIs. Second, we use SM@RT framework to build the device runtime model. Third, we define the scenario model, that is an abstraction of devices and objects which the system consists of. Fourth, we specify mapping rules from the scenario model to the runtime model and employ a synchronizer, which can interpret the mapping rules, to keep the synchronization between the scenario model and the device runtime model. The mapping handler reads the mapping rules that are defined by developers and does the mapping in terms of them. At last, developers can program smart home systems upon the MOF-compliant scenario model using the state-of-the-art model driven technologies. https://youtu.be/SP12OtmHj50
Jul 26 2017 cs.CV
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, improving the best published result in terms of CIDEr score from 114.7 to 117.9 and BLEU-4 from 35.2 to 36.9. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
Catastrophic interference has been a major roadblock in the research of continual learning. Here we propose a variant of the back-propagation algorithm, "conceptor-aided back-prop" (CAB), in which gradients are shielded by conceptors against degradation of previously learned tasks. Conceptors have their origin in reservoir computing, where they have been previously shown to overcome catastrophic forgetting. CAB extends these results to deep feedforward networks. On the disjoint MNIST task CAB outperforms two other methods for coping with catastrophic interference that have recently been proposed in the deep learning field.
Jul 14 2017 cs.CV
Mobile landmark search (MLS) recently receives increasing attention for its great practical values. However, it still remains unsolved due to two important challenges. One is high bandwidth consumption of query transmission, and the other is the huge visual variations of query images sent from mobile devices. In this paper, we propose a novel hashing scheme, named as canonical view based discrete multi-modal hashing (CV-DMH), to handle these problems via a novel three-stage learning procedure. First, a submodular function is designed to measure visual representativeness and redundancy of a view set. With it, canonical views, which capture key visual appearances of landmark with limited redundancy, are efficiently discovered with an iterative mining strategy. Second, multi-modal sparse coding is applied to transform visual features from multiple modalities into an intermediate representation. It can robustly and adaptively characterize visual contents of varied landmark images with certain canonical views. Finally, compact binary codes are learned on intermediate representation within a tailored discrete binary embedding model which preserves visual relations of images measured with canonical views and removes the involved noises. In this part, we develop a new augmented Lagrangian multiplier (ALM) based optimization method to directly solve the discrete binary codes. We can not only explicitly deal with the discrete constraint, but also consider the bit-uncorrelated constraint and balance constraint together. Experiments on real world landmark datasets demonstrate the superior performance of CV-DMH over several state-of-the-art methods.
Jun 30 2017 cs.CL
We develop a technique for transfer learning in machine comprehension (MC) using a novel two-stage synthesis network (SynNet). Given a high-performing MC model in one domain, our technique aims to answer questions about documents in another domain, where we use no labeled data of question-answer pairs. Using the proposed SynNet with a pretrained model from the SQuAD dataset on the challenging NewsQA dataset, we achieve an F1 measure of 44.3% with a single model and 46.6% with an ensemble, approaching performance of in-domain models (F1 measure of 50.0%) and outperforming the out-of-domain baseline of 7.6%, without use of provided annotations.
Online platforms can be divided into information-oriented and social-oriented domains. The former refers to forums or E-commerce sites that emphasize user-item interactions, like Trip.com and Amazon; whereas the latter refers to social networking services (SNSs) that have rich user-user connections, such as Facebook and Twitter. Despite their heterogeneity, these two domains can be bridged by a few overlapping users, dubbed as bridge users. In this work, we address the problem of cross-domain social recommendation, i.e., recommending relevant items of information domains to potential users of social networks. To our knowledge, this is a new problem that has rarely been studied before. Existing cross-domain recommender systems are unsuitable for this task since they have either focused on homogeneous information domains or assumed that users are fully overlapped. Towards this end, we present a novel Neural Social Collaborative Ranking (NSCR) approach, which seamlessly sews up the user-item interactions in information domains and user-user connections in SNSs. In the information domain part, the attributes of users and items are leveraged to strengthen the embedding learning of users and items. In the SNS part, the embeddings of bridge users are propagated to learn the embeddings of other non-bridge users. Extensive experiments on two real-world datasets demonstrate the effectiveness and rationality of our NSCR method.
Jun 01 2017 cs.CL
Generative adversarial networks (GANs) have great successes on synthesizing data. However, the existing GANs restrict the discriminator to be a binary classifier, and thus limit their learning capacity for tasks that need to synthesize output with rich structures such as natural language descriptions. In this paper, we propose a novel generative adversarial network, RankGAN, for generating high-quality language descriptions. Rather than train the discriminator to learn and assign absolute binary predicate for individual data sample, the proposed RankGAN is able to analyze and rank a collection of human-written and machine-written sentences by giving a reference group. By viewing a set of data samples collectively and evaluating their quality through relative ranking scores, the discriminator is able to make better assessment which in turn helps to learn a better generator. The proposed RankGAN is optimized through the policy gradient technique. Experimental results on multiple public datasets clearly demonstrate the effectiveness of the proposed approach.
May 24 2017 cs.CL
We introduce an architecture, the Tensor Product Recurrent Network (TPRN). In our application of TPRN, internal representations learned by end-to-end optimization in a deep neural network performing a textual question-answering (QA) task can be interpreted using basic concepts from linguistic theory. No performance penalty need be paid for this increased interpretability: the proposed model performs comparably to a state-of-the-art system on the SQuAD QA task. The internal representation which is interpreted is a Tensor Product Representation: for each input word, the model selects a symbol to encode the word, and a role in which to place the symbol, and binds the two together. The selection is via soft attention. The overall interpretation is built from interpretations of the symbols, as recruited by the trained model, and interpretations of the roles as used by the model. We find support for our initial hypothesis that symbols can be interpreted as lexical-semantic word meanings, while roles can be interpreted as approximations of grammatical roles (or categories) such as subject, wh-word, determiner, etc. Fine-grained analysis reveals specific correspondences between the learned roles and parts of speech as assigned by a standard tagger (Toutanova et al. 2003), and finds several discrepancies in the model's favor. In this sense, the model learns significant aspects of grammar, after having been exposed solely to linguistically unannotated text, questions, and answers: no prior linguistic knowledge is given to the model. What is given is the means to build representations using symbols and roles, with an inductive bias favoring use of these in an approximately discrete manner.
May 16 2017 cs.SI
Embedding network data into a low-dimensional vector space has shown promising performance for many real-world applications, such as node classification and entity retrieval. However, most existing methods focused only on leveraging network structure. For social networks, besides the network structure, there also exists rich information about social actors, such as user profiles of friendship networks and textual content of citation networks. These rich attribute information of social actors reveal the homophily effect, exerting huge impacts on the formation of social networks. In this paper, we explore the rich evidence source of attributes in social networks to improve network embedding. We propose a generic Social Network Embedding framework (SNE), which learns representations for social actors (i.e., nodes) by preserving both the structural proximity and attribute proximity. While the structural proximity captures the global network structure, the attribute proximity accounts for the homophily effect. To justify our proposal, we conduct extensive experiments on four real-world social networks. Compared to the state-of-the-art network embedding approaches, SNE can learn more informative representations, achieving substantial gains on the tasks of link prediction and node classification. Specifically, SNE significantly outperforms node2vec with an 8.2% relative improvement on the link prediction task, and a 12.7% gain on the node classification task.
May 11 2017 cs.OS
Imagining a disk which provides baseline performance at a relatively low price during low-load periods, but when workloads demand more resources, the disk performance is automatically promoted in situ and in real time. In a hardware era, this is hardly achievable. However, this imagined disk is becoming reality due to the technical advances of software-defined storage, which enable volume performance to be adjusted on the fly. We propose IOTune, a resource management middleware which employs software-defined storage primitives to implement G-states of virtual block devices. G-states enable virtual block devices to serve at multiple performance gears, getting rid of conflicts between immutable resource reservation and dynamic resource demands, and always achieving resource right-provisioning for workloads. Accompanying G-states, we also propose a new block storage pricing policy for cloud providers. Our case study for applying G-states to cloud block storage verifies the effectiveness of the IOTune framework. Trace-replay based evaluations demonstrate that storage volumes with G-states adapt to workload fluctuations. For tenants, G-states enable volumes to provide much better QoS with a same cost of ownership, comparing with static IOPS provisioning and the I/O credit mechanism. G-states also reduce I/O tail latencies by one to two orders of magnitude. From the standpoint of cloud providers, G-states promote storage utilization, creating values and benefiting competitiveness. G-states supported by IOTune provide a new paradigm for storage resource management and pricing in multi-tenant clouds.
Apr 21 2017 cs.CL
This paper addresses the problem of predicting popularity of comments in an online discussion forum using reinforcement learning, particularly addressing two challenges that arise from having natural language state and action spaces. First, the state representation, which characterizes the history of comments tracked in a discussion at a particular point, is augmented to incorporate the global context represented by discussions on world events available in an external knowledge source. Second, a two-stage Q-learning framework is introduced, making it feasible to search the combinatorial action space while also accounting for redundancy among sub-actions. We experiment with five Reddit communities, showing that the two methods improve over previous reported results on this task.
Apr 11 2017 cs.CV
Fine-grained image classification is a challenging task due to the large intra-class variance and small inter-class variance, aiming at recognizing hundreds of sub-categories belonging to the same basic-level category. Most existing fine-grained image classification methods generally learn part detection models to obtain the semantic parts for better classification accuracy. Despite achieving promising results, these methods mainly have two limitations: (1) not all the parts which obtained through the part detection models are beneficial and indispensable for classification, and (2) fine-grained image classification requires more detailed visual descriptions which could not be provided by the part locations or attribute annotations. For addressing the above two limitations, this paper proposes the two-stream model combining vision and language (CVL) for learning latent semantic representations. The vision stream learns deep representations from the original visual information via deep convolutional neural network. The language stream utilizes the natural language descriptions which could point out the discriminative parts or characteristics for each image, and provides a flexible and compact way of encoding the salient visual aspects for distinguishing sub-categories. Since the two streams are complementary, combining the two streams can further achieves better classification accuracy. Comparing with 12 state-of-the-art methods on the widely used CUB-200-2011 dataset for fine-grained image classification, the experimental results demonstrate our CVL approach achieves the best performance.
Apr 07 2017 cs.CV
Fine-grained image classification is to recognize hundreds of subcategories belonging to the same basic-level category, such as 200 subcategories belonging to the bird, which is highly challenging due to large variance in the same subcategory and small variance among different subcategories. Existing methods generally first locate the objects or parts and then discriminate which subcategory the image belongs to. However, they mainly have two limitations: (1) Relying on object or part annotations which are heavily labor consuming. (2) Ignoring the spatial relationships between the object and its parts as well as among these parts, both of which are significantly helpful for finding discriminative parts. Therefore, this paper proposes the object-part attention model (OPAM) for weakly supervised fine-grained image classification, and the main novelties are: (1) Object-part attention model integrates two level attentions: object-level attention localizes objects of images, and part-level attention selects discriminative parts of object. Both are jointly employed to learn multi-view and multi-scale features to enhance their mutual promotions. (2) Object-part spatial constraint model combines two spatial constraints: object spatial constraint ensures selected parts highly representative, and part spatial constraint eliminates redundancy and enhances discrimination of selected parts. Both are jointly employed to exploit the subtle and local differences for distinguishing the subcategories. Importantly, neither object nor part annotations are used in our proposed approach, which avoids the heavy labor consumption of labeling. Comparing with more than 10 state-of-the-art methods on 4 widely-used datasets, our OPAM approach achieves the best performance.
Learning sophisticated feature interactions behind user behaviors is critical in maximizing CTR for recommender systems. Despite great progress, existing methods seem to have a strong bias towards low- or high-order interactions, or require expertise feature engineering. In this paper, we show that it is possible to derive an end-to-end learning model that emphasizes both low- and high-order feature interactions. The proposed model, DeepFM, combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture. Compared to the latest Wide \& Deep model from Google, DeepFM has a shared input to its "wide" and "deep" parts, with no need of feature engineering besides raw features. Comprehensive experiments are conducted to demonstrate the effectiveness and efficiency of DeepFM over the existing models for CTR prediction, on both benchmark data and commercial data.
Connecting different text attributes associated with the same entity (conflation) is important in business data analytics since it could help merge two different tables in a database to provide a more comprehensive profile of an entity. However, the conflation task is challenging because two text strings that describe the same entity could be quite different from each other for reasons such as misspelling. It is therefore critical to develop a conflation model that is able to truly understand the semantic meaning of the strings and match them at the semantic level. To this end, we develop a character-level deep conflation model that encodes the input text strings from character level into finite dimension feature vectors, which are then used to compute the cosine similarity between the text strings. The model is trained in an end-to-end manner using back propagation and stochastic gradient descent to maximize the likelihood of the correct association. Specifically, we propose two variants of the deep conflation model, based on long-short-term memory (LSTM) recurrent neural network (RNN) and convolutional neural network (CNN), respectively. Both models perform well on a real-world business analytics dataset and significantly outperform the baseline bag-of-character (BoC) model.
The problem of quantizing the activations of a deep neural network is considered. An examination of the popular binary quantization approach shows that this consists of approximating a classical non-linearity, the hyperbolic tangent, by two functions: a piecewise constant sign function, which is used in feedforward network computations, and a piecewise linear hard tanh function, used in the backpropagation step during network learning. The problem of approximating the ReLU non-linearity, widely used in the recent deep learning literature, is then considered. An half-wave Gaussian quantizer (HWGQ) is proposed for forward approximation and shown to have efficient implementation, by exploiting the statistics of of network activations and batch normalization operations commonly used in the literature. To overcome the problem of gradient mismatch, due to the use of different forward and backward approximations, several piece-wise backward approximators are then investigated. The implementation of the resulting quantized network, denoted as HWGQ-Net, is shown to achieve much closer performance to full precision networks, such as AlexNet, ResNet, GoogLeNet and VGG-Net, than previously available low-precision networks, with 1-bit binary weights and 2-bit quantized activations.
Private record linkage (PRL) is the problem of identifying pairs of records that are similar as per an input matching rule from databases held by two parties that do not trust one another. We identify three key desiderata that a PRL solution must ensure: 1) perfect precision and high recall of matching pairs, 2) a proof of end-to-end privacy, and 3) communication and computational costs that scale subquadratically in the number of input records. We show that all of the existing solutions for PRL - including secure 2-party computation (S2PC), and their variants that use non-private or differentially private (DP) blocking to ensure subquadratic cost - violate at least one of the three desiderata. In particular, S2PC techniques guarantee end-to-end privacy but have either low recall or quadratic cost. In contrast, no end-to-end privacy guarantee has been formalized for solutions that achieve subquadratic cost. This is true even for solutions that compose DP and S2PC: DP does not permit the release of any exact information about the databases, while S2PC algorithms for PRL allow the release of matching records. In light of this deficiency, we propose a novel privacy model, called output constrained differential privacy, that shares the strong privacy protection of DP, but allows for the truthful release of the output of a certain function applied to the data. We apply this to PRL, and show that protocols satisfying this privacy model permit the disclosure of the true matching records, but their execution is insensitive to the presence or absence of a single non-matching record. We find that prior work that combine DP and S2PC techniques even fail to satisfy this end-to-end privacy model. Hence, we develop novel protocols that provably achieve this end-to-end privacy guarantee, together with the other two desiderata of PRL.
Dec 12 2016 cs.CV
We address the problem of instance-level semantic segmentation, which aims at jointly detecting, segmenting and classifying every individual object in an image. In this context, existing methods typically propose candidate objects, usually as bounding boxes, and directly predict a binary mask within each such proposal. As a consequence, they cannot recover from errors in the object candidate generation process, such as too small or shifted boxes. In this paper, we introduce a novel object segment representation based on the distance transform of the object masks. We then design an object mask network (OMN) with a new residual-deconvolution architecture that infers such a representation and decodes it into the final binary object mask. This allows us to predict masks that go beyond the scope of the bounding boxes and are thus robust to inaccurate object candidates. We integrate our OMN into a Multitask Network Cascade framework, and learn the resulting boundary-aware instance segmentation (BAIS) network in an end-to-end manner. Our experiments on the PASCAL VOC 2012 and the Cityscapes datasets demonstrate the benefits of our approach, which outperforms the state-of-the-art in both object proposal generation and instance segmentation.
Nov 30 2016 cs.IR
It is well known that learning customers' preference and making recommendations to them from today's information-exploded environment is critical and non-trivial in an on-line system. There are two different modes of recommendation systems, namely pull-mode and push-mode. The majority of the recommendation systems are pull-mode, which recommend items to users only when and after users enter Application Market. While push-mode works more actively to enhance or re-build connection between Application Market and users. As one of the most successful phone manufactures,both the number of users and apps increase dramatically in Huawei Application Store (also named Hispace Store), which has approximately 0.3 billion registered users and 1.2 million apps until 2016 and whose number of users is growing with high-speed. For the needs of real scenario, we establish a Push Service Platform (shortly, PSP) to discover the target user group automatically from web-scale user operation log data with an additional small set of labelled apps (usually around 10 apps),in Hispace Store. As presented in this work,PSP includes distributed storage layer, application layer and evaluation layer. In the application layer, we design a practical graph-based algorithm (named A-PARW) for user group discovery, which is an approximate version of partially absorbing random walk. Based on I mode of A-PARW, the effectiveness of our system is significantly improved, compared to the predecessor to presented system, which uses Personalized Pagerank in its application layer.
A Semantic Compositional Network (SCN) is developed for image captioning, in which semantic concepts (i.e., tags) are detected from the image, and the probability of each tag is used to compose the parameters in a long short-term memory (LSTM) network. The SCN extends each weight matrix of the LSTM to an ensemble of tag-dependent weight matrices. The degree to which each member of the ensemble is used to generate an image caption is tied to the image-dependent probability of the corresponding tag. In addition to captioning images, we also extend the SCN to generate captions for video clips. We qualitatively analyze semantic composition in SCNs, and quantitatively evaluate the algorithm on three benchmark datasets: COCO, Flickr30k, and Youtube2Text. Experimental results show that the proposed method significantly outperforms prior state-of-the-art approaches, across multiple evaluation metrics.
We propose a new encoder-decoder approach to learn distributed sentence representations that are applicable to multiple purposes. The model is learned by using a convolutional neural network as an encoder to map an input sentence into a continuous vector, and using a long short-term memory recurrent neural network as a decoder. Several tasks are considered, including sentence reconstruction and future sentence prediction. Further, a hierarchical encoder-decoder model is proposed to encode a sentence to predict multiple future sentences. By training our models on a large collection of novels, we obtain a highly generic convolutional sentence encoder that performs well in practice. Experimental results on several benchmark datasets, and across a broad range of applications, demonstrate the superiority of the proposed model over competing methods.
In recent years, interest in recommender research has shifted from explicit feedback towards implicit feedback data. A diversity of complex models has been proposed for a wide variety of applications. Despite this, learning from implicit feedback is still computationally challenging. So far, most work relies on stochastic gradient descent (SGD) solvers which are easy to derive, but in practice challenging to apply, especially for tasks with many items. For the simple matrix factorization model, an efficient coordinate descent (CD) solver has been previously proposed. However, efficient CD approaches have not been derived for more complex models. In this paper, we provide a new framework for deriving efficient CD algorithms for complex recommender models. We identify and introduce the property of k-separable models. We show that k-separability is a sufficient property to allow efficient optimization of implicit recommender problems with CD. We illustrate this framework on a variety of state-of-the-art models including factorization machines and Tucker decomposition. To summarize, our work provides the theory and building blocks to derive efficient implicit CD algorithms for complex recommender models.
In an Internet of Things network, multiple sensors send information to a fusion center for it to infer a public hypothesis of interest. However, the same sensor information may be used by the fusion center to make inferences of a private nature that the sensors wish to protect. To model this, we adopt a decentralized hypothesis testing framework with binary public and private hypotheses. Each sensor makes a private observation and utilizes a local sensor decision rule or privacy mapping to summarize that observation independently of the other sensors. The local decision made by a sensor is then sent to the fusion center. Without assuming knowledge of the joint distribution of the sensor observations and hypotheses, we adopt a nonparametric learning approach to design local privacy mappings. We introduce the concept of an empirical normalized risk, which provides a theoretical guarantee for the network to achieve information privacy for the private hypothesis with high probability when the number of training samples is large. We develop iterative optimization algorithms to determine an appropriate privacy threshold and the best sensor privacy mappings, and show that they converge. Finally, we extend our approach to the case of a private multiple hypothesis. Numerical results on both synthetic and real data sets suggest that our proposed approach yields low error rates for inferring the public hypothesis, but high error rates for detecting the private hypothesis.
We study the problem of learning influence functions under incomplete observations of node activations. Incomplete observations are a major concern as most (online and real-world) social networks are not fully observable. We establish both proper and improper PAC learnability of influence functions under randomly missing observations. Proper PAC learnability under the Discrete-Time Linear Threshold (DLT) and Discrete-Time Independent Cascade (DIC) models is established by reducing incomplete observations to complete observations in a modified graph. Our improper PAC learnability result applies for the DLT and DIC models as well as the Continuous-Time Independent Cascade (CIC) model. It is based on a parametrization in terms of reachability features, and also gives rise to an efficient and practical heuristic. Experiments on synthetic and real-world datasets demonstrate the ability of our method to compensate even for a fairly large fraction of missing observations.
Aug 12 2016 cs.CV
With increasing demand for efficient image and video analysis, test-time cost of scene parsing becomes critical for many large-scale or time-sensitive vision applications. We propose a dynamic hierarchical model for anytime scene labeling that allows us to achieve flexible trade-offs between efficiency and accuracy in pixel-level prediction. In particular, our approach incorporates the cost of feature computation and model inference, and optimizes the model performance for any given test-time budget by learning a sequence of image-adaptive hierarchical models. We formulate this anytime representation learning as a Markov Decision Process with a discrete-continuous state-action space. A high-quality policy of feature and model selection is learned based on an approximate policy iteration method with action proposal mechanism. We demonstrate the advantages of our dynamic non-myopic anytime scene parsing on three semantic segmentation datasets, which achieves $90\%$ of the state-of-the-art performances by using $15\%$ of their overall costs.
We develop a novel bi-directional attention model for dependency parsing, which learns to agree on headword predictions from the forward and backward parsing directions. The parsing procedure for each direction is formulated as sequentially querying the memory component that stores continuous headword embeddings. The proposed parser makes use of \it soft headword embeddings, allowing the model to implicitly capture high-order parsing history without dramatically increasing the computational complexity. We conduct experiments on English, Chinese, and 12 other languages from the CoNLL 2006 shared task, showing that the proposed model achieves state-of-the-art unlabeled attachment scores on 6 languages.
Jul 28 2016 cs.CV
In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base. More specifically, we propose a benchmark task to recognize one million celebrities from their face images, by using all the possibly collected face images of this individual on the web as training data. The rich information provided by the knowledge base helps to conduct disambiguation and improve the recognition accuracy, and contributes to various real-world applications, such as image captioning and news video analysis. Associated with this task, we design and provide concrete measurement set, evaluation protocol, as well as training data. We also present in details our experiment setup and report promising baseline results. Our benchmark task could lead to one of the largest classification problems in computer vision. To the best of our knowledge, our training dataset, which contains 10M images in version 1, is the largest publicly available one in the world.
Sparse support vector machine (SVM) is a popular classification technique that can simultaneously learn a small set of the most interpretable features and identify the support vectors. It has achieved great successes in many real-world applications. However, for large-scale problems involving a huge number of samples and extremely high-dimensional features, solving sparse SVMs remains challenging. By noting that sparse SVMs induce sparsities in both feature and sample spaces, we propose a novel approach, which is based on accurate estimations of the primal and dual optima of sparse SVMs, to simultaneously identify the features and samples that are guaranteed to be irrelevant to the outputs. Thus, we can remove the identified inactive samples and features from the training phase, leading to substantial savings in both the memory usage and computational cost without sacrificing accuracy. To the best of our knowledge, the proposed method is the \emphfirst \emphstatic feature and sample reduction method for sparse SVM. Experiments on both synthetic and real datasets (e.g., the kddb dataset with about 20 million samples and 30 million features) demonstrate that our approach significantly outperforms state-of-the-art methods and the speedup gained by our approach can be orders of magnitude.
Jun 16 2016 cs.LG
Unsupervised learning is the most challenging problem in machine learning and especially in deep learning. Among many scenarios, we study an unsupervised learning problem of high economic value --- learning to predict without costly pairing of input data and corresponding labels. Part of the difficulty in this problem is a lack of solid evaluation measures. In this paper, we take a practical approach to grounding unsupervised learning by using the same success criterion as for supervised learning in prediction tasks but we do not require the presence of paired input-output training data. In particular, we propose an objective function that aims to make the predicted outputs fit well the structure of the output while preserving the correlation between the input and the predicted output. We experiment with a synthetic structural prediction problem and show that even with simple linear classifiers, the objective function is already highly non-convex. We further demonstrate the nature of this non-convex optimization problem as well as potential solutions. In particular, we show that with regularization via a generative model, learning with the proposed unsupervised objective function converges to an optimal solution.
Dealing with the complex word forms in morphologically rich languages is an open problem in language processing, and is particularly important in translation. In contrast to most modern neural systems of translation, which discard the identity for rare words, in this paper we propose several architectures for learning word representations from character and morpheme level word decompositions. We incorporate these representations in a novel machine translation model which jointly learns word alignments and translations via a hard attention mechanism. Evaluating on translating from several morphologically rich languages into English, we show consistent improvements over strong baseline methods, of between 1 and 1.5 BLEU points.
We introduce an online popularity prediction and tracking task as a benchmark task for reinforcement learning with a combinatorial, natural language action space. A specified number of discussion threads predicted to be popular are recommended, chosen from a fixed window of recent comments to track. Novel deep reinforcement learning architectures are studied for effective modeling of the value function associated with actions comprised of interdependent sub-actions. The proposed model, which represents dependence between sub-actions through a bi-directional LSTM, gives the best performance across different experimental configurations and domains, and it also generalizes well with varying numbers of recommendation requests.
Jun 08 2016 cs.CV
Conventional change detection methods require a large number of images to learn background models or depend on tedious pixel-level labeling by humans. In this paper, we present a weakly supervised approach that needs only image-level labels to simultaneously detect and localize changes in a pair of images. To this end, we employ a deep neural network with DAG topology to learn patterns of change from image-level labeled training data. On top of the initial CNN activations, we define a CRF model to incorporate the local differences and context with the dense connections between individual pixels. We apply a constrained mean-field algorithm to estimate the pixel-level labels, and use the estimated labels to update the parameters of the CNN in an iterative EM framework. This enables imposing global constraints on the observed foreground probability mass function. Our evaluations on four benchmark datasets demonstrate superior detection and localization performance.
Training deep neural network is a high dimensional and a highly non-convex optimization problem. Stochastic gradient descent (SGD) algorithm and it's variations are the current state-of-the-art solvers for this task. However, due to non-covexity nature of the problem, it was observed that SGD slows down near saddle point. Recent empirical work claim that by detecting and escaping saddle point efficiently, it's more likely to improve training performance. With this objective, we revisit Hessian-free optimization method for deep networks. We also develop its distributed variant and demonstrate superior scaling potential to SGD, which allows more efficiently utilizing larger computing resources thus enabling large models and faster time to obtain desired solution. Furthermore, unlike truncated Newton method (Marten's HF) that ignores negative curvature information by using naïve conjugate gradient method and Gauss-Newton Hessian approximation information - we propose a novel algorithm to explore negative curvature direction by solving the sub-problem with stabilized bi-conjugate method involving possible indefinite stochastic Hessian information. We show that these techniques accelerate the training process for both the standard MNIST dataset and also the TIMIT speech recognition problem, demonstrating robust performance with upto an order of magnitude larger batch sizes. This increased scaling potential is illustrated with near linear speed-up on upto 16 CPU nodes for a simple 4-layer network.
Jun 01 2016 cs.CV
While depth sensors are becoming increasingly popular, their spatial resolution often remains limited. Depth super-resolution therefore emerged as a solution to this problem. Despite much progress, state-of-the-art techniques suffer from two drawbacks: (i) they rely on the assumption that intensity edges coincide with depth discontinuities, which, unfortunately, is only true in controlled environments; and (ii) they typically exploit the availability of high-resolution training depth maps, which can often not be acquired in practice due to the sensors' limitations. By contrast, here, we introduce an approach to performing depth super-resolution in more challenging conditions, such as in outdoor scenes. To this end, we first propose to exploit semantic information to better constrain the super-resolution process. In particular, we design a co-sparse analysis model that learns filters from joint intensity, depth and semantic information. Furthermore, we show how low-resolution training depth maps can be employed in our learning strategy. We demonstrate the benefits of our approach over state-of-the-art depth super-resolution methods on two outdoor scene datasets.
To obtain a better cycle-structure is still a challenge for the low-density parity-check (LDPC) code design. This paper formulates two metrics firstly so that the progressive edge-growth (PEG) algorithm and the approximate cycle extrinsic message degree (ACE) constrained PEG algorithm are unified into one integrated algorithm, called the metric-constrained PEG algorithm (M-PEGA). Then, as an improvement for the M-PEGA, the multi-edge metric-constrained PEG algorithm (MM-PEGA) is proposed based on two new concepts, the multi-edge local girth and the edge-trials. The MM-PEGA with the edge-trials, say a positive integer $r$, is called the $r$-edge M-PEGA, which constructs each edge of the non-quasi-cyclic (non-QC) LDPC code graph through selecting a check node whose $r$-edge local girth is optimal. In addition, to design the QC-LDPC codes with any predefined valid design parameters, as well as to detect and even to avoid generating the undetectable cycles in the QC-LDPC codes designed by the QC-PEG algorithm, the multi-edge metric constrained QC-PEG algorithm (MM-QC-PEGA) is proposed lastly. It is verified by the simulation results that increasing the edge-trials of the MM-PEGA/MM-QC-PEGA is expected to have a positive effect on the cycle-structures and the error performances of the LDPC codes designed by the MM-PEGA/MM-QC-PEGA.
Apr 29 2016 cs.DB
Considerable effort has been made to increase the scale of Linked Data. However, an inevitable problem when dealing with data integration from multiple sources is that multiple different sources often provide conflicting objects for a certain predicate of the same real-world entity, so-called object conflicts problem. Currently, the object conflicts problem has not received sufficient attention in the Linked Data community. In this paper, we first formalize the object conflicts resolution problem as computing the joint distribution of variables on a heterogeneous information network called the Source-Object Network, which successfully captures the all correlations from objects and Linked Data sources. Then, we introduce a novel approach based on network effects called ObResolution(Object Resolution), to identify a true object from multiple conflicting objects. ObResolution adopts a pairwise Markov Random Field (pMRF) to model all evidences under a unified framework. Extensive experimental results on six real-world datasets show that our method exhibits higher accuracy than existing approaches and it is robust and consistent in various domains. \keywordsLinked Data, Object Conflicts, Linked Data Quality, Truth Discovery
We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress. Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.
Apr 14 2016 cs.DS
A monotone drawing of a graph G is a straight-line drawing of G such that, for every pair of vertices u,w in G, there exists abpath P_uw in G that is monotone in some direction l_uw. (Namely, the order of the orthogonal projections of the vertices of P_uw on l_uw is the same as the order they appear in P_uw.) The problem of finding monotone drawings for trees has been studied in several recent papers. The main focus is to reduce the size of the drawing. Currently, the smallest drawing size is O(n^1.205) x O(n^1.205). In this paper, we present an algorithm for constructing monotone drawings of trees on a grid of size at most 12n x 12n. The smaller drawing size is achieved by a new simple Path Draw algorithm, and a procedure that carefully assigns primitive vectors to the paths of the input tree T. We also show that there exists a tree T_0 such that any monotone drawing of T_0 must use a grid of size Omega(n) x Omega(n). So the size of our monotone drawing of trees is asymptotically optimal.
Representation and learning of commonsense knowledge is one of the foundational problems in the quest to enable deep language understanding. This issue is particularly challenging for understanding casual and correlational relationships between events. While this topic has received a lot of interest in the NLP community, research has been hindered by the lack of a proper evaluation framework. This paper attempts to address this problem with a new framework for evaluating story understanding and script learning: the 'Story Cloze Test'. This test requires a system to choose the correct ending to a four-sentence story. We created a new corpus of ~50k five-sentence commonsense stories, ROCStories, to enable this evaluation. This corpus is unique in two ways: (1) it captures a rich set of causal and temporal commonsense relations between daily events, and (2) it is a high quality collection of everyday life stories that can also be used for story generation. Experimental evaluation shows that a host of baselines and state-of-the-art models based on shallow language understanding struggle to achieve a high score on the Story Cloze Test. We discuss these implications for script and story learning, and offer suggestions for deeper language understanding.
We show that a character-level encoder-decoder framework can be successfully applied to question answering with a structured knowledge base. We use our model for single-relation question answering and demonstrate the effectiveness of our approach on the SimpleQuestions dataset (Bordes et al., 2015), where we improve state-of-the-art accuracy from 63.9% to 70.9%, without use of ensembles. Importantly, our character-level model has 16x fewer parameters than an equivalent word-level model, can be learned with significantly less data compared to previous work, which relies on data augmentation, and is robust to new entities in testing.
Mar 31 2016 cs.CV
We present an image caption system that addresses new challenges of automatically describing images in the wild. The challenges include high quality caption quality with respect to human judgments, out-of-domain data handling, and low latency required in many applications. Built on top of a state-of-the-art framework, we developed a deep vision model that detects a broad range of visual concepts, an entity recognition model that identifies celebrities and landmarks, and a confidence model for the caption output. Experimental results show that our caption engine outperforms previous state-of-the-art systems significantly on both in-domain dataset (i.e. MS COCO) and out of-domain datasets.
There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks have focused on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image are often directed at commonsense inference and the abstract events evoked by objects in the image. In this paper, we introduce the novel task of Visual Question Generation (VQG), where the system is tasked with asking a natural and engaging question when shown an image. We provide three datasets which cover a variety of images from object-centric to event-centric, with considerably more abstract training data than provided to state-of-the-art captioning systems thus far. We train and test several generative and retrieval models to tackle the task of VQG. Evaluation results show that while such models ask reasonable questions for a variety of images, there is still a wide gap with human performance which motivates further work on connecting images with commonsense knowledge and pragmatics. Our proposed task offers a new challenge to the community which we hope furthers interest in exploring deeper connections between vision & language.
Uncertainty about models and data is ubiquitous in the computational social sciences, and it creates a need for robust social network algorithms, which can simultaneously provide guarantees across a spectrum of models and parameter settings. We begin an investigation into this broad domain by studying robust algorithms for the Influence Maximization problem, in which the goal is to identify a set of k nodes in a social network whose joint influence on the network is maximized. We define a Robust Influence Maximization framework wherein an algorithm is presented with a set of influence functions, typically derived from different influence models or different parameter settings for the same model. The different parameter settings could be derived from observed cascades on different topics, under different conditions, or at different times. The algorithm's goal is to identify a set of k nodes who are simultaneously influential for all influence functions, compared to the (function-specific) optimum solutions. We show strong approximation hardness results for this problem unless the algorithm gets to select at least a logarithmic factor more seeds than the optimum solution. However, when enough extra seeds may be selected, we show that techniques of Krause et al. can be used to approximate the optimum robust influence to within a factor of 1 - 1/e. We evaluate this bicriteria approximation algorithm against natural heuristics on several real-world data sets. Our experiments indicate that the worst-case hardness does not necessarily translate into bad performance on real-world data sets; all algorithms perform fairly well.
Jan 13 2016 cs.AI
In this paper we present the initial development of a general theory for mapping inference in predicate logic to computation over Tensor Product Representations (TPRs; Smolensky (1990), Smolensky & Legendre (2006)). After an initial brief synopsis of TPRs (Section 0), we begin with particular examples of inference with TPRs in the 'bAbI' question-answering task of Weston et al. (2015) (Section 1). We then present a simplification of the general analysis that suffices for the bAbI task (Section 2). Finally, we lay out the general treatment of inference over TPRs (Section 3). We also show the simplification in Section 2 derives the inference methods described in Lee et al. (2016); this shows how the simple methods of Lee et al. (2016) can be formally extended to more general reasoning tasks.
Nov 23 2015 cs.CL
Question answering tasks have shown remarkable progress with distributed vector representation. In this paper, we investigate the recently proposed Facebook bAbI tasks which consist of twenty different categories of questions that require complex reasoning. Because the previous work on bAbI are all end-to-end models, errors could come from either an imperfect understanding of semantics or in certain steps of the reasoning. For clearer analysis, we propose two vector space models inspired by Tensor Product Representation (TPR) to perform knowledge encoding and logical reasoning based on common-sense inference. They together achieve near-perfect accuracy on all categories including positional reasoning and path finding that have proved difficult for most of the previous approaches. We hypothesize that the difficulties in these categories are due to the multi-relations in contrast to uni-relational characteristic of other categories. Our exploration sheds light on designing more sophisticated dataset and moving one step toward integrating transparent and interpretable formalism of TPR into existing learning paradigms.
Nov 20 2015 cs.CV
In this paper, we tackle the problem of estimating the depth of a scene from a monocular video sequence. In particular, we handle challenging scenarios, such as non-translational camera motion and dynamic scenes, where traditional structure from motion and motion stereo methods do not apply. To this end, we first study the problem of depth estimation from a single image. In this context, we exploit the availability of a pool of images for which the depth is known, and formulate monocular depth estimation as a discrete-continuous optimization problem, where the continuous variables encode the depth of the superpixels in the input image, and the discrete ones represent relationships between neighboring superpixels. The solution to this discrete-continuous optimization problem is obtained by performing inference in a graphical model using particle belief propagation. To handle video sequences, we then extend our single image model to a two-frame one that naturally encodes short-range temporal consistency and inherently handles dynamic objects. Based on the prediction of this model, we then introduce a fully-connected pairwise CRF that accounts for longer range spatio-temporal interactions throughout a video. We demonstrate the effectiveness of our model in both the indoor and outdoor scenarios.
Nov 20 2015 cs.NA
Significant inaccuracy often occurs during the process of mathematical calculation due to the digit limitation of floating point, which may lead to catastrophic loss. Normally, people believe that adjustment of floating-point precision is an effective way to solve this problem, since high-precision floating-point has more digits to store information. Thus, it is a prevalent method to reduce the inaccuracy in much floating-point related research, that performing all the operations with higher precision. However, we discover that some operations may lead to larger error in higher precision. In this paper, we define this kind of operation that generates large error due to precision adjustment a precision-specific operation. Furthermore, we propose a light-weight searching algorithm for detecting precision-specific operations and figure out an automatic processing method to fixing them. In addition, we conducted an experiment on the scientific mathematical library of GLIBC. The result shows that there are many precision-specific operations, and our fixing approach can significantly reduce the inaccuracy.
This paper introduces a novel architecture for reinforcement learning with deep neural networks designed to handle state and action spaces characterized by natural language, as found in text-based games. Termed a deep reinforcement relevance network (DRRN), the architecture represents action and state spaces with separate embedding vectors, which are combined with an interaction function to approximate the Q-function in reinforcement learning. We evaluate the DRRN on two popular text games, showing superior performance over other deep Q-learning architectures. Experiments with paraphrased action descriptions show that the model is extracting meaning rather than simply memorizing strings of text.
This paper presents stacked attention networks (SANs) that learn to answer natural language questions from images. SANs use semantic representation of a question as query to search for the regions in an image that are related to the answer. We argue that image question answering (QA) often requires multiple steps of reasoning. Thus, we develop a multiple-layer SAN in which we query an image multiple times to infer the answer progressively. Experiments conducted on four image QA data sets demonstrate that the proposed SANs significantly outperform previous state-of-the-art approaches. The visualization of the attention layers illustrates the progress that the SAN locates the relevant visual clues that lead to the answer of the question layer-by-layer.
In this paper we develop dual free mini-batch SDCA with adaptive probabilities for regularized empirical risk minimization. This work is motivated by recent work of Shai Shalev-Shwartz on dual free SDCA method, however, we allow a non-uniform selection of "dual" coordinates in SDCA. Moreover, the probability can change over time, making it more efficient than fix uniform or non-uniform selection. We also propose an efficient procedure to generate a random non-uniform mini-batch through iterative process. The work is concluded with multiple numerical experiments to show the efficiency of proposed algorithms.
The recent progress on image recognition and language modeling is making automatic description of image content a reality. However, stylized, non-factual aspects of the written description are missing from the current systems. One such style is descriptions with emotions, which is commonplace in everyday communication, and influences decision-making and interpersonal relationships. We design a system to describe an image with emotions, and present a model that automatically generates captions with positive or negative sentiments. We propose a novel switching recurrent neural network with word-level regularization, which is able to produce emotional image captions using only 2000+ training sentences containing sentiments. We evaluate the captions with different automatic and crowd-sourcing metrics. Our model compares favourably in common quality metrics for image captioning. In 84.6% of cases the generated positive captions were judged as being at least as descriptive as the factual captions. Of these positive captions 88% were confirmed by the crowd-sourced workers as having the appropriate sentiment.