After a clustering solution is generated automatically, labelling these clusters becomes important to help understanding the results. In this paper, we propose to use a Mutual Information based method to label clusters of journal articles. Topical terms which have the highest Normalised Mutual Information (NMI) with a certain cluster are selected to be the labels of the cluster. Discussion of the labelling technique with a domain expert was used as a check that the labels are discriminating not only lexical-wise but also semantically. Based on a common set of topical terms, we also propose to generate lexical fingerprints as a representation of individual clusters. Eventually, we visualise and compare these fingerprints of different clusters from either one clustering solution or different ones.
Keeping track of the ever-increasing body of scientific literature is an escalating challenge. We present PubTree a hierarchical search tool that efficiently searches the PubMed/MEDLINE dataset based upon a decision tree constructed using >26 million abstracts. The tool is implemented as a webpage, where users are asked a series of eighteen questions to locate pertinent articles. The implementation of this hierarchical search tool highlights issues endemic with document retrieval. However, the construction of this tree indicates that with future developments hierarchical search could become an effective tool (or adjunct) in the mining of biological literature.
Feb 28 2017 cs.IR
Related Pins is the Web-scale recommender system that powers over 40% of user engagement on Pinterest. This paper is a longitudinal study of three years of its development, exploring the evolution of the system and its components from prototypes to present state. Each component was originally built with many constraints on engineering effort and computational resources, so we prioritized the simplest and highest-leverage solutions. We show how organic growth led to a complex system and how we managed this complexity. Many challenges arose while building this system, such as avoiding feedback loops, evaluating performance, activating content, and eliminating legacy heuristics. Finally, we offer suggestions for tackling these challenges when engineering Web-scale recommender systems.
Social media is often viewed as a sensor into various societal events such as disease outbreaks, protests, and elections. We describe the use of social media as a crowdsourced sensor to gain insight into ongoing cyber-attacks. Our approach detects a broad range of cyber-attacks (e.g., distributed denial of service (DDOS) attacks, data breaches, and account hijacking) in an unsupervised manner using just a limited fixed set of seed event triggers. A new query expansion strategy based on convolutional kernels and dependency parses helps model reporting structure and aids in identifying key event characteristics. Through a large-scale analysis over Twitter, we demonstrate that our approach consistently identifies and encodes events, outperforming existing methods.