# Information Retrieval (cs.IR)

• With an exponentially growing number of scientific papers published each year, advanced tools for exploring and discovering publications of interest are becoming indispensable. To empower users beyond a simple keyword search provided e.g. by Google Scholar, we present the novel web application PubVis. Powered by a variety of machine learning techniques, it combines essential features to help researchers find the content most relevant to them. An interactive visualization of a large collection of scientific publications provides an overview of the field and encourages the user to explore articles beyond a narrow research focus. This is augmented by personalized content based article recommendations as well as an advanced full text search to discover relevant references. The open sourced implementation of the app can be easily set up and run locally on a desktop computer to provide access to content tailored to the specific needs of individual users. Additionally, a PubVis demo with access to a collection of 10,000 papers can be tested online.
• In the last decade, driven also by the availability of an unprecedented computational power and storage capabilities in cloud environments we assisted to the proliferation of new algorithms, methods, and approaches in two areas of artificial intelligence: knowledge representation and machine learning. On the one side, the generation of a high rate of structured data on the Web led to the creation and publication of the so-called knowledge graphs. On the other side, deep learning emerged as one of the most promising approaches in the generation and training of models that can be applied to a wide variety of application fields. More recently, autoencoders have proven their strength in various scenarios, playing a fundamental role in unsupervised learning. In this paper, we instigate how to exploit the semantic information encoded in a knowledge graph to build connections between units in a Neural Network, thus leading to a new method, SEM-AUTO, to extract and weigh semantic features that can eventually be used to build a recommender system. As adding content-based side information may mitigate the cold user problems, we tested how our approach behave in the presence of a few rating from a user on the Movielens 1M dataset and compare results with BPRSLIM.
• In this paper, we present a semi-supervised learning algorithm for classification of text documents. A method of labeling unlabeled text documents is presented. The presented method is based on the principle of divide and conquer strategy. It uses recursive K-means algorithm for partitioning both labeled and unlabeled data collection. The K-means algorithm is applied recursively on each partition till a desired level partition is achieved such that each partition contains labeled documents of a single class. Once the desired clusters are obtained, the respective cluster centroids are considered as representatives of the clusters and the nearest neighbor rule is used for classifying an unknown text document. Series of experiments have been conducted to bring out the superiority of the proposed model over other recent state of the art models on 20Newsgroups dataset.
• In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially to avoid curse of dimensionality, an effective representation scheme based on term class relevancy measure is adapted, which drastically reduces the dimension to the number of classes in the corpus. Subsequently, the samples of larger sized classes are grouped into a number of subclasses of smaller sizes to make the entire corpus balanced. Each subclass is then given a single symbolic vector representation by the use of interval valued features. This symbolic representation in addition to being compact helps in reducing the space requirement and also the classification time. The proposed model has been empirically demonstrated for its superiority on bench marking datasets viz., Reuters 21578 and TDT2. Further, it has been compared against several other existing contemporary models including model based on support vector machine. The comparative analysis indicates that the proposed model outperforms the other existing models.
• Recent advances in neural networks have inspired people to design hybrid recommendation algorithms that can incorporate both (1) user-item interaction information and (2) content information including image, audio, and text. Despite their promising results, neural network-based recommendation algorithms pose extensive computational costs, making it challenging to scale and improve upon. In this paper, we propose a general neural network-based recommendation framework, which subsumes several existing state-of-the-art recommendation algorithms, and address the efficiency issue by investigating sampling strategies in the stochastic gradient descent training for the framework. We tackle this issue by first establishing a connection between the loss functions and the user-item interaction bipartite graph, where the loss function terms are defined on links while major computation burdens are located at nodes. We call this type of loss functions "graph-based" loss functions, for which varied mini-batch sampling strategies can have different computational costs. Based on the insight, three novel sampling strategies are proposed, which can significantly improve the training efficiency of the proposed framework (up to $\times 30$ times speedup in our experiments), as well as improving the recommendation performance. Theoretical analysis is also provided for both the computational cost and the convergence. We believe the study of sampling strategies have further implications on general graph-based loss functions, and would also enable more research under the neural network-based recommendation framework.
• We present an interface that can be leveraged to quickly and effortlessly elicit people's preferences for visual stimuli, such as photographs, visual art and screensavers, along with rich side-information about its users. We plan to employ the new interface to collect dense recommender datasets that will complement existing sparse industry-scale datasets. The new interface and the collected datasets are intended to foster integration of research in recommender systems with research in social and behavioral sciences. For instance, we will use the datasets to assess the diversity of human preferences in different domains of visual experience. Further, using the datasets we will be able to measure crucial psychological effects, such as preference consistency, scale acuity and anchoring biases. Last, we the datasets will facilitate evaluation in counterfactual learning experiments.