Besides the text content, documents and their associated words usually come with rich sets of meta informa- tion, such as categories of documents and semantic/syntactic features of words, like those encoded in word embeddings. Incorporating such meta information directly into the generative process of topic models can improve modelling accuracy and topic quality, especially in the case where the word-occurrence information in the training data is insufficient. In this paper, we present a topic model, called MetaLDA, which is able to leverage either document or word meta information, or both of them jointly. With two data argumentation techniques, we can derive an efficient Gibbs sampling algorithm, which benefits from the fully local conjugacy of the model. Moreover, the algorithm is favoured by the sparsity of the meta information. Extensive experiments on several real world datasets demonstrate that our model achieves comparable or improved performance in terms of both perplexity and topic quality, particularly in handling sparse texts. In addition, compared with other models using meta information, our model runs significantly faster.
Motivated by the reconstruction and the prediction of electricity consumption, we extend Nonnegative Matrix Factorization~(NMF) to take into account side information (column or row features). We consider general linear measurement settings, and propose a framework which models non-linear relationships between features and the response variables. We extend previous theoretical results to obtain a sufficient condition on the identifiability of the NMF in this setting. Based the classical Hierarchical Alternating Least Squares~(HALS) algorithm, we propose a new algorithm (HALSX, or Hierarchical Alternating Least Squares with eXogeneous variables) which estimates the factorization model. The algorithm is validated on both simulated and real electricity consumption datasets as well as a recommendation dataset, to show its performance in matrix recovery and prediction for new rows and columns.
Sep 20 2017 stat.AP
The health effects of environmental exposures have been studied for decades, typically using standard regression models to assess exposure-outcome associations found in observational non-experimental data. We propose and illustrate a different approach to examine causal effects of environmental exposures on health outcomes from observational data. Our strategy attempts to structure the observational data to approximate data from a hypothetical, but realistic, randomized experiment. This approach, based on insights from classical experimental design, involves four stages, and relies on modern computing to implement the effort in two of the four stages.More specifically, our strategy involves: 1) a conceptual stage that involves the precise formulation of the causal question in terms of a hypothetical randomized experiment where the exposure is assigned to units; 2) a design stage that attempts to reconstruct (or approximate) a randomized experiment before any outcome data are observed, 3) a statistical analysis comparing the outcomes of interest in the exposed and non-exposed units of the hypothetical randomized experiment, and 4) a summary stage providing conclusions about statistical evidence for the sizes of possible causal effects of the exposure on outcomes. We illustrate our approach using an example examining the effect of parental smoking on children's lung function collected in families living in East Boston in the 1970's. To complement the traditional purely model-based approaches, our strategy, which includes outcome free matched-sampling, provides workable tools to quantify possible detrimental exposure effects on human health outcomes especially because it also includes transparent diagnostics to assess the assumptions of the four-stage statistical approach being applied.
This paper explores the discrete Dynamic Causal Modeling (DDCM) and its relationship with Directed Information (DI). We prove the conditional equivalence between DDCM and DI in characterizing the causal relationship between two brain regions. The theoretical results are demonstrated using fMRI data obtained under both resting state and stimulus based state. Our numerical analysis is consistent with that reported in previous study.
Sep 20 2017 stat.AP
We consider the situation where a temporal process is composed of contiguous segments with differing slopes and replicated noise-corrupted time series measurements are observed. The unknown mean of the data generating process is modelled as a piecewise linear function of time with an unknown number of change-points. We develop a Bayesian approach to infer the joint posterior distribution of the number and position of change-points as well as the unknown mean parameters. A-priori, the proposed model uses an overfitting number of mean parameters but, conditionally on a set of change-points, only a subset of them influences the likelihood. An exponentially decreasing prior distribution on the number of change-points gives rise to a posterior distribution concentrating on sparse representations of the underlying sequence. A Metropolis-Hastings Markov chain Monte Carlo (MCMC) sampler is constructed for approximating the posterior distribution. Our method is benchmarked using simulated data and is applied to uncover differences in the dynamics of fungal growth from imaging time course data collected from different strains. The source code is available online.