Automatic detection of anomalies in space- and time-varying measurements is an important tool in several fields, e.g., fraud detection, climate analysis, or healthcare monitoring. We present an algorithm for detecting anomalous regions in multivariate spatio-temporal time-series, which allows for spotting the interesting parts in large amounts of data, including video and text data. In opposition to existing techniques for detecting isolated anomalous data points, we propose the "Maximally Divergent Intervals" (MDI) framework for unsupervised detection of coherent spatial regions and time intervals characterized by a high Kullback-Leibler divergence compared with all other data given. In this regard, we define an unbiased Kullback-Leibler divergence that allows for ranking regions of different size and show how to enable the algorithm to run on large-scale data sets in reasonable time using an interval proposal technique. Experiments on both synthetic and real data from various domains, such as climate analysis, video surveillance, and text forensics, demonstrate that our method is widely applicable and a valuable tool for finding interesting events in different types of data.
In the cost per click (CPC) pricing model, an advertiser pays an ad network only when a user clicks on an ad; in turn, the ad network gives a share of that revenue to the publisher where the ad was impressed. Still, advertisers may be unsatisfied with ad networks charging them for "valueless" clicks, or so-called accidental clicks. [...] Charging advertisers for such clicks is detrimental in the long term as the advertiser may decide to run their campaigns on other ad networks. In addition, machine-learned click models trained to predict which ad will bring the highest revenue may overestimate an ad click-through rate, and as a consequence negatively impacting revenue for both the ad network and the publisher. In this work, we propose a data-driven method to detect accidental clicks from the perspective of the ad network. We collect observations of time spent by users on a large set of ad landing pages - i.e., dwell time. We notice that the majority of per-ad distributions of dwell time fit to a mixture of distributions, where each component may correspond to a particular type of clicks, the first one being accidental. We then estimate dwell time thresholds of accidental clicks from that component. Using our method to identify accidental clicks, we then propose a technique that smoothly discounts the advertiser's cost of accidental clicks at billing time. Experiments conducted on a large dataset of ads served on Yahoo mobile apps confirm that our thresholds are stable over time, and revenue loss in the short term is marginal. We also compare the performance of an existing machine-learned click model trained on all ad clicks with that of the same model trained only on non-accidental clicks. There, we observe an increase in both ad click-through rate (+3.9%) and revenue (+0.2%) on ads served by the Yahoo Gemini network when using the latter. [...]
The allocation of a (treatment) condition-effect on the wrong principal component (misallocation of variance) in principal component analysis (PCA) has been addressed in research on event-related potentials of the electroencephalogram. However, the correct allocation of condition-effects on PCA components might be relevant in several domains of research. The present paper investigates whether different loading patterns at each condition-level are a basis for an optimal allocation of between-condition variance on principal components. It turns out that a similar loading shape at each condition-level is a necessary condition for an optimal allocation of between-condition variance, whereas a similar loading magnitude is not necessary.
Apr 20 2018 stat.AP
A common phenomenon in cancer syndromes is for an individual to have multiple primary cancers at different sites during his/her lifetime. Patients with Li-Fraumeni syndrome (LFS), a rare pediatric cancer syndrome mainly caused by germline TP53 mutations, are known to have a higher probability of developing a second primary cancer than those with other cancer syndromes. In this context, it is desirable to model the development of multiple primary cancers to enable better clinical management of LFS. Here, we propose a Bayesian recurrent event model based on a non-homogeneous Poisson process in order to obtain penetrance estimates for multiple primary cancers related to LFS. We employed a family-wise likelihood that facilitates using genetic information inherited through the family pedigree and properly adjusted for the ascertainment bias that was inevitable in studies of rare diseases by using an inverse probability weighting scheme. We applied the proposed method to data on LFS, using a family cohort collected through pediatric sarcoma patients at MD Anderson Cancer Center from 1944 to 1982. Both internal and external validation studies showed that the proposed model provides reliable penetrance estimates for multiple primary cancers in LFS, which, to the best of our knowledge, have not been reported in the LFS literature.