# Statistics (stat)

• The estimation of optimal treatment regimes is of considerable interest to precision medicine. In this work, we propose a causal $k$-nearest neighbor method to estimate the optimal treatment regime. The method roots in the framework of causal inference, and estimates the causal treatment effects within the nearest neighborhood. Although the method is simple, it possesses nice theoretical properties. We show that the causal $k$-nearest neighbor regime is universally consistent. That is, the causal $k$-nearest neighbor regime will eventually learn the optimal treatment regime as the sample size increases. We also establish its convergence rate. However, the causal $k$-nearest neighbor regime may suffer from the curse of dimensionality, i.e. performance deteriorates as dimensionality increases. To alleviate this problem, we develop an adaptive causal $k$-nearest neighbor method to perform metric selection and variable selection simultaneously. The performance of the proposed methods is illustrated in simulation studies and in an analysis of a chronic depression clinical trial.
• We propose a Las Vegas transformation of Markov Chain Monte Carlo (MCMC) estimators of Restricted Boltzmann Machines (RBMs). We denote our approach Markov Chain Las Vegas (MCLV). MCLV gives statistical guarantees in exchange for random running times. MCLV uses a stopping set built from the training data and has maximum number of Markov chain steps K (referred as MCLV-K). We present a MCLV-K gradient estimator (LVS-K) for RBMs and explore the correspondence and differences between LVS-K and Contrastive Divergence (CD-K), with LVS-K significantly outperforming CD-K training RBMs over the MNIST dataset, indicating MCLV to be a promising direction in learning generative models.
• Given a matrix $\mathbf{A}\in\mathbb{R}^{n\times d}$ and a vector $b \in\mathbb{R}^{d}$, we show how to compute an $\epsilon$-approximate solution to the regression problem $\min_{x\in\mathbb{R}^{d}}\frac{1}{2} \|\mathbf{A} x - b\|_{2}^{2}$ in time $\tilde{O} ((n+\sqrt{d\cdot\kappa_{\text{sum}}})\cdot s\cdot\log\epsilon^{-1})$ where $\kappa_{\text{sum}}=\mathrm{tr}\left(\mathbf{A}^{\top}\mathbf{A}\right)/\lambda_{\min}(\mathbf{A}^{T}\mathbf{A})$ and $s$ is the maximum number of non-zero entries in a row of $\mathbf{A}$. Our algorithm improves upon the previous best running time of $\tilde{O} ((n+\sqrt{n \cdot\kappa_{\text{sum}}})\cdot s\cdot\log\epsilon^{-1})$. We achieve our result through a careful combination of leverage score sampling techniques, proximal point methods, and accelerated coordinate descent. Our method not only matches the performance of previous methods, but further improves whenever leverage scores of rows are small (up to polylogarithmic factors). We also provide a non-linear generalization of these results that improves the running time for solving a broader class of ERM problems.
• Feature selection plays a critical role in data mining, driven by increasing feature dimensionality in target problems and growing interest in advanced but computationally expensive methodologies able to model complex associations. Specifically, there is a need for feature selection methods that are computationally efficient, yet sensitive to complex patterns of association, e.g. interactions, so that informative features are not mistakenly eliminated prior to downstream modeling. This paper focuses on Relief-based algorithms (RBAs), a unique family of filter-style feature selection algorithms that strike an effective balance between these objectives while flexibly adapting to various data characteristics, e.g. classification vs. regression. First, this work broadly examines types of feature selection and defines RBAs within that context. Next, we introduce the original Relief algorithm and associated concepts, emphasizing the intuition behind how it works, how feature weights generated by the algorithm can be interpreted, and why it is sensitive to feature interactions without evaluating combinations of features. Lastly, we include an expansive review of RBA methodological research beyond Relief and its popular descendant, ReliefF. In particular, we characterize branches of RBA research, and provide comparative summaries of RBA algorithms including contributions, strategies, functionality, time complexity, adaptation to key data characteristics, and software availability.
• Effective utilization of photovoltaic (PV) plants requires weather variability robust global solar radiation (GSR) forecasting models. Random weather turbulence phenomena coupled with assumptions of clear sky model as suggested by Hottel pose significant challenges to parametric & non-parametric models in GSR conversion rate estimation. Also, a decent GSR estimate requires costly high-tech radiometer and expert dependent instrument handling and measurements, which are subjective. As such, a computer aided monitoring (CAM) system to evaluate PV plant operation feasibility by employing smart grid past data analytics and deep learning is developed. Our algorithm, SolarisNet is a 6-layer deep neural network trained on data collected at two weather stations located near Kalyani metrological site, West Bengal, India. The daily GSR prediction performance using SolarisNet outperforms the existing state of art and its efficacy in inferring past GSR data insights to comprehend daily and seasonal GSR variability along with its competence for short term forecasting is discussed.
• An orthogonally equivariant estimator for the covariance matrix is proposed that is valid when the dimension $p$ is larger than the sample size $n$. Equivariance under orthogonal transformations is a less restrictive assumption than structural assumptions on the true covariance matrix. It reduces the problem of estimation of the covariance matrix to that of estimation of its eigenvalues. In this paper, the eigenvalue estimates are obtained from an adjusted likelihood function derived by approximating the integral over the eigenvectors of the sample covariance matrix, which is a challenging problem in its own right. Comparisons with two well-known orthogonally equivariant estimators are given, which are based on Monte-Carlo risk estimates for simulated data and misclassification errors in a real data analysis.
• We present an efficient alternating direction method of multipliers (ADMM) algorithm for segmenting a multivariate non-stationary time series with structural breaks into stationary regions. We draw from recent work where the series is assumed to follow a vector autoregressive model within segments and a convex estimation procedure may be formulated using group fused lasso penalties. Our ADMM approach first splits the convex problem into a global quadratic program and a simple group lasso proximal update. We show that the global problem may be parallelized over rows of the time dependent transition matrices and furthermore that each subproblem may be rewritten in a form identical to the log-likelihood of a Gaussian state space model. Consequently, we develop a Kalman smoothing algorithm to solve the global update in time linear in the length of the series.
• In this paper, a scale mixture of Normal distributions model is developed for classification and clustering of data having outliers and missing values. The classification method, based on a mixture model, focuses on the introduction of latent variables that gives us the possibility to handle sensitivity of model to outliers and to allow a less restrictive modelling of missing data. Inference is processed through a Variational Bayesian Approximation and a Bayesian treatment is adopted for model learning, supervised classification and clustering.
• Hash codes are efficient data representations for coping with the ever growing amounts of data. In this paper, we introduce a random forest semantic hashing scheme that embeds tiny convolutional neural networks (CNN) into shallow random forests, with near-optimal information-theoretic code aggregation among trees. We start with a simple hashing scheme, where random trees in a forest act as hashing functions by setting 1' for the visited tree leaf, and 0' for the rest. We show that traditional random forests fail to generate hashes that preserve the underlying similarity between the trees, rendering the random forests approach to hashing challenging. To address this, we propose to first randomly group arriving classes at each tree split node into two groups, obtaining a significantly simplified two-class classification problem, which can be handled using a light-weight CNN weak learner. Such random class grouping scheme enables code uniqueness by enforcing each class to share its code with different classes in different trees. A non-conventional low-rank loss is further adopted for the CNN weak learners to encourage code consistency by minimizing intra-class variations and maximizing inter-class distance for the two random class groups. Finally, we introduce an information-theoretic approach for aggregating codes of individual trees into a single hash code, producing a near-optimal unique hash for each class. The proposed approach significantly outperforms state-of-the-art hashing methods for image retrieval tasks on large-scale public datasets, while performing at the level of other state-of-the-art image classification techniques while utilizing a more compact and efficient scalable representation. This work proposes a principled and robust procedure to train and deploy in parallel an ensemble of light-weight CNNs, instead of simply going deeper.
• The diagnosis of Alzheimer's disease (AD) in routine clinical practice is most commonly based on subjective clinical interpretations. Quantitative electroencephalography (QEEG) measures have been shown to reflect neurodegenerative processes in AD and might qualify as affordable and thereby widely available markers to facilitate the objectivization of AD assessment. Here, we present a novel framework combining Riemannian tangent space mapping and elastic net regression for the development of brain atrophy markers. While most AD QEEG studies are based on small sample sizes and psychological test scores as outcome measures, here we train and test our models using data of one of the largest prospective EEG AD trials ever conducted, including MRI biomarkers of brain atrophy.
• Nov 23 2017 cs.DB stat.ML arXiv:1711.08330v1
In this paper we address cardinality estimation problem which is an important subproblem in query optimization. Query optimization is a part of every relational DBMS responsible for finding the best way of the execution for the given query. These ways are called plans. The execution time of different plans may differ by several orders, so query optimizer has a great influence on the whole DBMS performance. We consider cost-based query optimization approach as the most popular one. It was observed that cost-based optimization quality depends much on cardinality estimation quality. Cardinality of the plan node is the number of tuples returned by it. In the paper we propose a novel cardinality estimation approach with the use of machine learning methods. The main point of the approach is using query execution statistics of the previously executed queries to improve cardinality estimations. We called this approach adaptive cardinality estimation to reflect this point. The approach is general, flexible, and easy to implement. The experimental evaluation shows that this approach significantly increases the quality of cardinality estimation, and therefore increases the DBMS performance for some queries by several times or even by several dozens of times.
• We consider the problem of estimating the joint distribution $P$ of $n$ independent random variables within the Bayes paradigm from a non-asymptotic point of view. Assuming that $P$ admits some density $s$ with respect to a given reference measure, we consider a density model $\overline S$ for $s$ that we endow with a prior distribution $\pi$ (with support $\overline S$) and we build a robust alternative to the classical Bayes posterior distribution which possesses similar concentration properties around $s$ whenever it belongs to the model $\overline S$. Furthermore, in density estimation, the Hellinger distance between the classical and the robust posterior distributions tends to 0, as the number of observations tends to infinity, under suitable assumptions on the model and the prior, provided that the model $\overline S$ contains the true density $s$. However, unlike what happens with the classical Bayes posterior distribution, we show that the concentration properties of this new posterior distribution are still preserved in the case of a misspecification of the model, that is when $s$ does not belong to $\overline S$ but is close enough to it with respect to the Hellinger distance.
• Convolutional neural networks (CNNs) have been generally acknowledged as one of the driving forces for the advancement of computer vision. Despite their promising performances on many tasks, CNNs still face major obstacles on the road to achieving ideal machine intelligence. One is the difficulty of interpreting them and understanding their inner workings, which is important for diagnosing their failures and correcting them. Another is that standard CNNs require large amounts of annotated data, which is sometimes very hard to obtain. Hence, it is desirable to enable them to learn from few examples. In this work, we address these two limitations of CNNs by developing novel and interpretable models for few-shot learning. Our models are based on the idea of encoding objects in terms of visual concepts, which are interpretable visual cues represented within CNNs. We first use qualitative visualizations and quantitative statistics, to uncover several key properties of feature encoding using visual concepts. Motivated by these properties, we present two intuitive models for the problem of few-shot learning. Experiments show that our models achieve competitive performances, while being much more flexible and interpretable than previous state-of-the-art few-shot learning methods. We conclude that visual concepts expose the natural capability of CNNs for few-shot learning.
• The goal of graph representation learning is to embed each vertex in a graph into a low-dimensional vector space. Existing graph representation learning methods can be classified into two categories: generative models that learn the underlying connectivity distribution in the graph, and discriminative models that predict the probability of edge existence between a pair of vertices. In this paper, we propose GraphGAN, an innovative graph representation learning framework unifying above two classes of methods, in which the generative model and discriminative model play a game-theoretical minimax game. Specifically, for a given vertex, the generative model tries to fit its underlying true connectivity distribution over all other vertices and produces "fake" samples to fool the discriminative model, while the discriminative model tries to detect whether the sampled vertex is from ground truth or generated by the generative model. With the competition between these two models, both of them can alternately and iteratively boost their performance. Moreover, when considering the implementation of generative model, we propose a novel graph softmax to overcome the limitations of traditional softmax function, which can be proven satisfying desirable properties of normalization, graph structure awareness, and computational efficiency. Through extensive experiments on real-world datasets, we demonstrate that GraphGAN achieves substantial gains in a variety of applications, including link prediction, node classification, and recommendation, over state-of-the-art baselines.
• We consider the problem of sparse variable selection on high dimension heterogeneous data sets, which has been taken on renewed interest recently due to the growth of biological and medical data sets with complex, non-i.i.d. structures and prolific response variables. The heterogeneity is likely to confound the association between explanatory variables and responses, resulting in a wealth of false discoveries when Lasso or its variants are naïvely applied. Therefore, the research interest of developing effective confounder correction methods is growing. However, ordinarily employing recent confounder correction methods will result in undesirable performance due to the ignorance of the convoluted interdependency among the prolific response variables. To fully improve current variable selection methods, we introduce a model that can utilize the dependency information from multiple responses to select the active variables from heterogeneous data. Through extensive experiments on synthetic and real data sets, we show that our proposed model outperforms the existing methods.
• We tackle the problem of constructive preference elicitation, that is the problem of learning user preferences over very large decision problems, involving a combinatorial space of possible outcomes. In this setting, the suggested configuration is synthesized on-the-fly by solving a constrained optimization problem, while the preferences are learned itera tively by interacting with the user. Previous work has shown that Coactive Learning is a suitable method for learning user preferences in constructive scenarios. In Coactive Learning the user provides feedback to the algorithm in the form of an improvement to a suggested configuration. When the problem involves many decision variables and constraints, this type of interaction poses a significant cognitive burden on the user. We propose a decomposition technique for large preference-based decision problems relying exclusively on inference and feedback over partial configurations. This has the clear advantage of drastically reducing the user cognitive load. Additionally, part-wise inference can be (up to exponentially) less computationally demanding than inference over full configurations. We discuss the theoretical implications of working with parts and present promising empirical results on one synthetic and two realistic constructive problems.
• Deep Learning models are vulnerable to adversarial examples, i.e.\ images obtained via deliberate imperceptible perturbations, such that the model misclassifies them with high confidence. However, class confidence by itself is an incomplete picture of uncertainty. We therefore use principled Bayesian methods to capture model uncertainty in prediction for observing adversarial misclassification. We provide an extensive study with different Bayesian neural networks attacked in both white-box and black-box setups. The behaviour of the networks for noise, attacks and clean test data is compared. We observe that Bayesian neural networks are uncertain in their predictions for adversarial perturbations, a behaviour similar to the one observed for random Gaussian perturbations. Thus, we conclude that Bayesian neural networks can be considered for detecting adversarial examples.
• Estimating large covariance matrices has been a longstanding important problem in many applications and has attracted increased attention over several decades. This paper deals with two methods based on pre-existing works to impose sparsity on the covariance matrix via its unit lower triangular matrix (aka Cholesky factor) $\mathbf{T}$. The first method serves to estimate the entries of $\mathbf{T}$ using the Ordinary Least Squares (OLS), then imposes sparsity by exploiting some generalized thresholding techniques such as Soft and Smoothly Clipped Absolute Deviation (SCAD). The second method directly estimates a sparse version of $\mathbf{T}$ by penalizing the negative normal log-likelihood with $L_1$ and SCAD penalty functions. The resulting covariance estimators are always guaranteed to be positive definite. Some Monte-Carlo simulations as well as experimental data demonstrate the effectiveness of our estimators for hyperspectral anomaly detection using the Kelly anomaly detector.
• Many cognitive, sensory and motor processes have correlates in oscillatory neural sources, which are embedded as a subspace into the recorded brain signals. Decoding such processes from noisy magnetoencephalogram/electroencephalogram (M/EEG) signals usually requires the use of data-driven analysis methods. The objective evaluation of such decoding algorithms on experimental raw signals, however, is a challenge: the amount of available M/EEG data typically is limited, labels can be unreliable, and raw signals often are contaminated with artifacts. The latter is specifically problematic, if the artifacts stem from behavioral confounds of the oscillatory neural processes of interest. To overcome some of these problems, simulation frameworks have been introduced for benchmarking decoding methods. Generating artificial brain signals, however, most simulation frameworks make strong and partially unrealistic assumptions about brain activity, which limits the generalization of obtained results to real-world conditions. In the present contribution, we thrive to remove many shortcomings of current simulation frameworks and propose a versatile alternative, that allows for objective evaluation and benchmarking of novel data-driven decoding methods for neural signals. Its central idea is to utilize post-hoc labelings of arbitrary M/EEG recordings. This strategy makes it paradigm-agnostic and allows to generate comparatively large datasets with noiseless labels. Source code and data of the novel simulation approach are made available for facilitating its adoption.
• In this paper we are interested in multifractional stable processes where the self-similarity index $H$ is a function of time, in other words $H$ becomes time changing, and the stability index $\alpha$ is a constant. Using $\beta$- negative power variations ($-1/2<\beta<0$), we propose estimators for the value of the multifractional function $H$ at a fixed time $t_0$ and for $\alpha$ for two cases: multifractional Brownian motion ($\alpha=2$) and linear multifractional stable motion ($0<\alpha<2$). We get the consistency of our estimates for the underlying processes with the rate of convergence.
• The graph Laplacian plays key roles in information processing of relational data, and has analogies with the Laplacian in differential geometry. In this paper, we generalize the analogy between graph Laplacian and differential geometry to the hypergraph setting, and propose a novel hypergraph $p$-Laplacian. Unlike the existing two-node graph Laplacians, this generalization makes it possible to analyze hypergraphs, where the edges are allowed to connect any number of nodes. Moreover, we propose a semi-supervised learning method based on the proposed hypergraph $p$-Laplacian, and formalize them as the analogue to the Dirichlet problem, which often appears in physics. We further explore theoretical connections to normalized hypergraph cut on a hypergraph, and propose normalized cut corresponding to hypergraph $p$-Laplacian. The proposed $p$-Laplacian is shown to outperform standard hypergraph Laplacians in the experiment on a hypergraph semi-supervised learning and normalized cut setting.
• While most classical approaches to Granger causality detection repose upon linear time series assumptions, many interactions in neuroscience and economics applications are nonlinear. We develop an approach to nonlinear Granger causality detection using multilayer perceptrons where the input to the network is the past time lags of all series and the output is the future value of a single series. A sufficient condition for Granger non-causality in this setting is that all of the outgoing weights of the input data, the past lags of a series, to the first hidden layer are zero. For estimation, we utilize a group lasso penalty to shrink groups of input weights to zero. We also propose a hierarchical penalty for simultaneous Granger causality and lag estimation. We validate our approach on simulated data from both a sparse linear autoregressive model and the sparse and nonlinear Lorenz-96 model.
• In applications such as clinical safety analysis, the data of the experiments usually consists of frequency counts. In the analysis of such data, researchers often face the problem of multiple testing based on discrete test statistics, aimed at controlling family-wise error rate (FWER). Most existing FWER controlling procedures are developed for continuous data, which are often conservative when analyzing discrete data. By using minimal attainable p-values, several FWER controlling procedures have been developed for discrete data in the literature. In this paper, by utilizing known marginal distributions of true null p-values, three more powerful stepwise procedures are developed, which are modified versions of the conventional Bonferroni, Holm and Hochberg procedures, respectively. It is proved that the first two procedures strongly control the FWER under arbitrary dependence and are more powerful than the existing Tarone-type procedures, while the last one only ensures control of the FWER in special scenarios. Through extensive simulation studies, we provide numerical evidence of superior performance of the proposed procedures in terms of the FWER control and minimal power. A real clinical safety data is used to demonstrate applications of our proposed procedures. An R package "MHTdiscrete" and a web application are developed for implementing the proposed procedures.
• Nov 23 2017 cs.LG stat.ML arXiv:1711.08132v1
Convolutional Neural Networks (CNN) and the locally connected layer are limited in capturing the importance and relations of different local receptive fields, which are often crucial for tasks such as face verification, visual question answering, and word sequence prediction. To tackle the issue, we propose a novel locally smoothed neural network (LSNN) in this paper. The main idea is to represent the weight matrix of the locally connected layer as the product of the kernel and the smoother, where the kernel is shared over different local receptive fields, and the smoother is for determining the importance and relations of different local receptive fields. Specifically, a multi-variate Gaussian function is utilized to generate the smoother, for modeling the location relations among different local receptive fields. Furthermore, the content information can also be leveraged by setting the mean and precision of the Gaussian function according to the content. Experiments on some variant of MNIST clearly show our advantages over CNN and locally connected layer.
• In various real-world problems, we are presented with positive and unlabelled data, referred to as presence-only responses and where the number of covariates p is large. The combination of presence-only responses and high dimensionality presents both statistical and computational challenges. In this paper, we develop the PUlasso algorithm for variable selection and classification with positive and unlabelled responses. Our algorithm involves using the majorization-minimization (MM) framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm is guaranteed to converge to a stationary point, and then prove that any stationary point achieves the minimax optimal mean-squared error of slogp/n, where s is the sparsity of the true parameter. We also demonstrate through simulations that our algorithm out-performs state-of-the-art algorithms in the moderate p settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example.
• Motivation: How do we integratively analyze large-scale multi-platform genomic data that are high dimensional and sparse? Furthermore, how can we incorporate prior knowledge, such as the association between genes, in the analysis systematically? Method: To solve this problem, we propose a Scalable Network Constrained Tucker decomposition method we call SNeCT. SNeCT adopts parallel stochastic gradient descent approach on the proposed parallelizable network constrained optimization function. SNeCT decomposition is applied to tensor constructed from large scale multi-platform multi-cohort cancer data, PanCan12, constrained on a network built from PathwayCommons database. Results: The decomposed factor matrices are applied to stratify cancers, to search for top-k similar patients, and to illustrate how the matrices can be used for personalized interpretation. In the stratification test, combined twelve-cohort data is clustered to form thirteen subclasses. The thirteen subclasses have a high correlation to tissue of origin in addition to other interesting observations, such as clear separation of OV cancers to two groups, and high clinical correlation within subclusters formed in cohorts BRCA and UCEC. In the top-k search, a new patient's genomic profile is generated and searched against existing patients based on the factor matrices. The similarity of the top-k patient to the query is high for 23 clinical features, including estrogen/progesterone receptor statuses of BRCA patients with average precision value ranges from 0.72 to 0.86 and from 0.68 to 0.86, respectively. We also provide an illustration of how the factor matrices can be used for interpretable personalized analysis of each patient.
• In this note, we provide critical commentary on two articles that cast doubt on the validity and implications of Birnbaum's theorem: Evans (2013) and Mayo (2014). In our view, the proof is correct and the consequences of the theorem are alive and well.
• We consider the problem of estimating means of two Gaussians in a 2-Gaussian mixture, which is not balanced and is corrupted by noise of an arbitrary distribution. We present a robust algorithm to estimate the parameters, together with upper bounds on the numbers of samples required for the estimate to be correct, where the bounds are parametrised by the dimension, ratio of the mixing coefficients, a measure of the separation of the two Gaussians, related to Mahalanobis distance, and a condition number of the covariance matrix. In theory, this is the first sample-complexity result for imbalanced mixtures corrupted by adversarial noise. In practice, our algorithm outperforms the vanilla Expectation-Maximisation (EM) algorithm in terms of estimation error.
• Geophysical and other natural processes often exhibit non-stationary covariances and this feature is important to take into account for statistical models that attempt to emulate the physical process. A convolution-based model is used to represent non-stationary Gaussian processes that allows for variation in the correlation range and vari- ance of the process across space. Application of this model has two steps: windowed estimates of the covariance function under the as- sumption of local stationary and encoding the local estimates into a single spatial process model that allows for efficient simulation. Specifically we give evidence to show that non-stationary covariance functions based on the Matern family can be reproduced by the Lat- ticeKrig model, a flexible, multi-resolution representation of Gaussian processes. We propose to fit locally stationary models based on the Matern covariance and then assemble these estimates into a single, global LatticeKrig model. One advantage of the LatticeKrig model is that it is efficient for simulating non-stationary fields even at 105 locations. This work is motivated by the interest in emulating spatial fields derived from numerical model simulations such as Earth system models. We successfully apply these ideas to emulate fields that de- scribe the uncertainty in the pattern scaling of mean summer (JJA) surface temperature from a series of climate model experiments. This example is significant because it emulates tens of thousands of loca- tions, typical in geophysical model fields, and leverages embarrassing parallel computation to speed up the local covariance fitting
• In the context of model uncertainty and selection, empirical Bayes procedures can have undesirable properties such as extreme estimates of inclusion probabilities (Scott and Berger, 2010) or inconsistency under the null model (Liang et al., 2008). To avoid these issues, we define empirical Bayes priors with constraints that ensure that the estimates of the hyperparameters are at least as "vague" as those of proper default priors. In our examples, we observe that constrained EB procedures are better behaved than their unconstrained counterparts and that the Bayesian Information Criterion (BIC) is similar to an intuitively appealing constrained EB procedure.
• Hippocampal dentate granule cells are among the few neuronal cell types generated throughout adult life in mammals. In the normal brain, new granule cells are generated from progenitors in the subgranular zone and integrate in a typical fashion. During the development of epilepsy, granule cell integration is profoundly altered. The new cells migrate to ectopic locations and develop misoriented basal dendrites. Although it has been established that these abnormal cells are newly generated, it is not known whether they arise ubiquitously throughout the progenitor cell pool or are derived from a smaller number of bad actor progenitors. To explore this question, we conducted a clonal analysis study in mice expressing the Brainbow fluorescent protein reporter construct in dentate granule cell progenitors. Mice were examined 2 months after pilocarpine-induced status epilepticus, a treatment that leads to the development of epilepsy. Brain sections were rendered translucent so that entire hippocampi could be reconstructed and all fluorescently labeled cells identified. Our findings reveal that a small number of progenitors produce the majority of ectopic cells following status epilepticus, indicating that either the affected progenitors or their local microenvironments have become pathological. By contrast, granule cells with basal dendrites were equally distributed among clonal groups. This indicates that these progenitors can produce normal cells and suggests that global factors sporadically disrupt the dendritic development of some new cells. Together, these findings strongly predict that distinct mechanisms regulate different aspects
• In this work, we consider the task of classifying the binary positive-unlabeled (PU) data. The existing discriminative learning based PU models attempt to seek an optimal re-weighting strategy for U data, so that a decent decision boundary can be found. In contrast, we provide a totally new paradigm to attack the binary PU task, from perspective of generative learning by leveraging the powerful generative adversarial networks (GANs). Our generative positive-unlabeled (GPU) learning model is devised to express P and N data distributions. It comprises of three discriminators and two generators with different roles, producing both positive and negative samples that resemble those come from the real training dataset. Even with rather limited labeled P data, our GPU framework is capable of capturing the underlying P and N data distribution with infinite realistic sample streams. In this way, an optimal classifier can be trained on those generated samples using a very deep neural networks (DNNs). Moreover, an useful variant of GPU is also introduced for semi-supervised classification.
• The global sensitivity analysis of time-dependent processes requires history-aware approaches. We develop for that purpose a variance-based method that leverages the correlation structure of the problems under study and employs surrogate models to accelerate the computations. The errors resulting from fixing unimportant uncertain parameters to their nominal values are analyzed through a priori estimates. We illustrate our approach on a harmonic oscillator example and on a nonlinear dynamic cholera model.
• We design new algorithms for the combinatorial pure exploration problem in the multi-arm bandit framework. In this problem, we are given K distributions and a collection of subsets $\mathcal{V} \subset 2^K$ of these distributions, and we would like to find the subset $v \in \mathcal{V}$ that has largest cumulative mean, while collecting, in a sequential fashion, as few samples from the distributions as possible. We study both the fixed budget and fixed confidence settings, and our algorithms essentially achieve state-of-the-art performance in all settings, improving on previous guarantees for structures like matchings and submatrices that have large augmenting sets. Moreover, our algorithms can be implemented efficiently whenever the decision set V admits linear optimization. Our analysis involves precise concentration-of-measure arguments and a new algorithm for linear programming with exponentially many constraints.
• Deep generative models learn a mapping from a low dimensional latent space to a high-dimensional data space. Under certain regularity conditions, these models parameterize nonlinear manifolds in the data space. In this paper, we investigate the Riemannian geometry of these generated manifolds. First, we develop efficient algorithms for computing geodesic curves, which provide an intrinsic notion of distance between points on the manifold. Second, we develop an algorithm for parallel translation of a tangent vector along a path on the manifold. We show how parallel translation can be used to generate analogies, i.e., to transport a change in one data point into a semantically similar change of another data point. Our experiments on real image data show that the manifolds learned by deep generative models, while nonlinear, are surprisingly close to zero curvature. The practical implication is that linear paths in the latent space closely approximate geodesics on the generated manifold. However, further investigation into this phenomenon is warranted, to identify if there are other architectures or datasets where curvature plays a more prominent role. We believe that exploring the Riemannian geometry of deep generative models, using the tools developed in this paper, will be an important step in understanding the high-dimensional, nonlinear spaces these models learn.
• A new class of functions, called the `Information sensitivity functions' (ISFs), which quantify the information gain about the parameters through the measurements/observables of a dynamical system are presented. These functions can be easily computed through classical sensitivity functions alone and are based on Bayesian and information-theoretic approaches. While marginal information gain is quantified by decrease in differential entropy, correlations between arbitrary sets of parameters are assessed through mutual information. For individual parameters these information gains are also presented as marginal posterior variances, and, to assess the effect of correlations, as conditional variances when other parameters are given. The easy to interpret ISFs can be used to a) identify time-intervals or regions in dynamical system behaviour where information about the parameters is concentrated; b) assess the effect of measurement noise on the information gain for the parameters; c) assess whether sufficient information in an experimental protocol (input, measurements, and their frequency) is available to identify the parameters; d) assess correlation in the posterior distribution of the parameters to identify the sets of parameters that are likely to be indistinguishable; and e) assess identifiability problems for particular sets of parameters.
• This paper demonstrates the use of genetic algorithms for evolving: 1) a grandmaster-level evaluation function, and 2) a search mechanism for a chess program, the parameter values of which are initialized randomly. The evaluation function of the program is evolved by learning from databases of (human) grandmaster games. At first, the organisms are evolved to mimic the behavior of human grandmasters, and then these organisms are further improved upon by means of coevolution. The search mechanism is evolved by learning from tactical test suites. Our results show that the evolved program outperforms a two-time world computer chess champion and is at par with the other leading computer chess programs.
• This paper presents a novel deep learning based method for automatic malware signature generation and classification. The method uses a deep belief network (DBN), implemented with a deep stack of denoising autoencoders, generating an invariant compact representation of the malware behavior. While conventional signature and token based methods for malware detection do not detect a majority of new variants for existing malware, the results presented in this paper show that signatures generated by the DBN allow for an accurate classification of new malware variants. Using a dataset containing hundreds of variants for several major malware families, our method achieves 98.6% classification accuracy using the signatures generated by the DBN. The presented method is completely agnostic to the type of malware behavior that is logged (e.g., API calls and their parameters, registry entries, websites and ports accessed, etc.), and can use any raw input from a sandbox to successfully train the deep neural network which is used to generate malware signatures.
• Variational inference for latent variable models is prevalent in various machine learning problems, typically solved by maximizing the Evidence Lower Bound (ELBO) of the true data likelihood with respect to a variational distribution. However, freely enriching the family of variational distribution is challenging since the ELBO requires variational likelihood evaluations of the latent variables. In this paper, we propose a novel framework to enrich the variational family based on an alternative lower bound, by introducing auxiliary random variables to the variational distribution only. While offering a much richer family of complex variational distributions, the resulting inference network is likelihood almost free in the sense that only the latent variables require evaluations from simple likelihoods and samples from all the auxiliary variables are sufficient for maximum likelihood inference. We show that the proposed approach is essentially optimizing a probabilistic mixture of ELBOs, thus enriching modeling capacity and enhancing robustness. It outperforms state-of-the-art methods in our experiments on several density estimation tasks.
• One key requirement for effective supply chain management is the quality of its inventory management. Various inventory management methods are typically employed for different types of products based on their demand patterns, product attributes, and supply network. In this paper, our goal is to develop robust demand prediction methods for weather sensitive products at retail stores. We employ historical datasets from Walmart, whose customers and markets are often exposed to extreme weather events which can have a huge impact on sales regarding the affected stores and products. We want to accurately predict the sales of 111 potentially weather-sensitive products around the time of major weather events at 45 of Walmart retails locations in the U.S. Intuitively, we may expect an uptick in the sales of umbrellas before a big thunderstorm, but it is difficult for replenishment managers to predict the level of inventory needed to avoid being out-of-stock or overstock during and after that storm. While they rely on a variety of vendor tools to predict sales around extreme weather events, they mostly employ a time-consuming process that lacks a systematic measure of effectiveness. We employ all the methods critical to any analytics project and start with data exploration. Critical features are extracted from the raw historical dataset for demand forecasting accuracy and robustness. In particular, we employ Artificial Neural Network for forecasting demand for each product sold around the time of major weather events. Finally, we evaluate our model to evaluate their accuracy and robustness.
• Most research on the interpretability of machine learning systems focuses on the development of a more rigorous notion of interpretability. I suggest that a better understanding of the deficiencies of the intuitive notion of interpretability is needed as well. I show that visualization enables but also impedes intuitive interpretability, as it presupposes two levels of technical pre-interpretation: dimensionality reduction and regularization. Furthermore, I argue that the use of positive concepts to emulate the distributed semantic structure of machine learning models introduces a significant human bias into the model. As a consequence, I suggest that, if intuitive interpretability is needed, singular representations of internal model states should be avoided.
• Nov 23 2017 stat.ML arXiv:1711.08037v1
Calls to arms to build interpretable models express a well-founded discomfort with machine learning. Should a software agent that does not even know what a loan is decide who qualifies for one? Indeed, we ought to be cautious about injecting machine learning (or anything else, for that matter) into applications where there may be a significant risk of causing social harm. However, claims that stakeholders "just won't accept that!" do not provide a sufficient foundation for a proposed field of study. For the field of interpretable machine learning to advance, we must ask the following questions: What precisely won't various stakeholders accept? What do they want? Are these desiderata reasonable? Are they feasible? In order to answer these questions, we'll have to give real-world problems and their respective stakeholders greater consideration.
• We study platforms in the sharing economy and discuss the need for incentivizing users to explore options that otherwise would not be chosen. For instance, rental platforms such as Airbnb typically rely on customer reviews to provide users with relevant information about different options. Yet, often a large fraction of options does not have any reviews available. Such options are frequently neglected as viable choices, and in turn are unlikely to be evaluated, creating a vicious cycle. Platforms can engage users to deviate from their preferred choice by offering monetary incentives for choosing a different option instead. To efficiently learn the optimal incentives to offer, we consider structural information in user preferences and introduce a novel algorithm - Coordinated Online Learning (CoOL) - for learning with structural information modeled as convex constraints. We provide formal guarantees on the performance of our algorithm and test the viability of our approach in a user study with data of apartments on Airbnb. Our findings suggest that our approach is well-suited to learn appropriate incentives and increase exploration on the investigated platform.
• Objectives: To obtain a better estimate of the mortality of individuals suffering from blunt force trauma, including co-morbidity. Methodology: The Injury severity Score (ISS) is the default world standard for assessing the severity of multiple injuries. ISS is a mathematical fit to empirical field data. It is demonstrated that ISS is proportional to the Gibbs/Shannon Entropy. A new Entropy measure of morbidity from blunt force trauma including co-morbidity is derived based on the von Neumann Entropy, called the Abbreviated Morbidity Scale (AMS). Results: The ISS trauma measure has been applied to a previously published database, and good correlation has been achieved. Here the existing trauma measure is extended to include the co-morbidity of disease by calculating an Abbreviated Morbidity Score (AMS), which encapsulates the disease co-morbidity in a manner analogous to AIS, and on a consistent Entropy base. Applying Entropy measures to multiple injuries, highlights the role of co-morbidity and that the elderly die at much lower levels of injury than the general population, as a consequence of co-morbidity. These considerations lead to questions regarding current new car assessment protocols, and how well they protect the most vulnerable road users. Keywords: Blunt Force Trauma, Injury Severity Score, Co-morbidity, Entropy.

Noon van der Silk Nov 01 2017 21:51 UTC

This is an awesome paper; great work! :)

Noon van der Silk Mar 08 2017 04:45 UTC

I feel that while the proliferation of GUNs is unquestionable a good idea, there are many unsupervised networks out there that might use this technology in dangerous ways. Do you think Indifferential-Privacy networks are the answer? Also I fear that the extremist binary networks should be banned ent

...(continued)
Noon van der Silk Jan 27 2016 03:39 UTC

Great institute name ...

Alessandro Dec 09 2015 01:12 UTC

Hey, I've already seen this title! http://arxiv.org/abs/1307.0401

Chris Granade Sep 22 2015 19:15 UTC

Thank you for the kind comments, I'm glad that our paper, source code, and tutorial are useful!

Travis Scholten Sep 21 2015 17:05 UTC

This was a really well-written paper! Am very glad to see this kind of work being done.

In addition, the openness about source code is refreshing. By explicitly relating the work to [QInfer](https://github.com/csferrie/python-qinfer), this paper makes it more easy to check the authors' work. Furthe

...(continued)
Chris Granade Sep 15 2015 02:40 UTC

I fell for that clickbait title and read the paper. I still don’t get why von Neumann didn't want us to know about this weird trick? And which weird trick? The use of superfidelity or the use of non-physical density matrices like $\sigma^\sharp$?