results for au:Jung_A in:stat

- We formulate the problem of sampling and recovering clustered graph signal as a multi-armed bandit (MAB) problem. This formulation lends naturally to learning sampling strategies using the well-known gradient MAB algorithm. In particular, the sampling strategy is represented as a probability distribution over the individual arms of the MAB and optimized using gradient ascent. Some illustrative numerical experiments indicate that the sampling strategies based on the gradient MAB algorithm outperform existing sampling methods.
- This tutorial is based on the lecture notes for the courses "Machine Learning: Basic Principles" and "Artificial Intelligence", which I have taught during fall 2017 and spring 2018 at Aalto university. The aim is to provide an accessible introduction to some of the main concepts and methods within supervised machine learning. Most of the current systems which are con- sidered as (artificially) intelligent are based on some form of supervised machine learning. After discussing the main building blocks of a formal machine learning problem, some of the most popular algorithmic design patterns for machine learning methods are presented.
- We apply network Lasso to solve binary classification (clustering) problems on network structured data. To this end, we generalize ordinary logistic regression to non-Euclidean data defined over a complex network structure. The resulting logistic network Lasso classifier amounts to solving a non-smooth convex optimization problem. A scalable classification algorithm is obtained by applying the alternating direction methods of multipliers to solve this optimization problem.
- This paper investigates the computational complexity of sparse label propagation which has been proposed recently for processing network structured data. Sparse label propagation amounts to a convex optimization problem and might be considered as an extension of basis pursuit from sparse vectors to network structured datasets. Using a standard first-order oracle model, we characterize the number of iterations for sparse label propagation to achieve a prescribed accuracy. In particular, we derive an upper bound on the number of iterations required to achieve a certain accuracy and show that this upper bound is sharp for datasets having a chain structure (e.g., time series).
- Many current approaches to the design of intrusion detec- tion systems apply feature selection in a static, non-adaptive fashion. These methods often neglect the dynamic nature of network data which requires to use adaptive feature selection techniques. In this paper, we present a simple technique based on incremental learning of support vector machines in order to rank the features in real time within a streaming model for network data. Some illustrative numerical experiments with two popular benchmark datasets show that our approach al- lows to adapt to the changes in normal network behaviour and novel attack patterns which have not been experienced before.
- We consider the problem of classifying business process instances based on structural features derived from event logs. The main motivation is to provide machine learning based techniques with quick response times for interactive computer assisted root cause analysis. In particular, we create structural features from process mining such as activity and transition occurrence counts, and ordering of activities to be evaluated as potential features for classification. We show that adding such structural features increases the amount of information thus potentially increasing classification accuracy. However, there is an inherent trade-off as using too many features leads to too long run-times for machine learning classification models. One way to improve the machine learning algorithms' run-time is to only select a small number of features by a feature selection algorithm. However, the run-time required by the feature selection algorithm must also be taken into account. Also, the classification accuracy should not suffer too much from the feature selection. The main contributions of this paper are as follows: First, we propose and compare six different feature selection algorithms by means of an experimental setup comparing their classification accuracy and achievable response times. Second, we discuss the potential use of feature selection results for computer assisted root cause analysis as well as the properties of different types of structural features in the context of feature selection.
- The network Lasso is a recently proposed convex optimization method for machine learning from massive network structured datasets, i.e., big data over networks. It is a variant of the well-known least absolute shrinkage and selection operator (Lasso), which is underlying many methods in learning and signal processing involving sparse models. Highly scalable implementations of the network Lasso can be obtained by state-of-the art proximal methods, e.g., the alternating direction method of multipliers (ADMM). By generalizing the concept of the compatibility condition put forward by van de Geer and Buehlmann as a powerful tool for the analysis of plain Lasso, we derive a sufficient condition, i.e., the network compatibility condition, on the underlying network topology such that network Lasso accurately learns a clustered underlying graph signal. This network compatibility condition relates the location of the sampled nodes with the clustering structure of the network. In particular, the NCC informs the choice of which nodes to sample, or in machine learning terms, which data points provide most information if labeled.
- Despite recent advances, large scale visual artifacts are still a common occurrence in images generated by GANs. Previous work has focused on improving the generator's capability to accurately imitate the data distribution $p_{data}$. In this paper, we instead explore methods that enable GANs to actively avoid errors by manipulating the input space. The core idea is to apply small changes to each noise vector in order to shift them away from areas in the input space that tend to result in errors. We derive three different architectures from that idea. The main one of these consists of a simple residual module that leads to significantly less visual artifacts, while only slightly decreasing diversity. The module is trivial to add to existing GANs and costs almost zero computation and memory.
- Jun 30 2017 stat.ML arXiv:1706.09880v4Interpreting gradient methods as fixed-point iterations, we provide a detailed analysis of those methods for minimizing convex objective functions. Due to their conceptual and algorithmic simplicity, gradient methods are widely used in machine learning for massive data sets (big data). In particular, stochastic gradient methods are considered the de- facto standard for training deep neural networks. Studying gradient methods within the realm of fixed-point theory provides us with powerful tools to analyze their convergence properties. In particular, gradient methods using inexact or noisy gradients, such as stochastic gradient descent, can be studied conveniently using well-known results on inexact fixed-point iterations. Moreover, as we demonstrate in this paper, the fixed-point approach allows an elegant derivation of accelerations for basic gradient methods. In particular, we will show how gradient descent can be accelerated by a fixed-point preserving transformation of an operator associated with the objective function.
- We present a novel condition, which we term the net- work nullspace property, which ensures accurate recovery of graph signals representing massive network-structured datasets from few signal values. The network nullspace property couples the cluster structure of the underlying network-structure with the geometry of the sampling set. Our results can be used to design efficient sampling strategies based on the network topology.
- It has been shown recently that graph signals with small total variation can be accurately recovered from only few samples if the sampling set satisfies a certain condition, referred to as the network nullspace property. Based on this recovery condition, we propose a sampling strategy for smooth graph signals based on random walks. Numerical experiments demonstrate the effectiveness of this approach for graph signals obtained from a synthetic random graph model as well as a real-world dataset.
- Apr 10 2017 stat.ML arXiv:1704.02107v3The "least absolute shrinkage and selection operator" (Lasso) method has been adapted recently for networkstructured datasets. In particular, this network Lasso method allows to learn graph signals from a small number of noisy signal samples by using the total variation of a graph signal for regularization. While efficient and scalable implementations of the network Lasso are available, only little is known about the conditions on the underlying network structure which ensure network Lasso to be accurate. By leveraging concepts of compressed sensing, we address this gap and derive precise conditions on the underlying network topology and sampling set which guarantee the network Lasso for a particular loss function to deliver an accurate estimate of the entire underlying graph signal. We also quantify the error incurred by network Lasso in terms of two constants which reflect the connectivity of the sampled nodes.
- We characterize the sample size required for accurate graphical model selection from non- stationary samples. The observed data is modelled as a vector-valued zero-mean Gaussian random process whose samples are uncorrelated but have different covariance matrices. This model contains as special cases the standard setting of i.i.d. samples as well as the case of samples forming a stationary or underspread (non-stationary) processes. More generally, our model applies to any process model for which an efficient decorrelation can be obtained. By analyzing a particular model selection method, we derive a sufficient condition on the required sample size for accurate graphical model selection based on non-stationary data.
- This work proposes a novel method for semi-supervised learning from partially labeled massive network-structured datasets, i.e., big data over networks. We model the underlying hypothesis, which relates data points to labels, as a graph signal, defined over some graph (network) structure intrinsic to the dataset. Following the key principle of supervised learning, i.e., similar inputs yield similar outputs, we require the graph signals induced by labels to have small total variation. Accordingly, we formulate the problem of learning the labels of data points as a non-smooth convex optimization problem which amounts to balancing between the empirical loss, i.e., the discrepancy with some partially available label information, and the smoothness quantified by the total variation of the learned graph signal. We solve this optimization problem by appealing to a recently proposed preconditioned variant of the popular primal-dual method by Pock and Chambolle, which results in a sparse label propagation algorithm. This learning algorithm allows for a highly scalable implementation as message passing over the underlying data graph. By applying concepts of compressed sensing to the learning problem, we are also able to provide a transparent sufficient condition on the underlying network structure such that accurate learning of the labels is possible. We also present an implementation of the message passing formulation allows for a highly scalable implementation in big data frameworks.
- We formulate and analyze a graphical model selection method for inferring the conditional independence graph of a high-dimensional nonstationary Gaussian random process (time series) from a finite-length observation. The observed process samples are assumed uncorrelated over time and having a time-varying marginal distribution. The selection method is based on testing conditional variances obtained for small subsets of process components. This allows to cope with the high-dimensional regime, where the sample size can be (drastically) smaller than the process dimension. We characterize the required sample size such that the proposed selection method is successful with high probability.
- We consider the problem of learning a dictionary matrix from a number of observed signals, which are assumed to be generated via a linear model with a common underlying dictionary. In particular, we derive lower bounds on the minimum achievable worst case mean squared error (MSE), regardless of computational complexity of the dictionary learning (DL) schemes. By casting DL as a classical (or frequentist) estimation problem, the lower bounds on the worst case MSE are derived by following an established information-theoretic approach to minimax estimation. The main conceptual contribution of this paper is the adaption of the information-theoretic approach to minimax estimation for the DL problem in order to derive lower bounds on the worst case MSE of any DL scheme. We derive three different lower bounds applying to different generative models for the observed signals. The first bound applies to a wide range of models, it only requires the existence of a covariance matrix of the (unknown) underlying coefficient vector. By specializing this bound to the case of sparse coefficient distributions, and assuming the true dictionary satisfies the restricted isometry property, we obtain a lower bound on the worst case MSE of DL schemes in terms of a signal to noise ratio (SNR). The third bound applies to a more restrictive subclass of coefficient distributions by requiring the non-zero coefficients to be Gaussian. While, compared with the previous two bounds, the applicability of this final bound is the most limited it is the tightest of the three bounds in the low SNR regime.
- Oct 07 2014 stat.ML arXiv:1410.1184v3We propose a novel graphical model selection (GMS) scheme for high-dimensional stationary time series or discrete time process. The method is based on a natural generalization of the graphical LASSO (gLASSO), introduced originally for GMS based on i.i.d. samples, and estimates the conditional independence graph (CIG) of a time series from a finite length observation. The gLASSO for time series is defined as the solution of an l1-regularized maximum (approximate) likelihood problem. We solve this optimization problem using the alternating direction method of multipliers (ADMM). Our approach is nonparametric as we do not assume a finite dimensional (e.g., an autoregressive) parametric model for the observed process. Instead, we require the process to be sufficiently smooth in the spectral domain. For Gaussian processes, we characterize the performance of our method theoretically by deriving an upper bound on the probability that our algorithm fails to correctly identify the CIG. Numerical experiments demonstrate the ability of our method to recover the correct CIG from a limited amount of samples.
- Apr 07 2014 stat.ML arXiv:1404.1361v3We propose a method for inferring the conditional independence graph (CIG) of a high-dimensional Gaussian vector time series (discrete-time process) from a finite-length observation. By contrast to existing approaches, we do not rely on a parametric process model (such as, e.g., an autoregressive model) for the observed random process. Instead, we only require certain smoothness properties (in the Fourier domain) of the process. The proposed inference scheme works even for sample sizes much smaller than the number of scalar process components if the true underlying CIG is sufficiently sparse. A theoretical performance analysis provides conditions which guarantee that the probability of the proposed inference method to deliver a wrong CIG is below a prescribed value. These conditions imply lower bounds on the sample size such that the new method is consistent asymptotically. Some numerical experiments validate our theoretical performance analysis and demonstrate superior performance of our scheme compared to an existing (parametric) approach in case of model mismatch.
- We consider the problem of inferring the conditional independence graph (CIG) of a multivariate stationary dicrete-time Gaussian random process based on a finite length observation. Using information-theoretic methods, we derive a lower bound on the error probability of any learning scheme for the underlying process CIG. This bound, in turn, yields a minimum required sample-size which is necessary for any algorithm regardless of its computational complexity, to reliably select the true underlying CIG. Furthermore, by analysis of a simple selection scheme, we show that the information-theoretic limits can be achieved for a subclass of processes having sparse CIG. We do not assume a parametric model for the observed process, but require it to have a sufficiently smooth spectral density matrix (SDM).
- Feb 18 2014 stat.ML arXiv:1402.4078v2We consider the problem of dictionary learning under the assumption that the observed signals can be represented as sparse linear combinations of the columns of a single large dictionary matrix. In particular, we analyze the minimax risk of the dictionary learning problem which governs the mean squared error (MSE) performance of any learning scheme, regardless of its computational complexity. By following an established information-theoretic method based on Fanos inequality, we derive a lower bound on the minimax risk for a given dictionary learning problem. This lower bound yields a characterization of the sample-complexity, i.e., a lower bound on the required number of observations such that consistent dictionary learning schemes exist. Our bounds may be compared with the performance of a given learning scheme, allowing to characterize how far the method is from optimal performance.
- The investigation of the effects of sparsity or sparsity constraints in signal processing problems has received considerable attention recently. Sparsity constraints refer to the a priori information that the object or signal of interest can be represented by using only few elements of a predefined dictionary. Within this thesis, sparsity refers to the fact that a vector to be estimated has only few nonzero entries. One specific field concerned with sparsity constraints has become popular under the name Compressed Sensing (CS). Within CS, the sparsity is exploited in order to perform (nearly) lossless compression. Moreover, this compression is carried out jointly or simultaneously with the process of sensing a physical quantity. In contrast to CS, one can alternatively use sparsity to enhance signal processing methods. Obviously, sparsity constraints can only improve the obtainable estimation performance since the constraints can be interpreted as an additional prior information about the unknown parameter vector which is to be estimated. Our main focus will be on this aspect of sparsity, i.e., we analyze how much we can gain in estimation performance due to the sparsity constraints.
- Nov 14 2013 stat.ML arXiv:1311.3257v2We propose a method for inferring the conditional indepen- dence graph (CIG) of a high-dimensional discrete-time Gaus- sian vector random process from finite-length observations. Our approach does not rely on a parametric model (such as, e.g., an autoregressive model) for the vector random process; rather, it only assumes certain spectral smoothness proper- ties. The proposed inference scheme is compressive in that it works for sample sizes that are (much) smaller than the number of scalar process components. We provide analytical conditions for our method to correctly identify the CIG with high probability.
- The mathematical theory of reproducing kernel Hilbert spaces (RKHS) provides powerful tools for minimum variance estimation (MVE) problems. Here, we extend the classical RKHS based analysis of MVE in several directions. We develop a geometric formulation of five known lower bounds on the estimator variance (Barankin bound, Cramer-Rao bound, constrained Cramer-Rao bound, Bhattacharyya bound, and Hammersley-Chapman-Robbins bound) in terms of orthogonal projections onto a subspace of the RKHS associated with a given MVE problem. We show that, under mild conditions, the Barankin bound (the tightest possible lower bound on the estimator variance) is a lower semicontinuous function of the parameter vector. We also show that the RKHS associated with an MVE problem remains unchanged if the observation is replaced by a sufficient statistic. Finally, for MVE problems conforming to an exponential family of distributions, we derive novel closed-form lower bound on the estimator variance and show that a reduction of the parameter set leaves the minimum achievable variance unchanged.
- Mar 27 2012 stat.CO arXiv:1203.5475v3Estimating the spectral characteristics of a nonstationary random process is an important but challenging task, which can be facilitated by exploiting structural properties of the process. In certain applications, the observed processes are underspread, i.e., their time and frequency correlations exhibit a reasonably fast decay, and approximately time-frequency sparse, i.e., a reasonably large percentage of the spectral values is small. For this class of processes, we propose a compressive estimator of the discrete Rihaczek spectrum (RS). This estimator combines a minimum variance unbiased estimator of the RS (which is a smoothed Rihaczek distribution using an appropriately designed smoothing kernel) with a compressed sensing technique that exploits the approximate time-frequency sparsity. As a result of the compression stage, the number of measurements required for good estimation performance can be significantly reduced. The measurements are values of the discrete ambiguity function of the observed signal at randomly chosen time and frequency lag positions. We provide bounds on the mean-square estimation error of both the minimum variance unbiased RS estimator and the compressive RS estimator, and we demonstrate the performance of the compressive estimator by means of simulation results. The proposed compressive RS estimator can also be used for estimating other time-dependent spectra (e.g., the Wigner-Ville spectrum) since for an underspread process most spectra are almost equal.
- We consider estimation of a sparse parameter vector that determines the covariance matrix of a Gaussian random vector via a sparse expansion into known "basis matrices". Using the theory of reproducing kernel Hilbert spaces, we derive lower bounds on the variance of estimators with a given mean function. This includes unbiased estimation as a special case. We also present a numerical comparison of our lower bounds with the variance of two standard estimators (hard-thresholding estimator and maximum likelihood estimator).
- We study the performance of estimators of a sparse nonrandom vector based on an observation which is linearly transformed and corrupted by additive white Gaussian noise. Using the reproducing kernel Hilbert space framework, we derive a new lower bound on the estimator variance for a given differentiable bias function (including the unbiased case) and an almost arbitrary transformation matrix (including the underdetermined case considered in compressed sensing theory). For the special case of a sparse vector corrupted by white Gaussian noise-i.e., without a linear transformation-and unbiased estimation, our lower bound improves on previously proposed bounds.
- We consider unbiased estimation of a sparse nonrandom vector corrupted by additive white Gaussian noise. We show that while there are infinitely many unbiased estimators for this problem, none of them has uniformly minimum variance. Therefore, we focus on locally minimum variance unbiased (LMVU) estimators. We derive simple closed-form lower and upper bounds on the variance of LMVU estimators or, equivalently, on the Barankin bound (BB). Our bounds allow an estimation of the threshold region separating the low-SNR and high-SNR regimes, and they indicate the asymptotic behavior of the BB at high SNR. We also develop numerical lower and upper bounds which are tighter than the closed-form bounds and thus characterize the BB more accurately. Numerical studies compare our characterization of the BB with established biased estimation schemes, and demonstrate that while unbiased estimators perform poorly at low SNR, they may perform better than biased estimators at high SNR. An interesting conclusion of our analysis is that the high-SNR behavior of the BB depends solely on the value of the smallest nonzero component of the sparse vector, and that this type of dependence is also exhibited by the performance of certain practical estimators.