The Bayesian Information Criterion (BIC) has been widely used for estimating the number of data clusters in an observed data set for decades. The original derivation, referred to as classic BIC, does not include information about the specific model selection problem at hand, which renders it generic. However, very little effort has been made to check its appropriateness for cluster analysis. In this paper we derive BIC from first principle by formulating the problem of estimating the number of clusters in a data set as maximization of the posterior probability of candidate models given observations. We provide a general BIC expression which is independent of the data distribution given some mild assumptions are satisfied. This serves as an important milestone when deriving BIC for specific data distributions. Along this line, we provide a closed-form BIC expression for multivariate Gaussian distributed observations. We show that incorporating the clustering problem during the derivation of BIC results in an expression whose penalty term is different from the penalty term of the classic BIC. We propose a two-step cluster enumeration algorithm that utilizes a model-based unsupervised learning algorithm to partition the observed data according to each candidate model and the proposed BIC for selecting the model with the optimal number of clusters. The performance of the proposed criterion is tested using synthetic and real-data examples. Simulation results show that our proposed criterion outperforms the existing BIC-based cluster enumeration methods. Our proposed criterion is particularly powerful in estimating the number of data clusters when the observations have unbalanced and overlapping clusters.
Many problems in signal processing require finding sparse solutions to under-determined, or ill-conditioned, linear systems of equations. When dealing with real-world data, the presence of outliers and impulsive noise must also be accounted for. In past decades, the vast majority of robust linear regression estimators has focused on robustness against rowwise contamination. Even so called `high breakdown' estimators rely on the assumption that a majority of rows of the regression matrix is not affected by outliers. Only very recently, the first cellwise robust regression estimation methods have been developed. In this paper, we define robust oracle properties, which an estimator must have in order to perform robust model selection for under-determined, or ill-conditioned linear regression models that are contaminated by cellwise outliers in the regression matrix. We propose and analyze a robustly weighted and adaptive Lasso type regularization term which takes into account cellwise outliers for model selection. The proposed regularization term is integrated into the objective function of the MM-estimator, which yields the proposed MM-Robust Weighted Adaptive Lasso (MM-RWAL), for which we prove that at least the weak robust oracle properties hold. A performance comparison to existing robust Lasso estimators is provided using Monte Carlo experiments. Further, the MM-RWAL is applied to determine the temporal releases of the European Tracer Experiment (ETEX) at the source location. This ill-conditioned linear inverse problem contains cellwise and rowwise outliers and is sparse both in the regression matrix and the parameter vector. The proposed RWAL penalty is not limited to the MM-estimator but can easily be integrated into the objective function of other robust estimators.
Mar 20 2017 stat.ME
A distributed multi-speaker voice activity detection (DM-VAD) method for wireless acoustic sensor networks (WASNs) is proposed. DM-VAD is required in many signal processing applications, e.g. distributed speech enhancement based on multi-channel Wiener filtering, but is non-existent up to date. The proposed method neither requires a fusion center nor prior knowledge about the node positions, microphone array orientations or the number of observed sources. It consists of two steps: (i) distributed source-specific energy signal unmixing (ii) energy signal based voice activity detection. Existing computationally efficient methods to extract source-specific energy signals from the mixed observations, e.g., multiplicative non-negative independent component analysis (MNICA) quickly loose performance with an increasing number of sources, and require a fusion center. To overcome these limitations, we introduce a distributed energy signal unmixing method based on a source-specific node clustering method to locate the nodes around each source. To determine the number of sources that are observed in the WASN, a source enumeration method that uses a Lasso penalized Poisson generalized linear model is developed. Each identified cluster estimates the energy signal of a single (dominant) source by applying a two-component MNICA. The VAD problem is transformed into a clustering task, by extracting features from the energy signals and applying K-means type clustering algorithms. All steps of the proposed method are evaluated using numerical experiments. A VAD accuracy of $> 85 \%$ is achieved for a challenging scenario where 20 nodes observe 7 sources in a simulated reverberant rectangular room.
Jul 06 2016 stat.ME
A new robust and statistically efficient estimator for ARMA models called the bounded influence propagation (BIP) \tau-estimator is proposed. The estimator incorporates an auxiliary model, which prevents the propagation of outliers. Strong consistency and asymptotic normality of the estimator for ARMA models that are driven by independently and identically distributed (iid) innovations with symmetric distributions are established. To analyze the infinitesimal effect of outliers on the estimator, the influence function is derived and computed explicitly for an AR(1) model with additive outliers. To obtain estimates for the AR(p) model, a robust Durbin-Levinson type and a forward-backward algorithm are proposed. An iterative algorithm to robustly obtain ARMA(p,q) parameter estimates is also presented. The problem of finding a robust initialization is addressed, which for orders p+q>2 is a non-trivial matter. Numerical experiments are conducted to compare the finite sample performance of the proposed estimator to existing robust methodologies for different types of outliers both in terms of average and of worst-case performance, as measured by the maximum bias curve. To illustrate the practical applicability of the proposed estimator, a real-data example of outlier cleaning for R-R interval plots derived from electrocardiographic (ECG) data is considered. The proposed estimator is not limited to biomedical applications, but is also useful in any real-world problem whose observations can be modeled as an ARMA process disturbed by outliers or impulsive noise.
Jun 03 2016 stat.ME
Linear inverse problems are ubiquitous. Often the measurements do not follow a Gaussian distribution. Additionally, a model matrix with a large condition number can complicate the problem further by making it ill-posed. In this case, the performance of popular estimators may deteriorate significantly. We have developed a new estimator that is both nearly optimal in the presence of Gaussian errors while being also robust against outliers. Furthermore, it obtains meaningful estimates when the problem is ill-posed through the inclusion of $\ell_1$ and $\ell_2$ regularizations. The computation of our estimate involves minimizing a non-convex objective function. Hence, we are not guaranteed to find the global minimum in a reasonable amount of time. Thus, we propose two algorithms that converge to a good local minimum in a reasonable (and adjustable) amount of time, as an approximation of the global minimum. We also analyze how the introduction of the regularization term affects the statistical properties of our estimator. We confirm high robustness against outliers and asymptotic efficiency for Gaussian distributions by deriving measures of robustness such as the influence function, sensitivity curve, bias, asymptotic variance, and mean square error. We verify the theoretical results using numerical experiments and show that the proposed estimator outperforms recently proposed methods, especially for increasing amounts of outlier contamination. Python code for all of the algorithms are available online in the spirit of reproducible research.