We derive a new Bayesian Information Criterion (BIC) from first principles by formulating the problem of estimating the number of clusters in an observed data set as maximization of the posterior probability of the candidate models. Given that some mild assumptions are satisfied, we provide a general BIC expression for a broad class of data distributions. This serves as an important milestone when deriving the BIC for specific data distributions. Along this line, we provide a closed-form BIC expression for multivariate Gaussian distributed observations. We show that incorporating data structure of the clustering problem into the derivation of the BIC results in an expression whose penalty term is different from that of the original BIC. We propose a two-step cluster enumeration algorithm. First, a model-based unsupervised learning algorithm partitions the data according to a given set of candidate models. Subsequently, the optimal cluster number is determined as the one associated to the model for which the proposed BIC is maximal. The performance of the proposed criterion is tested using synthetic and real data sets. Despite the fact that the original BIC is a generic criterion which does not include information about the specific model selection problem at hand, it has been widely used in the literature to estimate the number of clusters in an observed data set. We, therefore, consider it as a benchmark comparison. Simulation results show that our proposed criterion outperforms the existing cluster enumeration methods that are based on the original BIC.
Many problems in signal processing require finding sparse solutions to under-determined, or ill-conditioned, linear systems of equations. When dealing with real-world data, the presence of outliers and impulsive noise must also be accounted for. In past decades, the vast majority of robust linear regression estimators has focused on robustness against rowwise contamination. Even so called `high breakdown' estimators rely on the assumption that a majority of rows of the regression matrix is not affected by outliers. Only very recently, the first cellwise robust regression estimation methods have been developed. In this paper, we define robust oracle properties, which an estimator must have in order to perform robust model selection for under-determined, or ill-conditioned linear regression models that are contaminated by cellwise outliers in the regression matrix. We propose and analyze a robustly weighted and adaptive Lasso type regularization term which takes into account cellwise outliers for model selection. The proposed regularization term is integrated into the objective function of the MM-estimator, which yields the proposed MM-Robust Weighted Adaptive Lasso (MM-RWAL), for which we prove that at least the weak robust oracle properties hold. A performance comparison to existing robust Lasso estimators is provided using Monte Carlo experiments. Further, the MM-RWAL is applied to determine the temporal releases of the European Tracer Experiment (ETEX) at the source location. This ill-conditioned linear inverse problem contains cellwise and rowwise outliers and is sparse both in the regression matrix and the parameter vector. The proposed RWAL penalty is not limited to the MM-estimator but can easily be integrated into the objective function of other robust estimators.
Mar 20 2017 stat.ME
A distributed multi-speaker voice activity detection (DM-VAD) method for wireless acoustic sensor networks (WASNs) is proposed. DM-VAD is required in many signal processing applications, e.g. distributed speech enhancement based on multi-channel Wiener filtering, but is non-existent up to date. The proposed method neither requires a fusion center nor prior knowledge about the node positions, microphone array orientations or the number of observed sources. It consists of two steps: (i) distributed source-specific energy signal unmixing (ii) energy signal based voice activity detection. Existing computationally efficient methods to extract source-specific energy signals from the mixed observations, e.g., multiplicative non-negative independent component analysis (MNICA) quickly loose performance with an increasing number of sources, and require a fusion center. To overcome these limitations, we introduce a distributed energy signal unmixing method based on a source-specific node clustering method to locate the nodes around each source. To determine the number of sources that are observed in the WASN, a source enumeration method that uses a Lasso penalized Poisson generalized linear model is developed. Each identified cluster estimates the energy signal of a single (dominant) source by applying a two-component MNICA. The VAD problem is transformed into a clustering task, by extracting features from the energy signals and applying K-means type clustering algorithms. All steps of the proposed method are evaluated using numerical experiments. A VAD accuracy of $> 85 \%$ is achieved for a challenging scenario where 20 nodes observe 7 sources in a simulated reverberant rectangular room.
Jul 06 2016 stat.ME
A new robust and statistically efficient estimator for ARMA models called the bounded influence propagation (BIP) \tau-estimator is proposed. The estimator incorporates an auxiliary model, which prevents the propagation of outliers. Strong consistency and asymptotic normality of the estimator for ARMA models that are driven by independently and identically distributed (iid) innovations with symmetric distributions are established. To analyze the infinitesimal effect of outliers on the estimator, the influence function is derived and computed explicitly for an AR(1) model with additive outliers. To obtain estimates for the AR(p) model, a robust Durbin-Levinson type and a forward-backward algorithm are proposed. An iterative algorithm to robustly obtain ARMA(p,q) parameter estimates is also presented. The problem of finding a robust initialization is addressed, which for orders p+q>2 is a non-trivial matter. Numerical experiments are conducted to compare the finite sample performance of the proposed estimator to existing robust methodologies for different types of outliers both in terms of average and of worst-case performance, as measured by the maximum bias curve. To illustrate the practical applicability of the proposed estimator, a real-data example of outlier cleaning for R-R interval plots derived from electrocardiographic (ECG) data is considered. The proposed estimator is not limited to biomedical applications, but is also useful in any real-world problem whose observations can be modeled as an ARMA process disturbed by outliers or impulsive noise.
Jun 03 2016 stat.ME
Linear inverse problems are ubiquitous. Often the measurements do not follow a Gaussian distribution. Additionally, a model matrix with a large condition number can complicate the problem further by making it ill-posed. In this case, the performance of popular estimators may deteriorate significantly. We have developed a new estimator that is both nearly optimal in the presence of Gaussian errors while being also robust against outliers. Furthermore, it obtains meaningful estimates when the problem is ill-posed through the inclusion of $\ell_1$ and $\ell_2$ regularizations. The computation of our estimate involves minimizing a non-convex objective function. Hence, we are not guaranteed to find the global minimum in a reasonable amount of time. Thus, we propose two algorithms that converge to a good local minimum in a reasonable (and adjustable) amount of time, as an approximation of the global minimum. We also analyze how the introduction of the regularization term affects the statistical properties of our estimator. We confirm high robustness against outliers and asymptotic efficiency for Gaussian distributions by deriving measures of robustness such as the influence function, sensitivity curve, bias, asymptotic variance, and mean square error. We verify the theoretical results using numerical experiments and show that the proposed estimator outperforms recently proposed methods, especially for increasing amounts of outlier contamination. Python code for all of the algorithms are available online in the spirit of reproducible research.