Statistics (stat)

  • PDF
    Recently, increased computational power and data availability, as well as algorithmic advances, have led machine learning techniques to impressive results in regression, classification, data-generation and reinforcement learning tasks. Despite these impressive results, the proximity to the physical limits of chip fabrication alongside the increasing size of datasets are motivating a growing number of researchers to explore the possibility of harnessing the power of quantum computation to speed-up classical machine learning algorithms. Here we review the literature in quantum machine learning and discuss perspectives for a mixed readership of classical machine learning and quantum computation experts. Particular emphasis will be placed on clarifying the limitations of quantum algorithms, how they compare with their best classical counterparts and why quantum resources are expected to provide advantages for learning problems. Learning in the presence of noise and certain computationally hard problems in machine learning are identified as promising directions for the field. Practical questions, like how to upload classical data into quantum form, will also be addressed.
  • PDF
    We study the active learning problem of top-$k$ ranking from multi-wise comparisons under the popular multinomial logit model. Our goal is to identify the top-$k$ items with high probability by adaptively querying sets for comparisons and observing the noisy output of the most preferred item from each comparison. To achieve this goal, we design a new active ranking algorithm without using any information about the underlying items' preference scores. We also establish a matching lower bound on the sample complexity even when the set of preference scores is given to the algorithm. These two results together show that the proposed algorithm is instance optimal (up to logarithmic factors). Our work extends the existing literature on rank aggregation in three directions. First, instead of studying a static problem with fixed data, we investigate the top-$k$ ranking problem in an active learning setting. Second, we provide the instance optimality, which is a much stronger theoretical guarantee. Finally, we extend the pairwise comparison to the multi-wise comparison, which has not been fully explored in ranking literature.
  • PDF
    This paper describes an implementation of the L-BFGS method designed to deal with two adversarial situations. The first occurs in distributed computing environments where some of the computational nodes devoted to the evaluation of the function and gradient are unable to return results on time. A similar challenge occurs in a multi-batch approach in which the data points used to compute function and gradients are purposely changed at each iteration to accelerate the learning process. Difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the updating process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, studies the convergence properties for both convex and nonconvex functions, and illustrates the behavior of the algorithm in a distributed computing platform on binary classification logistic regression and neural network training problems that arise in machine learning.
  • PDF
    Domain adaptation is an important open problem in deep reinforcement learning (RL). In many scenarios of interest data is hard to obtain, so agents may learn a source policy in a setting where data is readily available, with the hope that it generalises well to the target domain. We propose a new multi-stage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act. DARLA's vision is based on learning a disentangled representation of the observed environment. Once DARLA can see, it is able to acquire source policies that are robust to many domain shifts - even with no access to the target domain. DARLA significantly outperforms conventional baselines in zero-shot domain adaptation scenarios, an effect that holds across a variety of RL environments (Jaco arm, DeepMind Lab) and base RL algorithms (DQN, A3C and EC).
  • PDF
    Many of the recent approaches to polyphonic piano note onset transcription require training a machine learning model on a large piano database. However, such approaches are limited by dataset availability; additional training data is difficult to produce, and proposed systems often perform poorly on novel recording conditions. We propose a method to quickly synthesize arbitrary quantities of training data, avoiding the need for curating large datasets. Various aspects of piano note dynamics - including nonlinearity of note signatures with velocity, different articulations, temporal clustering of onsets, and nonlinear note partial interference - are modeled to match the characteristics of real pianos. Our method also avoids the disentanglement problem, a recently noted issue affecting machine-learning based approaches. We train a feed-forward neural network with two hidden layers on our generated training data and achieve both good transcription performance on the large MAPS piano dataset and excellent generalization qualities.
  • PDF
    A deep neural network based architecture was constructed to predict amino acid side chain conformation with unprecedented accuracy. Amino acid side chain conformation prediction is essential for protein homology modeling and protein design. Current widely-adopted methods use physics-based energy functions to evaluate side chain conformation. Here, using a deep neural network architecture without physics-based assumptions, we have demonstrated that side chain conformation prediction accuracy can be improved by more than 25%, especially for aromatic residues compared with current standard methods. More strikingly, the prediction method presented here is robust enough to identify individual conformational outliers from high resolution structures in a protein data bank without providing its structural factors. We envisage that our amino acid side chain predictor could be used as a quality check step for future protein structure model validation and many other potential applications such as side chain assignment in Cryo-electron microscopy, crystallography model auto-building, protein folding and small molecule ligand docking.
  • PDF
    This paper introduces a general Bayesian non- parametric latent feature model suitable to per- form automatic exploratory analysis of heterogeneous datasets, where the attributes describing each object can be either discrete, continuous or mixed variables. The proposed model presents several important properties. First, it accounts for heterogeneous data while can be inferred in linear time with respect to the number of objects and attributes. Second, its Bayesian nonparametric nature allows us to automatically infer the model complexity from the data, i.e., the number of features necessary to capture the latent structure in the data. Third, the latent features in the model are binary-valued variables, easing the interpretability of the obtained latent features in data exploration tasks.
  • PDF
    Hashing has been widely used for large-scale approximate nearest neighbor search because of its storage and search efficiency. Recent work has found that deep supervised hashing can significantly outperform non-deep supervised hashing in many applications. However, most existing deep supervised hashing methods adopt a symmetric strategy to learn one deep hash function for both query points and database (retrieval) points. The training of these symmetric deep supervised hashing methods is typically time-consuming, which makes them hard to effectively utilize the supervised information for cases with large-scale database. In this paper, we propose a novel deep supervised hashing method, called asymmetric deep supervised hashing (ADSH), for large-scale nearest neighbor search. ADSH treats the query points and database points in an asymmetric way. More specifically, ADSH learns a deep hash function only for query points, while the hash codes for database points are directly learned. The training of ADSH is much more efficient than that of traditional symmetric deep supervised hashing methods. Experiments show that ADSH can achieve state-of-the-art performance in real applications.
  • PDF
    A variety of representation learning approaches have been investigated for reinforcement learning; much less attention, however, has been given to investigating the utility of sparse coding. Outside of reinforcement learning, sparse coding representations have been widely used, with non-convex objectives that result in discriminative representations. In this work, we develop a supervised sparse coding objective for policy evaluation. Despite the non-convexity of this objective, we prove that all local minima are global minima, making the approach amenable to simple optimization strategies. We empirically show that it is key to use a supervised objective, rather than the more straightforward unsupervised sparse coding approach. We compare the learned representations to a canonical fixed sparse representation, called tile-coding, demonstrating that the sparse coding representation outperforms a wide variety of tilecoding representations.
  • PDF
    With the development of neural networks based machine learning and their usage in mission critical applications, voices are rising against the \textitblack box aspect of neural networks as it becomes crucial to understand their limits and capabilities. With the rise of neuromorphic hardware, it is even more critical to understand how a neural network, as a distributed system, tolerates the failures of its computing nodes, neurons, and its communication channels, synapses. Experimentally assessing the robustness of neural networks involves the quixotic venture of testing all the possible failures, on all the possible inputs, which ultimately hits a combinatorial explosion for the first, and the impossibility to gather all the possible inputs for the second. In this paper, we prove an upper bound on the expected error of the output when a subset of neurons crashes. This bound involves dependencies on the network parameters that can be seen as being too pessimistic in the average case. It involves a polynomial dependency on the Lipschitz coefficient of the neurons activation function, and an exponential dependency on the depth of the layer where a failure occurs. We back up our theoretical results with experiments illustrating the extent to which our prediction matches the dependencies between the network parameters and robustness. Our results show that the robustness of neural networks to the average crash can be estimated without the need to neither test the network on all failure configurations, nor access the training set used to train the network, both of which are practically impossible requirements.
  • PDF
    This article is concerned with the fitting of multinomial regression models using the so-called "Poisson Trick". The work is motivated by Chen & Kuo (2001) and Malchow-Møller & Svarer (2003) which have been criticized for being computationally inefficient and sometimes producing nonsense results. We first discuss the case of independent data and offer a parsimonious fitting strategy when all covariates are categorical. We then propose a new approach for modelling correlated responses based on an extension of the Gamma-Poisson model, where the likelihood can be expressed in closed-form. The parameters are estimated via an Expectation/Conditional Maximization (ECM) algorithm, which can be implemented using functions for fitting generalized linear models readily available in standard statistical software packages. Compared to existing methods, our approach avoids the need to approximate the intractable integrals and thus the inference is exact with respect to the approximating Gamma-Poisson model. The proposed method is illustrated via a reanalysis of the yogurt data discussed by Chen & Kuo (2001).
  • PDF
    We consider conditional tests for non-negative discrete exponential families. We develop two Markov Chain Monte Carlo (MCMC) algorithms which allow us to sample from the conditional space and to perform approximated tests. The first algorithm is based on the MCMC sampling described by Sturmfels. The second MCMC sampling consists in a more efficient algorithm which exploits the optimal partition of the conditional space into orbits of permutations. We thus establish a link between standard permutation and algebraic-statistics-based sampling. Through a simulation study we compare the exact cumulative distribution function (cdf) with the approximated cdfs which are obtained with the two MCMC samplings and the standard permutation sampling. We conclude that the MCMC sampling which exploits the partition of the conditional space into orbits of permutations gives an estimated cdf, under $H_0$, which is more reliable and converges to the exact cdf with the least steps. This sampling technique can also be used to build an approximation of the exact cdf when its exact computation is computationally infeasible.
  • PDF
    This article presents a weak law of large numbers and a central limit theorem for the scaled realised covariation of a bivariate Brownian semistationary process. The novelty of our results lies in the fact that we derive the suitable asymptotic theory both in a multivariate setting and outside the classical semimartingale framework. The proofs rely heavily on recent developments in Malliavin calculus.
  • PDF
    This article presents various weak laws of large numbers for the so-called realised covariation of a bivariate stationary stochastic process which is not a semimartingale. More precisely, we consider two cases: Bivariate moving average processes with stochastic correlation and bivariate Brownian semistationary processes with stochastic correlation. In both cases, we can show that the (possibly scaled) realised covariation converges to the integrated (possibly volatility modulated) stochastic correlation process.
  • PDF
    Bayesian nonparametrics are a class of probabilistic models in which the model size is inferred from data. A recently developed methodology in this field is small-variance asymptotic analysis, a mathematical technique for deriving learning algorithms that capture much of the flexibility of Bayesian nonparametric inference algorithms, but are simpler to implement and less computationally expensive. Past work on small-variance analysis of Bayesian nonparametric inference algorithms has exclusively considered batch models trained on a single, static dataset, which are incapable of capturing time evolution in the latent structure of the data. This work presents a small-variance analysis of the maximum a posteriori filtering problem for a temporally varying mixture model with a Markov dependence structure, which captures temporally evolving clusters within a dataset. Two clustering algorithms result from the analysis: D-Means, an iterative clustering algorithm for linearly separable, spherical clusters; and SD-Means, a spectral clustering algorithm derived from a kernelized, relaxed version of the clustering problem. Empirical results from experiments demonstrate the advantages of using D-Means and SD-Means over contemporary clustering algorithms, in terms of both computational cost and clustering accuracy.
  • PDF
    Repeated measures analyses require proper choice of the correlation model to ensure accurate inference and optimal efficiency. The linear exponent autoregressive (LEAR) correlation model provides a flexible two-parameter correlation structure that accommodates a variety of data types in which the correlation within-sampling unit decreases exponentially in time or space. The LEAR model subsumes three classic temporal correlation structures, namely compound symmetry, continuous-time AR(1), and MA(1), while maintaining parsimony and providing appealing statistical and computational properties. It also supplies a plausible correlation structure for power analyses across many experimental designs. However, no commonly used statistical packages provide a straightforward way to implement the model, limiting its use to those with the appropriate programming skills. Here we present a reparameterization of the LEAR model that allows easily implementing it in standard software for the special case of data with equally spaced temporal or spatial intervals.
  • PDF
    Individualized treatment rule (ITR) recommends treatment on the basis of individual patient characteristics and the previous history of applied treatments and their outcomes. Despite the fact there are many ways to estimate ITR with binary treatment, algorithms for continuous treatment have only just started to emerge. We propose a novel approach to continuous ITR estimation based on explicit modelling of uncertainty in the subject's outcome as well as direct estimation of the mean outcome using gaussian process regression. Our method incorporates two intuitively appealing properties - it is more inclined to give a treatment with the outcome of higher expected value and lower variance. Experiments show that this direct incorporation of the uncertainty into ITR estimation process allows to select better treatment than standard indirect approach that just models the average. Compared to the competitors (including OWL), the proposed method shows improved performance in terms of value function maximization, has better interpretability, and could be easier generalized to multiple interdependent continuous treatments setting.
  • PDF
    In this article, we consider a stochastic numerical simulator to assess the impact of some factors on a phenomenon. The simulator is seen as a black box with inputs and outputs. The quality of a simulation, hereafter referred to as fidelity, is assumed to be tunable by means of an additional input of the simulator (e.g., a mesh size parameter): high-fidelity simulations provide more accurate results, but are time-consuming. Using a limited computation-time budget, we want to estimate, for any value of the physical inputs, the probability that a certain scalar output of the simulator will exceed a given critical threshold at the highest fidelity level. The problem is addressed in a Bayesian framework, using a Gaussian process model of the multi-fidelity simulator. We consider a Bayesian estimator of the probability, together with an associated measure of uncertainty, and propose a new multi-fidelity sequential design strategy, called Maximum Speed of Uncertainty Reduction (MSUR), to select the value of physical inputs and the fidelity level of new simulations. The MSUR strategy is tested on an example.
  • PDF
    Identifying undocumented or potential future interactions among species is a challenge facing modern ecologists. Our aim is to guide the sampling of host-parasite networks by identifying the most likely undocumented interactions. Recent link prediction methods rely on trait data, however these data are limited to only a fraction of species found in large interaction databases. On the other hand, evolutionary relationships among species, encoded as phylogenetic trees, can act as proxies for underlying traits and historical patterns of parasite sharing among hosts. We show that using a network-based conditional model, phylogenetic information provides significant predictive power in a recently published global database of host-parasite interactions. Drawing from evolutionary biology, we find that applying alternative evolutionary models to the phylogeny greatly improves it. To further improve on the phylogeny-only model, we use a hierarchical Bayesian latent score framework for bipartite graphs that accounts for the number of interactions per species, as well as the host dependence informed by phylogeny. Combining the two information sources yields significant improvement in predictive accuracy over each of the submodels alone. As many interaction networks are constructed from presence-only data, we extend the model by integrating a correction mechanism for missing interactions, which proves valuable in reducing uncertainty in unobserved interactions.
  • PDF
    We propose prior distributions for all parts of the specification of a Markov mesh model. In the formulation we define priors for the sequential neighborhood, for the parametric form of the conditional distributions and for the parameter values. By simulating from the resulting posterior distribution when conditioning on an observed scene, we thereby obtain an automatic model selection procedure for Markov mesh models. To sample from such a posterior distribution, we construct a reversible jump Markov chain Monte Carlo algorithm (RJMCMC). We demonstrate the usefulness of our prior formulation and the limitations of our RJMCMC algorithm in two examples.
  • PDF
    One of the major hurdles preventing the full exploitation of information from online communities is the widespread concern regarding the quality and credibility of user-contributed content. Prior works in this domain operate on a static snapshot of the community, making strong assumptions about the structure of the data (e.g., relational tables), or consider only shallow features for text classification. To address the above limitations, we propose probabilistic graphical models that can leverage the joint interplay between multiple factors in online communities --- like user interactions, community dynamics, and textual content --- to automatically assess the credibility of user-contributed online content, and the expertise of users and their evolution with user-interpretable explanation. To this end, we devise new models based on Conditional Random Fields for different settings like incorporating partial expert knowledge for semi-supervised learning, and handling discrete labels as well as numeric ratings for fine-grained analysis. This enables applications such as extracting reliable side-effects of drugs from user-contributed posts in healthforums, and identifying credible content in news communities. Online communities are dynamic, as users join and leave, adapt to evolving trends, and mature over time. To capture this dynamics, we propose generative models based on Hidden Markov Model, Latent Dirichlet Allocation, and Brownian Motion to trace the continuous evolution of user expertise and their language model over time. This allows us to identify expert users and credible content jointly over time, improving state-of-the-art recommender systems by explicitly considering the maturity of users. This also enables applications such as identifying helpful product reviews, and detecting fake and anomalous reviews with limited information.
  • PDF
    High-dimensional linear and nonlinear models have been extensively used to identify associations between response and explanatory variables. The variable selection problem is commonly of interest in the presence of massive and complex data. An empirical Bayes model for high-dimensional generalized linear models (GLMs) is considered in this paper. The extension of the Iterated Conditional Modes/Medians (ICM/M) algorithm is proposed to build up a GLM. With the construction of pseudodata and pseudovariances based on iteratively reweighted least squares (IRLS), conditional modes are employed to obtain data-drive optimal values for hyperparameters and conditional medians are used to estimate regression coefficients. With a spike-and-slab prior for each coefficient, a conditional median can enforce variable estimation and selection at the same time. The ICM/M algorithm can also incorporate more complicated prior by taking the network structural information into account through the Ising model prior. Here we focus on two extensively used models for genomic data: binary logistic and Cox's proportional hazards models. The performance of the proposed method is demonstrated through both simulation studies and real data examples.
  • PDF
    We combine Bayesian prediction and weighted inference as a unified approach to survey inference. The general principles of Bayesian analysis imply that models for survey outcomes should be conditional on all variables that affect the probability of inclusion. We incorporate the weighting variables under the framework of multilevel regression and poststratification, as a byproduct generating model-based weights after smoothing. We investigate deep interactions and introduce structured prior distributions for smoothing and stability of estimates. The computation is done via Stan and implemented in the open source R package "rstanarm" ready for public use. Simulation studies illustrate that model-based prediction and weighting inference outperform classical weighting. We apply the proposal to the New York Longitudinal Study of Wellbeing. The new approach generates robust weights and increases efficiency for finite population inference, especially for subsets of the population.
  • PDF
    We consider the problem of calibrating inexact computer models using experimental data. To compensate for the misspecification of the computer model, a discrepancy function is usually included and modeled via a Gaussian stochastic process (GaSP), leading to better results of prediction. The calibration parameters in the computer model, however, sometimes become unidentifiable in the GaSP model, and the calibrated computer model fits the experimental data poorly as a consequence. In this work, we propose the scaled Gaussian stochastic process (S-GaSP), a novel stochastic process for calibration and prediction. This new approach bridges the gap between two predominant methods, namely the $L_2$ calibration and GaSP calibration. A computationally feasible approach is introduced for this new model under the Bayesian paradigm. The S-GaSP model not only provides a general framework for calibration, but also enables the computer model to predict well regardless of the discrepancy function. Numerical examples are also provided to illustrate the connections and differences between this new model and other previous approaches.
  • PDF
    We present a new algorithm for the 2D Radix-2 Sliding Window Fourier Transform (SWFT). Our algorithm avoids repeating calculations in overlapping windows by using a tree representation of the Cooley-Tukey Fast Fourier Transform (FFT). For an $N_0 \times N_1$ array and $n_0 = 2^{m_0} \times n_1 = 2^{m_1}$ windows, our algorithm takes $O(N_0 N_1 n_0 n_1)$ operations, which is faster than taking a 2D FFT in each window. We provide a C implementation of the algorithm, compare ours with existing algorithms, and show how the algorithm extends to higher dimensions.
  • PDF
    In this paper, we present a new task that investigates how people interact with and make judgments about towers of blocks. In Experiment~1, participants in the lab solved a series of problems in which they had to re-configure three blocks from an initial to a final configuration. We recorded whether they used one hand or two hands to do so. In Experiment~2, we asked participants online to judge whether they think the person in the lab used one or two hands. The results revealed a close correspondence between participants' actions in the lab, and the mental simulations of participants online. To explain participants' actions and mental simulations, we develop a model that plans over a symbolic representation of the situation, executes the plan using a geometric solver, and checks the plan's feasibility by taking into account the physical constraints of the scene. Our model explains participants' actions and judgments to a high degree of quantitative accuracy.
  • PDF
    Baseline correction plays an important role in past and current methodological debates in ERP research (e.g. the Tanner v. Maess debate in Journal of Neuroscience Methods), serving as a potential alternative to strong highpass filtering. However, the very assumptions that underlie traditional baseline also undermine it, making it statistically unnecessary and even undesirable and reducing signal-to-noise ratio. Including the baseline interval as a predictor in a GLM-based statistical approach allows the data to determine how much baseline correction is needed, including both full traditional and no baseline correction as subcases, while reducing the amount of variance in the residual error term and thus potentially increasing statistical power.
  • PDF
    Independence screening methods such as the two sample $t$-test and the marginal correlation based ranking are among the most widely used techniques for variable selection in ultrahigh dimensional data sets. In this sort note, simple examples are used to demonstrate potential problems with the independence screening methods in the presence of correlated predictors.
  • PDF
    The reliability of a complex industrial system can rarely be assessed analytically. As system failure is often a rare event, crude Monte-Carlo methods are prohibitively expensive from a computational point of view. In order to reduce computation times, variance reduction methods such as importance sampling can be used. We propose an adaptation of this method for a class of multi-component dynamical systems. We address a system whose failure corresponds to a physical variable of the system (temperature, pressure, water level) entering a critical region. Such systems are common in hydraulic and nuclear industry. In these systems, the statuses of the components (on, off, or out-of-order) determine the dynamics of the physical variables, and is altered both by deterministic feedback mechanisms and random failures or repairs. In order to deal with this interplay between components status and physical variables we model trajectory using piecewise deterministic Markovian processes (PDMP). We show how to adapt the importance sampling method to PDMP, by introducing a reference measure on the trajectory space, and we present a biasing strategy for importance sampling. A simulation study compares our importance sampling method to the crude Monte-Carlo method for a three-component-system.
  • PDF
    In this paper, we exploit the theory of compressive sensing to perform detection of a random source in a dense sensor network. When the sensors are densely deployed, observations at adjacent sensors are highly correlated while those corresponding to distant sensors are less correlated. Thus, the covariance matrix of the concatenated observation vector of all the sensors at any given time can be sparse where the sparse structure depends on the network topology and the correlation model. Exploiting the sparsity structure of the covariance matrix, we develop a robust nonparametric detector to detect the presence of the random event using a compressed version of the data collected at the distributed nodes. We employ the multiple access channel (MAC) model with distributed random projections for sensors to transmit observations so that a compressed version of the observations is available at the fusion center. Detection is performed by constructing a decision statistic based on the covariance information of uncompressed data which is estimated using compressed data. The proposed approach does not require any knowledge of the noise parameter to set the threshold, and is also robust when the distributed random projection matrices become sparse.

Recent comments

Noon van der Silk Mar 08 2017 04:45 UTC

I feel that while the proliferation of GUNs is unquestionable a good idea, there are many unsupervised networks out there that might use this technology in dangerous ways. Do you think Indifferential-Privacy networks are the answer? Also I fear that the extremist binary networks should be banned ent

...(continued)
Noon van der Silk Jan 27 2016 03:39 UTC

Great institute name ...

Alessandro Dec 09 2015 01:12 UTC

Hey, I've already seen this title! http://arxiv.org/abs/1307.0401

Chris Granade Sep 22 2015 19:15 UTC

Thank you for the kind comments, I'm glad that our paper, source code, and tutorial are useful!

Travis Scholten Sep 21 2015 17:05 UTC

This was a really well-written paper! Am very glad to see this kind of work being done.

In addition, the openness about source code is refreshing. By explicitly relating the work to [QInfer](https://github.com/csferrie/python-qinfer), this paper makes it more easy to check the authors' work. Furthe

...(continued)
Chris Granade Sep 15 2015 02:40 UTC

As a quick addendum, please note that the [supplementary video](https://www.youtube.com/watch?v=22ejRV0Kx2g) for this work is available [on YouTube](https://www.youtube.com/watch?v=22ejRV0Kx2g). Thank you!

Richard Kueng Mar 08 2015 22:02 UTC

Neither, Frédéric! Replacing fidelity by superfidelity still requires optimizing over all density matrices. However, the Birkhoff-von Neumann Theorem (see Lemma 1) allows for further restricting this optimization to n scalar variables w.l.o.g.---Theorem 2. Arguably, this greatly simplifies the geome

...(continued)
Frédéric Grosshans Mar 05 2015 11:31 UTC

I fell for that clickbait title and read the paper. I still don’t get why von Neumann didn't want us to know about this weird trick? And which weird trick? The use of superfidelity or the use of non-physical density matrices like $\sigma^\sharp$?

Noon van der Silk Mar 03 2015 03:20 UTC

I took the liberty of uploading the IPython notebook as a github [gist](https://gist.github.com), so it's viewable [here](http://nbviewer.ipython.org/urls/gist.githubusercontent.com/silky/b14fa42c6d5475a3a724/raw/887c19fb04581f1a33f9d03370e4b7b3a33c2ea8/ferrie_kueng_bayes_est_fid.ipynb).