Automatic music transcription (AMT) aims to infer a latent symbolic representation of a piece of music (piano-roll), given a corresponding observed audio recording. Transcribing polyphonic music (when multiple notes are played simultaneously) is a challenging problem, due to highly structured overlapping between harmonics. We study whether the introduction of physically inspired Gaussian process (GP) priors into audio content analysis models improves the extraction of patterns required for AMT. Audio signals are described as a linear combination of sources. Each source is decomposed into the product of an amplitude-envelope, and a quasi-periodic component process. We introduce the Matérn spectral mixture (MSM) kernel for describing frequency content of singles notes. We consider two different regression approaches. In the sigmoid model every pitch-activation is independently non-linear transformed. In the softmax model several activation GPs are jointly non-linearly transformed. This introduce cross-correlation between activations. We use variational Bayes for approximate inference. We empirically evaluate how these models work in practice transcribing polyphonic music. We demonstrate that rather than encourage dependency between activations, what is relevant for improving pitch detection is to learnt priors that fit the frequency content of the sound events to detect.
Dec 19 2016 cs.SD
We introduce a novel approach to studying animal behaviour and the context in which it occurs, through the use of microphone backpacks carried on the backs of individual free-flying birds. These sensors are increasingly used by animal behaviour researchers to study individual vocalisations of freely behaving animals, even in the field. However such devices may record more than an animals vocal behaviour, and have the potential to be used for investigating specific activities (movement) and context (background) within which vocalisations occur. To facilitate this approach, we investigate the automatic annotation of such recordings through two different sound scene analysis paradigms: a scene-classification method using feature learning, and an event-detection method using probabilistic latent component analysis (PLCA). We analyse recordings made with Eurasian jackdaws (Corvus monedula) in both captive and field settings. Results are comparable with the state of the art in sound scene analysis; we find that the current recognition quality level enables scalable automatic annotation of audio logger data, given partial annotation, but also find that individual differences between animals and/or their backpacks limit the generalisation from one individual to another. we consider the interrelation of 'scenes' and 'events' in this particular task, and issues of temporal resolution.
Aug 12 2016 cs.SD
Many biological monitoring projects rely on acoustic detection of birds. Despite increasingly large datasets, this detection is often manual or semi-automatic, requiring manual tuning/postprocessing. We review the state of the art in automatic bird sound detection, and identify a widespread need for tuning-free and species-agnostic approaches. We introduce new datasets and an IEEE research challenge to address this need, to make possible the development of fully automatic algorithms for bird sound detection.
Real music signals are highly variable, yet they have strong statistical structure. Prior information about the underlying physical mechanisms by which sounds are generated and rules by which complex sound structure is constructed (notes, chords, a complete musical score), can be naturally unified using Bayesian modelling techniques. Typically algorithms for Automatic Music Transcription independently carry out individual tasks such as multiple-F0 detection and beat tracking. The challenge remains to perform joint estimation of all parameters. We present a Bayesian approach for modelling music audio, and content analysis. The proposed methodology based on Gaussian processes seeks joint estimation of multiple music concepts by incorporating into the kernel prior information about non-stationary behaviour, dynamics, and rich spectral content present in the modelled music signal. We illustrate the benefits of this approach via two tasks: pitch estimation, and inferring missing segments in a polyphonic audio recording.
Mar 24 2016 cs.SD
Bird calls range from simple tones to rich dynamic multi-harmonic structures. The more complex calls are very poorly understood at present, such as those of the scientifically important corvid family (jackdaws, crows, ravens, etc.). Individual birds can recognise familiar individuals from calls, but where in the signal is this identity encoded? We studied the question by applying a combination of feature representations to a dataset of jackdaw calls, including linear predictive coding (LPC) and adaptive discrete Fourier transform (aDFT). We demonstrate through a classification paradigm that we can strongly outperform a standard spectrogram representation for identifying individuals, and we apply metric learning to determine which time-frequency regions contribute most strongly to robust individual identification. Computational methods can help to direct our search for understanding of these complex biological signals.
Mar 24 2016 cs.SD
Many approaches have been used in bird species classification from their sound in order to provide labels for the whole of a recording. However, a more precise classification of each bird vocalization would be of great importance to the use and management of sound archives and bird monitoring. In this work, we introduce a technique that using a two step process can first automatically detect all bird vocalizations and then, with the use of 'weakly' labelled recordings, classify them. Evaluations of our proposed method show that it achieves a correct classification of 61% when used in a synthetic dataset, and up to 89% when the synthetic dataset only consists of vocalizations larger than 1000 pixels.
Animals in groups often exchange calls, in patterns whose temporal structure may be influenced by contextual factors such as physical location and the social network structure of the group. We introduce a model-based analysis for temporal patterns of animal call timing, originally developed for networks of firing neurons. This has advantages over cross-correlation analysis in that it can correctly handle common-cause confounds and provides a generative model of call patterns with explicit parameters for the influences between individuals. It also has advantages over standard Markovian analysis in that it incorporates detailed temporal interactions which affect timing as well as sequencing of calls. Further, a fitted model can be used to generate novel synthetic call sequences. We apply the method to calls recorded from groups of domesticated zebra finch (Taenopyggia guttata) individuals. We find that the communication network in these groups has stable structure that persists from one day to the next, and that "kernels" reflecting the temporal range of influence have a characteristic structure for a calling individual's effect on itself, its partner, and on others in the group. We further find characteristic patterns of influences by call type as well as by individual.
Training a denoising autoencoder neural network requires access to truly clean data, a requirement which is often impractical. To remedy this, we introduce a method to train an autoencoder using only noisy data, having examples with and without the signal class of interest. The autoencoder learns a partitioned representation of signal and noise, learning to reconstruct each separately. We illustrate the method by denoising birdsong audio (available abundantly in uncontrolled noisy datasets) using a convolutional autoencoder.
Mar 25 2015 cs.SD
Many current paradigms for acoustic event detection (AED) are not adapted to the organic variability of natural sounds, and/or they assume a limit on the number of simultaneous sources: often only one source, or one source of each type, may be active. These aspects are highly undesirable for applications such as bird population monitoring. We introduce a simple method modelling the onsets, durations and offsets of acoustic events to avoid intrinsic limits on polyphony or on inter-event temporal patterns. We evaluate the method in a case study with over 3000 zebra finch calls. In comparison against a HMM-based method we find it more accurate at recovering acoustic events, and more robust for estimating calling rates.
In this article we present an account of the state-of-the-art in acoustic scene classification (ASC), the task of classifying environments from the sounds they produce. Starting from a historical review of previous research in this area, we define a general framework for ASC and present different imple- mentations of its components. We then describe a range of different algorithms submitted for a data challenge that was held to provide a general and fair benchmark for ASC techniques. The dataset recorded for this purpose is presented, along with the performance metrics that are used to evaluate the algorithms and statistical significance tests to compare the submitted methods. We use a baseline method that employs MFCCS, GMMS and a maximum likelihood criterion as a benchmark, and only find sufficient evidence to conclude that three algorithms significantly outperform it. We also evaluate the human classification accuracy in performing a similar classification task. The best performing algorithm achieves a mean accuracy that matches the median accuracy obtained by humans, and common pairs of classes are misclassified by both computers and humans. However, all acoustic scenes are correctly classified by at least some individuals, while there are scenes that are misclassified by all algorithms.
Automatic species classification of birds from their sound is a computational tool of increasing importance in ecology, conservation monitoring and vocal communication studies. To make classification useful in practice, it is crucial to improve its accuracy while ensuring that it can run at big data scales. Many approaches use acoustic measures based on spectrogram-type data, such as the Mel-frequency cepstral coefficient (MFCC) features which represent a manually-designed summary of spectral information. However, recent work in machine learning has demonstrated that features learnt automatically from data can often outperform manually-designed feature transforms. Feature learning can be performed at large scale and "unsupervised", meaning it requires no manual data labelling, yet it can improve performance on "supervised" tasks such as classification. In this work we introduce a technique for feature learning from large volumes of bird sound recordings, inspired by techniques that have proven useful in other domains. We experimentally compare twelve different feature representations derived from the Mel spectrum (of which six use this technique), using four large and diverse databases of bird vocalisations, with a random forest classifier. We demonstrate that MFCCs are of limited power in this context, leading to worse performance than the raw Mel spectral data. Conversely, we demonstrate that unsupervised feature learning provides a substantial boost over MFCCs and Mel spectra without adding computational complexity after the model has been trained. The boost is particularly notable for single-label classification tasks at large scale. The spectro-temporal activations learned through our procedure resemble spectro-temporal receptive fields calculated from avian primary auditory forebrain.
Nov 20 2013 cs.SD
Birdsong often contains large amounts of rapid frequency modulation (FM). It is believed that the use or otherwise of FM is adaptive to the acoustic environment, and also that there are specific social uses of FM such as trills in aggressive territorial encounters. Yet temporal fine detail of FM is often absent or obscured in standard audio signal analysis methods such as Fourier analysis or linear prediction. Hence it is important to consider high resolution signal processing techniques for analysis of FM in bird vocalisations. If such methods can be applied at big data scales, this offers a further advantage as large datasets become available. We introduce methods from the signal processing literature which can go beyond spectrogram representations to analyse the fine modulations present in a signal at very short timescales. Focusing primarily on the genus Phylloscopus, we investigate which of a set of four analysis methods most strongly captures the species signal encoded in birdsong. In order to find tools useful in practical analysis of large databases, we also study the computational time taken by the methods, and their robustness to additive noise and MP3 compression. We find three methods which can robustly represent species-correlated FM attributes, and that the simplest method tested also appears to perform the best. We find that features representing the extremes of FM encode species identity supplementary to that captured in frequency features, whereas bandwidth features do not encode additional information. Large-scale FM analysis can efficiently extract information useful for bioacoustic studies, in addition to measures more commonly used to characterise vocalisations.
We introduce a free and open dataset of 7690 audio clips sampled from the field-recording tag in the Freesound audio archive. The dataset is designed for use in research related to data mining in audio archives of field recordings / soundscapes. Audio is standardised, and audio and metadata are Creative Commons licensed. We describe the data preparation process, characterise the dataset descriptively, and illustrate its use through an auto-tagging experiment.
Feb 15 2013 cs.SD
Segregating an audio mixture containing multiple simultaneous bird sounds is a challenging task. However, birdsong often contains rapid pitch modulations, and these modulations carry information which may be of use in automatic recognition. In this paper we demonstrate that an improved spectrogram representation, based on the distribution derivative method, leads to improved performance of a segregation algorithm which uses a Markov renewal process model to track vocalisation patterns consisting of singing and silences.
Feb 04 2013 cs.SD
In musical performances with expressive tempo modulation, the tempo variation can be modelled as a sequence of tempo arcs. Previous authors have used this idea to estimate series of piecewise arc segments from data. In this paper we describe a probabilistic model for a time-series process of this nature, and use this to perform inference of single- and multi-level arc processes from data. We describe an efficient Viterbi-like process for MAP inference of arcs. Our approach is score-agnostic, and together with efficient inference allows for online analysis of performances including improvisations, and can predict immediate future tempo trajectories.
Nov 14 2012 cs.AI
We describe an inference task in which a set of timestamped event observations must be clustered into an unknown number of temporal sequences with independent and varying rates of observations. Various existing approaches to multi-object tracking assume a fixed number of sources and/or a fixed observation rate; we develop an approach to inferring structure in timestamped data produced by a mixture of an unknown and varying number of similar Markov renewal processes, plus independent clutter noise. The inference simultaneously distinguishes signal from noise as well as clustering signal observations into separate source streams. We illustrate the technique via a synthetic experiment as well as an experiment to track a mixture of singing birds.