This report presents our audio event detection system submitted for Task 2, "Detection of rare sound events", of DCASE 2017 challenge. The proposed system is based on convolutional neural networks (CNNs) and deep neural networks (DNNs) coupled with novel weighted and multi-task loss functions and state-of-the-art phase-aware signal enhancement. The loss functions are tailored for audio event detection in audio streams. The weighted loss is designed to tackle the common issue of imbalanced data in background/foreground classification while the multi-task loss enables the networks to simultaneously model the class distribution and the temporal structures of the target events for recognition. Our proposed systems significantly outperform the challenge baseline, improving F-score from 72.7% to 89.8% and reducing detection error rate from 0.53 to 0.19 on average.
Color theme or color palette can deeply influence the quality and the feeling of a photograph or a graphical design. Although color palettes may come from different sources such as online crowd-sourcing, photographs and graphical designs, in this paper, we consider color palettes extracted from fine art collections, which we believe to be an abundant source of stylistic and unique color themes. We aim to capture color styles embedded in these collections by means of statistical models and to build practical applications upon these models. As artists often use their personal color themes in their paintings, making these palettes appear frequently in the dataset, we employed density estimation to capture the characteristics of palette data. Via density estimation, we carried out various predictions and interpolations on palettes, which led to promising applications such as photo-style exploration, real-time color suggestion, and enriched photo recolorization. It was, however, challenging to apply density estimation to palette data as palettes often come as unordered sets of colors, which make it difficult to use conventional metrics on them. To this end, we developed a divide-and-conquer sorting algorithm to rearrange the colors in the palettes in a coherent order, which allows meaningful interpolation between color palettes. To confirm the performance of our model, we also conducted quantitative experiments on datasets of digitized paintings collected from the Internet and received favorable results.
We introduce in this work an efficient approach for audio scene classification using deep recurrent neural networks. An audio scene is firstly transformed into a sequence of high-level label tree embedding feature vectors. The vector sequence is then divided into multiple subsequences on which a deep GRU-based recurrent neural network is trained for sequence-to-label classification. The global predicted label for the entire sequence is finally obtained via aggregation of subsequence classification outputs. We will show that our approach obtains an F1-score of 97.7% on the LITIS Rouen dataset, which is the largest dataset publicly available for the task. Compared to the best previously reported result on the dataset, our approach is able to reduce the relative classification error by 35.3%.
Dec 30 2016 cs.SD
There is a common observation that audio event classification is easier to deal with than detection. So far, this observation has been accepted as a fact and we lack of a careful analysis. In this paper, we reason the rationale behind this fact and, more importantly, leverage them to benefit the audio event detection task. We present an improved detection pipeline in which a verification step is appended to augment a detection system. This step employs a high-quality event classifier to postprocess the benign event hypotheses outputted by the detection system and reject false alarms. To demonstrate the effectiveness of the proposed pipeline, we implement and pair up different event detectors based on the most common detection schemes and various event classifiers, ranging from the standard bag-of-words model to the state-of-the-art bank-of-regressors one. Experimental results on the ITC-Irst dataset show significant improvements to detection performance. More importantly, these improvements are consistent for all detector-classifier combinations.
We trained a deep all-convolutional neural network with masked global pooling to perform single-label classification for acoustic scene classification and multi-label classification for domestic audio tagging in the DCASE-2016 contest. Our network achieved an average accuracy of 84.5% on the four-fold cross-validation for acoustic scene recognition, compared to the provided baseline of 72.5%, and an average equal error rate of 0.17 for domestic audio tagging, compared to the baseline of 0.21. The network therefore improves the baselines by a relative amount of 17% and 19%, respectively. The network only consists of convolutional layers to extract features from the short-time Fourier transform and one global pooling layer to combine those features. It particularly possesses neither fully-connected layers, besides the fully-connected output layer, nor dropout layers.
We describe in this report our audio scene recognition system submitted to the DCASE 2016 challenge. Firstly, given the label set of the scenes, a label tree is automatically constructed. This category taxonomy is then used in the feature extraction step in which an audio scene instance is represented by a label tree embedding image. Different convolutional neural networks, which are tailored for the task at hand, are finally learned on top of the image features for scene recognition. Our system reaches an overall recognition accuracy of 81.2% and 83.3% and outperforms the DCASE 2016 baseline with absolute improvements of 8.7% and 6.1% on the development and test data, respectively.
This report describes our submissions to Task2 and Task3 of the DCASE 2016 challenge. The systems aim at dealing with the detection of overlapping audio events in continuous streams, where the detectors are based on random decision forests. The proposed forests are jointly trained for classification and regression simultaneously. Initially, the training is classification-oriented to encourage the trees to select discriminative features from overlapping mixtures to separate positive audio segments from the negative ones. The regression phase is then carried out to let the positive audio segments vote for the event onsets and offsets, and therefore model the temporal structure of audio events. One random decision forest is specifically trained for each event category of interest. Experimental results on the development data show that our systems significantly outperform the baseline on the Task2 evaluation while they are inferior to the baseline in the Task3 evaluation.
We present in this paper an efficient approach for acoustic scene classification by exploring the structure of class labels. Given a set of class labels, a category taxonomy is automatically learned by collectively optimizing a clustering of the labels into multiple meta-classes in a tree structure. An acoustic scene instance is then embedded into a low-dimensional feature representation which consists of the likelihoods that it belongs to the meta-classes. We demonstrate state-of-the-art results on two different datasets for the acoustic scene classification task, including the DCASE 2013 and LITIS Rouen datasets.
We introduce a new learned descriptor for audio signals which is efficient for event representation. The entries of the descriptor are produced by evaluating a set of regressors on the input signal. The regressors are class-specific and trained using the random regression forests framework. Given an input signal, each regressor estimates the onset and offset positions of the target event. The estimation confidence scores output by a regressor are then used to quantify how the target event aligns with the temporal structure of the corresponding category. Our proposed descriptor has two advantages. First, it is compact, i.e. the dimensionality of the descriptor is equal to the number of event classes. Second, we show that even simple linear classification models, trained on our descriptor, yield better accuracies on audio event classification task than not only the nonlinear baselines but also the state-of-the-art results.
We present in this paper a simple, yet efficient convolutional neural network (CNN) architecture for robust audio event recognition. Opposing to deep CNN architectures with multiple convolutional and pooling layers topped up with multiple fully connected layers, the proposed network consists of only three layers: convolutional, pooling, and softmax layer. Two further features distinguish it from the deep architectures that have been proposed for the task: varying-size convolutional filters at the convolutional layer and 1-max pooling scheme at the pooling layer. In intuition, the network tends to select the most discriminative features from the whole audio signals for recognition. Our proposed CNN not only shows state-of-the-art performance on the standard task of robust audio event recognition but also outperforms other deep architectures up to 4.5% in terms of recognition accuracy, which is equivalent to 76.3% relative error reduction.
Apr 06 2016 cs.CE
This paper investigates nonlinear bending and buckling behaviours of composite plates characterized by a thickness variation. Layer interfaces are described as functions of inplane coordinates. Top and bottom surfaces of the plate are symmetric about the midplane and the plate could be considered as a flat surface in analysis along with thickness parameters which vary over the plate. The variable thickness at a certain position in the midplane is modeled by a set of control points (or thickness-parameters) through NURBS (Non-Uniform Rational B-Spline) basic functions. The knot parameter space which is referred in modelling geometry and approximating displacement variables is employed for approximating thickness, simultaneously. The use of quadratic NURBS functions results in C^1 continuity of modeling variable thickness and analyzing solutions. Thin to moderately thick laminates in bound of first-order shear deformation theory (FSDT) are taken into account. Strain-displacement relations in sense of von-Karman theory are employed for large deformation. Riks method is used for geometrically nonlinear analysis. The weak form is approximated numerically by the isogeometric analysis (IGA), which has been found to be a robust, stable and realistic numerical tool. Numerical results confirm the reliability and capacity of the propose method.
Recognizing acoustic events is an intricate problem for a machine and an emerging field of research. Deep neural networks achieve convincing results and are currently the state-of-the-art approach for many tasks. One advantage is their implicit feature learning, opposite to an explicit feature extraction of the input signal. In this work, we analyzed whether more discriminative features can be learned from either the time-domain or the frequency-domain representation of the audio signal. For this purpose, we trained multiple deep networks with different architectures on the Freiburg-106 and ESC-10 datasets. Our results show that feature learning from the frequency domain is superior to the time domain. Moreover, additionally using convolution and pooling layers, to explore local structures of the audio signal, significantly improves the recognition performance and achieves state-of-the-art results.
Dec 12 2014 cs.HC
As modern industry shifts toward significant globalization, robust and adaptable network capability is increasingly vital to the success of business enterprises. Large quantities of information must be distilled and presented in a single integrated picture in order to maintain the health, security and performance of global networks. We present a design for a network situational awareness display that visually aggregates large quantities of data, identifies problems in a network, assesses their impact on critical company mission areas and clarifies the utilization of resources. This display facilitates the prioritization of network problems as they arise by explicitly depicting how problems interrelate. It also serves to coordinate mitigation strategies with members of a team.
Sep 06 2011 cs.DM
This paper presents a generalization of the sandpile model, called the parallel symmetric sandpile model, which inherits the rules of the symmetric sandpile model and implements them in parallel. In this new model, at each step the collapsing of the collapsible columns happens at the same time and one collapsible column is able to collapse on the left or on the right but not both. We prove that the set of forms of fixed points of the symmetric sandpile model is the same as the one of that model using parallel update scheme by constructing explicitly the way (in the parallel update scheme) to reach the form of an arbitrary fixed point of the sequential model.
Nov 01 2000 cs.DM
This paper is devoted to the study of the dynamics of a discrete system related to some self stabilizing protocol on a ring of processors.
SPM (Sand Pile Model) is a simple discrete dynamical system used in physics to represent granular objects. It is deeply related to integer partitions, and many other combinatorics problems, such as tilings or rewriting systems. The evolution of the system started with n stacked grains generates a lattice, denoted by SPM(n). We study here the structure of this lattice. We first explain how it can be constructed, by showing its strong self-similarity property. Then, we define SPM(infini), a natural extension of SPM when one starts with an infinite number of grains. Again, we give an efficient construction algorithm and a coding of this lattice using a self-similar tree. The two approaches give different recursive formulae for the cardinal of SPM(n), where no closed formula have ever been found.