In this paper, we propose a multi-level attention model to solve the weakly labelled audio classification problem. The objective of audio classification is to predict the presence or absence of audio events in an audio clip. Recently, Google published a large scale weakly labelled dataset called Audio Set, where each audio clip contains only the presence or absence of the audio events, without the onset and offset time of the audio events. Our multi-level attention model is an extension to the previously proposed single-level attention model. It consists of several attention modules applied on intermediate neural network layers. The output of these attention modules are concatenated to a vector followed by a multi-label classifier to make the final prediction of each class. Experiments shown that our model achieves a mean average precision (mAP) of 0.360, outperforms the state-of-the-art single-level attention model of 0.327 and Google baseline of 0.314.
Jan 15 2018 cs.SI
What makes content go viral? Which videos become popular and why others don't? Such questions have elicited significant attention from both researchers and industry, particularly in the context of online media. A range of models have been recently proposed to explain and predict popularity; however, there is a short supply of practical tools, accessible for regular users, that leverage these theoretical results. HIPie -- an interactive visualization system -- is created to fill this gap, by enabling users to reason about the virality and the popularity of online videos. It retrieves the metadata and the past popularity series of Youtube videos, it employs Hawkes Intensity Process, a state-of-the-art online popularity model for explaining and predicting video popularity, and it presents videos comparatively in a series of interactive plots. This system will help both content consumers and content producers in a range of data-driven inquiries, such as to comparatively analyze videos and channels, to explain and predict future popularity, to identify viral videos, and to estimate response to online promotion.
Dec 15 2017 cs.CV
This paper performs a comprehensive and comparative evaluation of the state of the art local features for the task of image based 3D reconstruction. The evaluated local features cover the recently developed ones by using powerful machine learning techniques and the elaborately designed handcrafted features. To obtain a comprehensive evaluation, we choose to include both float type features and binary ones. Meanwhile, two kinds of datasets have been used in this evaluation. One is a dataset of many different scene types with groundtruth 3D points, containing images of different scenes captured at fixed positions, for quantitative performance evaluation of different local features in the controlled image capturing situations. The other dataset contains Internet scale image sets of several landmarks with a lot of unrelated images, which is used for qualitative performance evaluation of different local features in the free image collection situations. Our experimental results show that binary features are competent to reconstruct scenes from controlled image sequences with only a fraction of processing time compared to use float type features. However, for the case of large scale image set with many distracting images, float type features show a clear advantage over binary ones.
Source separation (SS) aims to separate individual sources from an audio recording. Sound event detection (SED) aims to detect sound events from an audio recording. We propose a joint separation-classification (JSC) model trained only on weakly labelled audio data, that is, only the tags of an audio recording are known but the time of the events are unknown. First, we propose a separation mapping from the time-frequency (T-F) representation of an audio to the T-F segmentation masks of the audio events. Second, a classification mapping is built from each T-F segmentation mask to the presence probability of each audio event. In the source separation stage, sources of audio events and time of sound events can be obtained from the T-F segmentation masks. The proposed method achieves an equal error rate (EER) of 0.14 in SED, outperforming deep neural network baseline of 0.29. Source separation SDR of 8.08 dB is obtained by using global weighted rank pooling (GWRP) as probability mapping, outperforming the global max pooling (GMP) based probability mapping giving SDR at 0.03 dB. Source code of our work is published.
Among the statistical tools for online information diffusion modeling, both epidemic models and Hawkes point processes are popular choices. The former originate from epidemiology, and consider information as a viral contagion which spreads into a population of online users. The latter have roots in geophysics and finance, view individual actions as discrete events in continuous time, and modulate the rate of events according to the self-exciting nature of event sequences. Here, we establish a novel connection between these two frameworks. Namely, the rate of events in an extended Hawkes model is identical to the rate of new infections in the Susceptible-Infected-Recovered (SIR) model after marginalizing out recovery events -- which are unobserved in a Hawkes process. This result paves the way to apply tools developed for SIR to Hawkes, and vice versa. It also leads to HawkesN, a generalization of the Hawkes model which accounts for a finite population size. Finally, we derive the distribution of cascade sizes for HawkesN, inspired by methods in stochastic SIR. Such distributions provide nuanced explanations to the general unpredictability of popularity: the distribution for diffusion cascade sizes tends to have two modes, one corresponding to large cascade sizes and another one around zero.
This paper investigates the classification of the Audio Set dataset. Audio Set is a large scale weakly labelled dataset of sound clips. Previous work used multiple instance learning (MIL) to classify weakly labelled data. In MIL, a bag consists of several instances, and a bag is labelled positive if at least one instances in the audio clip is positive. A bag is labelled negative if all the instances in the bag are negative. We propose an attention model to tackle the MIL problem and explain this attention model from a novel probabilistic perspective. We define a probability space on each bag, where each instance in the bag has a trainable probability measure for each class. Then the classification of a bag is the expectation of the classification output of the instances in the bag with respect to the learned probability measure. Experimental results show that our proposed attention model modeled by fully connected deep neural network obtains mAP of 0.327 on Audio Set dataset, outperforming the Google's baseline of 0.314 and recurrent neural network of 0.325.
In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly supervised sound event detection task of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge. The audio clips in this task, which are extracted from YouTube videos, are manually labeled with one or a few audio tags but without timestamps of the audio events, which is called as weakly labeled data. Two sub-tasks are defined in this challenge including audio tagging and sound event detection using this weakly labeled data. A convolutional recurrent neural network (CRNN) with learnable gated linear units (GLUs) non-linearity applied on the log Mel spectrogram is proposed. In addition, a temporal attention method is proposed along the frames to predicate the locations of each audio event in a chunk from the weakly labeled data. We ranked the 1st and the 2nd as a team in these two sub-tasks of DCASE 2017 challenge with F value 55.6\% and Equal error 0.73, respectively.
Sep 05 2017 cs.SD
In this technique report, we present a bunch of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017) challenge. This task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry applications. There are two tasks, audio tagging and sound event detection from weakly labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our basic framework. We proposed a learnable gating activation function for selecting informative local features. Attention-based scheme is used for localizing the specific events in a weakly-supervised mode. A new batch-level balancing strategy is also proposed to tackle the data unbalancing problem. Fusion of posteriors from different systems are found effective to improve the performance. In a summary, we get 61% F-value for the audio tagging subtask and 0.73 error rate (ER) for the sound event detection subtask on the development set. While the official multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the audio tagging and 1.02 for the sound event detection.
Mar 20 2017 cs.SD
Audio tagging aims to perform multi-label classification on audio chunks and it is a newly proposed task in the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. This task encourages research efforts to better analyze and understand the content of the huge amounts of audio data on the web. The difficulty in audio tagging is that it only has a chunk-level label without a frame-level label. This paper presents a weakly supervised method to not only predict the tags but also indicate the temporal locations of the occurred acoustic events. The attention scheme is found to be effective in identifying the important frames while ignoring the unrelated frames. The proposed framework is a deep convolutional recurrent model with two auxiliary modules: an attention module and a localization module. The proposed algorithm was evaluated on the Task 4 of DCASE 2016 challenge. State-of-the-art performance was achieved on the evaluation set with equal error rate (EER) reduced from 0.13 to 0.11, compared with the convolutional recurrent baseline system.
Environmental audio tagging is a newly proposed task to predict the presence or absence of a specific audio event in a chunk. Deep neural network (DNN) based methods have been successfully adopted for predicting the audio tags in the domestic audio scene. In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging. Gated recurrent unit (GRU) based recurrent neural networks (RNNs) are then cascaded to model the long-term temporal structure of the audio signal. To complement the input information, an auxiliary CNN is designed to learn on the spatial features of stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging) of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. Compared with our recent DNN-based method, the proposed structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the development set. The spatial features can further reduce the EER to 0.10. The performance of the end-to-end learning on raw waveforms is also comparable. Finally, on the evaluation set, we get the state-of-the-art performance with 0.12 EER while the performance of the best existing system is 0.15 EER.
Oct 07 2016 cs.SD
Audio tagging aims to assign one or several tags to an audio clip. Most of the datasets are weakly labelled, which means only the tags of the clip are known, without knowing the occurrence time of the tags. The labeling of an audio clip is often based on the audio events in the clip and no event level label is provided to the user. Previous works have used the bag of frames model assume the tags occur all the time, which is not the case in practice. We propose a joint detection-classification (JDC) model to detect and classify the audio clip simultaneously. The JDC model has the ability to attend to informative and ignore uninformative sounds. Then only informative regions are used for classification. Experimental results on the "CHiME Home" dataset show that the JDC model reduces the equal error rate (EER) from 19.0% to 16.9%. More interestingly, the audio event detector is trained successfully without needing the event level label.
Feb 16 2016 cs.CV
Binary features have been incrementally popular in the past few years due to their low memory footprints and the efficient computation of Hamming distance between binary descriptors. They have been shown with promising results on some real time applications, e.g., SLAM, where the matching operations are relative few. However, in computer vision, there are many applications such as 3D reconstruction requiring lots of matching operations between local features. Therefore, a natural question is that is the binary feature still a promising solution to this kind of applications? To get the answer, this paper conducts a comparative study of binary features and their matching methods on the context of 3D reconstruction in a recently proposed large scale mutliview stereo dataset. Our evaluations reveal that not all binary features are capable of this task. Most of them are inferior to the classical SIFT based method in terms of reconstruction accuracy and completeness with a not significant better computational performance.
This paper gives the minimum generating sets of three types of one generator $(1+u)$-quasi twisted (QT) codes over $F_2+uF_2$, $u^2=0$. Moreover, it discusses the generating sets and the lower bounds on the minimum Lee distance of a special class of $A_2$ type one generator $(1+u)$-QT codes. Some good (optimal or suboptimal) linear codes over $F_2$ are obtained by these types of one generator $(1+u)$-QT codes.