We present a novel, reflection-aware method for 3D sound localization in indoor environments. Unlike prior approaches, which are mainly based on continuous sound signals from a stationary source, our formulation is designed to localize the position instantaneously from signals within a single frame. We consider direct sound and indirect sound signals that reach the microphones after reflecting off surfaces such as ceilings or walls. We then generate and trace direct and reflected acoustic paths using inverse acoustic ray tracing and utilize these paths with Monte Carlo localization to estimate a 3D sound source position. We have implemented our method on a robot with a cube-shaped microphone array and tested it against different settings with continuous and intermittent sound signals with a stationary or a mobile source. Across different settings, our approach can localize the sound with an average distance error of 0.8m tested in a room of 7m by 7m area with 3m height, including a mobile and non-line-of-sight sound source. We also reveal that the modeling of indirect rays increases the localization accuracy by 40% compared to only using direct acoustic rays.
We propose a novel approach for the generation of polyphonic music based on LSTMs. We generate music in two steps. First, a chord LSTM predicts a chord progression based on a chord embedding. A second LSTM then generates polyphonic music from the predicted chord progression. The generated music sounds pleasing and harmonic, with only few dissonant notes. It has clear long-term structure that is similar to what a musician would play during a jam session. We show that our approach is sensible from a music theory perspective by evaluating the learned chord embeddings. Surprisingly, our simple model managed to extract the circle of fifths, an important tool in music theory, from the dataset.
Music Information Retrieval (MIR) technologies have been proven useful in assisting western classical singing training. Jingju (also known as Beijing or Peking opera) singing is different from western singing in terms of most of the perceptual dimensions, and the trainees are taught by using mouth/heart method. In this paper, we first present the training method used in the professional jingju training classroom scenario and show the potential benefits of introducing the MIR technologies into the training process. The main part of this paper dedicates to identify the potential MIR technologies for jingju singing training. To this intent, we answer the question: how the jingju singing tutors and trainees value the importance of each jingju musical dimension-intonation, rhythm, loudness, tone quality and pronunciation? This is done by (i) classifying the classroom singing practices, tutor's verbal feedbacks into these 5 dimensions, (ii) surveying the trainees. Then, with the help of the music signal analysis, a finer inspection on the classroom practice recording examples reveals the detailed elements in the training process. Finally, based on the above analysis, several potential MIR technologies are identified and would be useful for the jingju singing training.
This paper addresses the problem of audio source recovery from multichannel noisy convolutive mixture for source separation and speech enhancement, assuming known mixing filters. We propose to conduct the source recovery in the short-time Fourier transform domain, and based on the convolutive transfer function (CTF) approximation. Compared to the time domain filters, CTF has much less taps, and thus less near-common zeros among channels and less computational complexity. This work proposes three source recovery methods, i) the multichannel inverse filtering method, i.e. multiple input/output inverse theorem (MINT), is exploited in the CTF domain, and for the multisource case, ii) a beamforming-like multichannel inverse filtering method is proposed appling the single source MINT and power minimization, which is suitable for the case that not the CTFs of all the sources are known, iii) a constrained Lasso method. The sources are recovered by minimizing their $\ell_1$-norm to impose the spectral sparsity, with the constraint that the $\ell_2$-norm fitting cost between the microphone signals and the mixture model involving the unknown source signals is less than a tolerance. The noise can be reduced by setting the tolerance to the noise power. Experiments under various acoustic conditions are conducted to evaluate the three proposed methods. The comparison among them and with the baseline methods are presented.