Sound (cs.SD)

  • PDF
    Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition and to the best of our knowledge, achieve the first promising result. We reduce the source sequence length by skipping frames and regularize the weights for better generalization and convergence. Moreover, we investigate the impact of varying attention mechanism (convolutional attention and attention smoothing) and the correlation between the performance of the model and the width of beam search. On the MiTV dataset, we achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% without using any lexicon or language model. While together with a trigram language model, we reach 2.81% CER and 5.77% SER.