Abstract
In traditional models of speech recognition, acoustic and language models are treated in independence and usually estimated separately, which may yield a suboptimal recognition performance. In this paper, we propose a joint optimization framework for learning the parameters of acoustic and language models using minimum classification error criterion. The joint optimization is performed in terms of a decoding graph constructed using weighted finite-state transducers based on context-dependent hidden Markov models and tri-gram language models. To emphasize the effectiveness of the proposed framework, two speech corpora, TIMIT and Resource Management (RM1), are incorporated in the conducted experiments. The preliminary experiments show that the proposed approach can achieve significant reduction in phone, word and sentence error rates on both TIMIT and RM1 when compared with conventional parameter estimation approaches.