Computer Science Department
School of Computer Science, Carnegie Mellon University


Improving Acoustic Models by Watching Television

Michael J. Witbrock*, Alexander G. Hauptmann

March 1998

This work was first presented at the 1997
AAAI Spring Symposium
, Palo Alto, CA, March 1997.

Keywords:Digital libraries, speech recognition, alignment of text and speech, speech recogniser training, Viterbi search, recognition errors, Informedia

Obtaining sufficient labelled training data is a persistent difficulty for speech recognition research. Although well transcribed data is expensive to produce, there is a constant stream of challenging speech data and poor transcription broadcast as closed-captioned television. We describe a reliable unsupervised method for identifying accurately transcribed sections of these broadcasts, and show how these segments can be used to train a recognition system. Starting from acoustic models trained on the Wall Street Journal database, a single iteration of our training method reduced the word error rate on an independent broadcast television news test set from 62.2 % to 59.5%.

5 pages

*Justsystem Pittsburgh Research Center, 4616 Henry Street, Pittsburgh, PA 15213. The work described in this paper was done while this author was an employee of Carnegie Mellon University.

Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by