Computer Science Department
School of Computer Science, Carnegie Mellon University
Improving Acoustic Models by Watching Television
Michael J. Witbrock*, Alexander G. Hauptmann
This work was first presented at the 1997
AAAI Spring Symposium, Palo Alto, CA, March 1997.
Keywords:Digital libraries, speech recognition, alignment of
text and speech, speech recogniser training, Viterbi search, recognition
Obtaining sufficient labelled training data is a persistent difficulty
for speech recognition research. Although well transcribed data is
expensive to produce, there is a constant stream of challenging speech
data and poor transcription broadcast as closed-captioned television. We
describe a reliable unsupervised method for identifying accurately
transcribed sections of these broadcasts, and show how these segments
can be used to train a recognition system. Starting from acoustic models
trained on the Wall Street Journal database, a single iteration of our
training method reduced the word error rate on an independent broadcast
television news test set from 62.2 % to 59.5%.
*Justsystem Pittsburgh Research Center, 4616 Henry Street, Pittsburgh, PA
15213. The work described in this paper was done while this author was an
employee of Carnegie Mellon University.