Computer Science Department
School of Computer Science, Carnegie Mellon University
Improving Trigram Language Modeling with the World Wide Web
Xiaojin Zhu, Ronald Rosenfeld
Keywords: Language models, speech recognition and synthesis,
We propose a novel method for using the World Wide Web to acquire
trigram estimates for statistical language modeling. We submit
an N-gram as a phrase query to web search engines. The search
engines return the number of web pages containing the phrase,
from which the N-gram count is estimated. The N-gram counts are
then used to form web-based trigram probability estimates. We
discuss the properties of such estimates, and methods to
interpolate them with traditional corpus based trigram estimates.
We show that the interpolated models improve speech recognition
word error rate significantly over a small test set.