CMU-CS-00-171
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-00-171

Improving Trigram Language Modeling with the World Wide Web

Xiaojin Zhu, Ronald Rosenfeld

November 2000

CMU-CS-00-171.ps
CMU-CS-00-171.pdf


Keywords: Language models, speech recognition and synthesis, Web-based services


We propose a novel method for using the World Wide Web to acquire trigram estimates for statistical language modeling. We submit an N-gram as a phrase query to web search engines. The search engines return the number of web pages containing the phrase, from which the N-gram count is estimated. The N-gram counts are then used to form web-based trigram probability estimates. We discuss the properties of such estimates, and methods to interpolate them with traditional corpus based trigram estimates. We show that the interpolated models improve speech recognition word error rate significantly over a small test set.

17 pages


Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu