@device(postscript)
@libraryfile(Mathematics10)
@libraryfile(Accents)
@style(fontfamily=timesroman,fontscale=11)
@pagefooting(immediate, left "@c<Technical Report 1996>",
center "@c<Computer Science>",
right "@c<Carnegie Mellon>")
@heading(Efficient Algorithms for Speech Recognition)
@heading(CMU-CS-96-143)
@center(@b(Mosur K. Ravishankar))
@center(May 1996 - Ph.D. Thesis)
@center(FTP: CMU-CS-96-143.ps)
@blankspace(1)
@begin(text,spacing=.90)

Advances in speech technology and computing power have created a surge of
interest in the practical application of speech recognition.  However, the
most accurate speech recognition systems in the research world are still far
too slow and expensive to be used in practical, large vocabulary continuous
speech applications.  Their main goal has been recognition accuracy, with
emphasis on acoustic and language modelling.  But practical speech
recognition also requires the computation to be carried out in real time
within the limited resources@y(M)CPU power and memory size@y(M)of commonly
available computers.  There has been relatively little work in this
direction while preserving the accuracy of research systems.

In this thesis, we focus on @i(efficient and accurate) speech recognition.
It is easy to improve recognition speed and reduce memory requirements by
trading away accuracy, for example by greater pruning, and using simpler
acoustic and language models.  It is much harder to improve both the
recognition speed and reduce main memory size while preserving the accuracy.

This thesis presents several techniques for improving the overall
performance of the CMU Sphinx-II system.  Sphinx-II employs semi-continuous
hidden Markov models for acoustics and trigram language models, and is one
of the premier research systems of its kind.  The techniques in this thesis
are validated on several widely used benchmark test sets using two
vocabulary sizes of about 20K and 58K words.

The main contributions of this thesis are @i(an 8-fold speedup and 4-fold
memory size reduction) over the baseline Sphinx-II system.  The improvement
in speed is obtained from the following techniques: lexical tree search,
phonetic fast match heuristic, and global best path search of the word
lattice.  The gain in speed from the tree search is about a factor of 5.
The phonetic fast match heuristic speeds up the tree search by another
factor of 2 by finding the most likely candidate phones active at any time.
Though the tree search incurs some loss of accuracy, it also produces
compact word lattices with low error rate which can be rescored for
accuracy.  Such a rescoring is combined with the best path algorithm to
find a globally optimum path through a word lattice.  This recovers the
original accuracy of the baseline system.  The total recognition time is
about 3 times real time for the 20K task on a 175MHz DEC Alpha workstation.

The memory requirements of Sphinx-II are minimized by reducing the sizes of
the acoustic and language models.  The language model is maintained on disk
and bigrams and trigrams are read in on demand.  Explicit software caching
mechanisms effectively overcome the disk access latencies.  The acoustic
model size is reduced by simply truncating precision of probability values
to 8 bits.  Several other engineering solutions, not explored in this
thesis, can be applied to reduce memory requirements further.  The memory
size for the 20K task is reduced to about 30-40MB.


@blankspace(2line)
@begin(transparent,size=10)
@b(Keywords:@ )@c<Speech recognition, search algorithms, real time recognition,
lexical tree search, lattice search, fast match algorithms, memory size
reduction>
@end(transparent)
@blankspace(1line)
@end(text)
@flushright(@b[(148 pages)])