|
CMU-CS-98-140
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMU-CS-98-140
Compiler and Hardware Support for Automatic Instruction
Prefetching: A Cooperative Approach
Todd C. Mowry, Chi-Keung Luk*
June 1998
CMU-CS-98-140.ps
Keywords: Cache memories, performance of systems (measurement
techniques, performance attributes), compilers
Instruction cache miss latency is becoming an increasingly important
performance bottleneck, especially for commercial applications. Although
instruction prefetching is an attractive technique for tolerating this latency,
we find that existing prefetching schemes are insufficient for modern
superscalar processors since they fail to issue prefetches early enough
(particularly for non-sequential accesses). To overcome these limitations, we
propose a new instruction prefetching technique whereby the hardware and
software cooperate to hide the latency as follows. The hardware performs
aggressive sequential prefetching combined with a novel prefetch
filtering mechanism to allow it to get far ahead without polluting the
cache. To hide the latency of non-sequential accesses, we propose and implement
a novel compiler algorithm which automatically inserts
instruction-prefetch instructions into the executable to prefetch the targets
of control transfers far enough in advance. Our experimental results
demonstrate that this new approach results in speedups ranging from 9.4% to
18.5% (13.3% on average) over the original execution time on an out-of-order
superscalar processor, which is more than double the average speedup of the
best existing schemes (6.5%). This is accomplished by hiding an average of
71% of the original instruction stall time, compared with only 36% for the
best existing schemes. We find that both the prefetch filtering and
compiler-inserted prefetching components of our design are essential and
complementary, that the compiler can limit the code expansion to less than 10%
on average, and that our scheme is robust with respect to variations in miss
latency and bandwidth.
20 pages
*Department of Computer Science, University of Toronto, Toronto,
Ontario, Canada, M5S 3G4.
|