|
CMU-CS-97-175
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMU-CS-97-175
Predicting Data Cache Misses in Non-Numeric
Applications Through Correlation Profiling
Todd C. Mowry, Chi-Keung Luk*
September 1997
An abbreviated version of this paper will appear in the
Proceedings of the Fourth International Symposium on
High-Performance Computer Architecture, February 1-4, 1998.
CMU-CS-97-175.ps
Keywords: Caches memories, performance of systems (measurement
techniques, performance attributes), data structures (graphs, lists, trees),
compilers
Software-based latency tolerance techniques offer the potential for bridging
the ever-increasing speed gap between the memory subsystem and today's
high-performance processors. However, to fully exploit the benefit of these
techniques, one must be careful to apply them only to the dynamic references
that are likely to suffer cache misses --- otherwise the runtime overheads
can potentially offset any gains. In this paper, we focus on isolating
dynamic miss instances in non-numeric applications, which is a difficult
but important problem. Although compilers cannot statically analyze data
locality in non-numeric applications, one viable approach is to use profiling
information to measure the actual miss behavior. Unfortunately, the
state-of-the-art in cache miss profiling (which we call summary profiling) is inadequate for references with intermediate miss ratios --- it either
misses opportunities to hide latency, or else inserts overhead that is
unnecessary. To overcome this problem, we propose and evaluate a new profiling
technique that helps predict which dynamic instances of a static memory
reference will hit or miss in the cache: correlation profiling.
Our experimental results demonstrate that roughly half of the 22 non-numeric
applications we study can potentially enjoy significant reductions in memory
stall time by exploiting at least one of the three forms of correlation
profiling we consider: control-flow correlation, self correlation, and
global correlation. In addition, our detailed case studies illustrate that
self correlation succeeds because a given reference's cache outcomes often
contain repeated patterns, and control-flow correlation succeeds because cache
outcomes are often call-chain dependent. We also demonstrate that software
prefetching can achieve better performance on a modern superscalar processor
when directed by correlation profiling rather than summary profiling
information.
26 pages
*Department of Computer Science, University of Toronto,
Toronto, Ontario, Canada, M5S 3G4
|