# Software-Controlled Multithreading Using Informing Memory Operations

Todd C. Mowry Sherwyn R. Ramkissoon<sup>†</sup> October 1998 CMU-CS-98-169

> School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213

<sup>†</sup>Department of Electrical and Computer Engineering, University of Toronto, Toronto, Ontario, Canada, M5S 3G4.

#### Abstract

Memory latency is becoming an increasingly important performance bottleneck, especially in multiprocessors. One technique for tolerating memory latency is *multithreading*, whereby we switch between threads upon expensive cache misses. In contrast with previous work on multithreading, we explore a new approach that is *software-controlled* rather than hardware-controlled. To implement software-controlled multithreading, we use *informing memory operations* to quickly trap upon cache misses to a miss handler which performs the actual thread switching in software. Our experimental results demonstrate that software-controlled multithreading, we find that by selectively applying a *register partitioning* optimization to reduce the thread-switching overhead, we can increase the overall speedups to as much as 25%. Given the much simpler hardware support required by our scheme, and the fact that its software-controlled multithreading is attractive alternative to traditional hardware-based schemes.

Todd C. Mowry is partially supported by a Faculty Development Award from IBM.

Keywords: B.3.2 Cache Memories, C.4 Performance of Systems (Measurement Techniques, Performance Attributes)

# 1 Introduction

Memory latency is a key performance bottleneck in modern microprocessor-based systems. As we look to the future, the relative importance of memory latency is expected to increase as the gap between processor and memory speeds continues to grow, and as wider-issue processors increase the effective performance penalty of each cycle of latency. While memory latency presents a challenge for all systems, the problem is especially acute in large-scale shared-memory multiprocessors, where accesses to remote memory locations can suffer latencies on the order of hundreds of cycles [9]. Although cache hierarchies are an essential first step toward coping with this problem, they are not a complete solution. To further tolerate latency, one attractive technique is to use a form of *multithreading* [1, 15, 18] whereby a long-latency access from one thread is overlapped with the computation from other parallel threads. (Note that throughout the remainder of this paper, we will use the term "multithreading" to refer to multithreading for the sake of latency tolerance, as opposed to more general forms of multithreading.)

### 1.1 Previous Work on Multithreading

Several researchers have proposed and evaluated hardware-based multithreading schemes in the past [1, 2, 8, 15, 18]. These schemes can be broken down into roughly three categories: *fine-grained*, *coarse-grained*, and *simultaneous* multithreading.

The idea behind *fine-grained multithreading*—as exemplified by the HEP [15] architecture—is to unconditionally switch between threads at a very fine granularity: i.e. once every cycle. The advantage of the fine-grained approach is that since the hardware knows ahead of time that a thread switch will occur on every cycle, the pipeline can be designed such that there is minimal switching overhead. The disadvantage of this approach, however, is that it relies on having a large number of parallel threads to keep the pipeline full. For applications with only limited amounts of thread-level parallelism (e.g., when there is only a single thread), the performance tends to suffer relative to a conventional, non-multithreaded processor, since each thread can utilize only a small fraction of the processing resources.<sup>1</sup>

Rather than switching between threads on every cycle, the idea behind *coarse-grained multithreading* is to allow a given thread to continue running (with the full processor to itself) until it encounters a long-latency operation; only at that point does the processor switch to executing another thread. An example of this coarse-grained approach is the MIT APRIL architecture [1]. In contrast with the fine-grained approach, coarse-grained multithreading offers better single-thread performance and requires a smaller number of parallel threads to hide latency. The disadvantage of this approach, however, is that since cache misses are detected relatively late in the pipeline, the minimum thread switching time is non-trivially large. Hence this scheme is not appropriate for hiding short latencies (e.g., primary cache misses which are satisfied by the secondary cache), and it is primarily used to hide the large latencies found in shared-memory multiprocessors.

Finally, a more recent proposal known as *simultaneous multithreading* [18] leverages the register renaming mechanism within dynamically-scheduled superscalar processors to allow instructions from multiple threads to be active simultaneously within the pipeline. The advantage of the simultaneous multithreading approach—and of an earlier technique called *interleaving* [8]—is that it enjoys good single-thread performance without paying a significant thread switching penalty.

A common feature of all of these multithreading techniques is that the decision of when to switch between threads and the actual switching itself is controlled entirely by hardware. As a result, a non-trivial amount of hardware support is required to manage the multiple threads. For example, to minimize the thread switching latency, coarse-grained multithreaded processors typically replicate key per-thread state such as the register file [1]. Under simultaneous multithreading, the concept of "thread switching" is effectively eliminated at the point where instructions reach the functional units—i.e. when they are buffered in dynamic instruction scheduling queues—since register renaming has already isolated the effects of independent threads. However, simultaneous multithreading does require some non-trivial hardware support to fetch, issue, and retire instructions from multiple threads properly. More importantly, simultaneous multithreading requires a larger register file to accommodate the multiple threads, and this is likely to increase register access latencies and possibly add additional stages to the pipeline [17]. Concern over the potential impact of multithreading hardware support on single-thread performance may be a contributing factor to why we have yet to see hardware-based multithreading in commodity microprocessors.

<sup>&</sup>lt;sup>1</sup>This problem is exacerbated by the fact these types of machines often do not have data caches or pipeline interlocks.

Rather than relying on specialized hardware support, an alternative approach is to use *software* to implement multithreading. The advantage of this approach is that there is obviously no degradation in single-thread performance (since the processor is not modified); the disadvantage, however, is that the thread switching time is significantly larger than when it is accelerated by special hardware support, and this may limit the types of latency that can be successfully hidden. Previous studies have considered purely software-based multithreading in the context of hiding remote latencies in software distributed shared memory (DSM) machines [12, 16]. Purely software-based multithreading makes sense for software DSMs for two reasons: (i) software is already invoked upon the start of a remote access, and therefore it knows when to initiate a thread switch; and (ii) remote access latencies are so large in software DSMs [3] (typically several orders of magnitude larger than in hardware DSMs [9]) that the overhead of switching threads in software is small by comparison. As a result, both the Mowry *et al.* [12] and Thitikamol and Keleher [16] studies found positive results when using software-based multithreading to hide the large remote latencies in software DSMs.

An open research question is whether software-based multithreading can successfully tolerate more modest forms of latency, such as the remote latencies in *hardware* DSMs (e.g., the SGI Origin [9]). To implement software-based multithreading, we need two software mechanisms: (i) the ability to switch between threads; and (ii) a mechanism for knowing *when* to trigger thread switches. The former mechanism is clearly feasible, since software can save and restore all thread-specific state (e.g., registers, the program counter, any condition codes, etc.). The latter mechanism, however, had been lacking in the past, since there was no way for software to directly observe and react to cache misses in a sufficiently lightweight fashion. (Note that the signal handler mechanism used to trigger thread switches in software DSMs is not applicable to cache misses, since it is too costly and can only react to page-level access violations.) Fortunately, a mechanism which provides this functionality was recently proposed by Horowitz *et al.* [5, 6]: *informing memory operations*.

### **1.2** Informing Memory Operations

The idea behind informing memory operations [5, 6] is to make cache misses directly observable to software, and to enable software to quickly react to these misses. In essence, an informing memory operation consists of a memory operation that is combined—either implicitly or explicitly—with a conditional branch-and-link operation where the branch is taken only if the reference suffers a cache miss. Horowitz *et al.* [5, 6] describe two possible implementations of informing memory operations: one based on branching on a cache-outcome condition code, and another based on a low-overhead trap.

The low-overhead trap approach works as follows. Two new user-visible registers are added to the architecture: (i) a *Miss Handler Address Register* (MHAR), which contains the address of the miss handler to be invoked upon a cache miss (setting this register to zero disables the trapping mechanism); and (ii) a *Miss Handler Return Register* (MHRR), which contains the return address for resuming execution at the end of the trap (i.e. it contains the address of the instruction following the memory reference that missed). Upon a cache miss, if the MHAR contains a non-zero value, then a branch-and-link occurs to this address, and the MHRR is set appropriately. Unlike traditional trapping mechanisms, this one is extremely lightweight since it occurs entirely at the user level (no operating system code is executed), and the only state that is saved is the MHRR. In other words, the run-time overhead is comparable to a traditional branch-and-link instruction, rather than a traditional trap. The authors demonstrate how this mechanism can be implemented within modern in-order and out-of-order superscalar pipelines without much additional complexity, since the bulk of the necessary hardware support already exists for handling branches and exceptions. The advantage of the low-overhead trap approach is that it potentially incurs no overhead on cache hits (unlike the cache-outcome condition code approach, which requires an explicit branch to test the condition code even on cache hits). Hence we will focus on the low-overhead trap approach throughout the remainder of this paper.

There are a number of applications of informing memory operations. For example, since they can be used to collect memory performance information accurately and with little overhead, informing memory operations enable a wide range of new performance monitoring tools which can guide either the programmer or the compiler in identifying and eliminating memory performance problems. In addition, Horowitz *et al.* [5, 6] also demonstrated that informing memory operations can automatically enhance the performance gains from software-controlled prefetching [10, 11, 13], and that they can accelerate software-based cache coherence with fine-grained access control [14]. The authors also suggest that informing memory operations could be used to implement software-controlled multithreading, but there has been no detailed study of this approach until now.

### 1.3 Objectives of This Study

In this paper, we perform a detailed evaluation of whether software-controlled multithreading based on informing memory operations can successfully improve the performance of parallel applications running on shared-memory multiprocessors with hardware cache coherence. In addition to evaluating our baseline scheme, we also investigate a number of extensions which are designed to further enhance the performance of software-controlled multithreading.

We focus on hardware DSMs rather than uniprocessors for two reasons. First, since applications written for hardware DSMs already contain parallel threads, it is straightforward to extract the additional parallel threads necessary for multithreading. (In contrast, the bulk of applications run on uniprocessors contain only a single thread, and parallelizing them is a non-trivial effort.) Second, hardware DSMs tend to suffer more from memory latency than uniprocessors—due to the large latency of remote accesses and the additional cache misses due to communication patterns—and therefore they are an important target for latency tolerance. If software-controlled multithreading on hardware DSMs is successful, then we get the best of both worlds: the benefits of multithreading when it pays off, and maximum single-thread performance when it does not.

The remainder of the paper is organized as follows. We begin in Section 2 by examining the issues involved in implementing software-controlled multithreading. Section 3 discusses our experimental methodology, and Section 4 presents our experimental results. Finally, we conclude in Section 5.

# 2 Software-Controlled Multithreading

In this section, we discuss the major challenges and tradeoffs involved with implementing software-controlled multithreading. We begin by discussing the hardware support necessary for this scheme, and then present a design of the miss handler software which performs the actual thread switching. Finally, we discuss how our scheme avoids deadlock and handles synchronization events properly.

### 2.1 Hardware Support

The target architecture for our study is a hardware cache-coherent shared-memory multiprocessor comprised of out-of-order superscalar processors. For the sake of concreteness, we will use the MIPS R10000 processor [21] as the basis for our discussion, although similar issues apply to other out-of-order superscalar processors.

Our goal is to support software-controlled multithreading with minimal hardware support beyond informing memory operations. There are three issues, however, which may require some additional hardware: the first two involve potential problems that would prevent us from overlapping enough computation with the cache miss, and the third involves our ability to selectively switch threads only upon long-latency misses.

The first obstacle to consider is that when a load suffers a cache miss, it typically cannot retire from the reorder buffer until its cache miss has completed. Since all instructions must retire in-order (even in an out-of-order issue machine), this means that all instructions executed after the miss (including thread switching code and the thread that we switch to) must remain in the reorder buffer until the miss completes. The problem is that reorder buffers are typically small (e.g., 32 entries in the R10000) relative to the number of instructions that one would need to execute to fully hide a remote cache miss (e.g., several hundred instructions in the SGI Origin). Hence the reorder buffer will fill up quickly upon a thread switch, causing the processor to stall before it can hide the miss latency. For example, the R10000 does not have sufficient buffering to even execute our thread switching code (described later in Section 2.2), let alone the thread that we are attempting to activate. To address this problem, we need a mechanism for specifying that the load should be allowed to retire, despite the fact that its miss is still in progress. In essence, we would like to convert the load into a *prefetch*, since prefetches can retire before their misses complete. Converting the load to a prefetch is acceptable because we do not care about the result of the load—only that it brings the line into the cache—since we will resume execution by re-executing the load that missed (as discussed later in Section 2.2). While there are a number of ways to accomplish this, one possibility is to set a flag which indicates to the trapping mechanism that upon a cache miss, the offending load should be allowed to retire (similar to a prefetch). Such an option may be useful in other cases where the miss handler would like to execute a non-trivial amount of code underneath the cache miss, and where the miss handler will resume execution by re-executing the load which invoked the trap, rather than the instruction which follows it.

The second potential problem is that during a thread switch, any use of the load destination register (e.g., if we attempt to save it to memory as part of saving the thread state) will result in a data dependence that will stall the processor until the load completes. Since we do not care about the result of the load (it will be re-executed later), there is no need to save this register value. One software-based solution would be to save all registers except the load destination; therefore when the register state of the thread we are switching to is restored, the act of overwriting this register will break the original data dependence on the load (due to register renaming), thus avoiding a stall. While this approach will work, the problem is how to quickly determine which register is the load target (since this information is not readily available inside the miss handler) and avoid saving it. One possibility would be to look up this value in a hash table based on the return address in the MHRR; however, this will result in non-trivial software overhead. Another possibility would be for the hardware to make the destination register number directly visible to the miss handler software, perhaps through another special architected register. While this would eliminate the need for a hash table lookup, we would still need to branch to a specialized version of the thread switching code to avoid saving the given register. The most desirable solution would be for the hardware to automatically break the data dependence on the load result when it is marking the load as being able to retire despite its outstanding miss (as discussed earlier). In other words, we would like to fully convert the load to having the same functionality as a prefetch: i.e. it can graduate immediately, and it produces no register result. Breaking this register dependence is realistic for the hardware because the Miss Status Handling Register (MSHR) [7]—the structure which tracks an outstanding miss in a lockup-free cache [4]—already maintains this register number. In our experiments, we assume that this latter hardware support is available.

The third area where additional hardware support may be helpful is in identifying (or predicting) whether a given cache miss is likely to suffer a large latency. Since multithreading can only improve performance if the miss latency is larger than the latency of switching between threads—and since our software-based approach requires roughly 55 cycles to switch threads—we cannot hide the latency of primary cache misses which hit in the secondary cache. Hence we only want to switch threads upon secondary cache misses (which are still large relative to our thread switching time). Ideally, we would like an informing mechanism where traps only occur upon secondary misses—however, implementing this may be difficult (or even impossible) given how late the secondary cache tags are checked. Instead, we assume that traps can only occur upon primary cache misses, but that inside the miss handler we can test a flag which indicates whether the primary miss is also a secondary cache miss.<sup>2</sup> This is similar to the condition-code approach that was discussed by Horowitz *et al.* [5, 6].

Note that in all three of these cases, the additional hardware support only affects actions taken upon miss handler invocation, and there is flexibility in how quickly the actions are performed. Hence we would not expect any of these features to slow down the critical path of normal execution. Having described our hardware support, we now discuss how it can be used to implement the miss handler.

#### 2.2 Design of the Miss Handler

We use a single miss handler to implement multithreading, as shown in Figure 1. The MHAR is set to contain this handler address at the start of execution, and is restored after each trap so that we continue using this same handler. As we see in Figure 1, the miss handler begins by subtracting four bytes (i.e. one instruction word) from the MHRR so that it will eventually restart the thread at the memory reference that missed, rather than at the instruction after it. The reason for doing this is that the original reference has been converted into a prefetch by the hardware (as discussed in the previous section), and therefore the reference must be re-executed to complete properly. The handler then tests whether the primary miss was also a secondary cache miss. If so, then the handler switches to a new thread; otherwise, it returns immediately.<sup>3</sup>

To switch between threads, the miss handler first saves the state of the current thread to memory, it then selects a thread to restart using a simple round-robin scheme, and finally it restores the state of this new thread. To prevent the memory references inside the miss handler from triggering additional informing memory traps, the trapping mechanism is disabled during the thread switch by writing a zero into the MHAR. Since user code in MIPS-based systems does not use the **k0** register, we use it as a pointer to where

<sup>&</sup>lt;sup>2</sup>Note that the processor will interlock on this flag until it is available.

 $<sup>^{3}</sup>$ Note that the processor will stall until the secondary cache miss flag is valid. If this is likely to take a non-trivial amount of time, then some of the thread switching code can be scheduled before this test to avoid wasting time.

| HandlerAddress:                    |       |                                       |
|------------------------------------|-------|---------------------------------------|
| add MHRR, MHRR, -4                 | $\Pi$ | Point the MHRR to previous inst       |
| bne #0, CMF, L2Miss                | $\Pi$ | Continue if cache-miss flag is set    |
| j MHRR                             | $\Pi$ | else L2 hit, so just return           |
| L2Miss:                            |       |                                       |
| li MHAR, #O                        | $\Pi$ | Disable miss-handler                  |
| li kO, #Membase                    | $\Pi$ | Get ptr to current state              |
| lw k0, 0(k0)                       |       |                                       |
| sw r1, 0(k0)                       | $\Pi$ | Save integer registers                |
| sw r2, 4(k0)                       | $\Pi$ | excluding k0,k1,r0                    |
| • • •                              |       |                                       |
| sw r31, 112(k0)                    |       |                                       |
| sw fcr31, 116(k0)                  | $\Pi$ | Save fp condition code register       |
| sd f0, 120(k0)                     | 11    | Save fp registers                     |
| sd f2, 128(k0)                     |       |                                       |
|                                    |       |                                       |
| sd f30, 240(k0)                    |       |                                       |
| sw MHRR, 248(kO)                   | 11    | Save MHRR                             |
| addu k0, k0, 256                   | 11    | Find & save ptr to new context state  |
| and kO, kO, #FFFFOFFF              | 11    | Assume 16 contexts, 256 bytes/context |
|                                    | 11    | and round robin selection method      |
| sw kO, Membase                     |       | <b>_</b>                              |
| 1 w r1, 0(k0)                      | 11    | Restore integer registers             |
| $1 \text{ w r}^2$ , $4(\text{RO})$ |       |                                       |
|                                    |       |                                       |
| 1  w r 31, 112(RU)                 |       | D                                     |
| Iw fcr31, 116(RU)                  | 11    | Restore Tp condition code register    |
| 1d TU, 120(RU)                     |       | Restore ID registers                  |
| Id T2, 128(RU)                     |       |                                       |
| 11 520 040(1-0)                    |       |                                       |
| 10 130, 240(k0)                    |       | Destant WIIDD                         |
| 10 MHRR, 240(KU)                   |       | Restore MARK                          |
| II MHAR, #HandlerAddres            | s /   | // Re-enable miss-nandler             |
| j nna                              | 11    | Jump to new context                   |
|                                    |       |                                       |

Figure 1: MIPS pseudo-code representation of the miss handler for software-controlled multithreading.

the thread state is stored. Assuming that the number of active threads per processor is a power of two, our simple round-robin scheme requires only three instructions to determine the next thread to be executed. Finally, the handler resumes thread execution by jumping to the address in the MHRR.

As we observe from this code, there are two major dimensions to consider when performing multithreading in software: (i) how to manage the saving and restoring of thread state; and (ii) how to decide when it is desirable to switch threads. We now consider both of these issues in greater detail.

#### 2.2.1 Saving and Restoring Thread State

Our multithreading scheme is similar to coarse-grained hardware-based schemes (e.g., APRIL [1]) in that thread switches are triggered by cache misses. An important difference, however, is that these hardware-based schemes devote special hardware to quickly saving and restoring the register state of threads. In contrast, we must save and restore registers through explicit loads and stores to memory. This overhead accounts for the bulk of our thread switching latency (which is roughly 55 cycles). The good news is that the thread state tends to stay in the primary data cache, which prevents the latencies from being even larger. However, since these non-trivial thread switching times are a potential performance bottleneck, we would like to reduce them even further.

The major trick for reducing the thread switching overhead is to avoid saving and restoring registers that do not need to be preserved. As a simple example, some applications do not use floating-point registers at all; by recognizing this fact, we could eliminate roughly half of the thread switching overhead in such applications. In general, the compiler can determine which registers are live at any given point in the program, and it could use this information to select a miss handler that has been customized to only save these live registers. While this approach may sound good in theory, it suffers the following limitations in practice. First, customizing the miss handler on a reference-by-reference basis involves either setting the MHAR before each reference, or else using the MHRR inside the miss handler to hash into a jump table. The Horowitz *et al.* study [5] quantified these types of overheads, which appear to be large enough to offset a non-trivial portion of the expected gains. A related limitation is that creating a large number of customized miss handlers will degrade the instruction cache performance. Finally, while it is easy to specify which registers are to be saved by choosing the right customized miss handler, it is more difficult to recognize which registers are to be *restored*, since this requires that we recognize the context of the suspended thread.<sup>4</sup>

A simpler approach to reducing the overhead of saving and restoring registers is to *statically partition* the registers between threads. For example, if we wanted to run two threads per processor, the compiler could compile each thread to use only half of the user registers. (Note that special-purpose registers—e.g., the stack pointer—cannot be partitioned.) The advantage of this approach is that many of the registers would be preserved in the register file itself, thus avoiding the need to save them to memory. The main disadvantage, however, is that each thread may suffer reduced performance due to having fewer available registers. (Another disadvantage is that code replication may impact the instruction cache performance.) Rather than taking an all-or-nothing approach, there is in fact a continuum of possibilities between saving all registers and partitioning all user registers. For example, it may be beneficial to give each thread one additional register at the expense of slightly increased switching overhead. We will evaluate the benefits of this static partitioning approach later in Section 4.

#### 2.2.2 Deciding When to Switch Threads

The second major challenge for software-controlled multithreading is switching threads only when the miss latency is expected to be large relative to the thread switching overhead. For our purposes, this means switching only upon secondary cache misses. Unfortunately—as we mentioned earlier—it is not likely that the result of the secondary cache tag check will be available early enough to trigger a trap. Instead, the strategy which we outlined in Figure 1 is to test whether the primary miss (which triggered the trap) is also a secondary cache miss once we are inside the miss handler. The main disadvantage of this approach is that if the reference does hit in the secondary cache, then we have wasted overhead with no benefit.

To avoid this useless overhead, we would like to predict *a priori* whether a given reference is likely to result in an expensive cache miss. If we believe that it will not, then we can disable the trapping mechanism for that reference. One possibility would be for the compiler to statically analyze the data locality [11, 19]; this technique has mainly been successful at predicting cache misses in matrix-based codes. Another possibility would be to collect a profile of how frequently each memory reference suffers a long-latency miss, and to feed this information back into the compiler. Finally, another possibility would be to use hardware to predict the conditional probability of a reference suffering a long-latency miss, given that it has suffered a primary cache miss. Such a prediction mechanism could use techniques similar to those used for branch prediction. With this information, the user could specify that they would like informing traps to occur only upon primary cache misses which are *also* predicted to be expensive misses. Implementing this behavior would be feasible since both the primary cache miss signal and the "expensive miss" prediction value would be available early enough to control the trap mechanism.

Of course, the drawback of using a prediction mechanism is that if it incorrectly predicts that a miss will be inexpensive when it turns to be expensive, then it is too late to invoke the thread switching code to hide the miss latency. We will evaluate the potential benefit of such techniques later in Section 4.

### 2.3 Avoiding Deadlock and Handling Synchronization Properly

By interleaving multiple threads on the same physical processor, multithreading introduces the possibility of deadlock in two ways. First, a repeated pattern could occur where thread A steals resource X from thread B (which is currently suspended, also waiting for resource X), only to suffer a thread switch back to B before A can use X; when thread B restarts, it steals resource X back from thread A, but also switches back to A before B can use X, etc. Such a pattern could be repeated infinitely as the two threads rapidly switch back and forth but neither thread makes progress. This scenario can arise when multiple threads suffer cache misses for unique addresses which map into the same cache entry. To prevent this problem, we swap out a given thread only once when it encounters a cache miss. If the miss has not completed by the time the round-robin scheduler reactivates the thread, then the thread stalls at that point until the miss completes (rather than switching to another thread).<sup>5</sup> Hence forward progress is guaranteed.

<sup>&</sup>lt;sup>4</sup>One way to implement this would be to save the instruction address of the customized code that should be used to restore a thread along with its other register state, and to jump to this address in the process of switching threads.

<sup>&</sup>lt;sup>5</sup>Although it is not clear how this works from our pseudo-code in Figure 1, the idea is to either postpone turning the miss handler back on until after the original reference completes when the thread is restarted (this can be accomplished if the

#### Table 1: Simulation parameters

| Pipeline Parameters |                 |  |  |  |  |
|---------------------|-----------------|--|--|--|--|
| ripenne rurumeters  |                 |  |  |  |  |
| Issue Width         | 4               |  |  |  |  |
| Functional Units    | 2 Int, 2 FP,    |  |  |  |  |
|                     | 2 Mem, 1 Branch |  |  |  |  |
| Reorder Buffer Size | 32              |  |  |  |  |
| Integer Multiply    | 12 cycles       |  |  |  |  |
| Integer Divide      | 76 cycles       |  |  |  |  |
| All Other Integer   | 1 cycle         |  |  |  |  |
| FP Divide           | 15 cycles       |  |  |  |  |
| FP Square Root      | 20 cycles       |  |  |  |  |
| All Other FP        | 2 cycles        |  |  |  |  |
| Branch Prediction   | 2-bit Counters  |  |  |  |  |

| Memory Parameters           |                          |  |  |  |  |
|-----------------------------|--------------------------|--|--|--|--|
| Line Size                   | 32B                      |  |  |  |  |
| Instruction Cache           | 32KB, 2-way set-assoc    |  |  |  |  |
| Data Cache                  | 32KB, 2-way set-assoc    |  |  |  |  |
| Unified Secondary Cache     | 2MB, 2-way set-assoc     |  |  |  |  |
| Data Cache Banks            | 2                        |  |  |  |  |
| Data Cache Fill Time        | 4 cycles                 |  |  |  |  |
| (Requires Exclusive Access) |                          |  |  |  |  |
| Miss Handlers (MSHRs)       | 16 for data, 2 for insts |  |  |  |  |
| Main Memory Bandwidth       | 1 access per 20 cycles   |  |  |  |  |
| Total Miss Latency to       | 14 cycles                |  |  |  |  |
| Secondary Cache             |                          |  |  |  |  |
| Total Miss Latency to       | 78 cycles                |  |  |  |  |
| Local Memory                |                          |  |  |  |  |
| Total Miss Latency to       | 200 cycles (2 hops),     |  |  |  |  |
| Remote Memory               | 300 cycles (3 hops)      |  |  |  |  |

The second scenario which can result in deadlock is if thread A spin-waits for a resource that is held by thread B, where B is currently suspended on the same processor as A, and A never yields the processor to B in the course of spin-waiting. This scenario can arise with any form of synchronization that involves spin-waiting (e.g., locks and barriers). Our solution is to force a thread switch (in software) as part of all spin-waiting loops. Not only does this approach avoid deadlock, it also has the added benefit that it helps the processor tolerate synchronization latency.

## 3 Experimental Framework

To evaluate our software-controlled multithreading scheme, we performed detailed cycle-by-cycle simulations of a collection of seven applications from the SPLASH-2 benchmark suite [20] on a shared-memory multiprocessor with out-of-order superscalar processors similar to the MIPS R10000 [21]. Our simulation model varies slightly from the actual MIPS R10000—e.g., we model two memory units, and we assume that all functional units are fully-pipelined. However, we do model the rich details of the processor, including the pipeline, register renaming, the reorder buffer, branch prediction, instruction fetching, branching penalties, the memory hierarchy (including contention), etc. The parameters of our model are shown in Table 1.

Our multiprocessor system model is roughly based on the SGI Origin [9]. We use a full-map directory to implement invalidation-based cache coherence. Remote accesses require either two or three network hops, depending on whether the data can be supplied by the *home* node or whether it must be forwarded from a *dirty-remote* node. We do not model network contention, but we do model memory contention in detail. As shown in Table 1, the two and three hop remote accesses result in nominal latencies of 200 and 300 cycles, respectively, not including additional delays due to memory contention.

We would like to emphasize that we simulate the *actual* thread-switching instructions shown in Figure 1, rather than simply modeling thread-switching as some fixed latency. In addition, we precisely model the timing of the trap mechanism for informing memory operations in the R10000, as described by Horowitz *et al.* [5, 6]. Our thread-switching code consists of a total of 104 instructions—of these, 94 are memory references. Given that our processor has two memory units, the memory references alone would dictate a minimum thread switching time of at least 47 cycles. Since we also model the instruction and data cache misses caused by the miss handler code, data dependences, resource constraints, etc., we observe a thread switching latency that is closer to 55 cycles. (The actual thread switching time varies across applications, and in one case is over 100 cycles, as we will see later in Section 4.)

We performed our experiments on the following applications from SPLASH-2: CHOLESKY, FFT, LU-CONT, LU-NCONT, OCEAN-CONT, OCEAN-NCONT, and RADIX. Table 2 briefly summarizes each application, along with the input data sets and other statistics. Further details on these applications can be found in the study

hardware supports sampling counters with the informing memory traps, or by scheduling explicit instructions in the code to turn the handler back on after keeping it disabled before restarting), or to combine an explicit test for a partial-latency miss with the test for an secondary cache miss inside the miss handler before invoke the thread switch code. In our experiments, we model the sampling counter approach.

Table 2: Benchmark characteristics table describes the benchmarks, input data set, and cache miss counts. The total number of misses, misses that hit in local memory, and remote miss counts are given for the 2-processor case.

| Name        | Description                                                    | Input Data Set                                 | Instructions     | Cache Miss Count |                  |                   |  |
|-------------|----------------------------------------------------------------|------------------------------------------------|------------------|------------------|------------------|-------------------|--|
|             |                                                                |                                                | Graduated        | Total            | Local Mem.       | Remote            |  |
| CHOLESKY    | Sparse Cholesky factorization                                  | tk14.0 input file                              | $44.6\mathrm{M}$ | 289K             | 28.8K            | $22.5{ m K}$      |  |
| FFT         | 1D fast Fourier transform                                      | 65536 complex points                           | $30.1\mathrm{M}$ | 256K             | 123.8K           | $59.6\mathrm{K}$  |  |
| LU-CONT     | LU factorization with                                          | 512x512 matrix,                                | 184M             | 755K             | 45.8K            | 50.7 K            |  |
|             | contiguous partitions                                          | 32x32 elem. blocks                             |                  |                  |                  |                   |  |
| Lu-ncont    | LU factorization with<br>non-contiguous partitions             | 512x512 matrix<br>32x32 elem. blocks           | $205\mathrm{M}$  | 7508K            | 66.7K            | $62.7 \mathrm{K}$ |  |
| Ocean-cont  | Large-scale ocean simulation<br>with contiguous partitions     | 130x130 grid                                   | 48.9M            | 2009K            | $52.2\mathrm{K}$ | 1.5K              |  |
| Ocean-ncont | Large-scale ocean simulation<br>with non-contiguous partitions | 130x130 grid                                   | $65.6\mathrm{M}$ | 2374K            | 284.4K           | 15.9K             |  |
| RADIX       | Integer radix sort                                             | 262144 keys, radix=1024,<br>max key value=1024 | $25.9\mathrm{M}$ | 197K             | $23.2\mathrm{K}$ | 24.4K             |  |



Figure 2: Performance of the baseline software-controlled multithreading scheme.

by Woo *et al.* [20]. All applications were compiled using version 2.8.0 of the gcc compiler, with -03 optimization. We used the *MINT3* MIPS instruction interpreter tool (provided by MIPS) to drive our detailed performance model, thus allowing us to simulate all instructions (including the thread-switching code) in a detailed, execution-driven fashion.

# 4 Experimental Results

We now present results from our simulation studies. We begin by evaluating the performance of our baseline software-controlled multithreading scheme. To further improve upon this scheme, we evaluate the performance potential of two techniques for reducing overheads: (i) *register partitioning* to reduce the thread switching overheads, and (ii) *miss prediction* to avoid invoking the miss handler upon secondary cache hits.

## 4.1 Performance of the Baseline Software-Controlled Multithreading Scheme

The results of our first set of experiments can be found in Figure 2 along with Tables 3 and 4. Figure 2 shows the performance impact of multithreading with two and four threads per processor on a two-processor machine, and with two threads per processor on a four-processor machine.<sup>6</sup> Each bar is labeled with the number of threads per processor, with the number of processors below that.

<sup>&</sup>lt;sup>6</sup>Since MINT3 can only simulate up to eight parallel threads at this point, we were not able to explore larger machine configurations. NOTE TO REVIEWERS: We hope to correct this by the final draft of the paper. By focusing on smaller machine configurations, we tend to underestimate the fraction of secondary cache misses that would be remote in a larger machine—hence our results are conservative since the potential performance gains are likely to be larger in larger-scale machines.

| T         |       | # of    |       | Breakdown of Normalized Graduation Slots |       |       |      |          |           |       |       |      |
|-----------|-------|---------|-------|------------------------------------------|-------|-------|------|----------|-----------|-------|-------|------|
|           | 1     | Threads | Total |                                          |       |       | St   | alls Due | to L1 M   | isses | Thr   | ead  |
|           | # of  | per     | Exec. |                                          | Inst. | Sync  | For  | und in L | ocation 1 | Below | Swite | hing |
| Benchmark | Procs | Proc    | Time  | Busy                                     | Stall | Stall | L2   | Mem      | Rem.      | Comb. | Setup | Regs |
| 1         | 2     | 1       | 100.0 | 12.0                                     | 21.9  | 32.9  | 2.2  | 3.3      | 5.0       | 22.7  | 0.0   | 0.0  |
|           |       | 2       | 98.2  | 11.4                                     | 23.4  | 33.9  | 0.7  | 1.1      | 0.0       | 19.1  | 2.1   | 6.5  |
| CHOLESKY  |       | 4       | 109.2 | 11.0                                     | 23.3  | 45.4  | 0.7  | 1.0      | 0.0       | 19.1  | 2.1   | 6.6  |
|           | 4     | 1       | 84.1  | 6.4                                      | 11.2  | 46.0  | 0.8  | 2.5      | 4.8       | 12.5  | 0.0   | 0.0  |
|           |       | 2       | 87.8  | 5.8                                      | 12.5  | 46.5  | 0.1  | 0.5      | 0.0       | 14.8  | 1.9   | 5.7  |
|           | 2     | 1       | 100.0 | 12.1                                     | 20.8  | 19.5  | 1.1  | 15.5     | 15.8      | 15.1  | 0.0   | 0.0  |
|           |       | 2       | 89.9  | 11.4                                     | 19.0  | 18.2  | 0.0  | 0.0      | 0.0       | 17.5  | 5.4   | 18.4 |
| FFT       |       | 4       | 85.6  | 10.1                                     | 20.1  | 16.7  | 0.0  | 0.0      | 0.0       | 15.4  | 5.2   | 18.1 |
|           | 4     | 1       | 65.1  | 5.6                                      | 9.7   | 20.7  | 0.5  | 7.6      | 15.0      | 6.1   | 0.0   | 0.0  |
|           |       | 2       | 59.6  | 3.8                                      | 10.0  | 19.8  | 0.0  | 0.0      | 0.0       | 8.7   | 4.2   | 13.1 |
|           | 2     | 1       | 100.0 | 17.8                                     | 32.0  | 14.5  | 2.0  | 1.2      | 3.2       | 29.2  | 0.0   | 0.0  |
|           |       | 2       | 90.9  | 17.5                                     | 29.4  | 13.2  | 0.1  | 0.2      | 0.0       | 27.2  | 1.5   | 1.8  |
| LU-CONT   |       | 4       | 96.8  | 16.8                                     | 30.8  | 16.9  | 0.1  | 0.3      | 0.0       | 27.4  | 2.5   | 2.0  |
|           | 4     | 1       | 61.3  | 8.9                                      | 14.9  | 16.1  | 0.9  | 0.8      | 5.7       | 14.1  | 0.0   | 0.0  |
|           |       | 2       | 57.4  | 7.9                                      | 15.0  | -17.7 | 0.1  | 0.0      | 0.0       | 13.8  | 1.1   | 1.8  |
|           | 2     | 1       | 100.0 | 19.6                                     | 15.4  | 7.6   | 23.3 | 1.4      | 3.4       | 29.4  | 0.0   | 0.0  |
| Lu        |       | 2       | 87.5  | 21.0                                     | 14.0  | 7.1   | 0.3  | 0.2      | 0.0       | 32.3  | 10.4  | 2.2  |
| NCONT     |       | 4       | 86.8  | 19.8                                     | 13.3  | 9.1   | 0.3  | 0.3      | 0.0       | 31.9  | 9.9   | 2.2  |
| 1100111   | 4     | 1       | 64.1  | 11.4                                     | 7.4   | 12.6  | 8.0  | 0.1      | 5.8       | 18.7  | 0.0   | 0.0  |
|           |       | 2       | 57.7  | 10.1                                     | 7.9   | 11.0  | 0.2  | 0.0      | 0.0       | 21.3  | 4.8   | 2.4  |
|           | 2     | 1       | 100.0 | 19.6                                     | 43.7  | 4.4   | 12.7 | 3.8      | 0.4       | 15.5  | 0.0   | 0.0  |
| OGEAN     |       | 2       | 89.3  | 20.4                                     | 39.2  | 4.6   | 0.0  | 0.0      | 0.0       | 17.5  | 4.9   | 2.7  |
| CONT.     |       | 4       | 112.9 | 20.6                                     | 43.5  | 11.0  | 0.0  | 0.0      | 0.0       | 28.9  | 5.5   | 3.4  |
| 00111     | 4     | 1       | 50.6  | 10.4                                     | 21.1  | 3.1   | 6.1  | 0.7      | 0.9       | 8.3   | 0.0   | 0.0  |
|           |       | 2       | 48.1  | 9.5                                      | 19.8  | 3.3   | 0.0  | 0.0      | 0.0       | 10.5  | 2.7   | 2.3  |
|           | 2     | 1       | 100.0 | 15.6                                     | 32.1  | 6.5   | 11.2 | 13.9     | 1.2       | 19.5  | 0.0   | 0.0  |
| OGEAN     |       | 2       | 99.1  | 15.6                                     | 31.9  | 6.1   | 0.0  | 0.0      | 0.0       | 28.0  | 6.1   | 11.4 |
| NCONT     |       | 4       | 121.2 | 16.0                                     | 35.4  | 16.4  | 0.0  | 0.0      | 0.0       | 33.8  | 6.5   | 13.1 |
|           | 4     | 1       | 50.9  | 8.2                                      | 16.3  | 3.0   | 4.9  | 7.0      | 1.0       | 9.6   | 0.0   | 0.0  |
|           |       | 2       | 48.8  | 7.1                                      | 16.2  | 3.1   | 0.0  | 0.0      | 0.0       | 13.9  | 2.7   | 5.8  |
|           | 2     | 1       | 100.0 | 18.8                                     | 53.5  | 2.5   | 2.1  | 3.8      | 8.1       | 11.2  | 0.0   | 0.0  |
|           |       | 2       | 99.1  | 18.1                                     | 52.9  | 1.3   | 0.1  | 0.2      | 0.0       | 13.8  | 0.9   | 11.3 |
| RADIX     |       | 4       | 106.1 | 19.0                                     | 54.3  | 4.8   | 0.1  | 0.2      | 0.0       | 14.4  | 1.5   | 11.5 |
|           | 4     | 1       | 53.6  | 9.7                                      | 26.5  | 1.6   | 1.0  | 1.4      | 7.6       | 5.7   | 0.0   | 0.0  |
|           |       | 2       | 53.8  | 8.4                                      | 26.0  | 0.9   | 0.1  | 0.1      | 0.0       | 7.9   | 0.6   | 9.8  |

Table 3: Breakdown of the performance of the baseline software-controlled multithreading scheme. Performance is normalized to the 2-processor case with no multithreading. Memory stall time is broken down into misses found in the L2 cache, local memory, or remote memory, and misses combined with other misses.

The execution times are normalized to the case without multithreading on two processors, and they are broken down into nine categories explaining what happened during all potential graduation slots.<sup>7</sup> The bottom section (*Busy*) is the number of slots when instructions actually graduate. The *Mem Stall* and *Sync Stall* sections are any non-graduating slots that can be directly attributed to data cache misses or synchronization, respectively. Table 3 breaks down the *Mem Stall* slots further into four categories: the first three are when a primary cache miss is ultimately found in the secondary cache, local memory, or requires a remote access, respectively; the fourth case (labeled *Comb.*) is when a primary cache miss is combined with another outstanding miss in progress. Returning to Figure 2, the top two sections in the multithreading cases represent slots due to the thread switching code; these are broken down into time spent saving and restoring registers (*TS Regs*) and the remaining miss handler time (*TS Setup*). Finally, the *Inst Stall* section is all other slots where instructions do not graduate. Note that these categories are only a first-order approximation of what is limiting performance, due to the inherent parallelism within an out-of-order superscalar processor and the fact that delaying one dependence tends to exacerbate subsequent dependences.

As we see in Figure 2, software-controlled multithreading results in significant speedups ranging from 10% to 16% in four of the seven applications (FFT, LU-CONT, LU-NCONT, and OCEAN-CONT), and more modest speedups of 1-2% in the other three cases. We also see that adding more threads does not necessarily improve performance. For example, OCEAN-CONT (on two processors) goes from a 12% speedup with two threads per processor to a comparable slowdown with four threads per processor. For all applications, however, there

<sup>&</sup>lt;sup>7</sup> The number of graduation slots is the issue width (4 in this case) multiplied by the number of cycles. We focus on graduation rather than issue slots to avoid counting speculative operations that are squashed.

|             | Avg. L2<br>Cache Miss | Average  | Average<br>Thread |
|-------------|-----------------------|----------|-------------------|
|             | Latency               | Length   | Switch Time       |
| Benchmark   | (cycles)              | (cycles) | (cycles)          |
| CHOLESKY    | 127                   | 826      | 71                |
| FFT         | 114                   | 161      | 56                |
| LU-CONT     | 143                   | 1769     | 57                |
| LU-NCONT    | 139                   | 1366     | 53                |
| OCEAN-CONT  | 84                    | 1125     | 55                |
| OCEAN-NCONT | 83                    | 260      | 54                |
| RADIX       | 137                   | 627      | 108               |

Table 4: Additional statistics on the baseline multithreading scheme.

is at least one configuration where software-controlled multithreading improves performance.

Let us begin by focusing on the impact of multithreading on memory stall times. We observe that without multithreading, six of the seven applications (all except RADIX) are spending over a third of their time stalled waiting for data when running on two processors; in three of these cases (FFT, LU-NCONT, and OCEAN-NCONT), about one-half of execution time is lost to memory stalls. By exploiting 2-way multithreading on two processors, we are able to hide 23% to 63% of the memory stall time; in six of the seven cases, multithreading hides over 35% of these stalls. As we see in Table 3, the bulk of the remaining miss latency with multithreading is due to misses that combine with other outstanding misses. For these combined misses, we are able to partially (but not fully) hide the memory latency. This effect is accentuated in part because our simple round-robin scheduling scheme blindly restarts the next thread without taking into consideration whether its miss has completed, or whether there are other threads that are ready to run. We chose our simple thread scheduling scheme, however, to minimize thread switching overhead and to avoid deadlock.

The benefit of reduced memory stall times is at least partially offset by the thread switching overheads. In four of the seven applications (CHOLESKY, LU-CONT, LU-NCONT, and OCEAN-CONT), the switching overhead with two threads each on two processors is less than 30% of the original memory stall time; in the other three cases, however, this overhead is almost one-half of the original memory stall time. It is not surprising that the thread switching times are non-trivially large, given that all of the thread switching is performed by software. The good news, however, is that the thread switching times are actually small enough that we do see some non-trivial performance gains. For example, even though FFT experiences a large thread-switching overhead, it still enjoys a 16% speedup with software-controlled multithreading. As we see Figure 2 and Table 3, the bulk of the thread switching overhead is usually due to saving and restoring registers, as opposed to other time spent in the miss handler. (The major exception to this is LU-NCONT, where most of the time is spent entering the miss handler and then deciding not to switch threads due to the reference hitting in the secondary cache.) Later in this section, we will evaluate techniques for reducing this thread-switching overhead.

We observe that multithreading generally had no positive impact on synchronization stalls. Part of the reason for this is that the bulk of the synchronization stalls in these applications are due to barriers. Since barrier stall times are dominated by load imbalance, which is not directly improved by latency tolerance, there is little opportunity for multithreading to improve their performance. In fact, the synchronization stall times become noticeably worse with four threads in several applications due to load imbalance problems.

To provide further insight into the multithreading behavior, Table 4 shows the following statistics: (i) the average secondary cache miss latency, which is the latency that a thread switch attempts to hide; (ii) the average *run length*, which is how long a thread executes between thread switches; and (iii) the average thread switching latency. (These numbers were collected from the case with two threads per processor on two processors, but the same trends hold in the other multithreading configurations.) First, we observe that the average secondary cache miss latency is significantly larger than the average thread switching latency in all cases. If this were not true, then the overhead of multithreading would offset any potential gains. Aside from the two versions of OCEAN (which are dominated by capacity misses, and where there is sufficient locality in the data distribution such that most secondary cache misses hit in local memory), the average miss latencies in the other applications are over 110 cycles due to the fact that a reasonably large fraction of secondary cache misses require remote communication. While five of the seven applications have thread switching latencies ranging from 53 to 57 cycles, CHOLESKY and RADIX experience much larger switching latencies are primarily caused by the

|             | Avg. Thread Switching Latency (cycles) |                       |  |  |  |
|-------------|----------------------------------------|-----------------------|--|--|--|
| Benchmark   | Baseline Case                          | Register Partitioning |  |  |  |
| Cholesky    | 71                                     | 26                    |  |  |  |
| Fft         | 56                                     | 18                    |  |  |  |
| LU-CONT     | 57                                     | 20                    |  |  |  |
| LU-NCONT    | 53                                     | 21                    |  |  |  |
| OCEAN-CONT  | 55                                     | 20                    |  |  |  |
| OCEAN-NCONT | 54                                     | 19                    |  |  |  |
| RADIX       | 108                                    | 37                    |  |  |  |

Table 5: Impact of register partitioning on thread switching latencies.

application displacing the thread switching instructions and data from the caches between thread switches.

Roughly speaking, we would expect the performance to saturate when the number of additional threads beyond the main thread is equal to  $\frac{L}{R+C}$ , where L, R, and C are the average miss latency, run length, and thread switching latency, respectively. Given the data in Table 4, we would expect to reach this saturation point with only one additional thread per processor, which is generally true. The one noticeable exception—FFT, which benefits from having four threads each on two processors—is also the case with the smallest average run length.

Finally, we observe that when multiple threads share the same physical cache, they can potentially interfere with each other either *constructively* (by effectively prefetching another thread's working set) or *destructively* (by displacing another thread's working set). While we did not observe any cases where destructive interference was problematic, we did observe a case of positive interference. In LU-NCONT, consecutive threads often access the same cache lines. When these threads are on separate processors, this sharing pattern results in communication and remote accesses. When consecutive threads are assigned to the *same* processor, however (as occurs under multithreading), one thread effectively prefetches the data set of another thread.

In summary, we have seen that our baseline software-controlled multithreading scheme can yield nontrivial performance gains. However, a key bottleneck which is limiting further performance improvement is the time spent switching between threads in software. To address this problem, we now consider techniques for reducing this overhead.

### 4.2 Register Partitioning

As we discussed earlier in Section 2.2.1, one approach to reducing the thread switching overhead is to partition the register set between threads, thereby reducing the number of registers that must be saved and restored. To perform these experiments, we recompiled each application using the -ffixed flag in gcc to control how many user registers could be allocated to a given thread. The following special-purpose MIPS registers could not be partitioned, and must still be saved and restored upon a thread switch: at, v0-v1, a0-a3, gp, sp, fp, ra and fcr31. By partitioning the remaining registers between threads, we were able to reduce the thread switching code to only 34 instructions, 24 of which were memory references. This reduced the average thread switching latency to as little as 18 cycles, as shown in Table 5. As we see in Table 5, register partitioning reduces the thread switching latency by at least a factor of 2.5 in all cases.

Figure 3 shows the impact of register partitioning on performance. For each multithreading case, we show two bars: the bar labeled **B** is the base case (shown earlier in Figure 2), and the bar labeled **R** is the case with register partitioning. As we see in Figure 3, the results are mixed.

In the cases with four threads per processor, register partitioning improves the performance of only one application: FFT, which enjoys a 7% speedup. For the other six applications, the negative impact of increased register spilling more than offsets the positive impact of faster thread switching. The problem in this case is that partitioning the registers between four threads eliminates three fourths of the user registers available to a given thread. As threads run for longer periods of time between thread switches, it becomes more important to have good register allocation rather than fast thread switching. Hence it is not surprising that the one application which actually benefits from four-way partitioning (FFT) also had the shortest average run length (as shown earlier in Table 4).

Register partitioning is more successful when there are only two threads per processor, in part because each thread loses only half of its user registers. As we see in Figure 3, two applications (FFT and RADIX) enjoy



Figure 3: Impact of register partitioning on performance ( $\mathbf{B}$  = baseline multithreading,  $\mathbf{R}$  = multithreading with register partitioning). Execution times are normalized to the case without multithreading on two processors.

significant performance gains from register partitioning with two threads per processor, and one application (LU-CONT) enjoys a modest speedup. As we saw earlier in Figure 2, FFT, OCEAN-NCONT, and RADIX each spend over 10% of their time saving and restoring registers to perform thread switches in the baseline case. Hence it is not surprising that we see large performance gains due to register partitioning in FFT and RADIX. In contrast, OCEAN-NCONT has higher register pressure than either FFT or RADIX, and consequently it loses too much performance due to register spilling to make up for the faster thread switching time.

Overall, we see that register partitioning can potentially improve performance by reducing the number of registers that must be saved and restored upon a thread switch. For example, in the case of RADIX, software-controlled multithreading offers almost no speedup on two processors in the baseline case, but it enjoys a 7% speedup with register partitioning. However, register partitioning is a technique that must be used with caution, since it can hurt performance if it causes too much register spilling. For example, with four threads per processor, the penalty of increased spilling due to having only 25% of the original user registers almost always outweighs the benefits of reduced switching overhead. Since the decision of whether to perform partitioning is controlled by software, the programmer has the flexibility to choose the option that works best for a given application. An even better solution would be for the compiler to make this decision *automatically*, which may be feasible since the compiler is aware of register spilling when it performs register allocation, and could adjust the degree of partitioning accordingly.

### 4.3 Miss Prediction

The final optimization that we consider is using prediction techniques to avoid invoking the miss handler upon primary cache misses which hit in the secondary cache (as discussed earlier in Section 2.2.2). The basic idea is to predict the conditional probability of a secondary cache miss given a primary cache miss for a specific reference, and to use this information at the time when a primary miss is detected to decide whether or not to actually invoke the miss handler. In theory, this could allow us to reduce some of the *TS Setup* time shown earlier in Figure 2. However, based on the results of our experiments, this optimization does not appear to be useful in practice. Even with a *perfect* prediction mechanism, the potential performance gain is generally quite small (just a few percent).<sup>8</sup> When we experimented with dynamic hardware predictors (e.g., two-bit saturating counters and other mechanisms commonly used for branch prediction), we were unable to achieve any speedup over the baseline case. Stride predictors are not helpful, since both the primary and secondary cache share the same line size. While it is easy to predict that a large fraction of references will hit in the secondary cache (specially those that enjoy spatial locality), most of these references *also* hit in the primary cache, in which case the miss handler would not be invoked anyway.

The fundamental problem is that accurately predicting the conditional probability of a secondary cache miss given a primary cache miss is difficult, and the penalty of a false negative (i.e. failing to predict a

<sup>&</sup>lt;sup>8</sup>Note that only a fraction of the TS Setup time can be eliminated, since much of it is due to real thread switches.

secondary cache miss) is extremely large, since we will fail to hide any of the miss latency in that case. (In contrast, the penalty of a false positive is much smaller, since we will quickly discover the mistake after entering the miss handler.) Hence all of the realistic predictors that we considered actually *hurt* performance by generating too many false negatives. The lesson that we have learned from these experiments is that it is far more important to reduce the overhead associated with actually switching threads (the largest component of which is saving and restoring registers) than trying to avoid invoking the miss handler in cases where a thread switch is unnecessary.

# 5 Conclusions

In contrast with previous studies on using multithreading to tolerate memory latencies in tightly-coupled machines, we have considered a completely new approach: one that is *software-controlled*, rather than hardware-controlled. The advantage of our approach is that due to its much simpler hardware support, it does not run the risk of degrading single-thread performance in applications which cannot benefit from multithreading (e.g., those that do not contain parallel threads). For example, our scheme does not require any modifications to the register file, unlike hardware-controlled schemes which typically require a much larger register file (thereby increasing register access latencies). The primary hardware support required by our scheme is *informing memory operations*, which have already been shown to be useful for a wide variety of purposes other than multithreading, and which are not expected to degrade single-thread performance.

Our experimental results demonstrate that software-controlled multithreading can result in significant performance gains. In our baseline scheme, four of seven applications speed up by 10% or more, with one application speeding up by 16% (FFT). By judiciously applying *register partitioning* to reduce the thread switching overhead in cases where it does not result in excessive register spilling, we can enjoy even larger speedups: e.g., an overall speedup of 25% in the case of FFT. Since both remote latencies and the amount of remote communication are expected to increase with larger numbers of processors, we expect even greater performance gains on larger scale multiprocessors.

As we look to the future, software-controlled multithreading should become even more attractive as instruction overhead becomes less and less expensive relative to memory latency. Software-controlled multithreading is a gentle path to providing the performance benefits of multithreading when it matters the most, without biting off the full cost and overheads associated with hardware-controlled multithreading. The attractiveness of software-controlled multithreading provides another compelling reason for future microprocessors to support informing memory operations.

# References

- A. Agarwal, B.-H. Lim, D. Kranz, and J. Kubiatowicz. APRIL: A Processor Architecture for Multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 104-114, May 1990.
- [2] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera Computer System. In Proceedings of the International Conference on Supercomputing, pages 1-6, June 1990.
- [3] S. Dwarkadas, A. L. Cox, and W. Zwaenepoel. An Integrated Compile-Time/Run-Time Software Distributed Shared Memory System. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996.
- [4] K. Farkas and N. Jouppi. Complexity/performance tradeoffs with non-blocking loads. In Proceedings of the 21st International Symposium on Computer Architecture, pages 211-222, April 1994.
- [5] M. Horowitz, M. Martonosi, T. C. Mowry, and M. D. Smith. Informing Memory Operations: Providing Performance Feedback in Modern Processors. In *Proceedings of the 23rd International Symposium on Computer Architecture*, pages 260-270, May 1996.
- [6] M. Horowitz, M. Martonosi, T. C. Mowry, and M. D. Smith. Informing Memory Operations: Memory Performance Feedback Mechanisms and Their Applications. ACM Transactions on Computer Systems, 16(2):170-205, May 1998.

- [7] D. Kroft. Lockup-free Instruction Fetch/Prefetch Cache Organization. In Proceedings of the 8th International Symposium on Computer Architecture, pages 81-85, 1981.
- [8] J. Laudon, A. Gupta, and M. Horowitz. Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 308-318, October 1994.
- [9] J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th International Symposium on Computer Architecture, pages 241-251, June 1997.
- [10] C. K. Luk and T. C. Mowry. Compiler-Based Prefetching for Recursive Data Structures. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 222-233, October 1996.
- T. C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University, March 1994. Technical Report CSL-TR-94-626.
- [12] T. C. Mowry, C. Q. C. Chan, and A. K. W. Lo. Comparative Evaluation of Latency Tolerance Techniques for Software Distributed Shared Memory. In *Proceedings of the 4th IEEE Symposium on High-Performance Computer Architecture*, pages 300-311, February 1998.
- [13] A. K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Department of Computer Science, Rice University, May 1989.
- [14] I. Schoinas, B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R. Larus, and D. A. Wood. Fine-Grain Access Control for Distributed Shared Memory. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 297-306, October 1994.
- [15] B. J. Smith. Architecture and applications of the HEP multiprocessor computer system. SPIE, 298:241-248, 1981.
- [16] K. Thitikamol and P. Keleher. Multi-threading and Remote Latency in Software DSMs. In Proceedings of the 17th International Conference on Distributed Computing Systems, 1997.
- [17] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In Proceedings of the 23rd International Symposium on Computer Architecture, pages 191-202, May 1996.
- [18] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In *Proceedings of ISCA 22*, pages 392-403, June 1995.
- [19] M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 30-44, June 1991.
- [20] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: Characterization and methodological considerations. In *Proceedings of the 22nd International Symposium on Computer Architecture*, pages 24-38, June 1995.
- [21] K. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro, April 1996.