DeCoR: A Delayed Commit and Rollback Mechanism for Handling Inductive Noise in Processors

Meeta S. Gupta, Krishna K. Rangan, Michael D. Smith, Gu-Yeon Wei and David Brooks

School of Engineering and Applied Sciences, Harvard University, 33 Oxford St., Cambridge, MA 02138
{meeta, kkrangan, smith, wei, dbrooks}@eecs.harvard.edu

Abstract

Increases in peak current draw and reductions in the operating voltages of processors continue to amplify the importance of dealing with voltage fluctuations in processors. Noise-margin violations lead to undesired effects, like timing violations, which may result in incorrect execution of applications. To support future trends, several recent architectural solutions for inductive noise have been proposed that, unfortunately, have a strong correlation to the underlying power-delivery package model and require a feedback loop that is largely constrained by the voltage/current sensor speed and accuracy. The resulting solutions are not robust across a range of microprocessor designs and packaging technologies. This paper proposes a robust scheme that does not make any assumptions about the package model or the responsiveness of the voltage sensors. In particular, our approach recovers from, rather than attempting to avoid, voltage emergencies. It uses a Delayed-Commit and Rollback scheme (DeCoR), which guarantees correct recovery even when noise margins are exceeded. The approach incurs a small performance impact when compared to an ideal machine that does not have voltage emergencies. We show that explicit checkpoint-recovery schemes, intended to handle infrequent events (e.g., radiation-induced soft errors), fail to handle frequently occurring voltage emergencies due to prohibitively high performance costs. DeCoR requires very few modifications to modern processor designs, as it simply leverages the existing buffering of the processor store queue and reorder buffer. Unlike conventional designs that conservatively protect all components of the processor from inductive noise with large timing margins, our approach conservatively protects only the architected register state and cache write paths.

1. Introduction

Inductive noise has been a long-standing problem in processor design. With greater interest in reduced-power microarchitectures, this problem has gained even more significance, especially as operating voltages decrease while peak currents increase in the presence of technology scaling [25]. Large current swings over small time scales cause large voltage swings in the power-delivery subsystem due to parasitic inductance. One can view such voltage emergencies as conditions when the supply voltage significantly deviates from the nominal operating voltage, resulting in transient timing faults and long-term reliability problems. The maximum allowable magnitude of this deviation is referred to as the hard threshold. Traditional designs require timing margins throughout the processor that are large enough to prevent timing faults. However, due to technology scaling, these timing margins will become overly large (10-15%) [11] unless efficient solutions are found for the inductive noise problem. These solutions must be cost effective, guarantee correctness, and minimally impact system-level performance.

Several researchers and microprocessor vendors have investigated architecture-level solutions to deal with voltage emergencies [12, 15, 21, 22]. These prior works strive to detect and avoid impending voltage emergencies arising from inductive noise to prevent failures. These impending voltage emergencies are either detected using voltage sensors [15] or current sensors [21]. Joseph et al. [15] propose a voltage sensor based approach, where a throttling mechanism is invoked when the sensed supply voltage crosses a specified level, called the soft threshold. Such a throttling mechanism must react before the voltage deviation proceeds beyond the soft threshold to the hard threshold. The choice of the soft threshold level is largely governed by the voltage sensor response time and accuracy. Our experiments (detailed in Section 6.2) show that throttling mechanisms cannot guarantee robustness for some benchmarks even with an ideal sensor delay of zero cycles. Moreover, the high dependence of throttling mechanisms on underlying assumptions associated with the power-delivery subsystem and sensors means that these schemes cannot easily be re-targeted for different processor architectures or power-delivery subsystems. In addition, the conservative threshold settings used to offset the sensor delay and allow the response mechanism to engage in these schemes cause unnecessary throttling which degrades performance. These drawbacks motivate the need for a solution that can be applied robustly to a wide range of designs with minimal hardware cost and low performance impact.

In this paper, we propose a delayed-commit and rollback mechanism (DeCoR) to handle voltage emergencies. Rather than trying to prevent voltage emergencies, this mechanism allows noise margin violations to occur, but when they do, the architecture has the ability to rollback to a guaranteed correct processor state. This approach relaxes constraints on the power-delivery subsystem and sensor implementation. We divide our processors into two zones — roll-back protected (RB-protected) and timing-margin protected (TM-protected) zones. A RB-protected zone includes all structures where DeCoR permits recovery from voltage-induced timing violations. In particular, we RB-protect the vast majority of the processor core, along with the read path of the L1 data cache. A TM-protected zone encompasses all other structures that require timing margins sufficiently large enough to prevent voltage fluctuations from corrupting execution. The write path of the L1 cache, the entire L2 cache, and the retirement register file (RRF) are TM-protected.

The delayed-commit mechanism speculatively buffers processor updates to the machine state (register file and memory) until it has verified that no noise-margin violations have occurred during a time period sufficient for the sensors to detect potential noise-
margin violations. At the end of this sliding window of time, the state is said to be noise-verified and can be committed to its respective structure. In the event of noise margin failure, noise-speculative updates are discarded and execution is restarted from a prior noise-verified state. Thus, the delayed-commit mechanism distinguishes the processor’s noise-speculative state from its noise-verified state. While the proposed delayed-commit and rollback mechanism may appear to resemble traditional checkpoint-recovery schemes, there are distinct differences in implementation requirements/challenges and resulting performance penalties (discussed in detail in Section 5.3). DeCoR has a conservative overhead of 7% compared to the 39% overhead of explicit checkpointing schemes.

Figure 1 illustrates three schemes for handling inductive noise discussed so far. In standard design flows (Figure 1(a)), the circuit designer is responsible for meeting timing margins under worst-case inductive noise conditions. This leads to a robust, but over-designed system for typical usage. In the emergency-avoidance throttling scheme (Figure 1(c)), architectural techniques are used to protect all state information while the timing margins are kept thin. Such throttling schemes yield a lower cost design, but fail to guarantee correct operation across different package solutions or designs with large sensor delays. We also show that aggressive throttling can suffer from its own timing problems or incur unacceptably large numbers of false alarms leading to performance degradation when compared to an ideal noiseless machine. In the proposed scheme (Figure 1(b)), we retain the robustness of conservative designs while gaining the benefits of throttling schemes. In other words, DeCoR strikes a balance between traditional designs that are completely TM-protected and previously proposed throttling solutions that attempt to entirely eliminate conservative timing margins.

The main contributions of this paper are:

1. We provide a detailed analysis of a wide range of power-delivery subsystems for modern processors and show how different characteristics affect the occurrence of voltage emergencies. This analysis highlights the need for a general robust solution to deal with voltage emergencies.

2. We propose a delayed-commit and rollback mechanism (DeCoR) that handles voltage emergencies and accommodates a variety of voltage/current sensor and power-delivery subsystem options. We demonstrate that our approach is robust—guaranteeing correct operation across a wide range of package designs and sensor delays.

3. We explore various design parameters and present an experimental evaluation of benefits and costs associated with our framework. In particular, we argue that DeCoR can be low cost, by leveraging existing buffering and reusing the flush mechanisms in typical microarchitectures. Moreover, the performance impact is substantially less than that of employing explicit checkpoint mechanisms for handling noise-margin violations.

The rest of the paper is organized as follows. Section 2 highlights two different types of transient errors and emphasizes the need for an explicit solution to deal with noise-margin violations. Section 3 provides a brief description of power-delivery subsystems for processors and shows the effects of differing characteristics on voltage emergencies. Then, the proposed delayed-commit and rollback mechanism is presented in detail in Section 4. The approach is evaluated in Section 5 and a comparison to explicit checkpoint schemes highlight the differences between the schemes. Section 6 presents an overview of existing throttling mechanisms for emergency avoidance, and gives examples where throttling fails to guarantee correct application behavior and showcases the robustness of our proposed scheme for a wide range of package solutions. Section 7 reviews prior research related to inductive noise and checkpoint mechanisms. Finally, Section 8 concludes and presents possible future research.

2. Background on Transient Errors

With continued advances in technology, the issue of reliability for modern high-performance processors is gaining importance. Reliability can be affected by transient errors, which can either be radiation-induced soft errors or voltage-induced noise-margin violations. Soft errors and noise-margin violations are similar in that they can cause transient failures, but they differ greatly in their characteristics. The main differences are in 1) the physical phenomenon that causes them; 2) the frequency of error occurrence; 3) the structures sensitive to the errors; and 4) the relationship between application characteristics and error occurrence. In this section, we explore these main differences between soft errors and noise-margin violations and highlight the differing requirements for detection and correction.

Soft errors are generally characterized as single-event upsets or bit-flips, caused by energetic particles from cosmic rays or alpha particles. The occurrence of soft errors is quite rare and these errors primarily affect data storage nodes. The probability of single-event upsets affecting the correctness of computation depends on the architectural vulnerability factor of the logic [19], which determines whether a fault in that logic would actually affect the outcome of the application. A common strategy is to employ a reactive mechanism where appropriate recovery actions are taken once an error has been detected. One of the main challenges to dealing with soft errors is the implementation of a robust error-detection mechanism, often seen in the form of parity bits and/or error correcting codes (ECC). For example, a parity-bit propagation technique to detect soft errors was implemented by Fujitsu in their SPARC processors, affecting almost 80% of latch banks and array structures [2]. Other detection approaches utilize redundant (or checker) processors and threads, which re-execute some or all of the instructions to verify correctness [4, 23, 29]. The infrequent occurrence of soft errors allows these reactive mechanisms to have large penalties associated with recovery. For example, the Fujitsu processor employs a checkpoint hardware mechanism with a quiescent and preparation period for restart of around 1μs. Hence, microsecond scale penalties are acceptable for soft errors that occur at the timescale of days.
Noise-margin violations have very different characteristics from those observed for soft errors. Noise-margin violations are sensitive to the characteristics of the underlying power-delivery subsystem and the application. Inductive noise results from parasitics present in the system, which can cause the voltage to swing significantly in response to current fluctuations. If the voltage swings are significant, they can induce timing-margin violations. Noise-margin violations primarily affect logic delay paths and are tightly coupled to application characteristics. For example, the presence of repeated execution patterns in applications can increase the susceptibility to timing violations [13]. Unlike soft errors, noise-margin violations are easier to detect, e.g., using hardware sensors, but require careful balance between latency and resolution. Several researchers have proposed to address noise-margin violations through feedback-driven avoidance mechanisms [12,15,21,22], where these sensors predict impending violations and take preemptive actions, such as throttling. Traditional designs require overly conservative timing-margins for worst-case design, preventing the occurrence of such violations. In contrast, designs can more aggressively set timing margins if the aforementioned avoidance mechanisms are in place. Consequently, the frequent and repetitive occurrence of voltage noise requires fast detection and response to ensure correctness and limit performance loss.

In contrast to the above avoidance techniques, this paper presents a simple yet robust reactive approach to deal with noise-margin violations. The approach is based on a delayed-commit mechanism which requires very little modification to conventional processor designs. The high occurrences (due to repeated execution patterns) of voltage-induced errors mean that the implementation and performance issues related to delayed-commit and rollback are paramount. Reactive mechanisms that have been proposed for handling soft errors cannot be used to address noise-margin violations due to the large disparity in the temporal characteristics of the two types of transient errors. Radiation-induced soft errors are infrequent and, hence, traditional recovery schemes [2, 30] implement checkpointing with coarse granularity on order of 100-1000 cycles. While sufficient for soft errors, this granularity is too coarse for much more frequent voltage emergencies, requiring overly high overheads and leading to unacceptably large performance penalties. DeCoR provides a fine-grain reactive approach for this very different problem domain—addressing voltage noise-margin violations—with minimal impact on performance.

3. The Power-Delivery Subsystem and Voltage Emergencies

Noise-margin violations, or voltage emergencies, are closely linked to the detailed characteristics of the power-delivery subsystem (PDS). In this section, we investigate the interaction between current consumption profiles and the PDS that leads to voltage emergencies. We consider voltage swings greater than ±5% of the nominal voltage to be voltage emergencies. The ±5% margins, assumed in previous studies, allow designers to chose relatively aggressive margins as opposed to 10-15%. We also show how different package parameters can affect voltage variations for a given processor. It is important to understand this interaction to design robust architectural solutions for handling noise-margin violations.

3.1 Characteristics of current pulses

While the PDS of a given microprocessor is a complex system consisting of several different components (e.g., voltage regulator module, package, on-die capacitors, etc.) [5, 8], a simplified second-order lumped model [14, 27] can adequately capture its resonance characteristics with impedance peaking in the mid-frequency range of 50MHz-200MHz. Ideally, the supply voltage across a processor should be constant. However, due to dynamic current fluctuations and the non-zero impedance of the PDS, large voltage fluctuations can occur. One way to characterize voltage variations is by convolving the instantaneous current profile of the microprocessor with the impulse response of the PDS (Equation 1).

\[ v(t) = i(t) * h(t) \]  

(1)

Sudden short spikes in current can cause voltage variations, but the magnitude of the variation is largely determined by the amount of charge built up over a specific time interval. Figure 2(a) shows the voltage transients for current pulses of varying amplitudes and durations. The first and second pulses have the same width, but the second pulse has a higher amplitude. A sufficiently high amplitude can induce violations (±5%). The last two pulses shown, though they have large integrated charge, do not cause significant variations in voltage. This shows that isolated pulses with a certain amplitude/width combination can lead to voltage emergencies.

Even if the voltage fluctuation caused by a current pulse in isolation does not exceed noise margins, a series of such pulses at the resonance frequency of the PDS may lead to a voltage emergency. Figure 2(b) shows the voltage response for a series of current pulses. The first sequence of current pulses has a period of 30 cycles, which corresponds to a frequency of 100 MHz for a 3GHz processor. If the resonance of the PDS also occurs at 100
MHz, voltage swings gradually build up and exceed emergency thresholds. Thus, it is important to consider both isolated pulses and resonating pulses when designing an architectural solution to handle voltage emergencies.

3.2 Analysis of sensitivity to power-delivery subsystem

Given that a power-delivery subsystem can be modeled as a second-order linear system, the response of the package model to current variations is largely governed by three factors: Q (quality factor), C (resonance cycles) and Z (peak impedance). These factors affect the robustness and correctness of any solution for handling noise-margin violations. In this section, we analyze the effects of these three factors on voltage emergencies.

**Quality factor (Q):** The quality factor of a system is the ratio of the resonant frequency to the rate at which it dissipates its energy. This factor determines the width of the resonance, or the resonance band of the system. A higher Q leads to a greater build-up of voltage for currents oscillating within the resonance band. Q depends on the effective inductance (L) and resistive losses (R) at the resonant frequency \( (f = \frac{1}{2\pi\sqrt{LC}}) \) as shown in the following equation:

\[
Q = \frac{2\pi f L}{R}
\]

A good package will have lower parasitic inductance and hence lower Q than a poor package. Figure 3(a) shows different packages with different Q, where it can be seen that higher-Q packages have a narrower resonance band and higher impedance at the resonant frequency. Higher impedance means that applications with current variations within the resonance band experience larger voltage fluctuations. Figure 3(b) illustrates how different packages with different Q factors affect voltage emergencies on a subset of the SPEC CPU2000 benchmarks. As Q increases, the fraction of cycles where the voltage extends beyond ±5% thresholds increases for all benchmarks. However, the slope for each benchmark differs, with *crafty* experiencing the steepest increase in noise-margin violations. This can be attributed to the differing current profiles of the benchmarks.

The PDS Q factor defines an important constraint on any technique designed to handle voltage emergencies. Specifically, the rate change of voltage will depend on Q. For example, a snapshot of *crafty’s* voltage trace is depicted in Figure 4(a). This example assumes two thresholds, one at ±3% (soft) and the other at ±5% (hard). The short time interval over which the voltage crosses both soft and hard thresholds determines the maximum delay any soft-threshold-based avoidance mechanism can tolerate. Figure 4(b) plots the percentage of voltage emergencies occurring for different delays between threshold-crossings over a range of Qs across the entire voltage trace of *crafty*. This plot shows that even with moderately low Qs voltage fluctuations can be very fast.

**Resonance Cycles (C):** This factor represents the number of processor cycles corresponding to one period of the PDS resonant frequency. As processor frequency increases while the PDS resonant frequency remains fixed, C also increases. For example, a resonant frequency of 100MHz for a 1GHz processor would result in a C of 100 [21], whereas C would be 30 for a 3GHz processor.

Voltage emergencies strongly depend on this resonance cycles metric. Figure 5 plots the resulting voltage fluctuations for three settings of C and shows that the minimum width of an emergency-inducing isolated current pulse differs for different resonance cycles. In fact, this width depends on the resonant frequency of the PDS such that a larger C tends to require wider current pulse widths, in terms of the number of processor cycles. Figure 6 shows how the fraction of cycles with noise-margin violations varies with processor frequency for a given package. The package considered here has a resonant frequency of 100MHz. We can see that the peaks for the different benchmarks shown here are at lower values of the resonance cycles metric (20-30 cycles). This is because, in these benchmarks, current pulses tend to have smaller widths—both for resonating and isolated pulses.

**Peak Impedance (Z):** This factor represents the peak impedance for the power-delivery subsystem at its resonant frequency. Ideally, this peak (or target) impedance should be as low as possible to avoid voltage emergencies. However, efforts to reduce this peak impedance can increase system cost. Therefore, circuit and architecture designs must cope with higher than desired impedance to avoid voltage emergencies. Figure 6 shows that as the peak impedance of the package increases, the noise-margin violations also increases across all benchmarks and resonance cycles.

The extent of noise-margin violations varies across different power-delivery subsystem designs, and is also closely related to the current consumption profiles. It is important to guarantee that the mechanism used for handling voltage emergencies is robust across a wide range of package and processor characteristics.

4. Delayed-Commit and Rollback

The previous section has shown that a solution for handling noise-margin violations needs to be robust for different packages and sensors characteristics. Along these lines, we propose a technique to deal with inductive noise that does not require fast sensors and actu-
4.1 RB-protected and TM-protected Zones

The proposed delayed commit and rollback (DeCoR) mechanism does not attempt to avoid voltage emergencies altogether. Instead, it lets the processor core run freely with more aggressively set timing margins, and provides safeguards to detect and recover from potential noise-margin violations if and when voltage emergencies do occur. We divide the overall machine architecture into two zones: a zone that is timing-margin protected (TM-protected), as is traditionally done, and a zone that is rollback protected (RB-protected).

The TM-protected zone relies on standard circuit-based techniques to guarantee all timing margins are met. Although circuitry in this zone requires more conservative design, we limit blocks that reside here to the retirement register file, the PC chain, the L1 write port, and the L2 cache. Fortunately, these structures tend to be less timing sensitive for several reasons. First, processor performance is relatively insensitive to L2 cache latency, so circuit-level access time is not critical to system performance. For example, many designs will construct L2 caches using low-leakage, high-Vt transistors that trade access latency for reduced power consumption [10]. For the L1 cache, the L1 read path is known to set the access time, while the write path is less critical [32]. This is because the read...
ports are driven by small SRAM cells that cannot easily be sized up, while the write ports are driven by external peripheral circuits that can be appropriately sized to increase speed. We assume idle memory cells are resilient to common-mode voltage fluctuations, which affect both sides of a differential SRAM cell equally. Intel’s cache [24] shows that memory cells retain state in lower-voltage drowsy modes, where the voltages are much lower than the low-end voltage-emergency level we assume. Furthermore, idle memory cells have additional protection from standard ECC measures common in today’s microprocessors. Finally, the retirement register file and PC chain are relatively small structures, which are unlikely to be timing critical and can be sized up with small power penalties, if needed.

The rest of the processor pipeline resides in the RB-protected zone, which includes the instruction fetch unit, instruction cache (there are no writes to the I-cache from the processor and the static memory cells are robust as explained above), the issue logic, the execution units, and the commit logic that consists of the reorder buffer and store queues. These structures can assume more aggressive timing margins to avoid unnecessary performance loss, since they rely on an architectural mechanism for protection. We note that updates to the branch predictor in the noise speculative state may corrupt the predictor state if a rollback is initiated. However, such entries would be few, have slight performance impact and not affect correctness. Splicing the processor into TM-protected and RB-protected is also straightforward, and can simply applied at the architectural block level (in RTL). The paths from these blocks can be flagged for extra timing margin. This approach is better than adding margins to all paths and no more complex than identifying critical paths in schemes like Razor [9].

To deal with timing violations in the RB-protected zone, we propose a delayed-commit and rollback mechanism that guarantees correctness in the presence of voltage emergencies. Our approach enables the processor to recover from voltage emergencies by rolling back the system state if timing violations are detected by voltage sensors. The delayed-commit mechanism ensures that the architectured state and the values in the L1 data cache are not corrupted by timing violations. Unlike emergency avoidance, this mechanism does not seek to detect and mitigate emergencies, but react to them after they occur. Hence, the response mechanism is engaged only for true voltage emergencies (i.e., violations of the ±5% hard threshold) and does not need to detect emergencies using soft thresholds. More importantly, our approach guarantees correctness of the system across a wide range of power-delivery network and processor characteristics and sensor delay latencies.

4.2 Delayed commit

Figure 7 presents a functional diagram of the delayed-commit architecture. To guarantee system correctness, this architecture distinguishes between the noise-verified state and noise speculative state. A noise-verified state is the state of the machine that is known to be correct. The program can be rolled back to the noise-verified state following a noise-margin failure, which will be signaled by a voltage-sensor reading. In the noise-speculative state, the processor continues executing down the regular execution path and results are held in existing buffering mechanisms (ROB and STQ) until the outcome of the sensor reading is known. This buffering breaks the traditional feedback loop necessary in sense and throttle schemes (Section 6), accommodating any length of sensor delays.

During program execution, the processor buffers values corresponding to the noise-speculative state in the reorder buffer (ROB) and store queue (STQ). To know when this state can become noise-verified state, each entry in the ROB/STQ has a counter associated

![Figure 7: Delayed commit architecture. A part of the RB-protected zone has been shown along with the additional modification including the counters added to the queues. The L1 cache and register file lie in the TM-protected zone.](image)

with it. Completed results need to be buffered in the ROB/STQ until they are verified to be correct with respect to noise violations. The time the instructions need to be buffered is directly proportional to the sensor delay in the system. If we assume a sensor delay of 5 cycles, all completing instructions will set their counters for this delay. When the instruction reaches the head of the queue and is ready to retire, the commit logic verifies that the counter has expired and then declares the state to be noise verified. At this point, it is safe to commit the state to the appropriate TM-protected structure, i.e., RRF or L1 data cache.

The correctness of our scheme centers on proper transitions from the noise speculative state to the noise-verified state. For our current implementation, this transition takes place when committing state from the STQ to the L1 data cache and from the ROB to the RRF. We must guarantee the robustness of writes to the RRF and data caches at all times, because in a worst-case scenario, a voltage emergency could occur when moving state into these noise-verified locations. For this reason, the write paths of these structures must have sufficient timing margins to tolerate voltage noise and hence lie in the TM-protected zone. In contrast, reads from the data cache and the retirement register file can experience voltage emergencies, because these emergencies would be detected and handled by the delayed-commit mechanism. Thus, we do not need to take any special measures when data transitions into the RB-protected zone. As discussed earlier, we assume that idle memory cells in the ROB/STQ will not be corrupted in the few cycles between state becoming noise-verified and committed.

4.3 Rollback

When a voltage emergency occurs, the sensors notify the machine that all noise speculative states should be flushed, and that it should initiate a rollback to the last noise-verified correct state. Flushing is straightforward, as the noise speculative state is already located in structures (the ROB and STQ) that are capable of flushing speculative states and rolling back program execution. Thus, our rollback mechanism flush is similar to a flush after branch mispredicts and the machine can be restarted the next cycle. A key attribute of our scheme is that rollback occurs only when noise margins are actually violated; false alarms never occur.

To ensure that rolling back the processor does not cause new emergencies, we start the processor at a reduced frequency for
some number of cycles, called the throttling period. This guarantees forward progress at the cost of a small loss in performance. Our experiments show that a throttling period of 10 cycles at the 50% frequency following the rollback action ensures forward progress. This half-rate throttling can be achieved without PLL involvement, that is, by gating the clock every other cycle.

5. Performance Analysis

This section investigates the performance impact of different parameters used in the proposed DeCoR mechanism and seeks to understand how changes in sensor delay, buffer size, throttling period, and throttling factor affect performance. Section 5.3 provides a performance comparison of our proposed schemes with previous explicit checkpoint schemes, and shows the high restore penalties associated with the previous schemes.

5.1 Power-Delivery and Simulation Frameworks

Though we do provide detailed results under four different package configurations in Section 6.2, the following analysis will focus mainly on a single package model based on the characteristics of the Pentium IV package [5]. In this model, the resonant frequency of the PDS occurs at 100MHz with a peak impedance of 5mΩ. This corresponds to a resonance cycle (C) of 30 cycles for a 3GHz machine. This model’s impedance response is illustrated in Figure 3(a) (Q=3). Finally, we assume peak current swings of 16-50A, and noise-margin violations occur at ±5% of a 1V supply. Earlier works [12,15,21] used parameters based on the Alpha 21264/21364 package, which we evaluate in Section 6.2.

Our architectural simulations are based on a version of SimpleScalar for the x86 architecture. Table 1 tabulates the parameters used to configure an 8-way superscalar, out-of-order processor. To get a detailed cycle-accurate current profile, we modified Watch [6] and incorporated the Watch model into our SimpleScalar simulator. The voltage variation was calculated by convolving the simulated current profiles with an impulse response of the PDS, as detailed previously in Equation 1. This second-order model was used in most other studies [15, 21]. Using an on-die grid model [18] would not affect correctness but may enable selective rollback. We ran a subset of SPEC2000 benchmarks, as certain benchmarks failed to complete successfully on our simulator. Each benchmark was run for 100 million instructions of a representative phase given by Simpoint. Large current variations can lead to significant voltage drops, and these variations are generally governed by the activity of the benchmark application just before the voltage drop. Our experiments showed some correlation between the distribution of voltage and the IPC of the application. Applications with high IPC (i.e., crafty and mesa) exhibit high levels of processor activity and hence have more current variations around the resonant frequency of the system. Benchmarks with lower IPC (e.g., equake and apsi) have longer stall/idle periods (e.g., due to high cache miss rates) than the previously mentioned benchmarks, and thus have a smaller fraction of cycles below ±5% threshold.

5.2 Performance Analysis of DeCoR

There are two primary sources of performance loss associated with the delayed-commit scheme: 1) Delaying the commit of instructions in the reorder buffer and store queue may lead to buffer pressure and potential stalls and 2) Each rollback and subsequent recovery incurs a runtime penalty. The performance impact of delayed commits is, therefore, a function of the sensor delay, the characteristics of the power-delivery system, and the size of the store queue and reorder buffer. Rollback overhead can be further divided into cycles spent in replay/rollback and cycles spent during throttle (the slow startup mode engaged during recovery). Equation 3 provides a breakdown of the penalty associated with rollback:

$$\text{ExtraWork}_{\text{DCR}} = RA \ast (RC + (TF - 1) \ast TP)$$

where $RA$ represents the number of times the rollback happens (or rollback alarms), $RC$ represents the number of cycles rolled back, $TF$ represents the throttle period, and $TF$ represents the throttling factor (fraction of the clock frequency applied during throttle).

Figure 8 shows the total performance penalty due to the DeCoR mechanism for different sensor delays. For sensor delays as low as 5 cycles, the performance impact ranges from 0.02% to 7% across the benchmark suite. This includes the total performance due to both STQ and ROB pressure during delayed commit and rollback and the slow startup of the processor during rollbacks.

A breakdown of the contribution of each factor (buffer pressure, rollback, and throttle) is also shown in Figure 8 for different sensor delays. We observe that the performance loss in benchmarks with high IPC (e.g., bzip and crafty) is largely dominated by the rollback and throttle penalties. There are two reasons for this. First, as discussed in Section 3, benchmarks with higher IPCs generally have more voltage emergencies. Second, high IPC benchmarks tend to have fewer memory stalls, resulting in lower occupancy rates for the ROB and STQ. Consequently, the penalty from buffer pressure is incurred infrequently. The performance loss of benchmarks with low IPC is dominated by buffer pressure, but the performance loss in these benchmarks is generally small. In this analysis, because the throttle period and sensor delay are both 10 cycles, we find that the relative weight of these two penalties is comparable across most benchmarks for low sensor delays. However, for larger sensor delays, the buffer pressure penalty and rollback penalty tend to dominate, and the throttle penalty decreases. Another way to understand how the delayed-commit scheme applies pressure to the store queue and ROB is to look at occupancy. Our simulations showed that benchmarks with low IPC, like equake and apsi, tend to have near full ROB/STQ structures and hence experience a higher performance penalty due to buffer pressure. The occupancy of these structures increases by only a few entries for almost all of the benchmarks, indicating that a slightly larger STQ/ROB would alleviate some performance loss.

5.3 Comparison with existing explicit checkpoint-recovery schemes

The proposed delayed-commit and rollback mechanism differs greatly from traditional checkpoint-recovery mechanisms in terms of the mechanisms and overheads involved. Previous checkpoint-recovery schemes have an explicit checkpoint mechanism, whereas our approach relies a light-weight implicit checkpointing scheme. In this section, we highlight the key differences between the two approaches and compare the performance penalties of not using a delayed-commit mechanism for handling inductive noise.

Explicit checkpoint-recovery mechanisms have been employed to tackle problems in a variety of domains. There can be several mechanisms for explicit checkpoints that vary in their degree of implementation complexity and overheads involved. Earlier checkpointing schemes were predominantly aimed at providing fault tolerance in large systems. Recently, however, checkpointing schemes have been used in various other domains: soft error detection [30],

<table>
<thead>
<tr>
<th>Clock Rate</th>
<th>3.1 GHz</th>
<th>ROC</th>
<th>84 Entries</th>
</tr>
</thead>
<tbody>
<tr>
<td>Instruction Window</td>
<td>16KB 2-way</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Functional Units</td>
<td>3 Integer ALU, 4 FP ALU, 64 Instructions</td>
<td></td>
<td></td>
</tr>
<tr>
<td>L1 I-Cache</td>
<td>64 KB 2-way</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Main Memory</td>
<td>300 cycle Latency</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Branch Penalty</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table 1: Processor parameters for SimpleScalar.
Checkpointing schemes to maintain correctness given noise-margin violations needs to be invariant of sensor delays. Figure 9 shows an example of a scenario where the violation is detected after checkpoint $C_{i+1}$ has been taken. To have correct semantics, the rollback should occur to checkpoint $C_i$, leading to a higher rollback cost. In general, even if the checkpoint interval is adjusted to match the sensor delay in the previously proposed mechanisms, the recovery costs of discrete, explicit mechanisms are higher than the implicit checkpointing mechanism in DeCoR.

A common trait to all checkpoint-recovery schemes is the explicit saving and restoring of required architectural state. The state required to be saved for correct execution at recovery is mainly the architected state, which consists of the registers and the up-to-date memory-state. For example, [17] and [16] assume a buffered memory update, using volatile bits for updated lines between checkpoints. When a rollback occurs, the lines marked volatile are flushed from the cache. However, this results in additional cache misses after rollback. The overhead of implementations proposed in [17] and [1] has a register restore latency of 8 cycles (for 32 registers with 4 write ports); the infrequency of the rollbacks in other implementations shadows the rollback cost. However, if such schemes are attempted to deal with noise-margin violations, potentially requiring frequent rollbacks, these previously proposed checkpoint-recovery mechanisms incur significant overheads.

Figure 10 shows the breakdown of the performance impact of rollback, throttle, and restoring state for explicit checkpointing compared to the performance impact of rollback, throttle, and delayed-commit for DeCoR with a sensor delay of 20 cycles. Due to the rollback to the previous checkpoint, $C_i$, instead of current checkpoint, $C_{i+1}$, for every rollback, the rollback (including the throttle penalties) are higher for the explicit checkpoint schemes. We can also see that performance impact of restoring state in explicit checkpoint mechanisms is significantly higher, around 39% on average. This includes both the register restore penalty as well as the impact of flushing of the volatile lines. A significant increase in cache miss rate was observed for bzips due to the flushing of the volatile lines at every rollback, leading to a huge performance penalty of 170%. The majority of the benchmarks clearly favor DeCoR by a wide performance margin, with swim, a benchmark with very few emergencies, being a notable exception. In this case, the benchmark incurs a high delayed-commit penalty (also seen in Figure 8), but the number of rollbacks is small resulting in better performance for the explicit checkpointing scheme.

In summary, explicit checkpoint mechanisms incur unacceptable performance overheads when applied to highly-frequent transient errors characteristic of inductive noise. In contrast, DeCoR provides reasonable performance overheads across a range of sensor delay. We now turn our attention to a robustness comparisons between DeCoR and existing throttling-based noise-control mechanisms.

6. Robustness comparison with hardware throttling

Solutions to deal with inductive noise must correctly avoid or recover from voltage emergencies, preferably with small performance loss. This section evaluates the robustness of previously proposed schemes [12, 15, 21, 22] across a wide range of PDS and sensor delay assumptions. A common attribute of these schemes is to use techniques that either spread out emergency-causing current variations in time or reduce their amplitudes. We refer to all of these techniques as throttling. The problem arises when one tries to use throttling alone in an emergency-avoidance manner to stop the processor from ever crossing a hard threshold (for the following discussion, we again assume a hard threshold of +/-5% of the nominal voltage). Section 6.1 illustrates the feedback associated with these emergency-avoidance approaches and discusses the impact of voltage sensor delay characteristics on the correctness of these schemes. Section 6.2 evaluates the effect of PDS characteris-
There are many ways to build sensors with tradeoffs between detection and aggregation across multiple sensors in different parts of the processor. This delay sets the speed of the overall feedback loop. As shown in Figure 11, the sensor turns the throttling mechanism when we need to engage the actuator. Assuming that the control logic and actuation mechanism can react quickly, the main bottleneck in throttling for emergency-avoidance is the speed of the individual sensors. We also evaluate and compare the performance impact of these mechanisms such as frequency throttling, pipeline freezing, pipeline firing, issue ramping, or changing the number of the available memory ports [12, 15, 21, 22]. All of these mechanisms rely on voltage or current sensors to detect threshold crossings indicative of noise-margin violations having occurred or about to occur. As shown in Figure 11, the sensor turns the throttling mechanism (i.e., the actuator) on. Assuming that the control logic and actuation mechanism can react quickly, the main bottleneck in throttling for emergency-avoidance is the speed of the individual sensors and aggregation across multiple sensors in different parts of the processor. This delay sets the speed of the overall feedback loop. There are many ways to build sensors with tradeoffs between delay and precision. Hence, it is important to understand the impact of the inherent delay and inaccuracy associated with the sensors on the different emergency-avoidance schemes.

Recall Figure 4(a), which shows the number of cycles from the crossing of the ±3% soft threshold to the crossing of the ±5% hard threshold for the benchmark crafty. It took only 3 cycles. In general, the maximum allowable sensor delay is largely determined by the minimum number of cycles for voltage to transition between the two thresholds. This suggests that for the emergency-avoidance throttling mechanism to work correctly for crafty under our package model, the sensors need to detect the soft threshold crossing within two cycles, leaving only one cycle for the actuator mechanism. Since noise violations are rare events, one might argue that the fraction of those occurring with such a steep slope would be extremely rare. Unfortunately, we need only a single noise-margin violation to disrupt the reliability of our processor circuits and cause incorrect program execution. Consequently, all such situations must be avoided.

To guarantee correct operation, emergency-avoidance throttling mechanisms must apply the throttling mechanisms before a hard emergency actually develops. To provide the sensor and actuator with more time to operate when delays are long, one can consider increasing the distance between the hard and soft thresholds. However, conservatively setting the soft threshold increases the number of false alarms, where voltage variations are unnecessarily flagged as requiring throttling. Since the processor incurs a performance penalty every time throttling is engaged (i.e., the program slows down), we would like to engage the throttling mechanism when we are sure that the hard threshold would otherwise be crossed.

Figure 12 illustrates the distribution of false alarms across our benchmarks for several different soft thresholds. A threshold of ±3% leads to an 80% false alarm rate, averaged across all the benchmarks. Thresholds closer to the noise margins have fewer false alarms, though the percentage is still relatively high. In comparison, DeCoR does not incur any false alarms since only the occurrence of voltage emergencies (hard threshold crossings) is detected.

### 6.2 Robustness Across Package Choice

As discussed in Section 3, a processor’s susceptibility to voltage emergencies is tightly coupled to the underlying power-delivery subsystem. Hence, correctness of any proposed solution depends on assumptions made about the package and/or processor models. In this section we show that the correctness of previously proposed throttling schemes vary with respect to current swing (dependent on processor architecture), resonant frequencies (i.e., packaging assumptions), and sensor delays. In contrast, only DeCoR’s performance depends on these parameters and correctness is not affected. We chose to sweep two of the three parameters governing power-delivery subsystem characteristics—Q factor and resonance cycles—as changing either one affects the peak impedance of the system. We also evaluate and compare the performance impact of different package characteristics on the proposed DeCoR scheme.

The techniques discussed in Section 6 aim to avoid emergencies by reducing current fluctuations by throttling the system in response to detection of voltage [15] or current repetitions [21]. Figure 13 presents contour plots of the number of noise-margin violations for bzip across different package characteristics, while employing an aggressive, 0.5× frequency-throttling mechanism that responds to voltage swings. With a voltage sensor soft threshold of ±3%, we consider two sensor delay scenarios – a sensor delay.
Figure 12: Distribution of false alarms generated for emergency-avoidance mechanisms given different soft thresholds.

Figure 13: Evaluation of robustness emergency-avoidance scheme in [15]. This figure shows the number of noise-margin violations during SPEC benchmark bzip with different package solutions and different sensor delays.

Figure 14: Evaluation of resonance tuning scheme [21] for equake. Figure 14 (b) depicts the percentage of single transition events causing noise-margin violations for different package solutions.

of 0 (Figure 13(a)) and a sensor delay of 5 (Figure 13(b)). Even with an optimistic sensor delay of 0, this throttling scheme fails to prevent noise-margin violations for packages with Q greater than 2, leading to correctness violations. For more realistic sensor delays, the number of violations increases by two orders of magnitude, and even packages with relatively low Q are sometimes unable to avoid noise-margin violations. The package characteristics assumed in [15] lie in the small region (upper left corner) where the throttling mechanism is effective. These results show that a throttling-based emergency-avoidance scheme alone is not a generally applicable solution to guarantee correctness.

Resonance tuning is another throttling mechanism that assumes voltage emergencies are caused by repeating high-low/low-high current transitions occurring within the resonance band [21]. We implemented the mechanism proposed in [21] and computed the corresponding parameters for each package with an initial response threshold of 2. In addition to resonant pulses, several single-transition events were detected for different packages, and a signif-
Table 2: Four different packages with different characteristics. Significant fraction of these events caused noise-margin violations. For example, Figure 14(a) illustrates an example of a sudden isolated current pulse found in SPEC benchmark *equake* that causes voltage to swing below the hard threshold. Figure 14(b) shows a contour plot delineating the fraction of single-transition events (or isolated pulse emergencies), seen in *equake*, that caused noise-margin violations across different package characteristics. These results show that the scheme proposed in [21] would not be able to detect such pulse emergencies for packages with Q greater than 2. On the other hand, resonance tuning techniques could be used to enhance DeCoR. In this case, resonance tuning would filter out many potential rollbacks and DeCoR would provide correctness guarantees for single-transition events. DeCoR’s correctness guarantee would also ease the implementation of the resonance tuning approach.

6.3 DeCoR’s performance across package assumptions

In contrast to the two schemes evaluated above, the DeCoR scheme proposed in this paper provides correctness guarantees across all of the different packages assumed for the underlying power-delivery network. However, there is a rollback penalty whenever a noise-margin violation occurs, and the performance degrades with lower-cost package solutions. Table 2 provides a description of different packages, sorted by impedance. Pkg3 represents the power-delivery subsystem that we use in our model, based on the Pentium IV. Pkg1 and Pkg2 represent packages used in related work [15, 21] based on the Alpha 21264 and 21364. Pkg1 represents a reasonably good package as it has a low target impedance and low Q. Similarly Pkg2 is also a reasonably good package, with low impedance, but slightly higher Q and much higher resonance cycles. Table 2 also notes the current range (max current to min current) of the processors assumed in the related work. For comparison, we also include a fourth package, Pkg4, that represents a low-cost package with very high impedance and high Q. Such poor characteristics can be a result of high package-to-chip inductance, due to fewer or low-cost bump connections, or less decoupling capacitance being available on the chip. Figure 15 presents a surface plot of how performance of the delayed-commit and rollback mechanism varies for *crafty* across a wide range of package characteristics.1 Package models with a Q of 3 result in less than 10% performance loss, but performance degrades rapidly as Q increases. Packages with higher resonance cycles tend to have less impact on performance, which suggests that the current profiles in this benchmark have more activity around smaller resonance cycles.

In summary, investigation of various avoidance and recovery schemes applied across systems with a wide range of packaging characteristics shows only the proposed DeCoR scheme can be generally applied to handle voltage emergencies for various power-delivery subsystems. As Q increases, DeCoR remains robust albeit with increasing performance penalties, while the emergency-avoidance throttling schemes fail to guarantee correctness.

7. Related Work

Prior works that handle noise-margin violations use throttling techniques to either spread out current variations in time or reduce their amplitude. Sections 6 and 6.2 provide detailed discussions and comparisons between DeCoR and the previously proposed approaches [12, 15, 21, 22].

As mentioned in Section 2 and 5.3, reactive schemes have been proposed to deal with soft errors, but cannot be applied to noise-margin violations. For example, Wang and Patel [30] present a coarse-grained checkpoint-restart mechanism to recover from soft errors, relying on a separate checkpoint hardware structure. Due to the rare occurrence of radiation strikes, they assume checkpoint recovery mechanisms have zero performance cost and primary attention is given to soft error detection and coverage.

In another related work, Razor [9], the authors propose a circuit-level mechanism to dynamically detect and correct timing failures by augmenting critical flip-flops in the microprocessor pipeline with shadow latches. These shadow latches rely on a delayed clock to provide additional timing margins and enable detection of speed-path failures. However, Razor may be costly to implement in a high-performance out-of-order core with several large array structures and tight timing paths. A recent study by Annavaram et al. [3] shows the distribution of timing margins for different paths across functional blocks in the Intel Core Duo microprocessor have hundreds of paths within 10% timing margins. This suggests that voltage-induced violations are likely to affect many paths.

DIVA also provides a method to dynamically detect and recover from transient errors [4, 7, 31]. This scheme relies on a checker processor that runs in parallel with the main out-of-order core, checking results prior to committing the instructions. None of the existing DIVA papers specifically address nor provide detailed analysis of how well it can cope with inductive noise. DIVA assumes that all reads and writes to all registers and memories will complete without error, requiring extensive TM-protected zones within the processor core given a noisy voltage environment. In addition, DIVA requires duplicate TM-protected functional units (e.g., INT/FPU/SSE units) that consume additional power and area resources. In contrast, DeCoR requires little, if any, additional hardware, and only the D-cache and register file write ports must be TM-protected.

8. Conclusions and Future Work

In this paper, we presented a delayed-commit and rollback mechanism to handle inductive noise in microprocessors. We divide the processor into two zones: a timing-margin protected zone encompassing selected structures implemented with conservative timing margins to handle the worst-case *didt* drops, and a rollback protected zone encompassing all of the processor structures protected by our approach from voltage-induced timing violations. Our scheme delays commits to the processor state, by speculatively holding processor updates until it can be verified that no noise-margin violations have occurred. In the event of a noise margin violation, the rollback allows the processor to return to a guaranteed correct state.

Our presented mechanism is independent of the performance of voltage or current sensors and the characteristics of the power delivery subsystem. We demonstrated that our approach is robust across a wide range of package assumptions and sensor delays. More importantly, our approach simply relies on existing buffering and flush mechanisms present in modern microarchitectures so, it can be implemented with low cost compared to other explicit checkpoint and rollback mechanisms. Experimental evaluation of overheads associated with our framework showed acceptable performance losses relative to an ideal machine with no voltage emergencies: for sensor delays between 5 and 10 cycles, the observed performance loss was between 3% and 5%. Moreover, the performance impact of our scheme is much lower than using explicit checkpoint-rollback

---

1 We chose crafty for this analysis, because this was the worst SPEC benchmark, representing the worst-case results for our scheme.
schemes to tackle voltage emergencies (7% versus 39% for a sensor delay of 20 cycles).

Going forward, we are exploring mechanisms that limit the number of times the system rolls back and replays. This is based on the observation that a hard emergency inside a loop is likely to recur during re-executions, but can be avoided by altering the emergency-causing code portion. This presents us with an interesting trade-off between the cost to reschedule the code using hardware/dynamic software approaches and the savings from the reduced number of rollbacks.

References