## Computer Science 146 Computer Architecture

Fall 2019 Harvard University

Instructor: Prof. David Brooks dbrooks@eecs.harvard.edu

Lecture 4: Basic Implementation and Pipelining























































## Structural Hazards Solutions

- Stall
  - Low Cost, Simple (+)
  - Increases CPI (-)
  - Try to use for rare events in high-performance CPUs
- Duplicate Resources
  - Decreases CPI (+)
  - Increases cost (area), possibly cycle time (-)
  - Use for cheap resources, frequent cases
    - Separate I-, D-caches, Separate ALU/PC adders, Reg File Ports

Computer Science 146 David Brooks

## Structural Hazards Solutions

- Pipeline Resources
  - High performance (+)
  - Control is simpler than duplication (+)
  - Tough to pipeline some things (RAMs) (-)
  - Use when frequency makes it worthwhile
  - Ex. Fully pipelined FP add/multiplies critical for scientific
- Good news
  - Structural hazards don't occur as long as each instruction uses a resource
    - At most once
    - Always in the same pipeline stage
    - For one cycle
  - RISC ISAs are designed with this in mind, reduces structural hazards





# RAW Example

| Cycle          | 1  | 2  | 3  | 4   | 5   | 6   | 7   | 8  |
|----------------|----|----|----|-----|-----|-----|-----|----|
| Add R3, R2, R1 | IF | ID | EX | MEM | WB  |     |     |    |
| Add R4, R3, R5 |    | IF | ID | EX  | MEM | WB  |     |    |
| Add R6, R3, R5 |    |    | IF | ID  | EX  | MEM | WB  |    |
| Add R7, R3, R5 |    |    |    | IF  | ID  | EX  | MEM | WB |

- First Add writes to R3 in cycle 5
- Second Add reads R3 in cycle 3
- Third Add reads R3 in cycle 4
  - We would compute the wrong answer because R3 holds the "old" value







## Forwarding, Bypassing

|                | -  | -  | -  | -   | -   | -   |     |    |
|----------------|----|----|----|-----|-----|-----|-----|----|
| Cycle          | 1  | 2  | 3  | 4   | 5   | 6   | 7   | 8  |
| Add R3, R2, R1 | IF | ID | EX | MEM | WB  |     |     |    |
| Add R4, R3, R5 |    | IF | ID | EX  | MEM | WB  |     |    |
| Add R6, R3, R5 |    |    | IF | ID  | EX  | MEM | WB  |    |
| Add R7, R3, R5 |    |    |    | IF  | ID  | EX  | MEM | WB |

- Code is now "stall-free"
- Are there any cases where we must stall?

Computer Science 146 David Brooks

## Load Use Hazards

| Cycle          | 1  | 2  | 3    | 4   | 5   | 6   | 7  | 8 |
|----------------|----|----|------|-----|-----|-----|----|---|
| lw R3, 10(R1)  | IF | ID | EX   | MEM | WB  |     |    |   |
| Add R4, R3, R5 |    | IF | ID ' | EX  | MEM | WB  |    |   |
| Add R6, R3, R5 |    |    | IF   | ID  | EX  | MEM | WB |   |

• Unfortunately, we can't forward "backward in time"

| Cycle          | 1  | 2  | 3  | 4     | 5  | 6   | 7   | 8  |
|----------------|----|----|----|-------|----|-----|-----|----|
| lw R3, 10(R1)  | IF | ID | EX | MEM   | WB |     |     |    |
| Add R4, R3, R5 |    | IF | ID | stall | EX | MEM | WB  |    |
| Add R6, R3, R5 |    |    | IF | stall | ID | EX  | MEM | WB |

### Load Use Hazards

- Can the compiler help out?
  - Scheduling to avoid load followed by immediate use
- "Delayed Loads"
  - Define the pipeline slot after a load to be a "delay slot"
  - NO interlock hardware. Machine assumes the correct compiler
- Compiler attempts to schedule code to fill delay slots
- Limits to this approach:
  - Only can reorder between branches (5-6 instructions)
  - Order of loads/stores difficult to swap (alias problems)
  - Makes part of implementation *architecturally visible*







# Control Hazards

| Cycle         | 1  | 2  | 3     | 4     | 5     | 6  | 7  | 8   |
|---------------|----|----|-------|-------|-------|----|----|-----|
| Branch Instr. | IF | ID | EX    | MEM   | WB    |    |    |     |
| Instr +1      |    | IF | stall | stall | IF    | ID | EX | MEM |
| Instr +2      |    |    | stall | stall | stall | IF | ID | EX  |

- In base pipeline, branch outcome not known until MEM
- Simple solution stall until outcome is known
- Length of control hazard is branch delay
  - In this simple case, it is 3 cycles (assume 10% cond. branches)
  - CPI<sub>Real</sub> = CPI<sub>Ideal</sub> + CPI<sub>Stall</sub> = 1.0 + 3 cycles \* .1 = 1.3

Computer Science 146 David Brooks

#### **Control Hazards: Solutions** Fast Branch Resolution • Performance penalty could be more than 30% - Deeper pipelines, some code is very branch heavy Fast Branch Resolution - Adder in ID for PC + immediate targets - Only works for simple conditions (compare to 0) - Comparing two register values could be too slow Cycle 2 3 4 6 7 8 1 5 IF ID ΕX Branch Instr. MEM WB Instr+1 IF ID WB stall ΕX MEM IF ID WB Instr +2 ΕX MEM stall Computer Science 146 David Brooks





|                                                    |    |         |         |          | azaro<br>ot Ta |            |         |   |  |  |
|----------------------------------------------------|----|---------|---------|----------|----------------|------------|---------|---|--|--|
| Cycle                                              | 1  | 2       | 3       | 4        | 5              | 6          | 7       | 8 |  |  |
| Untaken Branch                                     | IF | ID      | EX      | MEM      | WB             |            |         |   |  |  |
| Instr +1                                           |    | IF      | ID      | EX       | MEM            | WB         |         |   |  |  |
| Instr +2                                           |    |         | IF      | ID       | EX             | MEM        | WB      |   |  |  |
| Looks good if we're right!                         |    |         |         |          |                |            |         |   |  |  |
| Cycle                                              | ]  | Looks   | s good  | if we'r  | e right!       | 6          | 7       | 8 |  |  |
|                                                    |    |         |         | 1        |                | 6          | 7       | 8 |  |  |
| Faken Branch                                       | 1  | 2       | 3       | 4        | 5              | 6<br>flush | 7       | 8 |  |  |
| Cycle<br>Taken Branch<br>Instr +1<br>Branch Target | 1  | 2<br>ID | 3<br>EX | 4<br>MEM | 5<br>WB        |            | 7<br>WB | 8 |  |  |







# Interrupt Taxonomy

| Exception Type             | Sync vs.<br>Async | User Request<br>Vs. Coerced | User mask vs.<br>nommask | Within vs.<br>Between Insn | Resume vs.<br>terminate |
|----------------------------|-------------------|-----------------------------|--------------------------|----------------------------|-------------------------|
| I/O Device Req.            | Async             | Coerced                     | Nonmask                  | Between                    | Resume                  |
| Invoke O/S                 | Sync              | User                        | Nonmask                  | Between                    | Resume                  |
| Tracing Instructions       | Sync              | User                        | Maskable                 | Between                    | Resume                  |
| Breakpoint                 | Sync              | User                        | Maskable                 | Between                    | Resume                  |
| Arithmetic Overflow        | Sync              | Coerced                     | Maskable                 | Within                     | Resume                  |
| Page Fault (not in main m) | Sync              | Coerced                     | Nonmask                  | Within                     | Resume                  |
| Misaligned Memory          | Sync              | Coerced                     | Maskable                 | Within                     | Resume                  |
| Mem. Protection Violation  | Sync              | Coerced                     | Nonmask                  | Within                     | Resume                  |
| Using Undefined Insns      | Sync              | Coerced                     | Nonmask                  | Within                     | Terminate               |
| Hardware/Power Failure     | Async             | Coerced                     | Nonmask                  | Within                     | Terminate               |

Computer Science 146 David Brooks

#### Interrupts on Instruction Phases

| Exception Type                  | IF | ID | EXE | MEM | WB |
|---------------------------------|----|----|-----|-----|----|
| Arithmetic Overflow             |    |    | X   |     |    |
| Page Fault (not in main memory) | Х  |    |     | X   |    |
| Misaligned Memory               | Х  |    |     | X   |    |
| Mem. Protection Violation       | Х  |    |     | X   |    |

- Exceptions can occur on many different phases
- However, exceptions are only handled in WB
- Why?

| load<br>add | ÎF | ID<br>IF | EX<br>ID | MEM<br>EX                          | WB<br>MEM | WB |  |
|-------------|----|----------|----------|------------------------------------|-----------|----|--|
|             |    |          | Cor      | nputer Science 146<br>David Brooks | i         |    |  |

### How to take an exception?

- 1. Force a trap instruction on the next IF
- 2. Squash younger instructions (Turn off all writes (register/memory) for faulting instruction and all instructions that follow it)
- 3. Save all processor state after trap begins
  - PC-chain, PSW, Condition Codes, trap condition
  - PC-chain is length of the branch delay plus 1
- 4. Perform the trap/exception code then restart where we left off



# For next time

- Multi-cycle operations
  - More WAR, WAW nastiness
  - More precise interrupt nastiness
- SuperScalar/Dynamic Scheduling