# Computer Science 146 Computer Architecture

Fall 2019 Harvard University

Instructor: Prof. David Brooks dbrooks@eecs.harvard.edu

Lecture 9: Limits of ILP, Case Studies

Computer Science 146 David Brooks











# MIPS R10K: Register Map Table

|                                      | Map Table |    |    |    |     |                    |  |  |
|--------------------------------------|-----------|----|----|----|-----|--------------------|--|--|
| Initial Mapping                      | R1        | R2 | R3 | R4 |     |                    |  |  |
|                                      | P1        | P2 | P3 | P4 |     |                    |  |  |
| ADD R1 R2, R4                        | P5        | P2 | P3 | P4 | ADD | P5, P2, P4         |  |  |
| SUB R4, R1, R2                       | P5        | P2 | P3 | P6 | SUB | P6, <b>P</b> 5, P2 |  |  |
| ADD R3/B1, R3                        | Р5        | P2 | P7 | P6 | ADD | P7, P5, P3         |  |  |
| ADD R1, R3, R2                       | P8        | P2 | P7 | P6 | ADD | P8, P7, P2         |  |  |
|                                      |           |    |    |    |     |                    |  |  |
|                                      |           |    |    |    |     |                    |  |  |
|                                      |           |    |    |    |     |                    |  |  |
|                                      |           |    |    |    |     |                    |  |  |
|                                      |           |    |    |    |     |                    |  |  |
| Computer Science 146<br>David Brooks |           |    |    |    |     |                    |  |  |





| Feature       | Pentium III                                              | MIPS R10K                       |
|---------------|----------------------------------------------------------|---------------------------------|
| Value Storage | Architectural Register File,<br>ROB, Reservation Station | Physical Register File          |
| Reg. Read     | On Issue, Write into RS                                  | On Execute, to FU               |
| Reg. Write    | On Commit, from ROB                                      | On Writeback from FU            |
| Reg. Free     | Instruction Commits                                      | Overwriting instruction retires |
| Precise State | Simple $\rightarrow$ Reset Structures                    | Complex Checkpoints             |

David Brooks















### Limits on ILP: Load/Store Disambiguation • "Alias analysis" problem – How do we analyze dependencies through memory? • Compiler Solutions – Examine Registers + base offsets to check for conflicts • Hardware Solutions – In-order load/stores (slow!) – Loads in-order with other stores, but not loads – Loads issue out of order, cleanup mis-speculations (complex) – Predictors to choose from above policies





# Dynamic Scheduling in P6

| Parameter                                       | 80x86 microop      |
|-------------------------------------------------|--------------------|
| Max. instructions issued/clock                  | 3 6                |
| Max. instr. complete exec./clock                | 5                  |
| Max. instr. committed/clock                     | 3                  |
| Window (Instrs in reorder buffer)               | 40                 |
| Number of reservations stations                 | 20                 |
| Number of rename registers                      | 40                 |
| No. integer functional units (FUs)              | 2                  |
| No. floating point FUs<br>No. SIMD Fl. Pt. FUs1 | 1                  |
| No. memory Fus                                  | 1  load + 1  store |
|                                                 |                    |
| Computer Scien<br>David Brow                    |                    |























### Pentium 4

- Still translate from 80x86 to micro-ops
- P4 has better branch predictor, more FUs
- Instruction Cache holds micro-operations vs. 80x86 instructions
  no decode stages of 80x86 on cache hit ("Trace Cache")
- Faster memory bus: 400 MHz v. 133 MHz
- Caches
  - Pentium III: L1I 16KB, L1D 16KB, L2 256 KB
  - Pentium 4: L1I 12K uops, L1D 8 KB, L2 256 KB
  - Block size: PIII 32B v. P4 128B; 128 v. 256 bits/clock
- Clock rates:
  - Pentium III 1 GHz v. Pentium IV 1.5 GHz
  - 14 stage pipeline vs. 24 stage pipeline

Computer Science 146 David Brooks





| Pe       | enti    | um,         | Per       | ntiu       | n P                     | ro,      | P4 ]    | Pip     | eline     | e    |
|----------|---------|-------------|-----------|------------|-------------------------|----------|---------|---------|-----------|------|
| Prefetch | Decode  | Decode      | Execute   | Write-bac  | ĸ                       |          |         |         |           |      |
|          | P5      | Microarchil | ecture    |            |                         |          |         |         |           |      |
| Fetch    | Fetch   | Decode      | Decode    | Decode     | Rename                  | ROB Rd   | Rdy/Sch | Dispate | h Execute | 1    |
|          |         |             |           | P6 Microa  | rchitecture             |          |         |         |           |      |
| TC N     | lxt IP  | TC          | Fetch     | Drive      | Alloc                   | Ren      | ame     | Queue   | Schedule  |      |
| S        | chedule | Schedule    | Dispatch  | Dispatch F | leg File                | Reg File | Execute | Flags   | Branch Ck | Driv |
| P        | entiur  |             | II, III ( | 110.000    |                         | es (1 c  |         | x)      |           |      |
|          |         |             |           | 1          | er Science<br>id Brooks | 146      |         |         |           |      |











