#### Computer Science 146 Computer Architecture

Fall 2019 Harvard University

Instructor: Prof. David Brooks dbrooks@eecs.harvard.edu

Lecture 17: Main Memory

|     | С         | ourse Outlir            | ne Revisited            |              |
|-----|-----------|-------------------------|-------------------------|--------------|
| W1  | Feb 4     |                         | Introduction            | Ch. 1        |
| W2  | Feb 9/11  | Measuring Perf.         | ISA Design              | Ch. 2, A.1-3 |
| W3  | Feb 16/18 | Holiday/No Class        | Basic Pipelining        | A.4-11, Ch3  |
| W4  | Feb 23/25 | Multicycle/Scoreboard   | Tomasulo's Algorithm    | Ch 3         |
| W5  | Mar 1/3   | Branch Pred./Fetch      | Mult. Issue/Speculation | Ch 3         |
| W6  | Mar 8/10  | Processor Case Studies  | Static Issue            | Ch 3/4       |
| W7  | Mar 15/17 | Static ILP vs. HW ILP   | IA64 Study/Review       | Ch 4         |
| W8  | Mar 22/24 | IA64 Study/Review       | Midterm                 |              |
| W9  | Mar 29/31 | Spring                  | Break                   |              |
| W10 | Apr 5/7   | Caches                  | Caches                  | Ch 5         |
| W11 | Apr 12/14 | Caches                  | Main Memory             | Ch 5         |
| W12 | Apr 19/21 | Virtual Memory          | Shared Memory MPs       | Papers       |
| W13 | Apr 26/28 | Multithreading (SMT/MP) | Storage, I/O, Clusters  | Ch 6/7       |
| W14 | May 3/5   | Security Processors     | Network/GPU Processors  | Papers       |

















# Extended Data Out (EDO) DRAM







David Brooks





|       | S    | impl | e Ma   | in Memory              |
|-------|------|------|--------|------------------------|
| Cycle | Addr | Mem  | steady |                        |
| 1     | 12   | A    | *      |                        |
| 2     |      | A    | *      | • 4 word access = $15$ |
| 3     |      | T/B  | *      |                        |
| 4     |      | В    | *      | cycles                 |
| 5     | 13   | A    | *      | • $4$ -word cycle = 16 |
| 6     |      | A    | *      | -                      |
| 7     |      | T/B  | *      | cycles                 |
| 8     |      | В    | *      |                        |
| 9     | 14   | A    | *      |                        |
| 10    |      | A    | *      | • How to improve?      |
| 11    |      | T/B  | *      | – Lower latency?       |
| 12    |      | В    | *      | • A,B,T are fixed      |
| 13    | 15   | A    | *      | – Higher bandwidth?    |
| 14    |      | A    | *      |                        |
| 15    |      | T/B  | *      | 1                      |
| 16    |      | В    | *      | 1                      |

# Bandwidth: Wider DRAMs

| Cycle | Addr | Mem | steady |
|-------|------|-----|--------|
| 1     | 12   | Α   | *      |
| 2     |      | A   | *      |
| 3     |      | T/B | *      |
| 4     |      | В   | *      |
| 5     | 14   | A   | *      |
| 6     |      | A   | *      |
| 7     |      | T/B | *      |
| 8     |      | В   | *      |

- 64-bit DRAM instead
- 4 word access = 7 cycles
- 4-word cycle = 8 cycles
- 64-bit buses are more expensive (Pentium vs. 486)



# Simple Interleaving

| Cycle | Addr | Bank0 | Bank1 | Bank2 | Bank3 | steady |
|-------|------|-------|-------|-------|-------|--------|
| 1     | 12   | A     | A     | А     | A     |        |
| 2     |      | А     | А     | А     | A     |        |
| 3     |      | T/B   | В     | В     | В     | *      |
| 4     |      | В     | T/B   | В     | В     | *      |
| 5     |      |       |       | Т     |       | *      |
| 6     |      |       |       |       | Т     | *      |

- 4-word access = 6-cycles
- 4-word cycle = 4-cycles
  - Can start a new access in cycle 5
  - Overlap access with transfer (and still use a 32-bit bus!)



# Complex Interleaving

| Cycle | Addr | Bank0 | Bank1 | Bank2 | Bank3 | steady |
|-------|------|-------|-------|-------|-------|--------|
| 1     | 12   | A     |       |       |       |        |
| 2     | 13   | A     | A     |       |       |        |
| 3     | 14   | T/B   | A     | A     |       | *      |
| 4     | 15   | В     | T/B   | A     | А     | *      |
| 5     |      |       | В     | T/B   | А     | *      |
| 6     |      |       |       | В     | T/B   | *      |
| 7     |      |       |       |       | В     |        |

- 4-word access = 6-cycles
- 4-word cycle = 4-cycles
  - Same as simple interleaving

Computer Science 146 David Brooks

# Simple Interleaving (Non-Sequential)

| Cycle | Addr   | Bank0 | Bank1 | Bank2 | Bank3 | steady |
|-------|--------|-------|-------|-------|-------|--------|
| 1     | 12(15) | А     | А     | А     | A     | *      |
| 2     |        | А     | А     | А     | А     | *      |
| 3     |        | T/B   | В     | В     | В     | *      |
| 4     |        | В     | В     | В     | T/B   | *      |
| 5     | 18     | А     | А     | А     | А     | *      |
| 6     |        | А     | А     | А     | A     | *      |
| 7     |        | В     | В     | T/B   | В     | *      |
| 8     |        | В     | В     | В     | В     | *      |
| 9     | 21     | А     | А     | А     | А     | *      |
| 10    |        | А     | А     | А     | А     | *      |
| 11    |        | В     | T/B   | В     | В     | *      |
| 12    |        | В     | В     | В     | В     | *      |

• Non-sequential access, e.g. stride = 3

• 4-word access = 4-word cycle = 12-cycles

# Complex Interleaving (Non-Sequential)

| Cycle | Addr | Bank0 | Bank1 | Bank2 | Bank3 | steady |
|-------|------|-------|-------|-------|-------|--------|
| 1     | 12   | А     |       |       |       | *      |
| 2     | 15   | А     |       |       | А     | *      |
| 3     | 18   | T/B   |       | А     | А     | *      |
| 4     | 21   | В     | А     | А     | T/B   | *      |
| 5     |      |       | А     | T/B   | В     |        |
| 6     |      |       | T/B   | В     |       |        |

- 4-word access = 6-cycles
- 4-word cycle = 4-cycles
- DMA (I/O), Multiprocessors are non-sequential
- Want more banks than words in a cache line
  - Multiple cache misses in parallel (non-blocking caches)

Computer Science 146 David Brooks

#### Interleaving Problem Cycle Addr Bank0 Bank1 Bank2 Bank3 steady 12 \* 1 А 2 А \* \* 3 T/B 4 В \* 5 20 А \* \* 6 А 7 T/B \* В 8 \* Powers of 2 strides are a problem – all addresses, same bank 4-word access = 15 cycles, 4-word cycle = 16 cycle Solution: Use prime number of banks (e.g. 17) ٠

#### **Avoiding Bank Conflicts**

- Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 \* x[i][j];
- Even with 128 banks, since 512 is multiple of 128, conflict on word accesses
- SW: loop interchange or declaring array not power of 2 ("array padding")
- HW: Add more Banks, Add Prime number of banks
  - bank number = address mod number of banks
  - address within bank = address / number of words in bank
  - modulo & divide per memory access with prime no. banks?
  - address within bank = address mod number words in bank
  - bank number? easy if  $2^N$  words per bank



#### Independent Memory Banks

**DIMM Modules** 



- DIMM (Dual-Inline Memory Module) Configuration
- Banking occurs at the chip, module, and system levels
- 1 Rank of devices responds to each access
  All devices respond similarly
- Single-Sided DIMM
  - 4 banks per device => DIMM has 4 banks

• 512MB DIMM = 8x64Mx8, 4 Banks









#### **Interleaving Summary**

- Banks
  - Method to get high bandwidth with cheap (narrow) bus
- Bandwidth determines memory capacity
  - Hard to make many banks from narrow DIMMs
  - 32, 64-bit banks from 1x64MB DRAMS => 2048 DIMMS => 4GB
  - Can't force customers to buy so much memory to get good bandwidth
  - Must use wider DRAMs
  - RAMBUS does better with small memory systems (PS2)
  - Big servers have lots of memory so traditional banking works

