# Computer Science 146 Computer Architecture

Fall 2019 Harvard University

Instructor: Prof. David Brooks dbrooks@eecs.harvard.edu

Lecture 19: Multiprocessors















# Database Application Example

- Bank Database
- Parameters:
  - D = number of accounts
  - P = number of processors in server
  - N = number of ATMs (parallel transactions)
- Growth Functions
  - Computation: f(N)
  - Computation per processor: F(N/P)
  - Communication? Lock records while changing them
  - Communication: f(N)
  - Computation/communication: f(1)
  - No serial computation



# SIMD vs. MIMD

- Can you think of a commercial SIMD system?
- MP vs. SIMD
  - Programming model flexibility
    - Could simulate vectors with MIMD but not the other way
    - · Dominant market segment cannot use vectors
  - Cost effectiveness
    - Commodity Parts: high volume (cheap) components
    - MPs can be made up of many uniprocessors
    - Allows easy scalability from small to large
  - Vectors making some noise lately (Berkeley, NEC)

Computer Science 146 David Brooks

#### Taxonomy of Parallel (MIMD) **Processors** Center on organization of main memory - Shared vs. Distributed Appearance of memory to hardware - Q1: Memory access latency uniform? - Shared (UMA): yes, doesn't matter where data goes - Distributed (NUMA): no, makes a big difference Appearance of memory to *software* - Q2: Can processors communicate directly via memory? - Shared (shared memory): yes, communicate via load/store Distributed (message passing): no, communicate via messages • Dimensions are orthogonal - e.g. DSM: (physically) distributed, (logically) shared memory Computer Science 146 David Brooks





- Perfect (single-cycle) memory latency
- Perfect (infinite) memory bandwidth
- Real systems:
  - Latencies are long and grow with system size
  - Bandwidth is limited
    - Add memory banks, interconnect to hook up (latency goes up)















# Interconnect Routing

- Store-and-Forward Routing
  - Switch buffers entire message before passing it on
  - Latency = [(message length / bandwidth) + fixed switch overhead] \* #hops
- Wormhole Routing
  - Pipeline message through interconnect
  - Switch passes message on before completely arrives
  - Latency = (message length / bandwidth) + (fixed switch overhead \* #hops)
  - No buffering needed at switch
  - Latency (relative) independent of number of intermediate hops



## Shared Memory vs. Message Passing

- *Shared Memory* (multiprocessors)
  - One shared address space
  - Processors use conventional load/stores to access shared data
  - Communication can be complex/dynamic
  - Simpler programming model (compatible with uniprocessors)
  - Hardware controlled caching is useful to reduce latency + contention
  - Has drawbacks
    - Synchronization (discussed later)
    - More complex hardware needed

Computer Science 146 David Brooks

# Parallel Systems (80s and 90s) Machine Communication Interconnect #cpus Remote latency (us) SPARCcenter Shared memory Bus <=20</td> 1

| Machine       | Communication | Interconnect | #cpus   | Remote latency (us) |
|---------------|---------------|--------------|---------|---------------------|
| SPARCcenter   | Shared memory | Bus          | <=20    | 1                   |
| SGI Challenge | Shared memory | Bus          | <=32    | 1                   |
| CRAY T3D      | Shared memory | 3D Torus     | 64-1024 | 1                   |
| Convex SPP    | Shared memory | X-bar/ring   | 8-64    | 2                   |
| KSR-1         | Shared memory | Bus/ring     | 32      | 2-6                 |
| TMC CM-5      | Messages      | Fat tree     | 64-1024 | 10                  |
| Intel Paragon | Messages      | 2D mesh      | 32-2048 | 10-30               |
| IBM SP-2      | Messages      | Multistage   | 32-256  | 30-100              |

• Trend towards shared memory systems

# Multiprocessor Trends

- Shared Memory
  - Easier, more dynamic programming model
  - Can do more to optimize the hardware
- Small-to-medium size UMA systems (2-8 processors)
  - Processor + memory + switch on single board (4x pentium)
  - Single-chip multiprocessors (POWER4)
  - Commodity parts soon glueless MP systems
- Larger NUMAs built from smaller UMAs
  - Use commodity small UMAs with commodity interconnects (ethernet, myrinet)
  - NUMA clusters

Computer Science 146 David Brooks

#### Major issues for Shared Memory Cache coherence "Common Sense" • P1 Read[X] => P1 Write[X] => P1 Read[X] will return X • P1 Write[X] => P2 Read[X] => will return value written by P1 • P1 Write[X] => P2 Write[X] => Serialized (all processor see the writes in the same order) Synchronization - Atomic read/write operations Memory consistency Model - When will a written value be seen? - P1 Write[X] (10ps later) P2 Read[X] what happens? These are not issues for message passing systems - Why? Computer Science 146 David Brooks

# Cache Coherence

- Benefits of coherent caches in parallel systems?
  - Migration and Replication of shared data
  - Migration: data moved locally lowers latency + main memory bandwidth
  - Replication: data being simultaneously read can be replicated to reduce latency + contention
- Problem: sharing of writeable data

| Processor 0 | Processor 1 | Correct value of A in:              |  |  |
|-------------|-------------|-------------------------------------|--|--|
|             |             | Memory                              |  |  |
| Read A      |             | Memory, p0 cache                    |  |  |
|             | Read a      | Memory, p0 cache, p1 cache          |  |  |
| Write A     |             | P0 cache, memory (if write-through) |  |  |
|             | Read A      | P1 gets stale value on hit          |  |  |



# HW Coherence Protocols

• Absolute coherence

٠

- All copies of each block have same data at all times
- A little bit overkill...
- Need *appearance* of absolute coherence
  - Temporary incoherence is ok
    - Similar to write-back cache
    - As long as all loads get their correct value
- Coherence protocol: FSM that runs at every cache
  - Invalidate protocols: invalidate copies in other caches
  - Update protocols: update copies in other caches
  - Snooping vs. Directory based protocols (HW implementation)
  - Memory is always updated

Computer Science 146 David Brooks

#### Write Invalidate • Much more common for most systems Mechanics - Broadcast address of cache line to invalidate - All processor snoop until they see it, then invalidate if in local cache - Same policy can be used to service cache misses in write-back caches Processor Activity Bus Activity Contents of Memory Contents of Contents of CPU A's cache CPU B's cache Location X 0 CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X 0 0 0 CPU A writes 1 to X Invalidation miss for X 1 0 CPU B reads X 1 Cache miss for X 1 1 Computer Science 146 David Brooks

# Write Update (Broadcast)

| Processor Activity  | Bus Activity          | Contents of<br>CPU A's cache | Contents of<br>CPU B's cache | Contents of Memory<br>Location X |  |  |  |
|---------------------|-----------------------|------------------------------|------------------------------|----------------------------------|--|--|--|
|                     |                       |                              |                              | 0                                |  |  |  |
| CPU A reads X       | Cache miss for X      | 0                            |                              | 0                                |  |  |  |
| CPU B reads X       | Cache miss for X      | 0                            | 0                            | 0                                |  |  |  |
| CPU A writes 1 to X | Write Broadcast for X | 1                            | 1                            | 1                                |  |  |  |
| CPU B reads X       |                       | 1                            | 1                            | 1                                |  |  |  |

- · Bandwidth requirements are excessive
  - Can reduce by checking if word is shared (also can reduce write-invalidate traffic)
- Comparison with Write invalidate
  - Multiple writes, no interveaning reads require multiple broadcasts
  - Multiple broadcasts needed for multiple word cache line writes (only 1 invalidate)
  - Advantage: other processors can get the data faster

Computer Science 146 David Brooks

# <section-header><section-header><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item>





## Next Lecture

- More multiprocessors
  - Example of FSM
  - Directory based systems
  - Synchronization
  - Consistency
- Multithreading
  - In Extra Readings Intel's paper "Hyper-threading Technology Architecture and Microarchitecture"