09 multithreading computer architecture

MULTITHREADING

What is ILP?2

Instruction Level Parallelism Instruction-level parallelism (ILP) is a

measure of how many of the operations in a computer program can be performed simultaneously. The potential overlap among instructions is called instruction level parallelism.

There are two approaches to instruction level parallelism: Hardware Software

Example3

Consider the following program:1. e = a + b2. f = c + d3. m = e * f Operation 3 depends on the results of operations

1 and 2, so it cannot be calculated until both of them are completed. However, operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously. If we assume that each operation can be completed in one unit of time then these three instructions can be completed in a total of two units of time, giving an ILP of 3/2.

ILP is applications specific4

A goal of compiler and processor designers is to identify and take advantage of as much ILP as possible. Ordinary programs are typically written under a sequential execution model where instructions execute one after the other and in the order specified by the programmer.

ILP allows the compiler and the processor to overlap the execution of multiple instructions or even to change the order in which instructions are executed.

How much ILP exists in programs is very application specific.

In certain fields, such as graphics and scientific computing the amount can be very large. However, workloads such as cryptography may exhibit much less parallelism.

ILP Techniques5

Micro-architectural techniques that are used to exploit ILP include:

Instruction pipelining where the execution of multiple instructions can be partially overlapped.

Superscalar execution, VLIW, and the closely related explicitly parallel instruction computing concepts, in which multiple execution units are used to execute multiple instructions in parallel.

Out-of-order execution where instructions execute in any order that does not violate data dependencies. Note that this technique is independent of both pipelining and superscalar. Current implementations of out-of-order execution dynamically (i.e., while the program is executing and without any help from the compiler) extract ILP from ordinary programs. An alternative is to extract this parallelism at compile time and somehow convey this information to the hardware.

ILP Techniques6

Register renaming which refers to a technique used to avoid unnecessary serialization of program operations imposed by the reuse of registers by those operations, used to enable out-of-order execution.

Speculative execution which allow the execution of complete instructions or parts of instructions before being certain whether this execution should take place. A commonly used form of speculative execution is control flow speculation where instructions past a control flow instruction (e.g., a branch) are executed before the target of the control flow instruction is determined. Several other forms of speculative execution have been proposed and are in use including speculative execution driven by value prediction, memory dependence prediction and cache latency prediction.

Branch prediction which is used to avoid stalling for control dependencies to be resolved. Branch prediction is used with speculative execution.

Register Renaming7

Instructions 4, 5, and 6 are independent of instructions 1, 2, and 3, but the processor cannot finish 4 until 3 is done, because 3 would then write the wrong value.

We can eliminate this restriction by changing the names of some of the registers:

# Instruction1 R1 = M[1024]2 R1 = R1 + 23 M[1032] = R14 R1 = M[2048]5 R1 = R1 + 46 M[2056] = R1

# Instruction # Instruction

1 R1 = M[1024] 4 R2 = M[2048]

2 R1 = R1 + 2 5 R2 = R2 + 4

3 M[1032] = R1 6 M[2056] = R2

Now instructions 4, 5, and 6 can be executed in parallel with instructions 1, 2, and 3, so that the program can be executed faster.

Multithreading8

Difficult to continue to extract ILP from a single thread

Many workloads can make use of thread-level parallelism (TLP) TLP from multiprogramming (run independent

sequential jobs) TLP from multithreaded applications (run one job

faster using parallel threads) Multithreading uses TLP to improve utilization

of a single processor

Pipeline Hazards

10/30/2007

9

Each instruction may depend on the next

LW r1, 0(r2)LW r5, 12(r1)ADDI r5, r5, #12SW 12(r1), r5

F D X M W

t0 t1 t2 t3 t4 t5 t6 t7 t8

F D X M WD D DF D X M WD D DF F F

F DD D DF F F

t9 t10 t11 t12 t13 t14

What can be done to cope with this?

Multithreading10

How can we guarantee no dependencies between instructions in a pipeline?

-- One way is to interleave execution of instructions from different program threads on same pipeline

F D X M W

t0 t1 t2 t3 t4 t5 t6 t7 t8

T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)

t9

F D X M WF D X M WF D X M WF D X M W

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

Prior instruction in a thread always completes write-back before next instruction in same thread reads register file

CDC 6600 Peripheral Processors(Cray, 1964)

11

First multithreaded hardware 10 “virtual” I/O processors Fixed interleave on simple pipeline Pipeline has 100ns cycle time Each virtual processor executes one instruction every 1000ns Accumulator-based instruction set to reduce processor state

Simple Multithreaded Pipeline

12

Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage

Appears to software (including OS) as multiple, albeit slower, CPUs

+1

2 Thread select

PC1PC1PC1PC1

I$ IRGPR1GPR1GPR1GPR1

X

Y

2

D$

Multithreading Costs13

Each thread requires its own user state PC (Program counter) GPRs (General Purpose Registers)

Also, needs its own system state virtual memory page table base register exception handling registers

O the r c o s ts ?

Thread Scheduling Policies14

Fixed interleave (CDC 6 6 0 0 PPUs , 1 9 6 4) each of N threads executes one instruction every N

cycles if thread not ready to go in its slot, insert pipeline

bubble

Software-controlled interleave (TI ASC PPUs , 1 9 7 1 ) OS allocates S pipeline slots amongst N threads hardware performs fixed interleave over S slots,

executing whichever thread is in that slot

Hardware-controlled thread scheduling (HEP, 1 9 8 2 ) hardware keeps track of which threads are ready to go picks next thread to execute based on hardware

priority scheme

Denelcor HEP(Burton Smith, 1982)

15

First commercial machine to use hardware threading in main CPU 120 threads per processor 10 MHz clock rate Up to 8 processors precursor to Tera MTA (Multithreaded Architecture)

Tera MTA (1990-97)16

Up to 256 processors Up to 128 active threads per processor Processors and memory modules populate a

sparse 3D torus interconnection fabric Flat, shared main memory

No data cache Sustains one main memory access per cycle per processor

MTA Architecture17

Each processor supports 128 active hardware threads 1 x 128 = 128 stream status word (SSW)

registers, 8 x 128 = 1024 branch-target registers, 32 x 128 = 4096 general-purpose registers

Three operations packed into 64-bit instruction (short VLIW) One memory operation, One arithmetic operation, plus One arithmetic or branch operation

Thread creation and termination instructions

Explicit 3-bit “lookahead” field in instruction gives number of subsequent instructions (0-7) that are independent of this one c.f. instruction grouping in VLIW allows fewer threads to fill machine pipeline used for variable-sized branch delay slots

Coarse-Grain Multithreading

18

Tera MTA designed for supercomputing applications with large data sets and low locality No data cache Many parallel threads needed to hide large memory latency

Other applications are more cache friendly Few pipeline bubbles when cache getting hits Just add a few threads to hide occasional cache miss

latencies Swap threads on cache misses

MIT Alewife (1990)19

Modified SPARC chips register windows hold

different thread contextsUp to four threads per nodeThread switch on local cache miss

IBM PowerPC RS64-IV (2000)

20

Commercial coarse-grain multithreading CPU

Based on PowerPC with quad-issue in-order five-stage pipeline

Each physical CPU supports two virtual CPUs

On L2 cache miss, pipeline is flushed and execution switches to second thread short pipeline minimizes flush penalty (4 cycles), small

compared to memory access latency flush pipeline to simplify exception handling

For most apps, most execution units lie idle in an OoO superscalar21

From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.

For an 8-way superscalar.

Simultaneous Multithreading (SMT) for OoO Superscalars

22

Techniques presented so far have all been “vertical” multithreading where each pipeline stage works on one thread at a time

SMT uses fine-grain control already present inside an OoO superscalar to allow instructions from multiple threads to enter execution on same clock cycle. Gives better utilization of machine resources.

What is Superscalar?

A superscalar CPU architecture implements instruction-level parallelism within a single processor.

A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor.

Each functional unit is not separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.

23

SMT vs Superscalar

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading.

Where as, Superscalar is a CPU architecture that can execute more than one instruction in a given clock cycle.

SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.

24

From Superscalar to SMT

SMT is an out-of-order superscalar extended with hardware to support multiple executing threads


Extra pipeline stages for accessing thread-shared register files


Fetch from the two highest throughput threads.

Why?


Small items per-thread program counters per-thread return stacks per-thread bookkeeping for instruction

retirement, trap & instruction dispatch queue flush

thread identifiers, e.g., with BTB & TLB entries

Superscalar Machine Efficiency29

Issue width

Time

Completely idle cycle (vertical waste)

Instruction issue

Partially filled cycle, i.e., IPC < 4(horizontal waste)

Vertical Multithreading30

What is the effect of cycle-by-cycle interleaving? removes vertical waste, but leaves some horizontal waste

Issue width

Time

Second thread interleaved cycle-by-cycle

Instruction issue

Partially filled cycle, i.e., IPC < 4(horizontal waste)

Chip Multiprocessing (CMP)31

What is the effect of splitting into multiple processors? reduces horizontal waste, leaves some vertical waste, and puts upper limit on peak throughput of each thread.

Issue width

Time

Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995]

32

Interleave multiple threads to multiple issue slots with no restrictions

Issue width

Time

Coarse, Fine and Simultaneous Multithreaded Superscalar

10/30/2007

33

O-o-O Simultaneous Multithreading[Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996]

34

Add multiple contexts and fetch engines and allow instructions fetched from different threads to issue simultaneously

Utilize wide out-of-order superscalar processor issue queue to find instructions to issue from multiple threads

OOO instruction window already has most of the circuitry required to schedule from multiple threads

Any single thread can utilize whole machine

Power 4

10/30/2007

35

Single-threaded predecessor to Power 5. 8 execution units inout-of-order engine, each mayissue an instruction each cycle.

10/30/2007

36

Power 4

Power 5

2 fetch (PC),2 initial decodes

2 commits (architected register sets)

Changes in Power 5 to support SMT

37

Increased associativity of L1 instruction cache and the instruction address translation buffers

Added per thread load and store queues Increased size of the L2 (1.92 vs. 1.44 MB) and

L3 caches Added separate instruction prefetch and buffering

per thread Increased the number of virtual registers from 152

to 240 Increased the size of several issue queues The Power5 core is about 24% larger than the

Power4 core because of the addition of SMT support

Pentium-4 Hyperthreading (2002)

38

First commercial SMT design (2-way SMT) Hyperthreading == SMT

Logical processors share nearly all resources of the physical processor Caches, execution units, branch predictors

When one logical processor is stalled, the other can make progress No logical processor can use all entries in

queues when two threads are active

Pentium-4 HyperthreadingFro nt End

39

Resource divided between logical CPUs

Resource shared between logical CPUs

Pentium-4 HyperthreadingExe cutio n Pip e line

40

SMT adaptation to parallelism type

41

For regions with high thread level parallelism (TLP) entire machine width is shared by all threads

Issue width

Time

Issue width

Time

For regions with low thread level parallelism (TLP) entire machine width is available for instruction level parallelism (ILP)

Power 5 thread performance ...

10/30/2007

42

Relative priority of each thread controllable in hardware.

For balanced operation, both threads run slower than if they “owned” the machine.

Summary: Multithreaded Categories43

Time (processor cycle)

Superscalar Fine-Grained Coarse-Grained MultiprocessingSimultaneousMultithreading

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Idle slot

09 multithreading computer architecture

Engineering

ilp of

parts of instructions

ilp techniques

forms of speculative

control flow instruction

instruction pipeliningwhere

compiler extract ilp

sequential execution