MULTITHREADING
Jul 15, 2015
What is ILP?2
Instruction Level Parallelism Instruction-level parallelism (ILP) is a
measure of how many of the operations in a computer program can be performed simultaneously. The potential overlap among instructions is called instruction level parallelism.
There are two approaches to instruction level parallelism: Hardware Software
Example3
Consider the following program:1. e = a + b2. f = c + d3. m = e * f Operation 3 depends on the results of operations
1 and 2, so it cannot be calculated until both of them are completed. However, operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously. If we assume that each operation can be completed in one unit of time then these three instructions can be completed in a total of two units of time, giving an ILP of 3/2.
ILP is applications specific4
A goal of compiler and processor designers is to identify and take advantage of as much ILP as possible. Ordinary programs are typically written under a sequential execution model where instructions execute one after the other and in the order specified by the programmer.
ILP allows the compiler and the processor to overlap the execution of multiple instructions or even to change the order in which instructions are executed.
How much ILP exists in programs is very application specific.
In certain fields, such as graphics and scientific computing the amount can be very large. However, workloads such as cryptography may exhibit much less parallelism.
ILP Techniques5
Micro-architectural techniques that are used to exploit ILP include:
Instruction pipelining where the execution of multiple instructions can be partially overlapped.
Superscalar execution, VLIW, and the closely related explicitly parallel instruction computing concepts, in which multiple execution units are used to execute multiple instructions in parallel.
Out-of-order execution where instructions execute in any order that does not violate data dependencies. Note that this technique is independent of both pipelining and superscalar. Current implementations of out-of-order execution dynamically (i.e., while the program is executing and without any help from the compiler) extract ILP from ordinary programs. An alternative is to extract this parallelism at compile time and somehow convey this information to the hardware.
ILP Techniques6
Register renaming which refers to a technique used to avoid unnecessary serialization of program operations imposed by the reuse of registers by those operations, used to enable out-of-order execution.
Speculative execution which allow the execution of complete instructions or parts of instructions before being certain whether this execution should take place. A commonly used form of speculative execution is control flow speculation where instructions past a control flow instruction (e.g., a branch) are executed before the target of the control flow instruction is determined. Several other forms of speculative execution have been proposed and are in use including speculative execution driven by value prediction, memory dependence prediction and cache latency prediction.
Branch prediction which is used to avoid stalling for control dependencies to be resolved. Branch prediction is used with speculative execution.
Register Renaming7
Instructions 4, 5, and 6 are independent of instructions 1, 2, and 3, but the processor cannot finish 4 until 3 is done, because 3 would then write the wrong value.
We can eliminate this restriction by changing the names of some of the registers:
# Instruction1 R1 = M[1024]2 R1 = R1 + 23 M[1032] = R14 R1 = M[2048]5 R1 = R1 + 46 M[2056] = R1
# Instruction # Instruction
1 R1 = M[1024] 4 R2 = M[2048]
2 R1 = R1 + 2 5 R2 = R2 + 4
3 M[1032] = R1 6 M[2056] = R2
Now instructions 4, 5, and 6 can be executed in parallel with instructions 1, 2, and 3, so that the program can be executed faster.
Multithreading8
Difficult to continue to extract ILP from a single thread
Many workloads can make use of thread-level parallelism (TLP) TLP from multiprogramming (run independent
sequential jobs) TLP from multithreaded applications (run one job
faster using parallel threads) Multithreading uses TLP to improve utilization
of a single processor
Pipeline Hazards
10/30/2007
9
Each instruction may depend on the next
LW r1, 0(r2)LW r5, 12(r1)ADDI r5, r5, #12SW 12(r1), r5
F D X M W
t0 t1 t2 t3 t4 t5 t6 t7 t8
F D X M WD D DF D X M WD D DF F F
F DD D DF F F
t9 t10 t11 t12 t13 t14
What can be done to cope with this?
Multithreading10
How can we guarantee no dependencies between instructions in a pipeline?
-- One way is to interleave execution of instructions from different program threads on same pipeline
F D X M W
t0 t1 t2 t3 t4 t5 t6 t7 t8
T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)
t9
F D X M WF D X M WF D X M WF D X M W
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
Prior instruction in a thread always completes write-back before next instruction in same thread reads register file
CDC 6600 Peripheral Processors(Cray, 1964)
11
First multithreaded hardware 10 “virtual” I/O processors Fixed interleave on simple pipeline Pipeline has 100ns cycle time Each virtual processor executes one instruction every 1000ns Accumulator-based instruction set to reduce processor state
Simple Multithreaded Pipeline
12
Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage
Appears to software (including OS) as multiple, albeit slower, CPUs
+1
2 Thread select
PC1PC1PC1PC1
I$ IRGPR1GPR1GPR1GPR1
X
Y
2
D$
Multithreading Costs13
Each thread requires its own user state PC (Program counter) GPRs (General Purpose Registers)
Also, needs its own system state virtual memory page table base register exception handling registers
O the r c o s ts ?
Thread Scheduling Policies14
Fixed interleave (CDC 6 6 0 0 PPUs , 1 9 6 4) each of N threads executes one instruction every N
cycles if thread not ready to go in its slot, insert pipeline
bubble
Software-controlled interleave (TI ASC PPUs , 1 9 7 1 ) OS allocates S pipeline slots amongst N threads hardware performs fixed interleave over S slots,
executing whichever thread is in that slot
Hardware-controlled thread scheduling (HEP, 1 9 8 2 ) hardware keeps track of which threads are ready to go picks next thread to execute based on hardware
priority scheme
Denelcor HEP(Burton Smith, 1982)
15
First commercial machine to use hardware threading in main CPU 120 threads per processor 10 MHz clock rate Up to 8 processors precursor to Tera MTA (Multithreaded Architecture)
Tera MTA (1990-97)16
Up to 256 processors Up to 128 active threads per processor Processors and memory modules populate a
sparse 3D torus interconnection fabric Flat, shared main memory
No data cache Sustains one main memory access per cycle per processor
MTA Architecture17
Each processor supports 128 active hardware threads 1 x 128 = 128 stream status word (SSW)
registers, 8 x 128 = 1024 branch-target registers, 32 x 128 = 4096 general-purpose registers
Three operations packed into 64-bit instruction (short VLIW) One memory operation, One arithmetic operation, plus One arithmetic or branch operation
Thread creation and termination instructions
Explicit 3-bit “lookahead” field in instruction gives number of subsequent instructions (0-7) that are independent of this one c.f. instruction grouping in VLIW allows fewer threads to fill machine pipeline used for variable-sized branch delay slots
Coarse-Grain Multithreading
18
Tera MTA designed for supercomputing applications with large data sets and low locality No data cache Many parallel threads needed to hide large memory latency
Other applications are more cache friendly Few pipeline bubbles when cache getting hits Just add a few threads to hide occasional cache miss
latencies Swap threads on cache misses
MIT Alewife (1990)19
Modified SPARC chips register windows hold
different thread contextsUp to four threads per nodeThread switch on local cache miss
IBM PowerPC RS64-IV (2000)
20
Commercial coarse-grain multithreading CPU
Based on PowerPC with quad-issue in-order five-stage pipeline
Each physical CPU supports two virtual CPUs
On L2 cache miss, pipeline is flushed and execution switches to second thread short pipeline minimizes flush penalty (4 cycles), small
compared to memory access latency flush pipeline to simplify exception handling
For most apps, most execution units lie idle in an OoO superscalar21
From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.
For an 8-way superscalar.
Simultaneous Multithreading (SMT) for OoO Superscalars
22
Techniques presented so far have all been “vertical” multithreading where each pipeline stage works on one thread at a time
SMT uses fine-grain control already present inside an OoO superscalar to allow instructions from multiple threads to enter execution on same clock cycle. Gives better utilization of machine resources.
What is Superscalar?
A superscalar CPU architecture implements instruction-level parallelism within a single processor.
A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor.
Each functional unit is not separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.
23
SMT vs Superscalar
Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading.
Where as, Superscalar is a CPU architecture that can execute more than one instruction in a given clock cycle.
SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.
24
From Superscalar to SMT
SMT is an out-of-order superscalar extended with hardware to support multiple executing threads
From Superscalar to SMT
Small items per-thread program counters per-thread return stacks per-thread bookkeeping for instruction
retirement, trap & instruction dispatch queue flush
thread identifiers, e.g., with BTB & TLB entries
Superscalar Machine Efficiency29
Issue width
Time
Completely idle cycle (vertical waste)
Instruction issue
Partially filled cycle, i.e., IPC < 4(horizontal waste)
Vertical Multithreading30
What is the effect of cycle-by-cycle interleaving? removes vertical waste, but leaves some horizontal waste
Issue width
Time
Second thread interleaved cycle-by-cycle
Instruction issue
Partially filled cycle, i.e., IPC < 4(horizontal waste)
Chip Multiprocessing (CMP)31
What is the effect of splitting into multiple processors? reduces horizontal waste, leaves some vertical waste, and puts upper limit on peak throughput of each thread.
Issue width
Time
Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995]
32
Interleave multiple threads to multiple issue slots with no restrictions
Issue width
Time
O-o-O Simultaneous Multithreading[Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996]
34
Add multiple contexts and fetch engines and allow instructions fetched from different threads to issue simultaneously
Utilize wide out-of-order superscalar processor issue queue to find instructions to issue from multiple threads
OOO instruction window already has most of the circuitry required to schedule from multiple threads
Any single thread can utilize whole machine
Power 4
10/30/2007
35
Single-threaded predecessor to Power 5. 8 execution units inout-of-order engine, each mayissue an instruction each cycle.
Changes in Power 5 to support SMT
37
Increased associativity of L1 instruction cache and the instruction address translation buffers
Added per thread load and store queues Increased size of the L2 (1.92 vs. 1.44 MB) and
L3 caches Added separate instruction prefetch and buffering
per thread Increased the number of virtual registers from 152
to 240 Increased the size of several issue queues The Power5 core is about 24% larger than the
Power4 core because of the addition of SMT support
Pentium-4 Hyperthreading (2002)
38
First commercial SMT design (2-way SMT) Hyperthreading == SMT
Logical processors share nearly all resources of the physical processor Caches, execution units, branch predictors
When one logical processor is stalled, the other can make progress No logical processor can use all entries in
queues when two threads are active
Pentium-4 HyperthreadingFro nt End
39
Resource divided between logical CPUs
Resource shared between logical CPUs
SMT adaptation to parallelism type
41
For regions with high thread level parallelism (TLP) entire machine width is shared by all threads
Issue width
Time
Issue width
Time
For regions with low thread level parallelism (TLP) entire machine width is available for instruction level parallelism (ILP)
Power 5 thread performance ...
10/30/2007
42
Relative priority of each thread controllable in hardware.
For balanced operation, both threads run slower than if they “owned” the machine.