8/6/2019 Pipe Lining 8
1/31
1
Pipelining for Multi-
Core Architectures
8/6/2019 Pipe Lining 8
2/31
2
Multi-Core Technology
2004 2005 2007
Single Core Dual Core Multi-Core
+ Cache
+
Cache
Cache
Core
4 or more cores
+ Cache
Core
Cache2 or more cores
+ Cache
2Xmore
cores
8/6/2019 Pipe Lining 8
3/31
3
Why multi-core ?
Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:
heat problems
Clock problems
Efficiency (Stall) problems
Doubling issue rates above todays 3-6 instructions per clock, say to 6 to12 instructions, is extremely difficult
issue 3 or 4 data memory accesses per cycle,
rename and access more than 20 registers per cycle, and
fetch 12 to 24 instructions per cycle.
Many new applications are multithreaded
General trend in computer architecture (shift towards more parallelism)
8/6/2019 Pipe Lining 8
4/31
4
Instruction-level parallelism
Parallelism at the machine-instruction level
The processor can re-order, pipeline
instructions, split them into
microinstructions, do aggressive branch
prediction, etc.
Instruction-level parallelism enabled rapid
increases in processor speeds over the last15 years
8/6/2019 Pipe Lining 8
5/31
5
Thread-level parallelism (TLP)
This is parallelism on a more coarser scale
Server can serve each client in a separate
thread (Web server, database server)
A computer game can do AI, graphics, andsound in three separate threads
Single-core superscalar processors cannot
fully exploit TLP Multi-core architectures are the next step in
processor evolution: explicitly exploiting TLP
8/6/2019 Pipe Lining 8
6/31
6
What applications benefit
from multi-core?
Database servers
Web servers (Web commerce)
Multimedia applications
Scientific applications,
CAD/CAM
In general, applications with
Thread-level parallelism(as opposed to instruction-
level parallelism)
Each can
run on its
own core
8/6/2019 Pipe Lining 8
7/31
7
More examples
Editing a photo while recording a TV show
through a digital video recorder
Downloading software while running an
anti-virus program
Anything that can be threaded today will
map efficiently to multi-core
BUT: some applications difficult to
parallelize
8/6/2019 Pipe Lining 8
8/31
8/6/2019 Pipe Lining 8
9/31
9
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2Cach
eand
Control
Bus
Thread 1: floating point
Without SMT, only a single thread
can run at any given time
8/6/2019 Pipe Lining 8
10/31
10
Without SMT, only a single thread
can run at any given time
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2Cach
eand
Control
Bus
Thread 2:
integer operation
8/6/2019 Pipe Lining 8
11/31
11
SMT processor: both threads can
run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2Cach
eand
Control
Bus
Thread 1: floating pointThread 2:
integer operation
8/6/2019 Pipe Lining 8
12/31
12
But: Cant simultaneously use the
same functional unit
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2Cach
eand
Control
Bus
Thread 1 Thread 2
This scenario is
impossible with SMT
on a single core
(assuming a single
integer unit)IMPOSSIBLE
8/6/2019 Pipe Lining 8
13/31
8/6/2019 Pipe Lining 8
14/31
14
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2Cach
eandC
ontrol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2Cach
eand
Co
ntrol
Bus
Thread 3 Thread 4
Multi-core:
threads can run on separate cores
8/6/2019 Pipe Lining 8
15/31
15
Combining Multi-core and SMT
Cores can be SMT-enabled (or not) The different combinations:
Single-core, non-SMT: standard uniprocessor
Single-core, with SMT Multi-core, non-SMT
Multi-core, with SMT: our fish machines
The number of SMT threads:2, 4, or sometimes 8 simultaneous threads
Intel calls them hyper-threads
8/6/2019 Pipe Lining 8
16/31
16
SMT Dual-core: all four threads can
run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL
2Cach
eandControl
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2Cach
eand
Co
ntrol
Bus
Thread 1 Thread 3 Thread 2 Thread 4T
8/6/2019 Pipe Lining 8
17/31
17
Multi-Core and caches coherence
memory
L2 cache
L1 cache L1 cacheCO
RE
1
C
O
RE
0
L2 cache
memory
L2 cache
L1 cache L1 cacheCO
RE
1
C
O
RE
0
L2 cache
Both L1 and L2 are private
Examples: AMD Opteron,
AMD Athlon, Intel Pentium D
L3 cache L3 cache
A design with L3 caches
Example: Intel Itanium 2
8/6/2019 Pipe Lining 8
18/31
18
The cache coherence problem
Since we have private caches:How to keep the data consistent across caches?
Each core should perceive the memory as a
monolithic array, shared by all the cores
8/6/2019 Pipe Lining 8
19/31
19
The cache coherence problem
Suppose variable x initially contains 15213
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
One or more
levels of
cache
One or more
levels of
cache
One or more
levels of
cache
Main memory
x=15213
multi-core chip
8/6/2019 Pipe Lining 8
20/31
20
The cache coherence problem
Core 1 reads x
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
x=15213
One or more
levels of
cache
One or more
levels of
cache
One or more
levels of
cache
Main memory
x=15213
multi-core chip
8/6/2019 Pipe Lining 8
21/31
8/6/2019 Pipe Lining 8
22/31
22
The cache coherence problem
Core 1 writes to x, setting it to 21660
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
x=21660
One or more
levels of
cache
x=15213
One or more
levels of
cache
One or more
levels of
cache
Main memory
x=21660
multi-core chip
assuming
write-throughcaches
8/6/2019 Pipe Lining 8
23/31
23
The cache coherence problem
Core 2 attempts to read x gets a stale copy
Core 1 Core 2 Core 3 Core 4
One or more
levels of
cache
x=21660
One or more
levels of
cache
x=15213
One or more
levels of
cache
One or more
levels of
cache
Main memory
x=21660
multi-core chip
8/6/2019 Pipe Lining 8
24/31
24
The Memory Wall
Problem
8/6/2019 Pipe Lining 8
25/31
25
Memory Wall
Proc60%/yr.(2X/1.5yr)
DRAM9%/yr.
(2X/10yrs)1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-Memory
Performance Gap:(grows 50% / year)
Per
fo
rm
an
ce Moores Law
8/6/2019 Pipe Lining 8
26/31
26
Latency in a Single PC
1997 1999 2001 2003 2006 2009
X-Axis
0.1
1
10
100
1000
Tim
e(ns)
0
100
200
300
400
500
Mem
orytoCPU
Ratio
CPU Clock Period (ns)
Memory System Access Time
Ratio
Memory Access Time
CPU Time
Ratio
THEWALL
8/6/2019 Pipe Lining 8
27/31
h l d
8/6/2019 Pipe Lining 8
28/31
28
Technology TrendsTechnology Trends Capacity Speed (latency)Logic: 2x in 3 years 2x in 3 years
DRAM: 4x in 3 years 2x in 10 years
Disk: 4x in 3 years 2x in 10 years
DRAM Generations
Year Size Cycle Time
1980 64 Kb 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 120 ns
1996 64 Mb 110 ns
1998 128 Mb 100 ns2000 256 Mb 90 ns
2002 512 Mb 80 ns
2006 1024 Mb 60ns
16000:1 4:1
(Capacity) (Latency)
8/6/2019 Pipe Lining 8
29/31
29
Processor-DRAM Performance Gap Impact:Processor-DRAM Performance Gap Impact:ExampleExample
To illustrate the performance impact, assume a single-issue pipelined CPU with CPI
= 1 using non-ideal memory.
The minimum cost of a full memory access in terms of number of wasted CPU
cycles:
CPU CPU Memory Minimum CPU cycles orYear speed cycle Access instructions wastedMHZ ns ns
1986: 8 125 190 190/125 - 1 = 0.5
1989: 33 30 165 165/30 -1 = 4.5
1992: 60 16.6 120 120/16.6 -1 = 6.2
1996: 200 5 110 110/5 -1 = 21
1998: 300 3.33 100 100/3.33 -1 = 29
2000: 1000 1 90 90/1 - 1 = 89
2003: 2000 .5 80 80/.5 - 1 = 159
M i M
8/6/2019 Pipe Lining 8
30/31
30
Main MemoryMain Memory Main memory generally uses Dynamic RAM (DRAM),
which uses a single transistor to store a bit, but requires aperiodic data refresh (~every 8 msec).
Cache uses SRAM: Static Random Access Memory No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM)
Size: DRAM/SRAM - 4-8,
Cost & Cycle time: SRAM/DRAM - 8-16 Main memory performance:
Memory latency:
Access time: The time it takes between a memory access request and
the time the requested information is available to cache/CPU. Cycle time: The minimum time between requests to memory
(greater than access time in DRAM to allow address lines to be stable)
Memory bandwidth: The maximum sustained data transferrate between main memory and cache/CPU.
8/6/2019 Pipe Lining 8
31/31
31
Architects Use Transistors to Tolerate Slow
Memory
Cache Small, Fast Memory
Holds information (expected)
to be used soon
Mostly Successful
Apply Recursively Level-one cache(s)
Level-two cache
Most of microprocessor
die area is cache!