Pipe Lining 8

8/6/2019 Pipe Lining 8

1/31

1

Pipelining for Multi-

Core Architectures


2/31

2

Multi-Core Technology

2004 2005 2007

Single Core Dual Core Multi-Core

+ Cache

+

Cache

Cache

Core

4 or more cores

+ Cache

Core

Cache2 or more cores

+ Cache

2Xmore

cores


3/31

3

Why multi-core ?

Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:

heat problems

Clock problems

Efficiency (Stall) problems

Doubling issue rates above todays 3-6 instructions per clock, say to 6 to12 instructions, is extremely difficult

issue 3 or 4 data memory accesses per cycle,

rename and access more than 20 registers per cycle, and

fetch 12 to 24 instructions per cycle.

Many new applications are multithreaded

General trend in computer architecture (shift towards more parallelism)


4/31

4

Instruction-level parallelism

Parallelism at the machine-instruction level

The processor can re-order, pipeline

instructions, split them into

microinstructions, do aggressive branch

prediction, etc.

Instruction-level parallelism enabled rapid

increases in processor speeds over the last15 years


5/31

5

Thread-level parallelism (TLP)

This is parallelism on a more coarser scale

Server can serve each client in a separate

thread (Web server, database server)

A computer game can do AI, graphics, andsound in three separate threads

Single-core superscalar processors cannot

fully exploit TLP Multi-core architectures are the next step in

processor evolution: explicitly exploiting TLP


6/31

6

What applications benefit

from multi-core?

Database servers

Web servers (Web commerce)

Multimedia applications

Scientific applications,

CAD/CAM

In general, applications with

Thread-level parallelism(as opposed to instruction-

level parallelism)

Each can

run on its

own core


7/31

7

More examples

Editing a photo while recording a TV show

through a digital video recorder

Downloading software while running an

anti-virus program

Anything that can be threaded today will

map efficiently to multi-core

BUT: some applications difficult to

parallelize


8/31


9/31

9

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2Cach

eand

Control

Bus

Thread 1: floating point

Without SMT, only a single thread

can run at any given time


10/31

10

Without SMT, only a single thread

can run at any given time

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROMBTBL2Cach

eand

Control

Bus

Thread 2:

integer operation


11/31

11

SMT processor: both threads can

run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROMBTBL2Cach

eand

Control

Bus

Thread 1: floating pointThread 2:

integer operation


12/31

12

But: Cant simultaneously use the

same functional unit

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROMBTBL2Cach

eand

Control

Bus

Thread 1 Thread 2

This scenario is

impossible with SMT

on a single core

(assuming a single

integer unit)IMPOSSIBLE


13/31


14/31

14

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL2Cach

eandC

ontrol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL2Cach

eand

Co

ntrol

Bus

Thread 3 Thread 4

Multi-core:

threads can run on separate cores


15/31

15

Combining Multi-core and SMT

Cores can be SMT-enabled (or not) The different combinations:

Single-core, non-SMT: standard uniprocessor

Single-core, with SMT Multi-core, non-SMT

Multi-core, with SMT: our fish machines

The number of SMT threads:2, 4, or sometimes 8 simultaneous threads

Intel calls them hyper-threads


16/31

16

SMT Dual-core: all four threads can

run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL

2Cach

eandControl

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL2Cach

eand

Co

ntrol

Bus

Thread 1 Thread 3 Thread 2 Thread 4T


17/31

17

Multi-Core and caches coherence

memory

L2 cache

L1 cache L1 cacheCO

RE

1

C

O

RE

0

L2 cache

memory

L2 cache

L1 cache L1 cacheCO

RE

1

C

O

RE

0

L2 cache

Both L1 and L2 are private

Examples: AMD Opteron,

AMD Athlon, Intel Pentium D

L3 cache L3 cache

A design with L3 caches

Example: Intel Itanium 2


18/31

18

The cache coherence problem

Since we have private caches:How to keep the data consistent across caches?

Each core should perceive the memory as a

monolithic array, shared by all the cores


19/31

19


Suppose variable x initially contains 15213

Core 1 Core 2 Core 3 Core 4

One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=15213

multi-core chip


20/31

20


Core 1 reads x


One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=15213

multi-core chip


21/31


22/31

22


Core 1 writes to x, setting it to 21660


One or more

levels of

cache

x=21660

One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=21660

multi-core chip

assuming

write-throughcaches


23/31

23


Core 2 attempts to read x gets a stale copy


One or more

levels of

cache

x=21660

One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=21660

multi-core chip


24/31

24

The Memory Wall

Problem


25/31

25

Memory Wall

Proc60%/yr.(2X/1.5yr)

DRAM9%/yr.

(2X/10yrs)1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor-Memory

Performance Gap:(grows 50% / year)

Per

fo

rm

an

ce Moores Law


26/31

26

Latency in a Single PC

1997 1999 2001 2003 2006 2009

X-Axis

0.1

1

10

100

1000

Tim

e(ns)

0

100

200

300

400

500

Mem

orytoCPU

Ratio

CPU Clock Period (ns)

Memory System Access Time

Ratio

Memory Access Time

CPU Time

Ratio

THEWALL


27/31

h l d


28/31

28

Technology TrendsTechnology Trends Capacity Speed (latency)Logic: 2x in 3 years 2x in 3 years

DRAM: 4x in 3 years 2x in 10 years

Disk: 4x in 3 years 2x in 10 years

DRAM Generations

Year Size Cycle Time

1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1 Mb 190 ns

1989 4 Mb 165 ns

1992 16 Mb 120 ns

1996 64 Mb 110 ns

1998 128 Mb 100 ns2000 256 Mb 90 ns

2002 512 Mb 80 ns

2006 1024 Mb 60ns

16000:1 4:1

(Capacity) (Latency)


29/31

29

Processor-DRAM Performance Gap Impact:Processor-DRAM Performance Gap Impact:ExampleExample

To illustrate the performance impact, assume a single-issue pipelined CPU with CPI

= 1 using non-ideal memory.

The minimum cost of a full memory access in terms of number of wasted CPU

cycles:

CPU CPU Memory Minimum CPU cycles orYear speed cycle Access instructions wastedMHZ ns ns

1986: 8 125 190 190/125 - 1 = 0.5

1989: 33 30 165 165/30 -1 = 4.5

1992: 60 16.6 120 120/16.6 -1 = 6.2

1996: 200 5 110 110/5 -1 = 21

1998: 300 3.33 100 100/3.33 -1 = 29

2000: 1000 1 90 90/1 - 1 = 89

2003: 2000 .5 80 80/.5 - 1 = 159

M i M


30/31

30

Main MemoryMain Memory Main memory generally uses Dynamic RAM (DRAM),

which uses a single transistor to store a bit, but requires aperiodic data refresh (~every 8 msec).

Cache uses SRAM: Static Random Access Memory No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM)

Size: DRAM/SRAM - 4-8,

Cost & Cycle time: SRAM/DRAM - 8-16 Main memory performance:

Memory latency:

Access time: The time it takes between a memory access request and

the time the requested information is available to cache/CPU. Cycle time: The minimum time between requests to memory

(greater than access time in DRAM to allow address lines to be stable)

Memory bandwidth: The maximum sustained data transferrate between main memory and cache/CPU.


31/31

31

Architects Use Transistors to Tolerate Slow

Memory

Cache Small, Fast Memory

Holds information (expected)

to be used soon

Mostly Successful

Apply Recursively Level-one cache(s)

Level-two cache

Most of microprocessor

die area is cache!

Pipe Lining 8

Documents