EEC 581 Computer Architecture - Cleveland State University · EEC 581 Computer Architecture Multicore Architecture Department of Electrical Engineering and Computer Science Cleveland

1

EEC 581

Computer Architecture

Multicore Architecture

Department of Electrical Engineering and Computer Science

Cleveland State University

Multiprocessor Architectures

Late 1950s - one general-purpose processor and one

or more special-purpose processors for input and

output operations

Early 1960s - multiple complete processors, used for

program-level concurrency

Mid-1960s - multiple partial processors, used for

instruction-level concurrency

Single-Instruction Multiple-Data (SIMD) machines

Multiple-Instruction Multiple-Data (MIMD) machines

A primary focus of this chapter is shared memory

MIMD machines (multiprocessors)

2

Instruction and Data Streams

3

2

(3)

Thread Level Parallelism (TLP)

• Multiple threads of execution

• Exploit ILP in each thread

• Exploit concurrent execution across threads

(4)


• Taxonomy due to M. Flynn

Data Streams

Single Multiple

Instruction

Streams

Single SISD:

Intel Pentium 4

SIMD: SSE

instructions of x86

Multiple MISD:

No examples today

MIMD:

Intel Xeon e5345

Example: Multithreading (MT) in a single address space

Recall Executable Format

4

3

(5)

Recall The Executable Format

header

text

static data

reloc

symbol table

debug

Object file ready to be linked and loaded

Linker Loader

Static Libraries

An executable instance or

Process

What does a loader do?

(6)

Process

• A process is a running program with state

! Stack, memory, open files

! PC, registers

• The operating system keeps tracks of the state of all processors ! E.g., for scheduling processes

• There many processes for the

same application

! E.g., web browser

• Operating systems class for details Code

Static data

Heap

Stack

DLL’s

3

Process Level Parallelism

5

4

(7)


Process Process Process

• Parallel processes and throughput computing

• Each process itself does not run any faster

(8)

From Processes to Threads

• Switching processes on a core is expensive ! A lot of state information to be managed

• If I want concurrency, launching a process is expensive

• How about splitting up a single process into parallel computations?

" Lightweight processes or threads!

Process

6

3

(5)

Recall The Executable Format

header

text

static data

reloc

symbol table

debug

Object file ready to be linked and loaded

Linker Loader

Static Libraries

An executable instance or

Process

What does a loader do?

(6)

Process

• A process is a running program with state

! Stack, memory, open files

! PC, registers

• The operating system keeps tracks of the state of all processors ! E.g., for scheduling processes

• There many processes for the same application

! E.g., web browser

• Operating systems class for details Code

Static data

Heap

Stack

DLL’s

4

Categories of Concurrency

Categories of Concurrency: Physical concurrency - Multiple independent processors (

multiple threads of control)

Logical concurrency - The appearance of physical concurrency is presented by time-sharing one processor (software can be designed as if there were multiple threads of control)

Coroutines (quasi-concurrency) have a single thread of control

A thread of control in a program is the sequence of program points reached as control flows through the program

Motivations for the Use of Concurrency

Multiprocessor computers capable of physical

concurrency are now widely used

Even if a machine has just one processor, a program

written to use concurrent execution can be faster than

the same program written for nonconcurrent execution

Involves a different way of designing software that can

be very useful—many real-world situations involve

concurrency

Many program applications are now spread over

multiple machines, either locally or over a network

5

Introduction to Subprogram-Level

Concurrency

A task or process or thread is a program unit that can be in concurrent execution with other program units

Tasks differ from ordinary subprograms in that: A task may be implicitly started

When a program unit starts the execution of a task, it is not necessarily suspended

When a task’s execution is completed, control may not return to the caller

Tasks usually work together

Two General Categories of Tasks

Heavyweight tasks execute in their own address

space

Lightweight tasks all run in the same address space –

more efficient

A task is disjoint if it does not communicate with or

affect the execution of any other task in the program

in any way

6

Task Synchronization

A mechanism that controls the order in which tasks execute

Two kinds of synchronization Cooperation synchronization

Competition synchronization

Task communication is necessary for synchronization, provided by:- Shared nonlocal variables- Parameters- Message passing

Kinds of synchronization

Cooperation: Task A must wait for task B to complete

some specific activity before task A can continue its

execution, e.g., the producer-consumer problem

Competition: Two or more tasks must use some

resource that cannot be simultaneously used, e.g., a

shared counter

Competition is usually provided by mutually exclusive access

(approaches are discussed later)

7


13

4

(7)


Process Process Process

• Parallel processes and throughput computing

• Each process itself does not run any faster

(8)


• Switching processes on a core is expensive ! A lot of state information to be managed

• If I want concurrency, launching a process is expensive

• How about splitting up a single process into parallel computations?

" Lightweight processes or threads!

A Thread

14

5

(9)

Thread Parallel Execution Process

thread

(10)

A Thread

• A separate, concurrently executable instruction stream within a process

• Minimum amount state to execute on a core

! Program counter, registers, stack

! Remaining state shared with the parent process

o Memory and files

• Support for creating threads

• Support for merging/terminating threads

• Support for synchronization between threads

! In accesses to shared data

Our datapath

so far!

8

15

TLP

ILP of a single program is hard

Large ILP is Far-flung

We are human after all, program w/ sequential mind

Reality: running multiple threads or programs

Thread Level Parallelism

Time Multiplexing

Throughput computing

Multiple program workloads

Multiple concurrent threads

Helper threads to improve single program performance


16

2

(3)


• Multiple threads of execution

• Exploit ILP in each thread

• Exploit concurrent execution across threads

(4)


• Taxonomy due to M. Flynn

Data Streams

Single Multiple

Instruction

Streams

Single SISD:

Intel Pentium 4

SIMD: SSE

instructions of x86

Multiple MISD:

No examples today

MIMD:

Intel Xeon e5345

Example: Multithreading (MT) in a single address space

9

17

Single and Multithreaded Processes

A Simple Example

18

6

(11)

A Simple Example

Data Parallel Computation

(12)

Thread Execution: Basics

funcA()

Static data

Heap

Stack

funcB()

Stack

PC, registers, stack pointer


Thread #1

Thread #2

create_thread(funcB)

create_thread(funcA)

funcA() funcB()

WaitAllThreads()

end_thread() end_thread()

10

19

Examples of Threads

A web browser

One thread displays images

One thread retrieves data from network

A word processor

One thread displays graphics

One thread reads keystrokes

One thread performs spell checking in the background

A web server

One thread accepts requests

When a request comes in, separate thread is created to service

Many threads to support thousands of client requests

RPC or RMI (Java)

One thread receives message

Message service uses another thread

Thread Execution

20

6

(11)

A Simple Example

Data Parallel Computation

(12)

Thread Execution: Basics

funcA()

Static data

Heap

Stack

funcB()

Stack



Thread #1

Thread #2

create_thread(funcB)

create_thread(funcA)

funcA() funcB()

WaitAllThreads()

end_thread() end_thread()

11

Threads Execution on a Single Core

21

7

(13)

Threads Execution on a Single Core

• Hardware threads ! Each thread has its own hardware state

• Switching between threads on each cycle to share the core pipeline – why?

IF ID MEM WB EX

lw $t0, label($0) lw $t1, label1($0) and $t2, $t0, $t1

andi $t3, $t1, 0xffff

srl $t2, $t2, 12 ……

lw $t3, 0($t0) add $t2, $t2, $t3 addi $t0, $t0, 4

addi $t1, $t1, -1

bne $t1, $zero, loop

…….

Thread #1

Thread #2

lw

lw

lw lw

lw

lw

lw lw lw add

lw lw lw add and

No pipeline stall on load-to-use hazard!

Interleaved execution Improve

utilization !

(14)

An Example Datapath

From Poonacha Kongetira, Microarchitecture of the UltraSPARC T1 CPU

Execution Model: Multithreading

22

9

(17)

Execution Model: Multithreading

• Fine-grain multithreading ! Switch threads after each cycle

! Interleave instruction execution

• Coarse-grain multithreading ! Only switch on long stall (e.g., L2-cache miss)

! Simplifies hardware, but does not hide short stalls (e.g., data hazards)

! If one thread stalls (e.g., I/O), others are executed

(18)

Simultaneous Multithreading

• In multiple-issue dynamically scheduled processors ! Instruction-level parallelism across threads

! Schedule instructions from multiple threads

! Instructions from independent threads execute when function units are available

• Example: Intel Pentium-4 HT ! Two threads: duplicated registers, shared function

units and caches

! Known as Hyperthreading in Intel terminology

12

23

Threads vs. Processes

Thread

A thread has no data

segment or heap

A thread cannot live on its

own, it must live within a

process

There can be more than one

thread in a process, the first

thread calls main and has

the process’s stack

Inexpensive creation

Inexpensive context

switching

If a thread dies, its stack is

reclaimed by the process

Processes

A process has

code/data/heap and other

segments

There must be at least one

thread in a process

Threads within a process

share code/data/heap, share

I/O, but each has its own

stack and registers

Expense creation

Expensive context switching

It a process dies, its

resources are reclaimed and

all threads die

24

Thread Implementation

Process defines address

space

Threads share address

space

Process Control Block (PCB)

contains process-specific

info

PID, owner, heap pointer,

active threads and pointers

to thread info

Thread Control Block (TCB)

contains thread-specific info

Stack pointer, PC, thread

state, register …

CODE

Initialized data

Heap

Stack – thread 2

Stack – thread 1

DLL’s

Reserved

Process’s address space

$pc

$sp

State

Registers

…

…

$pc

$sp

State

Registers

…

…

TCB for thread2

TCB for thread1

13

Benefits

Responsiveness When one thread is blocked, your browser still responds

E.g. download images while allowing your interaction

Resource Sharing Share the same address space

Reduce overhead (e.g. memory)

Economy Creating a new process costs memory and resources

E.g. in Solaris, 30 times slower in creating process than thread

Utilization of MP Architectures Threads can be executed in parallel on multiple processors

Increase concurrency and throughput

User-level Threads

Thread management done by user-level threads library

Similar to calling a procedure

Thread management is done by the thread library in user space

User can control the thread scheduling (No disturbing the underlying OS scheduler)

No OS kernel support more portable

Low overhead when thread switching

Three primary thread libraries: POSIX Pthreads

Java threads

Win32 threads

14

Kernel Threads

A.k.a. lightweight process in the literature

Supported by the Kernel

Thread scheduling is fairer

Examples

Windows XP/2000

Solaris

Linux

Tru64 UNIX

Mac OS X

Multithreading Models

Many-to-One

One-to-One

Many-to-Many

15

Many-to-One

Many user-level threads mapped to one single kernel

thread

The entire process will block if a thread makes a

blocking system call

Cannot run threads in parallel on multiprocessors

Examples

Solaris Green Threads

GNU Portable Threads

Many-to-One Model

16

One-to-One

Each user-level thread maps to kernel thread

Do not block other threads when one is making a blocking system call

Enable parallel execution in an MP system

Downside:

performance/memory overheads of creating kernel threads

Restriction of the number of threads that can be supported

Examples

Windows NT/XP/2000

Linux

Solaris 9 and later

One-to-one Model

17

Many-to-Many Model

Allows many user level threads to be mapped to many

kernel threads

Allows the operating system to create a sufficient

number of kernel threads

Threads are multiplexed to a smaller (or equal)

number of kernel threads which is specific to a

particular application or a particular machine

Solaris prior to version 9

Windows NT/2000 with the ThreadFiber package

Many-to-Many Model

18

Pipeline Hazards

35

Multithreading

36

19

37

38

Multi-Tasking Paradigm Virtual memory makes it easy

Context switch could be

expensive or requires extra

HW

VIVT cache

VIPT cache

TLBs

Thread 1

Unused

Exe

cuti

on

Tim

e Q

uan

tum

FU1 FU2 FU3 FU4

Conventional

Superscalar

Single

Threaded

Thread 2

Thread 3

Thread 4

Thread 5

20

39

Multi-threading Paradigm

Thread 1

UnusedE

xecu

tio

n T

ime

FU1 FU2 FU3 FU4

Conventional

Superscalar

Single

Threaded

Simultaneous

Multithreading

(SMT)

Fine-grained

Multithreading

(cycle-by-cycle

Interleaving)

Thread 2

Thread 3

Thread 4

Thread 5

Coarse-grained

Multithreading

(Block Interleaving)

Chip

Multiprocessor

(CMP or

MultiCore)

40

Conventional Multithreading

Zero-overhead context switch

Duplicated contexts for threads

0:r0

0:r71:r0

1:r72:r0

2:r73:r0

3:r7

CtxtPtr

Memory (shared by threads)

Register file

21

41

Cycle Interleaving MT

Per-cycle, Per-thread instruction fetching

Examples: HEP, Horizon, Tera MTA, MIT M-machine

Interesting questions to consider

Does it need a sophisticated branch predictor?

Or does it need any speculative execution at all?

Get rid of “branch prediction”?

Get rid of “predication”?

Does it need any out-of-order execution capability?

42

Block Interleaving MT

Context switch on a specific event (dynamic pipelining) Explicit switching: implementing a switch instruction

Implicit switching: trigger when a specific instruction class fetched

Static switching (switch upon fetching) Switch-on-memory-instructions: Rhamma processor

Switch-on-branch or switch-on-hard-to-predict-branch

Trigger can be implicit or explicit instruction

Dynamic switching Switch-on-cache-miss (switch in later pipeline stage): MIT Sparcle

(MIT Alewife’s node), Rhamma Processor

Switch-on-use (lazy strategy of switch-on-cache-miss) Wait until last minute

Valid bit needed for each register Clear when load issued, set when data returned

Switch-on-signal (e.g. interrupt)

Predicated switch instruction based on conditions

No need to support a large number of threads

22

43

Register

Renamer

Register

Renamer

Register

Renamer

Register

Renamer

Register

Renamer

Simultaneous Multithreading (SMT)

SMT name first used by UW; Earlier versions from UCSB [Nemirovsky, HICSS‘91] and [Hirata et al., ISCA-92]

Intel’s HyperThreading (2-way SMT)

IBM Power7 (4/6/8 cores, 4-way SMT); IBM Power5/6 (2 cores. Each 2-way SMT, 4 chips per package) : Power5 has OoO cores, Power6 In-order cores;

Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources

Reg

File

FMult(4 cycles)

FAdd(2 cyc)

ALU

1ALU

2Load/Store(variable)

Fdiv, unpipe (16 cycles)

RS & ROB

plus

Physical

Register

File

Fetch

Unit

PCPCPCPCPCPCPCPC

I-CACHE

Decode

Register

RenamerReg

File

Reg

File

Reg

File

Reg

File

Reg

File

Reg

File

Reg

File

D-CACHE

Register

Renamer

Register

Renamer

44

Instruction Fetching Policy

FIFO, Round Robin, simple but may be too naive

Adaptive Fetching Policies BRCOUNT (reduce wrong path issuing)

Count # of br inst in decode/rename/IQ stages

Give top priority to thread with the least BRCOUNT

MISSCOUT (reduce IQ clog)

Count # of outstanding D-cache misses

Give top priority to thread with the least MISSCOUNT

ICOUNT (reduce IQ clog)

Count # of inst in decode/rename/IQ stages

Give top priority to thread with the least ICOUNT

IQPOSN (reduce IQ clog)

Give lowest priority to those threads with inst closest to the head of INT or FP instruction queues

Due to that threads with the oldest instructions will be most prone to IQ clog

No Counter needed

23

45

Resource Sharing

Could be tricky when threads compete for the resources

Static

Less complexity

Could penalize threads (e.g. instruction window size)

P4’s Hyperthreading

Dynamic

Complex

What is fair? How to quantify fairness?

A growing concern in Multi-core processors

Shared L2, Bus bandwidth, etc.

Issues

Fairness

Mutual thrashing

HyperThreading

46

10

(19) 19

Hyper-threading

• Implementation of Hyper-threading adds less than 5% to the chip area

• Principle: share major logic components (functional units) and improve utilization

• Architecture State: All core pipeline resources needed for executing a thread

Processor Execution Resources

Arch State




Arch State Arch State Arch State Arch State Arch State

2 CPU Without Hyper-threading 2 CPU With Hyper-threading

(20)

Multithreading with ILP: Examples

24

47

P4 HyperThreading Resource Partitioning

TC (or UROM) is alternatively accessed per cycle for

each logical processor unless one is stalled due to TC

miss

op queue (into ½) after fetched from TC

ROB (126/2)

LB (48/2)

SB (24/2) (32/2 for Prescott)

General op queue and memory op queue (1/2)

TLB (½?) as there is no PID

Retirement: alternating between 2 logical processors

48

Alpha 21464 (EV8) Processor

Technology

Leading edge process technology – 1.2 ~ 2.0GHz

0.125µm CMOS

SOI-compatible

Cu interconnect

low-k dielectrics

Chip characteristics

~1.2V Vdd

~250 Million transistors

~1100 signal pins in flip chip packaging

25

49

Alpha 21464 (EV8) Processor

Architecture

Enhanced out-of-order execution (that giant 2Bc-gskew

predictor we discussed before is here)

Large on-chip L2 cache

Direct RAMBUS interface

On-chip router for system interconnect

Glueless, directory-based, ccNUMA for up to 512-way

SMP

8-wide superscalar

4-way simultaneous multithreading (SMT)

Total die overhead ~ 6% (allegedly)

50

SMT Pipeline

Fetch Decode/

Map

Queue Reg

Read

Execute Dcache/

Store

Buffer

Reg

Write

Retire

Icache

Dcache

PC

Register

Map

Regs Regs

Source: A company once called Compaq

26

51

EV8 SMT

In SMT mode, it is as if there are 4 processors on a chip that shares their caches and TLB

Replicated hardware contexts

Program counter

Architected registers (actually just the renaming table since architected registers and rename registers come from the same physical pool)

Shared resources

Rename register pool (larger than needed by 1 thread)

Instruction queue

Caches

TLB

Branch predictors

Deceased before seeing the daylight.

52

Reality Check, circa 200x

Conventional processor designs run out of steam

Power wall (thermal)

Complexity (verification)

Physics (CMOS scaling)

1

10

100

1000

Wa

tts

/cm

2

i386i486

Pentium ® processor

Pentium Pro ® processor

Pentium II ® processor

Pentium III ® processor

Hot plateHot plate

Nuclear ReactorNuclear Reactor RocketRocketNozzleNozzle

Sun’sSun’sSurfaceSurface

1

10

100

1000

Wa

tts

/cm

2

i386i486

Pentium ® processor

Pentium Pro ® processor

Pentium II ® processor

Pentium III ® processor

Hot plateHot plate

Nuclear ReactorNuclear Reactor RocketRocketNozzleNozzle

Sun’sSun’sSurfaceSurface

“Surpassed hot-plate power

density in 0.5m; Not too long

to reach nuclear reactor,”

Former Intel Fellow Fred

Pollack.

27

53

Latest Power Density Trend

Yeo and Lee, “Peeling the Power Onion of Data Centers,” In

Energy Efficient Thermal Management of Data Centers, Springer. To appear 2011

54

Reality Check, circa 200x

Conventional processor designs run out of steam

Power wall (thermal)

Complexity (verification)

Physics (CMOS scaling)

Unanimous direction Multi-core

Simple cores (massive number)

Keep

Wire communication on leash

Gordon Moore happy (Moore’s Law)

Architects’ menace: kick the ball to the other side of the court?

What do you (or your customers) want?

Performance (and/or availability)

Throughput > latency (turnaround time)

Total cost of ownership (performance per dollar)

Energy (performance per watt)

Reliability and dependability, SPAM/spy free

28

55

Multi-core Processor Gala

56

Intel’s Multicore Roadmap

To extend Moore’s Law

To delay the ultimate limit of physics

By 2010

all Intel processors delivered will be multicore

Intel’s 80-core processor (FPU array)

Source: Adapted from Tom’s Hardware

2006 20082007

SC 1MB

DC 2MB

DC 2/4MB shared

DC 3 MB/6 MB shared (45nm)

2006 20082007

DC 2/4MB

DC 2/4MB shared

DC 4MB

DC 3MB /6MB shared (45nm)

2006 20082007

DC 2MB

DC 4MB

DC 16MB

QC 4MB

QC 8/16MB shared

8C 12MB shared (45nm)

SC 512KB/ 1/ 2MB

8C 12MB shared (45nm)

Deskto

p p

roce

sso

rs

Mo

bile

p

roce

sso

rs

En

terp

rise

p

roce

sso

rs

29

57

Is a Multi-core really better off?

Well, it is hard to say in Computing World

If you were plowing a field,

which would you rather use:

Two strong oxen or 1024 cores?

--- Seymour Cray

58

Q1. For a PIPT cache with virtual memory support, three possible events can be triggered

during an instruction fetch: (1) a cache lookup, (2) a TLB miss, (3) a page fault. Please

order these events in the correct order of their occurrences.

(2) (3) (1): In a PIPT cache, address translation will take place prior to a cache

lookup. It first searches for a match in TLB, therefore, a TLB miss (if any) will take

place first. Then a page table walk is initiated. If a page has not been allocated, a page

fault will follow. The OS will then allocate the page, fill in the page table entry, then

fill the translation into TLB, followed by a cache lookup.

Q2. Given a 256Meg x4 DRAM chip which consists of 2 banks, with 14-bit row

addresses. (256Meg indicates the number of addresses.) What is the row buffer size for

each bank?

256M 28 address bits needed. One bit is used for bank index, hence, the column

address = 28 – 1 – 14 = 13 bits

As the DRAM is a “x4” configuration

One row buffer of a bank = 213 * (x4 bits) = 32 kbits = 4 KB

30

59

Q3. Assume an Inverted Page Table (8-entry IPT) is used by a 32-bit OS. The memory

page size is 256KB. The complete IPT content is shown below. The Physical Page Number

(PPN) starts from 0 to 7 from the top of the table. There are three active processes, P1

(PID=1), P2 (PID=2) and P3 (PID=3) running in the system and the IPT holds the translation

for the entire physical memory. Answer the following questions.

Based on the size of the Inverted Page Table above, what is the size of the physical

memory ?

There are 8 entries in the IPT. As each page is 256KB, the size of the physical

memory = 8 * 256KB = 2MB

60

IBM Watson Jeopardy! Competition

POWER7 chips (2,880 cores) + 16TB memory

Massively parallel processing

Combine: Processing power, Natural language processing, AI, Search, Knowledge extraction

31

61

Major Challenges for Multi-Core Designs

Communication

Memory hierarchy

Data allocation (you have a large shared L2/L3 now)

Interconnection network

AMD HyperTransport

Intel QPI

Scalability

Bus Bandwidth, how to get there?

Power-Performance — Win or lose?

Borkar’s multicore arguments

15% per core performance drop 50% power saving

Giant, single core wastes power when task is small

How about leakage?

Process variation and yield

Programming Model

EEC 581 Computer Architecture - Cleveland State University · EEC 581 Computer Architecture Multicore Architecture Department of Electrical Engineering and Computer Science Cleveland

Documents