1 EEC 581 Computer Architecture Multicore Architecture Department of Electrical Engineering and Computer Science Cleveland State University Multiprocessor Architectures Late 1950s - one general-purpose processor and one or more special-purpose processors for input and output operations Early 1960s - multiple complete processors, used for program-level concurrency Mid-1960s - multiple partial processors, used for instruction-level concurrency Single-Instruction Multiple-Data (SIMD) machines Multiple-Instruction Multiple-Data (MIMD) machines A primary focus of this chapter is shared memory MIMD machines (multiprocessors)
31
Embed
EEC 581 Computer Architecture - Cleveland State University · EEC 581 Computer Architecture Multicore Architecture Department of Electrical Engineering and Computer Science Cleveland
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
EEC 581
Computer Architecture
Multicore Architecture
Department of Electrical Engineering and Computer Science
Cleveland State University
Multiprocessor Architectures
Late 1950s - one general-purpose processor and one
or more special-purpose processors for input and
output operations
Early 1960s - multiple complete processors, used for
Example: Multithreading (MT) in a single address space
Recall Executable Format
4
3
(5)
Recall The Executable Format
header
text
static data
reloc
symbol table
debug
Object file ready to be linked and loaded
Linker Loader
Static Libraries
An executable instance or
Process
What does a loader do?
(6)
Process
• A process is a running program with state
! Stack, memory, open files
! PC, registers
• The operating system keeps tracks of the state of all processors ! E.g., for scheduling processes
• There many processes for the
same application
! E.g., web browser
• Operating systems class for details Code
Static data
Heap
Stack
DLL’s
3
Process Level Parallelism
5
4
(7)
Process Level Parallelism
Process Process Process
• Parallel processes and throughput computing
• Each process itself does not run any faster
(8)
From Processes to Threads
• Switching processes on a core is expensive ! A lot of state information to be managed
• If I want concurrency, launching a process is expensive
• How about splitting up a single process into parallel computations?
" Lightweight processes or threads!
Process
6
3
(5)
Recall The Executable Format
header
text
static data
reloc
symbol table
debug
Object file ready to be linked and loaded
Linker Loader
Static Libraries
An executable instance or
Process
What does a loader do?
(6)
Process
• A process is a running program with state
! Stack, memory, open files
! PC, registers
• The operating system keeps tracks of the state of all processors ! E.g., for scheduling processes
• There many processes for the same application
! E.g., web browser
• Operating systems class for details Code
Static data
Heap
Stack
DLL’s
4
Categories of Concurrency
Categories of Concurrency: Physical concurrency - Multiple independent processors (
multiple threads of control)
Logical concurrency - The appearance of physical concurrency is presented by time-sharing one processor (software can be designed as if there were multiple threads of control)
Coroutines (quasi-concurrency) have a single thread of control
A thread of control in a program is the sequence of program points reached as control flows through the program
Motivations for the Use of Concurrency
Multiprocessor computers capable of physical
concurrency are now widely used
Even if a machine has just one processor, a program
written to use concurrent execution can be faster than
the same program written for nonconcurrent execution
Involves a different way of designing software that can
be very useful—many real-world situations involve
concurrency
Many program applications are now spread over
multiple machines, either locally or over a network
5
Introduction to Subprogram-Level
Concurrency
A task or process or thread is a program unit that can be in concurrent execution with other program units
Tasks differ from ordinary subprograms in that: A task may be implicitly started
When a program unit starts the execution of a task, it is not necessarily suspended
When a task’s execution is completed, control may not return to the caller
Tasks usually work together
Two General Categories of Tasks
Heavyweight tasks execute in their own address
space
Lightweight tasks all run in the same address space –
more efficient
A task is disjoint if it does not communicate with or
affect the execution of any other task in the program
in any way
6
Task Synchronization
A mechanism that controls the order in which tasks execute
Two kinds of synchronization Cooperation synchronization
Competition synchronization
Task communication is necessary for synchronization, provided by:- Shared nonlocal variables- Parameters- Message passing
Kinds of synchronization
Cooperation: Task A must wait for task B to complete
some specific activity before task A can continue its
execution, e.g., the producer-consumer problem
Competition: Two or more tasks must use some
resource that cannot be simultaneously used, e.g., a
shared counter
Competition is usually provided by mutually exclusive access
(approaches are discussed later)
7
From Processes to Threads
13
4
(7)
Process Level Parallelism
Process Process Process
• Parallel processes and throughput computing
• Each process itself does not run any faster
(8)
From Processes to Threads
• Switching processes on a core is expensive ! A lot of state information to be managed
• If I want concurrency, launching a process is expensive
• How about splitting up a single process into parallel computations?
" Lightweight processes or threads!
A Thread
14
5
(9)
Thread Parallel Execution Process
thread
(10)
A Thread
• A separate, concurrently executable instruction stream within a process
• Minimum amount state to execute on a core
! Program counter, registers, stack
! Remaining state shared with the parent process
o Memory and files
• Support for creating threads
• Support for merging/terminating threads
• Support for synchronization between threads
! In accesses to shared data
Our datapath
so far!
8
15
TLP
ILP of a single program is hard
Large ILP is Far-flung
We are human after all, program w/ sequential mind
Reality: running multiple threads or programs
Thread Level Parallelism
Time Multiplexing
Throughput computing
Multiple program workloads
Multiple concurrent threads
Helper threads to improve single program performance
Thread Level Parallelism (TLP)
16
2
(3)
Thread Level Parallelism (TLP)
• Multiple threads of execution
• Exploit ILP in each thread
• Exploit concurrent execution across threads
(4)
Instruction and Data Streams
• Taxonomy due to M. Flynn
Data Streams
Single Multiple
Instruction
Streams
Single SISD:
Intel Pentium 4
SIMD: SSE
instructions of x86
Multiple MISD:
No examples today
MIMD:
Intel Xeon e5345
Example: Multithreading (MT) in a single address space
9
17
Single and Multithreaded Processes
A Simple Example
18
6
(11)
A Simple Example
Data Parallel Computation
(12)
Thread Execution: Basics
funcA()
Static data
Heap
Stack
funcB()
Stack
PC, registers, stack pointer
PC, registers, stack pointer
Thread #1
Thread #2
create_thread(funcB)
create_thread(funcA)
funcA() funcB()
WaitAllThreads()
end_thread() end_thread()
10
19
Examples of Threads
A web browser
One thread displays images
One thread retrieves data from network
A word processor
One thread displays graphics
One thread reads keystrokes
One thread performs spell checking in the background
A web server
One thread accepts requests
When a request comes in, separate thread is created to service
Many threads to support thousands of client requests
RPC or RMI (Java)
One thread receives message
Message service uses another thread
Thread Execution
20
6
(11)
A Simple Example
Data Parallel Computation
(12)
Thread Execution: Basics
funcA()
Static data
Heap
Stack
funcB()
Stack
PC, registers, stack pointer
PC, registers, stack pointer
Thread #1
Thread #2
create_thread(funcB)
create_thread(funcA)
funcA() funcB()
WaitAllThreads()
end_thread() end_thread()
11
Threads Execution on a Single Core
21
7
(13)
Threads Execution on a Single Core
• Hardware threads ! Each thread has its own hardware state
• Switching between threads on each cycle to share the core pipeline – why?
IF ID MEM WB EX
lw $t0, label($0) lw $t1, label1($0) and $t2, $t0, $t1
andi $t3, $t1, 0xffff
srl $t2, $t2, 12 ……
lw $t3, 0($t0) add $t2, $t2, $t3 addi $t0, $t0, 4
addi $t1, $t1, -1
bne $t1, $zero, loop
…….
Thread #1
Thread #2
lw
lw
lw lw
lw
lw
lw lw lw add
lw lw lw add and
No pipeline stall on load-to-use hazard!
Interleaved execution Improve
utilization !
(14)
An Example Datapath
From Poonacha Kongetira, Microarchitecture of the UltraSPARC T1 CPU
Execution Model: Multithreading
22
9
(17)
Execution Model: Multithreading
• Fine-grain multithreading ! Switch threads after each cycle
! Interleave instruction execution
• Coarse-grain multithreading ! Only switch on long stall (e.g., L2-cache miss)
! Simplifies hardware, but does not hide short stalls (e.g., data hazards)
! If one thread stalls (e.g., I/O), others are executed
(18)
Simultaneous Multithreading
• In multiple-issue dynamically scheduled processors ! Instruction-level parallelism across threads
! Schedule instructions from multiple threads
! Instructions from independent threads execute when function units are available
• Example: Intel Pentium-4 HT ! Two threads: duplicated registers, shared function
units and caches
! Known as Hyperthreading in Intel terminology
12
23
Threads vs. Processes
Thread
A thread has no data
segment or heap
A thread cannot live on its
own, it must live within a
process
There can be more than one
thread in a process, the first
thread calls main and has
the process’s stack
Inexpensive creation
Inexpensive context
switching
If a thread dies, its stack is
reclaimed by the process
Processes
A process has
code/data/heap and other
segments
There must be at least one
thread in a process
Threads within a process
share code/data/heap, share
I/O, but each has its own
stack and registers
Expense creation
Expensive context switching
It a process dies, its
resources are reclaimed and
all threads die
24
Thread Implementation
Process defines address
space
Threads share address
space
Process Control Block (PCB)
contains process-specific
info
PID, owner, heap pointer,
active threads and pointers
to thread info
Thread Control Block (TCB)
contains thread-specific info
Stack pointer, PC, thread
state, register …
CODE
Initialized data
Heap
Stack – thread 2
Stack – thread 1
DLL’s
Reserved
Process’s address space
$pc
$sp
State
Registers
…
…
$pc
$sp
State
Registers
…
…
TCB for thread2
TCB for thread1
13
Benefits
Responsiveness When one thread is blocked, your browser still responds
E.g. download images while allowing your interaction
Resource Sharing Share the same address space
Reduce overhead (e.g. memory)
Economy Creating a new process costs memory and resources
E.g. in Solaris, 30 times slower in creating process than thread
Utilization of MP Architectures Threads can be executed in parallel on multiple processors
Increase concurrency and throughput
User-level Threads
Thread management done by user-level threads library
Similar to calling a procedure
Thread management is done by the thread library in user space
User can control the thread scheduling (No disturbing the underlying OS scheduler)
No OS kernel support more portable
Low overhead when thread switching
Three primary thread libraries: POSIX Pthreads
Java threads
Win32 threads
14
Kernel Threads
A.k.a. lightweight process in the literature
Supported by the Kernel
Thread scheduling is fairer
Examples
Windows XP/2000
Solaris
Linux
Tru64 UNIX
Mac OS X
Multithreading Models
Many-to-One
One-to-One
Many-to-Many
15
Many-to-One
Many user-level threads mapped to one single kernel
thread
The entire process will block if a thread makes a
blocking system call
Cannot run threads in parallel on multiprocessors
Examples
Solaris Green Threads
GNU Portable Threads
Many-to-One Model
16
One-to-One
Each user-level thread maps to kernel thread
Do not block other threads when one is making a blocking system call
Enable parallel execution in an MP system
Downside:
performance/memory overheads of creating kernel threads
Restriction of the number of threads that can be supported
Examples
Windows NT/XP/2000
Linux
Solaris 9 and later
One-to-one Model
17
Many-to-Many Model
Allows many user level threads to be mapped to many
kernel threads
Allows the operating system to create a sufficient
number of kernel threads
Threads are multiplexed to a smaller (or equal)
number of kernel threads which is specific to a
particular application or a particular machine
Solaris prior to version 9
Windows NT/2000 with the ThreadFiber package
Many-to-Many Model
18
Pipeline Hazards
35
Multithreading
36
19
37
38
Multi-Tasking Paradigm Virtual memory makes it easy
Context switch could be
expensive or requires extra
HW
VIVT cache
VIPT cache
TLBs
Thread 1
Unused
Exe
cuti
on
Tim
e Q
uan
tum
FU1 FU2 FU3 FU4
Conventional
Superscalar
Single
Threaded
Thread 2
Thread 3
Thread 4
Thread 5
20
39
Multi-threading Paradigm
Thread 1
UnusedE
xecu
tio
n T
ime
FU1 FU2 FU3 FU4
Conventional
Superscalar
Single
Threaded
Simultaneous
Multithreading
(SMT)
Fine-grained
Multithreading
(cycle-by-cycle
Interleaving)
Thread 2
Thread 3
Thread 4
Thread 5
Coarse-grained
Multithreading
(Block Interleaving)
Chip
Multiprocessor
(CMP or
MultiCore)
40
Conventional Multithreading
Zero-overhead context switch
Duplicated contexts for threads
0:r0
0:r71:r0
1:r72:r0
2:r73:r0
3:r7
CtxtPtr
Memory (shared by threads)
Register file
21
41
Cycle Interleaving MT
Per-cycle, Per-thread instruction fetching
Examples: HEP, Horizon, Tera MTA, MIT M-machine
Interesting questions to consider
Does it need a sophisticated branch predictor?
Or does it need any speculative execution at all?
Get rid of “branch prediction”?
Get rid of “predication”?
Does it need any out-of-order execution capability?
42
Block Interleaving MT
Context switch on a specific event (dynamic pipelining) Explicit switching: implementing a switch instruction
Implicit switching: trigger when a specific instruction class fetched
Static switching (switch upon fetching) Switch-on-memory-instructions: Rhamma processor
Switch-on-branch or switch-on-hard-to-predict-branch
Trigger can be implicit or explicit instruction
Dynamic switching Switch-on-cache-miss (switch in later pipeline stage): MIT Sparcle
(MIT Alewife’s node), Rhamma Processor
Switch-on-use (lazy strategy of switch-on-cache-miss) Wait until last minute
Valid bit needed for each register Clear when load issued, set when data returned
Switch-on-signal (e.g. interrupt)
Predicated switch instruction based on conditions
No need to support a large number of threads
22
43
Register
Renamer
Register
Renamer
Register
Renamer
Register
Renamer
Register
Renamer
Simultaneous Multithreading (SMT)
SMT name first used by UW; Earlier versions from UCSB [Nemirovsky, HICSS‘91] and [Hirata et al., ISCA-92]
Intel’s HyperThreading (2-way SMT)
IBM Power7 (4/6/8 cores, 4-way SMT); IBM Power5/6 (2 cores. Each 2-way SMT, 4 chips per package) : Power5 has OoO cores, Power6 In-order cores;