1 Parallelism Lecture notes from MKP and S. Yalamanchili (2) Overview • Goal: Understand how to scale performance via parallelism v Execute multiple instructions in parallel – instruction level parallelism (ILP) v Break up a program into multiple parallel instruction streams – thread level parallelism (TLP) v Process multiple data items in parallel – data level parallelism (DLP) • Consequences v Coordinating parallelism for correctness v What about caching?
42
Embed
parallelism - pwp.gatech.edupwp.gatech.edu/ece-ece3056-sy/wp-content/uploads/... · SISD SIMD MISD MIMD Single instruction multiple data stream computing, e.g., SSE Data Streams s
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Parallelism
Lecture notes from MKP and S. Yalamanchili
(2)
Overview
• Goal: Understand how to scale performance via parallelismv Execute multiple instructions in parallel – instruction
level parallelism (ILP)v Break up a program into multiple parallel instruction
streams – thread level parallelism (TLP)v Process multiple data items in parallel – data level
parallelism (DLP)
• Consequencesv Coordinating parallelism for correctnessv What about caching?
• Fine-grain multithreadingv Switch threads after each cyclev Interleave instruction execution
• Coarse-grain multithreadingv Only switch on long stall (e.g., L2-cache miss)v Simplifies hardware, but does not hide short stalls
(e.g., data hazards)v If one thread stalls (e.g., I/O), others are executed
18
(35)
Simultaneous Multithreading
• In multiple-issue dynamically scheduled processorsv Instruction-level parallelism across threadsv Schedule instructions from multiple threadsv Instructions from independent threads execute when
function units are available
• Example: Intel Pentium-4 HTv Two threads: duplicated registers, shared function
units and cachesv Known as Hyperthreading in Intel terminology
(36)36
Hyper-threading
• Implementation of Hyper-threading adds less than 5% to the chip area
• Principle: share major logic components (functional units) and improve utilization
• Architecture State: All core pipeline resources needed for executing a thread
Processor Execution Resources
Arch State
Processor Execution Resources
Processor Execution Resources
Processor Execution Resources
Arch State Arch State Arch State Arch State Arch State
2 CPU Without Hyper-threading 2 CPU With Hyper-threading
19
(37)
Multithreading with ILP: Examples
(38)
Thread Synchronization (6.5)Process
thread
Share data?
20
(39)
Thread Interactions
• What about shared data?v Need synchronization support
• Several different types of synchronization: we will look at one in detailv We are specifically interested in the exposure in the
ISA
(40)
Example: Communicating Threads
The Producer callswhile (1) {
while (count == BUFFER_SIZE); // do nothing
// add an item to the buffer++count;buffer[in] = item;in = (in + 1) % BUFFER_SIZE;
}
Producer Consumer
Thread 1
21
(41)
Example: Communicating Threads
The Consumer callswhile (1) {
while (count == 0); // do nothing
// remove an item from the buffer--count;item = buffer[out];out = (out + 1) % BUFFER_SIZE;
}
Producer Consumer
Thread 2
(42)
Uniprocessor Implementation• count++ could be implemented as
• Strong scaling: problem size fixedv As in example
• Weak scaling: problem size proportional to number of processorsv 10 processors, 10 × 10 matrix
o Time = 20 × tadd
v 100 processors, 32 × 32 matrixo Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
v Constant performance in this examplev For a fixed size system grow the number of
processors to improve performance
28
(55)
Cache Coherence (5.10)
• A shared variable may exist in multiple caches
• Multiple copies to improve latency
• This is a really a synchronization problem
(56)
Cache Coherence Problem
• Suppose two CPU cores share a physical address spacev Write-through caches
Time step
Event CPU A’s cache
CPU B’s cache
Memory
0 0
1 CPU A reads X 0 0
2 CPU B reads X 0 0 0
3 CPU A writes 1 to X 1 0 1
29
(57)
Example (Writeback Cache)
P
Cache
Memory
P
X= -100
X= -100Cache
P
CacheX= -100X= 505
Rd?X= -100
Rd?
Courtesy H. H. Lee
(58)
Coherence Defined
• Informally: Reads return most recently written value
• Formally:v P writes X; P reads X (no intervening writes)
Þ read returns written valuev P1 writes X; P2 reads X (sufficiently later)
Þ read returns written valueo c.f. CPU B reading X after step 3 in example
v P1 writes X, P2 writes XÞ all processors see writes in the same ordero End up with the same final value for X
30
(59)
Cache Coherence Protocols
• Operations performed by caches in multiprocessors to ensure coherencev Migration of data to local caches
o Reduces bandwidth for shared memoryv Replication of read-shared data
o Reduces contention for access
• Snooping protocolsv Each cache monitors bus reads/writes
• Directory-based protocolsv Caches and memory record sharing status of blocks
in a directory
(60)
Invalidating Snooping Protocols
• Cache gets exclusive access to a block when it is to be writtenv Broadcasts an invalidate message on the busv Subsequent read in another cache misses
o Owning cache supplies updated value
CPU activity Bus activity CPU A’s cache
CPU B’s cache
Memory
0CPU A reads X Cache miss for X 0 0CPU B reads X Cache miss for X 0 0 0CPU A writes 1 to X Invalidate for X 1 0CPU B read X Cache miss for X 1 1 1
31
(61)
Programming Model: Message Passing (6.7)
• Each processor has private physical address space
• Hardware sends/receives messages between processors
(62)
Parallelism
• Write message passing programs
• Explicit send and receive of datav Rather than accessing data in shared memory
send()
receive() send()
receive()
Process 2 Process 2
32
(63)
High Performance Computing
zdnet.com
• The dominant programming model is message passing
• Scales well but requires programmer effort• Science problems have fit this model well to
date
theregister.co.uk
(64)
A Simple MPI Program#include <stdio.h> #include <stdlib.h> #include <mpi.h> #include <math.h> int main(argc,argv) int argc; char *argv[]; { int myid, numprocs; int tag,source,destination,count; int buffer; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); tag=1234; source=0; destination=1; count=1; if(myid == source){ buffer=5678; MPI_Send(&buffer,count,MPI_INT,destination,tag,MPI_COMM_WORLD); printf("processor %d sent %d\n",myid,buffer); } if(myid == destination){ MPI_Recv(&buffer,count,MPI_INT,source,tag,MPI_COMM_WORLD,&status); printf("processor %d got %d\n",myid,buffer); } MPI_Finalize(); }
The Message Passing Interface (MPI)Library
From http://geco.mines.edu/workshop/class2/examples/mpi/c_ex01.c
33
(65)
A Simple MPI Program#include "mpi.h" #include <stdio.h> #include <math.h> int main( int argc, char *argv[] ){ int n, myid, numprocs, i; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); while (1) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n);
} MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);if (n == 0) break; else { h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += (4.0 / (1.0 + x*x)); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT)); } } MPI_Finalize(); return 0; }