This module created with support form NSF under grant # DUE 1141022 Module developed Spring 2013 by Apan Qasem Some slides adopted from Patterson and Hennessy 4 th Edition with permission Inter-Processor Parallel Architecture Course No Lecture No Term
Inter-Processor Parallel Architecture. Course No Lecture No Term. Outline. Parallel Architectures Symmetric multiprocessor architecture (SMP) Distributed-memory multiprocessor architecture Clusters The Grid The Cloud Multicore architecture Simultaneous Multithreaded architecture - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
This module created with support form NSF under grant # DUE 1141022
Module developed Spring 2013by Apan Qasem
Some slides adopted from Patterson and Hennessy 4th Edition with permission
pthread_mutex_lock(mutex);/ * code that modifies foo */
foo = foo + 1;
pthread_mutex_unlock(mutex);
• Any thread executing the critical section will perform the load, add and store without any intervening operations on foo
• To provide support for locking mechanism need atomic operations
17
at any point only one thread is going toexecute this code
critical section
TXST TUES Module : C2
Process Synchronization in SMP Architecture
• Need to be able to coordinate processes working on the same data
• At the program-level can use semaphores or mutexes to synchronize processes and implement critical sections
• Need architectural support to lock shared variables• atomic swap operation on MIPS (ll and sc)and SPARC (swp)
• Need architectural support to determine which processor gets access to the lock variable• single bus provides arbitration mechanism since the bus
is the only path to memory• the processor that gets the bus wins 18
TXST TUES Module : C2
Shared-Memory Multiprocessor (SMP)
• Single address space shared by all processors • akin to a multi-threaded program
• Processors communicate through shared variables in memory
• Architecture must provide features to co-ordinate access to shared data
19
What’s a big disadvantage?
TXST TUES Module : C2
Types of SMP
• SMPs come in two styles• Uniform memory access (UMA)
multiprocessors• Any memory access takes the same amount of time
• Non-uniform memory access (NUMA) multiprocessors• Memory is divided into banks• Memory latency depends on where the data is located
• Programming NUMAs are harder but design is easier
• NUMAs can scale to larger sizes and have lower latency to local memory leading to overall improved performance
• Most SMPs in use today are NUMA20
TXST TUES Module : C2
Distributed Memory Systems
• Multiple processors, each with its own address space connected via I/O bus
• Processors share data by explicitly sending and receiving information (message passing)
• Coordination is built into message-passing primitives (message send and message receive)
21
TXST TUES Module : C2
Specialized Interconnection Networks
• For distributed memory systems speed of communication between processors is critical
• I/O bus or Ethernet, although viable solutions don’t provide the necessary necessary performance• latency is important • high throughput is important
• Most distributed systems today are implemented with specialized interconnect networks• Infiniband• Myrinet• Quadrics
22
infiniband
myrinet
TXST TUES Module : C2
Clusters
• Clusters are a type of distributed memory systems• They are off-the-shelf, whole computers with multiple
private address spaces connected using the I/O bus and network switches• lower bandwidth than multiprocessor that use the processor-
memory (front side) bus• lower speed network links• more conflicts with I/O traffic
• Each node has its own OS, limiting the memory available for applications
• Improved system availability and expandability• easier to replace a machine without bringing down the whole
system• allows rapid, incremental expansion
• Economies-of-scale advantages with respect to costs23
TXST TUES Module : C2
Interconnection Networks
• On distributed systems processors can be arranged in a variety of ways
• Typically the more connections you have the better the performance and higher the cost
24
Bus Ring
2D Mesh
N-cube (N = 3)
Fully connected
TXST TUES Module : C2
SMPs vs. Distributed Systems
SMP
Communication happens through
shared memory
Harder to design and program
Not scalable
Need special OS
Programming API : OpenMP
Administration cost low
Distributed
Need explicit communication
Easier to design and program
Scalable
Can use regular OS
Programming API : MPI
Administration cost high
25
TXST TUES Module : C2
Power Density
Chart courtesy : Pat Gelsinger, Intel Developer Forum, 2004
Heat becoming an unmanageable problem
TXST TUES Module : C2
The Power Wall
• Moore’s law still holds but does not seem to be economically feasible• Power dissipation (and associated costs) too
high
• Solution• Put multiple simplified cores in the same chip area• Less power dissipation => Less heat => Lower cost
27
TXST TUES Module : C2
Multicore ChipsBlue G
ene/L
Tilera64Intel Core 2 D
uo
Shared caches
High-speed communication
28
TXST TUES Module : C2
Intel - Nehalem
29
TXST TUES Module : C2
AMD - Shanghai
30
TXST TUES Module : C2
CMP Architectural Considerations
In a way, each multicore chip is an SMP• memory => cache• processor => core
Architectural considerations are same as for SMPs• Scalability
• how many cores can we hook up to an L2 cache?• Sharing
• how do concurrent threads share data?• through LLC or memory
• Communication• how do threads communicate?• semaphores and locks, use cache if possible
31
cache coherence protocols
TXST TUES Module : C2
Simultaneous Multithreading (SMT)
• Many architectures today support multiple HW threads• SMTs use the resources of superscalar to exploit both ILP
and thread-level parallelism• Having more instructions to play with gives the scheduler
more opportunities in scheduling • No dependence between threads from different programs
• Need to rename registers
• Intel calls it’s SMT technology hyperthreading• On most machines today, you have SMT on every core
• Theoretically, a quad-core machine gives you 8 processors with hyperthreading
!$omp parallel do private(i,j)do j = 1, N do i = 1, M a(i,j) = 17 enddo b(j) = 17 c(j) = 17 d(j) = 17enddo!$omp end parallel do
49
TXST TUES Module : C2
Example : Data Parallel Code in OpenMP
do jj = 1, N, 16!$omp parallel do private(i,j) do j = jj, min(jj+16-1,N)!) do i = 1, M a(i,j) = 17 enddo b(j) = 17 c(j) = 17 d(j) = 17enddo!$omp end parallel do
50
TXST TUES Module : C2
D/k
D
DD/k ≤ Cache Capacity
D/k D/kD/k
Parallel algorithms for CMPs need to be cache-aware^
more
Shared-cache and Data Parallelization
Shared caches make the task of finding k much more difficultMinimizing communication is no longer the sole objective
51
TXST TUES Module : C2
Data Parallelism
• This type of parallelism is sometimes referred to as loop-level parallelism
• Quite common in scientific computation
• Also, sorting. Which ones?
52
TXST TUES Module : C2
Task Parallelism
53
t2, d1t0, d0 t1, d0 t3, d1
D0 D1
D1D0
TXST TUES Module : C2
Example : Task Parallel Code
if (thread == 0) do fileProcessingelse if (thread == 1) listen for requests
54
MPI, OpenMP often not good for this
Want pthreads
TXST TUES Module : C2
Pipelined Parallelism
55
CP
Shared Data Set
P
C
Synchronization window
time
TXST TUES Module : C2
Synchronization Window in Pipelined Parallelism
56
Bad
Not asbad
Better?
TXST TUES Module : C2
ExTime w/ E = ExTime w/o E ((1-F) + F/P)
Amdahl’s Law and Parallelism
Speedup due to enhancement E is
57
Speedup w/ E = -------------------------- Exec time w/o E
Exec time w/ E
Suppose,
Fraction F can be parallelized into P processing cores
Speedup w/ E = 1 / ((1-F) + F/P)
Gene Amdahl
TXST TUES Module : C2
Amdahl’s Law and Parallelism
• Assume we can parallelize 25% of the program and we have 20 processing cores
Speedup w/ E = 1/(.75 + .25/20) = 1.31
• If only 15% is parallelized Speedup w/ E = 1/(.85 + .15/20) = 1.17
• Amdahl’s Law tells us that to achieve linear speedup with n processors, none of the original computation can be scalar!
• To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less
Speedup w/ E = 1/(.001 + .999/100) = 90.99
58
Speedup with parallelism = 1 / ((1-F) + F/P)
TXST TUES Module : C2
max theoretical speedup
max speedup in relation to number of processors
Reality is often different!
59
TXST TUES Module : C2
Scaling
• Getting good speedup on a multiprocessor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem
• Strong scaling• when speedup can be achieved on a multiprocessor
without increasing the size of the problem
• Weak scaling• when speedup is achieved on a multiprocessor by
increasing the size of the problem proportionally to the increase in the number of processors
60
TXST TUES Module : C2
Load Balancing
• Load balancing is another important factor in parallel computing
• Just a single processor with twice the load of the others cuts the speedup almost in half
• If the complexity of the parallel threads vary, then the complexity of the overall algorithm is dominated by the thread with the worst complexity
• Granularity of parallel task is important• Adapt granularity based on architectural characteristics
61
TXST TUES Module : C2
Load Balancing
62
This one dominates! What are the architectural implications?