This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
EECC756 EECC756 -- ShaabanShaaban#1 lec # 1 Spring 2011 3-8-2011
–– A Generic Parallel Computer ArchitectureA Generic Parallel Computer Architecture•• The Need And Feasibility of The Need And Feasibility of Parallel ComputingParallel Computing
–– Scientific Supercomputing TrendsScientific Supercomputing Trends–– CPU Performance and Technology Trends, CPU Performance and Technology Trends, Parallelism in Microprocessor GenerationsParallelism in Microprocessor Generations–– Computer System Peak FLOP Rating History/Near FutureComputer System Peak FLOP Rating History/Near Future
•• The Goal of Parallel ProcessingThe Goal of Parallel Processing•• Elements of Parallel Computing Elements of Parallel Computing •• Factors Affecting Parallel System PerformanceFactors Affecting Parallel System Performance•• Parallel Architectures HistoryParallel Architectures History
–– Parallel Programming ModelsParallel Programming Models–– FlynnFlynn’’s 1972 Classification of Computer Architectures 1972 Classification of Computer Architecture
•• Current Trends In Current Trends In Parallel ArchitecturesParallel Architectures–– Modern Parallel Architecture Layered FrameworkModern Parallel Architecture Layered Framework
•• Shared Address Space Parallel ArchitecturesShared Address Space Parallel Architectures•• MessageMessage--Passing Passing MulticomputersMulticomputers: Message: Message--Passing Programming ToolsPassing Programming Tools•• Data Parallel SystemsData Parallel Systems•• Dataflow ArchitecturesDataflow Architectures•• Systolic Architectures: Systolic Architectures: Matrix Multiplication Systolic Array Example
PCA Chapter 1.1, 1.2
Why?
EECC756 EECC756 -- ShaabanShaaban#2 lec # 1 Spring 2011 3-8-2011
Parallel Computer ArchitectureParallel Computer ArchitectureA parallel computer (or multiple processor system) is a collection of
communicating processing elements (processors) that cooperate to solve large computational problems fast by dividing such problems into paralleltasks, exploiting Thread-Level Parallelism (TLP).
• Broad issues involved:– The concurrency and communication characteristics of parallel algorithms for a given
computational problem (represented by dependency graphs)– Computing Resources and Computation Allocation:
• The number of processing elements (PEs), computing power of each element and amount/organization of physical memory used.
• What portions of the computation and data are allocated or mapped to each PE.– Data access, Communication and Synchronization
• How the processing elements cooperate and communicate.• How data is shared/transmitted between processors.• Abstractions and primitives for cooperation/communication and synchronization.• The characteristics and performance of parallel system network (System interconnects).
– Parallel Processing Performance and Scalability Goals:• Maximize performance enhancement of parallelism: Maximize Speedup.
– By minimizing parallelization overheads and balancing workload on processors• Scalability of performance to larger systems/problems.
Processor = Programmable computing element that runs stored programs written using pre-defined instruction setProcessing Elements = PEs = Processors
i.e Parallel Processing
Task = Computation done on one processor
Goals
EECC756 EECC756 -- ShaabanShaaban#3 lec # 1 Spring 2011 3-8-2011
A A Generic Parallel Computer ArchitectureGeneric Parallel Computer Architecture
Processing Nodes: Each processing node contains one or more processing elements (PEs) or processor(s), memory system, plus communication assist: (Network interface and communication controller)Parallel machine network (System Interconnects).Function of a parallel machine network is to efficiently (reduce communication cost) transfer information (data, results .. ) from source node to destination node as needed to allow cooperation among parallel processing nodes to solve large computational problems divided into a number parallel computational tasks.
Mem
° ° °
Network
P
$
Communicationassist (CA)
Processing Nodes
A processing nodes
Parallel Machine Network(Custom or industry standard)
One or more processing elements or processorsper node: Custom or commercial microprocessors.
Single or multiple processors per chipHomogenous or heterogonous
Network Interface(custom or industry standard)
Operating SystemParallel ProgrammingEnvironments
Parallel Computer = Multiple Processor System
AKA Communication Assist (CA)
1
1
2
2
2-8 cores per chip
EECC756 EECC756 -- ShaabanShaaban#4 lec # 1 Spring 2011 3-8-2011
The Need And Feasibility of The Need And Feasibility of Parallel ComputingParallel Computing• Application demands: More computing cycles/memory needed
Gaming…– Mainstream multithreaded programs, are similar to parallel programs
• Technology Trends:– Number of transistors on chip growing rapidly. Clock rates expected to continue to go up but
only slowly. Actual performance returns diminishing due to deeper pipelines.– Increased transistor density allows integrating multiple processor cores per creating Chip-
Multiprocessors (CMPs) even for mainstream computing applications (desktop/laptop..).
• Architecture Trends:– Instruction-level parallelism (ILP) is valuable (superscalar, VLIW) but limited.– Increased clock rates require deeper pipelines with longer latencies and higher CPIs. – Coarser-level parallelism (at the task or thread level, TLP), utilized in multiprocessor systems
is the most viable approach to further improve performance.• Main motivation for development of chip-multiprocessors (CMPs)
• Economics:– The increased utilization of commodity of-the-shelf (COTS) components in high performance
parallel computing systems instead of costly custom components used in traditional supercomputers leading to much lower parallel system cost.
• Today’s microprocessors offer high-performance and have multiprocessor support eliminating the need for designing expensive custom Pes.
• Commercial System Area Networks (SANs) offer an alternative to custom more costly networks
DrivingForce
+ multi-tasking (multiple independent programs)
Moore’s Law still alive
Multi-coreProcessors
EECC756 EECC756 -- ShaabanShaaban#5 lec # 1 Spring 2011 3-8-2011
Why is Parallel Processing Needed?Why is Parallel Processing Needed?
Challenging Applications in Applied Science/Engineering• Astrophysics• Atmospheric and Ocean Modeling • Bioinformatics• Biomolecular simulation: Protein folding • Computational Chemistry • Computational Fluid Dynamics (CFD) • Computational Physics • Computer vision and image understanding• Data Mining and Data-intensive Computing • Engineering analysis (CAD/CAM)• Global climate modeling and forecasting• Material Sciences • Military applications• Quantum chemistry• VLSI design• ….
Such applications have very high1- computational and 2- memory requirements that cannot be met with single-processor architectures.
Many applications contain a largedegree of computational parallelism
Driving force for High Performance Computing (HPC)and multiple processor system development
Traditional Driving Force For HPC/Parallel Processing
EECC756 EECC756 -- ShaabanShaaban#6 lec # 1 Spring 2011 3-8-2011
Why is Parallel Processing Needed?Why is Parallel Processing Needed?Scientific Computing DemandsScientific Computing Demands
(Memory Requirement)
Computational and memory demands exceed the capabilities of even the fastest currentuniprocessor systems
3-5 GFLOPSfor uniprocessor
Driving force for HPC and multiple processor system development
EECC756 EECC756 -- ShaabanShaaban#7 lec # 1 Spring 2011 3-8-2011
Scientific Supercomputing TrendsScientific Supercomputing Trends• Proving ground and driver for innovative architecture and advanced
high performance computing (HPC) techniques: – Market is much smaller relative to commercial (desktop/server)
segment. – Dominated by costly vector machines starting in the 1970s through
the 1980s.– Microprocessors have made huge gains in the performance needed
for such applications:• High clock rates. (Bad: Higher CPI?)• Multiple pipelined floating point units.• Instruction-level parallelism.• Effective use of caches.• Multiple processor cores/chip (2 cores 2002-2005, 4 end of 2006, 6-12 cores
2011) However even the fastest current single microprocessor systemsstill cannot meet the needed computational demands.
• Currently: Large-scale microprocessor-based multiprocessor systems and computer clusters are replacing (replaced?) vector supercomputers that utilize custom processors.
As shown in last slide
Enabled with hightransistor density/chip
EECC756 EECC756 -- ShaabanShaaban#8 lec # 1 Spring 2011 3-8-2011
UniprocessorUniprocessor Performance EvaluationPerformance Evaluation• CPU Performance benchmarking is heavily program-mix dependent.• Ideal performance requires a perfect machine/program match.• Performance measures:
– Total CPU time = T = TC / f = TC x C = I x CPI x C = I x (CPIexecution + M x k) x C (in seconds)
TC = Total program execution clock cyclesf = clock rate C = CPU clock cycle time = 1/f I = Instructions executed countCPI = Cycles per instruction CPIexecution = CPI with ideal memoryM = Memory stall cycles per memory access k = Memory accesses per instruction
– MIPS Rating = I / (T x 106) = f / (CPI x 106) = f x I /(TC x 106)(in million instructions per second)
– Throughput Rate: Wp = 1/ T = f /(I x CPI) = (MIPS) x 106 /I(in programs per second)
• Performance factors: (I, CPIexecution, m, k, C) are influenced by: instruction-set architecture (ISA) , compiler design, CPU micro-architecture, implementation and control, cache and memory hierarchy, program access locality, and program instruction mix and instruction dependencies.
T = I x CPI x C
EECC756 EECC756 -- ShaabanShaaban#9 lec # 1 Spring 2011 3-8-2011
Single CPU Performance TrendsSingle CPU Performance TrendsP
erfo
rman
ce
0.1
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
• The microprocessor is currently the most natural building block for multiprocessor systems in terms of cost and performance.
• This is even more true with the development of cost-effective multi-core microprocessors that support TLP at the chip level.
CommodityProcessors
CustomProcessors
EECC756 EECC756 -- ShaabanShaaban#10 lec # 1 Spring 2011 3-8-2011
Microprocessor Frequency TrendMicroprocessor Frequency Trend
Result:Deeper PipelinesLonger stallsHigher CPI(lowers effective performance per cycle)Frequency doubles each generation
Number of gates/clock reduce by 25%Leads to deeper pipelines with more stages(e.g Intel Pentium 4E has 30+ pipeline stages)
Realty Check:Clock frequency scalingis slowing down!(Did silicone finally hit the wall?)
386486
Pentium(R)
Pentium Pro(R)
Pentium(R) IIMPC750
604+604
601, 603
21264S
2126421164A
2116421064A
21066
10
100
1,000
10,000
1987
1989
1991
1993
1995
1997
1999
2001
2003
2005
Mhz
1
10
100
Gat
e D
elay
s/ C
lock
IntelIBM Power PCDECGate delays/clock
Processor freq scales by 2X per
generation
Why?1- Power leakage2- Clock distribution delays
T = I x CPI x C
?Solution:Exploit TLP at the chip level,Chip-multiprocessor (CMPs)
No longerthe case
EECC756 EECC756 -- ShaabanShaaban#11 lec # 1 Spring 2011 3-8-2011
• One billion transistors/chip reached in 2005, two billion in 2008-9, Now ~ three billion• Transistor count grows faster than clock rate: Currently ~ 40% per year• Single-threaded uniprocessors do not efficiently utilize the increased transistor count.
Current Top LINPACK Performance:Now about 2,566,000 GFlop/s = 2566 TeraFlops = 2.566 PetaFlopsTianhe-1A ( @ National Supercomputing Center in Tianjin, China) 186,368 processor cores:14,336 Intel Xeon X5670 6-core processors @ 2.9 GHz + 7,168 Nvidia Tesla M2050 (8-core?) GPUs
Current ranking of top 500 parallel supercomputers in the world is found at: www.top500.org
Since ~ Nov. 2010
EECC756 EECC756 -- ShaabanShaaban#16 lec # 1 Spring 2011 3-8-2011
Why is Parallel Processing Needed?Why is Parallel Processing Needed?LINPAK Performance Trends
LIN
PAC
K (M
FLO
PS
)
1
10
100
1,000
10,000
1975 1980 1985 1990 1995 200
CRAY�n = 100CRAY�n = 1,000
Micro�n = 100Micro�n = 1,000
CRAY 1s
Xmp/14se
Xmp/416Ymp
C90
T94
DEC 8200
IBM Power2/99MIPS R4400
HP9000/735DEC Alpha
DEC Alpha AXPHP 9000/750
IBM RS6000/540
MIPS M/2000
MIPS M/120
Sun 4/260
LIN
PAC
K (G
FLO
PS
)
CRAY peakMPP peak
Xmp/416(4)
Ymp/832(8) nCUBE/2(1024)iPSC/860
CM-2CM-200
Delta
Paragon XP/S
C90(16)
CM-5
ASCI Red
T932(32)
T3D
Paragon XP/S MP�(1024)
Paragon XP/S MP�(6768)
0.1
1
10
100
1,000
10,000
1985 1987 1989 1991 1993 1995 199
UniprocessorUniprocessor PerformancePerformance Parallel System PerformanceParallel System Performance
1 TeraFLOP(1012 FLOPS =1000 GFLOPS)1 GFLOP
(109 FLOPS)
EECC756 EECC756 -- ShaabanShaaban#17 lec # 1 Spring 2011 3-8-2011
Computer System Peak FLOP Rating HistoryComputer System Peak FLOP Rating History
Peta FLOP
Teraflop
(1015 FLOPS = 1000 Tera FLOPS)
(1012 FLOPS = 1000 GFLOPS)
Current Top Peak FP Performance:Now about 4,701,000 GFlop/s = 4701 TeraFlops = 4.701 PetaFlopsTianhe-1A ( @ National Supercomputing Center in Tianjin, China) 186,368 processor cores:14,336 Intel Xeon X5670 6-core processors @ 2.9 GHz + 7,168 Nvidia Tesla M2050 (8-core?) GPUs
Tianhe-1A
Since ~ Nov. 2010
Current ranking of top 500 parallel supercomputers in the world is found at: www.top500.org
EECC756 EECC756 -- ShaabanShaaban#18 lec # 1 Spring 2011 3-8-2011
Source (and for current list): www.top500.org
November 2005
EECC756 EECC756 -- ShaabanShaaban#19 lec # 1 Spring 2011 3-8-2011
32nd List (November 2008): The Top 10The Top 10
Source (and for current list): www.top500.org
TOP500 Supercomputers
KW
EECC756 EECC756 -- ShaabanShaaban#20 lec # 1 Spring 2011 3-8-2011
34nd List (November 2009): The Top 10The Top 10
Source (and for current list): www.top500.org
TOP500 Supercomputers
KW
EECC756 EECC756 -- ShaabanShaaban#21 lec # 1 Spring 2011 3-8-2011
36nd List (November 2010): The Top 10The Top 10
Source (and for current list): www.top500.org
TOP500 Supercomputers
Current List
KW
EECC756 EECC756 -- ShaabanShaaban#22 lec # 1 Spring 2011 3-8-2011
The Goal of Parallel ProcessingThe Goal of Parallel Processing• Goal of applications in using parallel machines:
Maximize Speedup over single processor performance
Speedup (p processors) =
• For a fixed problem size (input data set), performance = 1/time
Speedup fixed problem (p processors) =
• Ideal speedup = number of processors = pVery hard to achieve
Performance (p processors)
Performance (1 processor)
Time (1 processor)
Time (p processors)
Due to parallelization overheads: communication cost, dependencies ...
Parallel Speedup, Speedupp
Fixed Problem Size Parallel Speedup
+ load imbalance
Parallel
EECC756 EECC756 -- ShaabanShaaban#23 lec # 1 Spring 2011 3-8-2011
The Goal of Parallel ProcessingThe Goal of Parallel Processing• Parallel processing goal is to maximize parallel speedup:
• Ideal Speedup = p = number of processors
– Very hard to achieve: Implies no parallelization overheads and perfect load balance among all processors.
• Maximize parallel speedup by:– Balancing computations on processors (every processor does the same amount of
work) and the same amount of overheads.– Minimizing communication cost and other overheads associated with each step
of parallel program creation and execution.• Performance Scalability:
Achieve a good speedup for the parallel application on the parallel architecture as problem size and machine size (number of processors) are increased.
Sequential Work on one processorMax (Work + Synch Wait Time + Comm Cost + Extra Work)
Speedup = <Time(1)Time(p)
Parallelization overheadsi.e the processor with maximum execution time
1
2
Or time
Time
Fixed Problem Size Parallel Speedup
+
EECC756 EECC756 -- ShaabanShaaban#24 lec # 1 Spring 2011 3-8-2011
Elements of Parallel ComputingElements of Parallel Computing
library of the target parallel computer , or ..– Concurrent (parallel) HLL .
• Concurrency Preserving Compiler: The compiler in this case preserves the parallelism explicitly specified by the programmer. It may perform some program flow analysis, dependence checking, limited optimizations for parallelism detection.
(a)
(b)
Approaches to parallel programming
Illustrated next
EECC756 EECC756 -- ShaabanShaaban#28 lec # 1 Spring 2011 3-8-2011
Approaches to Parallel ProgrammingApproaches to Parallel Programming
Source code written inSource code written inconcurrent dialects of C, C++concurrent dialects of C, C++
Compiler automatically detects parallelism in sequential source code and transforms it into parallel constructs/code
EECC756 EECC756 -- ShaabanShaaban#29 lec # 1 Spring 2011 3-8-2011
Factors Affecting Parallel System Performance• Parallel Algorithm Related:
– Available concurrency and profile, grain size, uniformity, patterns.• Dependencies between computations represented by dependency graph
– Type of parallelism present: Functional and/or data parallelism.– Required communication/synchronization, uniformity and patterns.– Data size requirements.– Communication to computation ratio (C-to-C ratio, lower is better).
• Parallel program Related:– Programming model used.– Resulting data/code memory requirements, locality and working set
characteristics.– Parallel task grain size.– Assignment (mapping) of tasks to processors: Dynamic or static.– Cost of communication/synchronization primitives.
• Hardware/Architecture related:– Total CPU computational power available.– Types of computation modes supported.– Shared address space Vs. message passing.– Communication network characteristics (topology, bandwidth, latency)– Memory hierarchy properties.
Concurrency = Parallelism
i.e Inherent Parallelism
+ Number of processors(hardware parallelism)
EECC756 EECC756 -- ShaabanShaaban#30 lec # 1 Spring 2011 3-8-2011
A
B
C
D
E
F
G
A
CB
D E F
G
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Time
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Time
A
BC
D
E
F
G
Sequential Executionon one processor
Assume computation time for each task A-G = 3Assume communication time between parallel tasks = 1Assume communication can overlap with computationSpeedup on two processors = T1/T2 = 21/16 = 1.3
Task Dependency Graph Possible Parallel Execution Schedule on Two Processors P0, P1
P0 P1
Comm
Comm
Idle
Idle
Idle
Comm
Comm
Comm
A simple parallel execution example
What would the speed bewith 3 processors?4 processors? 5 … ?
T1 =21
T2 =16
Task: Computationrun on one processor
EECC756 EECC756 -- ShaabanShaaban
Evolution of Computer Evolution of Computer ArchitectureArchitecture
Scalar
Sequential Lookahead
I/E Overlap FunctionalParallelism
MultipleFunc. Units Pipeline
ImplicitVector
ExplicitVector
MIMDSIMD
MultiprocessorMulticomputer
Register-to-Register
Memory-to-Memory
ProcessorArray
AssociativeProcessor
Massively Parallel Processors (MPPs)
I/E: Instruction Fetch andExecute
SIMD: Single Instruction stream over Multiple Data streams
MIMD: Multiple Instruction streams over Multiple Data streams
Computer Clusters
Message Passing
Shared Memory
Data Parallel
Non-pipelined
ParallelMachines
Pipelined (single or multiple issue)
Vector/data parallel
LimitedPipelining
EECC756 EECC756 -- ShaabanShaaban#32 lec # 1 Spring 2011 3-8-2011
Parallel Architectures HistoryParallel Architectures History
Application Software
SystemSoftware SIMD
Message PassingShared MemoryDataflow
SystolicArrays Architecture
Historically, parallel architectures were tied to parallel programming models:
• Divergent architectures, with no predictable pattern of growth.
More on this next lecture
Data Parallel Architectures
EECC756 EECC756 -- ShaabanShaaban#33 lec # 1 Spring 2011 3-8-2011
Parallel Programming ModelsParallel Programming Models• Programming methodology used in coding parallel applications• Specifies: 1- communication and 2- synchronization• Examples:
– Multiprogramming: or Multi-tasking (not true parallel processing!)No communication or synchronization at program level. A number of independent programs running on different processors in the system.
– Shared memory address space (SAS):Parallel program threads or tasks communicate implicitly using a shared memory address space (shared data in memory).
– Message passing:Explicit point to point communication (via send/receive pairs) is used between parallel program tasks using messages.
– Data parallel: More regimented, global actions on data (i.e the same operations over all elements on an array or vector)
– Can be implemented with shared address space or message passing.
However, a good way to utilize multi-core processors for the masses!
EECC756 EECC756 -- ShaabanShaaban#34 lec # 1 Spring 2011 3-8-2011
FlynnFlynn’’s 1972 Classification of Computer Architectures 1972 Classification of Computer Architecture
• Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines or uniprocessors.
• Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized
processing elements.
• Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution.
• Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers:
EECC756 EECC756 -- ShaabanShaaban#38 lec # 1 Spring 2011 3-8-2011
Shared Address Space (SAS) Parallel ArchitecturesShared Address Space (SAS) Parallel Architectures
• Any processor can directly reference any memory location – Communication occurs implicitly as result of loads and stores
• Convenient: – Location transparency– Similar programming model to time-sharing in uniprocessors
• Except processes run on different processors• Good throughput on multiprogrammed workloads
• Naturally provided on a wide range of platforms– Wide range of scale: few to hundreds of processors
• Popularly known as shared memory machines or model– Ambiguous: Memory may be physically distributed among
processing nodes.
Sometimes called Tightly-Coupled Parallel Computers
i.e Distributed shared memory multiprocessors
i.e multi-tasking
Communication is implicit via loads/stores
(in shared address space)
EECC756 EECC756 -- ShaabanShaaban#39 lec # 1 Spring 2011 3-8-2011
Shared Address Space (SAS) Parallel Programming Model• Process: virtual address space plus one or more threads of control• Portions of address spaces of processes are shared:
• Writes to shared address visible to other threads (in other processes too)• Natural extension of the uniprocessor model:
• Conventional memory operations used for communication• Special atomic operations needed for synchronization:
• Using Locks, Semaphores etc.• OS uses shared memory to coordinate processes.
St or e
P1P2
Pn
P0
Load
P0 pr i vat e
P1 pr i vat e
P2 pr i vat e
Pn pr i vat e
Virtual address spaces for acollection of processes communicatingvia shared addresses
Machine physical address space
Shared portionof address space
Private portionof address space
Common physicaladdresses
Thus communication is implicit via loads/stores
Thus synchronization is explicit
Shared Space
i.e for event ordering and mutual exclusion
In SAS:Communication is implicitvia loads/stores.
Ordering/Synchronizationis explicit using synchronizationPrimitives.
EECC756 EECC756 -- ShaabanShaaban#40 lec # 1 Spring 2011 3-8-2011
Models of SharedModels of Shared--Memory MultiprocessorsMemory Multiprocessors• The Uniform Memory Access (UMA) Model:
– All physical memory is shared by all processors.– All processors have equal access (i.e equal memory
bandwidth and access latency) to all memory addresses.– Also referred to as Symmetric Memory Processors (SMPs).
• Distributed memory or Non-uniform Memory Access (NUMA) Model:– Shared memory is physically distributed locally among
processors. Access latency to remote memory is higher.
• The Cache-Only Memory Architecture (COMA) Model:– A special case of a NUMA machine where all distributed
main memory is converted to caches.– No memory hierarchy at each processor.
1
2
3
EECC756 EECC756 -- ShaabanShaaban#41 lec # 1 Spring 2011 3-8-2011
Models of SharedModels of Shared--Memory MultiprocessorsMemory Multiprocessors
I/O ctrlMem Mem Mem
Interconnect
Mem I/O ctrl
Processor Processor
Interconnect
I/Odevices
M ° ° °M M
Network
P
$
P
$
P
$
° ° °
Network
D
P
C
D
P
C
D
P
C
Distributed memory orNon-uniform Memory Access (NUMA) Model
EECC756 EECC756 -- ShaabanShaaban#42 lec # 1 Spring 2011 3-8-2011
Uniform Memory Access (UMA) Uniform Memory Access (UMA) Example: Example: Intel Pentium Pro QuadIntel Pentium Pro Quad
P-Pro bus (64-bit data, 36-bit address, 66 MHz)
CPU
Bus interface
MIU
P-Promodule
P-Promodule
P-Promodule256-KB
L2 $Interruptcontroller
PCIbridge
PCIbridge
Memorycontroller
1-, 2-, or 4-wayinterleaved
DRAM
PCI b
us
PCI b
usPCII/O
cards
• All coherence and multiprocessing glue in processor module
• Highly integrated, targeted at high volume• Computing node used in Intel’s ASCI-Red
MPPBus-Based Symmetric Memory Processors (SMPs).
A single Front Side Bus (FSB) is shared among processorsThis severely limits scalability to only ~ 2-4 processors
Circa 1997
Shared FSB
4-way SMP
EECC756 EECC756 -- ShaabanShaaban#43 lec # 1 Spring 2011 3-8-2011
NonNon--Uniform Memory Access (NUMA) Uniform Memory Access (NUMA) Example: AMD 8Example: AMD 8--way way OpteronOpteron Server NodeServer Node
Dedicated point-to-point interconnects (HyperTransport links) used to connect processors alleviating the traditional limitations of FSB-based SMP systems.Each processor has two integrated DDR memory channel controllers:memory bandwidth scales up with number of processors.NUMA architecture since a processor can access its own memory at a lower latency than access to remote memory directly connected to other processors in the system.
Total 16 processor cores when dual core Opteron processors used(32 cores with quad core processors)
Circa 2003
EECC756 EECC756 -- ShaabanShaaban#44 lec # 1 Spring 2011 3-8-2011
Distributed SharedDistributed Shared--Memory Memory Multiprocessor System Example: Multiprocessor System Example:
Cray T3ECray T3E
Switch
P$
XY
Z
External I/O
Memctrl
and NI
Mem
• Scale up to 2048 processors, DEC Alpha EV6 microprocessor (COTS)• Custom 3D Torus point-to-point network, 480MB/s links• Memory controller generates communication requests for non-local references• No hardware mechanism for coherence (SGI Origin etc. provide this)
Example of Non-uniform Memory Access (NUMA)
NUMA MPP Example Circa 1995-1999
More recent Cray MPP Example:Cray X1E Supercomputer
MPP = Massively Parallel Processor System
Communication Assist (CA)3D Torus Point-To-Point Network
EECC756 EECC756 -- ShaabanShaaban#45 lec # 1 Spring 2011 3-8-2011
• Send specifies buffer to be transmitted and receiving process.• Receive specifies sending process and application storage to receive into.• Memory to memory copy possible, but need to name processes.• Optional tag on send and matching rule on receive.• User process names local data and entities in process/tag space too• In simplest form, the send/receive match achieves implicit pairwise synchronization event
– Ordering of computations according to dependencies • Many possible overheads: copying, buffer management, protection ...
Process P Process Q
AddressY
AddressX
Send X, Q, t
Receive Y, P, tMatch
Local pr ocessaddress spaceLocal pr ocess
address space
Send (X, Q, t)
Tag
DataRecipient
Receive (Y, P, t)
Tag
Data SenderSender P Recipient Q
Sender P Recipient Q
Data Dependency/Ordering
Pairwise synchronization using send/receive match
Communication is explicit via sends/receives
Synchronization isimplicit
Blocking Receive
Recipient blocks (waits) until message is received
i.e event ordering, in this case
EECC756 EECC756 -- ShaabanShaaban#47 lec # 1 Spring 2011 3-8-2011
All PE are synchronized(same instruction or operation in a given cycle)
Other Data Parallel Architectures: Vector Machines
EECC756 EECC756 -- ShaabanShaaban#52 lec # 1 Spring 2011 3-8-2011
Dataflow ArchitecturesDataflow Architectures• Represent computation as a graph of essential data dependencies
– Non-Von Neumann Architecture (Not PC-based Architecture)– Logical processor at each node, activated by availability of operands– Message (tokens) carrying tag of next instruction sent to next processor– Tag compared with others in matching store; match fires execution
1 b
a
+ − ×
×
×
c e
d
f
Dataflow graph
f = a × d
Network
Token�store
Waiting�Matching
Instruction�fetch
Execute
Token queue
Form�token
Network
Network
Program�store
a = (b +1) × (b − c)��d = c × e��
Research Dataflow machineprototypes include:• The MIT Tagged Architecture• The Manchester Dataflow Machine
The Tomasulo approach of dynamicinstruction execution utilizes dataflowdriven execution engine:• The data dependency graph for a small window of instructions is constructed dynamically when instructions are issued in order of the program.
•The execution of an issued instructionis triggered by the availability of itsoperands (data it needs) over the CDB.
Tokens = Copies of computation results
TokenMatching Token
Distribution
One Node
TokenDistributionNetwork
Dependency graph for entire computation (program)
i.e data or results
EECC756 EECC756 -- ShaabanShaaban#53 lec # 1 Spring 2011 3-8-2011
Systolic ArchitecturesSystolic Architectures
M
PE
M
PE PE PE
• Replace single processor with an array of regular processing elements• Orchestrate data flow for high throughput with less memory access
• Different from linear pipelining– Nonlinear array structure, multidirection data flow, each PE may have
(small) local instruction and data memory• Different from SIMD: each PE may do something different• Initial motivation: VLSI Application-Specific Integrated Circuits (ASICs) • Represent algorithms directly by chips connected in regular pattern
A possible example of MISD in FlynnA possible example of MISD in Flynn’’s s Classification of Computer ArchitectureClassification of Computer Architecture
PE = Processing ElementM = Memory
Example of Flynn’s Taxonomy’s MISD (Multiple Instruction Streams Single Data Stream):
EECC756 EECC756 -- ShaabanShaaban#54 lec # 1 Spring 2011 3-8-2011