POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multiprocessors Marco D. Santambrogio: [email protected] Simone Campanoni: [email protected]
Mar 19, 2016
POLITECNICO DI MILANOParallelism in wonderland:
are you ready to see how deep the rabbit hole goes?
Multiprocessors
Marco D. Santambrogio: [email protected]
Simone Campanoni: [email protected]
2
OutlineOutline
MultiprocessorsFlynn taxonomySIMD architecturesVector architecturesMIMD architectures
A real life example
What’s next
SupercomputersSupercomputers
Definition of a supercomputer:Fastest machine in world at given taskA device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+Any machine designed by Seymour Cray
CDC6600 (Cray, 1964) regarded as first supercomputer
3
The Cray XD1 exampleThe Cray XD1 example
The XD1 uses AMD Opteron 64-bit CPUsand it incorporates Xilinx Virtex-II FPGAs
Performance gains from FPGARC5 Cipher Breaking
1000x faster than 2.4 GHz P4 Elliptic Curve Cryptography
895-1300x faster than 1 GHz P3 Vehicular Traffic Simulation
300x faster on XC2V6000 than 1.7 GHz Xeon650xfaster on XC2VP100 than 1.7 GHz Xeon
Smith Waterman DNA matching28x faster than 2.4 GHz Opteron
4
Supercomputer ApplicationsSupercomputer Applications
• Typical application areas• Military research (nuclear weapons, cryptography)• Scientific research• Weather forecasting• Oil exploration• Industrial design (car crash simulation)
•All involve huge computations on large data sets
•In 70s-80s, Supercomputer Vector Machine
5
Parallel ArchitecturesParallel Architectures
Definition: “A parallel computer is a collection of processing elements that cooperates and communicate to solve large problems fast”
Almasi and Gottlieb, Highly Parallel Computing, 1989
The aim is to replicate processors to add performance vs design a faster processor.Parallel architecture extends traditional computer architecture with a communication architecture
abstractions (HW/SW interface)different structures to realize abstraction efficiently
6
Beyond ILPBeyond ILPILP architectures (superscalar, VLIW...): Support fine-grained, instruction-level parallelism;
Fail to support large-scale parallel systems;
Multiple-issue CPUs are very complex, and returns (as far as extracting greater parallelism) are diminishing extracting parallelism at higher levels becomes more and more attractive.A further step: process- and thread-level parallel architectures.To achieve ever greater performance: connect multiple microprocessors in a complex system.
7
Beyond ILPBeyond ILP
Most recent microprocessor chips are multiprocessor on-chip: Intel Core Duo, IBM Power 5, Sun NiagaraMajor difficulty in exploiting parallelism in multiprocessors: suitable software being (at least partially) overcome, in particular, for servers and for embedded applications which exhibit natural parallelism without the need of rewriting large software chunks
8
Flynn Taxonomy (1966) Flynn Taxonomy (1966)
SISD - Single Instruction Single DataUniprocessor systems
MISD - Multiple Instruction Single DataNo practical configuration and no commercial systems
SIMD - Single Instruction Multiple DataSimple programming model, low overhead, flexibility, custom integrated circuits
MIMD - Multiple Instruction Multiple Data
Scalable, fault tolerant, off-the-shelf micros
9
FlynnFlynn
10
SISDSISD
A serial (non-parallel) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Deterministic execution This is the oldest and even today, the most common type of computer
11
SIMDSIMDA type of parallel computer Single instruction: all processing units execute the same instruction at any given clock cycle Multiple data: each processing unit can operate on a different data element
Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing
12
MISDMISD
A single data stream is fed into multiple processing units.Each processing unit operates on the data independently via independent instruction streams.
13
MIMDMIMD
Nowadays, the most common type of parallel computerMultiple Instruction: every processor may be executing a different instruction stream Multiple Data: every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or non-deterministic
14
Which kind of Which kind of multiprocessors?multiprocessors?
Many of the early multiprocessors were SIMD – SIMD model received great attention in the ’80’s, today is applied only in very specific instances (vector processors, multimedia instructions);
MIMD has emerged as architecture of choice for general-purpose multiprocessors
Lets see these architectures more in details..
15
SIMD - Single Instruction SIMD - Single Instruction Multiple DataMultiple Data
Same instruction executed by multiple processors using different data streams.
Each processor has its own data memory.
Single instruction memory and control processor to fetch and dispatch instructions
Processors are typically special-purpose.
Simple programming model.
16
SIMDSIMD ArchitectureArchitectureCentral controller broadcasts instructions to multiple processing elements (PEs)
17
Array Controller
Inter-PE Connection Network
PE
Mem
PE
Mem
PE
Mem
PE
Mem
PE
Mem
PE
Mem
PE
Mem
PE
Mem
ControlData
Only requires one controller for whole array Only requires storage for one copy of program All computations fully synchronized
SIMD modelSIMD model
Synchronized units: single Program CounterEach unit has its own addressing registers
Can use different data addressesMotivations for SIMD:
Cost of control unit shared by all execution unitsOnly one copy of the code in execution is necessary
Real life:SIMD have a mix of SISD instructions and SIMDA host computer executes sequential operationsSIMD instructions sent to all the execution units, which has its own memory and registers and exploit an interconnection network to exchange data
18
SIMDSIMD MachinesMachines TodayTodayDistributed-memory SIMD failed as large-scale general-purpose computer platform
required huge quantities of data parallelism (>10,000 elements)required programmer-controlled distributed data layout
Vector supercomputers (shared-memory SIMD) still successful in high-end supercomputing
reasonable efficiency on short vector lengths (10-100 elements)single memory space
Distributed-memory SIMD popular for special-purpose accelerators
image and graphics processingRenewed interest for Processor-in-Memory (PIM)
memory bottlenecks => put some simple logic close to memoryviewed as enhanced memory for conventional systemtechnology push from new merged DRAM + logic processescommercial examples, e.g., graphics in Sony Playstation-2/3
19
Reality: Sony Playstation 2000Reality: Sony Playstation 2000
20
Playstation 2000Playstation 2000
Sample Vector Unit2-wide VLIWIncludes Microcode MemoryHigh-level instructions like matrix-
multiply
Emotion Engine:Superscalar MIPS coreVector Coprocessor
PipelinesRAMBUS DRAM interface
21
•25
Alternative Model: Vector Alternative Model: Vector ProcessingProcessing
Vector processors have high-level operations that work on linear arrays of numbers: "vectors"
22
+r1 r2
r3
add r3, r1, r2
SCALAR
(1 operation)
v1 v2
v3+
vector
length
add.vv v3, v1, v2
VECTOR
(N operations)
Vector SupercomputersVector SupercomputersEpitomized by Cray-1, 1976:
Scalar Unit + Vector ExtensionsLoad/Store ArchitectureVector RegistersVector InstructionsHardwired ControlHighly Pipelined Functional UnitsInterleaved Memory SystemNo Data CachesNo Virtual Memory
23
Properties of Vector Properties of Vector InstructionsInstructions
A single vector instruction specifies a great deal of work
Equivalent to executing an entire loopEach instruction represents 10 or 100s operations
Fetch and decode unit bandwidth needed to keep multiple deeply pipelined FUs busy dramatically reduced
Vector instructions indicate that computation of each result in the vector is independent of the computation of the results of the other elements of the vector– No need to check for data hazards in the vector
Hardware needs to check for data hazards only between two vectors instructions once per vector operand
24
Properties of Vector InstructionsProperties of Vector Instructions
Each result independent of previous result=> long pipeline, compiler ensures no dependencies=> high clock rate
Vector instructions access memory with known pattern=> highly interleaved memory to fetch the vector from a set of memory banks=> amortize memory latency of over 64 elements=> no (data) caches required! (Do use instruction cache)
Reduces branches and branch problems in pipelines – An entire loop is replaced by a vector instruction
therefore control hazards that would arise from the loop branch are avoided
25
Styles of Vector Styles of Vector ArchitecturesArchitectures
A vector processor consists of a pipelined scalar unit (ma be out-of order or VLIW) + vector unit
memory-memory vector processors: all vector operations are memory to memory (first ones as CDC)
vector-register processors: all vector operations between vector registers (except load and store)– Vector equivalent of load-store architectures
– Includes all vector machines since late 1980s:
Cray, Convex, Fujitsu, Hitachi, NEC
26
Components of Vector Components of Vector ProcessorProcessor
Vector Register: fixed length bank holding a single vector– has at least 2 read and 1 write ports– typically 8-32 vector registers, each holding 64-128 64-bit elements
Vector Functional Units (FUs): fully pipelined, start new operation every clock cycle– typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer add, logical, shift; may have multiple of same unit
– Control unit to detect hazards (control for Fus and data from register accesses)
– Scalar operations may use either the vector functional units or use a dedicated set.27
Components of Vector Components of Vector ProcessorProcessor
Vector Load-Store Units (LSUs): fully pipelined unit to load or store a vector to and from memory; – Pipelining allows moving words between vector registers and memory with a bandwidth of 1 word per clock cycle
– Handles also scalar loads and stores– may have multiple LSUs
Scalar registers: single element for FP scalar or address
Cross-bar to connect FUs , LSUs, registers
28
Vector programming modelVector programming model
29
Scalar Registers
r0
r15Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic Instructions
ADDV v3, v1, v2v3
v2v1
v1Vector Load and Store InstructionsLV v1, r1, r2
Base, r1 Stride, r2 Memory
Vector Register
Vector Code ExampleVector Code Example
# C codefor (i=0;i<64; i++) C[i] = A[i]+B[i];
# Scalar Code LI R4, #64loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop
# Vector Code LI VLR, #64 LV V1, R1 LV V2, R2 ADDV.D V3,V1,V2 SV V3, R3
30
Vector Instruction Set AdvantagesVector Instruction Set Advantages Compact– one short instruction encodes N operations
Expressive, tells hardware that these N operations:– are independent– use the same functional unit– access disjoint registers– access registers in same pattern as previous instructions
– access a contiguous block of memory (unit-stride load/store)
– access memory in a known pattern (strided load/store)
Scalable– can run same code on more parallel pipelines (lanes)
31
Vector Arithmetic ExecutionVector Arithmetic Execution
• Use deep pipeline (=> fast clock) to execute element operations
• Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)
V1
V2
V3
V3 <- v1 * v2
Six stage multiply pipeline
32
Vector Instruction ExecutionVector Instruction Execution
ADDV C,A,B
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Execution using four pipelined functional units
33
0 1 2 3 4 5 6 7 8 9 A B C D E F
+
Base StrideVector Registers
Memory Banks
Address Generator
• Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency
• Bank busy time: Time before bank ready to accept next request
• To avoid conflicts stride and #banks relatively prime
34
Vector Unit StructureVector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0, 4, 8, …
Elements 1, 5, 9, …
Elements 2, 6, 10, …
Elements 3, 7, 11, …
35
T0 Vector Microprocessor T0 Vector Microprocessor (UCB/ICSI, 1995)(UCB/ICSI, 1995)
LaneVector register elements striped over lanes
[0][8][16][24]
[1][9][17][25]
[2][10][18][26]
[3][11][19][27]
[4][12][20][28]
[5][13][21][29]
[6][14][22][30]
[7][15][23][31]
36
Vector ApplicationsVector Applications
Limited to scientific computing?
Multimedia Processing (compress., graphics, audio synth, image proc.)
Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort)
Lossy Compression (JPEG, MPEG video and audio) Lossless Compression (Zero removal, RLE, Differencing,
LZW) Cryptography (RSA, DES/IDEA, SHA/MD5) Speech and handwriting recognition Operating systems/Networking (memcpy, memset, parity,
checksum) Databases (hash/join, data mining, image/video serving) Language run-time support (stdlib, garbage collection) even SPECint95
37
MIMD - Multiple Instruction MIMD - Multiple Instruction Multiple DataMultiple Data
Each processor fetches its own instructions and operates on its own data.
Processors are often off-the-shelf microprocessors.
Scalable to a variable number of processor nodes.
Flexible: • single-user machines focusing on high-performance for one specific application,
• multi-programmed machines running many tasks simultaneously,
• some combination of these functions. Cost/performance advantages due to the use of off-the-shelf microprocessors.
Fault tolerance issues.38
Why MIMD?Why MIMD? MIMDs are flexible – they can function
as single-user machines for high performances on one application, as multiprogrammed multiprocessors running many tasks simultaneously, or as some combination of such functions;
Can be built starting from standard CPUs (such is the present case nearly for all multiprocessors!).
39
MIMDMIMD To exploit a MIMD with n processors
– at least n threads or processes to execute
– independent threads typically identified by the programmer or created by the compiler.
– Parallelism is contained in the threads• thread-level parallelism.
Thread: from a large, independent process to parallel iterations of a loop. Important: parallelism is identified by the software (not by hardware as in superscalar CPUs!)... keep this in mind, we'll use it!
40
MIMD MachinesExisting MIMD machines fall into 2 classes, depending on the number of processors involved, which in turn dictates a memory organization and interconnection strategy.Centralized shared-memory architectures
– at most few dozen processor chips (< 100 cores)– Large caches, single memory multiple banks– Often called symmetric multiprocessors (SMP) and the style of architecture called Uniform Memory Access (UMA)
Distributed memory architectures– To support large processor counts– Requires high-bandwidth interconnect– Disadvantage: data communication among processors
41
Key issues to design Key issues to design multiprocessorsmultiprocessors
How many processors? How powerful are processors? How do parallel processors share data? Where to place the physical memory? How do parallel processors cooperate and coordinate?
What type of interconnection topology? How to program processors? How to maintain cache coherency? How to maintain memory consistency? How to evaluate system performance?
42
Create the most amazing game Create the most amazing game consoleconsole
43
One coreOne core A 64-bit Power Architecture core Two-issue superscalar execution Two-way multithreaded core In-order execution Cache
– 32 KB instruction and a 32 KB data Level 1 cache
– 512 KB Level 2 cache– The size of a cache line is 128 bytes
One core to rule them all
44
Cell: PS3Cell: PS3 Cell is a heterogeneous chip multiprocessor – One 64-bit Power core– 8 specialized co-processors
• based on a novel single-instruction multiple-data (SIMD) architecture called SPU (Synergistic Processor Unit)
45
Ducks DemoDucks Demo
46
Duck Demo SPE UsageDuck Demo SPE Usage
47
Xenon: XBOX360Xenon: XBOX360 Three symmetrical cores
– each two way SMT-capable and clocked at 3.2 GHz
SIMD: VMX128 extension for each core 1 MB L2 cache (lockable by the GPU) running at half-speed (1.6 GHz) with a 256-bit bus
48
Microsoft visionMicrosoft vision Microsoft envisions a procedurally rendered game as having at least two primary components:– Host thread: a game's host thread will contain the main thread of execution for the game
– Data generation thread: where the actual procedural synthesis of object geometry takes place
These two threads could run on the same PPE, or they could run on two separate PPEs.
In addition to the these two threads, the game could make use of separate threads for handling physics, artificial intelligence, player input, etc.
49
The Xenon architectureThe Xenon architecture
50
From ILP to TLP: from the processor to From ILP to TLP: from the processor to the programmerthe programmer
Keep it simple– Stripping out hardware that's intended to optimize instruction scheduling at runtime.
– Neither the Xenon nor the Cell have an instruction window• Instructions pass through the processor in the order in which they're fetched
• Two adjacent, non-dependent instructions are executed in parallel where possible
Static execution – Is simple to implement– Takes up much less die space than dynamic execution
• since the processor doesn't need to spend a lot of transistors on the instruction window and related hardware. Those transistors that the lack of an instruction window frees up can be used to put more actual execution units on the die.
Rethink how you organize the processor– You can't just eliminate the instruction window and replace it with more execution
51
Regrouping the execution Regrouping the execution unitsunits
No hardware spent on an instruction window that looks for ILP at run-time
The programmer has to structure the code stream at compile time so that it contains a high level of thread-level parallelism (TLP)
Three separate cores– Each of which individually contains a relatively small number
of execution units. – The many parallel threads out of which the programmer has
woven the code stream are then scheduled to run on those separate cores
This TLP strategy will work extremely well for tasks like procedural synthesis that can be parallelized at the thread level.
However, it won't work as well as an old-fashioned wide execution core plus large instruction window for inherently single-threaded tasks. In particular, three types of game-oriented tasks are likely
to suffer from the lack of a out-of-order processing and core width:
Game control Artificial intelligence (AI) Physics
52
Procedural Synthesis in a Procedural Synthesis in a nutshellnutshell
Procedural synthesis is about making optimal use of system bandwidth and main memory by dynamically generating lower-level geometry data from statically stored higher-level scene data
For 3D games– Artists use a 3D rendering program to produce content for the game
– Each model is translated into a collection of polygons– Each polygons is represented in the computer's memory as collections of vertices
When the computer is rendering a scene in a game in real-time– Models that are being displayed on the screen start out in main memory as stored vertex data
– That vertex data is fed from main memory into the GPU• where it is then rendered into a 3D image and output to the monitor as a sequence of frames.
53
LimitationsLimitations There are two problems
– The costs of creating art assets for a 3D game are going through the roof along with the size and complexity of the games themselves
– Console hardware's limited main memory sizes and limited bus bandwidth
54
The Xbox 360's solutionThe Xbox 360's solution Store high-level descriptions of objects in main memory
Gave the CPU procedurally generate the geometry (i.e., the vertex data) of the objects on the fly
Main memory stores high-level information This information is passed into the Xbox 360's Xenon CPU, where the vertex data are generated by one or more running threads
These threads then feed that vertex data directly into the GPU– by way of a special set of write buffers in the L2 cache
The GPU then takes that vertex information and renders the trees normally, just as if it had gotten that information from main memory
55
The Xbox 360's solutionThe Xbox 360's solution
56
QuestionsQuestions
RISKmore than others think is safe,
CAREmore than others think is wise,
DREAMmore than other think is practical,
EXPECTmore than others think is possible
cadel maxim
57