P ll l C t P ll l C t Parallel Computer Architectures Parallel Computer Architectures Architectures Architectures Jin-Soo Kim ( [email protected]) Jin Soo Kim ( [email protected]) Computer Systems Laboratory Sungkyunkwan University htt // l kk d http://csl.skku.edu
34
Embed
C l lt PlP arallel Computer Architectures - AndroBenchcsl.skku.edu/uploads/ICE3003F09/22-parch.pdf · C l lt PlP arallel Computer Architectures ... ICE3003: Computer Architecture
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
P ll l C t P ll l C t Parallel Computer Architectures
2ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
• SIMD instructions
Hardware Multithreading (1)Hardware Multithreading (1)Hardware Multithreading (1)Hardware Multithreading (1)Run multiple threads of execution in parallelp p• Replicate registers, PC, etc.• Fast switching between threadsg
Fine-grain multithreading• Switch threads after each cycle• Switch threads after each cycle• Interleave instruction execution• If one thread stalls others are executed• If one thread stalls, others are executed
Coarse-grain multithreadingO l it h l t ll ( L2 h i )• Only switch on long stall (e.g., L2-cache miss)
• Simplifies hardware, but doesn’t hide short stalls (e.g., data hazards)
3ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
– Schedule instructions from multiple threads– Instructions from independent threads execute when
function units are availableWithin threads dependencies handled by scheduling and– Within threads, dependencies handled by scheduling and register renaming
• Makes a single physical processor appear as multiple g p y p pp plogical processors
6ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
Hardware Multithreading (5)Hardware Multithreading (5)Hardware Multithreading (5)Hardware Multithreading (5)Future of multithreadingg• Will it survive? In what form?
• Power considerations ⇒ simplified microarchitectures– Simpler forms of multithreading– e.g., Sun UltraSPARC T2: fine-grained multithreading
• Tolerating cache-miss latency– Thread switch may be most effective
• Multiple simple cores might share resources more ff ti leffectively
7ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
Multicore (1)Multicore (1)Multicore (1)Multicore (1)Multicore is mainstream
8ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
18ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
Parallel Architectures (1)Parallel Architectures (1)Parallel Architectures (1)Parallel Architectures (1)SMP: shared memory multiprocessorsy p• Hardware provides single physical address space for
all processors• Synchronize shared variables using locks• Memory access time: UMA vs. NUMA
19ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
Parallel Architectures (2)Parallel Architectures (2)Parallel Architectures (2)Parallel Architectures (2)Distributed memory architecturey• Each processor has private physical address space• Hardware sends/receives messages between / g
processors
20ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
Parallel Architectures (3)Parallel Architectures (3)Parallel Architectures (3)Parallel Architectures (3)Clusters• A network of independent computers• Each node has private memory and OSp y• Connected using I/O system
– E.g., Ethernet/switch, Internet
• Suitable for applications with independent tasks– Web servers, databases, simulations, …
• High availability, scalable, affordable• Problems
21ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
» cf. processor/memory bandwidth on an SMP
Data-Level ParallelismData-Level ParallelismData Level ParallelismData Level ParallelismSIMD (Single-Instruction Multiple-Data) ( g p )architecture• Operate element-wise on vectors of datap
– E.g., MMX and SSE instructions in x86:Multiple data elements in 128-bit wide registers
All h i i h• All processors execute the same instruction at the same time
Each with different data address etc– Each with different data address, etc.
• Simplifies synchronization• Reduced instruction control hardware• Reduced instruction control hardware• Works best for highly data-parallel applications
22ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
Vector Processors (1)Vector Processors (1)Vector Processors (1)Vector Processors (1)Characteristics• Highly pipelined function units• Stream data from/to vector registers to units/ g
– Data collected from memory into registers– Results stored from registers to memory
Example: Vector extension to MIPS• 32 x 64-element registers (64-bit elements)g• Vector instructions
– lv, sv: load/store vector– addv.d: add vectors of double– addvs.d: add scalar to each element of vector of double
Si ifi l d i i f h b d id h
23ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
l.d $f0,a($sp) ;load scalar alv $v1,0($s0) ;load vector xmulvs.d $v2,$v1,$f0 ;vector‐scalar multiplylv $v3,0($s1) ;load vector yaddv.d $v4,$v2,$v3 ;add y to productsv $v4,0($s1) ;store the result
25ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
Vector Processors (4)Vector Processors (4)Vector Processors (4)Vector Processors (4)Vector architectures and compilersp• Simplify data-parallel programming• Explicit statement of absence of loop-carried p p
dependencies– Reduced checking in hardware
• Regular access patterns benefit from interleaved and burst memory
• Avoid control hazards by avoiding loops• More general than ad-hoc media extensions (such as
MMX SSE)MMX, SSE)– Better match with compiler technology
26ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
MMX (1)MMX (1)MMX (1)MMX (1)Intel MMX Technologygy• Introduced in Pentium MMX and Pentium II• SIMD (Single-Instruction Multiple-Data) execution model( g p )
• For media and communications applications• Eight 64-bit MMX registers (MM0-MM7) are shared g t 6 b t eg ste s ( 0 ) a e s a ed
with x87 FPU registers (R0-R7)• Three new packed data typesp yp
27ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
28ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
SSE (1)SSE (1)SSE (1)SSE (1)Streaming SIMD Extensionsg• Introduced in Pentium III • For advanced 2-D/3-D graphics, motion video, image / g p , , g
processing, speech recognition, audio synthesis, telephony, and video conferencing
• New eight 128-bit XMM registers (XMM0-XMM7)• New 32-bit MXCSR register
– Control and status bits for operations on XMM registers
• New 128-bit packed single-precision FP data type
29ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
SSE (2)SSE (2)SSE (2)SSE (2)SSE packed/scalar single-precision FP p g poperations• Add, Multiply, Divide, Reciprocal, Square root, , p y, , p , q ,
Reciprocal of square root, Max, Min
Scalar single‐precisionFP operation
SSE logical operations• And Or Xor
30ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
And, Or, Xor
SSE (3)SSE (3)SSE (3)SSE (3)SSE shuffle operationp• Any two from src• Any two from desty
SSE unpack operation
31ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
SSE2SSE2SSE2 SSE2 Streaming SIMD Extensions 2 (SSE2)g ( )• Introduced in Pentium 4 and Intel Xeon• For advanced 3-D graphics, speech recognition, video g p , p g ,
encoding/decoding, E-commerce, Internet, scientific and engineering applications
• Six new data types
32ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])