EECC756 - Shaaban EECC756 - Shaaban #1 Exam Review Spring2001 5-10-2001 Parallel Computer Parallel Computer Architecture Architecture • A parallel computer is a collection of processing elements that cooperate to solve large problems. • Broad issues involved: – Resource Allocation: • Number of processing elements (PEs). • Computing power of each element. • Amount of physical memory used. – Data access, Communication and Synchronization • How the elements cooperate and communicate. • How data is transmitted between processors. • Abstractions and primitives for cooperation. – Performance and Scalability: • Performance enhancement of parallelism: Speedup. • Scalabilty of performance to larger systems/problems.
96
Embed
EECC756 - Shaaban #1 Exam Review Spring2001 5-10-2001 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Parallel Computer ArchitectureParallel Computer Architecture• A parallel computer is a collection of processing elements
that cooperate to solve large problems.
• Broad issues involved:– Resource Allocation:
• Number of processing elements (PEs).• Computing power of each element.• Amount of physical memory used.
– Data access, Communication and Synchronization• How the elements cooperate and communicate.• How data is transmitted between processors.• Abstractions and primitives for cooperation.
– Performance and Scalability:• Performance enhancement of parallelism: Speedup.• Scalabilty of performance to larger systems/problems.
Fundamental Design IssuesFundamental Design Issues• At any layer, interface (contract) aspect and performance
aspects– Naming: How are logically shared data and/or processes
referenced?– Operations: What operations are provided on these data– Ordering: How are accesses to data ordered and coordinated?– Replication: How are data replicated to reduce
communication?– Communication Cost: Latency, bandwidth, overhead,
occupancy• Understand at programming model first, since that sets
requirements• Other issues
– Node Granularity: How to split between processors and memory?
Conditions of Parallelism: Conditions of Parallelism: Data DependenceData Dependence
1 True Data or Flow Dependence: A statement S2 is data dependent on statement S1 if an execution path exists from S 1 to S2 and if at least one output variable of S1 feeds in as an input operand used by S2
denoted by S1 S2
2 Antidependence: Statement S2 is antidependent on S1 if S2 follows S1 in program order and if the output of S2 overlaps the input of S1
denoted by S1 S2
3 Output dependence: Two statements are output dependent if they produce the same output variable
Conditions of Parallelism: Data DependenceConditions of Parallelism: Data Dependence
4 I/O dependence: Read and write are I/O statements. I/O dependence occurs not because the same variable is involved but because the same file is referenced by both I/O statements.
5 Unknown dependence:
• Subscript of a variable is subscribed (indirect addressing).
• The subscript does not contain the loop index.
• A variable appears more than once with subscripts having different coefficients of the loop variable.
• The subscript is nonlinear in the loop index variable.
Data and I/O Dependence: ExamplesData and I/O Dependence: Examples A -
B -
S1: Load R1,AS2: Add R2, R1S3: Move R1, R3
S4: Store B, R1
S1: Read (4),A(I) /Read array A from tape unit 4/S2: Rewind (4) /Rewind tape unit 4/S3: Write (4), B(I) /Write array B into tape unit 4/S4: Rewind (4) /Rewind tape unit 4/
S1
S3
S4 S2
Dependence graph
S1 S3I/O
I/O dependence caused by accessing thesame file by the read and write statements
• Input: – n x n matrix A ; vector x of order n– The processor number i. The number of processors– The ith submatrix B = A( 1:n, (i-1)r +1 ; ir) of size n x r where r = n/p– The ith subvector w = x(i - 1)r + 1 : ir) of size r
• Output:
– Processor Pi computes the vector y = A1x1 + …. Aixi and passes the result to the right
– Upon completion P1 will hold the product Ax
Begin
1. Compute the matrix vector product z = Bw
2. If i = 1 then set y: = 0
else receive(y,left)
3. Set y: = y +z
4. send(y, right)
5. if i =1 then receive(y,left)
End
Tcomp = k(n2/p)Tcomm = p(l+ mn)T = Tcomp + Tcomm
= k(n2/p) + p(l+ mn)
Example: Asynchronous Matrix Vector Product on a Ring
Limited Concurrency: Amdahl’s LawLimited Concurrency: Amdahl’s Law–Most fundamental limitation on parallel speedup.–If fraction s of sequential execution is inherently serial,
speedup 1/s
–Example: 2-phase calculation,• sweep over n-by-n grid and do some independent computation.• sweep again and add each value to global sum.
–Time for first phase = n2/p–Second phase serialized at global variable, so time = n2
–Speedup or at most 2
–Possible Trick: divide second phase into two:• Accumulate into private sum during sweep.• Add per-process private sum into global sum.
–Parallel time is n2/p + n2/p + p, and speedup at best
Summary of Parallel Algorithms AnalysisSummary of Parallel Algorithms Analysis• Requires characterization of multiprocessor system and
algorithm.• Historical focus on algorithmic aspects: partitioning, mapping.• PRAM model: data access and communication are free:
– Only load balance (including serialization) and extra work matter:
– Useful for early development, but unrealistic for real performance.
– Ignores communication and also the imbalances it causes.– Can lead to poor choice of partitions as well as orchestration.– More recent models incorporate communication costs; BSP,
LogP, ...
Sequential Instructions
Max (Instructions + Synch Wait Time + Extra Instructions)Speedup <
Synchronous IterationSynchronous Iteration• Iteration-based computation is a powerful method for solving
numerical (and some non-numerical) problems.
• For numerical problems, a calculation is repeated and each time, a result is obtained which is used on the next execution. The process is repeated until the desired results are obtained.
• Though iterative methods are is sequential in nature, parallel implementation can be successfully employed when there are multiple independent instances of the iteration. In some cases this is part of the problem specification and sometimes one must rearrange the problem to obtain multiple independent instances.
• The term "synchronous iteration" is used to describe solving a problem by iteration where different tasks may be performing separate iterations but the iterations must be synchronized using point-to-point synchronization, barriers, or other synchronization mechanisms.
– Physical interconnection structure of the network graph:• Node Degree. • Network diameter: Longest minimum routing distance between any two nodes in hops.• Average Distance between nodes .• Bisection width: Number of links whose removal disconnects the graph and cuts it in
half.• Symmetry: The property that the network looks the same from every node.• Homogeneity: Whether all the nodes and links are identical or not.
– Type of interconnection:• Static or Direct Interconnects: Nodes connected directly using static links point-to-
point links.
• Dynamic or Indirect Interconnects: Switches are usually used to realize dynamic links between nodes:
– Each node is connected to specific subset of switches. (e.g multistage interconnection networks MINs).
– Blocking or non-blocking, permutations realized.
• Shared-, broadcast-, or bus-based connections. (e.g. Ethernet-based).
Dynamic Networks DefinitionsDynamic Networks Definitions• Permutation networks: Can provide any one-to-one mapping between
sources and destinations.
• Strictly non-blocking: Any attempt to create a valid connection succeeds. These include Clos networks and the crossbar.
• Wide Sense non-blocking: In these networks any connection succeeds if a careful routing algorithm is followed. The Benes network is the prime example of this class.
• Rearrangeably non-blocking: Any attempt to create a valid connection eventually succeeds, but some existing links may need to be rerouted to accommodate the new connection. Batcher's bitonic sorting network is one example.
• Blocking: Once certain connections are established it may be impossible to create other specific connections. The Banyan and Omega networks are examples of this class.
• Single-Stage networks: Crossbar switches are single-stage, strictly non-blocking, and can implement not only the N! permutations, but also the NN combinations of non-overlapping broadcast.
PermutationsPermutations• For n objects there are n! permutations by which the n objects can be
reordered. • The set of all permutations form a permutation group with respect to a
composition operation. • One can use cycle notation to specify a permutation function. For Example: The permutation = ( a, b, c)( d, e) stands for the bijection mapping: a b, b c , c a , d e , e d in a circular fashion. The cycle ( a, b, c) has a period of 3 and the cycle (d, e) has a period of 2. Combining the two cycles, the permutation has a cycle period of 2 x 3 = 6. If one
applies the permutation six times, the identity mapping I = ( a) ( b) ( c) ( d) ( e) is obtained.
Perfect ShufflePerfect Shuffle• Perfect shuffle is a special permutation function suggested by Harold Stone (1971) for parallel
processing applications. • Obtained by rotating the binary address of an one position left.• The perfect shuffle and its inverse for 8 objects are shown here:
• In the Omega network, perfect shuffle is used as an inter-stage connection pattern for all log2N stages.
• Routing is simply a matter of using the destination's address bits to set switches at each stage.
• The Omega network is a single-path network: There is just one path between an input and an output.
• It is equivalent to the Banyan, Staran Flip Network, Shuffle Exchange Network, and many others that have been proposed.
• The Omega can only implement NN/2 of the N! permutations between inputs and outputs, so it is possible to have permutations that cannot be provided (i.e. paths that can be blocked). – For N = 8, there are 84/8! = 4096/40320 = 0.1016 = 10.16% of the
permutations that can be implemented.
• It can take log2N passes of reconfiguration to provide all links. Because there are log2 N stages, the worst case time to provide all desired connections can be (log2N)2.
Multi-Stage Networks: Multi-Stage Networks: The Omega NetworkThe Omega Network
– Symmetric access to all of main memory from any processor.
• Currently Dominate the high-end server market:– Building blocks for larger systems; arriving to desktop.
• Attractive as high throughput servers and for parallel. programs:– Fine-grain resource sharing.– Uniform access via loads/stores.– Automatic data movement and coherent replication in caches.
• Normal uniprocessor mechanisms used to access data (reads and writes).– Key is extension of memory hierarchy to support multiple
Caches And Cache Coherence In Caches And Cache Coherence In Shared Memory MultiprocessorsShared Memory Multiprocessors
• Caches play a key role in all shared memory multiprocessor system variations:– Reduce average data access time.– Reduce bandwidth demands placed on shared interconnect.
• Private processor caches create a problem:– Copies of a variable can be present in multiple caches. – A write by one processor may not become visible to others:
• Processors accessing stale value in their private caches.– Process migration.– I/O activity.– Cache coherence problem.– Software and/or software actions needed to ensure write
visibility to all processors thus maintaining cache coherence.
– Adherence of processors and memory system to the expected behavior.
• Consistency Models: Specify the order by which shared memory access events of one process should be observed by other processes in the system.– Sequential Consistency Model.
– Weak Consistency Models.
• Program Order: The order in which memory accesses appear in the execution of a single process without program reordering.
• Event Ordering: Used to declare whether a memory event is legal when several processes access a common set of memory locations.
[Hardware is sequentially consistent if] the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order.
• Sufficient conditions to achieve SC in shared-memory access:– Every process issues memory operations in program order– After a write operation is issued, the issuing process waits for the write to
complete before issuing its next operation.– After a read operation is issued, the issuing process waits for the read to
complete, and for the write whose value is being returned by the read to complete, before issuing its next operation (provides write atomicity).
• According to these Sufficient, but not necessary, conditions:– Clearly, compilers should not reorder for SC, but they do!
• Loop transformations, register allocation (eliminates!).– Even if issued in order, hardware may violate for better performance
• Write buffers, out of order execution.– Reason: uniprocessors care only about dependences to same location
• Makes the sufficient conditions very restrictive for performance.
• Memory access order between processors determined by a hardware memory access “switch”.
• Stores and swaps issued by a processor are placed in a dedicated store FIFO buffer for the processor. Order of memory operations is the same as processor issue order.
• A load by a processor first checks its store buffer if it contains a store to the same location. – If it does then the load returns the value of the most recent
such store.
– Otherwise the load goes directly to memory.
– A processor is logically blocked from issuing further operations until the load returns a value.
Write-invalidate Snoopy Bus Protocol For Write-Back CachesFor Write-Back CachesState Transition Diagram
RW RO
INV
RW: Read-WriteRO: Read OnlyINV: Invalidated or not in cache
W(i) = Write to block by processor iW(j) = Write to block copy in cache j by processor j iR(i) = Read block by processor i.R(j) = Read block copy in cache j by processor j iZ(i) = Replace block in cache .Z(j) = Replace block copy in cache j i
Efficiency, Utilization, Redundancy, Quality of ParallelismEfficiency, Utilization, Redundancy, Quality of Parallelism• System Efficiency: Let O(n) be the total number of unit operations
performed by an n-processor system and T(n) be the execution time in unit time steps:
– Speedup factor: S(n) = T(1) /T(n)
– System efficiency for an n-processor system: E(n) = S(n)/n = T(1)/[nT(n)]
Parallel Performance Metrics Revisited: Amdahl’s LawParallel Performance Metrics Revisited: Amdahl’s Law• Harmonic Mean Speedup (i number of processors used):
• In the case w = {fi for i = 1, 2, .. , n} = (, 0, 0, …, 1-), the system is running sequential code with probability and utilizing n processors with probability (1-) with other processor modes not utilized.
Amdahl’s Law:
S 1/ as n Under these conditions the best speedup is upper-bounded by 1/
Scalability MetricsScalability Metrics• The study of scalability is concerned with determining the
degree of matching between a computer architecture and and an application algorithm and whether this degree of matching continues to hold as problem and machine sizes are scaled up .
• Basic scalablity metrics affecting the scalability of the system for a given problem:
Parallel System ScalabilityParallel System Scalability• Scalability (informal restrictive definition):
A system architecture is scalable if the system efficiency E(s, n) = 1 for all algorithms with any number of processors and any size problem s.
• Scalability Definition (more formal):
The scalability (s, n) of a machine for a given algorithm is defined as the ratio of the asymptotic speedup S(s,n) on the real machine to the asymptotic speedup SI(s, n)
Goal: Parallel machines that can be scaled to hundreds or thousands of processors.
• Design Choices:– Custom-designed or commodity nodes?– Network scalability. – Capability of node-to-network interface (critical).– Supporting programming models?
• What does hardware scalability mean?– Avoids inherent design limits on resources.– Bandwidth increases with machine size P.– Latency should not increase with machine size P.– Cost should increase slowly with P.
– Processor-Cache-Memory nodes connected by scalable network.– Distributed shared physical address space.– Communication assist must interpret network transactions, forming shared
address space.
• For a system with shared physical address space:– A cache miss must be satisfied transparently from local or remote memory
depending on address.
– By its normal operation, cache replicates data locally resulting in a potential cache coherence problem between local and remote copies of data.
– A coherency solution must be in place for correct operation.
• Standard snoopy protocols studied earlier may not apply for lack of a bus or a broadcast medium to snoop on.
• For this type of system to be scalable, in addition to latency and bandwidth scalability, the cache coherence protocol or solution used must also scale as well.
Scalable Cache CoherenceScalable Cache Coherence• A scalable cache coherence approach may have similar
cache line states and state transition diagrams as in bus-based coherence protocols.
• However, different additional mechanisms other than broadcasting must be devised to manage the coherence protocol.
• Two possible approaches:– Approach #1: Hierarchical Snooping.– Approach #2: Directory-based cache coherence.– Approach #3: A combination of the above two
Approach #1: Hierarchical SnoopingApproach #1: Hierarchical Snooping• Extend snooping approach: A hierarchy of broadcast media:
– Tree of buses or rings (KSR-1).– Processors are in the bus- or ring-based multiprocessors at the
leaves.– Parents and children connected by two-way snoopy interfaces:
• Snoop both buses and propagate relevant transactions.– Main memory may be centralized at root or distributed among
leaves.
• Issues (a) - (c) handled similarly to bus, but not full broadcast. – Faulting processor sends out “search” bus transaction on its bus.– Propagates up and down hierarchy based on snoop results.
• Problems: – High latency: multiple levels, and snoop/lookup at every level.– Bandwidth bottleneck at root.
• This approach has, for the most part, been abandoned.