Darshan Institute of Engineering and Technology 180702 - Parallel Processing Computer Engineering Chapter - 2 Parallel Programming Platforms Ishan Rajani 1 1) What is implicit parallelism? Explain pipelining and superscalar execution in parallel processing with suitable example. [Oct. 13(7 marks), June 12(7 marks)] Current processors use resources in multiple functional units and execute multiple instructions in the same cycle. The precise manner in which these instructions are selected and executed provides impressive diversity in architectures. In computer science, implicit parallelism is a characteristic of a programming language that allows a compiler or interpreter to automatically exploit the parallelism. Implicit Parallelism is a parallel programming model which aims to take advantage of the parallelism already inherent in the structure of a programming language. This is opposed to Explicit Parallelism which involves user-specified parallel instructions. A programmer that writes implicitly parallel code does not need to worry about task division or process communication, focusing instead in the problem that his or her program is intended to solve. Languages with implicit parallelism reduce the control that the programmer has over the parallel execution of the program. Though Implicit Parallelism may at first seem a great deal like Automatic Parallelism, it actually differs significantly because language structures can be built in such a way to restrict coding practices. Mechanisms used by various processors for supporting multiple instruction execution: Pipelining: Processors use the concept of pipeline to improve execution rate. An instruction pipeline is a technique used in the design of computer to increase their instruction throughput (the number of instructions that can be executed in a unit of time). By overlapping various stages in instruction execution (instruction fetch, decode, execute, memory access, write back to registers), pipelining enables faster execution. Speed increases as the number of stages of pipeline increased. Superscalar Execution: A processor with more than one pipelines and the ability to simultaneously issue multiple instructions, is sometimes referred to as super-pipelined processor. The ability of a processor to issue multiple instructions in the same cycle is referred to as superscalar execution.
64
Embed
Darshan Institute of Engineering and Technology 180702 ... · PDF fileDarshan Institute of Engineering and Technology 180702 - Parallel Processing Computer Engineering Chapter - 2
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
function accesses a pair of vector elements and waits for them. In the meantime, the second instance
of this function can access two other vector elements in the next cycle, and so on.
In this way, in every clock cycle, we can perform a computation.
Prefetching :
In a typical program, a data item is loaded and used by a processor in a small time window. If the load
results in a cache miss, then the use stalls.
A simple solution to this problem is to advance the load operation so that even if there is a cache
miss, the data is likely to have arrived by the time it is used.
It is a process to advance the load operation so that even if there is a cache miss, the data is likely to
have arrived by the time it is used.
In advancing the loads, we are trying to identify independent threads of execution that have no
resource dependency (i.e., use the same registers) with respect to other threads.
Co sider the pro le of addi g t o e tors a a d usi g a si gle for loop. I the first iteratio of the loop, the processor requests a[0] and b[0]. Since these are not in the cache, the processor must
pay the memory latency. While these requests are being serviced, the processor also requests a[1]
and b[1].
Assuming that each request is generated in one cycle (1 ns) and memory requests are satisfied in 100
ns, after 100 such requests the first set of data items is returned by the memory system.
Subsequently, one pair of vector components will be returned every cycle.
In this way, in each subsequent cycle, one addition can be performed and processor cycles are not
wasted.
4) Explain message passing and shared-address-space computers with neat sketches. Also state the
differences between these two computers. [June 13(7 marks)]
The "shared-address-space" view of a parallel platform supports a common data space that is
accessible to all processors.
Shared-address-space platforms supporting SPMD programming are also referred to as
multiprocessors.
Memory in shared-address-space platforms can be local or global.
If the time taken by a processor to access any memory word in the system (global or local) is
identical, the platform is classified as a uniform memory access (UMA) multicomputer. On the other
hand, if the time taken to access certain memory words is longer than others, the platform is called a
non-uniform memory access (NUMA) multicomputer.
If accessing local memory is cheaper than accessing global memory, algorithms must build locality
and structure data and computation accordingly.
Figures (a) and (b) illustrate UMA platforms, whereas Figure (c) illustrates a NUMA platform.
Here, in figure (b) it is faster to access a memory word in cache than a location in memory. However,
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Initially variable resides in global memory. When load operation by both processors is
e e uted, the state of aria le is said to e shared . Whe P0 e e utes store o this aria le, it arks all other opies as I alid’ a d its o
op as Dirt . It means all subsequent accesses to this variable will be serviced by P0.
At this point, if P1 attempts to fetch this variable, which was marked dirty by P0, then P0
services the request.
Now, variable at P1 and global memory are updated, and it re-e ters i to “hared state.
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Computer Engineering Chapter - 3 Principles of Parallel Algorithm Designs
Ishan Rajani 11
1) Enlist various decomposition techniques. Explain data decomposition with suitable example. OR
Explain recursive decomposition technique in detail and draw task dependency graph for the following
se ue ce usi g uick so t ith ecu si e deco positio a d choose 5 as pi ot ele e t i itially. 5, 12, 11, 1, 10, 6, 8, 3, 7, 4, 9, 2 [(Oct. 13, Oct. 12, June 13, June 12) – 7 marks ]
1. Recursive Decomposition:
This decomposition technique uses divide and conquer strategy.
In this tech., a problem is solved by first dividing into set of independent sub programs.
Such subprogram is solved by recursively applying similar division into smaller subprograms and so
on.
Ex : Quick-sort
In above example an array A of n elements is sorted using quick-sort.
It selects a pivot element X and partition A into 2 sub-arrays A0 and A1 such that all the elements in
A0 are smaller than x and all the elements in A1 are greater than or equal to x.
This partitioning step forms the divide step of the Algorithm.
Each one of the subsequences A0 and A1 is sorted by recursively calling quick sort. Each one of these
recursive calls further partitions the sub-arrays.
2. Data Decomposition:
Decomposition of computation is done in 2 steps.
Data on which computations are performed is partitioned.
Partition of data is used to partition the computation into several tasks.
Partitioning Output Data : In some computations, each element of o/p can be computed
independently, so, partitioning the o/p data automatically creates decomposition of problem into
tasks.
Ex : Matrix – Multiplication : Consider the problem of multiplying two n x n matrices A and B to yield
a matrix C. Figure shows a decomposition of this problem into four tasks. The decomposition
shown in Figure is based on partitioning the output matrix C into four sub matrices and each of
the four tasks computes one of these sub matrices.
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Computer Engineering Chapter - 3 Principles of Parallel Algorithm Designs
Ishan Rajani 12
Partitioning Input Data : In case of finding minimum, maximum or sum of array, o/p is not a known
value. In such cases it is possible to partitioning i/p data.
Task is created for each partition of i/p data and introduce concurrency.Not directly solved, a
follow-up is needed to combine the results.
A task is created for each partition of the input data and this task performs as much computation
as possible using these local data.
The problem of determining the minimum number of a set of itemsets { 4,9,1,7,8,11,12,2 } can
also be decomposed based on a partitioning of input data.
Figure shows a decomposition based on a partitioning of the input set of transactions.
Partitioning Both (i/o & o/p) Data :
In some cases, in which it is possible to partition the output data, partitioning of input data can
offer additional concurrency.
Ex : Relational Database of Vehicals for processing the following query :
MODEL="Civic" AND YEAR="2001" AND (COLOR="Green" OR COLOR="White")
Partitioning Intermediate Data :
Algorithms are often structured as multi-stage computations such that the output of one stage is
the input to the subsequent stage.
A decomposition of such an Algorithm can be derived by partitioning the input or the output data
of an intermediate stage of the Algorithm.
Partitioning intermediate data can sometimes lead to higher concurrency than partitioning input
or output data. Let us revisit matrix multiplication to illustrate a decomposition based on
partitioning intermediate data. Decompositions induced by a 2 x 2 partitioning of the output
matrix C, have a maximum degree of concurrency of four.
Degree of concurrency can be increased by introducing an intermediate stage in which eight tasks
compute their respective product submatrices and store the results in a temporary three-
dimensional matrix D, as shown in figure. The submatrix Dk,i,j is the product of Ai,k and Bk,j.
4,9 1,7 12,2 8,11
8,2 4,1
1,2
1
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Computer Engineering Chapter - 3 Principles of Parallel Algorithm Designs
Ishan Rajani 13
A partitioning of the intermediate matrix D induces a decomposition into eight tasks.After the
multiplication phase, a relatively inexpensive matrix addition step can compute the result matrix C.
3. Exploratory Decomposition:
Exploratory decomposition is used to decompose problems whose underlying computations
correspond to a search of a space for solutions.
In exploratory decomposition, we partition the search space into smaller parts, and search each one
of these parts concurrently, until the desired solutions are found.
Ex: Consider the 15-puzzle problem. The 15-puzzle consists of 15 tiles numbered 1 through 15 and
one blank tile placed in a 4 x 4 grid.
A tile can be moved into the blank position from a position adjacent to it, thus creating a blank in the
tile's original position. Depending on the configuration of the grid, up to four moves are possible: up,
down, left, and right.
The initial and final configurations of the tiles are specified. The objective is to determine any
sequence or a shortest sequence of moves that transforms the initial configuration to the final
configuration.
Figure illustrates sample initial and final configurations and a sequence of moves leading from the
initial configuration to the final configuration.
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Computer Engineering Chapter - 3 Principles of Parallel Algorithm Designs
Ishan Rajani 14
4. Speculative Decomposition:
Speculative decomposition is used when a program may take one of many possible computation all
significant branches depending on the output of other computations that precede it.
While one task is performing the computation whose output is used in deciding the next
computation, other tasks can concurrently start the computations of the next stage.
This scenario is similar to evaluating one or more of the branches of a switch statement in C in
parallel before the input for the switch is available.
When the input for the switch has finally been computed, the computation corresponding to the
correct branch would be used while that corresponding to the other branches would be discarded.
However, this parallel formulation of a switch guarantees at least some wasteful computation. In
order to minimize the wasted computation only the most promising branch is taken up a task in
parallel with the preceding computation.
In case the outcome of the switch is different from what was anticipated, the computation is rolled
back and the correct branch of the switch is taken.
5. Hybrid Decomposition:
So far we have discussed a number of decomposition methods that can be used to derive concurrent
formulations of many Algorithms.
These decomposition techniques are not exclusive, and can often be combined together.
Often, a computation is structured into multiple stages and it is sometimes necessary to apply
different types of decomposition in different stages. Then it is known as Hybrid Decomposition.
2) Explain Static and Dynamic Mapping Techniques. [June 13(7 marks)]
Once a computation has been decomposed into tasks, tasks are mapped to processes so it completes
in shortest amount of time.
In order to achieve small execution time two key sources of overheads must be minimized.
Time spent in inter-process communication.
Time that process may spend being idle.
Uneven load distribution may cause some processes to finish earlier than others. These two
objectives often conflict with each other.
For example, minimizing the interactions can be easily achieved by assigning set of tasks that need to
interact with each other onto the same process. So, it is resulting into highly unbalanced work load.
So, the processes with higher load are trying to finish their tasks.
Due to the conflicts between these objectives, finding a good mapping is become non-trivial problem.
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Computer Engineering Chapter - 3 Principles of Parallel Algorithm Designs
Ishan Rajani 15
Figure shows two mappings of 12-task decomposition in which the last four tasks can be started only after the first eight are
finished due to dependencies among tasks.
1. Static Mapping:
Static mapping techniques distribute the tasks among processes prior to the execution of the
algorithm. The choice of a good mapping in this case depends on several factors, including the
knowledge of task sizes, the size of data associated with tasks, the characteristics of inter-task
interactions, and even the parallel programming paradigm.
It is easier to design and program. Mapping Based on Data Partitioning: 1) Block Distribution:
It is a process to distribute an array and assign uniform contiguous partitions of the array to
different processes.
D-dimension array is distributed among the processes such that each process receives a
contiguous block of array.
Ex: nxn 2-dim matrix can be partitioned as follows:
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Computer Engineering Chapter - 3 Principles of Parallel Algorithm Designs
Ishan Rajani 16
For example, in the case of matrix-matrix multiplication, a one-dimensional distribution will allow
us to use up to n processes by assigning a single row of C to each process.
On the other hand, a two-dimensional distribution will allow us to use up to n2 processes by
assigning a single element of C to each process.
Figure illustrates this in the case of dense matrix-multiplication. With a one dimensional
partitioning along the rows, each process needs to access the corresponding n/p rows of matrix A
and the entire matrix B, as shown in figure (a) for process P5. However, with a two-dimensional
distribution, each process needs to access n/Vp rows of matrix A and n/Vp columns of matrix B, as
shown in figure (b) for process P5.
2) Cyclic-Block cyclic Distribution :
If the amount of work differs for different elements of a matrix, a block distribution can
potentially lead to load imbalances.
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Computer Engineering Chapter - 3 Principles of Parallel Algorithm Designs
Ishan Rajani 17
The block-cyclic distribution is a variation of the block distribution scheme that can be used to
alleviate the load-imbalance and idling problems.
The central idea behind a block-cyclic distribution is to partition an array into many more blocks
than the number of available processes. Then we assign the partitions (and the associated tasks)
to processes in a round-robin manner so that each process gets several non-adjacent blocks.
3) Randomized Block Distribution :
Randomized block distribution, a more general form of the block distribution, can be used in
situations illustrated in figure. Just like a block-cyclic distribution, load balance is sought by
partitioning the array into many more blocks than the number of available processes.
However, the blocks are uniformly and randomly distributed among the processes. A
onedimensional randomized block distribution can be achieved as follows. The random block
distribution is more effective in load balancing the computations.
Mapping Based on Graph Partitioning:
The array-based distribution schemes are quite effective in balancing the computations and
minimizing the interactions for a wide range of algorithms that use dense matrices and have
structured and regular interaction patterns.
However, there are many algorithms that operate on sparse data structures and for which the
pattern of interaction among data elements is data dependent and highly irregular.
In these computations, the physical domain is discretized and represented by a mesh of elements.
Mapping Based on Task Partitioning:
A mapping based on partitioning a task-dependency graph and mapping its nodes onto processes
can be used when the computation is naturally expressible in the form of a static taskdependency
graph with tasks of known sizes.
As a simple example of a mapping based on task partitioning, consider a task-dependency graph
that is a perfect binary tree. Such a task-dependency graph can occur in practical problems with
recursive decomposition, such as the decomposition for finding the minimum of a list of numbers
It is easy to see that this mapping minimizes the interaction overhead by mapping many
interdependent tasks onto the same process (i.e., the tasks along a straight branch of the tree)
and others on processes only one communication link away from each other.
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Computer Engineering Chapter - 3 Principles of Parallel Algorithm Designs
Ishan Rajani 18
2. Dynamic Mapping
Dynamic mapping techniques distribute the work among processes during the execution of the
algorithm. If tasks are generated dynamically, then they must be mapped dynamically too.
If task sizes are unknown, then a static mapping can potentially lead to serious load-imbalances
and dynamic mappings are usually more effective. If the amount of data associated with tasks is
large relative to the computation, then a dynamic mapping used.
Algorithms that require dynamic mapping are usually more complicated.
Dynamic mapping is necessary in situations where a static mapping may result in a highly
imbalanced distribution of work among processes or where the task-dependency graph itself if
dynamic.
Dynamic mapping techniques are usually classified as either centralized or distributed.
Centralized Schemes:
In a centralized dynamic load balancing scheme, all executable tasks are maintained in a common
central data structure or they are maintained by a special process or a subset of processes.
If a special process is designated to manage the pool of available tasks, then it is often referred to
as the master and the other processes that depend on the master to obtain work are referred to
as slaves.
Whenever a process has no work, it takes a portion of available work from the central data
structure or the master process. Whenever a new task is generated, it is added to this centralized
data structure or reported to the master process.
As more and more processes are used, the large number of accesses to the common data
structure or the master process tends to become a bottleneck.
Distributed Schemes:
In a distributed dynamic load balancing scheme, the set of executable tasks are distributed among
processes which exchange tasks at run time to balance work.
Each process can send work to or receive work from any other process. These methods do not
suffer from the bottleneck associated with the centralized schemes.
Some of the critical parameters of a distributed load balancing scheme are as follows:
o How are the sending and receiving processes paired together?
o Is the work transfer initiated by the sender or the receiver?
o How much work is transferred in each exchange?
o When is the work transfer performed?
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Computer Engineering Chapter - 3 Principles of Parallel Algorithm Designs
Ishan Rajani 19
3) Explain various parallel algorithm models. [Oct. 12(7 marks)]
An algorithm model is typically a way of structuring a parallel algorithm by selecting decomposition
and mapping technique and applying the appropriate strategy to minimize interactions.
1) Data Parallel Model:
The data-parallel model is one of the simplest algorithm models. In this model, the tasks are statically
or semi-statically mapped onto processes and each task performs similar operations on different
data.
This type of parallelism that is a result of identical operations being applied concurrently on different
data items is called data parallelism.
Data-parallel algorithms can be implemented in both shared-address-space and message passing
paradigms.
However, the partitioned address-space in a message-passing paradigm may allow better control of
placement, and thus may offer a better handle on locality.
Interaction overheads in the data-parallel model can be minimized by choosing a locality preserving
decomposition and, if applicable, by overlapping computation and interaction and by using optimized
collective interaction routines.
A key characteristic of data-parallel problems is that for most problems, the degree of data
parallelism increases with the size of the problem, making it possible to use more processes to
effectively solve larger problems.
2) Task Graph Model:
As the computations in any parallel algorithm can be viewed as a task- dependency graph. However,
in certain parallel algorithms, the task-dependency graph is explicitly used in mapping.
In the task graph model, the interrelationships among the tasks are utilized to promote locality or to
reduce interaction costs.
This model is typically employed to solve problems in which the amount of data associated with the
tasks is large relative to the amount of computation associated with them.
Typical interaction-reducing techniques applicable to this model include reducing the volume and
frequency of interaction by promoting locality while mapping the tasks based on the interaction
pattern of tasks, and using asynchronous interaction methods to overlap the interaction with
computation.
Examples of algorithms based on the task graph model include parallel quicksort, sparse matrix
factorization, and many parallel algorithms derived via divide-and-conquer decomposition.
This type of parallelism that is naturally expressed by independent tasks in a task-dependency graph
is called task parallelism.
3) Work Pool Model:
The work pool or the task pool model is characterized by a dynamic mapping of tasks onto processes
for load balancing in which any task may potentially be performed by any process.
There is no desired pre-mapping of tasks onto processes. The mapping may be centralized or
decentralized.
The work may be statically available in the beginning, or could be dynamically generated; i.e., the
processes may generate work and add it to the global (possibly distributed) work pool.
If the work is generated dynamically and a decentralized mapping is used, then a termination
detection algorithm would be required so that all processes can actually detect the completion of the
entire program and stop looking for more work.
In the message-passing paradigm, the work pool model is typically used when the amount of data
associated with tasks is relatively small compared to the computation associated with the tasks.
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Computer Engineering Chapter - 3 Principles of Parallel Algorithm Designs
Ishan Rajani 20
4) Master-Slave Model:
In the master-slave or the manager-worker model, one or more master processes generate work and
allocate it to worker processes. The tasks may be allocated a priori if the manager can estimate the
size of the tasks or if a random mapping can do an adequate job of load balancing.
In another scenario, workers are assigned smaller pieces of work at different times. The latter scheme
is preferred if it is time consuming for the master to generate work and hence it is not desirable to
make all workers wait until the master has generated all work pieces.
The manager-worker model can be generalized to the hierarchical or multi-level managerworker
model in which the top-level manager feeds large chunks of tasks to second-level managers, who
further subdivide the tasks among their own workers and may perform part of the work themselves.
This model is generally equally suitable to shared-address-space or message-passing paradigms.
The manager needs to give out work and workers need to get work from the manager. While using
the master-slave model, care should be taken to ensure that the master does not become a
bottleneck, which may happen if the tasks are too small or the workers are relatively fast.
5) Pipeline Model:
In the pipeline model, a stream of data is passed on through a succession of processes, each of which
performs some task on it.
This simultaneous execution of different programs on a data stream is called stream parallelism.
A new data triggers the execution of a new task by a process in the pipeline. A pipeline is a chain of
producers and consumers. Each process in the pipeline can be viewed as a consumer of a sequence of
data items for the process preceding it in the pipeline and as a producer of data for the process
following it in the pipeline.
The pipeline does not need to be a linear chain; it can be a directed graph. The pipeline model usually
involves a static mapping of tasks onto processes.
The most common interaction reduction technique applicable to this model is overlapping interaction
with computation.
6) Hybrid Model:
In some cases, more than one model may be applicable to the problem at hand, resulting in a hybrid
algorithm model.
A hybrid model may be composed either of multiple models applied hierarchically or multiple models
applied sequentially to different phases of a parallel algorithm.
Darshan Institute of Engineering and Technology 180702 – Parallel Processing
Computer Engineering Chapter - 4 Basic communication operations and algorithms
Ishan Rajani 21
1. Explain One-To-All broadcasting in different structures. [June 12(3.5 marks)]
Parallel algorithm often requires a single process to send identical data to all other processes or
subset of them.
This operation is known as one-to-all broadcast.
Initially, only the source process has the data of size m. But, at the termination of process, there are n
copies of data-one belonging to each other.
Implementation of above shown operation on variety of interconnection topologies is shown
below:
1) Ring :
A way to perform one-to-all broadcast is to sequentially send (p-1) messages from the source to other
(p-1) processes.
This is inefficient because source process becomes bottleneck.
Also, underutilized the communication network, because only one connection between single pair of
nodes is used at a time.
The better broadcast algo can be developed using technology known as recursive doubling.
In recursive doubling the source process first sends the message to another process.
Now both processes can simultaneously send the message to two other processes.
By continuing this procedure until all the processes has received the data.
So, message can be broadcast in (log p) steps.
Darshan Institute of Engineering and Technology 180702 – Parallel Processing
Computer Engineering Chapter - 4 Basic communication operations and algorithms
Ishan Rajani 22
Message transmission step is shown by numbered, dotted arrow from source to destination. The
number on an arrow indicates the time step during which message is transferred.
Message is first send to node (0) to node (4). In second step, distance between source and destination is
halved, and so on.
Message recipients are selected in this manner to avoid congestion on network.
For example, if node (0) send message to node(1), then both 0 and 1 attempted to send message to 2
and 3, the link between 1 and 2 would be congested.
2) Mesh :
Mesh with nodes p can e illustrated as a linear array of √p nodes. A linear array communication operation can be performed in two phases on mesh.
In first phase, operation is performed along rows by treating rows as liner arrays.
In second phase, columns are treated as linear arrays.
Consider the problem of one-to-all broadcast on a two-dimensional square mesh with √p rows and √p
columns.
First, a one-to-all broadcast is performed from the source to the remaining (√p-1) nodes of the same
row. Once all the nodes in a row of the mesh have acquired the data, they initiate a one-to-all
broadcast in their respective columns.
At the end of the second phase, every node in the mesh has a copy of the initial message.
The communication steps for one-to-all broadcast on a mesh are illustrated in figure.
Darshan Institute of Engineering and Technology 180702 – Parallel Processing
Computer Engineering Chapter - 4 Basic communication operations and algorithms
Ishan Rajani 23
3) Hypercube :
A hypercube with 2d nodes can be regarded as a d-dimensional mesh with two nodes in each
dimension.
Figure shows a one-to-all broadcast on an eight-node (three-dimensional) hypercube with node 0 as
the source.
Unlike a linear array, the hypercube broadcast would not suffer from congestion if node 0 started out
by sending the message to node 1 in the first step, followed by nodes 0 and 1 sending messages to
nodes 2 and 3, respectively, and finally nodes 0, 1, 2, and 3 sending messages to nodes 4, 5, 6, and 7,
respectively.
Darshan Institute of Engineering and Technology 180702 – Parallel Processing
Computer Engineering Chapter - 4 Basic communication operations and algorithms
Ishan Rajani 24
4) Complete Binary Tree :
The hypercube algorithm for one-to-all broadcast maps naturally onto a balanced binary tree in which
each leaf is a processing node and intermediate nodes serve only as switching units.
This is illustrated in figure for eight nodes. In this figure, the communicating nodes have the same
labels as in the hypercube algorithm. Figure shows that there is no congestion on any of the
communication links at any time.
The difference between the communication on a hypercube and the tree shown in figure is that there
is a different number of switching nodes along different paths on the tree. Algorithm for One-to-All
Broadcast operation. This algorithm is common to all topologies.
procedure ONE_TO_ALL_BC(d, my_id, X)
begin
mask := 2d - 1; /* Set all d bits of mask to 1 */
for i := d - 1 downto 0 do /* Outer loop */
mask := mask XOR 2i; /* Set bit i of mask to 0 */
if (my_id AND mask) = 0 then /* If lower i bits of my_id are 0 */
if (my_id AND 2i) = 0 then
msg_destination := my_id XOR 2i;
send X to msg_destination;
else
msg_source := my_id XOR 2i;
receive X from msg_source;
endelse;
endif;
endfor;
end ONE_TO_ALL_BC
Darshan Institute of Engineering and Technology 180702 – Parallel Processing
Computer Engineering Chapter - 4 Basic communication operations and algorithms
Ishan Rajani 25
2. What is dual operation? Explain dual operation of one-to-all broadcast in different structures.
[June 12(3.5 marks)]
The dual of one-to-all broadcast is all-to-one reduction, in which each processes starts with a better M
containing m words. The data from all processes are combined through an associative operator and
accumulated at a single destination process into one buffer of size m.
Reduction can be used to find sum, maximum, product, or minimum of set of numbers.
ith
word of accumulated M is the sum, product, minimum or maximum of ith
words of each of the
original buffers.
This both operations are used in several important parallel algorithms like matrix-multiplication, vector
product, etc.
Implementation of this operation on variety of interconnection topologies is shown below:
1) Ring :
Reduction on linear array can be performed by simply reversing the direction and sequence of
communication.
In first step, each odd numbered node sends its buffer to even node just before it, where contents of 2
buffers are combined into one.
Darshan Institute of Engineering and Technology 180702 – Parallel Processing
Computer Engineering Chapter - 4 Basic communication operations and algorithms
Ishan Rajani 26
In second step, there are now 4 buffers left to reduce on nodes 0, 2, 4, & 6. Content of buffer 0 and 2
are combined on 0 and nodes 6 and 4 are combined on 4.
Finally, 4 send its buffer to 0 which computes the final result of reduction.
2) Mesh:
Same as ring topology, reduction operation on mesh, hypercube and binary tree can be performed by
simply reversing the direction and sequence of communication.
In first phase, communication is done in columns of mesh, in which message is reduced by some
associative operator like sum or product.
In second phase, first row of the mesh is treated as linear array and row wise communication is done.
Finally, reduced data is stored in the buffer of destination node (0).
Step 1 and 2 shows the communication of phase 1 and steps 3 and 4 shows the communication of phase
2 in the above figure.
3) Hypercube :
As shown in the figure below, reduction operation in 3 dimensional hypercube is performed in 3 steps.
Reduction in hypercube can be done by reversing the direction of broadcast operation.
In first step, each odd numbered node sends its message to its even numbered neighbour node to its
left.
In second step, node 2 and 6 sends its message to 0 and 4 respectively. So, after second step node 0 and
4 contains reduced data.
In third step, node 4 sends its buffer to node 0. Finally, node 0 contains the reduced data.
Following figure shows the All-to-One reduction operation on 3 dimensional hypercube.
Darshan Institute of Engineering and Technology 180702 – Parallel Processing
Computer Engineering Chapter - 4 Basic communication operations and algorithms
Ishan Rajani 27
4) Complete Binary Tree :
All-to-One reduction in binary tree can also be performed by reversing the direction of communication
from one-to-all broadcast.
Darshan Institute of Engineering and Technology 180702 – Parallel Processing
Computer Engineering Chapter - 4 Basic communication operations and algorithms
Ishan Rajani 28
Algorithm for All-to-One Reduction operation:
The general algorithm for all topologies to perform all-to-one reduction operation is shown below,
procedure ALL_TO_ONE_REDUCE(d, my_id, m, X, sum)
begin
sum = X ;
mask := 0;
for i := 0 to d - 1 do
/* Select nodes whose lower i bits are 0 */
if (my_id AND mask) = 0 then
if (my_id AND 2i) 0 then
msg_destination := my_id XOR 2i;
send sum to msg_destination;
else
msg_source := my_id XOR 2i;
receive X from msg_source;
sum := sum + X ;
endelse;
mask := mask XOR 2i; /* Set bit i of mask to 1 */
endfor;
end ALL_TO_ONE_REDUCE
Both operations can also be shown as the following figure.
Darshan Institute of Engineering and Technology 180702 – Parallel Processing
Computer Engineering Chapter - 4 Basic communication operations and algorithms
Ishan Rajani 29
Ex: Matrix-Vector Multiplication:
3. Explain All-to-All Broadcast in various structures.
OR
Explain All-to-All Broadcast and Reduction in Ring structure.
OR
Explain All-to-All Broadcast on 2-D Mesh structure.
OR
Explain All-to-All Broadcast and Reduction on Hypercube structure.
[June 12(7 marks)]
It is the generalization of one-to-all broadcast in which all nodes simultaneously initiate a broadcast.
A process sends same m-word message to every other processes, but different processes may
broadcast different messages.
One way to perform an all-to-all broadcast is to perform p one-to-all broadcast, one starting from each
node.
It takes approximates up to p times than one-to-all broadcast.
All-to-all broadcast is used in matrix operations, including matrix multiplication and matrix-vector
multiplication.
It is possible to use the communication links in the interconnection network more efficiently by
performing all p one-to-all broadcasts simultaneously so that all messages traversing the same path at
the same time are concatenated into a single message whose size is the sum of the sizes of individual
messages.
Darshan Institute of Engineering and Technology 180702 – Parallel Processing
Computer Engineering Chapter - 4 Basic communication operations and algorithms
Ishan Rajani 30
The dual of all-to-all broadcast is all-to-all reduction, in which every node is the destination of all-to-all
reduction.
The following figure illustrates the all-to-all broadcast and all-to-all reduction.
All-to-all broadcast and all-to-all reduction on various topologies are discussed below :
All-to-All Broadcast on Ring:
While performing all-to-all broadcast on a linear array or a ring, all communication links can be kept
busy simultaneously until the operation is complete because each node always has some information
that it can pass along to its neighbor.
Each node first sends to one of its neighbors the data it needs to broadcast.
In subsequent steps, it forwards the data received from one of its neighbors to its other neighbor.
Figure illustrates all-to-all broadcast for an eight-node ring. In all-to-all broadcast, p different messages
circulate in the p-node ensemble.
In figure, each message is identified by its initial source, whose label appears in parentheses along
with the time step. For instance, the arc labeled 2 (7) between nodes 0 and 1 represents the data
communicated in time step 2 that node 0 received from node 7 in the preceding step.
As figure shows, if communication is performed circularly in a single direction, then each node
receives all (p - 1) pieces of information from all other nodes in (p - 1) steps.
Darshan Institute of Engineering and Technology 180702 – Parallel Processing
Computer Engineering Chapter - 4 Basic communication operations and algorithms
Ishan Rajani 31
Algorithm gives a procedure for all-to-all broadcast on a p-node ring.
The initial message to be broadcast is known locally as my_msg at each node.
Darshan Institute of Engineering and Technology 180702 – Parallel Processing
Computer Engineering Chapter - 4 Basic communication operations and algorithms
[Oct. 13(3.5 marks), June 13(7 marks), Oct. 12(7 marks), June 12(7 marks)]
A bitonic sorting network sorts n elements in O(log2n) time. The key operation of the bitonic sorting
network is the rearrangement of a bitonic sequence into a sorted sequence.
A bitonic sequence is a sequence of elements <a0, a1, ..., an-1> with the property that either :
The method that rearranges bitonic sequence to obtain monotonically increasing order is called bitonic
sort.
Let s = <a0, a1, ..., an-1> be a bitonic sequence such that a0<=a1<= ..., an/2-1 and an/2 <= an/2+1 <=... <= an-1.
Consider the following subsequences of s:
In sequence s1, there is an element bi = min{ai, an/2+i } such that all the elements before bi are from the
increasing part of the original sequence and all the elements after bi are from the decreasing part.
Also, in se uence s2, the ele ent i’ = ax{ai, an/2+i} is such that all the ele ents efore i’ are fro the decreasing part of the original sequence and all the elements after are from the increasing part.
Thus, the sequences s1 and s2 are bitonic sequences.
Furthermore, every element of the first sequence is smaller than every element of the second sequence.
The reason is that bi is greater than or equal to all elements of s1, is less than or equal to all elements of
s2, and is greater than or equal to bi.
So, solution of sorting or rearranging bitonic sequence of size n by rearranging two smaller bitonic
sequences and concatenating them.
Thus, we have reduced the initial problem of rearranging a bitonic sequence of size n to that of
rearranging two smaller bitonic sequences and concatenating the results. We refer to the operation of
splitting a itonic se uence of size n into the two itonic se uences as a itonic split . We can recursively obtain shorter bitonic sub-sequence using equation of S1 and S2; until we obtain sub-
sequence of size one.
Number of splits required to rearrange the bitonic sequence into sorted sequence is log n.
This procedure of erging the sorting se uence is called itonic erge shown elow in figure.
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Computer Engineering Chapter - 9 Sorting
Ishan Rajani 56
Merging of bitonic sequence is easy to implement on network of comparators, called itonic erging
network . This network contains log n columns, each column contains n/2 comparators and performs 1 step of bitonic
merge, and takes bitonic sequence as i/p and gives sorted sequence as o/p.
If we replace same n/w with decreasing comparator, then the i/p data will be sorted in monotonically
decreasing order.
Mapping Bitonic Sort to a Hypercube :
In this mapping, each of the n processes contains one element of the input sequence. Graphically, each
wire of the bitonic sorting network represents a distinct process.
During each step of the algorithm, the compare-exchange operations performed by a column of
comparators are performed by n/2 pairs of processes.
If the mapping is poor, the elements travel a long distance before they can be compared, which will
degrade performance.
Ideally, wires that perform a compare-exchange should be mapped onto neighboring processes.
In any step, the compare-exchange operation is performed between two wires only if their labels differ
in exactly one bit.
During each of stages, wires whose labels differ in the least-significant bit perform a compare-exchange in
the last step of each stage.
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Computer Engineering Chapter - 9 Sorting
Ishan Rajani 57
Hypercube Mapping wires onto the processes of a hypercube-connected parallel computer such a way
that Compare-exchange operations take place between wires whose labels differ in only one bit.
Processes are paired for their compare-exchange steps in a d-dimensional hypercube (that is, p = 2d).
In the final stage of bitonic sort, the input has been converted into a bitonic sequence.
During the first step of this stage, processes that differ only in the dth
bit of the binary representation of
their labels (that is, the most significant bit) compare-exchange their elements.
Thus, the compare-exchange operation takes place between processes along the dth dimension.
Similarly, during the second step of the algorithm, the compare-exchange operation takes place among
the processes along the (d - 1)th dimension.
Figure illustrates the communication during the last stage of the bitonic sort algorithm.
The bitonic sort algorithm for a hypercube is shown below.
The algorithm relies on the functions comp_exchange_max(i) and comp_exchange_min(i). These
functions compare the local element with the element on the nearest process along the ith dimension
and retain either the minimum or the maximum of the two elements.
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Computer Engineering Chapter - 9 Sorting
Ishan Rajani 58
------------------------------------------------------------------------------------------------------------- procedure BITONIC_SORT(label, d) begin for i := 0 to d - 1 do for j := i downto 0 do if (i + 1)st bit of label ≠ jth bit of label then comp_exchange max(j); else comp_exchange min(j); end BITONIC_SORT
--------------------------------------------------------------------------------------------------------------- Mapping Bitonic Sort to a Mesh :
The connectivity of a mesh is lower than that of a hypercube, so it is impossible to map wires to
processes such that each compare-exchange operation occurs only between neighboring processes.
There are several ways to map the input wires onto the mesh processes. Some of these are illustrated in
Figure. Each process in this figure is labeled by the wire that is mapped onto it.
The odd-even transposition algorithm sorts n elements in n phases (n is even), each of which requires
n/2 compare-exchange operations.
This algorithm alternates between two phases, called Odd and Even phases
Let <a1, a2, ..., an > be the sequence to be sorted. During the odd phase, elements with odd indices are
compared with their right neighbors, and if they are out of sequence they are exchanged; thus, the pairs
(a1, a2), (a3, a4), ..., (an-1, an) are compare-exchanged (assuming n is even).
Similarly, during the even phase, elements with even indices are compared with their right neighbors.
After n phases of odd-even exchanges, the sequence is sorted.
During each phase of the algorithm, compare-exchange operations on pairs of elements are performed
simultaneously.
Consider the one-element-per-process case. Let n be the number of processes (also the number of
elements to be sorted).
Assume that the processes are arranged in a one-dimensional array. Element ai initially resides on
process Pi for i = 1, 2, ..., n. During the odd phase, each process that has an odd label compare-
exchanges its element with the element residing on its right neighbor.
Similarly, during the even phase, each process with an even label compare-exchanges its element with
the element of its right neighbor. This parallel formulation is presented in following Algorithm.
------------------------------------------------------------------------------------------------------- procedure ODD-EVEN_PAR (n) begin id := process's label for i := 1 to n do if i is odd then if id is odd then compare-exchange_min(id + 1); else compare-exchange_max(id - 1); if i is even then if id is even then compare-exchange_min(id + 1); else compare-exchange_max(id - 1); end for
end ODD-EVEN_PAR
------------------------------------------------------------------------------------------------------- It is easy to parallelize odd-even transposition sort. During each phase of the algorithm, compare-
exchange operations on pairs of elements are performed simultaneously.
Consider the one-element-per-process case. Let n be the number of processes (also the number of
elements to be sorted).
Assume that the processes are arranged in a one-dimensional array. Element ai initially resides on
process Pi for i = 1, 2, ..., n.
During the odd phase, each process that has an odd label compare-exchanges its element with the
element residing on its right neighbor.
The odd-even transposition sort is shown in following Figure.
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
Computer Engineering Chapter - 9 Sorting
Ishan Rajani 60
Similarly, during the even phase, each process with an even label compare-exchanges its element with
the element of its right neighbor. This parallel formulation is presented in following Algorithm.
During each phase of the algorithm, the odd or even processes perform a compare- exchange step with
their right neighbors
A total of n such phases are performed; thus, the parallel run time of this formulation is Q(n).
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
1) E plai pa allel algo ith fo P i ’s algo ith a d co pa e its co ple it with the se ue tial algorithm for the same. [Oct. 13(7 marks), June 13(7 marks), Oct. 12(7 marks), June 12(7 marks)]
A minimum spanning tree (MST) for a weighted undirected graph is a spanning tree with minimum
weight.
If G is not connected, it cannot have a spanning tree. Prim's algorithm for finding an MST is a greedy
algorithm.
The algorithm begins by selecting an arbitrary starting vertex.
It then grows the minimum spanning tree by choosing a new vertex and edge that are guaranteed to
be in a spanning tree of minimum cost.
The algorithm continues until all the vertices have been selected. Let G = (V, E, w) be the weighted
undirected graph for which the minimum spanning tree is to be found, and let A = (ai, j) be its
weighted adjacency matrix.
The algorithm uses the set VT to hold the vertices of the minimum spanning tree during its
construction.
It also uses an array d[1..n] in which, for each vertex v € (V - VT), d [v] holds the weight of the edge
with the least weight from any vertex in VT to vertex v.
Each process Pi computes di[u]=min { d[v] / vє(V- VT)} during each iteration of while loop.
The global minimum is then obtained over all di[u] by all-to-one reduction operation and sorted in
Po.
Po then inserts it to VT and roadcast u to all y one-to-all broadcast operation.
Darshan Institute of Engineering and Technology 180702 - Parallel Processing
• The adjacency matrix is partitioned in a 1-D block fashion, with distance vector d partitioned
accordingly.
• In each step, a processor selects the locally closest node, followed by a global reduction to select
globally closest node.
• This node is inserted into MST, and the choice broadcast to all processors.
• Each processor updates its part of the d vector locally.
Time complexities for various operations:
• cost to select the minimum entry = O(n/p + log p).
• cost of a broadcast = O(log p).
• cost of local updation of the d vector = O(n/p).
• parallel time per iteration = O(n/p + log p).
• total parallel time is given = O(n2/p + n log p).
• corresponding iso-efficiency = O(p2log
2p).
2) Explain parallel fo ulatio of Dijkst a’s algo ith fo si gle sou ce sho test path with a e a ple. [June 13(7 marks), Oct. 12(7 marks), June 12(7 marks)]
For a weighted graph G = (V, E, w), the single-source shortest paths problem is to find the shortest
paths from a vertex v V to all other vertices in V.
A shortest path from u to v is a minimum-weight path.
Depending on the application, edge weights may represent time, cost, penalty, loss, or any other
quantity that accumulates additively along a path and is to be
minimized.
In the following section, we present Dijkstra's algorithm, which solves the singlesource shortest-paths
problem on both directed and undirected graphs with non-negative weights.
Dijkstra's algorithm, which finds the shortest paths from a single vertex s, is similar to Prim's
minimum spanning tree algorithm.
Dijkstra’s single short shortest path algorithm:
Like Prim's algorithm, it finds the shortest paths from s to the other vertices of G.
It is also greedy; that is, it always chooses an edge to a vertex that appears closest.
Comparing this algorithm with Prim's minimum spanning tree algorithm, we see that the two are
almost identical.
The main difference is that, for each vertex u (V - VT ), Dijkstra's algorithm stores l[u], the minimum
Darshan Institute of Engineering and Technology 180702 - Parallel Processing