1 CS4402 – Parallel Computing Lecture 7 Parallel Graphics – More Fractals Scheduling
1
CS4402 – Parallel Computing
Lecture 7
Parallel Graphics – More Fractals
Scheduling
04/19/23 2
FRACTALS
04/19/23 3
Fractals
A fractal is a set of points such that:
- its fractal dimension is infinite [infinite detail at every point].
- satisfies self-similarity: any part of the fractal is similar with the fractal.
Generating a fractal is a iterative process:
- start from P0
- iteratively generate P1=F(P0), P2=F(P1), …, Pn=F(Pn-1), …
P0 is a set of initial points
F is a transformation:
Geometric transformations: translations, rotations, scaling, …
Non-Linear coordinate transformation.
04/19/23 4
We work with 2 rectangular areas.
The user space:
- Real coordinates (x,y)
- Bounded between [xMin,xMax]*[yMin,yMax]
The screen space
- Integer coordinates (i, j)
- Bounded between [0,w-1]*[0,h-1]
- Is upside down with the Oy axis downward
How to squeeze the user space into the screen space?
How to translate (x,y) in (i,j)?
Points vs Pixels
04/19/23 5
Julia Sets – Self-Squaring Fractals
Consider the generating function F(z)=z2+c, z,c C.
Sequence of complex numbers: z0C and zn+1= zn2 + c.
Chaotic behaviour but two attractors for |zn|: 0 and +.
For a c C, Julia’s set Jc represents all the points whose orbit is finite.
04/19/23 6
Julia Sets – Algorithm
Inputs:
c C the complex number; [xmin,xmax] * [ymin,ymax] a region in plane.
Niter a number of iterations for orbits; R a threshold for the attractor .
Output: Jc the Julia set of c
Algorithm
For each pixel (i,j) on the screen
translate (i,j) into (x,y)
construct z0=x+j*y;
find the orbit of z0 [first Niter elements]
if (all the orbit points are under the threshold) draw (x,y)
04/19/23 7
for(i=0; i<=width; i++) for(j=0; j<width; j++){
int k =0;// construct the orbit of zz.re = XMIN + i*STEP; z.im = YMIN + j*STEP;for (k=0; k < NUMITER; k++) {
z = func(z,c);if (CompAbs(z) > R) break;
}
// test if the orbit in infiniteif (k>NUMITER-1) {
MPE_Draw_point(graph, i,j, MPE_YELLOW); MPE_Update(graph);
}else {
MPE_Draw_point(graph, i,j, MPE_RED); MPE_Update(graph);
}}
04/19/23 8
Julia Sets – || Algorithm
Remark 1.
The double for loop on (i,j) can be split into processors e.g.
uniform block or cyclic on i.
uniform block or cyclic on j.
No communication at all between processors, therefore this is
embarrassingly || computation.
Remark 2.
All processors draw a block of the fractal or several rows on the XGraph.
Prank knows the area to draw.
04/19/23 9
for(i=rank*width/size; i<=(rank+1)*width/size; i++) for(j=0; j<width; j++){// for(i=rank; i<width; i+=size) for(j=0; j<width; j++){// for(i=0; i<width; i++) for(j=rank*width/size; j<=(rank+1)*width/size; j++)// for(i=0; i<width; i++) for(j=rank; j<width; j+=size)
int k =0;// construct the orbit of zz.re = XMIN + i*STEP;z.im = YMIN + j*STEP;for (k=0; k < NUMITER; k++) {
z = func(z,c);if (CompAbs(z) > R) break;
}
// test if the orbit in infiniteif (k>NUMITER-1) {
MPE_Draw_point(graph, i,j, MPE_YELLOW); MPE_Update(graph);
}else {
MPE_Draw_point(graph, i,j, MPE_RED); MPE_Update(graph);}
}
04/19/23 10
04/19/23 11
04/19/23 12
The Maldelbrot Set
THE MANDELBROT FRACTAL IS AN INDEX FOR JULIA FRACTALS
Maldelbrot Set contains all the points cC such that
z0=0 and zn+1= zn2 + c has an finite orbit.
Inputs: [xmin,xmax] * [ymin,ymax] a region in plane.
Niter a number of iterations for orbits; R a threshold for the attractor .
Output: M the Mandelbrot set.
Algorithm
For each (x,y) in [xmin,xmax] * [ymin,ymax]
c=x+i*y;
find the orbit of z0=0 while under the threshold.
if (all the orbit points are not under the threshold) draw c(x,y)
04/19/23 13
for(i=0; i<=width; i++) for(j=0; j<width; j++){
int k =0;// construct the point cc.re = XMIN + i*STEP; c.im = YMIN + j*STEP;// construct the orbit of 0z.re = z.im = 0;for (k=0; k < NUMITER; k++) {
z = func(z,c);if (CompAbs(z) > R) break;
}
// test if the orbit in infiniteif (k>NUMITER-1) {
MPE_Draw_point(graph, i,j, MPE_YELLOW); MPE_Update(graph);
}else {
MPE_Draw_point(graph, i,j, MPE_RED); MPE_Update(graph);
}}
04/19/23 14
The Mandelbrot Set – || Algorithm
Remark 1.
The double for loop on (i,j) can be split into processors e.g.
uniform block or cyclic on i.
uniform block or cyclic on j.
No communication at all between processors, therefore this is
embarrassingly || computation.
Remark 2.
When the orbit goes to infinity in k steps then we can draw the pixel (i,j)
with the k-th color from a palette.
Bands color-ed similarly contain points with the same behaviour.
04/19/23 15
04/19/23 16
Fractal and Prime Numbers
Prime numbers can generate fractals.Remarks:
- If p>5 is prime then p%5 is 1,2,3,4.- 1,2,3,4 represent direction to do e.g. left, right, up down.- The fractal has the sizes w and h.
Step 1. Initialise a matrix of color with 0.Step 2. For each number p>5
If p is prime thenif(p%5==1)x=(x-1)%w;if(p%5==2)x=(x+1)%w;if(p%5==3)y=(y-1)%w;if(p%5==4)y=(y+1)%w;
Increase the color of (x,y)
Step 3. Draw the pixels with the color matrix.
04/19/23 17
Simple Remarks
The prime number set is infinite, furthermore it has no patter.
prime: 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, …
move: 3, 0, 2, 1, 3, 2, 4, 3, 4, 1, 2, …
The set of moves satisfies:
- it does not have any pattern moves are quite random.
- the number of 1-s, 2-s, 3-s and 4-s moves are quite similar,
hence the central pixels are reached more often.
The computation of the for loop is the most expensive operation.
04/19/23 18
// initialise a matrix with 0for(i=0;i<width;i++)for(j=0;j<width;j++)map[i][j]=0;
//start from the image centreposX = posY = width/2;
// traverse the set of prime numbersfor(i=0;i<n;i++){
if(isPrime(2*i+1)){
// move to a new position on the map and increment itmove = (2*i+1)%5;if (move==1) posX = (posX-1)%width;if (move==2) posX = (posX+1)%width;if (move==3) posY = (posY-1)%width;if (move==4) posY = (posY+1)%width;
map[posY][posX]++}
}
04/19/23 19
Parallel Computation: Simple Remarks
Processor rank gets some primes to test using some partitioning.
Processor rank therefore will traverse the pixels according with some moves.
Processor rank has to work with its own matrix map.
The map must be reduce on processor 0 to find the total number of hits.
04/19/23 20
Parallel Computation: Simple Remarks
The parallel computation of processor rank follows the steps:
1. Initialise the matrix map.
2. For each prime number assigned to rank do
1. Find the move and go to a new location
2. Increment the map
3. Reduce the matrix map.
4. If processor 0 then draw the map.
21
Splitting Loops
How to split the sequential loop if we have size processors?
Maths: n iterations & size processors n/size iterations per processor.
for(i=0;i<n;i++){
// body of looploop_body(data,i);
}
22
Splitting Loops in Similar Blocks
P rank gets the iterations rank*n/size, rank*n/size+1,…, (rank+1)*n/size-1
for(i=rank*n/size;i<(rank+1)*n/size;i++){
//aquire the data for this iterationloop_body(data,i);
}
rank*n/size (rank+1)*n/size-1
P rank
23
Splitting Loops in Cycles
P rank gets the iterations rank, rank+size, rank+2*size,….
for(i=rank;i<n;i+=size){
//aquire the data for this iterationloop_body(data,i);
}
P rank
24
Splitting Loops in Variable Blocks
P rank gets the iterations l[rank], l[rank]+1,…, u[rank]
for(i=l[rank];i<=u[rank];i++){
//aquire the data for this iterationloop_body(data,i);
}
l[rank] u[rank]
P rank
04/19/23 25
// initialise a matrix with 0for(i=0;i<width;i++)for(j=0;j<width;j++)map[i][j]=0;
//start from the image centreposX = posY = width/2;
// traverse the set of prime numbersfor(i=rank*n/size;i<(rank+1)*n/size;i++){
if(isPrime(p=2*i+1)){
// move to a new position on the map and increment itmove = p%5;if (move==1) posX = (posX-1)%width;if (move==2) posX = (posX+1)%width;if (move==3) posY = (posY-1)%width;if (move==4) posY = (posY+1)%width;
map[posY][posX]++}
}MPI_Reduce(&map[0][0], &globalMap[0][0], width*width, MPI_LONG, MPI_SUM, 0,
MPI_COMM_WORLD);
if(rank==0){
for(i=0;i<width;i++)for(j=0;j<width;j++)MPE_Draw_point(graph, i, j, colors[globalMap[i][j]);
}
04/19/23 26
04/19/23 27
Scheduling
28
Parallel Loops
Parallel loops represent the main source of parallelism.
Consider a system with p processors P1,P2,…, Pp and
for i=1, n do
call loop_body(i)
end for
Scheduling Problem:
Map the iterations {1,2,…,n} onto processors so that:
- the execution time is minimal.
- the execution times per processors are balanced.
- the processor’s idle time is minimal.
29
Parallel Loops
Suppose that the workload of loop_body is know and given by w1, w2,…, wn.
For Processor PJ the set of iteration is SJ={i1, i2, …, ik} so
- The execution time of Processor PJ is T(PJ)=∑ {wi: i in SJ}
- The execution time of the parallel loop is T=max{T(PJ): j=1,2,..,p}.
Static Scheduling: the partition is found at the compiling time.
Dynamic Scheduling: the partition is found at the running time.
30
Data Dependency
A dependency exists between program statements when the order of statement
execution affects the results of the program.
A data dependency results from multiple use of the same location(s) in storage
by different tasks. A data is “input” for another data.
Dependencies are important to parallel programming because they are one of the
primary inhibitors to parallelism.
Loops with data dependencies cannot be scheduled.
Example: The following for loop contains data dependencies.
for i=1, n do
a[i]=a[i-1]+1
end for
31
Load Balancing
Load balancing refers to the practice of distributing work among
processors so that all processors are kept busy all of the time.
If all the processor execution times are the same then a perfect load balance
is achieved.
Load Imbalance is the most important overhead of parallel computation
and reflects the case when there is a difference between two execution
times.
32
33
34
Useful Rules:
- If the workloads are similar then use static uniform block scheduling.
- If the workloads increase/decrease then use static cyclic scheduling.
- If we know the workloads and they are simple then guide the load balance.
- If the workloads are not known they use dynamic methods.
35
Balanced Workload Block Scheduling
w1, w2, …, wn the workload of the iterations
- total workload is w1+ w2+ …+ wn
- average per processor is
Each Processor gets consecutive iterations:
- lrank urank– the lower and upper indices of the block
- The workload is
size
wwwW n
...21
Wwww ull ...1
36
Balanced Workload Block Scheduling
Simple to work with integrals:
Average Workload per a processor is
Each processor workload is
n
diiwsize
W0
)(1
Wdiiwid
id
x
x
1
)(
WidWxWiddiiwWdiiw id
xx
x
idid
id
1
0
)()(1
37
38
39
40
41
42
Granularity
Granularity is the ratio of computation to communication.
Periods of computation are typically separated from periods of communication by synchronization events.
Fine-grain Parallelism: Relatively small amounts of computational work are done between communication events.
Facilitates load balancing and Implies high communication overhead and less opportunity for performance enhancement
Coarse-grain Parallelism: Relatively large amounts of computational work are done between communication/synchronization events. Harder to load balance efficiently