Parallelizing the Jacobi Iteration Algorithm Alberto Rodriguez Serial Approach Using OpenMP 1D Decomposition 2D Decomposition Using MPI 1D Decomposition Parallelizing the Jacobi Iteration Algorithm Alberto Rodriguez Higher Institute of Technologies and Applied Sciences (InSTEC) October 14, 2016
21
Embed
Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Parallelizing the Jacobi Iteration Algorithm
Alberto Rodriguez
Higher Institute of Technologies and Applied Sciences (InSTEC)
October 14, 2016
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Outline
1 Serial Approach
2 Using OpenMP1D Decomposition2D Decomposition
3 Using MPI1D Decomposition
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Serial Approach
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Analyzing the Code
With a serial code we have:
Time Complexity O(I ∗ N2) (Nested loops to compute thematrix for each iteration).
Memory Complexity O(N2) (Size of the matrix).
We expect:
Long times to process matrix of bigger size and/or greatamount of iterations.
Size of the matrix is limit by system (Size of the memoryof the node) .
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Results
0.000976562
0.00390625
0.015625
0.0625
0.25
1
4
16
64
128 256 512 1024 2048 4096 8192
t(s)
Matrix Size
O3
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Results
0.000976562
0.00390625
0.015625
0.0625
0.25
1
4
16
64
128 256 512 1024 2048 4096 8192
t(s)
Matrix Size
O3mavx
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Results
Data obtained from perf tool.
8
16
32
64
128
128 256 512 1024 2048 4096 8192
% o
f cache m
isses
Matrix Size
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Using OpenMP
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
What part of the code need parallelization?
for(iCount = 1; iCount <= Iterations; iCount + +)for(i = 0; i < Dimension; i + +)
for(j = 0; j < Dimension; j + +){...}
The outer loop can’t be parallelize because the new matrixdepends to the old matrix.
We can parallelize the inner loops.
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Instructions using OpenMP
for(iCount = 1; iCount <= Iterations; iCount + +)#pragma omp parallel for private(i , j)for(i = 0; i < Dimension; i + +)
for(j = 0; j < Dimension; j + +){...}
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Results
0
0.005
0.01
0.015
0.02
0.025
0.03
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 256
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 1024
1
2
3
4
5
6
7
8
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 4096
10
20
30
40
50
60
70
80
90
100
110
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 16384
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Instructions using OpenMP
for(iCount = 1; iCount <= Iterations; iCount + +)#pragma omp parallel for collapse(2) private(i , j)for(i = 0; i < Dimension; i + +)
for(j = 0; j < Dimension; j + +){...}
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Results
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 256
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 1024
1
2
3
4
5
6
7
8
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 4096
10
20
30
40
50
60
70
80
90
100
110
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 16384
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Results
Comparison between 1D and 2D decomposition.
0
0.005
0.01
0.015
0.02
0.025
0.03
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size: 256
1D2D
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size:1024
1D2D
1
2
3
4
5
6
7
8
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size:4096
1D2D
10
20
30
40
50
60
70
80
90
100
110
0 10 20 30 40 50 60 70
tim
e(s
)
Number of Threads
Matrix Size:16384
1D2D
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Using MPI
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
How to divide the matrix
We want:
Minimize cache misses:
Divide the matrix by groups of rows. (C)
Every process has roughly the same amount of job:
Every process will have the same amount of rows.If that is no posible the last ones are going to have 1 morerow.
Every process access some data that is store in theprevious process and in the next one:
We need to allocate 2 more rows for each process. (Ghostor boundary cells)
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
How to divide the matrix
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Communications
How many communications we are going to have betweenprocesses?
Every process (except the first an the last ones) needs tosend data to the previous process and to the next one.
Every process (except the first an the last ones) needs toreceive data from the previous process and the next one inorder to update its ghost cells.
Parallelizingthe JacobiIterationAlgorithm
AlbertoRodriguez
SerialApproach
UsingOpenMP
1DDecomposition
2DDecomposition
Using MPI
1DDecomposition
Communications
How many values are communicated per iteration?
The program has N processes.
Every process needs to communicate four times.
In every communication an entire row is communicated.