Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Parallelizing the Jacobi Iteration Algorithm

Alberto Rodriguez

Higher Institute of Technologies and Applied Sciences (InSTEC)

October 14, 2016


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Outline

1 Serial Approach

2 Using OpenMP1D Decomposition2D Decomposition

3 Using MPI1D Decomposition


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Serial Approach


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Analyzing the Code

With a serial code we have:

Time Complexity O(I ∗ N2) (Nested loops to compute thematrix for each iteration).

Memory Complexity O(N2) (Size of the matrix).

We expect:

Long times to process matrix of bigger size and/or greatamount of iterations.

Size of the matrix is limit by system (Size of the memoryof the node) .


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

0.000976562

0.00390625

0.015625

0.0625

0.25

1

4

16

64

128 256 512 1024 2048 4096 8192

t(s)

Matrix Size

O3


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

0.000976562

0.00390625

0.015625

0.0625

0.25

1

4

16

64

128 256 512 1024 2048 4096 8192

t(s)

Matrix Size

O3mavx


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

Data obtained from perf tool.

8

16

32

64

128

128 256 512 1024 2048 4096 8192

% o

f cache m

isses

Matrix Size


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Using OpenMP


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

What part of the code need parallelization?

for(iCount = 1; iCount <= Iterations; iCount + +)for(i = 0; i < Dimension; i + +)

for(j = 0; j < Dimension; j + +){...}

The outer loop can’t be parallelize because the new matrixdepends to the old matrix.

We can parallelize the inner loops.


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Instructions using OpenMP

for(iCount = 1; iCount <= Iterations; iCount + +)#pragma omp parallel for private(i , j)for(i = 0; i < Dimension; i + +)



AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

0

0.005

0.01

0.015

0.02

0.025

0.03

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 256

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 1024

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 4096

10

20

30

40

50

60

70

80

90

100

110

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 16384


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Instructions using OpenMP

for(iCount = 1; iCount <= Iterations; iCount + +)#pragma omp parallel for collapse(2) private(i , j)for(i = 0; i < Dimension; i + +)



AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 256

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 1024

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 4096

10

20

30

40

50

60

70

80

90

100

110

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 16384


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

Comparison between 1D and 2D decomposition.

0

0.005

0.01

0.015

0.02

0.025

0.03

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 256

1D2D

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size:1024

1D2D

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size:4096

1D2D

10

20

30

40

50

60

70

80

90

100

110

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size:16384

1D2D


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Using MPI


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

How to divide the matrix

We want:

Minimize cache misses:

Divide the matrix by groups of rows. (C)

Every process has roughly the same amount of job:

Every process will have the same amount of rows.If that is no posible the last ones are going to have 1 morerow.

Every process access some data that is store in theprevious process and in the next one:

We need to allocate 2 more rows for each process. (Ghostor boundary cells)


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

How to divide the matrix


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Communications

How many communications we are going to have betweenprocesses?

Every process (except the first an the last ones) needs tosend data to the previous process and to the next one.

Every process (except the first an the last ones) needs toreceive data from the previous process and the next one inorder to update its ghost cells.


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Communications

How many values are communicated per iteration?

The program has N processes.

Every process needs to communicate four times.

In every communication an entire row is communicated.

Every row has M elements.

Math: N ∗ 4 ∗ 1 ∗M = 4 ∗ N ∗M


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

50 100 150 200 250 300 350 400 450 500 550

tim

e(s

)

Number of processes

Matrix Size: 512

0

5

10

15

20

25

30

50 100 150 200 250 300 350 400 450 500 550

tim

e(s

)

Number of processes

Matrix Size: 2048

200

400

600

800

1000

1200

1400

50 100 150 200 250 300 350 400 450 500 550

tim

e(s

)

Number of processes

Matrix Size: 8192


AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

Comparison between AMD and Intel machines.

0

0.5

1

1.5

2

2.5

3

60 80 100 120 140 160 180 200 220 240 260

tim

e(s

)

Number of processes

Matrix Size: 512

AMDIntel

0

10

20

30

40

50

60

70

80

90

60 80 100 120 140 160 180 200 220 240 260

tim

e(s

)

Number of processes

Matrix Size: 2048

AMDIntel

0

500

1000

1500

2000

2500

3000

3500

4000

4500

60 80 100 120 140 160 180 200 220 240 260

tim

e(s

)

Number of processes

Matrix Size: 8192

AMDIntel

Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Documents