Top Banner
Background Serial FIND (Fast Inverse using Nested Dissection) Simulation Results Parallel Methods Optimization and Parallelization of FIND Algorithm Song Li Eric Darve Institute for Computational and Mathematical Engineering, Stanford University [email protected] SIAM CSE09 March 4, 2009 Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
60

Optimization and Parallelization of FIND Algorithm

Sep 12, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Optimization and Parallelization of FINDAlgorithm

Song Li Eric Darve

Institute for Computational and Mathematical Engineering, Stanford [email protected]

SIAM CSE09March 4, 2009

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 2: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Outline

1 Background

2 Serial FIND (Fast Inverse using Nested Dissection)

3 Simulation Results

4 Parallel Methods

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 3: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Outline

1 Background

2 Serial FIND (Fast Inverse using Nested Dissection)

3 Simulation Results

4 Parallel Methods

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 4: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Introduction

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Modeling the current throughnano-devices by Non-EquilibriumGreen’s Function approachSystem of Schrödinger-PoissonequationsBest known algorithm (RGF) hasrunning time O(n3

xny )

Our method (FIND): O(n2xny )

Other devices: nanotubes andnanowires

Page 5: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

The Math Problem

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

What we want: thediagonal of Gr = A−1

What we have: a sparsematrix A from adiscretized 2D mesh

Page 6: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

The Math Problem

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

4× 5 mesh

ny = 5

nx = 4

What we want: thediagonal of Gr = A−1

What we have: a sparsematrix A from adiscretized 2D mesh

Page 7: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

The Math Problem

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

20× 20 matrix A4× 5 mesh

ny = 5

nx = 4

What we want: thediagonal of Gr = A−1

What we have: a sparsematrix A from adiscretized 2D mesh

Page 8: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

The Math Problem

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

20× 20 matrix A4× 5 mesh

ny = 5

nx = 4

What we want: thediagonal of Gr = A−1

What we have: a sparsematrix A from adiscretized 2D mesh

Page 9: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

The Math Problem

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

20× 20 matrix A4× 5 mesh

ny = 5

nx = 4

What we want: thediagonal of Gr = A−1

What we have: a sparsematrix A from adiscretized 2D mesh

Page 10: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Outline

1 Background

2 Serial FIND (Fast Inverse using Nested Dissection)

3 Simulation Results

4 Parallel Methods

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 11: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Key Observations

Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1

Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 12: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Key Observations

Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1

Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 13: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Key Observations

Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1

Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 14: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Key Observations

Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1

Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 15: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Key Observations

Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1

Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 16: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Overall Structure: Partition Tree

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Order the mesh nodesin a way similar tonested dissection

Partition the wholemesh and form a treestructure to exploit thesubproblem overlap

Page 17: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

One Step of Elimination

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Gaussian elimination: A∗( t, t) def= A( t, t)− A( t, t)A( t, t)−1A( t, t)

A( t, t) A( t, t) 0A( t, t) A( t, t) A( t, t)

0 A( t, t) A( t, t) elimination

=⇒

A( t, t) A( t, t) 00 A∗( t, t) A( t, t)0 A( t, t) A( t, t)

eliminated node

inner node

bounary node

outer node⇒

Page 18: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

Page 19: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

Page 20: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

Page 21: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

Page 22: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

Page 23: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

Page 24: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

Page 25: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

Page 26: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

Page 27: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

Page 28: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

Page 29: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

Page 30: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

Page 31: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

Page 32: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×

Page 33: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×

Page 34: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×

Page 35: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×

Page 36: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×

Page 37: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×

Page 38: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Outline

1 Background

2 Serial FIND (Fast Inverse using Nested Dissection)

3 Simulation Results

4 Parallel Methods

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 39: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Simulation Device

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 40: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Running Time ComparisonLog-Log Scale with Reference Lines

1

8

64

512

4096

32768

64 128 256 512 1024

Run

ning

tim

e (s

econ

d)

n (=Nx=Ny)

Running Time ComparisonBetween FIND and RGF

FINDO(n3)RGF

O(n4)

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 41: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Memory Cost Comparison

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

FIND: O(N log(N))

RGF: O(N3/2)

Page 42: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Outline

1 Background

2 Serial FIND (Fast Inverse using Nested Dissection)

3 Simulation Results

4 Parallel Methods

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 43: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

How to Parallelize?

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Straightforward for leaf clustersTop level clusters dominate runningtime with less degree of parallelismUse the idle processors for redundantcomputationsMore floating point operations butshorter wall clock timeWorks for 1D, 2D, and 3D domains

Page 44: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Problem and Processor Settings

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 45: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Problem and Processor Settings

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 46: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Problem and Processor Settings

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 47: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Problem and Processor Settings

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 48: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Problem and Processor Settings

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 49: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 50: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 51: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 52: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 53: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 54: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 55: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 56: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 57: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 58: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 59: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Communication Pattern

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Page 60: Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Summary

Direct method for fast inverseTwo extensions, two optimizationsAn optimal parallel schemeCollaboration with other groups for moreapplications

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm