Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Optimization and Parallelization of FINDAlgorithm

Song Li Eric Darve

Institute for Computational and Mathematical Engineering, Stanford [email protected]

SIAM CSE09March 4, 2009

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm



Outline

1 Background

2 Serial FIND (Fast Inverse using Nested Dissection)

3 Simulation Results

4 Parallel Methods




Outline

1 Background



4 Parallel Methods




Introduction


Modeling the current throughnano-devices by Non-EquilibriumGreen’s Function approachSystem of Schrödinger-PoissonequationsBest known algorithm (RGF) hasrunning time O(n3

xny )

Our method (FIND): O(n2xny )

Other devices: nanotubes andnanowires



The Math Problem


What we want: thediagonal of Gr = A−1

What we have: a sparsematrix A from adiscretized 2D mesh



The Math Problem


4× 5 mesh

ny = 5

nx = 4





The Math Problem


20× 20 matrix A4× 5 mesh

ny = 5

nx = 4





The Math Problem



ny = 5

nx = 4





The Math Problem



ny = 5

nx = 4





Outline

1 Background



4 Parallel Methods




Key Observations

Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1

Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!




Key Observations






Key Observations






Key Observations






Key Observations






Overall Structure: Partition Tree


Order the mesh nodesin a way similar tonested dissection

Partition the wholemesh and form a treestructure to exploit thesubproblem overlap



One Step of Elimination


Gaussian elimination: A∗( t, t) def= A( t, t)− A( t, t)A( t, t)−1A( t, t)

A( t, t) A( t, t) 0A( t, t) A( t, t) A( t, t)

0 A( t, t) A( t, t) elimination

=⇒

A( t, t) A( t, t) 00 A∗( t, t) A( t, t)0 A( t, t) A( t, t)

eliminated node

inner node

bounary node

outer node⇒



Two Full Elimination Processes


Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node






eliminated node

inner node

bounary node

outer node

target node






eliminated node

inner node

bounary node

outer node

target node






eliminated node

inner node

bounary node

outer node

target node






eliminated node

inner node

bounary node

outer node

target node






eliminated node

inner node

bounary node

outer node

target node






eliminated node

inner node

bounary node

outer node

target node






eliminated node

inner node

bounary node

outer node

target node






eliminated node

inner node

bounary node

outer node

target node






eliminated node

inner node

bounary node

outer node

target node






eliminated node

inner node

bounary node

outer node

target node






eliminated node

inner node

bounary node

outer node

target node



Extensions and Optimizations


G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

















t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×









t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×









t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×









t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×









t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×









t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×



Outline

1 Background



4 Parallel Methods




Simulation Device




Running Time ComparisonLog-Log Scale with Reference Lines

1

8

64

512

4096

32768

64 128 256 512 1024

Run

ning

tim

e (s

econ

d)

n (=Nx=Ny)

Running Time ComparisonBetween FIND and RGF

FINDO(n3)RGF

O(n4)




Memory Cost Comparison


FIND: O(N log(N))

RGF: O(N3/2)



Outline

1 Background



4 Parallel Methods




How to Parallelize?


Straightforward for leaf clustersTop level clusters dominate runningtime with less degree of parallelismUse the idle processors for redundantcomputationsMore floating point operations butshorter wall clock timeWorks for 1D, 2D, and 3D domains



Problem and Processor Settings

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
















16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse
























































































Detailed Merging Process

















Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

































































































































































































Communication Pattern




Summary

Direct method for fast inverseTwo extensions, two optimizationsAn optimal parallel schemeCollaboration with other groups for moreapplications


Optimization and Parallelization of FIND Algorithm

Documents