Fast Sparse Matrix Inverse Computation

8/20/2019 Fast Sparse Matrix Inverse Computation

http://slidepdf.com/reader/full/fast-sparse-matrix-inverse-computation 1/168

FAST ALGORITHMS FOR SPARSE MATRIX INVERSECOMPUTATIONS

A DISSERTATION

SUBMITTED TO THE INSTITUTE FOR

COMPUTATIONAL AND MATHEMATICAL ENGINEERING

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITYIN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Song LiSeptember 2009



c Copyright by Song Li 2009

All Rights Reserved

ii



I certify that I have read this dissertation and that, in my opinion, itis fully adequate in scope and quality as a dissertation for the degreeof Doctor of Philosophy.

(Eric Darve) Principal Adviser


(George Papanicolaou)


(Michael Saunders)

Approved for the University Committee on Graduate Studies.

iii





AbstractAn accurate and efficient algorithm, called Fast Inverse using Nested Dissection

(FIND), has been developed for certain sparse matrix computations. The algorithmreduces the computation cost by an order of magnitude for 2D problems. Afterdiscretization on an N x ×N y mesh, the previously best-known algorithm RecursiveGreen’s Function (RGF) requires O(N 3x N y) operations, whereas ours requires onlyO(N 2x N y). The current major application of the algorithm is to simulate the transportof electrons in nano-devices, but it may be applied to other problems as well, e.g.,data clustering, imagery pattern clustering, image segmentation.

The algorithm computes the diagonal entries of the inverse of a sparse of nite-difference, nite-element, or nite-volume type. In addition, it can be extended tocomputing certain off-diagonal entries and other inverse-related matrix computations.As an example, we focus on the retarded Green’s function, the less-than Green’sfunction, and the current density in the non-equilibrium Green’s function approachfor transport problems. Special optimizations of the algorithm are also be discussed.These techniques are applicable to 3D regular shaped domains as well.

Computing inverse elements for a large matrix requires a lot of memory and isvery time-consuming even using our efficient algorithm with optimization. We pro-pose several parallel algorithms for such applications based on ideas from cyclic re-duction, dynamic programming, and nested dissection. A similar parallel algorithmis discussed for solving sparse linear systems with repeated right-hand sides withsignicant speedup.

v



Acknowledgements

The rst person I would like to thank is professor Walter Murray. My pleasantstudy here at Stanford and the newly established Institute for Computational andMathematical Engineering would not have been possible without him. I would alsovery like to thank Indira Choudhury, who helped me on numerous administrativeissues with lots of patience.

I would also very like to professor TW Wiedmann, who has been constantly pro-viding mental support and giving me advice on all aspects of my life here.

I am grateful to professors Tze Leung Lai, Parviz Moin, George Papanicolaou, andMichael Saunders for willing to be my committee members. In particular, Michaelgave lots of comments on my dissertation draft. I believe that his comments not onlyhelped me improve my dissertation, but also will benet my future academic writing.

I would very much like to thank my advisor Eric Darve for choosing a great topicfor me at the beginning and all his help afterwards in the past ve years. His constantsupport and tolerance provided me a stable environment with more than enoughfreedom to learn what interests me and think in my own pace and style, while atthe same time, his advice led my vague and diverse ideas to rigorous and meaningfulresults. This is almost an ideal environment in my mind for academic study andresearch that I have never actually had before. In particular, the discussion with him

in the past years has been very enjoyable. I am very lucky to meet him and have himas my advisor here at Stanford. Thank you, Eric!

Lastly, I thank my mother and my sister for always encouraging me and believingin me.

vi



Contents

Abstract v

Acknowledgements vi

1 Introduction 11.1 The transport problem of nano-devices . . . . . . . . . . . . . . . . . 1

1.1.1 The physical problem . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 NEGF approach . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Review of existing methods . . . . . . . . . . . . . . . . . . . . . . . 81.2.1 The need for computing sparse matrix inverses . . . . . . . . . 8

1.2.2 Direct methods for matrix inverse related computation . . . . 101.2.3 Recursive Green’s Function method . . . . . . . . . . . . . . . 13

2 Overview of FIND Algorithms 172.1 1-way methods based on LU only . . . . . . . . . . . . . . . . . . . . 172.2 2-way methods based on both LU and back substitution . . . . . . . 20

3 Computing the Inverse 233.1 Brief description of the algorithm . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Upward pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.1.2 Downward pass . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.3 Nested dissection algorithm of A. George et al. . . . . . . . . . 29

3.2 Detailed description of the algorithm . . . . . . . . . . . . . . . . . . 303.2.1 The denition and properties of mesh node sets and trees . . . 31

vii



3.2.2 Correctness of the algorithm . . . . . . . . . . . . . . . . . . . 34

3.2.3 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3 Complexity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.1 Running time analysis . . . . . . . . . . . . . . . . . . . . . . 393.3.2 Memory cost analysis . . . . . . . . . . . . . . . . . . . . . . . 443.3.3 Effect of the null boundary set of the whole mesh . . . . . . . 45

3.4 Simulation of device and comparison with RGF . . . . . . . . . . . . 463.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 Extension of FIND 51

4.1 Computing diagonal entries of another Green’s function: G< . . . . . 514.1.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.1.2 The pseudocode of the algorithm . . . . . . . . . . . . . . . . 534.1.3 Computation and storage cost . . . . . . . . . . . . . . . . . . 56

4.2 Computing off-diagonal entries of G r and G < . . . . . . . . . . . . . 57

5 Optimization of FIND 595.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2 Optimization for extra sparsity in A . . . . . . . . . . . . . . . . . . 615.2.1 Schematic description of the extra sparsity . . . . . . . . . . . 615.2.2 Preliminary analysis . . . . . . . . . . . . . . . . . . . . . . . 645.2.3 Exploiting the extra sparsity in a block structure . . . . . . . 66

5.3 Optimization for symmetry and positive deniteness . . . . . . . . . . 725.3.1 The symmetry and positive deniteness of dense matrix A . . 725.3.2 Optimization combined with the extra sparsity . . . . . . . . . 745.3.3 Storage cost reduction . . . . . . . . . . . . . . . . . . . . . . 75

5.4 Optimization for computing G < and current density . . . . . . . . . . 765.4.1 G< sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.4.2 G< symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.4.3 Current density . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.5 Results and comparison . . . . . . . . . . . . . . . . . . . . . . . . . 78

viii



6 FIND-2way and Its Parallelization 83

6.1 Recurrence formulas for the inverse matrix . . . . . . . . . . . . . . . 846.2 Sequential algorithm for 1D problems . . . . . . . . . . . . . . . . . . 896.3 Computational cost analysis . . . . . . . . . . . . . . . . . . . . . . . 916.4 Parallel algorithm for 1D problems . . . . . . . . . . . . . . . . . . . 95

6.4.1 Schur phases . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.4.2 Block Cyclic Reduction phases . . . . . . . . . . . . . . . . . . 103

6.5 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.5.1 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.5.2 Load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.5.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7 More Advanced Parallel Schemes 1137.1 Common features of the parallel schemes . . . . . . . . . . . . . . . . 1137.2 PCR scheme for 1D problems . . . . . . . . . . . . . . . . . . . . . . 1147.3 PFIND-Complement scheme for 1D problems . . . . . . . . . . . . . 1177.4 Parallel schemes in 2D . . . . . . . . . . . . . . . . . . . . . . . . . . 120

8 Conclusions and Directions for Future Work 123

Appendices 126

A Proofs for the Recursive Method Computing G r and G < 126A.1 Forward recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128A.2 Backward recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

B Theoretical Supplement for FIND-1way Algorithms 132B.1 Proofs for both computing G r and computing G < . . . . . . . . . . . 132B.2 Proofs for computing G < . . . . . . . . . . . . . . . . . . . . . . . . . 136

C Algorithms 140C.1 BCR Reduction Functions . . . . . . . . . . . . . . . . . . . . . . . . 141

ix



C.2 BCR Production Functions . . . . . . . . . . . . . . . . . . . . . . . . 142

C.3 Hybrid Auxiliary Functions . . . . . . . . . . . . . . . . . . . . . . . 144

Bibliography 147

x



List of Tables

I Symbols in the physical problems . . . . . . . . . . . . . . . . . . . . xviiII Symbols for matrix operations . . . . . . . . . . . . . . . . . . . . . . xviiiIII Symbols for mesh description . . . . . . . . . . . . . . . . . . . . . . xviii

3.1 Computation cost estimate for different type of clusters . . . . . . . . 46

5.1 Matrix blocks and their corresponding operations . . . . . . . . . . . 665.2 The cost of operations and their dependence in the rst method. The

costs are in ops. The size of A, B , C , and D is m ×m; the size of W ,X , Y , and Z is m ×n. . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.3 The cost of operations and their dependence in the second method.The costs are in ops. . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.4 The cost of operations in ops and their dependence in the third method 695.5 The cost of operations in ops and their dependence in the fourth

method. The size of A, B, C , and D is m ×m; the size of W , X , Y ,and Z is m ×n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.6 Summary of the optimization methods . . . . . . . . . . . . . . . . . 705.7 The cost of operations and their dependence for (5.10) . . . . . . . . 715.8 Operation costs in ops. . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.9 Operation costs in ops for G< with optimization for symmetry. . . . 77

6.1 Estimate of the computational cost for a 2D square mesh for differentcluster sizes. The size is in units of N 1/ 2. The cost is in units of N 3/ 2

ops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

xi



List of Figures

1.1 The model of a widely-studied double-gate SOI MOSFET with ultra-thin intrinsic channel. Typical values of key device parameters are alsoshown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Left: example of 2D mesh to which RGF can be applied. Right: 5-pointstencil. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1 Partitions of the entire mesh into clusters . . . . . . . . . . . . . . . . 182.2 The partition tree and the complement tree of the mesh . . . . . . . . 182.3 Two augmented trees with root clusters 14 and 15. . . . . . . . . . . 192.4 The nodes within the rectangle frame to the left of the dashed line and

those to the right form a larger cluster. . . . . . . . . . . . . . . . . . 19

3.1 Cluster and denition of the sets. The cluster is composed of all themesh nodes inside the dash line. The triangles form the inner set, thecircles the boundary set and the squares the adjacent set. The crossesare mesh nodes outside the cluster that are not connected to the meshnodes in the cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 The top gure is a 4 ×8 cluster with two 4 ×4 child clusters separatedby the dash line. The middle gure shows the result of eliminating

the inner mesh nodes in the child clusters. The bottom gure showsthe result of eliminating the inner mesh nodes in the 4 × 8 cluster.The elimination from the middle row can be re-used to obtain theelimination at the bottom. . . . . . . . . . . . . . . . . . . . . . . . . 26

xii



3.3 The rst step in the downward pass of the algorithm. Cluster C1 is

on the left and cluster C

2 on the right. The circles are mesh nodes incluster C11 . The mesh nodes in the adjacent set of C 11 are denotedby squares; they are not eliminated at this step. The crosses are meshnodes that are in the boundary set of either C 12 or C 2. These nodes needto be eliminated at this step. The dash-dotted line around the guregoes around the entire computational mesh (including mesh nodes inC 2 that have already been eliminated). There are no crosses in the topleft part of the gure because these nodes are inner nodes of C12 . . . . 27

3.4 A further step in the downward pass of the algorithm. Cluster C hastwo children C1 and C2. As previously, the circles are mesh nodes incluster C1. The mesh nodes in the adjacent set of C1 are denoted bysquares; they are not eliminated at this step. The crosses are nodesthat need to be eliminated. They belong to either the adjacent set of C or the boundary set of C2. . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 The mesh and its partitions. C1 = M . . . . . . . . . . . . . . . . . . 313.6 Examples of augmented trees. . . . . . . . . . . . . . . . . . . . . . . 323.7 Cluster C has two children C1 and C2. The inner nodes of cluster C1

and C2 are shown using triangles. The private inner nodes of C areshown with crosses. The boundary set of C is shown using circles. . . 33

3.8 Merging clusters below level L. . . . . . . . . . . . . . . . . . . . . . 413.9 Merging rectangular clusters. Two N x × W clusters merge into an

N x ×2W cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.10 Partitioning of clusters above level L. . . . . . . . . . . . . . . . . . . 423.11 Partitioning of clusters below level L. . . . . . . . . . . . . . . . . . . 433.12 Density-of-states (DOS) and electron density plots from RGF and

FIND. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.13 Comparison of the running time of FIND and RGF when N x is xed. 483.14 Comparison of the running time of FIND and RGF when N y is xed. 48

4.1 Last step of elimination . . . . . . . . . . . . . . . . . . . . . . . . . . 57

xiii



5.1 Two clusters merge into one larger cluster in the upward pass. The ×nodes have already been eliminated. The ◦and nodes remain to beeliminated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 Data ow from early eliminations for the left child and the right childto the elimination for the parent cluster . . . . . . . . . . . . . . . . . 62

5.3 Two clusters merge into one larger cluster in the downward pass. . . . 635.4 Two clusters merge into one larger cluster in the downward pass. . . . 645.5 The extra sparsity of the matrices used for computing the current density. 785.6 Coverage of all the needed current density through half of the leaf

clusters. Arrows are the current. Each needed cluster consists of foursolid nodes. The circle nodes are skipped. . . . . . . . . . . . . . . . 79

5.7 Optimization for extra sparsity . . . . . . . . . . . . . . . . . . . . . 805.8 Optimization for symmetry . . . . . . . . . . . . . . . . . . . . . . . 815.9 Comparison with RGF to show the change of cross-point . . . . . . . 81

6.1 Schematics of how the recurrence formulas are used to calculate G . . 856.2 The cluster tree (left) and the mesh nodes (right) for mesh of size

15 ×15. The number in each tree node indicates the cluster number.

The number in each mesh node indicates the cluster it belongs to. . . 926.3 Two different block tridiagonal matrices distributed among 4 different

processes, labeled p0, p1, p2 and p3. . . . . . . . . . . . . . . . . . . . 956.4 The distinct phases of operations performed by the hybrid method

in determining Trid A {G}, the block tridiagonal portion of G withrespect to the structure of A . The block matrices in this example arepartitioned across 4 processes, as indicated by the horizontal dashedlines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.5 Three different schemes that represent a corner production step under-taken in the BCR production phase, where we produce inverse blockson row/column i using inverse blocks and LU factors from row/column j . 99

xiv



6.6 The three different schemes that represent a center production step

undertaken in the BCR production phase, where we produce inverseblock elements on row/column j using inverse blocks and LU factorsfrom rows i and k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.7 Permutation for a 31 ×31 block matrix A (left), which shows that ourSchur reduction phase is identical to an unpivoted Gaussian eliminationon a suitably permuted matrix PAP (right). Colors are associatedwith processes. The diagonal block structure of the top left part of the matrix (right panel) shows that this calculation is embarrassinglyparallel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.8 BCR corresponds to an unpivoted Gaussian elimination of A withpermutation of rows and columns. . . . . . . . . . . . . . . . . . . . . 106

6.9 Following a Schur reduction phase on the permuted block matrix PAP ,we obtain the reduced block tridiagonal system in the lower right corner(left panel). This reduced system is further permuted to a form shownon the right panel, as was done for BCR in Fig. 6.8. . . . . . . . . . . 107

6.10 Walltime for our hybrid algorithm and pure BCR for different αs asa basic load balancing parameter. A block tridiagonal matrix A withn = 512 diagonal blocks, each of dimension m = 256, was used forthese tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.11 Speedup curves for our hybrid algorithm and pure BCR for differentαs. A block tridiagonal matrix A with n = 512 diagonal blocks, eachof dimension m = 256, was used. . . . . . . . . . . . . . . . . . . . . 110

6.12 The total execution time for our hybrid algorithm is plotted againstthe number of processes P . The choice α = 1 gives poor results exceptwhen the number of processes is large. Other choices for α give similarresults. The same matrix A as before was used. . . . . . . . . . . . . 111

7.1 The common goal of the 1D parallel schemes in this chapter . . . . . 1147.2 The merging process of PCR . . . . . . . . . . . . . . . . . . . . . . . 1157.3 The communication patterns of PCR . . . . . . . . . . . . . . . . . . 116

xv



7.4 The complement clusters in the merging process of PFIND-complement.

The subdomain size at each step is indicated in each sub-gure. . . . 1187.5 The subdomain clusters in the merging process of PFIND-complement. 1187.6 The communication pattern of PFIND-complement . . . . . . . . . . 1197.7 Augmented trees for target block 14 in FIND-1way and PFIND-Complement

schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

xvi



List of NotationsAll the symbols used in this dissertation are listed below unless they are denedelsewhere.

Table I: Symbols in the physical problems

A spectral functionb conduction band valley index

E energy level

f (·) Fermi distribution functionG Green’s functionGr retarded Green’s functionGa advanced Green’s functionG< less-than Green’s function

Planck constant

H Hamiltoniank, k wavenumberρ density matrixΣ self-energy (reecting the coupling effects)

U potentialv velocity

xvii



Table II: Symbols for matrix operationsA , AT , A† the sparse matrix from the discretization of the device,

its transpose, and its transpose conjugateA (i1 : i2, j ) block column vector containing rows i1 through i2 in

column j of AA (i, j 1 : j 2) block row vector containing columns j1 through j2 in

row i of AA (i1 : i2, j 1 : j 2) sub-matrix containing rows i1 through i2 and columns

j1 through j 2 of Aa ij block entry ( i, j ) of matrix A ; in general, a matrix and

its block entries are denoted by the same letter. A block

is denoted with a bold lowercase letter while the matrixis denoted by a bold uppercase letter.G Green’s function matrixG r retarded Green’s function matrixG a advanced Green’s function matrixG < lesser Green’s function matrixΣ and Γ self-energy matrix, Γ is used only in Chapter 6 to avoid

confusionI identity matrixL, U the LU factors of the sparse matrix AL, D , U block LDU factorization of A (used only in Chapter 6)d block size (used only in Chapter 6)n number of blocks in matrix (used only in Chapter 6)g,i , j ,k,m,n,p,q,r matrix indices0 zero matrix

Table III: Symbols for mesh descriptionB set of boundary nodes of a clusterC a cluster of mesh nodes after partitioningC complement set of C I set of inner nodesM the set of all the nodes in the meshN x , N y size of the mesh from the discretization of 2D deviceN the total number of nodes in the meshS set of private inner nodesT tree of clusters

xviii



Chapter 1

Introduction

The study of sparse matrix inverse computations come from the modeling andcomputation of the transport problem of nano-devices, but could be applicable toother problems. I will be focusing on the nano-device transport problem in thischapter for simplicity.

1.1 The transport problem of nano-devices

1.1.1 The physical problem

For quite some time, semiconductor devices have been scaled aggressively in orderto meet the demands of reduced cost per function on a chip used in modern inte-grated circuits. There are some problems associated with device scaling, however[Wong , 2002]. Critical dimensions, such as transistor gate length and oxide thickness,are reaching physical limitations. Considering the manufacturing issues, photolithog-raphy becomes difficult as the feature sizes approach the wavelength of ultraviolet

light. In addition, it is difficult to control the oxide thickness when the oxide ismade up of just a few monolayers. In addition to the processing issues, there aresome fundamental device issues. As the oxide thickness becomes very thin, the gateleakage current due to tunneling increases drastically. This signicantly affects thepower requirements of the chip and the oxide reliability. Short-channel effects, such

1



2 CHAPTER 1. INTRODUCTION

as drain-induced barrier lowering, degrade the device performance. Hot carriers also

degrade device reliability.To fabricate devices beyond current scaling limits, integrated circuit companiesare simultaneously pushing (1) the planar, bulk silicon complementary metal ox-ide semiconductor (CMOS) design while exploring alternative gate stack materials(high-k dielectric and metal gates) band engineering methods (using strained Si orSiGe [Wong, 2002; Welser et al. , 1992; Vasileska and Ahmed, 2005 ]), and (2) alterna-tive transistor structures that include primarily partially-depleted and fully-depletedsilicon-on-insulator (SOI) devices. SOI devices are found to be advantageous overtheir bulk silicon counterparts in terms of reduced parasitic capacitances, reducedleakage currents, increased radiation hardness, as well as inexpensive fabrication pro-cess. IBM launched the rst fully functional SOI mainstream microprocessor in 1999predicting a 25-35 percent performance gain over bulk CMOS [Shahidi, 2002]. To-day there is an extensive research in double-gate structures, and FinFET transistors[Wong , 2002], which have better electrostatic integrity and theoretically have bet-ter transport properties than single-gated FETs. A number of non-classical andrevolutionary technologies such as carbon nanotubes and nanoribbons or moleculartransistors have been pursued in recent years, but it is not quite obvious, in view of the predicted future capabilities of CMOS, that they will be competitive.

Metal-Oxide-Semiconductor Field-Effect Transistor (MOSFET) is an example of such devices. There is a virtual consensus that the most scalable MOSFET devicesare double-gate SOI MOSFETs with a sub-10 nm gate length, ultra-thin, intrinsicchannels and highly doped (degenerate) bulk electrodes — see, e.g., recent reviews[Walls et al. , 2004; Hasan et al., 2004] and Fig 1.1. In such transistors, short channeleffects typical of their bulk counterparts are minimized, while the absence of dopantsin the channel maximizes the mobility. Such advanced MOSFETs may be practi-cally implemented in several ways including planar, vertical, and FinFET geometries.However, several design challenges have been identied such as a process tolerancerequirement of within 10 percent of the body thickness and an extremely sharp dopingprole with a doping gradient of 1 nm/decade. The Semiconductor Industry Associ-ation forecasts that this new device architecture may extend MOSFETs to the 22 nm



1.1. THE TRANSPORT PROBLEM OF NANO-DEVICES 3

node (9 nm physical gate length) by 2016 [SIA, 2001]. Intrinsic device speed mayexceed 1 THz and integration densities will be more than 1 billion transistors/cm 2.

T ox = 1 nmT si = 5 nmL G = 9 nmL T = 17 nmL sd

= 10 nmN sd = 1020 cm− 3

N b = 0ΦG = 4.4227 eV(Gate workfunction)

x

y

Figure 1.1: The model of a widely-studied double-gate SOI MOSFET with ultra-thinintrinsic channel. Typical values of key device parameters are also shown.

For devices of such size, we have to take the quantum effect into consideration. Al-though some approximation theories of quantum effects with semiclassical MOSFETmodeling tools or other approaches like Landauer-Buttiker formalism are numericallyless expensive, they are either not accurate enough or not applicable to realistic de-vices with interactions (e.g., scattering) that break quantum mechanical phase andcause energy relaxation. (For such devices, the transport problem is between dif-fusive and phase-coherent.) Such devices include nano-transistors, nanowires, andmolecular electronic devices. Although the transport issues for these devices are verydifferent from one another, they can be treated with the common formalism providedby the Non-Equilibrium Green’s Function (NEGF) approach [Datta , 2000, 1997]. Forexample, a successful utilization of the Green’s function approach commercially is theNEMO (Nano-Electronics Modeling) simulator [ Lake et al., 1997], which is effectively1D and is primarily applicable to resonant tunneling diodes. This approach, how-ever, is computationally very intensive, especially for 2D devices [ Venugopal et al. ,






source/drain contacts and to the scattering process. Since the NEGF approach for

problems with or without scattering effect need the same type of computation, forsimplicity, we only consider ballistic transport here. After identifying the Hamiltonianmatrix and the self-energies, the third step is to compute the retarded Green’s func-tion. Once that is known, one can calculate other Green’s functions and determinethe physical quantities of interest.

Recently, the NEGF approach has been applied in the simulations of 2D MOSFETstructures [ Svizhenko et al. , 2002] as in Fig. 1.1. Below is a summary of the NEGFapproach applied to such a device.

I will rst introduce the Green’s functions used in this approach based on [Datta ,1997; Ferry and Goodnick, 1997 ].

We start from the simplest case: a 1D wire with a constant potential energy U 0

and zero vector potential. Dene G(x, x ) as

E −U 0 + 2

2m∂ 2

∂x 2 G(x, x ) = δ (x −x ). (1.2)

Since δ (x−x ) becomes an identity matrix after discretization, we often write G(x, x )as

E −U 0 + 2

2m∂ 2

∂x 2−1.

Since the energy level is degenerate, each E gives two Green’s functions

Gr (x, x ) = − i

ve+ ik |x−x | (1.3)

andGa (x, x ) = +

i v

e−ik |x−x |, (1.4)

where k ≡√ 2m (E−U 0 ) is the wavenumber and v ≡

km is the velocity. We will discuss

more about the physical interpretation of these two quantities later.Here Gr (x, x ) is called retarded Green’s function corresponding to the outgoing

wave (with respect to excitation δ (x, x ) at x = x ) and Ga (x, x ) is called advanced Green’s function corresponding to the incoming wave.

To separate these two states, and more importantly, to help take more complicatedboundary conditions into consideration, we introduce an innitesimal imaginary part




iη to the energy, where η ↓ 0. We also sometimes write η as 0+ to make its meaning

more obvious. Now we have Gr

= E + i0+

− U 0 + 2

2m

∂ 2

∂x 2−1

and Ga

= E −i0+

− U 0 + 2

2m∂ 2

∂x 2−1.

Now we consider the two leads (source and drain) in Fig. 1.1. Our strategy is tostill focus on device while summarize the effect of the leads as self-energy. First weconsider the effect of the left (source) lead. The Hamiltonian of the system with boththe left lead and the device included is given by

H L HLD

H DL HD. (1.5)

The Green’s function of the system is then

G L GLD

G DL GD=

(E + i0+ )I −H L −H LD

−H DL (E + i0+ )I −H D

−1

. (1.6)

Eliminating all the columns corresponding to the left lead, we obtain the Schur com-plement ( E + i0+ )I −Σ L and we have

G D = (( E + i0+ )I −Σ L )−1, (1.7)

where Σ L = H DL [(E + i0+ )I −H L )−1H LD is called the self-energy introduced by theleft lead.

Similarly, for the right (drain) lead, we have Σ R = H DR ((E + i0+ )I −H R )−1H RD ,the self-energy introduced by the right lead. Now we have the Green’s function of thesystem with both the left lead and the right leaded included:

G D = ((

E + i0+ )I

−Σ L

−Σ R )−1. (1.8)

Now we are ready to compute the Green’s function as long as Σ L and Σ R are available.Although it seems difficult to compute them since ( E + i0+ )I −H L , (E + i0+ )I −H R ,H DL , H DR , H RD , and H LD are of innite dimension, it turns out that they are oftenanalytically computable. For example, for a semi-innite discrete wire with grid point




distance a in y-direction and normalized eigenfunctions {χ m (xi)} as the left lead, the

self-energy is given by [Datta , 1997]

Σ L (i, j ) = −tm

χ m (xi)e+ ik m a χ m (x j ). (1.9)

Now we consider the distribution of the electrons. As fermions, the electrons ateach energy state E follow the Fermi-Dirac distribution and the density matrix isgiven by

ρ = f 0(E −µ)δ (E I −H )dE. (1.10)

Since δ (x) can be written as

δ (x) = 12π

ix + i0+ −

ix −i0+ , (1.11)

we can write the density matrix in terms of the spectral function A:

ρ = 12π f 0(E −µ)A(E )dE , (1.12)

where

A(

E )

≡ i(Gr

−Ga ) = iG r (Σ r

−Σ a )Ga .

We can also dene the coupling effect Γ ≡ i(Σ r − Σ a ). Now, the equations of motion for the retarded Green’s function ( Gr) and less-than Green’s function ( G< )are found to be (see [Svizhenko et al., 2002] for details and notations)

E − 2k2

z

2mz −Hb(r 1) Grb(r 1, r 2, kz , E )

− Σ rb(r 1, r , kz , E ) Gr

b(r , r 2, kz , E ) dr = δ (r 1 −r 2)

and

G<b (r 1, r 2, kz , E ) = Gr

b(r 1, r , kz , E ) Σ <b (r , r , kz , E ) Gr

b(r 2, r , kz , E )† dr dr ,

where † denotes the complex conjugate. Given Grb and G<

b , the density of states(DOS) and the charge density ρ can be written as a sum of contributions from the




individual valleys:

DOS(r , kz , E ) =b

N b(r , kz , E ) = −1π

b

Im[Grb(r , r , kz , E )]

ρ(r , kz , E ) =b

ρb(r , kz , E ) = −ib

G<b (r , r , kz , E ).

The self-consistent solution of the Green’s function is often the most time-intensivestep in the simulation of electron density.

To compute the solution numerically, we discretize the intrinsic device using a2D non-uniform spatial grid (with N x and N y nodes along the depth and length

directions, respectively) with semi-innite boundaries. Non-uniform spatial grids areessential to limit the total number of grid points while at the same time resolvingphysical features.

Recursive Green’s Function (RGF) [ Svizhenko et al. , 2002] is an efficient approachto computing the diagonal blocks of the discretized Green’s function. The operationcount required to solve for all elements of Gr

b scales as N 3x N y , making it very expensivein this particular case. Note that RGF provides all the diagonal blocks of the matrixeven though only the diagonal is really needed. Faster algorithms to solve for the

diagonal elements with operation count smaller than N 3x N y have been, for the past fewyears, very desirable. Our newly developed FIND algorithm addresses this particularissue of computing the diagonal of G r and G < , thereby reducing the simulation timeof NEGF signicantly compared to the conventional RGF scheme.

1.2 Review of existing methods

1.2.1 The need for computing sparse matrix inverses

The most expensive calculation in the NEGF approach is computing some of (butnot all) the entries of the matrix G r [Datta, 2000 ]:

G r(E ) = [E I −H −Σ ]−1 = A −1 (retarded Green’s function) (1.13)



1.2. REVIEW OF EXISTING METHODS 9

and G< (E ) = G r Σ < (G r)† (less-than Green’s function). In these equations, I is the

identity matrix, and E is the energy level. † denotes the transpose conjugate of a ma-trix. The Hamiltonian matrix H describes the system at hand (e.g., nano-transistor).It is usually a sparse matrix with connectivity only between neighboring mesh nodes,except for nodes at the boundary of the device that may have a non-local coupling(e.g., non-reecting boundary condition). The matrices Σ and Σ < correspond to theself energy and can be assumed to be diagonal. See Svizhenko [Svizhenko et al. , 2002]for this terminology and notation. In our work, all these matrices are considered tobe given and we will focus on the problem of efficiently computing some entries inG r and G< . As an example of entries that must be computed, the diagonal entriesof G r are required to compute the density of states, while the diagonal entries of G < allow computing the electron density. The current can be computed from thesuper-diagonal entries of G < .

Even though the matrix A in (1.13) is, by the usual standards, a mid-size sparsematrix of size typically 10 , 000 × 10, 000, computing the entries of G< is a majorchallenge because this operation is repeated at all energy levels for every iterationof the Poisson-Schr odinger solver. Overall, the diagonal of G< (E ) for the differentvalues of the energy level

E can be computed as many as thousands of times.

The problem of computing certain entries of the inverse of a sparse matrix isrelatively common in computational engineering. Examples include:

Least-squares tting In the linear least-square tting procedure, coefficients ak arecomputed so that the error measure

i

Y i −k

ak φk(xi)2

is minimal, where (xi , Y i) are the data points. It can be shown, under certainassumptions, in the presence of measurement errors in the observations Y i , theerror in the coefficients ak is proportional to C kk , where C is the inverse of thematrix A, where

A jk =i

φ j (xi)φk(xi).




Or we can write them with notation from the literature of linear algebra. Let

b = [Y 1, Y 2, . . .]T

, M ik = φk(xi), x = [a1, a2, . . .]T

, and A = MT

M . Then wehave the least-squares problem minx

b −M x 2 and the sensitivity of x dependson the diagonal entries of A −1.

Eigenvalues of tri-diagonal matrices The inverse iteration method attempts tocompute the eigenvector v associated with eigenvalue λ by solving iterativelythe equation

(A −λI )x k = sk x k−1

where λ is an approximation of λ and sk is used for normalization. Varah [ Varah ,1968] and Wilkinson[Peters and Wilkinson , 1971; Wilkinson , 1972; Peters andWilkinson , 1979] have extensively discussed optimal choices of starting vectorsfor this method. An important result is that, in general, choosing the vector e l

(lth vector in the standard basis), where l is the index of the column with thelargest norm among all columns of ( A − λI )−1, is a nearly optimal choice. Agood approximation can be obtained by choosing l such that the lth entry onthe diagonal of (A −λI )−1 is the largest among all diagonal entries.

Accuracy estimation When solving a linear equation Ax = b, one is often facedwith errors in A and b, because of uncertainties in physical parameters or inac-curacies in their numerical calculation. In general the accuracy in the computedsolution x will depend on the condition number of A: A A−1 , which can beestimated from the diagonal entries of A and its inverse in some cases.

Sensitivity computation When solving Ax = b, the sensitivity of xi to A jk isgiven by ∂xi /∂A jk = xk(A−1)ij .

Many other examples can be found in the literature.

1.2.2 Direct methods for matrix inverse related computation

There are several existing direct methods for sparse matrix inverse computations.These methods are mostly based on [Takahashi et al. , 1973]’s observation that if the




matrix A is decomposed using an LU factorization A = LDU , then we have

G r = A −1 = D−1L−1 + ( I −U )G r , and (1.14)

G r = U−1D−1 + G r(I −L). (1.15)

The methods for computing certain entries of the sparse matrix inverse based onTakahashi’s observation can lead to methods based on 2-pass computation: one forLU factorization and the other for a special back substitution. We call these methods2-way methods . The algorithm of [Erisman and Tinney , 1975]’s algorithm is such amethod. Let’s dene a matrix C such that

C ij =1, if Lij or U ij = 0

0, otherwise.

Erisman and Tinney showed the following theorem:

If C ji = 1, any entry G rij can be computed as a function of L, U , D and

entries G r pq such that p ≥ i, q ≥ j , and C qp = 1.

This implies that efficient recursive equations can be constructed. Specically, from(2.2), for i < j ,

G rij = −

n

k= i+1

U ik G rkj .

The key observation is that if we want to compute G rij with L ji = 0 and U ik = 0

(k > i ) then C jk must be 1. This proves that the theorem holds in that case. Asimilar result holds for j < i using (2.3). For i = j , we get (using (2.2))

Grii = ( D ii )−

1

−n

k= i+1U ik G

rki .

Despite the appealing simplicity of this algorithm, it has the drawback that themethod does not extend to the calculation of G < , which is a key requirement in ourproblem.




An application of Takahashi’s observation to computing both G r and G < on a 1D

system is the Recursive Green’s Function (RGF) method. RGF is a state-of-the-artmethod and will be used as a benchmark to compare with our methods. We willdiscuss it in detail in section 1.2.3.

In RGF, the computation on each way is conducted column by column in se-quence, from one end to the other. Such nature will make the method difficult toparallelize. Chapter 6 shows an alternative method based on cyclic reduction for theLU factorization and back substitution.

An alternative approach to the 2-way methods is our FIND-1way algorithm, whichwe will discuss in detail in Chapter 3. In FIND-1way, we compute the inverse solelythrough LU factorization without the need for back substitution. This method canbe more efficient under certain circumstances and may lead to further parallelization.

In all these methods, we need to conduct the LU factorization efficiently. For1D systems, a straightforward or simple approach will work well. For 2D systems,however, we need more sophisticated methods. Elimination tree [ Schreiber , 1982],min-degree tree [Davis, 2006], etc. are common approaches for the LU factorizationof sparse matrices. For a sparse matrix with local connectivity in its connectivitygraph, nested dissection has proved to be a simple and efficient method. We will usenested dissection methods in our FIND-1way, FIND-2way, and various parallel FINDalgorithms as well.

Independent of the above methods, K. Bowden developed an interesting set of matrix sequences that allows us in principle to calculate the inverse of any block tri-diagonal matrices very efficiently [ Bowden, 1989]. Briey speaking, four sequencesof matrices, K p, L p, M p and N p, are dened using recurrence relations. Then anexpression is found for any block (i, j ) of matrix G r :

j ≥ i, G rij = K iN −10 N j , j ≤ i, G r

ij = LiL−10 M j .

However the recurrence relation used to dene the four sequences of matrices turnsout to be unstable in the presence of roundoff errors. Consequently this approach is




not applicable to matrices of large size.

1.2.3 Recursive Green’s Function method

Currently the state-of-the-art is a method developed by Klimeck and Svizhenko etal. [Svizhenko et al., 2002], called the recursive Green’s function method (RGF). Thisapproach can be shown to be the most efficient for “nearly 1D” devices, i.e. devicesthat are very elongated in one direction and very thin in the two other directions.This algorithm has a running time of order O(N 3x N y), where N x and N y are the

number of points on the grid in the x and y direction, respectively.The key point of this algorithm is that even though it appears that the full knowl-

edge of G is required in order to compute the diagonal blocks of G< , in fact we willshow that the knowledge of the three main block diagonals of G is sufficient to cal-culate the three main block diagonals of G< . A consequence of this fact is that it ispossible to compute the diagonal of G < with linear complexity in N y block operations.If all the entries of G were required, the computational cost would be O(N 3x N 3y ).

Assume that the matrix A is the result of discretizing a partial differential equationin 2D using a local stencil, e.g., with a 5-point stencil. Assume the mesh is the onegiven in Fig. 1.2.

Figure 1.2: Left: example of 2D mesh to which RGF can be applied. Right: 5-pointstencil.

For a 5-point stencil, the matrix A can be written as a tri-diagonal block matrixwhere blocks on the diagonal are denoted by Aq (1 ≤ i ≤ n), on the upper diagonalby B q (1 ≤ i ≤ n −1), and on the lower diagonal by Cq (2 ≤ i ≤ n).

RGF computes the diagonal of A −1 by computing recursively two sequences. The




rst sequence, in increasing order, is dened recursively as [ Svizhenko et al., 2002]

F 1 = A −11

F q = ( A q −C qF q−1B q−1)−1 .

The second sequence, in decreasing order, is dened as

G n = F n

G q = F q + F qB qG q+1 C q+1 F q.

The matrix G q is in fact the q th diagonal block of the inverse matrix G r of A. If wedenote by N x the number of points in the cross-section of the device and N y along itslength ( N = N x N y) the cost of this method can be seen to be O(N 3x N y). Thereforewhen N x is small this is a computationally very attractive approach. The memoryrequirement is O(N 2x N y).

We rst present the algorithm to calculate the three main block diagonals of Gwhere AG = I. Then we will discuss how it works for computing G< = A −1ΣA −†.Both algorithms consist of a forward recurrence stage and a backward recurrence

stage.In the forward recurrence stage, we denote

A q def = A1:q,1:q (1.16)

G q def = ( A q)−1. (1.17)

Proposition 1.1 The following equations can be used to calculate Zq := Gqqq, 1 ≤

q ≤ n:

Z1 = A −111

Zq = A qq −A q,q−1Zq−1A q−1,q−1

.

Sketch of the Proof : Consider the LU factorization of Aq−1 = Lq−1U q−1 and




A q = LqU q. Because of the block-tridiagonal structure of A, we can eliminate the

q th

block column of A based on the (q, q ) block entry of A after eliminating the(q −1)th block column. In equations, we have Uqqq = A qq−A q,q−1(U q−1

q−1,q−1)−1A q−1,q =A qq −A q,q−1Zq−1A q−1,q and then Zq = ( U qq)−1 = ( A qq −A q,q−1Zq−1A q−1,q)−1. Formore details, please see the full proof in the appendix.

The last element is such that Zn = G nnn = G nn . From this one we can calculate

all the Gqq and Gq,q±1. This is described in the next subsection.Now, we will show that in the backward recurrence stage.

Proposition 1.2 The following equations can be used to calculate Gqq, 1 ≤ q ≤ n

based on Zq = G qqq:

G nn = Zn

G qq = Zq + ZqA q,q+1 G q+1 ,q+1 A q+1 ,qZq.

Sketch of the Proof : Consider the LU factorization of A = LU ⇒ Gqq =(U−1L−1)qq. Since U−1 and L−1 are band-2 block bi-diagonal matrices, respec-tively, we have Gqq = ( U−1)q,q −(U−1)q,q+1 (L−1)q+1 ,q . Note that ( U−1

qq ) = Zq from

the previous section, ( U−1

)q,q+1 = −ZqA q,q+1 Zq+1 by backward substitution, and(L−1)q+1 ,q = −A q+1 ,qZq is the elimination factor when we eliminate the q th blockcolumn in the previous section, we have G qq = Zq + ZqA q,q+1 G q+1 ,q+1 A q+1 ,qZq.

In the recurrence for the AG < A∗ = Σ stage, the idea is the same as the recurrencefor computing G r . However, since we have to apply similar operations to Σ fromboth sides, the operations are a lot more complicated. Since the elaboration of theoperations and the proofs is tedious, we put it in the appendix.








18 CHAPTER 2. OVERVIEW OF FIND ALGORITHMS

A , we can decompose LU factorizations into partial LU factorizations. As a result,we can perform them independently and reuse the partial results multiple times fordifferent orderings. If we reorder A properly, each of the partial factorizations isquite efficient (similar to the nested dissection procedure to minimize ll-ins [George,1973]), and many of them turn out to be identical.

More specically, we partition the mesh into subsets of mesh nodes rst and thenmerge these subsets to form clusters. Fig. 2.1 shows the partitions of the entire mesh

into 2, 4, and 8 clusters.

1 2 34 6

5 7

8 9 12 13

10 11 14 15

Figure 2.1: Partitions of the entire mesh into clusters

Fig. 2.2 shows the partition tree and complement tree of the mesh formed by theclusters in Fig. 2.1 and their complements.

Figure 2.2: The partition tree and the complement tree of the mesh

The two cluster trees in Fig. 2.3 (augmented tree ) show how two leaf complement

clusters ¯14 and

¯15 are computed by the FIND algorithm. The path

¯3 −

¯7 −

¯14 in the

left tree and the path 3 −7 − 15 in the right tree are part of the complement-clustertree; the other subtrees are copies from the basic cluster tree.

At each level of the two trees, two clusters merge into a larger one. Each mergeeliminates all the inner nodes of the resulting cluster. For example, the merge inFig. 2.4 eliminates all the ◦ nodes.



2.1. 1-WAY METHODS BASED ON LU ONLY 19

Figure 2.3: Two augmented trees with root clusters 14 and 15.

remaining nodesnodes already eliminatedinner nodes (to be eliminated)boundary nodes isolating innernodes from remaining nodes

Figure 2.4: The nodes within the rectangle frame to the left of the dashed line andthose to the right form a larger cluster.

Such a merge corresponds to an independent partial LU factorization becausethere is no connection between the eliminated nodes ( ◦) and the remaining nodes( ). This is better seen below in the Gaussian elimination in ( 2.1) for the merge inFig. 2.4:

A (◦, ◦) A (◦, •) 0A (•, ◦) A (•, •) A (•, )

0 A ( , •) A ( , )⇒

A (◦, ◦) A (◦, •) 00 A ∗(•, •) A (•, )0 A ( , •) A ( , )

,

where A ∗(•, •) = A (•, •) −A (•, ◦)A (◦, ◦)−1A (◦, •). (2.1)

In (2.1), ◦ and • each corresponds to all the ◦ nodes and • nodes in Fig. 2.4, respec-tively. A∗(•, •) is the result of the partial LU factorization. It will be stored for reusewhen we pass through the cluster trees.

To make the most of the overlap, we do not perform the eliminations for each




augmented tree individually. Instead, we perform the elimination based on the basic

cluster tree and the complement-cluster tree in the FIND algorithm. We start byeliminating all the inner nodes of the leaf clusters of the partition tree in Fig. 2.2followed by eliminating the inner nodes of their parents recursively until we reach theroot cluster. This is called the upward pass . Once we have the partial eliminationresults from upward pass, we can use them when we pass the complement-cluster treelevel by level, from the root to the leaf clusters. This is called the downward pass .

2.2 2-way methods based on both LU and back

substitution

In contrast to the need for only LU factorization in FIND-1way, FIND-2way needstwo phases in computing the inverse: LU factorization phase (called reduction phase— in Chapter 6) and back substitution phase (called production phase in Chapter 6).These two phases are the same as what we need for solving a linear system throughGaussian elimination. In our problem, however, since the given matrix A is sparseand only the diagonal entries of the matrices G r and G < are required in the iterative

process, we may signicantly reduce the computational cost using a number of differ-ent methods. Most of these methods are based on Takahashi’s observation [ Takahashiet al., 1973; Erisman and Tinney , 1975; Niessner and Reichert , 1983; Svizhenko et al.,2002]. In these methods, a block LDU (or LU in some methods) factorization of Ais computed. Simple algebra shows that

G r = D−1L−1 + ( I −U )G r , (2.2)

= U−1D−1 + G r (I −L), (2.3)

where L, U , and D correspond, respectively, to the lower block unit triangular, upperblock unit triangular, and diagonal block LDU factorization of A . The dense inverseG r is treated conceptually as having a block structure based on that of A .

For example, consider the rst equation and j > i . The block entry gij residesabove the block diagonal of the matrix and therefore [ D−1L−1]ij = 0. The rst








Chapter 3

Computing the Inverse

FIND improves on RGF by reducing the computational cost to O(N 2x N y) and thememory to O(N x N y logN x). FIND follows some of the ideas of the nested dissectionalgorithm [George, 1973]. The mesh is decomposed into 2 subsets, which are furthersubdivided recursively into smaller subsets. A series of Gaussian eliminations arethen performed, rst going up the tree and then down, to nally yield entries in theinverse of A . Details are described in Section 3.1.

In this chapter, we focus on the calculation of the diagonal of G r and will reservethe extension to the diagonal entries of G < and certain off-diagonal entries of G r forChapter 4.

As will be shown below, FIND can be applied to any 2D or 3D device, even thoughit is most efficient in 2D. The geometry of the device can be arbitrary as well as theboundary conditions. The only requirement is that the matrix A comes from a stencildiscretization, i.e., points in the mesh should be connected only with their neighbors.The efficiency degrades with the extent of the stencil, i.e., nearest neighbor stencilvs. second nearest neighbor.

3.1 Brief description of the algorithm

The non-zero entries of a matrix A can be represented using a graph in whicheach node corresponds to a row or column of the matrix. If an entry A ij is non-zero,

23



24 CHAPTER 3. COMPUTING THE INVERSE

we create an edge (possibly directed) between node i and j . In our case, each row orcolumn in the matrix can be assumed to be associated with a node of a computationalmesh. FIND is based on a tree decomposition of this graph. Even though differenttrees can be used, we will assume that a binary tree is used in which the mesh isrst subdivided into 2 sub-meshes (also called clusters of nodes), each sub-mesh issubdivided into 2 sub-meshes, and so on (see Fig. 3.5). For each cluster, we can denethree important sets:

Boundary set B This is the set of all mesh nodes in the cluster which have a con-

nection with a node outside the set.Inner set I This is the set of all mesh nodes in the cluster which do not have a

connection with a node outside the set.

Adjacent set This is the set of all mesh nodes outside the cluster which havea connection with a node outside the cluster. Such set is equivalent to theboundary set of the complement of the cluster.

These sets are illustrated in Fig. 3.1 where each node is connected to its nearest

neighbor as in a 5-point stencil. This can be generalized to more complex connectiv-ities.

Figure 3.1: Cluster and denition of the sets. The cluster is composed of all the meshnodes inside the dash line. The triangles form the inner set, the circles the boundaryset and the squares the adjacent set. The crosses are mesh nodes outside the clusterthat are not connected to the mesh nodes in the cluster.



3.1. BRIEF DESCRIPTION OF THE ALGORITHM 25

3.1.1 Upward pass

The rst stage of the algorithm, or upward pass, consists of eliminating all theinner mesh nodes contained in the tree clusters. We rst eliminate all the inner meshnodes contained in the leaf clusters, then proceed to the next level in the tree. Wethen keep eliminating the remaining inner mesh nodes level by level in this way. Thisrecursive process is shown in Fig. 3.2.

Notation: C will denote the complement of C (all the mesh nodes not in C). Theadjacent set of C is then always the boundary set of C . The following notation formatrices will be used:

M = [a11 a12 . . . a1n ; a21 a22 . . . a2n ; . . .],

which denotes a matrix with the vector [ a11 a12 . . . a1n ] on the rst row and the vector[a21 a22 . . . a2n ] on the second (and so on for other rows). The same notation is usedwhen aij is a matrix. A(U , V ) denotes the submatrix of A obtained by extracting therows (resp. columns) corresponding to mesh nodes in clusters U (resp. V).

We now dene the notation U C . Assume we eliminate all the inner mesh nodes of

cluster C from matrix A. Denote the resulting matrix AC + (the notation C+ has aspecial meaning described in more details in the proof of correctness of the algorithm)and B sC the boundary set of C . Then

U C = A C + (B C , B C ).

To be completely clear about the algorithm we describe in more detail how anelimination is performed. Assume we have a matrix formed by 4 blocks A, B , C andD, where A has p columns:

A BC D

.

The process of elimination of the rst p columns in this matrix consists of computing




⇒ ⇒

Figure 3.2: The top gure is a 4 ×8 cluster with two 4 ×4 child clusters separatedby the dash line. The middle gure shows the result of eliminating the inner meshnodes in the child clusters. The bottom gure shows the result of eliminating theinner mesh nodes in the 4 ×8 cluster. The elimination from the middle row can bere-used to obtain the elimination at the bottom.

an “updated block D” (denoted D ) given by the formula

D = D −CA−1B.

The matrix D can also be obtained by performing Gaussian elimination on [ A B ; C D]and stopping after p steps.

The pseudo-code for procedure eliminateInnerNodes implements this elimina-tion procedure (see page 27).

3.1.2 Downward pass

The second stage of the algorithm, or downward pass, consists of removing allthe mesh nodes that are outside a leaf cluster. This stage re-uses the eliminationcomputed during the rst stage. Denote by C1 and C2 the two children of the rootcluster (which is the entire mesh). Denote by C11 and C12 the two children of C1. If we re-use the elimination of the inner mesh nodes of C2 and C12 , we can efficiently

eliminate all the mesh nodes that are outside C1 and do not belong to its adjacentset, i.e., the inner mesh nodes of C 1. This is illustrated in Fig. 3.3.

The process then continues in a similar fashion down to the leaf clusters. A typicalsituation is depicted in Fig. 3.4. Once we have eliminated all the inner mesh nodesof C , we proceed to its children C1 and C2. Take C1 for example. To remove all the




Procedure eliminateInnerNodes( cluster C) . This procedure should be called

with the root of the tree: eliminateInnerNodes(root).Data : tree decomposition of the mesh; the matrix A .Input : cluster C with n boundary mesh nodes.Output : all the inner mesh nodes of cluster C are eliminated by the procedure.

The n ×n matrix UC with the result of the elimination is saved.if C is not a leaf then1

C 1 = left child of C ;2

C 2 = right child of C ;3

eliminateInnerNodes( C 1) /* The boundary set is denoted B C 1 */ ;4

eliminateInnerNodes( C 2) /* The boundary set is denoted B C 2 */ ;5

A C = [U C 1 A(B C 1, B C 2); A (B C 2, B C 1) U C 2];6

/* A(B C 1, B C 2) and A(B C 2, B C 1) are values from the original matrix A. */

else7

A C = A (C , C ) ;8

if C is not the root then9

Eliminate from AC the mesh nodes that are inner nodes of C ;10

Save the resulting matrix UC ;11

C 1 C 2

C 11

C 12

Figure 3.3: The rst step in the downward pass of the algorithm. Cluster C1 is on the

left and cluster C 2 on the right. The circles are mesh nodes in cluster C11 . The meshnodes in the adjacent set of C11 are denoted by squares; they are not eliminated atthis step. The crosses are mesh nodes that are in the boundary set of either C12 orC 2. These nodes need to be eliminated at this step. The dash-dotted line around thegure goes around the entire computational mesh (including mesh nodes in C2 thathave already been eliminated). There are no crosses in the top left part of the gurebecause these nodes are inner nodes of C12 .




inner mesh nodes of C 1, similar to the elimination in the upward pass, we simply needto remove some nodes in the boundary sets of C and C 2 because C 1 = C

∪C 2. The

complete algorithm is given in procedure eliminateOuterNodes .

Procedure eliminateOuterNodes( cluster C) . This procedure should be calledwith the root of the tree: eliminateOuterNodes(root).

Data : tree decomposition of the mesh; the matrix A ; the upward pass[eliminateInnerNodes()] should have been completed.

Input : cluster C with n adjacent mesh nodes.Output : all the inner mesh nodes of cluster C are eliminated by the procedure.

The n ×n matrix U C with the result of the elimination is saved.if C is not the root then1

D = parent of C /* The boundary set of D is denoted B D */ ;2

D 1 = sibling of C /* The boundary set of D 1 is denoted B D 1 */ ;3

A C = [U D A(B D , B D 1); A (B D 1, B D ) U D 1];4

/* A(B D , B D 1) and A(B D 1, B D ) are values from the original matrixA . */

/* If D is the root, then D = ∅ and A C = UD 1. */Eliminate from A C the mesh nodes which are inner nodes of C ; U C is the5

resulting matrix;

Save U C ;6

if C is not a leaf then7



eliminateOuterNodes( C 1);10

eliminateOuterNodes( C 2);11

else12

Calculate [A−1](C , C) using (6.8);13

For completeness we give the list of subroutines to call to perform the entirecalculation:

treeBuild( M ) /* Not described in this dissertation */ ;1

eliminateInnerNodes(root);2

eliminateOuterNodes(root);3




C

C 1

C 2

Figure 3.4: A further step in the downward pass of the algorithm. Cluster C has twochildren C1 and C 2. As previously, the circles are mesh nodes in cluster C 1. The meshnodes in the adjacent set of C1 are denoted by squares; they are not eliminated atthis step. The crosses are nodes that need to be eliminated. They belong to eitherthe adjacent set of C or the boundary set of C2.

Once we have reached the leaf clusters, the calculation is almost complete. Takea leaf cluster C . At this stage in the algorithm, we have computed U C . Denote byA C + the matrix obtained by eliminating all the inner mesh nodes of C (all the nodesexcept the squares and circles in Fig. 3.4); A C + contains mesh nodes in the adjacentset of C (i.e., the boundary set of C ) and the mesh nodes of C :

A C + = U C A(B C , C )

A (C , B C ) A(C , C ).

The diagonal block of A −1 corresponding to cluster C is then given by

[A−1](C , C ) = A (C , C ) −A (C , B C ) (U C )−1 A (B C , C ) −1. (3.1)

3.1.3 Nested dissection algorithm of A. George et al.

The algorithm in this chapter uses a nested dissection-type approach similar tothe nested dissection algorithm of George et al. [ George, 1973]. We highlight thesimilarities and differences to help the reader relate our new approach, FIND, to these





3.2. DETAILED DESCRIPTION OF THE ALGORITHM 31

contains the original tree and in addition a tree associated with the complement of each cluster ( C in our notation). The reason is that it allows us to present the al-gorithm in a unied form where the upward and downward passes can be viewed astraversing a single tree: the augmented tree. Even though from an algorithmic stand-point, the augmented tree is not needed (perhaps making things more complicated),from a theoretical and formal standpoint, this is actually a natural graph to consider.

3.2.1 The denition and properties of mesh node sets andtrees

We start this section by dening M as the set of all nodes in the mesh. If wepartition M into subsets C i , each C i being a cluster of mesh nodes, we can build abinary tree with its leaf nodes corresponding to these clusters. We denote such a treeby T 0 = {C i}. The subsets C i are dened recursively in the following way: Let C1

= M , then partition C 1 into C2 and C3, then partition C 2 into C4 and C5, C3 intoC 6 and C7, and partition each C i into C2i and C2i+1 , until C i reaches the predenedminimum size of the clusters in the tree. In T 0, C i and C j are the two children of Ck

iff C i∪

C j = C k and C i ∩C j = ∅, i.e., {C i , C j} is a partition of Ck . Fig. 3.5 shows thepartitioning of the mesh and the binary tree T 0, where for notational simplicity weuse the subscripts of the clusters to stand for the clusters.

C 1C 1 C 2 C 3

C 4

C 5

C 6

C 7

C 8

C 10

C 9

C 11

C 12

C 14

C 13

C 15

C − 8 C − 7

8 9

4

10 11

5

2

12 13

6

14 15

7

3

1 T 0

Figure 3.5: The mesh and its partitions. C1 = M .




Let C−i = C i = M\C i . Now we can dene an augmented tree T +r with respect to aleaf node C r ∈ T 0 as T +

r = {C j |C r ⊆ C− j , j < −3}∪(T 0\{C j |C r ⊆ C j , j > 0}). Suchan augmented tree is constructed to partition C−r in a way similar to T 0, i.e., in T +

r ,C i and C j are the two children of Ck iff C i ∪C j = C k and C i ∩C j = ∅. In addition,since C2 = C−3, C 3 = C −2 and C −1 = ∅, the tree nodes C±1, C−2, and C−3 are removedfrom T +

r to avoid redundancy. Two examples of such augmented tree are shown inFig. 3.6.

12 13

6

14 15

7

3

8 9

4

− 5 11

− 10

T 10

8 9

4

10 11

5

2

12 13

6

− 7 15

− 14

T 14

Figure 3.6: Examples of augmented trees.

We denote by Ii the inner nodes of cluster C i and by B i the boundary nodes asdened in section 5. Then we recursively dene the set of private inner nodes of C i

as S i = Ii\∪C j⊂C i

S j with S i = Ii if C i is a leaf node in T +r , where C i and C j ∈ T +

r .Fig. 3.7 shows these subsets for a 4 ×8 cluster.




C

C 1 C 2

Figure 3.7: Cluster C has two children C 1 and C2. The inner nodes of cluster C1 andC 2 are shown using triangles. The private inner nodes of C are shown with crosses.The boundary set of C is shown using circles.

Now we study the properties of these subsets. To make the main text short and

easier to follow, we only list below two important properties without proof. For otherproperties and their proofs, please see Appendix B.1.

Property 3.1 shows two different ways of looking at the same subset. This changeof view happens repeatedly in our algorithm.

Property 3.1 If C i and C j are the two children of Ck , then Sk ∪B k = B i ∪B j and S k = ( B i∪

B j )\B k .

The next property is important in that it shows that the whole mesh can bepartitioned into subsets S i , B−r , and C r . Such a property makes it possible to denea consistent ordering.

Property 3.2 For any given augmented tree T +r and all C i ∈ T +

r , S i , B−r , and C r

are all disjoint and M = (∪C i∈T +

rS i)∪

B −r ∪C r .

Now consider the ordering of A . For a given submatrix A (U , V ), if all the indicescorresponding to U appear before the indices corresponding to V , we say U < V . We

dene a consistent ordering of A with respect to T +

r as any ordering such that1. The indices of nodes in S i are contiguous;

2. C i ⊂ C j implies S i < S j ; and

3. The indices corresponding to B−r and C r appear at the end.




Since it is possible that C i ⊂ C j and C j ⊂ C i , we can see that the consistent

ordering of A with respect to T +

r is not unique. For example, if C

j and C

k are thetwo children of C i in T +r , then both orderings S j < Sk < S i and Sk < S j < S i are

consistent. When we discuss the properties of the Gaussian elimination of A Allthe properties apply to any consistent ordering so we do not make distinction belowamong different consistent orderings. From now on, we assume all the matrices wedeal with have a consistent ordering.

3.2.2 Correctness of the algorithm

The major task of showing the correctness of the algorithm is to prove that thepartial eliminations introduced in section 5 are independent of one another, and thatthey produce the same result for matrices with consistent ordering; therefore, froman algorithmic point of view, these eliminations can be reused.

We now study the properties of the Gaussian elimination for a matrix A withconsistent ordering.

Notation: For a given A, the order of S i is determined so we can write theindices of S i as i1, i2, . . . , etc. For notational convenience, we write ∪S j < S i

S j as S<i

and (∪S i < S j S j ) ∪ B−r ∪ C r as S>i . If g = i j then we denote S i j +1 by Sg+ . If i is

the index of the last S i in the sequence, which is always −r for T +r , then S i+ stands

for B −r . When we perform a Gaussian elimination, we eliminate the columns of Acorresponding to the mesh nodes in each S i from left to right as usual. We do noteliminate the last group of columns that correspond to C r , which remains unchangeduntil we compute the diagonal of A−1. Starting from A i1 = A , we dene the followingintermediate matrices for each g = i1, i2, . . . , −r as the results of each step of Gaussianelimination:

A g+ = the result of eliminating the Sg columns in A g.

Since the intermediate matrices depend on the ordering of the matrix A, which de-pends on T +

r , we also sometimes denote them explicitly by Ar,i to indicate the de-pendency.




Example. Let us consider Figure 3.6 for cluster 10. In that case, a consistent

ordering is S

12 , S

13 , S

14 , S

15 , S

6, S

7, S

8, S

9, S

3, S

4, S

−5, S

11 , S

−10 , B

−10 , C

10 . Thesequence i j is i1 = 12, i2 = 13, i3 = 14, .. . . Pick i = 15, then S<i = S12 ∪S 13 ∪S 14 .Pick i = −5, then S>i = S11∪

S −10∪B −10∪

C 10 . For g = i j = 15, Sg+ = S6.

The rst theorem in this section shows that the matrix preserves a certain sparsitypattern during the elimination process such that eliminating the S i columns onlyaffects the (B i , B i) entries. The precise statement of the theorem is listed in theappendix as Theorem B.1 with its proof.

The following two matrices show one step of the elimination, with the right pattern

of 0’s:

A i =

A i(S <i , S <i ) A i(S <i , S i) A i(S <i , B i) Ai(S <i , S >i \B i)0 A i(S i , S i) A i(S i , B i) 00 A i(B i , S i) A i(B i , B i) A i(B i , S >i \B i)0 0 A i(S >i \B i , B i) A i(S >i \B i , S >i \B i)

⇒

A i+ =

A i(S <i , S <i ) A i(S <i , S i) A i(S <i , B i) A i(S <i , S >i \B i)

0 A i(S i , S i) Ai(S i , B i) 00 0 A i+ (B i , B i) A i(B i , S >i \B i)0 0 A i(S >i \B i , B i) A i(S >i \B i , S >i \B i)

where A i+ (B i , B i) = A i(B i , B i) −A i(B i , S i)A i(S i , S i)−1A i(S i , B i).Note: in the above matrices, Ai(•, B i) is written as a block. In reality, however,

it is usually not a block for any A with consistent ordering.Because the matrix sparsity pattern is preserved during the elimination process,

we know that certain entries remain unchanged. More specically, Corollaries 3.1 and3.2 can be used to determine when the entries ( B i ∪B j , B i ∪B j ) remain unchanged.Corollary 3.3 shows that the entries corresponding to leaf nodes remain unchangeduntil their elimination. Such properties are important when we compare the elimi-nation process of matrices with different orderings. For proofs of these corollaries,please see Appendix B.1.




Corollary 3.1 If C i and C j are the two children of C k , then Ak(B i , B j ) = A (B i , B j )

and Ak(B

j ,B

i) = A (B

j ,B

i).

Corollary 3.2 If C i is a child of Ck , then Ak(B i , B i) = A i+ (B i , B i).

Corollaries 3.1 and 3.2 tell us that when we are about to eliminate the mesh nodesin Sk based on B i and B j , we can use the entries ( B i , B j ) and ( B j , B i) from the originalmatrix A, and the entries ( B i , B i) and ( B j , B j ) obtained after elimination of S i andS j .

Corollary 3.3 If C i is a leaf node in T +r , then A i(C i , C i) = A (C i , C i).

Corollary 3.3 tells us that we can use the entries from the original matrix A forleaf clusters (even though other mesh nodes may have already been eliminated atthat point).

Based on Theorem B.1 and Corollaries 3.1–3.3, we can compare the partial elim-ination results of matrices with different orderings.

Theorem 3.1 For any r and s such that C i ∈ T +r and C i ∈ T +

s , we have

A r,i (S i∪B i , S i∪

B i) = A s,i (S i∪B i , S i∪

B i).

Proof : If C i is a leaf node, then by Corollary 3.3, we have Ar,i (S i ∪B i , S i ∪B i) =A r,i (C i , C i) = A r (C i , C i) = A s (C i , C i) = A s,i (C i , C i) = A s,i (S i∪

B i , S i∪B i).

If the equality holds for i and j such that C i and C j are the two children of Ck ,then

• by Theorem B.1, we have Ar,i + (B i , B i) = As,i + (B i , B i) and Ar,j + (B j , B j ) =

A s,j + (B

j ,B

j );

• by Corollary 3.2, we have A r,k (B i , B i) = A r,i + (B i , B i) = A s,i + (B i , B i) = A s,k (B i , B i)and A r,k (B j , B j ) = A r,j + (B j , B j ) = A s,j + (B j , B j ) = A s,k (B j , B j );

• by Corollary 3.1, we have A r,k (B i , B j ) = A r (B i , B j ) = A s (B i , B j ) = A s,k (B i , B j )and A r,k (B j , B i) = A r (B j , B i) = A s (B j , B i) = A s,k (B j , B i).




Now we have A r,k (B i∪B j , B i∪

B j ) = A s,k (B i∪B j , B i∪

B j ). By Property B.4, we

have A r,k (S

k∪

Bk ,

Sk∪

Bk) = A s,k (

Sk∪

Bk ,

Sk∪

Bk). By mathematical induction, thetheorem is proved.

If we go one step further, based on Theorems B.1 and 3.1, we have the followingcorollary.

Corollary 3.4 For any r and s such that C i ∈ T +r and C i ∈ T +

s , we have

A r,i + (B i , B i) = A s,i + (B i , B i).

Theorem 3.1 and Corollary 3.4 show that the partial elimination results are com-mon for matrices with different orderings during the elimination process, which is thekey foundation of our algorithm.

3.2.3 The algorithm

Corollary 3.4 shows that A•,i+ (B i , B i) is the same for all augmented trees, so wecan have the following denition for any r :

U i = A r,i + (B i , B i).

By Theorem 3.1, Corollary 3.1, and Corollary 3.2, for all i, j , and k such that C i andC j are the two children of Ck , we have

U k = A r,k (B k , B k) −A r,k (B k , S k)A r,k (S k , S k)−1A r,k (S k , B k), (3.2)





3.3. COMPLEXITY ANALYSIS 39

trees, we can rst compute U i for all i > 0. The relations among the positive tree

nodes are the same in the original tree T

0 and in the augmented trees T +

r , so wego through T 0 to compute U i level by level from the bottom up and call it theupward pass . This is done in procedure eliminateInnerNodes of the algorithm inSection 3.1.1.

Once all the positive tree nodes have been computed, we compute U i for all thenegative tree nodes. For these nodes, if C i and C j are the two children of Ck in T 0,then C−k and C j are the two children of C−i in T +

r . Since C−k is a descendant of C−i

in T +r if and only if C i is a descendant of C k in T 0, we compute all the U i for i < 0

by going through T 0 level by level from the top down and call it the downward pass .This is done in the procedure eliminateOuterNodes in Section 3.1.2.

3.3 Complexity analysis

We sketch a derivation of the computational complexity. A more detailed deriva-tion is given in Section 3.3. Consider a mesh of size N x ×N y and let N = N x N y bethe total number of mesh nodes. For simplicity, we assume N x = N y and moreoverthat N is of the form N = (2 l)2 and that the leaf clusters in the tree contain only asingle node. The cost of operating on a cluster of size 2 p ×2 p both in the upward anddownward passes is O((2 p)3), because the size of both adjacent set and boundary setis of order 2 p. There are N/ (2 p)2 such clusters at each level, and consequently thecost per level is O(N 2 p). The total cost is therefore simply O(N 2l) = O(N 3/ 2). Thisis the same computational cost (in the O sense) as the nested dissection algorithmof George et al. [George, 1973]. It is now apparent that FIND has the same order of computational complexity as a single LU factorization.

3.3.1 Running time analysis

In this section, we will analyze the most computationally intensive operations inthe algorithm and give the asymptotic behavior of the computational cost. By ( 3.2),




we have the following cost for computing U i :

T ≈ b2i s i + s3

i / 3 + b i s2i ops, (3.4)

where both b i = |B i| and si = |S i| are of order a for clusters of size a × a. For asquared mesh, since the number of clusters in each level is inversely proportional toa2, the computational cost for each level is proportional to a and forms a geometricseries. As a result, the top level dominates the computational cost and the totalcomputational cost is of the same order as the computational cost of the topmostlevel. We now study the complexity in more details in the two passes below.

Before we start, we want to emphasize the distinction between a squared meshand an elongated mesh. In both cases, we want to keep all the clusters in the clustertree to be as close as possible to square. For a squared mesh N x ×N x , we can keep allthe clusters in the cluster tree to be of size either a ×a or a ×2a. For an elongatedmesh of size N x ×N y, where N y N x , we cannot do the same thing. Let the level of clusters each of size N x ×(2N x ) be level L; then all the clusters above level L cannotbe of size a ×2a. We will see the mergings and partitionings for clusters below andabove level L have different behaviors.

In the upward pass, as a typical case, a ×a clusters merge into a ×2a clusters andthen into 2 a ×2a clusters. In the rst merge, the size of B k is at most 6a and the sizeof Sk is at most 2a. The time to compute U k is T ≤ (62 ×2 +2 3/ 3+2 2 ×6)a3 ≤ 296

3 a3

ops and the per node cost is at most 1483 a ops, depending on the size of each node.

In the second merge, the size of B k is at most 8a and the size of Sk is at most 4a.The time to compute U k is T ≤ (82 ×4 + 4 3/ 3 + 4 2 ×8)a3 ≤ 1216

3 a3 and the per nodecost is at most 304

3 a. So we have the following level-by-level running time for a meshof size N x ×N y with leaf nodes of size a ×a for merges of clusters below level L:

1483

N xN ya ×2

→ 304

3 N xN ya ×1

→ 296

3 N xN ya ×2

→ 608

3 N x N ya . . .

We can see that the cost doubles from merging a ×a clusters to merging a ×2aclusters while remains the same from merging a × 2a clusters to merging 2 a × 2aclusters. This is mostly because the size of S doubles in the rst change but remains




the same in the second change, as seen in Fig. 3.8. The upper gure there shows themerging of two a ×a clusters into an a ×2a cluster and the lower gure correspondsto two a ×2a clusters merging into a 2 a ×2a cluster. The arrows point to all themesh nodes belonging to the set, e.g. B k is the set of all the boundary nodes of Ck inthe top gure. Also note that the actual size of the sets could be a little smaller thanthe number shown in the gure. We will talk about this at the end of this section.

C i C j

C k

a

a a

a

size(B k )=6 a

size(S k )=2 a

C n

C m

C k

a

a

2a

size(B n )=8 a

size(S n )=4 a

Figure 3.8: Merging clusters below level L.

For clusters above level L, the computational cost for each merging remains thesame because both the size of B and the size of S are 2N x . The computational costfor each merging is T ≈ 56

3 N 3x ≈ 19N 3x ops. This is shown in Fig. 3.9. Since forclusters above level L we have only half merges in the parent level compared to thechild level, the cost decreases geometrically for levels above level L.




C k

size(B k )=2 N x

size(S k )=2 N x

C i C jN x N x

W W

Figure 3.9: Merging rectangular clusters. Two N x×W clusters merge into an N x×2W cluster.

Adding all the computational costs together, the total cost in the upward pass is

T ≤ 151N x N y(a + 2 a + · · ·+ N x / 2) + 19N 3x (N y/ 2N x + N y/ 4N x + · · ·+ 1)

≈ 1512x N y + 19N 2x N y = 170N 2x N y.

In the downward pass, similar to the upward pass, for clusters above level L, thecomputational cost for each partitioning remains the same because both the size of B and the size of S are 2N x . Since the sizes are the same, the computational cost foreach merge is also the same: T ≈ 56

3 N 3x ≈ 19N 3x ops. This is shown in Fig. 3.10.

C k

size(B − i )=2 N x size(S − i )=2 N x

C i C jN x N x

W W

Figure 3.10: Partitioning of clusters above level L.

For clusters below level L, the cost for each level begins decreasing as we godownward in the cluster tree. When 2 a × 2a clusters are partitioned into a × 2aclusters, the size of B k is at most 6a and the size of Sk is at most 8a. Similar to theanalysis for upward pass, the time to compute Uk is T ≤ 422a ×4a2. When a ×2aclusters are partitioned into a ×a clusters, the size of B k is at most 4a and the size




of Sk is at most 6a. The time to compute U k is T

≤ 312a

×2a2.

So we have the following level-by-level running time per node, starting from N x ×N x down to the leaf clusters:

. . . 422a ×1.35

→ 312a ×1.48

→ 211a ×1.35

→ 156a

We can see that the computational cost changes more smoothly compared to that inthe upward pass. This is because the sizes of B and S increase relatively smoothly,as shown in Fig. 3.11.

C m

C k

C n

a

a

2a

size(B − k )=6 a

size(S − k )=8 a

C i C j

C k

a

a a

a

size(B − i )=4 a size(S − i )=6 a

Figure 3.11: Partitioning of clusters below level L.




The computational cost in the downward pass is

T ≤ 734N x N ya + 734 ×2N xN ya + · · ·+ 734N x N yN x / 2

+ 19 N 3x (N y/ 2N x) + 19N 3x (N y / 4N x ) + · · ·+ 19 N 3x

≈ 734N 2x N y + 19N 2x N y = 753N 2x N y ops.

So the total computational cost is

T ≈ 170N 2x N y + 753N 2x N y = 923N 2x N y ops.

The cost for the downward pass is signicantly larger than that for the upward passbecause the size of sets B ’s and S’s are signicantly larger.

In the above analysis, some of our estimates were not very accurate because wedid not consider minor costs of the computation. For example, during the upwardpass (similar during the downward pass):

• when the leaf clusters are not 2 ×2, we need to consider the cost of eliminatingthe inner mesh nodes of leaf clusters;

• the sizes of B

and S

are also different for clusters on the boundary of the mesh,where the connectivity of the mesh nodes is different from that of the innermesh nodes.

Such inaccuracy, however, becomes less and less signicant as the size of the meshbecomes bigger and bigger. It does not affect the asymptotic behavior of runningtime and memory cost either.

3.3.2 Memory cost analysis

Since the negative tree nodes are the ascendants of the positive tree nodes in theaugmented trees, the downward pass needs U i for i > 0, which are computed duringthe upward pass. Since these matrices are not used immediately, we need to storethem for each positive node. This is where the major memory cost comes from andwe will only analyze this part of the memory cost.




Let the memory storage for one matrix entry be one unit. Starting from the root

cluster until level L, the memory for each cluster is about the same while the numberof clusters doubles in each level. Each U i is of size 2N x ×2N x so the memory costfor each node is 4N 2x units and the total cost is

log2 (N y /N x )

i=0

(2i ×4N 2x ) ≈ 8N x N y units.

Below level L, we maintain the clusters to be of size either a × a or a × 2a bycutting across the longer edge. For a cluster of size a ×a, the size of B is 4a and the

memory cost for each cluster is 16a2 units. We have N y ×N x / (a ×a) clusters in eachlevel so we need 16N x N y units of memory for each level, which is independent of thesize of the cluster in each level. For a cluster of size a ×2a, the size B is 6a and thememory cost for each cluster is 36 a2 units. We have N y ×N x / (a ×2a) clusters in eachlevel so we need 18N x N y units of memory for each level, which is independent of thesize of the cluster in each level. For simplicity of the estimation of the memory cost,we let 16 ≈ 18.

There are 2 log2(N x ) levels in this part, so the memory cost needed in this part is

about 32N x N y log2(N x ) units.The total memory cost in the upward pass is thus

8N xN y + 32 N x N y log2(N x ) = 8 N x N y(1 + 4 log2(N x)) units.

If all the numbers are double decision complex numbers, the cost will be 128 N x N y(1+4log2(N x )) bytes.

3.3.3 Effect of the null boundary set of the whole mesh

As discussed in Section 3.3.1, the computation cost for a cluster is proportional tothe cube of its size while the number of clusters is inversely proportional to the squareof its size. So the computation cost for the top level clusters dominates. Because of this, the null boundary set of the whole mesh signicantly reduces the cost.




Table 3.1 shows the cost of a few top level clusters. The last two rows are thesum of the cost of the rest of the small clusters. Sum the costs for the two passestogether and the total computation cost for the algorithm is roughly 147 N 3/ 2 opswith optimization discussed in the early sections.

Table 3.1: Computation cost estimate for different type of clustersupward pass downward pass

size & cost per cluster clusters size & cost per cluster clusters s b cost level number s b cost level number

1 1 1.167 1 2 0 0 0 1 21 1 1.167 2 4 1 1 1.167 2 4

1/2 3/4 0.255 3 4 3/2 3/4 1.828 3 41/2 5/4 0.568 3 4 1/2 5/4 0.568 3 41/2 1 0.396 4 4 1 1/2 0.542 4 41/2 3/4 0.255 4 8 1/2 3/4 0.255 4 41/2 1/2 0.146 4 4 3/2 3/4 1.828 4 41/4 3/4 0.096 5 8 1 1 1.167 4 41/4 5/8 0.071 5 20 1 3/4 0.823 5 321/4 3/8 0.032 5 4 3/4 1/2 0.305 6 64

1/4 1/2 0.049 6 64 1 3/4 0.823 7... < 321/8 3/8 0.012 7 128 3/4 1/2 0.305 8... < 64

3.4 Simulation of device and comparison with RGF

To assess the performance and applicability and benchmark FIND against RGF,we applied these methods to the non self-consistent calculation of density-of-states and

electron density in a realistic non-classical double-gate SOI MOSFETs as depicted inFig. 1.1 with a sub-10 nm gate length, ultra-thin, intrinsic channels and highly doped(degenerate) bulk electrodes. In such transistors, short channel effects typical fortheir bulk counterparts are minimized, while the absence of dopants in the channelmaximizes the mobility and hence drive current density. The “active” device consists



3.4. SIMULATION OF DEVICE AND COMPARISON WITH RGF 47

of two gate stacks (gate contact and SiO 2 gate dielectric) above and below a thinsilicon lm. The thickness of the silicon lm is 5 nm. Using a thicker body reducesthe series resistance and the effect of process variation but it also degrades the shortchannel effects. The top and bottom gate insulator thickness is 1 nm, which isexpected to be near the scaling limit for SiO 2. For the gate contact, a metal gatewith tunable work function, φG , is assumed, where φG is adjusted to 4.4227 to providea specied off-current value of 4 µA/ µm. The background doping of the silicon lmis taken to be intrinsic, however, to take into account the diffusion of the dopant ions;the doping prole from the heavily doped S/D extensions to the intrinsic channel is

graded with a coefficient of g equal to 1 dec/nm. The doping of the S/D regions equals1 ×1020 cm−3. According to the ITRS road map [SIA, 2001], the high performancelogic device would have a physical gate length of LG = 9 nm at the year 2016. Thelength, LT , is an important design parameter in determining the on-current, while thegate metal work function, φG , directly controls the off-current. The doping gradient,g , affects both on-current and off-current.

Fig. 3.12 shows that the RGF and FIND algorithms produce identical density of states and electron density. The code used in this simulation is nanoFET and is avail-

able on the nanoHUB (www.nanohub.org). The nanoHUB is a web-based resourcefor research, education, and collaboration in nanotechnology; it is an initiative of theNSF-funded Network for Computational Nanotechnology (NCN).

−20

Distance [nm]

E l e c t r o n D e n s i t y [ c m

]

- 3

RGF

FIND

Channel Width = 5nmVg = Vd = 0.4V

Source DrainChannel

Overlapping

1019

1020

10

1018

1017

21

−10 0 2010

Figure 3.12: Density-of-states (DOS) and electron density plots from RGF and FIND.




Fig. 3.13 shows the comparisons of running time between FIND and RGF. In theleft gure, N x = 100 and N y ranges from 105 to 5005. In the right gure, N x = 200and N y ranges from 105 to 1005. We can see that FIND shows a considerable speed-up in the right gure when N x = 200. The running time is linear with respect toN y in both cases, as predicted in the computational cost analysis. The scaling withrespect to N x is different. It is equal to N 3x for RGF and N 2x for FIND.

1000 2000 3000 4000 5000

200

400

600

Ny

R u n n

i n g

t i m e p e r

e n e r g y

l e v e

l ( s e c o n

d )

Comparison between FINDand RGF. N

x=100

FINDRGF

200 400 600 800 1000

200400600800

10001200

Ny

R u n n

i n g

t i m e p e r

e n e r g y

l e v e

l ( s e c o n d

)

Comparison between FINDand RGF. N x=200

FINDRGF

Figure 3.13: Comparison of the running time of FIND and RGF when N x is xed.

Fig. 3.14 show the comparisons of running times between FIND and RGF whenN y’s are xed. We can see clearly the speedup of FIND in the gure when N xincreases.

0 500 1000 1500 20000

0.5

1

1.5

2 x 104

Comparison between FINDand RGF, N

y = 105

R u n n i n g

t i m e p e r

e n e r g y

l e v e

l i n s e c o n

d s

Nx

FINDRGF

0 500 10000

1000

2000

3000

4000

Comparison between FINDand RGF, N y = 205

R u n n

i n g

t i m e p e r

e n e r g y

l e v e

l i n s e c o n

d s

Nx

FINDRGF

Figure 3.14: Comparison of the running time of FIND and RGF when N y is xed.



3.5. CONCLUDING REMARKS 49

3.5 Concluding remarks

We have developed an efficient method of computing the diagonal entries of theretarded Green’s function (density of states) and the diagonal of the less-than Green’sfunction (density of charges). The algorithm is exact and uses Gaussian eliminations.A simple extension allows computing off diagonal entries for current density calcula-tions. This algorithm can be applied to the calculation of any set of entries of A−1,where A is a sparse matrix.

In this chapter, we described the algorithm and proved its correctness. We ana-lyzed its computational and memory costs in detail. Numerical results and compar-

isons with RGF conrmed the accuracy, stability, and efficiency of FIND.We considered an application to quantum transport in a nano-transistor. A

2D rectangular nano-transistor discretized with a mesh of size N x × N y , N x < N ywas chosen. In that case, the cost is O(N 2x N y), which improves over the result of RGF [Svizhenko et al., 2002], which scales as O(N 3x N y). This demonstrates thatFIND allows simulating larger and more complex devices using a ner mesh and witha small computational cost. Our algorithm can be generalized to other structuressuch as nanowires, nanotubes, molecules, and 3D transistors, with arbitrary shapes.

FIND has been incorporated in an open source code, nanoFET [ Anantram et al. ,2007], which can be found on the nanohub web portal ( http://www.nanohub.org ).

http://www.nanohub.org/

http://www.nanohub.org/






Chapter 4

Extension of FIND

In addition to computing the diagonal entries of the matrix inverse, the algorithmcan be extended to computing the diagonal entries and certain off-diagonal entries of G < = A −1ΣA −† for the electron and current density in our transport problem.

4.1 Computing diagonal entries of another Green’s

function: G <

Intuitively, A−1ΣA −† can be computed in the same way as A−1 = U −1L−1 becausewe have

G < = A −1ΣA −†

⇒ AG < A† = Σ

⇒ U G < U † = L−1Σ L−†

⇒ [G < ]nn = ( U nn )−1(L−1Σ L−†)nn ( U nn )−†.

It remains to show how to compute the last block of L−1Σ L−†. We can see from thefollowing updates that when we operate on the matrix Σ , in the same way as we doon matrix A , the pattern of Σ will be similar to that of A during the updates. Similarto the notation and denitions in Chapter 3, starting from Σ i1 = Σ and letting L−1

g

51



52 CHAPTER 4. EXTENSION OF FIND

be the matrix corresponding to the gth step of elimination such that Ag+ = L−1g Ag,

we dene Σ g+ = L−1

g Σ gL−†g . We also write the only non-zero non-diagonal block inL−1

g , [L−1g ](B g, S g), as L−1

g for notation simplicity. Now,

Σ g =

Σ g(S <g , S <g ) Σ g(S <g , S g) Σ g(S <g , B g) Σ g(S <g , S >g \B g)Σ g(S g, S <g ) Σ g(S g, S g) Σ g(S g, B g) 0Σ g(B g, S <g ) Σ g(B g, S g) Σ g(B g, B g) Σ g(B g, S >g \B g)Σ g(S >g \B g) 0 Σ g(S >g \B g, B g) Σ g(S >g \B g, S >g \B g)

⇒

Σ g+ =

Σ g(S <g , S <g ) Σ g(S <g , S g) Σ g+ (S <g , B g) Σ g(S <g , S >g \B g)Σ g(S g, S <g ) Σ g(S g, S g) Σ g+ (S g, B g) 0

Σ g+ (B g, S <g ) Σ g+ (B g, S g) Σ g+ (B g, B g) Σ g(B g, S >g \B g)Σ g(S >g \B g) 0 Σ g(S >g \B g, B g) Σ g(S >g \B g, S >g \B g)

,

where

Σ g+ (B g, B g) = Σ g(B g, B g) + L−1g Σ g(S g, B g) + Σ g(B g, S g)L−†g + L−1

g Σ g(S g, S g)L−†g .

The difference between the updates on A and the updates on Σ is that some entriesin Σ g(Ig, B g) and Σ g(B g, Ig) (indicated by Σ g(S <g , B g), Σ g(S g, B g), Σ g(B g, S <g ), andΣ g(B g, S g) above) will be changed when we update the columns corresponding toS g due to the right multiplication of L−†g . However, such changes do not affect theresult for the nal Σ NN because the next update step is independent of the values of Σ g(Ig, B g) and Σ g(B g, Ig). To show this rigorously, we need to prove theorems similarto Theorem B.1 and Theorem 3.1 and leave the proofs in Appendix B.2.

4.1.1 The algorithm

Now we consider clusters i, j , and k in the augmented tree rooted at cluster r ,with C i and C j the two children of Ck . By Theorem B.2, Corollary B.4, and Corollary



4.1. COMPUTING DIAGONAL ENTRIES OF ANOTHER GREEN’S FUNCTION: G < 53

B.5, we have

Σ k+ (B k , B k) = Σ k(B k , B k) + L−1k Σ k(S k , B k) + Σ k(B k , S k)L−†k + L−1

k Σ k(S k , S k)L−†k ,(4.1)

where L−1k = −A r,k (B k , S k)A−1

r,k (S k , S k), A r,k (B k∪S k , B k∪

S k) follows the same updaterule as in the basic FIND algorithm, and Σ k(S k∪

B k , S k∪B k) has the following form:

Σ k(B k , B k) =Σ i(B k ∩B i , B k ∩B i) Σ (B k ∩B i , B k ∩B j )Σ (B k ∩B j , B k ∩B i) Σ j (B k ∩B j , B k ∩B j )

,

Σ k(S k , S k) =Σ i(S k

∩B i , S k

∩B i) Σ (S k

∩B i , S k

∩B j )

Σ( S k ∩B j , S k ∩B i) Σ j (S k ∩B j , S k ∩B j ) ,

Σ k(B k , S k) =Σ i(B k ∩B i , S k ∩B i) 0

0 Σ j (B k ∩B j , S k ∩B j ),

Σ k(S k , B k) =Σ i(S k ∩B i , B k ∩B i) 0

0 Σ j (S k ∩B j , B k ∩B j ).

Similar to the basic FIND algorithm, if Ck is a leaf node, then by Corollary B.6 wehave Σ r,k (B k ∪ Sk , B k ∪ Sk) = Σ r (B k ∪ S k , B k ∪ Sk) (the entries in the original Σ ).

The recursive process of updating Σ is the same as updating U in the basic FINDalgorithm, but the update rule is different here.

4.1.2 The pseudocode of the algorithm

The pseudocode of the algorithm is an extension of the basic FIND algorithm.In addition to rearranging the matrix and updating the matrix U, we need to dosimilar things for the matrix Σ as well. Note that even if we do not need A−1, westill need to keep track of the update of A, i.e., the U matrices. This is because we

need A k(B k ∪S k , B k ∪S k) when we update Σ k(B k , B k).In the algorithm, we rst compute R C for all the positive nodes in the upward

pass as shown in procedures updateBoundaryNodes . We then compute R C for all thenegative nodes in the downward pass as shown in procedure updateAdjacentNodes .The whole algorithm is shown in procedure computeG < .





4.1. COMPUTING DIAGONAL ENTRIES OF ANOTHER GREEN’S FUNCTION: G < 55

Procedure updateAdjacentNodes( cluster C) . This procedure should be calledwith the root of the tree: updateAdjacentNodes(root).

Data : tree decomposition of the mesh; the matrix A ; the upward pass[updateBoundaryNodes()] should have been completed.

Input : cluster C with n adjacent mesh nodes (as the boundary nodes of C ).Output : all the outer mesh nodes of cluster C (as the inner nodes of C ) are

eliminated by the procedure. The matrix R C with the result of theelimination is saved.

if C is not the root then1

D = parent of C /* The boundary set of D is denoted B D */ ;2

D 1 = sibling of C /* The boundary set of D1 is denoted B D1 */ ;3

A C = [U D A(B D , B D 1); A (B D 1, B D ) UD 1];4

/* A(B D ,B D1 ) and A(B D1 ,B D ) are values from the original matrixA. */

/* If D is the root, then D = ∅ and AC = RD1 . */Rearrange A C such that the inner nodes of B D ∪

B D 1 appear rst;5

Set Sk in (4.1) equal to the above inner nodes and B k equal to the rest;6

Eliminate the outer nodes; set U C equal to the schur complement of A C ;7

Σ C = [R D Σ (B D , B D 1); Σ (B D 1, B D ) R D 1];8

Rearrange Σ C in the same way as for A C ;9

Compute R C based on Eq.( 4.1); R C = R k+ (B k , B k);10

Save U C and R C ;11

if C is not a leaf then12



updateAdjacentNodes( C 1);15

updateAdjacentNodes( C 2);16




Procedure computeG < ( mesh M ) . This procedure should be called by any func-tion that needs G < of the whole mesh M .

Data : the mesh M ; the matrix A; enough memory should be available fortemporary storage

Input : the mesh M ; the matrix A ;Output : the diagonal entries of G <

Prepare the tree decomposition of the whole mesh;1

for each leaf node C do2

Compute [A−1](C , C) using Eq. (6.8);3

Compute R C based on Eq. (7);4

Save R C together with its indices5

Collect R C with their indices and output the diagonal entries of R6

4.1.3 Computation and storage cost

Similar to the computation cost analysis in Chapter 3, the major computationcost comes from computing ( 4.1).

The cost depends on the size of Sk and B k . Let s = |S k| and b = |B k| be the sizes,

then the computation cost for L−1

k R k(S k , B k) is (s2

b + 13 s

3

+ sb2

) ops. Since L−1

k isalready given by ( 3.2) and computed in procedures eliminateInnerNodes (page 27)and eliminateOuterNodes (page 28), the cost is reduced to sb 2. Similarly, the costfor R k(B k , S k)L−†k is also sb 2 and the cost for L−1

k Rk(S k , S k)L−†k is sb 2 + s2b . So thetotal cost for ( 4.1) is (3sb 2 + s2b ) ops. Note that these cost estimates need explicitforms of A (B , S )A (S , S )−1 and we need to arrange the order of computation to haveit as an intermediate result in computing ( 3.2).

Now we consider the whole process. We analyze the cost in the two passes.

For upward pass, when two a ×a clusters merge into one a ×2a cluster, b

≤ 6aand s ≤ 2a, so the cost is at most 360a3 ops; when two a ×2a clusters merger intoone 2a × 2a cluster, b ≤ 8a and s ≤ 4a, so the cost is at most 896a3 ops. Wehave N x N y

2a 2 a ×2a clusters and N x N y4a 2 2a ×2a clusters, so the cost for these two levels

is (360a3 N x N y2a 2 + 896a3 N x N y

4a 2 =) 404N x N ya. Summing all the levels together up toa = N x / 2, the computation cost for upward pass is 404 N 2x N y.



4.2. COMPUTING OFF-DIAGONAL ENTRIES OF G R AND G< 57

For downward pass, when one 2 a

×2a cluster is partitioned into two a

×2a clusters,

b ≤ 6a and s ≤ 8a, so the cost is at most 1248a3 ops; when one a ×2a cluster ispartitioned intoto two a ×a clusters, b ≤ 4a and |S k| ≤ 6a, so the cost is at most432a3 ops. We have N x N y

2a 2 a ×2a clusters and N x N ya 2 a ×a clusters, so the cost for

these two levels is (1248a3 N x N y2a 2 + 432a3 N x N y

a 2 =) 1056N x N ya. Summing all the levelstogether up to a = N x / 2, the computation cost for upward pass is at most 1056 N 2x N y .

4.2 Computing off-diagonal entries of G r and G <

In addition to computing the diagonal entries of G r and G < , the algorithm can beeasily extended to computing the entries in G r and G < corresponding to neighboringnodes. These entries are often of interest in simulation. For example, the currentdensity can be computed via the entries corresponding to the horizontally neighboringnodes.

With a little more computation, these entries can be obtained at the same timewhen the algorithm computes the diagonal entries of the inverse. Fig. 4.1 shows thelast step of elimination and the inverse computation for the solid circle nodes.

10

9

12

11

3

2

4

1

8

5

7

6

12

1

11

2

10

3

9

4

8

5

7

6

6

7

5

8

4

9

3

10

2

11

1

12

Figure 4.1: Last step of elimination

The entries of A −1 corresponding to nodes in C can be computed by the following




equation:

[A−1

](C

,C

) = [ A (C

,C

) −A (C

,B

C )(U C )−1

A (B

C ,C

)]−1

. (4.2)Performing one step of back substitution, we have

[A−1](B C , C ) = −(U C )−1A (B C , C )[A−1](C , C ) (4.3)

and[A−1](C , B C ) = −[A−1](C , C )A (C , B C )(U C )−1. (4.4)

This could also be generalized to nodes with a distance from each other.





60 CHAPTER 5. OPTIMIZATION OF FIND

for G < , where

L−1

= −A (B

,S

)A (S

,S

)−1

. (5.3)These equations are copied from ( 3.2) and ( 4.1) with subscripts r and k skipped forsimplicity.1 As a result, computing ( 5.1) and (5.2) efficiently becomes critical for ouralgorithm to achieve good overall performance.

In Chapter 3, we gave the computation cost

13

s3 + s2b + sb 2 (5.4)

for (5.1) and s2b + 3 sb 2 (5.5)

for (5.2), where s and b are the size of sets S and B , respectively. These costs as-sume that all matrices in ( 5.1) and (5.2) are general dense matrices. These matrices,however, are themselves sparse for a typical 2D mesh. In addition, due to the char-acteristics of the physical problem in many real applications, the matrices A and Σoften have special properties [Svizhenko et al., 2002]. Such sparsities and propertieswill not change the order of cost, but may reduce the constant factor of the cost, thus

achieving signicant speed-up. With proper optimization, our FIND algorithm canexceed other algorithms on a much smaller mesh.

In this chapter, we rst exploit the extra sparsity of the matrices in ( 3.2) and(4.1) to reduce the constant factor in the computation and the storage cost. Wethen consider the symmetry and positive deniteness of the given matrix A for fur-ther performance improvement. We also apply these optimization techniques to G<

and current density when Σ has similar properties. Finally, we briey discuss thesingularity of A and how to deal with it efficiently.

1

Note that some entries of A (•, •) and Σ (•, •) have been updated from the blocks in the originalA and Σ . See Section 3.2.3 for details.



5.2. OPTIMIZATION FOR EXTRA SPARSITY IN A 61

5.2 Optimization for extra sparsity in A

Because of the local connectivity of 2D mesh in our approach ,2 the matricesA (B , S ), A(S , S ), and A(S , B ) in (5.1) are not dense. Such sparsity will not reducethe storage cost because U in (5.1) is still dense, which is the major component of the storage cost. Due to the sparsity of these matrices, however, the computationcost can be signicantly reduced. To clearly observe the sparsity and exploit it, it isconvenient to consider these matrices in blocks.

5.2.1 Schematic description of the extra sparsityAs shown earlier in Fig. 3.2, we eliminate the private inner nodes when we merge

two child clusters in the tree into their parent cluster. Here in Figs. 5.1 and 5.2,we distinguish between the left cluster and the right cluster in the merge and itscorresponding eliminations in matrix form to illustrate the extra sparsity.

Figure 5.1: Two clusters merge into one larger cluster in the upward pass. The ×nodes have already been eliminated. The ◦and nodes remain to be eliminated.

In Fig. 5.1, the three hollow circle nodes and the three hollow square nodes areprivate inner nodes of the parent cluster that originate from the left child clusterand the right child cluster, respectively. When we merge the two child clusters,these nodes will be eliminated. Fig. 5.2 illustrates how we construct a matrix based

2 The approach works for 3D mesh as well.




on the original matrix A and the results of earlier eliminations for the elimination 3

corresponding to the merge in Fig. 5.1.

+ ⇒

Figure 5.2: Data ow from early eliminations for the left child and the right child tothe elimination for the parent cluster

In Fig. 5.2, each row (column) of small blocks corresponds to a node in Fig. 5.1.The blocks with blue and red shading correspond to the boundary nodes of the twochild clusters in Fig. 5.1 (the circle and square nodes): those in blue correspond tothe left cluster and those in red correspond to the right cluster in Fig. 5.1 (the hollowand solid square nodes). They are rearranged to eliminate the private inner nodes

(the six hollow nodes). We can also write the resulting matrix in Fig. 5.2 as

U (◦,◦) A(◦, ) U (◦,•) 0A ( ,◦) U ( , ) 0 U ( , )U (•,◦) 0 U (•,•) A(•, )

0 U (◦,•) A( ,•) U ( , )

. (5.6)

In (5.6), ◦, , •, and correspond to all the ◦nodes, nodes, •nodes, and nodes in Fig. 5.2, respectively. The U blocks originate from the results of earlier

eliminations. They are typically dense, since after the elimination of the inner nodes,the remaining nodes become fully connected. The A blocks, however, originate fromthe original A , and are often sparse. For example, because each ◦node is connectedwith only one node in Fig. 5.1, A (◦, ) and A ( ,◦) are both diagonal. A(◦, ),

3 Here elimination refers to the partial elimination in Chapter 3.




A ( ,

◦), A( ,

•), and A (

•, ) are not shown in ( 5.6) because they are all 0. Note

that A (•, ) and A ( ,•) are almost all 0 except for a few entries, but such sparsitysaves little because they are only involved in oating-point addition operations. Wewill not exploit such sparsity.

The elimination shown here is one of the steps we discussed in Section 3.2.3, withthe left cluster labeled as i, the right one as j , and the parent cluster labeled as k.Below is the list of node types used in this section and the corresponding sets usedin Section 3.2.3:

◦ : S

L = S

k ∩B

i (private inner nodes from the left cluster) : SR = Sk ∩B j (private inner nodes from the right cluster)

• : B L = B k ∩B i (boundary nodes from the left cluster)

: B R = B k ∩B j (boundary nodes from the rightcluster)

The sparsity of the matrices in ( 3.2) can therefore be demonstrated as follows:

• Ar,k (B k

∩B i , S k

∩B j ) = A r,k (S k

∩B j , B k

∩B i) = 0,

• Ar,k (B k ∩B j , S k ∩B i) = A r,k (S k ∩B i , B k ∩B j ) = 0,

• Ar,k (S k ∩B i , S k ∩B j ) and A r,k (S k ∩B j , S k ∩B i) are diagonal.

Figure 5.3: Two clusters merge into one larger cluster in the downward pass.




These properties always hold for the upward pass as shown in Figs. 5.1 and Fig. 5.2.They hold for the downward pass as well but with a small number of exceptions,illustrated by the shaded nodes in Fig. 5.3 and by the green nodes and yellow blocksin Fig. 5.4. Since they only appear at the corners, we can ignore them when weanalyze the computation and storage cost. Note that such exceptions need specialtreatment in implementation.

Figure 5.4: Two clusters merge into one larger cluster in the downward pass.

Because typical matrix computation libraries only consider banded matrices andblock-diagonal matrices, we utilize special techniques for Ar,k (S k , S k) (illustrated inFig. 5.2 as the upper left 6 ×6 little blocks) to take full advantage of the sparsity.Sections 5.2.2 and 5.2.3 will discuss this procedure in detail.

5.2.2 Preliminary analysis

Before proceeding to further analysis, we list some facts about the cost of the basicmatrix computation used in Section 5.2.3. By simple counting, we know that for n×nfull matrices A and B, there are the following costs (in oating-point multiplications):




• A ⇒ LU : n3/ 3

• L ⇒ L−1: n3/ 6

• L, B ⇒ L−1B: n3/ 2

• U, L−1B ⇒ A−1B: n3/ 2

Adding them together, computing A−1 requires n3 and computing A−1B requires43 n3. Note that the order of computation is often important, e.g., if we computeA−1B = ( U −1L−1)B , then the cost is ( n3 + n3 =) 2 n3, whereas computing U −1(L−1B)

only requires 4

3 n3

.For simple matrix multiplication, if both matrices are full, then the cost is n3. If one of them is a diagonal matrix, then the cost is reduced to O(n2). In addition, fora general 2 ×2 block matrix, we have the following forms of the inverse: 4

A BC D

−1

= A−1 −A−1B D−1

−D−1C A−1 D−1 (5.7a)

=A−1 + A−1B D−1CA−1 −A−1B D−1

−D−1CA−1 D−1

(5.7b)

=A B0 D

−1 I 0

CA−1 I

−1

, (5.7c)

where A = A −BD −1C and D = D −CA−1B are the Schur complements of D andA, respectively. In ( 5.7a) and (5.7b), the two off-diagonal blocks of the inverse arethe results of back substitution. In ( 5.7a), they are independently based on A−1 andD−1, which is referred to as one-way back substitution ; whereas in (5.7b), both arebased on D−1, which is called a two-way back substitution . Eq. (5.7c) is based on theblock LU factorization of A (S , S ), which is called block LU inverse .

For multiplication of 2 ×2 block matrices with block size n ×n, if A and B areboth full, the cost is 8 n3. If one of them is a 2×2 block-diagonal matrix, then thecost is reduced to 4n3.

4 We assume A and D to be nonsingular.




5.2.3 Exploiting the extra sparsity in a block structure

In this section, we discuss several ways of handling A (S , S ) to exploit the sparsityshown in Section 5.2.1. They are not proved to be optimal. Depending on thecomputer architecture and the implementation of our methods, there are quite likelyother techniques to achieve better speed-up. The discussion below demonstratesthe speed-up achieved through simple techniques. For notational simplicity, in this

section we write A(S , S ) asA BC D , with A = A(S L , S L ), B = A(S L , S R ), C =

A (S R , S L ), and D = A(S R , S R ); A(S , B ) as W X Y Z

, with W = A(S L , B L ), X =

A (S L , B R ), Y = A (S R , B L ), and Z = A (S R , B R ); and A(B , S ) as P QR S

, with P =

A (B L , S L ), Q = A (B L , S R ), R = A (B R , S L ), and S = A (B R , S R ). Since the oating-point multiplication operation is much more expensive than addition operations, weignore the addition with A (B , B ).

The rst method is based on ( 5.7). It computes A(S , S )−1 through the Schur

complement of the two diagonal blocks of A(S , S ). Multiplying ( 5.7) by W X Y Z

,

from Table 5.1 we look at the result at the block level with the proper arrangement

of the computation order to minimize the computation cost. This table also lists therequired operations that are elaborated in Table 5.2.

Table 5.1: Matrix blocks and their corresponding operationsBlock Expression Operations

(1, 1) A−1W −A−1B D−1Y (5) (12)(1, 2) A−1X −A−1B D−1Z (9) (8)(2, 1) −D−1C A−1W + D−1Y (7) (10)(2, 2) D−1C A−1X + D−1Z (11) (6)

In Table 5.2, we list the operations required in Table 5.1 in sequence. An operationwith lower sequence numbers must be performed rst to reuse the results because theoperations depend on one another.

Since the clusters are typically equally split, we can let m = |S L | = |S R | = |S |/ 2






reduction 312 a3

→ 192a3 (62%) and 2528

3

→ 1520

3 a3 (60%), respectively. Note that inthe downward pass, |B L | = |B R |, but in terms of the computation cost estimate, it isequivalent to assuming that they are both s/ 2.

The second method is based on ( 5.7b). Like the rst method, it does not computeA (S , S )−1 explicitly. Instead, it computes A(S , S )−1A (S , B ) as

A BC D

−1W 00 Z

=A−1W + A−1B D−1CA−1W −A−1B D−1Z

−D−1CA−1W D−1Z , (5.8)

where ˜D = D −CA−

1

B, B and C are diagonal, and X = Y = 0. Table 5.3 showsthe required operations with the total cost 4

3 m3 + 5 m2n + 4mn 2. Compared to theprevious method, the cost here is smaller if m > 3

4 n. Therefore, this method is betterthan the rst method for the downward pass where typically ( m, n ) is (3a, 2a) and(4a, 2a). In such cases, the costs are 174 a3 (56%) and 1408

3 a3 (56%), respectively.

Table 5.3: The cost of operations and their dependence in the second method. Thecosts are in ops.

Operation Cost Seq No. Dependence

A−1 m3 (1) n/aD 0 (2) (1)

D−1Z m 3

3 + m2n (3) (2)

−(A−1B)( D−1Z ) m2n (4) (1)(3)A−1W m2n (5) (1)

−D−1(CA−1W ) m2n (6) (3)(5)(A−1W ) + ( A−1B)( D−1CA−1W ) m2n (7) (1)(5)(6)

A (S , B )A (S , S )−1A (S , B ) 4mn 2 (8) (7)Total 4

3 m3 + 5 m2n + 4 mn 2 n/a n/a

Note that if we compute the second block column of A BC D

−1 W 00 Z

and

D C B A

−1 Z 00 W , then we obtain the same form as in ( 5.8).

The third method is based on ( 5.7c). The two factors there will multiply A (B , S )




and A (S , B ) separately:

A (B , S )A (S , S )−1A (S , B )

=P 00 S

A B0 D

−1 I 0

CA−1 I

−1W 00 Z

=P 00 S

A−1 −A−1B D−1

0 D−1

I 0

−CA−1 I W 00 Z

=PA−1 −P A−1B D−1

0 S D−1

W 0

−CA−1

W Z . (5.9)

Table 5.4 shows the operations with the total cost 43 m3 + 4 m2n + 5mn 2. We

compute the explicit form of A−1 there because we need CA−1B to compute D andboth B and C are diagonal. S D−1 is computed by rst LU factorizing D and thensolving for S D−1. The LU form of D will be used when computing −(P A−1B) D−1

as well. The cost for (5.9) results from the multiplication of a block upper triangularmatrix and a block lower triangular matrix.

Table 5.4: The cost of operations in ops and their dependence in the third methodOperation Cost Seq. No. Dependence

A−1 m3 (1) n/aD 0 (2) (1)

D−1Z m 3

3 + m2n (3) (2)

−(A−1B)( D−1Z ) m2n (4) (1)(3)A−1W m2n (5) (1)

−D−1(CA−1W ) m2n (6) (3)(5)(A−1W ) + ( A−1B)( D−1CA−1W ) m2n (7) (1)(5)(6)

A (S

,B

)A (S

,S

)−1

A (S

,B

) 4mn2

(8) (7)Total 43 m3 + 5 m2n + 4 mn 2 n/a n/a

Finally, our fourth method is more straightforward. It considers A(B , S ) andA (S , B ) as block-diagonal matrices but considers A(S , S ) as a full matrix withoutexploiting its sparsity. Table 5.5 shows the operations with the total cost. Note that




the cost for computing the second column of L−1A (S , B ) is reduced because A(S , B )is block diagonal.

Table 5.5: The cost of operations in ops and their dependence in the fourth method.The size of A, B , C , and D is m ×m; the size of W , X , Y , and Z is m ×n

Operation Cost

LU for A (S , S ) 83 m3

L−1A (S , B ): 1st column 2m2nL−1A (S , B ): 2nd column 1

2 m2n

U −1

[L−1

A (S

,B

)] 4m2

nA (S , B )A (S , S )−1A (S , B ) 4mn 2

Total 83 m3 + 13

2 m2n + 4 mn 2

Table 5.6 summarizes the optimization techniques for extra sparsity and the cor-responding costs:

Table 5.6: Summary of the optimization methodsMethod Cost Reduction in four cases 5 Parallelism

2-way back substitution 83 m3 + 4 m2n + 4 mn 2 51.4, 52.6, 61.5, 60.1 good1-way back substitution 4

3 m3 + 5 m2n + 4 mn 2 53.0, 54.0, 55.8, 55.7 littleblock LU inverse 4

3 m3 + 4 m2n + 5 mn 2 59.1, 57.9, 53.9, 54.3 somenaıve LU 8

3 m3 + 132 m2n + 4 mn 2 59.0, 62.5, 76.0, 74.4 little

Note again that these methods are not exhaustive. There are many variations of these methods with different forms of A(S , S )−1, different computation orders, anddifferent parts of the computation to share among the operations. For example, ( 5.10)

5 The percentage of the cost given by ( 5.4).






5.3 Optimization for symmetry and positive de-

nitenessIn real problems, A is often symmetric (or Hermitian if the problems are complex,

which can be treated in a similar way) [ Svizhenko et al. , 2002]. So it is worthwhileto consider special treatment for such matrices to reduce both computation costand storage cost. Note that this reduction is independent of the extra sparsity inSection 5.2.

5.3.1 The symmetry and positive deniteness of dense ma-trix A

If all the given matrices are symmetric, it is reasonable to expect the eliminationresults to be symmetric as well, because our update rule ( 5.1) does not break matrixsymmetry. This is shown in the following property of symmetry preservation .

Property 5.1 (symmetry preservation) In (5.1), U, A (B , B ), and A(S , S ) are all symmetric; A(B , S ) and A(S , B ) are transpose of each other.

Proof : This property holds for all the leaf clusters by the symmetry of the originalmatrix A. For any node in the cluster tree, if the property holds for its two childclusters, then by ( 3.3), the property holds for the parent node as well.

In addition, A is often positive denite. This property is also preserved during theelimination process in our algorithm, as shown in the following property of positive-deniteness preservation .

Property 5.2 (positive-deniteness preservation) If the original matrix A is

positive-denite, then the matrix U in (5.1) is always positive-denite.Proof : Write n × n symmetric positive denite matrix A as a 11 zT

z A 11 and let

A 11def = A 11 −zz T /a 11 be the Schur complement of a11 . Let A (0) def = A, A (1) def = A (0)

11 ,. . . , A(n−1) def = A (n−2)

11 . Note that since A is positive denite, A(i)11 = 0 and A i+1 is

well dened for all i = 0, . . . , n −2. By denition, A i+1 are also all symmetric.



5.3. OPTIMIZATION FOR SYMMETRY AND POSITIVE DEFINITENESS 73

Now we show that A(1) , . . . , A(n−1) , are all positive denite. Given any ( n −1)-

vector y = 0, let x = −zT ya 11 y

T

= 0. Since A is symmetric positive denite, wehave

0 < x T Ax

= −z T y/a 11 yT a11 z T

z A 11

−z T y/a 11

y

= 0 −yT zz T /a 11 + yT A 11 −z T y/a 11

y

= −yT

zz T

y/a 11 + yT ¯

A11 y= yT (A 11 −zz T /a 11 )y

= yT A 11y.

As a result, A(1) is positive denite as well. Repeating this process, we have A(2) ,. . . , A (n−1) all positive denite.

Since any principal submatrix of a positive denite matrix is also positive de-nite [G. H. Golub, 1996a], every A(S k , S k) in (3.2) is also positive denite.

For the symmetry preservation property and the positive deniteness preservationproperty, we can write the last term in ( 5.1) as

A (B , S )A (S , S )−1A (S , B ) = ( G−1S A(S , B ))T (G−1

S A(S , B )),

where A (S , S ) = GS GT S . The Cholesky factorization has cost 1

6s3. The solver has cost

12

s2b , and the nal multiplication has cost 12

sb 2. The cost for (5.1) is then reduced byhalf from 1

3s3 + s2b + sb 2 to 1

6s3 + 1

2s2b + 1

2sb 2.

Note that even if A is not positive denite, by the symmetry preservation property

alone, we can still write the last term of ( 5.1) as

A (B , S )A (S , S )−1A (S , B ) = ( L−1S A(S , B ))T D−1

S (L−1S A(S , B )),

where A (S , S ) = LS D S LT S .




Similarly, the computation cost is reduced by half. However, this method is subjectto large errors due to small pivots, since the straightforward symmetric pivoting doesnot always work [G. H. Golub, 1996b]. The diagonal pivoting method [Bunch andParlett , 1971; Bunch and Kaufman, 1977] can be used to solve such a problem, butit is more expensive in both implementation and computation. We will discuss thisagain in more detail in a future article.

5.3.2 Optimization combined with the extra sparsity

For symmetric and positive denite A , we can take advantage of the extra sparsitydiscussed in Section 5.2 as well. Consider the following block Cholesky factorizationof A (S , S ):

A (S , S ) =A 11 A12

A T 12 A22

= G1 0AT

12G−T 1

˜G2

GT 1 G−1

1 A12

0 ˜GT 2

= GS GT S , (5.11)

where A 11 = G1GT 1 , A 22 = A 22 −A T

12A−111 A 12 , and A 22 = ˜G2 ˜GT

2 . Now, we have

G−1S A(S , B ) = G1 0

AT 12G−T

1 ˜G2

A i(S k ∩B i , B k ∩B i) 00 A j (S k ∩B j , B k ∩B j )

= G−11 A(S L , B L ) 0

−˜G−12 AT

12G−T 1 G−1

1 A(S L , B L ) ˜G−12 A(S R , B R )

. (5.12)

Finally, we compute

A (B , S )A (S , S )−1A (S , B ) = ( G−1S A(S , B ))T (G−1

S A(S , B )). (5.13)

Table 5.8 gives the operations with the total cost 23 m3 + 2 m2n + 5

2 mn 2. Note that thecosts for G−1

1 A12 and A 22 are both 16 m3 because A 12 is diagonal and A 22 is symmetric.

Compared to the original cost with a block structure, the cost is reduced by morethan half when m = n (reduced by exactly half when m = √ 3

2 n). Accidentally, thecost with both optimization (for extra sparsity and symmetry) is 25% of the original






5.4 Optimization for computing G < and current

densityWe can also reduce the cost for computing G< by optimizing the procedure for

(5.14), which is copied from (5.2):

R = Σ (B , B ) + L−1Σ (S , B ) + Σ (B , S )L−† + L−1Σ (S , S )L−†. (5.14)

The computation is by default based on the computation for G .

5.4.1 G < sparsity

Since Σ (S , B ) and Σ (B , S ) are block diagonal, the cost for the second and thethird term in ( 5.14) is reduced by half. Similar to the structure of A(S , S ) in (5.1),the blocks Σ (S k ∩B i , S k ∩B j ) and Σ (S k ∩B j , S k ∩B i) in Σ (S k , S k) are diagonal. If we use block matrix multiplication for L−1Σ (S , S ), then these blocks can be ignoredbecause multiplication with them are level-2 operations. As a result, the cost for thefourth term L−1Σ (S , S )L−† in (5.14) is reduced to 1

2s2b + sb 2 (or s 2b + 1

2sb 2, depending

on the order of computation). So the total cost for ( 5.14) becomes 12 s

2b + 2 sb

2

(ors b + 3

2sb 2). Compared with the cost without optimization for sparsity, it is reduced

by 37.5%.In addition, Σ is often diagonal or 3-diagonal in real problems. As a result,

Σ (S k ∩B i , S k ∩B j ) and Σ (S k ∩B j , S k ∩B i) become zero and Σ k(S k , S k) becomes blockdiagonal. This may lead to further reduction. Please see the appendix for detailedanalysis.

5.4.2 G<

symmetryFor symmetric Σ , we have a similar property of symmetry (and positive denite-

ness) preservation: Σ k+ (B k , B k), Σ k(B k , B k), and Σ k(S k , S k) in (4.1) are all symmet-ric. With such symmetry, we can perform the Cholesky factorization Σ (S , S ) = ΓS ΓT

S .Following (5.12) and ( 5.13), Table 5.9 lists the needed operations and their costs. The



5.4. OPTIMIZATION FOR COMPUTING G < AND CURRENT DENSITY 77

total cost for ( 4.1) is reduced to 16

s3 + s2b + 32

sb 2. Note that we can reuse A (B , S )G−T S

from computing symmetric G but since L−1 = A(B , S )A (S , S )−1 is not computedexplicitly there, we have to do it here. Also note that L−1 is a full dense matrixso computing L−1Σ (S , B ) requires sb 2 ops. Compared to the cost of computationwithout exploiting the symmetry ( s2b + 3 sb 2), the cost is reduced by one third whens = b .

Table 5.9: Operation costs in ops for G < with optimization for symmetry.

Operation Result Cost Cholesky on Σ (S , S ) ΓS

16

s3

(A (B , S )G−T S )G−1

S L−1 12

s2b

L−1Σ (S , B ) L−1Σ (S , B ) sb 2

L−1ΓS L−1Γ S12

s2b

[L−1ΓS ][L−1ΓS ]T L−1Σ (S , S )L−T 12

sb 2

Total R 16

s3 + s 2b + 32

sb 2

5.4.3 Current density

We can also exploit the extra sparsity when computing the current density, but itcannot improve much because there is not as much sparsity in ( 4.2)–(4.4) as in (5.1).This can been seen in Fig. 5.5.

However, we can signicantly reduce the cost with another approach. In thisapproach, in addition to computing the current density from nodes 9-12, we alsocompute the current density from nodes 1, 2, 5, and 6. This approach is still based

on (4.2)–(4.4) with no additional work, but will reduce the number of needed leaf clusters by half if we choose properly which clusters to compute. Fig. 5.6 shows howwe choose the leaf clusters to cover the desired off-diagonal entries in the inverse forcurrent density.

Fig. 5.6 shows only the coverage of current in x-direction, but the approach worksfor the current in y-direction, and more nicely, the same tiling works for the current




10

9

12

11

2

1

6

5

8

7

4

3

12

1

11

2

10

3

9

4

8

5

7

6

6

7

5

8

4

9

3

10

2

11

1

12

Figure 5.5: The extra sparsity of the matrices used for computing the current density.

in both directions. Readers can check the coverage on their own by drawing arrowsbetween neighboring nodes in the y-direction to see if they are covered by the blueareas in Fig. 5.6 shaded in blue and red.

We can also optimize ( 4.2)–(4.4) by similar techniques used in Section 5.3 toreduce the cost for symmetric A and Σ . In (4.2), A (C , C ) −A (C , B C )(U C )−1A (B C , C )is of the same form as (5.1) so the cost is reduced by half. The cost of computingits inverse is also reduced by half when it is symmetric. Eqs. ( 4.3) and (4.4) are thetranspose of each other so we only need to compute one of them. As a result, thecost for computing the current density is reduce by half in the presence of symmetry.

5.5 Results and comparison

Fig. 5.7 reveals the computation cost reduction for clusters of different size in thebasic cluster tree after optimization for extra sparsity during the upward pass. Asindicated in section 5.2, the improvements are different for two types of merge: i) twoa×a clusters⇒one a×2a cluster; ii) two 2a×a clusters⇒one 2a×2a cluster, so theyare shown separately in the gure. Note that the reduction for small clusters is low.This is because the extra second order computation cost (e.g., O(m2)) introducedin the optimization for extra sparsity is ignored in our analysis but is signicant for



5.5. RESULTS AND COMPARISON 79

Figure 5.6: Coverage of all the needed current density through half of the leaf clusters.Arrows are the current. Each needed cluster consists of four solid nodes. The circlenodes are skipped.

small clusters.

Fig. 5.8 reveals the computation cost reduction after optimization for symmetry.As in Fig. 5.7, we show the reduction for the merge for clusters at each level in thebasic cluster tree. We see that the cost is reduced almost by half for clusters largerthan 32 × 32. The low reduction for small clusters here is mainly due to a largerportion of diagonal entries for small matrices where Cholesky factorization cannotreduce the cost.

Although the cost reduction is low for small clusters in both optimizations, theydo not play a signicant role for a mesh larger than 32 ×32 because the top levels of

the tree in Fig. 3.5 dominate the computation cost.Fig. 5.9 compares the computation cost of RGF and optimized FIND on Hermitian

and positive-denite A and reveals the overall effect of the optimization for extrasparsity and symmetry. The simulations were performed on a square mesh of sizeN ×N . The cross-point is around 40 ×40 according to the two tted lines, which is




50%

55%

60%

65%

70%

2 0 2 2 2 4 2 6 2 8 2 10

C o m p u

t a t i o n c o s

t r e

d u c t i o n

a

two 2a × a ⇒ one 2a × 2a

two a × a ⇒ one 2a × a

Figure 5.7: Optimization for extra sparsity

also conrmed by a simulation at that point.








Chapter 6

FIND-2way and Its Parallelization

This chapter is organized as follows. We rst describe general relations to computeGreen’s functions (Section 6.1 and 6.2). These are applicable to meshes with arbitraryconnectivity. In Section 6.3, we calculate the computational cost of these approachesfor 1D and 2D Cartesian meshes and for discretizations involving nearest neighbornodes only, e.g., a 3 point stencil in 1D or a 5 point stencil in 2D. We also comparethe computational cost of this approach with the algorithm introduced in Chapter 3.

In Section 6.4, a parallel implementation of the recurrences for 1 dimensional prob-lems is proposed. The original algorithms by [Takahashi et al. , 1973] and [Svizhenkoet al., 2002] are not parallel since they are based on intrinsically sequential recur-rences. However, we show that an appropriate reordering of the nodes and denitionof the blocks in A lead to a large amount of parallelism. In practical cases for nano-transistor simulations, we found that the communication time was small and thatthe scalability was very good, even for small benchmark cases. This scheme is basedon a combination of domain decomposition and cyclic reduction techniques 1 (see, forexample, [Hageman and Varga , 1964]). Section 6.5 has some numerical results. Fornotation simplicity, we let G = G r in this chapter.

1 The method of cyclic reduction was rst proposed by Schr¨ oder in 1954 and published in Ger-man [Schroder, 1954].

83



84 CHAPTER 6. FIND-2WAY AND ITS PARALLELIZATION

6.1 Recurrence formulas for the inverse matrix

Consider a general matrix A written in block form:

A def =A 11 A12

A 21 A22. (6.1)

From the LDU factorization of A, we can form the factors LS def = A21A−111 , US def =

A−111 A 12 , and the Schur complement S def = A22 −A 21A−1

11 A 12 . The following equationholds for the inverse matrix G = A −1:

G =A−1

11 + U S S−1 LS −U S S−1

−S−1 LS S−1. (6.2)

This equation can be veried by direct multiplication with A . It allows computing theinverse matrix using backward recurrence formulas. These formulas can be obtainedby considering step i of the LDU factorization of A, which has the following form:

A =L(1 : i −1, 1 : i −1) 0

L(i : n, 1 : i

−1) I

D (1 : i −1, 1 : i −1) 00 S i

×U (1 : i −1, 1 : i −1) U(1 : i −1, i : n)

0 I.

(6.3)

The Schur complement matrices S i are dened by this equation for all i. From ( 6.3),the rst step of the LDU factorization of Si is the same as the i + 1 th step in thefactorization of A :

S i = I 0

L(i + 1 : n, i ) I

d ii 0

0 Si+1

I U (i, i + 1 : n)

0 I. (6.4)



6.1. RECURRENCE FORMULAS FOR THE INVERSE MATRIX 85

Combining ( 6.2) and (6.4) with the substitutions:

A 11 → d ii US → U (i, i + 1 : n)

S → Si+1 LS → L(i + 1 : n, i ),

we arrive at

[S i]−1 =d−1

ii + U (i, i + 1 : n) [S i+1 ]−1 L(i + 1 : n, i ) −U (i, i + 1 : n) [S i+1 ]−1

−[S i+1 ]−1 L(i + 1 : n, i ) [S i+1 ]−1.

From ( 6.2), we have

[S i]−1 = G (i : n, i : n), and [S i+1 ]−1 = G (i + 1 : n, i + 1 : n).

We have therefore proved the following backward recurrence relations starting fromgnn = d−1

nn :

G (i + 1 : n, i ) = −G (i + 1 : n, i + 1 : n) L(i + 1 : n, i )

G (i, i + 1 : n) = −U (i, i + 1 : n) G (i + 1 : n, i + 1 : n)g ii = d−1

ii + U (i,i + 1 : n) G (i + 1 : n, i + 1 : n) L(i + 1 : n, i )(6.5)

These equations are identical to Equations ( 2.4), (2.5), and ( 2.6), respectively. Arecurrence for i = n − 1 down to i = 1 can therefore be used to calculate G . SeeFigure 6.1.

Figure 6.1: Schematics of how the recurrence formulas are used to calculate G .




We have assumed that the matrices produced in the algorithms are all invertible

(when required), and this may not be true in general. However, this has been thecase in all practical applications encountered by the authors so far, for problemsoriginating in electronic structure theory.

By themselves, these recurrence formulas do not lead to any reduction in thecomputational cost. However, we now show that even though the matrix G is full,we do not need to calculate all the entries. We denote Lsym and Usym the lower andupper triangular factors in the symbolic factorization of A. The non-zero structureof (Lsym )T and (U sym )T is denoted by Q, that is Q is the set of all pairs (i, j ) suchthat sym

ji = 0 or usym

ji = 0. Then

gij =

−l>j, (i,l )∈Q

gil lj if i > j

−k>i, (k,j )∈Q

u ik gkj if i < j

d−1ii +

k>i, l>i, (k,l )∈Q

u ik gkl li if i = j

, (6.6)

where (i, j ) ∈ Q.Proof : Assume that ( i, j ) ∈ Q, i > j , then usym

ji = 0. Assume also that symlj = 0,

l > j , then by properties of the Gaussian elimination ( i, l ) ∈ Q. This proves therst case. The case i < j can be proved similarly. For the case i = j , assume thatu sym

ik = 0 and symli = 0, then by properties of the Gaussian elimination ( k, l) ∈ Q.

This implies that we do not need to calculate all the entries in G but rather onlythe entries in G corresponding to indices in Q. This represents a signicant reductionin computational cost. The cost of computing the entries in G in the set Q is the

same (up to a constant factor) as the LDU factorization.Similar results can be derived for the G< matrix

G < = G Γ G †, (6.7)

if we assume that Γ has the same sparsity pattern as A, that is, γ ij = 0 ⇒ a ij = 0.



6.1. RECURRENCE FORMULAS FOR THE INVERSE MATRIX 87

The block γ ij denotes the block ( i, j ) of matrix Γ.

To calculate the recurrences, we use the LDU factorization of A introduced pre-viously. The matrix G < satises

A G < A† = Γ.

Using the same block notations as on page 84, we multiply this equation to the leftby (LD )−1 and to the right by ( LD )−†:

I U S

0 SG < I 0

(U S)† S†=

A11 0

LSA 11 I

−1

Γ A11 0

LSA 11 I

−†.

The equation above can be expanded as

G <11 + U SG <

21 + G <12(U S)† + U SG <

22(U S)† (G <12 + U SG <

22)S†S(G <

21 + G <22(U S)†) SG <

22S†=

A−111 Γ11A−†11 A−1

11 (Γ12 −Γ11 (LS)†)(Γ21 −LSΓ11 )A−†11 Γ22 −LSΓ12 −Γ21(LS)† + LSΓ11 (LS)†

.

(6.8)

Using steps similar to the proof of ( 6.5) and (6.8) leads to a forward recurrence thatis performed in combination with the LDU factorization.

Let us dene Γ1 def = Γ, then, for 1 ≤ i ≤ n −1,

Γ i+1L

def = ( Γ i21 −L(i + 1 : n, i ) γ i11 ) d−†ii , (6.9)

Γ i+1U

def = d−1ii (Γ i

12 −γ i11 L(i + 1 : n, i )†), (6.10)

Γ i+1 def = Γi22 −L(i + 1 : n, i ) Γ i

12 −Γ i21 L(i + 1 : n, i )† (6.11)

+ L(i + 1 : n, i ) γ i11 L(i + 1 : n, i )†,

with

γ i11 Γi12

Γ i21 Γi

22

def = Γi ,

and the following block sizes (in terms of number of blocks of the sub-matrix): Note




that the U factors are not needed in this process. Once this recurrence is complete,

we can perform a backward recurrence to nd G<

, as in (6.5). This recurrence canbe proved using (6.8):

G < (i + 1 : n, i ) = G(i + 1 : n, i + 1 : n) Γ i+1L

−G < (i + 1 : n, i + 1 : n) U (i, i + 1 : n)†

G < (i, i + 1 : n) = Γi+1U G(i + 1 : n, i + 1 : n)†

−U (i, i + 1 : n) G < (i + 1 : n, i + 1 : n)

g<ii = d−1

ii γ i11 d−†ii

−U (i, i + 1 : n) G < (i + 1 : n, i )

−G < (i, i + 1 : n) U (i, i + 1 : n)†

−U (i, i + 1 : n) G < (i + 1 : n, i + 1 : n) U (i, i + 1 : n)†

(6.12)

starting from g<nn = gnn Γn g†nn . The proof is omitted but is similar to the one for

(6.5).As before, the computational cost can be reduced signicantly by taking advantage

of the fact that the calculation can be restricted to indices ( i, j ) in Q. For this, weneed to further assume that ( i, j ) ∈ Q ⇔ ( j, i ) ∈ Q.

First, we observe that the cost of computing Γi , ΓiL, Γi

U for all i is of the sameorder as the LDU factorization (i.e., equal up to a constant multiplicative factor).This can be proved by observing that calculating Γ i+1 using (6.11) leads to the samell-in as the Schur complement calculations for S i+1 .



6.2. SEQUENTIAL ALGORITHM FOR 1D PROBLEMS 89

Second, the set of (6.12) for G < can be simplied to

g<ij =

l>j, (i,l )∈Q

g il [γ j +1L ]l− j, 1 −

l>j, (i,l )∈Q

g<il u† jl if i > j

k>i, ( j,k )∈Q

[γ i+1U ]1,k−i g† jk −

k>i, (k,j )∈Q

u ik g<kj if i < j

d−1ii γ i11d−†ii −

k>i, (k,i )∈Q

u ik g<ki −

k>i, (i,k )∈Q

g<ik u†ik

−k>i,l>i, (k,l )

∈

Q

u ik g<kl u†il

if i = j

, (6.13)

where (i, j ) ∈ Q. (Notation clarication: γ i jl , [γ iL] jl , and [γ iU] jl are respectively blocksof Γi , Γi

L , and ΓiU .) The proof of this result is similar to the one for ( 6.6) for G .

The reduction in computational cost is realized because all the sums are restricted toindices in Q.

Observe that the calculation of G and Γ i depends only on the LDU factorization,while the calculation of G < depends also on Γi and G .

6.2 Sequential algorithm for 1D problems

In this section, we present a sequential implementation of the relations presentedabove, for 1D problems. Section 6.4 will discuss the parallel implementation.

In 1 dimensional problems, matrix A assumes a block tri-diagonal form. The LDUfactorization is obtained using the following recurrences:

i+1 ,i = a i+1 ,i d−1ii (6.14)

u i,i +1 = d−1

ii ai,i +1 (6.15)d i+1 ,i +1 = a i+1 ,i +1 −a i+1 ,i d−1

ii ai,i +1 (6.16)

= a i+1 ,i +1 − i+1 ,i a i,i +1

= a i+1 ,i +1 −a i+1 ,i u i,i +1 .




From ( 6.6), the backward recurrences for G are given by

g i+1 ,i = −gi+1 ,i +1 i+1 ,i (6.17)

g i,i +1 = −u i,i +1 gi+1 ,i +1 (6.18)

gii = d−1ii + u i,i +1 g i+1 ,i+1 i+1 ,i (6.19)

= d−1ii −g i,i +1 i+1 ,i

= d−1ii −u i,i +1 g i+1 ,i ,

starting with gnn = d−1nn . To calculate G < , we rst need to compute Γ i . Considering

only the non-zero entries in the forward recurrence ( 6.9), (6.10), and ( 6.11), we needto calculate

γ i+1L

def = ( γ i+1 ,i − i+1 ,i γ i) d−†ii

γ i+1U

def = d−1ii (γ i,i +1 −γ i †i+1 ,i )

γ i+1 def = γ i+1 ,i+1 − i+1 ,i γ i,i +1 −γ i+1 ,i †i+1 ,i + i+1 ,i γ i †i+1 ,i .

starting from γ 1 def = γ 11 (the (1,1) block in matrix Γ); γ i, γ iL, and γ iU are 1×1 blocks.

For simplicity, we have shortened the notations. The blocks γ i, γ

iL , and γ

iU are in fact

the (1,1) block of Γi , Γ iL , and Γ i

U .Once the factors L, U, G , γ i , γ iL , and γ iU have been computed, the backward

recurrence ( 6.12) for G < can be computed:

g<i+1 ,i = gi+1 ,i+1 γ i+1

L −g<i+1 ,i+1 u†i,i +1

g<i,i +1 = γ i+1

U g†i+1 ,i+1 −u i,i +1 g<i+1 ,i+1

g<ii = d−1

ii γ i d−†ii −u i,i +1 g<i+1 ,i −g<

i,i +1 u†i,i +1 −u i,i +1 g<i+1 ,i+1 u†i,i +1 ,

starting from g<nn = gnn γ n g†nn .

These recurrences are the most efficient way to calculate G and G < on a sequentialcomputer. On a parallel computer, this approach is of limited use since there is littleroom for parallelization. In section 6.4 we describe a scalable algorithm to perform



6.3. COMPUTATIONAL COST ANALYSIS 91

the same calculations in a truly parallel fashion.

6.3 Computational cost analysis

In this section, we determine the computational cost of the sequential algorithmsfor 1D and 2D Cartesian meshes with nearest neighbor stencils. From the recurrencerelations ( 6.6) and (6.13), one can prove that the computational cost of calculating Gand G < has the same scaling with problem size as the LDU factorization. If one usesa nested dissection ordering associated with the mesh [George, 1973], we obtain the

following costs for Cartesian grids assuming a local discretization stencil: We now doa more detailed analysis of the computational cost and a comparison with the FINDalgorithm of Li et al. [Li et al., 2008]. This algorithm comprises an LDU factorizationand a backward recurrence.

For the 1D case, the LDU factorization needs 4 d3 ops for each column: d3 forcomputing d−1

ii , d3 for i+1 ,i , d3 for u i,i +1 , and d3 for d i+1 ,i +1 . So the total com-putational cost of LDU factorization is 4 nd3 ops. Following (6.6), the cost of thebackward recurrence is 3 nd3. Note that the explicit form of d−1

ii is not needed inthe LDU factorization. However, the explicit form of d−1

ii is needed for the backwardrecurrence and computing it in the LDU factorization reduces the total cost. In theabove analysis, we have therefore included the cost d3 of obtaining d−1

ii during theLDU factorization. The total cost in 1D is therefore

7nd3.

For the 2D case, similar to the nested dissection method [George, 1973] and FINDalgorithm [Li et al., 2008], we decompose the mesh in a hierarchical way. See Fig. 6.2.

First we split the mesh in two parts (see the upper and lower parts of the mesh on theright panel of Fig. 6.2). This is done by identifying a set of nodes called the separator.They are numbered 31 in the right panel. The separator is such that it splits the meshinto 3 sets: set s1, s2 and the separator s itself. It satises the following properties:i) aij = 0 if i belongs to s1 and j to s2, and vice versa (this is a separator set); ii)




for every i in s, there is at least one index j1 in s1 and one index j2 in s2 such thata ij 1 = 0 and aij 2 = 0 (the set is minimal). Once the mesh is split into s1 and s2 theprocess is repeated recursively, thereby building a tree decomposition of the mesh.In Fig. 6.2, clusters 17–31 are all separators [ George, 1973]. When we perform theLDU factorization, we eliminate the nodes corresponding to the lower level clustersrst and then those in the higher level clusters.

1 2

17

3 4

18

25

5 6

19

7 8

20

26

29

9 10

21

11 1 2

22

27

1 3 1 4

23

1 5 1 6

24

28

30

31

1 1 1 17 2 2 2 29 5 5 5 19 6 6 6

1 1 1 17 2 2 2 29 5 5 5 19 6 6 6

1 1 1 17 2 2 2 29 5 5 5 19 6 6 6

25 25 25 25 25 25 25 29 26 26 26 26 26 26 26

3 3 3 18 4 4 4 29 7 7 7 20 8 8 8

3 3 3 18 4 4 4 29 7 7 7 20 8 8 8

3 3 3 18 4 4 4 29 7 7 7 20 8 8 8

31 31 31 31 31 31 31 31 31 31 31 31 31 31 31

9 9 9 21 10 10 10 30 13 13 13 23 14 14 14

9 9 9 21 10 10 10 30 13 13 13 23 14 14 14

9 9 9 21 10 10 10 30 13 13 13 23 14 14 14

27 27 27 27 27 27 27 30 28 28 28 28 28 28 28

11 11 11 22 12 12 12 30 15 15 15 24 16 16 16

11 11 11 22 12 12 12 30 15 15 15 24 16 16 16

11 11 11 22 12 12 12 30 15 15 15 24 16 16 16

Figure 6.2: The cluster tree (left) and the mesh nodes (right) for mesh of size 15 ×15.The number in each tree node indicates the cluster number. The number in eachmesh node indicates the cluster it belongs to.

Compared to nested dissection [ George, 1973], we use vertical and horizontal sepa-rators instead of cross shaped separators to make the estimates of computational costeasier to derive. Compared to FIND-1way, we use single separators | here insteadof double separators since we do not need additional “independence” among clus-ters here. Such additional “independence” can be skipped in FIND-1way as well byseparating each step of elimination into two part. We will discuss this in our furtherwork.

To estimate the computational cost, we need to consider the boundary nodesof a given cluster. Boundary nodes are nodes of a cluster that are connected tonodes outside the cluster. Since the non-zero pattern of the matrix changes as the



6.3. COMPUTATIONAL COST ANALYSIS 93

elimination is proceeding, we need to consider the evolving connectivity of the mesh.For example, the nodes labeled 22 become connected to nodes 27 and 30 after nodes11 and 12 have been eliminated; similarly, the nodes 20 become connected to nodes26, 31 and 29. This is shown in Fig. 6.2. For more details, see [George, 1973] and [Liet al. , 2008].

Once the separators and boundaries are determined, we see that the computationalcost of eliminating a separator is equal to sb 2 + 2 s2b + 1

3s3 = s(b + s)2 − 2

3s3 ops,

where s is the number of nodes in the separator set and b is the number of nodes inthe boundary set. As in the 1D case, we include the cost for d−1

ii .

To make it simple, we focus on a 2D square mesh with N nodes and the typicalnearest neighbor connectivity. If we ignore the effect of the boundary of the mesh,the size of the separators within a given level is xed. If the ratio b / s is constant,then the cost for each separator is proportional to s3 and the number of separatorsis proportional to N/ s2, so the cost for each level doubles every two levels. Thecomputational costs thus form a geometric series and the top level cost dominatesthe total cost.

When we take the effect of the mesh boundary in consideration, the value of b

for a cluster near the boundary of the mesh needs to be adjusted. For lower levels,such clusters form only a small fraction of all the clusters and thus the effect is notsignicant. For top levels, however, such effect cannot be ignored. Since the cost forthe top levels dominates the total cost, we need to calculate the computational costfor top levels more precisely.

Table 6.1 shows the cost for the top level clusters. The last two rows are the upperbound of the total cost for the rest of the small clusters.

If we compute the cost for each level and sum them together, we obtain a cost of

24.9 N 3/ 2

ops for the LDU factorization.For the backward recurrence, we have the same sets of separators. Each node in

the separator is connected to all the other nodes in the separator and all the nodes inthe boundary set. Since we have an upper triangular matrix now, when we deal witha separator of size s with b nodes on the boundary, the number of non-zero entries




Table 6.1: Estimate of the computational cost for a 2D square mesh for different

cluster sizes. The size is in units of N 1/ 2

. The cost is in units of N 3/ 2

ops.Size of cluster Cost per cluster

Separator Boundary LDU Back. Recurr. Level Number of clusters

1/2 1 1.042 2.375 1 21/2 1 1.042 2.375 2 41/4 5/4 0.552 1.422 3 41/4 3/4 0.240 0.578 3 41/4 1 0.380 0.953 4 41/4 3/4 0.240 0.578 4 8

1/4 1/2 0.130 0.297 4 41/8 3/4 0.094 0.248 5 81/8 5/8 0.069 0.178 5 241/8 1/2 0.048 0.119 6 641/16 3/8 0.012 0.031 7 1281/8 1/8 0.048 0.119 8 . . . ≤ 641/16 3/8 0.012 0.031 9 . . . ≤ 128

in each row increases from b to s + b . As a result, the cost for computing ( 6.6) is3(b 2s + bs 2 + 1

3s3) ops for each step. The total computational cost for the backward

recurrence is then 61 .3 N 3/ 2 ops. The costs for each type of clusters are also listedin Table 6.1.

Adding together the cost of the LDU factorization and the backward recurrence,the total computational cost for the algorithm is 86 .2 N 3/ 2 ops. For FIND, the costis 147N 3/ 2 ops [Li and Darve, 2009]. Note that these costs are for the serial versionof the two algorithms. Although the cost is reduced roughly by half compared toFIND, the parallelization of FIND is different and therefore the running time of bothalgorithms on parallel platforms may scale differently.

We now focus on the implementation of these recurrence relations for the calcula-tion of G in the 1D case on a parallel computer. Similar ideas can be applied to thecalculation of G< . The parallel implementation of the 2D case is signicantly morecomplicated and will be discussed in another chapter.



6.4. PARALLEL ALGORITHM FOR 1D PROBLEMS 95

6.4 Parallel algorithm for 1D problems

We present a parallel algorithm for the calculation of the Green’s function ma-trix G typically encountered in electronic structure calculations where the matrix A(cf. Eq. (1.13)) is assumed to be an n ×n block matrix, and in block tridiagonal formas shown by

A =

a11 a12

a21 a22 a23

a32 a33 a34. .

. . .

. ..

.

. (6.20)

where each block element a ij is a dense complex matrix. In order to develop a parallelalgorithm, we assume that we have at our disposal a total of P processing elements(e.g., single core on a modern processor). We also consider that we have processeswith the convention that each process is associated with a unique processing element,and we assume that they can communicate among themselves. The processes arelabeled p0, p1, . . . , pP−1. The block tridiagonal structure A is then distributed amongthese processes in a row-striped manner, as illustrated in Fig. 6.3.

p 0

p 1

p 2

p 3

p 0

p 1

p 2

p 3

Figure 6.3: Two different block tridiagonal matrices distributed among 4 differentprocesses, labeled p0, p1, p2 and p3.

Thus each process is assigned ownership of certain contiguous rows of A. Thisownership arrangement, however, also extends to the calculated blocks of the inverse




G , as well as any LU factors determined during calculation in L and U . The mannerof distribution is an issue of load balancing, and will be addressed later in the chapter.

Furthermore, for illustrative purposes, we present the block matrix structures ashaving identical block sizes throughout. The algorithm presented has no condition forthe uniformity of block sizes in A, and can be applied to a block tridiagonal matrixas presented on the right in Fig. 6.3. The size of the blocks throughout A , however,does have consequences for adequate load balancing.

For the purposes of electronic structure applications, we only require the portionof the inverse G with the same block tridiagonal structure structure of A . We express

this portion as Trid A {G}.The parallel algorithm is a hybrid technique in the sense that it combines the

techniques of cyclic reduction and Schur block decomposition, but where we nowconsider individual elements to be dense matrix blocks. The steps taken by thehybrid algorithm to produce Trid A {G} is outlined in Fig. 6.4.

The algorithm begins with our block tridiagonal matrix A partitioned across anumber of processes, as indicated by the dashed horizontal lines. Each process thenperforms what is equivalent to calculating a Schur complement on the rows/blocks

that it owns, leaving us with a reduced system that is equivalent to a smaller blocktridiagonal matrix ASchur . This phase, which we name the Schur reduction phase, isentirely devoid of interprocess communication.

It is on this smaller block tridiagonal structure that we perform block cyclic re-duction (BCR), leaving us with a single block of the inverse, gkk . This block cyclicreduction phase involves interprocess communication.

From gkk we then produce the portion of the inverse corresponding to GBCR =Trid A Schur {G} in what we call the block cyclic production phase. This is done using

(6.5). Finally, using GBCR

, we can then determine the full tridiagonal structure of G that we desire without any further need for interprocess communication througha so-called Schur production phase. The block cyclic production phase and Schurproduction phase are a parallel implementation of the backward recurrences in ( 6.5).




≡

≡

A SchurA

Trid A {G } G BCRg kk

Figure 6.4: The distinct phases of operations performed by the hybrid method indetermining Trid A {G}, the block tridiagonal portion of G with respect to the struc-ture of A. The block matrices in this example are partitioned across 4 processes, asindicated by the horizontal dashed lines.

6.4.1 Schur phases

In order to illustrate where the equations for the Schur reduction and productionphases come from, we perform them for small examples in this section. Furthermore,this will also serve to illustrate how our hybrid method is equivalent to unpivotedblock Gaussian elimination.

Corner Block Operations

Looking at the case of Schur corner reduction, we take the following block formas our starting point:

A =a ii aij 0a ji a jj

0

,




where denotes some arbitrary entries (zero or not). In eliminating the block a ii , wecalculate the following LU factors:

ji = a ji a−1ii

u ij = a−1ii aij ,

and we determine the Schur block

s def = a jj − ji a ij .

Let us now assume that the inverse block g jj has been calculated. We then startthe Schur corner production phase in which, using the LU factors saved from thereduction phase, we can obtain

g ji = −g jj ji

g ij = −u ij g jj ,

(see (6.5)) and nally

gii = a−1ii + u ij g jj ji (6.21)

= a−1ii −u ij g ji (6.22)

= a−1ii −gij ji . (6.23)

This production step is visualized in Fig. 6.5. On the top right of the gure we havethe stored LU factors preserved from the BCR elimination phase. On the top left,bottom right and bottom left we see three different schemes for producing the new

inverse blocks g ii , g ij and g ji from g jj and the LU factors u ij and ji , correspondingto Eq. ( 6.21), Eq. (6.22) and Eq. ( 6.23). A solid arrow indicates the need for thecorresponding block of the inverse matrix and a dashed arrow indicates the need forthe corresponding LU block.

Using this gure, we can determine what matrix blocks are necessary to calculate




u ij

l ji

g ii g ij

g ji g jj

g ii g ij

g ji g jj

g ii g ij

g ji g jj

Figure 6.5: Three different schemes that represent a corner production step under-taken in the BCR production phase, where we produce inverse blocks on row/columni using inverse blocks and LU factors from row/column j .

one of the desired inverse blocks. Looking at a given inverse block, the requiredinformation is indicated by the inbound arrows. The only exception is g ii which alsorequires the block a ii .

Three different schemes to calculate gii are possible, since the block can be deter-mined via any of the three equations ( 6.21), (6.22), or (6.23), corresponding respec-tively to the upper left, lower right and lower left schemes in Fig. 6.5.

Assuming we have process pi owning row i and process p j owning row j , wedisregard choosing the lower left scheme (cf. Eq. ( 6.23)) since the computation of gii

on process pi will have to wait for process p j to calculate and send g ji . This leaves uswith the choice of either the upper left scheme (cf. Eq. ( 6.21)) or bottom right scheme(cf. Eq. (6.22)) where g jj and ji can be sent immediately, and both processes pi and

p j can then proceed to calculate in parallel.However, the bottom right scheme (cf. Eq. ( 6.22)) is preferable to the top left

scheme (cf. Eq. (6.21)) since it saves an extra matrix-matrix multiplication. Thismotivates our choice of using Eq. ( 6.22) for our hybrid method.




Center Block Operations

The reduction/production operations undertaken by a center process begins withthe following block tridiagonal form:

A =

0 0 0 aii aij 0 0

0 a ji a jj a jk 00 0 a kj akk

0 0 0

.

Through a block permutation matrix P , we can transform it to the form

PAP =

a jj a ji a jk 0 0a ij aii 0 0akj 0 akk 0

0 0 00 0 0

,

which we interpret as a 2

×2 block matrix as indicated by the partitioning lines in

the expression. We then perform a Schur complement calculation as done earlier fora corner process, obtaining parts of the LU factors:

Part of the L factor (column j ):a ij

akja−1

jj = ij

kj (6.24)

Part of the U factor (row j ): a−1 jj a ji a jk = u ji u jk . (6.25)

This then leads us to the 2 ×2 block Schur matrix

a ii 00 a kk −

ij

kja ji a jk =

a ii − ij a ji − ij a jk

− kj a ji akk − kj a jk. (6.26)




Let us assume that we have now computed the following blocks of the inverse G :

g ii gik

gki gkk.

With this information, we can use the stored LU factors to determine the other blocksof the inverse [see (6.5)], getting

g ji g jk = − u ji u jkg ii gik

gki gkk

= −u ji g ii −u jk gki −u ji g ik −u jk gkk

and

g ij

gkj= −

gii gik

gki gkk

ij

kj

= −gii ij −g ik kj

−gki ij −gkk kj.

The nal block of the inverse is obtained as

g jj = a−1 jj + u ji u jk

gii gik

gki gkk

ij

kj (6.27)

= a−1 jj − g ji g jk

ij

kj= a−1

jj −g ji ij −g jk kj .

The Schur production step for a center block is visualized in Fig. 6.6, where

the arrows are given the same signicance as for a corner production step shown inFig. 6.5.

Again, three different schemes arise since g jj can be determined via one of the






following three equations, depending on how g jj from Eq. (6.27) is calculated:

g jj = a−1 jj + u ji gii ij + u jk gki ij + u ji g ik kj + u jk gkk kj (6.28)

g jj = a−1 jj −u ji g ij −u jk gkj (6.29)

g jj = a−1 jj −g ji ij −g jk kj , (6.30)

where Eq. (6.28), Eq. (6.29) and Eq. ( 6.30) corresponds to the upper left, lowerright and lower left schemes in Fig. 6.6. Similarly motivated as for the corner Schurproduction step, we choose the scheme related to Eq. ( 6.30) corresponding to thelower left corner of Fig. 6.6.

6.4.2 Block Cyclic Reduction phases

Cyclic reduction operates on a regular tridiagonal linear system by eliminatingthe odd-numbered unknowns recursively, until only a single unknown remains, un-coupled from the rest of the system. One can then solve for this unknown, and fromthere descend down the recursion tree and obtain the full solution. In the case of block cyclic reduction, the individual scalar unknowns correspond to block matrices,

but the procedure operates in the same manner. For our application, the equationsare somewhat different. We have to follow Eq. ( 6.5) but otherwise the pattern of computation is similar. The basic operation in the reduction phase of BCR is therow-reduce operation, where two odd-indexed rows are eliminated by row operationstowards its neighbor. Starting with the original block tridiagonal form,

0 0 0 00 a ih aii aij 0 0 00 0 a ji a jj a jk 0 00 0 0 a kj akk akl 00 0 0 0

,

we reduce from the odd rows i and k towards the even row j , eliminating the couplingelement a ji by a row operation involving row i and the factor ji = a ji a−1

ii . Likewise,




we eliminate a jk by a row operation involving row k and the factor jk = a jk a−1kk , and

obtain

0 0 0 00 a ih aii aij 0 0 00 aBCR

jh 0 aBCR jj 0 aBCR

jl 00 0 0 a kj akk akl 00 0 0 0

, (6.31)

where the new and updated elements are given as

aBCR jjdef = a jj − ji a ij − jk akj , (6.32)

aBCR jh

def = − ji a ih , (6.33)

aBCR jl

def = − jk akl . (6.34)

This process of reduction continues until we are left with only one row, namely aBCRkk ,

where k denotes the row/column we nally reduce to. From aBCRkk , we can determine

one block of the inverse via gkk = ( aBCRkk )−1. From this single block, the backward

recurrence ( 6.5) produces all the other blocks of the inverse. The steps are similar

to those in Section 6.4.1 and 6.4.1. The key difference between the Schur phase andBCR is in the pattern of communication in the parallel implementation. The Schurphase is embarrassingly parallel while BCR requires communication at the end of each step.

6.5 Numerical results

6.5.1 Stability

Our elimination algorithm is equivalent to an unpivoted Gaussian elimination on asuitably permuted matrix PAP for some permutation matrix P [Gander and Golub ,1997; Heller, 1976].

Fig. 6.7 and Fig. 6.8 show the permutation corresponding to the Schur phase and



6.5. NUMERICAL RESULTS 105

Figure 6.7: Permutation for a 31 ×31 block matrix A (left), which shows that ourSchur reduction phase is identical to an unpivoted Gaussian elimination on a suitablypermuted matrix PAP (right). Colors are associated with processes. The diago-nal block structure of the top left part of the matrix (right panel) shows that thiscalculation is embarrassingly parallel.

the cyclic reduction phase for a 31 ×31 block matrix A .In the case of our algorithm, both approaches are combined. The process is shown

on Fig. 6.9. Once the Schur reduction phase has completed on A , we are left with areduced block tridiagonal system. The rows and columns of this tridiagonal systemare then be permuted following the BCR pattern.

Our hybrid method is therefore equivalent to an unpivoted Gaussian elimination.Consequently, the stability of the method is dependent on the stability of using thediagonal blocks of A as pivots in the elimination process, as is the case for block

cyclic reduction.

6.5.2 Load balancing

For the purposes of load balancing, we will assume that blocks in A are of equal size. Although this is not the case in general, it is still of use for the investigation of




A PAP

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

1 2 3 4 5 6 7 8 9 1 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

2 0

2 1

2 2

2 3

2 4

2 5

2 6

2 7

2 8

2 9

3 0

3 1

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

1 2 3 4 5 6 7 8 9 1 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

2 0

2 1

2 2

2 3

2 4

2 5

2 6

2 7

2 8

2 9

3 0

3 1

Figure 6.8: BCR corresponds to an unpivoted Gaussian elimination of A with per-mutation of rows and columns.

nanodevices that tend to be relatively homogeneous and elongated, and thus givingrise to block tridiagonal matrices A with many diagonal blocks and of relativelyidentical size. Furthermore, this assumption serves as an introductory investigationon how load balancing should be performed, eventually preparing us for a futureinvestigation of the general case.

There are essentially two sorts of execution proles for a process: one for cornerprocesses ( p0 and pP−1) and for central processes. The corner processes performoperations described in section 6.4.1, while the center processes perform those outlinedin 6.4.1.

By performing an operation count under the assumption of equal block sizesthroughout A, and assuming an LU factorization cost equal to 2

3 d3 operations and

a matrix-matrix multiplication cost of 2 d3 operations (for a matrix of dimension d,cf. [Golub and Van Loan , 1989]), we estimate the ratio α of number of rows for acorner process to number of rows for central processes. An analysis of our algorithmpredicts α = 2.636 to be optimal [Petersen, 2008]. For the sake of completeness, weinvestigate the case of α = 1, where each process is assigned the same number of



6.5. NUMERICAL RESULTS 107

Figure 6.9: Following a Schur reduction phase on the permuted block matrix PAP ,we obtain the reduced block tridiagonal system in the lower right corner (left panel).This reduced system is further permuted to a form shown on the right panel, as wasdone for BCR in Fig. 6.8.

rows, while the values α = 2 and α = 3 are chosen to bracket the optimal choice.

6.5.3 BenchmarksThe benchmarking of the algorithms presented in this chapter was carried out on

a Sun Microsystems Sun Fire E25K server. This machine comprises 72 UltraSPARCIV+ dual-core CPUs, yielding a total of 144 CPU cores, each running at 1.8 GHz.Each dual-core CPU had access to 2 MB of shared L2-cache and 32 MB shared L3-cache, with a nal layer of 416 GB of RAM.

We estimated that in all probability each participating process in our calculationshad exclusive right to a single CPU core. There was no guarantee however thatthe communication network was limited to handling requests from our benchmarkedalgorithms. It is in fact highly likely that the network was loaded with other users’applications, which played a role in reducing the performance of our applications. Wewere also limited to a maximum number of CPU cores of 64.

The execution time was measured for a pure block cyclic reduction algorithm




(cf. Alg. C.1) in comparison with our hybrid algorithm (cf. Alg. C.2). The walltimemeasurements for running these algorithms on a block tridiagonal matrix A withn = 512 block rows with blocks of dimension m = 256 is given in Fig. 6.10 for fourdifferent load balancing values α = {1, 2, 2.636, 3}. A total number of processes usedfor execution was P = {1, 2, 4, 8, 16, 32, 64} in all cases.

The speedup results corresponding to these walltime measurements are given inFig. 6.11 for the four different load balancing values of α . For a uniform distributionwhere α = 1, a serious performance hit is experienced for

P = 4. This is due to

a poor load balance, as the two corner processes p0 and p3 terminate their Schurproduction/reduction phases much sooner than the central processes p1 and p2. It isthen observed that choosing α equal 2, 2.636 or 3 eliminates this dip in speedup. Theresults appear relatively insensitive to the precise choice of α .

It can also be seen that as the number of processes P increases for a xed n, agreater portion of execution time is attributed to the BCR phase of our algorithm.Ultimately higher communication and computational costs of BCR over the embar-rassingly parallel Schur phase dominate, and the speedup curves level off and drop,regardless of α .

In an effort to determine which load balancing parameter α is optimal, we comparethe walltime execution measurements for our hybrid algorithm in Fig. 6.12. Fromthis gure, we can conclude that a uniform distribution with α = 1 leads to poorerexecution times; other values of α produce improved but similar results, particularlyin the range P = 4 . . . 8, which is a common core count in modern desktop computers.

6.6 ConclusionWe have proposed new recurrence formulas [Eq. ( 6.6) and Eq. ( 6.13)] to calculate

certain entries of the inverse of a sparse matrix. Such calculation scales like N 3 for amatrix of size N ×N using a naıve algorithm, while our recurrences have reduce thecost by orders of magnitude. The computation cost using these recurrences is of the



6.6. CONCLUSION 109

100

101

10210

1

102

103

Number of Processors [proc]

M a x

W a l l c l o c k

T i m e

[ s ]

BCRHybrid

(a) α = 1

100

101

10210

1

102

103


M a x

W a l l c l o c k

T i m e

[ s ]

BCRHybrid

(b) α = 2

100

101

10210 1

10 2

103


M a x

W a l l c l o c

k T i m e

[ s ]

BCRHybrid

(c) α = 2 .636

100

101

10210 1

10 2

103


M a x

W a l l c l o c

k T i m e

[ s ]

BCRHybrid

(d) α = 3

Figure 6.10: Walltime for our hybrid algorithm and pure BCR for different αs as a

basic load balancing parameter. A block tridiagonal matrix A with n = 512 diagonalblocks, each of dimension m = 256, was used for these tests.




100

10110

0

101


S p e e

d u p

[ 1 ]

BCRHybridIdeal

(a) α = 1

100

10110

0

101


S p e e

d u p

[ 1 ]

BCRHybridIdeal

(b) α = 2

100

10110

0

101


S p e e

d u

p [ 1 ]

BCRHybridIdeal

(c) α = 2 .636

100

10110

0

101


S p e e

d u

p [ 1 ]

BCRHybridIdeal

(d) α = 3

Figure 6.11: Speedup curves for our hybrid algorithm and pure BCR for differentαs. A block tridiagonal matrix A with n = 512 diagonal blocks, each of dimensionm = 256, was used.








Chapter 7

More Advanced Parallel Schemes

In Chapter 6, we discussed a hybrid scheme for parallelizing FIND-2way in 1Dcase. In the late stage of the reduction process in the scheme, since there are fewerrows left, more processors are idle. We can use these idle processors to do redundantcomputation and save the total running time. To achieve this, we need to return toFIND-1way schemes, where the two passes can be combined. In this chapter, we willintroduce a few different parallel schemes based on this idea. Some of the schemeswork efficiently only for 1D case while others work well for 2D and 3D cases as well.

7.1 Common features of the parallel schemes

All the schemes in this chapter are based solely on FIND-1way. For each tar-get block, we eliminate its complement with respect to the whole domain by rsteliminating the inner nodes of every block (except the target block itself) and thenmerging these blocks. Since we always eliminate the inner nodes of every block, wewill not emphasize this step of elimination and assume that these inner nodes are al-ready eliminated, denoted by the shaded areas (in north-east direction and north-westdirection).

In all the following sections for 1D parallel schemes, we use 16 processors and 16blocks in 1D as the whole domain to illustrate how these schemes work. We assign onetarget block for each of the processors: P 0, P 1, . . ., P 15 with P i assigned to target block

113



114 CHAPTER 7. MORE ADVANCED PARALLEL SCHEMES

B i , and in the end, all the complement nodes are eliminated, as shown in Fig. 7.1.We will discuss several schemes to achieve this common goal with different mergingprocesses.

P 0P 0P 0P 0P 0P 0P 0P 0P 0P 0P 0P 0P 0P 0P 0P 0






P 6

P 6

P 6

P 6

P 6

P 6

P 6

P 6

P 6

P 6

P 6

P 6

P 6

P 6

P 6

P 6










B 0 B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 B 10 B 11 B 12 B 13 B 14 B 15

Figure 7.1: The common goal of the 1D parallel schemes in this chapter

7.2 PCR scheme for 1D problems

In our Parallel Cyclic Reduction (PCR) scheme, each processor starts with itsown target block as the initial cluster, denoted as C (0)

i . This is shown in Fig. 7.2(a).

On each processor, the initial cluster C (0)

i is merged with its immediate neighboringcluster C (i+1) mod 16 . The data for merging C (0)

i and C (0)i on P i comes from P i and

P (i+1) mod 16 . This is shown in Fig. 7.2(b). Since we always do mod 16 operation afteraddition, we will skip it in subscripts for notation simplicity.

The newly formed cluster, denoted as C (1)i , is then merged with C (1)

i+2 . This isshown in Fig. 7.2(c). Note that C (1)

15 has two parts. Generally, when a cluster is cut



7.2. PCR SCHEME FOR 1D PROBLEMS 115

by the right end of the whole domain, the remaining parts start from the left, asshown by C (1)

15 in Fig. 7.2(b), C (2)13 , C (2)

14 , and C (2)15 in Fig. 7.2(c), and similar clusters

in Figs. 7.2(d) and 7.2(e). We keep merging in this way until the cluster on eachprocessor covers all the blocks in two groups (Fig. 7.2(e)). In the end, each processorcommunicates with its neighbor to get the complement blocks and the compute theinverse of the matrix corresponding to the target block (Fig. 7.2(f)).

0000000000000000

1111111111111111

2222222222222222

3333333333333333

4444444444444444

5555555555555555

6666666666666666

7777777777777777

8888888888888888

9999999999999999

10101010101010101010101010101010

11111111111111111111111111111111

12121212121212121212121212121212

13131313131313131313131313131313

14141414141414141414141414141414

15151515151515151515151515151515

C (0)0

C (0)1

C (0)2

C (0)3

C (0)4

C (0)5

C (0)6

C (0)7

C (0)8

C (0)9

C (0)10

C (0)11

C (0)12

C (0)13

C (0)14

C (0)15

(a)

0000000000000000

1111111111111111

2222222222222222

3333333333333333

4444444444444444

5555555555555555

6666666666666666

7777777777777777

8888888888888888

9999999999999999

10101010101010101010101010101010

11111111111111111111111111111111

12121212121212121212121212121212

13131313131313131313131313131313

14141414141414141414141414141414

15151515151515151515151515151515

C (1)0

C (1)1

C (1)2

C (1)3

C (1)4

C (1)5

C (1)6

C (1)7

C (1)8

C (1)9

C (1)10

C (1)11

C (1)12

C (1)13

C (1)14

C (1)15

(b)

0000000000000000

1111111111111111

2222222222222222

3333333333333333

4444444444444444

5555555555555555

6666666666666666

7777777777777777

8888888888888888

9999999999999999

10101010101010101010101010101010

11111111111111111111111111111111

12121212121212121212121212121212

13131313131313131313131313131313

14141414141414141414141414141414

15151515151515151515151515151515

C (2)0

C (2)1

C (2)2

C (2)3

C (2)4

C (2)5

C (2)6

C (2)7

C (2)8

C (2)9

C (2)10

C (2)11

C (2)12

C (2)13

C (2)14

C (2)15

(c)

0000000000000000

1111111111111111

2222222222222222

3333333333333333

4444444444444444

5555555555555555

6666666666666666

7777777777777777

8888888888888888

9999999999999999

10101010101010101010101010101010

11111111111111111111111111111111

12121212121212121212121212121212

13131313131313131313131313131313

14141414141414141414141414141414

15151515151515151515151515151515

C (3)0

C (3)1

C (3)2

C (3)3

C (3)4

C (3)5

C (3)6

C (3)7

C (3)8

C (3)9

C (3)10

C (3)11

C (3)12

C (3)13

C (3)14

C (3)15

(d)

0000000000000000

1111111111111111

2222222222222222

3333333333333333

4444444444444444

5555555555555555

6666666666666666

7777777777777777

8888888888888888

9999999999999999

10101010101010101010101010101010

11111111111111111111111111111111

12121212121212121212121212121212

13131313131313131313131313131313

14141414141414141414141414141414

15151515151515151515151515151515

C (4)0

C (4)1

C (4)2

C (4)3

C (4)4

C (4)5

C (4)6

C (4)7

C (4)8

C (4)9

C (4)10

C (4)11

C (4)12

C (4)13

C (4)14

C (4)15

(e)

0000000000000000

1111111111111111

2222222222222222

33333333333333334444444444444444

5555555555555555

6666666666666666

7777777777777777

8888888888888888

9999999999999999

10101010101010101010101010101010

11111111111111111111111111111111

12121212121212121212121212121212

13131313131313131313131313131313

14141414141414141414141414141414

15151515151515151515151515151515

(f)

Figure 7.2: The merging process of PCR

Fig. 7.3 shows the communication pattern of this scheme. These patterns corre-spond to the steps in Fig. 7.2. For example, Fig. 7.3(a) is the communication patternneeded for the merging from from Fig. 7.2(a) to Fig. 7.2(b) and Fig. 7.3(e) is thecommunication pattern needed for the merging from Fig. 7.2(e) to Fig. 7.2(f).




(a) (b)

(c) (d) (e)

Figure 7.3: The communication patterns of PCR



7.3. PFIND-COMPLEMENT SCHEME FOR 1D PROBLEMS 117

7.3 PFIND-Complement scheme for 1D problemsIn our Parallel-FIND Complement (PFIND-Complement) scheme, subdomain is

the key concept. A subdomain is a subset of the whole domain consisting of adjacentblocks, which is used when considering the complement of the target block. Thesubdomain expands when the computation process goes on (usually doubled at eachstep) until it becomes the whole domain.

Each processor has its own subdomain at each step. We start from subdomainsof small size and keep expanding them until they are as large as the whole domain.There are two types of computation involved in the subdomain expansion: one is forcomputing the complement and the other one for computing the subdomain. Thesetwo types of computations have counterparts in FIND-1way: the rst one is similarto the downward pass there and the second one is identical to the upward pass. Wewill treat them as two separate processes in our discussion but they are carried outtogether in reality.

In the rst process, each processor computes the complement of its target blockwith respect to its subdomain. As the subdomain on each processor expands, the com-plement expands as well through merging with it neighboring subdomain. Figs. 7.4(a)–7.4(d) show the process of expansion with subdomain of size 2, 4, 8, and 16. In theend, we obtain the complement of every target block and are ready to compute thecorresponding inverse entries.

The second process computes the subdomain clusters needed by the above process.At each step, the subdomains are computed through merging smaller subdomainclusters in the previous step of this process. Fig. 7.5 illustrates this merging process.These clusters are needed by the process in Fig. 7.4. For example, the clusters in

Fig. 7.5(b) are obtained by merging the clusters in Fig. 7.5(a) and used for theclusters in Fig. 7.4(c).

The above two processes are conceptually separate. In reality, however, the twoprocesses, can be combined because in the process from Fig. 7.4(a)–Fig. 7.4(d), someprocessors are idle. For example, when obtaining the clusters in Fig. 7.4(b), the




0000000000000000

1111111111111111

2222222222222222

3333333333333333

4444444444444444

5555555555555555

6666666666666666

7777777777777777

8888888888888888

9999999999999999

10101010101010101010101010101010

11111111111111111111111111111111

12121212121212121212121212121212

13131313131313131313131313131313

14141414141414141414141414141414

15151515151515151515151515151515

(a) size = 2 (b) size = 4 (c) size = 8 (d) size = 16

Figure 7.4: The complement clusters in the merging process of PFIND-complement.The subdomain size at each step is indicated in each sub-gure.

0000000000000000

1111111111111111

2222222222222222

3333333333333333

4444444444444444

5555555555555555

6666666666666666

7777777777777777

8888888888888888

9999999999999999

10101010101010101010101010101010

11111111111111111111111111111111

12121212121212121212121212121212

13131313131313131313131313131313

14141414141414141414141414141414

15151515151515151515151515151515

(a) 2-block clusters (b) 4-block clusters (c) 8-block clusters

Figure 7.5: The subdomain clusters in the merging process of PFIND-complement.






real number and 16 bytes for one double precision oating point complex number).This cost is minimal since the cost remains the same throughout the computationprocess and the space for the current results at each step is required in any schemes.

At each step, every processor receives the results of size d3 units from anotherprocessor in a different subdomain, but only one processor in each subdomain willsend out its results (also of size d3 units). Since the communication is between asingle processor and multiple processors, an underlying multicast network topologycan make it efficient, especially in late steps since fewer processors send out messagesthere.

7.4 Parallel schemes in 2D

The PFIND-Complement scheme can be easily extended to 2D problems. Insteadof computing the complement of the target block with respect to the whole domain atthe end of the whole computation process, we compute the complement of the targetblock with respect to subdomains at every step. Fig. 7.7 shows how the complementof target block 5 is computed in the original serial FIND-1way scheme (left) and in

PFIND-Complement.

Figure 7.7: Augmented trees for target block 14 in FIND-1way and PFIND-Complement schemes



7.4. PARALLEL SCHEMES IN 2D 121

We can see that the right tree is shorter than that of the left one. Generally, the

augmented tree for FIND-1way is twice as tall as that in PFIND-Complement. Thismeans that PFIND-Complement saves the computation time approximately by half.If we look at the trees in Fig. 7.7 carefully and compare these trees with the

augmented trees for other target blocks (not shown), we can see that there is lessoverlap among different trees in PFIND-Complement. Such a lack of overlap leadsto more overall computation cost at the top of the tree, but since we have more idleprocessors there, it does not lead to extra computation time. More detailed analysiswill be discussed in our further articles.








124 CHAPTER 8. CONCLUSIONS AND DIRECTIONS FOR FUTURE WORK

Copenhagen. Chapter 7 proposed a few more advanced parallel schemes for future

work.All our implementations have proved to be excellent in running time, storage cost,and stability as well for real problems. Theoretically, however, similar to other directmethods with various numbering schemes, we cannot at the same time employ typicalpivoting schemes for the stability arisen in LU factorization. We may take advantageof the limited freedom of numbering within the framework of nested dissection andemploy partial pivoting when it is permitted, but whether it can be effective willdepend on concrete problems. We will leave it as our future work.

There are also other interesting serial and parallel schemes for our problems thatwe have not discussed yet. FIND-1way-SS (FIND-1way Single-Separator, cf. 92) is aserial algorithm similar to FIND-1way but more efficient and suitable for advancedparallel schemes. In FIND-1way, we use the boundary nodes of two neighboring clus-ters as the double-separator between them, while in FIND-1way-SS, we use sharedseparator between neighboring clusters as that in the nested dissection method. Asa result, FIND-1way-SS has fewer columns to eliminate in each step and will signif-icantly reduce the computation cost. This is achieved through the separation of thematrix multiplication and addition in ( 3.2). We will discuss more on this in our futurearticles.

We also proposed another advanced parallel scheme (PFIND-Overlap) but havenot discussed it. All these schemes can be generalized to sparse matrices arisen from3D problems and 2D problems with more complicated connectivity as well. Someof parallel schemes can be efficiently applicable to repeated linear system. We willdiscuss them in our future work.



Appendices

125



Appendix A

Proofs for the Recursive Method

Computing G r and G <

Proposition A.1 The following equations can be used to calculate Z q := Gqqq, 1 ≤

q ≤ n:

Z 1 = A −111

Z q = A

qq −A

q,q−1Z

q−1A

q−1,q

−1

Proof : In general, for matrices W and X such that

W 1 W 2W 3 W 4

X 1 X 2X 3 X 4

=I 00 I

by Gaussian elimination, we have:

W 1 W 20 W 4 −W 3W −1

1 W 2

X 1 X 2X 3 X 4 =

I 0

× I

and then we have:X 4 = ( W 4 −W 3W −1

1 W 2)−1

126







A.2. BACKWARD RECURRENCE 129

= I 0

−W 3W −1

1 I

Y 1 Y 2

Y 3 Y 4

I −W −∗1 W ∗3

0 I

=Y 1 Y 2˜Y 3

˜Y 4

whereY 4 = Y 4 −W 3W −1

1 Y 2 −Y 3W −∗1 W ∗3 + W 3W −11 Y 1W −∗1 W ∗3

So we haveX 4 = ( W 4 −W 3W −1

1 W 2)−1 Y 4(W ∗4 −W ∗2 W −∗1 W ∗3 )−1

Now, let W = A q, X = G <q , and Y = Σ q, and blockwise, let

W 1 = A q−1, W 2 = A 1:q

−1,q , W 3 = A q,1:q

−1, W 4 = A qq

X 1 = G <q1:q−1,1:q−1, X 2 = G <q

1:q−1,q , X 3 = G <qq,1:q−1, X 4 = G <q

q,q

Y 1 = Σ q−1, Y 2 = Σ 1:q−1,q , Y 3 = Σ q,1:q−1, Y 4 = Σ q,q

since (W 4 −W 3W −11 W 2)−1 = G q

qq = Z q, W 3 = ( 0 A q,q−1) and W −11 Σ q−1W −∗1 =

G <q −1, we have:

Z <q = Z q Σ q,q −Σ q,q−1Z ∗q−1A∗q,q−1 −A q,q−1Z q−1Σ q−1,q + A q,q−1Z <q−1A∗q,q−1 Z ∗q

Starting from Z <1 = Z 1Σ 1Z ∗1 , we have a recursive algorithm to calculate all theZ <q = X qqq, 1 ≤ q ≤ n.

As before, the last block G<nnn is equal to G<

nn . In the next section, we use abackward recurrence to calculate G<

qq, 1 ≤ q ≤ n.

A.2 Backward recurrence

The recurrence is this time given by:



130APPENDIX A. PROOFS FOR THE RECURSIVE METHOD COMPUTING G R AND G <

Proposition A.4 The following equations can be used to calculate G<qq, 1 ≤ q ≤ n:

G <nn = Z <n

G <qq = Z <q + Z qA q,q+1 G q+1 ,q+1 A q+1 ,qZ <∗g + Z <q A∗q+1 ,qG∗q+1 ,q+1 A∗q,q+1 Z ∗q

+ Z q[A q,q+1 G <q+1 ,q+1 A∗q,q+1 −A q,q+1 G q+1 ,q+1 Σ q+1 ,q −Σ q,q+1 G∗q+1 ,q+1 A∗q,q+1 ]Z ∗q

Proof : Consider matrices W , X , and Y such that:

W 1 W 2W 3 W 4

X 1 X 2X 3 X 4

W 1 W 2W 3 W 4

∗

=Y 1 Y 2Y 3 Y 4

∗

Performing Gaussian elimination, we have

W 1 W 20 U 4

X 1 X 2X 3 X 4

W ∗1 0W ∗2 U ∗4

=Y 1 Y 2Y 3 Y 4

∗

⇒ W 1X 1W ∗1 = Y ∗1 −W 2X 3W ∗1 −W 1X 2W ∗2 −W 2X 4W ∗2

Similar to the back substitution in the previous section, we have

U 4X 3W ∗1 = Y 3 −U 4X 4W ∗2 = Y 3 −W 3W −11 Y 1 −U 4X 4W ∗2

⇒ X 3W ∗1 = U −14 Y 3 −U −1

4 W 3W −11 Y 1 −X 4W ∗2

and

W 1X 2U ∗4 = Y 2 −W 2X 4U ∗4 = Y 2 −Y 1W −∗1 W ∗3 −W 2X 4U ∗4

⇒

W 1X 2 = Y 2U −∗4

−Y 1W −∗1 W ∗3 U −∗4

−W 2X 4

Now we have

X 1 = ( W −11 Y 1W −∗1 ) + W −1

1 (W 2U −14 W 3)(W −1

1 Y 1W −∗1 ) + ( W −11 Y 1W −∗1 )(W −∗3 U −∗4 W ∗2 )W −∗1

+ W −11 (W 2X 4W ∗2 −W 2U −1

4 Y 3 −Y 2U −∗4 W ∗2 )W −∗1



A.2. BACKWARD RECURRENCE 131

By denition, we have

X 1 = × ×× G<

qq, W −1

1 Y 1W −∗1 = × ×× Z <q

, W −11 = × ×

× Z q

By the result of previous section, we have

W 2U −14 W 3 =

0 00 A q,q+1 G q+1 ,q+1 A q+1 ,q

, W −∗3 U −∗4 W ∗2 = 0 00 A∗q+1 ,qG∗q+1 ,q+1 A∗q,q+1

and

X 4 = G<

q+1 ,q+1 ×× ×

, U −14 =

Gq+1 ,q+1 ×× ×

, W 2 = 0 0A q,q+1 0

Y 3 = 0 Σ q+1 ,q

0 0, Y 2 =

0 0Σ q,q+1 0

and then

W 2X 4W ∗2 = × ×× Aq,q+1 G <

q+1 ,q+1 A∗q,q+1, W 2U −1

4 Y 3 = × ×× Aq,q+1 G q+1 ,q+1 Σ q+1 ,q

,

and

Y 2U −∗4 W ∗2 = × ×× Σ q,q+1 G∗q+1 ,q+1 A∗q,q+1

Put them all together and consider the ( q, q ) block of equation (*), we have

G <qq = Z <q + Z qA q,q+1 G q+1 ,q+1 A q+1 ,qZ <∗g + Z <q A∗q+1 ,qG∗q+1 ,q+1 A∗q,q+1 Z ∗q

+ Z q[A q,q+1 G <q+1 ,q+1 A∗q,q+1 −A q,q+1 G q+1 ,q+1 Σ q+1 ,q −Σ q,q+1 G∗q+1 ,q+1 A∗q,q+1 ]Z ∗q







134APPENDIX B. THEORETICAL SUPPLEMENT FOR FIND-1WAY ALGORITHMS

Property B.6 For any k, u such that Ck , C u ∈ T +r , if Sk < Su , then Su ∩Ik = ∅.

Proof : By Property B.5, we have Cu ⊂ Ck . So for all j such that C j ⊆ Ck , we have j = u and thus S j ∩S u = ∅ by Property B.3. By Property B.2, Ik = ∪C j⊆

C kS j , so we

have Ik ∩S u = ∅.

Property B.7 If C j is a child of Ck , then for any Cu such that S j < Su < Sk , we have C j ∩C u = ∅ and thus B j ∩B u = ∅.

This is because the Cu can be neither a descendant of C j nor an ancestor of Ck .Proof : By Property B.5, either C

j ⊂ C

u or C

j ∩C

u =

∅. Since C

j is a child of C

kand u = k, we have C j ⊂ Cu ⇒ Ck ⊂ Cu ⇒ Sk < Su , which contradicts the givencondition Su < S k . So C j ⊂ Cu and then C j ∩C u = ∅.

Now we re-state Theorem B.1 more precisely with its proof (see page 35).

Theorem B.1 If we perform Gaussian elimination as described in Section 3.2 on the original matrix A with ordering consistent with any given T +

r , then

1. Ag(S ≥g, S <g ) = 0 ;

2. ∀h ≥ g, A g(S h , S >h \B h ) = A g(S >h \B h , S h ) = 0 ;

3. (a) Ag+ (B g, S >g \B g) = A g(B g, S >g \B g);

(b) Ag+ (S >g \B g, B g) = A g(S >g \B g, B g);

(c) Ag+ (S >g \B g, S >g \B g) = A g(S >g \B g, S >g \B g);

4. Ag+ (B g, B g) = A g(B g, B g) −A g(B g, S g)A g(S g, S g)−1A g(S g, B g).

The matrices A i and A i+ on page 35 show one step of elimination and may helpunderstand this theorem.Proof : Since (1) and (2) imply (3) and (4) for each i and performing Gaussianelimination implies (1), it is sufficient to prove (2). We will prove (2) by strongmathematical induction.



B.1. PROOFS FOR BOTH COMPUTING GR AND COMPUTING G< 135

1. For g = i1, (2) holds because of the property of the original matrix, i.e., an

entry in the original matrix is nonzero iff the corresponding two mesh nodesconnect to each other and no mesh nodes in Sh and S>h \B h are connected toeach other. The property is shown in the matrix below:

S h B h S>h \B h

S h × × 0B h × × ×

S >h \B h 0 × ×2. If (2) holds for all g = j such that S j < S k , then by Property B.1 we have either

C j ⊂ Ck or C j ∩C k = ∅.

• if C j ⊂ Ck , consider u such that Sk < Su . By Property B.6, we haveIk ∩S u = ∅. Since Ck = Ik ∪B k , we have (S u\B k) ∩C k = ∅. So we haveB j ∩(S u\B k) ⊆ (S u\B k)∪

C j ⊆ (S u\B k) ∩C k = ∅ ⇒ B j ∩(S >k \B k) = ∅.• if C j ∩C k = ∅, then B j ⊂ C j and Sk ⊂ Ck ⇒ B j ∩S k = ∅.

So in both cases, we have (B j , B j )

∩(S k , S >k

\B k) =

∅. Since for every S j < Sk ,

(1), (2), (3), and (4) hold, i.e., eliminating the S j columns only affects the(B j , B j ) entries, we have A k(S k , S >k \B k) = A k(S >k \B k , S k) = 0. Since the argu-ment is valid for all h ≥ k, we have∀h ≥ k, A k(S h , S >h \B h ) = A k(S >h \B h , S h ) =0. So (2) holds for g = k as well.

By strong mathematical induction, we have that (2) holds for all g such thatC g ∈ T +

r .

Below we restate Corollaries 3.1–3.3 and give their proofs.

Corollary B.1 If C i and C j are the two children of Ck , then A k(B i , B j ) = A (B i , B j )and Ak(B j , B i) = A (B j , B i).

Proof : Without loss of generality, let S i < S j . For any Su < Sk , consider thefollowing three cases: Su < S j , Su = S j , and S j < S u < Sk .

If Su < S j , then by Property B.1, either Cu ⊂ C j or Cu ∩C j = ∅.




• if Cu ⊂ C j , then since C i ∩C j = ∅, we have Cu ∩C i = ∅ ⇒ B u ∩B i = ∅;

• if Cu ∩C j = ∅, then since B u ⊂ Cu and B j ⊂ C j , we have B u ∩B j = ∅.So we have (B u , B u ) ∩(B i , B j ) = ( B u , B u ) ∩(B j , B i) = ∅.

If Su = S j , then B u = B j and B i ∩B j = ∅, so we also have (B u , B u ) ∩(B i , B j ) =(B u , B u ) ∩(B j , B i) = ∅.

If S j < Su < Sk , then by Property B.7, we have B u ∩ B j = ∅. So we have(B u , B u ) ∩(B i , B j ) = ( B u , B u ) ∩(B j , B i) = ∅ as well.

So for every Su < Sk , we have (B u , B u ) ∩ (B i , B j ) = ( B u , B u ) ∩ (B j , B i) = ∅.By Theorem B.1, eliminating Su only changes A(B u , B u ), so we have Ak(B i , B j ) =A (B i , B j ) and A k(B j , B i) = A (B j , B i).

Corollary B.2 If C i is a child of Ck , then Ak(B i , B i) = A i+ (B i , B i).

Proof : Consider u such that S i < Su < Sk . By Property B.7, we have B u ∩B i = ∅.By Theorem B.1, eliminating Su columns will only affect (B u , B u ) entries, we haveA k(B i , B i) = A i+ (B i , B i).

Corollary B.3 If C i is a leaf node in T +r , then Ai(C i , C i) = A (C i , C i).

Proof : Consider Su < S i , by Property B.1, we have either Cu ⊂ C i or Cu ∩C i = ∅.Since C i is a leaf node, there is no u such that Cu ⊂ C i . So we have Cu ∩C i = ∅ ⇒B u ∩C i = ∅. By Theorem B.1, we have A i(C i , C i) = A (C i , C i).

B.2 Proofs for computing G <

This section is organized in a similar way as in the previous section.

Theorem B.2 If we perform Gaussian elimination as described in Section 3.2 and update R g accordingly, then we have:

1. ∀h ≥ g, R g(S h , S >h \B h ) = R g(S >h \B h , S h ) = 0 ;

2. (a) Rg+ (B g, S >g \B g) = R g(B g, S >g \B g);






and

(B

u ,B

u ) ∩(B

j ,B

i) = (S

u ,B

u ) ∩(B

j ,B

i) = (B

u ,S

u ) ∩(B

j ,B

i) = ∅ (B.2)If Su ≤ S j , then by Property B.1, either C u ⊆ C j or C u ∩C j = ∅.• if C u ⊆ C j , then since C i∩C j = ∅, we have C u ∩C i = ∅ ⇒ B u ∩B i and Su ∩B i = ∅;

• if Cu ∩C j = ∅, then B u ∩B j = ∅ and Su ∩B j = ∅.So (1) and (2) hold.If S j < Su < Sk , then by Property B.7, we have C u ∩C j = ∅ ⇒ Su ∩B j = B u ∩B j =

∅. So (1) and (2) hold as well.

So for every Su < Sk , both (1) and (2) hold. By Theorem B.2, the step uof the updates only changes R u (B u , B u ), R u (B u , S u ), and Ru (S u , B u ). So we haveR k(B i , B j ) = R (B i , B j ) and R k(B j , B i) = R (B j , B i).

Corollary B.5 If C i is a child of Ck , then R k(B i , B i) = R i+ (B i , B i).

Proof : Consider u such that S i < Su < Sk . By Property B.7, we have Cu ∩C i =

∅ and thus (1) and (2) hold. By Theorem B.2, the step u of the updates onlychanges R u (B u , B u ), R u (B u , S u ), and R u (S u , B u ). So we have R k(B i , B j ) = R (B i , B j )

and R k(B j , B i) = R (B j , B i).

Corollary B.6 If C i is a leaf node in T +r , then R i(C i , C i) = R (C i , C i).

Proof : Consider Su < S i . By Property B.1, either Cu ⊂ C i or Cu ∩C i = ∅. SinceC i is a leaf node, there is no u such that Cu ⊂ C i . So we have Cu ∩C i = ∅ and thusB u ∩C i = Su ∩C i = ∅. By Theorem B.2, we have R i(C i , C i) = R (C i , C i).

Theorem B.3 For any r and s such that C i ∈ T +r a nd C i ∈ T +

s , we have:

R r,i (S i∪

B i , S i∪

B i) = R s,i (S i∪

B i , S i∪

B i)

Proof : If C i is a leaf node, then by Corollary B.6, we have R r,i (S i ∪B i , S i ∪B i) =R r,i (C i , C i) = R r (C i , C i) = R s (C i , C i) = R s,i (C i , C i) = R s,i (S i∪

B i , S i∪B i)

If the equality holds for i and j such that C i and C j are the two children of Ck ,then



B.2. PROOFS FOR COMPUTING G < 139

• by Theorem B.2, we have R r,i + (B i , B i) = R s,i + (B i , B i) and R r,j + (B j , B j ) =

R s,j + (B

j ,B

j );

• by Corollary B.5, we have R r,k (B i , B i) = R r,i + (B i , B i) = R s,i + (B i , B i) = R s,k (B i , B i)and R r,k (B j , B j ) = R r,j + (B j , B j ) = R s,j + (B j , B j ) = R s,k (B j , B j );

• by Corollary B.4, we have R r,k (B i , B j ) = R r (B i , B j ) = R s (B i , B j ) = R s,k (B i , B j )and R r,k (B j , B i) = R r (B j , B i) = R s (B j , B i) = R s,k (B j , B i).

Now we have R r,k (B i∪B j , B i∪

B j ) = R s,k (B i∪B j , B i∪

B j ). By Property B.3, wehave R r,k (S k ∪B k , S k ∪B k) = R s,k (S k ∪B k , S k ∪B k). By induction, the theorem is

proved.

Corollary B.7 For any r and s such that C i ∈ T +r and C i ∈ T +

s , we have:

R r,i + (B i , B i) = R s,i + (B i , B i)

Proof : By Theorem B.2 and Theorem B.3.





C.1. BCR REDUCTION FUNCTIONS 141

This variable of iBCR is explicitly used as an argument in the BCR phases, andimplicitly in the Schur phases. The implicit form is manifested through the variablestop and bot , that store the value of the top and bottom row indices “owned” by oneof the participating parallel processes.

Algorithm C.2 : InverseHybrid (A , iBCR )1: A, L, U ← ReduceSchur (A )2: A, L, U ← ReduceBCR (A , L, U , iBCR )3: gkk = a−1

kk

4: G ← ProduceBCR

(A , L, U , G , iBCR

)5: G ← ProduceSchur (A , L, U , G )6: return G

C.1 BCR Reduction Functions

The reduction phase of BCR is achieved through the use of the following twomethods. The method ReduceBCR , given in Alg. C.3, takes a block tridiagonal

A and performs a reduction phase over the supplied indices given in iBCR . Theassociated LU factors are then stored appropriately in matrices L and U .

ReduceBCR loops over each of the levels on line 3, and eliminates the odd-numbered rows given by line 4. This is accomplished by calls to the basic reductionmethod Reduce on line line 6.

Algorithm C.3 : ReduceBCR (A , L, U , iBCR )1: k ← length of i BCR {size of the “reduced” system }2: h

← log2(k)

{height of the binary elimination tree

}3: for level = 1 up to h do4: ielim ← determine the active rows5: for row = 1 up to length of i elim do {eliminate odd active rows }6: A, L, U ← Reduce (A , L, U , row , level , ielim )7: return A , L, U



142 APPENDIX C. ALGORITHMS

The core operation of the reduction phase of BCR are the block row updatesperformed by the method Reduce given in Alg. C.4, where the full system A issupplied along with the LU block matrices L and U . The value row is passed alongwith ielim , telling us which row of the block tridiagonal matrix given by the indices inielim we want to reduce towards. The method then looks at neighbouring rows basedon the level level of the elimination tree we are currently at, and then performs blockrow operations with the correct stride.

Algorithm C.4 : Reduce (A , L, U , row , level , ielim )

1: h, i, j, k, l ← get the working indices2: if i ≥ ielim [1] then {if there is a row above}3: u ij ← a−1

ii aBCRij

4: ji ← a ji a−1ii

5: a jj ← a jj − ji a ij

6: if aih exists then {if the row above is not the “top” row }7: a jh ← − ji a ih

8: if k ≤ ielim [end ] then {if there is a row below}9: ukj ← a−1

kk a kj

10: jk ← a jk a−1kk

11: a jj

← a jj

− jk a kj

12: if akl exists then {if the row below is not the “bottom” row }13: a jl ← − jk akl

14: return A , L, U

C.2 BCR Production Functions

The production phase of BCR is performed via a call to ProduceBCR given in

Alg. C.5. The algorithm takes as input an updated matrix A , associated LU factors,an inverse matrix G initialized with the rst block inverse gkk , and a vector of indicesiBCR dening the rows/columns of A on which BCR is to be performed.

The algorithm works by traversing each level of the elimination tree (line 1),where an appropriate striding length is determined and an array of indices iprod isgenerated. This array is a subset of the overall array of indices iBCR , and is determined



C.2. BCR PRODUCTION FUNCTIONS 143

by considering which blocks of the inverse have been computed so far. The rest of the algorithm then works through each of these production indices iprod , and calls theauxiliary methods CornerProduce and CenterProduce .

Algorithm C.5 : ProduceBCR (A , L, U , G , iBCR )1: for level = h down to 1 do2: stride ← 2level −1

3: iprod ← determine the rows to be produced4: for i = 1 up to length of i prod do5: kto ← iBCR [iprod [i]]6: if i = 1 then7: kfrom ← iBCR [iprod [i] + stride ]8: G ← CornerProduce (A , L, U , G , kfrom , kto )9: if i = 1 and i = length of i prod then

10: if iprod [end ] ≤ length of i BCR −stride then11: kabove ← iBCR [iprod [i]−stride ]12: kbelow ← iBCR [iprod [i] + stride ]13: G ← CenterProduce (A , L, U , G , kabove , kto , kbelow )14: else15: kfrom

← iBCR [iprod [i]

−stride ]

16: G ← CornerProduce (A , L, U , G , kfrom , kto )17: if i = 1 and i = length of i prod then18: kabove ← iBCR [iprod [i]−stride ]19: kbelow ← iBCR [iprod [i] + stride ]20: G ← CenterProduce (A , L, U , G , kabove , kto , kbelow )21: return G

The auxiliary methods CornerProduce and CenterProduce are given inAlg. C.6 and Alg. C.7.

Algorithm C.6 : CornerProduce (A , L, U , G , kfrom , kto )1: gkfrom ,kto ← −gkfrom ,kfrom kfrom ,kto2: gkto ,kfrom ← −u kto ,kfrom gkfrom ,kfrom3: gkto ,kto ← a−1

kto ,kto −gkto ,kfrom kfrom ,kto4: return G



144 APPENDIX C. ALGORITHMS

Algorithm C.7 : CenterProduce (A , L, U , G , kabove , kto , kbelow )

1: gkabove ,kto ← −gkabove ,kabove kabove ,kto −gkabove ,kbelow kbelow ,kto2: gkbelow ,kto ← −gkbelow ,kabove kabove ,kto −gkbelow ,kbelow kbelow ,kto3: gkto ,kabove ← −u kto ,kabove gkabove ,kabove −u kto ,kbelow gkbelow ,kabove4: gkto ,kbelow ← −u kto ,kabove gkabove ,kbelow −u kto ,kbelow gkbelow ,kbelow5: gkto ,kto ← a−1

kto ,kto −gkto ,kabove kabove ,kto −gkto ,kbelow kbelow ,kto6: return G

C.3 Hybrid Auxiliary Functions

Finally, this subsection deals with the auxiliary algorithms introduced by ourhybrid method. Prior to any BCR operation, the hybrid method applies a Schurreduction to A in order to reduce it to a smaller block tridiagonal system. Thisreduction is handled by ReduceSchur given in Alg. C.8, while the nal productionphase to generate the nal block tridiagonal Trid A {G} is done by ProduceSchur

in Alg. C.9.The Schur reduction algorithm takes the full initial block tridiagonal matrix A,

and through the implicit knowledge of how the row elements of A have been assignedto processes, proceeds to reduce A into a smaller block tridiagonal system. Thisimplicit knowledge is provided by the top and bot variables, which specify the topmostand bottommost row indices for this process.

The reduction algorithm Alg. C.8 is then split into three cases. If the process ownsthe topmost rows of A, it performs a corner elimination downwards. If it owns thebottommost rows, it then performs a similar operation, but in an upwards manner.Finally, if it owns rows that reside in the middle of A , it performs a center reductionoperation.

Finally, once the hybrid method has performed a full reduction and a subsequentBCR portion of production, the algorithm ProduceSchur handles the productionof the remaining elements of the inverse G .



C.3. HYBRID AUXILIARY FUNCTIONS 145

Algorithm C.8 : ReduceSchur (A )1: if myPID = 0 and P > 1 then {corner eliminate downwards }2: for i = top + 1 up to bot do

3: i,i −1 ← a i,i −1a−1

i−1,i−14: u i−1,i ← a−1

i−1,i−1a i−1,i

5: aii ← a ii − i,i −1a i−1,i

6: if myPID = P −1 then {corner eliminate upwards }7: for i = bot −1 down to top do8: i,i +1 ← a i,i +1 a−1

i+1 ,i +1

9: u i+1 ,i ← a−1i+1 ,i +1 a i+1 ,i

10: aii ← a ii − i,i +1 a i+1 ,i

11: if myPID = 0 and myPID = P −1 and P > 1 then {center elim. down}12: for i = top + 2 down to bot do13: i,i

−1

← a i,i

−1a−1

i

−1,i

−1

14: top ,i−1 ← a top ,i−1a−1i−1,i−1

15: u i−1,i ← a−1i−1,i−1a i−1,i

16: u i−1,top ← a−1i−1,i−1a i−1,top

17: aii ← a ii − i,i −1a i−1,i

18: atop ,top ← a top ,top − top ,i−1a i−1,top

19: ai, top ← −i,i −1a i−1,top

20: atop ,i ← −top ,i−1a i−1,i

21: return A , L, U





Bibliography

M. P. Anantram, Shaikh S. Ahmed, Alexei Svizhenko, Derrick Kearney, and GerhardKlimeck. NanoFET. doi: 10254/nanohub-r1090.5, 2007. 49

K. Bowden. A direct solution to the block tridiagonal matrix inversion problem.International Journal of General Systems , 15(3):185–98, 1989. 12

J. R. Bunch and L. Kaufman. Some stable methods for calculating inertia and solvingsymmetric linear systems. Mathematics of Computation , 31:162–79, 1977. 74

J. R. Bunch and B. N. Parlett. Direct methods for solving symmetric indenitesystems of linear equations. SIAM Journal on Numerical Analysis , 8:639–55, 1971.

74

S. Datta. Electronic Transport in Mesoscopic Systems . Cambridge University Press,1997. 3, 5, 7

S. Datta. Nanoscale device modeling: the Green’s function method. Superlattices and Microstructures , 28(4):253–278, 2000. 3, 4, 8

T. A. Davis. Direct methods for sparse linear systems . Society for Industrial andApplied Mathematics, 2006. 12

A. M. Erisman and W. F. Tinney. On computing certain elements of the inverse of asparse matrix. Numerical Mathematics , 18(3):177–79, 1975. 11, 20, 21

D. K. Ferry and S. M. Goodnick. Transport in Nanostructures . Cambridge UniversityPress, 1997. 5

147





BIBLIOGRAPHY 149

H. Niessner and K. Reichert. On computing the inverse of a sparse matrix. Int. J.

Num. Meth. Eng. , 19:1513–1526, 1983. 20G. Peters and J. Wilkinson. Linear Algebra, Handbook for Automatic Computation,

Vol. II , chapter The calculation of specied eigenvectors by inverse iteration, pages418–439. Springer-Verlag, 1971. 10

G. Peters and J. Wilkinson. Inverse iteration, ill-conditioned equations and Newton’smethod. SIAM Rev., 21:339–360, 1979. 10

D.E. Petersen. Block Tridiagonal Matrices in Electronic Structure Calculations . PhD

thesis, Dept. of Computer Science, Copenhagen University, 2008. 106

R Schreiber. A new implementation of sparse gaussian elimination. ACM Trans.Math. Software , 8, 1982. 12

J Schr oder. Zur L osung von Potentialaufgaben mit Hilfe des Differenzenverfahrens.Angew. Math. Mech. , 34:241–253, 1954. 83

G. Shahidi. SOI technology for the GHz era. IBM Journal of Research and Development , 46:121–131, 2002. 2

SIA. International Technology Roadmap for Semiconductors 2001 Edition .Semiconductor Industry Association (SIA), 2706 Montopolis Drive, Austin, Texas78741, 2001. URL http://www.itrs.net/Links/2001ITRS/Home.htm . 3, 47

A. Svizhenko, M. P. Anantram, T. R. Govindan, and B. Biegel. Two-dimensionalquantum mechanical modeling of nanotransistors. Journal of Applied Physics , 91(4):2343–54, 2002. 4, 5, 7, 8, 9, 13, 14, 20, 21, 49, 60, 72, 83

K. Takahashi, J. Fagan, and M.-S. Chin. Formation of a sparse bus impedancematrix and its application to short circuit study. In 8th PICA Conf. Proc. , pages63–69, Minneapolis, Minn., June 4–6 1973. 10, 20, 83, 111

J. Varah. The calculation of the eigenvectors of a general complex matrix by inverseiteration. Math. Comp. , 22:785–791, 1968. 10

http://www.itrs.net/Links/2001ITRS/Home.htm





150 BIBLIOGRAPHY

D. Vasileska and S. S. Ahmed. Narrow-width SOI devices: The role of quantum

mechanical size quantization effect and the unintentional doping on the deviceoperation. IEEE Transactions on Electron Devices , 52:227, 2005. 2

R. Venugopal, Z. Ren, S. Datta, M. S. Lundstrom, and D. Jovanovic. Simulatingquantum transport in nanoscale transistors: Real versus mode-space approaches.Journal of Applied Physics , 92(7):3730–9, OCT 2002. 3, 4

T. J. Walls, V. A. Sverdlov, and K. K. Likharev. Nanoscale SOI MOSFETs: acomparison of two options. Solid-State Electronics , 48:857–865, 2004. 2

J. Welser, J. L. Hoyt, and J. F. Gibbons. NMOS and PMOS transistors fabricatedin strained silicon/relaxed silicon-germanium structures. IEDM Technical Digest ,page 1000, 1992. 2

J. Wilkinson. Inverse iteration in theory and practice. In Symposia Matematica,

Fast Sparse Matrix Inverse Computation

Documents