h-cholesky on many-core

King Abdullah University of Science and Technology

H-Cholesky Factorization on Many-Core Accelerators

Gang Liao*

Tools Technical Consulting Group

Intel Asia-Pacific Research & Development Ltd.

Extreme Computing Research Center

Computer, Electrical and Mathematical Sciences & Engineering Division

King Abdullah University of Science and Technology (KAUST)

[email protected] [email protected]

Saturday 8th August, 2015

*This project is being conducted when I am a summer intern at Intel and a master student at KAUST.

1

mailto:[email protected]

mailto:[email protected]

Contents

1 Acknowledgments 4

2 Motivation 5

2.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Our Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Heterogeneous Computing 6

3.1 NVIDIA GPU Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.2 Performance Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Intel Xeon Phi Co-Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2.1 Why Intel Xeon Phi? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2.2 Co-Processor Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.3 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Matrix Decomposition 10

4.1 LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.2 QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.3 Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Numerical Linear Algebra Libraries 15

5.1 BLAS and LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.1.1 BLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.1.2 LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.2 Intel MKL and NVIDIA cuBLAS . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.2.1 Intel MKL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.2.2 NVIDIA cuBLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.3 PLASMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.3.1 Dynamic Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.4 MAGMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.4.1 Hybrid Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Hierarchical Matrix 21

6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.2.1 H-Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.2.2 Cluster Tree TI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2

6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.3.1 Inadmissible Leaves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.3.2 Admissible Leaves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.4 Hierarchical Matrix Representation . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7 H-Cholesky Factorization 27

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.2.1 H-Matrix Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . 28

7.2.2 H-LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7.2.3 H-Cholesky Fatorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.3 Compiler Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.3.1 H-Matrix Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . 35

7.3.2 H-LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.3.3 H-Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.4 Parallel Numerical Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.4.1 MKL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.4.2 PLASMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.4.3 MAGMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8 Parallel Optimization 39

8.1 SequentialH-Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 39

8.2 Task-basedH-Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . . . 40

8.3 Heterogeneous Architecture Optimization . . . . . . . . . . . . . . . . . . . . . . 41

9 Conclusion 43

3

1 Acknowledgments

I am profoundly grateful to Prof. David Keyes and Hatem Ltaief for their expert guidance and con-

tinuous encouragement throughout to see that this project rights its target since its commencement

to its completion.

In 2015, I am so lucky to be a summer intern in Technical Consulting Group at Intel Asia-Pacific

Research & Development Ltd. I would like to express deepest appreciation towards TCE team

members, especially for my mentor Colt Gan, whose invaluable guidance supported me in com-

pleting this project.

At last I must express my sincere heartfelt gratitude to Gustavo Chavez who helped me directly

or indirectly during this work. Thank you to Senior Research Scientist Alexander Litvinenko for

sharing his unique insights in this project for encouraging me to reach toward the impossible and

make it happen.

Numerous colleagues offered information, advice, and vision. There are certainly many people

who have helped directly and indirectly to whom I am very grateful for the support. I thank all

those who helped and I apologize for any who helped me and I failed to mention. Thank you all.

Gang LiaoTuesday June 9, 2015

4

2 Motivation

2.1 The Problem

The current implementation of low-rank matrix factorization and inversion in the state-of-the-art

numerical library Hlib has shown severe limitations on multicore architecture. Low-rank matrices

can be represented as hierarchical matrices (H-matrix), where diagonal blocks are dense and full

rank and the off-diagonal blocks are sparse and low rank. The low numerical rankness of these

blocks can be exploited in terms of computational complexity and memory footprint.

For some numerical problems, the computation involving the dense diagonal blocks can be the

most time-consuming and it is important to operate on them as efficient as possible.

2.2 Our Work

1. Integrate PLASMA and/or MAGMA into Hlib and provide then the first H-matrix compu-

tations on Intel Xeon Phi and NVIDIA GPUs.

2. Compare then performance and memory footprint against other existing implementations

(MKL/PLASMA/MAGMA for dense matrix computations and Hlib/HlibPro for H-matrix

computations).

3. Study the paper of Kriemann [13] and implement the kernels of the first task-based hier-

archical matrix computation (e.g., Cholesky factorization) on Intel Xeon Phi and NVIDIA

GPUs.

4. Performance benchmarking and comparisons against existing CPU implementations are to

be performed.

5. If the project results are convincing, the outcome can clearly be a research paper describing

the contributions and comparing the different implementations.

2.3 Future Work

Although challenging, a natural extension of this work would be to write a parallel C-version of the

code based on tile algorithms using Pthreads, OpenMP and CUDA and to integrate it within the

HiCMA project.

5

3 Heterogeneous Computing

Nowadays, all computers are parallel computers. All modern computers support parallelism in

hardware through at least one parallel feature, including vector instructions, multithreaded cores,

multicore processors, multiple processors, graphics engines, and parallel co-processors. In recent

years, modern multi-core and many-core architectures have brought about a revolution in high

performance computing. Because it is now possible to incorporate more and more processor cores

into a single chip, the era of the many- core processor is not far away.

The emergence of many-core architectures, such as NVIDIA GPU [1] and Intel Xeon Phi [2],

makes it possible to reduce significantly the running time of many modern applications.

3.1 NVIDIA GPU Accelerators

It's becoming increasingly common to use a general purpose graphics processing unit as a modified

form of stream processor. It can turn the massive computational power of a modern graphics

accelerators's shader pipeline into general purpose computing power.

Figure 1: NVIDIA CPU and GPU Programming Model.

The reason why we need GPU accelerators is because CPU performance walls, for instance,

power wall, ILP wall and memory wall. The trend in recent years is using an extra piece of hard-

6

ware (accelerators or coprocessors) within a server or a supercomputer node to speedup part of the

computation.

3.1.1 GPU Architecture

Compared with CPU, it removes several components (out-of-order control logic, branch predictor,

etc.) that help a single instruction stream run fast and it integrates many small cores into the single

chip to explore parallelism in data intensive applications. GPU-accelerated computing offers un-

precedented application performance by offloading compute-intensive portions of the application

to the GPU, while the remainder of the code still runs on the CPU. (see Fig. 1)

3.1.2 Performance Optimization

For CUDA, to optimize the performance, we have the following common strategies.

1. Minimize Thread Divergence all threads do the same work.

2. Avoid Warp Serialization: all threads should access different shared memory banks.

3. Optimize Global Memory Access: access is sequential and aligned: coalesced.

4. Maximize Occupancy: there is a massive number of threads.

5. Hide Host to Device Communication: overlap communication with kernel execution.

3.2 Intel Xeon Phi Co-Processors

3.2.1 Why Intel Xeon Phi?

Intel Xeon Phi coprocessors are designed to extend the reach of applications that have demon-

strated the ability to fully utilize the scaling capabilities of Intel Xeon processor-based systems and

fully exploit available processor vector capabilities or memory bandwidth. For such applications,

the Intel Xeon Phi coprocessors offer additional power-efficient scaling, vector support, and lo-

cal memory bandwidth, while maintaining the programmability and support associated with Intel

Xeon processors.

Most applications in the world have not been structured to exploit parallelism. This leaves a

wealth of capabilities untapped on nearly every computer system. Such applications can be ex-

tended in performance by a highly parallel device only when the application expresses a need for

parallelism through parallel programming.

7

3.2.2 Co-Processor Core

The first Intel Xeon Phi coprocessor was known by the code name Knights Corner early in de-

velopment. While programming does not require deep knowledge of the implementation of the

device, the cores are in-order dual issue x86 processor cores, as shown in Fig .2.

microarchitecture. In practice, use of at least two threads per core is nearly alwaysbeneficial. As such, it is much more important that applications use these multiple hardwarethreads on Intel Xeon Phi coprocessors than they use hyper-threads on Intel Xeonprocessors.

! Cores interconnected by a high-speed bidirectional ring.! Cores clocked at 1 GHz or more.! Cache coherent across the entire coprocessor.! Each core has a 512-KB L2 cache locally with high-speed access to all other L2 caches

(making the collective L2 cache size over 25 MB).! Caches deliver highly efficient power utilization while offering high bandwidth memory.

• Special instructions in addition to 64-bit x86:! Uniquely wide SIMD capability via 512-bit wide vectors instead of the narrower MMXt,

Intels SSE, or Intels AVX capabilities.! High performance support for reciprocal, square root, power, and exponent operations.! Scatter/gather and streaming store capabilities to achieve higher effective memory

bandwidth.• Special features:

! On package memory controller supports up to 8 GB GDDR5 (varies based on part).! PCIe connect logic is on-chip.

x86 specific logic < 2% of core + L2 area

L2Ctl

T0 IPT1 IPT2 IPT3 IP

512KBL2 Cache

L1 TLB and32KBCodeCache

L1 TLB and 32KBData Cache

Pipe 0 Pipe 1

Scalar RF

Decode uCode

X87 RF

x87 ALU 0 ALU 1

TLBMiss

Handler

VPU512b SIMD

VPU RF

L2 TLB

HW

P

4 ThreadsIn-Order

16B/Cycle (2 IPC)

Code Cache Miss

TLB Miss

Core

DCache Miss

TLB Miss

To On-Die Interconnect

FIGURE 1.8

Architecture of a Single Intels Xeon Phit Coprocessor Core.

8 CHAPTER 1 Introduction

Figure 2: Architecture of a Single Intel Xeon Phi Coprocessor Core.

This design means that while the origin source code keep almost the same, it still can run on

both MIC and CPU architectures.

3.2.3 Programming Model

For Intel, Fig .3 shows that it supports serval programmingmodels on its heterogenous architectures.

1. NativeModel: Compiling applications and Login to the coprocessor LinuxOS to run directly

on the MIC coprocessor.

2. Offload Model: Similar to GPUs, Programs launched on host, some chunks are launched on

the coprocessor after data has been transferred, that is, running a main host program and

offloading work to the coprocessor.

3. Symmetric Model: use both MIC coprocessor and CPU as nodes to run multiple MPI pro-

cesses.

8

Figure 3: Several Programming Models on Intel Heterogeneous Computing Platform.

3.3 Summary

While GPUs cannot offer the programmability of an Intel Xeon Phi coprocessor, they do share a

subset of what can be accelerated by scaling combined with vectorization or bandwidth. In other

words, applications that show positive results with GPUs should always benefit from Intel Xeon Phi

coprocessors because the same fundamentals of vectorization or bandwidth must be present. The

opposite is not true. The flexibility of an Intel Xeon Phi coprocessor includes support for appli-

cations that cannot run on GPUs. This is one reason that a system built including Intel Xeon Phi

coprocessors will have broader applicability than a system using GPUs. Additionally, tuning for

GPU is generally too different from a processor to have the dual-transforming-tuning benefit we

see in programming for Intel Xeon Phi coprocessors. This can lead to a substantial rise in invest-

ments to be portable across many machines now and into the future.

9

4 Matrix Decomposition

In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is

a factorization of a matrix into a product of matrices. There are many different matrix decompo-

sitions; each finds use among a particular class of problems. In numerical linear algebra, different

decomposition methods are used to implement efficient matrix algorithms.

For instance, when solving a system of linear equationsAx = b, the matrixA can be decomposed

via the LUDecomposition, the LU decomposition factorizes a matrix into a lower triangular matrix

L and an upper triangular matrix U . Similarly, the QR decomposition expresses A as QR with Q

an orthogonal matrix and R an upper triangular matrix.

4.1 LU Factorization

LU decomposition [3] (where 'LU' stands for 'lower upper', and also called LU factorization) factors

a matrix as the product of a lower triangular matrix and an upper triangular matrix. The product

sometimes includes a permutation matrix as well. The LU decomposition can be viewed as the

matrix form of Gaussian elimination. Computers usually solve square systems of linear equations

using the LU decomposition. The steps in the LU factorization of A are shown below(see Fig. 4).

Figure 4: The overview of LU factorization.

We can also derive blocked formulation of LU. The formula we derived are called right looking.

10

A11 A12 A13

A21 A22 A23

A31 A32 A33

=

L11

L21 I

L31 I

U11 U12 U13

B22 B23

B32 B33

Then, B22 and B33 will be

B22 = L22U22 = A22 − L21U12, B32 = L32U22 = A32 − L31U12

Repeated LU on B22, we can get L22 and U22, then L32 = B32U−122 .

For application, the following Fig. 5 shows that LDU decomposition of a Walsh matrix.

Figure 5: LDU decomposition of a Walsh matrix.

4.2 QR Factorization

In linear algebra, a QR decomposition [4] (also called a QR factorization) of a matrix is a decom-

position of a matrix A into a product A = QR of an orthogonal matrix Q and an upper triangular

matrix R. QR decomposition is often used to solve the linear least squares problem, and is the basis

for a particular eigenvalue algorithm, the QR algorithm. The QR algorithm for the computation

of eigenvalues, which is based on the QR-decomposition, is considered to be one of the 10 most

important algorithms of the 20th century.

There are several methods for actually computing the QR decomposition, such as by means of

the Gram–Schmidt process, Householder transformations, or Givens rotations. Each has a number

of advantages and disadvantages.

11

QR Factorization: A = QR, where QTQ = I , Q−1 = QT , Q is unitary and R is an upper

triangular matrix. There are several QR algorithms, Gram-Schmidt, Givens and Householder transfor-

mation. But, householder reflection is the most favorable, widely used to achieve high performance

on muticore and manycore architectures.

Compute a Householder reflection Q from a vector x,

v = x± ‖ x ‖ e1, Q = I − 2

vTvvvT (1)

So, R = Qt · · ·Q2Q1A and Q = QT1Q

T2 · · ·QT

t .

A = (QT1Q

T2 · · ·QT

t )(Qt · · ·Q2Q1A) = QR (2)

If we use this matrix-vector algorithm, the amount of computation per memory element fetched

from global memory is quite low, that is, all operations where O(n2) floating point operations are

performed on O(n2) floating-point data. We called BLAS Level-2. It's inefficient. Instead, to

avoid the memory bound, we can reduce the problem size for matrix-vector operations. Blocked

Householder QR algorithm transforms multiple Householder reflections as a single transformation

matrix and let Matrix-Matrix operations (BLAS Level 3) improve Flop/s.

We can writeQ = I+YW T , whereW and Y arem×r. The whole process of blockHouseholder

QR algorithm works as below (see Fig. 1):

Figure 6: The overview of block Householder QR algorithm.

12

4.3 Cholesky Factorization

In linear algebra, the Cholesky decomposition or Cholesky factorization [5] is a decomposition of a

Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate

transpose, useful for efficient numerical solutions and Monte Carlo simulations. When it is appli-

cable, the Cholesky decomposition is roughly twice as efficient as the LU decomposition for solving

systems of linear equations.

If A is a positive definite matrix, we can compute a Cholesky factorization: A = LLT . L denotes

a lower triangular matrix. We can use the following formula to computing L:

Lj,j =

√√√√Aj,j −j−1∑k

L2j,k

Li,j =1

Lj,j

√√√√Ai,j −j−1∑k

Li,kLj,k, Ifi > j

The following Fig. 7 shows that how cholesky factorization works on LAPACK.

In favor of the fork-join approach (panel factorization + trailing sub-matrix updates), we can use

the tiled Cholesky factorization to achieve fine granularity and flexibility.

A11 AT21

A21 A22

=

L11 0

L21 L22

LT

11 LT21

0 LT22

=

L11LT11 L11L

T21

L21LT11 L21L

T21 + L22L

T22

(3)

Then, we can see that

L21 = A21(LT11)

−1

A′22 = A22 − L21L

T21 = L22L

T22

13

Cholesky Decomposition tiled algorithm

(i) Initial Factorization (ii) Triangular Solve

(iii) Symmetric Rank Update (iv) Recursive descent

5Monday, 23 September 13Figure 7: Block Cholesky Factorization.

4.4 Summary

In the real world, it is not feasible for most of the matrix computations to be calculated in an optimal

explicit way, such as matrix inversion, matrix determinant, solving linear system and least square

fitting, thus to convert a difficultmatrix computation problem into several easier tasks such as solving

triangular or diagonal system will greatly facilitate the calculations. Data matrices representing

some numerical observations such as proximity matrix or correlation matrix are often huge and

hard to analyze, therefore to decompose the data matrices into some lower-order or lower-rank

canonical forms will reveal the inherent characteristic and structure of the matrices and help to

interpret their meaning readily.

14

5 Numerical Linear Algebra Libraries

In order to utilize the capability of modern computers, scientists and engineers both from academic

and industry developed many standard numerical libraries. Next generation of computer hardware

alway bring new libraries into high performance computing community.

5.1 BLAS and LAPACK

5.1.1 BLAS

The BLAS (Basic Linear Algebra Subprograms) [6] are routines that provide standard building

blocks for performing basic vector andmatrix operations. The Level 1 BLAS perform scalar, vector

and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level

3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable, and widely

available, they are commonly used in the development of high quality linear algebra software,

LAPACK for example.

Level-1 BLAS: vector-vector operations,O(n). Level-2 BLAS:Matrix-vector operations,O(mn).

Level-3 BLAS: Matrix-Matrix operations, O(kmn)

Figure 8: The performance of Level 1, 2 and 3 BLAS.

15

5.1.2 LAPACK

LAPACK (Linear Algebra Package) [7], as shown in Fig .9, is a standard software library for nu-

merical linear algebra. It provides routines for solving systems of linear equations and linear least

squares, eigenvalue problems, and singular value decomposition. It also includes routines to im-

plement the associated matrix factorizations such as LU, QR, Cholesky and Schur decomposition.

The routines handle both real and complex matrices in both single and double precision. LAPACK

routines are written so that as much as possible of the computation is performed by calls to the Basic

Linear Algebra Subprograms (BLAS).

Figure 9: Software Libaray: LAPACK.

5.2 Intel MKL and NVIDIA cuBLAS

5.2.1 Intel MKL

Intel Math Kernel Library (Intel MKL) [8] accelerates math processing routines that increase ap-

plication performance and reduce development time. Intel MKL includes highly vectorized and

threaded Linear Algebra, Fast Fourier Transforms (FFT), Vector Math and Statistics functions.

The easiest way to take advantage of all of that processing power is to use a carefully optimized

computing math library. Even the best compiler can’t compete with the level of performance pos-

sible from a hand-optimized library. If your application already relies on the BLAS or LAPACK

functionality, simply re-link with Intel MKL to get better performance on Intel and compatible

architectures.

Because Intel has done the engineering on these ready-to-use, royalty-free functions, there will

be more time to add features which customers request. Using Intel MKL can save development,

16

debug and maintenance time in the long run because today's code will run optimally on future

generations of Intel processors with minimal effort.

Figure 10: Intel Math Kernel Library.

5.2.2 NVIDIA cuBLAS

The cuBLAS library [9] is an implementation of BLAS (Basic Linear Algebra Subprograms) on

top of the NVIDIA CUDA runtime. It allows the user to access the computational resources of

NVIDIA Graphics Processing Unit (GPU).

The NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library is a GPU-accelerated

version of the complete standard BLAS library that delivers 6x to 17x faster performance than the

latest MKL BLAS (see Fig .11), since GPU has more stream cores than Xeon Phi.

Figure 11: NVIDIA cuBLAS VS Intel MKL.

17

5.3 PLASMA

In Fig. 12, The Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) [10]

project aims to address the critical and highly disruptive situation that is facing the Linear Algebra

and High Performance Computing community due to the introduction of multi-core architectures.

PLASMA's ultimate goal is to create software frameworks that enable programmers to sim-

plify the process of developing applications that can achieve both high performance and portability

across a range of new architectures.

The development of programming models that enforce asynchronous, out of order scheduling

of operations is the concept used as the basis for the definition of a scalable yet highly efficient

software framework for Computational Linear Algebra applications.

Figure 12: PLASMA: MODERN SOFTWARE STACK.

5.3.1 Dynamic Scheduling

PLASMA relies on runtime scheduling of parallel tasks. Runtime scheduling is based on the idea

of assigning work to cores based on the availability of data for processing at any given point in time,

and thus is also referred to as data-driven scheduling. The concept is related closely to the idea

of expressing computation through a task graph, often referred to as the DAG (Directed Acyclic

Graph), and the flexibility of exploring the DAG at runtime. This is in direct opposition to the

fork-and-join scheduling, where artificial synchronization points expose serial sections of the code

and multiple cores are idle while sequential processing takes place.

18

5.4 MAGMA

The MAGMA (Matrix Algebra on GPU andMulticore Architectures) [11] project aims to develop

a dense linear algebra library similar to LAPACKbut for heterogeneous/hybrid architectures, start-

ing with current "Multicore + GPU" systems.

The MAGMA research is based on the idea that, to address the complex challenges of the

emerging hybrid environments, optimal software solutions will themselves have to hybridize, com-

bining the strengths of different algorithms within a single framework. Building on this idea, we

aim to design linear algebra algorithms and frameworks for hybrid manycore and GPUs systems

that can enable applications to fully exploit the power that each of the hybrid components offers.

MAGMA allows applications to fully exploit the power of current heterogeneous systems of

multi/many-core CPUs and multi-GPUs to deliver the fastest possible time to accurate solution

within given energy constraints. From Fig .13, MAGMA supports CUDA, OpenCL and Intel

Xeon Phi to accelerate dense linear algebra library.

Figure 13: MAGMA: FEATURES AND PERFORMANCE.

5.4.1 Hybrid Algorithm

MAGMA uses a hybridization methodology where algorithms of interest are split into tasks of vary-

ing granularity and their execution scheduled over the available hardware components. Scheduling

can be static or dynamic. In either case, small non-parallelizable tasks, often on the critical path,

19

are scheduled on the CPU, and larger more parallelizable ones, often Level 3 BLAS, are scheduled

on the GPU.

5.5 Summary

Nowadays, we have various high performance linear algebra libraries on different computer hard-

wares which can be used to solve some specific problems efficiency. we also can take the advantage

of modern muticores and accelerators to extremely improve the performance of scientific com-

puting, like life and material sciences, bioinformatics, weather and forecasting model, and even

machine learning and data analytics.

20

6 Hierarchical Matrix

6.1 Background

Hierarchical matrices (H-matrices) [12] are a powerful tool to represent dense matrices coming

from integral equations or partial differential equations in a hierarchical, block-oriented, data-

sparse way with log-linear memory costs. Furthermore, a matrix arithmetic, e.g., matrix addition,

multiplication, inversion and LU factorization, is possible with log-linear computation costs.

6.2 Definitions

6.2.1 H-Matrices

Let I be an index set and n = #I its cardinality. The basic building block of an H-matrix is the

hierarchical partitioning of I in the form of a cluster tree (or H-tree):

Definition 1. The tree TI = (V , E) is called a cluster tree over I if

1. I is the root of TI

2. for all vertices v ∈ V either v =⋃

v′∈S(v) v′or v is a leaf of TI .

Here, S(v) is the set of sons of v. The set of leaves of TI is denoted by L(TI). A node in TI is also called a

cluster and we write t ∈ TI if t ∈ V . Nodes in the cluster tree with the same distance from the root are said to be

on the same level of the tree. Finally, let T lI := t ∈ TI : level(t) = l.

6.2.2 Cluster Tree TI

The candidates t, s ⊂ I for the construction of the partition of IxI will be stored in a so-called

cluster tree TI . The root of the tree TI is the index set I(0) := 0, ..., n−1. For ease of presentation

we assume the number n of basis functions to be a power of two: n = 2p.

1. The two successors of I(0)1 are I

(1)1 := 0, ..., n

2−1 and I

(1)2 := n

2, ..., n−1.


(2)1 := 0, ..., n

4−1 and I

(2)2 := n

4, ..., n−1.


(2)3 := n

2, ..., 3n

4−1 and I

(2)4 := 3n

4, ..., n−1.

21

1.4. CLUSTER TREE TI 13

(1.5)≤

! 1

0

! 1

0|ϕi(x)|3−k|ϕj(y)|dydx

=323−k

! 1

0ϕi(x)dx

! 1

0ϕj(y)dy

=32n−23−k.

Let us assume that we have partitioned the index set I × I for the matrix G into admissible blocks, wherethe low rank approximation is applicable, and inadmissible blocks, where we use the matrix entries from G(in the next two subsections we will present a constructive method for the partitioning):

I × I ="

t×s∈Pt × s.

Then the global approximation error is estimated in the Frobenius norm ∥M∥2F :=

#M2

ij :

Lemma 1.6 (Matrix Approximation Error) The approximation error ∥G−G∥F in the Frobenius normfor the matrix G built by the degenerate kernel g in the admissible blocks t×s ∈ P and by g in the inadmissibleblocks is bounded by

∥G − G∥F ≤ 32n−13−k.

Proof: Apply Lemma 1.5.

The question remains, how we want to partition the product index set I×I into admissible and inadmissibleblocks. A trivial partition would be P := {(i, j) | i ∈ I, j ∈ I} where only 1 × 1 blocks of rank 1 appear.In this case the matrix G is identical to G, but we do not exploit the option to approximate the matrix inlarge subblocks by matrices of low rank.

The number of possible partitions of I × I is rather large (even the number of partitions of I is so). Insubsequent chapters we will present general strategies for the construction of suitable partitions, but herewe only give an exemplary construction.

1.4 Cluster Tree TI

I(0)1 = I✟✟✙ ❍❍❥

I(1)1

%%✠ ❅❅❘

I(1)2

%%✠ ❅❅❘

I(2)1

✁✁☛

❆❆❯

I(2)2

✁✁☛

❆❆❯

I(2)3

✁✁☛

❆❆❯

I(2)4

✁✁☛

❆❆❯

I(3)1 I(3)

2 I(3)3 I(3)

4 I(3)5 I(3)

6 I(3)7 I(3)

8

{0, . . . , 7}✟✟✙ ❍❍❥

{0, . . . , 3}%%✠ ❅❅❘

{4, . . . , 7}%%✠ ❅❅❘

{0, 1}✁

✁☛❆❆❯

{2, 3}✁

✁☛❆❆❯

{4, 5}✁

✁☛❆❆❯

{6, 7}✁

✁☛❆❆❯

{0} {1} {2} {3} {4} {5} {6} {7}

Figure 1.3: The cluster tree TI for p = 3, on the left abstract and on the right concrete.

The candidates t, s ⊂ I for the construction of the partition of I×I will be stored in a so-called cluster treeTI . The root of the tree TI is the index set I(0)

1 := {0, . . . , n − 1}. For ease of presentation we assume thenumber n of basis functions to be a power of two:

n = 2p.

Figure 14: The cluster tree TI for p = 3, on the left abstract and on the right concrete.

Each subsequent node t with more than nmin indices has exactly two successors: the first

contains the first half of its indices, the second one the second half. Nodes with not more than

nmin indices are leaves. The parameter nmin controls the depth of the tree. For nmin = 1 we

get the maximal depth. However, for practical purposes (e.g., if the rank k is larger) we might want

to set nmin = 2k or nmin = 16.

The cluster tree TI defines the set of subsets of I which may be used to partition the product index

set I×I and hence, the matrix index set. For anH-matrix to be efficient, the partitioning has to be

restricted to a minimal set of block clusters t× s with t, s ∈ TI , such that t× s is either small, i.e.,

t, s ∈ L(TI), or admissible. The admissibility of a node is in general defined by the corresponding

application. For typical problems involving partial differential equations or integral equations the

classical admissibility condition is given by:

min(diam(t), diam(s)) ≤ ηdist(t, s) (4)

Here, diam(.) and dist(., .) denote the diameter of and distance between geometrical entities

associated with the clusters, e.g., the support of basis functions. Using the admissibility condition,

the actual partitioning of I × I can be constructed. For this, a cluster tree over I × I is built, with

its leaves forming the partition.

One possible method is to test blocks level by level starting with the root I of the tree TI and

descending in the tree. The tested blocks are stored in a so-called block cluster tree TI×I whose

leaves form a partition of the index set I × I . The algorithm is given as follows and called with

parameters BuildBlockClusterTree(I ,I ). (see Algorithm 1)

When p = 4 and p = 5, the structure of the partition is similar as the following Fig. 15:

22

14 CHAPTER 1. INTRODUCTORY EXAMPLE

The two successors of I(0)1 are I(1)

1 := {0, . . . , n2 − 1} and I(1)

2 := {n2 , . . . , n − 1}.


1 := {0, . . . , n4 − 1} and I(2)

2 := {n4 , . . . , n

2 − 1}.


3 := {n2 , . . . , 3n

4 − 1} and I(2)4 := {3n

4 , . . . , n − 1}.

Each subsequent node t with more than nmin indices has exactly two successors: the first contains thefirst half of its indices, the second one the second half. Nodes with not more than nmin indices are leaves.The parameter nmin controls the depth of the tree. For nmin = 1 we get the maximal depth. However, forpractical purposes (e.g., if the rank k is larger) we might want to set nmin = 2k or nmin = 16.

Remark 1.7 (Properties of TI) For nmin = 1 the tree TI is a binary tree of depth p (see Figure 1.3). Itcontains subsets of the index set I of different size. The first level consists of the root I = {0, . . . , n−1} withn indices, the second level contains two nodes with n/2 indices each and so forth, i.e., the tree is cardinalitybalanced. The number of nodes in the cardinality balanced binary tree TI (for nmin = 1) is #TI = 2n− 1.

1.5 Block Cluster Tree TI×I

The number of possible blocks t × s with nodes t, s from the tree TI is (#TI)2 = (2n − 1)2 = O(n2).This implies that we cannot test all possible combinations (our aim is to reduce the quadratic cost for theassembly of the matrix).

One possible method is to test blocks level by level starting with the root I of the tree TI and descending inthe tree. The tested blocks are stored in a so-called block cluster tree TI×I whose leaves form a partition ofthe index set I×I. The algorithm is given as follows and called with parameters BuildBlockClusterTree(I,I).

Algorithm 1 Construction of the block cluster tree TI×I

procedure BuildBlockClusterTree(cluster t, s)if (t, s) is admissible then

S(t × s) := ∅else

S(t × s) := {t′ × s′ | t′ ∈ S(t) and s′ ∈ S(s)}for t′ ∈ S(t) and s′ ∈ S(s) do

BuildBlockClusterTree(t′, s′)end for

end if

The tree TI×I is a quadtree, but there are leaves on different levels of the tree which is not the case for thebinary tree TI .

Example 1.8 (Block cluster tree, p = 3) We consider the example tree from Figure 1.3. The root of thetree is

{0, . . . , 7}× {0, . . . , 7}

1

2

3

4

5

6

7

1 2 3 4 5 6 7

0

0

which is not admissible because the corresponding domain to the index set {0, . . . , 7} is the interval [0, 1] and

diam([0, 1]) = 1 ≤ 0 = dist([0, 1], [0, 1]).

1.5. BLOCK CLUSTER TREE TI×I 15

The four successors of the root in the tree TI×I are

{0, 1, 2, 3}× {0, 1, 2, 3}, {0, 1, 2, 3}× {4, 5, 6, 7},{4, 5, 6, 7}× {0, 1, 2, 3}, {4, 5, 6, 7}× {4, 5, 6, 7}.

1

2

3

4

5

6

7

1 2 3 4 5 6 7

0

0

Again, none of these is admissible, and they are further subdivided into

{0, 1}× {0, 1}, {0, 1}× {2, 3}, {0, 1}× {4, 5}, {0, 1}× {6, 7},{2, 3}× {0, 1}, {2, 3}× {2, 3}, {2, 3}× {4, 5}, {2, 3}× {6, 7},{4, 5}× {0, 1}, {4, 5}× {2, 3}, {4, 5}× {4, 5}, {4, 5}× {6, 7},{6, 7}× {0, 1}, {6, 7}× {2, 3}, {6, 7}× {4, 5}, {6, 7}× {6, 7}.

1

2

3

4

5

6

7

1 2 3 4 5 6 7

0

0

Now some of the nodes are admissible, e.g., the node {0, 1} × {4, 5} because the corresponding domain is[0, 1

4 ] × [ 12 , 34 ]:

diam!"

0,14

#$=

14

= dist!"

0,14

#,

"12,34

#$.

The nodes on the diagonal are not admissible (the distance of the corresponding domain to itself is zero) andalso some nodes off the diagonal, e.g., {0, 1}× {2, 3}, are not admissible. The successors of these nodes arethe singletons {(i, j)} for indices i, j. The final structure of the partition looks as follows:

1

2

3

4

5

6

7

1 2 3 4 5 6 7

0

0

For p = 4 and p = 5 the structure of the partition is similar:

Figure 15: When p = 4 and p = 5, the structures of H-matrix.

6.3 Implementation

The product index set I × I resolves into admissible and inadmissible leaves of the tree TI×I .

The assembly, storage and matrix-vector multiplication differs for the corresponding two classes of

submatrices.

6.3.1 Inadmissible Leaves

Definition 2. (fullmatrix Representation) An n×m matrix M is said to be stored in fullmatrix represen-

tation if the entries Mij are stored as real numbers (double) in an array of length nm in the order M11, ...,Mn1,

M12, ...,Mn2, ...,M1m, ...,Mnm(column− wise).

23

The fullmatrix representation in the C programming language might look as follows:

1 typedef struct _fullmatrix fullmatrix;2 typedef fullmatrix *pfullmatrix;3

4 struct _fullmatrix {5 int rows;6 int cols;7 double* e;8 };

The array e has to be allocated and deallocated dynamically in the constructor and destructor:

1 pfullmatrix new_fullmatrix(int rows, int cols){2 pfullmatrix f = (pfullmatrix) malloc(sizeof(fullmatrix));3 f->rows = rows;4 f->cols = cols;5 f->e = (double*) malloc(rows*cols*sizeof(double));6

7 for(i=0; i<rows*cols; i++)8 f->e[i] = 0.0;9 return f;

10 }11

12 void del_fullmatrix(pfullmatrix f){13 if(f->e)14 free(f->e);15 free(f);16 f = 0x0;17 }

The entry Mij is stored at the position f→e[i + j ∗ f→rows]. The ordering of the matrix

entries in the fullmatrix representation is the same ordering that is used in standard linear algebra

packages (BLAS, LAPACK, MATLAB, etc.). Therefore, procedures from these libraries can be

called without complications, e.g., the matrix-vector multiplication can be performed by calling the

standard BLAS subroutine dgemv.

6.3.2 Admissible Leaves

Definition 3. (rkmatrix Representation) An n×m matrixM of rank at most k is said to be stored in rkmatrix

representation if it is stored in factorised formM = ABT where the two matrices A ∈ Rn×k and B ∈ Rm×k are

both stored as an array (column-wise).

24

The rkmatrix representation is implemented in the C programming language as follows:

1 typedef struct _rkmatrix rkmatrix;2 typedef rkmatrix *prkmatrix;3

4 struct _rkmatrix {5 int k;6 int kt;7 int rows;8 int cols;9 double* a;

10 double* b;11 };

The arrays a and b have to be allocated and deallocated dynamically in the constructor and

destructor:

1 prkmatrix new_rkmatrix(int k, int rows, int cols){2 int i;3 prkmatrix r = (prkmatrix) malloc(sizeof(rkmatrix));4 r->k = k;5 r->kt = 0;6 r->rows = rows;7 r->cols = cols;8 r->a = 0x0;9 r->b = 0x0;

10 if(k>0){11 r->a = (double*) malloc(k*rows*sizeof(double));12 for(i=0; i<rows*k; i++)13 r->a[i] = 0.0;14

15 r->b = (double*) malloc(k*cols*sizeof(double));16 for(i=0; i<cols*k; i++)17 r->b[i] = 0.0;18 }19 return r;20 }21 void del_rkmatrix(prkmatrix r){22 free(r->a);23 free(r->b);24 free(r);25 }

25

6.4 Hierarchical Matrix Representation

Definition 4. (H-matrix Representation) Let TI×I be a block cluster tree for the index set I . A matrix M ∈H(TI×I , k) is said to be stored inH-matrix representation if the submatrices corresponding to inadmissible leaves are

stored in fullmatrix representation and those corresponding to admissible leaves are stored in rkmatrix representation.

Each block t× s in the tree TI×I can be

1. a leaf - then the corresponding matrix block is represented by a fullmatrix or rkmatrix;

2. not a leaf - then the block t×s is decomposed into its sons t′×s

′with t

′ ∈ S(t) and s′ ∈ S(s).

This means that the matrix corresponding to the block t× s is a supermatrix that consists of

submatrices corresponding to t′ × s

′.

The supermatrix structure in the C programming language is implemented as follows:

1 typedef struct _supermatrix supermatrix;2 typedef supermatrix *psupermatrix;3 struct _supermatrix {4 int rows;5 int cols;6 int block_rows;7 int block_cols;8 prkmatrix r;9 pfullmatrix f;

10 psupermatrix* s;11 };

6.5 Summary

The implementation of anH-matrix is a tree with nodes implemented as supermatrix. Additionally,

the structure coincides with the structure given by the block cluster tree TI×I (successors subma-

trices) and the submatrices corresponding to admissible or inadmissible leaves are stored in the

rkmatrix or fullmatrix format.

26

7 H-Cholesky Factorization

7.1 Introduction

H-Cholesky factorization are formulated as recursive block algorithms, similar to standard blocked

algorithms for dense matrices. Beside the actual recursion, both algorithms are based onH-matrix

multiplication. Furthermore, an intrinsic sequential order, e.g., for diagonal elements, is present in

these algorithms. Therefore, only the matrix multiplications can be used for parallelization.

To fix this problem, the panel was further split into square submatrices or tiles. As soon as a tile

in a panel was finished, the update operations for blocks in the trailing submatrix can be started,

thereby eliminating the synchronization problem.

In such a dense Cholesky factorization, each operation, e.g., Cholesky factorization of a diago-

nal tile or the update of a block via matrix multiplication, forms a single task, i.e., atomic unit of

computation with data dependencies. The tasks and their dependencies hereby form a directed

acyclic graph (DAG), which can be used to efficiently schedule the tasks onto the different cores.

Due to the high amount of work per task, these algorithms are also especially suited for many-core

architectures, MIC and GPU. However, at first, we need to profiling to prove this vision.

7.2 Profiling

For profiling, problem size and leaf size ofH-matrix both should be considered. The problem size

shows that the problem scale you may want to solve, and the leaf size can impact the depth of

recursion, that is, the size of full matrix. Full matrix is dense, we can use common libraries (MKL,

PLASMA and MAGMA) to solve the matrix decomposition problem efficiently. From Fig. 16, the

performance will be significantly improved when full matrix part can be accelerated.

In order to make it convincing, Matrix-Vector Multiplication, LU and Cholesky Decomposition

will be evaluated on H-matrix. In the source code, nmin is leaf size and vertices is problem size.

27

psupermatrix build_1d_supermatrix(int n, int

k, int nmin);

psupermatrix buildpart_1d_supermatrix(int n,

int k, int size_n, int size_m, int startindex_n,

int startindex_m, int nmin);

int check_1d_admissible(int n0, int n, int m0,

int m);

prkmatrix

build_1d_rkmatrix(n,k,size_n,size_m,startinde

x_n,startindex_m);

admissable

pfullmatrix build_1d_fullmatrix(int n, int

size_n, int size_m, int startindex_n, int

startindex_m);

inadmissible and small

double

get_1dtaylorentry_a(int n, int

i, int nu, double x0)

double

get_1dtaylorentry_a(int n, int

i, int nu, double x0)

double get_1d_entry(int n,

int i, int j)

psupermatrix new_supermatrix(int

block_rows, int block_cols, int rows, int cols,

prkmatrix r, pfullmatrix f)

inadmissible and large


k, int nmin);


k, int nmin);


k, int nmin);


k, int nmin);

Recursive

Figure 16: The generation process of H-matrix.

7.2.1 H-Matrix Vector Multiplication

nmin vertices #time on Full Matrix (sec) #time on RK Matrix (sec) Storage (MB)

32 402 0.009 0.0018 2.0

32 10002 0.17 0.67 75.5

32 40002 0.71 3.49 270.1

32 160002 2.68 17.1203 971.8

28


32 10002 0.07 0.43 75.5

64 10002 0.09 0.43 77.8

128 10002 0.17 0.33 85.6

256 10002 0.24 0.36 104.4

512 10002 0.57 0.25 141.1

1024 10002 0.82 0.30 209.5

Figure 17: H-Matrix-Vector Multiplication for H when nmin = 32.

29

Figure 18: H-Matrix-Vector Multiplication for H when vertices = 10002.

7.2.2 H-LU Factorization


32 402 0.02 0.08 1.074

32 10002 0.38 7.36 52.057

32 40002 1.60 43.8282 251.296

32 160002 5.73 245.6178 1183.861

30


32 10002 1.85 5.75 52.057

64 10002 1.87 5.54 52.452

128 10002 2.02 4.44 56.904

256 10002 2.35 3.93 67.680

512 10002 3.79 3.88 97.484

1024 10002 7.27 4.26 142.001

Figure 19: H-LU factorization for HMatrix when nmin = 32.

31

Figure 20: H-LU factorization for HMatrix when vertices = 10002.

7.2.3 H-Cholesky Fatorization


32 402 0.01 0.05 0.563

32 10002 0.29 4.49 27.022

32 40002 0.96 25.9104 132.142

32 160002 3.95 143.0776 625.050

32


32 10002 0.29 4.49 27.022

64 10002 0.39 3.83 29.589

128 10002 0.86 3.21 35.693

256 10002 1.69 2.90 47.210

512 10002 4.01 2.68 76.393

1024 10002 8.15 2.38 120.190

Figure 21: Cholesky factorization for HMatrix when nmin = 32.

33

Figure 22: Cholesky factorization for HMatrix when vertices = 10002.

From above evaluation, when leaf size is big enough, full matrix occupies the majority of execu-

tuion time. From the former sections, full matrix can be accelerated via multi-core or many-core

architectures. Thus, this idea is convincing.

7.3 Compiler Optimization

Either gcc or icc can be used to compileHlib software with optimizations. With optimizations, it will

attempt to improve performance or code size. But, it always be costly compilation and debugging

not always possible.

1. For gcc, the following results only use the orignal compilation.

2. For icc opt1, icc with optimizations like -O2.

3. For icc opt2, icc with default optimizations like -msse4.2 -O3.

4. For MKL, -pthread -openmp should be added.

34

7.3.1 H-Matrix Vector Multiplication

0 100 200 300 400 500 600 700 800 900 1000 11000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9H−Matrix Vector Multiplication where the problem size (vertices) is 10002

Leaf size (nmin)

Tim

e (s

ec)

gccicc opt1icc opt 2icc + mkl 4 threadsicc + mkl 8 threads

Figure 23: H-Matrix Vector Multiplication where vertices = 10002.

7.3.2 H-LU Factorization

0 100 200 300 400 500 600 700 800 900 1000 11000

1

2

3

4

5

6

7

8H−LU Factorization where the problem size (vertices) is 10002

Leaf Size (nmin)

Tim

e (S

ec)

gccicc + opt1icc + opt2icc + mkl 4 threadsicc + mkl 8 threads

Figure 24: H-LU Decomposition where vertices = 10002.

35

7.3.3 H-Cholesky Factorization

0 100 200 300 400 500 600 700 800 900 1000 11000

1

2

3

4

5

6

7

8

9H Cholesky Decomposition where the problem size (vertices) is 10002

Leaf Size (nmin)

Tim

e (S

ec)

gccicc opt1icc opt2icc + mkl 4 threadsicc + mkl 8 threads

Figure 25: H-Cholesky Decomposition where vertices = 10002.

7.4 Parallel Numerical Libraries

Except the optimization of compilation, high performance numerical libraries (MKL, PLASMA,

MAGMA) also can be used to accelerate H-Matrix Decomposition.

7.4.1 MKL

Hlib can link with MKL to compilation and execution efficiency.

1 ./configure --prefix=$HOME/HLib --enable-examples --enable-optimize \2 --with-blas-libs="-mkl=parallel" --with-blas-ldflags="-mkl=parallel"

7.4.2 PLASMA

To accelerate H-Matrix Decomposition on multicore processors, PLASMA is used to replace the

general blas functions inH-Matrix factorization.

36

1 ./configure --prefix=$HOME/Hlib --enable-examples --enable-optimize \2 --with-blas-libs="-mkl=parallel" --with-lapack-libs="-mkl=parallel" \3 --with-blas-ldflags="-mkl=parallel" CFLAGS="-L/home/liao/plasma-installer_2

.6.0/install/lib -lm -lplasma \4 -lcoreblasqw -lcoreblas -lquark -llapacke -lhwloc -DDBGQUARK -I/home/liao/

plasma-installer_2.6.0/install/include"

For H-Cholesky Factorization, dpotrf_ can be replaced by PLASMA_dpotrf.

1 //dpotrf_("Lower part", &f->rows, f->e, &f->cols, &info);2 info = PLASMA_dpotrf(PlasmaLower , f->rows, f->e, f->cols);

The number of threads can be adjusted using the initialization statement.

1 PLASMA_Init(THREADS);2 printf("-- PLASMA is initialized to run on %d cores. \n",THREADS);3 ....code...4 PLASMA_Finalize();

7.4.3 MAGMA

To accelerate H-Matrix Decomposition on heterogeneous system (CPUs and GPUs), MAGMA is

used to replace the general blas functions inH-Matrix factorization.

1 ./configure --prefix=$HOME/Hlib_MAGMA --enable-examples --enable-optimize \2 --with-blas-libs="-L/home/liao/magma -1.6.1/lib -L/usr/local/cuda/lib64 -

lmagma \3 -lcublas -lcudart" --with-lapack-libs="-mkl=parallel" --with-blas-ldflags

="-mkl=parallel" \4 CFLAGS="-DADD_ -DHAVE_CUBLAS -I/home/liao/magma -1.6.1/include \5 -I/usr/local/cuda/include -I/home/liao/magma -1.6.1/testing"

For H-Cholesky Factorization, dpotrf_ can be replaced byMAGMA_dpotrf.

1 magma_init();2 //dpotrf_("Lower part", &f->rows, f->e, &f->cols, &info);3 magma_dpotrf(MagmaLower , f->rows, f->e, f->cols, &info);4 magma_finalize();

The result shows the significant improvement when PLASMA and MAGMA are integrated in

Hlib respectively, see Fig .26.

37

0 100 200 300 400 500 600 700 800 900 1000 11000

0.5

1

1.5

2

2.5

3H−Cholesky Decomposition where the problem size (vertices) is 10002

Leaf Size (nmin)

Tim

e (S

ec)

MKL

PLASMA

MAGMA

Figure 26: H-Cholesky Decomposition where vertices = 10002.

7.5 Summary

This section proves H-Matrix Decomposition have an almost optimal parallel scaling behavior

under heterogeneous parallel architectures.

38

8 Parallel Optimization

Cholesky factorization of a diagonal tile or the update of a block via matrix multiplication, forms

a single task, i.e., atomic unit of computation with data dependencies. The tasks and their de-

pendencies hereby form a directed acyclic graph (DAG), which can be used to efficiently schedule

the tasks onto the different cores. The concept of task-based DAG computations is used to split

theH-Cholesky factorization into single tasks and to define corresponding dependencies to form a

DAG. This task/DAG-based algorithm is able to utilize parallel CPUs much more efficiently com-

pared to the classical recursive algorithm and in particular demonstrates an optimal parallel scaling

behavior on many-core systems citeHmany.

8.1 SequentialH-Cholesky Factorization

In this section, we discuss the computation of the Cholesky factorization A = LLT where A is an

n× n symmetric positive definite matrix and L is an n× n lower triangular matrix.

Let t× t ∈ TI×I with S(t) = t0, ..., tl−1. Then, due to the block structure of the matrices A|t×t,

L|t×t, i.e., A|t×t =

At0×t0 · · · At0×tl−1

.... . .

...

Atl−1×t0 · · · Atl−1×tl−1

=

Lt0×t0

.... . .

Ltl−1×t0 · · · Ltl−1×tl−1

LTt0×t0

· · · LTtl−1×t0

. . ....

LTtl−1×tl−1

The above equation is equal to A|t×t = L|t×tL|Tt×t. The following algorithm 1 shows how recur-

siveH-Cholesky factorization Cholesky(A|t×t) works.

Unfortunately, the algorithm only provide very limited opportunities for parallelization. In the

case of binary cluster trees, the standard case in practical applications, there are just two matrix

solves (TriangularSolve and SymmetricRankUpdate) to be performed in parallel in Algorithm 1. Only the

matrix updates themselves can be computed with optimal parallel scaling. Especially the strict, local

execution order together with the recursive nature of these algorithms inhibits parallel execution.

Therefore, the achievable parallel speedup is small.

39

Input: A|t×t

Output: L|t×t

1 for i = 0; i ≤ l − 1; i++ do

2 Cholesky(A|ti×ti );

3 for j = i+ 1; j ≤ l − 1; j ++ do

4 //TriangularSolve(A|tj×ti );

5 L|tj×ti = A|tj×tiL|−Tti×ti

6 end

7 for j = i+ 1; j ≤ l − 1; j ++ do

8 for k = i+ 1; k ≤ j; k ++ do

9 //SymmetricRankUpdate(A|tj×tk );

10 A|tj×tk = A|tj×tk − L|tj×tiL|Ttj×ti

11 end

12 end

13 end

Algorithm 1: Recursive H-Cholesky factorization: Cholesky(A|t×t).

However, it also can be extended to recursive blocked (tiled)H-Cholesky factorization to achieve

higher performance.

8.2 Task-basedH-Cholesky Factorization

Except for the recursion in the case of H-matrices, the block-wise dense Cholesky factorization

and the H-Cholesky factorization in Algorithm 1 are identical if the block index is replaced by the

corresponding index set.

Algorithm 2 constructs a DAG forH-Cholesky factorization where wrapping each function into a

task(·) construct yields an algorithm to define the set of all tasks for the dense Cholesky factorization.

Still missing are the dependencies between these tasks, in the following denoted by “→”, i.e., task1

→ task2 implies task2 may only start after task1 has finished.

Algorithm 2 execution using a DAG has the advantage that any task may be scheduled for ex-

ecution as soon as all its dependencies are met, thereby overlapping diagonal factorization tasks

with off-diagonal solve or update tasks. Therefore, for the task-based algorithm the execution may

follow the anti-diagonal of the matrix (Fig. 27, right), while the recursive algorithm has to finish

diagonal matrix blocks before computing off-diagonal matrix blocks (Fig. 27, left).

40

Input: A|t×t

Output: L|t×t

1 for i = 0; i ≤ l − 1; i++ do

2 Task(Cholesky(A|ti×ti ));

3 Parallel for j = i+ 1; j ≤ l − 1; j ++ do

4 L|tj×ti = A|tj×tiL|−Tti×ti ; //TriangularSolve(A|tj×ti );

5 Task(Cholesky(A|ti×ti )) → Task(TriangularSolve(A|tj×ti ))

6 end

7 Parallel for j = i+ 1; j ≤ l − 1; j ++ do

8 Parallel for k = i+ 1; k ≤ j; k ++ do

9 A|tj×tk = A|tj×tk − L|tj×tiL|Ttj×ti; //SymmetricRankUpdate(A|tj×tk );

10 Task(TriangularSolve(A|tj×ti )) → Task(SymmetricRankUpdate(A|tj×tk ))

11 end

12 end

13 end

Algorithm 2: Constructs a DAG for H-Cholesky factorization: Cholesky(A|t×t).

Task based H-LU Factorization

e.g., for matrix solve tasks for non-leaf o�-diagonal blocks to agglomerate all dependencies due tosubblocks, thereby reducing the number of dependencies to update tasks. Furthermore, di�erentLU strategies, e.g., variations of left-looking or right-looking strategies, may also lead to di�erenttasks and therefore to di�erent DAGs. Also possible subtasks of special routines for handling denseand low-rank matrices can be integrated into the H-LU-DAG (see Remark 2.5).

As in the dense case, tasks in the DAG may be executed as soon as all their dependencies aremet, which again allows to overlap diagonal factorization with solve or update operations. Incontrast to this, the classical recursive H-LU factorization is restricted by the recursive blockhierarchy, which enforces synchronization of all CPU cores at the diagonal blocks. This becomesobvious when the individual stages of both factorization algorithms, i.e., groups of blocks whichmay by computed simultaneously, are explicitly shown as in Figure 7. There, for the task-basedalgorithm the execution may follow the anti-diagonal of the matrix (Figure 7, right), while therecursive algorithm has to finish diagonal matrix blocks before computing o�-diagonal matrixblocks (Figure 7, left).

1

2

2

3

4

4

4 4

5

5

5 5

6

7

7

8

9 9

9

9

10

10

10 10

10

10

11

11

11 11

12

13

13

14

15

15

15 15

16

16

16 16

17

18

18

19

1

2

2

3

3

3

4

4

4

4

5

5

5

5

5

6

6

7

7

7

7

7

8

8

8

8

9

9

9

9

9

10

10

11

11

11

12

12

12

12

13

13

13

14

14

15

Figure 7: Stages of recursive (left) and task-based (right) H-LU factorization.

Remark 3.6 In Figure 7, we assumed linear computation costs per block to simplify the presen-tation.

3.3 H-LU Factorization with Domain Decomposition

H-matrices for the domain decomposition case possess a special structure with large zero blocks,which will remain zero during H-LU factorization (see Figure 8), thereby yielding a more e�cientalgorithm. This is achieved by decoupling subdomains by an inner boundary (or interface), andordering the indices accordingly (refer to [9, 10]). Due to this decoupling, matrix blocks with indexsets associated with di�erent domains only have zero coe�cients. Matrix blocks with interfaceindices are non-zero and are handled as standard H-matrices, i.e., based on standard geometricalor algebraic clustering algorithms.

Diagonal matrix blocks corresponding to di�erent subdomains may be handled in parallel,followed by the factorization of the matrix block with indices of the inner boundary. Thereforethe parallel scaling behavior of the domain decomposition approach strongly depends on the sizeof this interface block. For H-matrices geometrical and black-box algorithms for computing asuitable interface have been developed in [9] and [10].

In case of nested dissection, the domain decomposition approach is recursively applied to thesubdomains.

14

Figure 27: Stages of recursive (left) and task-based (right) H-Cholesky Decomposition.

8.3 Heterogeneous Architecture Optimization

To accelerate H-Cholesky decomposition under heterogeneous architecture, executing Level 1

BLAS routines on CPUs, Level 2 and 3 BLAS routines on GPUs is a good idea.

After tracing the procedure of H-Cholesky decomposition, most hotspots have been optimized

via parallelism. The following Fig. 28 shows accelerated H-Cholesky decomposition using hybrid

parallel programming model, both OpenMP and CUDA.

41

,

(, # ) ,

( # ) ,

,2 , , ,

2 2 , ,

, ,

, , ,

1, , , , ,

1, , , , 1 ,

,2 , 2

, , , ,

,2 , , , 1 ,

, , , 1 ,

,2 , 2

, , ,

, , , ,

1, , , , ,

, , , ,

,2 , , , 1 ,

1, , , , 1 ,

, , , , ,

,2 , 2

, , , 1 ,

,2 , 2

2 , , 1 ,

2 , , 1 , ,

, 1 , ,

, 1 , ,2 , , , 1

,2 , , , 1

2 , , ,

2 , , 1 ,

,2 , 2 ,2 , , ,

, , ,

, , ,

,2 , 2, 1 ,

,2 , , , 1 ,2 , 2 ,2 , 2

,2 , , , 1,2 , 2 1 ,2 , 2 ,2 , 2 ,2 , 2

, , , , 2 , ,

, , , , 1 ,

, , ,

1 ,

,2 , 11 ,

2 , , ,

,

- , ,

- , ,

,2 , , , 1

- , 1 ,

,2 , 2

,2 ,,

- , ,

,

1, ,

1, 1 ,

,2 , 2

,

,2 , , , 1

1 ,

,2 , 2 ,2 , , , 1

,2 ,

Figure 28: The procedure of H-Cholesky decomposition under heterogeneous architecture.

42

After evaluating the hybrid parallel programming method, in Fig. 29, it shows impressive im-

provement.

0 500 1000 1500 2000 2500 3000 3500 4000 45000

2

4

6

8

10

12H−Cholesky Decomposition where the problem size (vertices) is 10002

nmin (leaf size)

Tim

e (s

ec)

MKL

Hybrid

Figure 29: H-Cholesky decomposition via heterogeneous architecture.

From Fig. 29, when leaf size is 4096 and vertices are fixed to 10002, hybrid method is 6x

faster than general way. With the growth of leaf size, it's convincing to believe that H-Cholesky

decomposition using heterogeneous architecture can be well scaled.

9 Conclusion

The result shows that H-Cholesky factorization on many-core accelerators is extremely efficient,

which also can be well scaled on large-scaled H-matrix.

43

References

[1] David Blair Kirk, Wen-mei W. Hwu, Programming Massively Parallel Processors - A Hands-on Ap-

proach. Morgan Kaufmann 2010, ISBN 978-0-12-381472-2, pp. I-XVIII, 1-258

[2] James Jeffers, James Reinders, Intel Xeon Phi Coprocessor High-Performance Programming. 2013

[3] LU Decomposition, http://en.wikipedia.org/wiki/LU_decomposition

[4] QR Decomposition, https://en.wikipedia.org/wiki/QR_decomposition

[5] Cholesky Decomposition, https://en.wikipedia.org/wiki/Cholesky_decomposition

[6] BLAS, http://www.netlib.org/blas/

[7] LAPACK, http://www.netlib.org/lapack/

[8] Intel MKL, https://software.intel.com/en-us/intel-mkl/

[9] NVIDIA cuBLAS, http://docs.nvidia.com/cuda/cublas/

[10] PLASMA, http://www.netlib.org/plasma/

[11] MAGMA, http://www.netlib.org/magma/

[12] Steffen Borm, Lars Grasedyck and Wolfgang Hackbusch, Hierarchical Matrices, max planck

institute, 2006.

[13] Ronald Kriemann, H-LU Factorization on many-core systems, max planck institute, 2014.

44

http://en.wikipedia.org/wiki/LU_decomposition

https://en.wikipedia.org/wiki/QR_decomposition

https://en.wikipedia.org/wiki/Cholesky_decomposition

http://www.netlib.org/blas/

http://www.netlib.org/lapack/

https://software.intel.com/en-us/intel-mkl/

http://docs.nvidia.com/cuda/cublas/

http://www.netlib.org/plasma/

http://www.netlib.org/magma/

h-cholesky on many-core

Documents

hlu factorization

hcholesky factorization36

hcholesky fatorization

hmatrix vector multiplication

qr factorization

intel mkl16

intel xeon phi

matrix decomposition