Top Banner
H–Cholesky Factorization on Many-Core Accelerators Gang Liao August 2, 2015
16
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: H-cholesky on manycore

H–Cholesky Factorization on Many-Core Accelerators Gang Liao

August 2, 2015

Page 2: H-cholesky on manycore

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

BackgroundIf A is a positive definite matrix, Cholesky factorization: A =   𝐿𝐿%

Data matrices representing some numericalobservations such as proximity matrix orcorrelation matrix are often huge and hard toanalyze, therefore to decompose the datamatrices into some lower-order or lower-rankcanonical forms will reveal the inherentcharacteristic and structure of the matrices andhelp to interpret their meaning readily.

Page 3: H-cholesky on manycore

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice3

Hierarchical Matrix Hierarchical matrices (H-matrices) are a powerful tool to represent dense matrices coming from integral equations or partial differential equations in a hierarchical, block-oriented, data-sparse way with log-linear memory costs.

Page 4: H-cholesky on manycore

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Hierarchical Matrix

4

Page 5: H-cholesky on manycore

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization NoticeIntel Confidential 5

Implementation: Inadmissible Leaves:The product index set resolves into admissible and inadmissible leaves of the tree. The assembly, storage and matrix-vector multiplication differs for the corresponding two classes of sub matrices.

Inadmissible Leaves:

Page 6: H-cholesky on manycore

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization NoticeIntel Confidential 6

Implementation: Admissible Leaves:The product index set resolves into admissible and inadmissible leaves of the tree. The assembly, storage and matrix-vector multiplication differs for the corresponding two classes of sub matrices.

Admissible Leaves:

Page 7: H-cholesky on manycore

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization NoticeIntel Confidential 7

Hierarchical Matrix Representation

Page 8: H-cholesky on manycore

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization NoticeIntel Confidential 8

Profiling

Page 9: H-cholesky on manycore

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Compiler Optimization – Full matrix

Intel Confidential 9

For icc opt1, icc with optimizations like -O2.

For icc opt2, icc with default optimizations like -msse4.2 -O3.

For icc mkl, icc opt2 + mkl function.

Page 10: H-cholesky on manycore

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Numerical Libraries Optimization – Full matrix

Intel Confidential 10

dpotrf_ vs plasma_dpotrf vsmagma_dpotrf

MKL: Intel Math Kernel Library (Intel MKL) accelerates math processing routines.

PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures

MAGMA: Matrix Algebra on GPU and Multicore Architectures

Page 11: H-cholesky on manycore

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization NoticeIntel Confidential 11

Parallel Optimization The concept of task-based DAG computations is used to split the H-Choleskyfactorization into single tasks and to define corresponding dependencies to form a DAG.

Page 12: H-cholesky on manycore

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice12

Code

Anal

ysis

Page 13: H-cholesky on manycore

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice13

Multicore Optimization – H-Cholesky Factorization

13

Example 1:

Example 2:

Page 14: H-cholesky on manycore

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice14

Manycore Optimization – H-Cholesky Factorization

1. Allocate & Copy r->a[row_offset] and r->b[col_offset] into accelerators.

2. Copy result ft->e from accelerators into CPU host memory.

Page 15: H-cholesky on manycore

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization NoticeIntel Confidential 15

Result & Conclusion

0 500 1000 1500 2000 2500 3000 3500 4000 45000

2

4

6

8

10

12H−Cholesky Decomposition where the problem size (vertices) is 10002

nmin (leaf size)

Tim

e (s

ec)

MKL

HybridH - Cholesky factorization on many-core accelerators is extremely efficient,

which also can be well scaled on large-scaled H-matrix.

Page 16: H-cholesky on manycore