Lecture From Algorithms to Code - Scientific · PDF file• Basic Linear Algebra Subprograms – Level 1 BLAS – Level 2 BLAS – Level 3 BLAS Part II: The FLAME Way • The.....

Lecture 2: From Algorithms to Code

Robert van de GeijnThe University of Texas at Austin

Overview

• Motivating Example: Cholesky FactorizationPart I: The Traditional Way• Basic Linear Algebra Subprograms

– Level 1 BLAS– Level 2 BLAS– Level 3 BLAS

Part II: The FLAME Way• The FLAME@lab API• The FLAMEC API (Application Programming Interface)• Elemental (Targeting distributed memory architectures)

A Motivating Example: CholeskyFactorization

j

i

j

k

k

Performance

Part 1: The Traditional Way

Overview




Basic Linear Algebra Subprograms

• Interface to commonly used fundamental linear algebra functionality

– Level‐1 BLAS: vector‐vector operations

– Level‐2 BLAS: matrix‐vector operations

– Level‐3 BLAS: matrix‐matrix operations

Level‐1 BLAS

• C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Soft., 5(3):308–323, Sept. 1979.

• Meant to allow portable high‐performance to be achieved on the vector supercomputers of the 1970s.

• Used to code LINPACK (predecessor of LAPACK, predecessor of libflame)

BLAS1: Examples

Cholesky Factorization

j n‐j

n‐j

j

k

k

n‐k+1

The Problem with Vector‐Vector Operations

• Perform O(n) computation with O(n) data• Memory is much slower than floating point arithmetic

• If vectors are in main memory, “feeding the beast” becomes a problem

• When used to compute an operation like Cholesky factorization, the vectors are in main memory

Memory Hierarchy

registers

L1

L2

main memory

``expensive’’

``cheap’’

``fast’’

``slow’’

Memory Hierarchy

registers

L1

L2

main memory

Performance

Level‐2 BLAS

• Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft., 14(1):1–17, March 1988.

• Improve data reuse as memory became slower

BLAS2: Examples

Cholesky Factorization

j n‐j

n‐j

j

The Problem with Matrix‐Vector Operations

• Perform O(n2) computation with O(n2) data• Memory is much slower than floating point arithmetic

• If matrix is in main memory, “feeding the beast” becomes a problem

• When used to compute an operation like Cholesky factorization, the matrix is in main memory

Memory Hierarchy

registers

L1

L2

main memory

Performance

Level‐3 BLAS

• Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff.A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16(1):1–17, March 1990.

• Further improve data reuse as cache memories became popular

BLAS3: Examples

The Benefits of Matrix‐Matrix Operations

• Perform O(n3) computation with O(n2) data• Overcomes the memory bottleneck

Optimizing Matrix‐Matrix Multiplication

Memory

Memory

Memory

Memory

Memory

CPU CachesAT

Memory

L2‐Cache

L1‐Cache

Registers

ATPack A

Performance of GotoBLAS

Casting Algorithms in Terms of BLAS3

• Algorithms that cast most computation in terms of level‐2 BLAS are often called unblocked algorithms

• Algorithms that cast most computation in terms of level‐3 BLAS are often called blockedalgorithms

• We need to derive an algorithm that casts most computation in terms of level‐3 BLAS

do j=1, n, nbjb = min( nb, n-j+1 )call chol( jb, A( j, j ), lda )call dtrsm(

‘Right’, ‘Lower triangular’, ‘Transpose’, ‘Nonunit diag’,N-J-JB+1, JB, 1.0d00, A( j, j ), lda, A( j+jb, j ), lda )

call dsyrk( ‘Lower triangular’, ‘No transpose’, N-J-JB+1, JB,-1.0d00, A( j+jb, j ), lda, 1.0d00, A( j+jb, j+jb ), lda )

enddo

n‐j‐jb+

1

j n‐j‐jb+1

j

do j=1, n, nbjb = min( nb, n-j+1 )call chol( jb, A( j, j ), lda )call dtrsm(

‘Right’, ‘Lower triangular’, ‘Transpose’, ‘Nonunit diag’,N-J-JB+1, JB, 1.0d00, A( j, j ), lda, A( j+jb, j ), lda )

call dsyrk( ‘Lower triangular’, ‘No transpose’, N-J-JB+1, JB,-1.0d00, A( j+jb, j ), lda, 1.0d00, A( j+jb, j+jb ), lda )

enddo

jb

Performance

Disadvantages of using BLAS3

• Indexing gets confusing

Part 3: The FLAME Way

Overview




FLAME APIs

• Paolo Bientinesi, Enrique S. Quintana‐Orti, Robert A. van de Geijn. Representing linear algebra algorithms in code: the FLAME application program interfaces. ACM Trans. on Mathem. Softw., 2005

• APIs have been defined for Mscript (Matlab, octave, Mathscript), C, Labview, C++ + MPI, …

• Used to implement – libflame (modern alternative for LAPACK)– Elemental (modern alternative for ScaLAPACK)

Cholesky factorization in FLAME@labfunction [ A_out ] = Chol_blk_var3( A, nb_alg )

[ ATL, ATR, ...ABL, ABR ] = FLA_Part_2x2( A, ...

0, 0, 'FLA_TL' );while ( size( ATL, 1 ) < size( A, 1 ) )

b = min( size( ABR, 1 ), nb_alg );[ A00, A01, A02, ...

A10, A11, A12, ...A20, A21, A22 ] = ...

FLA_Repart_2x2_to_3x3( ATL, ATR, ...ABL, ABR, b, b, 'FLA_BR' );

%----------------------------------------------------------%

A11 = Chol_unb_var1( A11 );A21 = A21 * inv( tril( A11 ) )';A22 = A22 - tril( A21 * A21' );

%----------------------------------------------------------%[ ATL, ATR, ...

ABL, ABR ] = ...FLA_Cont_with_3x3_to_2x2( A00, A01, A02, ...

A10, A11, A12, ...A20, A21, A22, 'FLA_TL' );

endA_out = [ ATL, ATR

ABL, ABR ];return

FLAME@lab Demo

(Turn on QuickTime recording!)

Cholesky factorization in FLAME@labint Chol_blk_var3( FLA_Obj A, int nb_alg ){

< declarations >FLA_Part_2x2( A, &ATL, &ATR,

&ABL, &ABR, 0, 0, FLA_TL );while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){

b = min( FLA_Obj_length( ABR ), nb_alg );FLA_Repart_2x2_to_3x3

( ATL, /**/ ATR, &A00, /**/ &A01, &A02,/* ************* */ /* ******************** */

&A10, /**/ &A11, &A12,ABL, /**/ ABR, &A20, /**/ &A21, &A22,

b, b, FLA_BR );/*--------------------------------------------------*/Chol_unb_var3( A11 );FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

FLA_TRANSPOSE, FLA_NONUNIT_DIAG,FLA_ONE, A11, A21 );

FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 );

/*--------------------------------------------------*/FLA_Cont_with_3x3_to_2x2

( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12,

/* ************** */ /* ****************** */&ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL );

}}

Overview




FLAMEC Demo

Overview




PartitionDownDiagonal( A, ATL, ATR,

ABL, ABR, 0 );while( ABR.Height() > 0 ){

RepartitionDownDiagonal( ATL, /**/ ATR, A00, /**/ A01, A02,/*************/ /****************/

/**/ A10, /**/ A11, A12,ABL, /**/ ABR, A20, /**/ A21, A22 );

A21_VC_Star.AlignWith( A22 );A21_MC_Star.AlignWith( A22 );A21_MR_Star.AlignWith( A22 );//----------------------------------------------------//A11_Star_Star = A11;advanced::internal::LocalChol( Lower, A11_Star_Star );A11 = A11_Star_Star;

A21_VC_Star = A21;basic::internal::LocalTrsm( Right, Lower, ConjugateTranspose, NonUnit, (F)1, A11_Star_Star, A21_VC_Star );

A21_MC_Star = A21_VC_Star;A21_MR_Star = A21_VC_Star;

// (A21^T[* ,MC])^T A21^H[* ,MR]// = A21[MC,* ] A21^H[* ,MR] = (A21 A21^H)[MC,MR]basic::internal::LocalTriangularRankK( Lower, ConjugateTranspose, (F)-1, A21_MC_Star, A21_MR_Star, (F)1, A22 );

A21 = A21_MC_Star;//----------------------------------------------------//A21_VC_Star.FreeAlignments();A21_MC_Star.FreeAlignments();A21_MR_Star.FreeAlignments();SlidePartitionDownDiagonal( ATL, /**/ ATR, A00, A01, /**/ A02,

/**/ A10, A11, /**/ A12,/*************/ /*****************/ABL, /**/ ABR, A20, A21, /**/ A22 );

} 58

From FLAME algorithm to Elemental implementation

Jack Poulson et al. "Elemental: A New Framework for Distributed Memory Dense Matrix Computations.” TOMS. Accepted (pending final approval).

Summary

• BLAS are widely used in scientific computing• Casting computation in terms of matrix‐matrix multiplication facilitates high performance

• Abstraction is a wonderful thing

Welcome to the Wonderful World of FLAME

Willkommen in der wunderbarenWelt der FLAME

Lecture From Algorithms to Code - Scientific · PDF file• Basic Linear Algebra Subprograms – Level 1 BLAS – Level 2 BLAS – Level 3 BLAS Part II: The FLAME Way • The.....

Documents