Lecture 2: From Algorithms to Code Robert van de Geijn The University of Texas at Austin
Lecture 2: From Algorithms to Code
Robert van de GeijnThe University of Texas at Austin
Overview
• Motivating Example: Cholesky FactorizationPart I: The Traditional Way• Basic Linear Algebra Subprograms
– Level 1 BLAS– Level 2 BLAS– Level 3 BLAS
Part II: The FLAME Way• The FLAME@lab API• The FLAMEC API (Application Programming Interface)• Elemental (Targeting distributed memory architectures)
A Motivating Example: CholeskyFactorization
j
i
j
k
k
Performance
Part 1: The Traditional Way
Overview
• Motivating Example: Cholesky FactorizationPart I: The Traditional Way• Basic Linear Algebra Subprograms
– Level 1 BLAS– Level 2 BLAS– Level 3 BLAS
Part II: The FLAME Way• The FLAME@lab API• The FLAMEC API (Application Programming Interface)• Elemental (Targeting distributed memory architectures)
Basic Linear Algebra Subprograms
• Interface to commonly used fundamental linear algebra functionality
– Level‐1 BLAS: vector‐vector operations
– Level‐2 BLAS: matrix‐vector operations
– Level‐3 BLAS: matrix‐matrix operations
Level‐1 BLAS
• C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Soft., 5(3):308–323, Sept. 1979.
• Meant to allow portable high‐performance to be achieved on the vector supercomputers of the 1970s.
• Used to code LINPACK (predecessor of LAPACK, predecessor of libflame)
BLAS1: Examples
Cholesky Factorization
j n‐j
n‐j
j
k
k
n‐k+1
The Problem with Vector‐Vector Operations
• Perform O(n) computation with O(n) data• Memory is much slower than floating point arithmetic
• If vectors are in main memory, “feeding the beast” becomes a problem
• When used to compute an operation like Cholesky factorization, the vectors are in main memory
Memory Hierarchy
registers
L1
L2
main memory
``expensive’’
``cheap’’
``fast’’
``slow’’
Memory Hierarchy
registers
L1
L2
main memory
Performance
Level‐2 BLAS
• Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft., 14(1):1–17, March 1988.
• Improve data reuse as memory became slower
BLAS2: Examples
Cholesky Factorization
j n‐j
n‐j
j
The Problem with Matrix‐Vector Operations
• Perform O(n2) computation with O(n2) data• Memory is much slower than floating point arithmetic
• If matrix is in main memory, “feeding the beast” becomes a problem
• When used to compute an operation like Cholesky factorization, the matrix is in main memory
Memory Hierarchy
registers
L1
L2
main memory
Performance
Level‐3 BLAS
• Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff.A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16(1):1–17, March 1990.
• Further improve data reuse as cache memories became popular
BLAS3: Examples
The Benefits of Matrix‐Matrix Operations
• Perform O(n3) computation with O(n2) data• Overcomes the memory bottleneck
Optimizing Matrix‐Matrix Multiplication
Memory
Memory
Memory
Memory
Memory
CPU CachesAT
Memory
L2‐Cache
L1‐Cache
Registers
ATPack A
Performance of GotoBLAS
Casting Algorithms in Terms of BLAS3
• Algorithms that cast most computation in terms of level‐2 BLAS are often called unblocked algorithms
• Algorithms that cast most computation in terms of level‐3 BLAS are often called blockedalgorithms
• We need to derive an algorithm that casts most computation in terms of level‐3 BLAS
do j=1, n, nbjb = min( nb, n-j+1 )call chol( jb, A( j, j ), lda )call dtrsm(
‘Right’, ‘Lower triangular’, ‘Transpose’, ‘Nonunit diag’,N-J-JB+1, JB, 1.0d00, A( j, j ), lda, A( j+jb, j ), lda )
call dsyrk( ‘Lower triangular’, ‘No transpose’, N-J-JB+1, JB,-1.0d00, A( j+jb, j ), lda, 1.0d00, A( j+jb, j+jb ), lda )
enddo
n‐j‐jb+
1
j n‐j‐jb+1
j
do j=1, n, nbjb = min( nb, n-j+1 )call chol( jb, A( j, j ), lda )call dtrsm(
‘Right’, ‘Lower triangular’, ‘Transpose’, ‘Nonunit diag’,N-J-JB+1, JB, 1.0d00, A( j, j ), lda, A( j+jb, j ), lda )
call dsyrk( ‘Lower triangular’, ‘No transpose’, N-J-JB+1, JB,-1.0d00, A( j+jb, j ), lda, 1.0d00, A( j+jb, j+jb ), lda )
enddo
jb
Performance
Disadvantages of using BLAS3
• Indexing gets confusing
Part 3: The FLAME Way
Overview
• Motivating Example: Cholesky FactorizationPart I: The Traditional Way• Basic Linear Algebra Subprograms
– Level 1 BLAS– Level 2 BLAS– Level 3 BLAS
Part II: The FLAME Way• The FLAME@lab API• The FLAMEC API (Application Programming Interface)• Elemental (Targeting distributed memory architectures)
FLAME APIs
• Paolo Bientinesi, Enrique S. Quintana‐Orti, Robert A. van de Geijn. Representing linear algebra algorithms in code: the FLAME application program interfaces. ACM Trans. on Mathem. Softw., 2005
• APIs have been defined for Mscript (Matlab, octave, Mathscript), C, Labview, C++ + MPI, …
• Used to implement – libflame (modern alternative for LAPACK)– Elemental (modern alternative for ScaLAPACK)
Cholesky factorization in FLAME@labfunction [ A_out ] = Chol_blk_var3( A, nb_alg )
[ ATL, ATR, ...ABL, ABR ] = FLA_Part_2x2( A, ...
0, 0, 'FLA_TL' );while ( size( ATL, 1 ) < size( A, 1 ) )
b = min( size( ABR, 1 ), nb_alg );[ A00, A01, A02, ...
A10, A11, A12, ...A20, A21, A22 ] = ...
FLA_Repart_2x2_to_3x3( ATL, ATR, ...ABL, ABR, b, b, 'FLA_BR' );
%----------------------------------------------------------%
A11 = Chol_unb_var1( A11 );A21 = A21 * inv( tril( A11 ) )';A22 = A22 - tril( A21 * A21' );
%----------------------------------------------------------%[ ATL, ATR, ...
ABL, ABR ] = ...FLA_Cont_with_3x3_to_2x2( A00, A01, A02, ...
A10, A11, A12, ...A20, A21, A22, 'FLA_TL' );
endA_out = [ ATL, ATR
ABL, ABR ];return
FLAME@lab Demo
(Turn on QuickTime recording!)
Cholesky factorization in FLAME@labint Chol_blk_var3( FLA_Obj A, int nb_alg ){
< declarations >FLA_Part_2x2( A, &ATL, &ATR,
&ABL, &ABR, 0, 0, FLA_TL );while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){
b = min( FLA_Obj_length( ABR ), nb_alg );FLA_Repart_2x2_to_3x3
( ATL, /**/ ATR, &A00, /**/ &A01, &A02,/* ************* */ /* ******************** */
&A10, /**/ &A11, &A12,ABL, /**/ ABR, &A20, /**/ &A21, &A22,
b, b, FLA_BR );/*--------------------------------------------------*/Chol_unb_var3( A11 );FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,
FLA_TRANSPOSE, FLA_NONUNIT_DIAG,FLA_ONE, A11, A21 );
FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 );
/*--------------------------------------------------*/FLA_Cont_with_3x3_to_2x2
( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12,
/* ************** */ /* ****************** */&ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL );
}}
Overview
• Motivating Example: Cholesky FactorizationPart I: The Traditional Way• Basic Linear Algebra Subprograms
– Level 1 BLAS– Level 2 BLAS– Level 3 BLAS
Part II: The FLAME Way• The FLAME@lab API• The FLAMEC API (Application Programming Interface)• Elemental (Targeting distributed memory architectures)
FLAMEC Demo
Overview
• Motivating Example: Cholesky FactorizationPart I: The Traditional Way• Basic Linear Algebra Subprograms
– Level 1 BLAS– Level 2 BLAS– Level 3 BLAS
Part II: The FLAME Way• The FLAME@lab API• The FLAMEC API (Application Programming Interface)• Elemental (Targeting distributed memory architectures)
PartitionDownDiagonal( A, ATL, ATR,
ABL, ABR, 0 );while( ABR.Height() > 0 ){
RepartitionDownDiagonal( ATL, /**/ ATR, A00, /**/ A01, A02,/*************/ /****************/
/**/ A10, /**/ A11, A12,ABL, /**/ ABR, A20, /**/ A21, A22 );
A21_VC_Star.AlignWith( A22 );A21_MC_Star.AlignWith( A22 );A21_MR_Star.AlignWith( A22 );//----------------------------------------------------//A11_Star_Star = A11;advanced::internal::LocalChol( Lower, A11_Star_Star );A11 = A11_Star_Star;
A21_VC_Star = A21;basic::internal::LocalTrsm( Right, Lower, ConjugateTranspose, NonUnit, (F)1, A11_Star_Star, A21_VC_Star );
A21_MC_Star = A21_VC_Star;A21_MR_Star = A21_VC_Star;
// (A21^T[* ,MC])^T A21^H[* ,MR]// = A21[MC,* ] A21^H[* ,MR] = (A21 A21^H)[MC,MR]basic::internal::LocalTriangularRankK( Lower, ConjugateTranspose, (F)-1, A21_MC_Star, A21_MR_Star, (F)1, A22 );
A21 = A21_MC_Star;//----------------------------------------------------//A21_VC_Star.FreeAlignments();A21_MC_Star.FreeAlignments();A21_MR_Star.FreeAlignments();SlidePartitionDownDiagonal( ATL, /**/ ATR, A00, A01, /**/ A02,
/**/ A10, A11, /**/ A12,/*************/ /*****************/ABL, /**/ ABR, A20, A21, /**/ A22 );
} 58
From FLAME algorithm to Elemental implementation
Jack Poulson et al. "Elemental: A New Framework for Distributed Memory Dense Matrix Computations.” TOMS. Accepted (pending final approval).
Summary
• BLAS are widely used in scientific computing• Casting computation in terms of matrix‐matrix multiplication facilitates high performance
• Abstraction is a wonderful thing
Welcome to the Wonderful World of FLAME
Willkommen in der wunderbarenWelt der FLAME