Top Banner
Automatic Performance Tuning Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University
23

Jeremy Johnson Dept. of Computer Science Drexel University

Nov 04, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance TuningAutomatic Performance Tuning

Jeremy JohnsonDept. of Computer Science

Drexel University

Page 2: Jeremy Johnson Dept. of Computer Science Drexel University

OutlineOutline

• Scientific Computation Kernels– Matrix Multiplication– Fast Fourier Transform (FFT)

• Automated Performance Tuning                                                          (IEEE Proc. Vol. 93, No. 2, Feb. 2005)

– ATLAS– FFTW– SPIRAL

Page 3: Jeremy Johnson Dept. of Computer Science Drexel University

Matrix Multiplication and the FFTMatrix Multiplication and the FFT

∑=

=n

kkjikij BAC

1

=

+=+==

=

∑∑

=+

=+

=

1

0

1

0

2112

1

0

12

22

12

11

,,R

lSR

S

l

kl

NS

l

N

l

kl

Nk

xy

llkk

xy

kklklk

Skk

RlSkRSN

ωωω

ω

Page 4: Jeremy Johnson Dept. of Computer Science Drexel University

Basic Linear Algebra Subprograms (BLAS)Basic Linear Algebra Subprograms (BLAS)

• Level 1 – vector­vector, O(n) data, O(n) operations• Level 2 – matrix­vector, O(n2) data, O(n2) operations• Level 3 – matrix­matrix, O(n2) data, O(n3) operations = data reuse = 

locality!

• LAPACK built on top of BLAS (level 3)– Blocking (for the memory hierarchy) is the single most important 

optimization for linear algebra algorithms

• GEMM – General Matrix Multiplication

– SUBROUTINE DGEMM (TRANSA, TRANSB, M, N, K,                                  ALPHA, A, LDA, B, LDB, BETA, C, LDC ) 

– C := alpha*op( A )*op( B ) + beta*C, – where op(X) = X or X’

Page 5: Jeremy Johnson Dept. of Computer Science Drexel University

DGEMMDGEMM

…*           Form  C := alpha*A*B + beta*C.*            DO 90, J = 1, N               IF( BETA.EQ.ZERO )THEN                  DO 50, I = 1, M                     C( I, J ) = ZERO   50             CONTINUE               ELSE IF( BETA.NE.ONE )THEN                  DO 60, I = 1, M                     C( I, J ) = BETA*C( I, J )   60             CONTINUE               END IF               DO 80, L = 1, K                  IF( B( L, J ).NE.ZERO )THEN                     TEMP = ALPHA*B( L, J )                     DO 70, I = 1, M                        C( I, J ) = C( I, J ) + TEMP*A( I, L )   70                CONTINUE                  END IF   80          CONTINUE   90       CONTINUE…

Page 6: Jeremy Johnson Dept. of Computer Science Drexel University

Matrix Multiplication PerformanceMatrix Multiplication Performance

Page 7: Jeremy Johnson Dept. of Computer Science Drexel University

Matrix Multiplication PerformanceMatrix Multiplication Performance

Page 8: Jeremy Johnson Dept. of Computer Science Drexel University

Numeric RecipesNumeric Recipes

• Numeric Recipes in C – The Art of Scientific Computing, 2nd Ed.– William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P. 

Flannery, Cambridge University Press, 1992. 

• “This book is unique, we think, in offering, for each topic considered, a certain amount of general discussion, a certain amount of analytical mathematics, a certain amount of discussion of algorithmics, and (most important) actual implementations of these ideas in the form of working computer routines.

• 1. Preliminarys• 2. Solutions of Linear Algebraic Equations• …• 12. Fast Fourier Transform• 19. Partial Differential Equations• 20. Less Numerical Algorithms

Page 9: Jeremy Johnson Dept. of Computer Science Drexel University

four1four1

Page 10: Jeremy Johnson Dept. of Computer Science Drexel University

four1 (cont)four1 (cont)

Page 11: Jeremy Johnson Dept. of Computer Science Drexel University

FFT PerformanceFFT Performance

Page 12: Jeremy Johnson Dept. of Computer Science Drexel University

Atlas Architecture and Search ParametersAtlas Architecture and Search Parameters

• NB – L1 data cache tile size

• NCNB – L1 data cache tile size for non­copying version

• MU, NU – Register tile size

• KU – Unroll factor for k’ loop

• LS – Latency for computation scheduling• FMA – 1 if fused multiply­add available, 0 otherwise• FF, IF,  NF – Scheduling of loads

Yotov et al., Is Search Really Necessary to Generate High­Performance BLAS?, Proc. IEEE, Vol. 93, No. 2, Feb. 2005

Page 13: Jeremy Johnson Dept. of Computer Science Drexel University

ATLAS Code GenerationATLAS Code Generation

• Optimization for locality – Cache tiling, Register tiling

Page 14: Jeremy Johnson Dept. of Computer Science Drexel University

ATLAS Code GenerationATLAS Code Generation

• Register Tiling– MU + NU + MU×NU  ≤ NR

• Loop unrolling• Scalar replacement• Add/mul interleaving• Loop skewing

• Ci’’j’’ = Ci’’j’’ + Ai’’k’’*Bk’’j’’

A C

B

NU

MU

K

K

NB

NB

mul1mul2…mulLs

add1

mulLs+1

add2

…mulMu×Nu

addMu×Nu­Ls+2

…addMu×Nu

Page 15: Jeremy Johnson Dept. of Computer Science Drexel University

ATLAS SearchATLAS Search

• Estimate Machine Parameters (C1, NR, FMA, LS)– Used to bound search

• Orthogonal Line Search (fix all parameters except one and search for the optimal value of this parameter)– Search order

• NB• MU, NU• KU• LS• FF, IF, NF• NCNB• Cleanup codes

Page 16: Jeremy Johnson Dept. of Computer Science Drexel University

Using FFTWUsing FFTW

Page 17: Jeremy Johnson Dept. of Computer Science Drexel University

FFTW InfrastructureFFTW Infrastructure

• Use dynamic programming to find an efficient way to combine code sequences.  

• Combine code sequences using divide and conquer structure in FFT

• Codelets (optimized code sequences for small FFTs)

• Plan encodes divide and conquer strategy and stores “twiddle factors”

• Executor computes FFT of given data using algorithm described by plan.

15

3 12

4 8

3 5

Right Recursive

Page 18: Jeremy Johnson Dept. of Computer Science Drexel University

SPIRAL SPIRAL systemsystem

DSP transform specifies

user

goes for a coffee

Formula Generator

SPL Compiler Sea

rch 

Eng

ine

runtime on given platform

controlsimplementation options

controlsalgorithm generation

fast algorithmas SPL formula

C/Fortran/SIMDcode

S P

 I R

 A L

(or an espresso for small transform

s)

platform­adaptedimplementation

comes back

Mathem

atician

Expert

Programmer

Page 19: Jeremy Johnson Dept. of Computer Science Drexel University

DSPDSP Algorithms: Example 4­point DFT Algorithms: Example 4­point DFTCooley/Tukey FFT (size 4):

 algorithms reduce arithmetic cost O(n^2)→O(nlog(n)) product of structured sparse matrices mathematical notation exhibits structure

−−

=

−−−−−−

1000001001000001

1100110000110011

000010000100001

1010010110100101

111111

111111

iii

ii

4222

42224 )()( LDFTITIDFTDFT ⋅⊗⋅⋅⊗=

Fourier transform

Identity Permutation

Diagonal matrix (twiddles)

Kronecker product

Page 20: Jeremy Johnson Dept. of Computer Science Drexel University

AlgorithmsAlgorithms = Ruletrees = Formulas = Ruletrees = Formulas)(

8IIDCT

)(4

IIDCT)(

4IVDCT

R1)()( 2/2

)(2/

)(2/

)(n

IVn

IIn

IIn IFDCTDCTPDCT ⊗⋅⊕⋅→

2FR3 R6

2FR4

R3

R1

R6

2F

2FR4

2)(

2 21

FDCT II ⋅→

)(2

IIDCT

)(2

IIDST

)(2

IVDCT

)(2

IIDST

)(2

IIDCT

R1 R6 SDCTPDCT IIn

IVn ⋅⋅→ )()(

)(4

IIDCT)(2

IVDCT

Page 21: Jeremy Johnson Dept. of Computer Science Drexel University

GeneratedGenerated DFT Vector Code: Pentium 4, SSE DFT Vector Code: Pentium 4, SSE(P

seud

o) g

flop/

s

DFT 2n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0

n

 speedups (to C code) up to factor of 3.1

0

1

2

3

4

5

6

7

4 5 6 7 8 9 10 11 12 13

Spiral SSEIntel MKL interl.FFTW 2.1.3Spiral CSpiral C vectSIMD­FFT

 

hand­tuned vendor assembly code

Page 22: Jeremy Johnson Dept. of Computer Science Drexel University

Best Best DFT Trees, size DFT Trees, size 210 = 1024 = 1024

scalar

C vect

SIMD

Pentium 4float

Pentium 4double

Pentium IIIfloat

AthlonXPfloat

10

64

4

10

87

5

10

8

6

4

2

2

2

2 2

2 2

2 2

2

21

22 3

10

8

5

2

2

2 3

10

97

5

12

22 3

10

62

42 2

2 2

4

10

4

2

6

2 42 2

2

10

5

2

5

2 3 3

10

6

3

4

2 2 3

10

8

5

2

2

2 3

10

64

42 2

2 2

2

10

5

2

5

2 3 3

trees platform/datatype dependent

Page 23: Jeremy Johnson Dept. of Computer Science Drexel University

CrosstimingCrosstiming of best trees on Pentium 4 of best trees on Pentium 4S

low

dow

n fa

ctor

 w.r.

t. be

st

DFT 2n single precision, runtime of best found of other platforms

n

software adaptation is necessary

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

4 5 6 7 8 9 10 11 12 13

Pentium 4 SSEPentium 4 SSE2AthlonXP SSEPentiumIII SSEPentium 4 float