Top Banner
1 Relational Query Processing Approach to Compiling Sparse Matrix Codes Vladimir Kotlyar Computer Science Department, Cornell University http://www.cs.cornell.edu/Info/Project/Bernoulli
59

Relational Query Processing Approach to Compiling Sparse Matrix Codes

Feb 05, 2016

Download

Documents

Byron G. Curtis

Relational Query Processing Approach to Compiling Sparse Matrix Codes. Vladimir Kotlyar Computer Science Department, Cornell University http://www.cs.cornell.edu/Info/Project/Bernoulli. Outline. Problem statement Sparse matrix computations Importance of sparse matrix formats - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Relational Query Processing Approach to Compiling Sparse Matrix Codes

1

Relational Query Processing Approach to Compiling Sparse Matrix Codes

Vladimir Kotlyar

Computer Science Department, Cornell Universityhttp://www.cs.cornell.edu/Info/Project/Bernoulli

Page 2: Relational Query Processing Approach to Compiling Sparse Matrix Codes

2

Outline

• Problem statement

– Sparse matrix computations

– Importance of sparse matrix formats

– Difficulties in the development of sparse matrix codes

• State-of-the-art restructuring compiler technology

• Technical approach and experimental results

• Ongoing work and conclusions

Page 3: Relational Query Processing Approach to Compiling Sparse Matrix Codes

3

Sparse Matrices and Their Applications

• Number of non-zeroes per row/column << n

• Often, less that 0.1% non-zero

• Applications:– Numerical simulations, (non)linear optimization, graph

theory, information retrieval, ...

0000

0000

000000

00000

0000

0000

Page 4: Relational Query Processing Approach to Compiling Sparse Matrix Codes

4

Application: numerical simulations

• Fracture mechanics Grand Challenge project:

– Cornell CS + Civil Eng. + other schools;

– supported by NSF,NASA,Boeing

• A system of differential equations is

solved over a continuous domain

• Discretized into an algebraic system in variables x(i)

• System of linear equations Ax=b is at the core

• Intuition: A is sparse because the physical interactions are local

Title:/usr/u/vladimir/private/talks/job/crack.epscCreator:MATLAB, The Mathworks, Inc.Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 5: Relational Query Processing Approach to Compiling Sparse Matrix Codes

5

Application: Authoritative sources on the Web

• Hubs and authorities on the Web

• Graph G=(V,E) of the documents

• A(u,v) = 1 if (u,v) is an edge

• A is sparse!

• Eigenvectors of identify hubs, authorities and their

clusters (“communities”) [Kleinberg,Raghavan ‘97]

Hubs

Authorities

AAT

Page 6: Relational Query Processing Approach to Compiling Sparse Matrix Codes

6

• Solution of linear systems– Direct methods (Gaussian elimination): A = LU

• Impractical for many large-scale problems• For certain problems: O(n) space, O(n) time

– Iterative methods• Matrix-vector products: y = Ax• Triangular system solution: Lx=b• Incomplete factorizations: A LU

• Eigenvalue problems:– Mostly matrix-vector products + dense computations

Sparse matrix algorithms

Page 7: Relational Query Processing Approach to Compiling Sparse Matrix Codes

7

Sparse matrix computations

• “DOANY” -- operations in any order– Vector ops (dot product, addition,scaling)– Matrix-vector products– Rarely used: C = A+B– Important: C A+B, A A + UV

• “DOACROSS” -- dependencies between operations– Triangular system solution: Lx = b

• More complex applications are built out of the above + dense kernels

• Preprocessing (e.g. storage allocation): “graph theory”

Page 8: Relational Query Processing Approach to Compiling Sparse Matrix Codes

8

Outline

• Problem statement

– Sparse matrix computations

– Sparse Matrix Storage Formats

– Difficulties in the development of sparse matrix codes

• State-of-the-art restructuring compiler technology

• Technical approach and experiments

• Ongoing work and conclusions

Page 9: Relational Query Processing Approach to Compiling Sparse Matrix Codes

9

Storing Sparse Matrices

• Compressed formats are essential– O(nnz) time/space, not O(n²)– Example: matrix-vector product

• 10M row/columns, 50 non-zeroes/row• 5 seconds vs 139 hours on a 200Mflops computer

(assuming huge memory)

• A variety of formats are used in practice– Application/architecture dependent– Different memory usage– Different performance on RISC processors

Page 10: Relational Query Processing Approach to Compiling Sparse Matrix Codes

10

Point formats

jih

gfe

dc

ba

0

0

00

00

1 1 2 3 2 3 4 3 4 43 1 3 4 1 2 4 1 2 1b a d g c f j e i h

1 5 7 9 11

13

2 3 4 3 4 1 2 3 4

a c e h f i b d g j

Coordinate

Compressed Column Storage

Page 11: Relational Query Processing Approach to Compiling Sparse Matrix Codes

11

0000

0000

0000

0000

000

00

tz

yx

qp

kh

gfe

dcba

Block formats

• Block Sparse Column

• “Natural” for physical problems with several unknowns at each point in space

• Saves storage: 25% for 2-by-2 blocks

• Improves performance on modern RISC processors

1 3 4 5

1 3 2 1

a e b f x… h… c…

Page 12: Relational Query Processing Approach to Compiling Sparse Matrix Codes

12

Why multiple formats: performance

• Sparse matrix-vector product

• Formats: CRS, Jagged diagonal, BlockSolve

• On IBM RS6000 (66.5 MHz Power2)

• Best format depends on the application (20-70% advantage)

0

5

10

15

20

25

30

35

MF

lop

s CRS

JDIAG

Bsolve

Page 13: Relational Query Processing Approach to Compiling Sparse Matrix Codes

13

Bottom line

• Sparse matrices are used in a variety of application areas

• Have to be stored in compressed data structures

• Many formats are used in practice– Different storage/performance characteristics

• Code development is tedious and error-prone– No random access– Different code for each format– Even worse in parallel (many ways to distribute the data)

Page 14: Relational Query Processing Approach to Compiling Sparse Matrix Codes

14

Libraries

• Dense computations: Basic Linear Algebra Subroutines– Implemented by most computer vendors– Few formats, easy to parametrize: row/column-major,

symmetric/unsymmetric, etc

• Other computations are built on top of BLAS

• Can we do the same for sparse matrices?

Page 15: Relational Query Processing Approach to Compiling Sparse Matrix Codes

15

Sparse Matrix Libraries

• Sparse Basic Linear Algebra Subroutine (SPBLAS) library

[Pozo,Remington @ NIST]– 13 formats ==> too many combinations of “A op B”– Some important ops are not supported– Not extensible

• Coarse-grain solver packages [BlockSolve,Aztec,…]– Particular class of problems/algorithms

(e.g. iterative solution)– OO approaches: hooks for basic ops

(e.g. matrix-vector product)

Page 16: Relational Query Processing Approach to Compiling Sparse Matrix Codes

16

Our goal: generate sparse codes automatically

• Permit user-defined sparse data structures

• Specialize high-level algorithm for sparsity, given the formats

FOR I=1,N sum = sum + X(I)*Y(I)

FOR I=1,N such that X(I)0 and Y(I)0 sum = sum + X(I)*Y(I)

executable code

Page 17: Relational Query Processing Approach to Compiling Sparse Matrix Codes

17

Input to the compiler

• FOR-loops are sequential

• DO-loops can be executed in any order (“DOANY”)

• Convert dense DO-loops into sparse code DO I=1,N; J=1,N

Y(I)=Y(I)+A(I,J)*X(J)

for(j=0; j<N;j++) for(ii=colp(j);ii < colp(j+1);ii++) Y(rowind(ii))=Y(rowind(ii))+vals(ii)*X(j);

Page 18: Relational Query Processing Approach to Compiling Sparse Matrix Codes

18

Outline

• Problem statement

• State-of-the-art restructuring compiler technology

• Technical approach and experiments

• Ongoing work and conclusions

Page 19: Relational Query Processing Approach to Compiling Sparse Matrix Codes

19

An example: locality enhancement

• Matrix-vector product, array A stored in column/major order

FOR I=1,NFOR J=1,N

Y(I) = Y(I) + A(I,J)*X(J)

• Would like to execute the code as:

FOR J=1,NFOR I=1,N

Y(I) = Y(I) + A(I,J)*X(J)

• In general?

Stride-N

Stride-1

Page 20: Relational Query Processing Approach to Compiling Sparse Matrix Codes

20

• Loop nests == polyhedra in integer spaces

FOR I=1,NFOR J=1,I

…..

• Transformations

• Used in production and research compilers (SGI, HP, IBM)

An abstraction: polyhedra

ij

Ni

1

1

i

j

j

i

Page 21: Relational Query Processing Approach to Compiling Sparse Matrix Codes

21

Caveat

• The polyhedral model is not applicable to sparse computations

FOR I=1,NFOR J=1,N

IF (A(I,J) 0) THEN Y(I) = Y(I) + A(I,J)*X(J)

• Not a polyhedron

What is the right formalism?

0

1

1

),( jiA

Nj

Ni

Page 22: Relational Query Processing Approach to Compiling Sparse Matrix Codes

22

Extensions for sparse matrix code generation

FOR I=1,NFOR J=1,N

IF (A(I,J) 0) THENY(I)=Y(I)+A(I,J)*X(J)

• A is sparse, compressed by column

• Interchange the loops, encapsulate the guard

FOR J=1,NFOR I=1,N such that A(I,J) 0

...

• “Control-centric” approach: transform the loops to match the

best access to data [Bik,Wijshoff]

Page 23: Relational Query Processing Approach to Compiling Sparse Matrix Codes

23

Limitations of the control-centric approach

• Requires well-defined direction of access

CCS CRS COORDINATE

(J,I) loop order (I,J) loop order ????

Page 24: Relational Query Processing Approach to Compiling Sparse Matrix Codes

24

Outline

• Problem statement

• State-of-the-art restructuring compiler technology

• Technical approach and experiments

• Ongoing work and conclusions

Page 25: Relational Query Processing Approach to Compiling Sparse Matrix Codes

25

Data-centric transformations

• Main idea: concentrate on the data

DO I=…..; J=...…..A(F(I,J))…..

• Array access function: <row,column> = F(I,J)

• Example: coordinate storage format:Title:coord.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 26: Relational Query Processing Approach to Compiling Sparse Matrix Codes

26

Data-centric sparse code generation

• If only a single sparse array:

FOR <row,column,value> in AI=row; J=columnY(I)=Y(I)+value*X(J)

• For each data structure provide an enumeration method

• What if more than one sparse array?– Need to produce efficient simultaneous enumeration

Page 27: Relational Query Processing Approach to Compiling Sparse Matrix Codes

27

Efficient simultaneous enumeration

DO I=1,NIF (X(I) 0 and Y(I) 0) THEN sum = sum + X(I)*Y(I)

• Options:– Enumerate X, search Y: “data-centric on” X– Enumerate Y, search X: “data-centric on” Y– Can speed up searching by scattering into a dense vector– If both sorted: “2-finger” merge

• Best choice depends on how X and Y are stored

• What is the general picture?

Title:dot2.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 28: Relational Query Processing Approach to Compiling Sparse Matrix Codes

28

An observation

DO I=1,NIF (X(I) 0 and Y(I) 0) THEN sum = sum + X(I)*Y(I)

• Can view arrays as relations (as in “relational databases”)

X(i,x) Y(i,y)

• Have to enumerate solutions to the relational query

Join(X(i,x), Y(i,y))

Title:dot-relations.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 29: Relational Query Processing Approach to Compiling Sparse Matrix Codes

29

Connection to relational queries

• Dot product Join(X,Y)

• General case?

Dot product Equi-join

Enumerate/Search Enumerate/Search

Scatter Hash join

“2-finger” (Sort)Merge join

Page 30: Relational Query Processing Approach to Compiling Sparse Matrix Codes

30

From loop nests to relational queries

DO I, J, K, ... …..A(F(I,J,K,...))…..B(G(I,J,K,...))…..

• Arrays are relations (e.g. A(r,c,a))

– Implicitly store zeros and non-zeros

• Integer space of loop variables is a relation, too: Iter(i,j,k,…)

• Access predicate S: relates loop variables and array elements

• Sparsity predicate P: “interesting” combination of zeros/non-zeros

Select(P, Select(S Bounds, Product(Iter, A, B, …)))

Page 31: Relational Query Processing Approach to Compiling Sparse Matrix Codes

31

Why relational queries?

[Relational model] provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation and organization of data on the other

E.F.Codd (CACM, 1970)

• Want to separate what is to be computed from how

Page 32: Relational Query Processing Approach to Compiling Sparse Matrix Codes

32

Bernoulli Sparse Compilation Toolkit

• BSCT is about 40K lines of ML + 9K lines of C

• Query optimizer at the core

• Extensible: new formats can be added

Optimizer

Instantiator

Query

Plan

Low level C code

CRS

CCS

BRS

Coordinate

….

Abstract properties

Macros

Front-end

Input program

Page 33: Relational Query Processing Approach to Compiling Sparse Matrix Codes

33

Query optimization: ordering joins

select(a0 x0, join(A(i,j,a), X(j,x), Y(i,y)))

A in CCS A in CRS

Join(Join(A,X), Y) Join(Join(A,Y), X)

FOR J Join(A,X) FOR I Join(A(*,J), Y)

FOR I Join(A,Y) FOR J Join(A(I,*),X)

Page 34: Relational Query Processing Approach to Compiling Sparse Matrix Codes

34

Query optimization: implementing joins

FOR I Join(A,Y) FOR J Join(A(I,*), X) .….

FOR I Merge(A,Y) H = scatter(X) FOR J enumerate A(I,*), search H …..

• Output is called a plan

Page 35: Relational Query Processing Approach to Compiling Sparse Matrix Codes

35

Instantiator: executable code generation

H=scatter XFOR I Merge(A,Y) FOR J enumerate A(I,*), search H

…..

…..for(I=0; I<N; I++) for(JJ=ROWP(I); JJ < ROWP(I+1); JJ++) …..

• Macro expansion

• Open system

Page 36: Relational Query Processing Approach to Compiling Sparse Matrix Codes

36

Summary of the compilation techniques

• Data-centric methodology: walk the data,compute accordingly

• Implementation for sparse arrays– arrays = relations, loop nests = queries

• Compilation path– Main steps are independent of data structure

implementations

• Parallel code generation– Ownership, communication sets,... = relations

• Difference from traditional relational databases query opt.– Selectivity of predicates not an issue; affine joins

Page 37: Relational Query Processing Approach to Compiling Sparse Matrix Codes

37

Experiments

• Sequential– Kernels from SPBLAS library– Iterative solution of linear systems

• Parallel– Iterative solution of linear systems– Comparison with the BlockSolve library from Argonne NL– Comparison with the proposed “High-Performance Fortran”

standard

Page 38: Relational Query Processing Approach to Compiling Sparse Matrix Codes

38

Setup

• IBM SP-2 at Cornell

• 120 MHz P2SC processor at each node– Can issue 2 multiply-add instructions per cycle– Peak performance 480 Mflops– Much lower on sparse problems: < 100 Mflops

• Benchmark matrices– From Harwell-Boeing collection– Synthetic problems

Page 39: Relational Query Processing Approach to Compiling Sparse Matrix Codes

39

Matrix-vector products

• BSR = “Block Sparse Row”

• VBR = “Variable Block Sparse Row”

• BSCT_OPT = Some “Dragon Book” optimizations by hand– Loop invariant removal

BSR/MV

0

20

40

60

80

100

5 8 11 14 17 20 25

Block size

MF

LO

PS LIB

BSCT

BSCT_OPT

VBR/MV

0

20

40

60

80

100

5 8 11 14 17 20 25

Block size

Mflo

ps LIB

BSCT

BSCT_OPT

Page 40: Relational Query Processing Approach to Compiling Sparse Matrix Codes

40

Solution of Triangular Systems

• Bottom line:– Can compete with the SPBLAS library

(need to implement loop invariant removal :-)

BSR/TS

0

20

40

60

80

100

5 8 11 14 17 20 25

Block size

Mfl

op

s LIBRAY

BSCT

BSCT_OPT

VBR/TS

0

20

40

60

80

100

5 8 11 14 17 20 25

Block size

MF

lop

s

LIB

BSCT

BSCT_OPT

Page 41: Relational Query Processing Approach to Compiling Sparse Matrix Codes

41

Iterative solution of sparse linear systems

• Essential for large-scale simulations

• Preconditioned Conjugate Gradients (PCG) algorithm– Basic kernels: y=Ax, Lx=b +dense vector ops

• Preprocessing step– Find M such that – Incomplete Cholesky factorization (ICC): – Basic kernels: , sparse vector scaling– Can not be implemented using the SPBLAS library

• Used CCS format (“natural” for ICC)

TCCA TuvAA

IAMM 11

Page 42: Relational Query Processing Approach to Compiling Sparse Matrix Codes

42

Iterative Solution

• ICC: a lot of “sparse overhead”

• Ongoing investigation (at MathWorks):

– Our compiler-generated ICC is 50-100 times faster than Matlab implementation !!

ICC/PCG

5.86 4.22 2.9 1.9

48 47 46

40

0

10

20

30

40

50

60

2000 4000 8000 16000

Matrix size

Mfl

op

s

ICC

PCG

Page 43: Relational Query Processing Approach to Compiling Sparse Matrix Codes

43

Iterative solution (cont.)

• Preliminary comparison with IBM ESSL DSRIS– DSRIS implements PCG (among other things)

• On BCSSTK30; have set values to vary the convergence

• BSCT ICC takes 1.28 secs

• ESSL DSRIS preprocessing (ILU+??) takes ~5.09 secs

• PCG iterations are ~15% faster in ESSL

Page 44: Relational Query Processing Approach to Compiling Sparse Matrix Codes

44

Parallel experiments

• Conjugate Gradient algorithm

– vs BlockSolve library (Argonne NL)

• “Inspector” phase

– Pre-computes what communication needs to occur

– Done once, might be expensive

• “Executor” phase

– “Receive-compute-send-...”

• On Boeing-Harwell matrices

• On synthetic grid problems to understand scalability

Page 45: Relational Query Processing Approach to Compiling Sparse Matrix Codes

45

Executor performance

• Grid problems: problem size per processor is constant– 135K rows, ~4.6M non-zeroes

• Within 2-4% of the library

Executor Performance (BCSSTK32)

0

1

2

3

2 4 8 16

Number of Processors

Tim

e (S

econ

ds)

Bsolve

BSCT

Executor Performance (grid problems)

00.5

11.5

22.5

2 4 8 16 32 64

Number of processors

Tim

e (s

econ

ds)

Bsolve

BSCT

Page 46: Relational Query Processing Approach to Compiling Sparse Matrix Codes

46

Inspector overhead

• Ratio of the inspector to single iteration of the executor– A problem-independent measure

• “HPF-2” -- new data-parallel Fortran standard– Lowest-common denominator, inspectors are not scalable

Inspector Overhead (BCSSTK32)

00.5

11.5

22.5

3

2 4 8 16

Number of Processors

Rat

io

Bsolve

BSCT

HPF-2

Inspector Overhead (grid problems)

02468

10

2 4 8 16 32 64

Number of processors

Rat

io

Bsolve

BSCT

HPF-2

Page 47: Relational Query Processing Approach to Compiling Sparse Matrix Codes

47

Experiments: summary

• Sequential– Competitive with SPBLAS library

• Parallel– Inspector phase should exploit formats

(cf. HPF-2)

Page 48: Relational Query Processing Approach to Compiling Sparse Matrix Codes

48

Outline

• Problem statement

• State-of-the-art restructuring compiler technology

• Technical approach and experiments

• Ongoing work and conclusions

Page 49: Relational Query Processing Approach to Compiling Sparse Matrix Codes

49

Ongoing work

• Packaging– “Library-on-demand”; as a Matlab toolbox

• Parallel code generation– Extend to handle more kernels

• Core of the compiler– Disjunctive queries, fill

Page 50: Relational Query Processing Approach to Compiling Sparse Matrix Codes

50

Ongoing work

• Packaging– “Library-on-demand”; as a Matlab toolbox– Completely automatic tool; data structure selection

• Out-of-core computations

• Parallel code generation– Extend to handle more kernels

• Core of the compiler– Disjunctive queries, fill

Page 51: Relational Query Processing Approach to Compiling Sparse Matrix Codes

51

Related work - compilers

• Polyhedral model [Lamport ’78, …]

• Sparse compilation [Bik,Wijshoff ‘92]

• Support for sparse computations in HPF-2 [Saltz,Ujaldon et al]– Fixed data structures– Separate compilation path for dense/sparse

• Data-centric blocking [Kodukula,Ahmed,Pingali ‘97]

Page 52: Relational Query Processing Approach to Compiling Sparse Matrix Codes

52

Related work - languages and compilers

• Polyhedral model [Lamport ’78, …]

• Parallelizing compilers/parallel languages

– e.g. HPF, ZPL

• Transformational programming [Gries]

• Software engineering through views [Novak]

• Sparse compilation [Bik,Wijshoff ‘92]

• Support for sparse computations in HPF-2 [Saltz,Ujaldon et al]

• Data-centric blocking [Kodukula,Ahmed,Pingali ‘97]

Page 53: Relational Query Processing Approach to Compiling Sparse Matrix Codes

53

Related work - languages and compilers

• Compilation of dense matrix (regular) codes– Polyhedral model [Lamport ‘78, ...]– Data-parallel languages (e.g. HPF, ZPL)

• Compilation of sparse matrix codes– [Bik, Wijshoff] -- sequential sparse compiler– [Saltz,Zima,Ujaldon,…] -- irregular computations in HPF-2– Fixed data structures, not extensible

• Programming with ADTs– SETL -- automatic data structure selection for set operations– Transformational systems (e.g. Polya [Gries])– Software reuse through views [Novak]

Page 54: Relational Query Processing Approach to Compiling Sparse Matrix Codes

54

Related work - databases

• Optimizing loops in DB programming languages [Lieuwen,DeWitt‘90]

• Extensible database systems [Predator, …]

Utilities

Rel

atio

ns

Seq

uenc

es

Imag

es

...

OPTOPT OPT

Predator

UtilitiesC

RS

CO

OR

D

BS

R ...

BSCT

Compiler

Page 55: Relational Query Processing Approach to Compiling Sparse Matrix Codes

55

Conclusions/Contributions

• Sparse matrix computations are widely used in practice

• Code development is tedious and error-prone

• Bernoulli Sparse Compilation Toolkit:– Arrays as relations, loops as queries– Compilation as query optimization

• Algebras in optimizing compilers

Data-flow analysis Lattice algebra

Dense matrix computations Polyhedral algebra

Sparse matrix computations Relational algebra

Page 56: Relational Query Processing Approach to Compiling Sparse Matrix Codes

56

Future work

• Sparse compiler– Completely automatic system, data structure selection– Out-of-core computations

• Compilation of signal/image processing applications– dense matrix computations + fast transforms (e.g. DCT)– multiple data representations

• Programming languages and databases– e.g. Java and JDBC

Page 57: Relational Query Processing Approach to Compiling Sparse Matrix Codes

57

Future work

• Sparse compiler– Completely automatic system, data structure selection– Out-of-core computations

• Extensible compilation– Want to support multiple formats, optimizations across basic ops– Example: signal/image processing

Page 58: Relational Query Processing Approach to Compiling Sparse Matrix Codes

58

Future work

• Application area: signal/image processing

– compositions of transforms, such as FFTs

– multiple data representations

– important to optimize the flow of data through memory

hierarchies

– IBM ESSL: FFT at close to peak performance

– algebra: Kronecker products

Page 59: Relational Query Processing Approach to Compiling Sparse Matrix Codes

59

Future interests

Sparse compiler

Databases

CompilersComputationalScience

Database ProgrammingSystemsData mining