Top Banner
The Study of Cache Oblivious Algorithms Prepared by Jia Guo
32

The Study of Cache Oblivious Algorithms

Feb 25, 2016

Download

Documents

Lars

The Study of Cache Oblivious Algorithms. Prepared by Jia Guo. Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran . In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA . - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Study of Cache Oblivious Algorithms

The Study of Cache Oblivious Algorithms

Prepared by Jia Guo

Page 2: The Study of Cache Oblivious Algorithms

2CS598dhp

Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.

Page 3: The Study of Cache Oblivious Algorithms

3CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transposition FFT

Conclusion

Page 4: The Study of Cache Oblivious Algorithms

4CS598dhp

Assumption

Only two levels of memory hierarchies: An ideal cache

Fully associativeOptimal replacement strategy“Tall cache”

A very large memory

Page 5: The Study of Cache Oblivious Algorithms

5CS598dhp

An Ideal Cache Model

An ideal cache model (Z,L)

Z: Total words in the cacheL: Words in one cache line

Page 6: The Study of Cache Oblivious Algorithms

6CS598dhp

Cache Complexity

An algorithm with input size n is measured by:Work complexity W(n)Cache complexity: the number of cache misses

it incurs. Q(n; Z, L)

Page 7: The Study of Cache Oblivious Algorithms

7CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transposition FFT

Conclusion

Page 8: The Study of Cache Oblivious Algorithms

8CS598dhp

Cache Aware Algorithms

Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L).

Need to adjust parameters when running on different platforms.

Page 9: The Study of Cache Oblivious Algorithms

9CS598dhp

Example: A blocked matrix multiplication algorithm

s is a tuning parameter to make the algorithm run fast

A11ss

nA

Page 10: The Study of Cache Oblivious Algorithms

10CS598dhp

Example (2)

Cache complexity The three s x s sub matrices should fit into the cache so

they occupy cache lines

Optimal performance is obtained when Z/L cache misses needed to bring 3 sub matrices into

cache n2/L cache misses needed to read n2 elements It is

)( Zs

)//1(

))/()/(/1(32

32

ZLnLn

LZsnLn

)/()/,max( 22 LssLss

Page 11: The Study of Cache Oblivious Algorithms

11CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transposition and FFT

Conclusion

Page 12: The Study of Cache Oblivious Algorithms

12CS598dhp

Cache Oblivious Algorithms

Have no parameters about hardware, such as cache size (Z), cache-line length (L).No tuning needed, platform independent.

The following algorithms introduced are proved to have the optimal cache complexity.

Page 13: The Study of Cache Oblivious Algorithms

13CS598dhp

Matrix Multiplication

Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p

Proceed recursively until reach the base case - one element.

n ≥ max (m, p)

m ≥ max (n, p)

p ≥ max (n, m)

Page 14: The Study of Cache Oblivious Algorithms

14CS598dhp

Matrix Multiplication (2)

12

111211 B

BAA

2

121 B

BAAA*B

A1*B1 A2*B2

A11*B11 A12*B12 A21*B21 A22*B22

22

212221 B

BAA

Assume Sizes of A, B are nx4n, 4nxn

+ +

+

Page 15: The Study of Cache Oblivious Algorithms

15CS598dhp

Matrix Multiplication (3)

Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.

Page 16: The Study of Cache Oblivious Algorithms

16CS598dhp

Matrix Multiplication (4)

Cache complexityCan achieve the same as the cache complexity

of Block-MULT algorithm (cache aware)For a square matrix, the optimal cache

complexity is achieved.

Page 17: The Study of Cache Oblivious Algorithms

17CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transposition FFT

Conclusion

Page 18: The Study of Cache Oblivious Algorithms

18CS598dhp

If n is very large, the access of B in column will cause cache miss every time!

(No spatial locality in B)

Matrix Transposition

A AT for i 1 to m for j 1 to n B( j, i ) = A( i, j )

m x n

Bn x m

Page 19: The Study of Cache Oblivious Algorithms

19CS598dhp

Matrix Transposition (2)

Partition array A along the longer dimension and recursively execute the transpose function.

A1A111

A12A12

A21A21

A22A22A11A11TT

A21A21TT

A12A12TT

A22A22TT

Page 20: The Study of Cache Oblivious Algorithms

20CS598dhp

Matrix Transposition (3)

Cache complexityIt has the optimal cache complexityQ(m, n) = Θ(1+mn/L)

Page 21: The Study of Cache Oblivious Algorithms

21CS598dhp

Fast Fourier Transform

Use Cooley-Tukey algorithm Cooley-Tukey algorithms recursively re-express a DFT

of a composite size n = n1n2 as: Perform n2 DFTs of size n1. Multiply by complex roots of unity called twiddle facto

rs. Perform n1 DFTs of size n2.

1

0

][][n

j

ijnjXiY

Page 22: The Study of Cache Oblivious Algorithms

22CS598dhp

1

0

[ ] [ ]n

ij

j

Y i X j w

2 11 1 1 2 2 2

1 2

2 1

1 1

1 2 1 1 2 20 0

[ ] [ ]n n

i j i j i jn n n

j j

Y i i n X j n j w w w

n2

n1

Page 23: The Study of Cache Oblivious Algorithms

23CS598dhp

Assume X is a row-major n1× n2 matrixSteps:

Transpose X in place.Compute n2 DFTsMultiply by twiddle factorsTranspose X in placeCompute n1 DFTsTranspose X in-place

Page 24: The Study of Cache Oblivious Algorithms

24CS598dhp

Fast Fourier Transform

*twiddle factor

Transpose to select n2 DFT of size n1

Call FFT recursively with n1=2, n2=2 Reach the base case, return

Transpose to select n1 DFT of size n2

Transpose and return

n1=4, n2=2

Page 25: The Study of Cache Oblivious Algorithms

25CS598dhp

Fast Fourier Transform

Cache complexityOptimal for a Cooley-Tukey algorithm, when n

is an exact power of 2Q(n) = O(1+(n/L)(1+logzn)

Page 26: The Study of Cache Oblivious Algorithms

26CS598dhp

Other Cache Oblivious Algorithms

Funnelsort Distribution sortLU decomposition without pivots

Page 27: The Study of Cache Oblivious Algorithms

27CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transpositionFFT

Conclusion

Page 28: The Study of Cache Oblivious Algorithms

28CS598dhp

Questions

How large is the range of practicality of cache-oblivious algorithms?

What are the relative strengths of cache-oblivious and cache-aware algorithms?

Page 29: The Study of Cache Oblivious Algorithms

29CS598dhp

Practicality of Cache-oblivious Algorithms

Average time to transpose an NxN matrix, divided by N2

Page 30: The Study of Cache Oblivious Algorithms

30CS598dhp

Practicality of Cache-oblivious Algorithms (2)

Average time taken to multiply two NxN matrices, divided by N3

Page 31: The Study of Cache Oblivious Algorithms

31CS598dhp

Question 2

Do cache-oblivious algorithms perform as well as cache-aware algorithms?FFTW libraryNo answer yet.

Page 32: The Study of Cache Oblivious Algorithms

32CS598dhp

References

Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.Cache-Oblivious Algorithmsby Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June 1999. Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC 2005.