Carnegie Mellon 12/5/2002 IBM-Thomas T. J. Watson Res. Center SPIRAL: SPIRAL: Tuning DSP Transforms to Tuning DSP Transforms to Computing Platforms Computing Platforms • Jeremy Johnson (Drexel) • Robert Johnson (MathStar Inc.) • David Padua (UIUC) • Viktor Prasanna (USC) • Markus Püschel (CMU) • Manuela Veloso (CMU) • Franz Franchetti (TU Vienna) • Gavin Haentjens (CMU) • Pinit Kumhom (Drexel) • Neungsoo Park (USC) • David Sepiashvili (CMU) • Bryan Singer (CMU) • Yevgen Voronenko (Drexel) • Jianxin Xiong (UIUC) Faculty Students http://www.ece.cmu.edu/~spiral José M. F. Moura
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SPIRAL: SPIRAL: Tuning DSP Transforms to Tuning DSP Transforms to Computing PlatformsComputing Platforms
• Jeremy Johnson (Drexel)• Robert Johnson (MathStar Inc.)• David Padua (UIUC)• Viktor Prasanna (USC)• Markus Püschel (CMU)• Manuela Veloso (CMU)
• Franz Franchetti (TU Vienna)• Gavin Haentjens (CMU)• Pinit Kumhom (Drexel)• Neungsoo Park (USC)• David Sepiashvili (CMU)• Bryan Singer (CMU)• Yevgen Voronenko (Drexel)• Jianxin Xiong (UIUC)
Faculty Students
http://www.ece.cmu.edu/~spiral
José M. F. Moura
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SponsorSponsor
Work supported by DARPA (DSO), Applied & Computational
Mathematics Program, OPAL, through grant managed by
research grant DABT63-98-1-0004 administered by the Army
Directorate of Contracting.
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Moore’s Moore’s Law and Law and High(estHigh(est)) Performance Performance Scientific ComputingScientific Computing
arithmetic cost model not accurate for predicting runtime (one cache miss = 10 floating point ops)better performance models hard to getbest code is machine dependent (registers/caches size, structure)hand-tuned code becomes obsolete as fast as it is writtencompiler limitationsfull performance requires (in part) assembly coding
Moore’s Law: processor-memory bottleneckshort life cycles of computersvery complex architectures
• vendor specific• special instructions (MMX, SSE, FMA, …)• undocumented features
(single processor, off-the-shelf)
Effects on software/algorithms:
Portable performance requires automation
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
AutomaticAutomatic Performance Tuning: ResearchPerformance Tuning: Research
Linear Algebra:ATLAS (J. Dongarra et al.) LAPACKPhiPACK (J. Demmel et al.)
Signal Processing: FFTW (M. Frigo and S. Johnson)SPIRAL
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SPIRALSPIRALAutomates
cuts development costscode less error-prone
takes advantage of architecture specific featuresporting without loss of performance
systematic exploration of alternatives both at algorithmic and code level
are performance critical
Implementation
Platform-Adaptation
Optimization
of DSP algorithms
A library generator for highly optimized signal processing algorithms
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SPIRALSPIRAL ApproachApproach
DSP Transform(DFT, DCT, Wavelets etc.)
Computing Platform
given
given
PossibleImplementations
PerformanceEvaluation
Inte
llig
ent
Sea
rchPossible
Algorithms
SP
IRA
L S
earc
h S
pac
e
(Pentium III, Pentium 4, Athlon, SUN, PowerPC, Alpha, … )
adaptedimplementation
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
OrganizationOrganization
Mathematical Framework
Formula Generator
SPL and SPL Compiler
Search Engine
SPIRAL system
Conclusions
Transforms, Rules, and Formulas
Transform → Algorithm
Algorithm → Implementation
How to find the best implementation
Everything taken together
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
DSPDSP Algorithms: Example 4Algorithms: Example 4--point DFTpoint DFT
Cooley/Tukey FFT (size 4):
product of structured sparse matricesmathematical notation
• single static assignment code• no reuse of temporary vars• only scalar temporary vars• constants precomputed
Extensible through templates
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
TemplatesTemplates
(template(F n)[ n >= 1 ]( do i=0,n-1
y(i)=0do j=0,n-1y(i)=y(i)+W(n,i*j)*x(j)
endend ))
Pattern
I-code
Condition
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Code Generation and Template MatchingCode Generation and Template Matching
(F 2) matches pattern (F n) and assigns 2 to n.Because n=2 satisfies the condition n>=1,the following i-code is generated from the template:
do i = 0,1y(i) = 0do j = 0,1
y(i) = y(i)+W(2,i*j)*x(j)end
end
Y(0)=x(0)+x(1)y(1)=x(0)-x(1)
Unrolling & Optimization
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SIMDSIMD Short Vector ExtensionsShort Vector Extensions
+ x
vector length = 4(4-way)
Extension to instruction set architectureAvailable on most current architectures (SSE on Pentium, AltiVec on Motorola G4)Originally for multimedia (like MMX for integers)Requires fine grain parallelismLarge potential speed-up
SIMD instructions are architecture specificNo common API (usually assembly hand coding)Performance very sensitive to memory accessAutomatic vectorization very limited
Problems:
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Vector Vector Code Generation from SPL FormulasCode Generation from SPL Formulas
Naturally vectorizable construct
A
x y
4IA ⊗vector length
iiii
k
ii QEIADP )(
1υ⊗∏
=
Pi, Qi permutationsDi, Ei diagonalsAi arbitrary formulasν SIMD vector length
Symbolic vectorization(automatic formula manipulation)
Mapping to C code + vector API
LOAD_VECT(xl0, x + 0);LOAD_VECT(xl4, x + 16);f0 = SIMD_SUB(xl0, xl4);LOAD_VECT(xl1, x + 4);LOAD_VECT(xl5, x + 20);f1 = SIMD_SUB(xl1, xl5);...yl7 = SIMD_SUB(f1, f4);STORE_L_8_4(yl6, yl7, y + 24);yl2 = SIMD_SUB(f0, f5);yl3 = SIMD_ADD(f1, f4);STORE_L_8_4(yl2, yl3, y + 8);
SSESSE2AltiVec…
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
OrganizationOrganization
Mathematical Framework
Formula Generator
SPL and SPL Compiler
Search Engine
SPIRAL system
Conclusions
Transforms, Rules, and Formulas
Transform → Algorithm
Algorithm → Implementation
How to find the best implementation
Everything taken together
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
WhyWhy Search?Search?Toy problem
DCT IV- size 24
~31000 formulas
Search in algorithm space in SPIRAL:Exhaustive & Random Search, DP, Hill climbing, Genetic AlgorithmsBeyond search: design of optimal tree
DCT IV- size 24
~31000 formulas
scheduled
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
WhyWhy Search?Search?
DCT, type IV, size 16
• maaaany different formulas• large spread in runtimes, even for modest size• precisely equal arithmetic cost• best formula is platform-dependent
~31000 formulas
Toy problem:scheduled
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
NumberNumber of Formulas/Algorithmsof Formulas/Algorithms
k
123456789
# DFT, size 2^k
16
40296
27744162570361280~1.01 • 10^27~2.31 • 10^61
~2.86 • 10^133
# DCT-IV, size 2^k
110
12631242
19244433627343815121631354242
~1.07 • 10^38~2.30 • 10^76
~1.06 • 10^153
differ in data flow not in arithmetic cost exponential search space
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SearchSearch Methods Available in SPIRALMethods Available in SPIRAL
Exhaustive SearchDynamic Programming (DP)Random SearchHill ClimbingSTEER (similar to a genetic algorithm)
Good100s-1000sAllHill Climbing
(very) good100s-1000sAllSTEER
fair/goodUser decidedAllRandom
(very) good10s-100sAll DP
BestAllVery smallExhaust
ResultsTimedSizesFormulasPossible
Search over • algorithm space and • implementation options (degree of unrolling)
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
STEERSTEERPopulation n:
Population n+1:
……
……
Mutation
Cross-Breeding expanddifferently
swapexpansions
Survival of Fittest
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
LearningLearning to Generate Fast Algorithmsto Generate Fast Algorithms
• Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy)• Learns from a transform of one size, generates the best algorithm for many sizes• Tested for DFT and WHT
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Some Experimental ResultsSome Experimental Results
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
DCTDCT Type IV Size 16Type IV Size 16
Fastest Found Formulas Number of Formulas Timed
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Experimental Experimental ResultsResults
high performance code(compared with FFTW)
different transforms
search methods(applicable to all transforms)
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Some Experimental ResultsSome Experimental Results