© 2000 MIT Exploiting Superword Level Parallelism with Multimedia Instruction Sets Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology {slarsen,saman}@lcs.mit.edu www.cag.lcs.mit.edu/slp
Jan 06, 2016
© 2000 MIT
Exploiting Superword Level Parallelism with Multimedia
Instruction Sets
Samuel LarsenSaman Amarasinghe
Laboratory for Computer ScienceMassachusetts Institute of Technology
{slarsen,saman}@lcs.mit.eduwww.cag.lcs.mit.edu/slp
© 2000 MIT
Overview
• Problem statement• New paradigm for parallelism SLP • SLP extraction algorithm• Results• SLP vs. ILP and vector parallelism• Conclusions• Future work
© 2000 MIT
Multimedia Extensions
• Additions to all major ISAs• SIMD operations
Instruction Set Architecture SIMD Width Floating PointAltiVec PowerPC 128 yesMMX/SSE Intel 64/128 yes3DNow! AMD 64 yesVIS Sun 64 noMAX2 HP 64 noMVI Alpha 64 noMDMX MIPS V 64 yes
© 2000 MIT
Using Multimedia Extensions
• Library calls and inline assembly– Difficult to program– Not portable
© 2000 MIT
Using Multimedia Extensions
• Library calls and inline assembly– Difficult to program– Not portable
• Different extensions to the same ISA– MMX and SSE– SSE vs. 3DNow!
© 2000 MIT
Using Multimedia Extensions
• Library calls and inline assembly– Difficult to program– Not portable
• Different extensions to the same ISA– MMX and SSE– SSE vs. 3DNow!
• Need automatic compilation
© 2000 MIT
Vector Compilation
• Pros:– Successful for vector computers– Large body of research
© 2000 MIT
Vector Compilation
• Pros:– Successful for vector computers– Large body of research
• Cons:– Involved transformations – Targets loop nests
© 2000 MIT
Superword Level Parallelism (SLP)
• Small amount of parallelism– Typically 2 to 8-way
• Exists within basic blocks • Uncovered with a simple analysis
© 2000 MIT
Superword Level Parallelism (SLP)
• Small amount of parallelism– Typically 2 to 8-way
• Exists within basic blocks • Uncovered with a simple analysis
• Independent isomorphic operations– New paradigm
© 2000 MIT
1. Independent ALU Ops
R = R + XR * 1.08327G = G + XG * 1.89234B = B + XB * 1.29835
R R XR 1.08327G = G + XG * 1.89234B B XB 1.29835
© 2000 MIT
2. Adjacent Memory References
R = R + X[i+0]G = G + X[i+1]B = B + X[i+2]
R RG = G + X[i:i+2]B B
© 2000 MIT
for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0]
3. Vectorizable Loops
© 2000 MIT
3. Vectorizable Loops
for (i=0; i<100; i+=4)
A[i:i+3] = B[i:i+3] + C[i:i+3]
for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0]
A[i+1] = A[i+1] + B[i+1]A[i+2] = A[i+2] + B[i+2]A[i+3] = A[i+3] + B[i+3]
© 2000 MIT
4. Partially Vectorizable Loops
for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)
© 2000 MIT
4. Partially Vectorizable Loops
for (i=0; i<16; i+=2)
L0L1
= A[i:i+1] – B[i:i+1]
D = D + abs(L0)D = D + abs(L1)
for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L)
L = A[i+1] – B[i+1]D = D + abs(L)
© 2000 MIT
Exploiting SLP with SIMD Execution
• Benefit:– Multiple ALU ops One SIMD op– Multiple ld/st ops One wide mem op
© 2000 MIT
Exploiting SLP with SIMD Execution
• Benefit:– Multiple ALU ops One SIMD op– Multiple ld/st ops One wide mem op
• Cost:– Packing and unpacking– Reshuffling within a register
© 2000 MIT
Packing/Unpacking Costs
C = A + 2D = B + 3
C A 2D B 3= +
© 2000 MIT
Packing/Unpacking Costs
• Packing source operands
A AB BA = f()
B = g()C = A + 2D = B + 3
C A 2D B 3= +
© 2000 MIT
Packing/Unpacking Costs
• Packing source operands• Unpacking destination operands
C CD D
A = f()B = g()C = A + 2D = B + 3E = C / 5F = D * 7
A AB B
C A 2D B 3= +
© 2000 MIT
Optimizing Program Performance
• To achieve the best speedup:– Maximize parallelization– Minimize packing/unpacking
© 2000 MIT
Optimizing Program Performance
• To achieve the best speedup:– Maximize parallelization– Minimize packing/unpacking
• Many packing possibilities– Worst case: n ops n! configurations– Different cost/benefit for each choice
© 2000 MIT
A = B + CD = E + F
Observation 1:Packing Costs can be Amortized
• Use packed result operands
G = A - HI = D - J
© 2000 MIT
Observation 1:Packing Costs can be Amortized
• Use packed result operands• Share packed source operands
A = B + CD = E + F
G = B + HI = E + J
A = B + CD = E + F
G = A - HI = D - J
© 2000 MIT
Observation 2:Adjacent Memory is Key
• Large potential performance gains– Eliminate ld/str instructions– Reduce memory bandwidth
© 2000 MIT
Observation 2:Adjacent Memory is Key
• Large potential performance gains– Eliminate ld/str instructions– Reduce memory bandwidth
• Few packing possibilities– Only one ordering exploits pre-packing
© 2000 MIT
SLP Extraction Algorithm
• Identify adjacent memory references
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
© 2000 MIT
SLP Extraction Algorithm
• Identify adjacent memory references
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
© 2000 MIT
SLP Extraction Algorithm
• Follow def-use chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
© 2000 MIT
SLP Extraction Algorithm
• Follow def-use chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
HJ
CD
AB= -
© 2000 MIT
SLP Extraction Algorithm
• Follow use-def chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
HJ
CD
AB= -
© 2000 MIT
SLP Extraction Algorithm
• Follow use-def chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
CD
EF
35= *
HJ
CD
AB= -
© 2000 MIT
SLP Extraction Algorithm
• Follow use-def chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
CD
EF
35= *
HJ
CD
AB= -
© 2000 MIT
SLP Compiler Results
• SLP compiler implemented in SUIF• Tested on two benchmark suites
– SPEC95fp– Multimedia kernels
• Performance measured three ways:– SLP availability– Compared to vector parallelism– Speedup on AltiVec
© 2000 MIT
SLP Availability
0
10
20
30
40
50
60
70
80
90
100
swim
tomca
tv
mgr
id
su2c
or
wave5
apsi
hydr
o2d
turb
3d
applu
fppp
p FIR IIRVM
MMMM
YUV
% dynamic SUIF instructions eliminated
128 bits1024 bits
© 2000 MIT
SLP vs. Vector Parallelism
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
swim
tomca
tv
mgrid
su2c
or
wave5 ap
si
hydro2
d
turb3d
applu
fppp
p
SLPVector
© 2000 MIT
Speedup on AltiVec
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
swim
tomca
tv FIR IIRVM
MMMM
YUV
6.7
© 2000 MIT
SLP vs. Vector Parallelism
• Extracted with a simple analysis– SLP is fine grain basic blocks
© 2000 MIT
SLP vs. Vector Parallelism
• Extracted with a simple analysis– SLP is fine grain basic blocks
• Superset of vector parallelism – Unrolling transforms VP to SLP– Handles partially vectorizable loops
© 2000 MIT
SLP vs. Vector Parallelism
}Basic block
© 2000 MIT
SLP vs. Vector Parallelism
Iterations
© 2000 MIT
SLP vs. ILP
• Subset of instruction level parallelism
© 2000 MIT
SLP vs. ILP
• Subset of instruction level parallelism
• SIMD hardware is simpler– Lack of heavily ported register files
© 2000 MIT
SLP vs. ILP
• Subset of instruction level parallelism
• SIMD hardware is simpler– Lack of heavily ported register files
• SIMD instructions are more compact– Reduces instruction fetch bandwidth
© 2000 MIT
SLP and ILP
• SLP & ILP can be exploited together– Many architectures can already do this
© 2000 MIT
SLP and ILP
• SLP & ILP can be exploited together– Many architectures can already do this
• SLP & ILP may compete– Occurs when parallelism is scarce
© 2000 MIT
SLP and ILP
• SLP & ILP can be exploited together– Many architectures can already do this
• SLP & ILP may compete– Occurs when parallelism is scarce
• Unroll the loop more times– When ILP is due to loop level
parallelism
© 2000 MIT
Conclusions
• Multimedia architectures abundant– Need automatic compilation
© 2000 MIT
Conclusions
• Multimedia architectures abundant– Need automatic compilation
• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp
© 2000 MIT
Conclusions
• Multimedia architectures abundant– Need automatic compilation
• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp
• SLP extraction successful– Simple, local analysis– Provides speedups from 1.24 – 6.70
© 2000 MIT
Conclusions
• Multimedia architectures abundant– Need automatic compilation
• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp
• SLP extraction successful– Simple, local analysis– Provides speedups from 1.24 – 6.70
• Found SLP in general-purpose codes
© 2000 MIT
Future Work
• SLP analysis beyond basic blocks– Packing maintained across blocks– Loop invariant packing– Fill unused slots with speculative ops
© 2000 MIT
Future Work
• SLP analysis beyond basic blocks– Packing maintained across blocks– Loop invariant packing– Fill unused slots with speculative ops
• SLP architectures– Emphasis on SIMD– Better packing/unpacking
© 2000 MIT
Exploiting Superword Level Parallelism with Multimedia
Instruction Sets
Samuel LarsenSaman Amarasinghe
Laboratory for Computer ScienceMassachusetts Institute of Technology
{slarsen,saman}@lcs.mit.eduwww.cag.lcs.mit.edu/slp