Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Exploiting Superword Level Parallelism with Multimedia

Instruction Sets

Samuel LarsenSaman Amarasinghe

Laboratory for Computer ScienceMassachusetts Institute of Technology

{slarsen,saman}@lcs.mit.eduwww.cag.lcs.mit.edu/slp

© 2000 MIT

Overview

• Problem statement• New paradigm for parallelism SLP • SLP extraction algorithm• Results• SLP vs. ILP and vector parallelism• Conclusions• Future work

© 2000 MIT

Multimedia Extensions

• Additions to all major ISAs• SIMD operations

Instruction Set Architecture SIMD Width Floating PointAltiVec PowerPC 128 yesMMX/SSE Intel 64/128 yes3DNow! AMD 64 yesVIS Sun 64 noMAX2 HP 64 noMVI Alpha 64 noMDMX MIPS V 64 yes

© 2000 MIT

Using Multimedia Extensions

• Library calls and inline assembly– Difficult to program– Not portable

© 2000 MIT



• Different extensions to the same ISA– MMX and SSE– SSE vs. 3DNow!

© 2000 MIT



• Different extensions to the same ISA– MMX and SSE– SSE vs. 3DNow!

• Need automatic compilation

© 2000 MIT

Vector Compilation

• Pros:– Successful for vector computers– Large body of research

© 2000 MIT

Vector Compilation

• Pros:– Successful for vector computers– Large body of research

• Cons:– Involved transformations – Targets loop nests

© 2000 MIT

Superword Level Parallelism (SLP)

• Small amount of parallelism– Typically 2 to 8-way

• Exists within basic blocks • Uncovered with a simple analysis

© 2000 MIT

Superword Level Parallelism (SLP)

• Small amount of parallelism– Typically 2 to 8-way

• Exists within basic blocks • Uncovered with a simple analysis

• Independent isomorphic operations– New paradigm

© 2000 MIT

1. Independent ALU Ops

R = R + XR * 1.08327G = G + XG * 1.89234B = B + XB * 1.29835

R R XR 1.08327G = G + XG * 1.89234B B XB 1.29835

© 2000 MIT

2. Adjacent Memory References

R = R + X[i+0]G = G + X[i+1]B = B + X[i+2]

R RG = G + X[i:i+2]B B

© 2000 MIT

for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0]

3. Vectorizable Loops

© 2000 MIT

3. Vectorizable Loops

for (i=0; i<100; i+=4)

A[i:i+3] = B[i:i+3] + C[i:i+3]

for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0]

A[i+1] = A[i+1] + B[i+1]A[i+2] = A[i+2] + B[i+2]A[i+3] = A[i+3] + B[i+3]

© 2000 MIT

4. Partially Vectorizable Loops

for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)

© 2000 MIT

4. Partially Vectorizable Loops

for (i=0; i<16; i+=2)

L0L1

= A[i:i+1] – B[i:i+1]

D = D + abs(L0)D = D + abs(L1)

for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L)

L = A[i+1] – B[i+1]D = D + abs(L)

© 2000 MIT

Exploiting SLP with SIMD Execution

• Benefit:– Multiple ALU ops One SIMD op– Multiple ld/st ops One wide mem op

© 2000 MIT

Exploiting SLP with SIMD Execution

• Benefit:– Multiple ALU ops One SIMD op– Multiple ld/st ops One wide mem op

• Cost:– Packing and unpacking– Reshuffling within a register

© 2000 MIT

Packing/Unpacking Costs

C = A + 2D = B + 3

C A 2D B 3= +

© 2000 MIT


• Packing source operands

A AB BA = f()

B = g()C = A + 2D = B + 3

C A 2D B 3= +

© 2000 MIT


• Packing source operands• Unpacking destination operands

C CD D

A = f()B = g()C = A + 2D = B + 3E = C / 5F = D * 7

A AB B

C A 2D B 3= +

© 2000 MIT

Optimizing Program Performance

• To achieve the best speedup:– Maximize parallelization– Minimize packing/unpacking

© 2000 MIT

Optimizing Program Performance

• To achieve the best speedup:– Maximize parallelization– Minimize packing/unpacking

• Many packing possibilities– Worst case: n ops n! configurations– Different cost/benefit for each choice

© 2000 MIT

A = B + CD = E + F

Observation 1:Packing Costs can be Amortized

• Use packed result operands

G = A - HI = D - J

© 2000 MIT

Observation 1:Packing Costs can be Amortized

• Use packed result operands• Share packed source operands

A = B + CD = E + F

G = B + HI = E + J

A = B + CD = E + F

G = A - HI = D - J

© 2000 MIT

Observation 2:Adjacent Memory is Key

• Large potential performance gains– Eliminate ld/str instructions– Reduce memory bandwidth

© 2000 MIT

Observation 2:Adjacent Memory is Key

• Large potential performance gains– Eliminate ld/str instructions– Reduce memory bandwidth

• Few packing possibilities– Only one ordering exploits pre-packing

© 2000 MIT

SLP Extraction Algorithm

• Identify adjacent memory references

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

© 2000 MIT


• Identify adjacent memory references


AB = X[i:i+1]

© 2000 MIT


• Follow def-use chains


AB = X[i:i+1]

© 2000 MIT


• Follow def-use chains


AB = X[i:i+1]

HJ

CD

AB= -

© 2000 MIT


• Follow use-def chains


AB = X[i:i+1]

HJ

CD

AB= -

© 2000 MIT




AB = X[i:i+1]

CD

EF

35= *

HJ

CD

AB= -

© 2000 MIT




AB = X[i:i+1]

CD

EF

35= *

HJ

CD

AB= -

© 2000 MIT

SLP Compiler Results

• SLP compiler implemented in SUIF• Tested on two benchmark suites

– SPEC95fp– Multimedia kernels

• Performance measured three ways:– SLP availability– Compared to vector parallelism– Speedup on AltiVec

© 2000 MIT

SLP Availability

0

10

20

30

40

50

60

70

80

90

100

swim

tomca

tv

mgr

id

su2c

or

wave5

apsi

hydr

o2d

turb

3d

applu

fppp

p FIR IIRVM

MMMM

YUV

% dynamic SUIF instructions eliminated

128 bits1024 bits

© 2000 MIT

SLP vs. Vector Parallelism

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

swim

tomca

tv

mgrid

su2c

or

wave5 ap

si

hydro2

d

turb3d

applu

fppp

p

SLPVector

© 2000 MIT

Speedup on AltiVec

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

swim

tomca

tv FIR IIRVM

MMMM

YUV

6.7

© 2000 MIT


• Extracted with a simple analysis– SLP is fine grain basic blocks

© 2000 MIT


• Extracted with a simple analysis– SLP is fine grain basic blocks

• Superset of vector parallelism – Unrolling transforms VP to SLP– Handles partially vectorizable loops

© 2000 MIT


}Basic block

© 2000 MIT


Iterations

© 2000 MIT

SLP vs. ILP

• Subset of instruction level parallelism

© 2000 MIT

SLP vs. ILP


• SIMD hardware is simpler– Lack of heavily ported register files

© 2000 MIT

SLP vs. ILP


• SIMD hardware is simpler– Lack of heavily ported register files

• SIMD instructions are more compact– Reduces instruction fetch bandwidth

© 2000 MIT

SLP and ILP

• SLP & ILP can be exploited together– Many architectures can already do this

© 2000 MIT

SLP and ILP


• SLP & ILP may compete– Occurs when parallelism is scarce

© 2000 MIT

SLP and ILP


• SLP & ILP may compete– Occurs when parallelism is scarce

• Unroll the loop more times– When ILP is due to loop level

parallelism

© 2000 MIT

Conclusions

• Multimedia architectures abundant– Need automatic compilation

© 2000 MIT

Conclusions


• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp

© 2000 MIT

Conclusions



• SLP extraction successful– Simple, local analysis– Provides speedups from 1.24 – 6.70

© 2000 MIT

Conclusions



• SLP extraction successful– Simple, local analysis– Provides speedups from 1.24 – 6.70

• Found SLP in general-purpose codes

© 2000 MIT

Future Work

• SLP analysis beyond basic blocks– Packing maintained across blocks– Loop invariant packing– Fill unused slots with speculative ops

© 2000 MIT

Future Work

• SLP analysis beyond basic blocks– Packing maintained across blocks– Loop invariant packing– Fill unused slots with speculative ops

• SLP architectures– Emphasis on SIMD– Better packing/unpacking

© 2000 MIT

Exploiting Superword Level Parallelism with Multimedia

Instruction Sets

Samuel LarsenSaman Amarasinghe

Laboratory for Computer ScienceMassachusetts Institute of Technology

{slarsen,saman}@lcs.mit.eduwww.cag.lcs.mit.edu/slp

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Documents