Top Banner
© 2000 MIT Exploiting Superword Level Parallelism with Multimedia Instruction Sets Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology {slarsen,saman}@lcs.mit.edu www.cag.lcs.mit.edu/slp
55

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Jan 06, 2016

Download

Documents

toyah

Exploiting Superword Level Parallelism with Multimedia Instruction Sets. Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology {slarsen,saman}@lcs.mit.edu www.cag.lcs.mit.edu/slp. Overview. Problem statement - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Exploiting Superword Level Parallelism with Multimedia

Instruction Sets

Samuel LarsenSaman Amarasinghe

Laboratory for Computer ScienceMassachusetts Institute of Technology

{slarsen,saman}@lcs.mit.eduwww.cag.lcs.mit.edu/slp

Page 2: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Overview

• Problem statement• New paradigm for parallelism SLP • SLP extraction algorithm• Results• SLP vs. ILP and vector parallelism• Conclusions• Future work

Page 3: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Multimedia Extensions

• Additions to all major ISAs• SIMD operations

Instruction Set Architecture SIMD Width Floating PointAltiVec PowerPC 128 yesMMX/SSE Intel 64/128 yes3DNow! AMD 64 yesVIS Sun 64 noMAX2 HP 64 noMVI Alpha 64 noMDMX MIPS V 64 yes

Page 4: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Using Multimedia Extensions

• Library calls and inline assembly– Difficult to program– Not portable

Page 5: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Using Multimedia Extensions

• Library calls and inline assembly– Difficult to program– Not portable

• Different extensions to the same ISA– MMX and SSE– SSE vs. 3DNow!

Page 6: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Using Multimedia Extensions

• Library calls and inline assembly– Difficult to program– Not portable

• Different extensions to the same ISA– MMX and SSE– SSE vs. 3DNow!

• Need automatic compilation

Page 7: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Vector Compilation

• Pros:– Successful for vector computers– Large body of research

Page 8: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Vector Compilation

• Pros:– Successful for vector computers– Large body of research

• Cons:– Involved transformations – Targets loop nests

Page 9: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Superword Level Parallelism (SLP)

• Small amount of parallelism– Typically 2 to 8-way

• Exists within basic blocks • Uncovered with a simple analysis

Page 10: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Superword Level Parallelism (SLP)

• Small amount of parallelism– Typically 2 to 8-way

• Exists within basic blocks • Uncovered with a simple analysis

• Independent isomorphic operations– New paradigm

Page 11: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

1. Independent ALU Ops

R = R + XR * 1.08327G = G + XG * 1.89234B = B + XB * 1.29835

R R XR 1.08327G = G + XG * 1.89234B B XB 1.29835

Page 12: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

2. Adjacent Memory References

R = R + X[i+0]G = G + X[i+1]B = B + X[i+2]

R RG = G + X[i:i+2]B B

Page 13: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0]

3. Vectorizable Loops

Page 14: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

3. Vectorizable Loops

for (i=0; i<100; i+=4)

A[i:i+3] = B[i:i+3] + C[i:i+3]

for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0]

A[i+1] = A[i+1] + B[i+1]A[i+2] = A[i+2] + B[i+2]A[i+3] = A[i+3] + B[i+3]

Page 15: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

4. Partially Vectorizable Loops

for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)

Page 16: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

4. Partially Vectorizable Loops

for (i=0; i<16; i+=2)

L0L1

= A[i:i+1] – B[i:i+1]

D = D + abs(L0)D = D + abs(L1)

for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L)

L = A[i+1] – B[i+1]D = D + abs(L)

Page 17: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Exploiting SLP with SIMD Execution

• Benefit:– Multiple ALU ops One SIMD op– Multiple ld/st ops One wide mem op

Page 18: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Exploiting SLP with SIMD Execution

• Benefit:– Multiple ALU ops One SIMD op– Multiple ld/st ops One wide mem op

• Cost:– Packing and unpacking– Reshuffling within a register

Page 19: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Packing/Unpacking Costs

C = A + 2D = B + 3

C A 2D B 3= +

Page 20: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Packing/Unpacking Costs

• Packing source operands

A AB BA = f()

B = g()C = A + 2D = B + 3

C A 2D B 3= +

Page 21: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Packing/Unpacking Costs

• Packing source operands• Unpacking destination operands

C CD D

A = f()B = g()C = A + 2D = B + 3E = C / 5F = D * 7

A AB B

C A 2D B 3= +

Page 22: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Optimizing Program Performance

• To achieve the best speedup:– Maximize parallelization– Minimize packing/unpacking

Page 23: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Optimizing Program Performance

• To achieve the best speedup:– Maximize parallelization– Minimize packing/unpacking

• Many packing possibilities– Worst case: n ops n! configurations– Different cost/benefit for each choice

Page 24: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

A = B + CD = E + F

Observation 1:Packing Costs can be Amortized

• Use packed result operands

G = A - HI = D - J

Page 25: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Observation 1:Packing Costs can be Amortized

• Use packed result operands• Share packed source operands

A = B + CD = E + F

G = B + HI = E + J

A = B + CD = E + F

G = A - HI = D - J

Page 26: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Observation 2:Adjacent Memory is Key

• Large potential performance gains– Eliminate ld/str instructions– Reduce memory bandwidth

Page 27: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Observation 2:Adjacent Memory is Key

• Large potential performance gains– Eliminate ld/str instructions– Reduce memory bandwidth

• Few packing possibilities– Only one ordering exploits pre-packing

Page 28: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Extraction Algorithm

• Identify adjacent memory references

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

Page 29: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Extraction Algorithm

• Identify adjacent memory references

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

AB = X[i:i+1]

Page 30: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Extraction Algorithm

• Follow def-use chains

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

AB = X[i:i+1]

Page 31: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Extraction Algorithm

• Follow def-use chains

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

AB = X[i:i+1]

HJ

CD

AB= -

Page 32: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Extraction Algorithm

• Follow use-def chains

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

AB = X[i:i+1]

HJ

CD

AB= -

Page 33: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Extraction Algorithm

• Follow use-def chains

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

AB = X[i:i+1]

CD

EF

35= *

HJ

CD

AB= -

Page 34: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Extraction Algorithm

• Follow use-def chains

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

AB = X[i:i+1]

CD

EF

35= *

HJ

CD

AB= -

Page 35: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Compiler Results

• SLP compiler implemented in SUIF• Tested on two benchmark suites

– SPEC95fp– Multimedia kernels

• Performance measured three ways:– SLP availability– Compared to vector parallelism– Speedup on AltiVec

Page 36: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Availability

0

10

20

30

40

50

60

70

80

90

100

swim

tomca

tv

mgr

id

su2c

or

wave5

apsi

hydr

o2d

turb

3d

applu

fppp

p FIR IIRVM

MMMM

YUV

% dynamic SUIF instructions eliminated

128 bits1024 bits

Page 37: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. Vector Parallelism

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

swim

tomca

tv

mgrid

su2c

or

wave5 ap

si

hydro2

d

turb3d

applu

fppp

p

SLPVector

Page 38: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Speedup on AltiVec

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

swim

tomca

tv FIR IIRVM

MMMM

YUV

6.7

Page 39: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. Vector Parallelism

• Extracted with a simple analysis– SLP is fine grain basic blocks

Page 40: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. Vector Parallelism

• Extracted with a simple analysis– SLP is fine grain basic blocks

• Superset of vector parallelism – Unrolling transforms VP to SLP– Handles partially vectorizable loops

Page 41: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. Vector Parallelism

}Basic block

Page 42: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. Vector Parallelism

Iterations

Page 43: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. ILP

• Subset of instruction level parallelism

Page 44: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. ILP

• Subset of instruction level parallelism

• SIMD hardware is simpler– Lack of heavily ported register files

Page 45: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. ILP

• Subset of instruction level parallelism

• SIMD hardware is simpler– Lack of heavily ported register files

• SIMD instructions are more compact– Reduces instruction fetch bandwidth

Page 46: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP and ILP

• SLP & ILP can be exploited together– Many architectures can already do this

Page 47: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP and ILP

• SLP & ILP can be exploited together– Many architectures can already do this

• SLP & ILP may compete– Occurs when parallelism is scarce

Page 48: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP and ILP

• SLP & ILP can be exploited together– Many architectures can already do this

• SLP & ILP may compete– Occurs when parallelism is scarce

• Unroll the loop more times– When ILP is due to loop level

parallelism

Page 49: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Conclusions

• Multimedia architectures abundant– Need automatic compilation

Page 50: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Conclusions

• Multimedia architectures abundant– Need automatic compilation

• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp

Page 51: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Conclusions

• Multimedia architectures abundant– Need automatic compilation

• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp

• SLP extraction successful– Simple, local analysis– Provides speedups from 1.24 – 6.70

Page 52: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Conclusions

• Multimedia architectures abundant– Need automatic compilation

• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp

• SLP extraction successful– Simple, local analysis– Provides speedups from 1.24 – 6.70

• Found SLP in general-purpose codes

Page 53: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Future Work

• SLP analysis beyond basic blocks– Packing maintained across blocks– Loop invariant packing– Fill unused slots with speculative ops

Page 54: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Future Work

• SLP analysis beyond basic blocks– Packing maintained across blocks– Loop invariant packing– Fill unused slots with speculative ops

• SLP architectures– Emphasis on SIMD– Better packing/unpacking

Page 55: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Exploiting Superword Level Parallelism with Multimedia

Instruction Sets

Samuel LarsenSaman Amarasinghe

Laboratory for Computer ScienceMassachusetts Institute of Technology

{slarsen,saman}@lcs.mit.eduwww.cag.lcs.mit.edu/slp