Top Banner
Boyi: A Systematic Framework for Automatically Deciding the Right Execution Model of OpenCL Applications on FPGAs Jiantong Jiang (Northeastern University, China), Zeke Wang (Zhejiang University, China), Xue Liu (Northeastern University, China), Juan Gómez-Luna (ETH Zürich, Switzerland), Nan Guan (Hong Kong Polytechnic University), Qingxu Deng (Northeastern University, China), Wei Zhang (Hong Kong University of Science and Technology), Onur Mutlu (ETH Zürich , Switzerland)
33

Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Jul 18, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Boyi: A Systematic Framework for Automatically Deciding the Right Execution

Model of OpenCL Applications on FPGAs

Jiantong Jiang (Northeastern University, China),Zeke Wang (Zhejiang University, China),Xue Liu (Northeastern University, China),

Juan Gómez-Luna (ETH Zürich, Switzerland),Nan Guan (Hong Kong Polytechnic University),Qingxu Deng (Northeastern University, China),

Wei Zhang (Hong Kong University of Science and Technology),Onur Mutlu (ETH Zürich , Switzerland)

Page 2: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Outline

• Background and Motivations

• Our Solution

• Experiment

• Conclusion

Page 3: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

What is OpenCL?

• OpenCL stands for Open Computing Language.

• OpenCL has been developed for heterogeneous

computing environments with a host-accelerator

execution model.

ØThe CPU runs the control task.ØThe GPU/ FPGA runs the computing kernel.

Page 4: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Global Memory Interconnect

Global Memory

...

...

CU-1Local Memory

Pipeline

...

Private Memory

...

CU-NLocal Memory

Pipeline

...

Private Memory

OpenCL on FPGA

External DDR

FPGAMemory blocks

• Software-centric à FPGA as a parallelarchitecture.

• Users can program with OpenCL.

DSP blocks

Memory blocks

Logic blocks

• Hardware-centric à fine-gainedparallelism

• Users need to program with HDL.

OpenCLSDK

Page 5: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

• How conventional OpenCL performs on an FPGA?

Core2Core1Core0

Conventional OpenCL: NDRange Kernel

Explicit multi-threaded à executing the same operation on multiple data concurrently

Load

Compute

store

Load

Compute

store

Load

Compute

store

Data 0 Data 1 Data 2

Page 6: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

• Conventional OpenCL cannot always represent

FPGA architecture in an efficient manner.

Issue of Conventional OpenCL

1.6x 12.1x 4.1x 3.0x 4.7x 11.3x 21.0x

Page 7: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

The optimal performance is enabled by two OpenCL features!

Page 8: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

• The SWI model executes the kernel in only one

CU that contains only one work-item.

Load

Compute

store

OpenCL Feature 1: SWI Kernel

Pipelined parallelismà No conflict among work items.

D0 D1 D2 D3 D7D6D5D4

8 Data Points

Page 9: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

OpenCL

channel

OpenCL Feature 2: OpenCL Channel

• OpenCL channel can be used to pass data

between two OpenCL kernels (typically SWI).

ØSynchronizing the kernelsØReducing the number of global memory accesses

Global memory

FPGA

Producer kernel

Consumer kernel

Page 10: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

• Two OpenCL features exponentially increase design space.• Enabled by two features, we have four OpenCL execution models:

• For each execution model, we have at least six optimizationmethods:

• For each optimization method, we have different pragmas.

Challenges

NDRange SWI NDRange+Channel SWI+Channel

SM MC PM UL SIMD CU

• The compilation time is extremely long!

Page 11: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

73.8

21.0 32.4 17.1

2.7

147.7

837.1

34.7

4.6

1.0 2.7 1.5

0.3

31.4

6.6

0.025 0.01

0.1

1

10

100

1000

RSCD TQH HSTO SC CEDD KM MM MS

Spee

dup

over

the

GPU

bas

elin

e

Most suitable execution model

Most unsuitable execution model

Effect of Four Execution Models

• Different execution models can significantly affect

the performance.

NDRange SWI NDRange+Channel SWI+Channel

Execution model should be decided first.

Page 12: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Can we explicitly determine the most suitable

execution model (i.e., whether or not to use

two OpenCL features) to optimize OpenCL

programs on FPGAs?

We provide a systematic framework Boyi to

automatically determine the most suitable

execution model.

Page 13: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Outline

• Background and Motivations

• Our Solution

Ø OpenCL Pattern RecognitionØ Execution Model Prediction

• Experiment

• Conclusion

Page 14: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Architecture of Boyi• Boyi explicitly determines the most suitable execution

model to optimize OpenCL programs on FPGAs.• OpenCL Pattern Recognition• Execution Model Prediction

Clang

Frontend

LLVM

IR

Direct

Prediction

Potential

Evolution

Four Execution Models

NDR

SWI

NDR + C

SWI + C

SWI

OpenCL

Channel

OpenCL Pattern Recognition Execution Model Prediction

OpenCL Kernel

Source Code

Host C/C++

Source Code

Most Suitable

Execution ModelKKC Recognition

MPS Recognition

AO Recognition

AO: Atomic Operation

MPS: Multi-Pass Scheme

KKC: Kernel-to-Kernel Communication

Page 15: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Outline

• Background and Motivations

• Our Solution

Ø OpenCL Pattern RecognitionØ Execution Model Prediction

• Experiment

• Conclusion

Page 16: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

5

• Issues on FPGAs:

OpenCL Pattern: Atomic OperationInput data

Hash function

Histogram

0

1

2

3

10

61

42

73

24

55

96

07

Hash index

12032110

8 work-itemsConflict

Noticeable resource overheadLong latency and low bandwidthLow frequency à AO is not a good fit on FPGAs.

1

6

4

7

2

9

0

Page 17: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

5

OpenCL Pattern: Atomic Operation

• Potential on FPGAs:Input data

Hash function

Histogram

0

1

2

3

10

61

42

73

24

55

96

07

Hash index

12032110

1

6

4

7

2

9

0

Page 18: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

OpenCL Pattern: Multi-Pass Scheme

in 6 0 2 31 4 3 52 8 9 01 6 4 7

0 6 6 80 1 5 80 2 10 190 1 7 11out

11131918local_sum

Step 1: 4 work-groups

5037180pre_sum

50 56 56 5837 38 42 4518 20 28 370 1 7 11out

Step 2: 1 work-group

in 6 0 2 31 4 3 52 8 9 01 6 4 7

0 6 6 80 1 5 80 2 10 190 1 7 11out

Step 3: 4 work-groups

11131918local_sum

Page 19: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

• Issues on FPGAs:

• Potential on FPGAs:

OpenCL Pattern: Multi-Pass Scheme

6 0 2 31 4 3 52 8 9 01 6 4 7

50 56 56 5837 38 42 4518 20 28 370 1 7 11out

in

More memory traffic

à MPS is not a good fit on FPGAs.

Page 20: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

OpenCL Pattern:

Kernel-to-Kernel Communication

• Issues on FPGAs:

The communication via global memory is expensive.

Global memory

FPGA

Producer kernel

Consumer kernel

Page 21: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

• Potential on FPGAsReducing the number of memory accessesInter-kernel parallelism (i.e., concurrent kernel execution)

OpenCL

channel

Global memory

FPGA

Producer kernel

Consumer kernel

OpenCL Pattern:

Kernel-to-Kernel Communication

Page 22: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

#Kernels >1?

#Kernels >1? #KKCTrue = 0

R1: NumKernels

R2: IsSameBuff

#R2Triplets > 0?

#Kernels

Y

N

#R2Triplets

#KKCTrue

Pass for host C/C++ analysis

Pass for OpenCL kernel analysis

Y

#MPSTrue = 0

#Kernels

Y

N

#R4Triplets

#MPSTrue = 0 N

Y

Buffs

#MPSTrue

R2BuffTriplets

#KKCTrue = 0 N

#R2Triplets

Y

R2BuffTriplets

#MPSTrue = 0 N

R4SeqTriplets

R1: NumOfKernels

R2: IsSameBuff

R2: IsRdWr

R2: IsRdWr

R3: IsSameMAP

R4: IsSequential

R5: VarBuffInHost

R5: BuffInKernel

#AOTrue

(c) MPS recognition

(b) KKC recognition

(a) AO recognition

#R2Triplets > 0?

#R4Triplets > 0?

HasAO

R5: VarInKernel

Vars, VarVals

Args, ArgVals

•We develop nine LLVM passes

to recognize three OpenCL

patterns.

• AO recognition• KKC recognition• MPS recognition

OpenCL Pattern Recognition

The implementation details can be found in our paper!

Page 23: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Outline

• Background and Motivations

• Our Solution

Ø OpenCL Pattern RecognitionØ Execution Model Prediction

• Experiment

• Conclusion

Page 24: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Execution Model Prediction

• Direct prediction

Kernels with AO and MPS benefit from the SWI kernel.Kernels with KKC benefit from the OpenCL channel.

AO MPS KKCN N NY N NN Y NY Y NN N YY N YN Y YY Y Y

Direct predictionNDR

SWI

SWI

SWI

NDR+C

SWI+C

SWI+C

SWI+C

Page 25: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Execution Model Prediction

• Potential evolution of SWI

ØConditions:

AO MPS KKC Direct predictionN N N NDR

Y N N SWI

N Y N SWI

Y Y N SWI

N N Y NDR+C

Y N Y SWI+C

N Y Y SWI+C

Y Y Y SWI+C

Potential evolutionNDR

SWI+C

SWI+C

SWI+C

NDR+C

SWI+C

SWI+C

SWI+C

Sufficient FPGA resource.The SWI kernel is compute-bound.

Page 26: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Outline

• Background and Motivations

• Our Solution

• Experiment

ØExperimental SetupØEffect of Execution ModelØPrediction of Execution Model

• Conclusion

Page 27: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

• Platform: Terasic DE5a-Net board: Altera Arria 10 GX FPGA and 8GB 2-bank DDR3, with Altera OpenCL SDK version 16.1.• Workloads:

Experimental Setup

Benchmark Source Description AO MPS KKC DatasetsBFS Chai Breadth-First Search Y N N NY, NE, UT

RSCD RANSAC Y N Y 2000 iterations

TQH Task Queue System Y N N Basket

HSTO Histogram Y N N 256bins

SC Stream Compaction Y N N 50%

PAD Padding Y N N 1000*999

CEDD Canny Edge Detection N N Y Peppa, Maradona, Paw

KM Rodinia K-Means N N N 25600 points, 8 features

MM Intel demo Matrix Multiplication N N N A: 2k*1k, B: 1k*1k

MS Mandelbrot Set N N N 640*800, 2000 iterations

PS CUDA demo Prefix Sum N Y N 262144 points

Page 28: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Comparison Methodology

• Hypothesis 1: Different execution models lead to

significant performance differences.

ØQuantitative comparison among execution modelsØExploring optimization combinations

• Hypothesis 2: Boyi can predict the most suitable

execution model for each OpenCL application.

Page 29: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

App Number of combinations Maximum speedupNDR SWI NDR+C SWI+C NDR SWI NDR+C SWI+C

BFS 17 7 7 9 1.9 3.1 1.2 3.1RSCD 25 10 24 46 15.8 4.6 73.8 39.7

TQH 9 15 23 1.1 1.3 21.0HSTO 13 37 11 29 2.7 5.1 16.9 32.4

SC 15 34 10 1.5 4.5 17.1PAD 10 10 14 1.2 1.6 4.8

CEDD 57 15 22 7 2.7 0.3 2.7 0.4

KM 33 11 10 18 147.7 32.8 136.4 31.4

MM 25 9 6 837.1 13.3 6.6

MS 7 6 7 34.7 0.02 3.2

PS 26 20 12 15.8 44.4 46.2

Hypothesis 1: Different executionmodel --> different performance

• Quantitative comparison among execution models

Different execution models result in significant performance differences.

Different applications require different execution models to achieve the best performance.

It is critical to decide the most suitable execution model when optimizing OpenCL applications on FPGAs.

Page 30: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

2.3 9.0

2.0 3.9 5.1

51.2

73.5

7.7

28.8

51.3

114.9

89.3

147.7

122.0

2.3 2.5 2.7 0.1 0.1 4.9 8.8 16.6

32.8

1.2 14.7

124.9 136.4

2.2 14.4

90.3

119.0 112.6

31.4

1.9 9.9

17.5 29.0

0

20

40

60

80

100

120

140

160

Spee

dup

over

the

GPU

bas

elin

e NDR SWI+CNDR+CSWI

•We manually implement sufficient number of

optimization combinations (subset) for KM, such

that we reach the near-to-optimal optimization

combination for each execution model.

Most suitable execution model Most unsuitable execution model

Hypothesis 1: Exploring optimization combinations for Each Execution Model

Page 31: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Application AO MPS KKC Actual PredictedBFS Y N N SWI SWI

RSCD Y N Y NDR+C SWI+C

TQH Y N N SWI+C SWI+C ※

HSTI Y N N SWI+C SWI+C ※

SC Y N N SWI+C SWI+C ※

PAD Y N N SWI+C SWI+C ※

CEDD N N Y NDR+C NDR+C

KM N N N NDR NDR

MM N N N NDR NDR

MS N N N NDR NDR

PS N Y N SWI SWI

SWI+C ※ indicates the potential evolution of SWI

N NDR+C

The actual and predicted execution models roughly match.

Hypothesis 2: Boyi Predicts the Right Execution Model

Page 32: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

End-to-end Performance Comparison

• Performance comparison to existing works

Application Ours (ms) Existing work (ms) Our/Existing SpeedupRSCD [1] 0.8 28.9 38.3TQH [1] 66.9 150.6 2.3HSTO [1] 38.8 487.9 12.6CEDD [1] 161.9 237.8 1.5MM [2] 9.1 34.3 3.8MS [2] 27.2 944.1 34.7

[1] S. Huang et al., “Analysis and modeling of collaborative execution strategies for heterogeneouscpu-fpga architectures”, ICPE, 2019.[2] Intel. Intel SDK for OpenCL Design Examples. 2018

Page 33: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points.

Outline

• Background and Motivations

• Our Solution

• Experiment

• Conclusion