Top Banner
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems Ryohei Kobayashi , Kenji Kise Tokyo Institute of Technology, Japan MCSoC-15@Turin, Italy Embedded Multicore/Many-core Architectures 13:40-14:10, September 23, 2015
47

FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

Feb 19, 2017

Download

Devices & Hardware

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

Ryohei Kobayashi, Kenji Kise Tokyo Institute of Technology, Japan

MCSoC-15@Turin, Italy Embedded Multicore/Many-core Architectures 13:40-14:10, September 23, 2015

Page 2: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

1

Introduction

Page 3: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l Multi-core and many-core processors have been mainstream to accelerate applications by parallel processing

l # of cores has been increased depending on Moore’s Law

Multi-core and Many-core Processor

2

2006 Core 2 Duo Conroe (65nm)

2 Cores

2008 Core 2 Quad Yorkfield (45nm)

4 Cores

2010 Xeon 7500 Nehalem EX

(32nm)

8 Cores

2012 Xeon Phi

Knights Corner (22nm)

50+ Cores

Page 4: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l The end of Moore’s Law means that approaches relying on the following points are hopeless... Ø The increase in # of cores (especially) Ø Implementation of rich features Ø Etc...

Will Moore’s Law Continue?

3

Page 5: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

Will Moore’s Law Continue?

4

Yes. But we have motivation for accelerators [1].

That’s why we bought Altera

[1] Cupta et al, Xeon+FPGA Platform for the Data Center, CARL 2015 (Co-located with ISCA 2015)

Page 6: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

Will Moore’s Law Continue?

5

Microsoft is a trademark of the Microsoft group of companies

Yes. But we have motivation for accelerators [1].

That’s why we bought Altera

[1] Cupta et al, Xeon+FPGA Platform for the Data Center, CARL 2015 (Co-located with ISCA 2015)

Page 7: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

Will Moore’s Law Continue?

6

Microsoft is a trademark of the Microsoft group of companies

Yes. But we have motivation for accelerators [1].

That’s why we bought Altera

No, Moore’s Law is ending. That’s why hardware

specialization will be critical [2]

[1] Cupta et al, Xeon+FPGA Platform for the Data Center, CARL 2015 (Co-located with ISCA 2015) [2] Putnam et al, A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA2014

Page 8: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

Will Moore’s Law Continue?

7

Microsoft is a trademark of the Microsoft group of companies

Yes. But we have motivation for accelerators [1].

That’s why we bought Altera

No, Moore’s Law is ending. That’s why hardware

specialization will be critical [2]

[1] Cupta et al, Xeon+FPGA Platform for the Data Center, CARL 2015 (Co-located with ISCA 2015) [2] Putnam et al, A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA2014

Dedicated hardware era is coming!!!

Page 9: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l Sorting is a fundamental computation kernel

This work: Sorting Accelerator

8

Databases[1] Image Processing[2] Data Compression[3]

[1] Rene Mueller et al, Sorting Networks on FPGAs, The VLDB Journal 2012 [2] Ratnayake, K et al, An FPGA Architecture of Stable-Sorting on a Large Data Volume : Application to Video Signals, CISS 2007 [3] Martinez, J et al, An FPGA-based parallel sorting architecture for the Burrows Wheeler transform, ReConFig 2005

Page 10: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l Sorting is a fundamental computation kernel

This work: Sorting Accelerator

9

[1] Rene Mueller et al, Sorting Networks on FPGAs, The VLDB Journal 2012 [2] Ratnayake, K et al, An FPGA Architecture of Stable-Sorting on a Large Data Volume : Application to Video Signals, CISS 2007 [3] Martinez, J et al, An FPGA-based parallel sorting architecture for the Burrows Wheeler transform, ReConFig 2005

Databases[1] Image Processing[2] Data Compression[3]

! Sorting accelerators fulfilling the following requirements do not exist...

Problem

ü  High Performance ü  Customizable ü  Open sourced

Page 11: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

10

Our Proposed Sorting Accelerator

Page 12: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l Using the following sorting architectures Ø The sorting network Ø The merge sorter tree

Our Proposed Sorting Accelerator

11

1

4

3

2

4

3

2

1 >

> >

The sorting network

Proposed Sorting Accelerator

The merge sorter tree

Page 13: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l A sorting architecture composed of wires and comparators

l Example: Sorting 4 values in the network Ø Smaller and larger values are carried to the top and bottom

The Sorting Network*

12

1

4

3

2

4

3

2

1

Bubble sort network with 4-inputs and 4-outputs

* Donald E. Knuth. The Art of Computer Programming. 1998.

Page 14: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l A data path that executes merge process

The Merge Sorter Tree*

13

>

> >

FIFO

Sorter Cell >

4-way merge sorter tree

* Dirk Koch et al, FPGASort, FPGA’11

Page 15: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l Sorting process in the merge sorter tree Ø The data sequences in the leftmost FIFOs must be sorted

The Merge Sorter Tree

14

>

> >

>

> >

>

> >

8 9 3 5

1 3 2 2 1

3

3 2 2

1 2

1

8 9 5

3 7 5

1 2

3

2

5

2 2 3 7 9 8

x

x: Invalid Value

Cycle N Cycle N+1 Cycle N+2

Page 16: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

Data Path of the Proposed Sorting Accelerator

15

Debug HW

Host PC

DRAM

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

IN OUT

Stage

Page 17: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

16

Example: Sorting 256 elements

Page 18: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l The generated initial data sequence is stored in the external memory

Sorting 256 Elements from 256 to 1

17

256 255 254 … 64 63 … 3 2 1

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

DRAM

Page 19: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l Initialization is done

Sorting 256 Elements from 256 to 1

18

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

256 255 254 253 252

DRAM

Page 20: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l The data is sent to Sorting Network

Sorting 256 Elements from 256 to 1

19

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

256 255 254

64 63 … 3 2 1

256 255 254 253 252

IN OUT

Stage

DRAM

Page 21: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l Sorting Network can sort 16 elements Ø The initial data sequence turns into 16 sorted data sequences by passed through this network

Sorting 256 Elements from 256 to 1

20

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA 16 … 3 2 1 32 … 19 18 17 256 … 243 242 241

This is sorted 256 255 254 253 252

DRAM

Page 22: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l The data passed through the network is stored in Input Buffer, and sent to Merge Sorter Tree

Sorting 256 Elements from 256 to 1

21

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA 241 242 243 256… 225 226 240…

224… 208…

227 210 211 194 195

209 193

256 255 254 253 252

DRAM

Page 23: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l The root of the tree emits sorted data sequences

Sorting 256 Elements from 256 to 1

22

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA 241 242 243 256… 226 227 240…

224… 208…

228 210 211 209

193 194 195

196 201

202

225

203

256 255 254 253 252

DRAM

Page 24: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l The data sequence composed of 16 Units turns into 4 sorted data sequences

Sorting 256 Elements from 256 to 1

23

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

64 … 3 2 1 128 … 67 66 65 256 … 195 194 193

This is sorted (64 elements)

192 … 131 130 129 256 255 254 253 252

DRAM

Page 25: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l The data is stored in the external memory

Sorting 256 Elements from 256 to 1

24

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

193 194 195 196 197 198

199 200 201 202 203 204 ……

Store Area 256 255 254 253 252

DRAM

Page 26: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l This data is not fully sorted yet... Ø This data has to be sent to Merge Sorter Tree again

Sorting 256 Elements from 256 to 1

25

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

256 255 254 253 252

1 193 194 195

62 63 64

DRAM

Page 27: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l The data is read form DRAM and sent to Sorting Network

Sorting 256 Elements from 256 to 1

26

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

193 194 195

129 130 … 62 63 64

Load Area 256 255 254 253 252

1 193 194 195

62 63 64

DRAM

Page 28: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l In this time, this network is a mere data path because portions of the data sequence are already sorted

Sorting 256 Elements from 256 to 1

27

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

256 255 254 253 252

1 193 194 195

62 63 64

DRAM

Page 29: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l The data passed through the network is stored in Input Buffer, and sent to Merge Sorter Tree

Sorting 256 Elements from 256 to 1

28

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

256 255 254 253 252

1 193 194 195

62 63 64

193 194 195 256… 129 130 192…

128… 64…

131 66 67 2 3

65 1

DRAM

Page 30: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l The root of the tree emits sorted data sequences

Sorting 256 Elements from 256 to 1

29

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

256 255 254 253 252

1 193 194 195

62 63 64

64…

1 2 3

4

51 52

225

53

193 194 195 256… 129 130 192…

128… 131

66 67 65

DRAM

Page 31: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l The data is stored in the external memory

Sorting 256 Elements from 256 to 1

30

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

1 2 3 4 5 6

7 8 9 10 11 12 ……

Store Area 256 255 254 253 252

1 193 194 195

62 63 64

DRAM

Page 32: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l This data is fully sorted !! Ø The data is fully sorted by passed through the network and the tree as required ﹣ log# of ways(# of elements/16)

Sorting 256 Elements from 256 to 1

31

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

1 2 3 4 5

256 193 194 195

62 63 64

DRAM

Page 33: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l The fully sorted data is sent to Host PC Ø To verify that the accelerator accurately works

Sorting 256 Elements from 256 to 1

32

DRAM

Host PC

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

512-bit shift register

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

1 2 3 4 5

256 193 194 195

62 63 64

Load Area

1 2 3 4

5 6 …

DRAM

Page 34: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l Duplication of the merge sorter tree

Data Path of the Accelerator with the Duplicated Merge Sorter Tree

33

512-bit shift register 512-bit

shift register 512-bit

shift register

>

> >

32 32 32 32

32 32

32 32

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512-bit shift register

32

512

512

512

512

Input Buffer Merge Sorter Tree

Initial Data Generator

Sorting Network

512

512

512

512

DRAM Controller

UART Controller

Output Buffer 32 512 512

512 512

512

512

1

64

FPGA

Duplicated Logics

512-bit shift register

Page 35: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l Effectiveness Ø To sort data sequences in parallel ﹣ The sorting logic throughput is improved

Duplication of the Merge Sorter Tree

34 The accelerator with four 4-way trees sorts the initial data sequence

193 194 … 256 129 130 … 192 65 66 … 128 1 2 … 64

256 … 194 193 192 … 130 129 128 … 66 65 64 … 2 1 Initial Data Sequence

Sorting the data in parallel Tree 0 Tree 1 Tree 2 Tree 3

Executing merge process in a tree

1 2 3 4 5 6 7 8 … … … … 253 254 255 256 Sorting is done!!!

Page 36: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l This accelerator is customizable by tuning # of ways and duplicated trees

l This accelerator performance can be formulated Ø Designer can estimate accelerator performance in advance and implement the best one fulfilling hardware resource constraints

Characteristics of the Accelerator

35

# of required cycles to finish sorting (k: # of ways, P: # of trees, N: # of elements)

Page 37: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

36

Evaluation

Page 38: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l Implementation Platform Ø Xilinx FPGA VC707 Evaluation Kit

Hardware Setup(1/2)

37

Power In

UART Port

JTAG Port

DDR3 SO-DIMM DRAM (4GB*, 800MHz/1600Mbps)

* This kit originally has 1GB DDR3 SO-DIMM (800MHz/1600Mbps) memory

Virtex-7 VX485T FPGA

Page 39: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l All logics are implemented in Verilog HDL l Design Tool: Vivado2014.4 l Operating frequency

Ø Logic: 200MHz,Memory bus: 800MHz l Initial Data Generator

Ø Supporting the following data-generation types ﹣ A random data sequence using Xorshift* ﹣ A sorted data sequence ﹣ A reverse-order sorted data sequence

Hardware Setup(2/2)

38 * George Marsaglia, Xorshift RNGs, Journal of Statistical Software 2003.

Page 40: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l Point: Sorting Process Time and hardware resource usage Ø Dataset: 256M 32-bits integer values

l Opponent: Intel Corei7-4770 @ 3.4GHz Ø A single thread Ø gcc 4.8.2 (-O3 optimization) Ø Sorting algorithm ﹣ Merge sort ﹣ Quick sort

l How to measure the execution time Ø FPGA -> to get execution cycles Ø CPU -> to use gettimeofday

Evaluation

39

Page 41: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l Sorting performance Ø It is improved as # of ways and trees is larger Ø It is independent of data-sequence types Ø It is almost same as estimated one

Evaluation: Sorting Performance

40

0

10

20

30

40

50

60

merge sort

quick sort 4-way

4-way/2-parallel

4-way/4-parallel

8-way

8-way/2-parallel

8-way/4-parallel

8-way/8-parallel

16-way

16-way/2-parallel

16-way/4-parallel

Sorting Process Time[sec]

xorshift sorted reverse Estimated

Page 42: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

0

10

20

30

40

50

60

merge sort quick sort 8-way/8-parallel

Sorting Process Time[sec]

xorshift sorted reverse

l In a case of random data sequence Ø 10.06x faster than merge sort Ø 8.01x faster than quick sort

Evaluation: Sorting Performance of 8-way/8-parallel

41

10.06x 8.01x

Page 43: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l Most of the designs can be implemented on low-end devices Ø This accelerator is available on even embedded systems

Customizable: the Performance and the Hardware Resource Usage

42

4-way

4-way/2-parallel

4-way/4-parallel

8-way

8-way/2-parallel

8-way/4-parallel

8-way/8-parallel

16-way

16-way/2-parallel

16-way/4-parallel

0

2

4

6

8

10

12

0 10000 20000 30000 40000 50000 60000 70000

Speed-up Ratio

# of Slices

Cost Performance

Available 15,850 Slices (Artix-7 XC7A100T)

Available 50,950 Slices (Kintex-7 XC7K325T)

Page 44: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l FACE is available on GitHub Ø https://github.com/monotone-RK/FACE

l Currently, FACE can work on Xilinx FPGA VC707 Evaluation Kit Ø We will try to port another environment if you have requests and if possible

Open Sourced

43 Xilinx FPGA VC707 Evaluation Kit

Page 45: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

44

Page 46: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

45

Conclusion

Page 47: FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems

l FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems Ø This accelerator is customizable by tuning # of ways and duplicated trees

Ø This accelerator performance can be formulated Ø Open sourced ﹣ Available on GitHub (https://github.com/monotone-RK/FACE)

l Future Work Ø Performance evaluation including data transfer ﹣ Bus system like AXI4 or Avalon, NoC, PCIe, etc...

Conclusion

46