Top Banner
MicRunA Framework for Scale-free Graph Algorithms on SIMD Architecture of the Xeon Phi Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, Xiaoling Li and Lei Luo College of Computer National University of Defense Technology 10/7/2017
23

MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

Aug 29, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

MicRun:A Framework for Scale-free

Graph Algorithms on SIMD Architecture of

the Xeon Phi

Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu,

Qi Zhang, Xiaoling Li and Lei Luo

College of Computer

National University of Defense Technology

10/7/2017

Page 2: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

2

Section 1 Backgrounds & Motivation

– Scale-free Graphs & Graph Algorithms

– The Xeon Phi Architecture

– Bucket Grouping Module

– Auto-tuning Module

Section 2 The MicRun Framework

Section 3 Experiments & Conclusions

Outline

Page 3: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

3

Section 1 Backgrounds & Motivation

– Scale-free Graphs & Graph Algorithms

– The Xeon Phi Architecture

– Bucket Grouping Module

– Auto-tuning Module

Section 2 The MicRun Framework

Section 3 Experiments & Conclusions

Outline

Page 4: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

4

• Scale-free Graphs are Widely Used− Social Networks Applications− Chemical Molecular Structures− Reference Citations

• Features of Scale-free Graphs − The Sparsity Characteristic of Graphs − The Connectivity of Vertices Follows Power-law Distribution

100

101

102

103

104

100

101

102

103

104

105

Degree

Num

ber

of

V

ert

ices

100

101

102

103

100

102

104

106

Degree

Nu

mb

er

o

f

Ve

rtic

es

(a) Higgs-twitter (b) Soc-pokec

Backgrounds & Motivation

y = x-γ

Page 5: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

5

• Graph Algorithms

− Load values of source vertices− Load values of edges− Compute

(e.g. Addition Minimum et.)

− Update destination vertices

Backgrounds & Motivation

Sequential computation steps

Page 6: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

• The Xeon Phi Architecture− Architecture: Many Integrated Core (MIC)− 512-bit VPU and four hyper-threads supported− Frequency is more than 1.50GHz − Memory (GDDR5) is more than 8GB − 57-72 cores with optimized KNC Instruction set− Connect to CPU with PCIE

Backgrounds & Motivation

6

Page 7: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

Backgrounds & Motivation

7

• Challenges of Executing Graph Algorithms on Phi− SIMD access locality influenced by access range− Write conflicts can occur in SIMD Parallelism

Page 8: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

• Tiling-and-Grouping Strategy is Commonly Used − Tiling Enhance the data locality− Grouping Remove Parallel conflict− Related Citations

Efficient Parallel Graph Processing over CPU and MIC (Chen et al. CGO. 2016)

Reusing Data Reorganization of graph Applications. (Jiang et al. IPDPS. 2016)

Optimizing scale-free SPVM on the Intel Xeon Phi. (Tang et al. CGO 2015)

Backgrounds & Motivation

8

Page 9: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

9

• New Challenges Appear

− High Penalty when Using Greedy Grouping− Difficult to Select the Optimal Tile Size

0

50

100

150

200

250

300

350

soc-pokec higgs-twitter

Tim

e

(second)

soc. blocking time

soc. grouping time

higgs. blocking time

higgs. grouping time

orig 128 256 512 1024 2048 4096 8192 163840

500

1000

1500

2000

2500

Tile Size

File

Siz

e

(MB

)

soc-pokec

higgs-twitter

(a) Time Overhead (b) Memory Overhead

Backgrounds & Motivation

Page 10: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

10

Section 1 Backgrounds & Motivation

– Scale-free Graphs & Graph Algorithms

– The Xeon Phi Architecture

– Bucket Grouping Module

– Auto-tuning Module

Section 2 The MicRun Framework

Section 3 Experiments & Conclusions

Outline

Page 11: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

11

• Overview of the Framework and the Modules − Tiling Module− Bucket Grouping Module− Auto-tuning Module− Graph Algorithms

Workflow of the MicRun Framework.

The MicRun Framework

Page 12: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

12

• Grouping Module− Bucket Structure is introduced to construct groups− Max-heap Optimization is used to improve efficiency

1 2 3

9

4 5

6 7 8

10 11

12 13

14

15

16

Dest. Vertices

Sou

rce

Ve

rtic

es

87654321

9

1 12

6

2

10

4 13 5

14

11

7

8

3

16

15

Bucket number

nnz in buckets

(a) nnz in a tile (b) nnz transformed into groups using buckets

O(n2)Group1 Group2 Group3 Group4 Group5 Group6 SIMD

Bucket 7-1-2-4 11-3-9-12 14-6-10-13 15-5-8-D 16-D-D-D NULL 16/20

Sequential(Chen. 2016)

1-2-3-4 5-6-7-8 9-10-11-12 13-14-D-D 15-D-D-D 16-D-D-D 16/24

The MicRun Framework

O(n2)

Page 13: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

13

• Grouping Module− Bucket Structure is introduced to construct groups− Max-heap Optimization is used to improve efficiency

1 2 3

9

4 5

6 7 8

10 11

12 13

14

15

16

Dest. Vertices

Sou

rce

Ve

rtic

es

87654321

9

1 12

6

2

10

4 13 5

14

11

7

8

3

16

15

Bucket number

nnz in buckets

(a) nnz in a tile (b) nnz transformed into groups using buckets

O(n2)

The MicRun Framework

O(n*log(b))

Page 14: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

14

• Auto-tuning Module− Extract Features Based on the Ideal Graph Application

sizes of the adjust matrix of graphs is related to the sparsity character The nnzs in the graph can influence the whole memory The number of nnzs in each column is related to the nnzs’ distribution The average stride between nnzs can influence the cache miss The feature tuple is constructed as: (s, n, γ, NC , ST)

− Decision Tree Model is Employed The training target OT is obtained by manually probing

The MicRun Framework

int

sum , , ,

1 1 1

p q tfloat float

c r nc g comp nc s total

i j k

T T T T T nnz

Page 15: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

Section 1 Backgrounds & Motivation

– Scale-free Graphs & Graph Algorithms

– The Xeon Phi Architecture

– Bucket Grouping Module

– Auto-tuning Module

Section 2 The MicRun Framework

Section 3 Experiments & Conclusions

Outline

15

Page 16: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

16

• Platform− MIC node on the Tianhe-Ⅱ supercomputer

− The version of the Xeon Phi is 31S1P

− 57 X86 cores, 1.10 GHz, 4 hyper threads per core− The capacity of L2 cache is 28.5MB− Intel ICC 13.0.0, -O3 enabled

• Graph Applications− Bellman-Ford Algorithm− PageRank Algorithm

• Datasets− SNAP Dataset − University of Florida Sparse Matrix Collection

Experiments

Page 17: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

• College of Computer of NUDT

• Hometown of Supercomputers: Tianhe - Ⅱ– No. 1 in TOP500 (2013.6 – 2015.11)

– 33.86 PFLOPS, 32,000 CPUs+48,000 MICs

17

Experiments

Page 18: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

18

Experiments

• Bucket Grouping vs. Seq. Grouping (Chen. 2016)

(a) Time Overhead during Grouping Stage (b) SIMD utilization by two Grouping Strategies

− Grouping Time Overhead− SIMD Utilization Ratio

Decrease stably Converge to 1 faster

Page 19: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

• The Execution of two Graph Algorithms

(a) Comparison of Execution Time

(b) Execution Time of Bellman-Ford

(c) Execution Time of PageRank1.2x on Average

Experiments

19

Page 20: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

Datasets

Bellman-Ford PageRank

OPT. vs. SEQ. AUTO. vs. SEQ. OPT. vs. SEQ. AUTO. vs. SEQ.

Val Size Val Size Val Size Val Size

lp_osa_60 1.08 1024 1.03 256 1.07 256 1.07 256

msdoor 1.11 1152 1.05 4096 1.14 512 1.14 512

rajat24 1.18 2048 1.09 256 1.09 768 1.09 768

Si87H76 1.05 128 1.05 128 1.14 128 1.03 512

higgs-twitter 1.26 896 1.13 3072 1.33 1024 1.21 640

kron-logn18 1.29 4096 1.29 4096 1.36 2048 1.25 1024

SPEEDUP ACHIEVED BY OPT. AND AUTO. TILING OVER SEQUENTIAL TILING PERFORMANCE

• The Performance of the Auto-tuning Module

Optimal 0ver Sequential 1.05x ~ 1.36x

Auto-tuning 0ver Sequential 1.03x ~ 1.29x

Experiments

20

Page 21: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

• The MicRun Framework− Grouping Module

Bucket structure is employed Max-heap mechanism is embedded

− Auto-tuning Module Decision Tree Classifier is introduced

• Future work− Enrich the graph algorithms built-in − Expand the framework to MIMD parallel level

Conclusions

21

Page 22: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

The Tianhe-2 supercomputer is available online.All the scientists can collaborate with us to develop new software and access Tianhe-2 through the Internet.

Welcome to contact us !Email: [email protected]

22

Page 23: MicRun A Framework for Scale-free Graph …• The Xeon Phi Architecture −Architecture: Many Integrated Core (MIC) −512-bit VPU and four hyper-threads supported −Frequency is

Thank you! Questions?