Top Banner
Institute of Computing Technology, Chinese Academy of Sciences Working and Researching on Open64 Hongtao Yu Feng Li Wei Huo Wei Mi Li Chen Chunhui Ma Wenwen Xu Ruiqi Lian Xiaobing Feng
44

Working and Researching on Open64

Feb 22, 2016

Download

Documents

peony

Working and Researching on Open64. Institute of Computing Technology, Chinese Academy of Sciences. Outline. Reform Open64 as an aggressive program analysis tool Source code analysis and error checking Source-to-source transformation WHIRL to C Extending UPC for GPU cluster New targeting - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Working and Researching on Open64

Institute of Computing Technology, Chinese Academy of Sciences

Working and Researching on Open64

Hongtao Yu Feng Li Wei HuoWei Mi Li Chen Chunhui Ma

Wenwen Xu Ruiqi Lian Xiaobing Feng

Page 2: Working and Researching on Open64

OutlineReform Open64 as an aggressive program

analysis tool– Source code analysis and error checking

Source-to-source transformation– WHIRL to C

Extending UPC for GPU clusterNew targeting

– Target to LOONGSON CPU

Page 3: Working and Researching on Open64

Part Ⅰ Aggressive program analysis

Page 4: Working and Researching on Open64

Whole Program analysis (WPA)Aim at Error checkingA frameworkPointer analysis

– The foundation of other program analysis – Flow - and context-sensitive

Program slicing– Interprocedural – Reduce program size for specific problems

Page 5: Working and Researching on Open64

Static slicer

Whole Program Analyzer

IPA_LINK

IPL summay phase

FSCS pointer analysis (LevPA)

Build Call Graph

Construct SSA Form for each procedure

WPA Framework

Static error checker

Page 6: Working and Researching on Open64

LevPA -- Level by Level pointer analysis

A Flow- and Context-sensitive pointer analysisFast analyzing millions of lines of codeThe work has been published as

Hongtao Yu, Jingling Xue, Wei Huo, Zhaoqing Zhang, Xiaobing Feng. Level by Level: Making Flow- and Context-Sensitive Pointer Analysis Scalable for Millions of Lines of Code. In the Proceedings of the 2010 International Symposium on Code Generation and Optimization. April 24-28, 2010, Toronto, Canada.

Page 7: Working and Researching on Open64

LevPALevel by Level analysis

– Analyze the pointers in decreasing order of their points-to levelsSuppose

int **q, *p, x;q has a level 2, p has a level 1 and x has a level 0. a variable can be referenced directly or indirectly through

dereferences of another pointer. – Fast flow-sensitive analysis on full sparse SSA– Fast and accurate context-sensitive analysis using a

full transfer function

7

Page 8: Working and Researching on Open64

Framework

Figure 1. Level-by-level pointer analysis (LevPA).

Evaluate transfer functions

Bottom-up Top-down

Propagate points-to set

Compute points-to

level

for points-to level from the highest to lowest

incremental build call graph

8

Page 9: Working and Researching on Open64

Example

int o, t;main() { L1: int **x, **y;

L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; }

void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj;}

9

ptl(x, y, p, q) =2ptl(a, b, c, d, e) =1 ptl(t, o) = 0

analyze first { x, y, p, q } then { a, b, c, d, e} last { t, o }

Page 10: Working and Researching on Open64

Bottom-up analyze level 2void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj; }

main() { L1: int **x, **y;

L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; }

10

Page 11: Working and Researching on Open64

Bottom-up analyze level 2void foo( int **p, int **q) { L11: *p1 = *q1; L12: *q1 = &obj; }

main() { L1: int **x, **y;

L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; }

11

• p1’s points-to depend on formal-in p• q1’s points-to depend on formal-in q

Page 12: Working and Researching on Open64

Bottom-up analyze level 2void foo( int **p, int **q) { L11: *p1 = *q1; L12: *q1 = &obj; }

main() { L1: int **x, **y;

L2: int *a, *b, *c, *d, *e; L3: x1 = &a; y1 = &b; L4: foo(x1, y1); L5: *b = 5; L6: if ( … ) { x2 = &c; y2 = &e; } L7: else { x3 = &d; y3 = &d; } x4=ϕ (x2, x3); y4=ϕ (y2, y3) L8: c = &t; L9: foo( x4, y4); L10: *e = 10; }12

p1’s points-to depend on formal-in p q1’s points-to depend on formal-in q

• x1 → { a }• y1 → { b }• x2 → { c }• y2 → { e }• x3 → { d }• y3 → { d }• x4 → { c, d }• y4 → { e, d }

Page 13: Working and Researching on Open64

Full-sparse AnalysisAchieve flow-sensitivity flow-insensitively

– Regard each SSA name as a unique variable– Set constraint-based pointer analysis

Full sparse– Saving time– Saving space

13

Page 14: Working and Researching on Open64

Top-down analyze level 2

L4:foo.p → { a }foo.q → { b }

L9:foo.p → { c, d }foo.q → { d, e }

• foo.p → { a, c, d }• foo.q → { b, d, e }

main: Propagate to callsite

14

void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj; }

main() { L1: int **x, **y;

L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; }

Page 15: Working and Researching on Open64

Top-down analyze level 2

void foo( int **p, int **q) { μ(b, d, e) L11: *p1 = *q1; χ(a, c, d) L12: *q1 = &obj;

χ(b, d, e) }

foo: Expand pointer dereferences

15

Merging calling contexts here

void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj; }

main() { L1: int **x, **y;

L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; }

Page 16: Working and Researching on Open64

Context Condition

To be context-sensitivePoints-to relation ci

– p ⟹ v (p→v ) , p must (may) point to v, p is a formal parameter.

Context Condition ℂ(c1,…,ck)– a Boolean function consists of higher-level points-to

relationsContext-sensitive μ and χ

– μ(vi, (cℂ 1,…,ck))– vi+1=χ(vi, M, (cℂ 1,…,ck))

M {may, must∈ }, indicates weak/strong update

16

Page 17: Working and Researching on Open64

Context-sensitive μ and χ

void foo( int **p, int **q) { μ(b, q⟹b)

μ(d, q→d) μ(e, q→e)

L11: *p1 = *q1; a=χ(a , must, p a)⟹ c=χ(c , may, p→c) d=χ(d , may, p→d)L12: *q1 = &obj; b=χ(b , must, q b)⟹ d=χ(d , may, q→d) e=χ(e , may, q→e)}

17

Page 18: Working and Researching on Open64

Bottom-up analyze level 1

void foo( int **p, int **q) { μ(b1, q⟹b) μ(d1, q→d) μ(e1, q→e)

L11: *p1 = *q1; a2=χ(a1 , must, p⟹a) c2=χ(c1 , may, p→c) d2=χ(d1 , may, p→d)L12: *q1 = &obj; b2=χ(b1 , must, q⟹b) d3=χ(d2 , may, q→d) e2=χ(e1 , may, q→e)}

Trans(foo, a) = < { }, { <b, q⟹b> , < d, q→d>, < e, q→e>} , p a⟹ , must >

18

• Trans(foo, c) = < { }, { <b, q⟹b> , < d, q→d>, < e, q→e>} , p→c, may >

• Trans(foo, b) = < {< obj, q⟹b> }, { } , q b⟹ , must >

• Trans(foo, e) = < {< obj, q→e> }, { } , q→e, may >

• Trans(foo, d) = < {< obj, q→d> }, { <b, p→d q∧ ⟹b> , < d, p→d>, < e, p→d

q∧ →e> } , p→d q∨ →d, may >

Page 19: Working and Researching on Open64

Bottom-up analyze level 1

int obj, t;main() { L1: int **x, **y;

L2: int *a, *b, *c, *d, *e; L3: x1 = &a; y1 = &b; μ(b1, true) L4: foo(x1 , y1 ); a2=χ(a1 , must, true) b2=χ(b1 , must, true) c2=χ(c1, may , true) d2=χ(d1, may , true) e2=χ(e1, may , true)

L5: *b1 = 5; L6: if ( … ) { x2 = &c; y2 = &e; } L7: else { x3 = &d; y3 = &d; } x4=ϕ (x2, x3) y4=ϕ (y2, y3) L8: c1 = &t; μ(d1, true) μ(e1, true) L9: foo(x4 , y4); a2=χ(a1 , must, true) b2=χ(b1 , must, true) c2=χ(c1, may , true) d2=χ(d1, may , true) e2=χ(e1, may , true) L10: *e1= 10; }

19

Page 20: Working and Researching on Open64

Full context-sensitive analysisCompute a complete transfer function for each

procedureThe transfer function maintains a low cost of

being represented and applied– Represent calling contexts by calling conditions

Merging similar calling contextsBetter than using calling strings in reducing costs

– Implement context conditions by using BDDs.compactly represent context conditions enable Boolean operations to be evaluated efficiently

20

Page 21: Working and Researching on Open64

Experiment

Analyzes million lines of code in minutesFaster than the state-of-the art FSCS pointer analysis

algorithms.

Table 2.  Performance (secs).

21

Benchmark KLOCLevPA Bootstrapping(PLDI’08

)

64bit 32bit 32bit

Icecast-2.3.1 22 2.18 5.73 29

sendmail 115 72.63 143.68 939

httpd 128 16.32 35.42 161

445.gombk 197 21.37 40.78 /

wine-0.9.24 1905 502.29 891.16 /

wireshark-1.2.2 2383 366.63 845.23 /

Page 22: Working and Researching on Open64

Future workThe points-to result can be only used for error

checking nowWe are working for

– serving for optimization Let WPA framework generate codes (connect to CG)Let points-to set be accommodated for optimization

passesnew optimizations under the WPA framework

– serving for parallelizationprovide precise information to programmers for guiding

parallelization

22

Page 23: Working and Researching on Open64

An interprocedural slicerBased on PDG (Program dependence graph)Compressing PDG

Merging nodes that are aliased

Accommodate multiple pointer analysis Allow many problems to be solved on slice to

reduce the time and space costs

23

Page 24: Working and Researching on Open64

Application of sliceNow aiding program error checking

– reduce the number of states to be checkedUse Saturn as our error checkerInput slices to Saturn instead of the whole programThe time the error checker (Saturn) needs to detect errors

in file and memory operations is 11 and 2 times faster after slicing

24

Page 25: Working and Researching on Open64

1313. 701274. 39607. 71

434. 22

472. 00

384. 0827. 73

1636. 71 974. 42 3966. 83

0123456789

10111213141516171819202122

比值

FILEPOINTER

25

Page 26: Working and Researching on Open64

Application of sliceNow aiding program error checking

– reduce the number of states to be checkedUse Saturn as our error checkerInput slices to Saturn instead of the whole programThe time the error checker (Saturn) needs to detect errors

in file and memory operations is 11 and 2 times faster after slicing

26

Page 27: Working and Researching on Open64

Application of sliceNow aiding program error checking

– reduce the number of states to be checkedUse Saturn as our error checkerInput slices to Saturn instead of the whole programThe time the error checker (Saturn) needs to detect errors

in file and memory operations is 11.59 and 2.06 times faster after slicing

– improve the accuracy of error checking toolsUse Fastcheck as our error checkermore true errors are detected by Fastcheck

27

Page 28: Working and Researching on Open64

2. 88

0. 9

1

1. 1

1. 2

1. 3

1. 4

比值

原程序

FSM切片

28

Page 29: Working and Researching on Open64

Part Ⅱ Improvement on whirl2c

29

Page 30: Working and Researching on Open64

Improvement on whirl2cPrevious status

– Whirl2c is designed for compiler engineers of IPA and LNO to debug

– Berkeley UPC group and Houston Openuh group extend whirl2c somewhat, but it still cannot support big applications and various optimizations

Problem– Type Information incorrect because of

transformations

30

Page 31: Working and Researching on Open64

Improvement on whirl2cOur work

– Improve whirl2c to support recompilation of its output and execution

– Pass spec2000 C/C++ programs under O0/O2/O3+IPA based on pathscale-2.2

Motivation– Some customers require us not to touch their

platforms– Support the retargetability of some platform

independent optimizations– Support gdb of the whirl2c output

31

Page 32: Working and Researching on Open64

Improvement on whirl2cIncorrect information due to transformation

– Before structure folding

– After structure folding

Wrong output

whirl2c

frontend

32

Page 33: Working and Researching on Open64

Improvement on whirl2c Incorrect type information is mainly related to pointer/array/structure type

and their compositions.

We reinfer the type information correctly based on basic types– Basic type information is used to generate assembly code, so it is reliable– Array element size is also reliable– A series of rules to get the correct type information based on basic type infor,

array element size infor and operators.

Information useful for whirl2c but incorrect due to various optimizations is corrected just before whirl2c, which needs little change to existing IR

whirl2c

33

Page 34: Working and Researching on Open64

Part ⅢExtending UPC for GPU cluster

34

Page 35: Working and Researching on Open64

Extending UPC with Hierarchical Parallelism

UPC (Unified Parallel C), parallel extension to ISO C99– A dialect of PGAS languages (Partitioned Global Address Language)– Suitable for distributed memory machines, shared memory systems

and hybrid memory systems– Good performance, portability and programmability

Important UPC features– SPMD parallelism– Shared data is partitioned to segments, each of which has affinity to

one UPC thread, and shared data is referenced through shared pointer– Global workload partitioning, upc_forall with affinity expression

ICT extends UPC with hierarchical parallelism– Extend data distribution for shared arrays– Hybrid SPMD with implicit thread hierarchy – Realize important optimizations targeting GPU cluster

35

Page 36: Working and Researching on Open64

Source-to-source Compiler, built upon Berkeley UPC(Open64)Frontend support Analysis and transformation on upc_forall loops

– shared memory management based on reuse analysis– Data regroup analysis for global memory coalescing

Structure splitting and array transpose– Instrumentation for memory consistency (collaborate with DSM

system)– Affinity-aware loop tiling

For multidimensional data blocking on shared arrays – Create data environments for kernel loop leveraging array

section analysis Copy in, copy out, private (allocation), formal arguments

– CUDA kernel code generation and runtime instrumentationkernel function and kernel invocation

Whirl2c translator, UPC=> C+UPCR+CUDA

36

Page 37: Working and Researching on Open64

Memory Optimizations for CUDA What data will be put into the shared memory?

– firstly pseudo tiling– Extend REGION with reuse degree and region volume

inter-thread and intra-threadaverage reuse degree for merged region

– 0-1 bin packing problem (SM capacity)Quantify the profit: reuse degree integrated with coalescing attribute prefer inter-thread reuse

What is the optimal data layout in global memory?– Coalescing attributes of array reference

only consider contiguous constraints– Legality analysis– Cost model and amortization analysis

Code transformations (in a runtime library)

37

Page 38: Working and Researching on Open64

Extend UPC’s Runtime SystemA DSM system on each UPC thread

– Demand-driven data transfer between GPU and CPU

– Manage all global variables– Grain size, upc tile for shared arrays and private

array as a wholeshuffle remote and local array region into one

contiguous physical block before transferringData transformation for memory coalescing

– implemented in the GPU side using CUDA kernel– Leverage shared memory

38

Page 39: Working and Researching on Open64

Applications Description Original language

Application field

Source

Nbody n-body simulation CUDA+MPI Scientific computing

CUDA campus programming contest 2009

LBM Lattice Boltzmann method in computational fluid dynamics

C Scientific computing

SPEC CPU 2006

CP Coulombic Potential CUDA Scientific computing

UIUC Parboil Benchmark

MRI-FHD Magnetic Resonance Imaging FHD

CUDA Medical image analysis

UIUC Parboil Benchmark

MRI-Q Magnetic Resonance Imaging Q

CUDA Medical image analysis

UIUC Parboil Benchmark

TPACF Two Point Angular Correlation Function

CUDA Scientific computing

UIUC Parboil Benchmark

39

Benchmarks

Page 40: Working and Researching on Open64

Overal l per f ormance on si ngl eCUDA node

00. 5

11. 5

2

nbody l bm

spee

dup

base dsmmemory coal esci ng sm reuseOPT CUDA

Overal l performance on GPU cl uster

02468

10

nbody mri -fhd

mri -q tpacf cp

spee

dup

base dsmmemory coal esci ng sm reuseOPT CUDA/MPI

CPUs on each node: 2 dual core AMD Opteron 880GPU: NVIDIA GeForce 9800 GX2Compilers: nvcc (2.2) –O3 ; GCC (3.4.6) –O3

Use 4-node cuda cluster; ethernet

40

UPC Performance on CUDA cluster

Page 41: Working and Researching on Open64

For more details, please contactLi Chen

[email protected]

41

Page 42: Working and Researching on Open64

Part Ⅳ Open Source Loongcc

42

Page 43: Working and Researching on Open64

Open Source LoongccTarget to LOONGSON CPUBase on Open64

– Main trunk -- r2716A MIPS-like processor

– Have new instructionsNew Features

– LOONGSON machine model– LOONGSON feature support

FE, LNO, WOPT, CG– Edge profiling

43

Page 44: Working and Researching on Open64

Thanks

44