Top Banner
public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null) return null; Path[] ret = new Path[stats.length]; for (int i = 0; i < stats.length; ++i){ ret[i] = stats[i].getPath(); } return ret; } Any-Code Completion Generated: (Java) stats[i].getPath() (25.2%) new Path(stats[i]) (3.3%) new Path(stats[i], charset) charset) (2.5%)
34

Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

Jul 10, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null) return null; Path[] ret = new Path[stats.length]; for (int i = 0; i < stats.length; ++i){ ret[i] = stats[i].getPath(); } return ret;}

Any-Code Completion

Generated: (Java) stats[i].getPath() (25.2%) new Path(stats[i]) (3.3%) new Path(stats[i], charset) charset)

(2.5%)

Page 2: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

2

Overview: a Structural Language Model

MethodCall

ArrayAccess Name

Name Name

stats i

get path

stats[i].getPath()

MethodCall

ArrayAccess Name

Name Name

stats i

get path

Page 3: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

3http://AnyCodeGen.org

Page 4: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

Structural Language Models of Code ICML’2020

Uri Alon Technion

Eran Yahav Technion

Omer Levy Tel-Aviv University

Facebook AI Research

Roy Sadaka Technion

4

Page 5: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null) return null; Path[] ret = new Path[stats.size()]; for (int i = 0; i < stats.length; ++i){ ret[i] = stats[i].getPath(); } return ret;}

Language modeling of code

• Code completion

• Validate existing code, detect unlikely code.

5

public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null) return null; Path[] ret = new Path[stats.size()]; for (int i = 0; i < stats.length; ++i){ ret[i] = stats[i].getPath(); } return ret;}

Page 6: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

Instead of representing the task as:

“predict a missing sentence in a text”

Represent the task as:

“predict a missing subtree in a tree”.

Learn syntactic patterns, instead of sequential patterns

Key Idea #1: predict a missing subtree

6

Page 7: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

Any valid code snippet can be parsed into an Abstract Syntax Tree (AST).

The AST is composed of nodes

and user-defined values in its leaves.

Abstract Syntax Tree

7

stats[i].getPath()

MethodCall

ArrayAccess Name

Name Name

stats i

get path

Page 8: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

Key Idea #2: a structural language model (SLM)

In a natural-language model:

But how can we compute the probability of a tree?

Pr(Y) = Pr(y1, y2, . . . , yn) =n

∏t=1

Pr (yt ∣ y < t)

8

Page 9: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

Key Idea #2: a structural language model (SLM)

Given a tree A (can be an arbitrary graph)

Induce an ordering over its nodes: A (in practice: DFS)

A structural language model (SLM) computes the probability of the tree A:

But, how can we represent the partial tree when computing ?

a0, a1, . . . , an ∈

Pr( ) =n

∏t=0

Pr (at ∣ a<t)

Pr (at ∣ a<t)a<t

9

A

Page 10: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

LearningEffort

AnalysisEffortSurface text

(token stream)AST

PathsData flowAnalysis

Control flowAnalysis

Handcraftedfeatures

...

The fundamental tradeoff in code representation

Requires expertise, language-specific, task-specific model

Implicitly re-learn syntactic & semantic regularities

Sweet-spot

model size, data, time…

10

[“code2vec”, POPL’2019]

[“A General Path-based Representation …”, PLDI’2018]

Page 11: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

We compute the probability of a node

by considering the paths in the Abstract Syntax Tree (AST)

from all leaves into .

Pr (at ∣ a<t)

IfExpr

MethodRoot

?

Key Idea #3: a partial tree as AST paths

at

11

Page 12: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

IfExpr

MethodRoot

?

12

Page 13: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

AST Paths are simple paths over nodes in the AST.

In previous works, we used AST paths to read code.

In this work, we generate code by predicting the next node in a set of AST paths.

AST Paths

13 [“code2seq”, ICLR’2019]

IfExpr

MethodRoot

?

SLM, this work

Page 14: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

AST Paths capture long-range interactions

14

public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null) return null; Path [] ret = new Path[stats.length];

for (int i = 0; i < stats.length; ++i){ ret[i] = stats[i].getPath();

} return ret;}

Page 15: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

• Any sequential encoder to encode each arbitrary-length path into a fixed-length vector separately

(e.g., LSTM, transformer encoder)

• Any contextualizer to let all paths interact

(e.g., transformer encoder)

• Attend to the contextualized paths using the root path as the query

Model

IfExpr

MethodRoot

?

15

Page 16: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

Model

Encode paths Contextualize Attend Predict node

Greater

QueryContext

IfExpr

MethodRoot

?

16

Page 17: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

Generate the Tree of: x > 1

IfExpr

MethodRoot

?

17

Page 18: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

Greater

IfExpr

MethodRoot

?

18

Generate the Tree of: x > 1

Page 19: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

Greater

Name

IfExpr

MethodRoot

?

19

Generate the Tree of: x > 1

Page 20: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

Greater

Name

IfExpr

x

MethodRoot

?

20

Generate the Tree of: x > 1

Page 21: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

Greater

Name IntExp

IfExpr

x

MethodRoot

?

21

Generate the Tree of: x > 1

Page 22: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

Greater

Name IntExp

IfExpr

x

MethodRoot

1

x > 1

22

Generate the Tree of: x > 1

Page 23: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

myNewFoo = myObj.getFoo();

myNewFoo.setFooId(id);

Copy Mechanism

23Vocabulary

full token copy subtoken copy

Page 24: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null) return null; Path[] ret = new Path[stats.length]; for (int i = 0; i < stats.length; ++i){ ret[i] = stats[i].getPath(); } return ret;}

24

Example - Java

Generated: (Java) stats[i].getPath() (25.2%) new Path(stats[i]) (3.3%) new Path(stats[i], charset) charset)

(2.5%)

Page 25: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

public static string Camelize(this string input){ var word = input.Pascalize(); return word.Length > 0 ? word.Substring(0, 1).ToLower() + word.Substring(1) : word;}

25

Example - C#

Generated: (C#) word.Substring(0, 1) (14.1%) word.trim() (8.2%) word.Substring(1) (5.8%)

Page 26: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

acc@1 acc@5 tree@1 tree@5

55.3

39.1

24.8

18.0

50.5

34.7

24.1

16.6

47.4

31.8

21.4

14.2

49.7

34.3

23.2

16.9

52.4

38.1

23.0

16.8

41.7

30.8

11.8

8.1

seq2prodseq2treeLSTMs+attn+copyTransformer-small+copyTransformer-base+copySLM

a.b > 1 tree= c.d > 2

Java Results (trained on 1.3M examples)

NAME.NAME > INT 26

15M45M12M 45M

Transformer base

Transformer small

LSTM +copyseq2treeseq2prod

SLM (this work)

SLM (this work)

SLM (this work)

SLM (this work)

1.43.8

8.3 8.3 4.44.8

13.6 7.9 4.85.62.9

Page 27: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

C# Results

27

acc@1 acc@5

45.5

37.635.9

22.3

37.9

26.4 27.1

15.218.5

13.0 12.0

7.4

SLM (this work)

seq2seq +copy

seq2tree +copy

seq2seq +copy

GNN →NAG

PHOGSLM (this work)

seq2seq +copy

seq2tree +copy

seq2seq +copy

GNN →NAGPHOG

9.67.618.427.033.5

15.311.222.424.6 30.2

Page 28: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

28

Error Analysis

55.3

39.1

24.8

18.0

?

What kind of mistakes are responsible for the gap between acc@k and tree@k ?

acc@1 tree@1 acc@5 tree@5

?

SLM (this work):

Page 29: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

29

Error Analysis

74%: Single-token mismatch

30%: Single-subtoken mismatch

Single token74%

Single token44%

Singlesubtoken

30%

55.3

39.1

24.818.0

What kind of mistakes are responsible for the gap between acc@k and tree@k ?

Page 30: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

30

Error Analysispublic float getProgress() { this.readLock.lock(); try { if (this.currentAttempt != null) { return this.currentAttempt.getProgress(); } return 0; } finally { this.readLock.unlock(); }}

Generated: Exact-match Tree-match Compiles this.currentAttempt.getCount() (31.3%) ✘ ✔ ✘

-1 f (30.6%) ✘ ✘ ✔

this.currentAttempt.get() (1.5%) ✘ ✔ ✘

this.currentAttempt.getTime() (1.2%) ✘ ✔ ✘

this.currentAttempt.getProgress() (0.9%) ✔ ✔ ✔

Page 31: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

31

Error Analysispublic float getProgress() { this.readLock.lock(); try { if (this.currentAttempt != null) { return this.currentAttempt.getProgress(); } return 0; } finally { this.readLock.unlock(); }}

Page 32: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

http://AnyCodeGen.org 32

Page 33: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

33http://AnyCodeGen.org

Page 34: Any-Code Completion - urialon.cswp.cs.technion.ac.il · Structural Language Models of Code ICML’2020 Uri Alon Technion Eran Yahav Technion Omer Levy Tel-Aviv University Facebook

1. Predicting a missing subtree in a tree

2. A structural language model over trees

3. A partial AST as a set of paths

Structural Language Models of Code

Pr( ) =n

∏t=0

Pr (at ∣ a<t)

Greater

Name IntExp

IfExpr

x

MethodRoot

1

http://AnyCodeGen.org [email protected] 34

Key points:

A