Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

Programming with “Big Code”: Lessons, Techniques, Applications

Pavol Bielik, Veselin Raychev, Martin VechevDepartment of Computer ScienceETH Zurich

Work @ ETH Zurich

Work on “Big Code” started a few years ago

Code Completion with Statistical Language Models, PLDI 2014Machine Translation for Programming Languages, Onward 2014Predicting Program Properties from “Big Code”, POPL 2015Fast and Precise Statistical Code Completion, ETH TRStatistical Feedback Generation for Programs, ETH TRProgramming with Big Code: Lessons, Techniques and Applications, SNAPL 2015

Prof.Martin Vechev

Prof.AndreasKrause

VeselinRaychev

PavolBielik

Svetoslav Karaivanov

ChristineZeller

PascalRoos

Applications[PLDI 14]SLANG: Code Completion

Intent i = new Intent();

?ctx.sendBroadcast(i);

All of these benefit from the “Big Code” and lead to applications not possible with previous techniques

P( Java | C# )P( C# | Java )P( Java )

[Onward 14]Programming Language Translation

...for x in range(a):

print a[x]

[submitted]Statistical Feedback Generation

likely error

[POPL 15]JSNice: DeobfuscationType Prediction

...for x in range(a):

print a[x]

[submitted]Statistical Feedback Generation

likely error

Probabilistic Programming Systems: Dimensions

Applications

Intermediate Representation

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

What is a generic metric for code?Applications

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

✔ Cross Entropy → ✗ Code Completion✔ BLEU Score → ✗ Program Translation

Traditional metrics might not be indicative of client performance

What is the best program representation?Applications

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

Sequences

req → {<open, 0>, <send, 0>}source → {..., <open, 2>}

Graphical Models Feature Vectors

req → (0,0,1,1,0)source → (1,0,0,0,0)...

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

Choosing the right representation is crucial

Feedback Generation: Sequence representations

Allamanis et. al. [2013] 46.4%

Hsiao et. al. [2014] 50.8%

Incorporate semantic information 75.3%

Incorporate dataflow analysis 86.3%

Applications

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

How to extract program representation?SLANG (APIs): alias and typestate analysisJSNice (Variable Names): scope and alias analysisFeedback Generation: alias, control-flow and typestate analysis

req.open("GET", source, false);

req → {<open, 0>, <send, 0>}source → {..., <open, 2>}

Applications

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

How to extract program representation?SLANG (APIs): alias and typestate analysisJSNice (Variable Names): scope and alias analysisFeedback Generation: alias, control-flow and typestate analysis

Design scalable yet precise enough algorithms

no alias analysiswith alias analysis

1% 10% 100%

[Precision vs % of data used]

Applications

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

What is the suitable probabilistic model?N-gram language model

Probabilistic context-free grammarsNeural networks

(Structured) Support vector machineConditional Random Fields

Applications

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

What is the suitable probabilistic model?N-gram language model

Probabilistic context-free grammarsNeural networks

(Structured) Support vector machineConditional Random Fields

Baseline 25.3%

Independent 54.1%

Structured 63.4%

Structured prediction is critical

Programming with “Big Code”Applications

Analyze Program(PL)

Train Model(ML)

Query Model

Code completionDeobfuscation

Program synthesis

Feedback generation

Translation

alias analysis

typestate analysis

Graphical Models

N-gram language modelSVM Structured SVM

Neural Networks

Sequences (sentences)

Translation TableFeature Vectors

control-flow analysis

scope analysis

argmax P(y | x)

y ∈ Ω

Analyze Program(PL)

Train Model(ML)

Query Model

Program synthesis

Feedback generation

Translation

alias analysis

typestate analysis

Graphical Models

Neural Networks

scope analysis

argmax P(y | x)

y ∈ Ω

Greedy MAP Inference

http://www.nice2predict.org/http://www.srl.inf.ethz.ch/spas.php More information and tutorials at:

General framework

http://www.nice2predict.org/

We have open-sourced our prediction engine and we are extending it with new capabilities

Upcoming PLDI’15 tutorial

Analyze Program(PL)

Train Model(ML)

Query Model

Program synthesis

Feedback generation

Translation

alias analysis

typestate analysis

Graphical Models

Neural Networks

scope analysis

argmax P(y | x)

y ∈ Ω

Greedy MAP Inference

http://www.nice2predict.org/http://www.srl.inf.ethz.ch/spas.php More information and tutorials at:

Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

Documents

Static and User-Extensible Proof Checking Antonis...

POPL/VMCAI...

Idempotent Transactional Workflow (POPL 2013) G. Ramalingam....

Metropolitan Statistical Areas, Rounds 6-7 - NLS data ·...

API Code Recommendation using Statistical Learning from...

National Code of Good Practices of the NSS€¦ · quality....

A model for statistical variation of fracture properties in....

Source Code -Tons of Code Package -More Code -Statistical...

Modular Type Classes Derek Dreyer Robert Harper Manuel M.T.....

REGIONAL STATISTICAL YEARBOOK Lombardia in … · REGIONAL....

Statistical Decision Theory · Statistical Decision Theory....

POPL 2015 Slides

Multi-token code suggestions using statistical language...

POPL 2012 Presentation

Chapter 9: Statistical Pattern Recognition -...

Statistical Analysis Plan D8488C00001 2.0 Date 20 ... ·...