Page 1
Efficiently Mining Source Codewith Boa
Robert Dyer
The research activities described in this talk were supported in part by the US National Science Foundation (NSF) grants CCF-13-49153, CCF-13-20578, TWC-12-23828, CCF-11-17937, CCF-10-17334, and CCF-10-18600.
Tien N. NguyenHridesh Rajan Hoan Anh Nguyen
Page 2
2
What do I mean bysoftware repository?
Page 4
4
What features do they have?
Page 5
5
What do I mean bymining software repositories (MSR)?
Page 7
7
What are some examples ofsoftware repository mining?
Page 8
8
What is the most used programming language?
Page 9
9
How many words are in commit messages?
Words[] = update, 30715Words[] = cleanup, 19073Words[] = updated, 18737Words[] = refactoring, 11981Words[] = fix, 11705Words[] = test, 9428Words[] = typo, 9288Words[] = updates, 7746Words[] = javadoc, 6893Words[] = bugfix, 6295
Page 10
10
How has unit testing evolved over time?
JUnit 4 release
Page 11
11
What makes thisultra-large-scale mining?
Page 12
12
Previous examples queried...
Projects 699,331
Code Repositories 494,158
Revisions 15,063,073
Unique Files 69,863,970
File Snapshots 147,074,540
AST Nodes 18,651,043,23
Over 250GB of pre-processed data
Page 13
13
What doesbringing BIGDATA to the masses
mean?
Page 14
14
How has unit testing evolved over time?
How can we solve this task?
Page 15
15
Resultsforeachmine project
metadata
Hasrepository?
Method has@Test?
yes
yes
Accessrepository
Find allmethods
Find allsource files
minerevisions
minesources
Page 16
16
Resultsforeachmine project
metadata
Hasrepository?
Method has@Test?
yes
yes
Accessrepository
Find allmethods
Find allsource files
minerevisions
minesources
Challenge: Volume
Page 17
17
Challenge: Volume
Projects 699,331
Code Repositories 494,158
Revisions 15,063,073
Unique Files 69,863,970
File Snapshots 147,074,540
AST Nodes 18,651,043,23
How do you:
Find such a large dataset? Transform the data for analysis?Access this data? Efficiently analyze the data?Store the data?
Page 18
18
Resultsforeachmine project
metadata
Hasrepository?
Method has@Test?
yes
yes
Accessrepository
Find allmethods
Find allsource files
minerevisions
minesources
Challenge: Velocity
Page 19
19
Challenge: Velocity
Page 20
20
Challenge: Velocity
Page 21
21
Resultsforeachmine project
metadata
Hasrepository?
Method has@Test?
yes
yes
Accessrepository
Find allmethods
Find allsource files
minerevisions
minesources
Challenge: Variety
Page 22
22
Challenge: Variety
Page 23
Ultra-large-scale Software Repository MiningThe Boa Experience
[ICSE'14][ICSE'13][GPCE'13][SPLASH'13 SRC][TOSEM] (under review)
Page 24
24
Boa's Architecture
Replicate
Stored oncluster
User submitsquery
Deployed andexecuted on cluster
Query resultreturnedvia web
cache
Boa's Data Infrastructure
and Transform
Compiled intoHadoop program
Boa's Computing Infrastructure
Page 25
25
Resultsforeachmine project
metadata
Hasrepository?
Method has@Test?
yes
yes
Accessrepository
Find allmethods
Find allsource files
minerevisions
minesources
Challenge: Volume
Challenge: VelocityChallenge: Variety
Page 26
26
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Automatically parallelized
Analyzes 18 billion AST nodes in minutes
Only 10 lines of code
No external libraries
A better solution...
Page 27
27
How has unit testing evolved over time?
Tests: output sum[timestamp] of int;
Page 28
28
How has unit testing evolved over time?
Tests: output sum[timestamp] of int;
visit(input, visitor {
});
Page 29
29
How has unit testing evolved over time?
Tests: output sum[timestamp] of int;
visit(input, visitor {
before n: Modifier ->
});
Page 30
30
How has unit testing evolved over time?
Tests: output sum[timestamp] of int;
visit(input, visitor {
before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION &&
});
Page 31
31
How has unit testing evolved over time?
Tests: output sum[timestamp] of int;
visit(input, visitor {
before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name))
});
Page 32
32
How has unit testing evolved over time?
Tests: output sum[timestamp] of int;
visit(input, visitor {
before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Page 33
33
How has unit testing evolved over time?
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Page 34
34
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
input = project1
input = project2
input = project3
input = projectn
.
.
.
Dataset
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Boa Program
Boa Program
Boa Program
Boa Program
.
.
.
Tests
Tests[631152000] = 5Tests[631154020] = 12Tests[631161103] = 14Tests[631172392] = 18 . . .
Output
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Tests[631152000] << 1;
631152000, 1
Tests[631154020] << 1;
631152000, 1631154020, 1631152000, 1631154020, 1631154020, 1631161103, 1
Page 35
35
Automatic Parallelization
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Output variables with built in aggregator functions:sum, mean, top(k), bottom(k), set, collection, etc
Compiler generates Hadoop MapReduce code
Page 36
36
Abstracting MSR with Types
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Custom domain-specific types for mining software repositories5 base types and 9 types for source code
No need to understand multiple data formats or APIs
Page 37
37
Abstracting MSR with Types
Project
CodeRepository
Revision
ChangedFile
ASTRoot
1
1..*
1
*
1
*
1
0..1
Page 38
38
Abstracting MSR with Types
ASTRoot
Namespace
Declaration
1
*
1
1..*
Method Variable Type
1
*
1
*
1
*
Statement Expression*
*1
1
Page 39
39
Challenge: How can we make mining source code easier?
Answer: Declarative Visitors
Page 40
40
Background: Visitor Pattern
Rectangle
Triangle
draw(Graphics g)scale(int x, int y)
Circledraw(Graphics g)scale(int x, int y)
draw(Graphics g)scale(int x, int y)
Rectangle
Triangle
accept(Visitor v)
Circleaccept(Visitor v)
accept(Visitor v)
DrawVisitorvisit(Rectangle r)
visit(Circle c)visit(Triangle t)
ScaleVisitorvisit(Rectangle r)
visit(Circle c)visit(Triangle t)
Page 41
41
Easing Source Code Mining with Visitors
id := visitor {before T -> statement;after T -> statement;
};
visit(node, id);
Page 42
42
Easing Source Code Mining with Visitors
id := visitor {before id : T1 -> statement;
before T2, T3 -> statement;
before _ -> statement;};
Page 43
43
Easing Source Code Mining with Visitors
ASTRoot
Namespace
Declaration
Method Variable Type
Statement Expression
ASTRoot
Namespace
Declaration
Method Variable Type
Statement Expression
Page 44
44
before n: Declaration -> {
}
Easing Source Code Mining with Visitors
Method Type
Statement Expression
ASTRoot
Namespace
Declaration
Variable
before n: Declaration -> {foreach (i: int; n.fields[i])
visit(n.fields[i]);
}
before n: Declaration -> {foreach (i: int; n.fields[i])
visit(n.fields[i]);stop;
}
Page 45
45
Let's see it in action!
http://boa.cs.iastate.edu/boa/
Page 46
46
Summary
Ultra-large-scale software repository miningposes several challenges
Automatically parallelizes queries
Domain-specific language, types, and functionsto make mining software repositories easier
Boa provides abstractions to addressthese challenges
Ultra-large-scale dataset with almost 700k projects
Page 47
47
Boa's Global Impact
90+ users from over 20 countries!
Page 48
48
Thank you!
http://boa.cs.iastate.edu/