Top Banner
Boa: An End-to-end Repository Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported in part by the US National Science Foundation (NSF) under grants CNS-18-23294, CCF- 15-18776, CNS-15-12947, CNS-15-13263, and CCF-13-49153.
49

An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Jul 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Boa:An End-to-end Repository Mining Platform

Dr. Robert DyerBowling Green State University

The research and educational activities described in this talk was supported in part by the US National Science Foundation (NSF) under grantsCNS-18-23294, CCF- 15-18776, CNS-15-12947, CNS-15-13263, and CCF-13-49153.

Page 2: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Collabo

rators

Hoan Nguyen Hridesh Rajan Tien Nguyen

Elena Sherman Gary Leavens Mehdi Bagherzadeh

Page 3: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Stud

ents Che Shian Hung Rebecca

BrunnerJingyi Su

Kaushik Nimmala

Mohd Arafat ShubhendraShrimal

Page 4: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

https://boa.cs.iastate.edu/

Page 5: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

5

What do I mean bysoftware repository?

Page 6: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported
Page 7: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Projects 7,830,023

Code Repositories 380,125

Revisions 23,229,406

Unique Files 146,398,339

File Snapshots 484,947,086

AST Nodes 71,810,106,868

What do I mean byultra large scale?

Page 8: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

8

Why do we want to mine software repositories?

Page 9: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Detecting/correcting API mis-uses

Analyzing language feature usage

Training models

Clone detectionMining patterns

Page 10: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Consider a task to answer

"How was unit testingadopted over time?"

Page 11: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Is fork?

Access repository

mine project

metadata

No

mine revisions

foreach project

Result:Unit TestUsages

Find all Methods

Method has @Test?

Find all Java source files

Yes

mine sources

Page 12: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

A solution in Java...

class UnitTestFinder {

static void main(String[] args) {

... /* create and submit a Hadoop job */

}

static class UnitTestFinder Mapper extends Mapper<Text, BytesWritable, Text, LongWritable> {

static class DefaultVisitor {

... /* define default tree traversal */

}

void map(Text key, BytesWritable value, Context context) {

final Project p = ... /* read from input */

new DefaultVisitor() {

boolean preVisit(Modifier n) {

if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name))

context.write(new Text(”Tests[” + current(n).getCommitDate() + "]"), new LongWritable(1));

}

}

}.visit(p);

}

}

static class UnitTestFinder Reducer extends Reducer<Text, LongWritable, Text, LongWritable> {

void reduce(Text key, Iterable<LongWritable> vals, Context context) {

int sum = 0;

for (LongWritable value : vals)

sum += value.get();

context.write(key, new LongWritable(sum));

}

}

}

Full programover 140 lines of code

Uses JSON, SVN, and Eclipse JDT libraries

Uses Hadoop framework

Explicit/manual parallelization

Too much code!

Do not read!

Page 13: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Full program 7 lines of code!Automatically parallelized!

No external libraries needed!Analyzes 18 billion AST nodes in minutes!

Tests: output sum[timestamp] of int;

visit(input, visitor {

before n: Modifier ->if (n.kind == ModifierKind.ANNOTATION &&

match(`^(org\.junit\.)?Test$`, n.annotation_name))Tests[current(Revision).commit_date] << 1;

});

A better solution

Page 14: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

The Boa language anddata-intensive infrastructure

http://boa.cs.iastate.edu/

Page 15: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

How was unit testing adopted over time?

Tests: output sum[timestamp] of int;

Page 16: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

How was unit testing adopted over time?

Tests: output sum[timestamp] of int;

visit(input, visitor {

});

Page 17: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

How was unit testing adopted over time?

Tests: output sum[timestamp] of int;

visit(input, visitor {

before n: Modifier ->

});

Page 18: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

How was unit testing adopted over time?

Tests: output sum[timestamp] of int;

visit(input, visitor {

before n: Modifier ->

if (n.kind == ModifierKind.ANNOTATION &&

});

Page 19: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

How was unit testing adopted over time?

Tests: output sum[timestamp] of int;

visit(input, visitor {

before n: Modifier ->

if (n.kind == ModifierKind.ANNOTATION &&

match(`^(org\.junit\.)?Test$`, n.annotation_name))

});

Page 20: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

How was unit testing adopted over time?

Tests: output sum[timestamp] of int;

visit(input, visitor {

before n: Modifier ->

if (n.kind == ModifierKind.ANNOTATION &&

match(`^(org\.junit\.)?Test$`, n.annotation_name))

Tests[current(Revision).commit_date] << 1;

});

Page 21: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Tests: output sum[timestamp] of int;

visit(input, visitor {before n: Modifier ->if (n.kind == ModifierKind.ANNOTATION &&

match(`^(org\.junit\.)?Test$`, n.annotation_name))Tests[current(Revision).commit_date] << 1;

});

Tests: output sum[timestamp] of int;

visit(input, visitor {before n: Modifier ->if (n.kind == ModifierKind.ANNOTATION &&

match(`^(org\.junit\.)?Test$`, n.annotation_name))Tests[current(Revision).commit_date] << 1;

});

input = project1

input = project2

input = project3

input = projectn

.

.

.

Dataset

Tests: output sum[timestamp] of int;

visit(input, visitor {before n: Modifier ->if (n.kind == ModifierKind.ANNOTATION &&

match(`^(org\.junit\.)?Test$`, n.annotation_name))Tests[current(Revision).commit_date] << 1;

});

Boa Program

Boa Program

Boa Program

Boa Program

.

.

.

Tests

Tests[631152000] = 5Tests[631154020] = 12Tests[631161103] = 14Tests[631172392] = 18

.

.

.

Output

Tests: output sum[timestamp] of int;

visit(input, visitor {before n: Modifier ->if (n.kind == ModifierKind.ANNOTATION &&

match(`^(org\.junit\.)?Test$`, n.annotation_name))Tests[current(Revision).commit_date] << 1;

});

Tests: output sum[timestamp] of int;

visit(input, visitor {before n: Modifier ->if (n.kind == ModifierKind.ANNOTATION &&

match(`^(org\.junit\.)?Test$`, n.annotation_name))Tests[current(Revision).commit_date] << 1;

});

Tests[631152000] << 1;

631152000, 1

Tests[631154020] << 1;

631152000, 1631154020, 1631152000, 1631154020, 1631154020, 1631161103, 1

Page 22: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

JUnit 4 release

Page 23: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Consider a more complex task

“Find the buggy files by examining past bug fixing behavior.”

Page 24: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Boa Other Tools

1b. Download 2b. Download

3a. Analysis Task

3b. Output

1a. Analysis Task

2a. Analysis Task

1a. Compute the fixing file counts, fixing revision counts (FRC), and average FRC across projects

“Find the buggy files by examining past bug fixing behavior.”

1b. Download Boa output through Boa API

Page 25: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Boa Other Tools

1b. Download 2b. Download

3a. Analysis Task

3b. Output

1a. Analysis Task

2a. Analysis Task

2a. Filter projects Based on MSR criteria

“Find the buggy files by examining past bug fixing behavior.”

2b. Download Boa output through Boa API

Page 26: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Boa Other Tools

1b. Download 2b. Download

3a. Analysis Task

3b. Output

1a. Analysis Task

2a. Analysis Task

3a. Generate a list of potentially buggy files

“Find the buggy files by examining past bug fixing behavior.”

3b. Output result

Page 27: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Boa does notenable sharing and reusability

and consequently

fails to become anend-to-end MSR platform

Page 28: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Solution:

Materialized Views in Boa

Page 29: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

“Find the buggy files by examining past bug fixing behavior.”

Fix File Count

Dataset

Fix Revision Count (FRC)

Retained Projects

Boa Query 2

Ranked Files

Output

Boa Query 3Boa Query 1

Average FRC across Projects

Page 30: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Language features

Page 31: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Language features

Page 32: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Runtimemodfications

Compiler

1. Submit Boa job

2. Request views with tag -views

3. Returning a list of external views

4. Retrieve job id for query roots with tag name

5. Compile query with resolved view paths and job ids

6. Compile finished

Poller

User

Page 33: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

New HDFS Layout{TrackerPath} / boa /

23456 /

Filter /

FixingRevision /

output /

FixFileCount.seq

FixRevisionCount.seq

AverageFRC.seq

Query.jar

workflow.xml

output /

Retained.seq

Query.jar

workflow.xml

output /

o.seq

Query.jar

workflow.xml

Page 34: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

New HDFS Layout{TrackerPath} / boa /

23456 /

Filter /

FixingRevision /

output /

FixFileCount.seq

FixRevisionCount.seq

AverageFRC.seq

Query.jar

workflow.xml

output /

Retained.seq

Query.jar

workflow.xml

output /

o.seq

Query.jar

workflow.xml

Page 35: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

New HDFS Layout{TrackerPath} / boa /

23456 /

Filter /

FixingRevision /

output /

FixFileCount.seq

FixRevisionCount.seq

AverageFRC.seq

Query.jar

workflow.xml

output /

Retained.seq

Query.jar

workflow.xml

output /

o.seq

Query.jar

workflow.xml

Page 36: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

New HDFS Layout{TrackerPath} / boa /

23456 /

Filter /

FixingRevision /

output /

FixFileCount.seq

FixRevisionCount.seq

AverageFRC.seq

Query.jar

workflow.xml

output /

Retained.seq

Query.jar

workflow.xml

output /

o.seq

Query.jar

workflow.xml

Page 37: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

word10 -> {word, 10}hardWord -> {hard, word}hard_word -> {hard, word}

• Binkley et al. created a gold set containing 2,663 identifiers and the split form based on 8,522 human splitting judgements• Hill et al. studied several splitting algorithms using gold set

• We applied greedy splitting algorithm to split the gold set

Case Study 1: Splitting Identifiers

Page 38: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Splitting Identifiers – DAG

Output

Dictionary

WordListsGoldSet WordLists/

AbbreviationWordLists/StopWords

GreedySplit

Words WordsConfidenceLevelProgramLanguage

Gold

Accuracy

Result

Dataset

Page 39: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

• Original study used three different dictionary sizes:• Small (50,276 entries) -> 56 %• Medium (98,569 entries) -> 51 %• Large (479,625 entries) -> 60 %

• Our dictionary contained 61,215 entries -> 53%

Case Study 1: Result

Page 40: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

• Foucault et al. analyzed turnover impact on 5 open source projects

• We computed turnover metrics and analyzed its impact on 1,676open source projects

Case Study 2: Developer Turnover Impact

Page 41: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Developer Turnover – DAG

1. LatestCommitTime

ProjectCommitTime

2. EarliestCommitTime

1. AtLeast

ProjectDeveloperFilter

1. TurnoverRetainedProject

TurnoverProjectFilter

1. Commit_M2. BFCommit_M3. PeriodCount4. D_P5. ST6. ENA7. ELA8. INA9. ILA10. STA11. ENCount12. ELCount13. INCount14. ILCount15. STCount16. …

TurnoverMetrics

TurnoverRQ1

1. Ratio2. ENmax3. INmax4. ELmax5. ILmax6. STmax7. …

TurnoverRQ2

1. Correlation2. CorrelationCI3. ENp4. ENn5. …

TurnoverRQ3

Output

Output

Output* Each query is also fed in the same Boa dataset.

1. ENLratio2. ENLratioStats3. ConversionRate4. ConversionRateStats

Page 42: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Case Study 2: Results

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

100

200

300

400

500

6000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

RatioofExteranlstoTotalNumberofDevelopers

Numbe

rofP

rojects

NumberofProjectsforEachLevelofExternalRatio

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

100

200

300

400

500

6000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

ConversionRate

Numbe

rofP

rojects

NumberofProjectsforEachConversionRateLevel

0

200

400

600

800

1000

1200

1400

ENActivity INActivity ELActivity ILActivity STActivity

Num

bero

fProject

TheNumberofProjectsforeachHighestActivityMetric

01002003004005006007008009001000

NewcomerActivity LeaverActivity

Num

bero

fProjects

NewcomerActivityvs.LeaverActivity

0

5

10

15

20

25

30

35

40

45

50

ENActivity INActivity ELActivity ILActivity STActivity

Num

bero

fProject

PositiveCorrelationbetweenTurnoverandBug-fixedDensity

0

50

100

150

200

250

300

350

400

450

500

ENActivity INActivity ELActivity ILActivity STActivity

Num

berofProject

NegativeCorrelationbetweenTurnoverandBug-fixedDensity

Page 43: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

1,193 users

Page 44: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

1,193 users, 35 countries

Page 45: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

1,193 users, 35 countries, 6 continents

Page 46: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Future Research

Page 47: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Enabling Machine Learning in Boa

Key Challenges1) What language abstractions can we provide to integrate ML techniques in Boa?2) Can we scale the training phase?

Page 48: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

Leveraging National Infrastructures

[2a] User submits query(on boa.cs.iastate.edu)

[1c/2b] Frontend transmitsquery to new XSEDE backend

[1b] Frontend transmits queryto XSEDE backend

[1a] User submits query(on OSF.io)

Existing Frontend New XSEDE Backend (see D.3)

Boa compiler andruntime

(see D.1)

Distributed Deep Learning runtime

Machine Learninglanguage features

New OSF.io Frontend (see D.2)

Page 49: An End-to-end Repository Mining Platform · Mining Platform Dr. Robert Dyer Bowling Green State University The research and educational activities described in this talk was supported

https://boa.cs.iastate.edu/