Modeling the Evolution of Topics in Source Code Histories
Stephen W. Thomas
Bram Adams
Ahmed E. Hassan
Dorothea Blostein
[2]
[3]Time
Pop
ular
ity
Linux Development
Audio Codecs
What have the Skype developers been
interested in?
Microsoft manager
[4]
What are developers working on?
Option 1: Speak with every developer
Time
Pop
ular
ity
Linux Development
Audio Codecs
Option 2: Use automated tool
[5]
Tool: Topic Evolution Models
…Topic “Linux”
… Topic “codec”Topi
c P
opul
arity
Time
V1.0 V1.1 V1.2 V2.0 V4.0
Topic “GUI”…
Applied to Source Code Histories
[6]
Success in Other Domains
Email Archives
Conference Proceedings Newspaper Articles
[7]
Topic Evolution on Source Code
Topic Model
Mapping Topics Over Time
Background: The Hall Model
[8]
V1.0V1.1
V1.2V1.3
XMLFile I/O
XMLGUI
GUIFile I/O
XMLFile I/O
XMLFile I/O
XMLFile I/O
XMLFile I/O
XMLGUI
XMLGUI
XMLGUI
XMLGUI
GUIFile I/O
[9]
V1 V2 V3 V4 V5
File
ID
Topic 1: XMLTopic 2: GUITopic 3: File I/O
Expect:
XMLFile I/O
XMLGUI
GUIFile I/O
XMLFile I/O
XMLFile I/O
XMLFile I/O
XMLFile I/O
XMLGUI
XMLGUI
XMLGUI
XMLGUI
GUIFile I/O
Topic 1: XML+ File I/OTopic 2: XML + GUITopic 3: GUI+ File I/O
Get:
Topic 1
Topic 3
Topic 2
Problem: Topics are muddled, not distinct
[10]
Pop
ular
ity
File I/O
XMLGUIExpect:
V1 V2 V3 V4 V5
File
ID
XMLFile I/O
XMLGUI
GUIFile I/O
XMLFile I/O
XMLFile I/O
XMLFile I/O
XMLFile I/O
XMLGUI
XMLGUI
XMLGUI
XMLGUI
GUIFile I/O
XMLGUI
Problem: Evolutions not sensitive or accurate
Pop
ular
ity Get:
Topic 3Topic 1
Topic 2
Topic 1
Topic 3
Topic 2
Topic 2
[11]
Problems due to duplication
Topics are muddled, not distinct
Evolutions are not accurate
Found in Source Code Histories
63% files don’t change
84% files don’t change
99.8% words don’t change
99.8% words don’t change
[12]
JHotDraw
Real-World Duplication
The Diff Model
[13]
Topic Model
MappingTopics Over
TimeDiff Reconstruction
Step
The Diff Model
[14]
V1.0V1.1
V1.2V1.3
...
if (vacstmt->options & VACOPT_VACUUM){ PreventTransactionChain(isTopLevel, stmttype); in_outer_xact = false;}...
...// Don’t run VACUUM in user transition block!if (vacstmt->options & VACOPT_VACUUM){ PreventTransactionChain(isTopLevel, stmttype); in_inner_xact = false;}...
Version 5.3.7 Version 5.3.8
// Don’t run VACUUM in user transition block!in_inner_xact = false;
Diff
in_outer_xact = false;
Deleted lines Added lines
Diff Step
[15]
[16]
GUI (77%)XML (23%)
SecondVersion
FirstVersion
GUI (90%)XML (10%) =- +
Reconstructing Topic Memberships
(1000 * 90%) - (200*100%) + (150*20%) = 730
?
(950 lines)(150 lines)(200 lines)(1000 lines)
(1000 * 10%) - (200*0%) + (150*80%) = 220
Topic Model
DeletedLines
GUI (100%)XML (0%)
Topic Model
AddedLines
GUI (20%)XML (80%)
Topic Model Infer
Case Studies
[17]
JHotDraw
Drawing Application Framework (Java)
13 releases (5.2.0 – 7.5.1)613 files84K SLOC
Database Management System(C)
46 releases (7.0.0 – 8.3.5)844 files501K SLOC
I bet the Diff model discovers topics that are more distinct!
[18]
High KL divergence High distinctness [19]
Measuring Distinctness
xml fopen button element menu fclose attribute
Wor
d P
roba
bilit
y XML topic
GUI topic
xml fopen button element menu fclose attribute
Wor
d P
roba
bilit
y
xml fopen button element menu fclose attribute
Wor
d P
roba
bilit
y XML + File IO topic
xml fopen button element menu fclose attribute
Wor
d P
roba
bilit
y XML + GUI topic
Low KL divergence Low distinctness
With KL-Divergence
[20]
Average Topic Distinctness
Topic 1Topic 2Topic 3Topic 4Topic 5…Topic K
Topic 1Topic 2Topic 3Topic 4Topic 5…Topic K
Hall TopicsTopic 1Topic 2Topic 3Topic 4Topic 5…Topic K
Topic 1Topic 2Topic 3Topic 4Topic 5…Topic K
Diff Topics
+32% +38%
Diff makes more distinct topics
[21]
JHotDraw
[22]
Topics are muddled, not distinct
Evolutions are not accurate
Diff makes more distinct topics
Problems due to duplicationFound in Source Code Histories
I bet the Diff model discovers more accurate topic evolutions
[23]
[24]
No oracle dataset
Measuring Accuracy
Create simulatedscenario by handTruth known
1.
Manually investigateevolutions in JHotDraw and PSQL
2.
Truth learned
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10
copy copy copy copy copy copy copy copy copy
PSQLbackend.access
Simulated Project
[25]
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10
3 files from PSQLtimezone
Simulated Scenario 1
timezone topic
[26]
Manual Investigation
[27]
Topic 1
2. Validate against project documentation (commit logs, release notes, etc.)
1 .Select change events
Diff makes more accurate topics
[28]
+25% precision
SimulatedProject
+33% precision
JHotDraw
+47% precision
+100% recall
[29]
Topics are muddled, not distinct
Evolutions are not accurate
Diff makes more distinct topics
Diff makes more accurate evolutions
Problems due to duplicationFound in Source Code Histories
[30]
Summary