Andrew G. West and Insup Lee August 28, 2012 Towards Content-Driven Reputation for Collaborative Code Repositories
Jan 03, 2016
Andrew G. West and Insup LeeAugust 28, 2012
Towards Content-Driven Reputation for Collaborative Code Repositories
Big Concept
1. Do the computed reputations accurately reflect user behavior? If so, how could such a system be useful in practice?
2. What do inaccuracies teach us about differences in the evolution of code vs. natural language content? Adaptation?
2
Apply reputation algorithms developed for wikis in collaborative code repositories:
Motivations
Platform equivalence• Purely collaborative• Increasingly distributed;
collaboration between unknown/un-trusted parties
3
VehicleForge.mil [1]•Crowdsourcing a next generation military vehicle•Trust implications!
CONTENT-DRIVEN REPUATION
4
Content Driven Rep.
5
V1V0
Article Version History
Initialization
AuthorsA1
Mr. Franklin flew a kite
IDEA: Content that survives is good content. Good content is written/maintained by good authors.
V1: No reputation changes; no survival
Content Driven Rep.
6
V1 V2 V3V0
Article Version History
Initialization
AuthorsA1 A2 A3
V4A4
Mr. Franklin flew a kite
Your mom flew a plane
Damage
IDEA: When a subsequent editor allows content to survive, it has his/her implicit approval (and vice versa)
V2: Author A2 deletes most of A1’s content. Reputation of A1 is negatively impacted.
Content Driven Rep.
7
V1 V2 V3V0
Article Version History
Initialization Content Restoration
AuthorsA1 A2 A3
V4A4
Mr. Franklin flew a kite
Your mom flew a plane
Mr. Franklin flew a kite
Damage
IDEA: Survival is examined at depth
V3: Author A3 reverts A2’s content. Editor A1 gains reputation as his content is restored, A2 loses rep.
Content Driven Rep.
8
V1 V2 V3V0
Article Version History
Initialization Content Restoration
AuthorsA1 A2 A3
V4
Content Persistence
A4
Mr. Franklin flew a kite
Your mom flew a plane
Mr. Franklin flew a kite
Mr. Franklin flew a kite and …
Damage
IDEA: … and the process continues (depth=10)
V4: Author A1 and A3 accrue reputation, while A2 continues to receive reputation decrements.
In Practice
Implemented as WikiTrust [2, 3]• Token survival + edit distance captures novel
content as well as maintenance actions• Size of ∆ is: (1) proportional to degree of
change, (2) weighted by the rep. of the editor• Nice security properties–Implicit feedback–Symmetric evaluation–No self approval
9
WikiTrust Success
Live processing several language editions of Wikipedia; portable!
10
VANDALISM
Implementation [4] works on any MediaWiki installation
REPRESENTING AREPOSITORY ON
A WIKI PLATFORM
11
Repo. ↔ Wiki Model
12
1
2
3
4
6
7
5
9
tags/
trunk/
branches/ merge
Just replay history in a sequential fashion:•Repository ↔ wiki•Check-in ↔ edit•File ↔ article
Repo. ↔ Wiki Model
Minor accommodations: • Ignore tags• Ignore branches (merge
as a recommendation)• Multi-file check-in
13
1
2
3
4
6
7
5
9
tags/
trunk/
branches/ merge
Just replay history in a sequential fashion:•Repository ↔ wiki•Check-in ↔ edit•File ↔ article
Replay in Practice
1. [svnsync] produces local copy (not a checkout)2. [svn log] yields metadata script (see table)3. Pipe file versions into wiki via API
1. Log-in user (create account if needed)2. Use [svn cat path@id] syntax to yield content3. Make edit to article “path”. Logout.
14
ID USR COMMENT MOD PATH
1 U1 Initial check-in.A /trunk/core/header.c
A /trunk/core/misc.c
2 U2 Compilation error M /trunk/core/header.c
3 U1 Don’t need this D /trunk/core/misc.c
CASE STUDYINTRODUCTION
15
Mediawiki SVN• Case study repository: Mediawiki SVN [5]• http://hincapie.cis.upenn.edu/wiki_mediawiki/
16
PROPERTY ORIG MOD
Authors 326 271
Check-ins 91,808 53,715
File versions 585,629 117,432
… in trunk/ 420,613 117,432
Unique paths 138,741 7,521
… to PHP file 56,063 7,521
Further filtering:• Only PHP files
• Core language• No binary files• Tokenization
• Toss out i18n filesper late 2011
Mediawiki SVN• Case study repository: Mediawiki SVN [X]• http://hincapie.cis.upenn.edu/wiki_mediawiki/
17
PROPERTY ORIG MOD
Authors 326 271
Check-ins 91,808 53,715
File versions 585,629 117,432
… in trunk/ 420,613 117,432
Unique paths 138,741 7,521
… to PHP file 56,063 7,521
Further filtering:• Only PHP files
• Core language• No binary files• Tokenization
• Toss out i18n files
Wiki database is givento WikiTrust implementation:
Revision #A by J had ∆+0.75 on reputation of X=12.05Revision #B by K had ∆-42.00 on reputation of Y=0.5
Revision #B by K had ∆+16.75 on reputation of Z=1000.1… … …
Recall: An edit can change up to 10 reputations!
General Results (1)
18
Distribution of Final User Reputations • Reputations
lie on [0,20k]• 0.0 is the
initial rep. • ≈15 users
w/max. rep. Not always those w/most revs.
General Results (2)
19
Distribution of Update ∆s, by Magnitude • Majority of
updates are positive; evidence of a healthy community
• Most freq. update is 1-10 pt. increment
Example Reputations
20
EVALUATING REPUTATION ACCURACY
21
Evaluation Process
Find edits (Ex) where:
• Subsequent edit (Ex+1) resulted in non-trivial rep. loss for author
• Manually inspect comment, Bugzilla, diffs, and ask:“Would editor Ax+1 consider the previous change CONSTRUCTIVE, or UNCONSTRUCTIVE”?
• Could be a subjective mess, but…22
Ex+1
Ex
Non-trivialcontentremoval
Was this removal the result of ineptitude by the prior editor?
Classifying Rep. Loss (1)
23
Surprising number of obviously “bad” actions resulting in reverts. Editor calls out previous edit and/or editor explicitly:
“Password in plaintext! … DOESN'T WORK … don't put it in trunk!”“massive breakage with incomplete experimental changes”
“revert … spewing giant red HTML all over everything”
“failed, possibly other problems. NEEDS PARSER TESTS”“ten billion extra callouts …. clutter things up and trigger errors”
“… no apparent purpose … more complex and prone to breakage”
Classifying Rep. Loss (2)
24
Some cases are more ambiguous. The editor erred but its not immediately clear there should be significant penalty (NONFATAL):
Code showing no immediate errors:• But reverted (or branched) for testing
Issues unrelated to functional code: • Whitespace, comment/string changes
Evaluation Results
Per a conservative approach, anything not in the other two sets is CONSTRUCTIVE:
25
UNCONSTRUCTIVE NON-FATAL CONSTRUCTIVE51% 19% 30%
63% accuracy if we discount the “non-fatal” cases70% accuracy if we interpret them as “unconstructive”Interpret how you wish; purposely a naïve application
Concentrate on false-positives:Can the algorithm be improved?
IDENTIFYING & FIXINGFALSE POSITIVES +
EVALUATION
26
False Positives (1)
SVN does not handle RENAME elegantly:
27
file.c
file_renamed.cADD
DEL
Consequences: Authors of [file.c] punished; provenance lost; renamer gets all credit.
Solutions: Detect via hash; simple wiki “move”
False Positives (2.1)
28
INTER-DOCUMENT REORGANIZATION is problematic for WikiTrust
file1.c >>
file2.c >>
file3.c >> ...
Entire code-base as one giant doc. –global diff!
func_b(){…}func_c(){…}
file_1.c
func_c(){…}……
file_2.c
--- ∆ +++ ∆
Solution: Examine all diff ∆; sub-string matching; replay history. Intra-doc reorg. is a non-issue!
False Positives (2.2)
29
INTER-DOCUMENT REORGANIZATION is problematic for WikiTrust
file1.c >>
file2.c >>
file3.c >> ...
Entire code-base as one giant doc.
Solution: Intra-document reorg. is non-issue!; Global diff; substring matching; replay history.
func_b(){…}func_c(){…}
file_1.cfunc_c(){…}……
file_2.c
--- ∆ +++ ∆
[This is the content block being moved]
A1 – V1
A2 – V2
A3– V3
[This is the same block 3 edits ago]
Destination doc. history
A1
A2
A3
!
False Positives (2.3)
30
INTER-DOCUMENT REORGANIZATION is problematic for WikiTrust
file1.c >>
file2.c >>
file3.c >> ...
Entire code-base as one giant doc.
Solution: Intra-document reorg. is non-issue!; Global diff; substring matching; replay history.
func_b(){…}func_c(){…}
file_1.cfunc_c(){…}……
file_2.c
--- ∆ +++ ∆
TRANSCLUSION!
A1
A2
A3 text{{sect}} text
A1
A2
A3
sec. txt. sec. txt.sec. txt.
New doc.Old doc.
False Positives (3)
REVERT CHAINS cause big penalties:
31
+++ BIG CODE CHANGES
“Revert: Needs testing first”
+++ BIG CODE CHANGES
identicalnearly identical
V0 V1 V2 V3
Consequences: At V2, A1 loses reputation (a NONFATAL).
Solution: Revert chains rare; manual inspection?
testing done
At V3, A2 is wrongly punished.
False Positives (4)
• Initially 30 false positive cases– If “solutions” were implemented– This number would be just 10– Suggestions accuracies of 80-90%
• And those 10 cases?– Benign code evolution– Feature requests; method deprecation; no fault
• Results similar for [ruby] and [httpd]32
Better Evaluation
• POC evaluation lacking in many ways– Not enough examples. Subjective.– Says nothing about true negatives
• Bug attribution is extremely difficult– Corpus: “X erred at rev. Y with severity {L,M,H}”– If it could be automated; problem solved!– Work backwards from Bugzilla? Developers?– Reputation as a predictor of future loss events.
• Qualitative instead of quantitative measures33
Other Optimization
• Lots of free variables, weights, ceilings
34
// this is a loopfor(int i=0;i<10;i++) print(“Some text”); for ( int i = 0 ; i < 10 ; i++ ){
print( “” );}
Canonical code
for ( int i = 0; i < 10; i++ ){ print( “” );} for ( int i = 0 ; i < 10 ; i++ ){
print( “” );}
Tokenization
USE-CASES &CONCLUSIONS
35
Use-case: Small Projects
• Small/non-production proj.– Conflict, not just tokens!
• Undergraduate research– Who did all the work?
• Academic paper repositories– Automatic author order!
• Collaboration or conflict?– Graph of reputation events
36
A B
C D
Faction #1
Faction #2
-
+
+
++
--
-
-
-
Use-cases (2)MEDIAWIKI• Alert service/warnings (anti-vandal style)• Expediting code review• Permission granting/revocation
37
Use-cases (2)MEDIAWIKI• Alert service/warnings (anti-vandal style)• Expediting code review• Permission granting/revocation
VEHICLEFORGE.MIL• Access control for users/commits• Wrap content-persistent reputation with metadata
features for a stronger classifier [6]• Robustness considerations (i.e., reach-ability)
38
Conclusions• Despite high-(er) barriers to entry, bad things still
happen in production repositories!• Content-persistence is a reasonably accurate way to
identify these instances ex post facto• False positives indicate code uniqueness:
– 1. Non-functional aspects are non-trivial (WS, comments)– 2. Inter-document reorganization is common– 3. Quality-assurance is more than surface level
• Evaluation needs to be more rigorous• A variety of use-cases if it becomes production-ready
39
References
40
[1] Lohr, Steve. “Pentagon Pushes Crowdsourced Manufacturing”. New York Times “Bits Blog”. April 5, 2012.
[2] Adler, B.T. and L. de Alfaro. “A Content-Driven Reputation System for Wikipedia”. In WWW 2007: Proc. of the 16th Intl. World Wide Web Conference.
[3] Adler, B.T., et al. “Measuring Author Contributions to Wikipedia”. In WikiSym 2008: Proc. of the 3rd Intl. Symposium on Wikis and Open Collaboration.
[4] WikiTrust online. http://www.wikitrust.net/
[5] Mediawiki SVN. http://svn.wikimedia.org/viewvc/mediawiki/ (note: this an archive of that resource, Git is the currently used repository software)
[6] Adler, B.T. et al. “Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features”. In CICLing 2011: Proc. of the 12th Intl. Conference on Intelligent Text Processing and Computational Linguistics.
[Ø] Mediawiki Developer Hub. http://www.mediawiki.org/wiki/Developer_hub