What are developers talking about? AN ANALYSIS OF TOPICS AND TRENDS IN STACK OVERFLOW DENNIS PORTENGEN
Dec 31, 2015
What are developers talking about?AN ANALYSIS OF TOPICS AND TRENDS IN STACK OVERFLOW
DENNIS PORTENGEN
Authors
• Anton Barua (pursuing MSc. Computing Science)
• Stephen W. Thomas (PhD Computing Science)
• Dr. Ahmed E. Hassan (Business)
Goal of the paper
• “Uncovering the main discussion topics, their underlying dependencies, and trends over time.” (Barua et al., 2012)
• 4 RQs• What are the main discussion topics? • Does a question in one topic trigger answers in another?• How does developer interest change over time?• How do the interest in specific technologies change over time?
Main topics in article
• Topic modelling• Uses word-frequencies and co-occurence frequencies to build a model of
related words
• LDA (Latent Dirichlet Allocation) • Statistical technique that creates topics of sets of words in a document
• Simple idea:• ‘Planet’ , ‘Space’, ‘Star’, ‘Orbit’ indicates that topic is related to astronomy
Research Methodology
Stack Overflow Data Set Post Extraction
Extracted Posts
Pre-processing
Pre-processed Posts
LDA
Topics and Topic Memberships
ResultsPost-processing
Phase 1 Phase 2 Phase 3
Example Result of pre-processing
Before pre-processing After pre-processing<p> I’ve been having issues getting C sockets API to work properly in C++. Specifically, although I am including sys/socket.h, I still get compile time errors telling me that AF_INET is not defined. Am I missing something obvious, or could this be related to the fact that I’m doing this coding on z/OS and my problems are much more complicated? </p>
Issu c socket api work properly c++ specif include sy socket.h compil time error af_inet defin miss obvious relat fact code z os problem complic
Related Literature
• Categorized in 4 fields• The general study of Q&A websites • The study of Stack Overflow specifically• The study of other social platforms for developers• The use of LDA to study trends in software engineering data
• Difference with these studies• Aimed at the textual context generated by users instead of user activity
Opinion
STRONG POINTS
• Qualitative and quantitave techniques
• Large dataset
• Methodology applicable to other developer resources
WEAK POINTS
• Methodology does not incorporate predictive model
• Experimentation with K value and value of treshold δ