Secondary data analysis with digital trace data Examples from FLOSS research Andrea Wiggins 13 Juillet, 2011
May 15, 2015
Secondary data analysis with digital trace data
Examples from FLOSS research
Andrea Wiggins13 Juillet, 2011
Secondary Data Analysis
• Uses existing data produced or collected by someone else, usually for a different purpose
• Databases
• Repositories
• Surveys
• Emails
• Social networks
2
Digital Trace Data
• Records of activity (trace data) undertaken through an online information system (thus digital)
• Increasingly common in studies of online phenomena
• Large volumes of available data
• Can be complete: a census, not a sample
• May be more reliably recorded than other data
3
Characteristics
1. Found data (not produced for research)
2. Event-based data (not summary data)
3. Events occur over time, so it is longitudinal data
4
Requirements
• Understand the original data source
• How it was collected, potential problems
• Limitations of the sample
• What the data describe
• Match with appropriate analysis methods and measures
• New types of data may require new measures
• Theoretical coherence is very important
5
Advantages
• Data may be “complete”
• Usually no response bias (exception: cookies)
• May cover long periods of time and large groups
• Multiple different data types, but mostly textual
• Data are often easy to acquire
• APIs or scraping web pages (with caution)
• Databases, archives, or repositories of research data
• But remember: you usually get what you pay for!
6
Disadvantages
• Often difficult to know limitations of data
• Data may be poorly documented
• Original creator may not be available for comment
• Volume of data can be overwhelming
• Sampling strategies needed, e.g., temporal, random
• Substantial time required for data preparation: 90% of effort
• Exceptions are everywhere and will break analyses, but can only be discovered through trial and error
7
Example: Email Networks
• Data source: email listservs for FLOSS projects
• Analysis approach: create social networks
• Within discussion threads, individuals are nodes, and links are reply-to messages
• Some conceptual issues for interpretation, choice of measures
• Technical challenges
• Temporal aggregation
• Identity resolution
8
Temporal AggregationFigures from Howison et al., 2006
9
Network Workflow10
Network Results
Cleaning up before shutting down
• Observed anomalous patterns in trackers for both projects: periodic centralization spikes
• A single user makes batch bug closings (up to 279!)– Fire’s (feature request) tracker housekeeping
appears to be preparation for project closure
– Gaim’s tracker housekeeping was more regular and repeated
• Different levels of correlation between venues, suggesting different types of interactions
• User venues more decentralized than developer venues, reflecting greater number of participants
• Overall trend toward decentralization could be result of different influences
11
Example: Classification
• Replication of success-tragedy classification
• Classification criteria originally drawn from interviews with community members
• Data extracted from repositories
• Technical challenges
• Merging data from two repositories
• Processing large volume of data in multiple steps
12
Variables
• Inputs: project names and 5 threshold values for classification tests, e.g. number of downloads
• Project statistics retrieved from repositories
• Founding date
• Data collection date
• Dates for all releases
• Number of downloads
• URL
13
Classification workflow14
Classification Results
Class Original Our results Difference
unclassifiable
3 186 3 296 +110
II 13 342 (12%) 16 252 (14%) +2 910 (+2%)
IG 10 711 (10%) 12 991 (11%) +2 280 (+1%)
TI 37 320 (35%) 36 507 (31%) -813 (-4%)
TG 30 592 (28%) 32 642 (28%) +2 050 (0%)
SG 15 782 (15%) 16 045 (14%) +263 (-1%)
other 8 422 0
Total 119 355 117 733
15
Thanks!
• Questions?
16