Secondary data analysis with digital trace data

Secondary data analysis with digital trace data

Examples from FLOSS research

Andrea Wiggins13 Juillet, 2011

Secondary Data Analysis

• Uses existing data produced or collected by someone else, usually for a different purpose

• Databases

• Repositories

• Surveys

• Emails

• Social networks

2

Digital Trace Data

• Records of activity (trace data) undertaken through an online information system (thus digital)

• Increasingly common in studies of online phenomena

• Large volumes of available data

• Can be complete: a census, not a sample

• May be more reliably recorded than other data

3

Characteristics

1. Found data (not produced for research)

2. Event-based data (not summary data)

3. Events occur over time, so it is longitudinal data

4

Requirements

• Understand the original data source

• How it was collected, potential problems

• Limitations of the sample

• What the data describe

• Match with appropriate analysis methods and measures

• New types of data may require new measures

• Theoretical coherence is very important

5

Advantages

• Data may be “complete”

• Usually no response bias (exception: cookies)

• May cover long periods of time and large groups

• Multiple different data types, but mostly textual

• Data are often easy to acquire

• APIs or scraping web pages (with caution)

• Databases, archives, or repositories of research data

• But remember: you usually get what you pay for!

6

Disadvantages

• Often difficult to know limitations of data

• Data may be poorly documented

• Original creator may not be available for comment

• Volume of data can be overwhelming

• Sampling strategies needed, e.g., temporal, random

• Substantial time required for data preparation: 90% of effort

• Exceptions are everywhere and will break analyses, but can only be discovered through trial and error

7

Example: Email Networks

• Data source: email listservs for FLOSS projects

• Analysis approach: create social networks

• Within discussion threads, individuals are nodes, and links are reply-to messages

• Some conceptual issues for interpretation, choice of measures

• Technical challenges

• Temporal aggregation

• Identity resolution

8

Temporal AggregationFigures from Howison et al., 2006

9

Network Workflow10

Network Results

Cleaning up before shutting down

• Observed anomalous patterns in trackers for both projects: periodic centralization spikes

• A single user makes batch bug closings (up to 279!)– Fire’s (feature request) tracker housekeeping

appears to be preparation for project closure

– Gaim’s tracker housekeeping was more regular and repeated

• Different levels of correlation between venues, suggesting different types of interactions

• User venues more decentralized than developer venues, reflecting greater number of participants

• Overall trend toward decentralization could be result of different influences

11

Example: Classification

• Replication of success-tragedy classification

• Classification criteria originally drawn from interviews with community members

• Data extracted from repositories

• Technical challenges

• Merging data from two repositories

• Processing large volume of data in multiple steps

12

Variables

• Inputs: project names and 5 threshold values for classification tests, e.g. number of downloads

• Project statistics retrieved from repositories

• Founding date

• Data collection date

• Dates for all releases

• Number of downloads

• URL

13

Classification workflow14

Classification Results

Class Original Our results Difference

unclassifiable

3 186 3 296 +110

II 13 342 (12%) 16 252 (14%) +2 910 (+2%)

IG 10 711 (10%) 12 991 (11%) +2 280 (+1%)

TI 37 320 (35%) 36 507 (31%) -813 (-4%)

TG 30 592 (28%) 32 642 (28%) +2 050 (0%)

SG 15 782 (15%) 16 045 (14%) +263 (-1%)

other 8 422 0

Total 119 355 117 733

15

Thanks!

• Questions?

16

Secondary data analysis with digital trace data

Technology

limitations of data

data preparation

existing data

textual data

advantages data

repositories of research

new types of data

large volume of data