The Network structure of R packages on CRAN & BioConductor

Post on 19-Aug-2015

7848 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

Transcript

The network structure of R packages on CRAN and BioConductorAndrie de Vries

adevries@microsoft.com@RevoAndrie

JSM 2015, Seattle

Joseph Rickert

jrickert@microsoft.com@RevoJoe

• R is an incredibly successful open source software project

• R is a large thriving community with tens of thousands of contributors and over 7K contributed packages on CRAN, BioConductor and github

• How do you begin to find what you are looking for?

• Before designing search algorithms, it is reasonable to study the structure of CRAN and BioConductor

Background

Modeling CRAN and BioConductorHypothesis:

Having different management structures:• CRAN almost anything

goes• BioConductor focused and

centrally managed

CRAN and BioConductor have discernably different package network structures.

Objectives of this study:

• Explore the network graph of CRAN and BioConductor

• Characterize their respective network structures

• Develop preliminary models to look for structural differences

Explore the network graph of CRAN and BioConductor

Characterize their respective network structures

Develop preliminary models to look for structural differences

A network of package dependencies

CRAN BioConductor

CRAN

*Note: Colour indicates communities found by the walktrap algorithm, but has no common meaning in the two networks

BioConductor

Explore the network graph of CRAN and BioConductor

Characterize their respective network structures

Develop preliminary models to look for structural differences

• Observe:• CRAN is ~4.5 times larger than BioConductor• But CRAN has ~20 times more clusters, i.e. many more, but smaller

clusters• This indicates that BioConductor is in fact stronger clustered, as

confirmed by the higher transitivity (clustering) coefficient

Graph statistics

nodes edges average.path.length assortativity.degree no.clusters cluster.coefcran 6867 14749 2.72 -0.082 1573 0.015bioc 1552 5756 1.95 -0.078 70 0.060

Bootstrapped cluster coefficient

Bootstrap sample: n = 1000, size of each subgraph = 500 nodes, no replacement

Two-sample Kolmogorov-Smirnov test

data: CRAN and BioConductorD = 0.643, p-value < 2.2e-16alternative hypothesis: two-sided

Analysis of degree distribution

Notice the difference at degree = 0 power.law.fit power.law.xmin power.law.KS.p

cran 2.55 5 0.061bioc 2.59 9 0.632

Comparing degree distribution

Degree distribution of CRAN and BioConductor

Two-sample Kolmogorov-Smirnov test

D^+ = 0.19943, p-value < 2.2e-16

alternative hypothesis: the CDF of x lies above that of y

• The original samples of networks are comparatively large, thus certain to find differences

• Sub-sampling allows us to look at finer level of detail

Resampling from degree distribution

Typical small sample n =100

P-value distribution

Explore the network graph of CRAN and BioConductor

Characterize their respective network structures

Develop preliminary models to look for structural differences

Exponential Random Graph Modeling (ERGM)

Formula: bioc_net ~ edges + degree(c(1, 2))

• The network structures of CRAN and Bioconductor are detectably different

• The large number of unconnected packages is a dominant feature of CRAN

• Large communities form around infrastructure and tools packages on CRAN

• Preliminary modeling indicates that feature-driven random graph models will be productive

Conclusions:

Next steps: join the project!!

Scripts available at:https://github.com/andrie/cran-network-structure

adevries@microsoft.com@RevoAndrie

jrickert@microsoft.com@RevoJoe

top related