The Network structure of R packages on CRAN & BioConductor

The network structure of R packages on CRAN and BioConductorAndrie de Vries

adevries@microsoft.com@RevoAndrie

JSM 2015, Seattle

Joseph Rickert

jrickert@microsoft.com@RevoJoe

• R is an incredibly successful open source software project

• R is a large thriving community with tens of thousands of contributors and over 7K contributed packages on CRAN, BioConductor and github

• How do you begin to find what you are looking for?

• Before designing search algorithms, it is reasonable to study the structure of CRAN and BioConductor

Background

Modeling CRAN and BioConductorHypothesis:

Having different management structures:• CRAN almost anything

goes• BioConductor focused and

centrally managed

CRAN and BioConductor have discernably different package network structures.

Objectives of this study:

• Explore the network graph of CRAN and BioConductor

• Characterize their respective network structures

• Develop preliminary models to look for structural differences

Explore the network graph of CRAN and BioConductor

Characterize their respective network structures

Develop preliminary models to look for structural differences

A network of package dependencies

CRAN BioConductor

*Note: Colour indicates communities found by the walktrap algorithm, but has no common meaning in the two networks

BioConductor

• Observe:• CRAN is ~4.5 times larger than BioConductor• But CRAN has ~20 times more clusters, i.e. many more, but smaller

clusters• This indicates that BioConductor is in fact stronger clustered, as

confirmed by the higher transitivity (clustering) coefficient

Graph statistics

nodes edges average.path.length assortativity.degree no.clusters cluster.coefcran 6867 14749 2.72 -0.082 1573 0.015bioc 1552 5756 1.95 -0.078 70 0.060

Bootstrapped cluster coefficient

Bootstrap sample: n = 1000, size of each subgraph = 500 nodes, no replacement

Two-sample Kolmogorov-Smirnov test

data: CRAN and BioConductorD = 0.643, p-value < 2.2e-16alternative hypothesis: two-sided

Analysis of degree distribution

Notice the difference at degree = 0 power.law.fit power.law.xmin power.law.KS.p

cran 2.55 5 0.061bioc 2.59 9 0.632

Comparing degree distribution

Degree distribution of CRAN and BioConductor

Two-sample Kolmogorov-Smirnov test

D^+ = 0.19943, p-value < 2.2e-16

alternative hypothesis: the CDF of x lies above that of y

• The original samples of networks are comparatively large, thus certain to find differences

• Sub-sampling allows us to look at finer level of detail

Resampling from degree distribution

Typical small sample n =100

P-value distribution

Exponential Random Graph Modeling (ERGM)

Formula: bioc_net ~ edges + degree(c(1, 2))

• The network structures of CRAN and Bioconductor are detectably different

• The large number of unconnected packages is a dominant feature of CRAN

• Large communities form around infrastructure and tools packages on CRAN

• Preliminary modeling indicates that feature-driven random graph models will be productive

Conclusions:

Next steps: join the project!!

Scripts available at:https://github.com/andrie/cran-network-structure

adevries@microsoft.com@RevoAndrie

jrickert@microsoft.com@RevoJoe

The Network structure of R packages on CRAN & BioConductor

p cran

degree distribution

coef cran

pvalue distribution

bootstrap sample

typical small sample

nodes edges average

bioconductor d

Technology

COMPREHENSIVE VIEW ON CRAN PACKAGES- Robust Analysis of Data...

Creating R Packages, Using CRAN, R-Forge, And Local R...

NGMN CRAN Suggestions on Potential Solutions to CRAN

Creating R Packages, Using CRAN, R-Forge, And Local R...

Package ‘biomaRt’ - Bioconductor › packages ›...

Package ‘FME’ -...

R Data Import/ExportThis manual describes the import and...

Package ‘cellWise’ftp.osuosl.org › pub › cran ›...

R / Bioconductor Packages for Short Read Analysis...R /...

The Past, Present, and Future of the R Project · The Past,...

Bioconductor Packages for Pre-processing DNA Microarray Data...

The randomSurvivalForest...

Bioconductor annotation packages · Bioconductor annotation...

Package ‘isobar’ - Bioconductor › packages › release...

If You Can’t Beat ’EmGrowth 2010 2012 2014 2016 0 100...

Psychometrics with R: A review of CRAN packages for Item...