Top Banner
Extended Assortitivity and the Structure in the Open Source Development Community Joseph A. Cottam * Andrew Lumsdaine Open Systems Laboratory; Indiana University; Bloomington, IN 47405 ABSTRACT Open source software development represents the work of hundreds of thousands (perhaps millions) of developers around the world and is a critical component of many widely-used software products. Much of this development is self-organizing, taking place outside the structures of specific organizations. In this regard, open-source development communities are similar to academic research com- munities, which are also self-organizing and trans-institutional in nature. The structure of academic research communities has been characterized by the use of Bibliometrics techniques, which provide a quantitative basis for the self-organizing networks of academic ar- tifacts (publications). In this paper, we introduce ‘Developmetrics,’ the application of Bibliometrics techniques to similarly characterize the artifacts produced by open source development (software). Sim- ilar cross-over analysis has been done before in the study of other self-organizing networks, such as the Internet. We use a standard Bibliometrics visualization technique to guide our discussion and we introduce extensions to concepts of network assortativity, node afference and node efference to accommodate network meta-data. With the help of these tools, we investigate issues related to com- munity formation and the product development life-cycle. Through visualizations and measurement, we show that the strongest orga- nizing factor in the open source community is the choice of pro- gramming language. This paper describes how we reached this (and other insights) and discusses the implications these results may have on open source software development. We also discuss future questions and directions for Developmetrics analysis. Keywords: Open source development, Network analysis, Visual- ization. Index Terms: D.2.9 [Software Engineering:]: Management— Programming teams,Life cycle K.4.3 [Computers and Society:]: Organizational Impacts—Computer-supported collaborative work 1 I NTRODUCTION Bibliometrics has helped the field of Library and Information Sci- ence by providing a well-grounded quantitative understanding of the relevant artifacts. The techniques of Bibliometrics have been applied in other fields to give a similar understanding of complex systems. We have adapted some of the techniques and tools of Bibliometrics to try to understand the development of open-source software. Open source software development is driven by a self- organizing network of individuals working on software projects, much like academic publications are based on self-organizing in- dividuals working on publications. We believe that a Bibliometrics inspired analysis of open-source software will illuminate: (1) The overarching community structure of software development; (2) The nature and formation environment of the sub-communities and; (3) * e-mail: [email protected] e-mail: [email protected] The maturity of a project relative to its peers. We call the appli- cation of Bilbiometrics to software development ‘Developmetrics’ This paper presents our initial application of Developmetrics to the open-source community hosted at Sourceforge.net. Open source has long been a mechanism for sharing software in research and academic circles. Over the last decade (fueled in part by the increasingly pervasive connectivity of the Internet) open source projects have become an important part of the mainstream software ecosystem. Open source projects span numerous program domains, languages, maturity levels and importance levels. Some of the most visible projects, such as the Apache HTTP Server [16] and the Linux Operating System [1], contribute to the successful daily operations of many corporations and to the Internet itself. Open source projects compete with traditional corporate projects in every application domain. They also cover comparable numbers of developers. Structurally, however, they are much different. The principle difference between open source projects and typi- cal structures in a corporate environment is the self-organizing na- ture of the open source community. In a corporate environment, individuals are assigned to specific projects by managers. In open- source, individuals freely engage and disengage in projects as they choose. How the individuals organize themselves around projects and how the projects in the community relate to each other is un- known. Understanding these structures could improve the effec- tiveness of the open-source by (1) enabling community building in a directed fashion and (2) illuminating potential biases in the open-source community formation process. Since communities are the basis of open source development, strengthening communities could benefit the entire software development process. However, since these communities are self-organizing, there is no a-priori structure we can use to effectively strengthen them. Finding out the nature of the communities being built, allows strengthening ef- forts to be targeted. Furthermore, identifying biases in commu- nity formation may reveal areas where the organizational tools do not succeed in forming relevant communities. Similar analysis has helped in the development and better leverage of the quintessential self-organizing network: the Internet [4]. Our first two questions are directly concerned with the nature of communities in open source software. What defines a commu- nity? How are they formed? Understanding the characteristics of the community is important because it allows resources to be more efficiently used. Constructing tools to strengthen already strong community aspects will probably not have as large of an impact as focusing efforts on places where the community is weak. This is similar to work on finding disciplines or genres in Bibliometrics or the more general sociological problem of community formation that have been pursued from a network analysis standpoint [4]. Such characterizations have led to objective measurements for the im- pact of a particular paper or individual in their field, and the ability to understand critical components in a network. Such results have helped allocate scarce resources to the most important components of their respective networks. Our third question relates to the maturity of individual projects within a community. Since software usage involves an investment of time to learn, identifying mature projects is of great importance.
6

Extended Assortitivity and the Structure in the Open ... · Extended Assortitivity and the Structure in the Open Source Development Community ... Bibliometrics visualization technique

Jul 20, 2018

Download

Documents

ngokhue
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Extended Assortitivity and the Structure in the Open ... · Extended Assortitivity and the Structure in the Open Source Development Community ... Bibliometrics visualization technique

Extended Assortitivity and the Structure in the Open Source DevelopmentCommunity

Joseph A. Cottam∗ Andrew Lumsdaine†

Open Systems Laboratory; Indiana University; Bloomington, IN 47405

ABSTRACT

Open source software development represents the work of hundredsof thousands (perhaps millions) of developers around the world andis a critical component of many widely-used software products.Much of this development is self-organizing, taking place outsidethe structures of specific organizations. In this regard, open-sourcedevelopment communities are similar to academic research com-munities, which are also self-organizing and trans-institutional innature. The structure of academic research communities has beencharacterized by the use of Bibliometrics techniques, which providea quantitative basis for the self-organizing networks of academic ar-tifacts (publications). In this paper, we introduce ‘Developmetrics,’the application of Bibliometrics techniques to similarly characterizethe artifacts produced by open source development (software). Sim-ilar cross-over analysis has been done before in the study of otherself-organizing networks, such as the Internet. We use a standardBibliometrics visualization technique to guide our discussion andwe introduce extensions to concepts of network assortativity, nodeafference and node efference to accommodate network meta-data.With the help of these tools, we investigate issues related to com-munity formation and the product development life-cycle. Throughvisualizations and measurement, we show that the strongest orga-nizing factor in the open source community is the choice of pro-gramming language. This paper describes how we reached this(and other insights) and discusses the implications these results mayhave on open source software development. We also discuss futurequestions and directions for Developmetrics analysis.

Keywords: Open source development, Network analysis, Visual-ization.

Index Terms: D.2.9 [Software Engineering:]: Management—Programming teams,Life cycle K.4.3 [Computers and Society:]:Organizational Impacts—Computer-supported collaborative work

1 INTRODUCTION

Bibliometrics has helped the field of Library and Information Sci-ence by providing a well-grounded quantitative understanding ofthe relevant artifacts. The techniques of Bibliometrics have beenapplied in other fields to give a similar understanding of complexsystems. We have adapted some of the techniques and tools ofBibliometrics to try to understand the development of open-sourcesoftware. Open source software development is driven by a self-organizing network of individuals working on software projects,much like academic publications are based on self-organizing in-dividuals working on publications. We believe that a Bibliometricsinspired analysis of open-source software will illuminate: (1) Theoverarching community structure of software development; (2) Thenature and formation environment of the sub-communities and; (3)

∗e-mail: [email protected]†e-mail: [email protected]

The maturity of a project relative to its peers. We call the appli-cation of Bilbiometrics to software development ‘Developmetrics’This paper presents our initial application of Developmetrics to theopen-source community hosted at Sourceforge.net.

Open source has long been a mechanism for sharing softwarein research and academic circles. Over the last decade (fueled inpart by the increasingly pervasive connectivity of the Internet) opensource projects have become an important part of the mainstreamsoftware ecosystem. Open source projects span numerous programdomains, languages, maturity levels and importance levels. Someof the most visible projects, such as the Apache HTTP Server [16]and the Linux Operating System [1], contribute to the successfuldaily operations of many corporations and to the Internet itself.Open source projects compete with traditional corporate projectsin every application domain. They also cover comparable numbersof developers. Structurally, however, they are much different.

The principle difference between open source projects and typi-cal structures in a corporate environment is the self-organizing na-ture of the open source community. In a corporate environment,individuals are assigned to specific projects by managers. In open-source, individuals freely engage and disengage in projects as theychoose. How the individuals organize themselves around projectsand how the projects in the community relate to each other is un-known. Understanding these structures could improve the effec-tiveness of the open-source by (1) enabling community buildingin a directed fashion and (2) illuminating potential biases in theopen-source community formation process. Since communities arethe basis of open source development, strengthening communitiescould benefit the entire software development process. However,since these communities are self-organizing, there is no a-prioristructure we can use to effectively strengthen them. Finding outthe nature of the communities being built, allows strengthening ef-forts to be targeted. Furthermore, identifying biases in commu-nity formation may reveal areas where the organizational tools donot succeed in forming relevant communities. Similar analysis hashelped in the development and better leverage of the quintessentialself-organizing network: the Internet [4].

Our first two questions are directly concerned with the natureof communities in open source software. What defines a commu-nity? How are they formed? Understanding the characteristics ofthe community is important because it allows resources to be moreefficiently used. Constructing tools to strengthen already strongcommunity aspects will probably not have as large of an impact asfocusing efforts on places where the community is weak. This issimilar to work on finding disciplines or genres in Bibliometrics orthe more general sociological problem of community formation thathave been pursued from a network analysis standpoint [4]. Suchcharacterizations have led to objective measurements for the im-pact of a particular paper or individual in their field, and the abilityto understand critical components in a network. Such results havehelped allocate scarce resources to the most important componentsof their respective networks.

Our third question relates to the maturity of individual projectswithin a community. Since software usage involves an investmentof time to learn, identifying mature projects is of great importance.

Page 2: Extended Assortitivity and the Structure in the Open ... · Extended Assortitivity and the Structure in the Open Source Development Community ... Bibliometrics visualization technique

Identifying mature projects is also the first step towards being ableto predict which projects will become mature in the future. One ofthe major differences between open-source software and corporatedeveloped software is that open-source software is always available,regardless of its developmental state. This availablity is a strengthin that the software is always available and interested parties caninfluence its development, a hallmark of the open-source method.But it also a weakness since those interested in only using the soft-ware are confronted with projects in various stages of developmentand forced to choose which product to use. They may inadvertantlyinvest time exploring in immature products. Lost time on imma-ture projects is a common problem for individuals not interestedin becoming developers, but clients using the open-source softwarethey investigate. Self reporting of program maturity is the com-mon method of maturity estimation. However, varying standardsof those doing the estimation make it diificult to compare productsusing self reporting alone. We try to discern if there are network-related characteristics related to the maturity of a project that mayhelp predict the state of the software. Community structure caninfluence the maturity of a project, but the mechanics of that inter-action are unknown.

Software development forms a number of different types of net-works. The two most obvious are (1) the network formed by peo-ple as they work on projects and (2) the network of dependenciesbetween different software projects. We investigated the former,following up on work demonstrated in [9] and [5]. These pa-pers showed that the network formed by people working on projectshas a structure similar to that found in other Bibliometrics inspiredanalysis domains. We collected data from SourceForge.net [12], apopular open source development and distribution site.

2 META-ASSORTATIVITY

In order to understand some of the community structure, we ex-tended a common network statistic, assortativity. In its classicalmeaning, assortativity is a measure of the consistency of a net-work’s structure in small subgraphs. A network is deemed ‘assor-tative’ if there is a positive correlation between a node’s degree andthe degree of its neighbors, such that nodes tend to link to othernodes of a similar degree [17]. In this classical sense, assortativ-ity is strictly a structural measurement and a measure of the wholenetwork.

A locally-defined concept that takes into account meta-data isafference (and its inverse efference). When a node n with attributeX links to other nodes with attribute X with higher than backgroundprobability, n is said to be an ‘afferent node’. This definition can beextended to sets of nodes as well; when linking inside of a groupof nodes is higher than expected that group is said to be afferent.If the group is defined by a particular attribute, then an attribute Xafferent group is defined. Afference is an inherently local attributeand it is difficult to compare the relative afference of two groups ofdiffering sizes (as such, it is usually treated as a binary variable).

Combing the definitions of assoritivity and afference we cre-ate meta-assoritivity. We say a network exhibits attribute X meta-assortativity if there is a positive correlation between a node hav-ing a particular meta-data attribute (or attribute level) and a node’sneighbors having the same attribute (or attribute level). The levelof the meta-assoritivity is determined by both the degree to whichlinking deviates from the expected value and the number of nodesthat share attribute X in the network. Meta-assortiviity becomesthe weighting of afference by occurance. We feel this captures im-portant structural and meta-data aspects of the network in a singlemeasure. Large groups with large afference are of more importancethan small groups with large afference. A sufficiently large groupmay have lower afference, but its size may cause it have a largerimpact on the network as a whole. The same reasoning holds forefferent groups. Meta-assortitivity captures these relationships.

3 DATA COLLECTION

The popular development support site SourceForge.net was selectedto sample the community development offerings. SourceForgeboasts 120 thousand projects and one million users in its site port-folio. It provides communication, distribution, web presence andsource code change control services to the projects it hosts. Fur-ther, SourceForge has taken a laissez-faire approach to regulation,not placing restrictions on the target audience of the resulting prod-ucts, design process or organization of the projects hosted there.It represents a wide sample of projects being developed for manydifferent communities and provides easy access to much of its con-tent [11]. The analysis of this paper is derived from a snapshot ofmost of the projects in the SourceForge web site in February 2006that was gathered by a web crawler. It resulted in 113,525 projectsand 159,661 users.

Each project had meta-data to describe the target audience, toolsused, etc that are selected by the project administrators. Thesemeta-data attributes are collectively referred to as ‘troves’ in thecommunity. They can be viewed as either as set of trees of relatedtroves or as a large collection of binary attributes. Since any num-ber of attributes can be selected for each project, and no co-usagerestrictions are derived from the tree structure, we chose to treatthem as binary attributes. There are 517 troves represented in ourdata set.

4 DATA ANALYSIS

4.1 Network CharacterizationWe formed a weighted projects network from the data collectedwhere any two projects that shared a developer were linked (sonodes are projects and edges are developers). To account forprojects that had more than one developer in common, we weightedthe edges with the number of shared developers. This resulted in anetwork with one large, strongly connected cluster of 27,520 nodes(25% percent of the network), many smaller connected clusters(12,451 clusters totaling 34,471 nodes) and 51,534 nodes not con-nected to any other nodes (representing projects worked on by justone person who does no work on any other projects). The BoostGraph Library Boost Graph Library [14] was used to perform thishigh-level analysis. The largest connected cluster became the focusof our investigation, but when appropriate, other clusters were alsoincluded.

We further investigated the linking structure of this network withthe Boost Graph Library. We discovered a scale-free distributionin the graph with a descent of -2.5 on a log-log scale (both arenatural log) and R-square fit characteristic of .91. The number ofprojects linked ranged from 76,789 projects with just one link toone project with 390 links. This degree distribution is consistentwith prior [5, 10] and the linking structure of networks found inBibliometrics[3]. This confirms that the requirements of Biblio-metrics analysis concerning the network structure are upheld.

4.2 VisualizationFollowing the lead of [8], we used the VxOrd layout algorithm toarrange the nodes of the graph and seek a high-level understandingof the relationships between projects. The VxOrd layout algorithm,one part of the VxInsight system, is designed to visually isolatesub-communities in a network by selectively removing links fromthe graph to form clusters while concurrently doing force-directedlayout [7]. The VxOrd algorithm takes a single parameter indicat-ing how aggressive it should be at removing edges. We tried thealgorithm at three levels: Minimum, Default and Maximum. Thisresulted in edge cutting levels of 8.6%, 17.4% and 42.1% respec-tively. Our layouts were restricted to the nodes of the largest con-nected cluster, and further filtered to only those nodes with morethan one neighbor. This left us with 44% of the total nodes of thelargest cluster. We feel that filtering out the single-link nodes is safe

Page 3: Extended Assortitivity and the Structure in the Open ... · Extended Assortitivity and the Structure in the Open Source Development Community ... Bibliometrics visualization technique

Figure 1: Projects with the troves Java (blue), Python (green) and C(red).

because they represent the ‘leaves’ of the network, while multi-linknodes are the internal structure. The decision to filter to just multi-link nodes was driven by the capacity of the computer running Vx-Ord. Though all of the level displayed some clustering behavior,layout with minimum cuts displayed the most obvious groupingstructure and was carried into the next step of the analysis. Theresulting layout was then rendered with a prototype of our declara-tive visualization framework [5].

To look for the sub-communities the SourceForge trove data wasto color the individual nodes of the graph. We hypothesized that thetroves would likely reveal some of the community structure. Our in-vestigation took each of the 517 Troves and generated a derivativeimage of the before rendered projects map for each. These deriva-tive maps used a binary variable (trove present/absent) and coloredeach node associated with where that variable fell True. The major-ity of troves were without an apparent region in the map. However,those that showed high locality fell into two major categories: verysmall (less than 3% of the graph, e.g. HAM radio software) andprogramming languages. The three maps that exhibited high clus-tering in the colored nodes were for Java, Python and C (see figures1-4).

4.3 AnalysisThe images of figures 1-4 suggest that programming language isby far the most cohesive trove; it is around languages that the mostdistinct communities form. This observation was the root of thedefinitions for meta-assortativity presented earlier. This is an intu-itive impression of the data, but the display is constrained by havingonly two dimensions. Does the pattern hold up if we do a broaderstatistical test? To investigate this, we move back to examining theentire weighted projects graph and employed the concept of net-work meta-assortativity and node afference. To challenge our visualimpressions more rigorously, the following values were measuredin the collected data for each trove value:

1. The number of nodes which had the trove value

2. The number of nodes linked to by any node with the trovevalue

Figure 2: Breakout of projects with the trove Java, primarily occupy-ing the lower parts of the graph.

Figure 3: Breakout of projects with the trove Python, primarily occu-pying a band in the center of the graph.

Page 4: Extended Assortitivity and the Structure in the Open ... · Extended Assortitivity and the Structure in the Open Source Development Community ... Bibliometrics visualization technique

Figure 4: Breakout of projects with the trove C, loosely collected inthe upper parts of the graph.

3. The number of nodes linked to by any node with the trovevalue that shared that trove value

From these three measurements, the following values were cal-culated:

A. The percentage of nodes that had the trove value (dividing thetotal number of nodes by value #1). This is the backgroundprobability that any randomly selected node would have thetrove value, and therefore the expected percentage of linksbetween nodes that share the given trove (if link formation israndom).

B. The ‘affinity’ of a trove is the observed percentage of linksbetween nodes that share the trove, calculated by dividing #1by #2.

C. The trove afference was found by subtracting value A fromvalue B. Positive values represent a group that is afferent (in-terlinking at a rate higher than background probability). Neg-ative values show groups that are efferent.

D. Meta-assorititivity is found by weighting the afference by thenumber of nodes with the associated trove. As stated earlier,it can be thought of as afference weighted by occurance.

Ordering all troves by their meta-assortitivity (value D) indicatedthat the top 5 slots belong to languages. Further, Java has the meta-assortitivity, despite being the 87th largest unweighed value. Thenext most afferent contributing languages (in order) were C, PHP,C++ and Python. It is interesting to note that, although PHP hada definite pattern (tended towards the right of the graph), it did nothave a high density region to visually separate it as the other lan-guages in the VxOrd plots (see figure 5). This is contradiction im-plies that there is some difference in community structure betweenthe different languages, even though at a meta-afference level theyare similar.

On the opposite end of the scale, highly efferent troves were ’AllWindows’, GPL, developers and troves associated with projects onthe immature end of the product development life-cycle. The ’All

Figure 5: Projects with the trove PHP (Pink).

Figure 6: Projects with the trove C (Blue) or C++ (Orange).

Page 5: Extended Assortitivity and the Structure in the Open ... · Extended Assortitivity and the Structure in the Open Source Development Community ... Bibliometrics visualization technique

4 620-2 8

6020

0

Freq

uenc

yMeta-Assortitivity of

A�erent Troves

Natural Log of Meta-Assortitivity

8040

Figure 7: Distribution of the meta-assortitivity/weighted afference ofafferent troves. Generated in R[15].

Windows’ trove listed as a highly efferent trove is a little surprising,given that many of the specific Windows versions were afferent,falling in near rank 30 in the midst of of windowing toolkits (KDE,gnome, X11) other specific operating systems (individual OS X ver-sions and Solaris). Other very broad troves on the efferent end ofthe scale included ’All Posix OSes’ and ’OS Independent, written inan interpreted language’. These troves, despite being very popular,may be receiving a different pattern of usage because they are notleaf nodes in the trove tree. Many projects elect to select only theparticular Posix OS they use, while others select the particular OSand the ’POSIX OSes’ trove. Taking into account the tree structurewhen calculating the meta-afference may resolve these issues.

It is interesting to note the contrast of ranking when looking atthe afference (value C) vs. the weighted afference (e.g. the meta-assortitivity, value D). Change in rank was measured as the absolutevalue of the difference between the rank when ordered by meta-assortitivity and the rank when ordered by meta-assortitivity. Thesmallest movers (those that did not change their rank) were all inthe efferent side of the scale (having negative weighted afference).Only 5 members of the top 100 on the afferent side of the scalemoved fewer than 5 spaces, the median being 127 spaces. Thisseems to indicate a difference beyond just relative assortitivity inthe afferent and efferent sub-categories. Why is so much differencebetween afferent ranking and weighted afferent weighting only onone end of the scale was beyond the scope of our investigation.

To test the significance of the afference of the languages listed,we checked to see if the languages were statistical outliers. Thedistribution of the full meta-assortitivity statistic was not normal; ithad two distinct groups. However, this split closely followed thetransition from afferent to efferent clusters. Splitting the populationinto these respective groups allowed us to perform a log transfor-mation on the afferent values that yielded a normal distribution (seefigure 7). In the subgroup of afferent troves (constituting 82% of thetroves), the top 5 troves (all of which are programming languages)were the only troves more than 2.5 standard deviations above ofthe mean, while there were no troves more than 2 standard devia-tions below the mean. This indicates that the language troves earlierhighlighted are probably outliers, and therefore of interest.

The structure of the efferent troves was much different fromthat of the afferent troves (See Figure 8). The distribution showstwo distinct groups, divided by the number of projects associated

5 100-5-10

2515

50

Meta-Assortitivity of E�erent Troves

Freq

uenc

y

Natural Log of the Absolute Value of Meta-Assortitivity

2010

Figure 8: Distribution of the meta-assortitivity/weighted afference ofdis-assortative troves. Since all such troves have negative meta-assortitivity, the absolute value was used. Generated in R[15].

with the trove. The group on the left is composed of troves withfew associated projects (the largest had only 328 projects). Thegroup on the right is made of 18 troves, the smallest of which had17,274 projects. This right-hand group included all troves associ-ated with immature end of the project development cycle. Therewere few projects in an intermediate range. Since no unified distri-bution could be found, discussion of significant differences cannotbe made.

4.4 Maturity FindingsThe visualizations did not show any pattern in the collocationof projects of varying maturity levels. However, the assortitiv-ity derived measurements did. The troves indicating high matu-rity (‘Mature’ and ‘Production Stable’) were found with high meta-assortitivity (in the to 10%). In contrast, the troves indicating lowmaturity (‘Planning’, ‘Alpha’, ‘Beta’) had highly negative afference(i.e. they were efferent). All of the low maturity troves ended upin the 1% most meta-disassortiive values. This indicates that ma-ture projects tend to link to other mature projects at a rate higherthan background probability but low maturity projects do not fol-low suit, linking to other low maturity projects less often than theirbackground probability predicts. This at first appears to be a con-tradiction, since any low maturity project would have to link with amature project to increase that group’s efference. This would conse-quentially raise the efference of the mature group as well (since alllinks are bidirectional). However, approximately 1/3 of the projectsdo not carry a maturity rating. This allows the low maturity projectsto link outside of their group without a corresponding decrease inthe afference of the maturity projects.

Does this observation allow prediction of the maturity of theproject? A table of projects with their corresponding maturity,and count of total links and count of links to mature projects wascreated. This was analyzed using the R statistical analysis pack-age [15]. The maturity of the project was analyzed with respect tothe total number of links and the number of links to mature projectsusing a generalized linear model with a binomial error distribution.This type of test requires that the response variable be binary andat least one of the explanatory variables be distinct for each repli-cate [6]. This was not strictly true, but adding a very small randomvalue to each of the percents does make it true (noise values were

Page 6: Extended Assortitivity and the Structure in the Open ... · Extended Assortitivity and the Structure in the Open Source Development Community ... Bibliometrics visualization technique

between -.00000001 and .00000001). Both the logit link functionand complementary log-log link function were tried, the compli-mentary log-log function was kept because it minimized the de-viance of the resulting model. Step-wise model simplification ledto a very erratic model where where maturity was predicted signif-icantly with respect to the total number of links a project had, thepercentage of those links that were to mature projects and an inter-action between the two. The erratic nature of the resulting modelgave the impression that there are more variables at work that werenot included in the original model. Therefore, though the afferenceand efference of maturity levels seem related to project maturity, itis insufficient to give a reasonable estimate of the current state of agiven project.

4.5 Comparison of VxOrd and Meta-AssortativityWe are not claiming that the straightforward afferent and effer-ent measurements are a replacement for the VxOrd, only that theycapture the first-level structure well. For example, the trove mapsshowed that the C and C++ regions interleave extensively (see fig-ure 6). We wished to see if they formed a cohesive combinedcommunity stronger than their respective isolated communities.Combing the two into a ‘synthetic’ trove and repeating the meta-assortitivity calculation yielded a surprising result. The combinedC/C++ group was highly efferent. The contraposition of the cluster-ing VxOrd layout for an efferent group indicates that neither anal-ysis tells the whole story of the clustering of the Sourceforge com-munity.

4.6 Future WorksSuch Bibliometrics inspired analysis of software developmentprojects we call Developmetrics. Developmetrics brings quantita-tive aspects into the analysis of the community aspects of softwaredevelopment. It pulls from traditional software engineering mea-surements, network analysis (such as Bibliometrics) and visualiza-tion communities.

The next major Developmetrics task is to see how these sub-communities interface with each other. For example, the imagesgenerated show that the Python community sits between the Javaand the C/C++ communities. Is this an artifact of VxOrd layoutalgorithm or does Python act as a bridge between these two com-munities?

A broader Developmetrics task is to understand how the com-munities grow and develop. This analysis has focused on a sin-gle time frame. We hope to create a series of time-sliced bipartitenetworks where nodes represent both projects and developers. Wewill develop these time slices by mining the project change logsto see how the connections change over time. We hope this workcan inform how developers structure their projects to improve theproject’s chance of reaching maturity.

The structured measurements techniques of Developmetrics willbe of increasing importance as software development projects spanmore organizations, take advantage of more ambient resources andembed themselves into context not historically taking on softwaredevelopment responsibilities.

5 CONCLUSION

The application of Bibliometrics analysis and visualization tech-niques was valuable in guiding the analysis of the open source soft-ware community. The directions they indicated helped focus ourstudy to more rapidly discover the salient feature of language choicein community formation. Bibliometrics tools focused the human-resource intensive statistical analysis by sifting out many options inan automated fashion.

Our finding that the primary organizational lines in the commu-nity fall upon implementation lines (i.e., with the programming lan-guage selected) and not along problem semantic lines has important

implications. Among them, the maxim that “given enough eyeballs,all bugs are shallow” [13] may have limited scope. Errors that de-rive from domain specific knowledge misunderstandings are not as“shallow” as those that derive from language features. Therefore, amore directed effort must be made in both detecting and resolvingsuch application-domain based issues.

ACKNOWLEDGEMENTS

The authors wish to thank Katy Borner and Chris Mueller for theirhelp with visualization and Nicholas Edmonds for his help withanalysis ideas. This work was supported in part by a grant from theLilly Endowment. Access to SourceForge data was provided in partby the SDRA[2]. All data collection and dissemination have beenconducted with the approval of and in accordance to the policiesof the Indiana University Institutional Review Board, study #07-1743.

REFERENCES

[1] The Linux Kernel. http://kernel.org/, March 2007.[2] SourceForge Research Data Archive.

https://zerlot.cse.nd.edu/mywiki/index.php, March 2007.[3] K. Borner, J. T. Maru, and R. L. Goldstone. The simultaneous evolu-

tion of author and paper networks. November 2003.[4] S. Chakrabarti. Mining the Web: Discovering Knowledge from Hyper-

Text Data. Science & Technology Books, 2002.[5] J. A. Cottam and A. Lumsdaine. Thisstar: Declarative visualization

prototype. In IEEE Symposium on Information Visualization, 2007.[6] M. J. Crawley. Statistics: An introduction using R. John Wiley &

Sons, 2005.[7] G. S. Davidson, B. Hendrickson, D. K. Johnson, C. E. Meyers, and

B. N. Wylie. Knowledge mining with vxinsight: Discovery through-interaction. J. Intell. Inf. Syst., 11(3):259–285, 1998.

[8] T. Holloway, M. Bozicevic, and K. Börner. Analyzing and visu-alizing the semantic coverage of wikipedia and its authors: Researcharticles. Complex., 12(3):30–40, 2007.

[9] G. Madey, V. Freeh, and R. Tynan. The open source software de-velopment phenomenon: An analysis based on social network theory.In Proceedings of the Americas Conference on Information Systems(AMCIS 2002), pages 1806–1813, Dallas, Texas, 2002.

[10] G. Madey, V. Freeh, and R. Tynan. The open source software devel-opment phenomenon: An analysis based on social network theory. InProceedings of the Eigth Americas Conference on Information Sys-tems, 2002.

[11] Open Source Development Network, Inc. DocumentH06 - Crawler Policy and Open Source Research Policy.http://sourceforge.net/docman/display doc.php?docid=31763&group id=1.

[12] Open Source Development Network, Inc. Sourceforge.http://sourceforge.net, March 2007.

[13] E. S. Raymond. The Cathedral and the Bazaar: Musings on Linuxand Open Source by an Accidental Revolutionary. O’Reilly and As-sociates, Sebastopol, California, 1999.

[14] J. Siek, L.-Q. Lee, and A. Lumsdaine. The Boost Graph LibraryUser Guide and Reference Manual (With CD-ROM). Addison-Wesley,Boston, MA, 2002.

[15] R. D. C. Team. R: A Language and Environment for Statistical Com-puting. R Foundation for Statistical Computing, Vienna, Austria,2006. ISBN 3-900051-07-0.

[16] The Apache Software Foundation. Apache HTTP Server.http://httpd.apache.org/, March 2007.

[17] R. Xulvi-Brunet and I. M. Sokolov. Changing correlations in net-works: Assortativity and dissortativity. Acta Physica Polonica B,36:1431–+, May 2005.