Network Motifs in Computational Graphs: A Case Study in ......APS/123-QED Network Motifs in Computational Graphs: A Case Study in Software Architecture Sergi Valverde1 and Ricard V.

Network Motifs inComputational Graphs: A CaseStudy in Software ArchitectureSergi ValverdeRicard V. Solé

SFI WORKING PAPER: 2005-04-008

SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent theviews of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our externalfaculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, orfunded by an SFI grant.©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensuretimely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rightstherein are maintained by the author(s). It is understood that all persons copying this information willadhere to the terms and constraints invoked by each author's copyright. These works may be reposted onlywith the explicit permission of the copyright holder.www.santafe.edu

SANTA FE INSTITUTE

APS/123-QED

Network Motifs in Computational Graphs: A Case Study in Software Architecture

Sergi Valverde1 and Ricard V. Sole1,2

1 ICREA-Complex Systems Lab, Universitat Pompeu Fabra, Dr. Aiguader 80, 08003 Barcelona, Spain2Santa Fe Institute, 1399 Hyde Park Road, New Mexico 87501, USA

Complex networks in both nature and technology have been shown to display characteristic, smallsubgraphs (so called motifs) which appear to be related to their underlying functionality. All thesenetworks share a common trait: they manipulate information at different scales in order to performsome kind of computation. Here we analyse a large set of software class diagrams and show thatseveral highly frequent network motifs appear to be a consequence of network heterogeneity andsize, thus suggesting a somewhat less relevant role of functionality. However, by using a simplemodel of network growth by duplication and rewiring, it is shown the rules of graph evolution seemto be largely responsible for the observed motif distribution.

PACS numbers: 05, 89.75

I. INTRODUCTION

Many natural and artificial systems are describable asnetworks of interacting components [1–4]. The networkis a medium that allows resource sharing often involvingan efficient transport of energy (metabolism, power grid),matter (highways, airport webs) or information (cellularcommunication, Internet). The architecture of complexnetworks can be explored at different scales, from theoverall properties defined by average measures such aspath length or clustering, correlations or degree distri-butions to the more fundamental features displayed bysmall subsystems. In this context, it has been shown thatsome special, small subgraphs -so called motifs- seem tobe particularly relevant in describing the architecture ofcomplex networks [5]. Motifs have been suggested to bethe functional building blocks of network complexity. Aresome subgraphs more common than others because theirfunctional relevance?

An alternative view is that the rules of network growthcan by themselves favour some subgraphs with no spe-cial relation to the underlying functionality. Actually,this seems to be the case for the structure of the protein-protein interaction map. In spite that proteins performfunctions, the overall architecture of the protein networkis easily reproduced by means of a simple model of nodeduplication plus rewiring [6]. Such properties includescale-freeness, small world features and even hierarchi-cal organization and protomodularity [7]. Mounting ev-idence suggests that many key features of complex net-

1 1 1 2

3 3

2 3

2

4 4

1 2

2 3

44 3 4

1

FIG. 1: Examples of common network motifs with n = 4elements found in software graphs. Here each node is a classand arrows indicate static dependencies among classes (seetext).

works (including motifs) might be strongly tied to theglobal network structure [8].

If functional constraints to network architecture haveto be considered, one particularly relevant aspect of net-work complexity is associated to the presence of someunderlying computational process. Computation is akey ingredient of any complex adaptive system (CAS).By storing and processing information, CAS are able topredict (and adapt to) external fluctuations. Compu-tation occurs in both natural and artificial systems [9],although the building process that creates the compu-tational structure is different. This is actually one ofthe most important points here: are the rules of de-signed and evolved systems completely different? Bio-logical networks are largely originated through tinkering[10–12]: new components are obtained by re-using oldones, mainly by duplicating them. In spite of the appar-ent limitations of such mechanism, it allows to discovergood designs [13]. More complex computations can bedeveloped as the network size is increased and new func-tions can emerge.

How is computation linked to network structure? Ten-tative answers, to be developed here, can be obtainedby looking at a very important class of computationally-driven networks: software systems. They offer a uniqueopportunity of exploring different levels of complexitywith well-defined functional traits. As oposed to mostexamples of evolving computational networks, extensivedatabases storing software evolution registers exist andinvolve a high degree of detail. Here we analyse thelargest data set of software maps explored to date (83different systems). The main goal of our study is to seeif functionality, as opposed to network evolution, is amain constraint to the distribution of network motifs inreal graphs. The paper is organized as follows. In sectionII an overview of the software systems analysed here isgiven. In section III, the statistical patterns of networksmotifs in a large set of software systems is presented andthe presence of scaling relations and the size-dependentfrequency of motifs analysed. In section IV a model ofduplication and rewiring is used in order to reproduce thestructure of motifs of a large software map. In section V

2

a general discussion is presented.

II. SOFTWARE NETWORKS

Programming languages describe software systems[15]. Every computer program has a textual representa-tion following syntactic rules dictacted by a programminglanguage [16]. The program is decomposed in a numberof simpler software entities, or logical elements that aregiven a unique name. Software entities include things asdata objects, instructions, subprograms, or modules. Ahierarchy or natural ordering between software entities isprompted by modern programming languages . At thelowest level, a program is viewed as a sequence of simplemachine instructions. Sequences of related instructionsare enclosed in subprograms. At the highest level, thereare modules or logical containers grouping simpler soft-ware entities. Often, modules are defined as functionalblocks but there are no strict principles driving modulecomposition.

It is useful to depict the complex structures definedin computer programs by means of a graph [14], wherenodes represent software entities and links represent syn-tactical relationships between modules , subprogramsand instructions. In this paper, we will focus in a partic-ularly interesting subset of software entities: the collec-tion of modules and their static interactions (also calledsoftware architecture). While it is widely acknowledgedthat software evolution depends on its architecture, verylittle is known about the cause and effect relationshipsbetween design practices and evolution outcomes [24]. Inorder to understand the relation between structure andartificial evolution, we have envisaged a network modelof software architecture, hereafter called software mapor software network. Here, we show how the evolutionof computer programs can be understood by recoveringand analyzing their software networks at different stagesof development.

Following [14, 20], figure 2 shows the text of an in-complete C++[19] computer program (see fig. 2a) andits corresponding software map (see fig. 2b). The pro-gram text reads from left to right and top to bottom.The software network Ω = (V, L) of this C++ programis recovered by means of a very simple lexical analysis.First, we identify the vector of all module names (alsoclass in C++) given by W = (wi) = (Point, Chessmen,

Point, Move, Point, Point, Pawn, Chessmen, Move).Name ordering is important when recovering module de-pendencies (see below). Names wi that appear in thehead of a module declaration provide the sef of networknodes V = vi. These names (hereafter called mod-ule definitions) are easily identified because they are pre-ceeded by the C++ keywork ’class’. Remaining namesare called module references. In this example, we havefour (N = |V | = 4) unique module definitions: Point

(w1), Chessmen (w2), Move (w4) and Pawn (w7). Thisdefines a mapping from names wk ∈ W and network

FIG. 2: (a) A piece of C++ code from a chess-playing pro-gram and (b) the corresponding software map or networkmodel displaying the collection of modules and their logicaldependencies. The only information required to recover thesoftware map is the set of module names (highlighted in boldin (a)) and their relative locations in the C++ program (seetext). Notice how nodes are labelled with their names andlinks are decorated with relationship type.

nodes vi ∈ V in the software network.

Design of any non-trivial function involves the interac-tion of at least two modules [18]. Static module interac-tions can be depicted from relative positions of names inW . Lets assume that wk ∈ w1, w2, w4, w7 is a moduledefinition associated to node vi and wl /∈ w1, w2, w4, w7is a module reference associated to node vj . A directedlink vi → vj ∈ L signals a dependency from moduledefinition wk to module reference wl. Link directional-ity reflects name ordering in the C++ program, that is,k < l. There are two types of module dependencies: as-sociation (also ”has-a” relationship) or inheritance (or”is-a” relationship). The purpose of these dependenciesis to stablish a logical organization of the system. How-ever, our analysis is centered in the study of topologicalpatterns and does not take into account detailed rela-tionship semantics. In an association, referenced nodevj is nested in the C++ block of module vi. This blockis allways bracketed by the symbols and . In an in-heritance, referenced node vj allways follows the C++sequence ”: public” after the referencing node vi (seefig.2a). Repeated links are not considered in the follow-ing analysis.

Software maps capture the topology of complex soft-ware systems. In particular, these maps provide a quan-titative approach to the evolution of technology. Theyare actually evolving entities and somewhat inhabit anintermediate zone between computing machines and neu-ral structures. We have shown software networks to bescale-free and small world [14, 20, 22]. Software networkscan be described under a statistical physics perspective.

3

III. SOFTWARE MOTIFS

In this section, we extend our previous topologicalstudies by analyzing software networks at the level ofnetwork subgraphs, or subsets of connected nodes in anetwork. The statistics of subgraphs provides impor-tant information about network structure. It has beenclaimed that overrepresented subgraphs (i.e, motifs) sig-nal key building blocks of networks [5] . This might be thecase for regulatory networks, where specific subgraphs(i.e., feed-fordward loops) perform information process-ing functions [29]. A particular class of subgraphs, cycles,have received considerable interest. Cyclical dependen-cies in software maps imply that a module is related toitself, which may be acceptable, unacceptable or required[21]. Ambiguity in the functional meaning of cycles sug-gests that subgraphs in software graphs are not strictlyrelated to well-defined functions. The ubiquity of sub-graphs in software networks seems to be a consequenceof top-down mechanisms of software organization and nota consequence of selective pressures.

Following the method outlined in previous section, wehave recovered and analysed a large dataset of softwaremaps Ω from 83 reverse-engineered C++ software sys-tems. A given graph can be characterized by a degreesequence. For the whole graph Ω each node has a de-gree sequence given by the indegree list Ki and theoutdegree list Ri (with i = 1, ..., N). The lists wouldbe completed by the so called mutual edges Mi, i. e.cases where there is a pair of edges in both directionsbetween two nodes. For each subgraph Ωi ⊂ Ω of sizen (here n=3,4), another degree sequence would be pro-vided by two new lists, now kj and rj for the in- andoutdegrees, respectively. For example, for the left sub-graph in figure 1, we would have kj = 0, 1, 1, 2 andrj = 2, 1, 1, 0.

Network motifs are defined in terms of subgraphswhich appear much more often than expected from purechance. Specifically, they occur with a high frequencycompared with the expected from an ensemble of ran-domized graphs with identical degree structure [5]. Therandom networks are generated by means of the switch-ing rule. For every pair of links i → j and u → v inthe original software network, we add the pair i → v andj → u in the randomized network. This rule keeps intactthe in/out-degree sequences but destroys degree-degreecorrelations. The statistical significance of a given sub-graph Ωi is described by its Z score [5], defined as:

Z(Ωi) =Nreal(Ωi) − 〈Nrand(Ωi)〉

σ(Nrand(Ωi))(1)

Here Nreal(Ωi) is the number of times the sub-graph appears in the network, whereas 〈Nrand(Ωi)〉 andσ(Nrand(Ωi)) refer to the mean and standard deviation ofits appearances in the randomized ensemble, respectively.In order to be significant, it is required that |Z(Ωi)| > 2.When Z(Ωi) > 2 (Z(Ωi) < −2) the motif (antimotif)

is considered to be more (less) common than expectedfrom random. In table I the results from our analysisare shown for some typical software networks. A handfulof these subgraphs appear to be present in all softwaresystems analysed and also in both electronic circuits andbiological networks involving computation. This is thecase of Bi-parallel (S904), Bi-fan (S204), the feed for-ward loop (S38) and its close variants (such as S2186and S408). Such a common point might be easily inter-preted in functional terms: similar subgraphs are abun-dant because they are selected or chosen to perform agiven function or task. As shown below, no evidencefrom statistical patterns supports such view.

Assuming sparse graphs (〈K〉 N), the probabil-ity of a given subgraph Ωi can be estimated. Follow-ing Itzkovitz et al., we can see how this is calculatedusing the first subgraph in figure 1. Here we have:Kj = 2, 1, 1, 0 and Rj = 0, 1, 1, 2. The idea isto compute the different probabilities associated to eachdirected edge linking all pairs of nodes. For example, theprobability of having a directed link from node 1 to node2 (for K1R N 〈K〉) is approximately:

P (1 → 2) =K1R2

N 〈K〉(2)

which can be interpreted as follows [23]: we perform K1

attempts for the first node to connect to the target nodewith a probability R2/N 〈K〉. Similarly, we would have:

P (1 → 3) =(K1 − 1)R3

N 〈K〉(3)

being the approach used for all edges. The average num-ber of appearances of Ωi is finally computed by averagingItzkovitz et al. [23] have shown that the average num-ber of appearances 〈G〉 of a given subgraph is given bya product of moments of different orders of the indegree,outdegree and mutual degree distributions:

〈G〉 ∼ Nn−ga−gm〈K〉ga 〈M〉gm

n∏

j=1

⟨(

Ki

kj

)(

Ri

rj

)(

Mi

mj

)⟩

i

(4)where ga and gm are the number of single and mutualedges. The approximation assumes uncorrelated, sparsenetworks (〈K〉 N). Both conditions are met by soft-ware maps [25]. These mean-field quantities can be usedas a null model estimate of the number of motifs, andeventually to detect stray, significant deviations formrandomness. Since different motifs are found in differ-ent systems [5] they can actually allow us to identify thebasic functional blocks for a given class of networks.

By exploring our collection of software graphs, we de-termined < G > for real nets (indicated as Nreal in fig. 2)and compared them to Nrand. Here software maps with asize N > 10 have been analysed. Two groups have beenchosen, involving medium-sized graphs (N < 300) andlarge graphs (N > 300). The previous set is comparedwith results from other networks involving computational

4

FIG. 3: Network motifs with n = 3, 4 elements found in software graphs. The numbers of node and edges for each networkare shown. The most frequent motifs where classified in distinct rows for each type of system: medium software systems, largesoftware systems, gene regulatory nets, neural networks and digital electronic circuits. For each motif, we display the numberof ocurrences in the network (Nreal), the number of occurrences (Nrand ± SD) in a set of 100 randomized network versionsand a qualitative measure of its statistical significance as given by the Z score (see text). Medium and large software networksshare a large amount of motifs but we found larger variability in the medium data set. A remarkable difference is motif 2190(the last motif in the second row), which appeared only in the context of large software systems.

tasks. Here previous results for both gene and neural net-works are also shown for comparison (data from [5]). Thereason of using biological networks as a reference systemis twofold. First, the chosen systems are known to per-form computational tasks (or can be described by meansof an equivalent computational circuit). The second isthat it has been conjectured that both natural and arti-ficial networks might share some commonalities relatingthe mechanisms that shape their evolution [12]. Commonfeatures might reflect common functional traits, but also(as shown below) common rules of graph evolution withno special functional meaning.

In order to explore the question of how relevant is the

overall network structure in conditioning the frequency ofgiven subgraphs, we should consider the global structureof the network. The first approximation is to consider thedegree of heterogeneity as provided by the distributionof links. Software systems have a well-defined scale-freeindegree distribution

Pi(k) =γi − 1

k1−γi

0

(k + k0)−γi (5)

with γi ∼ 2. A mean value 〈γi〉 = 2.09 ± 0.05 has beenobtained by averaging over all the systems studied here.The distribution of scaling exponent is strongly peakedaround γi = 2. The outdegree distribution Po(k) is much

5

steeper, and seems better described by a broad scale dis-tribution, not far from the exponential limit. This isactually the opposite situation considered in [23] but isnot difficult to show that it is essentially symmetric inthe theoretical treatment.

For the regime considered here, it was shown that 〈G〉follows a scaling law

〈G〉 ∼ Nα (6)

Specifically, for a given pair n, g and a given scalingexponent, we have:

〈G〉 ∼ Nn−g+s−γi+1 (7)

where s is the maximum in-degree for our case. Thisscaling is actually valid for 2 < γi < s + 1.

Two examples of the observed scaling laws are shownin figure 4 for two very frequent software motifs. Thetwo motifs shown are very different, with n = g = 4and (a) s = 2 (subgraph S904) and (b) s = 3 (subgraphS2190). Using (3) the expected number of times a givenmotif appears would scale as 〈G〉 ∼ N s+1−γi and usingthe scaling exponent γi ≈ 2.09±0.06 a scaling law 〈G〉 ∼N0.9 should be predicted for uncorrelated, sparse graphs.The predicted scaling is recovered from real data (seefigure 4), with α(S904) = 0.91 ± 0.13 and α(S2190) =0.85 ± 0.15, thus indicating that the average trends areconsistent with the expectation from random scale-freenetworks. It confirms the validity of Itzkovitz et. al [23]prediction and its agreement with a set of real networks.This agreement is an interesting result, particularly ifwe remember that this is a designed system to performgiven functions. The fact that we obtain the scaling lawexpected for the random, scale-free graph reveals that theobserved scaling in motif abundances are a consequenceof top-down constraints derived from graph evolution.

IV. DUPLICATION-BASED EVOLUTION

The topology of software architecture emerges from de-signed evolution. On top of the process, there must be abasic building plan towards a final function or set of func-tions. The engineer foresees the outcome of its work. Butthere are a number of strong constraints no less impor-tant and operating through the software building process.On the one hand, modular structures are shaped throughparallel paths of evolution. Different blocks will be in-volved in more specific subfunctions. On the other hand,increased complexity leads to conflicts between differentsubparts. This is reflected for example at the topologicallevel: small software maps tend to display tree struc-ture, whereas larger systems tipically display much morecomplex patterns [14]. The common overall structuredetected in software graphs in terms of the degree distri-bution (and other average properties) suggests that thefinal topological patterns might be strongly constrained.

We conjecture that the abundance of subgraphs in soft-ware networks relates to universal mechanisms of network

101

102

103

104

N

100

101

102

103

<G

>

101

102

103

104

100

101

102

103

101

102

103

104

N

100

101

102

103

101

102

103

104

100

101

102

103

FIG. 4: Scaling in the number of appearances of a given mo-tif against network size. Here two common motifs (each in-dicated) have been considered over the sample set of 83 sys-tems. Here we have (a) S904 with n = 4, g = 4 and s = 2and (b) S2190, with n = 4, g = 4 and s = 3. In both cases,the predicted exponent is α = 0.9. Using the average scalingexponent for the in-degree distribution (γi ≈ 2.1) we obtainα(S904) = 0.91 ± 0.13 and α(S2190) = 0.85 ± 0.15. The fitwas made using least squares on log-log scale.

growth underlying their evolution. Real software mapstend to display motif generalizations or subgraphs havingan structure comprising many replicas of the 4-motifs ob-served here. These structures are highly redundant. Thissuggests a very simple duplication-based mechanism ofsubgraph generation. New modules depend upon othermodules in order to provide useful functionalities. Andit seems reasonable to assume that similar modules willshare a large number of module dependencies. In a re-lated software engineering study [27], structural similar-ites in C++ software at the module (class) level havebeen analyzed. They have found quantitative evidenceof structural duplications. However, they did not pro-vide any model explaining the origin of duplications.

Figure 5 shows a detailed example illustrating how top-down duplication works in software development. Imag-ine we want to add a new software module representingthe queen, in the previous chess-playing program (see fig.2a). First, we will add a new module declaration, whichis conveniently named Queen (see fig. 5a). Because aqueen is a type of chessmen, it seems reasonable to makethis module depend upon the same modules referenced bysimilar modules, which in this case is the Pawn. By us-ing the Pawn module and its neighborhood as a template,we add an inheritance relationship from the Queen to theChessmen (see fig. 5b). Duplication is completed withthe addition of a collaboration relationship from Queen

to Move (see fig. 5c). Comparison between final network(see fig. 5c) with the initial network (see fig. 2) reveals anew bi-parallel subgraph and twice the number of bi-fansubgraphs.

An example of this process taking place in real softwaredevelopment is shown in figure 6. Here a given subsystem

6

FIG. 5: From (a) to (c), an illustration of duplication mech-anism in software map evolution. Time flows from top tobottom. Here, a new module Queen is introduced by cloningthe links of the similar module Pawn. Every stage displaysthe evolving C++ program (right) and its corresponding soft-ware network (left) reconstructed by the method described insection 2. New text is enclosed in a box. Note how duplica-tion of links in the software map is parallel to duplication ofcode in the C++ program.

inside a growing software graph is displayed at differentdevelopment stages. Duplication of nodes seems to be atwork, as well as further removal of many links associatedto a given hub. From c to d a duplication of the hubinvolving many incoming links has taken place (togetherwith some further node addition). From d through i, itis evident how a large number of new classes were addedby copying the pattern of single nodes connected to twocentral hubs. There is extensive rewiring in some stages,such as in f , where the lower hub losses a large fractionof in-links. Moreover,there is also addition of new con-nections between existing nodes (see h → i). The wholesequence spans one year of development. The main obser-vation from this example (which is a typical one) is thatnode duplication plus rewiring, particularly link removal,are widespread. This is also the case in the evolution ofcellular networks [6].

Examples like the previous one suggest thatduplication-divergence growth is the cause of observedsubgraphs abundances in software maps. This hypothe-sis can be tested by comparing the distribution of sub-graphs in real networks with those obtained with astochastic model of network growth based in asymmet-

FIG. 6: A real instance of software network growth from awell-defined subsystem of Prorally[14] showing duplication.Evolution goes from top to bottom and left to right. Onlythe largest connected component is displayed here. The figureshows how the target hub in (c) has beed duplicated in (d)(both nodes highlighted with a dotted box). Many duplicatednodes involve less connected targets (see (g) to (h)).

ric duplication-divergence rules, previously described in[6]. First, an initial random (or backbone) network ofm0 < N nodes is created. This random graph is gener-ated by the addition of nodes with degree k0 = 2, everylink pointing to a random target node [28]. This back-bone posses a tree-like structure (as it occurs with soft-ware maps at the beginning of their evolution). Startingfrom this backbone, we apply the following rules at eachiteration of the model:

1. Duplication. A randomly chosen target node v iscloned, and the new node w attaches to all theneighbors of the target node.

2. Divergence. For each pair of original/redundantlinks remove one of them with probability δ.

3. Cross-linking. In addition, the target and newnode are linked (w → v) with probability β. Thisrule is important in order to generate triads or 3-subgraphs.

In spite of the simple set of rules implicit in the duplica-tion model, the frequency of subgraphs obtained from ourin silico system are remarkably close to those seen in their

7

10-5

10-4

10-3

10-2

10-1

10010

-5

10-4

10-3

10-2

10-1

100

C p

redi

cted

10-5

10-4

10-3

10-2

10-1

10010

-5

10-4

10-3

10-2

10-1

100

10-5

10-4

10-3

10-2

10-1

100

Cobserved

10-5

10-4

10-3

10-2

10-1

100

C p

redi

cted

10-5

10-4

10-3

10-2

10-1

100

Cobserved

10-5

10-4

10-3

10-2

10-1

100

A B

C D

FIG. 7: Comparison of observed and predicted (from aduplication model) 4-motif concentrations for (a) Blender,(b) Filezilla, (c) gtk and (d) exult (here concentrations arerescaled by ×10−3). The exponents for the least squares fitare: (a) ξ = 0.94±0.12, (b) ξ = 0.92±0.13, (c)ξ = 0.96±0.11and (d) ξ = 1.14 ± 0.12, respectively.

real counterparts. In figure 7, we have compared the con-centration of 4-subgraphs expressed in various softwarenetworks and the concentration of 4-subgraphs predictedwith the duplication-based evolution model. These plotswere obtained with the following method. We generate400 graphs, 100 for each of four different software net-works: Blender, Filezilla, GTK and Exult [26]. Eachsynthetic graph has the same number of links L and num-ber of nodes N as measured in the corresponding soft-ware map and no further constraints are imposed. Theparameter space is sampled uniformly. Once the syn-thetic networks are obtained, we perform a 4-subgraphcensus by counting the number of appearances of each 4-subgraph Ωi in the model and in the synthetic network.Notice that we do not test for statistical significance (asin the motif analysis). Instead, our comparison test isbased solely in raw subgraph counts. In order to com-pare the two systems, the raw number of subgraphs ofsize four are computed and their concentration C eval-uated. Here, the concentration is simply the number ofappearances of the 4-subgraph over the total number of4-subgraphs found.

In figure 7, each point represents the pair(Cobserved, Cpredicted) of observed and predicted concen-trations for given 4-subgraph Ωi. Specifically, we display

the set of pairs and the power law fit Cpred ∼ Cξobs.

Despite fluctuations, the simple duplication model pre-sented here predicts reasonably well the concentrationof common software network motifs: the value of theexponent ξ is reasonably close to one in all cases. This isremarkable, given the oversimplification considered hereand given the limited constraints imposed to the selectedmodel graphs to be compared with the real ones. The

100

101

102

Motif rank number

10-6

10-5

10-4

10-3

10-2

10-1

100

Fre

quency a b c

d e f

ab

c

d ef

g

h

hg

FIG. 8: Frequency-rank distribution of network subgraphs ina software network ( ). Here the most frequent subgraph hasrank one, the second has rank two, etc. The frequency P (r) ofa subgraph with rank r decays rapidly with subgraph rank.An interesting feature is that most common subgraphs aresparser than less common ones, which are more dense.

error bars grow as less common subgraphs are used.If we restrict to C > 10−3, the exponent ξ becomesmuch closer to one. Specifically, we obtain now (a)ξ = 0.96 ± 0.11, (b) ξ = 0.97 ± 0.10, (c)ξ = 0.98 ± 0.12and (d) ξ = 1.06 ± 0.18, respectively. Consistentlywith previous work [8], less common subgraphs aretipically more dense (have more links). In figure 8 anexample of this correlation is shown for Exult. Usinga frequency-rank plot of 4-subgraphs, we can see thatsubgraphs with high frequencies have few links whereashigher ranks (small frequencies) are associated to densesubgraphs.

V. DISCUSSION

In this paper we have analysed the statistical patternsof network motifs in a large set of software diagrams.Software maps have been previously shown to be scalefree and display small world behavior [14, 20, 22] but noprevious analyses focused on the small-scale architecture.The main goal of our study is to explore the relevance ofgraph evolution in relation with true functionality. Ourstudy actually suggests that dynamical rules, with lit-tle relattion to underlying functional constraints, largelydetermine the frequency of motifs in software graphs.

By using recent theoretical and numerical methods tomeasure and characterize network motifs, we have foundthat:

1. A number of network motifs are obtained, being themost common shared with other (natural) systemsinvolving computational traits, such as genetic andneural networks.

2. The number of appearances of a given network mo-tif scales as 〈G〉 ∼ Nn−g+s−γi+1, in agreement withprevious calculations for random graphs with scale-free degree distributions. This result is supported

8

by previous observations of the uncorrelated char-acter of software maps.

3. Evidence from software evolution suggests that du-plication and rewiring, as it occurs with some cel-lular networks, might play a key role in shapingsoftware maps. Using a previous model of networkgrowth by duplication and diversification, it hasbeen shown that it fits rather well the frequenciesof appearance of network motifs.

Previous studies have proposed the idea that networkmotifs seem to define the minimal, meaningful buildingblocks of network complexity. Perhaps not surprisinglywe often find them as the basic structures associated tospecific functional traits, from computation to patternformation. The former is exemplified by feed-forwardloops, a three-element motif found in genetic regulatorysystems [29, 30]. The latter is actually a particularly rele-vant example. However, since the statistical distributionof network motifs involves dealing with large numbers ofdifferent subgraphs, the question of how motifs in general

might reflect functional traits requires the formulation ofappropriate null models of graph evolution. Such modelsmust ignore any functional trait in order to test the pos-sibility that global properties of network structure (suchas graph heterogeneity) might strongly influence what weshould expect.

The model chosen here has been a duplication-rewiringone [6]. These models have been shown to generate het-erogeneous graphs with many properties close to rele-vant biological systems such as protein-protein interac-tion maps. Network heterogeneity is largely due to effec-tive preferential attachment. Additionally, the rules ofduplication strongly bias the types of motifs to be formedtowards some special subsets. The final consequence isthat the patterns of network motifs generated by the du-plication model might be able to explain (in statisticalterms) the observed abundances of motifs, with no fur-ther requirement of functional constraints. The fact that

biological systems, also involved in performing computa-tions, have common motifs might support this view. Al-though sharing common motifs seems to call for commonfunctionalities, it is important to remember that biolog-ical structures are largely generated through tinkering[10, 11]. Protein interaction networks grow by gene du-plication and neural networks also experience increases ofcell numbers together with wide synaptic changes. Per-haps the common traits are a byproduct of the commontinkered evolution based on extensive reuse and copy ofavailable structures.

One final comment concerns with the common sub-graphs also shared by digital circuits. They are not ob-tained, strictly speaking, through a process of duplica-tion and rewiring. Although the way complex circuitsare built does include some amount of reuse [31] con-siderations involving low cost in links are of fundamen-tal importance. In spite of such constraints, it has beenshown that electronic circuits have small world structureand are also highly heterogeneous [32]. Previous workseems to indicate that optimal design towards efficientcommunication at low cost can generate scale-free, het-erogeneous architectures [33, 34]. Such result suggestsagain that network heterogeneity might pervade motifabundances.

Acknowledgments

The authors thank Shalev Itzkovitz for a careful read-ing and comments on an earlier version of the manuscript.The analysis of network motifs has been done usingavailable free software from Uri Alon’s Lab (see-http://www.weizmann.ac.il/mcb/UriAlon/index.html).This work has been supported by grants BFM2001-2154and by the EU within the 6th Framework Programunder contract 001907 (DELIS) and by the Santa FeInstitute.

[1] Dorogovtsev, S. N. & Mendes, J. F. F., Evolution of Net-works: From Biological Nets to the Internet and WWW,Oxford University Press, New York (2003)

[2] Albert, R. & Barabasi, A. L., Rev. Mod. Phys., 74, 47(2002).

[3] Newman, M. E. J., SIAM Review, 45, 167-256 (2003).[4] Bornholdt, S. & Schuster, G. , eds. Handbook of Graphs

and Networks, Wiley-VCH, Berlin (2002).[5] Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N.,

Chklovskii, D. & Alon, U., Science 298 (2002) 824-827.[6] Sole, R. V., Pastor-Satorras, R., Smith, E. D. and Kepler,

T., Adv. Complex Syst. 5, 43 (2002); Vazquez, A., Flam-mini, A., Maritan, A. and Vespignani, A. Complexus 1,38 (2003); Pastor-Satorras, R., Smith, E. D. and Sole, R.V., J. Theor. Biol., 222, 199-210 (2003); J. Kim, P. L.Krapivsky, B. Kahng, and S. Redner, Phys. Rev. E 66,

055101 (2002); K.-I. Goh, B. Kahng, and D. Kim., e-print(q-bio.MN/0312009 v2); W. Banzhaf and P. Dwigth Kuo,J. Biol. Phys. Chem. 4, 85 (2004).

[7] Sole, R. V. and Fernandez, P. SFI working paper 03-12-071. See also Guimera, R, Sales-Pardo, M, and Amaral,L. A. N. Phys. Rev. E 70, 25101 (2004).

[8] Vazquez, A. et al., Proc. Natl. Acad. Sci. USA 101 (2004)1794-17945.

[9] Hayes, B., Am. Sci. 89 (2001) 204.[10] Jacob, F. Science 196 (1976) 1161-1166.[11] Duboule, D. & Wilkins, A. S. Trends Genet. 14 (1998)

54-59.[12] Sole, R. V., Ferrer, R., Montoya, J. M. & Valverde, S.,

Complexity, 8, 20 (2002).[13] Alon, U., Science, 301, 1866-1867 (2003)[14] Valverde, S., Ferrer-Cancho, R. & Sole, R. V. Europhys.

9

Lett. 60 (2002) 512-517.[15] Aho, A. V., Science 303, 27, (2004), 1331-1333.[16] Aho, A. V., Sethi, R. & Ullman, J. D., Compilers: Prin-

ciples, Techniques and Tools, Addison-Wesley LongmanPublishing Co., Inc., Boston, MA (1986).

[17] Fraser, H. B., Hirsch, A. E., Steinmetz, L. M., Scharfe,C. & Feldman, M. W., Science 296 (2002) 750-752.

[18] Gamma, E., Helm R., Johnson R., Vlissides J. (1994)Design Patterns Elements of Reusable Object-OrientedSoftware (Addison-Wesley, New York)

[19] Stroustrup, B., (1986) The C++ Programming Language(Addison-Wesley, Reading, MA)

[20] Valverde, S., & Sole, R. V., Santa Fe Inst. WorkingPaper, SFI/03-07-044 (2003)

[21] Lakos, J., (1996) Large Scale C++ Software Design(Addison-Wesley, New York).

[22] Myers, C. R., Phys. Rev. E , 68, 046116 (2003)[23] Itzkovitz, S., Milo, R., Kashtan, N., Ziv G., & Alon, U.

Phys. Rev. E 68 026127 (2003)[24] Kemerer, C. F, & Slaughter, S., IEEE Trans. Soft. Eng,

25, 4, (1999) 493-509.[25] All software maps analysed here (and others studied by

other authors) have been shown to be sparse. Correla-tions have been also analysed in Sole, R. V. and Valverde,S., in: Complex Networks, E. Ben-Naim, H. Frauenfelder,and Z. Toroczkai (eds.), Lecture Notes in Physics, pp.169-190. Springer, Berlin (2004). Using statistical mea-sures derived from information theory, it was shown thatsoftware maps are considerably uncorrelated.

[26] The source code is available at the fol-lowing web sites: http://www.blender.org

(Blender), http://filezilla.sourceforge.net(Filezilla), http://www.gtk.org (GTK) andhttp://exult.sourceforge.net (Exult).

[27] Fioravanti, F., Migliarese, G., & Nesi, P., In 23rd Inter-national Conference on Software Engineering (ICSE’01),IEEE, May 12 - 19, Toronto, Canada, (2001).

[28] Callaway, D. S., Hopcroft, J. E., Kleinberg, J. M, New-man, M. E. J., & Strogatz, S. H., Phys. Rev. E 64, 041902(2001)

[29] Shen-Orr, S., Milo, R., Mangan, S. & Alon, U., NatureGenetics, 31, 64-68 (2002).

[30] Mangan, S. & Alon, U., Proc. Nat. Acad. Sci., 100,11980-11985 (2003).

[31] As circuit complexity increases (both in terms of num-ber of components and computational tasks) it becomesmore difficult to design from scratch choosing sets ofsmall gates and building optimal, low-cost circuits. Pre-defined gates involving well-known (and sometimes com-plex) input-output functions are widely used and assem-bled together. In that sense, some amount of re-use is atwork.

[32] Ferrer, R., Janssen, C. and Sole, R. V., Phys. Rev. E 64,046119 (2001).

[33] Ferrer, R. and Sole, R. V., in: Statistical Physics ofComplex Networks, R. Pastor-Satorras, M. Rubi and A.Diaz-Guilera, editors Lecture Notes in Physics, Springer(Berlin), 114-125 (2003).

[34] Sole, R. V. and Valverde, S., in: Complex Networks, E.Ben-Naim, H. Frauenfelder, and Z. Toroczkai (eds.), Lec-ture Notes in Phyics, pp 169-190. Springer, Berlin (2004).

Network Motifs in Computational Graphs: A Case Study in ......APS/123-QED Network Motifs in Computational Graphs: A Case Study in Software Architecture Sergi Valverde1 and Ricard V.

Documents