Top Banner
TRD 3: MULTI-SCALE NETWORKS – PROJECT SUMMARY Although networks have been extremely useful for representing molecular interactions and mechanisms, network diagrams do not visually resemble the contents of cells. Rather, the cell involves a multi-scale hierarchy of components – proteins are subunits of protein complexes which, in turn, are parts of pathways, biological processes, organelles, cells, tissues, and so on. In this Technology Research and Development Project (TRD), we will pursue methods that move Network Biology towards such hierarchical, multi-scale views of the structure and function of biological systems. Biological ontologies are one very successful framework for capturing hierarchical multi- scale organization, but they have so far been only indirectly connected to biological networks and other types of ‘omics data. Recently, we introduced methods for inferring the terms and term relations of a gene ontology directly from the hierarchical structure contained in molecular networks, and we prototyped a web resource to distribute network-based ontologies (NeXO, nexontology.org). This recent progress motivates and lays groundwork for our present focus on hierarchical multi-scale representations. Specific aims are to develop tools that: (1) Iteratively and flexibly incorporate new network experimental results into a ‘working’ NeXO ontology, (2) Use a gene ontology structure, either inferred or literature curated, to guide an engine for generalized functional predictions, and (3) Explore multi-scale analysis above the cellular level, by bridging ligand-receptor networks to networks of cell- cell communication. These aims are stimulated by a range of Driving Biomedical Projects involving the Gene Ontology project, the Saccharomyces Genome Database, a Cancer Gene Ontology, and multi-scale analysis of viral-host, cell-cell communication and social networks. Ultimately, all research aims synergize to use network data to propel hierarchical models of biological structure and function.
19
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Technology R&D Theme 3: Multi-scale Network Representations

TRD 3: MULTI-SCALE NETWORKS – PROJECT SUMMARY Although networks have been extremely useful for representing molecular interactions and mechanisms, network diagrams do not visually resemble the contents of cells. Rather, the cell involves a multi-scale hierarchy of components – proteins are subunits of protein complexes which, in turn, are parts of pathways, biological processes, organelles, cells, tissues, and so on. In this Technology Research and Development Project (TRD), we will pursue methods that move Network Biology towards such hierarchical, multi-scale views of the structure and function of biological systems. Biological ontologies are one very successful framework for capturing hierarchical multi-scale organization, but they have so far been only indirectly connected to biological networks and other types of ‘omics data. Recently, we introduced methods for inferring the terms and term relations of a gene ontology directly from the hierarchical structure contained in molecular networks, and we prototyped a web resource to distribute network-based ontologies (NeXO, nexontology.org). This recent progress motivates and lays groundwork for our present focus on hierarchical multi-scale representations. Specific aims are to develop tools that: (1) Iteratively and flexibly incorporate new network experimental results into a ‘working’ NeXO ontology, (2) Use a gene ontology structure, either inferred or literature curated, to guide an engine for generalized functional predictions, and (3) Explore multi-scale analysis above the cellular level, by bridging ligand-receptor networks to networks of cell-cell communication. These aims are stimulated by a range of Driving Biomedical Projects involving the Gene Ontology project, the Saccharomyces Genome Database, a Cancer Gene Ontology, and multi-scale analysis of viral-host, cell-cell communication and social networks. Ultimately, all research aims synergize to use network data to propel hierarchical models of biological structure and function.

Page 2: Technology R&D Theme 3: Multi-scale Network Representations

TRD 3: MULTI-SCALE NETWORKS – PROJECT NARRATIVE Although networks have been extremely useful for representing interactions and group formation, network diagrams fail to capture important aspects of biological structure and function. We will pursue methods that move Network Biology towards more accurate hierarchical, multi-scale views of biological systems. The hierarchical models developed here will enable integration of both basic and clinical data to predict disease outcomes in response to specific therapies.

Page 3: Technology R&D Theme 3: Multi-scale Network Representations

TRD 3: MULTI-SCALE NETWORKS – SPECIFIC AIMS Although networks have been very useful for representing molecular interactions and mechanisms, network diagrams do not visually resemble the contents of cells. Rather, the cell involves a multi-scale hierarchy of components – proteins are subunits of protein complexes which, in turn, are parts of pathways, biological processes, organelles, cells, tissues, and so on. In this technology research project, we will pursue methods that move Network Biology towards such hierarchical, multi-scale views of biological structure and function. Aim 1. Assembly and refinement of gene ontology structure from biological network data. Ontologies have been very successful at capturing hierarchical, multi-scale cellular organization. In the prior period of support we introduced methods for assembling a gene ontology directly from the hierarchical structure evidenced by molecular networks and other ‘omics data. We prototyped a web resource to distribute network-based ontologies (NeXO, nexontology.org), but it is still at an early stage. In the next support period, we will research methods to iteratively and flexibly incorporate new experimental results and data into a ‘working’ NeXO ontology, highlighting new terms and term relations that are created, alongside existing terms/relations that are further supported or weakened. We will transform nexontology.org to an interactive community resource that enables investigators not only to browse an existing ontology but to create, share, and iteratively update, revise, correct, and expand these ontologies. The potential of this aim is to effectively systematize and crowd-source an important type of biological model – the ontology. Aim 2. Functionalized gene ontologies as a hierarchy of phenotypic prediction. Hierarchy and scale are important not only for capturing the physical architecture of a system (Aim 1) but also its function. Recent progress in artificial intelligence (AI), embodied by agents such as Siri and Watson, inspires an approach for moving from networks and gene ontologies, which are currently descriptive in nature, towards predictive models that are able to predict a range of cellular phenotypes and answer biological questions. Using these AIs as a rough inspirational guideline, we will develop gene ontologies as a major platform for the functional translation of genotype to phenotype, with a particular focus on personalized cancer therapeutics. This aim intersects with the separate TRD project on Predictive Networks and serves as a bridge between the two TRDs. Aim 3. Bridging ligand-receptor networks to cell-cell communication networks. We will also explore multi-scale network analysis above the cellular level, in the context of an emerging class of biological networks called cell-cell interaction networks. In these networks nodes are cells, and edges represent physical or chemical (e.g. hormonal) interactions. Inter-cellular signaling and regulation networks could in the future be controlled to grow artificial organs, heal tissues and develop novel therapies. We will infer, analyze and visualize multi-scale models of inter-cellular communication networks and their corresponding intracellular signaling networks and pathways, which link to traditional molecular interaction network analysis methods. For instance, we will use network analysis methods to identify potential control points in the cell-cell and intracellular interaction network with applications to regenerative medicine (growing blood from stem cells). Growth of data and analysis methods in this area will enable network science to contribute to the wider understanding of physiological systems. These aims are stimulated by a range of Driving Biomedical Projects involving the Gene Ontology project, the Saccharomyces Genome Database, a Cancer Gene Ontology, and multi-scale analysis of viral-host, cell-cell communication and social networks. Ultimately, all research aims synergize to use network data to propel hierarchical models of biological structure and function.

Page 4: Technology R&D Theme 3: Multi-scale Network Representations

TRD 3: MULTI-SCALE NETWORKS – RESEARCH STRATEGY SIGNIFICANCE Why it is time to move beyond flat models of biological networks. Like any model of the world, our view of the cell is inescapably bound by the time and place in which we live. Over the years different schools have fashioned the cell in a variety of forms, from bags of enzymes1, to metabolic channels2, to feedback circuits3, to complex systems4, to gels5, to self-modifying programs in software6. A model that has pervaded cell biology for the past fifteen years is the so-called “network” view (Figure 1A), which has bloomed in parallel with the emergence of human-made networks such as the Internet and Facebook. This view treats cells as containers for vast networks of “nodes” (genes, gene products, metabolites, or other biomolecules) connected by “links” (physical interactions or functional associations)7. Network representations of the cell flow directly from the ability to characterize not only genes and proteins in isolation, but also their functional similarities and physical binding partners— a major outcome of transcriptomics and proteomics approaches. Analysis of network information, whether biological or human-made, is an active field leading to algorithms that detect nodes with strategic positions within a network7 or that analyze networks to identify modular structures8 (a topic of earlier progress during the past period of support for the NRNB). While incredibly influential, the network is likely not the ultimate representation of a cell, for two reasons. First, network diagrams do not visually resemble the contents of cells. Nowhere in the cell do we observe actual wires running between genes and proteins– unlike for the Internet, which is truly a network of wires among processing units. Rather, the cell involves a multi-scale hierarchy of components that is not readily captured by basic network representations. For example, the proteasome has been mapped extensively to identify its key genes and interactions, but the network visualization of these data (Figure 1A) is very different from the proteasome’s spatial appearance (Figure 1B). The interactions making up the proteasome factor into a regulatory particle and a core, which, in turn, factor into a base and a lid, and an alpha and beta subunit, respectively. This hierarchical structure is obscured by the network visualization of pairwise relationships between gene products. Aim 1 will address this shortcoming, by using molecular networks and other ‘omics data to build hierarchical models of the cell parallel to the Gene Ontology9.

Figure 1. From networks to ontologies. (A) Network representation of three types of interactions that form the proteasome structure, displayed using a force directed layout. (B) Cartoon representation of the structure of the proteasome (PDB entry 4b4t), created by integrating partial crystallographic structures obtained by analysis of 2.4 million images from electron microscopy. (C) Hierarchical factorization of the proteasome sub-components as described by our data-driven gene ontology NeXO. Across all panels, colors indicate membership to the core complex beta subunit (red), core complex alpha subunit (orange), regulatory particle lid complex (blue) and regulatory particle base complex (purple) according to the GO (A), the Protein Data Bank (B) and NeXO (C).

Page 5: Technology R&D Theme 3: Multi-scale Network Representations

From description to prediction. Second, many of the molecular networks published to date, including many from the NRNB or earlier research by our labs10-21, are descriptive maps of physical or functional connectivity rather than predictive models. For example, technologies such as yeast two hybrid, protein affinity purification, and chromatin immunoprecipitation are often used to define and draw large networks of protein-protein and protein-DNA interactions22, but these static maps do not, by themselves, predict cell behavior. Although we and many others in the field of network biology have inferred networks capable of predicting gene function or phenotypic responses [reviewed here23,24; network inference was the focus of previous Aim 4 of the past funding period], these efforts have tended to focus on a specific class of predictions, i.e. gene expression level or cell growth rate. Assembling a model that would predict a range of phenotypes, rather than only one type of outcome, requires understanding how phenotypes are interrelated. Here again a hierarchy is important, since cellular organization involves a multi-scale hierarchy not only in structure but also in function. For example, the proteasome is a central component of ubiquitin-mediated protein degradation, which, depending on an intricate set of inputs and rules, can result in cellular homeostasis, differentiation, death, and other fates. This multi-scale hierarchy of processes is, again, simply not exposed by a standard pairwise network representation. Aim 2 will address this shortcoming by developing methods to ‘functionalize’ the Gene Ontology, so that it is not merely a static description of the contents of cells, but an active framework for predicting phenotype from genotype. From networks to ontologies: Building better models of cell structure from omics data. To capture hierarchical organization, a particularly promising direction in computer science has been the development of the ontology, a model that divides its subject domain into a set of fundamental concepts or entities and relationships among those entities25. Ontologies arise from the metaphysics branch of philosophy, which is concerned with the nature of what exists and the categories into which the world’s objects naturally fall. Ontologies build upon and extend network models in two key ways: ‘entities’ refer not only to elemental objects but also to any meaningful grouping of objects, and ‘relationships’ refer not only to direct connections but also to nested structures, such as one entity being a part or type of another. Thus, ontologies explicitly allow for a higher order organization of knowledge, missing from raw networks. They have been key for building powerful knowledge representation and reasoning systems in many domains26 including biomedicine27. Ontologies became very influential in cell biology through the development of the Gene Ontology (GO)9. GO is a major resource of knowledge about genes, gene products, and the hierarchy of cellular components, molecular functions and biological processes in which they participate. Entities in GO (GO terms) are hierarchical groupings of other entities. The GO resource is presently very large, with nearly 35,000 GO terms connected by ~65,000 hierarchical term-term relations, describing more than 80 different species. The impact of GO is hard to overstate – just try to think of a single modern ‘omics analysis that does not use GO to validate a novel data set or approach, or to generate new mechanistic hypotheses. In a sense GO is the most universal, and universally accepted, model of a cell that we currently have. One limitation of GO lies in the fact that the ontology structure is constructed by a diverse team of scientists according to their best abilities to curate the published scientific literature. Thus, GO inevitably misses the large proportion of cell biology that is not yet known or has not yet been curated, and it contains biases that are hard to control. To address these challenges, in the prior period of support we investigated whether gene ontologies could be inferred computationally directly from systematic molecular interaction networks28. In this study, a large fraction of the GO hierarchy was recapitulated de novo, directly from network data gathered in budding yeast. For example, the pairwise interaction network for genes and gene products encoding the proteasome (Figure 2A) was transformed to infer the hierarchical structure of proteasomal components to a high degree of accuracy (Figure 2C). In addition, several hundred cellular entities were identified from the data that had not yet been catalogued in GO, pointing to potentially novel or uncurated molecular machinery which we are pursuing in collaboration with the Gene Ontology Consortium (formerly a CSP, now a DBP).

Page 6: Technology R&D Theme 3: Multi-scale Network Representations

Over the next few years, we will expand on this preliminary work to introduce a system for organizing molecular interactions and cancer ‘omics data as a genomics-driven, crowd-sourced Gene Ontology. This will address several parallel challenges in the ‘omics sciences:

(1) The need to move beyond clustering to recognize the multi-scale structure embedded in data

(2) The need to improve ontologies of gene function in their scalability, consistency and coverage

(3) The continued need to provide biomedicine with an accurate map of hallmark pathways and processes that drive disease progression.

Taking clues from Siri: ‘Active’ networks and ontologies. Whether based on expert knowledge or inferred from data, current gene ontologies are static descriptions of cellular organization. They enable representing and reasoning on the structural relationships among biological entities27,29 but lack any native capacity to capture dynamic biological states or make phenotypic predictions. However, since gene ontologies inherently represent multi-scale hierarchy in cellular organization, they provide in theory an ideal substrate for building models that would also be predictive of a range of cellular responses and phenotypes. In this respect, intelligent agents developed in the field of knowledge representation and reasoning26, such as Apple’s Siri and IBM’s Watson, provide an excellent example of what a predictive, or ‘executable’, ontology looks like. At Siri’s core is a series of ontologies containing knowledge that concerns Siri – answers to questions one would normally ask an iPhone30. For instance, Siri uses an ontology for event planning which treats both meals and movies as types of events, where meals involve a restaurant and a restaurant consists of components such as a name, address, and style of food. In many ways, such ontologies are similar in structure to bio-ontologies such as GO (Figure 2).

Figure 2. From ontologies to active ontologies. A subset of the Gene Ontology9, left, alongside a subset of an active ontology for event planning30, right. Red relationships and entities indicate dynamic computation. Unlike gene ontologies, however, which are essentially descriptive, Siri’s ontologies are coupled with dynamic reasoning systems that render them active: “Whereas a conventional ontology is a formal representation of domain knowledge with distinct concepts and relations among concepts, an Active Ontology is a processing formalism where distinct processing elements are arranged according to ontology notions; it is an execution environment”30. These active ontologies not only encode entities and relations, but entities are associated with states and relations are associated with rule sets that perform actions within and among entities. Through a bottom up execution, input states are incrementally propagated up the hierarchy to impact higher-level entities, whose states are output as the answer to the initial question – the best prediction based on the inputs. For example, try asking Siri to “Find a good sushi restaurant for two tonight”. This query is translated by setting the states of several entities: style is set to ‘sushi’, address to the user’s current location,

Page 7: Technology R&D Theme 3: Multi-scale Network Representations

party size to the value ‘2’, and event date to today’s date (Figure 2). These values are propagated through the ontology to generate a list of restaurants, which becomes the state of the event entity. This event result can then be provided to the user or included in further computations. In Aim 2, we will explore whether such systems can teach us how to develop question-and-answer, or genotype-to-phenotype prediction, systems for cell biology31. Cell-cell interaction networks. We will also develop technology for understanding network structure above the cellular level. In so-called cell-cell interaction networks, nodes are cells and edges represent physical or chemical (e.g. hormonal) interactions. Chemical interactions are of greatest interest as they describe inter-cellular signaling and regulation pathways, which could in the future be controlled to grow artificial organs, heal tissues and develop novel therapies. Increasing information in this area will enable network science to contribute to the wider understanding of physiological systems. We have gained experience and interest in this area via analysis of two novel experimentally mapped cell-cell interaction networks of the developing human hematopoietic system32,33 in collaboration with Peter Zandstra at the University of Toronto (Zandstra is now a DBP). The Zandstra lab is interested in mapping inter-cellular networks and feedback in regulating stem and progenitor cell fate for the purposes of growing blood from stem cells, which would be safer than blood donations. Cell-cell interaction networks demand new analysis tools that consider their autocrine and paracrine structure and how they are controlled by intra-cellular molecular networks. Despite the recognized importance of inter-cellular networks and feedback in regulating multicellular organism development, the specific cell populations involved and underlying molecular mechanisms are largely undefined. For example, blood cells are known to secrete and respond to a large number of regulatory proteins in lineage- and differentiation stage-specific patterns34,35. Dynamic mathematical models of cells patterning into tissues during development have been built36-38, but they function at the cell population/tissue level and treat cells as a compartment or spatial gradient and do not consider actual cell-cell interactions. Perhaps the best-studied cell-cell interaction network is that of the worm, Caenorhabditis elegans, which has been completely mapped over organism development by microscopy. Network analysis by clustering found that interneurons are more densely connected in the nervous system compared to sensory or motor neurons, leading to the interpretation that these cells act as central processing units39. More recent work predicted cell-cell networks involved in cancer therapy resistance40, and found that specific network motifs are enriched in inter-cellular cytokine mediated communication networks41 and that specific components are more important than others42, however this work has thus far studied small cell-cell network models that were never experimentally validated. As technology for single cell and stem cell measurement improves, we expect a growth in the amount of cell-cell network information. We are already observing this growth in projects such as a new CSP from Laurie Ailles at the University Health Network in Toronto, who is studying how cancer-associated fibroblasts provide a supportive microenvironment for cancer stem cells within high-grade serous ovarian cancer and other cancers. New technology she has developed quantifies the protein levels of 363 cell surface antigens in single cell populations43. INNOVATION Central innovation and hypothesis. The central innovation of this TRD project is a set of ideas and approaches for transitioning Network Biology from the current status-quo of flat, pairwise, and descriptive representations of biological interactions, to a future in which the same interaction data lead to the construction of hierarchical models of biological structure and function. We will explore the hypothesis that current network representations, which view a dataset of pairwise interactions as a mathematical graph of nodes and edges, may be “too close” to the raw data to allow for complete or even accurate biological insight. Models derived from the same interactions, such as gene ontologies and biological process diagrams, may form a more intuitive result, provided these multi-scale formulations can avoid the tendency towards over-fitting or -interpretation. The most direct representations of data are not always the most desirable for meaningful interpretation of those data. In x-ray crystallography, the most direct representations of x-ray

Page 8: Technology R&D Theme 3: Multi-scale Network Representations

diffraction patterns are two-dimensional images44. However, when many such images are integrated and analyzed, exquisite 3D structural models of proteins emerge which, in turn, enable accurate predictions of protein dynamics and function. Similarly, from many molecular measurements and interaction data sets the higher order structure and function of the cell might emerge, if only we could figure out how to assemble these images properly. Turning networks into ontologies: towards a Network-eXtracted Ontology. Recently we and others have shown very promising results in the hierarchical analysis of physical and genetic networks—i.e., that networks harbor rich structure which is not only modular but also hierarchical and multi-scale45-50 (Aim 1 Progress Report). In particular, we have been able to recover ~60% of the hierarchical GO Cellular Component hierarchy de novo, directly from physical and genetic network data gathered in S. cerevisiae and in a manner that is completely independent from the known structure of GO or from the literature. The resulting Network-eXtracted Ontology, which we call NeXO, provides a structured hierarchical interpretation of network data which will in most cases be vastly preferable to flat lists of interaction (a.k.a. interaction ‘hairballs’) or flat lists of network clusters/complexes. The focus of Aim 1, and an innovative aspect of this proposal, is to explore how these ontologies can be iteratively updated by a community of biomedical investigators.

3.1 ASSEMBLY AND REFINEMENT OF ONTOLOGY STRUCTURE FROM BIOLOGICAL NETWORK DATA Project Leader: Trey Ideker (UCSD) Overview. Ontologies have been very successful at capturing hierarchical, multi-scale cellular organization. In the prior period of support we introduced methods for assembling a gene ontology directly from the hierarchical structure contained in molecular networks and other ‘omics data. We prototyped a web resource to distribute network-based ontologies (NeXO, nexontology.org), but it is still at an early stage. In the next support period, we will research methods to iteratively and flexibly incorporate new experimental results and data into a ‘working’ NeXO ontology, highlighting new terms and term relations that are created, alongside existing terms/relations that are further supported or weakened. We will transform nexontology.org to an interactive community resource that enables investigators not only to browse an existing ontology but to create, share, and iteratively update, revise, correct, and expand these ontologies. These tools will be built and explored alongside Driving Biomedical Projects including a Yeast Gene Ontology, a Cancer Gene Ontology, a Viral-Host Gene Ontology and a hierarchical exploration of social networks. The goal is a means of systematically incorporating ‘omics data into whole-cell ontological models, with the potential to systematize and crowd-source an important type of model construction. Preliminary Results and Progress Report: Proof-of-concept and maturation of a Network-eXtracted Ontology (NeXO). The previous award supported research by NRNB investigators that led to creation and prototyping of the first gene ontology inferred from ‘omics data, the NeXO Resource28,51 (http://nexontology.org). This work fell naturally under previous TRD-C: Visualization and Representation of Biological Networks. NeXO provides a methodology whereby physical and genetic network data can be transformed to assemble a structured ontology of protein complexes. Using this system, we assembled an ontology based on four large yeast networks capturing current knowledge of physical protein-protein interactions, genetic interactions (synthetic-lethality and epistasis), co-expressed genes, as well as an integrated functional network known as YeastNet52. The resulting Network-eXtracted Ontology (NeXO) contains a total of 4,123 terms and 7,804 term-term relationships (Figure 3). Based on alignment of the systematic NeXO to the literature-curated Gene Ontology (GO), it appears that NeXO captures ~60% of terms in the Cellular Component branch of GO. To further validate NeXO vs. GO, we have used both ontologies to perform functional enrichment of gene sets, the task to which GO is most often applied. In this regard, NeXO performs at least as well as GO for functional enrichment in several different genome-scale data sets. Thus, the computed ontology provides functionally-relevant terms which cover a wide spectrum of yeast biology to an

Page 9: Technology R&D Theme 3: Multi-scale Network Representations

extent comparable to manually-curated efforts. Since the original proof-of-concept work was published in early 201328, we have released a visually integrated website for browsing NeXO and GO ontologies in the style of Google Maps51. This summer we published a major improvement to the ontology inference algorithm53 which was presented and well-received at the Intelligent Systems in Molecular Biology (ISMB 2014) conference. Progress Report Publications 1-11.

Methods Basic inference of ontologies and alignment to a reference. To construct a data-driven ontology, a set of input features is first gathered for each gene, representing information collected from ‘omics studies such as its interaction partners in molecular networks, its expression levels over time or conditions, or other data depending on the DBP. These features are analyzed to generate a pairwise gene-gene similarity matrix, in which the similarity between two genes reflects their closeness in input features. Many methods have been proposed for this purpose54-56, presently we have been successful with the technique of random forest regression57. The pairwise similarity matrix is then clustered (Figure 4) using either of several algorithms we have published in prior work28,53. For example, our original method is to use a hierarchical probabilistic model for community detection50,58 which constructs a binary tree, or dendrogram, seeking to maximize the overall probability of the network data by iteratively joining sets of genes with similar patterns of interaction. Gene sets, represented by nodes in the tree, are suggestive of biological entities or ‘terms’ in an ontology. Joining of two sets, represented by connecting two nodes beneath a third, suggests specialized terms that are part of a more general one. The tree is then expanded to allow for creation of terms with multiple (>2) children and/or parents which is important for identifying complexes with many subunits or which participate in multiple parent processes [transforming the hierarchical tree into a directed acyclic graph— we do not detail this method here but it involves evaluating the probability of the network under the new vs. old structure]. This method yields a novel structure that we call the Network-Extracted Ontology, or NeXO, in which genes are organized under a hierarchy of terms and parent-child term relations strongly supported by the input datasets. At this stage terms simply represent structures detected in data and are given systematic IDs, much like ORFs detected in a newly-sequenced genome. To annotate these terms with information from known biology, the NeXO structure is aligned against a reference ontology,

Figure 3. Building the NeXO ontology. The ontology is reduced to a tree, with nodes indicating terms and edges indicating hierarchical relations between terms, i.e. that one term contains another. Node sizes indicate the number of genes assigned to a term. Node colors represent the degree of correspondence to a term in GO as determined by ontology alignment, with high-level alignments labeled. Insets show the hierarchy identified for the ribosome and actin cytoskeleton.

Page 10: Technology R&D Theme 3: Multi-scale Network Representations

much like ORFs are annotated by alignment against a reference genome whose genes are well-annotated. As in past work, our default reference ontology for this step will be the literature-curated Gene Ontology. The desired result of aligning NeXO and GO is to identify NeXO terms that correspond to well-known versus novel structures, as well as GO terms that are well-supported by the available data. For high confidence matches, the GO annotations are transferred to the NeXO term, including the term name and description. Terms that are novel (similar to ‘ORFaned’ genes) may become extremely interesting for further biological exploration and experimental follow-up.

Although methods for ontology alignment have not received much attention in molecular biology or bioinformatics, they are under active research in the computer science and semantic web communities. We will implement an ontology alignment algorithm based on a previously-proposed method called ASMOV59, which was the winning ontology alignment algorithm in the 2010 Ontology Alignment Evaluation Initiative (om2010.ontologymatching.org/). The method was designed to align semantic ontologies, and it is based on a score function that measures the lexical similarity of text labels and comments associated with terms. Hence, we will modify and expand this approach to align ontologies in which the terms refer to sets of genes (technically, the set of genes assigned to a term defines the ‘label’ of that term). Application of current and new procedures for data-driven ontologies to Driving Projects. We will begin work immediately to construct and/or revise data-driven ontologies with each of our Driving Biomedical Projects, an activity that is expected to continue for most of the next five-year performance period. The projects are:

1. Creating new terms and term relations in the Gene Ontology. Our previous efforts to infer gene ontologies from network data were initially carried out as a Collaboration and Service Project (CSP) with Mike Cherry, Professor of Genetics at Stanford and head of the Gene Ontology Consortium for the Saccharomyces model organism. Together with Cherry, we will continually apply tools developed in this TRD to revise and expand the yeast NeXO based on new data, and to communicate the most promising new terms and term relations it identifies to the Saccharomyces GO.

2. Elucidating the hierarchy of modules in the virus-human protein interaction network. Dr. Nevan Krogan at UCSF is a world leader in generating large-scale maps of protein complexes based on affinity purification mass spectrometry as well as in systems for synthetic lethal genetic interaction screening. NRNB and Krogan have a long-standing relationship in developing physical and genetic interaction maps of biological systems of interest11,12,18,60-62, including the original NeXO paper28. We expect this productive relationship to continue as we develop tools for data-driven assembly and refinement of gene ontologies within this TRD, initially as applied to physical and genetic interactions of viral protein subunits with proteins encoded by the human host.

A A

B A AFigure X. Automated assembly and alignment of gene o

Figure 4. Automated assembly and alignment of gene ontologies. (A) Probabilistic community detection within the input networks yields a binary tree in which nodes correspond to ontology terms and links correspond to parent-child term relations. Unsupported terms are replaced by multi-way joins, and additional parent-child relations are added based on network data. The resulting ontology is aligned against the Gene Ontology, in a way that (B) prohibits non-unique mappings and ancestor-descendant criss-crossing.

Page 11: Technology R&D Theme 3: Multi-scale Network Representations

3. Gene ontology inference based on binding-site-resolved ‘edgetic’ protein networks. Drs. Marc Vidal and David Hill are pioneers in protein interaction mapping via the yeast-two-hybrid system. Recently they developed the capability to map interactions at binding site resolution, by using modular protein domains as baits combined with phage display knowledge of the preferred binding motif of each domain. We will together explore whether this binding interface information can be used to inform the inferred gene ontology structures we are building in this TRD.

4. Hierarchical analysis of cancer subtypes with TCGA / ICGC and Sage Bionetworks. Cancer genomics projects are generating large cancer specific ‘omics data sets. Therefore, natural DBPs for this project are provided by The Cancer Genome Atlas, International Cancer Genome Sequencing Consortium, and Sage Bionetworks, all of which are associated with major cancer genomics projects nationally and internationally. Our focus will be to construct a Cancer Gene Ontology based on a pan-cancer analysis of data from all ~20 major TCGA tissue types. Such a Cancer GO would provide insight into the hierarchy of biological processes and cellular components that is somatically mutated or differentially activated during cancer progression.

5. Understanding the multi-scale hierarchy of social interactions. We will work with UCSD Professor James Fowler, a renowned social networks researcher, to apply the hierarchical methods developed in this aim to analyze the structure of a large social network generated from the Framingham Heart Study. This study has surveyed health behaviors, disease outcomes, and social relationships among >12,000 people for over 37 years25-27. During these collaborations, we will experiment with ontologies constructed with different sources and types of data, e.g. using genetic interactions only versus those that also include physical interactions and other types. Such exploration is needed to evaluate which interaction types are most revealing of cellular componentry such as protein complexes and larger macro-molecular structures, and how to weight genetic versus physical interactions for this purpose. We will seek to determine how much interaction data one needs to construct a robust ontology for each of the DBP datasets, e.g., one which is able to faithfully recover a substantial fraction of knowledge in the manually-curated GO. At present, what we know is that this is possible using an integrated network including all genetic and physical interactions that have been mapped to-date for budding yeast. Development of iterative procedures for incorporating new data into a data-driven ontology. We will conduct a major program of exploratory research and development on approaches by which data-driven gene ontologies such as NeXO can evolve over time, by incorporating new datasets as they are generated and published. We will begin by evaluating a relatively straightforward approach, which is to integrate the new dataset(s) into the pairwise gene similarity matrix which forms the input to the ontology inference method (see above). Once the similarities have been adjusted, an ‘updated’ ontology is constructed based on the old+new data and aligned against the ‘previous’ ontology based on old data only. Similar to alignment against GO (see above), the desired result is to identify terms and term relations in the updated ontology that are newly-created as well as previous terms / relations that are reinforced by the new data. Ultimately one might also imagine downgrading or retiring terms that have remained unsupported over many diverse dataset updates, but this is admittedly a more delicate proposition than adding new terms. A limitation of this simple update approach is that the complete ontology must be reconstituted each time a new data set is evaluated. An alternative and more optimal approach may be to directly modify the previous ontology using information from the new data set. We will explore both simple and these more advanced approaches in the course of research. Given an update procedure, the experimentalist may wish to design further studies aimed at the new terms. These specially directed new data could then spawn another ontology update, enabling the exciting possibility of continued iteration between improving the ontology (aka the biological model) and the experimental data generation phases of a study.

Page 12: Technology R&D Theme 3: Multi-scale Network Representations

An online system for distribution and community construction of data-driven ontologies. Ontology models developed with our DBPs will be made available to the scientific community via query from the stand-alone NeXO website, nexontology.org, as well as through a specialized App for Cytoscape. We will also prototype a web-based system whereby a unified and common ‘Crowd-Sourced NeXO Ontology’ can be iteratively updated from biological data sets uploaded by investigators from the biomedical research community at large. Achieving this vision will require the addition of major features to nexontology.org, including user accounts, data upload, and a cloud-based implementation of ontology inference. If successful, we will seek to transition the new website to independent funding to support what could ultimately become a large community of users. The allure of such a system is that the wealth of ‘omics data being generated every year could be analyzed to assemble different types of gene ontology systematically, with less and less reliance on back curation of the literature. Ultimately, the desired outcome is to enable a shift from using ontologies to evaluate data to using data to construct and evaluate ontologies—that is, from a regime in which the ontology is viewed as gold standard to one in which it is the major result. 3.2 FUNCTIONALIZED GENE ONTOLOGIES AS A HIERARCHY OF PHENOTYPIC PREDICTION Project Leader: Trey Ideker (UCSD) Overview. Whether based on expert knowledge (GO) or inferred from data (NeXO in Aim 1), current gene ontologies are static descriptions of cellular structure and organization. They enable representing and reasoning on the structural relationships among biological entities27,29 but lack any native capacity to capture dynamic functional states or make phenotypic predictions. However, since gene ontologies inherently represent multi-scale hierarchy in cellular organization, they provide in theory an ideal substrate for building models that would also be predictive of a range of cellular functions and phenotypes. In this respect, question and answer systems developed in the field of knowledge representation and reasoning26, such as Apple’s Siri and IBM’s Watson, provide an excellent example of what a predictive, or ‘executable’, ontology looks like. In this aim, we will explore whether such systems can teach us how to develop predictive systems for cell biology31. This aim intersects with the separate TRD project on Predictive Networks and serves as a bridge between the two TRDs. Preliminary Results and Progress Report: Activating static networks as predictive models. The Ideker laboratory has over the years introduced a progression of approaches that seek to use molecular network information to guide the prediction of phenotypic outcomes such as disease state or drug response. Relevant works include ActiveModules63, Network-Based Classification64, Network-Guided Forests65, Network-Based Stratification66 and several influential reviews on using networks predictively67,68. The more recent works (2011 to present) were supported by the past period of NRNB funding. Generally, our methodology has been to identify subnetworks of genes whose expression levels (molecular profile) or mutation states (genotype) can be functionally combined to predict disease outcome (phenotype or class). For example, Network-Guided Forests is a classification method that associates subnetworks of genes with decision trees that evaluate the expression levels of those genes to predict sample class. Such approaches have shown success in classification of metastatic vs. non-metastatic breast cancer64, aggressive vs. indolent leukemia69, as well as classification of cell fate decisions during development16,65. We have found repeatedly that, unlike the gene sets identified by regular classifiers, the subnetworks identified by network-based methods are highly enriched for causal factors of disease, and they show very consistent performance across different sample datasets. Progress Report Publications 12-17. Methods Taking clues from Siri: propagation of state on predictive ontologies. We will explore use of the structure of ontologies, rather than the structure of networks, in making phenotypic predictions. The key distinction is that networks are concerned mainly with pairwise associations between genes,

Page 13: Technology R&D Theme 3: Multi-scale Network Representations

whereas ontologies represent hierarchical relations across a range of biological modules at various scales including genes and proteins, protein complexes, pathways and processes, and organelles. Question and answering systems such as Apple’s Siri provide a useful model of how hierarchical relations in an ontology can propagate state information. Unlike current gene ontologies which are descriptive, Siri’s ontologies are coupled with dynamic reasoning systems that render them active: “Whereas a conventional ontology is a formal representation of domain knowledge with distinct concepts and relations among concepts, an Active Ontology is a processing formalism where distinct processing elements are arranged according to ontology notions; it is an execution environment”30. These active ontologies not only encode entities and relations, but entities are associated with states and relations are associated with rule sets that perform actions within and among entities. During execution, input states are incrementally propagated up and down the hierarchy to impact other entities, whose states provide the answer to the initial question – the best prediction based on the inputs. How the ontologies within Siri are used to answer questions, however, is very different from how GO is used today in bioinformatics. Typically, GO terms are associated with a set of genes (annotations), but not with dynamic states; the relationships between GO terms are not associated with rule sets that perform actions, at least beyond propagation of gene set annotations. Given this similarity, we will explore construction of such an ‘active’ gene ontology as a general engine for genotype-phenotype translation. Genotype-to-phenotype prediction challenges from Driving Biological Projects. We will base our methods development on data and prediction challenges motivated by DBPs in yeast (Cherry DBP) and cancer (TCGA DBP). Yeast has by far the largest number of genotype-phenotype measurements of any organism: most single and double gene knockout strains have been constructed and assayed for growth, yielding over 10 million ‘simple’ genotypes systematically tested for the same phenotype70-

72. In addition, hundreds of natural yeast genetic isolates have been fully sequenced and extensively phenotyped, providing examples of complex genotype backgrounds73. In cancer, TCGA currently has tumor exomes available for over 8000 cancer patients (genotypes), along with clinical information such as survival time, tumor grade, and in some cases drug response (phenotypes). In both yeast and cancer, the goal is to predict the phenotype of growth, survival, etc. given the genotype of a strain or patient. Transformation of genotype to ‘ontotype’. The genotype indicates the set of mutation states of all genes, which for each gene might be represented simply as {mutated, wildtype} or {loss-of-function, wild-type, gain-of-function} before considering more precise values. We will prototype propagation approaches by which these states on genes can be integrated with a gene ontology to infer corresponding states on terms. For example, since the gene SWI4 encodes a subunit of the SBF complex, the yeast swi4Δ genotype {Swi4 <= loss-of-function} might propagate upwards in the ontology to set the state of the parent term {SBF transcription complex <= loss-of-function}, and continue to propagate upwards to affect ancestor terms at higher scales such as ‘RNA pol II transcription factor complex’ and ultimately ‘nucleus’ and ‘cell’. We call the set of mutation states of all terms the ‘ontotype.’ For prediction problems, the ontotype and genotype can then be used together or separately as a set of features for classification of a phenotypic class, e.g. {alive, dead}, or regression against a quantitative phenotype, e.g. numerical growth rate or progression-free time interval. Alternatively, the state of any particular term, representing a cellular component or process, can itself be considered as the phenotype of interest. Predictions will be benchmarked using metrics such as ROC and PR curves along with standard statistical techniques such as cross-validation or bootstrapping. Open questions and milestones. A major research question will be to determine how to dynamically compute the states of ontology terms based on the states of their children, parents, descendants, and ancestors. The underlying mathematical function could take many forms, including logic gates such as AND / OR, linear or additive functions, probabilistic functions, or polynomial or logistic equations. How

Page 14: Technology R&D Theme 3: Multi-scale Network Representations

to determine the specific forms and parameters of these functions, regardless of what form they take, is also unclear. This step could happen by statistical association from many input-output examples using machine learning methods, by including externally generated biological knowledge specific to each entity, or by manual curation from literature. As this aim is quite exploratory, we do not include specific algorithmic plans or mathematical details here. Some important milestones for success, however, will be (1) a proof-of-principle bioinformatic method for propagating molecular profiles on a gene ontology to predict a phenotypic outcome, and (2) implementation of this method in a robust software tool as a Cytoscape App. 3.3 BRIDGING LIGAND-RECEPTOR NETWORKS TO CELL-CELL COMMUNICATION NETWORKS Project Leader: Gary Bader (University of Toronto) Overview. Cell-cell interaction networks are an emerging area of network science. In collaboration with the Zandstra DBP, which is mapping cell-cell interaction networks in the hematopoietic system to help engineer blood tissue, we will develop novel technology for cell network analysis. We will develop methods to infer cell-cell interaction networks from molecular profiling data of purified cell populations, cell-cell interaction network topology analysis software, methods to identify intracellular pathways that control cell-cell interactions and methods to visualize multi-scale models of inter-cellular communication networks and their intracellular signaling systems. Preliminary Results and Progress Report. In the past funding period, we worked with the Zandstra lab to prototype cell-cell interaction network inference methods and their analysis. Two papers were published in Molecular Systems Biology that experimentally mapped novel cell-cell interaction networks for the purpose of identifying growth and inhibitory factors that modulate self-renewal, which is useful for blood stem cell control. The second paper included network topology analysis and discovered that ligand production is cell type dependent, whereas ligand binding is promiscuous. Consequently, additional control strategies such as cell frequency modulation and compartmentalization were needed to achieve specificity in HSC fate regulation. These proof-of-concept methods now need to be further developed to extend and streamline their use, as described below. Progress Report Publications 20,118. Methods Cell-cell interaction network inference from single cell population molecular profiles. Cell-cell interaction networks are currently mapped by inferring regulatory relationships based on the expression of transmitters and receptors at the cell surface. For instance, if cell type A expresses the epidermal growth factor peptide hormone and cell type B expresses the epidermal growth factor receptor protein, and there is a means to transmit the hormone to the target receptor (e.g. by diffusion within a tissue or in the blood stream), then a directional edge is inferred from cell type A to cell type B. This process depends on the availability of relatively pure cell populations and ability to measure the expression of their secreted and surface proteins, both of which are practical with current technology43,74,75. We will develop technology to automatically process mRNA and protein expression profiles from cell populations into cell-cell interaction networks using the following steps:

1. Identify all known ligands and receptors based on known gene function annotation. For instance, using gene ontology terms “cytokine activity,” “growth factor activity,” “hormone activity,” and “receptor activity,” genes with ligand or receptor activity will be compiled from the Ensembl BioMart web service76.

2. Collect all known protein interactions between ligands and receptors (e.g. from iRefIndex77, GeneMANIA78, Pathway Commons79 and related comprehensive interaction resources). We have previously literature curated ~270 ligand-receptor pairs not currently in standard databases and these will also be included32,33.

Page 15: Technology R&D Theme 3: Multi-scale Network Representations

3. Compile a list of expressed ligands and receptors from each available cell type population,

based on available gene or protein expression data43,74,75. We will prefer protein expression information, but will use mRNA expression levels a proxy when protein levels are not available (with appropriate caveats).

4. Infer directed regulatory edges between expressed ligand and receptor pairs.

5. Visualize the resulting cell-cell interaction network.

Preliminary work successfully used this approach, but we will develop it into a generally applicable technology that can be conveniently automatically updated. Our initial focus will be on available human data, but the technology will be applicable to any organism with enough information available. Discovery of key players and rules of cell-cell interaction networks. We will develop technology to make it easy for biologists to computationally analyze the topological properties of cell-cell interaction networks to help identify key control points and general organizational principles. We will use multiple established measures of node importance in networks (centrality measures), including hub detection (find highly connected nodes that when removed cause the network to split into parts80) and betweenness centrality (find important connection points between different network regions81). This analysis will be accomplished using the CytoHubba, CentiScaPe and/or NetMatch network analyzer Cytoscape apps, which we will tailor to function on directed cell-cell interaction networks. In particular, selected network analysis functions in these apps will be published as Cytoscape commands so they can be made available in a cell-cell interaction network analysis app that we will develop. Identify intracellular pathways that control and are controlled by cell-cell interactions. We will develop novel computational methods to explain how signals observed to occur between cells are controlled by and control internal molecular networks and pathways. First, we will gather an intracellular network of physical molecular and control interactions between all identified receptors and secreted chemical signal genes from available molecular interaction and pathway databases (e.g. iRefIndex, GeneMANIA, Pathway Commons). We will then use established path finding algorithms (e.g. as implemented in Cytoscape apps such as PathExplorer and in the Pathway Commons web service system) to identify potential signaling pathways that control chemical signal secretion, and links from activated receptors to activation of pathways in target cells. Paths will be limited to genes expressed in the given cell population. To identify pathways that are controlled by a given cell-cell communication path, we will apply pathway enrichment analysis to downstream molecules in target cells. Thus, we will predict how inter-cellular signaling impinges on intracellular systems, which in turn could impinge on additional cell-cell signaling paths. We will also use the Pathway Extraction and Reduction Algorithm (PERA) method described in TRD1 to identify signaling systems involving cell-cell communication factors. Multi-scale visualization of cell-cell interaction networks in the context of internal molecular networks. We will develop novel multi-scale network visualization methods to help interpret networks generated in this aim. In particular, we will group ligand and receptor families (using Cytoscape’s grouping function) to reduce complexity of the resulting network, based on family information in the Gene Ontology. We will also develop methods to display intracellular molecular paths, where nodes represent genes, within nodes representing cells. These paths will also connect to intracellular nodes representing pathways to visualize which pathways are activated by specific cell-cell communication signals. Links with other TRDs. As the active collection of molecular profiles for secreted and receptor protein expression grows, we expect data sets to become available that cover multiple time points and samples (e.g. disease patients and healthy controls). Thus, we will develop multi-scale cell-cell interaction networks across conditions and use technology developed in the Differential Networks TRD to compare them. We will also explore how patient specific versions of these networks can be used as predictive features in work described in the Predictive Networks TRD.

Page 16: Technology R&D Theme 3: Multi-scale Network Representations

TRD 3: MULTI-SCALE NETWORKS – BIBLIOGRAPHY AND REFERENCES CITED 1. Mathews, C.K. The Cell-Bag of Enzymes or Network of Channels? J Bacteriol 175, 6377-81

(1993). 2. Reddy, G.P., Singh, A., Stafford, M.E. & Mathews, C.K. Enzyme Associations in T4 Phage

DNA Precursor Synthesis. Proc Natl Acad Sci U S A 74, 3152-6 (1977). 3. Monod, J., Changeux, J.P. & Jacob, F. Allosteric Proteins and Cellular Control Systems.

Journal of Molecular Biology 6, 306-& (1963). 4. Kauffman, S.A. The Origins of Order : Self-Organization and Selection in Evolution, xviii, 709

p. (Oxford University Press, New York, 1993). 5. Pollack, G.H. Cells, Gels and the Engines of Life : A New, Unifying Approach to Cell Function,

xiv, 305 p. (Ebner & Sons, Seattle, WA, 2001). 6. Bray, D. Wetware : A Computer in Every Living Cell, xii, 267 p. (Yale University Press, New

Haven ; London, 2009). 7. Barabasi, A.L. & Oltvai, Z.N. Network Biology: Understanding the Cell's Functional

Organization. Nat Rev Genet 5, 101-13 (2004). 8. Mitra, K., Carvunis, A.R., Ramesh, S.K. & Ideker, T. Integrative Approaches for Finding

Modular Structure in Biological Networks. Nat Rev Genet 14, 719-32 (2013). 9. Ashburner, M. et al. Gene Ontology: Tool for the Unification of Biology. The Gene Ontology

Consortium. Nat Genet 25, 25-9 (2000). 10. Novarino, G. et al. Exome Sequencing Links Corticospinal Motor Neuron Disease to Common

Neurodegenerative Disorders. Science 343, 506-11 (2014). 11. Bandyopadhyay, S. et al. Rewiring of Genetic Networks in Response to DNA Damage.

Science 330, 1385-9 (2010). 12. Roguev, A. et al. Conservation and Rewiring of Functional Modules Revealed by an Epistasis

Map in Fission Yeast. Science 322, 405-10 (2008). 13. Workman, C.T. et al. A Systems Approach to Mapping DNA Damage Response Pathways.

Science 312, 1054-9 (2006). 14. Konig, R. et al. Human Host Factors Required for Influenza Virus Replication. Nature 463,

813-7 (2010). 15. Suthram, S., Sittler, T. & Ideker, T. The Plasmodium Protein Network Diverges from Those of

Other Eukaryotes. Nature 438, 108-12 (2005). 16. Ravasi, T. et al. An Atlas of Combinatorial Transcriptional Regulation in Mouse and Man. Cell

140, 744-52 (2010). 17. Bandyopadhyay, S. et al. A Human Map Kinase Interactome. Nat Methods 7, 801-5 (2010). 18. Guenole, A. et al. Dissection of DNA Damage Responses Using Multiconditional Genetic

Interaction Maps. Mol Cell 49, 346-58 (2013). 19. Begley, T.J., Rosenbach, A.S., Ideker, T. & Samson, L.D. Hot Spots for Modulating Toxicity

Identified by Genomic Phenotyping and Localization Mapping. Mol Cell 16, 117-25 (2004). 20. Srivas, R. et al. A Uv-Induced Genetic Network Links the Rsc Complex to Nucleotide Excision

Repair and Shows Dose-Dependent Rewiring. Cell Rep 5, 1714-24 (2013). 21. Jaehnig, E.J., Kuo, D., Hombauer, H., Ideker, T.G. & Kolodner, R.D. Checkpoint Kinases

Regulate a Global Network of Transcription Factors in Response to DNA Damage. Cell Rep 4, 174-88 (2013).

22. Chuang, H.Y., Hofree, M. & Ideker, T. A Decade of Systems Biology. Annu Rev Cell Dev Biol 26, 721-44 (2010).

23. Walhout, A.J.M., Vidal, M. & Dekker, J. Handbook of Systems Biology : Concepts and Insights, xiii, 538 p. (Waltham Academic Press, London ;, 2013).

24. Koller, D. & Friedman, N. Probabilistic Graphical Models : Principles and Techniques, xxi, 1231 p. (MIT Press, Cambridge, MA, 2009).

25. Gruber, T.R. Toward Principles for the Design of Ontologies Used for Knowledge Sharing. International Journal of Human-Computer Studies 43, 907-928 (1995).

Page 17: Technology R&D Theme 3: Multi-scale Network Representations

26. Brachman, R.J. & Levesque, H.J. Knowledge Representation and Reasoning, xxix, 381 p.

(Morgan Kaufmann, Amsterdam ; Boston, 2004). 27. Robinson, P.N. & Bauer, S. Introduction to Bio-Ontologies, xxvii, 488 p. (Taylor & Francis,

Boca Raton, 2011). 28. Dutkowski, J. et al. A Gene Ontology Inferred from Molecular Networks. Nat Biotechnol 31, 38-

45 (2013). 29. Myhre, S., Tveit, H., Mollestad, T. & Laegreid, A. Additional Gene Ontology Structure for

Improved Biological Reasoning. Bioinformatics 22, 2020-7 (2006). 30. Guzzoni, D., Baur, C. & Cheyer, A. Active: A Unified Platform for Building Intelligent Web

Interaction Assistants. 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Workshops Proceedings, 417-420 (2006).

31. Wren, J.D. Question Answering Systems in Biology and Medicine--the Time Is Now. Bioinformatics 27, 2025-6 (2011).

32. Qiao, W. et al. Intercellular Network Structure and Regulatory Motifs in the Human Hematopoietic System. Molecular systems biology 10, 741 (2014).

33. Kirouac, D.C. et al. Dynamic Interaction Networks in a Hierarchically Organized Tissue. Mol Syst Biol 6, 417 (2010).

34. Billia, F., Barbara, M., McEwen, J., Trevisan, M. & Iscove, N.N. Resolution of Pluripotential Intermediates in Murine Hematopoietic Differentiation by Global Complementary DNA Amplification from Single Cells: Confirmation of Assignments by Expression Profiling of Cytokine Receptor Transcripts. Blood 97, 2257-68 (2001).

35. Majka, M. et al. Numerous Growth Factors, Cytokines, and Chemokines Are Secreted by Human Cd34(+) Cells, Myeloblasts, Erythroblasts, and Megakaryoblasts and Regulate Normal Hematopoiesis in an Autocrine/Paracrine Manner. Blood 97, 3075-85 (2001).

36. von Dassow, G., Meir, E., Munro, E.M. & Odell, G.M. The Segment Polarity Network Is a Robust Developmental Module. Nature 406, 188-92 (2000).

37. Kondo, S. Cell-Cell Interaction Network That Generates the Skin Pattern of Animal. Genome Inform 16, 287-91 (2005).

38. De Matteis, G., Graudenzi, A. & Antoniotti, M. A Review of Spatial Computational Models for Multi-Cellular Systems, with Regard to Intestinal Crypts and Colorectal Cancer Development. Journal of mathematical biology 66, 1409-62 (2013).

39. Eckmann, J.P. & Moses, E. Curvature of Co-Links Uncovers Hidden Thematic Layers in the World Wide Web. Proc Natl Acad Sci U S A 99, 5825-9 (2002).

40. Komurov, K. Modeling Community-Wide Molecular Networks of Multicellular Systems. Bioinformatics 28, 694-700 (2012).

41. Frankenstein, Z., Alon, U. & Cohen, I.R. The Immune-Body Cytokine Network Defines a Social Architecture of Cell Interactions. Biol Direct 1, 32 (2006).

42. Tieri, P. et al. Quantifying the Relevance of Different Mediators in the Human Immune Cell Network. Bioinformatics 21, 1639-43 (2005).

43. Gedye, C.A. et al. Cell Surface Profiling Using High-Throughput Flow Cytometry: A Platform for Biomarker Discovery and Analysis of Cellular Heterogeneity. PLoS ONE 9, e105602 (2014).

44. McPherson, A. Introduction to Macromolecular Crystallography, x, 267 p. (Wiley-Blackwell, Hoboken, N.J., 2009).

45. Ravasz, E., Somera, A.L., Mongru, D.A., Oltvai, Z.N. & Barabasi, A.L. Hierarchical Organization of Modularity in Metabolic Networks. Science 297, 1551-5 (2002).

46. Dotan-Cohen, D., Letovsky, S., Melkman, A.A. & Kasif, S. Biological Process Linkage Networks. PLoS One 4, e5313 (2009).

47. Tanay, A., Sharan, R., Kupiec, M. & Shamir, R. Revealing Modularity and Organization in the Yeast Molecular Network by Integrated Analysis of Highly Heterogeneous Genomewide Data. Proc Natl Acad Sci U S A 101, 2981-6 (2004).

48. Kelley, R. & Ideker, T. Systematic Interpretation of Genetic Interactions Using Protein Networks. Nat Biotechnol 23, 561-6 (2005).

Page 18: Technology R&D Theme 3: Multi-scale Network Representations

49. Jaimovich, A., Rinott, R., Schuldiner, M., Margalit, H. & Friedman, N. Modularity and

Directionality in Genetic Interaction Maps. Bioinformatics 26, i228-36 (2010). 50. Park, Y. & Bader, J.S. Resolving the Structure of Interactomes with Hierarchical Agglomerative

Clustering. BMC Bioinformatics 12 Suppl 1, S44 (2011). 51. Dutkowski, J. et al. Nexo Web: The Nexo Ontology Database and Visualization Platform.

Nucleic Acids Res 42, D1269-74 (2014). 52. Lee, I., Li, Z. & Marcotte, E.M. An Improved, Bias-Reduced Probabilistic Functional Gene

Network of Baker's Yeast, Saccharomyces Cerevisiae. PLoS One 2, e988 (2007). 53. Kramer, M., Dutkowski, J., Yu, M., Bafna, V. & Ideker, T. Inferring Gene Ontologies from

Pairwise Similarity Data. Bioinformatics 30, i34-42 (2014). 54. Jensen, L.J. et al. String 8--a Global View on Proteins and Their Functional Interactions in 630

Organisms. Nucleic Acids Res 37, D412-6 (2009). 55. Lee, I., Date, S.V., Adai, A.T. & Marcotte, E.M. A Probabilistic Functional Network of Yeast

Genes. Science 306, 1555-8 (2004). 56. Jansen, R. et al. A Bayesian Networks Approach for Predicting Protein-Protein Interactions

from Genomic Data. Science 302, 449-53 (2003). 57. Breiman, L. Random Forests. Machine Learning 45, 5-32 (2001). 58. Clauset, A., Moore, C. & Newman, M.E. Hierarchical Structure and the Prediction of Missing

Links in Networks. Nature 453, 98-101 (2008). 59. Jean-Mary, Y.R., Shironoshita, E.P. & Kabuka, M.R. Ontology Matching with Semantic

Verification. Web Semant 7, 235-251 (2009). 60. Ryan, C.J. et al. Hierarchical Modularity and the Evolution of Genetic Interactomes across

Species. Mol Cell 46, 691-704 (2012). 61. Wilmes, G.M. et al. A Genetic Interaction Map of Rna-Processing Factors Reveals Links

between Sem1/Dss1-Containing Complexes and Mrna Export and Splicing. Mol Cell 32, 735-46 (2008).

62. Hannum, G. et al. Genome-Wide Association Data Reveal a Global Map of Genetic Interactions among Protein Complexes. PLoS Genet 5, e1000782 (2009).

63. Ideker, T., Ozier, O., Schwikowski, B. & Siegel, A.F. Discovering Regulatory and Signalling Circuits in Molecular Interaction Networks. Bioinformatics 18 Suppl 1, S233-40 (2002).

64. Chuang, H.Y., Lee, E., Liu, Y.T., Lee, D. & Ideker, T. Network-Based Classification of Breast Cancer Metastasis. Mol Syst Biol 3, 140 (2007).

65. Dutkowski, J. & Ideker, T. Protein Networks as Logic Functions in Development and Cancer. PLoS Comput Biol 7, e1002180 (2011).

66. Hofree, M., Shen, J.P., Carter, H., Gross, A. & Ideker, T. Network-Based Stratification of Tumor Mutations. Nat Methods 10, 1108-15 (2013).

67. Ideker, T., Dutkowski, J. & Hood, L. Boosting Signal-to-Noise in Complex Biology: Prior Knowledge Is Power. Cell 144, 860-3 (2011).

68. Carvunis, A.R. & Ideker, T. Siri of the Cell: What Biology Could Learn from the Iphone. Cell 157, 534-8 (2014).

69. Chuang, H.Y. et al. Subnetwork-Based Analysis of Chronic Lymphocytic Leukemia Identifies Pathways That Associate with Disease Progression. Blood 120, 2639-49 (2012).

70. Costanzo, M. et al. The Genetic Landscape of a Cell. Science 327, 425-31 (2010). 71. Winzeler, E.A. et al. Functional Characterization of the S. Cerevisiae Genome by Gene

Deletion and Parallel Analysis. Science 285, 901-6 (1999). 72. Hillenmeyer, M.E. et al. The Chemical Genomic Portrait of Yeast: Uncovering a Phenotype for

All Genes. Science 320, 362-5 (2008). 73. Bloom, J.S., Ehrenreich, I.M., Loo, W.T., Lite, T.L. & Kruglyak, L. Finding the Sources of

Missing Heritability in a Yeast Cross. Nature 494, 234-7 (2013). 74. Novershtern, N. et al. Densely Interconnected Transcriptional Circuits Control Cell States in

Human Hematopoiesis. Cell 144, 296-309 (2011). 75. Laurenti, E. et al. The Transcriptional Architecture of Early Human Hematopoiesis Identifies

Multilevel Control of Lymphoid Commitment. Nature immunology 14, 756-63 (2013).

Page 19: Technology R&D Theme 3: Multi-scale Network Representations

76. Kinsella, R.J. et al. Ensembl Biomarts: A Hub for Data Retrieval across Taxonomic Space.

Database : the journal of biological databases and curation 2011, bar030 (2011). 77. Turner, B. et al. Irefweb: Interactive Analysis of Consolidated Protein Interaction Data and

Their Supporting Evidence. Database (Oxford) 2010, baq023 (2010). 78. Zuberi, K. et al. Genemania Prediction Server 2013 Update. Nucleic acids research 41, W115-

22 (2013). 79. Cerami, E.G. et al. Pathway Commons, a Web Resource for Biological Pathway Data. Nucleic

Acids Res (2010). 80. Jeong, H., Mason, S.P., Barabasi, A.L. & Oltvai, Z.N. Lethality and Centrality in Protein

Networks. Nature 411, 41-2 (2001). 81. Yu, H., Kim, P.M., Sprecher, E., Trifonov, V. & Gerstein, M. The Importance of Bottlenecks in

Protein Networks: Correlation with Gene Essentiality and Expression Dynamics. PLoS Comput Biol 3, e59 (2007).