Network Biology: from lists to underpinnings of molecular behaviour

Network Biology:from lists to underpinnings of molecular

behaviour

Michel Dumontier, Ph.D.Associate Professor of Bioinformatics

Carleton University

1BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]


Provenance

• This talk was prepared in part with input from the “Interpreting Gene Lists” workshop put forward by the Canadian Bioinformatics Workshops (bioinformatics.ca)

• http://bioinformatics.ca/workshops/2009/course-content

BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier] 3

http://bioinformatics.ca/workshops/2009/course-content

So you did some mass spectrometry?

Protein Identification4BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]

database search vs de novoS#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lative

Ab

un

da

nce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

WR

A

C

VG

E

K

DW

LP

T

L T

WR

A

C

VG

E

K

DW

LP

T

L T

de novo

AVGELTK

Database Search

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..



My experiment worked and I have dozens, hundreds, or thousands of

hits…. now what?

?Protein

IdentificationS#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lative

Ab

un

da

nce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0


Use the list to explore Biology

• Determine significant shared attributes• Explore putative mechanisms of actions• Test hypotheses

Protein IdentificationS#: 1708 RT: 54.47 AV: 1 NL: 5.27E6

T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lative

Ab

un

da

nce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

Eureka!Hypothesis on the

molecular basisof disease/process

Network Biology


# in list having attribute

# in list sharing these attributes

Oxidative Metabolism

Detoxification

Enriched in smokers =UP-regulated in smokers


Outline

1. Explore identified proteins

2. Attribute enrichment

3. Networks

4. Pathways

5. Lab


A hypothesis underlies the list of identified proteins

• An initial question was posed, an experiment performed and a list of candidates obtained.

• The question is, what are the roles of these entities in the biological process being investigated. – Normal vs pathological– Response to stimulus– Interactions and complexes


Biological Answers

• Computational systems biology– Information retrieval and summary– Interaction network analysis– Pathway analysis– Function prediction


Molecular Attributes

• An attribute provides information about to the entity in question (e.g. shape, function, process)

• Sequence and structure provides information about – Motifs, domains, interaction/binding sites, post-

translational modifications, conformational changes, molecular complexes, mutations, conservation/evolution

– Functions, localization, biological / pathological processes


Gene Ontology

• Captures terminology related to three aspects– biological processes– molecular functions – cellular components

• Relationships between terms are largely defined with “is a” and “part of” relations

Cell division

Isomerase activity


GO Structure cell

membrane chloroplast

mitochondrial chloroplastmembrane membrane

is-apart-of

Species independent. Some lower-level terms are specific to a group, but higher level terms are not


Gene Ontology

• 30,393 terms, 99.2% with definitions– 18,939 biological processes– 2,735 cellular components– 8,719 molecular functions

• GO Slim is an official reduced set of GO terms– Generic, plant, yeast– Good for making pie charts


Annotation

• Manual annotation– Created by scientific curators

• High quality• Small number (time-consuming to create)

• Electronic annotation– Annotation derived without human validation

• Computational predictions (accuracy varies)• Lower ‘quality’ than manual codes

• Key point: be aware of annotation origin


Evidence Type(provenance of facts)

• ISS: Inferred from Sequence/Structural Similarity

• IDA: Inferred from Direct Assay• IPI: Inferred from Physical Interaction• IMP: Inferred from Mutant Phenotype• IGI: Inferred from Genetic Interaction• IEP: Inferred from Expression Pattern• TAS: Traceable Author Statement• NAS: Non-traceable Author Statement• IC: Inferred by Curator• ND: No Data available

• IEA: Inferred from electronic annotation


Variable Coverage

Lomax J. Get ready to GO! A biologist's guide to the Gene Ontology. Brief Bioinform. 2005 Sep;6(3):298-304.


GO Software Tools

• GO resources are freely available to anyone without restriction– Includes the ontologies, gene associations

and tools developed by GO• Other groups have used GO to create

tools for many purposeshttp://www.geneontology.org/GO.tools


http://www.geneontology.org/GO.tools

http://www.geneontology.org/GO.tools

Accessing GO: QuickGO

http://www.ebi.ac.uk/ego/21BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]

Explore Ontologies

http://www.ebi.ac.uk/ontology-lookup


Databases of Molecular Annotation

• NCBI – Genbank / RefSeq– Entrez Gene

• EBI – UniProt– Ensembl BioMart

(eukaryotes)

Model Organism Databases• Berkeley Drosophila Genome Project (BDGP)• dictyBase (Dictyostelium discoideum) • FlyBase (Drosophila melanogaster) • GeneDB (Schizosaccharomyces pombe,

Plasmodium falciparum, Leishmania major and Trypanosoma brucei)

• UniProt Knowledgebase (Swiss-Prot/TrEMBL/PIR-PSD) and InterPro databases

• Gramene (grains, including rice, Oryza) • Mouse Genome Database (MGD) and Gene

Expression Database (GXD) (Mus musculus) • Rat Genome Database (RGD) (Rattus

norvegicus)• Reactome• Saccharomyces Genome Database (SGD)

(Saccharomyces cerevisiae) • The Arabidopsis Information Resource (TAIR)

(Arabidopsis thaliana) • The Institute for Genomic Research (TIGR):

databases on several bacterial species • WormBase (Caenorhabditis elegans) • Zebrafish Information Network (ZFIN): (Danio

rerio 23BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]


Identifiers

• Identifiers (IDs) are ideally unique, stable names or numbers that help track database records– E.g. Social Insurance Number, Entrez Gene ID 41232

• Gene and protein information stored in many databases– Genes have many IDs

• Records for: Gene, DNA, RNA, Protein– Important to recognize the correct record type– E.g. Entrez Gene records don’t store sequence. They

link to DNA regions, RNA transcripts and proteins.


NCBI Database

Links

http://www.ncbi.nlm.nih.gov/Database/datamodel/data_nodes.swf

NCBI:U.S. National Center for Biotechnology Information

Part of National Library of Medicine (NLM)


Common IdentifiersSpecies-specificHUGO HGNC BRCA2MGI MGI:109337RGD 2219 ZFIN ZDB-GENE-060510-3 FlyBase CG9097 WormBase WBGene00002299 or ZK1067.1 SGD S000002187 or YDL029WAnnotationsInterPro IPR015252OMIM 600185Pfam PF09104Gene Ontology GO:0000724SNPs rs28897757Experimental PlatformAffymetrix 208368_3p_s_atAgilent A_23_P99452CodeLink GE60169Illumina GI_4502450-S

GeneEnsembl ENSG00000139618Entrez Gene 675Unigene Hs.34012

RNA transcriptGenBank BC026160.1RefSeq NM_000059Ensembl ENST00000380152

ProteinEnsembl ENSP00000369497RefSeq NP_000050.2UniProt BRCA2_HUMAN or A1YBP1_HUMANIPI IPI00412408.1EMBL AF309413 PDB 1MIU

Red = Recommended27BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]

Identifier Mapping

• So many IDs!– Mapping (conversion) is a headache

• Four main uses– Disambiguate similarly named entities– Used to reference related information– Biological and informational provenance

• E.g. Genes to proteins, Entrez Gene to Affy

– Unification during dataset merging• Equivalent entities


ID Mapping Services

• Synergizer– http://llama.med.harvard.edu/

synergizer/translate/

• Ensembl BioMart

– http://www.ensembl.org

• UniProt– http://www.uniprot.org/


Outline



3. Networks

4. Pathways


Attribute Enrichment (AE)

Given:1. list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42

2. attributes: e.g. function, process, localization, interactions

AE Question: Are any of the attributes surprisingly enriched in the list?

• Details:– How to assess “surprisingly” (statistics)– How to correct for repeating the tests


What is a P-value?

• The P-value is (a bound) on the probability that the “null hypothesis” is true,

• Calculated through statistics with the data and testing the probability of observing those statistics, or ones more extreme, given a sample of the same size distributed according to the null hypothesis,

• Intuitively: P-value is the probability of a false positive result (aka “Type I error”)


How likely are the observed differences between the two distributions due to chance?

66

7

7

5

01

1 22

1

1

1

10

00 0

value

value distribution


AE using the T-test

Answer: Two-tailed T-test

Black: N1=500

Red: N2=4500

Mean: m1 = 1.1 Std: s1 = 0.9

T-statistic =

Mean: m1 = 4.9 Std: s1 = 1.0

2

22

1

21

21

Ns

Ns

mm

= -88.5

Formal Question: What is the probability of observing the T-statistic or one more extreme if the means of the two distributions were the same?


AE using the T-test

T-statistic =

2

22

1

21

21

Ns

Ns

mm

= -88.5

T-distribution

Pro

ba

bil

ity

de

ns

ity

T-statistic

0

P-value = shaded area * 2

-88.5

Formal Question: What is the probability of observing the T-statistic or one more extreme if the means of the two distributions were the same?


T-test limitations1. Assumes distributions are both approximately Gaussian (i.e. normal)

– Score distribution assumption is often true for:• Log ratios from microarrays

– Score distribution assumption is rarely true for:• Peptide counts, sequence tags (SAGE or NextGen sequencing), transcription factor

binding sites hits

2. Tests for significance of difference in means of two distribution but does not test for other differences between distributions.

Pro

bab

ilit

y d

en

sity

score 0

Values are positive and have increasing density near zero, e.g. sequence counts

Pro

bab

ilit

y d

en

sity

score

Distributions with outliers, or “heavy-tailed” distributions

Pro

bab

ilit

y d

en

sity

score

Bimodal “two-bumped” distributions.


Kolmogorov-Smirnov (K-S) testP

rob

abil

ity

den

sity

score 0

Question: Are the red and black distributions significantly different?

Calculate cumulative distributions of red and black

Cu

mu

lati

ve p

rob

abil

ity

score 0

0.5

1.0

Cumulative distribution

Length = 0.4

Formal question: Is the length of largest difference between the “empirical distribution functions” statistically significant?


What is the probability of finding 4 or more proteins with feature X in a random sample of

5 proteinslist

RRP6MRD1RRP7RRP43RRP42

Background population:500 X proteins,5000 proteins


Fisher’s exact test

Background population:500 X proteins, 5000 proteins

list


P-value

Null distribution

Answer = 4.6 x 10-4

P-value for Fisher’s exact testis “the probability that a random draw of the same size as the list from the background population would produce the observed number (or more) of attributes in the list.”,depends on size of the list, # with features (in list, background), and the background population. 39BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]

Important details

• To test for under-enrichment of “black”, test for over-enrichment of “red”.

• Need to choose “background population” appropriately, e.g., if only portion of the total complement is queried (or having annotation), only use that population as background.

• To test for enrichment of more than one independent types of annotation (red vs black and circle vs square), apply Fisher’s exact test separately for each type.

• The hypergeometric test is equivalent to a one-tailed Fisher’s exact test.


How to win the P-value lottery, part 1

Background population:500 X5000 Y

Random draws

… 7,834 draws later …

Expect a random draw with observed enrichment once every 1 / P-value draws


How to win the P-value lottery, part 2Keep the list the same, evaluate different annotations

Observed drawRRP6MRD1RRP7RRP43RRP42

Different annotations



Correcting for multiple tests

• The Bonferroni correction controls the probability any one test is due to random chance aka Family-Wise Error Rate (FWER) If M = # of annotations tested: Corrected P-value = M x original P-value

• The Benjamini-Hochberg (B-H) controls the proportion of positive tests (i.e. rejections of the null hypothesis) that are false positives aka False Discovery Rate (FDR)– FDR is the expected proportion of the observed enrichments that

are due to random chance.– Less stringent than the Bonferroni


Reducing multiple test correction stringency

• The correction to the P-value threshold a depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to be

• Can control the stringency by reducing the number of tests: – e.g. use GO slim or restrict testing to the appropriate

GO annotations.


AE tools

• Web-based tools – Funspec:

• easy tool for yeast, not maintained, uses GO annotations and some annotations (e.g. protein complexes)

– YeastFeatures • Similar to Funspec, different datasets and presentation

– GoMiner: • Uses GO annotations, covers many organisms, needs a

background set of genes

• Cytoscape-based tools– BINGO:

• Does GO annotations and displays enrichment results graphically and visually organizes related categories


Funspec: Simple ORA for yeasthttp://funspec.med.utoronto.ca/

Paste list hereBonferroni correct? YES!

Choose sources of annotation

Cavaets:• yeast only,• last updated 2002


http://software.dumontierlab.com/yeastfeatures47BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]


GoMiner, part 1http://discover.nci.nih.gov/gominer

1. Click “web interface”

2. Upload background

3. Upload list

4. Choose organism

5. Choose evidence code (All or Level 1)


GoMiner, part 2

6. Restrict # of tests via category size

7. Restrict # of tests via GO hierarchy

8. Results emailed to this address, in a few minutes


DAVID, part 1 http://david.abcc.ncifcrf.gov/

Paste list here

Choose ID type

List type: list or background?

DAVID automatically detects organism


DAVID, part 2http://david.abcc.ncifcrf.gov/


BINGO, an ORA cytoscape pluginhttp://www.psb.ugent.be/cbd/papers/BiNGO/index.htm

Links represent parent-child relationships in GO ontology

Colours represent significance of enrichment

Nodes represent GO categories



Outline



3. Networks • Physical networks• Genetic networks• Functional networks

4. Pathways


Why Network and Pathway Analysis?

• Intuitive to Biologists• Provide a biological context for results• More efficient than searching databases gene-by-gene• Intuitive display for sharing data

• Computation on Pathway Content• Visualize multiple data types on a pathway or network• Find active pathways• Identify potential regulators


network

In biology, a network is a graph comprised of nodes that correspond to entities (genes, proteins, small molecules) and edges that correspond to physical/agentive or associative relations between entities.

Vertex (node)

EdgeCycle

-5

Directed Edge (Arc)

Weighted Edge7

10


Integration in a Network Context


Expression data mappedto node colours

Integration in a Network Context


Mapping Biology to a Network

• A simple mapping: Protein-protein interactions– one protein/node, one interaction/edge

• Edges can represent other relationships– Physical e.g. protein-protein interaction– Regulatory e.g. kinase activates target– Genetic e.g. epistasis– Similarity e.g. protein sequence similarity

• Critical: understand the mapping for network analysis


Protein Sequence Similarity Network

http://apropos.icmb.utexas.edu/lgl/61BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]

Literature Network

• Computationally extract gene relationships from text, usually PubMed abstracts

• Useful if network is not in a database– Literature search tool

• BUT not perfect– Problems recognizing gene names– Natural language processing is difficult

• Agilent Literature Search Cytoscape plugin• iHOP (www.ihop-net.org/UniPub/iHOP/)


Agilent Literature Search


Cytoscape Network produced by Literature Search.

Abstract from the scientific literature

Sentences for an edge


Enrichment Map

A

B

|)||,(|min

||

BA

BA

Overlap


Nodes represent gene-sets


Olfactory Receptor

Muscle Contraction

Ectodermal Dev. &Keratinocyte Diff.

Ubiquitin Processes

DNA Processes

Mitotic Cell Cycle

DNA Repair

DNA ReplicationRas GTPase

Serine Endopeptidase

Chromatin Remodeling

Chromosome

Ubiquitin-dependent Proteolysis

Ubiquitin Ligase

Microtubule Cytoskeleton

Intermediate Filament

Cytoskeleton

Ion ChannelCalcium

Potassium Sodium

Mitochondrial Oxidative

Metabolism

Fatty Acid Metabolism

Cytoskeleton

mRNA Transport

RNA Splicing

RNA Processes

Transcription

rRNA Processing

Ribonucleotide Metabolism

Translation


68

Physical Networks

• Between two molecular objects– DNA, RNA, gene, protein, complex, small molecule,

photon– Requires a site of interaction / binding

• Biologically relevant:– Present/expressed at the same time– Share a cellular location– Leads to some biologically relevant outcome

BA

BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]

Molecular Interactions

RAS interacting with RALGDS

(PDB: 1LFD)

Synthetic protein interacting with ATP and Zinc

(PDB: 2P0X)


70

Experimental Interaction Discovery

Microarray

Two-Hybrid

MassSpectrometry

Genetics

X-Ray

NMR

Direct, Physical Indirect, Physical Indirect, Genetic


71

Experimental Considerations• How do you know if the interaction really

exists? • Each method has its advantages and

disadvantages. – Be aware of systematic errors– Be aware of contaminants.

• Each method observes interactions from a slightly different experimental condition.

• Support from many different sources is certainly better (necessary) than just one.


72

B

Some affinity purification caveats

A

First and most importantly, this is only a representation of the observation.

You can only tell what proteins are in the eluate; you can’t tell how they are connected to one another.

If there is only one other protein present (B), then its likely thatA and B are directly interacting.

But, what if I told you that two other proteins (B and C) werepresent along with A…. B

A

C


73

B

Complexes with unknown topology

A

Which of these models is correct?The complex described by this experimental result is said to have an Unknown Topology.

C B

A

C B

A

C


74

B

Complexes with unknown stoichiometry

A

Here’s another possibility?The complex described by this experimental result is also said to have Unknown Stoichiometry.

B

A

B


75

Interaction Models

Spoke Matrix

Simple model, useful for data navigation

More accurate

Theoretical max. number of interactions

ActualTopology


76

High-throughput Mass Spectrometric Protein Complex Identification (HMS-PCI)

Ste12

Ho et al. Nature. 2002 Jan 10;415(6868):180-3

Mike Tyers, SLRI



78

k-core analysis

• A part of a graph where every node is connected to other nodes with at least k edges (k=0,1,2,3...)

• Highest k-core is a central most densely connected region of a graph

• Regions of dense connectivity may represent molecular complexes

• Therefore, high k-cores may be molecular complexes


79

Pre MS Ho

Gavin

Union

6-core 6-core

6-core 9-core

Interaction can define function

MCODE plugin for CytoscapeBIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]

80

http://pathguide.org


Interaction Databases

• Experiment (E)• Structure detail (S)• Predicted

– Physical (P)– Functional (F)

• Curated (C)• Homology

modeling (H)• *IMEx consortium


Network Classification of Disease• Traditional: Gene association• Limitations: Too many genes reduces

statistical power• New: Active cell map based approaches

combining network and molecular profiles

Chuang HY, Lee E, Liu YT, Lee D, Ideker TNetwork-based classification of breast cancer metastasisMol Syst Biol. 2007;3:140. Epub 2007 Oct 16

Liu M, Liberzon A, Kong SW, Lai WR, Park PJ, Kohane IS, Kasif SNetwork-based analysis of affected biological processes in type 2 diabetes modelsPLoS Genet. 2007 Jun;3(6):e96

Efroni S, Schaefer CF, Buetow KHIdentification of key processes underlying cancer phenotypes using biologic pathway analysisPLoS ONE. 2007 May 9;2(5):e425


Network-Based Breast Cancer Classification• 57k intx from Y2H,

orthology, co-citation, HPRD, BIND, Reactome

• 2 breast cancer cohorts, different expression platforms

Chuang HY, Lee E, Liu YT, Lee D, Ideker TNetwork-based classification of breast cancer metastasisMol Syst Biol. 2007;3:140. Epub 2007 Oct 16


• Similar network markers across 2 data sets (better than original overlap)

• Increased classification accuracy

• Better coverage of known cancer risk genes (*)


PIPE

• Predicts yeast PPI from sequence– Uses interaction databases to find similar

interacting proteins– Estimates the site of interaction– 75% accuracy (61% sensitivity, 89%

specificity)– Finds new interactions among complexes




PIPE2

• First all-to-all sequence-based computational screen of PPIs in yeast – 29,589 high confidence interactions of ~ 2 x 107

possible pairs – 16,000x faster than PIPE– 99.95% specificity


89

Synthetic Genetic Interactions

• Synthetic genetic interactions (lethal, slow growth)• Mate two mutants without phenotypes to get a daughter

cell with a phenotype• Synthetic lethal (SL), slow growth

• robotic mating using the yeast deletion library• Genetic interactions provide functional data on protein

interactions or redundant genes• About 23% of known SLs (1295 - YPD+MIPS) were

known protein interactions in yeast

Tong et al. Science. 2001 Dec 14;294(5550):2364-8


90

Cell PolarityCell Wall Maintenance Cell StructureMitosisChromosome StructureDNA Synthesis DNA RepairUnknownOthers

Synthetic Genetic Interactions in Yeast

Tong, BooneBIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]

Validation: Protein Localization

A – A3: Y2HB: physical methodsC: geneticE: immunological

True positives:- Localized in the

same cellular compartment

- Have common cellular role

Sprinzak, Sattath, Margalit, J Mol Biol, 200391BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]

Comparisons• All methods except for Y2H and synthetic

lethality technique are biased toward abundant proteins.

• PPI bias toward certain cellular localizations. • Evolutionarily conserved proteins have much

better coverage in Y2H than the proteins restricted to a certain organism.

C. Von Mering et al, Nature, 2002:


Functional Associations• Molecular Interactions• Regulatory Interactions• Genetic Interactions• Similarity relationships

– Co-expression– Protein sequence– Domain architecture– Phylogenetic profiles– Gene neighborhood– Gene fusion– …


http://string.embl.de/von Mering et al., Nucleic Acids Res., 2005


95



=

Query-specific weights for multifaceted function queries

+GeneticTong et al. 2001

w1 x w2 x w3 xweights

Co-expression

CDC27

APC11CDC23

XRS2RAD54

MRE11

UNK1

UNK2

Cell cycle

DNA repair

Pavlidis et al, 2002, Lanckriet et al, 2004Mostafavi et al, 2008

+Co-complexed

Durrett 2006

Gene Function Prediction using a Multiple Association Network Integration Algorithm


GeneMANIA Cytoscape Plugin


Outline



3. Networks

4. Pathways

5. Lab


pathway

In biology, a pathway is a network which consists of inputs (physical entities), outputs (physical entities, biological outcomes), and the molecular machinery and chemical transformations required/expected to realize the end-directed activity.


Using Pathway Information

Databases

Literature

Expert knowledge

Experimental Data

Find active processesunderlying a phenotype

PathwayInformation

PathwayAnalysis


htt

p:/

/pat

hg

uid

e.o

rg

Vuk PavlovicSylva Donaldson

>290 PathwayDatabases!

• Varied formats, representation, coverage

• Pathway data extremely difficult to combine and use


Aim: Convenient Access to Pathway Information

Facilitate creation and communication of pathway dataAggregate pathway data in the public domainProvide easy access for pathway analysis

http://www.pathwaycommons.org


Access From Cytoscape


Fatty Acid Degradation?Other pathways / processes?

GenMAPP.org

cardiomyopathy: downregulated genes


Fatty Acid Degradation Pathway


Cardiomyopathy Data on Fatty Acid Degradation Pathway


Visualizing Time Course Data on Pathways: Multiple Comparison View


Outline



3. Networks

4. Pathways

5. Lab


110

Network Analysis

• Cytoscape– Visualize molecular interaction

networks and integrate interactions with gene expression profiles and other state data. Data filters & custom plug-in architecture.

– http://www.cytoscape.org

• Biolayout Express 3D– Large networks– Gene expression– www.sanger.ac.uk/Teams/Team101/

biolayout/b3d.html


http://www.cytoscape.org/

http://www.sanger.ac.uk/Teams/Team101/biolayout/b3d.html





Network Analysis using Cytoscape

Databases

Literature

Expert knowledge

Experimental Data

Find biological processesunderlying a phenotype

NetworkInformation

NetworkAnalysis


Network visualization and analysis

UCSD, ISB, Agilent, MSKCC, Pasteur, UCSF, Unilever, UToronto, U Texas

http://cytoscape.org

Pathway comparisonLiterature miningGene Ontology analysisActive modulesComplex detectionNetwork motif search


Manipulate Networks Filter/Query

Automatic LayoutInteraction Database Search


Focus

Overview

Zoom

PKC Cell Wall Integrity


Active Community

• Help– 8 tutorials, >10 case studies– Mailing lists for discussion– Documentation, data sets

• Annual Conference: Houston Nov 6-9, 2009

• 10,000s users, 2500 downloads/month• >40 Plugins Extend Functionality

– Build your own, requires programming

http://www.cytoscape.org

Cline MS et al. Integration of biological networks and gene expression data using Cytoscape Nat Protoc. 2007;2(10):2366-82


LAB

Objective• Create a map of the functional enrichments from

the 14 input proteins

Methods• Use HGNC to obtain the gene symbols from the

names• Submit the gene symbols to a tool that already

has datasets loaded.• Get Attributes and do analysis on network


14 Proteins• ISOFORM of APOPTOSIS-INDUCING FACTOR 1, MITOCHONDRIAL • QUINONE OXIDOREDUCTASE.; 26 KDA PROTEIN.;22 KDA PROTEIN.; 32 KDA PROTEIN.• 14-3-3 PROTEIN EPSILON.• ELONGATION FACTOR 1-GAMMA.; 50 KDA PROTEIN.• AFG3-LIKE PROTEIN 2.• 3-KETOACYL-COA THIOLASE, MITOCHONDRIAL• IMPORTIN BETA-1 SUBUNIT.• FH1/FH2 DOMAIN-CONTAINING PROTEIN• ANNEXIN VI ISOFORM 2.; ANNEXIN A6.• 2,4-DIENOYL-COA REDUCTASE, MITOCHONDRIAL• HYDROXYACYL GLUTATHIONE HYDROLASE ISOFORM 1.; HYDROXYACYLGLUTATHIONE

HYDROLASE.• ISOFORM 1 OF ELECTRON TRANSFER FLAVOPROTEIN SUBUNIT BETA.; ISOFORM 2 OF

ELECTRON TRANSFER FLAVOPROTEIN SUBUNIT BETA• ISOFORM 1 OF LONG-CHAIN-FATTY-ACID--COA LIGASE 1• PHOSPHOLIPASE C DELTA 4.


Get their gene symbol/identifiersHGNC - http://www.genenames.org

• Provide a table of mappings• What challenges did you face when trying to identify the

symbols from textual descriptions?118BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]

Identify functional enrichments

Discuss and provide a plot for the enrichment of Gene Ontology categories


Build an attribute enrichment network

• Which new proteins are functionally linked?• What datasets were used in the network construction?


Attribute Enrichment with a custom data set

• Use BioMart to– convert HGNC identifiers to Ensembl

Identifiers– Obtain the Gene Ontology categories for the

target proteins and the background proteins.• Use FUNC to do the enrichment analysis






Collect the Gene Ontology attributes for the list, then for all the human genes


Next steps are harder…

To use FUNC, you need to convert the BioMART output to the file format above. This is pretty easy to do in excel for the protein list, but excel can’t handle the results for all the human proteins. Need to write a small script… take BIOC3008 and become a competent in simple data manipulation

http://func.eva.mpg.de/


http://func.eva.mpg.de/

Network Biology: from lists to underpinnings of molecular behaviour

Health & Medicine

not15biol5502bchem5900

19biol5502bchem5900

molecular complexes

terms generic

gene associations

list of candidates

gene ontology30

higher level terms