Network Biology: from lists to underpinnings of molecular behaviour Michel Dumontier, Ph.D. Associate Professor of Bioinformatics Carleton University 1 BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
May 10, 2015
Network Biology:from lists to underpinnings of molecular
behaviour
Michel Dumontier, Ph.D.Associate Professor of Bioinformatics
Carleton University
1BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
2BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Provenance
• This talk was prepared in part with input from the “Interpreting Gene Lists” workshop put forward by the Canadian Bioinformatics Workshops (bioinformatics.ca)
• http://bioinformatics.ca/workshops/2009/course-content
BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier] 3
So you did some mass spectrometry?
Protein Identification4BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
database search vs de novoS#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Re
lative
Ab
un
da
nce
850.3
687.3
588.1
851.4425.0
949.4
326.0524.9
589.2
1048.6397.1226.9
1049.6489.1
629.0
WR
A
C
VG
E
K
DW
LP
T
L T
WR
A
C
VG
E
K
DW
LP
T
L T
de novo
AVGELTK
Database Search
Database ofknown peptides
MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
5BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
6BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
My experiment worked and I have dozens, hundreds, or thousands of
hits…. now what?
?Protein
IdentificationS#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Re
lative
Ab
un
da
nce
850.3
687.3
588.1
851.4425.0
949.4
326.0524.9
589.2
1048.6397.1226.9
1049.6489.1
629.0
7BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Use the list to explore Biology
• Determine significant shared attributes• Explore putative mechanisms of actions• Test hypotheses
Protein IdentificationS#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Re
lative
Ab
un
da
nce
850.3
687.3
588.1
851.4425.0
949.4
326.0524.9
589.2
1048.6397.1226.9
1049.6489.1
629.0
Eureka!Hypothesis on the
molecular basisof disease/process
Network Biology
8BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
# in list having attribute
# in list sharing these attributes
Oxidative Metabolism
Detoxification
Enriched in smokers =UP-regulated in smokers
9BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Outline
1. Explore identified proteins
2. Attribute enrichment
3. Networks
4. Pathways
5. Lab
10BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
A hypothesis underlies the list of identified proteins
• An initial question was posed, an experiment performed and a list of candidates obtained.
• The question is, what are the roles of these entities in the biological process being investigated. – Normal vs pathological– Response to stimulus– Interactions and complexes
11BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Biological Answers
• Computational systems biology– Information retrieval and summary– Interaction network analysis– Pathway analysis– Function prediction
12BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Molecular Attributes
• An attribute provides information about to the entity in question (e.g. shape, function, process)
• Sequence and structure provides information about – Motifs, domains, interaction/binding sites, post-
translational modifications, conformational changes, molecular complexes, mutations, conservation/evolution
– Functions, localization, biological / pathological processes
13BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Gene Ontology
• Captures terminology related to three aspects– biological processes– molecular functions – cellular components
• Relationships between terms are largely defined with “is a” and “part of” relations
Cell division
Isomerase activity
14BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
GO Structure cell
membrane chloroplast
mitochondrial chloroplastmembrane membrane
is-apart-of
Species independent. Some lower-level terms are specific to a group, but higher level terms are not
15BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Gene Ontology
• 30,393 terms, 99.2% with definitions– 18,939 biological processes– 2,735 cellular components– 8,719 molecular functions
• GO Slim is an official reduced set of GO terms– Generic, plant, yeast– Good for making pie charts
16BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Annotation
• Manual annotation– Created by scientific curators
• High quality• Small number (time-consuming to create)
• Electronic annotation– Annotation derived without human validation
• Computational predictions (accuracy varies)• Lower ‘quality’ than manual codes
• Key point: be aware of annotation origin
17BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Evidence Type(provenance of facts)
• ISS: Inferred from Sequence/Structural Similarity
• IDA: Inferred from Direct Assay• IPI: Inferred from Physical Interaction• IMP: Inferred from Mutant Phenotype• IGI: Inferred from Genetic Interaction• IEP: Inferred from Expression Pattern• TAS: Traceable Author Statement• NAS: Non-traceable Author Statement• IC: Inferred by Curator• ND: No Data available
• IEA: Inferred from electronic annotation
18BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Variable Coverage
Lomax J. Get ready to GO! A biologist's guide to the Gene Ontology. Brief Bioinform. 2005 Sep;6(3):298-304.
19BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
GO Software Tools
• GO resources are freely available to anyone without restriction– Includes the ontologies, gene associations
and tools developed by GO• Other groups have used GO to create
tools for many purposeshttp://www.geneontology.org/GO.tools
20BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Accessing GO: QuickGO
http://www.ebi.ac.uk/ego/21BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Explore Ontologies
http://www.ebi.ac.uk/ontology-lookup
22BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Databases of Molecular Annotation
• NCBI – Genbank / RefSeq– Entrez Gene
• EBI – UniProt– Ensembl BioMart
(eukaryotes)
Model Organism Databases• Berkeley Drosophila Genome Project (BDGP)• dictyBase (Dictyostelium discoideum) • FlyBase (Drosophila melanogaster) • GeneDB (Schizosaccharomyces pombe,
Plasmodium falciparum, Leishmania major and Trypanosoma brucei)
• UniProt Knowledgebase (Swiss-Prot/TrEMBL/PIR-PSD) and InterPro databases
• Gramene (grains, including rice, Oryza) • Mouse Genome Database (MGD) and Gene
Expression Database (GXD) (Mus musculus) • Rat Genome Database (RGD) (Rattus
norvegicus)• Reactome• Saccharomyces Genome Database (SGD)
(Saccharomyces cerevisiae) • The Arabidopsis Information Resource (TAIR)
(Arabidopsis thaliana) • The Institute for Genomic Research (TIGR):
databases on several bacterial species • WormBase (Caenorhabditis elegans) • Zebrafish Information Network (ZFIN): (Danio
rerio 23BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
24BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Identifiers
• Identifiers (IDs) are ideally unique, stable names or numbers that help track database records– E.g. Social Insurance Number, Entrez Gene ID 41232
• Gene and protein information stored in many databases– Genes have many IDs
• Records for: Gene, DNA, RNA, Protein– Important to recognize the correct record type– E.g. Entrez Gene records don’t store sequence. They
link to DNA regions, RNA transcripts and proteins.
25BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
NCBI Database
Links
http://www.ncbi.nlm.nih.gov/Database/datamodel/data_nodes.swf
NCBI:U.S. National Center for Biotechnology Information
Part of National Library of Medicine (NLM)
26BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Common IdentifiersSpecies-specificHUGO HGNC BRCA2MGI MGI:109337RGD 2219 ZFIN ZDB-GENE-060510-3 FlyBase CG9097 WormBase WBGene00002299 or ZK1067.1 SGD S000002187 or YDL029WAnnotationsInterPro IPR015252OMIM 600185Pfam PF09104Gene Ontology GO:0000724SNPs rs28897757Experimental PlatformAffymetrix 208368_3p_s_atAgilent A_23_P99452CodeLink GE60169Illumina GI_4502450-S
GeneEnsembl ENSG00000139618Entrez Gene 675Unigene Hs.34012
RNA transcriptGenBank BC026160.1RefSeq NM_000059Ensembl ENST00000380152
ProteinEnsembl ENSP00000369497RefSeq NP_000050.2UniProt BRCA2_HUMAN or A1YBP1_HUMANIPI IPI00412408.1EMBL AF309413 PDB 1MIU
Red = Recommended27BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Identifier Mapping
• So many IDs!– Mapping (conversion) is a headache
• Four main uses– Disambiguate similarly named entities– Used to reference related information– Biological and informational provenance
• E.g. Genes to proteins, Entrez Gene to Affy
– Unification during dataset merging• Equivalent entities
28BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
ID Mapping Services
• Synergizer– http://llama.med.harvard.edu/
synergizer/translate/
• Ensembl BioMart
– http://www.ensembl.org
• UniProt– http://www.uniprot.org/
29BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Outline
1. Explore identified proteins
2. Attribute enrichment
3. Networks
4. Pathways
30BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Attribute Enrichment (AE)
Given:1. list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42
2. attributes: e.g. function, process, localization, interactions
AE Question: Are any of the attributes surprisingly enriched in the list?
• Details:– How to assess “surprisingly” (statistics)– How to correct for repeating the tests
31BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
What is a P-value?
• The P-value is (a bound) on the probability that the “null hypothesis” is true,
• Calculated through statistics with the data and testing the probability of observing those statistics, or ones more extreme, given a sample of the same size distributed according to the null hypothesis,
• Intuitively: P-value is the probability of a false positive result (aka “Type I error”)
32BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
How likely are the observed differences between the two distributions due to chance?
66
7
7
5
01
1 22
1
1
1
10
00 0
value
value distribution
33BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
AE using the T-test
Answer: Two-tailed T-test
Black: N1=500
Red: N2=4500
Mean: m1 = 1.1 Std: s1 = 0.9
T-statistic =
Mean: m1 = 4.9 Std: s1 = 1.0
2
22
1
21
21
Ns
Ns
mm
= -88.5
Formal Question: What is the probability of observing the T-statistic or one more extreme if the means of the two distributions were the same?
34BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
AE using the T-test
T-statistic =
2
22
1
21
21
Ns
Ns
mm
= -88.5
T-distribution
Pro
ba
bil
ity
de
ns
ity
T-statistic
0
P-value = shaded area * 2
-88.5
Formal Question: What is the probability of observing the T-statistic or one more extreme if the means of the two distributions were the same?
35BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
T-test limitations1. Assumes distributions are both approximately Gaussian (i.e. normal)
– Score distribution assumption is often true for:• Log ratios from microarrays
– Score distribution assumption is rarely true for:• Peptide counts, sequence tags (SAGE or NextGen sequencing), transcription factor
binding sites hits
2. Tests for significance of difference in means of two distribution but does not test for other differences between distributions.
Pro
bab
ilit
y d
en
sity
score 0
Values are positive and have increasing density near zero, e.g. sequence counts
Pro
bab
ilit
y d
en
sity
score
Distributions with outliers, or “heavy-tailed” distributions
Pro
bab
ilit
y d
en
sity
score
Bimodal “two-bumped” distributions.
36BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Kolmogorov-Smirnov (K-S) testP
rob
abil
ity
den
sity
score 0
Question: Are the red and black distributions significantly different?
Calculate cumulative distributions of red and black
Cu
mu
lati
ve p
rob
abil
ity
score 0
0.5
1.0
Cumulative distribution
Length = 0.4
Formal question: Is the length of largest difference between the “empirical distribution functions” statistically significant?
37BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
What is the probability of finding 4 or more proteins with feature X in a random sample of
5 proteinslist
RRP6MRD1RRP7RRP43RRP42
Background population:500 X proteins,5000 proteins
38BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Fisher’s exact test
Background population:500 X proteins, 5000 proteins
list
RRP6MRD1RRP7RRP43RRP42
P-value
Null distribution
Answer = 4.6 x 10-4
P-value for Fisher’s exact testis “the probability that a random draw of the same size as the list from the background population would produce the observed number (or more) of attributes in the list.”,depends on size of the list, # with features (in list, background), and the background population. 39BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Important details
• To test for under-enrichment of “black”, test for over-enrichment of “red”.
• Need to choose “background population” appropriately, e.g., if only portion of the total complement is queried (or having annotation), only use that population as background.
• To test for enrichment of more than one independent types of annotation (red vs black and circle vs square), apply Fisher’s exact test separately for each type.
• The hypergeometric test is equivalent to a one-tailed Fisher’s exact test.
40BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
How to win the P-value lottery, part 1
Background population:500 X5000 Y
Random draws
… 7,834 draws later …
Expect a random draw with observed enrichment once every 1 / P-value draws
41BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
How to win the P-value lottery, part 2Keep the list the same, evaluate different annotations
Observed drawRRP6MRD1RRP7RRP43RRP42
Different annotations
RRP6MRD1RRP7RRP43RRP42
42BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Correcting for multiple tests
• The Bonferroni correction controls the probability any one test is due to random chance aka Family-Wise Error Rate (FWER) If M = # of annotations tested: Corrected P-value = M x original P-value
• The Benjamini-Hochberg (B-H) controls the proportion of positive tests (i.e. rejections of the null hypothesis) that are false positives aka False Discovery Rate (FDR)– FDR is the expected proportion of the observed enrichments that
are due to random chance.– Less stringent than the Bonferroni
43BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Reducing multiple test correction stringency
• The correction to the P-value threshold a depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to be
• Can control the stringency by reducing the number of tests: – e.g. use GO slim or restrict testing to the appropriate
GO annotations.
44BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
AE tools
• Web-based tools – Funspec:
• easy tool for yeast, not maintained, uses GO annotations and some annotations (e.g. protein complexes)
– YeastFeatures • Similar to Funspec, different datasets and presentation
– GoMiner: • Uses GO annotations, covers many organisms, needs a
background set of genes
• Cytoscape-based tools– BINGO:
• Does GO annotations and displays enrichment results graphically and visually organizes related categories
45BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Funspec: Simple ORA for yeasthttp://funspec.med.utoronto.ca/
Paste list hereBonferroni correct? YES!
Choose sources of annotation
Cavaets:• yeast only,• last updated 2002
46BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
http://software.dumontierlab.com/yeastfeatures47BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
48BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
GoMiner, part 1http://discover.nci.nih.gov/gominer
1. Click “web interface”
2. Upload background
3. Upload list
4. Choose organism
5. Choose evidence code (All or Level 1)
49BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
GoMiner, part 2
6. Restrict # of tests via category size
7. Restrict # of tests via GO hierarchy
8. Results emailed to this address, in a few minutes
50BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
DAVID, part 1 http://david.abcc.ncifcrf.gov/
Paste list here
Choose ID type
List type: list or background?
DAVID automatically detects organism
51BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
DAVID, part 2http://david.abcc.ncifcrf.gov/
52BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
BINGO, an ORA cytoscape pluginhttp://www.psb.ugent.be/cbd/papers/BiNGO/index.htm
Links represent parent-child relationships in GO ontology
Colours represent significance of enrichment
Nodes represent GO categories
53BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
54BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Outline
1. Explore identified proteins
2. Attribute enrichment
3. Networks • Physical networks• Genetic networks• Functional networks
4. Pathways
55BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Why Network and Pathway Analysis?
• Intuitive to Biologists• Provide a biological context for results• More efficient than searching databases gene-by-gene• Intuitive display for sharing data
• Computation on Pathway Content• Visualize multiple data types on a pathway or network• Find active pathways• Identify potential regulators
56BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
network
In biology, a network is a graph comprised of nodes that correspond to entities (genes, proteins, small molecules) and edges that correspond to physical/agentive or associative relations between entities.
Vertex (node)
EdgeCycle
-5
Directed Edge (Arc)
Weighted Edge7
10
57BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Integration in a Network Context
58BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Expression data mappedto node colours
Integration in a Network Context
59BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Mapping Biology to a Network
• A simple mapping: Protein-protein interactions– one protein/node, one interaction/edge
• Edges can represent other relationships– Physical e.g. protein-protein interaction– Regulatory e.g. kinase activates target– Genetic e.g. epistasis– Similarity e.g. protein sequence similarity
• Critical: understand the mapping for network analysis
60BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Protein Sequence Similarity Network
http://apropos.icmb.utexas.edu/lgl/61BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Literature Network
• Computationally extract gene relationships from text, usually PubMed abstracts
• Useful if network is not in a database– Literature search tool
• BUT not perfect– Problems recognizing gene names– Natural language processing is difficult
• Agilent Literature Search Cytoscape plugin• iHOP (www.ihop-net.org/UniPub/iHOP/)
62BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Agilent Literature Search
63BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Cytoscape Network produced by Literature Search.
Abstract from the scientific literature
Sentences for an edge
64BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Enrichment Map
A
B
|)||,(|min
||
BA
BA
Overlap
65BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Nodes represent gene-sets
66BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Olfactory Receptor
Muscle Contraction
Ectodermal Dev. &Keratinocyte Diff.
Ubiquitin Processes
DNA Processes
Mitotic Cell Cycle
DNA Repair
DNA ReplicationRas GTPase
Serine Endopeptidase
Chromatin Remodeling
Chromosome
Ubiquitin-dependent Proteolysis
Ubiquitin Ligase
Microtubule Cytoskeleton
Intermediate Filament
Cytoskeleton
Ion ChannelCalcium
Potassium Sodium
Mitochondrial Oxidative
Metabolism
Fatty Acid Metabolism
Cytoskeleton
mRNA Transport
RNA Splicing
RNA Processes
Transcription
rRNA Processing
Ribonucleotide Metabolism
Translation
67BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
68
Physical Networks
• Between two molecular objects– DNA, RNA, gene, protein, complex, small molecule,
photon– Requires a site of interaction / binding
• Biologically relevant:– Present/expressed at the same time– Share a cellular location– Leads to some biologically relevant outcome
BA
BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Molecular Interactions
RAS interacting with RALGDS
(PDB: 1LFD)
Synthetic protein interacting with ATP and Zinc
(PDB: 2P0X)
69BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
70
Experimental Interaction Discovery
Microarray
Two-Hybrid
MassSpectrometry
Genetics
X-Ray
NMR
Direct, Physical Indirect, Physical Indirect, Genetic
BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
71
Experimental Considerations• How do you know if the interaction really
exists? • Each method has its advantages and
disadvantages. – Be aware of systematic errors– Be aware of contaminants.
• Each method observes interactions from a slightly different experimental condition.
• Support from many different sources is certainly better (necessary) than just one.
BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
72
B
Some affinity purification caveats
A
First and most importantly, this is only a representation of the observation.
You can only tell what proteins are in the eluate; you can’t tell how they are connected to one another.
If there is only one other protein present (B), then its likely thatA and B are directly interacting.
But, what if I told you that two other proteins (B and C) werepresent along with A…. B
A
C
BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
73
B
Complexes with unknown topology
A
Which of these models is correct?The complex described by this experimental result is said to have an Unknown Topology.
C B
A
C B
A
C
BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
74
B
Complexes with unknown stoichiometry
A
Here’s another possibility?The complex described by this experimental result is also said to have Unknown Stoichiometry.
B
A
B
BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
75
Interaction Models
Spoke Matrix
Simple model, useful for data navigation
More accurate
Theoretical max. number of interactions
ActualTopology
BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
76
High-throughput Mass Spectrometric Protein Complex Identification (HMS-PCI)
Ste12
Ho et al. Nature. 2002 Jan 10;415(6868):180-3
Mike Tyers, SLRI
BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
77BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
78
k-core analysis
• A part of a graph where every node is connected to other nodes with at least k edges (k=0,1,2,3...)
• Highest k-core is a central most densely connected region of a graph
• Regions of dense connectivity may represent molecular complexes
• Therefore, high k-cores may be molecular complexes
BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
79
Pre MS Ho
Gavin
Union
6-core 6-core
6-core 9-core
Interaction can define function
MCODE plugin for CytoscapeBIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
80
http://pathguide.org
BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Interaction Databases
• Experiment (E)• Structure detail (S)• Predicted
– Physical (P)– Functional (F)
• Curated (C)• Homology
modeling (H)• *IMEx consortium
81BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Network Classification of Disease• Traditional: Gene association• Limitations: Too many genes reduces
statistical power• New: Active cell map based approaches
combining network and molecular profiles
Chuang HY, Lee E, Liu YT, Lee D, Ideker TNetwork-based classification of breast cancer metastasisMol Syst Biol. 2007;3:140. Epub 2007 Oct 16
Liu M, Liberzon A, Kong SW, Lai WR, Park PJ, Kohane IS, Kasif SNetwork-based analysis of affected biological processes in type 2 diabetes modelsPLoS Genet. 2007 Jun;3(6):e96
Efroni S, Schaefer CF, Buetow KHIdentification of key processes underlying cancer phenotypes using biologic pathway analysisPLoS ONE. 2007 May 9;2(5):e425
82BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Network-Based Breast Cancer Classification• 57k intx from Y2H,
orthology, co-citation, HPRD, BIND, Reactome
• 2 breast cancer cohorts, different expression platforms
Chuang HY, Lee E, Liu YT, Lee D, Ideker TNetwork-based classification of breast cancer metastasisMol Syst Biol. 2007;3:140. Epub 2007 Oct 16
83BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
• Similar network markers across 2 data sets (better than original overlap)
• Increased classification accuracy
• Better coverage of known cancer risk genes (*)
84BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
PIPE
• Predicts yeast PPI from sequence– Uses interaction databases to find similar
interacting proteins– Estimates the site of interaction– 75% accuracy (61% sensitivity, 89%
specificity)– Finds new interactions among complexes
85BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
86BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
87BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
PIPE2
• First all-to-all sequence-based computational screen of PPIs in yeast – 29,589 high confidence interactions of ~ 2 x 107
possible pairs – 16,000x faster than PIPE– 99.95% specificity
88BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
89
Synthetic Genetic Interactions
• Synthetic genetic interactions (lethal, slow growth)• Mate two mutants without phenotypes to get a daughter
cell with a phenotype• Synthetic lethal (SL), slow growth
• robotic mating using the yeast deletion library• Genetic interactions provide functional data on protein
interactions or redundant genes• About 23% of known SLs (1295 - YPD+MIPS) were
known protein interactions in yeast
Tong et al. Science. 2001 Dec 14;294(5550):2364-8
BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
90
Cell PolarityCell Wall Maintenance Cell StructureMitosisChromosome StructureDNA Synthesis DNA RepairUnknownOthers
Synthetic Genetic Interactions in Yeast
Tong, BooneBIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Validation: Protein Localization
A – A3: Y2HB: physical methodsC: geneticE: immunological
True positives:- Localized in the
same cellular compartment
- Have common cellular role
Sprinzak, Sattath, Margalit, J Mol Biol, 200391BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Comparisons• All methods except for Y2H and synthetic
lethality technique are biased toward abundant proteins.
• PPI bias toward certain cellular localizations. • Evolutionarily conserved proteins have much
better coverage in Y2H than the proteins restricted to a certain organism.
C. Von Mering et al, Nature, 2002:
92BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Functional Associations• Molecular Interactions• Regulatory Interactions• Genetic Interactions• Similarity relationships
– Co-expression– Protein sequence– Domain architecture– Phylogenetic profiles– Gene neighborhood– Gene fusion– …
93BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
http://string.embl.de/von Mering et al., Nucleic Acids Res., 2005
94BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
95
95BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
96BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
=
Query-specific weights for multifaceted function queries
+GeneticTong et al. 2001
w1 x w2 x w3 xweights
Co-expression
CDC27
APC11CDC23
XRS2RAD54
MRE11
UNK1
UNK2
Cell cycle
DNA repair
Pavlidis et al, 2002, Lanckriet et al, 2004Mostafavi et al, 2008
+Co-complexed
Durrett 2006
Gene Function Prediction using a Multiple Association Network Integration Algorithm
97BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
GeneMANIA Cytoscape Plugin
98BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Outline
1. Explore identified proteins
2. Attribute enrichment
3. Networks
4. Pathways
5. Lab
99BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
pathway
In biology, a pathway is a network which consists of inputs (physical entities), outputs (physical entities, biological outcomes), and the molecular machinery and chemical transformations required/expected to realize the end-directed activity.
100BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Using Pathway Information
Databases
Literature
Expert knowledge
Experimental Data
Find active processesunderlying a phenotype
PathwayInformation
PathwayAnalysis
101BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
htt
p:/
/pat
hg
uid
e.o
rg
Vuk PavlovicSylva Donaldson
>290 PathwayDatabases!
• Varied formats, representation, coverage
• Pathway data extremely difficult to combine and use
102BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Aim: Convenient Access to Pathway Information
Facilitate creation and communication of pathway dataAggregate pathway data in the public domainProvide easy access for pathway analysis
http://www.pathwaycommons.org
103BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Access From Cytoscape
104BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Fatty Acid Degradation?Other pathways / processes?
GenMAPP.org
cardiomyopathy: downregulated genes
105BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Fatty Acid Degradation Pathway
106BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Cardiomyopathy Data on Fatty Acid Degradation Pathway
107BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Visualizing Time Course Data on Pathways: Multiple Comparison View
108BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Outline
1. Explore identified proteins
2. Attribute enrichment
3. Networks
4. Pathways
5. Lab
109BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
110
Network Analysis
• Cytoscape– Visualize molecular interaction
networks and integrate interactions with gene expression profiles and other state data. Data filters & custom plug-in architecture.
– http://www.cytoscape.org
• Biolayout Express 3D– Large networks– Gene expression– www.sanger.ac.uk/Teams/Team101/
biolayout/b3d.html
BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Network Analysis using Cytoscape
Databases
Literature
Expert knowledge
Experimental Data
Find biological processesunderlying a phenotype
NetworkInformation
NetworkAnalysis
111BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Network visualization and analysis
UCSD, ISB, Agilent, MSKCC, Pasteur, UCSF, Unilever, UToronto, U Texas
http://cytoscape.org
Pathway comparisonLiterature miningGene Ontology analysisActive modulesComplex detectionNetwork motif search
112BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Manipulate Networks Filter/Query
Automatic LayoutInteraction Database Search
113BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Focus
Overview
Zoom
PKC Cell Wall Integrity
114BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Active Community
• Help– 8 tutorials, >10 case studies– Mailing lists for discussion– Documentation, data sets
• Annual Conference: Houston Nov 6-9, 2009
• 10,000s users, 2500 downloads/month• >40 Plugins Extend Functionality
– Build your own, requires programming
http://www.cytoscape.org
Cline MS et al. Integration of biological networks and gene expression data using Cytoscape Nat Protoc. 2007;2(10):2366-82
115BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
LAB
Objective• Create a map of the functional enrichments from
the 14 input proteins
Methods• Use HGNC to obtain the gene symbols from the
names• Submit the gene symbols to a tool that already
has datasets loaded.• Get Attributes and do analysis on network
116BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
14 Proteins• ISOFORM of APOPTOSIS-INDUCING FACTOR 1, MITOCHONDRIAL • QUINONE OXIDOREDUCTASE.; 26 KDA PROTEIN.;22 KDA PROTEIN.; 32 KDA PROTEIN.• 14-3-3 PROTEIN EPSILON.• ELONGATION FACTOR 1-GAMMA.; 50 KDA PROTEIN.• AFG3-LIKE PROTEIN 2.• 3-KETOACYL-COA THIOLASE, MITOCHONDRIAL• IMPORTIN BETA-1 SUBUNIT.• FH1/FH2 DOMAIN-CONTAINING PROTEIN• ANNEXIN VI ISOFORM 2.; ANNEXIN A6.• 2,4-DIENOYL-COA REDUCTASE, MITOCHONDRIAL• HYDROXYACYL GLUTATHIONE HYDROLASE ISOFORM 1.; HYDROXYACYLGLUTATHIONE
HYDROLASE.• ISOFORM 1 OF ELECTRON TRANSFER FLAVOPROTEIN SUBUNIT BETA.; ISOFORM 2 OF
ELECTRON TRANSFER FLAVOPROTEIN SUBUNIT BETA• ISOFORM 1 OF LONG-CHAIN-FATTY-ACID--COA LIGASE 1• PHOSPHOLIPASE C DELTA 4.
117BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Get their gene symbol/identifiersHGNC - http://www.genenames.org
• Provide a table of mappings• What challenges did you face when trying to identify the
symbols from textual descriptions?118BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Identify functional enrichments
Discuss and provide a plot for the enrichment of Gene Ontology categories
119BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Build an attribute enrichment network
• Which new proteins are functionally linked?• What datasets were used in the network construction?
120BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Attribute Enrichment with a custom data set
• Use BioMart to– convert HGNC identifiers to Ensembl
Identifiers– Obtain the Gene Ontology categories for the
target proteins and the background proteins.• Use FUNC to do the enrichment analysis
121BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
122BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
123BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
124BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
125BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Collect the Gene Ontology attributes for the list, then for all the human genes
126BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
Next steps are harder…
To use FUNC, you need to convert the BioMART output to the file format above. This is pretty easy to do in excel for the protein list, but excel can’t handle the results for all the human proteins. Need to write a small script… take BIOC3008 and become a competent in simple data manipulation
http://func.eva.mpg.de/
127BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]