29.09.2009 – Stefan Götz, Valencia 1 Functional Analysis - Outline ● Test for enriched functions – Fisher's Exact Test (FatiGO) – Gene Set Enrichment (GSEA, FatiScan) ● Kegg Pathway Analysis with B2G ● B2G-Far
29.09.2009 – Stefan Götz, Valencia
1
Functional Analysis - Outline
● Test for enriched functions– Fisher's Exact Test (FatiGO)
– Gene Set Enrichment (GSEA, FatiScan)
● Kegg Pathway Analysis with B2G● B2G-Far
29.09.2009 – Stefan Götz, Valencia
2
Biosynthesis 54% Biosynthesis 18%
Sporulation 18% Sporulation 18%
One Gene List (A) The other list (B)
Are this two groups of genes
carrying out different
biological roles?
Fisher's Exact Test
Is this statistically significant?which means: is it unlikely to have occurred by
chance
???
???
29.09.2009 – Stefan Götz, Valencia
3
Biosynthesis 54% Biosynthesis 18%
Sporulation 18% Sporulation 18%
One Gene List (A) The other list (B)Are this two
groups of genes carrying out
different biological roles?
95No biosynthesis
26Biosynthesis
BAGenes in group A have not significantly to do with biosynthesis nor sporulation.
Fisher's Exact TestC
ontingency table
p-value for biosynthesis = 0.0913
29.09.2009 – Stefan Götz, Valencia
4
Multiple testing correction
We do this for all GO term of our dataset!!!
Many tests => Many false positive => We need correction!
FDR control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly rejected null hypotheses.
FWER control: The familywise error rate is the probability of making one or more false discoveries among all the hypotheses when performing multiple pairwise tests.
(more conservative)
29.09.2009 – Stefan Götz, Valencia
5
Different types of comparisons
● Compare one condition against another
● Remove Common Ids
● Test and Ref-Set are interchangeable
Set 1 Set 2
Common IDs
● Compare a subset against the total
● Gossip default setting
● Test and Ref-Set are NOT interchangeable
Test-Set
Ref-Set
Common IDs
Test-Set
Ref-Set
Common IDs
29.09.2009 – Stefan Götz, Valencia
6
FatiGO in Blast2GO
● Two-Tailed test not only identifies over but also under represented functions.
● If no Ref-Set is chosen all annotations are used as reference
29.09.2009 – Stefan Götz, Valencia
7
FatiGO in Blast2GO● Result table with link outs to sequence lists
29.09.2009 – Stefan Götz, Valencia
8
FatiGO in Blast2GO
Retains only the lowest, most specific enriched term per GO branch
29.09.2009 – Stefan Götz, Valencia
9
FatiGO in Blast2GO● Export enriched terms data as DAG graphs!
reduce
=> To draw all nodes, set filter to 1
29.09.2009 – Stefan Götz, Valencia
10
FatiGO in Blast2GO
=> Filter results
% of sequences in Ref group
% of sequences in Test group
If Test > Ref = over-expressed
If Ref > Test = under-expressed
● Export enriched terms as chart!
29.09.2009 – Stefan Götz, Valencia
11
FatiScan Features
● Interpret a ranked list of genes● There is not need for choosing a cut-off (all
information is included)● One statistical test for each Functional Block of
annotation– Multiple testing context (hundreds of annotation)
– Filtering of annotation is convenient (the less tests the best correction)
29.09.2009 – Stefan Götz, Valencia
12
FatiScan...testing along an ordered list
• Index ranking genes according to some biological aspect under study.
• Database that stores gene class membership information.
• FatiScan searches over the whole ordered list, trying to find runs of functionally related genes.
List of genes
+
-
Annotation label A
Annotation label B
Annotation label CBA C
Block of genes enriched in the annotation A
Annotation C is homogeneously
distributed along the list
Block of genes enriched in the annotation B
29.09.2009 – Stefan Götz, Valencia
13
FatiScangene set enrichment analysis (fold-change case-control)
Ranked ge
ne list (e.g. fold-ch
anges)
-.....
GO1 GO2 GO10.... GO11
A:Functional labels (GO, KEGG, etc.) over-represented among over-expressed genes
B:Functional labels under-represented among over-expressed genes
C: Functional labels over-represented among under-expressed genes
D:Functional labels under-represented among under-expressed genes
+
A
B
C D
29.09.2009 – Stefan Götz, Valencia
14
Functional Analysis - Outline
● Test for enriched functions– Fisher's Exact Test (FatiGO)
– Gene Set Enrichment (GSEA, FatiScan)
● Kegg Pathway Analysis● B2G-Far
C04018C10 GO:0004707C04018C10 GO:0006468C04018C10 GO:0005524C04018C10 EC:2.7.11.24C04018E10 GO:0005739C04018E10 GO:0009536C04018A12 GO:0009056C04018C12 GO:0004869C04018C12 ...
Export annotations to other tools
29.09.2009 – Stefan Götz, Valencia
15
BabelomicsWEB tools suite
● A complete suite of web tools for the functional analysis of groups of genes in high-throughput experiments
29.09.2009 – Stefan Götz, Valencia
16
Babelomics Databases
Babelomics
Interpro
Gene Ontology
KEGGEnsembl
SwissProt
Transcription Factors
MicroRNA
Cisred
BioentitiesLiterature
Integrated Biological DBof Functional Annotation for more than 10 species
● Import your own annotations
29.09.2009 – Stefan Götz, Valencia
17
Babelomics Tools
FatiGO: Finds differential distributions of functional terms between two groups of
genes, these terms can be: Gene Ontology , InterPro motifs, SwissProt KW ,
transcription factors (TF), gene expression in tissues, bioentities from
scientific literature, cis-regulatory elements CisRed.
Tissues Mining Tool: compares reference values of gene expression in
tissues to your results.
MARMITE: Finds differential distributions of bioentities extracted from PubMed
between two groups of genes.
FatiScan: detect significant functions with Gene Ontology, InterPro motifs,
Swissprot KW and KEGG pathways in lists of genes ordered according to
differents characteristics.
MarmiteScan: Use chemical and disease-related information to detect related
blocks of genes in a gene list with associated values.
twolists
ranked list
29.09.2009 – Stefan Götz, Valencia
18
FatiScan Web Tool
List of genes
C04018C12 2.31C04018C13 2.23C04018C14 1.87C04018C16 1.62C04018E18 0.87C04018E19 0.12C04018A21 -0.01C04018C33 -0.18C04018C65 ....
C04018C13 GO:0004707C04018C14 GO:0006468C04018C15 GO:0005524C04018C16 EC:2.7.11.24C04018E17 GO:0005739C04018E18 GO:0009536C04018A19 GO:0009056C04018C22 GO:0004869C04018C32 ...
List of annotations
29.09.2009 – Stefan Götz, Valencia
19
FatiScan ResultsSignificantly enriched GO-Terms
Percentages (Test/Ref)
Adjusted p-Values
p-Values
29.09.2009 – Stefan Götz, Valencia
20
Excercise● List of genes differentially expressed between
two tumour classes ● To identify functionally enriched terms for
blocks of genes we are going to perform a threshold-free FatiScan
● Genes are ranked by there fold-change
29.09.2009 – Stefan Götz, Valencia
21
Functional Analysis - Outline
● Test for enriched functions– Fisher's Exact Test (FatiGO)
– Gene Set Enrichment (GSEA, FatiScan)
● Kegg Pathway Analysis with B2G● B2G-Far
29.09.2009 – Stefan Götz, Valencia
22
KEGG: Kyoto Encyclopedia of Genes and Genomes
GO Term
EnzymeCode
KEGGPATHWAY
29.09.2009 – Stefan Götz, Valencia
23
Obtain Enzyme Codes
29.09.2009 – Stefan Götz, Valencia
24
KEGG Pathways
● First, choose a folder to save the KEGG maps
● Maps are retrieved online● Export as text file (tab-sep)
29.09.2009 – Stefan Götz, Valencia
25
KEGG Pathways
--> Each enzyme in a different color --> Pathways are ordered for abundance
Ordered List
29.09.2009 – Stefan Götz, Valencia
26
Functional Analysis - Outline
● Test for enriched functions– Fisher's Exact Test (FatiGO)
– Gene Set Enrichment (GSEA, FatiScan)
● Kegg Pathway Analysis with B2G● Functional Annotation Repository: B2G-FAR
29.09.2009 – Stefan Götz, Valencia
27
Annotation Repository: B2G-Far
● Many public available sequences are uncharacterised → Reduce the amount of un-annotated sequences
● Generate high-quality functional annotations especially for the non-model species community
Comprehensive and high throughput:Apply the Blast2GO methodology to the probably most largest protein sequence resource: SIMAP → pre-calculated sequence alignments of 29 million non-redundant proteins which cover the content of all major public sequence, contains InterPro domains Annotation
Functional and highly used:Annotate non-model Affymetrics Microarrays
29.09.2009 – Stefan Götz, Valencia
28
Analysis PipelineSimap contains
29 million proteins Annotation source: manual curated
GO-lite database
17 non-model species
GeneChips
29.09.2009 – Stefan Götz, Valencia
29
Hundreds of species now with Blast2GO protein
annotations
29.09.2009 – Stefan Götz, Valencia
30
Results
Datasource Unique sequences
Whole Simap 29.375.919
Simap without metagenomes 24.394.532
Simap protein sequences annotated by Blast2GO 13.263.568
Sequences which do not surpass the annotation threshold 2.269.564
Sequences without sequence alignments to GO 8.861.400
Processing time: 3 days !
29.09.2009 – Stefan Götz, Valencia
31
Online Repository
● High quality functional annotations: 2000 species, 17 non-model species microarrays
● B2G-Far Content
– Taxonomy
– General Information
– Data download
– Statistics:Annotation distr.GO-level distr.Top-50 functions
29.09.2009 – Stefan Götz, Valencia
32
Search your sequences in B2G-FAR
● Download annotations of your species
● Load annotations into B2G
● Deselect all sequences
●
29.09.2009 – Stefan Götz, Valencia
33
Exercise
● Go to web site –> course material● Perform an enrichment analysis with
Blast2GO and FatiScan in Babelomics● Once completed, check out the
comments/solution of the exercise