Lecture 4 – Introduction to Functional Annotation José Luis Mosquera Computational Lab on Microarrays Data Analysis Special Topics in Computer Science Institute of Bioinformatics – Johannes Kepler University June 2010
Lecture 4 – Introduction to Functional Annotation
José Luis Mosquera
Computational Lab on Microarrays Data AnalysisSpecial Topics in Computer Science
Institute of Bioinformatics – Johannes Kepler UniversityJune 2010
Outline
1. Introduction1. Biological significance2. The Gene Ontology
2. Methods1. Some approaches to find biological meaning2. Hypergeometric and related approaches
3. Tools1. Evolution of the GO Tools2. SerbGO: Searching for the best GO Tool3. FatiGO
Biological Significance (1/5)
• With the advent of genomic technologies it has become possible to perform high throughput biological experiments in a routinely manner.
• It highlighted different challenges
1. The experiment itself2. Statistical analysis of the results3. Biological interpretation
• These experiments often yield lists of identifiers (genes, peptides,...) which are selected using some specific criteria to assign them statistical significance.
High throughput experiments
Biological Significance (2/5)
• Sometimes the selected items either
1. as being statistical significant is very high, or2. do not show any statistical significance
• Whatever the reason it is expected they “mean something” biologically.
High throughput experiments
Synthesis
What the list means from the biological point of view.
Biological Significance (3/5)
• The usual (reasonable?) way to proceed is to shift the focus from “statistical” to “biological” significance.
• Whereas there is a clear agreement about what means statistical significance...
• There is no consensus definition of biological significance,
• Although everyone talks about it…
If biological significance is the answer, what was the question?
Biological Significance (4/5)
• Interestingly biological significance is often re-casted in terms of statistical significance
Biological significance means Statistical significance
R. Díaz-Uriarte, CAMDA 2002
...to understand the biological relevance of statisticaldifferences in gene expression data...by examining significant differences in the distribution of (GO)terms related to biological processes or molecular function.
Biological Significance (5/5)
• Although it is not necessarily so
Biological significance does not mean Statistical significance
GeneSifter website
... to characterize the biology involved in a particularexperiment, and to identify particular genes of interest...combining the identification of broad biological themes with theability to focus on a particular gene..
The Gene Ontology (1/12)
Let it be clear what is…
The Challange
How could we attribute to the large lists of genes (identifiers) a biological interpretation?
The Gene Ontology (2/12)
• Looking for the existing annotations contained in databases that help to relate the selected genes with the biological knowledge.
• Bioinformatic resources often store data in a scientific natural language.
Rationale
Drawback
Annotation in this way is human readable and understandable, but it is difficult to interpret computationally.
The Gene Ontology (3/12)
What’s a cell?
• The same name can be used to describe different concepts.
• A concept can be described using different names.
• Comparison is difficult, especially across species or databases
The Gene Ontology (4/12)
• The most important thing you want to know is what the genes products are concerned with, i.e. their function.
• The best functional annotation systems use human beings who read the literature before assigning a function to a gene.
Functional Annotation
Some difficulties
• Different people use different words for the same function• They mean different things by the same word.• The context in which a gene was found may not be associated with its function.
• Inference of a function from sequence alone is error-prone and sometimes unreliable.
The Gene Ontology (5/12)
Functional Annotation
What can we do?
Attempts to overcome some of these drawbacks some useful annotation systems are the ontologies.
The Gene Ontology (6/12)
What’s an ontology?
Definition
An ontology is an entity that provides a set of vocabulary terms covering aconceptual domain.
• These terms must1. have an exhaustive and rigorous definition
2. be placed within a (hierarchical data) structure of relationships.
• The terms may be linked with two kinds of relationships1. “is-a” between parent and child.
2. “part-of” between part and the whole.
• They may have one or more
The Gene Ontology (7/12)
What’s an ontology?
But… What about the bilogical field?
A powerful ontology to perform biological interpretation of “our” experiments is the Gene Ontology (usually named GO)
The Gene Ontology (8/12)
• The GO project is a cooperative effort addressed to deal with the need for consistent descriptions of gene products in different databases.
• It is developed and maintained be the Gene Ontology Consortium.
• The GO is organized around three basic ontologies
GO ontologies
Ontology Number of Terms1
Molecular Functions (MF) 7220
Biological Process (BP) 9529
Cellular Component (CC) 1536
Total GO terms 18235
Gene Ontology
Molecular Function Biological Process Cellular Component
1 May 2005
The Gene Ontology (9/12)
GO graph
The Gene Ontology (10/12)
• Annotation contained in the GO database consist of two essential parts
• It highlighted different challenges
1. The ontologies that provide a structured vocabulary.2. The annotations that link the gene products to the
associated terms that define their function
• GO database attributes annotation in a species-independent way.
• Most important databases have cross-references with the GO database.
GO database
The Gene Ontology (11/12)
A given gene product may
• represent one or more molecular functions,
• be used in one or more biological processes and
• appear in one or more cellular components.
GO database
The Gene Ontology (12/12)
GO AnnotationsEvidence Codes
IEA Inferred from Electronic Annotation
ISS Inferred from Sequence Similarity
IEP Inferred from Expression Pattern
IMP Inferred from Mutant Phenotype
IGI Inferred from Genetic Interaction
IPI Inferred from Physical Interaction
IDA Inferred from Direct Assay
RCA Inferred from Reviewed Computational Analysis
TAS Traceable Author Statement
NAS Non-traceable Author Statement
IC Inferred by Curator
ND No biological Data available
Some Approaches To Find Biological Meaning
• Annotating results using an appropriate biological databases
• Rely on some form of grouping methods
1. Genes Set Enrichment: Hypergeometric tests, Fisher's Exact, GSEA,...2. Holistac Approaches: Category, globaltest, GlobalAncova,...3. Minimal Acceptance Strength4. ...
• Take a more global approach, relying on some type of
1. Graph-theoretic analysis, or2. Pathway analysis.
• Or the most (up-to-date) global approach: Systems biology
Quick overview
Gene Set Enrichment (1)
Consider the following frame
• N genes on a microarray
• M genes belong to a given GO term category (A),
• M − N do not belong to it (category Ac )
• K of these N genes are selected and assigned to a given class (e.g. regulated genes)
• x of these K genes will be in A
Hypergeometric test
Statistical Hypotessis
H0 : GO category A is equally represented in the microarray than in the class of differentially regulated genesH1 : GO category A is more (or less) represented in the microarray than in the class of differentially regulated genes
Example
Gene Set Enrichment (2)
• The probability that certain category occurs x times just by chance in the list of differentially regulated genes is modeled by a hypergeometric distribution with parameters (N , M , K ).
Hypergeometric distribution
Question
Assuming sampling without replacement, what is the probability of having exactly x genes of category A?
Gene Set Enrichment (3)
• The preceding model allows a user to compute a p-value for the test in which rejecting the null hypothesis corresponds to deciding that the category which is being tested is over-represented
• To test for under-representation we would use 1-p-values.
Hypergeometric distribution
Gene Set Enrichment (5)
Hypergeometric distribution
Some considerations…
•Different programs use slightly different approaches, most of which are equivalent in some sense
1. Fishers exact test2. Chi-Squared test3. Binomial test4. …
•It is necessary, and most programs do, to make some form of multiple testing adjustment due to the fact that one may be doing dozens or more tests simultaneously.
Gene Set Enrichment (6)
• Purposed by Subramanian et al. (2005)
STEP 1
● Compute a gene-wise measure (e.g. absolute t-statistics)● Rank genes according to this measure
STEP 2
● Assign labels A to genes belonging to a gene group of interest and B to all the other genes
● If group A is enriched with interesting genes, many of it’s genes will have high ranks and we will observe a separation in the ordered list
A B A A B A A A B A B B B A B B B B A B B B
Gene Set Enrichment Analysis (GSEA)
• Assign score nB to all genes A and −n
A to all genes B
• Draw the cumulative sum of these scores
• Is the maximum M of the cumulative sum unusually high? (Kolmogorov-Smirnov test)
Gene Set Enrichment Analysis (GSEA)
Gene Set Enrichment (7)
Evolution of the GO Tools
• In recent years many similar tools to analyze biological significance using the GO have been published and made available.
• Draghici (2005) reviews 15 of them.
• Huang (2008) classify 68 tools.
From the initial gap to the crowd
SerbGO: Searching for the best GO Tool (1/4)
• There are many tools to do not-so-many things.
• It is a bidirectional application. The user can...
1. ask for some features to get the appropriate tools for their interests
2. compare tools to check which cqapabilities are implemented in each one.
The project…
SerbGO Tool
It is intended to assist us in determine which microarray tools for gene expression analysis that make use of the GO ontologies are best suited to their projects.
SerbGO: Searching for the best GO Tool (2/4)
http://estbioinfo.stat.ub.es/apli/serbgo
SerbGO: Searching for the best GO Tool (3/4)
Which tools perform what tasks?
Many functionalities are available
Check your options in the form and move forward
SerbGO: Searching for the best GO Tool (4/4)
Comparing GO tools by their capabilities
Tick the tools that you want to compare
Appendix
Example
A Ac
K = 200 genes differentially expressed
N = 10000 genes on microarrays
M = 500 in the GO category A
N - M = 9500 belong to Ac
x = 25 genes are in A
Back