Using Semantic Similarity Measures in Using Semantic Similarity Measures in the Biomedical Domain for Computing the Biomedical Domain for Computing Similarity between Genes based on Gene Similarity between Genes based on Gene Ontology Ontology By : Elham Khabiri By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid Adviser : Dr. Hisham Al-Mubaid
Using Semantic Similarity Measures in the Biomedical Domain for Computing Similarity between Genes based on Gene Ontology. By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid. Motivation. Drug Target. Human. Yeast. Goal : Measure functional similarity between genes and Proteins Reason: - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Using Semantic Similarity Measures in the Using Semantic Similarity Measures in the Biomedical Domain for Computing Similarity Biomedical Domain for Computing Similarity
between Genes based on Gene Ontologybetween Genes based on Gene Ontology
By : Elham KhabiriBy : Elham Khabiri
Adviser : Dr. Hisham Al-MubaidAdviser : Dr. Hisham Al-Mubaid
University of Houston - Clear Lake2
Motivation
• Goal :– Measure functional similarity between genes and
Proteins
• Reason:– It is useful to measure the functional difference
between genes in different organisms
– Find the genes with unknown functions
HumanYeastDrug Target
University of Houston - Clear Lake3
MotivationMotivation
• To compute the similarity between two genes g1 and g2, we can use one of the following information sources:
– gene sequence information– gene functional annotations (GO terms)– biomedical literature and texts – gene expression profiles.
• In this work, we use Gene functional
annotations and the gene ontology GO to measure the similarity between genes.
University of Houston - Clear Lake4
MotivationMotivation
• Given two genes Gp and Gq such that gene Gp is annotated with a set of n different GO terms, we call it the set GOp:
GOp = {tp1, tp
2, …., tpn}
• Similarly, the annotation set for gene Gq is:
GOq = {tq1, tq
2, …., tqn}
that is, gene Gq is annotated with m different GO terms • The terms tp
i or tqj are nodes in the GO
• If both genes Gp and Gq are annotated with only one term (n=m=1) and the same GO term ( tp
1 = tq1) then the
similarity between them is maximum.
University of Houston - Clear Lake5
MotivationMotivation
• In general, if both genes Gp and Gq are annotated with the same set of GO terms (n=m≥1) (that is, tp
i = tqj) then
the similarity between them is maximum.
University of Houston - Clear Lake6
Motivation
• Many data resources in bioinformatics not only hold data in the form of sequences, but also as annotation– Scientific natural language– Suitable for human but not easy for
machine processing
University of Houston - Clear Lake7
Related Work:Semantic Measures in NLP
Resnik, 1995
Lin, 1998
Jiang and Conrath, 1997
Wu & Palmer, 1994
Leacock and Chodorow, 1998
Based on Information Based on Information
Content (IC) of Least Content (IC) of Least
Common AncestorCommon Ancestor (LCA)(LCA)
Based on Ontology Based on Ontology StructureStructure
University of Houston - Clear Lake8
1
151410
7
13
8
t2t1
9
54 6
3t
Related Work
• WordNet [Miller 1995] • Information Content Based Measures
– Resnik, 1995
)/(freq(t) log- ) t,(tsim) t,LCA(tt
21Resnik21
N
freq(t): Frequency of concept c in database.
N: the number of all the concepts in database.
University of Houston - Clear Lake9
Related Work
– Jiang and Conrath, 1997
– Lin, 1998
) t,LCA(tt2121JC
21
))p(t log)p(t (log-p(t) 2log ) t,(tdist
))(log)c(log
)(log.2(max),(
21),(21 21 cPP
cPccsim ccScLin
University of Houston - Clear Lake10
Related Work
• Ontology Structure Based Measures:– Wu & Palmer, 1994
• Based on the depths of the two concepts in the taxonomies, and the depth of the LCS
– Leacock and Chodorow, 1998: PL• Based on the PL(t1,t2) of the shortest path between two
concepts• Scale the measure by the overall depth D of the taxonomy
),(
2log),(
2121 cclen
Dccsimlch
)()(
)),((2),(
21
2121 cdepthcdepth
cclcsdepthccsimwup
University of Houston - Clear Lake11
Related Work:Measures in Biomedical Domain
• First semantic similarity measure in biomedical domain:– Rada et al., 1989 : Path Length between
biomedical terms in the MeSH ontology• Measure of semantic similarity in Gene
Ontology (GO)– Lord et al., 2003: Applied Resnik’s to GO– Validated the correlation between
sequence and semantic similarity
University of Houston - Clear Lake12
Related Work:Recent Works in Biomedical Domain
• Al-Mubaid and Nguyen, 2007– Investigated the effectiveness of using Medline
corpus as the information source for measuring the semantic similarity in the biomedical domain
• Al-Mubaid and Nguyen, 2007– A technique for computing the semantic similarity
between biomedical terms across multiple ontologies within a unified framework like UMLS
• Wang et. al , 2007– Functional similarity measure of GO terms based on
contributions of the term’s ancestors in GO Evaluation: Compare it with Resnik’s measure
• Found it was closer to human perception
University of Houston - Clear Lake13
Sequence Similarity
• Sequence Similarity– BLAST [Altschul 1990] :Finds regions of local
similarity between sequences of genes– WU-BLAST2
Output E-value Bit-score
University of Houston - Clear Lake14
Drawbacks of Sequence Similarity
• Sequence similarity holds for most genes with the same functionality
• Devos 2000: 30% of the functional similarity found by sequence similarity might be erroneous – Reason: Genes that are not evolved from a
common ancestors do not have a considerable sequence similarity
• One drawback for the sequence notation is that, it is not readable and understandable by human.
University of Houston - Clear Lake15
New approach
• Ontology structure based – Path Length (PL) between the two terms– Number of minimum paths between terms– Depth of LCA of two terms
• Ontology used: Gene Ontology– A comprehensive resource for gene functional
information
• Validation – Correlation with sequence similarity– Correlation with two other semantic measures
University of Houston - Clear Lake16
Gene Ontology
• One of the greatest project in bioinformatics• Created in 2000 by GO Consortium [Ashburner et. al]
• Consists of a set of controlled vocabularies for– Biological Process– Molecular Functions– Cellular Components
• Shows the functional and biological terms related to genes in a hierarchical and structured way
University of Houston - Clear Lake17
Gene Ontology
University of Houston - Clear Lake18
Gene Ontology
• Directed Acyclic Graph
• Each term may have more than one parent
• There may be more than one path between two nodes (terms)
• Each two node have at least one LCA (Least Common Ancestor)
University of Houston - Clear Lake19
3 Proposed Measures
1. Plain Path Length (PL)– Number of edges between the two terms
2. Path Length with Variation (PLm)
– Number of common terms– Number of minimum paths
3. Path Length with Depth (SimPLD)
– Path Length between two terms– Depth of LCA of the two terms
University of Houston - Clear Lake20
Plain Path Length
1
151410
7
13
8
1211
9
54 6
32
11 2584712 6
Parents of 11
Parents of 12 Parent of 4
Parents of 8
Considers the first level ancestor of each node in the list
• Divide datasets Based On E-Value:– High Sequence Similarity (HSS): E-value ≤ 10-5 – Low Sequence Similarity (LSS): 10-5 < E-value <1– No Sequence Similarity (NSS): E-value = 1
University of Houston - Clear Lake23
Evaluation: Compare PL with Sequence Similarity
Dataset1 From SGD
0
10
20
30
40
50
60
70
PL<=2 2<PL<=7 PL>7
Path Length
Per
cen
tag
e HSS
LSS
NSS
DataSet2 From SGD
0102030405060708090
PL<=2 2<PL<=7 PL>7
Path Length
Per
cen
tag
e HSS
LSS
NSS
University of Houston - Clear Lake24
Evaluation: Compare PL with Sequence Similarity
Dataset3 From SGD
0
10
20
30
40
50
60
70
80
PL<=2 2<PL<=7 PL>7
Path Length
Per
cen
tag
e
HSS
LSS
NSS
Distribution of Path Length in FlyBase dataset
0
20
40
60
80
100
PL<=2 3 <PL<=7 PL>7
Path LengthP
erce
ntag
e
HSS
NSS
70% of HSS have PL<=2
7% of HSS have PL>7
7% of NSS have PL<=2
80% of HSS have PL<=2
4% of HSS have PL>7
17% of NSS have PL<=2
University of Houston - Clear Lake25
3 Proposed Measures
1. Plain Path Length (PL)– Number of edges between the two terms
2. Path Length with Variation (PLm)
– Number of common terms– Number of minimum paths
3. Path Length with Depth (SimPLD)
– Path Length between two terms– Depth of LCA of the two terms
University of Houston - Clear Lake26
Path Length with Variation
• More than one LCA
• Two minimum Paths– “6-10-7-5-1” – “6-10-11-5-1”
• More functional similarity that those who have only one minimum path between them
5
2 41 3
10
6
9
7 8
11
12
University of Houston - Clear Lake27
PL with Variation
PL(gox, goy) if nmp = 1
PL(gox, goy)/w1.nmp, otherwise
PLm (gox, goy)
PL(gox, goy) = the minimum path length in
the GO graph between the two GO terms
gox and goy
mn
n
i
m
j
q
j
p
i gogom
PL
1 1
qpm
),(
)G ,(G PL
University of Houston - Clear Lake28
Path Length with Variation
• genep is annotated with terms {t1,..., tn}
• geneq is annotated with terms {t1,..., tm }
Max go_pl = 15
mnnct
n
i
m
j
q
j
p
i gogom
PL
1 1
qpm
),(
2
1 )G ,(G PL
nct = number of common GO terms between Gp, Gq.
)G ,(GPL - max )G ,Sim(G qpmgo_plqp
University of Houston - Clear Lake29
Validate PLm
• We measured the similarity of gene pairs in SGD pathways
• Pathway is a series of chemical reactions occurring within a cell – Pathway #5 (allantoin degradation): 4 genes – pathway #6 (arginine biosynthesis): 7 genes– pathway #141 (tryptophan degradation): 12 genes
• Compare with – Resnik measure – Wang et. al measure
• Correlation between SimPLD and sequence similarity
• Dataset:– SGD – FlyBase – Human-Yeast
• Ontology Used: – Molecular function (MF)
University of Houston - Clear Lake37
Compare SimPLD with Sequence Similarity
Distribution of sim for Human-Yeast dataset
0
20
40
60
80
100
-2<sim<0 0=<sim<2.8
Perc
enta
ge HSS
LSS
NSS
Distribution of sim for FlyBase
0
20
40
60
80
100
-2<sim<0 0=<sim<2.8
Perc
enta
ge HSS
NSS
Distribution of sim for SGD
0
20
40
60
80
100
-2<sim<0 0=<sim<2.8
Perc
enct
age
HSS
LSS
NSS
Based On BLAST E-Value:
High Sequence SimilarityLow Sequence SimilarityNo Sequence Similarity
University of Houston - Clear Lake38
Conclusion
• Gene Ontology is a reliable source to be used for functional similarity
• Our semantic measures – Can be used as an automated tool to
determine the genes with the similar functionalities
– Has a fairly well agreement with Blast sequence similarity and results of other famous semantic measures
University of Houston - Clear Lake39
Resulted Publications
• Khabiri E., Al-Mubaid H. (2007) “A path length method for gene functional similarity using GO annotations.” 16th International Conference on Software Engineering and Data Engineering SEDE 2007. Las Vegas, Nevada USA, 2007
• Khabiri E. (2007) “A Preliminary study of Correlation between depth and Path Length of GO nodes with Gene Sequence Similarity.” IEEE 7 International Conference on BioInformatics and BioEngineering BIBE07, Boston, Massachusetts USA, 2007
• Al-Mubaid H., Khabiri E., “A New Path Length Based Measure for Functional Similarity of Genes with Evaluation Using SGD Pathways.” Computational Structural Bioinformatics Workshop (CSBW), San Jose, CA (Accepted, not finalized)
University of Houston - Clear Lake40
Future Work
• Apply path length-based measures to more datasets from different model organisms
• More accurate evaluation – Biomedical literature – Microarray data analysis