Top Banner
Using Semantic Similarity Measures in Using Semantic Similarity Measures in the Biomedical Domain for Computing the Biomedical Domain for Computing Similarity between Genes based on Gene Similarity between Genes based on Gene Ontology Ontology By : Elham Khabiri By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid Adviser : Dr. Hisham Al-Mubaid
42

By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

Jan 14, 2016

Download

Documents

jalene

Using Semantic Similarity Measures in the Biomedical Domain for Computing Similarity between Genes based on Gene Ontology. By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid. Motivation. Drug Target. Human. Yeast. Goal : Measure functional similarity between genes and Proteins Reason: - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

Using Semantic Similarity Measures in the Using Semantic Similarity Measures in the Biomedical Domain for Computing Similarity Biomedical Domain for Computing Similarity

between Genes based on Gene Ontologybetween Genes based on Gene Ontology

By : Elham KhabiriBy : Elham Khabiri

Adviser : Dr. Hisham Al-MubaidAdviser : Dr. Hisham Al-Mubaid

Page 2: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake2

Motivation

• Goal :– Measure functional similarity between genes and

Proteins

• Reason:– It is useful to measure the functional difference

between genes in different organisms

– Find the genes with unknown functions

HumanYeastDrug Target

Page 3: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake3

MotivationMotivation

• To compute the similarity between two genes g1 and g2, we can use one of the following information sources:

– gene sequence information– gene functional annotations (GO terms)– biomedical literature and texts – gene expression profiles.

• In this work, we use Gene functional

annotations and the gene ontology GO to measure the similarity between genes.

Page 4: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake4

MotivationMotivation

• Given two genes Gp and Gq such that gene Gp is annotated with a set of n different GO terms, we call it the set GOp:

GOp = {tp1, tp

2, …., tpn}

• Similarly, the annotation set for gene Gq is:

GOq = {tq1, tq

2, …., tqn}

that is, gene Gq is annotated with m different GO terms • The terms tp

i or tqj are nodes in the GO

• If both genes Gp and Gq are annotated with only one term (n=m=1) and the same GO term ( tp

1 = tq1) then the

similarity between them is maximum.

Page 5: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake5

MotivationMotivation

• In general, if both genes Gp and Gq are annotated with the same set of GO terms (n=m≥1) (that is, tp

i = tqj) then

the similarity between them is maximum.

Page 6: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake6

Motivation

• Many data resources in bioinformatics not only hold data in the form of sequences, but also as annotation– Scientific natural language– Suitable for human but not easy for

machine processing

Page 7: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake7

Related Work:Semantic Measures in NLP

Resnik, 1995

Lin, 1998

Jiang and Conrath, 1997

Wu & Palmer, 1994

Leacock and Chodorow, 1998

Based on Information Based on Information

Content (IC) of Least Content (IC) of Least

Common AncestorCommon Ancestor (LCA)(LCA)

Based on Ontology Based on Ontology StructureStructure

Page 8: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake8

1

151410

7

13

8

t2t1

9

54 6

3t

Related Work

• WordNet [Miller 1995] • Information Content Based Measures

– Resnik, 1995

)/(freq(t) log- ) t,(tsim) t,LCA(tt

21Resnik21

N

freq(t): Frequency of concept c in database.

N: the number of all the concepts in database.

Page 9: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake9

Related Work

– Jiang and Conrath, 1997

– Lin, 1998

) t,LCA(tt2121JC

21

))p(t log)p(t (log-p(t) 2log ) t,(tdist

))(log)c(log

)(log.2(max),(

21),(21 21 cPP

cPccsim ccScLin

Page 10: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake10

Related Work

• Ontology Structure Based Measures:– Wu & Palmer, 1994

• Based on the depths of the two concepts in the taxonomies, and the depth of the LCS

– Leacock and Chodorow, 1998: PL• Based on the PL(t1,t2) of the shortest path between two

concepts• Scale the measure by the overall depth D of the taxonomy

),(

2log),(

2121 cclen

Dccsimlch

)()(

)),((2),(

21

2121 cdepthcdepth

cclcsdepthccsimwup

Page 11: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake11

Related Work:Measures in Biomedical Domain

• First semantic similarity measure in biomedical domain:– Rada et al., 1989 : Path Length between

biomedical terms in the MeSH ontology• Measure of semantic similarity in Gene

Ontology (GO)– Lord et al., 2003: Applied Resnik’s to GO– Validated the correlation between

sequence and semantic similarity

Page 12: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake12

Related Work:Recent Works in Biomedical Domain

• Al-Mubaid and Nguyen, 2007– Investigated the effectiveness of using Medline

corpus as the information source for measuring the semantic similarity in the biomedical domain

• Al-Mubaid and Nguyen, 2007– A technique for computing the semantic similarity

between biomedical terms across multiple ontologies within a unified framework like UMLS

• Wang et. al , 2007– Functional similarity measure of GO terms based on

contributions of the term’s ancestors in GO Evaluation: Compare it with Resnik’s measure

• Found it was closer to human perception

Page 13: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake13

Sequence Similarity

• Sequence Similarity– BLAST [Altschul 1990] :Finds regions of local

similarity between sequences of genes– WU-BLAST2

Output E-value Bit-score

Page 14: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake14

Drawbacks of Sequence Similarity

• Sequence similarity holds for most genes with the same functionality

• Devos 2000: 30% of the functional similarity found by sequence similarity might be erroneous – Reason: Genes that are not evolved from a

common ancestors do not have a considerable sequence similarity

• One drawback for the sequence notation is that, it is not readable and understandable by human.

Page 15: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake15

New approach

• Ontology structure based – Path Length (PL) between the two terms– Number of minimum paths between terms– Depth of LCA of two terms

• Ontology used: Gene Ontology– A comprehensive resource for gene functional

information

• Validation – Correlation with sequence similarity– Correlation with two other semantic measures

Page 16: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake16

Gene Ontology

• One of the greatest project in bioinformatics• Created in 2000 by GO Consortium [Ashburner et. al]

• Consists of a set of controlled vocabularies for– Biological Process– Molecular Functions– Cellular Components

• Shows the functional and biological terms related to genes in a hierarchical and structured way

Page 17: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake17

Gene Ontology

Page 18: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake18

Gene Ontology

• Directed Acyclic Graph

• Each term may have more than one parent

• There may be more than one path between two nodes (terms)

• Each two node have at least one LCA (Least Common Ancestor)

Page 19: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake19

3 Proposed Measures

1. Plain Path Length (PL)– Number of edges between the two terms

2. Path Length with Variation (PLm)

– Number of common terms– Number of minimum paths

3. Path Length with Depth (SimPLD)

– Path Length between two terms– Depth of LCA of the two terms

Page 20: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake20

Plain Path Length

1

151410

7

13

8

1211

9

54 6

32

11 2584712 6

Parents of 11

Parents of 12 Parent of 4

Parents of 8

Considers the first level ancestor of each node in the list

Parent of 5

Page 21: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake21

PL between two Genes

• genep is annotated with terms {t1,..., tn}

• geneq is annotated with terms {t1,..., tm }

1..m}:j 1..n,:i | dij avg{

Facl6

Annotated with 3 MF

dij: Shortest PL between ti of

gene1 and tj of gene2

Page 22: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake22

PL Evaluation

• Based on Correlation with Sequence Similarity

• Genome Used: – SGD (Saccharomyces cerevisiae): 3 datasets– FlyBase (Drosophila Melanogaster): 1 dataset

• Divide datasets Based On E-Value:– High Sequence Similarity (HSS): E-value ≤ 10-5 – Low Sequence Similarity (LSS): 10-5 < E-value <1– No Sequence Similarity (NSS): E-value = 1

Page 23: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake23

Evaluation: Compare PL with Sequence Similarity

Dataset1 From SGD

0

10

20

30

40

50

60

70

PL<=2 2<PL<=7 PL>7

Path Length

Per

cen

tag

e HSS

LSS

NSS

DataSet2 From SGD

0102030405060708090

PL<=2 2<PL<=7 PL>7

Path Length

Per

cen

tag

e HSS

LSS

NSS

Page 24: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake24

Evaluation: Compare PL with Sequence Similarity

Dataset3 From SGD

0

10

20

30

40

50

60

70

80

PL<=2 2<PL<=7 PL>7

Path Length

Per

cen

tag

e

HSS

LSS

NSS

Distribution of Path Length in FlyBase dataset

0

20

40

60

80

100

PL<=2 3 <PL<=7 PL>7

Path LengthP

erce

ntag

e

HSS

NSS

70% of HSS have PL<=2

7% of HSS have PL>7

7% of NSS have PL<=2

80% of HSS have PL<=2

4% of HSS have PL>7

17% of NSS have PL<=2

Page 25: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake25

3 Proposed Measures

1. Plain Path Length (PL)– Number of edges between the two terms

2. Path Length with Variation (PLm)

– Number of common terms– Number of minimum paths

3. Path Length with Depth (SimPLD)

– Path Length between two terms– Depth of LCA of the two terms

Page 26: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake26

Path Length with Variation

• More than one LCA

• Two minimum Paths– “6-10-7-5-1” – “6-10-11-5-1”

• More functional similarity that those who have only one minimum path between them

5

2 41 3

10

6

9

7 8

11

12

Page 27: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake27

PL with Variation

PL(gox, goy) if nmp = 1

PL(gox, goy)/w1.nmp, otherwise

PLm (gox, goy)

PL(gox, goy) = the minimum path length in

the GO graph between the two GO terms

gox and goy

mn

n

i

m

j

q

j

p

i gogom

PL

1 1

qpm

),(

)G ,(G PL

Page 28: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake28

Path Length with Variation

• genep is annotated with terms {t1,..., tn}

• geneq is annotated with terms {t1,..., tm }

Max go_pl = 15

mnnct

n

i

m

j

q

j

p

i gogom

PL

1 1

qpm

),(

2

1 )G ,(G PL

nct = number of common GO terms between Gp, Gq.

)G ,(GPL - max )G ,Sim(G qpmgo_plqp

Page 29: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake29

Validate PLm

• We measured the similarity of gene pairs in SGD pathways

• Pathway is a series of chemical reactions occurring within a cell – Pathway #5 (allantoin degradation): 4 genes – pathway #6 (arginine biosynthesis): 7 genes– pathway #141 (tryptophan degradation): 12 genes

• Compare with – Resnik measure – Wang et. al measure

Page 30: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake30

Validate PLm: Compare with Resnik

• Pathway 5: allantoin degradation – 4 genes, 6 pairs        

gene1 gene2 Res Ours

DAL1 DAL2 2.4 11

DAL1 DAL3 2.4 11

DAL1 DUR1,2 1.7 9.5

DAL2 DAL3 5.2 13

DAL2 DUR1,2 1.7 9.5

DAL3 DUR1,2 1.7 9.5

They Correlate well with

each other

Minimum

Maximum

Page 31: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake31

Validate PLm : Compare with Resnik

Pathway 6: 7 genes, 21 pairs        gene1 gene2 Res Ours

ARG1 ARG3 0.28 8

ARG1 ARG4 0.28 8

ARG2 ARG3 1.38 7.5

ARG3 ARG5,6 1.01 8.5

ARG4 ARG8 0.28 7

ARG1 ARG8 0.28 8

PL(ARG2, ARG3) > PL(ARG3, ARG5,6)

PL(ARG4, ARG8) > PL(ARG1, ARG8)

Page 32: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake32

Evaluation: Clusters of Genes Wang et. al vs. Our Method

Page 33: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake33

3 Proposed Measures

1. Plain Path Length (PL)– Number of edges between the two terms

2. Path Length with Variation (PLm)

– Number of common terms– Number of minimum paths

3. Path Length with Depth (SimPLD)

– Path Length between two terms– Depth of LCA of the two terms

Page 34: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake34

Similarity between GO terms

• PL(gox, goy ) = minimum path length between the two GO terms gox and goy

)2

),(log()

)),((log(),(

MaxDepth

gogoPL

MaxDepth

gogolcadepthgogoSim yxyx

yxPLD

1

151410

7

13

8

1211

9

54 6

32

Page 35: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake35

SimPLD between two Genes

• gp is annotated with terms {go1,..., gon}

• gq is annotated with terms {go1,..., gom }

1..m}:y 1..n,:x|)go ,(go{sim avg )g ,(gsim yxPLDqpPLD

Page 36: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake36

Evaluation: SimPLD

• Correlation between SimPLD and sequence similarity

• Dataset:– SGD – FlyBase – Human-Yeast

• Ontology Used: – Molecular function (MF)

Page 37: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake37

Compare SimPLD with Sequence Similarity

Distribution of sim for Human-Yeast dataset

0

20

40

60

80

100

-2<sim<0 0=<sim<2.8

Perc

enta

ge HSS

LSS

NSS

Distribution of sim for FlyBase

0

20

40

60

80

100

-2<sim<0 0=<sim<2.8

Perc

enta

ge HSS

NSS

Distribution of sim for SGD

0

20

40

60

80

100

-2<sim<0 0=<sim<2.8

Perc

enct

age

HSS

LSS

NSS

Based On BLAST E-Value:

High Sequence SimilarityLow Sequence SimilarityNo Sequence Similarity

Page 38: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake38

Conclusion

• Gene Ontology is a reliable source to be used for functional similarity

• Our semantic measures – Can be used as an automated tool to

determine the genes with the similar functionalities

– Has a fairly well agreement with Blast sequence similarity and results of other famous semantic measures

Page 39: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake39

Resulted Publications

• Khabiri E., Al-Mubaid H. (2007) “A path length method for gene functional similarity using GO annotations.” 16th International Conference on Software Engineering and Data Engineering SEDE 2007. Las Vegas, Nevada USA, 2007

• Khabiri E. (2007) “A Preliminary study of Correlation between depth and Path Length of GO nodes with Gene Sequence Similarity.” IEEE 7 International Conference on BioInformatics and BioEngineering BIBE07, Boston, Massachusetts USA, 2007

• Al-Mubaid H., Khabiri E., “A New Path Length Based Measure for Functional Similarity of Genes with Evaluation Using SGD Pathways.” Computational Structural Bioinformatics Workshop (CSBW), San Jose, CA (Accepted, not finalized)

Page 40: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake40

Future Work

• Apply path length-based measures to more datasets from different model organisms

• More accurate evaluation – Biomedical literature – Microarray data analysis

• Consider the number of distinct paths

• Prediction of functionally unknown genes

Page 41: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake41

Page 42: By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake42