Top Banner
Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU Hanoi, February 18 th , 2012 [email protected]
35

Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Dec 25, 2015

Download

Documents

Isaac Pierce
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Semantic Relation Extraction for Linking Named Entities

to Biomedical Databases

Presenter: Lê Hoàng QuỳnhKnowledge of Technology Laboratory, UET, VNU

Hanoi, February 18th, [email protected]

Page 2: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Main contents

• Motivation and purpose

• Some approaches: the pros and cons

• Discussion and Proposal

• Conclusion

2

Page 3: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Motivation and purpose

“… developing a state of the art named entities tagger for full open source biomedical texts …”•Deploying various named entity recognizers to see which works the best•Linking the named entities to its appropriate identifier in public databases

3

Page 4: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Motivation and purpose (cont’)

What’re named entities we focus on ?• Phenotype descriptions • Disease names • Gene names• Chemical names

4

Page 5: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Motivation and purpose (cont’)

Ontology =•Concept/Class

•Term/Individual•Relation/Property

5

Page 6: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Motivation and purpose (cont’)

6biocaster.org

The Biocaster Multilingual Ontology

Page 7: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Motivation and purpose (cont’)

• How to link the named entities to unique identifiers in a biomedical database ?

• What are the difference between “linking” and “filling” ?

• Method ?• Clustering• Sematic relation extraction [LTB11]• …

7[LTB11] Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha. An Integrated Approach Using Conditional Random Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text. In IALP 2011, Penang, Malaysia.

Page 8: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Motivation and purpose (cont’)

Semantic relation extraction•Extracting relationships between terms is the task of extracting underlying relations between two term expressed by words or phrases [Gir08]•Due to the unique patterns of biomedical relations, techniques designed for extracting relations from general text may not be suitable for the biomedical domain

8[Gir08] Girju R, “Semantic relation extraction and its applications”, ESSLLI 2008 Course Material, Hamburg, Germany, 4-15 August 2008

Page 9: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Motivation and purpose (cont’)

What’re kinds of semantic relation we focus on ?•Hyponymy•Synonymy•Causal/effect•Indicate/hasSymptom•Treat•…..

9

Entity:•Phenotype descriptions •Disease names •Gene names•Chemical names

Page 10: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Motivation and purpose (cont’)

What’re kinds of semantic relation we focus on ?•Hyponymy•Synonymy•Causal/effect•Indicate/hasSymptom•Treat•…..

10

Entity:•Phenotype descriptions •Disease names •Gene names•Chemical names

Page 11: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches

Three groups of existing methods:•Pattern-based extraction relies on the occurrence of term pairs in the same contexts and uses the words in the context to identify the relation•Distributional clustering uses the contexts that terms occur in individually and attempts to group semantically related elements based on similarities of these contexts•Term variation is based on the form of the term and uses similarities between terms to identify, which are semantically related

11

Page 12: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)

Distributional clustering:•Considering the context that a term tends to occur in and then apply clustering to work out, which terms are most “similar". •By using this methodology they could found class of words that are similar in meaning

For example: Use the verb "fire“ we to found these following class of nouns:o Gun, Missile, Weapono Shot, Bullet, Rocket, Missileo Officer, Aide, Chief Manager

12

Page 13: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)

Distributional clustering:•Pros:

o Distributional clustering does not require that the terms occur in the same sentence or even in the same document

o Generally has a higher recall than pattern based methods•Cons:

o This method requires a mathematical approach to determine the clusters of terms which have a similar distribution of contexts

o It is very difficult from distributional clustering to work out the nature of the relationship between the terms

o Distributional clustering is not suitable for extracting specific relationships such as if "X is a causal agent of Y“

13

Page 14: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)

Term variation:•Looking at the form of the actual term and using the similarity of the words in it to deduce if the terms are related. For example: "cancer of the mouth" and "mouth cancer"

•Jacquemin [Jac99] defines three main ways that term variation occurs:

o Syntactic Variationso Morpho-syntactic Variationso Semantic Variations

14[Jac99] Christian Jacquemin. Syntagmatic and paradigmatic representations of term variation. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 341-348.1999.

Page 15: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)

Term variation:

•Pros:o Often has very high precisiono Strongest for finding if two terms are synonymouso Can prove useful for some other cases as well

•Cons:o Cannot help to identify relationships between

terms with no similarity

15

Page 16: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)

Pattern-based extraction involve finding the terms in the same sentence and in some “pattern" that is suggestive of a particular relation.•Hearst [Hea92] used patterns to extract terms that exhibit the hyponymy relation•Her approach involved noting that such terms often occurred near each other in stereotypical patterns

Some kinds of flu, such as bird flu are …”Pattern: noun phrase - “such as" - noun phrase

hyponym(“bird flu", “flu")

•Method for developing these patternso Decide on a lexical relationshipo Collect a set of term pairs known to have this relationship and a corpus, which

contains these pairso Find the places where these terms co-occuro Find commonalities and hypothesize a patterno Use this pattern to find more term pairs and repeat the process

16[Hea92] Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics, pages 539-545, 1992.

Page 17: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)

Pattern-based extraction•Pros:

o Simple o Patterns have the advantage that they can be specialised for different

relationships.o Can be used for various languages

•Cons: o This method was manual o There was no way to provide a strong comparison between the

effectiveness of the different patterns, which perhaps lead to the inclusion of a relatively “weak" pattern

o It is not clear how to automatically generate patterns, which are specific to a given relationship and domain

o As patterns rely on finding the two terms in the same context, this limits the recall and ambiguity in the text can cause errors in the extractions

o Problem of identification boundaries of the terms

17

Page 18: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)

Mccrae’s approach [Mcc09] for synonym and hyponymy relation

18

[Mcc09] John Philip Mccrae. Automatic Extraction of Logically Consistent Ontologies from Text Corpora. Doctor of philosophy. Department of informatics, school of multidisciplinary sciences, the graduate univesity of advanced studies. September 2009

• Starts with the most general pattern, that is the pattern consisting of only wild cards

• Develops a more specific pattern by replacing wild cards with terms from some corpus (full text chap. 3.1)

Page 19: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)

Mccrae’s approach:•Problem of identification term’s boundary

entity = (NN|JJ|NNS|NNP|FW|NNPS|JJR) * (NN|NNS|NNP|NNPS)NN: A singular noun

NNS: A plural noun

NNP: A proper noun

NNPS: A pluralised proper noun

JJ: An adjective

FW: A prefix

JJR: An adjective in comparative form

19

Page 20: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)

Mccrae’s approach:•Covers every possible variation of the patterns the the search space is far too large to be tractable

It is necessary to find a way to cover this search space more efficiently

o prioritizing "better" patterns o skipping those patterns which are too similar to

existing patterns.

20

Page 21: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)

Mccrae’s approach:•Rule definition:

*1 * such as *2 :Rule: :- name() words(1,1) "such" "as" name()

•Simplified the rules o Match-set (Chap. 3.2.1 in full text)

:- words(1,2) name() words(0,1) words(2,3) "literal" name()Simplified form: :- words(1,1) name() words(2,4) "literal" name()o Join-set and alignment (Chap 3.2.2 in full text)

:- "a" name() "b" "c" "d" name():- words(,1) name() words(2,3) "c" name()

Alignment on these rules: f(2; 2); (4; 4); (6; 5)gThe alignment-to-join conversion:

:- words(,1) name() words(2,3) "c" words(0,1) name() words(0,0)Simplified form: :- name() words(2,3) "c" words(0,1) name()

•Classification

21

Page 22: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)M

c cra

e’s

app

roac

h:

Res

ult

s

22

Page 23: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)M

c cra

e’s

app

roac

h:

Res

ult

s

23

Page 24: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)

Approach by utilizing the Web [SNR08] [TNN10]•RDF describes a SemanticWeb using RDF Statements, which are triples of the form <Subject, Property, Object>•Query the search engines with lexico-syntactic patterns to retrieve relevant information•The “seed” patterns are initially handcrafted but can be progressively learnt•Extract relations from snippets

24

[SNR08] Saurav Sahay, Shamkant B. Navathe, Ashwin Ram. Discovering Semantic Biomedical Relations Utilizing the Web. ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 1, Article 3. March 2008.[TNN10] Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le (2010). "Automatic Named Entity Set Expansion Using Semantic Rules and Wrappers for Unary Relations", IALP 2010: 170-173, Harbin, Heilongjiang China; December 28-30, 2010

Page 25: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)

• [SNR08] focus on discovering causal relationship between a disease and a biological entity

• Application: For augmenting Ontologies

25

• Purpose: Given a disease discover the likely causes of this disease

Page 26: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)

Approaches summary and evaluation

26

Method Precision Recall Applicability

Patterns OK Limited Produces specfic results for any relationships

Distributional Clustering OK Good Only produces a concept of “semantic relatedness"

Term Variation Good Poor Strongest for synonym, some use elsewhere

Page 27: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Some approaches (cont’)

What if using machine learning ?•Using CRF [BDS08]:

o Extracts both the existence of a relation and its type

o Using two type of CRF

•Using Kernel-Based learning [LZL08]: o Relation detection: a binary classification of true

and false relationso Relation classification: a 4-class classification of

the four relation types

27

[BDS08] Markus Bundschus, Mathaeus Dejori, Martin Stetter, Volker Tresp, Hans-Peter Kriege. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 2008, 9:207 doi:10.1186/1471-2105-9-207[LZL08] Jiexun Li, Zhu Zhang, Xin Li, Hsinchun Chen. Kernel-Based Learning for Biomedical Relation Extraction. journal of the american society for information science and technology, 59(5):756–769, 2008

Page 28: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Discussion and Proposal

Challenges•Language complexity•Requirement good pre-processing (POS-tagging, chunking, NER, etc.)•…•Techniques designed for extracting relations from general text may not be suitable for the biomedical domain•Lack of tools, data•…•It is unlikely that the extracted relations will match the structure of the ontology

28

Page 29: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Discussion and Proposal

Challenges•Modifiers: The inclusion of an adjective modifier in a term For example: "acute headache" & "headache“; “mental retardation”

•Granularity: Terms are nearly always used synonymously but have slight differences in their meaning. For example: The term "HIV-1" is the most common strain of "HIV“ but "HIV-2" is less easily transmitted and mostly confined to a small area of West Africa

•Property: This means that two terms refer to the same thing but with a slightly different propertyFor example: "dengue shock syndrome" is a late stage development of "dengue fever

29

Page 30: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Discussion and Proposal (cont’)

Compromises- Figure out what type of relationship or not -Binary classification or multi-label classification -1 or 2 classifier-Pattern-based extraction, distributional clustering or term variation -Using machine learning or not-…

30

Page 31: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Discussion and Proposal (cont’)

Proposal•Only deal with intra-sentence relations !!!•2 classifiers•Pattern-based extraction and term variation •Semi-supervised learning•There is still not a strong definition or training resources for Phenotype and disease need to work on this using available resources such as the Human Phenotype Ontology and the CALBC data set from the EBI shared task 2011

31

Page 32: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Discussion and Proposal (cont’)

32

Wha

t’s a

bout

the

Mod

el ?

Page 33: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Conclusion & Future Works

• Purpose: Hyponymy, Synonym and Causal relation extraction for Phenotype descriptions, Disease names, Gene names and Chemical names

• Improve on method (using semantic pattern & term variation, bootstrapping technique, etc.)

• Exploring data and ontology• “Linking to ontology” review• Propose model• Try to use other available resources

33

Page 34: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

References[LTB11] Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha. An Integrated

Approach Using Conditional Random Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text. In IALP 2011, Penang, Malaysia.

[TNN10] Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le (2010). "Automatic Named Entity Set Expansion Using Semantic Rules and Wrappers for Unary Relations", IALP 2010: 170-173, Harbin, Heilongjiang China; December 28-30, 2010

[Mcc09] John Philip Mccrae. Automatic Extraction of Logically Consistent Ontologies from Text Corpora. Doctor of philosophy. Department of informatics, school of multidisciplinary sciences, the graduate univesity of advanced studies. September 2009

[BDS08] Markus Bundschus, Mathaeus Dejori, Martin Stetter, Volker Tresp, Hans-Peter Kriege. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 2008, 9:207 doi:10.1186/1471-2105-9-207

[Gir08] Girju R, “Semantic relation extraction and its applications”, ESSLLI 2008 Course Material, Hamburg, Germany, 4-15 August 2008

[LZL08] Jiexun Li, Zhu Zhang, Xin Li, Hsinchun Chen. Kernel-Based Learning for Biomedical Relation Extraction. journal of the american society for information science and technology, 59(5):756–769, 2008

[SNR08] Saurav Sahay, Shamkant B. Navathe, Ashwin Ram. Discovering Semantic Biomedical Relations Utilizing the Web. ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 1, Article 3. March 2008.

[Jac99] Christian Jacquemin. Syntagmatic and paradigmatic representations of term variation. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 341-348.1999.

[Hea92] Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics, pages 539-545, 1992.

[Bio] http://biocaster.org

34

Page 35: Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU.

Thank you for you attention!

35