Top Banner
Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006
63

Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the

biomedical domain

SBI Course WS 2005/2006

Thomas Karopka

19.01.2006

Page 2: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Outline Motivation Introduction to Natural Language Processing Named Entity Recognition (NER) Information Extraction (IE) GATE-General Architecture for Text

Engineering Some Tools, some applications.... (Short introduction to GATE)

Page 3: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

• Huge amount of biomedical knowledge• Problem: unstructured text difficult to analyze automatically

40.000 abstracts á 5 min – app. 400 days (8 h a day)

Solution: NLP – Information Extraction

• MEDLINE: currently contains over 16 million biomedical abstracts• 50.000 new abstracts per month

Motivation

Page 4: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

What is NLP?Definition 1:

Natural Language Processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems

inherent in the processing and manipulation of natural language, but not, generally, natural language

understanding.

Definition 2:

A study of how to use computers to do things with human languages.

Synonyms: Language Engineering, Human Language Technology

Page 5: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Publications in MEDLINE

0

2

4

6

8

10

12

14

Million

Year

Publications in MEDLINE

jährliche Publikationen

kumulierte Anzahl

Publications per year

Page 6: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Main fields of NLP

Text to speech Speech recognition

Natural language generation Machine translation Question answering Information retrieval

Information extraction Named entity recognition

Text classification Translation technology

Text Summaries

Page 7: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Why is NLP so hard?

Ambiguity Context Acronyms Semantics

Page 8: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Ambiguity

Time flies like an arrow, fruit flies like a banana

(Groucho Marx)

Page 9: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Global vs. Local ambiguity

Local ambiguity means that part of a sentence can have more than 1 interpretation, but not the whole sentence.

Global ambiguity means that the whole sentence can have more than 1 interpretation.

Page 10: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Global vs. Local ambiguity cont. Local ambiguity

The old train..... ...the young. ...left the station.

Here syntax can tell us that TRAIN must be a verb in sentence 1.

Global ambiguity "I saw the Grand Canyon flying to New York"

"I saw a Boeing 747 flying to New York"

Here we know the meaning of the two sentences because we know

what can and cannot fly.

Page 11: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Types of Ambiguity Categorical ambiguity

Noun : "Time is money" Verb: "Time me on the last lap" Adjective: "Time travel is not likely in my life time„

Word sense ambiguity Electrical : "The battery was charged with jump leads" Legal: "Thief was charged by PC Smith" Responsibility: "The lecturer was charged with student

recruitment"

Page 12: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Types of Ambiguity cont.

Structural ambiguity "You can have peas and beans or carrots with the set meal„

Referential ambiguity What can THEY refer to in: "After THEY finished the exam

the students and lecturers left.„ Lectures only?Students only?Both?

Page 13: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Problems in NLP

Polysemy - one word carrying different meanings. (Glück 1993, 474) (in different contexts)

beam ('Lichtstrahl' und 'Balken')

Synonymy - the semantic relation that holds between two words that can (in a given context) express the same meaning

ship – vessel buy - purchase

Semantics - the meaning of a word, phrase, clause, or sentence, as opposed to its syntactic construction.

„Baby swallows fly“

Page 14: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Basic NLP Tasks

TokenizationSplit text into units called tokens (words, .,-)

Sentence SplittingDetect sentence boundaries

Part of Speech (POS) TaggingApply parts of speech (verb, noun, adjective..)

ParsingWork out parse trees

Page 15: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Basic NLP Tasks cont.

Verb Phrase chunkingFind verbal phrases

Noun Phrase chunkingFind noun phrases

Acronym resolutionFind long forms for acronyms

Corefference resolutionNew York, .... The big apple

Page 16: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Basic NLP Tasks cont.

Named Entity RecognitionFind named entities

.....

Page 17: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

What is NER? NER

Named Entity Recognition Including two tasks

Identification of proper names in text Classification of proper names in text

Newswire Domain Person, Location, Organization

Biomedical Domain Protein, DNA, RNA, Body Part, Cell Type, Lipid, etc.

Page 18: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

NER in biomedical domain

BioNER aims to recognize following namesFirst Priority

Protein name, DNA name, RNA nameSecond Priority

cell type, other organic compound, cell line, lipid, multi-cell, virus, cell component, body part, tissue, amino acid monomer, polynucleotide, mono-cell, inorganic, peptide, nucleotide, atom, other artificial source, carbohydrate, organic

Page 19: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Example of NER - BiomedicalProtei

n/gene

Cell type

Page 20: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Problems in BioNER

Unknown words Long compound words Variations of expressions Nested NEs

Page 21: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Unknown Words

Words containing hyphen, digit, letter, Greek letter, Roman numeral. Alpha B1 Adenyly cyclase 76E Latent membrane protein 1 4’-mycarosyl isovaleryl-CoA transferase oligodeoxyribonucleotide 18-deoxyaldosterone

Abbreviation and Acronym IL, TECd, IFN, TPA

Page 22: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Long Compound words

interleukin 1 (IL-1)-responsive kinase interleukin 1-responsive kinase epidermal growth factor receptor SH2 domain containing tyrosine kinase

Syk SH2 domain (GENIA example)

Page 23: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Various expressions of the same NE

Spelling variation N-acetylcysteine, N-acetyl-cysteine, NAcetylCysteine

Word permutation beta-1 intergrin, integrin beta-1

Ambiguous expressions epidermal growth factor receptor, EGF receptor,

EGFR c-jun, c-Jun, c jun

Page 24: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Various expressions: the name explains its function

the Ras guanine nucleotide exchange factor Sos

the Ras guanine nucleotide releasing protein Sos

the Ras exchanger Sos the GDP-GTP exchange factor Sos Sos(mSos), a GDP/GTP exchange protein

for Ras

Page 25: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Various expressions: The name includes preposition and/or

conjunction (ambiguity of dependencies)

p85 alpha subunit of PI 3-kinase SH2 and SH3 domains of Src NF-AT1 , AP-1 , and NF-kB sites E2F1 and -3 Residues 432, 435, 437, 438, and 440

Page 26: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Nested Named Entity

An NE embedded in another NE. IL-2: protein IL-2 gene: gene CBP/p300 associated factor: protein CBP/p300 associated factor binding

promoter: DNA

Page 27: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Gene Naming Conventions

"Biologists would rather share their toothbrush than share a gene name„ Michael Ashburner [1]

[1] Pearson H. Biology's name game. Nature. 2001;411:631–632.

Page 28: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Protein/Gene name recognitionFor comic relief don‘t miss the ‚worst gene names‘ page:

http://tinman.vetmed.helsinki.fi/eng/drosophila.html

My favourite ones: drop dead FBgn0000494 lost in space FBgn0016996 ken and barbie FBgn0011236

Source: FlyBase http://flybase.bio.indiana.edu/

Page 29: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Page 30: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Page 31: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

State-of-the-art Systems on NER: Two evaluation contests

BioCreative 2004 (March) Critical Assessment of Information Extraction

Systems in Biology Task 1: Entity extraction

Target: genes (or proteins, where there is ambiguity) 10000 sentences from Medline as training data, and

5000 sentences as testing data BioNLP 2004 (August)

GENIA Corpus as training data and 404 abstracts as testing data

Target: 5 classes, including protein, DNA, gene, cell line and cell type.

Both use exact match scoring.

Page 32: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

BioNLP 2004 Datasets

 

# of abstract

s# of sentences # of tokens

Training Set 2,000 20,546 (10.27/abs)472,006 (236.00/abs)

(22.97/sen)

Test Set

Total 404 4,260 (10.54/abs)96,780 (239.55/abs)

(22.72/sen)

1978-1989

104 991 ( 9.53/abs)22,320 (214.62/abs)

(22.52/sen)

1990-1999

106 1,115 (10.52/abs)25,080 (236.60/abs)

(22.49/sen)

2000-2001

130 1,452 (11.17/abs)33,380 (256.77/abs)

(22.99/sen)

S/1998-2001

204 2,254 (11.05/abs)51,628 (253.08/abs)

(22.91/sen)

Page 33: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

R/P/F  1978-1989

set 1990-1999

set 2000-2001

set S/1998-2001 set

 Total

[Zho04

]

75.3 / 69.5 / 72.3

77.1 / 69.2 / 72.9

75.6 / 71.3 / 73.8

75.8 / 69.5 / 72.5

76.0 / 69.4 / 72.6

[Fin04]

66.9 / 70.4 / 68.6

73.8 / 69.4 / 71.5

72.6 / 69.3 / 70.9

71.8 / 67.5 / 69.6

71.6 / 68.6 / 70.1

[Set04

]

63.6 / 71.4 / 67.3

72.2 / 68.7 / 70.4

71.3 / 69.6 / 70.5

71.3 / 68.8 / 70.1

70.3 / 69.3 / 69.8

[Son04

]

60.3 / 66.2 / 63.1

71.2 / 65.6 / 68.2

69.5 / 65.8 / 67.6

68.3 / 64.0 / 66.1

67.8 / 64.8 / 66.3

[Zha04]

63.2 / 60.4 / 61.8

72.5 / 62.6 / 67.2

69.1 / 60.2 / 64.7

69.2 / 60.3 / 64.4

69.1 / 61.0 / 64.8

[Rös04]

59.2 / 60.3 / 59.8

70.3 / 61.8 / 65.8

68.4 / 61.5 / 64.8

68.3 / 60.4 / 64.1

67.4 / 61.0 / 64.0

[Par04]

62.8 / 55.9 / 59.2

70.3 / 61.4 / 65.6

65.1 / 60.4 / 62.7

65.9 / 59.7 / 62.7

66.5 / 59.8 / 63.0

[Lee04]

42.5 / 42.0 / 42.2

52.5 / 49.1 / 50.8

53.8 / 50.9 / 52.3

52.3 / 48.1 / 50.1

50.8 / 47.6 / 49.1

BL 47.1 / 33.9 /

39.4 56.8 / 45.5 /

50.5 51.7 / 46.3 /

48.8 52.6 / 46.0 /

49.1 52.6 / 43.6 /

47.7

Page 34: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Current Methods

Machine LearningHMM, SVM, ME (Maximum Entropy), CRF

(Conditional Random Field) Hybrid methods

Dictionary BasedApproximate String matching algorithm

Naming Rules Dynamic Programming

Page 35: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Features for Machine Learning Methods

Morphological Features Orthographical Features POS Features

Genia POS tagger Semantic Trigger Features

Head-noun Features NF-kappaB consensus site IL-2 gene

Page 36: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Morphological FeaturesPrefix/Suffix Example

~cin~mide~zole

actinomycinCycloheximideSulphamethoxazole

~lipid~rogen~vitamin

phospholipidsestrogendihydroxyvitamin

~blast~cyte~phil

erythroblastthymocyteeosinophil

phosph~methyl~immuno~

phosphorylationmethyltranferaseimmunomodulator

Page 37: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Orthographical Features

OrthographicalFeatures

Example Orthographical Features

Example

AllCaps EBNA, NFAT AlphaDigit p50, p65

AlphaDigitAlpha IL23R, E1A ATGCSequence

CCGCCC

CapLowAlpha Src, Ras, Epo CapMixAlpha NFkappaB

CapsAndDigits IL2, STAT4, SH2

DigitAlpha 2xNFkappaB

Page 38: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Head Nouns

Head Nouns

Unigram factor, protein, receptor, alpha, NF-kappaB, IL-2, cytokine,kinase, transcription, domain, complex, TNF-alpha, Nuclear, p50, CD28, TNF, molecule, subunit, cell, STAT3, family, tumor, factor-alpha, expression, interleukin

Bigram NF-kappa B, transcription factor, I kappa, nuclear factor, protein kinase, B alpha, kinase C, tumor necrosis, T cell,glucocorticoid receptor, binding protein, factor alpha, adhesion molecule, monoclonal antibody, gene product, binding domain

Page 39: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Excursus: Head Noun, Noun phrase

A noun is usually embedded in a noun phrase (NP), a syntactic unit of the sentence in which information about the noun is gathered.

The noun is the head of the noun phrase, the central constituent that determines the syntactic character of the phrase.

Page 40: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Excursus: Head Noun, Noun phrase cont.

A noun phrase normally consists of: An optional determiner Zero or more adjective phrases A head noun Optional post-modifier (prepositional phrase or clausal

modifier)

Example:

The homeless old man in the park that I tried to help yesterday

human umbilical vein endothelial cellslipopolysaccharide-stimulated human saphenous vein endothelial cells

Page 41: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Zhou et al. approach

HMM + SVM Post-processing

Rule-based: used to resolve nested name entities.

Top1 in the NLPBA Task, F=72.5%

Page 42: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Manning et al. method Machine learning:

ME Markov model Local features External resources and larger context

Post-processing To correct gene’s boundary (mainly for BioCreative

Task)

Top 1 in BioCreative, F= 83.2% Top 2 in NLPBA Task, F=70.1%

Page 43: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

IE-Systems analyse unstructured text,extract predefined named entities and store these entities in a structured form

What is Informationsextraction(IE)?

Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information to form new facts or new hypotheses to be explored further by more conventional means

of experimentation.

Source: Marti Hearst, What is text mining? http://www.sims.berkeley.edu/~hearst/text-mining.html

Page 44: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Targets of Information Extraction

Protein-Protein interaction/binding/inhibition Protein-Small Molecules Gene-Gene regulation Gene-Gene Product interaction Gene-Drug relation Protein-Subcellular location Amino Acid-Protein relation

Example relationships between gene and drugs: The gene is the drug target The gene confers resistance to the drug The gene metabolizes the drug

Page 45: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Information Extraction Tasks

Identify Target Named Entities

Identify Relationsamong Named

Entities

Identify Relationsamong Events and

Named Entities

Associate Resultswith existing

database records

Page 46: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

IE-SystemsRulebased Systems using rules for the extraction

Machine Learning: Support Vector Machine (SVM), Maximum Entropy (ME), Memory Based Learning (MBL),Inductive Logic Programming (ILP)Artificial Neural Networks (ANNs)

Hybrid Systems combining the two approaches

Page 47: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

GATE –General Architecture for Text Engineering

„GATE is an architecture, a framework and a development environment for LE (Language Engineering)“ (Cunningham, 2002) • Integrated Development Environment for LE-Applications• Reusable Components• Extensive amount of APIs• Integration of different NLP plattforms• WEKA (machine learning), Protégé (Ontology)• Open Source, Java

Page 48: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Extractor

TokenizerSentenceSplitter

GeneGazetteer

Gene-relationtransducer

POS Tagger

AcronymResolution

NP-Chunking

XMLdocs

GATE standard components

external modules

New developed modules

Finite State Transducer Uses JAPE Grammar JAPE rules are compiled to Java Objects that are used by the GATE API

Consists of an Indexfile which is used to access lists with keywords Lists are compiled to finite state machines every keyword is annotated with a type (e.g. Gene, Relation, Organism ...)

Page 49: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

JAPEExample: IL-1beta and TNF-alpha significantly

enhanced the production of GM-CSF

MacroLabel

Rule: G1ofG2( (GENE):gene1

(mRNA)?

(ADVS)?

({Lookup.majorType == relverb}):rel

({Token.category == DT})?

{Token.category == NN}

{Token.string == "of"}

(GENE):gene2

(mRNA)?): cgr --> { Java code }

RHS

Gazetteer Lookup

POS-tag

Page 50: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Examples

G1(-)relverb G2 TNF-alpha(-)mediated GM-CSF

G1 adverb? modal? rel G2 IFNG (mRNA) significantly downregulates IL8 (mRNA)

G2 (mRNA) relation by G1 (mRNA)

CDC2 activation by cyclin B1

G1 relverb rel of G2TNF-alpha(-)mediated upregulation of GM-CSF

G1 relverb and/but G2 relverb G3 rel

IL1 upregulates but IFNG downregulates IL8 expression

Pattern Example

Page 51: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Gene-gene

relation

Page 52: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Evaluation

POS

PARCORREC

*5.0

ACT

PARCORPRE

*5.0

COR = correct relations, POS = possible relations, ACT = actual extracted relations, PAR = partial correct extracted relations

Estimation based on 100 manual checked abstracts

PRE = 83% REC = ?

Standard for evaluation necessary: BioCreAtive? GENIA?

Page 53: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

N

N: Correct RelationsM:Retrieved RelationsC: Correct Relations that are actually retrieved

M

C

Query

Collection of Documents

Precision

Recall

CMCN

F-Value:

(P):

(R):

P+R2*P*R

More complicated due to partially filled templates

Page 54: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Recall vs. Precision

High recall: You get all the right answers, but garbage too. Good when incorrect results are not problematic. More common from automatic systems.

High precision: When all returned answers must be correct. Good when missing results are not problematic. More common from hand-built systems.

In general in these things, one can trade one for the other But it’s harder to score well on both

precision

recall

x

x

x

x

Page 55: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

1. CC Coordinating conjunction

2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition or

subordinating conjunction

7. JJ Adjective 8. JJR Adjective,

comparative 9. JJS Adjective,

superlative 10.LS List item marker 11.MD Modal 12.NN Noun, singular or

mass 13.NNS Noun, plural 14.NP Proper noun,

singular 15.NPS Proper noun, plural 16.PDT Predeterminer

17. POS Possessive ending

18. PP Personal pronoun

19. PP$ Possessive pronoun

20. RB Adverb 21. RBR Adverb,

comparative 22. RBS Adverb,

superlative 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VB Verb, base form 28. VBD Verb, past tense 29. VBG Verb, gerund or

present participle 30. VBN Verb, past

participle 31. VBP Verb, non-3rd

person singular present

Penn Treebank Tagset

Page 56: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Tools for NLP in the biomedical domain

Page 57: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Page 58: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Page 59: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Page 60: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Page 61: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Page 62: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain

Page 63: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.

Natural Language Processing in the Biomedical Domain