Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006
Dec 22, 2015
Natural Language Processing in the
biomedical domain
SBI Course WS 2005/2006
Thomas Karopka
19.01.2006
Natural Language Processing in the Biomedical Domain
Outline Motivation Introduction to Natural Language Processing Named Entity Recognition (NER) Information Extraction (IE) GATE-General Architecture for Text
Engineering Some Tools, some applications.... (Short introduction to GATE)
Natural Language Processing in the Biomedical Domain
• Huge amount of biomedical knowledge• Problem: unstructured text difficult to analyze automatically
40.000 abstracts á 5 min – app. 400 days (8 h a day)
Solution: NLP – Information Extraction
• MEDLINE: currently contains over 16 million biomedical abstracts• 50.000 new abstracts per month
Motivation
Natural Language Processing in the Biomedical Domain
What is NLP?Definition 1:
Natural Language Processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems
inherent in the processing and manipulation of natural language, but not, generally, natural language
understanding.
Definition 2:
A study of how to use computers to do things with human languages.
Synonyms: Language Engineering, Human Language Technology
Natural Language Processing in the Biomedical Domain
Publications in MEDLINE
0
2
4
6
8
10
12
14
Million
Year
Publications in MEDLINE
jährliche Publikationen
kumulierte Anzahl
Publications per year
Natural Language Processing in the Biomedical Domain
Main fields of NLP
Text to speech Speech recognition
Natural language generation Machine translation Question answering Information retrieval
Information extraction Named entity recognition
Text classification Translation technology
Text Summaries
Natural Language Processing in the Biomedical Domain
Why is NLP so hard?
Ambiguity Context Acronyms Semantics
Natural Language Processing in the Biomedical Domain
Ambiguity
Time flies like an arrow, fruit flies like a banana
(Groucho Marx)
Natural Language Processing in the Biomedical Domain
Global vs. Local ambiguity
Local ambiguity means that part of a sentence can have more than 1 interpretation, but not the whole sentence.
Global ambiguity means that the whole sentence can have more than 1 interpretation.
Natural Language Processing in the Biomedical Domain
Global vs. Local ambiguity cont. Local ambiguity
The old train..... ...the young. ...left the station.
Here syntax can tell us that TRAIN must be a verb in sentence 1.
Global ambiguity "I saw the Grand Canyon flying to New York"
"I saw a Boeing 747 flying to New York"
Here we know the meaning of the two sentences because we know
what can and cannot fly.
Natural Language Processing in the Biomedical Domain
Types of Ambiguity Categorical ambiguity
Noun : "Time is money" Verb: "Time me on the last lap" Adjective: "Time travel is not likely in my life time„
Word sense ambiguity Electrical : "The battery was charged with jump leads" Legal: "Thief was charged by PC Smith" Responsibility: "The lecturer was charged with student
recruitment"
Natural Language Processing in the Biomedical Domain
Types of Ambiguity cont.
Structural ambiguity "You can have peas and beans or carrots with the set meal„
Referential ambiguity What can THEY refer to in: "After THEY finished the exam
the students and lecturers left.„ Lectures only?Students only?Both?
Natural Language Processing in the Biomedical Domain
Problems in NLP
Polysemy - one word carrying different meanings. (Glück 1993, 474) (in different contexts)
beam ('Lichtstrahl' und 'Balken')
Synonymy - the semantic relation that holds between two words that can (in a given context) express the same meaning
ship – vessel buy - purchase
Semantics - the meaning of a word, phrase, clause, or sentence, as opposed to its syntactic construction.
„Baby swallows fly“
Natural Language Processing in the Biomedical Domain
Basic NLP Tasks
TokenizationSplit text into units called tokens (words, .,-)
Sentence SplittingDetect sentence boundaries
Part of Speech (POS) TaggingApply parts of speech (verb, noun, adjective..)
ParsingWork out parse trees
Natural Language Processing in the Biomedical Domain
Basic NLP Tasks cont.
Verb Phrase chunkingFind verbal phrases
Noun Phrase chunkingFind noun phrases
Acronym resolutionFind long forms for acronyms
Corefference resolutionNew York, .... The big apple
Natural Language Processing in the Biomedical Domain
Basic NLP Tasks cont.
Named Entity RecognitionFind named entities
.....
Natural Language Processing in the Biomedical Domain
What is NER? NER
Named Entity Recognition Including two tasks
Identification of proper names in text Classification of proper names in text
Newswire Domain Person, Location, Organization
Biomedical Domain Protein, DNA, RNA, Body Part, Cell Type, Lipid, etc.
Natural Language Processing in the Biomedical Domain
NER in biomedical domain
BioNER aims to recognize following namesFirst Priority
Protein name, DNA name, RNA nameSecond Priority
cell type, other organic compound, cell line, lipid, multi-cell, virus, cell component, body part, tissue, amino acid monomer, polynucleotide, mono-cell, inorganic, peptide, nucleotide, atom, other artificial source, carbohydrate, organic
Natural Language Processing in the Biomedical Domain
Example of NER - BiomedicalProtei
n/gene
Cell type
Natural Language Processing in the Biomedical Domain
Problems in BioNER
Unknown words Long compound words Variations of expressions Nested NEs
Natural Language Processing in the Biomedical Domain
Unknown Words
Words containing hyphen, digit, letter, Greek letter, Roman numeral. Alpha B1 Adenyly cyclase 76E Latent membrane protein 1 4’-mycarosyl isovaleryl-CoA transferase oligodeoxyribonucleotide 18-deoxyaldosterone
Abbreviation and Acronym IL, TECd, IFN, TPA
Natural Language Processing in the Biomedical Domain
Long Compound words
interleukin 1 (IL-1)-responsive kinase interleukin 1-responsive kinase epidermal growth factor receptor SH2 domain containing tyrosine kinase
Syk SH2 domain (GENIA example)
Natural Language Processing in the Biomedical Domain
Various expressions of the same NE
Spelling variation N-acetylcysteine, N-acetyl-cysteine, NAcetylCysteine
Word permutation beta-1 intergrin, integrin beta-1
Ambiguous expressions epidermal growth factor receptor, EGF receptor,
EGFR c-jun, c-Jun, c jun
Natural Language Processing in the Biomedical Domain
Various expressions: the name explains its function
the Ras guanine nucleotide exchange factor Sos
the Ras guanine nucleotide releasing protein Sos
the Ras exchanger Sos the GDP-GTP exchange factor Sos Sos(mSos), a GDP/GTP exchange protein
for Ras
Natural Language Processing in the Biomedical Domain
Various expressions: The name includes preposition and/or
conjunction (ambiguity of dependencies)
p85 alpha subunit of PI 3-kinase SH2 and SH3 domains of Src NF-AT1 , AP-1 , and NF-kB sites E2F1 and -3 Residues 432, 435, 437, 438, and 440
Natural Language Processing in the Biomedical Domain
Nested Named Entity
An NE embedded in another NE. IL-2: protein IL-2 gene: gene CBP/p300 associated factor: protein CBP/p300 associated factor binding
promoter: DNA
Natural Language Processing in the Biomedical Domain
Gene Naming Conventions
"Biologists would rather share their toothbrush than share a gene name„ Michael Ashburner [1]
[1] Pearson H. Biology's name game. Nature. 2001;411:631–632.
Natural Language Processing in the Biomedical Domain
Protein/Gene name recognitionFor comic relief don‘t miss the ‚worst gene names‘ page:
http://tinman.vetmed.helsinki.fi/eng/drosophila.html
My favourite ones: drop dead FBgn0000494 lost in space FBgn0016996 ken and barbie FBgn0011236
Source: FlyBase http://flybase.bio.indiana.edu/
Natural Language Processing in the Biomedical Domain
Natural Language Processing in the Biomedical Domain
Natural Language Processing in the Biomedical Domain
State-of-the-art Systems on NER: Two evaluation contests
BioCreative 2004 (March) Critical Assessment of Information Extraction
Systems in Biology Task 1: Entity extraction
Target: genes (or proteins, where there is ambiguity) 10000 sentences from Medline as training data, and
5000 sentences as testing data BioNLP 2004 (August)
GENIA Corpus as training data and 404 abstracts as testing data
Target: 5 classes, including protein, DNA, gene, cell line and cell type.
Both use exact match scoring.
Natural Language Processing in the Biomedical Domain
BioNLP 2004 Datasets
# of abstract
s# of sentences # of tokens
Training Set 2,000 20,546 (10.27/abs)472,006 (236.00/abs)
(22.97/sen)
Test Set
Total 404 4,260 (10.54/abs)96,780 (239.55/abs)
(22.72/sen)
1978-1989
104 991 ( 9.53/abs)22,320 (214.62/abs)
(22.52/sen)
1990-1999
106 1,115 (10.52/abs)25,080 (236.60/abs)
(22.49/sen)
2000-2001
130 1,452 (11.17/abs)33,380 (256.77/abs)
(22.99/sen)
S/1998-2001
204 2,254 (11.05/abs)51,628 (253.08/abs)
(22.91/sen)
Natural Language Processing in the Biomedical Domain
R/P/F 1978-1989
set 1990-1999
set 2000-2001
set S/1998-2001 set
Total
[Zho04
]
75.3 / 69.5 / 72.3
77.1 / 69.2 / 72.9
75.6 / 71.3 / 73.8
75.8 / 69.5 / 72.5
76.0 / 69.4 / 72.6
[Fin04]
66.9 / 70.4 / 68.6
73.8 / 69.4 / 71.5
72.6 / 69.3 / 70.9
71.8 / 67.5 / 69.6
71.6 / 68.6 / 70.1
[Set04
]
63.6 / 71.4 / 67.3
72.2 / 68.7 / 70.4
71.3 / 69.6 / 70.5
71.3 / 68.8 / 70.1
70.3 / 69.3 / 69.8
[Son04
]
60.3 / 66.2 / 63.1
71.2 / 65.6 / 68.2
69.5 / 65.8 / 67.6
68.3 / 64.0 / 66.1
67.8 / 64.8 / 66.3
[Zha04]
63.2 / 60.4 / 61.8
72.5 / 62.6 / 67.2
69.1 / 60.2 / 64.7
69.2 / 60.3 / 64.4
69.1 / 61.0 / 64.8
[Rös04]
59.2 / 60.3 / 59.8
70.3 / 61.8 / 65.8
68.4 / 61.5 / 64.8
68.3 / 60.4 / 64.1
67.4 / 61.0 / 64.0
[Par04]
62.8 / 55.9 / 59.2
70.3 / 61.4 / 65.6
65.1 / 60.4 / 62.7
65.9 / 59.7 / 62.7
66.5 / 59.8 / 63.0
[Lee04]
42.5 / 42.0 / 42.2
52.5 / 49.1 / 50.8
53.8 / 50.9 / 52.3
52.3 / 48.1 / 50.1
50.8 / 47.6 / 49.1
BL 47.1 / 33.9 /
39.4 56.8 / 45.5 /
50.5 51.7 / 46.3 /
48.8 52.6 / 46.0 /
49.1 52.6 / 43.6 /
47.7
Natural Language Processing in the Biomedical Domain
Current Methods
Machine LearningHMM, SVM, ME (Maximum Entropy), CRF
(Conditional Random Field) Hybrid methods
Dictionary BasedApproximate String matching algorithm
Naming Rules Dynamic Programming
Natural Language Processing in the Biomedical Domain
Features for Machine Learning Methods
Morphological Features Orthographical Features POS Features
Genia POS tagger Semantic Trigger Features
Head-noun Features NF-kappaB consensus site IL-2 gene
Natural Language Processing in the Biomedical Domain
Morphological FeaturesPrefix/Suffix Example
~cin~mide~zole
actinomycinCycloheximideSulphamethoxazole
~lipid~rogen~vitamin
phospholipidsestrogendihydroxyvitamin
~blast~cyte~phil
erythroblastthymocyteeosinophil
phosph~methyl~immuno~
phosphorylationmethyltranferaseimmunomodulator
Natural Language Processing in the Biomedical Domain
Orthographical Features
OrthographicalFeatures
Example Orthographical Features
Example
AllCaps EBNA, NFAT AlphaDigit p50, p65
AlphaDigitAlpha IL23R, E1A ATGCSequence
CCGCCC
CapLowAlpha Src, Ras, Epo CapMixAlpha NFkappaB
CapsAndDigits IL2, STAT4, SH2
DigitAlpha 2xNFkappaB
Natural Language Processing in the Biomedical Domain
Head Nouns
Head Nouns
Unigram factor, protein, receptor, alpha, NF-kappaB, IL-2, cytokine,kinase, transcription, domain, complex, TNF-alpha, Nuclear, p50, CD28, TNF, molecule, subunit, cell, STAT3, family, tumor, factor-alpha, expression, interleukin
Bigram NF-kappa B, transcription factor, I kappa, nuclear factor, protein kinase, B alpha, kinase C, tumor necrosis, T cell,glucocorticoid receptor, binding protein, factor alpha, adhesion molecule, monoclonal antibody, gene product, binding domain
Natural Language Processing in the Biomedical Domain
Excursus: Head Noun, Noun phrase
A noun is usually embedded in a noun phrase (NP), a syntactic unit of the sentence in which information about the noun is gathered.
The noun is the head of the noun phrase, the central constituent that determines the syntactic character of the phrase.
Natural Language Processing in the Biomedical Domain
Excursus: Head Noun, Noun phrase cont.
A noun phrase normally consists of: An optional determiner Zero or more adjective phrases A head noun Optional post-modifier (prepositional phrase or clausal
modifier)
Example:
The homeless old man in the park that I tried to help yesterday
human umbilical vein endothelial cellslipopolysaccharide-stimulated human saphenous vein endothelial cells
Natural Language Processing in the Biomedical Domain
Zhou et al. approach
HMM + SVM Post-processing
Rule-based: used to resolve nested name entities.
Top1 in the NLPBA Task, F=72.5%
Natural Language Processing in the Biomedical Domain
Manning et al. method Machine learning:
ME Markov model Local features External resources and larger context
Post-processing To correct gene’s boundary (mainly for BioCreative
Task)
Top 1 in BioCreative, F= 83.2% Top 2 in NLPBA Task, F=70.1%
Natural Language Processing in the Biomedical Domain
IE-Systems analyse unstructured text,extract predefined named entities and store these entities in a structured form
What is Informationsextraction(IE)?
Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information to form new facts or new hypotheses to be explored further by more conventional means
of experimentation.
Source: Marti Hearst, What is text mining? http://www.sims.berkeley.edu/~hearst/text-mining.html
Natural Language Processing in the Biomedical Domain
Targets of Information Extraction
Protein-Protein interaction/binding/inhibition Protein-Small Molecules Gene-Gene regulation Gene-Gene Product interaction Gene-Drug relation Protein-Subcellular location Amino Acid-Protein relation
Example relationships between gene and drugs: The gene is the drug target The gene confers resistance to the drug The gene metabolizes the drug
Natural Language Processing in the Biomedical Domain
Information Extraction Tasks
Identify Target Named Entities
Identify Relationsamong Named
Entities
Identify Relationsamong Events and
Named Entities
Associate Resultswith existing
database records
Natural Language Processing in the Biomedical Domain
IE-SystemsRulebased Systems using rules for the extraction
Machine Learning: Support Vector Machine (SVM), Maximum Entropy (ME), Memory Based Learning (MBL),Inductive Logic Programming (ILP)Artificial Neural Networks (ANNs)
Hybrid Systems combining the two approaches
Natural Language Processing in the Biomedical Domain
GATE –General Architecture for Text Engineering
„GATE is an architecture, a framework and a development environment for LE (Language Engineering)“ (Cunningham, 2002) • Integrated Development Environment for LE-Applications• Reusable Components• Extensive amount of APIs• Integration of different NLP plattforms• WEKA (machine learning), Protégé (Ontology)• Open Source, Java
Natural Language Processing in the Biomedical Domain
Extractor
TokenizerSentenceSplitter
GeneGazetteer
Gene-relationtransducer
POS Tagger
AcronymResolution
NP-Chunking
XMLdocs
GATE standard components
external modules
New developed modules
Finite State Transducer Uses JAPE Grammar JAPE rules are compiled to Java Objects that are used by the GATE API
Consists of an Indexfile which is used to access lists with keywords Lists are compiled to finite state machines every keyword is annotated with a type (e.g. Gene, Relation, Organism ...)
Natural Language Processing in the Biomedical Domain
JAPEExample: IL-1beta and TNF-alpha significantly
enhanced the production of GM-CSF
MacroLabel
Rule: G1ofG2( (GENE):gene1
(mRNA)?
(ADVS)?
({Lookup.majorType == relverb}):rel
({Token.category == DT})?
{Token.category == NN}
{Token.string == "of"}
(GENE):gene2
(mRNA)?): cgr --> { Java code }
RHS
Gazetteer Lookup
POS-tag
Natural Language Processing in the Biomedical Domain
Examples
G1(-)relverb G2 TNF-alpha(-)mediated GM-CSF
G1 adverb? modal? rel G2 IFNG (mRNA) significantly downregulates IL8 (mRNA)
G2 (mRNA) relation by G1 (mRNA)
CDC2 activation by cyclin B1
G1 relverb rel of G2TNF-alpha(-)mediated upregulation of GM-CSF
G1 relverb and/but G2 relverb G3 rel
IL1 upregulates but IFNG downregulates IL8 expression
Pattern Example
Natural Language Processing in the Biomedical Domain
Gene-gene
relation
Natural Language Processing in the Biomedical Domain
Evaluation
POS
PARCORREC
*5.0
ACT
PARCORPRE
*5.0
COR = correct relations, POS = possible relations, ACT = actual extracted relations, PAR = partial correct extracted relations
Estimation based on 100 manual checked abstracts
PRE = 83% REC = ?
Standard for evaluation necessary: BioCreAtive? GENIA?
Natural Language Processing in the Biomedical Domain
N
N: Correct RelationsM:Retrieved RelationsC: Correct Relations that are actually retrieved
M
C
Query
Collection of Documents
Precision
Recall
CMCN
F-Value:
(P):
(R):
P+R2*P*R
More complicated due to partially filled templates
Natural Language Processing in the Biomedical Domain
Recall vs. Precision
High recall: You get all the right answers, but garbage too. Good when incorrect results are not problematic. More common from automatic systems.
High precision: When all returned answers must be correct. Good when missing results are not problematic. More common from hand-built systems.
In general in these things, one can trade one for the other But it’s harder to score well on both
precision
recall
x
x
x
x
Natural Language Processing in the Biomedical Domain
1. CC Coordinating conjunction
2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition or
subordinating conjunction
7. JJ Adjective 8. JJR Adjective,
comparative 9. JJS Adjective,
superlative 10.LS List item marker 11.MD Modal 12.NN Noun, singular or
mass 13.NNS Noun, plural 14.NP Proper noun,
singular 15.NPS Proper noun, plural 16.PDT Predeterminer
17. POS Possessive ending
18. PP Personal pronoun
19. PP$ Possessive pronoun
20. RB Adverb 21. RBR Adverb,
comparative 22. RBS Adverb,
superlative 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VB Verb, base form 28. VBD Verb, past tense 29. VBG Verb, gerund or
present participle 30. VBN Verb, past
participle 31. VBP Verb, non-3rd
person singular present
Penn Treebank Tagset
Natural Language Processing in the Biomedical Domain
Tools for NLP in the biomedical domain
Natural Language Processing in the Biomedical Domain
Natural Language Processing in the Biomedical Domain
Natural Language Processing in the Biomedical Domain
Natural Language Processing in the Biomedical Domain
Natural Language Processing in the Biomedical Domain
Natural Language Processing in the Biomedical Domain
Natural Language Processing in the Biomedical Domain