Top Banner
Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo
50

Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Mar 28, 2015

Download

Documents

Mia Burton
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Natural Language Tools and Resources for Biomedical Information Extraction

Yoshimasa Tsuruoka

Tsujii laboratory

University of Tokyo

Page 2: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Outline

• NLP resources for bioNLP– GENIA corpus

• NLP tools– Machine learning

• Maximum entropy modeling for feature forest• Maximum entropy modeling with inequality constraints

– Part-of-speech tagger– Chunker (shallow parser)– HPSG Parser

• Applications of NLP– Extracting disease-gene relationships from MEDLINE abstracts

Page 3: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Application of NLP to the Biomedical domain

• Plenty of text– MEDLINE database: 12 million abstracts – Needs of effective IE and IR

• Domain knowledge– Gene ontology, KEGG, UMLS, ICD, …

• Other Information sources– Molecular databases

• DNA sequences, motifs, diseases, molecular interactions, etc…

Page 4: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Developing NLP resources

• Resources for NLP research– Domain knowledge– Training data for ML-based techniques– Test data for evaluating the transferability of a system

• GENIA resources– Ontology– Corpus

Page 5: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

GENIA corpus

• 4,000 MEDLINE abstracts– Selected by MeSH Terms (Human, Blood cells, Transcription

factors)

• XML format• Contents

– Named-entity (Kim et al 2003)– Part-of-speech (Tateisi et al 2004)– Parse tree– Co-reference (Institute of Infocomm Research, Singapore)

Page 6: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

GENIA part-of-speech corpus

• Each token is annotated with its part-of-speech tag.• Size

– 2,000 abstracts

– 20,544 sentences

– 50,1054 words (about half the size of Penn Treebank)

The peri-kappa B site mediates human immunodeficiency virus type 2 enhancer activation in monocytes …

DT NN NN NN VBZ JJ NN

NN NN CD NN NN IN NNS

Page 7: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

The peri-kappa B site mediates human immunodeficiency    virus type 2 enhancer activation in monocytes …

GENIA named-entity corpus

• Terms are annotated based on the semantic classes in the GENIA ontology

• Size– 2,000 abstracts– Number of the terms: 92,723– Vocabulary size: 36,568

    DNA virus

cell_type

Page 8: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

GENIA treebank

• Based on the standard of the Penn TreeBank• Size

– 500 abstracts– (1500 abstracts by the end of this summer)

CD3-episilon expression is controlled by a downstream T lymphocyte-specific enhancer element

NP ADJP

NP

PP

VP

VP

S

Page 9: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Few known genes (IL-2, members of the IL-8 family, interferon-gamma) are induced in T cells only through the combined effect of phorbol myristic acetatete (PMA) and a Ca(2+)-ionophore, and expression of only these genes can be fully suppressed by Cyclosporin A (CyA).

T cell

IL-2 Interferon-gamma

IL-8 familyIL-2

IL-8

IFN-γ

Ca(2+)-iPMA

Ca(2+)-iPMA

Ca(2+)-iPMA

CyA×

CyA×

CyA×

: Target: Interaction: Agent

: Location

Event AnnotationFew known genes (IL-2, members of the IL-8 family, interferon-gamma) are induced in T cells only through the combined effect of phorbol myristic acetatete (PMA) and a Ca(2+)-ionophore, and expression of only these genes can be fully suppressed by Cyclosporin A (CyA).

Page 10: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Few known genes (IL-2, members of the IL-8 family, interferon-gamma) are induced in T cells only through the combined effect of phorbol myristic acetatete (PMA) and a Ca(2+)-ionophore, and expression of only these genes can be fully suppressed by Cyclosporin A (CyA).

T cell

IL-2 Interferon-gamma

IL-8 familyIL-2

IL-8

IFN-γ

Ca(2+)-iPMA

Ca(2+)-iPMA

Ca(2+)-iPMA

: Target: Interaction: Agent

: Location

Event annotation

Page 11: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Few known genes (IL-2, members of the IL-8 family, interferon-gamma) are induced in T cells only through the combined effect of phorbol myristic acetatete (PMA) and a Ca(2+)-ionophore, and expression of only these genes can be fully suppressed by Cyclosporin A (CyA).

T cell

IL-2 Interferon-gamma

IL-8 familyIL-2

IL-8

IFN-γCyA×

CyA×

CyA×

: Target: Interaction: Agent

: Location

Event annotation

Page 12: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

GENIA corpus

• Used in more than 240 institutions– Japan (28), Asia (54), North America (63), Europe (62),

etc…• De facto standard for evaluating biomedical named-

entity recognition systems– BioNLP workshop at Coling 2004

• Named-entity recognition shared task– Institute for Infocomm Research (Singapore),– Stanford University (USA),– University of Edinburgh (UK),– University of Wisconsin-Madison (USA),– Pohang University of Science and Technology (Korea),– University of Alberta (Canada),– University Duisburg-Essen (Germany),– Korea University (Korea),– National Taiwan University (Taiwan),

Page 13: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

NLP tools

• Biomedical text mining– Huge amount of text.

• Machine learning– Training set can be very large.– Efficient training algorithms.

• Taggers (and parsers)– Decoding should be fast.

Page 14: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Machine learning

• Supervised learning– learns the rules for classifying samples into

predefined classes by seeing a large number of training samples with class labels.

• Algorithms– Naïve Bayes, Decision Tree, SVMs, AdaBoost,

Perceptron, Random forests, Maximum Entropy, etc...

Page 15: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Maximum entropy learning

• Log-linear modeling

• Maximum likelihood estimation– determines the parameters so that they maximizes

the likelihood of the training data

F

iii xf

Zxq

1

exp1

Feature functionFeature weight

Page 16: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Maximum entropy modeling with inequality constraints

(Kazama and Tsujii 2003)

• Advantages over the standard ME modeling.– Good regularization effects (as good as Gaussian prior)– Sparse solution

• C++ implementation– offers fast training.– can be used as a library.– can incorporate the model into your source code.

• The C++ library is used in many NLP programs (e.g. POS tagger, chunkers, IE modules)

Page 17: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Part-of-speech tagging

• A PoS tagger annotates each token with its part-of-speech tag.

The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NNvirus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS

Page 18: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Chunking (shallow parsing)

• A chunker (shallow parser) segments a sentence into non-recursive phrases.

He reckons the current account deficit will narrow toNP VP NP VP PPonly # 1.8 billion in September . NP PP NP

Page 19: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Chunking (shallow parsing)

• Chunking tasks can be converted into a standard tagging task.

He reckons the current account deficit will narrow toBNP BVP BNP INP INP INP BVP IVP BPP

only # 1.8 billion in September . BNP INPINP INP BPP BNP

Page 20: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Sequential Classification Approaches

• Sequence tagging tasks– Find the tag sequence that maximizes the following probability given

the observation (e.g. words):

• Left to right decomposition (with the first-order markov assumption)

• Right to left decomposition (with the first-order markov assumption)

ottP n |...1

n

iiin ottPottP

111 ||...

n

iiin ottPottP

111 ||...

classification problem

Page 21: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Bidirectional Inference

• Possible decomposition structures

• Bidirectional inference algorithm (Tsuruoka et al.)– We can find the “best” structure and tag sequences in

polynomial time

t1 t2 t3(a) t1 t2 t3(b)

t1 t2 t3(c) t1 t2 t3(d)

Page 22: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

State-of-the-art PoS taggers

• Tagging speed and accuracy on Penn Treebank

Tagging Speed Accuracy

Dependency Net (2003) Very slow 97.24

Perceptron (2002) ? 97.11

SVM (2003) Fast 97.05

HMM (2000) Extremely fast 96.48

Bidirectional MEMM Very fast 97.10

Page 23: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

State-of-the-art Chunkers

• Chunking speed an accuracy on Penn Treebank

Tagging Speed Accuracy

Perceptron (2003) ? 93.74

SVM + voting (2003) Slow? 93.91

SVM (2000) Fast 93.48

Bidirectional MEMM Very fast 93.70

Page 24: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

The peri-kappa B site mediates human immunodeficiency    virus type 2 enhancer activation in monocytes …

Named-entity recognition

• Recognizing named-entities in text• Similar to chunking

– IOB tagging• Named entities in the biomedical domain are long.

– Sliding window

    DNA virus

cell_type

Page 25: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

A sliding window approach to biomedical NE recogition

• We want to use rich features on a “term”.

• Enumerate all sub-word sequences in a sentence.

• Classify them into semantic classes.

W1 W2 W3 W4

Page 26: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Accuracy of biomedical NE recognition

Recall Precision F-score

SVM+HMM (Zho 2004) 76.0 69.4 72.6

Sliding window 71.5 70.2 70.8

MEMM (Fin 2004) 71.6 68.6 70.1

CRF (Set 2004) 70.3 69.3 69.8

• Shared task at Coling 2004 BioNLP workshop

Page 27: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

HPSG parsing

• HPSG– A few schema– Many lexical entries– Deep syntactic analysis

• Grammar– Corpus-based grammar

construction (Miyao et al 2004)

• Parser– Beam search (Tsuruoka

et al.)

Lexical entryLexical entry

HEAD: verbSUBJ: <>COMPS: <>

Mary walked slowly

HEAD: nounSUBJ: <>COMPS: <>

HEAD: verbSUBJ: <noun>COMPS: <>

HEAD: advMOD: verb

HEAD: verbSUBJ: <noun>COMPS: <>

Subject-head schema

Head-modifier schema

Page 28: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Phrase structure

The company is run by him

DT NN VBZ VBN IN PRP

dt np vp vp pp np

np pp

vp

vp

s

Page 29: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Predicate-argument structure

The company is run by him

DT NN VBZ VBN IN PRP

dt np vp vp pp np

np pp

vp

vp

s

arg1arg2mod

Page 30: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

IR search engine using predicate-argument structures

Page 31: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

• A maximum entropy model is defined for the entire tree structure– e.g. HPSG parse trees

• Exponentially-many trees are represented with a packed forest of polynomial size

• A probability of each tree is estimated without unpacking the feature forest

Feature forest model (Miyao and Tsujii 2002)

S

NP1

mn

NP2 VP1VP2

number of trees: size:

feature forest

nm

Page 32: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Automatic Generation of Spelling Variants

• Variant GeneratorNF-Kappa B (1.0)NF Kappa B (0.9)NF kappa B(0.6)NF kappaB (0.5)NFkappaB (0.3)

:

GeneratorNF-Kappa B

Each generated variant is associated with its generation probability

Page 33: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Generation Algorithm

T cell (1.0)

T-cell (0.5) T cells (0.2)

T-cells (0.1)

0.5

0.2

0.2

• Recursive generation

  P = P’ x Pop

Page 34: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Learning Operation Rules

• Operations for generating variants– Substitution

– Deletion

– Insertion

• Context– Character-level context: preceding (following) two

characters

• Operation Probability

contextf

operationcontextfcontextoperationP

,

Page 35: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Example of variant generation (1)

Generation Probability

Generated Variants Frequency

1.0 (input) antiinflammatory effect 7

0.462 anti-inflammatory effect 33

0.393 antiinflammatory effects 6

0.356 Antiinflammatory effect 0

0.286 antiinflammatory-effect 0

0.181 anti-inflammatory effects 23

: : :

Page 36: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Example of variant generation (2)

Generation Probabilitiy

Generated Variants Frequency

1.0 (Input) tumour necrosis factor alpha 15

0.492 tumor necrosis factor alpha 126

0.356 tumour necrosis factor-alpha 30

0.235 Tumour necrosis factor alpha 2

0.175 tumor necrosis factor alpha 182

0.115 Tumor necrosis factor alpha 8

: : :

Page 37: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Domain Adaptation

• Newspaper articles are widely used as training data for machine learning-based NLP systems.

• Domain Adaptability– Part-of-speech tagging– HPSG parsing

Page 38: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Tagging errors by TnT tagger (Brants 2000)

… and membrane potential after mitogen binding. CC NN NN IN NN JJ… two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN… to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN

Page 39: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Accuracy of TnT tagger on the GENIA corpus

• Ignoring unessential errors

Accuracy

TnT (original) 84.4%

NNP = NN, NNPS = NNS 90.0%

LS = NN 91.3%

JJ = NN 94.9%

About 94% in practice

Page 40: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

GENIA tagger

training WSJ GENIA

WSJ 97.0 84.3

GENIA 75.2 98.1

WSJ+GENIA 96.9 98.1

•An MEMM tagger trained on WSJ and GENIA corpus

The tagger works well on both types of texts.

Page 41: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Parsing MEDLINE with the HPSG parser

• Parsing accuracy on the GENIA Treebank

#sentences LP / LR UP / UR

All sentences 1,556 82.8 / 81.5 86.4 / 85.1

Covered sentences 1,104 86.8 / 86.5 88.7 / 88.4

Page 42: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Extracting Disease-Gene Associations from MEDLINE abstracts

These results suggested that targeted disruption of Cyp19 caused anovulation and precocious depletion of ovarian follicles

Furthermore, AML cells with methylated p15(INAK4B) tended to express higher levels of DNMT1 and 3B.

Page 43: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Text

• 1.5 million MEDLINE abstracts– Selected by MeSH Terms

• “Disease Category” AND (“Amino Acids, Peptides, and Proteins” OR “Genetic Structures”)

• Parsing– All the sentences were parsed by the HPSG

parser– Using a PC cluster (100 processors with GXP)– Time: 10 days

Page 44: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Training data

All foals with OLWS were homozygous for the Ile118Lys EDNRB mutation, and adults that were homozygous were not found.

Dominant radial drusen and Arg345Trp EFEMP1 mutation.

The 5 year overall survival (OS) and event-free survival (EFS) were 94 and 90 +/- 8%, respectively, with a median follow-up of 48 months.

These data may indicate that formation of parathyroid adenoma in young patients is related to a mechanism involving EGFR.

• All co-occurrences are classified into “relevant” or “irrelevant” by a domain expert.

Page 45: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Predicate-argument features (1)

• Dedifferentiation of adenoid cystic carcinoma: report of a case implicating p53 gene mutation.

X gene/disease

ARG2

Page 46: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Predicate-argument features (2)

• These results suggested that targeted disruption of Cyp19 caused anovulation and precocious depletion of ovarian follicles.

• Furthermore, AML cells with methylated p15(INAK4B) tended to express higher levels of DNMT1 and 3B.

X disease/gene

ARG2ARG1

gene/disease

Page 47: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Extraction accuracy

• Training/test data: 2,253 sentences

• 10-fold cross validation

features recall precision f-score

N/A 1.0 0.351 0.520

+ bag of words 0.733 0.682 0.706

+ local context 0.733 0.695 0.714

+ predicate-argument structures

0.759 0.710 0.733

Page 48: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

DGA explorer

Page 49: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Summary

• The GENIA corpus– Part-of-speech: 2000 abstracts– Named-entities: 2000 abstracts– Parse tree: 500 abstracts

• Machine learning– Maximum entropy modeling

• Inequality constraints• Feature forests

– Bidirectional inference for sequence tagging• NLP tools

– Part-of-speech tagger: 97.11%– Chunker: 93.7%– HPSG parser: 87.5%– Term variant generation

• Extracting disease-gene associations from MEDLINE

Page 50: Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Software and resources• Machine learning packages

– Maximum entropy with inequality constraints– Maximum entropy for feature forests

• Taggers and Parsers– PoS tagger– Chunker– Named-entity tagger– HPSG parser

• GENIA resource– Named-entity corpus– Part-of-speech corpus– Tree corpus– Co-reference corpus (Singapore Univ.)– HPSG parsed results (100,000 MEDLINE abstracts)