Top Banner
Automatic Acquisition of Huge Training Data for Bio-Medical Named Entity Recognition Yu Usami, Han-Cheol Cho, Naoaki Okazaki, and Jun’ichi Tsujii Graduate School of Information Science and Technology University of Tokyo
54
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Usami bionlp2011

Automatic Acquisitionof Huge Training Datafor Bio-Medical Named Entity Recognition

Yu Usami, Han-Cheol Cho, Naoaki Okazaki, and Jun’ichi Tsujii

Graduate School of Information Science and Technology University of Tokyo

Page 2: Usami bionlp2011

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

Page 3: Usami bionlp2011

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

Page 4: Usami bionlp2011

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

B B BO O O O OI I

Labels B : Beginning of NE I : Inside of NE O: Out of NE

Page 5: Usami bionlp2011

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

B B BO O O O OI I

Page 6: Usami bionlp2011

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

B B BO O O O OI I Expensive• Cost• Time

Page 7: Usami bionlp2011

Our Idea

Page 8: Usami bionlp2011

Our Idea

Utilize inexpensive and large resources:

Lexical database Unlabeled text

Page 9: Usami bionlp2011

Our Idea

Utilize inexpensive and large resources:

Lexical database Unlabeled text

Build dictionary

Page 10: Usami bionlp2011

Our Idea

Utilize inexpensive and large resources:

Lexical database Unlabeled text

Build dictionary

String match

Page 11: Usami bionlp2011

Our Idea

Utilize inexpensive and large resources:

Lexical database Unlabeled text

Build dictionary

String match

Acquire annotated corpus for Training

Page 12: Usami bionlp2011

Dictionary Building

Page 13: Usami bionlp2011

Dictionary Building

Symbol: CD177

Page 14: Usami bionlp2011

Dictionary Building

Official Name: CD177 molecule

Page 15: Usami bionlp2011

Dictionary Building

Synonyms: NB1, PRV1, HNA2A, CD177

Page 16: Usami bionlp2011

Dictionary Building

Page 17: Usami bionlp2011

CD177 CD177 molecule NB1 PRV1 HNA2A

Dictionary Building

Page 18: Usami bionlp2011

Task Settings

Task: Single class NER

Target Class: Gene-or-gene-product (GGP)

Resources:

• Lexical database: Entrez Gene

include 6,816,109 gene (protein) records

• Unlabeled text: 2009 MEDLINE

include 17,764,827 articles

Page 19: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Page 20: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Page 21: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Test data

String match

Page 22: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Page 23: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training dataString match

Unlabeled text

Page 24: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Training data

Page 25: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Model

Learn

Page 26: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Test data

Apply

Page 27: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

14.27

40.78

23.83

42.69

10.18

39.03

PRF1

Dic-based

ML-based

Page 28: Usami bionlp2011

Problem of Simple Approach

Stats: Acquired 1,715,344,107 labeled tokens including 10.0% NEs

Examples(A)PMID 1984484: It is clear that in culture media of

AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.

(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.

Page 29: Usami bionlp2011

Goal of This Study

Our ContributionAcquire huge high-quality training datawith lexical database and unlabeled text

Methodology

1. Utilize references (links) for disambiguation

2. Expand NEs based on coordination analysis

3. Gain new NEs by using self-training

Page 30: Usami bionlp2011

Disambiguation

Utilize lexical database references

record AM

reference PMID 1984484

(A)PMID 1984484: It is clear that in culture media of

AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.

(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.

Page 31: Usami bionlp2011

Disambiguation

Utilize lexical database references

record AM

reference PMID 1984484

(A)PMID 1984484: It is clear that in culture media of

AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.

(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.

Page 32: Usami bionlp2011

Disambiguation

Utilize lexical database references

record AM

reference PMID 1984484

(A)PMID 1984484: It is clear that in culture media of

AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.

(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.

Page 33: Usami bionlp2011

Side Effect of Using References

Lacks of the reference in the lexical database

record entA entB entC

ref PMID 19025 1021 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

String matchif referred

Page 34: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Page 35: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Start from Here

Page 36: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Coordinate token

Page 37: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Page 38: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Is this mention included in the dictionary?

Page 39: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Yes

Page 40: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Yes

Page 41: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Coordinate token

Page 42: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Page 43: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Is this mention included in the dictionary?

Coordination Analysis

Page 44: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Yes

Page 45: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Yes

Page 46: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Not a coordinate tokenNot included

Page 47: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

End

Page 48: Usami bionlp2011

Self-training

Training Data

Classifier Model Remaining Data

Learning

Apply

Add new NEs

Page 49: Usami bionlp2011

Evaluation Settings

Test corpus:BioNLP 2011 Shared Task EPI corpus(Training set + Development set)

Learning and Decoding:Linear kernel SVM(Predict each token label sequentially)

Page 50: Usami bionlp2011

NER Results

Method Prec. Recall F1

String match 39.03 42.69 40.78 + References 90.62 13.52 23.53 + Coord Analysis 89.66 13.77 23.87

String match 10.18 23.83 14.27 + References 69.25 39.12 50.00 + Coord Analysis 66.79 47.44 55.47 + Self-training 63.72 51.18 56.77

Dic-based

ML-based

Page 51: Usami bionlp2011

Automatic vs Manual

Type Total tokens NE tokens Manual 161,577 12,603 Automatic 48,677,426 3,055,362 NER Performance

Trained oneach corpus Manual Automatic

62.6667.8957.9258.56

68.2680.76

P R F1

Page 52: Usami bionlp2011

Automatic vs Manual

Type Total tokens NE tokens Manual 161,577 12,603 Automatic 48,677,426 3,055,362 NER Performance

Trained oneach corpus Manual Automatic

62.6667.8957.9258.56

68.2680.76

P R F1F1: 67.89 F1: 62.66

Page 53: Usami bionlp2011

Conclusion

Acquired high-quality training data automatically• Use of references for high-precision • Improve recall with‣ Coordination analysis‣ Self-training

Acquired large size training data• Used 10% (Memory limitation)

Page 54: Usami bionlp2011

Future Work

Utilize all of acquired training data for learning‣ Online learning

Improve self-training performance

Semi-supervised approach with acquired data

Apply to another domain or semantic class