TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

TM and NLP for BiologyResearch Issues in HPSG Parsing

Junichi TSUJII

School of Computer ScienceNational Centre for Text Mining

University of Manchester, UK

Department of Computer ScienceSchool of Information Science and Technology

University of Tokyo, JAPAN

2

Increments

： accumulation

Increase in Medline

2002

2000

1998

199219941996

1990

1988

1980198219841986

1978

1970197219741976

1968

1966

1964

0

100,000

200,000

300,000

400,000

500,000

600,000

年

incr

emen

ts

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

acc

um

ula

tio

n

G-protein coupled receptor

Before 19889 papers

1992256 papers2005

14,000 papers

MEDLINE alone

More than 0.5 million per year More than 1.3 thousand per day

Articles added

Medline Access

1997: 0.163 M accesses/month2006: 82.027 M accesses/month

[D.L.Banville 2006]

500 times more

3

NaCTeMwww.nactem.ac.uk

• First such centre in the world • Funding: JISC, BBSRC, EPSRC• Consortium investment

• Chair in TM (Prof. J. Tsujii, Univ. Tokyo)

• Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk funded by the Wellcome Trust

• Initial focus: biomedical academic community• Extend services to industry• Extend focus to other domains (social

sciences)

http://www.nactem.ac.uk/

http://www.mib.ac.uk/

4

Consortium

• Universities of Manchester, Liverpool• Service activity run by MIMAS (National

Centre for Dataset Services), within MC (Manchester Computing)

• Self-funded partners– San Diego Supercomputing Center – University of California, Berkeley – University of Geneva – University of Tokyo

• Strong industrial & academic support– IBM, AZ, EBI, Wellcome Trust, Sanger Institute,

Unilever, NowGEN, MerseyBio, …

5

6

7

8

9

10

11

NLP and TM

Text Mining

Text as a bag of words

Words as surface strings

Natural Language Processing

Language as a complex system linking surfacestrings of characters with their meanings Text and words as structured objects

NLP-based TM

Linking text with knowledge

12

Non-Trivial Mappings

Language Domain Knowledge Domain

Concepts and Relationships among Them

Linguistic expressions

Motivated Independently of language

TerminologyParsingParaphrasing

From surface diversities and ambiguities to

conceptual invariants

13

Example

14

Non-trivial Mapping

Language Domain Knowledge Domain　

Independently motivated of Language

Same relationswith differentStructures

Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to …..

[A] protein activates [B] (Pathway extraction)

Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene.

Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription.

[sentence] > ([arg1_activate] > [protein])Retrieval usingRegional Algebra

15

Predicate-argument structureParser based on Probabilistic HPSG (Enju)

S

p53 has been shown to directly activate the Bcl-2 protein

NP

VP

ADVP

S

VP

VP

VP

NP arg1arg2

arg2

arg3

16

述語 /項構造確率ＨＰＳＧ解析器 (Enju) の出力

The protein is activated by it

DT NN VBZ VBN IN PRP

dt np vp vp pp np

np pp

vp

vp

s

arg1arg2mod

Semantic Retrieval SystemUsing Deep Syntax

MEDIE

Passive

Passive and Infinitival Clause

17

18

19

20

21

22

23

24

25

26

Demos

•MEDIE

• Info-PubMed

http://www-tsujii.is.s.u-tokyo.ac.jp/medie/

http://www-tsujii.is.s.u-tokyo.ac.jp/info-pubmed/

27

Predicate-argument structureParser based on Probabilistic HPSG (Enju)

S

p53 has been shown to directly activate the Bcl-2 protein

NP

VP

ADVP

S

VP

VP

VP

NP arg1arg2

arg2

arg3

28

29

30

31

Penn Treebank GENIA

Coverage 99.7% 99.2%

F-Value (PArelations) 87.4% 86.4%

Sentence Precison 39.2% 31.8%

Processing Time 0.68sec 1.00sec

Performance of Semantic Parser

32

Scalability of TM Tools

The number of papers 14,792,890

The number of abstracts 7,434,879

The number of sentences 70,815,480

The number of words 1,418,949,650

Compressed data size 3.2GB

Uncompressed data size 10GB

Target Corpus: MEDLINE corpus

Suppose, for example, that it

takes one second for parsing one

sentence….70 million seconds, that is, about 2 years

33

TM and GRID

• Solution– The entire MEDLINE were parsed by

distributed PC clusters consisting of 340 CPUs

– Parallel processing was managed by grid platform GXP [Taura2004]

• Experiments– The entire MEDLINE was parsed in 8 days

• Output– Syntactic parse trees and predicate

argument structures in XML format– The data sizes of compressed/uncompressed

output were 42.5GB/260GB.

34

Efficient Parsing for HPSG

35

Background: HPSG• Head-Driven Phrase Structure Grammar (HPSG) [Pollard and Sag, 1994]

– Lexicalized and Constraints-based Grammar–A few Rule Schema General constraints on linguistic constructions

–Constraints embedded in Lexicon Word-Specific Constraints

–Constraints between phrase structures and semantic structures

36

I like it

Parsing by HPSG

37

HEAD nounSUBJ < >COMPS < >

I


it

HEAD verbSUBJ COMPS

like

<NP><NP>

Parsing by HPSG

Assignment of Lexical Entries

38


I

HEAD verbSUBJ COMPS


like it

1< >

2< >2

HEAD verb

SUBJ

COMPS < >


1< >

Head-Complement

Application of

Rule Schema

39



I

HEAD verbSUBJ COMPS

like it

1< >2< >

2

HEAD verbSUBJ < >COMPS < >

1< >HEAD verbSUBJ COMPS < >

1

Subject-Head

Application of

Rule Schema

40

Inefficiency of HPSG Parsing

• Complex DAG ： Typed-feature structures– Abstract machine for Unification (LiLFeS)

• Unification: Expensive Operation （⇔ CFG Approximation: CFG Filtering ）

• Assignment of Lexical Entries– High reduction of search space / Super

tagging

41

Filtering with CFG (1/5)• 2-phased parsing

– Approximate HPSG with CFG with keeping important constraints.

– Obtained CFG might over-generate, but can be used in filtering.

– Rewriting in CFG is far less expensive than that of application of rule schemata, principles and so on.

CompileHPSG CFGFeature

Structures

Input Sentences

Built-in CFG Parser

LiLFeS UnificationParsing

+

Output

Complete parse trees

47

System Overview

I like it

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

P High

Supertagger

I like itInputsentence

CFG Filtering

I like it

HEAD nounSUBJ < >

COMPS < >

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD nounSUBJ < >

COMPS < >

I like it

HEAD nounSUBJ < >

COMPS < >

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD nounSUBJ < >

COMPS < >

I like it

HEAD nounSUBJ < >

COMPS < >

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD nounSUBJ < >

COMPS < >

...

Deterministic Shift/Reduce Parser

I like it

Experiment Results

LP(%) LR(%) F1(%) Avg. timeStaged/Deterministic model

86.93 86.47 86.70 30ms/snt

Previous method 1（ Supertagger+ChartParser）

87.35 86.29 86.81 183ms/snt

Previous method 2（ Unigram + ChartParser ）

84.96 84.25 84.60 674ms/snt

6 times faster20 times faster than the initial model

49

Domain/Text Type Adaptation

50

F-score Training Time （ Sec ）

Baseline （ PTB-trained, PTB-applied) 89.81 0

Baseline (PTB-trained, GENIA-applied) 86.39 0

Retraining （ GENIA ） 88.45 14,695

Retraining （ PTB+GENIA) ） 89.94 238,576

Structure with RefDist 88.18 21,833

Lexical with RefDist 89.04 12,957

Lex/Structure with RefDist 90.15 31,637

51

Adaptation with Reference Distribution

)(

)|()|(

,)|()|(1

)|(

w ww

ww

l

lw

Tt wsyniilex

wsyniilexE

i

i

tqwlpZ

tqwlpZ

tp

Lexical Assignment Syntactic Preference

Original model

j

jjs

stgZ

stpM )|(exp1

)|(

Feature function

Feature weight)|(0 stp

52

83

84

85

86

87

88

89

90

0 2000 4000 6000 8000

Number of Sentence of the GENIA Training Set

F-s

core

Baseline (PTB)

Simple Retraining （ GENIA)

Retraining (GENIA+PTB)Structure with Ref.DistLexical with RefDist

Lexical/Structure woth RefDist

53

83

84

85

86

87

88

89

90

0 10000 20000 30000

Training Time （ Sec ）

F- s

core

Retrinaing(GENIA)

Structure with RefDistLexicon woth RefDist

Lex/Str with RefDist

54

F-score Training Time （ Sec ）

Baseline （ PTB-trained, PTB-applied) 89.81 0

Baseline (PTB-trained, GENIA-applied) 86.39 0

Retraining （ GENIA ） 88.45 14,695

Retraining （ PTB+GENIA) ） 89.94 238,576

Structure with RefDist 88.18 21,833

Lexical with RefDist 89.04 12,957

Lex/Structure with RefDist 90.15 31,637

55

Tool1: POS Tagger

• General-Purpose POS taggers, trained by WSJ– Brill’s tagger, TnT tagger, MX POST, etc. – 97%

• General-Purpose POS taggers do not work well for MEDLINE abstracts

The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NNvirus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS

56

Errors seen in TnT tagger (Brants 2000)

A chromosomal translocation in … DT JJ NN IN… and membrane potential after mitogen binding. CC NN NN IN NN JJ… two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN… to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN

57

Performance of GENIA Tagger

Training corpus WSJ GENIA

WSJ 97.0 84.3

GENIA 75.2 98.1

WSJ+GENIA 96.9 98.1

Training corpus

WSJ GENIA

WSJ 96.7 84.3

GENIA 80.1 97.9

WSJ+GENIA 96.5 97.5

• GENIA tagger (Ref.) TnT tagger

No degradation of the taggertrained by the mixed corpus

Some degradations (0.2 ~ 0.4) were observed, compared withthe taggers trained by “pure” corpora

58

CRF-based POS + Active LearningGENIA

3,000 sentences : 98.420,000 sentences: 98.58

59

10,000 sentences: 96.76Best Performance: 97.18

CRF-based POS + Active LearningPTB

60

Applications

GENIA Event Annotation - example

LinkCauseLinkCause

– For an identified event in the given sentence,• classify the type of events and record the text span giving the clue of it (ClueType).• identify the theme of the events and record the text span linking the theme to the event

(LinkTheme).• identify the cause of the events and record the text span linking the cause to the event

(LinkCause).

• record the environment (location, time) of the events (ClueLoc, ClueTime).

LinkThemeLinkTheme

ClueLocClueLoc

ClueTypeClueType

ClueTypeClueType

Gene_expression• Theme patterns observed (2,958)

– Protein 2,308– DNA 　　　　 591– RNA 　　　　 25– Peptide 4– Protein Protein 2– Erroneous 27

• Keywords– coexpress, nonexpress, overexpress,

express, biosynthesis, product, synthesize, constitute, …

coexpression

Transcription

• Theme patterns observed (929)– DNA 　　　　 449– RNA 　　　 272– Protein 167– Peptide 2– Erroneous22

• Keyword– Transcrib, transcript, synthesi, express,

…

Localization• Theme patterns observed

(730)– Protein 608– Lipid 31– Atom 29– Other_organic_compound

14– DNA 12– Virus 5– Carbohydrate 5– RNA 4– Inorganic 4– Peptide 3• Keywords

– Translocation, sectetion, release, localization, mobilization, uptake, secrete, import, transport, translocate, sequester, influx, mograte, localisation, move, delivery, export, …

• ClueLoc

– NONE 241

– nuclear 140

– to the nucleus 12

– into the nucleus 11

– Cytoplasmic 8

– in the cytoplasm 7

– macrophages 5

– nuclear … in t lymphocytes4

– monocytes 4

– in the nucleus 4

– in the cytosol 4

– in colostrum 4

– from the cytoplasm to the nucleus 4

Localization• Keywords and Locations

– translocation (166)• nuclear 108• NONE

38• …

– secretion (100)• NONE

57• name_of_cells 43

– release (80)• NONE

51• name_of_cells 19• …

– localization (30)• nuclear 25• intracellular 3

– uptake (24)• NONE 14• name_of_cells 20

• Keywords and Themes– translocation (166)

• Protein 161• Virus

4• RNA

1– secretion (100)

• Protein 98• Lipid

1• Peptide 1

– release (80)• Protein 67• Other_organic_compoun 6• Lipid

3– localization (30)

• Protein 30– uptake (24)

• Lipid15

• Carbohydrate 5• Protein 4

69

Future Plan

Kitano’s group, Kell’s group

70

71

72

Future Directions

• Domain Adaptation + Inter-operability– High performance can be obtained by using domain specific

characteristics and domain semantics– Differences among abstracts, full papers, comments in DBs

– Standardized Interfaces (API) of NLP tools

• Text Archives – Abstracts + Full Papers + Comments/Summary Descriptions

in DBs

• Combining NLP tools with Mining tools – Knowledge Discovery (Disease Gene Association)– Hypotheses Generation– Automatic Data Interpretation

73

Future Directions

• Domain Adaptation + Inter-operability– High performance can be obtained by using domain specific

characteristics and domain semantics– Differences among abstracts, full papers, comments in DBs

– Standardized Interfaces (API) of NLP tools

• Text Archives – Abstracts + Full Papers + Comments/Summary Descriptions

in DBs

• Combining NLP tools with Mining tools – Knowledge Discovery (Disease Gene Association)– Hypotheses Generation– Automatic Data Interpretation

TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

Documents

knowledge slide

regional algebra slide

protein retrieval

motivated of language

tm text mining text

accumulation gprotein

language terminology

enhancerbinding protein