TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester, UK Department of Computer Science School of Information Science and Technology University of Tokyo, JAPAN
65
Embed
TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
TM and NLP for BiologyResearch Issues in HPSG Parsing
Junichi TSUJII
School of Computer ScienceNational Centre for Text Mining
University of Manchester, UK
Department of Computer ScienceSchool of Information Science and Technology
University of Tokyo, JAPAN
2
Increments
: accumulation
Increase in Medline
2002
2000
1998
199219941996
1990
1988
1980198219841986
1978
1970197219741976
1968
1966
1964
0
100,000
200,000
300,000
400,000
500,000
600,000
年
incr
emen
ts
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
acc
um
ula
tio
n
G-protein coupled receptor
Before 19889 papers
1992256 papers2005
14,000 papers
MEDLINE alone
More than 0.5 million per year More than 1.3 thousand per day
Articles added
Medline Access
1997: 0.163 M accesses/month2006: 82.027 M accesses/month
[D.L.Banville 2006]
500 times more
3
NaCTeMwww.nactem.ac.uk
• First such centre in the world • Funding: JISC, BBSRC, EPSRC• Consortium investment
• Chair in TM (Prof. J. Tsujii, Univ. Tokyo)
• Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk funded by the Wellcome Trust
• Initial focus: biomedical academic community• Extend services to industry• Extend focus to other domains (social
• Universities of Manchester, Liverpool• Service activity run by MIMAS (National
Centre for Dataset Services), within MC (Manchester Computing)
• Self-funded partners– San Diego Supercomputing Center – University of California, Berkeley – University of Geneva – University of Tokyo
• Strong industrial & academic support– IBM, AZ, EBI, Wellcome Trust, Sanger Institute,
Unilever, NowGEN, MerseyBio, …
5
6
7
8
9
10
11
NLP and TM
Text Mining
Text as a bag of words
Words as surface strings
Natural Language Processing
Language as a complex system linking surfacestrings of characters with their meanings Text and words as structured objects
NLP-based TM
Linking text with knowledge
12
Non-Trivial Mappings
Language Domain Knowledge Domain
Concepts and Relationships among Them
Linguistic expressions
Motivated Independently of language
TerminologyParsingParaphrasing
From surface diversities and ambiguities to
conceptual invariants
13
Example
14
Non-trivial Mapping
Language Domain Knowledge Domain
Independently motivated of Language
Same relationswith differentStructures
Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to …..
[A] protein activates [B] (Pathway extraction)
Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene.
Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription.
• Assignment of Lexical Entries– High reduction of search space / Super
tagging
41
Filtering with CFG (1/5)• 2-phased parsing
– Approximate HPSG with CFG with keeping important constraints.
– Obtained CFG might over-generate, but can be used in filtering.
– Rewriting in CFG is far less expensive than that of application of rule schemata, principles and so on.
CompileHPSG CFGFeature
Structures
Input Sentences
Built-in CFG Parser
LiLFeS UnificationParsing
+
Output
Complete parse trees
47
System Overview
I like it
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
P High
Supertagger
I like itInputsentence
CFG Filtering
I like it
HEAD nounSUBJ < >
COMPS < >
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD nounSUBJ < >
COMPS < >
I like it
HEAD nounSUBJ < >
COMPS < >
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD nounSUBJ < >
COMPS < >
I like it
HEAD nounSUBJ < >
COMPS < >
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD nounSUBJ < >
COMPS < >
...
Deterministic Shift/Reduce Parser
I like it
Experiment Results
LP(%) LR(%) F1(%) Avg. timeStaged/Deterministic model
86.93 86.47 86.70 30ms/snt
Previous method 1( Supertagger+ChartParser)
87.35 86.29 86.81 183ms/snt
Previous method 2( Unigram + ChartParser )
84.96 84.25 84.60 674ms/snt
6 times faster20 times faster than the initial model
49
Domain/Text Type Adaptation
50
F-score Training Time ( Sec )
Baseline ( PTB-trained, PTB-applied) 89.81 0
Baseline (PTB-trained, GENIA-applied) 86.39 0
Retraining ( GENIA ) 88.45 14,695
Retraining ( PTB+GENIA) ) 89.94 238,576
Structure with RefDist 88.18 21,833
Lexical with RefDist 89.04 12,957
Lex/Structure with RefDist 90.15 31,637
51
Adaptation with Reference Distribution
)(
)|()|(
,)|()|(1
)|(
w ww
ww
l
lw
Tt wsyniilex
wsyniilexE
i
i
tqwlpZ
tqwlpZ
tp
Lexical Assignment Syntactic Preference
Original model
j
jjs
stgZ
stpM )|(exp1
)|(
Feature function
Feature weight)|(0 stp
52
83
84
85
86
87
88
89
90
0 2000 4000 6000 8000
Number of Sentence of the GENIA Training Set
F-s
core
Baseline (PTB)
Simple Retraining ( GENIA)
Retraining (GENIA+PTB)Structure with Ref.DistLexical with RefDist
Lexical/Structure woth RefDist
53
83
84
85
86
87
88
89
90
0 10000 20000 30000
Training Time ( Sec )
F- s
core
Retrinaing(GENIA)
Structure with RefDistLexicon woth RefDist
Lex/Str with RefDist
54
F-score Training Time ( Sec )
Baseline ( PTB-trained, PTB-applied) 89.81 0
Baseline (PTB-trained, GENIA-applied) 86.39 0
Retraining ( GENIA ) 88.45 14,695
Retraining ( PTB+GENIA) ) 89.94 238,576
Structure with RefDist 88.18 21,833
Lexical with RefDist 89.04 12,957
Lex/Structure with RefDist 90.15 31,637
55
Tool1: POS Tagger
• General-Purpose POS taggers, trained by WSJ– Brill’s tagger, TnT tagger, MX POST, etc. – 97%
• General-Purpose POS taggers do not work well for MEDLINE abstracts
The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NNvirus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS
56
Errors seen in TnT tagger (Brants 2000)
A chromosomal translocation in … DT JJ NN IN… and membrane potential after mitogen binding. CC NN NN IN NN JJ… two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN… to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN
57
Performance of GENIA Tagger
Training corpus WSJ GENIA
WSJ 97.0 84.3
GENIA 75.2 98.1
WSJ+GENIA 96.9 98.1
Training corpus
WSJ GENIA
WSJ 96.7 84.3
GENIA 80.1 97.9
WSJ+GENIA 96.5 97.5
• GENIA tagger (Ref.) TnT tagger
No degradation of the taggertrained by the mixed corpus
Some degradations (0.2 ~ 0.4) were observed, compared withthe taggers trained by “pure” corpora
58
CRF-based POS + Active LearningGENIA
3,000 sentences : 98.420,000 sentences: 98.58
59
10,000 sentences: 96.76Best Performance: 97.18
CRF-based POS + Active LearningPTB
60
Applications
GENIA Event Annotation - example
LinkCauseLinkCause
– For an identified event in the given sentence,• classify the type of events and record the text span giving the clue of it (ClueType).• identify the theme of the events and record the text span linking the theme to the event
(LinkTheme).• identify the cause of the events and record the text span linking the cause to the event
(LinkCause).
• record the environment (location, time) of the events (ClueLoc, ClueTime).
LinkThemeLinkTheme
ClueLocClueLoc
ClueTypeClueType
ClueTypeClueType
Gene_expression• Theme patterns observed (2,958)
– Protein 2,308– DNA 591– RNA 25– Peptide 4– Protein Protein 2– Erroneous 27