NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004
Jan 13, 2016
NLP Tools for Biology Literature
Mining
Qiaozhu MeiJing Jiang
ChengXiang ZhaiNov 3, 2004
What do we have?
Biology Literature (huge amount of text)E.g. Mites in the genus Varroa are the primary parasites of honey bees … Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the 12S ribosomal RNA subunit is inverted and separated from the 16S rRNA by a novel non-coding region, a trait not yet seen in other arthropods. … (from Biological Abstracts)
What do we want?
Named entities: gene names, protein names, drugs, etc.
Interaction events between entities: transcription, translation, post
translational modification, etc.
Relationships between basic events: caused by, inhibited by, etc.
(from Hirschman et al. 02)
Preliminary System Structure
Pre-processed data ready to mine
POS Tagger Parser Entity Extractor
…Collections of raw textual data
Genes, proteins, other entities
Nouns, Verbs, etc.
NPs, VPs, Relations
…
Text Pre-processing: NLP
Text Mining Modules: TM
POS Taggers
Tree Tagger Brill Tagger SNoW Tagger LT Chunk Stanford Tagger
Results of POS Tagging
Raw text:Mites in the genus Varroa are the primary parasites of honey bees … Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the 12S ribosomal RNA subunit is inverted and separated from the 16S rRNA by a novel non-coding region, a trait not yet seen in other arthropods. …
(from Biological Abstracts)
Results of POS Tagging (cont.)
Tree Brill SNoW LT Stanford
the DT DT DT DT DT
12S JJ CD NN JJ CD
ribosomal JJ JJ JJ JJ JJ
RNA NP NNP NNP NNP NNP
subunit NN NN NN NN NN
is VBZ VBZ VBZ VBZ VBZ
inverted VVN JJ JJ VBN JJ
and CC CC CC CC CC
separated VVN VBN VBD VBN JJ
from IN IN IN IN IN
the DT DT DT DT DT
Results of POS Tagging (cont.)
Tree Brill SNoW LT Stanford
16S JJ CD NN JJ CD
rRNA NN NN NN NNP NNP
by IN IN IN IN IN
a DT DT DT DT DT
novel NN NN NN JJ NN
non-coding
VVG JJ NN JJ JJ
region NN NN NN NN NN
Comparison of POS Taggers
Tree Brill SNoW LT Stanford
Src. Stuttgart
Eric Brill UIUC Edinburgh Stanford
Alg. decision tree
transformation based
network of linear functions
HMM dis-ambiguatio
n
maximum entropy
Speed
1min/5M
< 1min/5M~8mins/
5M40mins/5M 80mins/5M
Adapt.
yessource
includedhigh low
source & API
included
Other
punc-tuation sensitiv
e
commonly used
help available
96 – 98% precision
Conclusions
Existing general-purpose POS taggers work fine for our task. Most nouns and verbs correctly identified
There is still room to improve existing POS taggers for biology data. E.g. to identify gene and protein names
Speed and adaptability are important.
A Little Bit More on SNoW
SNoW has a POS tagger and a shallow parser. Speed is reasonable. Software is adaptable as help is available from CCG. The network model can be trained if
we have training data.
Result of SNoW Shallow Parser
[NP the 12 S ribosomal RNA subunit] [VP is] [ADJP inverted] and [VP separated] [PP from] [NP the 16 S rRNA] [PP by] [NP a novel non-coding region]
(from online demo)
Problems: Currently the package is not available for download from the new CCG page. There is still problem running the old package on our machine. (compilation, path setting, etc.)
Parsers
SNoW (already covered)
LT-Chunk MiniPar Collins Stanford
Result of LT-Chunk
[[ the_DT 12S_JJ ribosomal_JJ RNA_NNP subunit_NN ]] (( is_VBZ inverted_VBN and_CC separated_VBN )) from_IN [[ the_DT 16S_JJ rRNA_NNP ]] by_IN [[ a_DT novel_JJ non-coding_JJ region_NN ]]
Result of MiniPar16 (the ~ Det 20 det (gov subunit))17 (12S ~ N 20 nn (gov subunit))18 (ribosomal ~ A 20 mod (gov subunit))19 (RNA ~ N 20 nn (gov subunit))20 (subunit ~ N 22 s (gov invert))21 (is be be 22 be (gov invert))22 (inverted invert V E0 i (gov fin))E4 (() subunit N 22 obj (gov invert)23 (and ~ U 22 lex-mod (gov invert))24 (separated separate V 22 lex-dep (gov
invert))25 (from ~ Prep 22 mod (gov invert))26 (the ~ Det 28 det (gov rRNA))27 (16S ~ N 28 nn (gov rRNA))28 (rRNA ~ N 25 pcomp-n (gov from))
Results of Collins Parser(S~is~2~2 (NPB~subunit~5~5 the/DT
12S/CD ribosomal/JJ RNA/NNP subunit/NN ) (VP~is~2~1 is/VBZ (UCP~inverted~3~1 (ADJP~inverted~1~1 inverted/JJ ) and/CC (VP~separated~3~1 separated/VBN (PP~from~2~1 from/IN (NPB~rRNA~3~3 the/DT 16S/CD rRNA/NN ) ) (PP~by~2~1 by/IN (NP~region~2~1 (NPB~region~4~4 a/DT novel/JJ non-coding/JJ region/NN ,/PUNC, )
Comparison of ParsersLT MiniPar Collins Stanford
Src. Edinburgh U Alberta M. Collins Stanford
Prec. Part of LT-POSSlightly over 88%
~ 85%
Speed 40min/5M 14min/5M > 3 hrs/5Mvery slow …
Adapt.
Low, training not allowed
High, provides API
Source included
Source & API included
Other
LT-Chunk is a part of LT-POS; Readable output
Complex Output of dependency and governing info.
Well-known. Tagged input needed.
Java based.
Conclusion on Parsers
MiniPar has advantages so far: Fast Outputs dependency & governing
info. and useful relations Provides API
If SNoW is tuned for the task, we can easily plug it into the module.
Entity Extractors
Abner: extracts protein, DNA, RNA, cell line, and cell type
Yagi: extracts only gene names, a brother of Abner
LingPipe: Named entity extraction that can be trained for different domains.
Result of Abner
Ten of <RNA>22 transfer RNAs</RNA> are in different locations relative to hard ticks , and the 12 <protein>S ribosomal RNA subunit</protein> is inverted and separated from the 16 S rRNA by a novel non-coding region, …
Result of LingPipe
Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the <ENAMEX id="0" type="GENE">12S ribosomal RNA subunit</ENAMEX> is inverted and separated from the <ENAMEX id="1" type="GENE">16S rRNA</ENAMEX> by a novel non-coding region, …
Comparison of Entity Extractors
Abner Yagi LingPipe
Src. U Wisconsin U Wisconsin Alias-i, Inc.
Alg. CRF Model CRF Model B-CUBED alg.
Prec. 89.3%/69.9% (seen/unseen) data, 72% for protein
75% on unseen dataExact Match: 64.9 %
Recall 65% Exact: ~ 70%
Speed 40mins/5M 3mins/5M5mins/5M (model1)3hrs/5M (model2)
Adapt. Java based, pre-trained
Java based, pre-trained with BioCreative
Two trained models, training allowed
OtherGraphic Interface; files <= 500k
Should split into small files <= 1M, can take directory as input
Command line & demo. Also does co-referencing.
Conclusion on Entities Extractors
Still a lot of room to improve. However, with existing extractors we can begin high level text mining work. Performances over honeybee data need to be evaluated. As soon as better extractor is constructed, we can plug in easily.
Summary
Some Existing NLP tools for supporting Biology Literature Mining: POS Taggers , Parsers and Entity-
Extractors are evaluated Observations along two lines:
Still considerable room of improvement beyond the existing NLP tools, especially customize them for special domains.
We can begin exploring higher-level text mining research with support of these toolkits.
Text Preprocessing Modules are independent, easy to plug and play
References
Hirschman, L. et al. Accomplishments and challenges in literature data mining for biology Bioinformatics, 2002 Dekang Lin. Dependency-based evaluation of MiniPar In Workshop on the Evaluation of Parsing Systems, 1998
End of Talk
Thank you!