NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

NLP Tools for Biology Literature

Mining

Qiaozhu MeiJing Jiang

ChengXiang ZhaiNov 3, 2004

What do we have?

Biology Literature (huge amount of text)E.g. Mites in the genus Varroa are the primary parasites of honey bees … Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the 12S ribosomal RNA subunit is inverted and separated from the 16S rRNA by a novel non-coding region, a trait not yet seen in other arthropods. … (from Biological Abstracts)

What do we want?

Named entities: gene names, protein names, drugs, etc.

Interaction events between entities: transcription, translation, post

translational modification, etc.

Relationships between basic events: caused by, inhibited by, etc.

(from Hirschman et al. 02)

Preliminary System Structure

Pre-processed data ready to mine

POS Tagger Parser Entity Extractor

…Collections of raw textual data

Genes, proteins, other entities

Nouns, Verbs, etc.

NPs, VPs, Relations

…

Text Pre-processing: NLP

Text Mining Modules: TM

POS Taggers

Tree Tagger Brill Tagger SNoW Tagger LT Chunk Stanford Tagger

Results of POS Tagging

Raw text:Mites in the genus Varroa are the primary parasites of honey bees … Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the 12S ribosomal RNA subunit is inverted and separated from the 16S rRNA by a novel non-coding region, a trait not yet seen in other arthropods. …

(from Biological Abstracts)

Results of POS Tagging (cont.)

Tree Brill SNoW LT Stanford

the DT DT DT DT DT

12S JJ CD NN JJ CD

ribosomal JJ JJ JJ JJ JJ

RNA NP NNP NNP NNP NNP

subunit NN NN NN NN NN

is VBZ VBZ VBZ VBZ VBZ

inverted VVN JJ JJ VBN JJ

and CC CC CC CC CC

separated VVN VBN VBD VBN JJ

from IN IN IN IN IN

the DT DT DT DT DT

Results of POS Tagging (cont.)


16S JJ CD NN JJ CD

rRNA NN NN NN NNP NNP

by IN IN IN IN IN

a DT DT DT DT DT

novel NN NN NN JJ NN

non-coding

VVG JJ NN JJ JJ

region NN NN NN NN NN

Comparison of POS Taggers


Src. Stuttgart

Eric Brill UIUC Edinburgh Stanford

Alg. decision tree

transformation based

network of linear functions

HMM dis-ambiguatio

n

maximum entropy

Speed

1min/5M

< 1min/5M~8mins/

5M40mins/5M 80mins/5M

Adapt.

yessource

includedhigh low

source & API

included

Other

punc-tuation sensitiv

e

commonly used

help available

96 – 98% precision

Conclusions

Existing general-purpose POS taggers work fine for our task. Most nouns and verbs correctly identified

There is still room to improve existing POS taggers for biology data. E.g. to identify gene and protein names

Speed and adaptability are important.

A Little Bit More on SNoW

SNoW has a POS tagger and a shallow parser. Speed is reasonable. Software is adaptable as help is available from CCG. The network model can be trained if

we have training data.

Result of SNoW Shallow Parser

[NP the 12 S ribosomal RNA subunit] [VP is] [ADJP inverted] and [VP separated] [PP from] [NP the 16 S rRNA] [PP by] [NP a novel non-coding region]

(from online demo)

Problems: Currently the package is not available for download from the new CCG page. There is still problem running the old package on our machine. (compilation, path setting, etc.)

Parsers

SNoW (already covered)

LT-Chunk MiniPar Collins Stanford

Result of LT-Chunk

[[ the_DT 12S_JJ ribosomal_JJ RNA_NNP subunit_NN ]] (( is_VBZ inverted_VBN and_CC separated_VBN )) from_IN [[ the_DT 16S_JJ rRNA_NNP ]] by_IN [[ a_DT novel_JJ non-coding_JJ region_NN ]]

Result of MiniPar16 (the ~ Det 20 det (gov subunit))17 (12S ~ N 20 nn (gov subunit))18 (ribosomal ~ A 20 mod (gov subunit))19 (RNA ~ N 20 nn (gov subunit))20 (subunit ~ N 22 s (gov invert))21 (is be be 22 be (gov invert))22 (inverted invert V E0 i (gov fin))E4 (() subunit N 22 obj (gov invert)23 (and ~ U 22 lex-mod (gov invert))24 (separated separate V 22 lex-dep (gov

invert))25 (from ~ Prep 22 mod (gov invert))26 (the ~ Det 28 det (gov rRNA))27 (16S ~ N 28 nn (gov rRNA))28 (rRNA ~ N 25 pcomp-n (gov from))

Results of Collins Parser(S~is~2~2 (NPB~subunit~5~5 the/DT

12S/CD ribosomal/JJ RNA/NNP subunit/NN ) (VP~is~2~1 is/VBZ (UCP~inverted~3~1 (ADJP~inverted~1~1 inverted/JJ ) and/CC (VP~separated~3~1 separated/VBN (PP~from~2~1 from/IN (NPB~rRNA~3~3 the/DT 16S/CD rRNA/NN ) ) (PP~by~2~1 by/IN (NP~region~2~1 (NPB~region~4~4 a/DT novel/JJ non-coding/JJ region/NN ,/PUNC, )

Comparison of ParsersLT MiniPar Collins Stanford

Src. Edinburgh U Alberta M. Collins Stanford

Prec. Part of LT-POSSlightly over 88%

~ 85%

Speed 40min/5M 14min/5M > 3 hrs/5Mvery slow …

Adapt.

Low, training not allowed

High, provides API

Source included

Source & API included

Other

LT-Chunk is a part of LT-POS; Readable output

Complex Output of dependency and governing info.

Well-known. Tagged input needed.

Java based.

Conclusion on Parsers

MiniPar has advantages so far: Fast Outputs dependency & governing

info. and useful relations Provides API

If SNoW is tuned for the task, we can easily plug it into the module.

Entity Extractors

Abner: extracts protein, DNA, RNA, cell line, and cell type

Yagi: extracts only gene names, a brother of Abner

LingPipe: Named entity extraction that can be trained for different domains.

Result of Abner

Ten of <RNA>22 transfer RNAs</RNA> are in different locations relative to hard ticks , and the 12 <protein>S ribosomal RNA subunit</protein> is inverted and separated from the 16 S rRNA by a novel non-coding region, …

Result of LingPipe

Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the <ENAMEX id="0" type="GENE">12S ribosomal RNA subunit</ENAMEX> is inverted and separated from the <ENAMEX id="1" type="GENE">16S rRNA</ENAMEX> by a novel non-coding region, …

Comparison of Entity Extractors

Abner Yagi LingPipe

Src. U Wisconsin U Wisconsin Alias-i, Inc.

Alg. CRF Model CRF Model B-CUBED alg.

Prec. 89.3%/69.9% (seen/unseen) data, 72% for protein

75% on unseen dataExact Match: 64.9 %

Recall 65% Exact: ~ 70%

Speed 40mins/5M 3mins/5M5mins/5M (model1)3hrs/5M (model2)

Adapt. Java based, pre-trained

Java based, pre-trained with BioCreative

Two trained models, training allowed

OtherGraphic Interface; files <= 500k

Should split into small files <= 1M, can take directory as input

Command line & demo. Also does co-referencing.

Conclusion on Entities Extractors

Still a lot of room to improve. However, with existing extractors we can begin high level text mining work. Performances over honeybee data need to be evaluated. As soon as better extractor is constructed, we can plug in easily.

Summary

Some Existing NLP tools for supporting Biology Literature Mining: POS Taggers , Parsers and Entity-

Extractors are evaluated Observations along two lines:

Still considerable room of improvement beyond the existing NLP tools, especially customize them for special domains.

We can begin exploring higher-level text mining research with support of these toolkits.

Text Preprocessing Modules are independent, easy to plug and play

References

Hirschman, L. et al. Accomplishments and challenges in literature data mining for biology Bioinformatics, 2002 Dekang Lin. Dependency-based evaluation of MiniPar In Workshop on the Evaluation of Parsing Systems, 1998

End of Talk

Thank you!

NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Documents

snow snow

s rrna pp

jj ribosomal

jj rrna

jj rna

s ribosomal rna subunit

jj noncoding

existing pos taggers