Top Banner
NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004
26

NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Jan 13, 2016

Download

Documents

Colleen Walton
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

NLP Tools for Biology Literature

Mining

Qiaozhu MeiJing Jiang

ChengXiang ZhaiNov 3, 2004

Page 2: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

What do we have?

Biology Literature (huge amount of text)E.g. Mites in the genus Varroa are the primary parasites of honey bees … Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the 12S ribosomal RNA subunit is inverted and separated from the 16S rRNA by a novel non-coding region, a trait not yet seen in other arthropods. … (from Biological Abstracts)

Page 3: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

What do we want?

Named entities: gene names, protein names, drugs, etc.

Interaction events between entities: transcription, translation, post

translational modification, etc.

Relationships between basic events: caused by, inhibited by, etc.

(from Hirschman et al. 02)

Page 4: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Preliminary System Structure

Pre-processed data ready to mine

POS Tagger Parser Entity Extractor

…Collections of raw textual data

Genes, proteins, other entities

Nouns, Verbs, etc.

NPs, VPs, Relations

Text Pre-processing: NLP

Text Mining Modules: TM

Page 5: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

POS Taggers

Tree Tagger Brill Tagger SNoW Tagger LT Chunk Stanford Tagger

Page 6: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Results of POS Tagging

Raw text:Mites in the genus Varroa are the primary parasites of honey bees … Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the 12S ribosomal RNA subunit is inverted and separated from the 16S rRNA by a novel non-coding region, a trait not yet seen in other arthropods. …

(from Biological Abstracts)

Page 7: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Results of POS Tagging (cont.)

Tree Brill SNoW LT Stanford

the DT DT DT DT DT

12S JJ CD NN JJ CD

ribosomal JJ JJ JJ JJ JJ

RNA NP NNP NNP NNP NNP

subunit NN NN NN NN NN

is VBZ VBZ VBZ VBZ VBZ

inverted VVN JJ JJ VBN JJ

and CC CC CC CC CC

separated VVN VBN VBD VBN JJ

from IN IN IN IN IN

the DT DT DT DT DT

Page 8: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Results of POS Tagging (cont.)

Tree Brill SNoW LT Stanford

16S JJ CD NN JJ CD

rRNA NN NN NN NNP NNP

by IN IN IN IN IN

a DT DT DT DT DT

novel NN NN NN JJ NN

non-coding

VVG JJ NN JJ JJ

region NN NN NN NN NN

Page 9: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Comparison of POS Taggers

Tree Brill SNoW LT Stanford

Src. Stuttgart

Eric Brill UIUC Edinburgh Stanford

Alg. decision tree

transformation based

network of linear functions

HMM dis-ambiguatio

n

maximum entropy

Speed

1min/5M

< 1min/5M~8mins/

5M40mins/5M 80mins/5M

Adapt.

yessource

includedhigh low

source & API

included

Other

punc-tuation sensitiv

e

commonly used

help available

96 – 98% precision

Page 10: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Conclusions

Existing general-purpose POS taggers work fine for our task. Most nouns and verbs correctly identified

There is still room to improve existing POS taggers for biology data. E.g. to identify gene and protein names

Speed and adaptability are important.

Page 11: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

A Little Bit More on SNoW

SNoW has a POS tagger and a shallow parser. Speed is reasonable. Software is adaptable as help is available from CCG. The network model can be trained if

we have training data.

Page 12: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Result of SNoW Shallow Parser

[NP the 12 S ribosomal RNA subunit] [VP is] [ADJP inverted] and [VP separated] [PP from] [NP the 16 S rRNA] [PP by] [NP a novel non-coding region]

(from online demo)

Problems: Currently the package is not available for download from the new CCG page. There is still problem running the old package on our machine. (compilation, path setting, etc.)

Page 13: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Parsers

SNoW (already covered)

LT-Chunk MiniPar Collins Stanford

Page 14: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Result of LT-Chunk

[[ the_DT 12S_JJ ribosomal_JJ RNA_NNP subunit_NN ]] (( is_VBZ inverted_VBN and_CC separated_VBN )) from_IN [[ the_DT 16S_JJ rRNA_NNP ]] by_IN [[ a_DT novel_JJ non-coding_JJ region_NN ]]

Page 15: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Result of MiniPar16 (the ~ Det 20 det (gov subunit))17 (12S ~ N 20 nn (gov subunit))18 (ribosomal ~ A 20 mod (gov subunit))19 (RNA ~ N 20 nn (gov subunit))20 (subunit ~ N 22 s (gov invert))21 (is be be 22 be (gov invert))22 (inverted invert V E0 i (gov fin))E4 (() subunit N 22 obj (gov invert)23 (and ~ U 22 lex-mod (gov invert))24 (separated separate V 22 lex-dep (gov

invert))25 (from ~ Prep 22 mod (gov invert))26 (the ~ Det 28 det (gov rRNA))27 (16S ~ N 28 nn (gov rRNA))28 (rRNA ~ N 25 pcomp-n (gov from))

Page 16: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Results of Collins Parser(S~is~2~2 (NPB~subunit~5~5 the/DT

12S/CD ribosomal/JJ RNA/NNP subunit/NN ) (VP~is~2~1 is/VBZ (UCP~inverted~3~1 (ADJP~inverted~1~1 inverted/JJ ) and/CC (VP~separated~3~1 separated/VBN (PP~from~2~1 from/IN (NPB~rRNA~3~3 the/DT 16S/CD rRNA/NN ) ) (PP~by~2~1 by/IN (NP~region~2~1 (NPB~region~4~4 a/DT novel/JJ non-coding/JJ region/NN ,/PUNC, )

Page 17: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Comparison of ParsersLT MiniPar Collins Stanford

Src. Edinburgh U Alberta M. Collins Stanford

Prec. Part of LT-POSSlightly over 88%

~ 85%

Speed 40min/5M 14min/5M > 3 hrs/5Mvery slow …

Adapt.

Low, training not allowed

High, provides API

Source included

Source & API included

Other

LT-Chunk is a part of LT-POS; Readable output

Complex Output of dependency and governing info.

Well-known. Tagged input needed.

Java based.

Page 18: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Conclusion on Parsers

MiniPar has advantages so far: Fast Outputs dependency & governing

info. and useful relations Provides API

If SNoW is tuned for the task, we can easily plug it into the module.

Page 19: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Entity Extractors

Abner: extracts protein, DNA, RNA, cell line, and cell type

Yagi: extracts only gene names, a brother of Abner

LingPipe: Named entity extraction that can be trained for different domains.

Page 20: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Result of Abner

Ten of <RNA>22 transfer RNAs</RNA> are in different locations relative to hard ticks , and the 12 <protein>S ribosomal RNA subunit</protein> is inverted and separated from the 16 S rRNA by a novel non-coding region, …

Page 21: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Result of LingPipe

Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the <ENAMEX id="0" type="GENE">12S ribosomal RNA subunit</ENAMEX> is inverted and separated from the <ENAMEX id="1" type="GENE">16S rRNA</ENAMEX> by a novel non-coding region, …

Page 22: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Comparison of Entity Extractors

Abner Yagi LingPipe

Src. U Wisconsin U Wisconsin Alias-i, Inc.

Alg. CRF Model CRF Model B-CUBED alg.

Prec. 89.3%/69.9% (seen/unseen) data, 72% for protein

75% on unseen dataExact Match: 64.9 %

Recall 65% Exact: ~ 70%

Speed 40mins/5M 3mins/5M5mins/5M (model1)3hrs/5M (model2)

Adapt. Java based, pre-trained

Java based, pre-trained with BioCreative

Two trained models, training allowed

OtherGraphic Interface; files <= 500k

Should split into small files <= 1M, can take directory as input

Command line & demo. Also does co-referencing.

Page 23: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Conclusion on Entities Extractors

Still a lot of room to improve. However, with existing extractors we can begin high level text mining work. Performances over honeybee data need to be evaluated. As soon as better extractor is constructed, we can plug in easily.

Page 24: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Summary

Some Existing NLP tools for supporting Biology Literature Mining: POS Taggers , Parsers and Entity-

Extractors are evaluated Observations along two lines:

Still considerable room of improvement beyond the existing NLP tools, especially customize them for special domains.

We can begin exploring higher-level text mining research with support of these toolkits.

Text Preprocessing Modules are independent, easy to plug and play

Page 25: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

References

Hirschman, L. et al. Accomplishments and challenges in literature data mining for biology Bioinformatics, 2002 Dekang Lin. Dependency-based evaluation of MiniPar In Workshop on the Evaluation of Parsing Systems, 1998

Page 26: NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

End of Talk

Thank you!