Top Banner
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~
38

GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

TextpressoSearch engine for Biomedical Literature

~Eimear Kenny~

Page 2: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Page 3: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Born out of frustration….

• Search systems effective at locating Search systems effective at locating interesting papers ….. BUT …. have to interesting papers ….. BUT …. have to read the paper to get to the facts. read the paper to get to the facts.

• Many data are not contained in abstract Many data are not contained in abstract or index …. therefore, important papers or index …. therefore, important papers can be missed by search engines.can be missed by search engines.

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 4: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

The Perfect System

Type in question and the search

engine tells you the answer!

Full text

“Conceptual search”

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 5: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

• Searches full text– returns any sentences that match your query

• Provides two ways to query– search raw data – Keyword search– search meta-data – Category search

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Enter Textpresso

Page 6: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

….. activation of let-7 RNA expression downregulates LIN-4 to relieve inhibition of lin-29.

Biological Process

Regulation RegulationGene

GeneMolecular Function

Biological Process

<?xml version="1.0" encoding="ISO-8859-1" standalone="no" ?><!DOCTYPE article SYSTEM "/var/www/html/textpresso.dtd"><article> // <sentence id='s7'> // <process grammar ='NN' source='textpresso' type='general' biosynthesis='no'> activation</process> <pposition grammar ='IN' type='of'> of </pposition> <gene grammar ='JJ' reference='direct'> let-7 </gene> <text>RNA</text> <process grammar ='NN' source='textpresso' type='molecular' biosynthesis='expression'> expression</process> <regulation grammar ='NNS' type='negative'> down regulates</regulation> <function grammar ='NNP' reference='direct' source='textpresso' protein='yes'> LIN-41 </function> <pposition grammar ='TO' type='to'>to </pposition> <text>relieve</text> <regulation grammar ='NNS' type='negative'> inhibition </regulation> <pposition grammar ='IN' type='of'> of</pposition> <gene grammar ='NNP' reference='direct'> lin-29 </gene> <text>. </text> </sentence> //</article>

Page 7: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Categories

GENEPATHWAY

REGULATION CELL

Locus let-60 eat-4 LIN-12

repress enhanced upregulate inhibition

precursorupstream cascade descendants

Neuron EMS

HSN AB Vulva precursor

37 Categories!!!

Page 8: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Page 9: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Page 10: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

lin-39 acts downstream of Ras

lin-25 acts indirectly via sur-2

eor-1 and eor-2 are closely involved in Ras signaling

Page 11: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Page 12: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Find sentences from the literature that describe genetic interaction!

>= 2 named “Gene” &&(>= 1 “Association” || >= 1 “Regulation”)

Using Textpresso to expediate curation

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 13: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Sentences containing gene-gene interactions

Random 1 (0.5%)

2 named genes 13 (6.5%)

2 named genes

+

1 category39 (19.5%)

Sampling 200 sentences ……

Adding Textpresso category enriches 3-fold!

Page 14: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Installation and Adaption of Textpresso for your Domain

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 15: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Dependencies• Tested on Redhat 9.0 or Debian 3.1 (kernel 2.4.20 or higher)

– should work on any unix-based system

• Apache (1.3.29), Perl (5.6.1 or higher)• Perl Modules:

– XML::Parser XML::RegExp – XML::XQL XML::Checker– XML::DOM XML::Parser::PerlSAX– PDF::Create

• Brill Tagger (C compiler)– parts of speech tagger (http://research.microsoft.com/~brill/)

• XPDF– pdftotext utility (http://www.foolabs.com/xpdf/)

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 16: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Download

http://www.textpresso.org

http://www.gmod.org

Page 17: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Unpack and Install

Page 18: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Web-site

Page 19: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Web Scripts

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 20: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Database

Page 21: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Build Scripts

Electronic PDF

Raw Text

Parts-of-speech Text

Annotated Text

Abstracts

Keywords

Index Maker

PDF2Text

Preprocessor

Text2XML

Textpresso Database

Wormbase Database

Journal Web-sites

TextpressoOntology

CollectPapers

CollectAbstracts

Page 22: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

Tailoring Pt 1 -Text Collection

• Abstracts Collection– can be downloaded from central resource such as PubMed – PubFetch!

• PDF Collection:– limited to open access journals (PLoS Biology) or journals

to which you subscribe– inject_pmid script from Textpresso web-site (Allen Day)– manual download from journal web-site

Page 23: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Tailoring Pt 2 – Adapting Ontology

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 24: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Tailoring Pt 2 – Adapting Ontology

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

• Almost all “Relationship and Description” and “Syntax and Grammer”categories and some “Biological Concepts” categories are generic to the Biomedical domain.

• Some new categories can use existing category structure (yeast genes replace worm genes)

• Some de novo categories would be useful (Cell Cycle, Chromosomal Aberrations, Disease etc).

Page 25: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Tailoring Pt 3 – Adapting Interface

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 26: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Tailoring Pt 3 – Adapting Interface

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 27: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Tailoring Pt 3 – Adapting Interface

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 28: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Textpresso 2.0

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 29: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Overhaul Code

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

• Adding another layer of abstraction– definition files and modulesuse constant SY_ANNOTATION_FIELDS => { abstract => ‘abstract/’,

body=> ‘body/’, title=> ‘title/’};

… defines which fields are to be annotated during the build process

• Advantages:– easy to adapt software (no script tweaking)– easy to add new modules

Page 30: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

New Features

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 31: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Distributed Searches

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 32: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Variable Scope

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 33: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

New Sort Modes

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 34: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

100 sentences per hour!

Page 35: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Search for patterns in sentences

The life-extension phenotype of old-1 was completely suppressed by daf-16 ( m26 ) ( Figure 1e ) . <determiner> <text> <phenotype> <preposition> <gene> <auxiliary> <effect> <regulation> <preposition> <gene> <bracket> <text> <bracket> <bracket> <text> <text> <bracket> <text>

Developed hidden Markov model to identify common patterns of text that surrounds required entities.

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 36: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Hidden Markov Model

Match Match Match

I I I I I II

Begin End

I I

<gene> <gene><regulation>

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary

Page 37: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

GMOD Meeting, May 2005

Patent Pending,Caltech Proprietary

True test sentences have similar score to training sentences

Page 38: GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Textpresso TeamDevelopers:Eimear KennyHans-Michael Müller

Code Contributers:Allen Day (many patches including inject_pmid)Robert Li (alternative pdf2text converter)Stan Dong and Christopher Lane (code optimization for speed)Juancarlos Chan (web-site scripting)

Information Extraction Analysis:Andrei Petcherski

Paper Collection:Daniel Wang

Principle Investigator:Paul Sternberg

GMOD Meeting, May 2005Patent Pending,Caltech Proprietary