Gleaning Relational Gleaning Relational Information from Information from Biomedical Text Biomedical Text Mark Goadrich Mark Goadrich Computer Sciences Department Computer Sciences Department University of Wisconsin - Madison University of Wisconsin - Madison Joint Work with Jude Shavlik and Louis Joint Work with Jude Shavlik and Louis Oliphant Oliphant CIBM Seminar - Dec 5th 2006 CIBM Seminar - Dec 5th 2006
47
Embed
Gleaning Relational Information from Biomedical Text
Gleaning Relational Information from Biomedical Text. Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude Shavlik and Louis Oliphant CIBM Seminar - Dec 5th 2006. Outline. The Vacation Game Formalizing with Logic - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Gleaning Relational Gleaning Relational Information from Biomedical Information from Biomedical
TextText
Mark GoadrichMark GoadrichComputer Sciences DepartmentComputer Sciences Department
University of Wisconsin - MadisonUniversity of Wisconsin - Madison
Joint Work with Jude Shavlik and Louis OliphantJoint Work with Jude Shavlik and Louis Oliphant
CIBM Seminar - Dec 5th 2006CIBM Seminar - Dec 5th 2006
OutlineOutline
The Vacation GameThe Vacation Game Formalizing with LogicFormalizing with Logic Biomedical Information ExtractionBiomedical Information Extraction Evaluating HypothesesEvaluating Hypotheses Gleaning Logical RulesGleaning Logical Rules ExperimentsExperiments Current DirectionsCurrent Directions
My Secret RuleMy Secret Rule– The word must have The word must have two adjacent two adjacent
lettersletters which are the which are the same lettersame letter..
Found by using Found by using inductive logicinductive logic– Positive and Negative ExamplesPositive and Negative Examples– Formulating and Eliminating HypothesesFormulating and Eliminating Hypotheses– Evaluating Success and FailureEvaluating Success and Failure
Machine LearningMachine Learning– Classify data into categoriesClassify data into categories– Divide data into Divide data into traintrain and and test test setssets– Generate hypotheses onGenerate hypotheses on train train set and set and
then measure performance on then measure performance on testtest set set In ILP, data are In ILP, data are ObjectsObjects … …
– person, block, molecule, word, phrase, person, block, molecule, word, phrase, ……
and and RelationsRelations between them between them– grandfather, has_bond, is_member, …grandfather, has_bond, is_member, …
Formalizing with LogicFormalizing with Logic
apple
a b c d e f g h i j k l mn o p q r s t u v w x y z
Biomedical Information Biomedical Information ExtractionExtraction
NPL3 encodes a nuclear protein with an RNA NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of recognition motif and similarities to a family of proteins involved in RNA metabolism.proteins involved in RNA metabolism.
ykuD was transcribed by SigK RNA polymerase ykuD was transcribed by SigK RNA polymerase from T4 of sporulation.from T4 of sporulation.
Mutations in the COL3A1 gene have been Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in syndrome, a disease leading to aortic rupture in early adult life.early adult life.
Biomedical Information Biomedical Information ExtractionExtraction
The dog running down the street The dog running down the street tackled and bit my little sister.tackled and bit my little sister.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Biomedical Information Biomedical Information ExtractionExtraction
NPL3 encodes a nuclear protein NPL3 encodes a nuclear protein with … with …
Some Prolog PredicatesSome Prolog Predicates Biomedical PredicatesBiomedical Predicates
– phrase_contains_medDict_term(Phrase, Word, WordText)phrase_contains_medDict_term(Phrase, Word, WordText)– phrase_contains_mesh_term(Phrase, Word, WordText)phrase_contains_mesh_term(Phrase, Word, WordText)– phrase_contains_mesh_disease(Phrase, Word, WordText)phrase_contains_mesh_disease(Phrase, Word, WordText)– phrase_contains_go_term(Phrase, Word, WordText)phrase_contains_go_term(Phrase, Word, WordText)
0.86 Recall0.86 Recall 0.12 Precision0.12 Precision 0.21 F1 Score0.21 F1 Score
Precision-Focused SearchPrecision-Focused Search
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Recall-Focused SearchRecall-Focused Search
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
F1-Focused SearchF1-Focused Search
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Aleph - LearningAleph - Learning
Aleph learnsAleph learns theories of rulestheories of rules (Srinivasan, v4, 2003)(Srinivasan, v4, 2003)– Pick positive seed examplePick positive seed example– Use heuristic search to find best ruleUse heuristic search to find best rule– Pick new seed from uncovered positivesPick new seed from uncovered positives
and repeat until threshold of positives and repeat until threshold of positives coveredcovered
Learning theories is time-consumingLearning theories is time-consuming Can we reduce time with ensembles?Can we reduce time with ensembles?
GleanerGleaner
Definition of GleanerDefinition of Gleaner– One who gathers grain left behind by One who gathers grain left behind by
reapersreapers
Key Ideas of GleanerKey Ideas of Gleaner– Use Aleph as underlying ILP rule engineUse Aleph as underlying ILP rule engine– Search rule space with Rapid Random Search rule space with Rapid Random
RestartRestart– Keep wide range of rules usually discardedKeep wide range of rules usually discarded– Create separate theories for diverse recallCreate separate theories for diverse recall
Gleaner - LearningGleaner - LearningP
reci
sion
Recall
Create Create BB Bins Bins Generate ClausesGenerate Clauses Record Best per Record Best per
BinBin
Gleaner - LearningGleaner - Learning
Recall
Seed 1
Seed 2
Seed 3
Seed K
.
.
.
Gleaner - EnsembleGleaner - Ensemble
.
.
.
.
.
pos1: prot_loc(…)
pos1: prot_loc(…) 12
pos2: prot_loc(…) 47
pos3: prot_loc(…) 55
neg1: prot_loc(…) 5
neg2: prot_loc(…) 14
neg3: prot_loc(…) 2
neg4: prot_loc(…) 18
12pos2: prot_loc(…) 47
Pos
Neg
Pos
Pos
Neg
Neg
Pos
Rules from bin 5
Gleaner - EnsembleGleaner - Ensemble
Recall
Pre
cisi
on
1.0
1.0pos3: prot_loc(…)
neg28: prot_loc(…)
pos2: prot_loc(…)
neg4: prot_loc(…)
neg475: prot_loc(…)
.
pos9: prot_loc(…)
neg15: prot_loc(…).
55
52
47
18
17
17
16
ScoreExamples
1.00 0.05
0.50 0.05
0.66 0.10
0.12 0.85
0.13 0.90
0.12 0.90
Precision Recall
Gleaner - OverlapGleaner - Overlap
For each bin, take the topmost curveFor each bin, take the topmost curve
Recall
Pre
cisi
on
How to use GleanerHow to use GleanerP
reci
sion
Recall
Generate Test CurveGenerate Test Curve User Selects Recall BinUser Selects Recall Bin Return ClassificationsReturn Classifications
Ordered By Their ScoreOrdered By Their Score
Recall = 0.50Precision = 0.70
Aleph EnsemblesAleph Ensembles We compare to We compare to ensembles of theoriesensembles of theories AlgorithmAlgorithm ( (Dutra Dutra et alet al ILP 2002 ILP 2002))
– Use Use KK different initial seeds different initial seeds – Learn Learn KK theories containing theories containing CC rules rules– Rank examples by the number of theoriesRank examples by the number of theories
Need to balance Need to balance CC for high for high performanceperformance– Small Small CC leads to low recall leads to low recall– Large Large CC leads to converging theories leads to converging theories
Evaluation MetricsEvaluation Metrics
Area Under Recall-Area Under Recall-Precision Curve Precision Curve (AURPC)(AURPC)– All curves All curves
standardized standardized to cover full recall to cover full recall rangerange
– Averaged AURPC Averaged AURPC over 5 foldsover 5 folds
Number of clauses Number of clauses consideredconsidered– Rough estimate of Rough estimate of
– 7,245 sentences from 871 abstracts 7,245 sentences from 871 abstracts – Examples are phrase-phrase combinationsExamples are phrase-phrase combinations
1.6 GB of background knowledge1.6 GB of background knowledge– Structural, Statistical, Lexical and Structural, Statistical, Lexical and
OntologicalOntological– In total, 200+ distinct background In total, 200+ distinct background
predicatespredicates
Experimental MethodologyExperimental Methodology Performed five-fold cross-validationPerformed five-fold cross-validation Variation of parametersVariation of parameters
Protein Localization ResultsProtein Localization Results
Genetic Disorder ResultsGenetic Disorder Results
Current DirectionsCurrent Directions
Learn diverse rules across seedsLearn diverse rules across seeds Calculate probabilistic scores for Calculate probabilistic scores for
examplesexamples Directed Rapid Random RestartsDirected Rapid Random Restarts Cache rule information to speed Cache rule information to speed
scoringscoring Transfer learning across seedsTransfer learning across seeds Explore Active Learning within ILPExplore Active Learning within ILP
Take-Home MessageTake-Home Message
Biology, Gleaner and ILPBiology, Gleaner and ILP– Challenging problems in biology can be Challenging problems in biology can be
naturally formulated for Inductive Logic naturally formulated for Inductive Logic ProgrammingProgramming
– Many rules constructed and evaluated in Many rules constructed and evaluated in ILP hypothesis searchILP hypothesis search
– Gleaner makes use of those rules that Gleaner makes use of those rules that are not the highest scoring ones for are not the highest scoring ones for improved speed and performanceimproved speed and performance
AcknowledgementsAcknowledgements
USA DARPA Grant F30602-01-2-0571USA DARPA Grant F30602-01-2-0571 USA Air Force Grant F30602-01-2-0571USA Air Force Grant F30602-01-2-0571 USA NLM Grant 5T15LM007359-02USA NLM Grant 5T15LM007359-02 USA NLM Grant 1R01LM07050-01USA NLM Grant 1R01LM07050-01 UW Condor GroupUW Condor Group David Page, Vitor Santos Costa, Ines David Page, Vitor Santos Costa, Ines
Dutra, Soumya Ray, Marios Skounakis, Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jesse Davis, Mark Craven, Burr Settles, Jesse Davis, Sarah Cunningham, David Haight, Ameet Sarah Cunningham, David Haight, Ameet SoniSoni