Gleaning Relational Gleaning Relational Information from Biomedical Information from Biomedical
TextText
Mark GoadrichMark GoadrichComputer Sciences DepartmentComputer Sciences Department
University of Wisconsin - MadisonUniversity of Wisconsin - Madison
Joint Work with Jude Shavlik and Louis OliphantJoint Work with Jude Shavlik and Louis Oliphant
CIBM Seminar - Dec 5th 2006CIBM Seminar - Dec 5th 2006
OutlineOutline
The Vacation GameThe Vacation Game Formalizing with LogicFormalizing with Logic Biomedical Information ExtractionBiomedical Information Extraction Evaluating HypothesesEvaluating Hypotheses Gleaning Logical RulesGleaning Logical Rules ExperimentsExperiments Current DirectionsCurrent Directions
The Vacation GameThe Vacation Game
PositivePositive
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
NegativeNegative
The Vacation GameThe Vacation Game
PositivePositive– AppleApple– FeetFeet– LuggageLuggage– MushroomsMushrooms– BooksBooks– WalletWallet– BeekeeperBeekeeper
NegativeNegative– PearPear– SocksSocks– CarCar– FungusFungus– NovelNovel– MoneyMoney– HiveHive
PositivePositive– AApppplele– FFeeeett– LuLuggggageage– MushrMushroooomsms– BBooooksks– WaWalllletet– BBeeeekkeeeeperper
The Vacation GameThe Vacation Game
My Secret RuleMy Secret Rule– The word must have The word must have two adjacent two adjacent
lettersletters which are the which are the same lettersame letter..
Found by using Found by using inductive logicinductive logic– Positive and Negative ExamplesPositive and Negative Examples– Formulating and Eliminating HypothesesFormulating and Eliminating Hypotheses– Evaluating Success and FailureEvaluating Success and Failure
Inductive Logic Inductive Logic ProgrammingProgramming
Machine LearningMachine Learning– Classify data into categoriesClassify data into categories– Divide data into Divide data into traintrain and and test test setssets– Generate hypotheses onGenerate hypotheses on train train set and set and
then measure performance on then measure performance on testtest set set In ILP, data are In ILP, data are ObjectsObjects … …
– person, block, molecule, word, phrase, person, block, molecule, word, phrase, ……
and and RelationsRelations between them between them– grandfather, has_bond, is_member, …grandfather, has_bond, is_member, …
Formalizing with LogicFormalizing with Logic
apple
a b c d e f g h i j k l mn o p q r s t u v w x y z
w2169
a p p l ew2169_1 w2169_5w2169_4w2169_3w2169_2
Objects
Relations
Formalizing with LogicFormalizing with Logic
word(w2169). letter(w2169_1).word(w2169). letter(w2169_1).has_letter(w2169, w2169_2). has_letter(w2169, w2169_2). has_letter(w2169, w2169_3). has_letter(w2169, w2169_3). next(w2169_2, w2169_3).next(w2169_2, w2169_3).letter_value(w2169_2, ‘p’).letter_value(w2169_2, ‘p’).letter_value(w2169_3, ‘p’).letter_value(w2169_3, ‘p’).
pos(X) :- has_letter(X, A), has_letter(X, pos(X) :- has_letter(X, A), has_letter(X, B),B), next(A, B), letter_value(A, next(A, B), letter_value(A, C), C), letter_value(B, C).letter_value(B, C).
a b c d e f g h i j k l mn o p q r s t u v w x y z
w2169
w2169_1 w2169_5w2169_4w2169_3w2169_2
‘apple'
head body Variables
Biomedical Information Biomedical Information ExtractionExtraction
*image courtesy of SEER Cancer Training Site
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
DatabaseStructured
Biomedical Information Biomedical Information ExtractionExtraction
http://www.geneontology.orghttp://www.geneontology.org
Biomedical Information Biomedical Information ExtractionExtraction
NPL3 encodes a nuclear protein with an RNA NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of recognition motif and similarities to a family of proteins involved in RNA metabolism.proteins involved in RNA metabolism.
ykuD was transcribed by SigK RNA polymerase ykuD was transcribed by SigK RNA polymerase from T4 of sporulation.from T4 of sporulation.
Mutations in the COL3A1 gene have been Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in syndrome, a disease leading to aortic rupture in early adult life.early adult life.
Biomedical Information Biomedical Information ExtractionExtraction
The dog running down the street The dog running down the street tackled and bit my little sister.tackled and bit my little sister.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Biomedical Information Biomedical Information ExtractionExtraction
NPL3 encodes a nuclear protein NPL3 encodes a nuclear protein with … with …
verbnoun article adj noun prep
sentence
prepphrase
…verb
phrasenoun
phrasenoun
phrasenoun
phrasenoun
phrase
MedDict Background MedDict Background KnowledgeKnowledge
http://cancerweb.ncl.ac.uk/omd/http://cancerweb.ncl.ac.uk/omd/
MeSH Background KnowledgeMeSH Background Knowledge
http://www.nlm.nih.gov/mesh/http://www.nlm.nih.gov/mesh/MBrowser.htmlMBrowser.html
GO Background KnowledgeGO Background Knowledge
http://www.geneontology.orghttp://www.geneontology.org
Some Prolog PredicatesSome Prolog Predicates Biomedical PredicatesBiomedical Predicates
– phrase_contains_medDict_term(Phrase, Word, WordText)phrase_contains_medDict_term(Phrase, Word, WordText)– phrase_contains_mesh_term(Phrase, Word, WordText)phrase_contains_mesh_term(Phrase, Word, WordText)– phrase_contains_mesh_disease(Phrase, Word, WordText)phrase_contains_mesh_disease(Phrase, Word, WordText)– phrase_contains_go_term(Phrase, Word, WordText)phrase_contains_go_term(Phrase, Word, WordText)
Lexical PredicatesLexical Predicates– internal_caps(Word) alphanumeric(Word)internal_caps(Word) alphanumeric(Word)
Look-ahead Phrase PredicatesLook-ahead Phrase Predicates– few_POS_in_phrase(Phrase, POS)few_POS_in_phrase(Phrase, POS)– phrase_contains_specific_word_triple(Phrase, W1, W2, W3)phrase_contains_specific_word_triple(Phrase, W1, W2, W3)– phrase_contains_some_marked_up_arg(Phrase, Arg#, Word,phrase_contains_some_marked_up_arg(Phrase, Arg#, Word, Fold)Fold)
Relative Location of PhrasesRelative Location of Phrases– protein_before_location(ExampleID)protein_before_location(ExampleID)– word_pair_in_between_target_phrases(ExampleID, W1, W2)word_pair_in_between_target_phrases(ExampleID, W1, W2)
Still More PredicateStill More Predicate High-scoring words in High-scoring words in proteinprotein phrases phrases
– bifunction, repress, pmr1, … bifunction, repress, pmr1, …
High-scoring words in High-scoring words in locationlocation phrases phrases– golgi, cytoplasm, ergolgi, cytoplasm, er
High-scoring High-scoring BETWEENBETWEEN protein & location protein & location– across, cofractionate, inside, …across, cofractionate, inside, …
Biomedical Information Biomedical Information ExtractionExtraction
Given:Given: Medical Journal abstracts tagged Medical Journal abstracts tagged
with biological relations with biological relations Do:Do: Construct system to extract Construct system to extract
related related phrases from phrases from unseen textunseen text
Our Gleaner ApproachOur Gleaner Approach
Develop Develop fast ensemble algorithmsfast ensemble algorithms focused on focused on recallrecall and and precisionprecision evaluation evaluation
Using Modes to Chain Using Modes to Chain RelationsRelations
Phrase
Sentence
Word
alphanumeric(…)alphanumeric(…)
internal_caps(…)internal_caps(…)
verb(…)verb(…)
phrase_child(…, …)phrase_child(…, …)
long_sentence(…)long_sentence(…)
phrase_parent(…, …)phrase_parent(…, …)
noun_phrase(…)noun_phrase(…)
Growing Rules From SeedGrowing Rules From Seed
NPL3 encodes a nuclear protein with … NPL3 encodes a nuclear protein with …
prot_loc(prot_loc(ab1392078_sen7_ph0ab1392078_sen7_ph0, , ab1392078_sen7_ph2ab1392078_sen7_ph2, , ab1392078_sen7ab1392078_sen7).).
phrase_contains_novelword(phrase_contains_novelword(ab1392078_sen7_ph0, ab1392078_sen7_ph0, ab1392078_sen7_ph0_w0).ab1392078_sen7_ph0_w0).
phrase_next(phrase_next(ab1392078_sen7_ph0ab1392078_sen7_ph0, , ab1392078_sen7_ph1ab1392078_sen7_ph1).).
……
noun_phrase(noun_phrase(ab1392078_sen7_ph2ab1392078_sen7_ph2).).
word_child(word_child(ab1392078_sen7_ph2ab1392078_sen7_ph2, , ab9018277_sen5_ph11_w3ab9018277_sen5_ph11_w3).).
……
avg_length_sentence(avg_length_sentence(ab1392078_sen7ab1392078_sen7).).
……
Phrase Phrase Sentence
Phrase
Word
Word
Growing Rules From SeedGrowing Rules From Seed
prot_loc(prot_loc(ProteinProtein,,LocationLocation,,SentenceSentence) :-) :-
phrase_contains_some_alphanumeric(phrase_contains_some_alphanumeric(ProteinProtein,E),,E),
phrase_contains_some_internal_cap_word(phrase_contains_some_internal_cap_word(ProteinProtein,,E),E),
phrase_next(phrase_next(ProteinProtein,_), ,_),
different_phrases(different_phrases(ProteinProtein,,LocationLocation),),
one_POS_in_phrase(one_POS_in_phrase(LocationLocation,noun), ,noun),
phrase_contains_some_arg2_10x_word(phrase_contains_some_arg2_10x_word(LocationLocation,_),,_),
phrase_previous(phrase_previous(LocationLocation,_), ,_),
avg_length_sentence(avg_length_sentence(SentenceSentence). ).
Rule EvaluationRule Evaluation
Prediction vs ActualPrediction vs ActualPositive or NegativePositive or Negative
True or FalseTrue or False
FNTP
TP
FPTP
TP
TP
FP FN
TN
actu
al
prediction
RP
2PR
F1 Score =F1 Score =
Focus on positive examplesFocus on positive examplesRecall = Recall =
Precision = Precision =
Protein Localization Rule 1Protein Localization Rule 1
prot_loc(prot_loc(ProteinProtein,,LocationLocation,,SentenceSentence) :-) :-
phrase_contains_some_alphanumeric(phrase_contains_some_alphanumeric(ProteinProtein,E),,E),
phrase_contains_some_internal_cap_word(phrase_contains_some_internal_cap_word(ProteinProtein,,E),E),
phrase_next(phrase_next(ProteinProtein,_), ,_),
different_phrases(different_phrases(ProteinProtein,,LocationLocation),),
one_POS_in_phrase(one_POS_in_phrase(LocationLocation,noun), ,noun),
phrase_contains_some_arg2_10x_word(phrase_contains_some_arg2_10x_word(LocationLocation,_),,_),
phrase_previous(phrase_previous(LocationLocation,_), ,_),
avg_length_sentence(avg_length_sentence(SentenceSentence).).
0.15 Recall0.15 Recall 0.51 Precision0.51 Precision 0.23 F1 Score0.23 F1 Score
Protein Localization Rule 2Protein Localization Rule 2
prot_loc(prot_loc(ProteinProtein,,LocationLocation,,SentenceSentence) :-) :-
phrase_contains_some_marked_up_arg2(phrase_contains_some_marked_up_arg2(LocationLocation,C),C)
phrase_contains_some_internal_cap_word(phrase_contains_some_internal_cap_word(ProteinProtein,,_),_),
word_previous(C,_).word_previous(C,_).
0.86 Recall0.86 Recall 0.12 Precision0.12 Precision 0.21 F1 Score0.21 F1 Score
Precision-Focused SearchPrecision-Focused Search
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Recall-Focused SearchRecall-Focused Search
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
F1-Focused SearchF1-Focused Search
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Aleph - LearningAleph - Learning
Aleph learnsAleph learns theories of rulestheories of rules (Srinivasan, v4, 2003)(Srinivasan, v4, 2003)– Pick positive seed examplePick positive seed example– Use heuristic search to find best ruleUse heuristic search to find best rule– Pick new seed from uncovered positivesPick new seed from uncovered positives
and repeat until threshold of positives and repeat until threshold of positives coveredcovered
Learning theories is time-consumingLearning theories is time-consuming Can we reduce time with ensembles?Can we reduce time with ensembles?
GleanerGleaner
Definition of GleanerDefinition of Gleaner– One who gathers grain left behind by One who gathers grain left behind by
reapersreapers
Key Ideas of GleanerKey Ideas of Gleaner– Use Aleph as underlying ILP rule engineUse Aleph as underlying ILP rule engine– Search rule space with Rapid Random Search rule space with Rapid Random
RestartRestart– Keep wide range of rules usually discardedKeep wide range of rules usually discarded– Create separate theories for diverse recallCreate separate theories for diverse recall
Gleaner - LearningGleaner - LearningP
reci
sion
Recall
Create Create BB Bins Bins Generate ClausesGenerate Clauses Record Best per Record Best per
BinBin
Gleaner - LearningGleaner - Learning
Recall
Seed 1
Seed 2
Seed 3
Seed K
.
.
.
Gleaner - EnsembleGleaner - Ensemble
.
.
.
.
.
pos1: prot_loc(…)
pos1: prot_loc(…) 12
pos2: prot_loc(…) 47
pos3: prot_loc(…) 55
neg1: prot_loc(…) 5
neg2: prot_loc(…) 14
neg3: prot_loc(…) 2
neg4: prot_loc(…) 18
12pos2: prot_loc(…) 47
Pos
Neg
Pos
Pos
Neg
Neg
Pos
Rules from bin 5
Gleaner - EnsembleGleaner - Ensemble
Recall
Pre
cisi
on
1.0
1.0pos3: prot_loc(…)
neg28: prot_loc(…)
pos2: prot_loc(…)
neg4: prot_loc(…)
neg475: prot_loc(…)
.
pos9: prot_loc(…)
neg15: prot_loc(…).
55
52
47
18
17
17
16
ScoreExamples
1.00 0.05
0.50 0.05
0.66 0.10
0.12 0.85
0.13 0.90
0.12 0.90
Precision Recall
Gleaner - OverlapGleaner - Overlap
For each bin, take the topmost curveFor each bin, take the topmost curve
Recall
Pre
cisi
on
How to use GleanerHow to use GleanerP
reci
sion
Recall
Generate Test CurveGenerate Test Curve User Selects Recall BinUser Selects Recall Bin Return ClassificationsReturn Classifications
Ordered By Their ScoreOrdered By Their Score
Recall = 0.50Precision = 0.70
Aleph EnsemblesAleph Ensembles We compare to We compare to ensembles of theoriesensembles of theories AlgorithmAlgorithm ( (Dutra Dutra et alet al ILP 2002 ILP 2002))
– Use Use KK different initial seeds different initial seeds – Learn Learn KK theories containing theories containing CC rules rules– Rank examples by the number of theoriesRank examples by the number of theories
Need to balance Need to balance CC for high for high performanceperformance– Small Small CC leads to low recall leads to low recall– Large Large CC leads to converging theories leads to converging theories
Evaluation MetricsEvaluation Metrics
Area Under Recall-Area Under Recall-Precision Curve Precision Curve (AURPC)(AURPC)– All curves All curves
standardized standardized to cover full recall to cover full recall rangerange
– Averaged AURPC Averaged AURPC over 5 foldsover 5 folds
Number of clauses Number of clauses consideredconsidered– Rough estimate of Rough estimate of
timetime
Recall
Pre
cisi
on
1.0
1.0
YPD Protein LocalizationYPD Protein Localization
Hand-labeled datasetHand-labeled dataset (Ray & Craven ’01)(Ray & Craven ’01)
– 7,245 sentences from 871 abstracts 7,245 sentences from 871 abstracts – Examples are phrase-phrase combinationsExamples are phrase-phrase combinations
1,810 positive & 279,154 negative1,810 positive & 279,154 negative
1.6 GB of background knowledge1.6 GB of background knowledge– Structural, Statistical, Lexical and Structural, Statistical, Lexical and
OntologicalOntological– In total, 200+ distinct background In total, 200+ distinct background
predicatespredicates
Experimental MethodologyExperimental Methodology Performed five-fold cross-validationPerformed five-fold cross-validation Variation of parametersVariation of parameters
– Gleaner (20 recall bins)Gleaner (20 recall bins) # seeds = {25, 50, 75, 100}# seeds = {25, 50, 75, 100} # clauses = {1K, 10K, 25K, 50K, 100K, 250K, # clauses = {1K, 10K, 25K, 50K, 100K, 250K,
500K}500K}
– Ensembles (0.75 minacc, 1K and 35K nodes)Ensembles (0.75 minacc, 1K and 35K nodes) # theories = {10, 25, 50, 75, 100}# theories = {10, 25, 50, 75, 100} # clauses per theory = {1, 5, 10, 15, 20, 25, 50}# clauses per theory = {1, 5, 10, 15, 20, 25, 50}
PR Curves - 100,000 ClausesPR Curves - 100,000 Clauses
PR Curves - 1,000,000 PR Curves - 1,000,000 ClausesClauses
Protein Localization ResultsProtein Localization Results
Genetic Disorder ResultsGenetic Disorder Results
Current DirectionsCurrent Directions
Learn diverse rules across seedsLearn diverse rules across seeds Calculate probabilistic scores for Calculate probabilistic scores for
examplesexamples Directed Rapid Random RestartsDirected Rapid Random Restarts Cache rule information to speed Cache rule information to speed
scoringscoring Transfer learning across seedsTransfer learning across seeds Explore Active Learning within ILPExplore Active Learning within ILP
Take-Home MessageTake-Home Message
Biology, Gleaner and ILPBiology, Gleaner and ILP– Challenging problems in biology can be Challenging problems in biology can be
naturally formulated for Inductive Logic naturally formulated for Inductive Logic ProgrammingProgramming
– Many rules constructed and evaluated in Many rules constructed and evaluated in ILP hypothesis searchILP hypothesis search
– Gleaner makes use of those rules that Gleaner makes use of those rules that are not the highest scoring ones for are not the highest scoring ones for improved speed and performanceimproved speed and performance
AcknowledgementsAcknowledgements
USA DARPA Grant F30602-01-2-0571USA DARPA Grant F30602-01-2-0571 USA Air Force Grant F30602-01-2-0571USA Air Force Grant F30602-01-2-0571 USA NLM Grant 5T15LM007359-02USA NLM Grant 5T15LM007359-02 USA NLM Grant 1R01LM07050-01USA NLM Grant 1R01LM07050-01 UW Condor GroupUW Condor Group David Page, Vitor Santos Costa, Ines David Page, Vitor Santos Costa, Ines
Dutra, Soumya Ray, Marios Skounakis, Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jesse Davis, Mark Craven, Burr Settles, Jesse Davis, Sarah Cunningham, David Haight, Ameet Sarah Cunningham, David Haight, Ameet SoniSoni