Biomedical Information Extraction using Inductive Logic Programming
Post on 18-Jan-2016
38 Views
Preview:
DESCRIPTION
Transcript
Biomedical Information Extraction using Inductive
Logic Programming
Mark Goadrich and Louis Oliphant Advisor: Jude Shavlik
Acknowledgements to NLM training grant 5T15LM007359-02
AbstractAutomated methods for finding relevant information from the large amount of biomedical literature are needed. Information extraction (IE) is the process of finding facts from unstructured text such as biomedical journals and putting those facts in an organized system. Our research mines facts about a relationship (e.g. protein localization) from PubMed abstracts. We use Inductive Logic Programming (ILP) to learn a set of logical rules that explain when and where a relationship occurs in a sentence. We build rules by finding patterns in syntactic as well as semantic information for each sentence in a training corpus that has been previously marked with the relationship. These rules can then be used on unmarked text to find new instances of the relation. Some major research issues involved in this approach are handling unbalanced data, searching the enormous space of clauses, learning probabilistic logical rules, and incorporating expert background knowledge.
The Central Dogma
Discoveries protein - protein
interactions protein localizations genetic diseases
Most knowledge stored in articles
Just Google it?
*image courtesy of National Human Genome Research Institute
World of Publishing
Current authors write articles in Word, LaTeX, and publish
in conferences, journals, etc humans index and extract relevant information
(time and cost intensive) Future?
all published articles available on the Web semantic web – extension of HTML for content articles automatically annotated and indexed into
searchable databases
Information Extraction
Given a set of abstracts tagged with biological
relationships between phrases
Do learn a theory (eg, set of inference rules)
that accurately extracts these relations
Training Data
Why Use ILP?
KDD Cup 2002 logical rules (handcrafted) did best on IE task
Hypotheses are comprehensible written in first-order predicate calculus (FOPC) aim to cover only positive examples
Background knowledge easily incorporated expert advice linguistic knowledge of English parse trees biomedical knowledge (eg. MESH)
ILP Example: Family Tree Positive
daughter(mary, ann) daughter(eve, tom)
Negative daughter(tom, ann) daughter(eve, ann) daughter(ian, tom) daughter(ian, ann) …
Background Knowledge mother(ann, mary) mother(ann, tom) father(tom, eve) father(tom, ian) female(ann) female(mary) female(eve) male(tom) male(ian)
Ann
IanEve
MaryTom
Possible Rules daughter(A,B) if male(A) and father(B,A) daughter(A,B) if mother(B,A) daughter(A,B) if female(A) and male(B) daughter(A,B) if female(A) and mother(B,A)
Father Father
Mother Mother
Sundance ParsingSentence
…NP-Conj seg VP segment NP segment
smf1 and smf2
unk conj unk
are
cop
mitochondrial membrane_proteins
Sentence StructurePredicates
parent(smf1,np-conj seg)parent(np-conj seg,sentence)child(np-conj seg,smf1)child(sentence,np-conj seg)next(smf1,and)next(np-conj seg,vp seg)after(np-conj seg,np seg)…
Part of SpeechPredicates
noun(membrane_proteins)verb(are)unk(smf1)noun_phrase(np seg)verb_phrase(vp seg)…
…
……
unk noun
Lexical Word Predicates
novelword(smf1)novelword(smf2)alphabetic(and)alphanumeric(smf1)…
Biomedical Knowledge Predicates
in_med_dict(mitochondrial)go_mitochondrial_membrane(smf1)go_mitochondrion(smf1)…
Sample Learned Rule
gene_disease(E,A) :-isa_np_segment(E), isa_np_segment(A), prev(A,B), pp_segment(B), child(A,C), next(C,D), alphabetic(D), novelword(C), child(E,F), alphanumeric(F).
Sent.
A EB
C D F
Noun Phrase Noun PhrasePrepositional Phrase
Novel Word Alphabetic Word Alphanumeric Word
Ensembles for Rules
N heads are better than one… learn multiple (sets of) rules with
training data aggregate the results by voting on
classification of testing data Bagging (Brieman ’96)
each rule-set gets one vote Boosting (Freund and Shapire ’96)
each rule gets weighted vote
Drawing a PR Curve
Conf Class Pre Rec
.98 + 1.00 0.20
.97 + 1.00 0.40
.84 - 0.66 0.40
.78 + 0.75 0.60
.55 + 0.80 0.80
.43 - 0.66 0.80
.23 - 0.57 0.80
.22 - 0.50 0.80
.12 + 0.55 1.00
Recall
Pre
cisi
on
Testset Results
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
Recall
Pre
cisi
on
Craven Group
Boosting
Rule Quality
Bagging
Handling Large Skewed Data 5 fold cross-validation
train : 1007 positive / 240,874 negative test : 284 positive / 243,862 negative
With a 95% accurate rule set … 270 true positives 12,193 false positives! recall = 270 / 284 = 95.0% precison = 270 / 12,363 = 2.1%
Handling Large Skewed Data Ways to handle data
assign different costs to each class much more important to not cover negatives
under-sampling with bagging negatives under-represented key is to pick good negatives
filter data to restore equal ratio in testing data use naïve Bayes to learn relational parts
pos
neg
noun phrase filter
split into parts
genes diseases
naïve Bayes filter
naïve Bayes filter
join back
pos
neg
Filters to Reduce Negatives
1 : 485
1 : 1,979
1 : 39
Probabilistic Rules
Logical rules are too strict and often overfit Add probabilistic weight to each rule
based on accuracy on tuning set Learn parameters
make each rule a binary feature use any standard Machine Learning algorithm
(Naïve Bayes, perceptron, logistic regression…) to learn the weights
assign probability to examples based on weights
Weighted Exponential Model
where is a weight for each feature
Taking logs we get
We need to set to maximize log probability of the tuning set
i
i
N
i
fiieZ
foP1
1)|(
N
iii fZfoP
1
log)|(log
Weighted Exponential Model
Incorporating Background Knowledge Creation of predicates that capture salient
features endsIn(word, ‘ase’) occursInAbstractNtimes(word, 5)
Incorporation of prior knowledge into the learning system protein(word) if endsIn(word, ‘ase’) and
occursInAbstractNtimes(word, 5).
Searching in Large Spaces
Probabilistic bottom clause probabilistically remove least significant
predicates from the “bottom clause” Random rule generation
in place of hill-climbing, randomly select rules of a given length from the bottom clause
retain only those rules which do well on a tune set Learn coverage of clauses
neural network, Bayesian learning, etc.
References Nelson, Stuart J.; Powell, Tammy; Humphreys, Betsy L.
The Unified Medical Language System (UMLS) Project. In: Encyclopedia of Library and Information Science. Forthcoming.
Christopher D. Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing 1999. MIT Press
Ellen Riloff The Sundance Sentence Analyzer. 2002 Ines De Castro Dutra, et. al. An Emperical Evaluation of Bagging in Inductive Logic
Programming. 2002. in Proceedings of the International Conference on Inductive Logic Programming. Syndey, Australia.
Dayne Frietag and Nicholas Kushmerick Boosted Wrapper Induction. 2001. in Proceedings of American Association of Artificial Intelligence (AAAI-2000)
Souyma Ray and Mark Craven Representing Sentence Structure in Hidden Markov Models for Information Extraction. 2001. in Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-2001)
Tina Eliassi-Rad and Jude Shavlik A Theory-Refinement Approach to Information Extraction. 2001. in Proceedings of the 18th International Conference on Machine Learning
M. Craven and J. Kumlien Constructing biological knowledge-bases by extracting information from text sources. 1999. in Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 77-86. Germany.
Leo Breiman. Bagging Predictors. 1996. Machine Learning, 24(2):123-140. Yoav Freund and Robert E. Schapire. Experiments with a New Boosting Algorithm. 1996. in
International Conference on Machine Learning, pages 148-156.
top related