Department of Computer Science University of Bari Knowledge Acquisition & Machine Learning Lab CILC 2006 Convegno Italiano di Logica Computazionale 26-27 giugno 2006, Dipartimento di Informatica, Learning for Biomedical Information Extraction with ILP Margherita Berardi Vincenzo Giuliano Donato Malerba
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Department of Computer Science
University of Bari
Knowledge Acquisition &Machine Learning Lab
CILC 2006Convegno Italiano di Logica
Computazionale26-27 giugno 2006, Dipartimento di Informatica, Bari
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Outline of the talk
IE for Biomedicine Looking around IE problem formulation
which representation model on data? which features?
which framework for reasoning? Mutual Recursion in IE Text processing & domain
knowledge Application to studies on
mitochondrial genome Conclusions & Future work
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
IE from Biomedical Texts: Motivation
Complexity of biological systems: Too many specialized biological tasks Several entities interacting in a single phenomenon Many conditions to simultaneously verify
Complexity of biomedical languages: Several nomenclatures, dictionaries, lexica tending to quickly become obsolete
Too much to read!
Genome decoding increasing amount of published literature
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
IE History Message Understanding Conference (MUC) DARPA [’87-’95],
TIPSTER [’92-’96] Most early work dominated by hand-built models
E.g. SRI’s FASTUS, hand-built FSMs. But by 1990’s, some machine learning: Lehnert, Cardie,
Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98] Wrapper Induction: initially hand-build, then ML [Soderland ’96],
[Kushmeric ’97], … Most learning attempts based on statistical approaches
Learning of production rules constrained by probability measures (e.g., HMMs, Probabilistic Context-free Grammars)
Some recent logic-based approaches Rapier (Califf ’98) SRV (Freitag ’98) INTHELEX (Ferilli et al. ’01) FOIL-based (Aitken ’02) Aleph-based (Goadrich et al. ’04)
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Learning Language in biomedicine
BioCreAtIvE - Critical Assessment for Information Extraction in Biology (http://biocreative.sourceforge.net/)
BioNLP, Natural language processing of biology text (http://www.bionlp.org)
ACL/COLING Workshops on Natural Language Processing in Biomedicine
SIGIR Workshops on Text Analysis for Bioinformatics Special Interest Group in Text Mining since ISMB’03 (Intelligent
Systems for Molecular Biology): BioLINK (Biology Literature, Information and Knowledge)
PSB (Pacific Symposium on Biocomputing) tracks Genomic tracks in TREC (Text Retrieval Conference) PASCAL challenges on information extraction
http://nlp.shef.ac.uk/pascal/ Workshops: IJCAI, ECAI, ECML/PKDD, ICML (Learning Language in
Logic since ’99, challenge task on Extracting Relations from Biomedical Texts)
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Is there “Logic” in language learning? IE systems limitations, in general:
Portability (domain-dependent, task-dependent) Scalability (work well on “relevant” data)
Statistics-based approaches wide coverage, scalability, no semantics, no domain knowledge
Logic-based approaches: natural encoding of natural language statements and queries
in first-order logic, human-comprehensible models, domain knowledge refinement of models
[R. J. Mooney, Learning for Semantic Interpretation: Scaling Up Without Dumbing Down, ICML Workshop on Language Learning in Logic, 1999]
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
IE problem formulation for HmtDB HmtDB resource of variability data associated to
clinical phenotypes concerning human mithocondrial genome
(http://www.hmdb.uniba.it/)
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Textual Entity ExtractionEx: “Cytoplasts from two unrelated patients with MELAS (mitochondrial myopathy, encephalopathy, lactic acidosis, and strokelike episodes) harboring an A-*G transition at nucleotide position 3243 in the tRNALeU(UUR) gene of the mitochondrial genome were fused with human cells lacking endogenous mitochondrial DNA (mtDNA)”
pathology associated to the mutation under study, substitution that causes the mutation, type of the mutation, position in the DNA where the mutation occurs, gene correlated to the mutation.
By modelling the sentence structure:
substitution(X) follows (Y,X), type (Y)
Extractors cannot be learned independently!!!
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Textual Entity Extraction Each entity is characterized by some slots defining a template
The task is to learn rules to fill slots (template filling)
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
The learning task Classification
Each class (slot) is a concept (target predicate), each model (template filler) induced for the class is a logical theory explaining the concept (set of predicate definitions)
Predefined models of classification should be provided
Importance of domain knowledge and first-order representations
Usefulness of mutual recursion (concept dependencies)
ILP = Inductive Learning Logic Programming From IL: inductive reasoning from observations and
background knowledge From LP: first-order logic as representation formalism
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
ATRE (Apprendimento di Teorie Ricorsive da Esempi)
http://www.di.uniba.it/~malerba/software/atre/
Given a set of concepts C1, C2, ... , Cr a set of objects O described in a language LO a background knowledge BK described in a
language LBK a language of hypotheses LH that defines the space
of hypotheses SH a user’s preference criterion PCFinda (possibly recursive) logical theory T for the
concepts C1, C2, ... , Cr , such that T is complete and consistent with respect to the set of observations and satisfies the preference criterion PC.
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
ATRE Main Characteristics
Learning problem: induce recursive theories from examples
ILP setting: learning from interpretations Observation language: ground multiple-head clauses Hypothesis language: non-ground definite clauses Constraints: linkedness + range-restrictedness Generalization model: generalized implication Search strategy for a recursive theory: separate-and-
parallel-conquer Continuous and discrete attributes and relations Background knowledge: intensionally defined
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Application
We considered 71 documents selected by biologists
Expert users manually annotated occurrences of entities of interest, namelyMutation: position, type, substitution, type_position, locusSubjects: nationality, method, pathology, category,
number
The extraction process (both learning and recognition) is locally performed to text portions of interest, automatically classified
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Textual portions of papers were categorized in five classes: Abstract, Introduction, Materials & Methods, Discussion and Results
The abstract of each paper was processed
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
Abstract Introduction Methods Results Discussion
Co
rrec
tly
clas
sifi
ed (
%)
Avg. No. of categories correctly classified
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
An A-to-G mutation at nucleotide position (np) 3243 in the mitochondrial tRNALeu(UUR) gene is closely associated with various clinical phenotypes of diabetes mellitus.