On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Kassel, 22 July 2005 Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR “Demokritos” http://www.iit.demokritos.gr/~paliourg
42
Embed
On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning
On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning. Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR “Demokritos” http://www.iit.demokritos.gr/~paliourg. Kassel, 22 July 2005. Outline. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Training Training ExamplesExamplesTraining Training
ExamplesExamples
Overly Specific Overly Specific GrammarGrammar
Overly Specific Overly Specific GrammarGrammar
Final Final GrammarGrammar
Final Final GrammarGrammar
Any Inferred Grammar better
than those in beam?
Kassel, 22/07/2005 ICCS’05 26
Experimental resultsExperimental results
• The Dyck language with k=1: S → S S | ( S ) | є
Errors of:• Omission: failures to parse sentences
generated from the “correct” grammar (longer test sentences than in the training set).– Overly specific grammar.
• Commission: failures of the “correct” grammar to parse sentences generated by the inferred grammar.– Overly general grammar.
Kassel, 22/07/2005 ICCS’05 27
Probability of parsing a valid sentence (1-errors of omission)Probability of parsing a valid sentence (1-errors of omission)
Experimental resultsExperimental results
Kassel, 22/07/2005 ICCS’05 28
Probability of generating a valid sentence (1-errors of commission)Probability of generating a valid sentence (1-errors of commission)
Experimental resultsExperimental results
Kassel, 22/07/2005 ICCS’05 29
OutlineOutline• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia information extraction.
• Open issues
Kassel, 22/07/2005 ICCS’05 30
Ontology EnrichmentOntology Enrichment
• Highly evolving domain (e.g. laptop descriptions)– New Instances characterize new concepts.
e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology.
– New surface appearance of an instance.
e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’
• We concentrate on instances.
• The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain.
Kassel, 22/07/2005 ICCS’05 31
Ontology EnrichmentOntology Enrichment
Multi-Lingual Domain Ontology
Additional annotations
Validation
Ontology Enrichment / Population
Domain Expert
Annotating Corpus Using Domain Ontology
Information extraction
machine learning
Corpus
Kassel, 22/07/2005 ICCS’05 32
Finding synonymsFinding synonyms
• The number of instances for validation increases with the size of the corpus and the ontology.
• There is a need for supporting the enrichment of the ‘synonymy’ relationship.
• Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship).
• Issues to be handled:Synonym : ‘Intel pentium 3’ - ‘Intel pIII’
Orthographical : ‘Intel p3’ - ‘intell p3’
Lexicographical : ‘Hewlett Packard’ - ‘HP’
Combination : ‘Intell Pentium 3’ - ‘P III’
Kassel, 22/07/2005 ICCS’05 33
COCLUCOCLU
• COCLU (COmpression-based CLUstering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff.
• CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters.
• COCLU iteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ).
Kassel, 22/07/2005 ICCS’05 34
Experimental resultsExperimental results
Initial 2nd iter.
15/58 48/58
28/58 56/58
40/58 57/58
Discovering lexical synonyms:
Assign an instance to a group, while decreasing proportionally the number of instances available initially in each group.
50
60
70
80
90
100
0 20 40 60 80
Instances removed (%)
Ac
cu
rac
y (
%)
Discovering new instances:
Hide part of the known instances.
Evolve ontology and grammars to recover them.
Kassel, 22/07/2005 ICCS’05 35
OutlineOutline• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– BOEMIE: Bootstrapping ontology evolution with multimedia information extraction.
• Open issues
Kassel, 22/07/2005 ICCS’05 36
BOEMIE - motivationBOEMIE - motivation• Multimedia content grows with increasing rates in public
and proprietary webs.
• Hard to provide semantic indexing of multimedia content.
• Significant advances in automatic extraction of low-level features from visual content.
• Little progress in the identification of high-level semantic features
• Little progress in the effective combination of semantic features from different modalities.
• Great effort in producing ontologies for semantic webs.
• Hard to build and maintain domain-specific multimedia ontologies.
Kassel, 22/07/2005 ICCS’05 37
BOEMIE- approachBOEMIE- approach
EVOLVEDONTOLOGY
INITIALONTOLOGY
POPULATION & ENRICHMENT COORDINATION
INTERMEDIATEONTOLOGY
ONTOLOGY EVOLUTION TOOLKIT
LEARNING TOLS
REASONING ENGINE
MATCHING TOOLS
ONTOLOGY MANAGEMENT TOOL
ONTOLOGY EVOLUTION
SEMANTICS EXTRACTION
RESULTS
OTHERONTOLOGIES
SEMANTICS EXTRACTION
MULTIMEDIA CONTENT
SEMANTICS EXTRACTION TOOLKIT
TEXT EXTRACTION TOOLS
AUDIO EXTRACTION TOOLS
INFORMATION FUSION TOOLS
VISUAL EXTRACTION TOOLS
FROM VISUAL CONTENT
FROM NON-VISUAL CONTENT
FROM FUSED CONTENT
Content Collection (crawlers, spiders, etc.)
Kassel, 22/07/2005 ICCS’05 38
OutlineOutline• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia information extraction.
• Open issues
Kassel, 22/07/2005 ICCS’05 39
KR issuesKR issues
• Is there a common formalism to capture the necessary semantics + syntactic + lexical knowledge for IE?
• Is that better than having separate representations for different tasks?
• Do we need an intermediate formalism (e.g. grammar + CG + ontology)?
• Do we need to represent uncertainty (e.g. using probabilistic graphical models)?
Kassel, 22/07/2005 ICCS’05 40
ML issuesML issues
• What types and which aspects of grammars and conceptual structures can we learn?
• What training data do we need? Can we reduce the manual annotation effort?
• What background knowledge do we need and what is the role of deduction?
• What is the role of multi-strategy learning, especially if complex representations are used?
Kassel, 22/07/2005 ICCS’05 41
Content-type issuesContent-type issues
• What is the role of semantically annotated content in learning, e.g. as training data?
• What is the role of hypertext as a graph?
• Can we extract information from multimedia content?
• How can ontologies and learning help improve extraction from multimedia?
Kassel, 22/07/2005 ICCS’05 42
SKEL IntroductionSKEL Introduction
• This is research of many current and past members of SKEL.
• CROSSMARC is joint work of the project consortium (NCSR “Demokritos”, Uni of Edinburgh, Uni of Roma ‘Tor Vergata’, Veltinet, Lingway).