Machine Learning for Information Machine Learning for Information Integration on the Web Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR “Demokritos” http://www.iit.demokritos.gr/skel Dagstuhl, February 15, 2005
27
Embed
Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Machine Learning for Information Integration Machine Learning for Information Integration on the Webon the Web
Georgios Paliouras
Software & Knowledge Engineering Lab
Inst. of Informatics & TelecommunicationsNCSR “Demokritos”
http://www.iit.demokritos.gr/skel
Dagstuhl, February 15, 2005
15/2/2005 Machine Learning for Information Integration 2
Dagstuhl
SKEL IntroductionSKEL Introduction
• Areas of research activity:– Information gathering (retrieval, crawling, spidering)– Information filtering (text and multimedia classification)– Information extraction (named entity recognition and classification,
role identification, wrappers, grammar and lexicon learning)– Personalization (user stereotypes and communities)
SKEL’s research objective:innovative knowledge technologies for
reducing the information overload on the Web
15/2/2005 Machine Learning for Information Integration 3
Dagstuhl
Structure of the talkStructure of the talk
• Web Information integration in CROSSMARC• Learning Context Free Grammars• Meta-learning for Web Information Extraction• Machine Learning for Ontology Maintenance• Conclusions
15/2/2005 Machine Learning for Information Integration 4
Dagstuhl
SKEL IntroductionSKEL Introduction
• National Centre for Scientific Research "Demokritos” (GR)• University of Edinburgh (UK)• Universita di Roma Tor Vergata (IT)• VeltiNet A.E. (GR)• Lingway (FR)
CROSSMARC consortium
15/2/2005 Machine Learning for Information Integration 5
Dagstuhl
CROSSMARC ObjectivesCROSSMARC Objectives
• crawl the Web for interesting Web pages,• extract information from pages of different sites without
a standardized format (structured, semi-structured, free text),
• process Web pages written in several languages,• be customized semi-automatically to new domains and
languages,• deliver integrated information according to personalized
profiles.
Develop technology for Information Integration that can:
15/2/2005 Machine Learning for Information Integration 6
Dagstuhl
CROSSMARC ArchitectureCROSSMARC Architecture
Ontology
15/2/2005 Machine Learning for Information Integration 7
15/2/2005 Machine Learning for Information Integration 8
Dagstuhl
Structure of the talkStructure of the talk
• Web Information integration in CROSSMARC• Learning Context Free Grammars• Meta-learning for Web Information Extraction• Machine Learning for Ontology Maintenance• Conclusions
15/2/2005 Machine Learning for Information Integration 9
Training Training ExamplesExamplesTraining Training
ExamplesExamples
Overly Specific Overly Specific GrammarGrammar
Overly Specific Overly Specific GrammarGrammar
Final Final GrammarGrammar
Final Final GrammarGrammar
Any Inferred Grammar better
than those in beam?
15/2/2005 Machine Learning for Information Integration 12
Dagstuhl
Structure of the talkStructure of the talk
• Web Information integration in CROSSMARC• Learning Context Free Grammars• Meta-learning for Web Information Extraction• Machine Learning for Ontology Maintenance• Conclusions
15/2/2005 Machine Learning for Information Integration 13
Dagstuhl
D \ DjDj
Meta-learning for Web IEMeta-learning for Web IE
Base-level dataset D
L1…LN
MDj
Meta-level dataset MD
C1(j)…CN(j)
CM
New vector x
C1...CN
Meta-levelvector
Class value y(x)
L1…LN
LM
Stacked generalization
15/2/2005 Machine Learning for Information Integration 14
Dagstuhl
Meta-learning for Web IEMeta-learning for Web IE
…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br> Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB SDRAM up to 1GB…
Information Extraction is not naturally a classification task
In IE we deal with text documents, paired with templates
Template T
t(s,e) s, e Field f
Transport ZX 47, 49 model
15” 56, 58 screenSize
TFT 59, 60 screenType
Intel <b> Pentium III 63, 67 procName
600 MHz 67, 69 procSpeed
256 MB 76, 78 ram
Each template is filled with instances <t(s,e), f>
15/2/2005 Machine Learning for Information Integration 15
Dagstuhl
Meta-learning for Web IEMeta-learning for Web IE
T1 filled by the IE system E1
t(s, e) s, e f
Transport ZX 47, 49 model
15” 56, 58 screenSize
TFT 59, 60 screenType
Intel <b> Pentium III 63, 67 procName
600 MHz 67, 69 procSpeed
256 MB 76, 78 ram
1 GB 81, 83 ram
T2 filled by the IE system E2
t(s, e) s, e f
Transport ZX 47, 49 manuf
TFT 59, 60 screenType
Intel <b> Pentium 63, 66 procName
600 MHz 67, 69 procSpeed
256 MB 76, 78 ram
1 GB 81, 83 HDcapacity
…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br> Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB SDRAM up to 1GB…
Combining Information Extraction systems
15/2/2005 Machine Learning for Information Integration 16
Dagstuhl
Meta-learning for Web IEMeta-learning for Web IE
Stacked template (ST)
s, e t(s, e) Field by E1 Field by E2 Correct field
47, 49 Transport ZX model manuf model
56, 58 15” screenSize - screenSize
59, 60 TFT screenType screenType screenType
63, 66 Intel<b>Pentium - procName -
63, 67 Intel<b>Pentium III procName - procName
67, 69 600 MHz procSpeed procSpeed procSpeed
76, 78 256 MB ram ram ram
81, 83 1 GB ram HDcapacity -
Creating a stacked template
…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br> Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB SDRAM up to 1GB…
15/2/2005 Machine Learning for Information Integration 17
Dagstuhl
D \ Dj
Meta-learning for Web IEMeta-learning for Web IE
Training in the new stacking framework
Dj
L1…LNE1(j)…EN(j)
CM
ST1 ST2 …
L1…LN E1…EN
LMMDj
D = set of documents, paired with hand-filled templates
MD = set of meta-level feature vectors
15/2/2005 Machine Learning for Information Integration 18
Dagstuhl
Meta-learning for Web IEMeta-learning for Web IE
Stacking at run-time
New document d
E1
E2
EN
…
T1
T2
TN
Stacked template CM
TFinal
template
<t(s,e), f>
15/2/2005 Machine Learning for Information Integration 19
Dagstuhl
Structure of the talkStructure of the talk
• Web Information integration in CROSSMARC• Learning Context Free Grammars• Meta-learning for Web Information Extraction• Machine Learning for Ontology Maintenance• Conclusions
15/2/2005 Machine Learning for Information Integration 20
Dagstuhl
Ontology EnrichmentOntology Enrichment
• Highly evolving domain (e.g. laptop descriptions)– New Instances characterize new concepts.
e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology.– New surface appearance of an instance.
e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’
• We concentrate on instances.
• The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain they cover.
15/2/2005 Machine Learning for Information Integration 21
Dagstuhl
Ontology EnrichmentOntology Enrichment
Multi-Lingual Domain Ontology
Additional annotations
Validation
Ontology Enrichment / Population
Domain Expert
Annotating Corpus Using Domain Ontology
Information extraction
machine learning
Corpus
15/2/2005 Machine Learning for Information Integration 22
Dagstuhl
Enrichment with synonymsEnrichment with synonyms
• The number of instances for validation increases with the size of the corpus and the ontology.
• There is a need for supporting the enrichment of the ‘synonymy’ relationship.
• Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship).
• Issues to be handled:Synonym : ‘Intel pentium 3’ - ‘Intel pIII’
Orthographical : ‘Intel p3’ - ‘intell p3’
Lexicographical : ‘Hewlett Packard’ - ‘HP’
Combination : ‘Intell Pentium 3’ - ‘P III’
15/2/2005 Machine Learning for Information Integration 23
• COCLU (COmpression-based CLUstering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff.
• CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters.
• COCLU iteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ).
15/2/2005 Machine Learning for Information Integration 24
Dagstuhl
Structure of the talkStructure of the talk
• Web Information integration in CROSSMARC• Learning Context Free Grammars• Meta-learning for Web Information Extraction• Machine Learning for Ontology Maintenance• Conclusions
15/2/2005 Machine Learning for Information Integration 25
Dagstuhl
SKEL IntroductionSKEL Introduction
• Information integration can benefit from machine learning.• Grammar learning methods have become efficient.• Combining IE systems improves performance.• Ontologies can be used to annotate examples to learn IE
systems and enrich ontologies.• Grammar learning in parallel/combination to ontology
learning?
Conclusions
15/2/2005 Machine Learning for Information Integration 26
Dagstuhl
SKEL IntroductionSKEL Introduction
• This is research of many current and past members of SKEL.
• CROSSMARC is joint work of the project consortium.
Acknowledgements
15/2/2005 Machine Learning for Information Integration 27