Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications.

Machine Learning for Information Integration Machine Learning for Information Integration on the Webon the Web

Georgios Paliouras

Software & Knowledge Engineering Lab

Inst. of Informatics & TelecommunicationsNCSR “Demokritos”

http://www.iit.demokritos.gr/skel

Dagstuhl, February 15, 2005

15/2/2005 Machine Learning for Information Integration 2

Dagstuhl

SKEL IntroductionSKEL Introduction

• Areas of research activity:– Information gathering (retrieval, crawling, spidering)– Information filtering (text and multimedia classification)– Information extraction (named entity recognition and classification,

role identification, wrappers, grammar and lexicon learning)– Personalization (user stereotypes and communities)

SKEL’s research objective:innovative knowledge technologies for

reducing the information overload on the Web


Dagstuhl

Structure of the talkStructure of the talk

• Web Information integration in CROSSMARC• Learning Context Free Grammars• Meta-learning for Web Information Extraction• Machine Learning for Ontology Maintenance• Conclusions


Dagstuhl


• National Centre for Scientific Research "Demokritos” (GR)• University of Edinburgh (UK)• Universita di Roma Tor Vergata (IT)• VeltiNet A.E. (GR)• Lingway (FR)

CROSSMARC consortium


Dagstuhl

CROSSMARC ObjectivesCROSSMARC Objectives

• crawl the Web for interesting Web pages,• extract information from pages of different sites without

a standardized format (structured, semi-structured, free text),

• process Web pages written in several languages,• be customized semi-automatically to new domains and

languages,• deliver integrated information according to personalized

profiles.

Develop technology for Information Integration that can:


Dagstuhl

CROSSMARC ArchitectureCROSSMARC Architecture

Ontology


Dagstuhl

CROSSMARC OntologyCROSSMARC Ontology

…<description>Laptops</description> <features> <feature id="OF-d0e5"> <description>Processor</description> <attribute type="basic" id="OA-d0e7"> <description>Processor Name</description> <discrete_set type="open"> <value id="OV-d0e1041"> <description>Intel Pentium 3</description> </value> …

<node idref="OV-d0e1041"> <synonym>Intel Pentium III</synonym> <synonym>Pentium III</synonym> <synonym>P3</synonym> <synonym>PIII</synonym></node>

Lexicon

Ontology

<node idref="OA-d0e7">

<synonym>Όνομα Επεξεργαστή</synonym>

</node>

Greek Lexicon


Dagstuhl




Dagstuhl

Learning Context Free GrammarsLearning Context Free Grammars

• Infers context-free grammars.• Learns from positive examples only.• Overgenarisation controlled through a heuristic, based

on MDL.• Two basic/three auxiliary learning operators.• Two search strategies:

– Beam search.– Genetic search.

Introducing eg-GRIDS


Dagstuhl


Minimum Description Length (MDL)Minimum Description Length (MDL)

Model Length (ML) Model Length (ML) == GDLGDL ++ DDLDDL

Bits required to encode the grammar G.

Grammar Description Length (GDL)Grammar Description Length (GDL)

Bits required to encode all training examples, as encoded by the grammar G.

Derivations Description Length (DDL)Derivations Description Length (DDL)

Overly Specific Overly Specific GrammarGrammar


Overly General Overly General GrammarGrammar

Overly General Overly General GrammarGrammar

DDLDDL

HypothesesHypothesesHypothesesHypotheses

GDLGDL


Dagstuhl


eg-GRIDS Architectureeg-GRIDS Architecture

Operator Operator ModeMode

Beam of Beam of GrammarsGrammarsBeam of Beam of

GrammarsGrammars

MergeMerge NTNT OperatorOperator

CreateCreate NTNT OperatorOperator

Lea

rnin

g O

per

ator

s

Create Create Optional NTOptional NT

DetectDetect CenterCenter EmbeddingEmbedding

YES

NO

Evo

luti

onar

y A

lgor

ith

m

MutationMutation

Search Organisation Selection

BodyBody SubstitutionSubstitution

Training Training ExamplesExamplesTraining Training

ExamplesExamples



Final Final GrammarGrammar

Final Final GrammarGrammar

Any Inferred Grammar better

than those in beam?


Dagstuhl




Dagstuhl

D \ DjDj

Meta-learning for Web IEMeta-learning for Web IE

Base-level dataset D

L1…LN

MDj

Meta-level dataset MD

C1(j)…CN(j)

CM

New vector x

C1...CN

Meta-levelvector

Class value y(x)

L1…LN

LM

Stacked generalization


Dagstuhl


…TransPort ZX 15" XGA TFT Display Intel Pentium III 600 MHZ 256k Mobile processor 256 MB SDRAM up to 1GB…

Information Extraction is not naturally a classification task

In IE we deal with text documents, paired with templates

Template T

t(s,e) s, e Field f

Transport ZX 47, 49 model

15” 56, 58 screenSize

TFT 59, 60 screenType

Intel Pentium III 63, 67 procName

600 MHz 67, 69 procSpeed

256 MB 76, 78 ram

Each template is filled with instances <t(s,e), f>


Dagstuhl


T1 filled by the IE system E1

t(s, e) s, e f

Transport ZX 47, 49 model

15” 56, 58 screenSize


Intel Pentium III 63, 67 procName


256 MB 76, 78 ram

1 GB 81, 83 ram

T2 filled by the IE system E2

t(s, e) s, e f

Transport ZX 47, 49 manuf


Intel Pentium 63, 66 procName


256 MB 76, 78 ram

1 GB 81, 83 HDcapacity


Combining Information Extraction systems


Dagstuhl


Stacked template (ST)

s, e t(s, e) Field by E1 Field by E2 Correct field

47, 49 Transport ZX model manuf model

56, 58 15” screenSize - screenSize

59, 60 TFT screenType screenType screenType

63, 66 IntelPentium - procName -

63, 67 IntelPentium III procName - procName

67, 69 600 MHz procSpeed procSpeed procSpeed

76, 78 256 MB ram ram ram

81, 83 1 GB ram HDcapacity -

Creating a stacked template



Dagstuhl

D \ Dj


Training in the new stacking framework

Dj

L1…LNE1(j)…EN(j)

CM

ST1 ST2 …

L1…LN E1…EN

LMMDj

D = set of documents, paired with hand-filled templates

MD = set of meta-level feature vectors


Dagstuhl


Stacking at run-time

New document d

E1

E2

EN

…

T1

T2

TN

Stacked template CM

TFinal

template

<t(s,e), f>


Dagstuhl




Dagstuhl

Ontology EnrichmentOntology Enrichment

• Highly evolving domain (e.g. laptop descriptions)– New Instances characterize new concepts.

e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology.– New surface appearance of an instance.

e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’

• We concentrate on instances.

• The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain they cover.


Dagstuhl

Ontology EnrichmentOntology Enrichment

Multi-Lingual Domain Ontology

Additional annotations

Validation

Ontology Enrichment / Population

Domain Expert

Annotating Corpus Using Domain Ontology

Information extraction

machine learning

Corpus


Dagstuhl

Enrichment with synonymsEnrichment with synonyms

• The number of instances for validation increases with the size of the corpus and the ontology.

• There is a need for supporting the enrichment of the ‘synonymy’ relationship.

• Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship).

• Issues to be handled:Synonym : ‘Intel pentium 3’ - ‘Intel pIII’

Orthographical : ‘Intel p3’ - ‘intell p3’

Lexicographical : ‘Hewlett Packard’ - ‘HP’

Combination : ‘Intell Pentium 3’ - ‘P III’


Dagstuhl

Compression-based ClusteringCompression-based Clustering

• COCLU (COmpression-based CLUstering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff.

• CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters.

• COCLU iteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ).


Dagstuhl




Dagstuhl


• Information integration can benefit from machine learning.• Grammar learning methods have become efficient.• Combining IE systems improves performance.• Ontologies can be used to annotate examples to learn IE

systems and enrich ontologies.• Grammar learning in parallel/combination to ontology

learning?

Conclusions


Dagstuhl


• This is research of many current and past members of SKEL.

• CROSSMARC is joint work of the project consortium.

Acknowledgements


Dagstuhl

Announcement IJCAI workshopAnnouncement IJCAI workshop

Workshop on Grammatical Inference Applications: Successes and Future Challenges

IJCAI-05, Edinburgh, Scotland

July 31, 2005

Paper submission deadline: March 19, 2005

URL: http://www.ics.mq.edu.au/~menno/IJCAI05/

Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications.

Documents

extract information

integrated information

information overload

web slide

nt operator learning

contextfree grammars

d d j djdj metalearning

generalization slide