Top Banner
Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003
42

Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Information Integration:A Status Report

Alon Halevy

University of Washington, Seattle

IJCAI 2003

Page 2: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Mediated Schema

OMIMSwiss-Prot

HUGO GO

Gene-Clinics

EntrezLocus-Link

GEO

Entity

Sequenceable Entity

GenePhenotypeStructured Vocabulary

Experiment

ProteinNucleotide Sequence

Microarray Experiment

Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code?

Page 3: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Motivation and ActivityApplication areas of data integration: Enterprise information integration ($$) The government Data sources on the web Scientific data sharing.

Several data sharing architectures: Virtual data integration, warehousing, message-

passing, web-services.

Many research projects: Mine: Information Manifold, Tukwila, LSD, Piazza.

EII: a new industry buzzword.

Page 4: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Today’s AgendaRecent progress Mediation languages Query processing (XML and other) Some lessons from commercial world.

Current challenges Enabling large-scale data sharing: peer-data

management systems. The age of problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR.

AI is more vital than ever for progress here!

Page 5: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Mediation Languages

Goal: Mediated Schema

SourceSource Source Source Source

Language forSpecifyingSemanticRelationships (not full FOL)

Q

Q’ Q’ Q’ Q’ Q’

Assume: data at the sources is structure (or seems so).

Page 6: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Global-as-View (GAV)

Mediated Schema

SourceSource Source Source SourceR1 R2 R3 R4 R5

Title, Actor, …

Actor(x,y) :- R1(x,y,z)Actor(x,y) :- R2(x,z), R3(z,y)

Page 7: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Local-as-View (LAV,GLAV)

Mediated Schema

SourceSource Source Source SourceR1 R2 R3 R4 R5

Title, Actor …

R1(x,y,z) :- Title(x,y), Actor(x,z), y< 1970R5(x,y,z) :- Movie(x,y,”French”)

Page 8: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Mediation Languages: Summary

A lot of nice theory and practical algorithms.

Careful choice of expressive power mattered.

Algorithms for answering queries using views are in every commercial DBMS.

Description Logics – also an attractive formalism for mediation.

Bottleneck is coming up with the mapping expressions.

Page 9: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Outline

Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world.

Current challenges Enabling large-scale data sharing: peer-data

management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR.

Page 10: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Adaptive Query Processing

Problem: no stats, network unstableCannot ‘Plan and then execute’Need to adapt plan during execution.Ideas already in Ingres (1976) (early database system) Interleaving planning and execution (AI)

Key question: when and granularity of adaptation: For every tuple? Materialization points? See [Ives et al. 2002] for our solution.

Page 11: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Convergent Query Processing[Ives et al., 2002]

(I O S)

I OS

I1

O1 S1

O1S1

I1 O1S1

IO

I0 O0S0

I0 O0

“Cleanup” query plan

Join In-stock, Orders, Shipping

I2 O2S2

I2S2

Page 12: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

XML Query Processing

XML facilitates integration. Mediator query processor may manipulate XML

directly.

Challenges: XML is not flat, but nested; Path queries. Can be irregular; doesn’t adhere to a strict

schema.

Progress: Defining and optimizing XQuery. Going back and forth: XML to relational.

Page 13: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

The Commercial World

Some startups: Nimble, MetaMatrix, Calixa, Composite, Enosys

Big guys making announcements: IBM, BEA, MS, (Oracle still being defiant). Integration technology in different layers:

E.g., reporting companies want it (Actuate)

Progress: analysts have buzzword -- EII.Challenges: Integration with EAI? Yet another middleware? Horizontal vs. vertical?

Page 14: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

What Worked?

Performance was not an issue.

Tools, tools, toolsFor managing sources and creating

mediated schemas.

XML query processing was needed.

Concordance: need common keys to join sources:Active research area!

Page 15: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Outline

Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world.

Current challenges Enabling large-scale data sharing: peer-data

management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR.

Page 16: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Limitations of Mediated Schema

Mediated Schema

SourceSource Source Source Source

Q

Q’ Q’ Q’ Q’ Q’

Page 17: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Peer Data-Management

PDMS: a network of peers (data sources)

Peers can:Export base data, or combinations of dataServe as logical mediators for other peers

A peer can be both a server and a client.

Semantic relationships are specified locally (between small sets of peers).

This is a Semantic Web (different angle)

Page 18: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Network of Mappings (Piazza)

UW Stanford

DBLP

Roma Paris

CiteSeer

Vienna

GAV, LAVGLAV

Q

Q’

Q’Q’’

Q’’

Q’’

Q’’

Page 19: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Advantages of PDMS

No need for a central mediated schema.Can map data opportunistically, as is most convenient.Queries are posed using the peer’s schema. Answers come from anywhere in the system.Infrastructure for Semantic Web applicationsThis is not P2P file sharing. Data has rich semantics Membership is not as dynamic.

Page 20: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Schema Mediation for PDMS

UW Stanford

DBLP

Roma Paris

CiteSeer

Vienna

GAV, LAVGLAV

Q

Q’

Q’Q’’

Q’’

Q’’

Q’’When can LAV and GAV be combined to form such a network structure?(semantics not yet obvious.

[ICDE-03],[WWW-03 for XML]

Page 21: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Efficient Query Answering

UW Stanford

DBLP

Roma Paris

CiteSeer

Vienna

Q

Q’

Q’Q’’

Q’’

Q’’

Q’’Problems: • redundant paths• expensive reformulation.

Possible solution:• Pre-compose some paths

Page 22: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Mapping Composition[Jayant Madhavan and Halevy, VLDB 2003]

Incredibly subtle! In general, composition can be an infinite set of GLAV formulas.Results:Finite in many casesEven when infinite, often has finite, useful

encoding.Hence, compositions can usually be pre-

optimized.

Page 23: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Other Research Issues

UW Stanford

DBLP

Saarbruecken Leipzig

CiteSeer

Berlin

Q

Q’

Q’Q’’

Q’’

Q’’

Q’’Intelligent data placement

Management of mapping networks

Improving networks: finding additional connections.

Handling inconsistencies

Page 24: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

PDMS-Related Projects

Hyperion (Toronto)

PeerDB (Singapore)

Local relational models (Trento)

Edutella (Hannover, Germany)

Semantic Gossiping (EPFL Zurich)

Raccoon (UC Irvine)

Orchestra (Ives, U. Penn)

Page 25: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Outline

Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world.

Current challenges Enabling large-scale data sharing: peer-data

management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR.

Page 26: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Schema/Ontology Matching

Schema heterogeneity: a key roadblock for information integration Different data sources speak their own schema Mapping is key to any data sharing architecture

MediatorMediator

ConsumerConsumer

Data SourceData Source

Data SourceData Source

Data SourceData Source

Hotel, GaststätteBrauerei,

Kathedrale

Lodges, Restaurants

Beaches, Volcanoes

Hotel, Restaurant,AdventureSports,

HistoricalSites

Page 27: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Schema Matching

Schema Matching: Schema Matching: Discovering correspondences between similar elementsDiscovering correspondences between similar elementsEventually… BooksAndMusic(x:Title,…) = Books(x:Title,…) Eventually… BooksAndMusic(x:Title,…) = Books(x:Title,…) CDs(x:Album,…) CDs(x:Album,…)

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

Books TitleISBNPriceDiscountPriceEdition

CDs AlbumASINPriceDiscountPriceStudio

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

AuthorsISBNFirstNameLastName

Inventory Database A

Inventory Database B

Page 28: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Typical ApproachesMultiple sources of evidences in the schemas Schema element names

BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation

ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book

Data types, data instances DateTime Integer, addresses have similar formats

Schema structure All books have similar attributes

Use domain knowledge

Combine multiple techniques to exploit all available evidence

In isolation, techniques are incomplete or brittle

Page 29: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Philosophy of Solutions

Effective schema matching requires a principled combination of techniques.Like human experts, the matcher should improve over timeLSD: Mapping data sources to a mediated schema. Use a few mappings as training examples to learn

hypotheses for elements of the mediated schema. See [Doan et al., SIGMOD-2001, MLJ-2003]

Next step: corpus-based matching.

Page 30: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Corpus-Based Matching

Collection of schemas and mappings

Reuse extracted informationto match new schemas

CDs Categories Artists

Items

Artists

Authors Books

Music

Information

Litreture

Publisher

Authors

Corpus of Books and Inventory SchemasCorpus of Books and Inventory Schemas

Identify common concepts and patterns Books, Authors, Publishers, …Books Title, Author, Price, Publisher

Page 31: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Mapping Knowledge BaseData Instances

LearnerName Learner

Data TypeLearner

DescriptionLearner

StructureLearner

NL:… DIL:…DTL:… DL:…SL:… ML:…

Meta Learner

C1

NL:… DIL:…DTL:… DL:…SL:… ML:…

CN

Learners:Learners: extract knowledge from schemas and mappings

Schemas and mappings: Schemas and mappings: accumulated over time

Learned models:Learned models: for each unique element in any schema.

Mapping Knowledge BaseMapping Knowledge Base

Page 32: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Preliminary results: Corpus is useful

Shipping Domain

-15

-10

-5

0

5

10

15

P1a P1b P2a P2b P3a P3b P4a P4b

Schema Pairs

Avg

Nu

mb

er

of

Matc

hes

Only MKB Only BASIC

Page 33: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

With and without the corpusInventory Domain

0

0.2

0.4

0.6

0.8

1

P1a P1b P2a P2b P3a P3b P4a P4b

Schema Pairs

Recall

MKB BASIC COMB

Page 34: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Outline

Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world.

Current challenges Enabling large-scale data sharing: peer-data

management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR.

Page 35: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Corpus vs. Traditional KR

A large corpus of uncoordinated knowledge fragments

vs.Carefully designed knowledge base

Can a corpus offer a more attractive solution for some KR problems?

Page 36: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Pause: KR vs. Corpus

Knowledge base: Hard to engineer, brittle at the boundariesOnly one way of saying things.

Corpus: “Easier” to build, coverage not predefined.Many views of the domain.

See proceedings for full argument.

Page 37: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Corpus-based KRContents:Schemas, ontologies, meta-data, data,

queries, mappings.

Collect statistics on the corpus:How often does a word appear as a

relation name? When it does, what tend to be the attribute

names? What other tables are there?

Support a KR-style interface on the corpus (OKBC-like)

Page 38: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Other Applications of C-B-KR

Question answering on the web

Focused crawling

Natural language interfaces to DB’s

Schema and ontology authoring

Semantic query optimization.

Whenever we need knowledge to help us rank multiple answers/plans.

Page 39: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Example Queries

How are two terms related? GPA(studentID, $value), Student(studentID, GPA, address)

Find different ways of saying the same:Class(Lexus, Luxury)LuxuryCar(Lexus, Toyota)

When do two terms play similar roles? IJCAIReview(p1, rev2, accept)AIJReferees(round2, p3, rev4, reject)

Page 40: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Challenges for C-B-KR

Building the corpus.

How focused should the corpus be?

Is human tuning needed or helpful?

How do we accommodate inference?

How do we leverage traditional KR?

Page 41: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

SummaryThe vision: data authoring, querying and sharing by everyone. We got the plumbing to work. To go further, we need AI

techniques.

Challenge: cross the structure chasm: It’s hard to author & query structured data! PDMS: architecture for ad-hoc sharing. Ontology/schema matching is key!

Are we providing the right tools? Corpus-based knowledge representation.

We need benchmarks!

Page 42: Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Some References

www.cs.washington.edu/homes/alonPiazza: ICDE03, WWW03, VLDB-03The Structure Chasm: CIDR-03Mediation surveys: VLDB Journal 01 Lenzerini tutorial.

Schema matching: Rahm and Bernstein, VLDB Journal 01.

Workshops: IJCAI, Semantic Web Conf.Teaching integration to undergraduates: SIGMOD Record, September, 2003.