Model Management and the Future Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 20, 2005 Semex figures extracted.

Model Management and the Future

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Database & Information Systems

April 20, 2005

Semex figures extracted from NY DB/IR talk by A. Halevy

2

Administrivia

“Final exam” Fri, May 6, noon – 1:30 Free pizza and soft drinks 5-10 minute overviews of your projects Reports and code due

3

Metadata Management

The challenges: There are lots of metadata representations

Different data models; different definition types (e.g., Java classes, XML Schemas, SQL DDL, …)

Many of the problems are unsolvable in the abstract e.g., schema matching But maybe we can customize tools for each task And maybe we can get user input to help

We want to create a clean, composable model of operators Should be “algebraic” in some sense, with nice properties Operators need to be generic but extensible

4

The Basic Algebraic Operators

MatchBasically, schema matching: takes two models and

returns a mapping between themElementary vs. complex match; reliance on morphisms

ComposeTakes two mappings and composes them

DiffTakes a model A, a mapping A B, and returns the part

of A that’s not mappedModelGen

Takes model A, creates new model B plus mapping A BMerge

Takes models A, B, mapping between them, returns the union C, plus mappings A C, B C

5

Model Management in Action

6

Schematic of Changes

the new parts in S2 thatneed to be propagated to d2

Dest. w/o deleted itemsfrom s1

the XML version of s2

7

Actual Operations

8

What’s Hard?

Match We saw that LSD is far from perfect, and it’s the best

out there…

Merge Can we make (A merge B) merge C = A merge (B merge

C)? (Buneman, Davidson, Kosky 92)

With Diff, how do we ensure a well-formed model as the result? They return a copy of the model, plus mappings

showing what is actually part of the diff

Composition – it isn’t always closed within the mapping language!

9

More Challenges

What about: Semantics of the meta-model – how do we

handle, e.g., constraints? What to do about approximate

correspondences? Can we actually make these things generic but

expressive enough to be useful?

Do you think this vision is feasible?

10

Switching Gears

… to another unsolvable problem!

Personal information management

What does this mean? Google Desktop Search, Mac OS Tiger, Windows Longhorn –

it means keyword search over your emails and documents Outlook, Lotus Agenda, …: a database of “stuff”

... or lots of new systems: Haystack (Karger, MIT); MyLifeBits (Bell, Microsoft Research); Semex (Dong and Halevy, U Wash)

11

What Should It Mean?

The hard disk is the database! Two methods of interaction:

Browsing – via “semantic links” (think of RDF edges, or relations in an ER diagram)

On-the-fly integration – create a schema, maybe provide some examples, and have the system automatically map data into the schema

In some sense, this represents the sum total of most of the things we’ve talked about this semester Query processing; integration; information retrieval;

schema matching; entity matching; semantic web; etc.

12

The Semex System

13

A Global Schema/Model

In general, it should be possible to define our own “schema” (or ontology)

Semex: a very simple domain model describing basic classes and relationships Their focus was on research-related topics:

Articles, messages, conferences, people, … The model is in RDF – why?

The two tasks: Map data into the appropriate classes Present associations to the user, allow them to be

browsed and queried

14

Semex Interface

15

What’s the Central Problem? Lots of data (typically with some tags) but fragmented

across many sources and schemas – we want to grab it and fill in info about People, Papers, etc.

Paperref:title: “Distributed query processing in a …”author: Robert S. Epsteinauthor: Michael Stonebrakerauthor: Eugene Wong

Citation:title: “Distributed Query Processing in a …”author: Epstein, R. S.author: Stonebreaker, M.author: Wong, E.

EMail:title: “Your CIDR paper”sender: [email protected]

16

Reference Reconciliation a.k.a. entity resolution, value matching, deduplication,

… Finding when two items refer to the same entity Generally relies on some form of schema matching as

a first step In Semex, this is done by “association extractors” (wrappers

and mappings) In our case, figuring out whether attributes from a data

source should be: Merged into an existing (partial) “tuple” Or they should create a new tuple

e.g.:<person><name>Michael

Stonebraker</name><email>?</email></person><person><name>?</name><email>[email protected]</

email></person>

17

The Key Idea

In isolation, we can consider similarity of the data items, but that’s frequently not very helpful

But maybe we can consider other factors: co-occurrence – [email protected] is mentioned in

one place as being associated with “M. Stonebraker”; “M. Stonebraker” co-authors with “Epstein and Wong”; etc.

associations at a higher level – Stonebraker is at MIT’s CSAIL; csail.mit.edu is MIT CSAIL’s domain

Match multiple concepts at the same time, and use a “dependency graph” to determine whether merging at a higher level suggests merging at a lower level (and vice versa)

When we find a match, use that to try to transitively find more matches (“enrichment”)

18

Example of Dependency Graph

19

Graph Creation and Maintenance

For every pair, initialize similarity to be 0 If the items are comparable, compute similarity

Add edges for each possible similarity relationship between attributes

Mark all nodes as active For each active node, recompute its similarity

score based on similarities of outgoing edges If above a (conservative) threshold, merge

Mark all outgoing neighbors with similarity < 1 as active Else mark as inactive Repeat until fixpoint

A few other details for enrichment (computing transitive effects of merging) and constraints (avoiding illegal merges)

20

Personal Info Management

In some ways, one of the real frontiers of data management Needs to have some info retrieval, databases,

user interfaces, and even ontologies Indexing? query processing?

Brings in all of the AI-complete issues, too! Schema matching, entity matching (in a very

hard form), … Lots of smart people are working on this

Do you think you’ll have a PIM system on your desktop in 3-5 years?

21

Wrapping up…

This semester has been a whirlwind tour of many different aspects of the “data ecosystem” Query processing, storage, and transactions Issues relating to data distribution (both DB and

Google) Heterogeneity, mappings, and reformulation (and the

limitations thereof) Semantic webs of various kinds Metadata management PIM

I hope I’ve been able to convey some of what makes this field both relevant and, I think, cool…

22

Lots of Related Ideas at Penn Orchestra: “Collaborative data sharing”

Many databases or warehouses, each with its own schema Piazza-like mappings among the schemas

Each is being independently modified How do you “synchronize” – esp. when each user may want

to override the changes made elsewhere? A distributed Piazza “engine” underneath Approximate mappings?

Aspenn: Rethinking stream and sensor processing “Seeing the forest from the trees” – define the entities

being sensed in a declarative way, associate streams with them

Composite entities, approximation Digital curation: databases as resources (how do

we archive, do version control, maintain provenance, allow to evolve?)

23

Thanks!!!

I had a great time this semester – I hope you learned a lot and found it to be enjoyable I’m looking forward to seeing your projects!

Best of luck to those of you who are finishing this year!

Model Management and the Future Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 20, 2005 Semex figures extracted.

Documents

b c slide

halevy slide

model management

extensible slide

action slide

new model b

semex system slide

merge b merge c