Top Banner
Uncertainty in Data Integration Ai Jing 2007-11-10
38

Uncertainty in Data Integration Ai Jing 2007-11-10.

Mar 26, 2015

Download

Documents

Ian Nelson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Uncertainty in Data Integration Ai Jing 2007-11-10.

Uncertainty in Data Integration

Ai Jing2007-11-10

Page 2: Uncertainty in Data Integration Ai Jing 2007-11-10.

Outline Data Integration with Uncertainty Overview of Workshop on

Management of Uncertain Data Uncertainty in Deep Web

Page 3: Uncertainty in Data Integration Ai Jing 2007-11-10.

Outline

Data Integration with Uncertainty Overview of Workshop on

Management of Uncertain Data Uncertainty in Deep Web

Page 4: Uncertainty in Data Integration Ai Jing 2007-11-10.

Data Integration with Uncertainty

Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Page 5: Uncertainty in Data Integration Ai Jing 2007-11-10.

Data Integration with Uncertainty

Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Page 6: Uncertainty in Data Integration Ai Jing 2007-11-10.

Traditional Data Integration SystemsSELECT P.title AS title, P.year AS year, A.

name AS authorFROM Author, Paper, AuthoredBy

WHERE Author.aid = AuthoredBy.aid AND Paper.pid = AUthoredBy.pid Q

Q1

Q2

Q3

Q4

Q5

Page 7: Uncertainty in Data Integration Ai Jing 2007-11-10.

Uncertainty Can Occur at Three Levels in Data Integration Applications

III. Query Level

II. Mapping Level

I. Data Level

Focus of the paper:Probabilistic schema mappings

Page 8: Uncertainty in Data Integration Ai Jing 2007-11-10.

Example Probabilistic Mappings

T(name, email, mailing-addr, home-addr, office-addr)S(pname, email-addr, current-addr, permanent-addr)

T(name, email, mailing-addr, home-addr, office-addr) S(pname, email-addr, current-addr, permanent-addr)

T(name, email, mailing-addr, home-addr, office-addr)

S(pname, email-addr, current-addr, permanent-addr)

m1:

0.5

m2:

0.4

m3:

0.1

Page 9: Uncertainty in Data Integration Ai Jing 2007-11-10.

Top-k Query Answering w.r.t. Probabilistic Mappings

Mediated Schema

Q: SELECT mailing-addr FROM T

0.5 0.40.1

Q1: SELECT current-addr FROM S

Q2: SELECT permanent-addr FROM S

Q3: SELECT email-addr FROM S

Page 10: Uncertainty in Data Integration Ai Jing 2007-11-10.

Data Integration with Uncertainty

Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Page 11: Uncertainty in Data Integration Ai Jing 2007-11-10.

Definition of probabilistic mappings

Schema Mapping

Probabilistic Mapping

S=(pname, email-addr, home-addr, office-addr)

T=(name, mailing-addr)

one-to-one schema matchinghave exact knowledge of mapping

S=(pname, email-addr, home-addr, office-addr)

T=(name, mailing-addr)

1.0 0.1 0.5 0.4

Page 12: Uncertainty in Data Integration Ai Jing 2007-11-10.

By-Table Semantics

DT=

m

0.5

Page 13: Uncertainty in Data Integration Ai Jing 2007-11-10.

By-Tuple Semantics

DT=

Pr(<m1,m3>)=0.05

Page 14: Uncertainty in Data Integration Ai Jing 2007-11-10.

Data Integration with Uncertainty

Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Page 15: Uncertainty in Data Integration Ai Jing 2007-11-10.

By-Table Query Answering

Page 16: Uncertainty in Data Integration Ai Jing 2007-11-10.

By-Tuple Query Answering

Page 17: Uncertainty in Data Integration Ai Jing 2007-11-10.

Data Integration with Uncertainty

Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Page 18: Uncertainty in Data Integration Ai Jing 2007-11-10.

Complexity of query answering

Page 19: Uncertainty in Data Integration Ai Jing 2007-11-10.

More on By-Tuple Query Answering The high complexity comes from computing probabili

ties the number of mapping sequences is exponential in the size of the i

nput data n tuples, m mappings m^n mapping sequences

There are two subsets of queries that can be answered in PTIME by query rewriting SELECT mailing-addr FROM T SELECT mailing-addr FROM T,V

WHERE T.mailing-addr = V.hightech In general query answering cannot be done by query

rewriting

One of Dt

Page 20: Uncertainty in Data Integration Ai Jing 2007-11-10.

Extensions to More Expressive Mappings

The complexity results for query answering carry over to three extensions to more expressive mappings Complex mappings

GLAV mappings

Conditional mappings:

Page 21: Uncertainty in Data Integration Ai Jing 2007-11-10.

Data Integration with Uncertainty

Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Page 22: Uncertainty in Data Integration Ai Jing 2007-11-10.

Contributions

Definition of probabilistic mappingsSemantics: by-table v.s. by-tuple

Complexity of query answering

Page 23: Uncertainty in Data Integration Ai Jing 2007-11-10.

Outline

Data Integration with Uncertainty Overview of Workshop on

Management of Uncertain Data Uncertainty in Deep Web

Page 24: Uncertainty in Data Integration Ai Jing 2007-11-10.

Overview of MUD 2007

Theory A New Language and Architecture to Obtain Fuzzy Global Depende

ncies About the Processing of Division Queries Addressed to Possibilistic

Databases Making Aggregation Work in Uncertain and Probabilistic Datab

ases Application

Materialized Views in Probabilistic Databases

Application Flexible matching of Ear Biometrics Consistent Joins Under Primary Key Constraints

Page 25: Uncertainty in Data Integration Ai Jing 2007-11-10.

A New Language and Architecture to Obtain Fuzzy Global Dependencies

SQL does not satisfy the minimum requirements to be true DM language

A New Language: dmFSQL (data mining Fuzzy Structured Query Language)

Fuzzy Database Data mining

Page 26: Uncertainty in Data Integration Ai Jing 2007-11-10.

About the Processing of Division Queries Addressed to Possibilistic Databases

They devised a data model which is a strong representation system for operations in possibilistic databases

A possibilistic databases D can be interpreted as a weighted disjunctive set of regular databases

Division Queries

Page 27: Uncertainty in Data Integration Ai Jing 2007-11-10.

Making Aggregation Work inUncertain and Probabilistic Databases

Trio is a prototype database management system for storing and querying data with uncertainty and lineage

Trio’s query language——TriQL

Trio data model and query semantics

Aggregation function in the Trio system for uncertain and probabilistic data

Page 28: Uncertainty in Data Integration Ai Jing 2007-11-10.

Materialized Views in Probabilistic Databases

Materialized Views for probabilistic may not define a unique probability distribution

view representation Answer queries on large probabilistic dat

a set more efficiently with materialized views

Page 29: Uncertainty in Data Integration Ai Jing 2007-11-10.

Flexible matching of Ear Biometrics

Research area Image Recognition (or Identification)

Scenario identifying found bodies in a large-scale disaster

Challenge fast and cheap identification no DNA-databases or fingerprint

databases are at hand

Page 30: Uncertainty in Data Integration Ai Jing 2007-11-10.

Consistent Joins Under Primary KeyConstraints

Inconsistent database primary key

will the natural join of the repaired relations always be nonempty, no matter whichtuples are selected?

game theory, winning strategy

Page 31: Uncertainty in Data Integration Ai Jing 2007-11-10.

Outline

Data Integration with Uncertainty Overview of Workshop on

Management of Uncertain Data Uncertainty in Deep Web

Page 32: Uncertainty in Data Integration Ai Jing 2007-11-10.

Uncertainty in Deep Web

No “perfect” data Noise Dirty Redundancy ……

No “perfect” solution Web data extraction Interface integration ……

Page 33: Uncertainty in Data Integration Ai Jing 2007-11-10.

Uncertainty in Deep Web Data Integration(1)

Query Translation

Resul ts Extraction

Data Merging

Integrated Interface

Deep Web

WDB Discovery

Interface Integration

RDBWeb DB

Web DB

Web DB

Web DBWeb DB

Interface Schema Extraction

WDB Clustering

Query Process Modul e

I nterface I ntegrati on Modul e

WDB Selection

Query Submission

Resul ts Annotation

Resul t Process Modul e

•Robust•Evaluable

Page 34: Uncertainty in Data Integration Ai Jing 2007-11-10.

Uncertainty in Deep Web Data Integration(2)

Query Translation

Resul ts Extraction

Data Merging

Integrated Interface

Deep Web

WDB Discovery

Interface Integration

RDBWeb DB

Web DB

Web DB

Web DBWeb DB

Interface Schema Extraction

WDB Clustering

Query Process Modul e

I nterface I ntegrati on Modul e

WDB Selection

Query Submission

Resul ts Annotation

Resul t Process Modul e

•Tuning•Feedback•Evaluable

Page 35: Uncertainty in Data Integration Ai Jing 2007-11-10.

Uncertainty in Jobtong(1)

Data level

Page 36: Uncertainty in Data Integration Ai Jing 2007-11-10.

Uncertainty in Jobtong(2)

Query level

How can we give every result a probability to show it’s importance?

Page 37: Uncertainty in Data Integration Ai Jing 2007-11-10.

Uncertainty in Jobtong(3)

The automatic maintenance of configuration files

<record><xpath>/html/body//table/tr[@class='nob']</xpath> <combination>2</combination> <items> <item> <name>title</name> <xpath>td[2]/a/span</xpath> </item> <item> <name>company</name> <xpath>td[3]/a/span</xpath> </item> </items></record>

<record> <xpath>/html/body//table/tr[@class='list2' or @class='list3']</xpath> <combination>2</combination> <items> <item> <name>title</name> <xpath>td[2]/a</xpath> </item> <item> <name>company</name> <xpath>td[3]/a</xpath> </item> </items></record>

Page 38: Uncertainty in Data Integration Ai Jing 2007-11-10.

Q&A

Thank you!