Top Banner
Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it Structure-Aware XML Object Identification
21

Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Diego Milano, Monica Scannapieco and Tiziana Catarci

Università di Roma “La Sapienza”

Dipartimento di Informatica e Sistemistica

{milano,monscan,catarci}@dis.uniroma1.it

Structure-Aware XML Object Identification

Page 2: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Context

Object Identification problem: identifying different data instances that refer to the same real-world entity.

• Complex:

•no shared Identifiers

•errors (e.g. misspellings)

• Well-studied for relational data, but still an open problem in the case of semistructured data.

•Needed for Semantic Data Integration

Page 3: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Outline

•Issues in XML object Identification

•Drawbacks of existing tree similarity measures

•A structure-aware distance for XML data

•Experimental Evaluation

Page 4: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

XML Object Identification

•Relational data:

•Flat and fixed structure

•Tuples compared pairwise field by field

•String similarity functions often used to compare fields

•XML data:

•Tree-like and flexible structure

•Optional data

•Unbounded length lists

•Structural correspondence more difficult

Page 5: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Contribution

We propose a new distance for XML data, the structure-aware XML distance:

•Structure aware

•data comparison driven by tree structure

•Taylored to XML Object Identification

•avoid issues arising when using existing tree similarity measures for Object Identification

Page 6: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Tree-edit distance

Measures cost of making a tree isomorphic to another one by node insertions, deletions and relabellings

A cost defined for each operation

Distance = cost of a minimal-cost sequence of operations

Works well when only tree structure is important, and labels do not have semantics.

In XML, data is present on leaves as text, and the structure partially describes its meaning.

Page 7: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Issues

XML model is ordered, but Object Identification requires unordered comparisons:

<ELEMENT contact (phonenum? email?)*>

Schema languages constrain only structural order, not data order.

But tree-edit distance is NP-complete for unordered trees.

Note: Also other edit-based distances like the Alignment Distance are also NP-complete in the unordered case.

Page 8: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Examples

movie

title movie

“1994” “T. Guiry”“Lassie”

awards

“D. Petrie” “H. Slater”

awards

“Oscar”“Lassie”

“Oscar”

dog

ownername

“Oscar”“Lassie”

titleyear

directoractor

actress

•Compares topology, not data:

•differences in optional elements influence identification

•difficult to define a cost model that preserves the semantics of labels

Page 9: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Outline

•Issues in XML object Identification

•Drawbacks of existing tree similarity measures

•A structure-aware distance for XML data

•Experimental Evaluation

Page 10: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Structure Aware XML Distance

• Preserves element-label semantics

• Ignores differences due to optional data

• Polynomial even for unordered trees

Page 11: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

OverlaysAn overlay O of two trees T1 and T2 is a subset of T1 x T2

s.t. for nodes vi,vi’ and inner nodes ni in Ti:

1. one-to-one:

if (v1, v2), (v1’, v2’) in O then v1= v1 iff v2’ = v2’

2. same-path:

if (v1, v2) in O then path(v1) = path(v2)

3. to-leaves:

(n1, n2) in O iff in (v1, v2) s.t. n1 = parent(v1) ∧

n2 = parent(v2) in (v1, v2) in O

We are interested in maximal overlays: O∄ ’ s.t. O O’

Page 12: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Example

“jon” “mary” “lise”“karl”

A

C D

FF F

T1

K

D

“tom”

G

“karl”

G

D

“tom”

H

A

C D

F

“john” “lisa”

F

“mary”

F

T2

“karl”

K

Page 13: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Distance

The cost for matching two nodes is zero if they are inner nodes, equal to a string-similarity measure on their textual values if they are leaves.

Any string similarity measure can be used. We use the string-edit distance.

Cost of an overlay O = sum of costs of all matches in O

An optimal overlay has minimal cost among all possible overlays.

The structure aware XML distance of two XML trees T1 and T2 is the cost of any optimal overlay of T1 and T2.

Page 14: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Example

Distance calculated as:sdist(“john”, “jan”) + sdist(“lisa”, “lisa”) = 2

sdist(john; jona) + sdist(karl, karl) + sdist(mary; tom) = 10

A

C D

F

“john” “mary”

F

“lisa”

F

A

C

K

“Karl”“jan”

F

T1

D

F

“mary”“lisa”

F

“Karl”

K

T2

Page 15: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Computing Overlays

Best Assignment

A

C D

F

“john” “mary”

F

“lisa”

F

A

C

K

“Karl”“jan”

F

T1

D

F

“mary”“lisa”

F

“Karl”

K

T2

Page 16: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Complexity

Computes overlays bottom-up

For each couple of nodes, solves a minimum weight bipartite matching problem using a variant of the Munkres algorithm.

The cost is bounded by O(|T1|x|T2|x(deg1+deg2)3)

Since only nodes with the same label are matched, average performance is better.

Page 17: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Outline

•Issues in XML object Identification

•Drawbacks of existing tree similarity measures

•A structure-aware distance for XML data

•Experimental Evaluation

Page 18: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Evaluation (cont’d)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100

% differences

F-s

core

deletion

textchange

swap

Page 19: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Evaluation (cont’d)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6 7 8

% of data changes

XML distance precision

XML distance fscore

XML distance recall

edit distance precision

edit distance fscore

edit distance recall

Page 20: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Conclusions

• XML object Identification requires to compare tree-like, flexible structured data

• We have proposed a structure aware distance for xml data

• More satisfactory than existing tree similarity measures:

•Respects XML structure

•Efficient even on unordered data

Page 21: Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Ongoing/Future Work

•Extension to XML data with structural differences (Almost done!)

•More efficient algorithm(s)?