Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Diego Milano, Monica Scannapieco and Tiziana Catarci

Università di Roma “La Sapienza”

Dipartimento di Informatica e Sistemistica

{milano,monscan,catarci}@dis.uniroma1.it

Structure-Aware XML Object Identification

Context

Object Identification problem: identifying different data instances that refer to the same real-world entity.

• Complex:

•no shared Identifiers

•errors (e.g. misspellings)

• Well-studied for relational data, but still an open problem in the case of semistructured data.

•Needed for Semantic Data Integration

Outline

•Issues in XML object Identification

•Drawbacks of existing tree similarity measures

•A structure-aware distance for XML data

•Experimental Evaluation

XML Object Identification

•Relational data:

•Flat and fixed structure

•Tuples compared pairwise field by field

•String similarity functions often used to compare fields

•XML data:

•Tree-like and flexible structure

•Optional data

•Unbounded length lists

•Structural correspondence more difficult

Contribution

We propose a new distance for XML data, the structure-aware XML distance:

•Structure aware

•data comparison driven by tree structure

•Taylored to XML Object Identification

•avoid issues arising when using existing tree similarity measures for Object Identification

Tree-edit distance

Measures cost of making a tree isomorphic to another one by node insertions, deletions and relabellings

A cost defined for each operation

Distance = cost of a minimal-cost sequence of operations

Works well when only tree structure is important, and labels do not have semantics.

In XML, data is present on leaves as text, and the structure partially describes its meaning.

Issues

XML model is ordered, but Object Identification requires unordered comparisons:

<ELEMENT contact (phonenum? email?)*>

Schema languages constrain only structural order, not data order.

But tree-edit distance is NP-complete for unordered trees.

Note: Also other edit-based distances like the Alignment Distance are also NP-complete in the unordered case.

Examples

movie

title movie

“1994” “T. Guiry”“Lassie”

awards

“D. Petrie” “H. Slater”

awards

“Oscar”“Lassie”

“Oscar”

dog

ownername

“Oscar”“Lassie”

titleyear

directoractor

actress

•Compares topology, not data:

•differences in optional elements influence identification

•difficult to define a cost model that preserves the semantics of labels

Outline





Structure Aware XML Distance

• Preserves element-label semantics

• Ignores differences due to optional data

• Polynomial even for unordered trees

OverlaysAn overlay O of two trees T1 and T2 is a subset of T1 x T2

s.t. for nodes vi,vi’ and inner nodes ni in Ti:

1. one-to-one:

if (v1, v2), (v1’, v2’) in O then v1= v1 iff v2’ = v2’

2. same-path:

if (v1, v2) in O then path(v1) = path(v2)

3. to-leaves:

(n1, n2) in O iff in (v1, v2) s.t. n1 = parent(v1) ∧

n2 = parent(v2) in (v1, v2) in O

We are interested in maximal overlays: O∄ ’ s.t. O O’

Example

“jon” “mary” “lise”“karl”

A

C D

FF F

T1

K

D

“tom”

G

“karl”

G

D

“tom”

H

A

C D

F

“john” “lisa”

F

“mary”

F

T2

“karl”

K

Distance

The cost for matching two nodes is zero if they are inner nodes, equal to a string-similarity measure on their textual values if they are leaves.

Any string similarity measure can be used. We use the string-edit distance.

Cost of an overlay O = sum of costs of all matches in O

An optimal overlay has minimal cost among all possible overlays.

The structure aware XML distance of two XML trees T1 and T2 is the cost of any optimal overlay of T1 and T2.

Example

Distance calculated as:sdist(“john”, “jan”) + sdist(“lisa”, “lisa”) = 2

sdist(john; jona) + sdist(karl, karl) + sdist(mary; tom) = 10

A

C D

F

“john” “mary”

F

“lisa”

F

A

C

K

“Karl”“jan”

F

T1

D

F

“mary”“lisa”

F

“Karl”

K

T2

Computing Overlays

Best Assignment

A

C D

F

“john” “mary”

F

“lisa”

F

A

C

K

“Karl”“jan”

F

T1

D

F

“mary”“lisa”

F

“Karl”

K

T2

Complexity

Computes overlays bottom-up

For each couple of nodes, solves a minimum weight bipartite matching problem using a variant of the Munkres algorithm.

The cost is bounded by O(|T1|x|T2|x(deg1+deg2)3)

Since only nodes with the same label are matched, average performance is better.

Outline





Evaluation (cont’d)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100

% differences

F-s

core

deletion

textchange

swap

Evaluation (cont’d)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6 7 8

% of data changes

XML distance precision

XML distance fscore

XML distance recall

edit distance precision

edit distance fscore

edit distance recall

Conclusions

• XML object Identification requires to compare tree-like, flexible structured data

• We have proposed a structure aware distance for xml data

• More satisfactory than existing tree similarity measures:

•Respects XML structure

•Efficient even on unordered data

Ongoing/Future Work

•Extension to XML data with structural differences (Almost done!)

•More efficient algorithm(s)?

Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it.

Documents

diego milano