Top Banner
Change Detection in XML Documents using Semantic Identifiers BY KAILAASH BALACHANDRAN 1
31

Schemaless Change detection in XML Documents using Semantic Identifiers

Jun 21, 2015

Download

Technology

Change Detection is a process of comparing successive versions of documents to identify the changes. The success of XML as the standard for data exchange has paved way for a number of change detection techniques that focus more on structural changes, rather than on the semantics. Existing structural change detection mechanisms tend to break down when the changes made are significantly large. This paper discusses a schema less, semantics based framework that associates semantic identifiers to elements in successive versions, thus clearing the obstacle of efficient association of elements even if the structural change is significant.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Schemaless Change detection in XML Documents using Semantic Identifiers

Change Detection in XML Documents

using Semantic Identifiers

BY

KAILAASH BALACHANDRAN

1

Page 2: Schemaless Change detection in XML Documents using Semantic Identifiers

Outline

Motivation

Introduction

The Approach

• Identifiers

• 2-step Algorithm

• Axioms

Semantic Change Detection

• Finding Identifiers

• Matching Nodes

Examples

Conclusion

2

Page 3: Schemaless Change detection in XML Documents using Semantic Identifiers

Motivation(1)

Fig.2. Version 2

<author>

<name>Dan Brown</name>

<book>

<title>The Da Vinci Code</title>

<publisher>Doubleday</publisher><salesprice>$35</salesprice>

<isbn>0385504209</isbn>

</book>

<book>

<title>Angels & Demons</title>

<publisher>Pocket Star</publisher><price>$56</price>

</book></author>

Fig.1. Version 1

<author>

<name>Dan Brown</name>

<book>

<title>The Da Vinci Code</title>

<publisher>Doubleday</publisher><price> $35 </price>

</book>

<book>

<title>Angels and Demons</title>

<publisher>Pocket Star</publisher>

<price> $56</price></book>

</author>

3

Page 4: Schemaless Change detection in XML Documents using Semantic Identifiers

Motivation(1)

Fig.2. Version 2

<author>

<name>Dan Brown</name>

<book>

<title>The Da Vinci Code</title>

<publisher>Doubleday</publisher><salesprice>$35</salesprice>

<isbn>0385504209</isbn>

</book>

<book>

<title>Angels & Demons</title>

<publisher>Pocket Star</publisher><price>$56</price>

</book></author>

Fig.1. Version 1

<author>

<name>Dan Brown</name>

<book>

<title>The Da Vinci Code</title>

<publisher>Doubleday</publisher><price> $35 </price>

</book>

<book>

<title>Angels and Demons</title>

<publisher>Pocket Star</publisher>

<price> $56</price></book>

</author>

4

Page 5: Schemaless Change detection in XML Documents using Semantic Identifiers

Motivation(2)Fig.3. Version 3

<publisher>Doubleday

<book>

<title>The Da Vinci Code</title>

<author>

<name>Dan Brown</name></author>

<price> $35</price>

</book>

</publisher>

<publisher>Pocket Star

<book>

<title>Angels and Demons</title>

<author><name>Dan Brown</name>

</author>

<price> $56</price>

</book> </publisher>

Fig.1. Version 1

<author>

<name>Dan Brown</name>

<book>

<title>The Da Vinci Code</title>

<publisher>Doubleday</publisher>

<price> $35 </price>

</book>

<book>

<title>Angels and Demons</title><publisher>Pocket Star</publisher>

<price> $56</price>

</book>

</author>

5

Page 6: Schemaless Change detection in XML Documents using Semantic Identifiers

Motivation(2)Fig.3. Version 3

<publisher>Doubleday

<book>

<title>The Da Vinci Code</title>

<author>

<name>Dan Brown</name></author>

<price> $35</price>

</book>

</publisher>

<publisher>Pocket Star

<book>

<title>Angels and Demons</title>

<author><name>Dan Brown</name>

</author>

<price> $56</price>

</book> </publisher>

Fig.1. Version 1

<author>

<name>Dan Brown</name>

<book>

<title>The Da Vinci Code</title>

<publisher>Doubleday</publisher><price> $35 </price>

</book>

<book>

<title>Angels and Demons</title>

<publisher>Pocket Star</publisher>

<price> $56</price>

</book>

</author>

6

Page 7: Schemaless Change detection in XML Documents using Semantic Identifiers

Motivation(3)

Disadvantages of Structural detection approach:

Difficult to associate elements in different versions.

Break down when the changes are significant.

Affects Incremental Evaluation.

High cost of change of data.

7

Page 8: Schemaless Change detection in XML Documents using Semantic Identifiers

Introduction

What is Semantic Based Change Detection?

A process of Identifying changes between successive versions of a document

based on its semantics, rather than on the structure of the document.

The Approach:

1. Find Semantic Identifier for each node in the XML model.

2. Compute these Identifiers to associate nodes across multiple versions.

8

Page 9: Schemaless Change detection in XML Documents using Semantic Identifiers

Identifiers

Type is list of labels from root to element separated by a ‘/’.

Identifier serves to distinguish elements of same type.

Two nodes x and y, are semantically the same if and only if their identifiers evaluate to

the same result.

Node

x

Node

y

Same Result

Eval(x,L) = Eval(y,L)

where,• x,y are the nodes,

• List of Expressions L = { E1,E2…En}

9

Page 10: Schemaless Change detection in XML Documents using Semantic Identifiers

Identifiers

Local Identifier: An identifier is local if it evaluates to descendants of the context

node, otherwise it is non-local.

Version 1:

<author>

<name>Dan Brown</name><book>

<title>The Da Vinci Code</title>

<publisher>Doubleday</publisher>

<price> $35 </price>

</book>

<book>

<title>Angels and Demons</title>

<publisher>Pocket Star</publisher><price> $56</price>

</book>

</author>

Version 3:

<publisher>Doubleday

<book><title>The Da Vinci Code</title>

<author>

<name>Dan Brown</name>

</author><price> $35</price>

</book>

</publisher>

<publisher>Pocket Star <book>

<title>Angels and Demons</title><author>

<name>Dan Brown</name>

</author><price> $56</price>

</book> </publisher>

10

Page 11: Schemaless Change detection in XML Documents using Semantic Identifiers

Identifiers

Local Identifier: An identifier is local if it evaluates to descendants of the context

node, otherwise it is non-local.

Version 1:

<author>

<name>Dan Brown</name><book>

<title>The Da Vinci Code</title>

<publisher>Doubleday</publisher>

<price> $35 </price>

</book>

<book>

<title>Angels and Demons</title>

<publisher>Pocket Star</publisher><price> $56</price>

</book>

</author>

Version 3:

<publisher>Doubleday

<book><title>The Da Vinci Code</title>

<author>

<name>Dan Brown</name>

</author><price> $35</price>

</book>

</publisher>

<publisher>Pocket Star <book>

<title>Angels and Demons</title><author>

<name>Dan Brown</name>

</author><price> $56</price>

</book> </publisher>

<name> is

local<name> is

non-local

11

Page 12: Schemaless Change detection in XML Documents using Semantic Identifiers

Identify nodes based on its

Semantics

The Algorithm

Phase 1:

Bottom up fashion.

Identifies all local identifiers.

Semantically different nodes are identified.

Phase 2:

Runs recursively and identifies non-local identifiers.

All semantically distinct nodes are found.

Any remaining node is a redundant copy of another node in the document.

12

Page 13: Schemaless Change detection in XML Documents using Semantic Identifiers

Identify nodes based on its

Semantics(Phase 1)

<publisher>Doubleday

<book>

<title>The Da Vinci Code</title>

<author>

<name>Dan Brown</name>

</author></book>

</publisher>

<publisher>Pocket Star

<book>

<title>Angels and Demons</title><author>

<name>Dan Brown</name>

</author></book> </publisher>

Semantically different.

Axiom 1: Nodes that are structurally different are semantically different.

13

Page 14: Schemaless Change detection in XML Documents using Semantic Identifiers

Identify nodes based on its

Semantics(Phase 1)

<publisher>Doubleday

<book>

<title>The Da Vinci Code</title>

<author>

<name>Dan Brown</name>

</author></book>

</publisher>

<publisher>Pocket Star

<book>

<title>Angels and Demons</title><author>

<name>Dan Brown</name>

</author></book> </publisher>

Are they semantically the same?

Axiom 1: Nodes that are structurally different are semantically different.

14

Page 15: Schemaless Change detection in XML Documents using Semantic Identifiers

Identify nodes based on its

Semantics(Phase 2)

<publisher>Doubleday

<book>

<title>The Da Vinci Code</title><author>

<name>Dan Brown</name>

</author>

</book>

</publisher>

<publisher>Pocket Star

<book>

<title>Angels and Demons</title><author>

<name>Dan Brown</name>

</author>

</book> </publisher>

No, because they’re in context of two

different books

Axiom 2: Nodes that are structurally

identical are semantically identical

if and only if their respective parents are semantically identical or if they

are both root nodes.

15

Page 16: Schemaless Change detection in XML Documents using Semantic Identifiers

Semantic Change Detection

How to handle structural changes ?

Assumption: Identifying information will remain nearby.

X

Y Z YX

A

Z

Version 1 Version 2

16

Page 17: Schemaless Change detection in XML Documents using Semantic Identifiers

Semantic Change Detection

Type Territory : The territory of a type T is the set of all text nodes that are descendants of the least common ancestor (lca) of all of the type T nodes.

Within the type territory is the territory controlled by individual nodes of that

type.

Node Territory : The territory of a type T node p is the type territory of T excluding all text nodes that are descendants of other type T nodes.

17

Page 18: Schemaless Change detection in XML Documents using Semantic Identifiers

Node and Type Territory

document root

lca (p)

p1

p2

p3

node territory of p2node territory of p1

Node territory

type territory of p

18

Page 19: Schemaless Change detection in XML Documents using Semantic Identifiers

Finding IdentifiersVersion 1:

<bib>

<author><name>n1</name>

<book>

<title>t1</title>

<publisher>p1</publisher>

</book>

</author><author><name>n2</name>

<book>

<title>t2</title>

<publisher>p2</publisher>

</book>

<book>

<title>t1</title><publisher>p1</publisher>

</book></author>

</bib>

Version 2:

<bib>

<pub> p1

<book>

<title>t1</title><author>

<name>n1</name>

</author>

<book>

<pub> p2

<book>

<title>t2</title>

<author>

<name>n2</name>

</author>

<book>

19

Page 20: Schemaless Change detection in XML Documents using Semantic Identifiers

Identifiers

<bib>

<author><name>n1</name>

<book>

<title>t1</title>

<publisher>p1</publisher>

</book>

</author>

<author><name>n2</name>

<book>

<title>t2</title><publisher>p2</publisher>

</book>

<book>

<title>t1</title><publisher>p1</publisher>

</book></author>

</bib>

Node IDENTIFIER

book (../author/name/text(),

title/text())

20

Page 21: Schemaless Change detection in XML Documents using Semantic Identifiers

IdentifiersValues of Identifiers for <book> in Version 1

<bib>

<author><name>n1</name>

<book>

<title>t1</title><publisher>p1</publisher>

</book>

</author>

<author><name>n2</name>

<book>

<title>t2</title>

<publisher>p2</publisher>

</book>

<book>

<title>t1</title>

<publisher>p1</publisher>

</book></author>

</bib>

Value of Identifier = n1, t1

Value of Identifier = n2, t2

Value of Identifier = n2, t1

21

Page 22: Schemaless Change detection in XML Documents using Semantic Identifiers

IdentifiersValues of Identifiers for <book> in Version 2

<bib>

<pub> p1

<book>

<title>t1</title><author>

<name>n1</name>

</author>

</book>

</pub>

<pub> p2

<book>

<title>t2</title>

<author>

<name>n2</name>

</author>

</book></pub>

</bib>

22

Page 23: Schemaless Change detection in XML Documents using Semantic Identifiers

IdentifiersValues of Identifiers for <book> in Version 2

<bib>

<pub> p1

<book>

<title>t1</title><author>

<name>n1</name>

</author>

</book>

</pub>

<pub> p2

<book>

<title>t2</title>

<author>

<name>n2</name>

</author>

</book></pub>

</bib>

Value of Identifier = p1, t1

Value of Identifier = p2, t2

23

Page 24: Schemaless Change detection in XML Documents using Semantic Identifiers

Identifiers

Node IDENTIFIER

book (top) n1 , t1

book

(middle)n2 , t2

book

(bottom)

n2 , t1

Values of Identifiers for <book> in both versions:

Node IDENTIFIER

book 1 (top) p1 , t1

book 2

(bottom)p2 , t2

Version 1 Version 2

How to map both ?

24

Page 25: Schemaless Change detection in XML Documents using Semantic Identifiers

Matching

Admits: q admits p if and only if q is in the node territory of p.

Nodes p and q are matched if and only if p and q admit each other.

Consider nodes p and q that reside in different versions Vp and Vq.

q1, q2….qn

q1, q2….qn

Node q in Vq Node p in Vp

25

Page 26: Schemaless Change detection in XML Documents using Semantic Identifiers

Semantic Change Detection

n1

author author

namebook

name bookbook

bib

title pub n2

t1 p1 t2 p2 t1

pubtitle title pub

p1

bib

pub pub

p1 book p2 book

titleauthor author

title author

t1name name

t2 name

n1 n2 n2

Version 1

Version 2

Book matches:

26

Page 27: Schemaless Change detection in XML Documents using Semantic Identifiers

Semantic Change Detection

n1

author author

name bookname

book book

bib

title pub n2

t1 p1 t2 p2 t1

pubtitle title pub

p1

bib

pub pub

p1 book p2 book

titleauthor author

title author

t1name name

t2name

n1 n2 n2

Version 1

Version 2

Book matches:

admits

27

Page 28: Schemaless Change detection in XML Documents using Semantic Identifiers

Semantic Change Detection

n1

author author

name bookname

book book

bib

title pub n2

t1 p1 t2 p2 t1

pubtitle title pub

p1

bib

pub pub

p1 book p2 book

titleauthorauthor

title author

t1name name

t2name

n1 n2 n2

Version 1

Version 2

Book matches:

Node match

28

Page 29: Schemaless Change detection in XML Documents using Semantic Identifiers

Semantic Change Detection

n1

author author

name bookname

book book

bib

title pub n2

t1 p1 t2 p2 t1

pubtitle title pub

p1

bib

pub pub

p1 book p2 book

titleauthorauthor

title author

t1name name

t2name

n1 n2 n2

Version 1

Version 2

Book matches:

Node match

29

Page 30: Schemaless Change detection in XML Documents using Semantic Identifiers

Semantic Change Detection

n1

author author

name bookname

book book

bib

title pub n2

t1 p1 t2 p2 t1

pubtitle title pub

p1

bib

pub pub

p1 book p2 book

titleauthor author

title author

t1name name

t2name

n1 n2 n2

Version 1

Version 2

Author matches:

30

Page 31: Schemaless Change detection in XML Documents using Semantic Identifiers

Conclusion

Semantic change detection technique.

• Find identifiers for each node in the XML document

• Associate nodes across versions.

Information that identifies an element is conserved across changes.

Time complexity is O(n*log(n))

We can match nodes even when structural changes are significant.

31