Top Banner
Requests to Tsong-Li • 1. Related work at end of each section • 2. Screen dumps of treebase at end of treesearch section (you’ll see where) • 3. Web addresses at the very end.
23

Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Dec 29, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Requests to Tsong-Li

• 1. Related work at end of each section

• 2. Screen dumps of treebase at end of treesearch section (you’ll see where)

• 3. Web addresses at the very end.

Page 2: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Searching for and ComparingTrees and Graphs

Dennis Shasha, [email protected]

Courant Institute, NYU

Joint work with

Kaizhong Zhang and Jason Wang

Page 3: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Philosophy

• Trees and graphs represent data in many domains in linguistics, chemistry, and even maybe the web.

• Question: why can’t I search for trees or graphs at the speed of keyword searches?

• Why can’t I compare trees (or graphs) as easily as I can compare strings?

Page 4: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Tree Searching

• Given a small tree t is it present in a bigger tree T?

Page 5: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

What does “present” mean?

• Preserving sibling order or not

• Preserving ancestor order

• Preserving distance

• Mismatches

Page 6: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Sibling Order

• Order of children of a node:

A

B C

A

C B

?=

Page 7: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Ancestor Order

• Order between children and parent.

A

B CA

C

B

?=

Page 8: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Ancestor Distance

• Can children become grandchildren:

A

B C

A

B X

?=

C

Page 9: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Mismatches

• Can there be relabellings, inserts, and deletes (Tolstoy problem):

A

B C

A

X C

howfar?

Page 10: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Bottom Line

• There is no one definition of mismatch or subtree (Tolstoy problem). You must choose the package that suits you.

• I will tell you about three.

Page 11: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

TreeSearch Query Language

• Query language is simply a tree decorated with single length don’t cares (?) and variable length don’t cares (*).

A

*

B C

?

D

>= 0, oneach side

=1

Page 12: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Exact Match

• Query matches exactly if contained regardless of sibling order or other nodes

A

*

B C

?

D

=

X

Y A

W

Z

C

BX Q

DU

Page 13: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Inexact Match

• Inexact match if missing or differing node labels. Higher differences cost more.

A

*

B C

?

D

Differby 1

X

Y A

W

Z

C

BX Q

EU

Page 14: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Treesearch Conceptual Algorithm

• Take all paths in query tree.

• Find out where each path is in the data tree.

• So notion of distance is number of paths that differ. Higher nodes are more important.

• Implementation: suffix array. A few seconds on several thousand trees.

Page 15: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Treesearch Review

• Ancestor order matters.• Sibling order doesn’t.• Don’t cares: * and ?• Distance metric is based on numbers of path

differences.• Sister system built by Divesh and Sihem at

Bell Labs that allows terms to be “generalized”

Page 16: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Tsong-Li: screen dumps of treebase then related work

Page 17: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Tree Edit

• Order of children matters

A

B C

A’

C B

A->A’del(B)ins(B)

Page 18: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Tree Edit in General

• Operations are relabel A->A’, delete (X), insert (B).

A

X C

A’

C B

A->A’del(B)ins(B)

CC

Page 19: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Review of Tree Edit

• Generalizes string editing distance for trees, a dynamic programming algorithm.

• O(|T1| |T2| depth(T1) depth(T2))

• The basis for XMLdiff.

• Also has * and best removal of subtrees.

Page 20: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Tsong-Li: related work here

Page 21: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Graph Edit

• Thesis work of Rosalba Giugno.

• Find a small graph (with * and ?) in a big graph.

• Doesn’t work fast if query graph is big because graph subisomorphism is exponential.

Page 22: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Example of GraphGrep

• Query graph has nodes and don’t cares

A

B

* DC

Page 23: Requests to Tsong-Li 1. Related work at end of each section 2. Screen dumps of treebase at end of treesearch section (you’ll see where) 3. Web addresses.

Summary of Tools

• Why can’t tree and graph search be like keyword search?

• We are getting there and will provide software if you are interested.

• Current downloads of about 50.