Top Banner
TASM: Top-k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 MichaelB¨ohlen 3 Themis Palpanas 4 1 Free University of Bozen-Bolzano, Italy [email protected] 2 University of Alberta, Canada [email protected] 3 University of Zurich, Switzerland [email protected] 4 University of Trento, Italy [email protected] ICDE 2010, March 3 Long Beach, CA, USA Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 1 / 28
136

TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Jul 03, 2018

Download

Documents

lamhanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM: Top-k Approximate Subtree Matching

Nikolaus Augsten1 Denilson Barbosa2

Michael Bohlen3 Themis Palpanas4

1Free University of Bozen-Bolzano, [email protected]

2University of Alberta, [email protected]

3University of Zurich, [email protected]

4University of Trento, [email protected]

ICDE 2010, March 3Long Beach, CA, USA

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 1 / 28

Page 2: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Outline

1 Motivation and Problem Definition

2 TASM-PostorderUpper Bound on Subtree SizePrefix Ring Buffer Pruning

3 Experiments

4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 2 / 28

Page 3: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

Outline

1 Motivation and Problem Definition

2 TASM-PostorderUpper Bound on Subtree SizePrefix Ring Buffer Pruning

3 Experiments

4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 3 / 28

Page 4: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

Motivation

Query (XML fragment) Document (very large XML)

article

authors

author

Tim

author

John

booktitle

ICDEDBLP

28M nodes, 531MB

top-k matches?

Rank the top-k matches for the article query in the DBLP document!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 4 / 28

Page 5: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

Motivation

Query (XML fragment) Document (very large XML)

article

authors

author

Tim

author

John

booktitle

ICDEDBLP

28M nodes, 531MB

top-k matches?

Rank the top-k matches for the article query in the DBLP document!

Example Answer: k = 3inproceedings

authors

author

Tim

author

John

booktitle

ICDE

(1 error)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 4 / 28

Page 6: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

Motivation

Query (XML fragment) Document (very large XML)

article

authors

author

Tim

author

John

booktitle

ICDEDBLP

28M nodes, 531MB

top-k matches?

Rank the top-k matches for the article query in the DBLP document!

Example Answer: k = 3inproceedings

authors

author

Tim

author

John

booktitle

ICDE

article

author

Tim

authorsauthor

John

booktitle

TKDE

(1 error) (2 errors)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 4 / 28

Page 7: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

Motivation

Query (XML fragment) Document (very large XML)

article

authors

author

Tim

author

John

booktitle

ICDEDBLP

28M nodes, 531MB

top-k matches?

Rank the top-k matches for the article query in the DBLP document!

Example Answer: k = 3inproceedings

authors

author

Tim

author

John

booktitle

ICDE

article

author

Tim

authorsauthor

John

booktitle

TKDE

inproceedings

authors

author

Tim

author

John

author

Peter

booktitle

ICDE

(1 error) (2 errors) (3 errors)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 4 / 28

Page 8: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

TASM: Top-k Approximate Subtree Matching

Definition (TASM: Top-k Approximate Subtree Matching)

Given: query tree Q, document tree T , size k of rankingGoal: Compute a

top-k ranking R = (T1, T2, . . . ,Tk)

of all subtrees Ti of document T

with respect to query Q

using the tree edit distance for the ranking.

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 5 / 28

Page 9: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

TASM: Top-k Approximate Subtree Matching

Definition (TASM: Top-k Approximate Subtree Matching)

Given: query tree Q, document tree T , size k of rankingGoal: Compute a

top-k ranking R = (T1, T2, . . . ,Tk)

of all subtrees Ti of document T

with respect to query Q

using the tree edit distance for the ranking.

Subtree Ti :

a node and all its descendantslargest subtree is document itself

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 5 / 28

Page 10: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

TASM: Top-k Approximate Subtree Matching

Definition (TASM: Top-k Approximate Subtree Matching)

Given: query tree Q, document tree T , size k of rankingGoal: Compute a

top-k ranking R = (T1, T2, . . . ,Tk)

of all subtrees Ti of document T

with respect to query Q

using the tree edit distance for the ranking.

Subtree Ti :

a node and all its descendantslargest subtree is document itself

top-k ranking R = (T1, Ti , . . . , Tk )

subtrees sorted by distance to querybest k subtrees: Ti /∈ R ⇒ ted(Q, Tk ) ≤ ted(Q, Ti )

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 5 / 28

Page 11: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

Ranking Function: Tree Edit Distance (TED)

article

authors

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

TKDE

Tree Edit Distance: Minimum number of node edit operations(insert, rename, delete) that transform one tree into the other.

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 6 / 28

Page 12: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

Ranking Function: Tree Edit Distance (TED)

article

authors

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

TKDE

del(authors)

Tree Edit Distance: Minimum number of node edit operations(insert, rename, delete) that transform one tree into the other.

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 6 / 28

Page 13: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

Ranking Function: Tree Edit Distance (TED)

article

authors

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

TKDE

del(authors) ren(ICDE)

Tree Edit Distance: Minimum number of node edit operations(insert, rename, delete) that transform one tree into the other.

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 6 / 28

Page 14: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

Ranking Function: Tree Edit Distance (TED)

article

authors

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

TKDE

del(authors) ren(ICDE)

Tree Edit Distance: Minimum number of node edit operations(insert, rename, delete) that transform one tree into the other.

TASM computes TED between query and document subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 6 / 28

Page 15: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

Ranking Function: Tree Edit Distance (TED)

article

authors

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

TKDE

del(authors) ren(ICDE)

Tree Edit Distance: Minimum number of node edit operations(insert, rename, delete) that transform one tree into the other.

TASM computes TED between query and document subtrees

Size and number of computed subtrees define TASM complexity

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 6 / 28

Page 16: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

State of the Art

TASM-Dynamic: dynamic programming solution1

computes distance to every subtree of the documentuse smaller subtrees to compute larger onesrank subtrees by visiting memoization tableSpace complexity: O(mn), m: query size, n: document size

1Zhang and Shasha 1989, Demaine et al. 2007Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 7 / 28

Page 17: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

State of the Art

TASM-Dynamic: dynamic programming solution1

computes distance to every subtree of the documentuse smaller subtrees to compute larger onesrank subtrees by visiting memoization tableSpace complexity: O(mn), m: query size, n: document size

Space complexity limits application to databases

in database applications n is huge (database size!)TASM-Dynamic maintains two m × n matrixes in RAM> 6GB RAM for our tiny query (m = 8) on DBLP (n = 28 × 106)

1Zhang and Shasha 1989, Demaine et al. 2007Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 7 / 28

Page 18: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

State of the Art

TASM-Dynamic: dynamic programming solution1

computes distance to every subtree of the documentuse smaller subtrees to compute larger onesrank subtrees by visiting memoization tableSpace complexity: O(mn), m: query size, n: document size

Space complexity limits application to databases

in database applications n is huge (database size!)TASM-Dynamic maintains two m × n matrixes in RAM> 6GB RAM for our tiny query (m = 8) on DBLP (n = 28 × 106)

For database size solutions dynamic programming is too expensive.

1Zhang and Shasha 1989, Demaine et al. 2007Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 7 / 28

Page 19: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

State of the Art

TASM-Dynamic: dynamic programming solution1

computes distance to every subtree of the documentuse smaller subtrees to compute larger onesrank subtrees by visiting memoization tableSpace complexity: O(mn), m: query size, n: document size

Space complexity limits application to databases

in database applications n is huge (database size!)TASM-Dynamic maintains two m × n matrixes in RAM> 6GB RAM for our tiny query (m = 8) on DBLP (n = 28 × 106)

For database size solutions dynamic programming is too expensive.

State-of-the-art algorithms do not scale!

1Zhang and Shasha 1989, Demaine et al. 2007Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 7 / 28

Page 20: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Motivation and Problem Definition

Problem Definition

Find a solution for TASM (Top-k Approximate Subtree Matching) that

scales to very large documents

runs in small memory

ranks subtrees correctly (no heuristics!)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 8 / 28

Page 21: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder

Outline

1 Motivation and Problem Definition

2 TASM-PostorderUpper Bound on Subtree SizePrefix Ring Buffer Pruning

3 Experiments

4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 9 / 28

Page 22: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Outline

1 Motivation and Problem Definition

2 TASM-PostorderUpper Bound on Subtree SizePrefix Ring Buffer Pruning

3 Experiments

4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 10 / 28

Page 23: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

worst match

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

Page 24: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

Page 25: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

Page 26: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Ti ’s in R are better than worst match T ′

k of R ′

(ii) ted(Q, Ti ) ≤ ted(Q, T ′

k)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

Page 27: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Ti ’s in R are better than worst match T ′

k of R ′

(ii) ted(Q, Ti ) ≤ ted(Q, T ′

k) ≤ |Q| + |T ′

k |

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

Page 28: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Ti ’s in R are better than worst match T ′

k of R ′

(ii) ted(Q, Ti ) ≤ ted(Q, T ′

k) ≤ |Q| + |T ′

k |

3. Size upper bound for subtree Ti

|Ti | − |Q| ≤ ted(Q, Ti ) Q Ti

at least:insert missing nodes

|Ti | − |Q|

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

Page 29: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Ti ’s in R are better than worst match T ′

k of R ′

(ii) ted(Q, Ti ) ≤ ted(Q, T ′

k) ≤ |Q| + |T ′

k |

3. Size upper bound for subtree Ti

|Ti | − |Q| ≤ ted(Q, Ti ) Q Ti

at least:insert missing nodes

|Ti | − |Q|

|Ti | ≤ ted(Q, Ti ) + |Q|

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

Page 30: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Ti ’s in R are better than worst match T ′

k of R ′

(ii) ted(Q, Ti ) ≤ ted(Q, T ′

k) ≤ |Q| + |T ′

k |

3. Size upper bound for subtree Ti

|Ti | − |Q| ≤ ted(Q, Ti ) Q Ti

at least:insert missing nodes

|Ti | − |Q|

|Ti | ≤ ted(Q, Ti ) + |Q| ≤ 2|Q| + |T ′

k |

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

Page 31: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

|T ′

k | ≤ k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Ti ’s in R are better than worst match T ′

k of R ′

(ii) ted(Q, Ti ) ≤ ted(Q, T ′

k) ≤ |Q| + |T ′

k |

3. Size upper bound for subtree Ti

|Ti | − |Q| ≤ ted(Q, Ti ) Q Ti

at least:insert missing nodes

|Ti | − |Q|

|Ti | ≤ ted(Q, Ti ) + |Q| ≤ 2|Q| + |T ′

k |

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

Page 32: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

|T ′

k | ≤ k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Ti ’s in R are better than worst match T ′

k of R ′

(ii) ted(Q, Ti ) ≤ ted(Q, T ′

k) ≤ |Q| + |T ′

k |

3. Size upper bound for subtree Ti

|Ti | − |Q| ≤ ted(Q, Ti ) Q Ti

at least:insert missing nodes

|Ti | − |Q|

|Ti | ≤ ted(Q, Ti ) + |Q| ≤ 2|Q| + |T ′

k | ≤ 2|Q| + k

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

Page 33: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Upper Bound on Subtree Size

Theorem (Upper Bound on Subtree Size)

TASM needs to consider only small document subtrees of size τ or less:

τ = 2|Q| + k

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 12 / 28

Page 34: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Upper Bound on Subtree Size

Theorem (Upper Bound on Subtree Size)

TASM needs to consider only small document subtrees of size τ or less:

τ = 2|Q| + k

Upper bound is very powerful:

independent of document size and structure!

linear in query size and k

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 12 / 28

Page 35: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Upper Bound on Subtree Size

Theorem (Upper Bound on Subtree Size)

TASM needs to consider only small document subtrees of size τ or less:

τ = 2|Q| + k

Upper bound is very powerful:

independent of document size and structure!

linear in query size and k

Example: top-10 with example query |Q| = 8 on DBLP (28M nodes)

with bound: max subtree size τ = 2 ∗ 8 + 10 = 26

without bound: maximum subtree size is 28M (whole document)!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 12 / 28

Page 36: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Upper Bound on Subtree Size

Upper Bound on Subtree Size

Theorem (Upper Bound on Subtree Size)

TASM needs to consider only small document subtrees of size τ or less:

τ = 2|Q| + k

Upper bound is very powerful:

independent of document size and structure!

linear in query size and k

Example: top-10 with example query |Q| = 8 on DBLP (28M nodes)

with bound: max subtree size τ = 2 ∗ 8 + 10 = 26

without bound: maximum subtree size is 28M (whole document)!

Document-independent upper bound on subtree size!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 12 / 28

Page 37: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Outline

1 Motivation and Problem Definition

2 TASM-PostorderUpper Bound on Subtree SizePrefix Ring Buffer Pruning

3 Experiments

4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 13 / 28

Page 38: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 39: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 40: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 41: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 42: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 43: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 44: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 45: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 46: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 47: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 48: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 49: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 50: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 51: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 52: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 53: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 54: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 55: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 56: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5 proc,13

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 57: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5 proc,13 X2,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 58: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5 proc,13 X2,1 title,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 59: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5 proc,13 X2,1 title,2

book,3

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 60: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5 proc,13 X2,1 title,2

book,3 dblp,22

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 61: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5 proc,13 X2,1 title,2

book,3 dblp,22

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Relevant and state-of-the-art for XML Parsing

full subtree known only at closing tagclosing tags appear in postorder

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 62: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5 proc,13 X2,1 title,2

book,3 dblp,22

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Relevant and state-of-the-art for XML Parsing

full subtree known only at closing tagclosing tags appear in postorder

Implementation is efficient and heavily used for

XML streamsplain XML files (e.g., SAX)XML in database (Dewey, interval encoding, ...)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Page 63: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Candidate Subtrees

Candidate subtrees are all subtrees Ti of the document with

|Ti | ≤ τ ANDTi is not contained in a larger subtree |Tj | ≤ τ

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 15 / 28

Page 64: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Candidate Subtrees

Candidate subtrees are all subtrees Ti of the document with

|Ti | ≤ τ ANDTi is not contained in a larger subtree |Tj | ≤ τ

Pruning: find candidate subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 15 / 28

Page 65: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

Page 66: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

Page 67: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

Page 68: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

Page 69: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

Page 70: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Problem: memory buffer can grow very large!must keep subtrees in memory until non-candidate ancestor is readworst case: memory buffer stores O(n) nodes(frequent in data-centric XML!)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

Page 71: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Problem: memory buffer can grow very large!must keep subtrees in memory until non-candidate ancestor is readworst case: memory buffer stores O(n) nodes(frequent in data-centric XML!)

Example: DBLP, τ = 50

99% of nodes are still in buffer when root node is read!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

Page 72: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Problem: memory buffer can grow very large!must keep subtrees in memory until non-candidate ancestor is readworst case: memory buffer stores O(n) nodes(frequent in data-centric XML!)

Example: DBLP, τ = 50

99% of nodes are still in buffer when root node is read!

Simple pruning not feasible for large documents!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

Page 73: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Efficient Pruning is Tricky!

Problem: when can we remove a node from the buffer?

when we see |Ti | ≤ τ , we don’t yet know about parent (postorder!)subtree of parent might be smaller than τ !

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 17 / 28

Page 74: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Efficient Pruning is Tricky!

Problem: when can we remove a node from the buffer?

when we see |Ti | ≤ τ , we don’t yet know about parent (postorder!)subtree of parent might be smaller than τ !

Our Solution does not wait for parent

prefix ring buffer: fixed size bufferpruning rule: prune based on following nodes

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 17 / 28

Page 75: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)

e↑ s↑John,1 auth,2 X1,1 title,4 article,5

Prefix ring buffer of size τ + 1 (main memory)

stores prefix (τ nodes in postorder) of the document

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

Page 76: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)

e↑ s↑John,1 auth,2 X1,1 title,4 article,5

Prefix ring buffer of size τ + 1 (main memory)

stores prefix (τ nodes in postorder) of the document

two operations

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

Page 77: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)

e↑ s↑John,1 auth,2 X1,1 title,4 article,5

Prefix ring buffer of size τ + 1 (main memory)

stores prefix (τ nodes in postorder) of the document

two operations

append new node

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

Page 78: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)VLDB,1

e↑ s↑John,1 auth,2 X1,1 title,4 article,5

Prefix ring buffer of size τ + 1 (main memory)

stores prefix (τ nodes in postorder) of the document

two operations

append new node

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

Page 79: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)VLDB,1

e↑ s↑John,1 auth,2 X1,1 title,4 article,5

Prefix ring buffer of size τ + 1 (main memory)

stores prefix (τ nodes in postorder) of the document

two operations

append new noderemove leftmost subtree/node

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

Page 80: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)

s↑VLDB,1

e↑

Prefix ring buffer of size τ + 1 (main memory)

stores prefix (τ nodes in postorder) of the document

two operations

append new noderemove leftmost subtree/node

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

Page 81: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)

s↑VLDB,1

e↑

Prefix ring buffer of size τ + 1 (main memory)

stores prefix (τ nodes in postorder) of the document

two operations

append new noderemove leftmost subtree/node

Pruning rule: If leftmost node in full ring buffer is

leaf: leftmost subtree is candidate subtree

non-leaf: leftmost node is non-candidate node

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

Page 82: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Pruning Rule – Intuition

Candidate subtree: leftmost node is a leaf

Ti : leftmost subtree, starts with leftmost nodeTj : smallest subtree that contains Ti

due to postorder: Tj contains all nodes in buffersince |Ti | ≤ τ and |Tj | > τ : Ti is a candidate

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 19 / 28

Page 83: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Pruning Rule – Intuition

Candidate subtree: leftmost node is a leaf

Ti : leftmost subtree, starts with leftmost nodeTj : smallest subtree that contains Ti

due to postorder: Tj contains all nodes in buffersince |Ti | ≤ τ and |Tj | > τ : Ti is a candidate

Non-candidate node: leftmost node is a non-leaf

leftmost non-leaf is parent of previously removed nodeswe remove either candidate subtrees and non-candidate nodesin both cases: parent is a non-candidate

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 19 / 28

Page 84: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)John,1 auth,2 X1,1 · · ·

prefix ring buffer (main memory)

s↑ e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 85: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)auth,2 X1,1 title,2 · · ·

prefix ring buffer (main memory)

s↑John,1

e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 86: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)X1,1 title,2 article,5 · · ·

prefix ring buffer (main memory)

s↑John,1 auth,2

e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 87: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)title,2 article,5 VLDB,1 · · ·

prefix ring buffer (main memory)

s↑John,1 auth,2 X1,1

e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 88: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)article,5 VLDB,1 conf,2 · · ·

prefix ring buffer (main memory)

s↑John,1 auth,2 X1,1 title,2

e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 89: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)VLDB,1 conf,2 Peter,1 · · ·

prefix ring buffer (main memory)

s↑John,1 auth,2 X1,1 title,2 article,5

e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 90: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)conf,2 Peter,1 auth,2 · · ·

prefix ring buffer (main memory)

s↑John,1 auth,2 X1,1 title,2 article,5 VLDB,1

e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 91: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)conf,2 Peter,1 auth,2 · · ·

prefix ring buffer (main memory)

s↑John,1 auth,2 X1,1 title,2 article,5 VLDB,1

e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 92: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)conf,2 Peter,1 auth,2 · · ·

prefix ring buffer (main memory)

s↑VLDB,1

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 93: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)Peter,1 auth,2 X3,1 · · ·

prefix ring buffer (main memory)

e↑ s↑VLDB,1 conf,2

append

candidate subtrees:(output)

article

auth

John

title

X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 94: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)auth,2 X3,1 title,2 · · ·

prefix ring buffer (main memory)Peter,1

e↑ s↑VLDB,1 conf,2

append

candidate subtrees:(output)

article

auth

John

title

X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 95: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)X3,1 title,2 article,5 · · ·

prefix ring buffer (main memory)Peter,1 auth,2

e↑ s↑VLDB,1 conf,2

append

candidate subtrees:(output)

article

auth

John

title

X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 96: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)title,2 article,5 Mike,1 · · ·

prefix ring buffer (main memory)Peter,1 auth,2 X3,1

e↑ s↑VLDB,1 conf,2

append

candidate subtrees:(output)

article

auth

John

title

X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 97: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)article,5 Mike,1 auth,2 · · ·

prefix ring buffer (main memory)Peter,1 auth,2 X3,1 title,2

e↑ s↑VLDB,1 conf,2

append

candidate subtrees:(output)

article

auth

John

title

X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 98: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)article,5 Mike,1 auth,2 · · ·

prefix ring buffer (main memory)Peter,1 auth,2 X3,1 title,2

e↑ s↑VLDB,1 conf,2

append

candidate subtrees:(output)

article

auth

John

title

X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 99: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)article,5 Mike,1 auth,2 · · ·

prefix ring buffer (main memory)

s↑Peter,1 auth,2 X3,1 title,2

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 100: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)Mike,1 auth,2 X4,1 · · ·

prefix ring buffer (main memory)

s↑Peter,1 auth,2 X3,1 title,2 article,5

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 101: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)auth,2 X4,1 title,2 · · ·

prefix ring buffer (main memory)

s↑Peter,1 auth,2 X3,1 title,2 article,5 Mike,1

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 102: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)auth,2 X4,1 title,2 · · ·

prefix ring buffer (main memory)

s↑Peter,1 auth,2 X3,1 title,2 article,5 Mike,1

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 103: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)auth,2 X4,1 title,2 · · ·

prefix ring buffer (main memory)

s↑Mike,1

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 104: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)X4,1 title,2 article,5 · · ·

prefix ring buffer (main memory)

e↑ s↑Mike,1 auth,2

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 105: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)title,2 article,5 proc,13 · · ·

prefix ring buffer (main memory)X4,1

e↑ s↑Mike,1 auth,2

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 106: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)article,5 proc,13 X2,1 · · ·

prefix ring buffer (main memory)X4,1 title,2

e↑ s↑Mike,1 auth,2

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 107: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)proc,13 X2,1 title,2 · · ·

prefix ring buffer (main memory)X4,1 title,2 article,5

e↑ s↑Mike,1 auth,2

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 108: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)X2,1 title,2 book,3 · · ·

prefix ring buffer (main memory)X4,1 title,2 article,5 proc,13

e↑ s↑Mike,1 auth,2

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 109: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)X2,1 title,2 book,3 · · ·

prefix ring buffer (main memory)X4,1 title,2 article,5 proc,13

e↑ s↑Mike,1 auth,2

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 110: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)X2,1 title,2 book,3 · · ·

prefix ring buffer (main memory)

s↑proc,13

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 111: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)title,2 book,3 dblp,22 · · ·

prefix ring buffer (main memory)

s↑proc,13 X2,1

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 112: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)book,3 dblp,22 · · ·

prefix ring buffer (main memory)

s↑proc,13 X2,1 title,2

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 113: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)dblp,22 · · ·

prefix ring buffer (main memory)

e↑ s↑proc,13 X2,1 title,2 book,3

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 114: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)(empty) · · ·

prefix ring buffer (main memory)dblp,22

e↑ s↑proc,13 X2,1 title,2 book,3

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 115: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)(empty) · · ·

prefix ring buffer (main memory)dblp,22

e↑ s↑proc,13 X2,1 title,2 book,3

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 116: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)(empty) · · ·

prefix ring buffer (main memory)dblp,22

e↑ s↑X2,1 title,2 book,3

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 117: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)(empty) · · ·

prefix ring buffer (main memory)dblp,22

e↑ s↑X2,1 title,2 book,3

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 118: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)(empty) · · ·

prefix ring buffer (main memory)

s↑dblp,22

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 119: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)(empty) · · ·

prefix ring buffer (main memory)

s↑dblp,22

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 120: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)(empty) · · ·

prefix ring buffer (main memory)

s↑ e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

Page 121: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

TASM-Postorder

TASM-postorder

1. empty ranking R, tightening upper bound τ ′= τ

2. for each candidate subtree Ti

a. if |R| = k: update τ ′ = min(τ,max(R) + |Q|)b. compute tree edit distance for all subtrees of Ti within τ ′

c. update ranking R

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 21 / 28

Page 122: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

TASM-Postorder

TASM-postorder

1. empty ranking R, tightening upper bound τ ′= τ

2. for each candidate subtree Ti

a. if |R| = k: update τ ′ = min(τ,max(R) + |Q|)b. compute tree edit distance for all subtrees of Ti within τ ′

c. update ranking R

Theorem (TASM-Postorder)

The space complexity of TASM-postorder is independent of thedocument size:

O(m2 + mk)

(m: query size, k: result size)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 21 / 28

Page 123: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

TASM-Postorder Prefix Ring Buffer Pruning

TASM-Postorder

TASM-postorder

1. empty ranking R, tightening upper bound τ ′= τ

2. for each candidate subtree Ti

a. if |R| = k: update τ ′ = min(τ,max(R) + |Q|)b. compute tree edit distance for all subtrees of Ti within τ ′

c. update ranking R

Theorem (TASM-Postorder)

The space complexity of TASM-postorder is independent of thedocument size:

O(m2 + mk)

(m: query size, k: result size)

TASM-postorder scales to very large documents!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 21 / 28

Page 124: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Experiments

Outline

1 Motivation and Problem Definition

2 TASM-PostorderUpper Bound on Subtree SizePrefix Ring Buffer Pruning

3 Experiments

4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 22 / 28

Page 125: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Experiments

Pruning Effectiveness

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 23 / 28

Page 126: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Experiments

Pruning Effectiveness

Prefix ring buffer pruning is very effective!Maximum subtree reduced from 37M to 18 nodes.

Dataset: PSD protein sequences, 37M nodes, 683MB

Compute TASM (|Q| = 4, k = 1)

TASM-dynamic (state of the art)TASM-postorder (our solution)

Histogram of computed subtrees

1e0

1e1

1e2

1e3

1e4

1e5

1e6

1e7

1e0 1e1 1e2 1e3 1e4 1e5 1e6 1e7

num

ber

of s

ubtr

ees

subtree size (nodes)

largest subtree: 37Mentire document

TASM-Dynamic

1e0

1e1

1e2

1e3

1e4

1e5

1e6

1e7

1e0 1e1 1e2 1e3 1e4 1e5 1e6 1e7

num

ber

of s

ubtr

ees

subtree size (nodes)

largest subtree: 18

TASM-Postorder

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 23 / 28

Page 127: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Experiments

Scalability: TASM-Postorder vs. TASM-Dynamic

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 24 / 28

Page 128: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Experiments

Scalability: TASM-Postorder vs. TASM-Dynamic

TASM-postorder much faster than TASM-dynamic.

Dataset: XMark (synthetic XML for benchmark)

Vary query size and document size

Compute TASM (k = 5)

TASM-dynamic (state of the art)TASM-postorder (our solution)

Measure wall clock time

1e0

1e1

1e2

1e3

4 8 16 32 64

time

(sec

onds

)

query size (nodes)

dyn, T:224MBdyn, T:112MBpos, T:224MBpos, T:112MB

1e0

1e1

1e2

1e3

112 224 448 896 1792

time

(sec

onds

)

document size (MB)

dyn, |Q|=8dyn, |Q|=4pos, |Q|=8pos, |Q|=4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 24 / 28

Page 129: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Experiments

Scalability with Result Size k

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 25 / 28

Page 130: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Experiments

Scalability with Result Size k

TASM-postorder scales well with k .Increasing k by 4 orders of magnitude only doubles runtime.

0

50

100

150

200

250

300

1e0 1e1 1e2 1e3 1e4

time

(sec

onds

)

k

dyn, T:224MBdyn, T:112MBpos, T:224MBpos, T:112MB

Dataset: XMark (synthetic XML forbenchmark)

Vary k (size of ranking)

Compute TASM (|Q| = 16)

TASM-dynamic (state of the art)TASM-postorder (our solution)

Measure wall clock time

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 25 / 28

Page 131: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Experiments

Space complexity: TASM-Postorder vs. TASM-Dynamic

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 26 / 28

Page 132: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Experiments

Space complexity: TASM-Postorder vs. TASM-Dynamic

TASM-postorder: space independent of document!

1e0

1e1

1e2

1e3

4e3

112 224 448 896 1792

mem

ory

(MB

)

document size (MB)

3GB

8MB

dyn, |Q|=16dyn, |Q|=4

pos, |Q|=16pos, |Q|=4

Dataset: XMark (synthetic XML forbenchmark)

Vary document size

Compute TASM (k = 5)

TASM-dynamic (state of the art)TASM-postorder (our solution)

Measure main memory usage

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 26 / 28

Page 133: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Conclusion and Future Work

Outline

1 Motivation and Problem Definition

2 TASM-PostorderUpper Bound on Subtree SizePrefix Ring Buffer Pruning

3 Experiments

4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 27 / 28

Page 134: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Conclusion and Future Work

Conclusion

Conclusion

Prefix Ring Buffer for space efficient pruning

Dynamic programming does not scale for database size solutions.

Upper bound τττ : limit maximum subtree size for TASM

TASM-postorder: highly scalable TASM algorithm

TASM-postorder makes TASM feasible.

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 28 / 28

Page 135: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Conclusion and Future Work

Conclusion

Conclusion

Prefix Ring Buffer for space efficient pruning

Dynamic programming does not scale for database size solutions.

Upper bound τττ : limit maximum subtree size for TASM

TASM-postorder: highly scalable TASM algorithm

TASM-postorder makes TASM feasible.

Future Work – New research opportunities:

tune tree edit distance to different applications

index the document: can we avoid a document scan?

parallel TASM algorithm: where to split document?

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 28 / 28

Page 136: TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

Erik D. Demaine, Shay Mozes, Benjamin Rossman, and OrenWeimann.An optimal decomposition algorithm for tree edit distance.In ICALP, volume 4596 of LNCS, pages 146–157, Wroclaw, Poland,July 2007. Springer.

K. Zhang and D. Shasha.Simple fast algorithms for the editing distance between trees andrelated problems.SIAM J. on Computing, 18(6):1245–1262, 1989.

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 28 / 28