Top Banner
Purely Functional Data Structures for On-line LCA Edward Kmett Boston Haskell May 30 th 2012
43

Purely Functional Data Structures for On-Line LCA

May 24, 2015

Download

Technology

Edward Kmett

This talk improves the known asymptotic complexity of online lowest common ancestor search from O(h) to O(log h), opening the door to new uses in distributed computing and version control.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Purely Functional Data Structures for On-Line LCA

Purely Functional Data Structures for On-line LCA

Edward Kmett

Boston HaskellMay 30th 2012

Page 2: Purely Functional Data Structures for On-Line LCA

Overview

The Lowest Common Ancestor (LCA) Problem

Tarjan’s Off-line LCA

Off-line Tree-Like LCA

Off-line Range-Min LCA

Naïve On-line LCA

Data Structures from Number Systems

Skew-Binary Random Access Lists

Skew-Binary On-line LCA

Page 3: Purely Functional Data Structures for On-Line LCA

The Lowest Common Ancestor Problem

Given a tree, and two nodes in the tree, find the lowest entry in the tree that is an ancestor to both.

A

B

C D

E

F G

H

I

J

Page 4: Purely Functional Data Structures for On-Line LCA
Page 5: Purely Functional Data Structures for On-Line LCA

The Lowest Common Ancestor Problem

Given a tree and two nodes in the tree, find the lowest entry in the tree that is an ancestor to both.

Applications:Computing Dominators in Flow Graphs

Three-Way Merge Algorithms in Revision Control

Common Word Roots/Suffixes

Range-Min Query (RMQ) problems

Computing Distance in a Tree

Page 6: Purely Functional Data Structures for On-Line LCA

The Lowest Common Ancestor Problem

Given a tree and two nodes in the tree, find the lowest entry in the tree that is an ancestor to both.

First formalized by Aho, Hopcraft, and Ullman in 1973.

They provided ephemeral on-line and off-line versions of the problem in terms of two operations, with their off-line version of the algorithm requiring O(n log*(n)) and their online version requiring O(n log n) steps.

Research has largely focused on the off-line versions of this problem where you are given the entire tree a priori.

Page 7: Purely Functional Data Structures for On-Line LCA

cons, link, or grow?The original formulation of LCA was in terms of two operations link x y which grafts an unattached tree x on as a child of y, and lca x y which computes the lowest common ancestor of x and y.

Alternately, we can work with lca x y and cons a y, which returns a new extended version of the path y grown downward with the globally unique node ID a, and

We can replace cons a y with a monadic grow y, which tracks the variable supply internally. By using a concurrent variable supply like the one supplied by the concurrent-supply package enables you to grow the tree in parallel.

Page 8: Purely Functional Data Structures for On-Line LCA

Tarjan’s Off-line LCA

In 1979, Robert Tarjan found a way to compute a predetermined set of distinct LCA queries at the same time given the complete tree by creatively using disjoint-set forests in O(nα(n)). (This is stronger condition than the usual offline problem statement.) function TarjanOLCA(u) MakeSet(u); u.ancestor := u; for each v in u.children do TarjanOLCA(v); Union(u,v); Find(u).ancestor := u; u.colour := black; for each v such that {u,v} in P do if v.colour == black print "The LCA of “+u+" and “+v+" is " + Find(v).ancestor;

Page 9: Purely Functional Data Structures for On-Line LCA

Tarjan’s Off-line LCA

In 1979, Robert Tarjan found a way to compute a predetermined set of distinct LCA queries at the same time given the complete tree by creatively using disjoint-set forests in O(nα(n)).

In 1983, Harold Gabow and Robert Tarjan improved the asymptotics of the preceding algorithm to O(n) by noting special-case opportunities not available in general purpose disjoint-set forest problems.

Page 10: Purely Functional Data Structures for On-Line LCA

Tree-Like Off-line LCA

In 1984, Dov Harel and Robert Tarjan provided the first asymptotically optimal off-line solution, which converts the tree in O(n) into a structure that can be queried in O(1).

In 1988, Baruch Scheiber and Uzi Vishkin simplified that structure, by building arbitrary-fanout trees out of paths and binary trees, and providing fast indexing into each case.

Page 11: Purely Functional Data Structures for On-Line LCA

Range-Min Off-line LCA

In 1993, Omer Berkman and Uzi Vishkin found another conversion with the same O(n) preprocessing using an Euler tour to convert the tree structure into a Range-Min structure, that can be queried in O(1) time.

This was improved in 2000 by Michael Bender and Martin Farach-Colton.

Alstrup, Gavoille, Kaplan and Rauhe focused on distributing this algorithm.

Fischer and Heun reduced the memory requirements, but also show logarithmically slower RMQ algorithms are often faster the common problem sizes of today!

Page 12: Purely Functional Data Structures for On-Line LCA

Backup Plans

Page 13: Purely Functional Data Structures for On-Line LCA

Naïve On-line LCA

Build paths as lists of node IDs, using cons as you go.

x = [5,4,3,2,1] :# 5

y = [6,3,2,1] :# 4

To compute lca x y, first cut both lists to have the same length.

x’ = [4,3,2,1], y’ = [6,3,2,1], len = 4

Then keep dropping elements from both until the IDs match.

lca x y = [3,2,1] :# 3

Page 14: Purely Functional Data Structures for On-Line LCA

Naïve On-line LCA

No preprocessing step.

O(h) LCA query time where h is the length of the path.

O(1) to extend a path.

No need to store the entire tree, just the paths you are currently using. This helps with distribution and parallelization.

As an on-line algorithm, the tree can grow without requiring costly recalculations.

Page 15: Purely Functional Data Structures for On-Line LCA

To go faster we’d need to extract a common suffix in sublinear time. Very Well…

Naïve On-line LCA

Page 16: Purely Functional Data Structures for On-Line LCA

Data Structures from Number Systems

We are already familiar with at least one data structure derived from a number system.

data Nat = Zero | Succ Nat

data List a = Nil | Cons a (List a)

O(1) succ grants us O(1) cons

Page 17: Purely Functional Data Structures for On-Line LCA

Binary Random-Access Lists

We could construct a data structure from binary numbers as well, where you have a linked list of “flags” with 2n elements in them.

However, adding 1 to a binary number can affect all log n digits in the number, yielding O(log n) cons.

Page 18: Purely Functional Data Structures for On-Line LCA

Skew-Binary Numbers

The nth digit has value 2n+1-1, and each digit has a value of 0,1, or 2.

We only allow a single 2 in the number, which must be the first non-zero digit.

Every natural number can be uniquely represented by this scheme.

succ is an O(1) operation.

There are 2n+1-1 nodes in a complete tree of height n.

15

7 3 1

0

1

2

1 0

1 1

1 2

2 0

1 0 0

1 0 1

1 0 2

1 1 0

1 1 1

1 1 2

1 2 0

2 0 0

1 0 0 0

Page 19: Purely Functional Data Structures for On-Line LCA

Skew-Binary Random Access Lists

We store a linked list of complete trees, where we are allowed to have two trees of the same size at the front of the list, but after that all trees are of strictly increasing height.

data Tree a = Tip a | Bin a (Tree a) (Tree a)data Path a = Nil | Cons !Int !Int (Tree a) (Path a)

length :: Path a -> Intlength Nil = 0length (Cons n _ _ _) = n

I call these random-access lists a Path here, because of our use case.

Page 20: Purely Functional Data Structures for On-Line LCA

Skew-Binary On-line LCANaïve On-line LCA:

Build paths as lists of node IDs, using cons as you go.

To compute lca x y, first cut both lists to have the same length.

Then keep dropping elements until the IDs match.

Page 21: Purely Functional Data Structures for On-Line LCA

Skew-Binary On-line LCANaïve On-line LCA:

Build paths as lists of node IDs, using cons as you go.

To compute lca x y, first cut both lists to have the same length.

Then keep dropping elements until the IDs match.

Page 22: Purely Functional Data Structures for On-Line LCA

Skew-Binary On-line LCANaïve On-line LCA:

Build paths as lists of node IDs, using cons as you go.

To compute lca x y, first cut both lists to have the same length.

Then keep dropping elements until the IDs match.

1

Page 23: Purely Functional Data Structures for On-Line LCA

Skew-Binary On-line LCANaïve On-line LCA:

Build paths as lists of node IDs, using cons as you go.

To compute lca x y, first cut both lists to have the same length.

Then keep dropping elements until the IDs match.

2 1

Page 24: Purely Functional Data Structures for On-Line LCA

Skew-Binary On-line LCANaïve On-line LCA:

Build paths as lists of node IDs, using cons as you go.

To compute lca x y, first cut both lists to have the same length.

Then keep dropping elements until the IDs match.

3

2 1

Page 25: Purely Functional Data Structures for On-Line LCA

Skew-Binary On-line LCANaïve On-line LCA:

Build paths as lists of node IDs, using cons as you go.

To compute lca x y, first cut both lists to have the same length.

Then keep dropping elements until the IDs match.

4 3

2 1

Page 26: Purely Functional Data Structures for On-Line LCA

Naïve On-line LCA:Build paths as lists of node IDs, using cons as you go.

To compute lca x y, first cut both lists to have the same length.

Then keep dropping elements until the IDs match.

5 4 3

2 1

Skew-Binary On-line LCA

Page 27: Purely Functional Data Structures for On-Line LCA

Skew-Binary On-line LCANaïve On-line LCA:

Build paths as lists of node IDs, using cons as you go.

To compute lca x y, first cut both lists to have the same length.

Then keep dropping elements until the IDs match.

6

5 4

3

2 1

Page 28: Purely Functional Data Structures for On-Line LCA

Skew-Binary On-line LCANaïve On-line LCA:

Build paths as lists of node IDs, using cons as you go.

To compute lca x y, first cut both lists to have the same length.

Then keep dropping elements until the IDs match.

7

6

5 4

3

2 1

Page 29: Purely Functional Data Structures for On-Line LCA

Naïve On-line LCA:Build paths as lists of node IDs, using cons as you go.

To compute lca x y, first cut both lists to have the same length.

Then keep dropping elements until the IDs match.

8 7

6

5 4

3

2 1

Skew-Binary On-line LCA

Page 30: Purely Functional Data Structures for On-Line LCA

Skew-Binary On-line LCANaïve On-line LCA:

Build paths as lists of node IDs, using cons as you go.

To compute lca x y, first cut both lists to have the same length.

Then keep dropping elements until the IDs match.

-- O(1)cons :: a -> Path a -> Path acons a (Cons n w t (Cons _ w' t2 ts)) | w == w' = Cons (n + 1) (2 * w + 1) (Bin a t t2) tscons a ts = Cons (length ts + 1) 1 (Tip a) ts

Page 31: Purely Functional Data Structures for On-Line LCA

Skew-Binary On-line LCANaïve On-line LCA:

Build paths as lists of node IDs, using cons as you go.

To compute lca x y, first cut both lists to have the same length.

Then keep dropping elements until the IDs match.

lca :: Eq a => Path a -> Path a -> Path alca xs ys = case compare nxs nys of LT -> lca' xs (keep nxs ys) EQ -> lca' xs ys GT -> lca' (keep nys xs) ys where nxs = length xs nys = length ys

Page 32: Purely Functional Data Structures for On-Line LCA

Skew-Binary KeepO(log (h - k)) to keep the top k elements of path of height h

6

5 4

3

2 1

keep 2 (fromList [6,5,4,3,2,1])

Page 33: Purely Functional Data Structures for On-Line LCA

Skew-Binary KeepO(log (h - k)) to keep the top k elements of path of height h

6

5 4

3

2 1

keep 2 (fromList [6,5,4,3,2,1]) =

keep 2 (fromList [3,2,1])

Page 34: Purely Functional Data Structures for On-Line LCA

Skew-Binary KeepO(log (h - k)) to keep the top k elements of path of height h

6

5 4

3

2 1

keep 2 (fromList [6,5,4,3,2,1])

Page 35: Purely Functional Data Structures for On-Line LCA

Skew-Binary KeepO(log (h - k)) to keep the top k elements of path of height h

keep :: Int -> Path a -> Path akeep _ Nil = Nilkeep k xs@(Cons n w t ts) | k >= n = xs | otherwise = case compare k (n - w) of GT -> keepT (k - n + w) w t ts EQ -> ts LT -> keep k ts

consT :: Int -> Tree a -> Path a -> Path aconsT w t ts = Cons (w + length ts) w t ts

keepT :: Int -> Int -> Tree a -> Path a -> Path akeepT n w (Bin _ l r) ts = case compare n w2 of LT -> keepT n w2 r ts EQ -> consT w2 r ts GT | n == w - 1 -> consT w2 l (consT w2 r ts) | otherwise -> keepT (n - w2) w2 l (consT w2 r ts) where w2 = div w 2keepT _ _ _ ts = ts

Page 36: Purely Functional Data Structures for On-Line LCA

Skew-Binary On-line LCA

Naïve On-line LCA:Build paths as lists of node IDs, using cons as you go.

To compute lca x y, first cut both lists to have the same length.

Then keep dropping elements until the IDs match.

lca :: Eq a => Path a -> Path a -> Path alca xs ys = case compare nxs nys of LT -> lca' xs (keep nxs ys) EQ -> lca' xs ys GT -> lca' (keep nys xs) ys where nxs = length xs nys = length ys

Page 37: Purely Functional Data Structures for On-Line LCA

Comparing Node IDsWe can check to see if two paths have the same head or are both empty in O(1).

infix 4 ~=(~=) :: Eq a => Path a -> Path a -> BoolNil ~= Nil = TrueCons _ _ s _ ~= Cons _ _ t _ = sameT s t_ ~= _ = False

sameT :: Eq a => Tree a -> Tree a -> BoolsameT xs ys = root xs == root ys

root :: Tree a -> aroot (Tip a) = aroot (Bin a _ _) = a

Page 38: Purely Functional Data Structures for On-Line LCA

We can modify the algorithm for keep into an algorithm that takes any monotone predicate that only transitions from False to True once during the walk up the path and yields a result in O(log h)

Monotonicity

We have exactly one shape for a given number of elements, so we can walk the spine of the two random access lists at the same time in lock-step. This lets us, modify this algorithm to work with a pair of paths, because the shapes agree.

(~=) is monotone given using globally unique IDs.

Page 39: Purely Functional Data Structures for On-Line LCA

Finding the Matchlca’ requires the invariant that both paths have the same length. This is provided by the fact that lca, shown earlier, trims the lists first.

lca' :: Eq a => Path a -> Path a -> Path alca' h@(Cons _ w x xs) (Cons _ _ y ys) | sameT x y = h | xs ~= ys = lcaT w x y xs | otherwise = lca' xs yslca' _ _ = Nil

lcaT :: Eq a => Int -> Tree a -> Tree a -> Path a -> Path alcaT w (Bin _ la ra) (Bin _ lb rb) ts | sameT la lb = consT w2 la (consT w2 ra ts) | sameT ra rb = lcaT w2 la lb (consT w ra ts) | otherwise = lcaT w2 ra rb ts where w2 = div w 2lcaT _ _ _ ts = ts

Page 40: Purely Functional Data Structures for On-Line LCA

Naïve On-line LCA:Build paths as lists of node IDs, using cons as you go. O(1)

To compute lca x y, first cut both lists to have the same length. O(h)

Then keep dropping elements until the IDs match. O(h)

Skew-Binary On-line LCA:Build paths as lists of node IDs, using cons as you go. O(1)

To compute lca x y, first cut both lists to have the same length. O(log h)

Then keep dropping elements until the IDs match. O(log h)

Skew-Binary On-line LCA

Page 41: Purely Functional Data Structures for On-Line LCA

Skew-Binary On-line LCA

No preprocessing step.

O(log h) LCA query time where h is the length of the path.

O(1) to extend a path.

No need to store the entire tree, just the paths you are currently using. This helps with distribution and parallelization when working on large trees.

As an on-line algorithm, the tree can grow without requiring costly recalculations.

Preserves all of the benefits of the naïve algorithm, while drastically reducing the costs.

Page 42: Purely Functional Data Structures for On-Line LCA

Now What?

We found that skew-binary random access lists can be used to accelerate the naïve online LCA algorithm while retaining the desirable properties.

You can install a working version of this algorithm from hackage

cabal install lca

Next time I’ll talk about the applications of this algorithm to a “revision control” monad which can be used for parallel and incremental computation in Haskell.

I am working with Daniel Peebles on a proof of correctness and asymptotic performance in Agda.

Page 43: Purely Functional Data Structures for On-Line LCA

Any Questions?