Top Banner
Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer , Daniel Deutch, Tova Milo, Pierre Senellart
23

Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Dec 13, 2015

Download

Documents

Kerrie Carr
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Finding Optimal Probabilistic Generators for XML Collections

Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart

Page 2: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Adding probabilities to an XML Schema

• Given a collection of XML documents, we sometimes have a schema the documents conform to.– E.g., DTD or XSD

– Restricts the structure, mostly parent-child node relations (using regular expressions)

• The schema may be very general (e.g., xhtml, RSS)

• We want to add probabilities that reflect the likelihood of different parts of the schema– We will use the probabilities to turn the schema into a probabilistic

generative model for XML documents

– In particular, we want them to maximize the likelihood of a given XML document or document collection

- 2 -

Motivation

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Page 3: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

One Application: XML Auto-Completion [SIGMOD 2012]

• Based on previous document versions / corpus of example documents –

• Suggest nodes / sub-trees / node values to the user

• For example:

• Challenges:

– Allow editing in every part of the document

– What kind of completion to suggest?

– Finding the top-k best completions

- 3 -

Motivation

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

<MyPapers><Paper>

<title>XML for Beginners</title>

<author>M. Jones<author><author>H. Q.

David</author><author>L.

Martin</author><author>S. Smith</author>

</Paper><Paper>

<title>Advanced XML</title>

<author>M. Jones</author>

<author>J. E. Peterson</author>

<author>G. L. Williams</author>

</Paper><Paper>

<title> </title><author> </author><author> </author><author> </author>

</Paper></MyPapers>

Page 4: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Many Other Usages for a Probabilistic Schema

• Testing – e.g., generating many XML messages to simulate network load and test system performance.

• Explaining – e.g., a probabilistic schema for DBLP may show which types of publications are rarely used, which kinds of attributes are not filled for BibTex, etc.

• Schema Evaluation – how well a given schema describes a given corpus.

• …

- 4 -

Motivation

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Page 5: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Our solution - An Outline

- 5 -Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Preliminaries – Tree Automata

Generators for Schemas without Constraints

Restart Generators

Continuation-Test Generators

Leaf Values

Adding Constraints

Page 6: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Schema as a Deterministic Tree Automaton

- 6 -

Preliminaries

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

q0 q1 q2b

a c

$

An XML document is modeled as an ordered tree.

Document d0:

Schema validation: the children of an a-labeled node are accepted by DFA Aa

Automaton Ar: (L(Ar) = a*bc*$)

Validation is performed for the children of every inner node.

abcd abcd532

$

r

a b c

Page 7: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Using the Schema as a Generator

• Recall that we want to turn the schema from an acceptor into a probabilistic generative model.

• Straightforward nondeterministic generator: repeatedly choose an accepting run for a node's automaton, and generate children accordingly.

• Adding probabilities: we consider two problem settings

1. Generating documents that are accepted by the schema, while maximizing the likelihood of a corpus.

2. Additionally, imposing integrity constraints on the documents (e.g., key constraints)

- 7 -

Preliminaries

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Page 8: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Probabilistic Generator

• Each transition is assigned a probability

• We assume independent choices, (a Markovian process) thus the document probability is the product.

• In this case, Pr(d)=pa p∙ a p∙ b p∙ $

• The schema and generator ignore leaf values (for now!)- 8 -

Without Constraints

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

ba c

$pa pc

pb p$

q0

q1

q2 $

r

a a b

Page 9: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Formal Problem Definition

• Given a corpus D of documents ,• and a deterministic schema S that accepts every

document in D• We want to find an optimal generator based on S:

– Find probabilities for the transitions of S that maximize the probability of generating D,

– i.e., the maximum likelihood estimator (MLE).

- 9 -

Without Constraints

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Page 10: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

A Learning Algorithm

- 10 -

Without Constraints

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

b

a c

$

$

The frequency of using each transition during the corpus verification process is recorded.

(q0, a)

(q0, b)

(q1, c)

(q1, $)

1111

q0 q1 q2

r

a b c

Page 11: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

An Algorithm for Probabilities Learning (Cont.)

This is repeated for every node in every corpus document.We set the probability of each transition to be its relative frequency.

- 11 -

Without Constraints

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

(q0, a) 1(q0, b) 1(q1, c) 1(q1, $) 1

/2/2

/2/2Theorem: This efficient algorithm

learns the MLE probabilities – finds an optimal probabilistic generator

Page 12: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

An Additional Result

• Theorem: generation terminates with probability 1.

– Guaranteed only because of the choice of probabilities according to the corpus.

- 12 -

Without Constraints

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Page 13: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Integrity Constraints

• We want to support integrity constraints, which are used in XML schema languages.

• Key Constraint: the leaves of a-labeled leaves have unique values (unary key)

• Inclusion Constraint: the values of a-labeled leaves are contained in those of b-labeled leaves

• Domain Constraint: the values of a-labeled leaves belong to some (finite or infinite) domain

• Different types are considered in the literature [Fan & Libkin 2001; David Libkin & Tan 2011]

- 13 -

Adding Constraints

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Page 14: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

New Problem• We want to find optimal generators for XML schemas with

constraints.

• Valid generator output: an XML document, which1. is a accepted by the schema, and

2. there exists a valid leaf value assignment – which does not violate the constraints

– Example: each of a, b, c is unique, and contained the others

- 14 -

Adding Constraints

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

$

r

a a bc

r

a b

b

c

b

Page 15: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Restart Generators• A simple idea:

– Use a probabilistic generator to generate a document

– Check if it has a value assignment valid w.r.t. the constraints

– If not, 'restart' and try again until a valid document is generated

• Problem definition -- same as in the case without constraints (but now the schema includes constraints)

• Proposition: Given a document with no values, checking for the existence of a valid value assignment is in PTIME– Proof: By translating the constraints to bounds on the number of unique

values for each leaf label

• Bad news: number of restarts can be unboundedly large in an optimal generator

- 15 -

Adding Constraints

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Page 16: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Continuation-test Generators

• Never make choices that lead to a 'dead end', thus always generate a valid document.

• We use a binary test to check if a choice has a continuation.• Example: add to the schema of d0 the constraints:

– c is included in a– c is unique

• The generation process:

- 16 -

Adding Constraints

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

ba c

$$

pa pc

pb p$

q0

q1

q2

r

a b c

Pr(d) = pa p∙ b p∙ c∙1

Perform a continuation-test before taking the

transition

Implies |c|≤|a|

Page 17: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Learning Algorithm for Continuation-test Generators

• The probabilities are again relative frequencies, but –only in cases where there was an alternative choice.

• The learned generator will generate as many c-s as a-s

Adding Constraints

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

(q0, a) 1(q0, b) 1(q1, c) 1(q1, $) 0

/2/2/1/1

(q1, $) was chosen only when (q1, c) was not available.

- 17 -

Page 18: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Results for Continuation-test Generators

• Theorem: The algorithm learns an optimal continuation-test generator, for automata with binary choices.– Extensions to non-binary are discussed in the paper

• Theorem: Continuation-test is NP-Complete– But only in the size of the schema; it is polynomial in the document size

– Both generation and finding the optimal generator are exponential in the schema size unless P=NP.

– Based on schema satisfiability test [David et al. 2011]

• Theorem: probability of termination for a continuation-test generator may be arbitrarily small!– Proof – by construction of a simple, non-recursive schema

– Can be handled by adding a constraint on the document size.

– Sub-classes of schemas that guarantee termination?

- 18 -

Adding Constraints

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Page 19: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Adding Values to the Structure

• So far our generators were used only for the document structure

• Leaf values may also have a distribution according to which they can be generated– The distribution may be learned from the same

document collection• We will focus on the interesting case –

generating leaf values for a schema with constraints

- 19 -

Leaf Values

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Page 20: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Suggested Algorithm• We start with a valid document skeleton

• Order labels by inclusion constraints (e.g., c, b, a)• Choose a leaf from the 'smallest' (most included) label, and including leaves• Draw a value (from the domain) according to a given distribution.• Use PTIME test to verify validity, if not revert the step• Improvements presented in the paper

- 20 -

Leaf Values

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

$

r

a b c

abcdabcd efg

Page 21: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Related Work

• Schema Satisfiability tests [Fan & Libkin 2001; David, Libkin & Tan 2011]

• Probabilistic XML and Probabilistic Schemas [e.g., Benedikt, Kharlamov, Olteanu & Senellart 2010]

• Probabilistic XML generation [e.g., Antonopoulos, Geerts, Martens & Neven 2011]

• Schema Inference [e.g., Bex, Gelade, Neven & Vansummeren 2008]

• AXML [Abiteboul, Benjelloun & Milo 2008]

• PCFGs [e.g., Chi & Geman 1998]

- 22 -

Summary

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Page 22: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Conclusion

• A model for a probabilistic XML generators• Unconstrained case

– Generation and learning optimal generators can be done efficiently– Termination is guaranteed

• Constrained case– Restart generator

• # of restarts is unbounded

– Continuation-test generators• Generation and learning optimal generators are expensive• Termination is not guaranteed

• Leaf Value generation

• In the talk labels and states are coupled (as in a DTD), but all the results hold when they are uncoupled.

• Future work– Efficient combinations of restart and continuation-test generators – Experimental study

- 23 -

Summary

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Page 23: Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart.

Thank You!Thank You!

Q&A