Top Banner
45

Reconstructing Textual Documents from n-grams

Apr 15, 2017

Download

Science

matthigalle
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Reconstructing Textual Documents from n-grams
Page 2: Reconstructing Textual Documents from n-grams

Motivation: Privacy-preserving data mining

Share textual data for mutual benefit, general good or contractual reasonsBut not all of it:

text analytics on private documents

marketplace scenarios [Cancedda ACL 2012]

copyright concerns

1

Page 3: Reconstructing Textual Documents from n-grams

Motivation: Privacy-preserving data mining

Share textual data for mutual benefit, general good or contractual reasonsBut not all of it:

text analytics on private documents

marketplace scenarios [Cancedda ACL 2012]

copyright concerns

1

Page 4: Reconstructing Textual Documents from n-grams

Motivation: Privacy-preserving data mining

Share textual data for mutual benefit, general good or contractual reasonsBut not all of it:

text analytics on private documents

marketplace scenarios [Cancedda ACL 2012]

copyright concerns

1

Page 5: Reconstructing Textual Documents from n-grams

Problem

1 Given n-gram information of a document d , how well can wereconstruct d?

2 If I want/have to share n-gram statistics, what is a good strategy toavoid reconstruction, while preserving utility of data?

2

Page 6: Reconstructing Textual Documents from n-grams

Example

s = $ a rose rose is a rose is a rose #

2-grams:

$ a 1a rose 3rose rose 1rose is 2is a 2rose # 1

Note that the same 2-grams are obtained starting from:

s = $ a rose is a rose rose is a rose #

s = $ a rose is a rose is a rose rose #

=⇒ Find large chunks of text of whose presence we arecertain

3

Page 7: Reconstructing Textual Documents from n-grams

Example

s = $ a rose rose is a rose is a rose #2-grams:

$ a 1a rose 3rose rose 1rose is 2is a 2rose # 1

Note that the same 2-grams are obtained starting from:

s = $ a rose is a rose rose is a rose #

s = $ a rose is a rose is a rose rose #

=⇒ Find large chunks of text of whose presence we arecertain

3

Page 8: Reconstructing Textual Documents from n-grams

Example

s = $ a rose rose is a rose is a rose #2-grams:

$ a 1a rose 3rose rose 1rose is 2is a 2rose # 1

Note that the same 2-grams are obtained starting from:

s = $ a rose is a rose rose is a rose #

s = $ a rose is a rose is a rose rose #

=⇒ Find large chunks of text of whose presence we arecertain

3

Page 9: Reconstructing Textual Documents from n-grams

Example

s = $ a rose rose is a rose is a rose #2-grams:

$ a 1a rose 3rose rose 1rose is 2is a 2rose # 1

Note that the same 2-grams are obtained starting from:

s = $ a rose is a rose rose is a rose #

s = $ a rose is a rose is a rose rose #

=⇒ Find large chunks of text of whose presence we arecertain

3

Page 10: Reconstructing Textual Documents from n-grams

Problem Encoding

An n-gram corpus is encoded as a graph, subgraph of the de Bruijn graph, whereedges correspond to n-grams

0

1

$ a , 1

2

a rose , 3

rose rose , 1

3

rose is , 2

4

rose # , 1

is a , 2

4

Page 11: Reconstructing Textual Documents from n-grams

Problem Encoding

[2, 2, 3, 1]→ rose rose is a

0

1

$ a , 1

2

a rose , 3

rose rose , 1

3

rose is , 2

4

rose # , 1

is a , 2

4

Page 12: Reconstructing Textual Documents from n-grams

Problem encoding

Given such a graph, each Eulerian path gives a plausible reconstruction

Problem: Find those parts that are common in all of them

BEST Theorem, 1951

Given an Eulerian graph G = (V ,E ), the number of different Euleriancycles is

Tw (G )∏v∈V

(d(v)− 1)!

Tw (G ) is the number of trees directed towards the root at a fixed node w

5

Page 13: Reconstructing Textual Documents from n-grams

Problem encoding

Given such a graph, each Eulerian path gives a plausible reconstructionProblem: Find those parts that are common in all of them

BEST Theorem, 1951

Given an Eulerian graph G = (V ,E ), the number of different Euleriancycles is

Tw (G )∏v∈V

(d(v)− 1)!

Tw (G ) is the number of trees directed towards the root at a fixed node w

5

Page 14: Reconstructing Textual Documents from n-grams

Problem encoding

Given such a graph, each Eulerian path gives a plausible reconstructionProblem: Find those parts that are common in all of them

BEST Theorem, 1951

Given an Eulerian graph G = (V ,E ), the number of different Euleriancycles is

Tw (G )∏v∈V

(d(v)− 1)!

Tw (G ) is the number of trees directed towards the root at a fixed node w

5

Page 15: Reconstructing Textual Documents from n-grams

Problem Encoding

[0, 1, 2]→ $ a rose

0

1

$ a , 1

2

a rose , 3

rose rose , 1

3

rose is , 2

4

rose # , 1

is a , 2

6

Page 16: Reconstructing Textual Documents from n-grams

Definitions

ec(G ): the set of all Eulerian paths of G

given the path c = e1, . . . , en; `(c) = [label(e1), . . . , label(en)]

s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)

Given G , we want G ∗ st:

1 is equivalent:

{s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G ∗)}

2 is irreducible:

6 ∃e1, e2 ∈ E ∗ : [label(e1), label(e2)] appears in all `(c), c ∈ ec(G ∗)

Given G ∗ we can just read maximal blocks from the labels.

7

Page 17: Reconstructing Textual Documents from n-grams

Definitions

ec(G ): the set of all Eulerian paths of G

given the path c = e1, . . . , en; `(c) = [label(e1), . . . , label(en)]

s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)

Given G , we want G ∗ st:

1 is equivalent:

{s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G ∗)}

2 is irreducible:

6 ∃e1, e2 ∈ E ∗ : [label(e1), label(e2)] appears in all `(c), c ∈ ec(G ∗)

Given G ∗ we can just read maximal blocks from the labels.

7

Page 18: Reconstructing Textual Documents from n-grams

Definitions

ec(G ): the set of all Eulerian paths of G

given the path c = e1, . . . , en; `(c) = [label(e1), . . . , label(en)]

s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)

Given G , we want G ∗ st:

1 is equivalent:

{s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G ∗)}

2 is irreducible:

6 ∃e1, e2 ∈ E ∗ : [label(e1), label(e2)] appears in all `(c), c ∈ ec(G ∗)

Given G ∗ we can just read maximal blocks from the labels.

7

Page 19: Reconstructing Textual Documents from n-grams

Example

s = $ a rose rose is a rose is a rose #

2

rose rose , 1rose is a rose , 2

4rose # , 10 $ a rose , 1

8

Page 20: Reconstructing Textual Documents from n-grams

9

Page 21: Reconstructing Textual Documents from n-grams

Rule 1 (Pigeonhole rule)

α.δ occurs at least 4 times

10

Page 22: Reconstructing Textual Documents from n-grams

Rule 1 (Pigeonhole rule)

α.δ occurs at least 4 times

10

Page 23: Reconstructing Textual Documents from n-grams

Rule 2: non-local information

x is an “articulation point” [Tarjan 1971]

α.β occurs at least once

11

Page 24: Reconstructing Textual Documents from n-grams

Rule 2: non-local information

x is an “articulation point” [Tarjan 1971]

α.β occurs at least once

11

Page 25: Reconstructing Textual Documents from n-grams

Rule 2: non-local information

x is an “articulation point” [Tarjan 1971]

α.β occurs at least once

11

Page 26: Reconstructing Textual Documents from n-grams

Main Result

Theorem

Both rules are correct and complete: their application on G leads to agraph G ∗ that is equivalent to G and irreducible.

12

Page 27: Reconstructing Textual Documents from n-grams

Experiments

Gutenberg project: out-of-copyright (US) books. 1 000 random singlebooks.

average maximalMean of average and maximal block size

13

Page 28: Reconstructing Textual Documents from n-grams

Experiments

Gutenberg project: out-of-copyright (US) books. 1 000 random singlebooks.

average maximalMean of average and maximal block size

13

Page 29: Reconstructing Textual Documents from n-grams

Increasing Diversity

Instead of running on a single book, run on concatenation of k books.

Average number of large blocks (≥ 100)

14

Page 30: Reconstructing Textual Documents from n-grams

Increasing Diversity

Instead of running on a single book, run on concatenation of k books.Average number of large blocks (≥ 100)

14

Page 31: Reconstructing Textual Documents from n-grams

Remove completeness assumption

Remove those n-grams whose frequency is < M.

mean / max vs M

error rate vs M

(n = 5)

15

Page 32: Reconstructing Textual Documents from n-grams

Remove completeness assumption

Remove those n-grams whose frequency is < M.

mean / max vs M

error rate vs M

(n = 5)

15

Page 33: Reconstructing Textual Documents from n-grams

Remove completeness assumption

Remove those n-grams whose frequency is < M.

mean / max vs M error rate vs M(n = 5)

15

Page 34: Reconstructing Textual Documents from n-grams

A better noisifying strategy

Instead of removing n-grams, add strategically chosen n-grams

removing edges vs adding edges

16

Page 35: Reconstructing Textual Documents from n-grams

A better noisifying strategy

Instead of removing n-grams, add strategically chosen n-gramsremoving edges vs adding edges

16

Page 36: Reconstructing Textual Documents from n-grams

Keep utility

Removing

17

Page 37: Reconstructing Textual Documents from n-grams

Keep utility

Removing

17

Page 38: Reconstructing Textual Documents from n-grams

Keep utility

Removing Adding

17

Page 39: Reconstructing Textual Documents from n-grams

Conclusions

How well can textual documents be reconstructed from their list ofn-grams

Resilience to standard noisifying approach

Better noisifying by adding (instead of removing) n-grams

18

Page 40: Reconstructing Textual Documents from n-grams

Questions?

19

Page 41: Reconstructing Textual Documents from n-grams

Appendix

20

Page 42: Reconstructing Textual Documents from n-grams

Rule 1 (Pigeonhole rule)

Incoming edges of x : (〈v1, x , `1〉, p1), . . . , (〈vn, x , `n〉, pn)Outgoing edges (〈x ,w1, t1〉, k1) . . . , (〈x ,wm, tm〉, km).If ∃i , j such that pi > d(x)− kj .then

E ′ = E \ ({〈vi , x , `i 〉, a), 〈(x ,wj , tj〉, a)}) ∪ {〈vi ,wj , `i .tj〉, a)} wherea = pi − (d(x)− kj).

if a = d(x) then V ′ = V \ {x}, else V ′ = V

21

Page 43: Reconstructing Textual Documents from n-grams

Rule 2: non-local information

x division point dividing G in components G1,G2. If d̂inG1 (x) = 1 and

d̂outG2 (x) = 1 ((〈v , x , `〉, p) and (〈x ,w , t〉, k)), then

E ′ = (E \ {(〈v , x , `〉, 1), (〈x ,w , t〉, 1)}) ∪ {(〈v ,w , `.t〉 , 1)}V ′ = V

22

Page 44: Reconstructing Textual Documents from n-grams

Increasing Diversity

Instead of running on a single book, run on concatenation of k books.

(Mean of average block size)

23

Page 45: Reconstructing Textual Documents from n-grams

Increasing Diversity

Instead of running on a single book, run on concatenation of k books.

23