Reconstructing Textual Documents from n-grams

Motivation: Privacy-preserving data mining

Share textual data for mutual benefit, general good or contractual reasonsBut not all of it:

text analytics on private documents

marketplace scenarios [Cancedda ACL 2012]

copyright concerns

1





copyright concerns

1





copyright concerns

1

Problem

1 Given n-gram information of a document d , how well can wereconstruct d?

2 If I want/have to share n-gram statistics, what is a good strategy toavoid reconstruction, while preserving utility of data?

2

Example

s = $ a rose rose is a rose is a rose #

2-grams:

$ a 1a rose 3rose rose 1rose is 2is a 2rose # 1

Note that the same 2-grams are obtained starting from:

s = $ a rose is a rose rose is a rose #

s = $ a rose is a rose is a rose rose #

=⇒ Find large chunks of text of whose presence we arecertain

3

Example

s = $ a rose rose is a rose is a rose #2-grams:






3

Example







3

Example







3

Problem Encoding

An n-gram corpus is encoded as a graph, subgraph of the de Bruijn graph, whereedges correspond to n-grams

0

1

$ a , 1

2

a rose , 3

rose rose , 1

3

rose is , 2

4

rose # , 1

is a , 2

4

Problem Encoding

[2, 2, 3, 1]→ rose rose is a

0

1

$ a , 1

2

a rose , 3

rose rose , 1

3

rose is , 2

4

rose # , 1

is a , 2

4

Problem encoding

Given such a graph, each Eulerian path gives a plausible reconstruction

Problem: Find those parts that are common in all of them

BEST Theorem, 1951

Given an Eulerian graph G = (V ,E ), the number of different Euleriancycles is

Tw (G )∏v∈V

(d(v)− 1)!

Tw (G ) is the number of trees directed towards the root at a fixed node w

5

Problem encoding

Given such a graph, each Eulerian path gives a plausible reconstructionProblem: Find those parts that are common in all of them

BEST Theorem, 1951


Tw (G )∏v∈V

(d(v)− 1)!


5

Problem encoding

Given such a graph, each Eulerian path gives a plausible reconstructionProblem: Find those parts that are common in all of them

BEST Theorem, 1951


Tw (G )∏v∈V

(d(v)− 1)!


5

Problem Encoding

[0, 1, 2]→ $ a rose

0

1

$ a , 1

2

a rose , 3

rose rose , 1

3

rose is , 2

4

rose # , 1

is a , 2

6

Definitions

ec(G ): the set of all Eulerian paths of G

given the path c = e1, . . . , en; `(c) = [label(e1), . . . , label(en)]

s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)

Given G , we want G ∗ st:

1 is equivalent:

{s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G ∗)}

2 is irreducible:

6 ∃e1, e2 ∈ E ∗ : [label(e1), label(e2)] appears in all `(c), c ∈ ec(G ∗)

Given G ∗ we can just read maximal blocks from the labels.

7

Definitions





1 is equivalent:

{s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G ∗)}

2 is irreducible:



7

Definitions





1 is equivalent:

{s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G ∗)}

2 is irreducible:



7

Example

s = $ a rose rose is a rose is a rose #

2

rose rose , 1rose is a rose , 2

4rose # , 10 $ a rose , 1

8

9

Rule 1 (Pigeonhole rule)

α.δ occurs at least 4 times

10


α.δ occurs at least 4 times

10

Rule 2: non-local information

x is an “articulation point” [Tarjan 1971]

α.β occurs at least once

11




11




11

Main Result

Theorem

Both rules are correct and complete: their application on G leads to agraph G ∗ that is equivalent to G and irreducible.

12

Experiments

Gutenberg project: out-of-copyright (US) books. 1 000 random singlebooks.

average maximalMean of average and maximal block size

13

Experiments

Gutenberg project: out-of-copyright (US) books. 1 000 random singlebooks.

average maximalMean of average and maximal block size

13

Increasing Diversity

Instead of running on a single book, run on concatenation of k books.

Average number of large blocks (≥ 100)

14


Instead of running on a single book, run on concatenation of k books.Average number of large blocks (≥ 100)

14

Remove completeness assumption

Remove those n-grams whose frequency is < M.

mean / max vs M

error rate vs M

(n = 5)

15



mean / max vs M

error rate vs M

(n = 5)

15



mean / max vs M error rate vs M(n = 5)

15

A better noisifying strategy

Instead of removing n-grams, add strategically chosen n-grams

removing edges vs adding edges

16

A better noisifying strategy

Instead of removing n-grams, add strategically chosen n-gramsremoving edges vs adding edges

16

Keep utility

Removing

17

Keep utility

Removing

17

Keep utility

Removing Adding

17

Conclusions

How well can textual documents be reconstructed from their list ofn-grams

Resilience to standard noisifying approach

Better noisifying by adding (instead of removing) n-grams

18

Questions?

19

Appendix

20


Incoming edges of x : (〈v1, x , `1〉, p1), . . . , (〈vn, x , `n〉, pn)Outgoing edges (〈x ,w1, t1〉, k1) . . . , (〈x ,wm, tm〉, km).If ∃i , j such that pi > d(x)− kj .then

E ′ = E \ ({〈vi , x , `i 〉, a), 〈(x ,wj , tj〉, a)}) ∪ {〈vi ,wj , `i .tj〉, a)} wherea = pi − (d(x)− kj).

if a = d(x) then V ′ = V \ {x}, else V ′ = V

21


x division point dividing G in components G1,G2. If d̂inG1 (x) = 1 and

d̂outG2 (x) = 1 ((〈v , x , `〉, p) and (〈x ,w , t〉, k)), then

E ′ = (E \ {(〈v , x , `〉, 1), (〈x ,w , t〉, 1)}) ∪ {(〈v ,w , `.t〉 , 1)}V ′ = V

22



(Mean of average block size)

23



23

Reconstructing Textual Documents from n-grams

Science