Top Banner
Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University of Tours, France [email protected]
67

Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

Dec 17, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

Finite state subautomataApplication to Electronic Dictionaries

Lamia TounsiPolytech'Tours, Computer Science laboratory

François Rabelais University of Tours, France

[email protected]

Page 2: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

2

Motivation

o DFSA are widely used in Natural Language processing

Find all sub structures in a given FSA.

Search of subautomata in a DFSA• Decompose a very large FSA into smaller ones• Discover frequently occurring data • Reduce memory consumption

Page 3: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

3

Plan

Mathematical preliminaries • Automaton• Subautomaton

Research of subautomata• Smallest closed subautomaton• Smallest subautomaton

Application to automata representing dictionaries Indexation and Compression Conclusion

Page 4: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

Finite state subautomataApplication to Electronic Dictionnaries

Mathematical preliminaries •Automaton•Subautomaton

Research of subautomata•Smallest closed subautomaton•Smallest subautomaton

Application to automata representing dictionariesIndexation and Compression Conclusion

Page 5: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

5

Automaton

A deterministic acyclic automaton A =<, Q, , qi, qf > is the alphabet• Q is the finite set of states is the transition function: : Q Q• qi is the initial state (qi Q)• qf is the final state (qf Q)

Let a and w * : (p, )=p (p, wa)= ( (p,w),a)

Page 6: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

6

Successors & predecessors

Succ(p) = {qQ : , (p,)= q}Succ*(p) = {qQ : w*, (p,w)= q}

Pred(p) = { qQ : , (q,)= p}Pred*(p) = { qQ : w*, (q,w)= p}

Height : • H(qf)=0• H(p)=Max{q Succ(p)} H(q)+1

Page 7: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

7

Automaton

An automaton that recognizes the flexion of nine verbs

H(14)=4

H(13)=5

Page 8: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

8

Source (E) & Initial State (p)

Let E E

Page 9: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

9

Source (E) & Initial State (p)

Let E • AP(E)={ w path from qi to p, p E}

Page 10: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

10

Source (E) & Initial State (p)

Let E • AP(E)={ w path from qi to p, p E}

• AN(E)={p Q/ w AP(E), p w}

Page 11: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

11

Source (E) & Initial State (p)

Let E • AP(E)={ w path from qi to p, p E}

• AN(E)={p Q/ w AP(E), p w}

source(E) AN (E)

Source(E) :

H(source(E)) =MinqAN (E)(H(q))

Source(E)

Page 12: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

12

Source (E) & Initial State (p)

Let E • AP(E)={ w path from qi to p, p E}

• AN(E)={p Q/ w AP(E), p w}

source(E) AN (E)

source(E) :

H(source(E)) =MinqAN (E)(H(q))

Let p Q, p qi

IS(p) = Source(Pred(p))

Page 13: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

13

Source (E) & Initial State (p)

Source(q2, q3, q5) = Source(q3, q4) = q2

Source(q3, q4, q5) = Source(q3, q4, q5 , q6) = q1

IS(q3)= q2

IS(q5)= q1

IS(q6)= q1

Page 14: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

14

Sink (E) & Final State (p)

Let E • PP(E) = { w path from p to qf, p E}

• PN(E) = {p Q/ w PP(E), p w}

Sink(E) PN (E)

Sink(E) :

H(Sink(E)) =MaxqPN (E)(H(q))

Let p Q, p qi

FS(p) = Sink(Succ(p))

Page 15: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

15

Subautomaton (SA)

A’=<, Q’, ’, si, sf > is a sub automaton of A iff:• Q’ Q

• {si, sf } Q’

Q’ Q’ ’:

(q, ) Q’ : ’ (q, ) = (q, )

q Q’ : q Succ*(si) and q Pred*(sf)

q Q’ \ {si, sf } : Succ(q) Q’ and Pred(q) Q’

Page 16: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

16

Subautomaton (SA)

An automaton that recognizes the flexion of nine verbs

SA

Page 17: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

17

Subautomaton (SA)

An automaton that recognizes the flexion of nine verbs

SA

Page 18: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

18

Subautomaton (SA)

An automaton that recognizes the flexion of nine verbs

SA

Page 19: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

19

Subautomaton (SA)

An automaton that recognizes the flexion of nine verbs

SA

Page 20: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

20

Closed subautomaton (CSA)

Let Q Q’ and si, sf two distinct states:

A subautomaton A’=<, Q’, ’, si, sf > is a closed subautomaton iff :

q Q’ \ {si}: Pred(q) Q’

q Q’ \ {sf}: Succ(q) Q’

Page 21: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

21

Closed subautomaton (CSA)

An automaton that recognizes the flexion of nine verbs

CSA

Page 22: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

22

Closed subautomaton (CSA)

An automaton that recognizes the flexion of nine verbs

CSA

Page 23: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

23

Closed subautomaton (CSA)

An automaton that recognizes the flexion of nine verbs

CSA

Page 24: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

24

Smallest Closed subautomaton (SCSA)

Let Q Q’ and si, sf two distinct states:

A closed subautomaton A’=<, Q’, ’, si, sf >is a smallest closed subautomaton iff :

(si, q) is CSA q= sf

q Q’ :

(q, sf) is CSA q= si

Page 25: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

25

Smallest Closed subautomaton (SCSA)

An automaton that recognizes the flexion of nine verbs

SCSASCSASCSA SCSA

Page 26: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

26

Smallest subautomaton (SSA)

Let p Q \{si, sf}

The subautomaton A’=<, Q’, ’, si, sf >

is SSA(p) iff :- A’ strictly contains p A’’=<, Q’’, ’’, s’’i, s’’f > wich strictly

contains p : Q’ Q’’

Page 27: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

27

Smallest subautomaton (SSA)

An automaton that recognizes the flexion of nine verbs

SSA(6) SSA(18)

Page 28: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

Finite state subautomataApplication to Electronic Dictionaries

Mathematical preliminaries •Automaton•Subautomaton

Research of subautomata•Smallest closed subautomaton (SCSA)•Smallest subautomaton (SSA)

Application to automata representing dictionariesIndexation and Compression Conclusion

Page 29: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

29

Research SCSA

Property 1.

(si, sf ) is a SCSA iff IS(sf)= si & FS(si)= sf

Property 2. (Associativity)

If E=E1E2 and E1 , E2 then

Source(E)= Source(Source(E1),Source(E2))

Property 3. (Hierarchy between two SCSA )• Either, they have no common transitions,• Either, one is strictly included in the other.

Page 30: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

30

Research SCSA

Let p Q1. P.IS : initial state associated to p.2. P.FSmin : minimal final state associated to p, assuming

that p is the initial state of a SCSA.3. P.FSmax : maximal final state associated to p, assuming

that p is the initial state of a SCSA.

Property 4.

p>qi, (p.IS,p) is a SCSA iff p.IS.FSmin p p.IS.FSmax

Complexity Algorithm : O (n2)

Page 31: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

31

Research SCSA

FSminis

FSmax

Page 32: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

32

Research SCSA

FSminis

FSmax

Page 33: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

33

Research SSA

Let A’=<, Q’, ’, si, sf > be a subautomaton

Property 5.E Q’ \ {sf}: Succ*(si)Pred*(E) Q’

E Q’ \ {si}: Pred*(sf)Succ*(E) Q’

Page 34: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

34

SSA associated to grey states

E

Page 35: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

35

SSA associated to grey states

Source

Page 36: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

36

SSA associated to grey states

SinkSource

Page 37: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

37

SSA associated to grey states

SinkSource

Page 38: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

38

SSA associated to grey states

Source Sink

Page 39: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

39

Research SSA

Property 6.

Let p, p’, q, q’ Q• {p, p’} Pred(q) and {q, q’} Succ(p)• H(p’) ≥ H(p) and H(q’) ≤ H(q)

p and q belong to the same SSA

Page 40: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

40

All Subautomata of an automaton

Algorithm input: A - output: subautomata

1: repeat2: repeat3: Detect, store and replace each parallels by one transition;4: Detect, store and replace each sequences by one transition;5: until the automaton is freed from all its parallels and sequences6: Detect, store and replace each smallest subautomata by one transition;7: until The automaton A is reduced to one single transition

Valdez J., Tarjan R. E., Lawler E. L., The recognition of series-parallel digraphs, SIAM J. Comput. 11-2:298-313, 1982.

Page 41: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

41

All Subautomata of an automaton

Page 42: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

42

All Subautomata of an automaton

Page 43: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

43

All Subautomata of an automaton

Page 44: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

44

All Subautomata of an automaton

Page 45: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

45

All Subautomata of an automaton

Page 46: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

46

All Subautomata of an automaton

Page 47: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

Finite state subautomataApplication to Electronic Dictionaries

Mathematical preliminaries •Automaton•Subautomaton

Research of subautomata•Smallest closed subautomaton (SCSA)•Smallest subautomaton (SSA)

Application to automata representing dictionariesIndexation and Compression Conclusion

Page 48: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

48

Dictionaries and automata

10 dictionaries : Lexicographic order of words

• 6 Delaf : French, English, Serbian, German, Polylexicaux English, French cities.

• 4 Web : Frech, Hungarian, Bulgarian and Portuguese.

Properties of automata:Finit set of states, Acyclic, deterministic, unique initial

state, unique final state, minimal.

Page 49: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

49

Internal structure of automata

d

Page 50: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

50

Internal structure of automata

d

Page 51: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

51

Experimental Results

Page 52: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

Finite state subautomataApplication to Electronic Dictionnaries

Mathematical preliminaries •Automaton•Subautomaton

Research of subautomata•Smallest closed subautomaton•Smallest subautomaton

Application to automata representing dictionariesFactorisation, indexation and compression Conclusion

Page 53: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

53

Factorisation, indexation and compression

The reseach of subautomata detects sequences and parallels

Sequence subautomaton

Parallel subautomaton

Proposal: - The application of the direct acyclic word graph, initially dedicated for

indexing text, to index the subautomata,- heuristic to select the most interesting substructure to factorize.

Page 54: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

54

Storage of an automaton

c

c

d

d

1 1 a 8

2 0 c 3

3 1 a 5

4 0 b 6

5 1 b 7

6 1 c 10

7 1 c 9

8 1 b 11

9 1 d 0

10 1 d 11

11 1 b 0

Boolean Character 

log2(|Σ|) Address arrival state

log2(Max address+1)

Page 55: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

55

Factorization

c

c

d

d

b

1 1 a 5

2 0 c 3

3 1 a 7

4 0 6

5 1 b 6

6 1 b 0

7 1 0

a cb

1 1 a 8

2 0 c 3

3 1 a 5

4 0 b 6

5 1 b 7

6 1 c 10

7 1 c 9

8 1 b 11

9 1 d 0

10 1 d 11

11 1 b 0

Page 56: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

56

Factorisation

b

Factorization

bb

Factorization

c

c

d

d

Page 57: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

57

How can we choose the subautomata to factorize ?

- The best candidates to be factorized are those which increase memory storage efficiency and reduce the size of the initial automaton

Profit = saved memory – Consumed memory

- The memory space is saved by elimination of all occurrences of the substructure

- The memory space is consumed by the extention of the alphabet and the index.

Page 58: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

58

Directed Acyclic word graph (DAWG)

Computations of frequency and profit associated to each sequence with a DAWG

DAWG (aabba)

Page 59: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

59

Greedy Algorithm of Compression

Algorithm input: A - Output: A, Alphabet

1: Iterative process 2: Select the best sequence s from the DAWG 3: Extend the alphabet to represent s4: Delete s from A and from DAWG5: Update the DAWG

Page 60: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

60

Compression FCM

FCM

Page 61: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

61

Compression FCNM

FCNM

Page 62: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

62

Compression FCDic

FCDic

Page 63: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

63

Best Compressions

1024

FCNMFCNM

Page 64: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

64

Best Compressions

1024

FCNMFCNM

Page 65: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

Finite state subautomataApplication to Electronic Dictionaries

Mathematical preliminaries •Automaton•Subautomaton

Research of subautomata•Smallest closed subautomaton•Smallest subautomaton

Application to automata representing dictionariesFactorisation, indexation and compression Conclusion

Page 66: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

66

Conclusion

Research of two kinds of smallest subautomata

Statistical analysis of the internal structure of some automata associated to dictionnaries

Method of compression based on factorizations of sequences or parallel subautomata

A minimised automaton does not always lead to the better compression.

Page 67: Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University.

67

Future works

Factorization of more kinds of subautomata,

Find a way to deminimised an automaton in order to get a better compression,

Work on alternative encoding of automata, for example a depth first codage