Grammar Induction: •Laurent Miclet, Tim Oates, Jose ...pagesperso.lina.univ-nantes.fr/~cdlh/ICML06/icml_gi_slides_6pp.pdf · 1 ICML 2006, Grammatical Inference 1 1 Colin de la Higuera,

1

1ICML 2006, Grammatical Inference1

Colin de la Higuera, Université de Saint-EtienneTim Oates, University of Maryland

Grammar Induction: Techniques and Theory


Acknowledgements• Laurent Miclet, Tim Oates, Jose Oncina, Rafael Carrasco, PacoCasacuberta, Pedro Cruz, RémiEyraud, Philippe Ezequel, Henning Fernau, Jean-Christophe Janodet, Thierry Murgue, Frédéric Tantini, Franck Thollard, Enrique Vidal,...

• … and a lot of other people to whom we are grateful


Outline

1 An introductory example

2 About grammatical inference

3 Some specificities of the task

4 Some techniques and algorithms

5 Open issues and questions


1 How do we learn languages?

A very simple example

Carmel and Markovitch 98 & 99http://www.cs.technion.ac.il/~carmel/papers.html


The problem:

• An agent must take cooperative decisions in a multi-agent world.

• His decisions will depend:

– on the actions of other agents;

– on what he hopes to win or lose.


Hypothesis: the opponent follows a rational strategy (given by a DFA/Moore machine):

e e

pp

l l d

p e

e e p e p → le e e → d

You: listenor doze

Me: equations or pictures

2


Example: (the prisoner’s dilemma)

• Each prisoner can admit (a) or stay silent (s)

• If both admit: 3 years each;

• If A admits but not B, A=0 years, B=5 years;

• If B admits but not A, B=0 years, A=5 years;

• If neither admits: 1 year each.


a

a

s

s

-3

-3

0

-5

0

-5

-1

-1

AB


• Here an iterated version against an opponent that follows a rational strategy.

• Gain Function: limit of means.

• A game is a word in

(His_moves × My_moves)*!


The general problem

• We suppose that the strategy of the opponent is given by a deterministic finite automaton.

• Can we imagine an optimal strategy?


Suppose we know the opponent’s strategy:

• Then (game theory):

• Consider the opponent’s graph in which we value the edges by our own gain.


a s

a

s

-3 0

-5 -1

s s

aa

a s s

a s-3

-5 -1

0

-1

0

3


1 Find the cycle of maximum mean weight.

2 Find the best path leading to this cycle of maximum mean weight.

3 Follow the path and stay in the cycle.

All that is needed is to find the opponent’s automaton!

Then


a s

a

s

-3 0

-5 -1

s s

aa

a s s

a s

Mean= -0.5

Best path

-3

-5 -1

0

-1

0


Question

• Having seen a game of this opponent…

• Can we reconstruct his strategy ?


Data (him, me) : {aa as sa aa as ssss ss sa}

HIM MEa aa ss aa aa ss ss ss as a

I play asa, his move is a


λ→ a

a→a

as → s

asa → a

asaa → a

asaas → s

asaass → s

asaasss → s

asaasssa → s

a


λ → a

a → a

as → s

asa → a

asaa → a

asaas → s

asaass → s

asaasss → s

asaasssa → s

a

a

4


λ → a

a → a

as → s

asa → a

asaa → a

asaas → s

asaass → s

asaasss → s

asaasssa → s

a

a


λ → a

a → a

as → s

asa → a

asaa → a

asaas → s

asaass → s

asaasss → s

asaasssa → s

a

a, s


λ → a

a → a

as → s

asa → a

asaa → a

asaas → s

asaass → s

asaasss → s

asaasssa → s

sa

a

s


λ → a

a → a

as → s

asa → a

asaa → a

asaas → s

asaass → s

asaasss → s

asaasssa → s

sa

a

s

a


λ → a

a → a

as → s

asa → a

asaa → a

asaas → s

asaass → s

asaasss → s

asaasssa → s

sa

a

s

a,s


λ → a

a → a

as → s

asa → a

asaa → a

asaas → s

asaass → s

asaasss → s

asaasssa → s

sa

a

s

a

s

5


λ → a

a → a

as → s

asa → a

asaa → a

asaas → s

asaass → s

asaasss → s

asaasssa → s

ssa

a

s

a

s


λ → a

a → a

as → s

asa → a

asaa → a

asaas → s

asaass → s

asaasss → s

asaasssa → s

ssa

a

s

a

s

s


λ → a

a → a

as → s

asa → a

asaa → a

asaas → s

asaass → s

asaasss → s

asaasssa → s

s

ssa

a

s

a

s


λ → a

a → a

as → s

asa → a

asaa → a

asaas → s

asaass → s

asaasss → s

asaasssa → s

s

ssa

a

s

a

s


λ → a

a → a

as → s

asa → a

asaa → a

asaas → s

asaass → s

asaasss → s

asaasssa → s

s

ssa

a

s

a

s

a


λ → a

a → a

as → s

asa → a

asaa → a

asaas → s

asaass → s

asaasss → s

asaasssa → s

s

ssa

a

s

a

s

a

6


s

ssa

a

s

a

s

a


How do we get hold of the learning data?

a) through observation

b) through exploration


An open problemThe strategy is probabilistic:

s

a:70%S:30%

a:50%S:50%

a:20%S:80%

a

s

a

s

a


Tit for Tat

sa

a

s

a

s


2 Specificities of grammatical inference

Grammatical inference consists (roughly) in finding the (a) grammar or automaton that has produced a given set of strings (sequences, trees, terms, graphs).


The goal/idea

• Old Greeks:

A whole is more than the sum of all parts

• Gestalt theory

A whole is different than the sum of all parts

7


Better said

• There are cases where the data cannot be analyzed by considering it in bits

• There are cases where intelligibility of the pattern is important


Nothing Lots

What do people know about formal language theory?


A small reminder on formal language theory

• Chomsky hierarchy

• + and – of grammars


A crash course in Formal language theory

• Symbols

• Strings

• Languages

• Chomsky hierarchy

• Stochastic languages


Symbols

are taken from some alphabet Σ

Stringsare sequences of symbols from Σ


Languages

are sets of strings over Σ

Languagesare subsets of Σ*

8


Special languages

• Are recognised by finite state automata

• Are generated by grammars


a

b

a

b

a

b

DFA: Deterministic Finite State Automaton


a

b

a

b

a

b

abab∈L46ICML 2006, Grammatical Inference

46

What is a context free grammar?A 4-tuple (Σ, S, V, P) such that:

– Σ is the alphabet;

– V is a finite set of non terminals;

– S is the start symbol;

– P ∈ V × (V∪Σ)* is a finite set of rules.


Example of a grammarThe Dyck1 grammar

– (Σ, S, V, P)

– Σ = {a, b}

– V = {S}

– P = {S → aSbS, S → λ }


Derivations and derivation trees

S → aSbS

→ aaSbSbS

→ aabSbS

→ aabbS

→ aabb

a

a

b

b

S

SS

S

S

λ

λ

λ

9


Chomsky Hierarchy

• Level 0: no restriction

• Level 1: context-sensitive

• Level 2: context-free

• Level 3: regular


Chomsky Hierarchy• Level 0: Whatever Turing machines can do

• Level 1: – {anbncn: n∈ }– {anbmcndm : n,m∈ }– {uu: u∈Σ*}

• Level 2: context-free– {anbn: n∈ }– brackets

• Level 3: regular– Regular expressions (GREP)


The membership problem

• Level 0: undecidable

• Level 1: decidable

• Level 2: polynomial

• Level 3: linear


The equivalence problem




• Level 3: Polynomial only when the representation is DFA.


41

31

21

21

21

32

b

b

a

a

a

b

43

21

PFA: Probabilistic Finite (state) Automaton


0.1

0.3

a

b

a

b

a

b

0.65

0.350.9

0.7

0.3

0.7

DPFA: Deterministic Probabilistic Finite (state) Automaton

10


What is nice with grammars?

• Compact representation

• Recursivity

• Says how a string belongs, not just if it belongs

• Graphical representations (automata, parse trees)


What is not so nice with grammars?

• Even the easiest class (level 3) contains SAT, Boolean functions, parity functions…

• Noise is very harmful:

– Think about putting edit noise to language {w: |w|a=0[2]∧|w|b=0[2]}


Inductive Inference Pattern Recognition

Grammatical Inference

The field

Machine Learning

Computational linguistics Computational biology Web technologies


The data

• Strings, trees, terms, graphs

• Structural objects

• Basically the same gap of information as in programming between tables/arrays and data structures


Alternatives to grammatical inference

• 2 steps:

– Extract features from the strings

– Use a very good method over Rn.


Examples of stringsA string in Gaelic and its translation to English:

• Tha thu cho duaichnidh ri èarràirde de a’ coisich deas damh

•You are as ugly as the north end of a southward traveling ox

11

61ICML 2006, Grammatical Inference61 62ICML 2006, Grammatical Inference

62


>A BAC=41M14 LIBRARY=CITB_978_SKBAAGCTTATTCAATAGTTTATTAAACAGCTTCTTAAATAGGATATAAGGCAGTGCCATGTAGTGGATAAAAGTAATAATCATTATAATATTAAGAACTAATACATACTGAACACTTTCAATGGCACTTTACATGCACGGTCCCTTTAATCCTGAAAAAATGCTATTGCCATCTTTATTTCAGAGACCAGGGTGCTAAGGCTTGAGAGTGAAGCCACTTTCCCCAAGCTCACACAGCAAAGACACGGGGACACCAGGACTCCATCTACTGCAGGTTGTCTGACTGGGAACCCCCATGCACCTGGCAGGTGACAGAAATAGGAGGCATGTGCTGGGTTTGGAAGAGACACCTGGTGGGAGAGGGCCCTGTGGAGCCAGATGGGGCTGAAAACAAATGTTGAATGCAAGAAAAGTCGAGTTCCAGGGGCATTACATGCAGCAGGATATGCTTTTTAGAAAAAGTCCAAAAACACTAAACTTCAACAATATGTTCTTTTGGCTTGCATTTGTGTATAACCGTAATTAAAAAGCAAGGGGACAACACACAGTAGATTCAGGATAGGGGTCCCCTCTAGAAAGAAGGAGAAGGGGCAGGAGACAGGATGGGGAGGAGCACATAAGTAGATGTAAATTGCTGCTAATTTTTCTAGTCCTTGGTTTGAATGATAGGTTCATCAAGGGTCCATTACAAAAACATGTGTTAAGTTTTTTAAAAATATAATAAAGGAGCCAGGTGTAGTTTGTCTTGAACCACAGTTATGAAAAAAATTCCAACTTTGTGCATCCAAGGACCAGATTTTTTTTAAAATAAAGGATAAAAGGAATAAGAAATGAACAGCCAAGTATTCACTATCAAATTTGAGGAATAATAGCCTGGCCAACATGGTGAAACTCCATCTCTACTAAAAATACAAAAATTAGCCAGGTGTGGTGGCTCATGCCTGTAGTCCCAGCTACTTGCGAGGCTGAGGCAGGCTGAGAATCTCTTGAACCCAGGAAGTAGAGGTTGCAGTAGGCCAAGATGGCGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTATGTCCAAAAAAAAAAAAAAAAAAAAGGAAAAGAAAAAGAAAGAAAACAGTGTATATATAGTATATAGCTGAAGCTCCCTGTGTACCCATCCCCAATTCCATTTCCCTTTTTTGTCCCAGAGAACACCCCATTCCTGACTAGTGTTTTATGTTCCTTTGCTTCTCTTTTTAAAAACTTCAATGCACACATATGCATCCATGAACAACAGATAGTGGTTTTTGCATGACCTGAAACATTAATGAAATTGTATGATTCTAT


��1

� � � � � � ��


66

12


<book><part><chapter>

<sect1/><sect1>

<orderedlist numeration="arabic"><listitem/><f:fragbody/>

</orderedlist></sect1>

</chapter></part>

</book>


<?xml version="1.0"?><?xml-stylesheet href="carmen.xsl" type="text/xsl"?><?cocoon-process type="xslt"?><!DOCTYPE pagina [<!ELEMENT pagina (titulus?, poema)><!ELEMENT titulus (#PCDATA)><!ELEMENT auctor (praenomen, cognomen, nomen)><!ELEMENT praenomen (#PCDATA)><!ELEMENT nomen (#PCDATA)><!ELEMENT cognomen (#PCDATA)><!ELEMENT poema (versus+)><!ELEMENT versus (#PCDATA)>]><pagina><titulus>Catullus II</titulus><auctor><praenomen>Gaius</praenomen><nomen>Valerius</nomen><cognomen>Catullus</cognomen></auctor>


70

A logic program learned by GIFTcolor_blind(Arg1) :-

start(Arg1,X),p11(Arg1,X).

start(X,X).

p11(Arg1,P) :- mother(M,P),p4(Arg1, M).p4(Arg1,X) :-

woman(X),father(F,X),p3(Arg1,F).p4(Arg1,X) :-

woman(X),mother(M,X),p4(Arg1,M).p3(Arg1,X) :- man(X),color_blind(X).


3 Hardness of the task– One thing is to build algorithms, another is to be able to state that it works.

– Some questions:– Does this algorithm work?

– Do I have enough learning data?

– Do I need some extra bias?

– Is this algorithm better than the other?

– Is this problem easier than the other?


Alternatives to answer these questions:

– Use benchmarks

– Solve a real problem

– Prove things

13


Theory

• Because you may want to be able to say something more than « seems to work in practice ».


Convergence

• Does my algorithm converge in some sense to a best solution.

• To be able to answer, we have to admit the existence of a best solution.


Issues

• Get close to the best?

– Metrics

– Distributions over strings

• PAC related model and similar: very negative results


Identification in the limit

L Pres ⊆N→XA class of languages

A class of grammarsG

L A learnerThe naming function

yields

ϕ

f(N)=g(N) ⇒yields(f)=yields(g)L(ϕ(f))=yields(f)


f1 f2

h1 h2

fn

hn

fi

hi ≡ hn

L(hi)= L

L is identifiable in the limit in terms of Gfrom Pres iff

∀L∈L, ∀f∈Pres(L)


No quería componer otro Quijote —lo cual es fácil— sino el Quijote. Inútil agregar que no encaró nunca una transcripción mecánica del original; no se proponía copiarlo. Su admirable ambición era producir unas páginas que coincidieran -palabra por palabra y línea por línea-con las de Miguel de Cervantes.

[…]

“Mi empresa no es difícil, esencialmente” leo en otro lugar de la carta. “Me bastaría ser inmortal para llevarla a cabo.”

Jorge Luis Borges(1899–1986)Pierre Menard, autor del Quijote (El jardín de senderos que

se bifurcan) Ficciones

14


4 Algorithmic ideas


The space of GI problems

• Type of input (strings)

• Presentation of input (batch)

• Hypothesis space (subset of the regular grammars)

• Success criteria (identifi-cation in the limit)


Types of input

the cat hates the dogStrings:

StructuralExamples:

cat dog the the hates

(+)

(-)

Graphs:


Types of input - oracles• Membership queries

– Is string S in the target language?

• Equivalence queries– Is my hypothesis correct?

– If not, provide counter example

• Subset queries– Is the language of my hypothesis a subset of the target language?


Presentation of input

• Arbitrary order

• Shortest to longest

• All positive and negative examples up to some length

• Sampled according to some probability distribution


Presentation of input

• Text presentation

– A presentation of all strings in the target language

• Complete presentation (informant)

– A presentation of all strings over the alphabet of the target language labeled as + or -

15


Hypothesis space

• Regular grammars

– A welter of subclasses

• Context free grammars

– Fewer subclasses

• Hyper-edge replacement graph grammars


Success criteria

• Identification in the limit

– Text or informant presentation

– After each example, learner guesses language

– At some point, guess is correct and never changes

• PAC learning


Theorem’s due to Gold• The good news

– Any recursively enumerable class of languages can be learned in the limit from an informant (Gold, 1967)

• The bad news– A language class is superfinite if it includes all finite languages and at least one infinite language

– No superfinite class of languages can be learned in the limit from a text (Gold, 1967)

– That includes regular and context-free


A picture

Little information

A lot of information

Poor languages Rich Languages

Sub-classes of reg, from pos

Mildly context sensitive, from queries

DFA, from queries

Context-free, from pos

DFA, from pos+neg


AlgorithmsRPNI

K-Reversible

GRIDS

SEQUITUR

L*


4.1 RPNI

• Regular Positive and Negative Grammatical Inference

Identifying regular languages in polynomial time

Jose Oncina & Pedro García 1992

16


• It is a state merging algorithm;

• It identifies any regular language in the limit;

• It works in polynomial time;

• It admits polynomial charac-teristic sets.


The algorithm

function rmerge(A,p,q)

A = merge(A,p,q)

while ∃a∈Σ, p,q∈δA(r,a), p≠qdo

rmerge(A,p,q)


A=PTA(X); Fr ={δ(q0,a): a∈Σ };

K ={q0};

While Fr≠∅ do

choose q from Fr

if ∃p∈K: L(rmerge(A,p,q))∩X-=∅then A = rmerge(A,p,q)

else K = K ∪ {q}

Fr = {δ(q,a): q∈K} – {K}


X+={λ, aaa, aaba, ababa, bb, bbaaa}

a

a

aa

b

b

b

a

a

a

ba b

a

X-={aa, ab, aaaa, ba}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15


Try to merge 2 and 1

a

a

aa

b

b

b

a

a

a

ba b

a


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15


Needs more merging for determinization

aa

aa

b

b

b

a

a

a

b a ba


1,2

3

4

5

6

7

8

9

10

11

12

13

14

15

17


But now string aaaa is accepted, so the merge must be rejected

a

b

b a

a

a

ab

a


1,2,4,7

3,5,8 6

9, 11

10

12

13

14

15


Try to merge 3 and 1

a

a

aa

b

b

b

a

a

a

ba b

a


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15


Requires to merge 6 with {1,3}

a

a

aa

b

b

a

a

a

ba b

a


1,3

2

4

5

6

7

8

9

10

11

12

13

14

15

b


And now to merge 2 with 10

a

a

aa

b

aa

a

ba b

a


1,3,6

2

4

5

7

8

9

10

11

12

13

14

15

b


And now to merge 4 with 13

a

a

aa

b

a

ba b

a


1,3,6

2,10 4

5

7

8

9

11

12

13

14

15

b a


And finally to merge 7 with 15

a

a

aa

b

a

ba b

a


1,3,6

2,10

4,13

5

7

8

9

11

12

14

15

b

18


No counter example is accepted so the merges are kept

a

a

aa

bb

a ba


1,3,6

2,10

4,13

5

7,15

8

9

11

12

14

b


Next possible merge to be checked is {4,13} with {1,3,6}

a

a

aa

bb

a ba


1,3,6

2,10

4,13

5

7,15

8

9

11

12

14

b


a

a

ab

ba b

a


1,3,4,6,13

2,10

5

7,15

8

9

11

12

14

b

a

More merging for determinizationis needed


ab

a ba


1,3,4,6,8,13

2,7,10,11,15

5

9

12

14

b

a

But now aa is accepted


So we try {4,13} with {2,10}

a

a

aa

bb

a ba


1,3,6

2,10

4,13

5

7,15

8

9

11

12

14

b


After determinizing, negative string aa is again accepted

a ba b

a


1,3,62,4,7,10,13,15

5,89,11 12

14

b

a

19


So we try 5 with {1,3,6}

a

a

aa

bb

a ba


1,3,6

2,10

4,13

5

7,15

8

9

11

12

14

b


But again we accept ab

aa

aa

b

b


1,3,5,6,12

2,9,10,144,13

7,15

8

11

b


So we try 5 with {2,10}

a

a

aa

bb

a ba


1,3,6

2,10

4,13

5

7,15

8

9

11

12

14

b


Which is OK. So next possible merge is {7,15} with {1,3,6}

a

a

a

ab

b


1,3,6

2,5,10

4,9,13

7,15

8,12

11,14

b


Which is OK. Now try to merge {8,12} with {1,3,6,7,15}

aa

a

ab

a


1,3,6,7,15

2,5,10

4,9,13

8,12

11,14

b

b


And ab is accepted

a

a

b

a


1,3,6,7,8,12,15

2,5,10,11,14

4,9,13

b

b

20


Now try to merge {8,12} with {4,9,13}

aa

a

ab

a


1,3,6,7,15

2,5,10

4,9,13

8,12

11,14

b

b


This is OK and no more merge is possible so the algorithm halts.

aa

a

b

a


1,3,6,7,11,14,15

2,5,10

4,8,9,12,13

b

b


Definitions

• Let ≤ be the length-lexordering over Σ*

• Let Pref(L) be the set of all prefixes of strings in some language L.


Short prefixes

Sp(L)={u∈Pref(L): δ(q0,u)=δ(q0,v) ⇒ u≤v}

• There is one short prefix per useful state

0

1 2a

b

a

b b

aSp(L)={λ, a}


Kernel-sets

• N(L)={ua∈Pref(L): u∈Sp(L)}∪{λ}• There is an element in the Kernel-set for each useful transition

0

1 2a

b

a

b b

aN(L)={λ, a, b, ab}


A characteristic sample

• A sample is characteristic (for RPNI) if

–∀x∈Sp(L) ∃xu∈X+–∀x∈Sp(L), ∀y∈N(L),

δ(q0,x)≠δ(q0,y) ⇒∃z∈Σ*: xz∈X+∧yz∈X- ∨

xz∈X-∧yz∈X+

21


About characteristic samples• If you add more strings to a characteristic sample it still is characteristic;

• There can be many different characteristic samples;

• Change the ordering (or the exploring function in RPNI) and the characteristic sample will change.


Conclusion• RPNI identifies any regular language in the limit;

• RPNI works in polynomial time.

Complexity is in O(║X+║3.║X-║);• There are many significant variants of RPNI;

• RPNI can be extended to other classes of grammars.


Open problems

• RPNI’s complexity is not a tight upper bound. Find the correct complexity.

• The definition of the characteristic set is not tight either. Find a better definition.


AlgorithmsRPNI

K-Reversible

GRIDS

SEQUITUR

L*


4.2 The k-reversible languages• The class was proposed by Angluin(1982).

• The class is identifiable in the limit from text.

• The class is composed by regular languages that can be accepted by a DFA such that its reverse is deterministic with a look-ahead of k.


Let A=(Σ, Q, δ, I, F) be a NFA, we denote by AT=(Σ, Q, δT, F, I) the reversal automaton with:

δT(q,a)={q’∈Q: q∈δ(q’,a)}

22


0 1

3

b2

4

a

ba

a a a

0 1

3

b2

4

a

ba

a a a

A

AT


Some definitions

• u is a k-successor of q if │u│=k and δ(q,u)≠∅.

• u is a k-predecessor of q if │u│=k and δT(q,uT)≠∅.

• λ is 0-successor and 0-predecessor of any state.


0 1

3

b2

4b

a

a a a

A

• aa is a 2-successor of 0 and 1 but not of 3.

• a is a 1-successor of 3.

• aa is a 2-predecessor of 3 but not of 1.

a


A NFA is deterministic with look-ahead k iff ∀q,q’∈Q: q≠q’(q,q’∈I) ∨ (q,q’∈δ(q”,a))

⇒

(u is a k-successor of q) ∧(v is a k-successor of q’) ⇒ u≠v


Prohibited:

2

1

a

a

u

u

│u│=k


Example

This automaton is not deterministic with look-ahead 1 but is deterministic with look-ahead 2.

0 1

3

b2

4

a

ba

a a a

23


K-reversible automata• A is k-reversible if A is deterministic and AT is deterministic with look-ahead k.

• Example

0 1

b

2baa

b

0 1

b

2baa

bdeterministic deterministic with look-ahead 1


Violation of k-reversibility• Two states q, q’ violate the k-reversibility condition iff– they violate the deterministic condition: q,q’∈δ(q”,a);

or

– they violate the look-ahead condition: •q,q’∈F, ∃u∈Σk: u is k-predecessor of both;

•∃u∈Σk, δ(q,a)=δ(q’,a) and u is k-predecessor of both q and q’.


Learning k-reversible automata

• Key idea: the order in which the merges are performed does not matter!

• Just merge states that do not comply with the conditions for k-reversibility.


K-RL Algorithm (φk-RL)

Data: k∈ , X sample of a k-RL L

A=PTA(X)

While ∃q,q’ k-reversibility violators do

A=merge(A,q,q’)


Let X={a, aa, abba, abbbba}

a

λ ab abb

aa

abbbbabbb abbbba

abbaa

b b b b a

a

a

k=2

Violators, for u= ba



a

λ ab abb

aa

abbbbabbb

abbaa

b b b ba

a

a

k=2

Violators, for u= bb

24



a

λ ab abb

aa

abbb

abbaa

b b bb

a

a

k=2


Properties (1)• ∀k≥0, ∀X, φk-RL(X) is a k-reversible language.

• L(φk-RL(X)) is the smallest k-reversible language that contains X.

• The class Lk-RL is identifiable in the limit from text.


Properties (2)• Any regular language is k-reversible iff

(u1v)-1L ∩(u2v)-1L≠∅ and │v│=k

⇒(u1v)

-1L=(u2v)-1L

(if two strings are prefixes of a string of length at least k, then the strings are

Nerode-equivalent)


Properties (3)

• Lk-RL(X) ⊂ L(k+1)-RL(X)• Lk-TSS(X) ⊂ L(k-1)-RL(X)


Properties (4)

The time complexity is O(k║X║3).

The space complexity is O(║X║).

The algorithm is not incremental.


Properties (4) Polynomial aspects

• Polynomial characteristic sets

• Polynomial update time

• But not necessarily a polynomial number of mind changes

25


Extensions• Sakakibara built an extension for context-free grammars whose tree language is k-reversible

• Marion & Besombes propose an extension to tree languages.

• Different authors propose to learn these automata and then estimate the probabilities as an alternative to learning stochastic automata.


Exercises

• Construct a language L that is not k-reversible, ∀k≥0.

• Prove that the class of k-reversible languages is not in TxtEx.

• Run φk-RL on X={aa, aba, abb, abaaba, baaba} for k=0,1,2,3


Solution (idea)

• Lk={ai: i≤k}

• Then for each k: Lk is k-reversible but not k-1-reversible.

• And ULk = a*

• So there is an accumulation point…


AlgorithmsRPNI

K-Reversible

GRIDS

SEQUITUR

L*


4.3 Active Learning: learning DFA from membership and

equivalence queries: the L* algorithm


The classes C and H

• sets of examples

• representations of these sets

• the computation of L(x) (and h(x)) must take place in time polynomial in ⏐x⏐.

26


Correct learningA class C is identifiable with a polynomial number of queries of type T if there exists an algorithm φ that:

1) ∀L∈C identifies L with a polynomial number of queries of type T;

2) does each update in time polynomial in ⎪f⎪ and in Σ⎪xi⎪, {xi} counter-examples seen so far.


Algorithm L*• Angluin’s papers

• Some talks by Rivest

• Kearns and Vazirani

• Balcazar, Diaz, Gavaldà & Watanabe


Some references• Learning regular sets from queries and counter-examples, D. Angluin, Information and computation, 75, 87-106, 1987.

• Queries and Concept learning, D. Angluin, Machine Learning, 2, 319-342, 1988.

• Negative results for Equivalence Queries, D. Angluin, Machine Learning, 5, 121-150, 1990.


The Minimal Adequate Teacher

• You are allowed:

– strong equivalence queries;

– membership queries.


General idea of L*• find a consistent table (representing a DFA);

• submit it as an equivalence query;

• use counterexample to update the table;

• submit membership queries to make the table complete;

• Iterate.


An observation table

λ

λ

a

a

abaab

1 0

00

010001

27


The states (S) or test set

The transitions (T)

The experiments (E)

λ

λ

a

a

abaab

1 0

00

010001


Meaning

δ(q0, λ. λ)∈F⇔

λ ∈L

λ

λ

a

a

abaab

1 0

00

010001


δ(q0, ab.a)∉ F⇔

aba ∉ L

λ

λ

a

a

abaab

1 0

00

010001


Equivalent prefixes

These two rows are equal,

hence

δ(q0,λ)= δ(q0,ab)

λ

λ

a

a

abaab

1 0

00

010001


Building a DFA from a table

λ

λ

a

a

abaab

1 0

00

010001

λ

a

a


λ

λ

a

a

abaab

1 0

00

010001

λ

a

a

b

a

b

28


λ

λ

a

a

abaab

1 0

00

010001

λ

a

a

b

a

b

Some rules

This set is prefix-closed

This set is suffix-closed

SΣ\S=T


An incomplete table

λ

λ

a

a

abaab

1 0

0

01001

λ

a

a

b

a

b


Good idea

We can complete the table by making membership queries...

u

v

?uv∈L ?

Membership query:


A table is

closed if any row of Tcorresponds to some row in S

λ

λ

a

a

abaab

1 0

00

011001

Not closed


And a table that is not closed

λ

λ

a

a

abaab

1 0

00

011001

λ

a

a

b

a

b

?


What do we do when we have a table that is not closed?

• Let s be the row (of T) that does not appear in S.

• Add s to S, and ∀a∈Σ sa to T.

29


An inconsistent table

λ

λ a

abaa

1 0a

b00

00

0101

bbba 01

00

Are a and bequivalent?


A table is consistent if

Every equivalent pair of rows in H remains equivalent in Safter appending any symbol

row(s1)=row(s2)

⇒∀a∈Σ, row(s1a)=row(s2a)


What do we do when we have an inconsistent table?

Let a∈Σ be such that row(s1)=row(s2) butrow(s1a)≠row(s2a)

• If row(s1a)≠row(s2a), it is so for experiment e

• Then add experiment ae to the table


What do we do when we have a closed and consistent table ?

• We build the corresponding DFA

• We make an equivalence query!!!


What do we do if we get a counter-example?

• Let u be this counter-example

• ∀w∈Pref(u) do– add w to S

–∀a∈Σ, such that wa∉Pref(u) add wa to T


Run of the algorithm

λ

λ

a

b

1

1

1 Table is now closed

and consistentλ

b

a

30


An equivalence query is made!

λ

b

a

Counter example baa is returned


λ

λ

a

b1

1

0baaba

baaa

bbbab

baab

1

01

1

11

Not consistent

Because of


λ

λ a

a

b1 1

1

0 0 baaba

baaa

bbbab

baab

1 1

0 1

1 0

1 1

Table is now closed and consistent

λ ba

baa

a

b

a

b b

a

0

0 0

1 1


Proof of the algorithm

Sketch only

Understanding the proof is important for further algorithms

Balcazar et al. is a good place for that.


Termination / Correctness• For every regular language there is a unique minimal DFA that recognizes it.

• Given a closed and consistent table, one can generate a consistent DFA.

• A DFA consistent with a table has at least as many states as different rows in S.

• If the algorithm has built a table with n different rows in S, then it is the target.


Finiteness

• Each closure failure adds one different row to S.

• Each inconsistency failure adds one experiment, which also creates a new row in S.

• Each counterexample adds one different row to S.

31


Polynomial

• |E| ≤ n

• at most n-1 equivalence queries

• |membership queries| ≤ n(n-1)m where m is the length of the longest counter-example returned by the oracle


Conclusion• With an MAT you can learn DFA

– but also a variety of other classes of grammars;

– it is difficult to see how powerful is really an MAT;

– probably as much as PAC learning.

– Easy to find a class, a set of queries and provide and algorithm that learns with them;

– more difficult for it to be meaningful.

• Discussion: why are these queries meaningful?


AlgorithmsRPNI

K-Reversible

GRIDS

SEQUITUR

L*


4.4 SEQUITUR

(http://sequence.rutgers.edu/sequitur/)

(Neville Manning & Witten, 97)

Idea: construct a CF grammar from a very long string w, such that L(G)={w}– No generalization

– Linear time (+/-)

– Good compression rates


Principle

The grammar with respect to the string:

– Each rule has to be used at least twice;

– There can be no sub-string of length 2 that appears twice.


ExamplesS→abcdbc

S→AbAabA →aa

S →aAdAA →bc

S→aabaaab

S→AaAA →aab

32


abcabdabcabd


In the beginning, God created the heavens and the earth.

And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.

And God said, Let there be light: and there was light.

And God saw the light, that it was good: and God divided the light from the darkness.

And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.

And God said, Let there be a firmament in the midst of the waters, and let it divide the waters from the waters.

And God made the firmament, and divided the waters which were under the firmament from the waters which were above the firmament: and it was so.

And God called the firmament Heaven. And the evening and the morning were the second day.


190

• appending a symbol to rule S;

• using an existing rule;

• creating a new rule;

• and deleting a rule.

Sequitur options


Results

On text:

– 2.82 bpc

– compress 3.46 bpc

– gzip 3.25 bpc

– PPMC 2.52 bpc


AlgorithmsRPNI

K-Reversible

GRIDS

SEQUITUR

L*

33


4.5 Using a simplicity bias (Langley & Stromsten, 00)

Based on algorithm GRIDS (Wolff, 82)

Main characteristics:– MDL principle;

– Not characterizable;

– Not tested on large benchmarks.


Two learning operatorsCreation of non terminals and rules

NP →ART ADJ NOUNNP →ART ADJ ADJ NOUN

NP →ART AP1NP →ART ADJ AP1AP1 → ADJ NOUN


Merging two non terminals

NP →ART AP1NP →ART AP2AP1 → ADJ NOUNAP2 → ADJ AP1

NP →ART AP1AP1 → ADJ NOUNAP1 → ADJ AP1


• Scoring function: MDL

principle: ⎪G⎪+Σw∈T ⎪d(w)⎪

• Algorithm:

– find best merge that improves current grammar

– if no such merge exists, find best creation

– halt when no improvement


Results

• On subsets of English grammars (15 rules, 8 non terminals, 9 terminals): 120 sentences to converge

• on (ab)*: all (15) strings of length ≤ 30

• on Dyck1: all (65) strings of length ≤ 12


AlgorithmsRPNI

K-Reversible

GRIDS

SEQUITUR

L*

34


5 Open questions and conclusions• dealing with noise

• classes of languages that adequately mix Chomsky’s hierarchy with edit distance

• stochastic context-free grammars

• polynomial learning from text

• learning POMDPs

• fast algorithms…

Grammar Induction: •Laurent Miclet, Tim Oates, Jose ...pagesperso.lina.univ-nantes.fr/~cdlh/ICML06/icml_gi_slides_6pp.pdf · 1 ICML 2006, Grammatical Inference 1 1 Colin de la Higuera,

Documents