Grammatical Inference Fran¸coisCoste SML, Master SIF 2020-2021 F. Coste (Inria) Grammatical Inference SML 2020-2021 1 / 123 Grammatical Inference Learn the grammar of a language from correct (and incorrect) sentences N. Chomsky, Syntactic Structure, Mouton, 1957, PhD thesis MIT 1955 E. M. Gold, Language Identification in the Limit, Information and Control, 1967 ... (targeted) Applications Syntactic pattern recognition [Fu, 1982] Natural language, Molecular biology, Structured texts, Web, action planning, intrusion detection . . . Field Theoretical (learnability) Practical (algorithms) F. Coste (Inria) Grammatical Inference SML 2020-2021 2 / 123 Formal languages theory Sequence of symbols s 1 s 2 ...s p : word Set of words {m 1 ,m 2 ,...}: language Set of production rules generating a language: grammar Learning a grammar by induction: Grammatical Inference (covers more broadly inductive learning of languages, even if the representation is not grammatical) F. Coste (Inria) Grammatical Inference SML 2020-2021 3 / 123 Grammar Grammar : G = hΣ,N,S,Ri Σ finite set of terminals (a,b,c,. . . ) N finite set of non-terminals (S,T,U,. . . ) S(∈ N) axiom (start symbol) R set of rewriting rules Each rule is written as: α → β, α ∈ (N ∪ Σ) * N(N ∪ Σ) * ,β ∈ (N ∪ Σ) * When some rules have the same left hand side, we write: α → β 1 |β 2 |··· F. Coste (Inria) Grammatical Inference SML 2020-2021 4 / 123 Grammars and languages Elementary derivation: ⇒ G : μαδ ⇒ G μβδ iff ∃ α → β ∈ R, μ, δ ∈ (N ∪ Σ) * Derivation ⇒ * G : finite sequence of elementary derivations Language generated by a grammar G, L(G) : L(G)= {m ∈ Σ * |S ⇒ * G m} Free Monoid Σ * : set of all the words on Σ Empty word: or λ Empty language: ∅ (6= {}) F. Coste (Inria) Grammatical Inference SML 2020-2021 5 / 123 Example Dyck1’s grammar (balanced parenthesis) G = hΣ,N,S,Ri Σ= {a, b} N = {S} R = {S → aSbS, S → } Derivation S ⇒ aSbS ⇒ aaSbSbS ⇒ aabSbS ⇒ aabbS ⇒ aabb F. Coste (Inria) Grammatical Inference SML 2020-2021 6 / 123 Exercises Find the grammars generating the following languages: {aaba, aaa} All the words on {a, b} (Σ * ) Words on {a, b} beginning by a Codons on {a, c, g, t} (letter’s count is a multiple of 3) Palindromes on {a, b} R = {S → aSa|bSb|a|b|} Biological palindromes (on {a, c, g, t}, a - t, c - g) exercise. . . {a n b n c n |n ≥ 1} R = {S → abc|aSAc, bA → bb, cA → Ac} S ⇒ aSAc ⇒ aabcAc ⇒ aabAcc ⇒ aabbcc Copy : {ww/w ∈{a, b} * } exercise. . . F. Coste (Inria) Grammatical Inference SML 2020-2021 7 / 123 Chomsky Hierarchy Hierarchy of recursively enumerable languages: 0 Unrestricted 1 Context sensitive (grammaires contextuelles) α → β, |α|≤|β| 2 context-free (grammaires alg´ ebriques) A → β, A ∈ N 3 regular (grammaires r´ eguli` eres, automates) A → aB or A → a, A, B ∈ N,a ∈ Σ ∪{} F. Coste (Inria) Grammatical Inference SML 2020-2021 8 / 123
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Grammatical Inference
Francois Coste
SML, Master SIF
2020-2021
F. Coste (Inria) Grammatical Inference SML 2020-2021 1 / 123
Grammatical Inference
Learn the grammar of a language from correct (and incorrect) sentences
N. Chomsky, Syntactic Structure, Mouton, 1957, PhD thesis MIT 1955
E. M. Gold, Language Identification in the Limit, Information and Control, 1967
F. Coste (Inria) Grammatical Inference SML 2020-2021 7 / 123
Chomsky Hierarchy
Hierarchy of recursively enumerable languages:
0 Unrestricted
1 Context sensitive (grammaires contextuelles)
α→ β, |α| ≤ |β|
2 context-free (grammaires algebriques)
A→ β, A ∈ N
3 regular (grammaires regulieres, automates)
A→ aB or A→ a, A,B ∈ N, a ∈ Σ ∪ {ε}
F. Coste (Inria) Grammatical Inference SML 2020-2021 8 / 123
The Chomsky Hierarchy
F. Coste (Inria) Grammatical Inference SML 2020-2021 9 / 123
Regular languages are worth inferring
For practical applications, powerful recursive models may not be required
Regular languages can account for short term dependencies (likeN-Gramms), but also some long-term dependencies.
Any language can be approximated by a regular language (each finitelanguage is regular!).
Properties of regular languages are well studied; this makes the developmentof inference methods easier
Simple and efficient parsing of string (O(|m|) for DFA).
F. Coste (Inria) Grammatical Inference SML 2020-2021 10 / 123
Outline
1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example
F. Coste (Inria) Grammatical Inference SML 2020-2021 11 / 123
Automata
A = 〈Σ, Q,Q0, QF , δ〉
Multiples of 3 (binary):
Σ finite alphabet {0, 1}Q finite set of states {q0, q1, q2}
Q0(⊆ Q) initial states {q0}QF (⊆ Q) final states {q0}
δ transition function: Q× Σ→ P(Q)(δ∗ : P(Q)× Σ∗ → P(Q) denotes the extension to words of δ)
Language accepted by A
L(A) = {m ∈ Σ∗|δ∗(Q0,m) ∩QF 6= ∅}
F. Coste (Inria) Grammatical Inference SML 2020-2021 12 / 123
Automata and languages
Language accepted/recognized by automata: regular language. + ∗ ()
Exces
Find automatas on Σ = {a, b} recognizing:
- {abba, aab}. (show that each finite language is regular)
- all the words on Σ : (a+ b)∗ = {a, b}∗ = Σ∗
- all the words containing the motif aa
- all the words with 3 letter (extension to codons ?)
- all the words with an even number of a.
Deterministic finite state automata (DFA) : |δ(q, a)| ≤ 1Any non deterministic automata (NFA) can be determinized
⇒ LAFN = LAFD
Canonical automaton of L, A(L) : smallest DFA accepting L
F. Coste (Inria) Grammatical Inference SML 2020-2021 13 / 123
Can we learn regular languagesfrom positive examples only?
Theoretical framework: identification in the limit [Gold67]
Presentation : infinite sequence of examples
P : x1 x2 x3 . . . xk . . . xi . . .↓ ↓ ↓ ↓ ↓H1 H2 H3 Hk Hi ≡ Hk ≡ H0
Identification in the limit of H0 :
∀P,∃k, ∀i > k,Hi ≡ H0
F. Coste (Inria) Grammatical Inference SML 2020-2021 14 / 123
Let’s try!
a, aa, aaa . . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 15 / 123
Limit point
If a limit point exists:
L1 ⊂ L2 ⊂ L3 ⊂ · · · ⊂ L∞ =⋃i
Li
Then
The class of languages is not identifiable in the limit from positiveexamples
F. Coste (Inria) Grammatical Inference SML 2020-2021 16 / 123
Results [Gold67]
No superfinite class of language (⊃ regular) can be identified in thelimit from text (i.e. positive examples only)
The class of primitive recursive function (“fonction recursiveprimitive”) can be identified in the limit from informant (examplesand counter-examples)(False for the class of total recursive functions)
→ rationale for using counter-examples
Time needed for learning ???
F. Coste (Inria) Grammatical Inference SML 2020-2021 17 / 123
Polynomial Time and Data Identification in the Limit[Gold 78] [Pitt 89] [Higuera 95]
Identification in the limit from Polynomial Time and Data (IPTD)
A representation class R is identifiable in the limit from polynomial timeand data iff there exists two polynomials p and q, a learning algorithm As.t.:
Given any sample S = 〈S+, S−〉 of size m,A returns a representation R in R compatible with S in p(m) time
For each representation R of size n,there exists a characteristic sample of size less than q(n)
Characteristic sample CS = 〈CS+, CS−〉: for any S = 〈S+, S−〉, s.t.CS+ ⊆ S+, CS− ⊆ S−, A returns a representation R′ equivalent with R
F. Coste (Inria) Grammatical Inference SML 2020-2021 18 / 123
Are automata IPTD?
Outline
1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example
F. Coste (Inria) Grammatical Inference SML 2020-2021 19 / 123
1 Learning automataDefinitionsLearning automata from positive and negative examples
Problem definitionRPNIStructural completeness hypothesisUtility of counter-examplesEDSM heuristic
Learning automata from positive example
F. Coste (Inria) Grammatical Inference SML 2020-2021 20 / 123
Remark:Given a sample S = 〈S+, S−〉, an infinite number of automata are
compatible with S
Searching for the smallest compatible DFA
Smallest compatible DFA problem
Given S+ ⊂ Σ∗ (examples) and S− ⊂ Σ∗ (counter-examples),Find smallest DFA A st S+ ⊂ L(A) and S− ∩ L(A) = ∅
Application of Occam’s razor
Canonical automata of language . . .
NP-Complete problem [Gold78] [Angluin78]
Proof: reduction to SAT
To find a DFA (only) polynomially bigger than the smallest DFA compatiblewith 〈S+, S−〉 is NP-Complete [Pitt, Warmuth 93]
PAC-Learning DFA is as hard as breaking the RSA cryptosystem [Pitt,
Warmuth 88] [Kearn, Valiant 89]
F. Coste (Inria) Grammatical Inference SML 2020-2021 22 / 123
A← PTA(S+)for all (p, q) in standard order 1 doA′ ← Deterministic merge(A, p, q)if A′ accepts no counter-example from S− thenA← A′
end ifend for
Complexity : O((|S+|+ |S−|).|S+|2)
1Standard order u ≺ v : (|u| < |v|) ∨ (|u| = |v| ∧ ∃k, ∀i < k, ui = vi ∧ uk < vk)F. Coste (Inria) Grammatical Inference SML 2020-2021 36 / 123
Success / amount of sequences in training sample
fig. from [Lang, 1992]
F. Coste (Inria) Grammatical Inference SML 2020-2021 37 / 123
Identification ?
Requirements for finding the solution with RPNI?
1. The target automata has to be in the search space
and
2. The good merges have to be chosen
F. Coste (Inria) Grammatical Inference SML 2020-2021 38 / 123
1 Learning automataDefinitionsLearning automata from positive and negative examples
Problem definitionRPNIStructural completeness hypothesisUtility of counter-examplesEDSM heuristic
Learning automata from positive example
F. Coste (Inria) Grammatical Inference SML 2020-2021 39 / 123
Structural completeness hypothesis
S+ is structurally complete wrt A if an acceptation of S+ by A exists st:
Every transition of A is used
Every final state of A is used for acceptation
S+ = {aaa, bba, baaa} A =
F. Coste (Inria) Grammatical Inference SML 2020-2021 40 / 123
Maximal Canonical Automaton
Rote learning of S+ = {aaa, bba, baaa}
Union :
MCA(S+)
Only one initial state (classical but not required):
MCA(S+)
F. Coste (Inria) Grammatical Inference SML 2020-2021 41 / 123
Merging states
Language generalisation operator
Preserve structural completeness
F. Coste (Inria) Grammatical Inference SML 2020-2021 42 / 123
Merging states
Language generalisation operator
Preserve structural completeness
F. Coste (Inria) Grammatical Inference SML 2020-2021 43 / 123
Merging states
Language generalisation operator
Preserve structural completeness
F. Coste (Inria) Grammatical Inference SML 2020-2021 44 / 123
Merging states
Language generalisation operator
Preserve structural completeness
F. Coste (Inria) Grammatical Inference SML 2020-2021 45 / 123
Merging states
Language generalisation operator
Preserve structural completeness
F. Coste (Inria) Grammatical Inference SML 2020-2021 46 / 123
Merging states
Language generalisation operator
Preserve structural completeness
F. Coste (Inria) Grammatical Inference SML 2020-2021 47 / 123
Merging states
Language generalisation operator
Preservation of structural completeness
ua(S+)
Theorem
All automata A st S+ is structurally complete wrt A can be built bymerging states of MCA(S+)
F. Coste (Inria) Grammatical Inference SML 2020-2021 48 / 123
Search space
F. Coste (Inria) Grammatical Inference SML 2020-2021 49 / 123
DFA search space
operator: deterministic merge
Theorem
All automata A st S+ is structurally complete wrt A can be build bydeterministic merges of states in MCA(S+) (or PTA(S+))
F. Coste (Inria) Grammatical Inference SML 2020-2021 50 / 123
1 Learning automataDefinitionsLearning automata from positive and negative examples
Problem definitionRPNIStructural completeness hypothesisUtility of counter-examplesEDSM heuristic
Learning automata from positive example
F. Coste (Inria) Grammatical Inference SML 2020-2021 51 / 123
Limiting generalisation with a set of counter-examples S−
Border Set : set of most general elements(Greater generalisation under control of S−)
Occam’s razor → looking for smallest automaton
S− guides also the search. . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 52 / 123
Limiting generalisation with a set of counter-examples S−
Border Set : set of most general elements(Greater generalisation under control of S−)
Occam’s razor → looking for smallest automaton
S− guides also the search. . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 53 / 123
Characteristic sample for RPNI
How to ensure that RPNI returns A(L) ?
Ideas :
Sample has to be structurally complete wrt A(L)
Sample is informative enough to prevent merging distinct states
F. Coste (Inria) Grammatical Inference SML 2020-2021 54 / 123
Characteristic sample for RPNIShort prefixes and Kernel
Let Pr(L) denote the set of prefixes of a language L: Pr(L) = {u ∈ Σ∗ : uv ∈ L}
Short prefixesSmallest sequences enabling to reach each state of the target
Sp(L) = {u ∈ Pr(L) : @v ∈ Pr(L), v < u and δA(L)(q0, v) = δA(L)(q0, u)}
KernelSequences of Sp concatenated with one letter allowing to reach a new state(exercise all the possible transitions)
N(L) = {ua ∈ Pr(L) : u ∈ Sp(L), a ∈ Σ} ∪ {ε}
What would be N(L) for the following DFA target ?
F. Coste (Inria) Grammatical Inference SML 2020-2021 55 / 123
Characteristic sample for RPNI
S = 〈S+, S−〉 is a characteristic sample of A(L) for RPNI if:
∀x ∈ N(L) :∃u ∈ Σ∗, xu ∈ S+(u = ε if x ∈ L)
∀x, y ∈ N(L), δA(L)(q0, x) 6= δA(L)(q0, y) :
∃u ∈ Σ∗, ((xu ∈ S+ and yu ∈ S−) or (xu ∈ S− and yu ∈ S+))
What would be a characteristic sample for ?
Is the characteristic sample unique for an automate?
It can shown that:
- Adding new examples to the characteristic sample does not change theautomata returned by RPNI
- For each A(L), there exists a characteristic sample of size O(|A(L)|2)
F. Coste (Inria) Grammatical Inference SML 2020-2021 56 / 123
What about merging states in random order?Trakhtenbrot et Barzdin 1973
Algorithm : deterministic merge of pair of states not resulting inincompatible automata in random order
Algorithm complexity?At most |PTA|.|A|2 [Lang92] (where A is the target automaton)
Characteristic sample?{w ∈ Σ∗/|w| ≤ d+ 1 + ρ}d : depth of automataρ : distinguishably degree(length of suffix required to distinguish pairs ofstates, i.e. allowing to reach a final state and a non final state)
Worst case d = ρ = |A| − 1In average, ρ = log|Σ| log2 |A| et d = C log|Σ|(where C : constant wrt Σ)For |Σ| = 2, average size is: ∼ 16|A|2 − 1|A| = 32 → 16383 seq., 65 → 67599, 506 → 4096575 ...
F. Coste (Inria) Grammatical Inference SML 2020-2021 57 / 123
RPNI
The solution returned by RPNI is:
a DFA belonging to the Border Set
the canonical automata of the language that it accepts
if the sample is characteristic, it is the smallest compatible DFA(Contradiction with NP-Completeness of the problem ?
would require ∼ 4 000 seq. with RPNIF. Coste (Inria) Grammatical Inference SML 2020-2021 78 / 123
Learning from positive and negative examples
[Gold 67]:
No superfinite class of language can be identified in the limit frompositive examples onlyThe class of primitive recursive function can be identified in the limitfrom positive and negative examples
Efficient learning
DFA are IPTD from positive and negative examples (RPNI)Extension to some closely related classesNFA are not! CFG neither . . .An heuristic (EDSM) that seems to perform better . . . (?)
What if negative examples are not available?
F. Coste (Inria) Grammatical Inference SML 2020-2021 79 / 123
Outline
1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example
F. Coste (Inria) Grammatical Inference SML 2020-2021 80 / 123
Learning from positive example (only)
Statistical criteria for not merging pair of states: ALERGIA
“Characterizable” methods: k-RI, k-testable languages
Heuristics methods: ECGI
F. Coste (Inria) Grammatical Inference SML 2020-2021 81 / 123
1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example
Alergiak-reversibles languagesECGI
F. Coste (Inria) Grammatical Inference SML 2020-2021 82 / 123
ALERGIA
[Carrasco, Oncina 99]
Input: S+, precision parameter αOutput : (probabilistic) DFA AA← PPTA(S+)for all (p, q) in standard order do
if compatible(p, q, α) thenA← deterministic merge(A, p, q)
end ifend for
F. Coste (Inria) Grammatical Inference SML 2020-2021 83 / 123
ALERGIA
Compatibility between two 2 states q1 and q2 :
Transition probabilities are similar enough:∀a ∈ Σ ∪ {#},
∣∣∣∣C(q1, a)
C(q1)−C(q2, a)
C(q2)
∣∣∣∣ <√
1
2ln
2
α
(1√C(q1)
+1√C(q2)
)
Compatibility of successors :
∀a ∈ Σ, δ(q1, a) et δ(q2, a) sont α-compatibles
F. Coste (Inria) Grammatical Inference SML 2020-2021 84 / 123
ALERGIA
Local measure of suffix language similarity
Other measures . . .→ Learning probabilistic automata→ Identification of probability distributions on words
See:
PAC-learnability of Probabilistic Deterministic Finite State Automata,A. Clark and F. Thollard, Journal of Machine Learning Research, 2004.Towards feasible PAC-learning probabilistic deterministic finiteautomata, J. C. Rabal and R. Gavalda, ICGI 2008Learning Rational stochastic languages, F. Denis, Y. Esposito, A.Habrard, COLT 2006.Spectral learning of weighted automata - A forward-backwardperspective, B. Balle, X. Carreras, F. M. Luque, A. Quattoni, MachineLearning, 2014. . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 85 / 123
1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example
Alergiak-reversibles languagesECGI
F. Coste (Inria) Grammatical Inference SML 2020-2021 86 / 123
Characterizable learning
Negative result of [Gold67] applies to superfinite languages.To avoid over-generalization, an approach performing minimalgeneralisation at each step ensure identification for particular classes oflanguages.
F. Coste (Inria) Grammatical Inference SML 2020-2021 87 / 123
F. Coste (Inria) Grammatical Inference SML 2020-2021 121 / 123
Biological palindrome : S → aSt|cSg|tSa|gSc|εDerivation tree of atgttcgaacat ?Consequence of adding a new rewriting rule:S → SS|aSt|cSg|tSa|gSc|ε ?Derivation tree of caaatcgatcatcgaagagctcttgttg ?de gaatattcgaatattc ?
CopyS → AaS|CcS|GgS|TtS|XX → εAa→ aA ; Ac→ cA ; Ag → gA ; At→ tACa→ aC ; Cc→ cC ; Cg → gC ; Ct→ tCGa→ aG ; Gc→ cG ; Gg → gG ; Gt→ tGTa→ aT ; Tc→ cT ; Tg → gT ; Tt→ tTAX → Xa ; CX → Xc ; GX → Xg ; TX → Xt
Derivation tree of ctaacctaac ?
F. Coste (Inria) Grammatical Inference SML 2020-2021 122 / 123
What we have seen in SML so far
Introduction to machine learning
Generalisation, necessity of a bias. . .How to define properly a machine learning problem: choice of objectdescription, choice of hypothesis space, choice of ’best’ hypothesis, i.e.setting biasesExploration of search spaceEvaluation of the risk
Learning on sequences
Vectorization of texts and Naive BayesAutomata and learnability
Next: State-of-the art algorithms for attribute-valuerepresentations of instances. . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 123 / 123