MASTER DI SCIENZE COGNITIVE GENOVA 2005 14-10-05 Natural Language Grammars and Parsing Alessandro Mazzei Dipartimento di Informatica Università di Torino
MASTER DI SCIENZE COGNITIVEGENOVA 2005
14-10-05
Natural Language Grammars andParsing
Alessandro MazzeiDipartimento di Informatica
Università di Torino
Natural Language Processing
Phonetics acoustic and perceptual elements
Phonology inventory of basic sounds (phonemes) and basic rules for combination, e.g. vowel harmony
Morphology how morphemes combine to form words, relationship of phonemes to meaning
Syntax sentence formation, word order and the formation of constituents from word groupings
Semantics how do word meanings recursively compose to form sentence meanings (from syntax to logical formulas)
Pragmatics meaning that is not part of compositional meaning
Natural Language Syntax
Syntactic Parsing: deriving a syntactic structure fromthe word sequence
Syntactic structure
Word sequence
Natural Language Syntax
Syntactic Parsing: deriving a syntactic structure fromthe word sequence
PaoloPaolo ama FrancescaN
NP
S
Paolo ama Francesca
NV
VPama
Francesca
sub obj
Generative approach to Syntax
Formal languages Generative grammars Context-Free Parser Probabilistic parsing Treebank
Formal Languages
Σ = {a1,a
2,...,a
n} alphabet
Σ* Σ = {0,1} 001,111110,ε,0 ∈ Σ*
Formal Language L ⊆ Σ*
Formal Languages
Σ = {0,1}
L1 = {01,0101,010101,01010101,...}
L2 = {01,0011,000111,00001111,...}
L3 = {11,1111,11111111,}
Generative approach to Syntax
Formal languages Generative grammars Context-Free Parser Probabilistic parsing Treebank
Grammar and derivation
If A → β ∈ P
αAγ ⇒ αβγ directly derives
if α1 ⇒ α
2, α
2 ⇒ α
3, ... , α
m-1 ⇒ α
m
α1 ⇒* α
m derives
L(G)={x ∈ Σ* : S ⇒* x}
Grammar 1
● G1=({0,1},{A,B},A,{A→0B,B→1A,B→1})
A⇒0B⇒01
A⇒0B⇒01A⇒010B⇒0101
A⇒0B⇒01A⇒010B⇒0101A⇒01010B⇒010101
L(G1)={01,0101,010101,...}
Grammar 2
● G2=({0,1},{S},S,{S→0S1,S→01})
S⇒01
S⇒0S1⇒0011
S⇒0S1⇒00S11⇒000111
L(G2)={01,0011,000111,...}
Generative Grammars and Natural Languages
● Generative Grammars can model the natural language as a formal language
● The derivation tree can model the syntactic structure of the sentences
Grammar 3
● G4=(Σ
4,{S,NP,VP,V
1,V
2},S,P
4})
Σ4 = {I,Anna,John,Harry,saw,see,swiming}
P4 = {S→ NP VP, VP→V
1 S, VP→V
2,
NP→I|John|Harry|Anna, V
1→saw|see, V
2→swimming}
Grammar 3
● G4=(Σ
4,{S,NP,VP,V
1,V
2},S,P
4})
S⇒NP VP⇒I VP⇒I V1S⇒I saw S ⇒I saw NP VP ⇒
I saw Harry VP⇒I saw Harry V2⇒I saw Harry swimming
L(G3)={I swim,I saw Harry swim,...}
Grammar 3
I
V2
NP
S
SV1
VP
sawNP VP
Harry
swim
S⇒NP VP⇒I VP⇒I V1S⇒I saw S ⇒I saw NP VP ⇒
I saw Harry VP⇒I saw Harry V2⇒I saw Harry swimming
Generative Power
● What is the smallest class of generative grammars that can generate the natural languages?
● Weak vs. Strong Generative power
Languages Chomsky hierarchy
(ab)n
anbn
anbncn
a2n
LDiag
Linear A → aB
Context-freeS → aSb
Context-sensitiveCaa → aaCa
Type 0
Ψ → θ
Languages Chomsky hierarchy
(ab)n
anbn
anbncn
a2n
LDiag
Linear A → aB
Context-freeS → aSb
Context-sensitiveCaa → aaCa
Type 0
Ψ → θ
Mildly Context-sensitiveCB → f(C,B)
Generative approach to Syntax
Formal languages Generative grammars Context-Free Parser Probabilistic parsing Treebank
Constituency
Constituent = group of contiguous (?!) words ● that are as a unit [Fodor-Bever,Bock-Loebell]
● that have syntactic properties
Ex. preposed-postposed, substitutability.
Noun Phrases (NP), Verb Phrases (VP),...
● CFG: Constituent ⇔ non terminal symbols V
Generative approach to Syntax
Formal languages Generative grammars Context-Free Parser Probabilistic parsing Treebank
Anatomy of a Parser
(1) Grammar
Context-Free, ...
(2) Algorithm
I. Search strategy top-down, bottom-up, left-to-right, ...
II.Memory organization back-tracking, dynamic programming, ...
(3) Oracle
Probabilistic, rule-based, ...
Parser 1
(1) Grammar
Context-Free, ...
(2) Algorithm
I. Search strategy top-down, bottom-up, left-to-right, ...
II.Memory organization back-tracking, dynamic programming, ...
(3) Oracle
Probabilistic, rule-based, ...
Parser 1 (1)S→NP VPNP→DET NomNP→PropN
S→AUX NP VPAUX→doesNP→DET Nom
DET→thisNom→Noun
Noun→flightVP→Verb
Ambiguity
● One sentence can have several “legal parse tree”
● 15 words ⇒ ~1000000 parse trees
Dynamic Programming ⇒ Earley Algorithm
Generative approach to Syntax
Formal languages Generative grammars Context-Free Parser Probabilistic parsing Treebank
PCFGP(T
a) = .15 * .4 *.05 * .05 *
.35 * .75 * .4 * .4 * .4 * .3 * .4 * .5 =
= 1.5 x 10-6
P(Tb) = .15 * .4 *.4 * .05 *
.05 * .75 * .4 * .4 * .4 * .3 * .4 * .5 =
= 1.7 x 10-6
Parser 2 (CKY)
(1) Grammar
Context-Free, ...
(2) Algorithm
I. Search strategy top-down, bottom-up, left-to-right, ...
II.Memory organization back-tracking, dynamic programming, ...
(3) Oracle
Probabilistic, rule-based, ...
CKY idea
W1 W
2 W
3 W
4 W
5
C
P(1,4,A) = pA * P(1,2,B) * P(3,4,C)
P(1,4,D) = pD * P(1,2,B) * P(3,4,C)
B
A
A→BC [pA]
D→BC [pD]
W1 W
2 W
3 W
4 W
5
CB
D
Generative approach to Syntax
Formal languages Generative grammars Context-Free Parser Probabilistic parsing Treebank
Treebank
● How can we compute the probability of a PCFG? Counting
● Treebank: collection of syntactic annotated sentences (trees)
● Penn TB: 1M word
Treebank Grammars (PCFG)
P(A→β)=Count(A→β)/Count(A)
P(S→NP VP) =2/2=1 P(NP→N) =2/2=1
P(VP→V N) =1/2=.5 P(VP→V) =1/2=.5
P(N→Paolo) =2/3=.66 P(N→Francesca) =1/3=.33
P(V→corre) =1/2=.5 P(V→ama) =1/2=.5
Paolo ama FrancescaN
NP
S
NV
VP
Paolo corre N
NP
S
V
VP