-
1
B MN DUYT
Ch nhim B mn
Ng Hu Phc
CNG CHI TIT BI GING
(Dng cho tit ging)
Hc phn:
X L NGN NG T NHIN
Nhm mn hc:.....................
B mn: Khoa hc my tnh
Khoa (Vin): CNTT
Thay mt nhm
mn hc
H Ch Trung
Thng tin v nhm mn hc
TT H tn gio vin Hc hm Hc v
1 H Ch Trung GVC TS
3 Nguyn Trung Tn TG TS
a im lm vic: Gi hnh chnh, B mn Khoa hc my tnh Tng 13
nh S4 Hc vin K thut Qun s.
a ch lin h: B mn Khoa hc my tnh Khoa Cng ngh thng tin
Hc vin K thut Qun s. 236 Hong Quc Vit.
in thoi, email: 01685582102, [email protected];
Bi ging 01: Tng quan v x l ngn ng t nhin
Chng I, mc:
Tit th: 1-3 Tun th: 1
- Mc ch yu cu
Mc ch: Trang b nhng hiu bit chung nht v mn hc; Nm vng
cc khi nim, bi ton c bn trong X l ngn ng t nhin, c s ton hc
lm c s hc tp mn hc.
Yu cu: sinh vin phi h thng li cc kin thc c s v ton ri rc,
kin thc lp trnh, t nghin cu v n tp li nhng vn l thuyt ngn ng
hnh thc v vn phm.
- Hnh thc t chc dy hc: L thuyt, tho lun, t hc, t nghin cu
- Thi gian: Gio vin ging: 2 tit; Tho lun v lm bi tp trn lp: 1
tit;
Sinh vin t hc: 6 tit.
- a im: Ging ng do P2 phn cng.
- Ni dung chnh:
-
2
1. Ti sao cn hc XLNNTN?
2. ng dng ca x l ngn ng t nhin
3. Cc vn ca XLNNTN
4. Ni dung mn hc
1. Ti sao cn hc XLNNTN?
NLP - L mt nhnh ca tr tu nhn to tp trung vo cc ng dng
trn ngn ng ca con ngi. Trong tr tu nhn to th x l ngn ng t
nhin
l mt trong nhng phn kh nht v n lin quan n vic phi hiu ngha
ngn ng-cng c hon ho nht ca t duy, giao tip.
Cc thut ton NLP hin i l c c s da trn cc thnh tu ca hc
my, c bit l hc my thng k. Nghin cu cc thut ton NLP hin i i
hi mt s hiu bit trn nhiu lnh vc khc nhau, bao gm ngn ng hc,
khoa
hc my tnh, v xc sut thng k.
2. Ti sao XLNNTN l kh?
Ambiguity
At last, a computer that understands you like your mother"
1. (*) It understands you as well as your mother understands
you
2. It understands (that) you like your mother
3. It understands you as well as it understands your mother
1 and 3: Does this mean well, or poorly?
Ambiguity at Many Levels
At the acoustic level (speech recognition):
1. : : : a computer that understands you like your mother"
2. : : : a computer that understands you lie cured mother"
Ambiguity at Many Levels
At the syntactic level:
Different structures lead to different interpretations.
ng gi i rt nhanh
At the semantic (meaning) level:
Two definitions of mother"
a woman who has given birth to a child
-
3
a stringy slimy substance consisting of yeast cells and
bacteria; is added
to cider or wine to produce vinegar
At the semantic (meaning) level:
They put money in the bank
= buried in mud?
I saw her duck with a telescope
At the discourse (multi-clause) level:
Alice says they've built a computer that understands you like
your mother
But she
doesn't know any details
doesn't understand me at all
This is an instance of anaphora, where she co-referees to some
other
discourse entity
V d: ng gi i rt nhanh.
3. ng dng ca x l ngn ng t nhin
XLNN l mt trong nhng lnh vc mi nhn trong x hi thng tin
1. Xy dng kho thut ng (Terminological Resources
Construction)
Mc ch: xy dng t in thut ng chuyn ngnh; bng thut ng dng
trong nh my, x nghip; t in ln dng cho cc h thng ch mc ho ti
liu; t in thut ng song ng dng cho dch thut v.v; Thu thp thut ng
t
kho vn bn.
Cch tip cn: Xc nh t, ng on danh t. Xc nh cc nhm t
thng cng xut hin (collocation)
2. Tm kim, truy xut thng tin (Information
Retrieval/Extraction)
Mc ch: Tm kim cc vn bn c lin quan n truy vn; Sp xp cc
vn bn tm c.
-
4
Cch tip cn: Ch mc ho ti liu (indexation); X l cu truy vn
(chun ho, tm thut ng tng ng, v.v.); Sp xp kt qu truy vn (nh
gi lin quan ca ti liu so vi truy vn)
3. Tm tt vn bn (Text Summary)
Mc ch: Sinh tm tt vn bn t ng
Cch tip cn: Hiu vn bn t ng, rt gn, sinh tm tt; Xc nh cc
n v vn bn ni bt, chn on vn bn tng ng, gp tm tt; Lc tm tt
vn bn nh phn loi ng ngha cu da theo cc cu trc ngn ng.
4. Dch t ng (Machine Translation)
Mc ch: Dch t ng; Tr gip dch bng my
Cch tip cn: Phn tch vn bn ngun (sa li, chun ho, n gin ho,
ch gii ngn ng); Dch t ng (kh thi trn cc vn bn trong phm vi
hp)/bn t ng (can thip trn ngn ng ngun hoc ch); Sa bn dch
5. Hiu vn bn t ng (Automatic Text Comprehension)
Mc ch: Nhn bit ch vn bn; Thit lp quan h gia cc cu (cu
trc nguyn nhn, chui thi gian, i t, v.v).
Cch tip cn: Phn tch cu trc vn bn thit lp c quan h gia
cc thnh phn trong vn bn; Phn tch ch , hnh ng, nhn vt, cu trc
mnh v.v.
6. Sinh vn bn t ng (Automatic Text Generation)
Mc ch: Sinh vn bn cho h thng dch; Sinh vn bn cho h thng hi
thoi ngi my; Sinh vn bn din t cc d liu s
Cch tip cn: Phn tch ni dung mc su: mng ng ngha, khi nim;
T chc ni dung su thnh cc mnh cn din t; Xy dng cy c php,
chnh sa hnh thi t.
7. i thoi ngi - my (Human-Machine Dialogue)
Mc ch: Xy dng h thng giao tip ngi my.
Cch tip cn: Tin x l u vo: nhn dng ting ni; Hiu vn bn t
ng (c bit ch n vn phn tch tham chiu - reference); Sinh vn bn
t ng; Tng hp ting ni.
1.2. Cc vn ca XLNN
1. X l n ng (Monolingual Processing)
Phn tch vn bn tng cp
- T vng (Lexical/Morpho-syntactic Analysis)
- C php (Syntactic Analysis/Parsing)
- Ng ngha (Semantic Analysis)
-
5
- Ng dng (Pragmatics)
2. X l a ng (Multilingual Processing)
Xy dng cng c
- Ging hng a ng (Multilingual Alignment)
- Tr gip dch a ng (Machine Translation)
- Tm kim thng tin a ng (Multilingual Information Retrieval)
1.3. Ti nguyn ngn ng cho XLNN
1. Tm quan trng
Cng c v ti nguyn trong XLNN
Cng c, phng php: mang tnh tng qut, p dng c cho nhiu
ngn ng
Ti nguyn: c trng cho tng ngn ng; xy dng rt tn km) dn n
nhu cu chia s, trao i ti nguyn ngn ng
Cc "ngn hng" ng liu ln:
LDC (Linguistic Data Consortium), ELDA (Evaluations and
Language
resources Distribution Agency), OLAC (Open Language Archives
Community),
v.v.
2. X l n ng
T in (lexicon)
- Thng tin hnh thi (morphology)
- Thng tin c php (syntax)
- Thng tin ng ngha (semantics), bn th hc (ontology)
Ng php (grammar)
- Vn phm hnh thc (Grammar Formalisms)
Kho vn bn (Corpora)
- Kho vn bn th (Raw Corpus)
- Kho vn bn c ch gii ngn ng (Annotated Corpus) t, t loi,
cu trc ng php, v.v.
3. X l a ng
T in a ng
- T in song ng
- T in a ng
Ng php
- Vn phm song ng(Bilingual Grammar)
Kho vn bn a ng (Multilingual/Parallel Corpus)
- Kho vn bn a ng th
-
6
- Kho vn bn a ng ging hng (Aligned Multilingual Corpus),
c hoc khng c ch gii ngn ng
- B nh dch (Translation Memory)
1.4. Vn chun ho (Standardization)
1. Yu cu chun ho ti nguyn ngn ng
Nhu cu trao i ng liu: Biu din nht qun; M ho chun
Cc hot ng chun ho: Cc d n hng ti chun (EAGLES, TEI,
v.v.); D n ISO TC 37/SC 4
2. Cc kha cnh chun ho
M hnh biu din: T in; Ch gii kho vn bn, v.v.
Thut ng, phm tr d liu: Thut ng chun (Terminology); DCR (Data
Category Registry)
Ngn ng m ho: XML; RDF (Resource Description Framework), OWL
(Web Ontology Language), v.v.
4. Ni dung mn hc
1. Tng quan v x l ngn ng t nhin (1 lecture)
2. B tc mt s khi nim, thut ng trong NLP (1 lecture)
3. M hnh ngn ng v cc k thut lm mn (1 lecture)
4. Vn gn nhn v m hnh Markov n (2 lectures)
5. Phn tch da trn thng k (2 lectures)
6. Dch my (2 lectures)
7. Log-linear models (2 lectures)
8. Conditional random fields, and global linear models (2
lectures)
9. Unsupervised/semi-supervised learning in NLP (2 lectures)
- Ni dung tho lun
1. Phn bit cc dng thc ngn ng s ging, khc nhau gia ngn ng
lp trnh v ngn ng t nhin.
- Yu cu SV chun b
n tp li cc kin thc lin quan n l thuyt ngn ng hnh thc,
automata hu hn v biu thc chnh quy.
- Ti liu tham kho
1. Speech&Language Procesing: An Introduction to Natural
Language
Processing, Computational Linguistics, and Speech Recognition,
2nd
edition, Daniel Jurafsky and James Martin. Prentice Hall, 2008.
Chng 1.
-
7
2. Foundations of Statistical Natural Language Processing,
Christopher
Manning and Hinrich Schtze, MIT Press, 1999. Chng 1.
3. Data Mining: Practical Machine Learning Tools and Techniques
(3rd ed),
Ian H. Witten and Eibe Frank, Morgan Kaufmann, 2005. Chng 1.
- Ghi ch: Cc mn hc tin quyt : tr tu nhn to, cu trc d liu v
gii
thut, lp trnh cn bn.
Bi ging 02: B tc mt s khi nim, thut ng trong XLNNTN
Chng I, mc:
Tit th: 1-3 Tun th: 2
- Mc ch yu cu
Mc ch: Cung cp cc khi nim v thut ng c bn trong x l ngn
ng t nhin; cc vn t ra trong x l ngn ng t nhin v ng dng
Yu cu: Sinh vin nm vng khi nim lm tin cho theo di cc bi
ging tip theo ca mn hc.
- Hnh thc t chc dy hc: L thuyt, tho lun, t hc, t nghin cu
- Thi gian: Gio vin ging: 2 tit; Tho lun v lm bi tp trn lp: 1
tit;
Sinh vin t hc: 6 tit.
- a im: Ging ng do P2 phn cng.
- Ni dung chnh:
2.1. Tm tt c im ting Vit
1. Lch s pht trin ting Vit
Qu trnh pht trin: H Nam , nhnh Mn-Khmer, khi ng Mn-
Khmer, nhm Vit-Mng (A. Haudricourt, 1953); Quan h tip xc vi
cc
ngn ng trong khu vc, c bit l cc ting h Thi; Thi k Bc thuc,
vay
mn ting Hn (xp x 70% vn t vng ting Vit gc Hn); Thi k Php
thuc, vay mn t ting Php, "sao phng ng php" chu u
Loi hnh ngn ng ting Vit
Cc loi hnh ngn ng
Bin hnh (flexional languages)
- Bin i hnh thi t th hin quan h ng php
- Cu to t: cn t, ph t kt hp cht ch
- Mt ph t c th biu din nhiu ngha ng php
- V d: ting Anh, Php, Nga
-
8
Chp dnh (agglutinating languages)
- Cu to t mi bng cch chp dnh cn t vi cc ph t
- Cn t c th ng c lp
- Mi ph t ch th hin mt ngha nht nh
- V d: ting Th Nh K, Nht, Triu Tin
a tng hp (polysynthetic languages)
- C n v t c bit c th lm thnh cu
- C c tnh cht ca ngn ng bin hnh v chp dnh
- V d: Mt s ngn ng vng Kapkaz
n lp (isolating languages)
- T khng c hin tng bin hnh
- Quan h ng php c din t bng trt t t (word order) hoc cc h t
(tool words)
- n v hnh tit = m tit (syllable) = hnh v (morpheme)
- V d: Hn, Thi, Vit l ngn ng n lp.
Ch vit v h thng m
Ch vit
- Da trn bng ch ci latin
- Ch vit: k m (phonetic transcription)
- Cc quy nh chun ho cha c tn trng (i hay y, qui hay quy, phin
m
ting nc ngoi)
H thng m
- H thng m chun cho ting Vit ph thng (cha c a vo t in)
- Cc cch pht m a phng
- (Tham kho thm http://www.vietlex.com)
T v t loi ting Vit
T trong t in ting Vit (Trung tm t in hc)
T n: t n tit, mt s t a tit
T phc: t a tit
- Kt hp chnh ph (semantic subordination): xe p
- Kt hp song song (semantic coordination): qun o, non nc, giang
sn
- Ly (reduplication): trng trng
- Qun ng (expression): u b u bu
T loi trong t in ting Vit
-
9
- Danh t (noun), ng t (verb), tnh t (adjective), i t (pronoun),
ph t
(adverb), kt t (conjunction/linking word), tnh thi t (modal
word), thn t
(interjection)
- Hin tng chuyn loi (category mutation) ph bin
Ng php
Cu to ng
- Th t chnh - ph ng vai tr ch o
- S dng h t th hin s nhiu, quan h thi, quan h ph thuc v lin
hp
song song
- S dng dng ly, ng iu thay i sc thi ngha
Cu to cu
- Th t thng thng S-V-O
- Th t - thuyt (topic prominent): Cy l to. Nh xy ri.
2.2. Phn tch t vng
Mt s thut ng
T (word)
- Hnh v (morpheme), gc t (stem), t v (lexeme), t v chun tc
(lemma)
T loi (part-of-speech - POS)
- Phn loi t (word category): danh t, ng t, tnh t, v.v.
- c im hnh thi t (morphology): dng t bin hnh (inflectional
forms)
Phn tch t vng ting Vit
Phn on t (Word segmentation): Nhp nhng do t a tit; Cng
c hin c?
Gn nhn t loi (POS tagging): Xc nh tp t loi; Gii quyt
nhp nhng do hin tng chuyn loi, t ng ngha; Khng da c
vo hnh thi t; Cng c hin c?
2.3. Phn tch ng php
Phn tch cm t (chunking):
- Phn tch c php nng
- Hai cch tip cn: quy tc (vn phm chnh quy), thng k (bi ton
gn nhn)
- Ph thuc vo kt qu tch t v gn nhn t loi
-
10
- Cng c hin c?
Phn tch cy c php (parsing)
- C php thnh phn (constituency), c php ph thuc
- (dependency)
- Hai cch tip cn: Thng k, da vo quy tc
- Cng c hin c?
2.4. Phn tch ng ngha
- Ni dung tho lun
2. Kinh nghim trong qu trnh bin dch v debug khi lp trnh trong
mi
trng Turbo C v Visual C++.
3. S ging, khc nhau gia ngn ng lp trnh v ngn ng t nhin.
4. S ging, khc nhau gia trnh bin dch v ngi bin dch.
- Yu cu SV chun b
n tp li cc kin thc lin quan n l thuyt ngn ng hnh thc,
automata hu hn v biu thc chnh quy.
- Bi tp
- Ti liu tham kho
1. Speech&Language Procesing: An Introduction to Natural
Language
Processing, Computational Linguistics, and Speech Recognition,
2nd
edition, Daniel Jurafsky and James Martin. Prentice Hall, 2008.
Chng 2.
- Cu hi n tp
- Ghi ch: Cc mn hc tin quyt : ton ri rc, cu trc d liu v gii
thut,
lp trnh cn bn.
Bi ging 03: M hnh ngn ng v cc k thut lm mn
Chng I, mc:
Tit th: 1-3 Tun th: 3
- Mc ch yu cu
Mc ch: Trang b kin thc v m hnh ha cc m hnh biu din ngn
ng t nhin.
Yu cu: Nm vng cc m hnh biu din ngn ng trong hc my.
-
11
- Hnh thc t chc dy hc: L thuyt, tho lun, t hc, t nghin cu
- Thi gian: Gio vin ging: 2 tit; Tho lun v lm bi tp trn lp: 1
tit;
Sinh vin t hc: 6 tit.
- a im: Ging ng do P2 phn cng.
- Ni dung chnh:
1. Vn m hnh ha ngn ng
2. M hnh N-gram (bigram, trigram)
3. nh gi m hnh ngn ng
4. Cc k thut lm mn
4.1. Ni suy tuyn tnh (Linear interpolation)
4.2. Chit khu (Discounting methods)
4.3. Truy hi (Back-off)
1. Vn m hnh ha ngn ng
M hnh ngn ng c p dng trong rt nhiu lnh vc ca x l ngn
ng t nhin nh: kim li chnh t, dch my hay phn on t... Chnh v
vy,
nghin cu m hnh ngn ng chnh l tin nghin cu cc lnh vc tip
theo.
M hnh ngn ng c nhiu hng tip cn, nhng ch yu c xy dng
theo m hnh Ngram.
M hnh ngn ng l mt phn b xc sut trn cc tp vn bn. Ni n
gin, m hnh ngn ng c th cho bit xc sut mt cu (hoc cm t) thuc
mt ngn ng l bao nhiu.
V d 1: khi p dng m hnh ngn ng cho ting Vit:
P[hm qua l th nm] = 0.001
P[nm th hm l qua] = 0
V d 2: We have some (finite) vocabulary,
say V = {the, a, man, telescope, Beckham, two, }
I We have an (infinite) set of strings, Vt
the STOP
a STOP
the fan STOP
the fan saw Beckham STOP
the fan saw saw STOP
the fan saw Beckham play for Real Madrid STOP
-
12
We have a training sample of example sentences in English
We need to learn a probability distribution p i.e., p is a
function that
satisfies:
() = 1, () 0
p(the STOP) = 10-12
p(the fan STOP) = 10-8
p(the fan saw Beckham STOP) = 2 x10-8
p(the fan saw saw STOP) = 10-15
p(the fan saw Beckham play for Real Madrid STOP) = 2 x10-9
2. M hnh N-gram (bigram, trigram)
Nhim v ca m hnh ngn ng l cho bit xc sut ca mt cu ww...wl
bao nhiu. Theo cng thc Bayes: P(AB) = P(B|A) * P(A), th:
P(www) = P(w) * P(w|w) * P(w|ww) ** P(w|www)
Theo cng thc ny, m hnh ngn ng cn phi c mt lng b nh v
cng ln c th lu ht xc sut ca tt c cc chui di nh hn m. R
rng, iu ny l khng th khi m l di ca cc vn bn ngn ng t nhin
(m c th tin ti v cng). c th tnh c xc sut ca vn bn vi lng
b nh chp nhn c, ta s dng xp x Markov bc n:
P(w|w,w,, w) = P(w|w,w, ,w)
Nu p dng xp x Markov, xc sut xut hin ca mt t (w) c coi
nh ch ph thuc vo n t ng lin trc n (www) ch khng phi ph
thuc vo ton b dy t ng trc (www). Nh vy, cng thc tnh xc sut
vn bn c tnh li theo cng thc:
P(www) = P(w) * P(w|w) * P(w|ww) ** P(w|www)* P(w|www)
Vi cng thc ny, ta c th xy dng m hnh ngn ng da trn vic
thng k cc cm c t hn n+1 t. M hnh ngn ng ny gi l m hnh ngn
ng N-gram.
Mt cm N-gram l 1 dy con gm n phn t lin tip nhau ca 1 dy cc
phn t cho trc.
3. nh gi m hnh ngn ng
Khi s dng m hnh N-gram theo cng thc xc sut th, s phn b
khng u trong tp vn bn hun luyn c th dn n cc c lng khng
-
13
chnh xc. Khi cc N-gram phn b tha, nhiu cm n-gram khng xut
hin
hoc ch c s ln xut hin nh, vic c lng cc cu c cha cc cm n-
gram ny s c kt qu ti. Vi
V l kch thc b t vng, ta s c Vcm N-gram c th sinh t b t
vng. Tuy nhin, thc t th s cm N-gram c ngha v thng gp ch chim
rt t.
V d: ting Vit c khong hn 5000 m tit khc nhau, ta c tng s cm
3-gram c th c l: 5.000= 125.000.000.000 Tuy nhin, s cm 3-gram
thng
k c ch xp x 1.500.000. Nh vy s c rt nhiu cm 3-gram khng xut
hin hoc ch xut hin rt t.
Khi tnh ton xc sut ca mt cu, c rt nhiu trng hp s gp cm
Ngram cha xut hin trong d liu hun luyn bao gi. iu ny lm xc
sut
ca c cu bng 0, trong khi cu c th l mt cu hon ton ng v mt ng
php v ng ngha. khc phc tnh trng ny, ngi ta phi s dng mt s
phng php lm mn (Estimation techniques).
4. Cc k thut lm mn:
khc phc tnh trng cc cm N-gram phn b tha nh cp,
ngi ta a ra cc phng php lm mn kt qu thng k nhm nh gi
chnh xc hn (mn hn) xc sut ca cc cm N-gram. Cc phng php lm
mn nh gi li xc sut ca cc cm N-gram bng cch:
Gn cho cc cm N-gram c xc sut 0 (khng xut hin) mt gi tr
khc 0.
Thay i li gi tr xc sut ca cc cm N-gram c xc sut khc 0
(c xut hin khi thng k) thnh mt gi tr ph hp (tng xc sut
khng i).
Cc phng php lm mn c th c chia ra thnh loi nh sau:
Chit khu (Discounting): gim (lng nh) xc sut ca cc
cm Ngram c xc sut ln hn 0 b cho cc cm Ngram khng
xut hin trong tp hun luyn.
Truy hi (Back-off): tnh ton xc sut cc cm Ngram khng xut
hin trong tp hun luyn da vo cc cm Ngram ngn hn c xc
sut ln hn 0
Ni suy (Interpolation): tnh ton xc sut ca tt c cc cm
Ngram da vo xc sut ca cc cm Ngram ngn hn.
- Yu cu SV chun b
-
14
n tp li cc kin thc lin quan n l thuyt ngn ng hnh thc,
automata hu hn v biu thc chnh quy.
- Ti liu tham kho
1. Speech&Language Procesing: An Introduction to Natural
Language
Processing, Computational Linguistics, and Speech Recognition,
2nd
edition, Daniel Jurafsky and James Martin. Prentice Hall, 2008.
Chng 3.
2. Foundations of Statistical Natural Language Processing,
Christopher
Manning and Hinrich Schtze, MIT Press, 1999. Chng 2.
- Cu hi n tp
- Ghi ch: Cc mn hc tin quyt : ton ri rc, cu trc d liu v gii
thut,
lp trnh cn bn.
Bi ging 04: Ngn ng hnh thc v Automata hu hn
Chng 4, mc:
Tit th: 1-3 Tun th: 4,5
- Mc ch yu cu
Mc ch: Nm c cc khi nim c bn v cng c lm vic vi ngn
ng l vn phm phi ng cnh v automata hu hn.
Yu cu: Nm vng l thuyt ngn ng hnh thc, cc dng thc automata
hu hn v ng dng trong x l ngn ng.
- Hnh thc t chc dy hc: L thuyt, tho lun, t hc, t nghin cu
- Thi gian: Gio vin ging: 2 tit; Tho lun v lm bi tp trn lp: 1
tit;
Sinh vin t hc: 6 tit.
- a im: Ging ng do P2 phn cng.
- Ni dung chnh:
1. An introduction to the parsing problem
2. Context free grammars
3. A brief(!) sketch of the syntax of English
4. Examples of ambiguous structures
5. Probabilistic Context-Free Grammars (PCFGs)
6. The CKY Algorithm for parsing with PCFGs
7. Lexicalization of a treebank
-
15
8. Lexicalized probabilistic context-free grammars
9. Parameter estimation in lexicalized probabilistic
context-free grammars
10. Accuracy of lexicalized probabilistic context-free
grammars
1. C php (Syntax)
Mc ch phn tch c php: Kim tra mt cu c ng ng php hay
khng; Ch ra cc ng on (syntagm) v quan h ph thuc gia chng cho
vic xy dng ngha ca cu
V d: Con mo con ang xi mt con chut cng to b.
[[[Con [mo]] con]NP[[ang xi] [mt [[con[chut cng]] to
b]]NP]VP]S.
T vng v ng php:
T vng v ng php:
T vng (lexicon): T vng cha tt c cc t trong ngn ng; T vng
phi cha cc thng tin ng m, hnh thi, ng php, ng ngha ca mi t
Ng php (grammar): Phm tr ng php (t loi, ng on, v.v.); Quy tc
(ng m, hnh thi, ng php, ng ngha, ng dng)
T vng v ng php b sung cho nhau.
2. Vn phm hnh thc
Vn phm G l mt b sp th t gm 4 thnh phn G = < , , S, P
>,
trong :
o - bng ch ci, gi l bng ch ci c bn (bng ch ci kt thc
terminal symbol);
o , =, gi l bng k hiu ph (bng ch ci khng kt thc
nonterminal symbol);
o S - k hiu xut pht hay tin (start variable);
o P - tp cc lut sinh (production rules) dng , , ( )*,
trong cha t nht mt k hiu khng kt thc (i khi, ta gi chng
l cc qui tc hoc lut vit li).
Cc quy c trong vic a ra vn phm. Trong mn hc s dng:
o Ch ci in hoa A, B, C, biu th cc bin, trong S l k hiu
xut pht;
o X, Y, Z, biu din cc k t cha bit hoc cc bin;
o a, b, c, d, e, biu din ch ci;
o u, v, w, x, y, z, biu din chui ch ci;
o , , , biu th chui cc bin hoc cc k hiu kt thc.
-
16
Khi nim v dn xut trc tip, dn xut gin tip, dn xut ng lc v
khng lp, cy dn xut, vn phm tng ng, ngn ng sinh bi vn phm.
Avram Noam Chomsky a ra mt h thng phn loi cc vn phm da
vo tnh cht ca cc lut sinh (1956).
Vn phm loi 0 Vn phm khng hn ch (UG Unrestricted
Grammar): khng cn tha iu kin rng buc no trn tp cc lut sinh;
Vn phm loi 1 Vn phm cm ng cnh (CSG Context Sensitive
Grammar): nu vn phm G c cc lut sinh dng v:
, || ||;
Vn phm loi 2 Vn phm phi ng cnh (CFG Context-Free
Grammar): c lut sinh dng A vi A l mt bin n v l chui cc k
hiu thuc ( )*;
Vn phm loi 3 Vn phm chnh quy (RG Regular Grammar): c
mi lut sinh dng tuyn tnh phi hoc tuyn tnh tri.
Tuyn tnh phi: A aB hoc A a;
Tuyn tnh tri: A Ba hoc A a;
Vi A, B l cc bin n, a l k hiu kt thc (c th l rng).
Nu k hiu L0, L1, L2, L3 l lp cc ngn ng c sinh ra bi vn phm
loi 0, 1, 2, 3 tng ng, ta c: L3 L2 L1 L0.
L0, L1 - lp ngn ng quy on nhn c bng my Turing
L2 lp ngn ng i s nhn bit c nh tmat y xung
L3 lp ngn ng nhn bit c nh tmat hu hn trng thi
Vn phm hnh thc cho phn tch c php: ngha ngn ng hc ca
G=
biu din t vng ca ngn ng;
biu din cc phm tr ng php: cu, cc ng on (danh ng, ng
ng, v.v), cc t loi (danh t, ng t, v.v.)
Tin S tng ng vi phm tr cu
Tp quy tc sinh P biu din cc quy tc c php. Cc quy tc cha t
nht
mt k hiu kt (t) gi l quy tc t vng. Cc quy tc khc gi l quy tc
ng
on.
Mi t trong t vng (t in) c m t bng mt tp cc quy tc sinh
cha t ny v phi.
Mi cy dn xut (cy c php) m t phn tch ca mt ng on thnh
cc thnh phn trc tip.
, , , ,A A
-
17
Nhp nhng ngn ng v vn phm:
Nhp nhng t vng: Mt t c nhiu t loi ()
Nhp nhng ng on: Mt cm t c th phn tch thanh cc cm con
theo nhiu cch khc nhau (ng gi i nhanh qu)
3. Ngn ng chnh quy v tmat
Automata l mt my tru tng (m hnh tnh ton) c c cu v hot
ng n gin nhng c kh nng on nhn ngn ng.
Finite automata (FA) - m hnh tnh ton hu hn: c khi u v kt
thc, mi thnh phn u c kch thc hu hn c nh v khng th m rng
trong sut qu trnh tnh ton;
Hot ng theo theo tng bc ri rc (steps);
Ni chung, thng tin ra sn sinh bi mt FA ph thuc vo c thng tin
vo
hin ti v trc . Nu s dng b nh (memory), gi s rng n c t nht
mt b nh v hn;
S phn bit gia cc loi automata khc nhau ch yu da trn vic
thng tin c th c a vo memory hay khng;
nh ngha: mt DFA l mt b nm: A=(Q, , , q0, F), trong :
1. Q : tp khc rng, tp hu hn cc trng thi (p, q);
2. : b ch ci nhp vo (a, b, c );
3. : D Q, hm chuyn (hay nh x), D Q , c ngha l (p, a)
=q hoc (p, a) = , trong p, q Q , a ;
4. q0 Q : trng thi bt u (start state);
5. F Q : tp cc trng thi kt thc (finish states).
Trong trng hp D = Q ta ni A l mt DFA y .
nh ngha: Automat hu hn a nh c nh ngha bi b 5: A = (Q,
, , q0, F), trong :
1. Q - tp hu hn cc trng thi.
2. - l tp hu hn cc ch ci.
3. - l nh x chuyn trng thi. : Q 2Q
4. q0 Q l trng thi khi u.
5. F Q l tp trng thi kt;
nh x l mt hm a tr (hm khng n nh), v vy A c gi l
khng n nh;
nh ngha: NFA vi -dch chuyn (NFA) l b nm: A= (Q, , , q0,
F), trong :
-
18
1. Q: tp hu hn cc trng thi;
2. : tp hu hn cc ch ci;
3. : Q( {}) 2Q ;
4. q0 l trng thi ban u;
5. F Q l tp trng thi kt thc.
Biu thc chnh quy
nh ngha: Biu thc chnh quy c nh ngha mt cch quy nh
sau:
1. l biu thc chnh quy. L()={}.
l biu thc chnh quy. L()={}.
nu a, a l biu thc chnh quy. L(a)={a}.
2. Nu r, s l cc biu thc chnh quy th:
((r)) l biu thc chnh quy. L((r))=L(r);
r+s l biu thc chnh quy. L(r+s)=L(r)L(s);
r.s l biu thc chnh quy. L(r.s)=L(r).L(s);
r* l biu thc chnh quy. L(r*)=L(r)*.
3. Biu thc chnh quy ch nh ngha nh trong 1 v 2.
* Tm c v RE:
1. Jeffrey E. F. Friedl. Mastering Regular Expressions, 2nd
Edition.
O'Reilly & Associates, Inc. 2002.
2. http://www.regular-expressions.info/
nh l 1: Nu L l tp c chp nhn bi mt NFA th tn ti mt DFA
chp nhn L.
Gii thut tng qut xy dng DFA t NFA:
Gi s NFA A={Q, , , q0, F} chp nhn L, gii thut xy dng DFA
A={Q, , , q0, F} chp nhn L nh sau:
o Q = 2Q , phn t trong Q c k hiu l [q0, q1, , qi] vi q0, q1,
, qi Q;
o q0 = [q0];
-
19
o F l tp hp cc trng thi ca Q c cha t nht mt trng thi
kt thc trong tp F ca A;
o Hm chuyn ([q1, q2,..., qi], a) = [p1, p2,..., pj] nu v ch
nu
({q1, q2,..., qi }, a) = {p1, p2,..., pj}.
o i tn cc trng thi [q0, q1, , qi].
nh l 2: Nu L c chp nhn bi mt NFAe th L cng c chp
nhn bi mt NFA khng c e-dch chuyn.
Thut ton: Gi s ta c NFAe A(Q, , , q0, F) chp nhn L, ta xy
dng: NFA A={Q, , , q0, F} nh sau:
o F = F q0 nu *( q0) cha t nht mt trng thi thuc F. Ngc
li, F = F;
o (q, a) = *(q, a).
H qu: Nu L l tp c chp nhn bi mt NFA th tn ti mt DFA
chp nhn L.
Gii thut xy dng cho DFA tng ng:
1. Tm kim T = e* (q0) ; T cha c nh du;
2. Thm T vo tp Q (of DFA);
3. while (xt trng thi T Q cha nh du){
3.1. nh du T;
3.2. forearch (vi mi k hiu nhp a){
U:= e*(d(T,a));
if(U khng thuc tp trng thi Q){
add U to Q;
Trng thi U cha c nh du;
}
[T,a]= U;
}
}
nh l 3: nu r l RE th tn ti mt NFA chp nhn L(r).
(chng minh: bi ging, gii thut Thompson)
-
20
nh l 4: Nu L c chp nhn bi mt DFA, th L c k hiu bi
mt RE.
Chng minh:
L c chp nhn bi DFA A({q1, q2,..., qn}, , , q1, F)
t Rkij = {x | (qi, x) = qj v nu (qi, y) = ql (y x) th l k}
(c
ngha l Rkij - tp hp tt c cc chui lm cho automata i t trng
thi i n trng thi j m khng i ngang qua trng thi no ln hn
k)
nh ngha quy ca Rkij:
Rkij = Rk-1ik(Rk-1
kk)*Rk-1
kj Rk-1ij
Ta s chng minh (quy np theo k) b sau: vi mi Rkij u tn ti mt
biu thc chnh quy k hiu cho Rkij .
k = 0: R0ij l tp hu hn cc chui 1 k hiu hoc e
Gi s ta c b trn ng vi k-1, tc l tn ti RE
Rk-1lm sao cho L(Rk-1
lm) = rk-1
lm
Vy i vi rkij ta c th chn RE:
rkij = (rk-1
ik)(rk-1
kk)*(rk-1
kj) + rk-1
ij
b c chng minh
nhn xt: () = 1 . Vy L c th c k hiu bng RE:
r = rn1j1 + rn1j2 + + rn1jp vi F = {qj1, qj2, , qjp}
4. My chuyn hu hn trng thi (tmat hu hn c u ra)
ng dng my chuyn hu hn trng thi:
Phn on vn bn thnh cc cu, phn on cu thanh cc t
Phn tch t thnh cc hnh v (ngn ng bin hnh) Gn nhn t loi
-
21
Ci t trnh phn tch c php vn phm phi ng cnh: my chuyn
quy. Mi quy tc biu din bng mt my chuyn, vn phm l my chuyn vi
xu vo l cu cn phn tch, xu ra l cu phn tch c php t ngoc.
5. Vn phm phi ng cnh v phn tch c php
Thut ton phn tch c php:
Nguyn l chung
Cc chin lc:
Phn tch t di ln (nhn bit): qu trnh phn tch c hng
dn bi cu vo (quy tc sinh c dng t phi sang tri)
Phn tch t trn xung (on bit): qu trnh phn tch c hng
dn bi cc gi thuyt (quy tc sinh c dng t tri sang phi)
Kt hp 2 chin lc
hn ch vic lp li cc tnh ton, ngi ta s dng mt bng ghi nh
cc kt qu trung gian.
Hn ch ca vn phm phi ng cnh:
Cc hn ch chnh:
Cy kt qu khng th hin cc rng buc ng ngha trong cu phn tch
S a dng ca cc cu trc c php i hi mt s lng rt ln cc quy
tc ng php, nhng khng c cch biu din lin h gia chng vi nhau.
Chomsky a ra vn phm ci bin (transformational grammar), nhng
vn
phm ny cng b ph phn v mt ngn ng rt nhiu. Hn na phc tp
tnh ton tr v tng ng vi vn phm dng 0.
T ra i nhiu h hnh thc vn phm mi.
Parsing (Syntactic Structure)
INPUT:
Boeing is located in Seattle.
OUTPUT:
-
22
Syntactic Formalisms
Work in formal syntax goes back to Chomsky's PhD thesis in the
1950s
Examples of current formalisms: minimalism, lexical functional
grammar
(LFG), head-driven phrase-structure grammar (HPSG), tree
adjoining
grammars (TAG), categorical grammars
Data for Parsing Experiments
Penn WSJ Treebank = 50,000 sentences with associated trees
Usual set-up: 40,000 training sentences, 2400 test sentences
An example tree:
The Information Conveyed by Parse Trees
(1) Part of speech for each word
(N = noun, V = verb, DT = determiner)
(2) Phrases
-
23
Noun Phrases (NP): the burglar", the apartment"
Verb Phrases (VP): robbed the apartment"
Sentences (S): the burglar robbed the apartment"
(3) Useful Relationships
=>the burglar" is the subject of robbed"
An Example Application: Machine Translation
English word order is subject - verb - object
Japanese word order is subject - object - verb
English: IBM bought Lotus
Japanese: IBM Lotus bought
English: Sources said that IBM bought Lotus yesterday
Japanese: Sources yesterday IBM Lotus bought that said
-
24
Context-Free Grammars
A Context-Free Grammar for English
N = {S, NP, VP, PP, DT, Vi, Vt, NN, IN}
S = S
= {sleeps, saw, man, woman, telescope, the, with, in}
Note: S=sentence, VP=verb phrase, NP=noun phrase,
PP=prepositional phrase, DT=determiner, Vi=intransitive
verb,
Vt=transitive verb, NN=noun, IN=preposition
Left-Most Derivations
-
25
For example: [S], [NP VP], [D N VP], [the N VP], [the man VP],
[the man
Vi], [the man sleeps]
Representation of a derivation as a tree:
An Example
DERIVATION RULES USED
S S NP VP
NP VP NP DT N
DT N VP DT the
the N VP N dog
the dog VP VP VB
the dog VB VB laughs
the dog laughs
Properties of CFGs
A CFG defines a set of possible derivations
A string s is in the language defined by the CFG if there is at
least one
derivation that yields s
Each string in the language generated by the CFG may have more
than
one derivation (ambiguity")
An Example of Ambiguity
-
26
The Problem with Parsing: Ambiguity
INPUT: She announced a program to promote safety in trucks and
vans
POSSIBLE OUTPUTS:
-
27
A Brief Overview of English Syntax
Parts of Speech (tags from the Brown corpus):
Nouns
o NN = singular noun e.g., man, dog, park
o NNS = plural noun e.g., telescopes, houses, buildings
o NNP = proper noun e.g., Smith, Gates, IBM
Determiners
o DT = determiner e.g., the, a, some, every
Adjectives
o JJ = adjective e.g., red, green, large, idealistic
- Ni dung tho lun
1. S ging, khc nhau gia ngn ng lp trnh v ngn ng t nhin.
2. Cc chin lc phn tch c php t trn xung v t di ln
- Yu cu SV chun b
n tp li cc kin thc lin quan n l thuyt ngn ng hnh thc,
automata hu hn v biu thc chnh quy.
- Ti liu tham kho
1. Speech&Language Procesing: An Introduction to Natural
Language
Processing, Computational Linguistics, and Speech Recognition,
2nd
edition, Daniel Jurafsky and James Martin. Prentice Hall, 2008.
Chng 4.
Bi ging 05: X l ngn ng da trn thng k
Chng 6, mc:
Tit th: 1-3 Tun th: 6
- Mc ch yu cu
Mc ch: Tm hiu phng php thng k ng dng cho x l ngn ng,
Yu cu: Nm c phng php, bit cch ng dng xy dng chng
trnh, vit c mt s ng dng n gin
- Hnh thc t chc dy hc: L thuyt, tho lun, t hc, t nghin cu
- Thi gian: Gio vin ging: 2 tit; Tho lun v lm bi tp trn lp: 1
tit;
Sinh vin t hc: 6 tit.
-
28
- a im: Ging ng do P2 phn cng.
- Ni dung chnh:
1. The Tagging Problem
2. Generative models, and the noisy-channel model, for
supervised learning
3. Hidden Markov Model (HMM) taggers
a. Basic definitions
b. Parameter estimation
c. The Viterbi algorithm
Phn tch ngn ng:
Cc cch tip cn
Da trn lut: XD m hnh h thng vi tp cc lut ngn ng
Da trn thng k: XD m hnh h thng vi tp cc xc sut cho cc "s
kin" c th xy ra
Cc m hnh lai kt hp c 2 phng php trn
Phn tch da vo kho ng liu chim u th.
Ch gii kho ng liu v pht hin tri thc ( corpora annotation):
Ch gii n v t
Ch gii t loi
Ch gii cm t, cu trc cu
Ch gii ng ngha
Ch gii ng s ch (co-reference)
Ch gii song ng
...
Cc phng php hc my:
-
29
N -gram, m hnh Markov (Markov model)
SVM (Support Vector Machine)
CRF (Conditional Random Field)
Mng n-ron (Neural network)
Hc da trn cc lut bin i (transformation-based learning): phng
php Brill
Phn loi s dng cy quyt nh (decision trees)
...
N gram:
Mt n-gram l mt on vn bn c di n (n t)
Thng tin n-gram cho bit c tnh no ca ngn ng, nhng khng nm
bt c cu trc ngn ng
Vic tm kim v s dng cc b n-gram l nhanh v d dng
N -gram c th ng dng trong nhiu ng dng XLNN:
D on t tip theo trong mt pht ngn da vo n-1 t trc
Hu dng trong cc ng dng kim tra chnh t, xp x ngn ng, v.v.
V d
-
30
M hnh Markov: Mt m hnh bigram cng c gi l m hnh Markov
bc mt
M hnh ny v c bn l mt otomat hu hn trng thi c trng s: cc
trng thi l cc t, cc cung ni gia 2 trng thi gn vi 1 xc sut no
.
Trigram: Vic chn n = 3 (trigram) cho php ta c xp x tt hn Nhn
chung, cc trigram l ngn, cho php tnh xc sut tng i chnh xc t
d
liu quan st c
Vi n cng ln, vic tnh xc sut cng km chnh xc (do thiu d liu) v
phc tp b nh cng cng tng.
Hun luyn m hnh n-gram:
Tnh xc sut t kho ng liu hun luyn (k thut c lng kh nng cc
i MLE) nh vic tnh cc tn sut tng i:
-
31
Cn lu vic la chn kho ng liu hun luyn tu theo ng liu m ta s
p dng m hnh n-gram thu c.
K thut lm mn: Vn khi hun luyn m hnh n-gram: d liu tha, c
th c nhng n-gram vi xc sut tnh c = 0.
K thut lm mn: bin i cc xc sut bng 0 thnh khc 0, tc l iu
chnh cc xc sut tnh cho cc d liu cha quan st c.
Tnh tha ca d liu:
Tnh tha ca d liu: ~ 50% s t ch xut hin 1 ln
Lut Zipf: Tn s xut hin ca mt t t l nghch vi xp hng v tn sut
ca t
V d: Lm mn bng cch thm 1, gi s cn tnh xc sut bigram, thm 1
vo tt c cc t s, ng thi cng mu s vi s t xut hin trong kho ng
liu ( tng cc xc sut bng 1)
- Yu cu SV chun b
n tp li cc kin thc lin quan n l thuyt ngn ng hnh thc,
automata hu hn v biu thc chnh quy.
- Bi tp
- Ti liu tham kho
1. Speech&Language Procesing: An Introduction to Natural
Language
Processing, Computational Linguistics, and Speech Recognition,
2nd
edition, Daniel Jurafsky and James Martin. Prentice Hall, 2008.
Chng 5.
2. Foundations of Statistical Natural Language Processing,
Christopher
Manning and Hinrich Schtze, MIT Press, 1999. Chng 4.
Bi ging 06: Vn gn nhn v m hnh Markov n
Chng 5, mc:
Tit th: 1-3 Tun th: 7, 8
- Mc ch yu cu
Mc ch: Tm hiu bi ton gn nhn t loi, m hnh markov n
Yu cu: hiu v nm c vai tr ca gn nhn t loi trong x l ngn
ng, cc phng php v m hnh gn nhn t loi, m hnh markov n.
-
32
- Hnh thc t chc dy hc: L thuyt, tho lun, t hc, t nghin cu
- Thi gian: Gio vin ging: 2 tit; Tho lun v lm bi tp trn lp: 1
tit;
Sinh vin t hc: 6 tit.
- a im: Ging ng do P2 phn cng.
- Ni dung chnh:
Markov Processes
Consider a sequence of random variables X1, X2,Xn.
Each random variable can take any value in a finite set V.
For now we assume the length n is fixed (e.g., n = 100)
Our goal: model:
(1 = 1, 2 = 2, = )
First-Order Markov Processes:
(1 = 1, 2 = 2, = )
= (1 = 1) ( = |1 = 1, , 1 = 1)
=2
= (1 = 1) ( = |1 = 1)
=2
The first-order Markov assumption: For any {2 } for any 1
Second-Order Markov Processes
(1 = 1, 2 = 2, = )
= (1 = 1) (2 = 2|1 = 1)
( = |2 = 2, 1 = 1)
=3
= ( = |2 = 2, 1 = 1)
=1
(For convenience we assume 0 = 1 =, where * is a special
start"
symbol.)
Modeling Variable Length Sequences
We would like the length of the sequence, n, to also be a random
variable
A simple solution: always define = STOP where STOP is a
special
symbol
Then use a Markov process as before:
-
33
(1 = 1, 2 = 2, = )
= ( = |2 = 2, 1 = 1)
=1
(For convenience we assume 0 = 1 =, where * is a special
start"
symbol.)
Trigram Language Models:
A trigram language model consists of:
1. A finite set V
2. A parameter (|, ) for each trigram u, v, w such that
{}, and , {}
For any sentence 1 , where for = 1. . ( 1) and =
the probability of the sentence under the trigram language model
is:
(1 ) = (|2, 1)
=1
where we define 0 = 1 =
An Example
For the sentence the dog barks STOP we would have
( ) = (| ,)
(| , )
(|, )
(|, )
The Trigram Estimation Problem
Remaining estimation problem:
(|2, 1)
For example:(|, )
A natural estimate (the maximum likelihood estimate"):
(|2, 1) =(2, 1, )
(2, 1)
(|, ) =(, , )
(, )
Sparse Data Problems:
Say our vocabulary size is N = |V|, then there are N3 parameters
in the
model.
e.g., N = 20.000 => 20.0003 = 8 x1012 parameters.
-
34
Evaluating a Language Model: Perplexity
We have some test data, m sentences:
1, 2, 3, ,
We could look at the probability under our model ()=1 . Or
more
conveniently, the log probability
()
=1
= log ()
=1
In fact the usual evaluation measure is perplexity
= 2 =1
log ()
=1
and M is the total number of words in the test data.
Some Intuition about Perplexity
Say we have a vocabulary V, and N = |V| + 1 and model that
predicts
(|, ) =1
For all {}, and , {}
Easy to calculate the perplexity in this case:
= 2 = log (1
)
=> Perplexity = N
Perplexity is a measure of effective branching factor"
Typical Values of Perplexity:
Results from Goodman (A bit of progress in language modeling"),
where
|V| = 50.000
A trigram model: Perplexity = 74
(1 ) = (|2, 1)
=1
A bigram model: Perplexity = 137
(1 ) = (|1)
=1
A unigram model: Perplexity = 955
(1 ) = ()
=1
Some History:
-
35
Shannon conducted experiments on entropy of English i.e., how
good are
people at the perplexity game?
C. Shannon. Prediction and entropy of printed English. Bell
Systems
Technical Journal, 30:50-64, 1951.
Chomsky (in Syntactic Structures (1957)):
Second, the notion grammatical" cannot be identified with
meaningful"
or significant" in any semantic sense.
Sentences (1) and (2) are equally nonsensical, but any speaker
of English
will recognize that only the former is grammatical.
(1) Colorless green ideas sleep furiously.
(2) Furiously sleep ideas green colorless.
Third, the notion grammatical in English" cannot be identified
in any
way with the notion high order of statistical approximation to
English". It is
fair to assume that neither sentence (1) nor (2) (nor indeed any
part of these
sentences) has ever occurred in an English discourse. Hence, in
any statistical
model for grammaticalness, these sentences will be ruled out on
identical
grounds as equally `remote' from English. Yet (1), though
nonsensical, is
grammatical, while (2) is not.
Sparse Data Problems
A natural estimate (the maximum likelihood estimate"):
(|2, 1) =(2, 1, )
(2, 1)
(|, ) =(, , )
(, )
Say our vocabulary size is N = |V|, then there are N3 parameters
in the
model.
e.g., N = 20.000 => 20.0003 = 8 x1012 parameters.
The Bias-Variance Trade-Of
Trigram maximum-likelihood estimate:
(|2, 1) =(2, 1, )
(2, 1)
Bigram maximum-likelihood estimate:
-
36
(|1) =(1, )
(1)
Unigram maximum-likelihood estimate
() =()
()
Linear Interpolation:
Take our estimate (|2, 1) to be:
(|2, 1) = 1(|2, 1) + 2(|1) + 3()
where 1 + 2 + 3 = 1 and > 0 for all ;
Our estimate correctly defines a distribution (define = {})
(|, ) =
[1(|2, 1) + 2(|1) + 3()]
= 1 (|2, 1) + 2 (|1)
+ 3 ()
= 1 + 2 + 3
= 1
(Can show also that (|, ) 0 for all )
How to estimate the values?
Hold out part of training set as validation" data
Define (1, 2, 3)to be the number of times the trigram (1, 2, 3)
is
seen in validation set.
Choose 1, 2, 3 to maximize:
(1, 2, 3) = (1, 2, 3)(3|2, 1)
1,2,3
such that 1 + 2 + 3 = 1 and > 0 for all , and where:
(|2, 1) = 1(|2, 1) + 2(|1) + 3()
Discounting Methods
Summary
Three steps in deriving the language model probabilities:
1. Expand (1 ) using Chain rule.
2. Make Markov Independence Assumptions
(|1, 2, , 2, 1) = (|2, 1)
3. Smooth the estimates using low order counts.
Other methods used to improve language models:
-
37
o Topic" or long-range" features.
o Syntactic models.
It's generally hard to improve on trigram models though!!
Tagging Problems, and Hidden Markov Models
1. The Tagging Problem
2. Generative models, and the noisy-channel model, for
supervised learning
3. Hidden Markov Model (HMM) taggers
a. Basic definitions
b. Parameter estimation
c. The Viterbi algorithm
Part-of-Speech Tagging
INPUT:
Profits soared at Boeing Co., easily topping forecasts on Wall
Street,
as their CEO Alan Mulally announced first quarter results.
OUTPUT:
Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV
topping/V
forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N
Alan/N Mulally/N
announced/V first/ADJ quarter/N results/N ./.
N = Noun
V = Verb
P = Preposition
Adv = Adverb
Adj = Adjective
Named Entity Recognition
INPUT: Profits soared at Boeing Co., easily topping forecasts on
Wall
Street, as their CEO Alan Mulally announced first quarter
results.
OUTPUT: Profits soared at [Company Boeing Co.], easily
topping
forecasts on [Location Wall Street], as their CEO [Person Alan
Mulally]
announced first quarter results.
Named Entity Extraction as Tagging
INPUT: Profits soared at Boeing Co., easily topping forecasts on
Wall
Street, as their CEO Alan Mulally announced first quarter
results.
OUTPUT:
-
38
Profits/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NA
topping/NA forecasts/NA on/NA Wall/SL Street/CL ,/NA as/NA
their/NA
CEO/NA Alan/SP Mulally/CP announced/NA first/NA quarter/NA
results/NA
./NA
NA = No entity
SC = Start Company
CC = Continue Company
SL = Start Location
CL = Continue Location
Our Goal
Training set:
1 Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/,
will/MD
join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ
director/NN
Nov./NNP 29/CD ./.
2 Mr./NNP Vinken/NNP is/VBZ chairman/NN of/IN Elsevier/NNP
N.V./NNP ,/, the/DT Dutch/NNP publishing/VBG group/NN ./.
3 Rudolph/NNP Agnew/NNP ,/, 55/CD years/NNS old/JJ and/CC
chairman/NN of/IN Consolidated/NNP Gold/NNP Fields/NNP
PLC/NNP
,/, was/VBD named/VBN a/DT nonexecutive/JJ director/NN of/IN
this/DT British/JJ industrial/JJ conglomerate/NN ./.
38.219 It/PRP is/VBZ also/RB pulling/VBG 20/CD people/NNS
out/IN of/IN Puerto/NNP Rico/NNP ,/, who/WP were/VBD
helping/VBG
Huricane/NNP Hugo/NNP victims/NNS ,/, and/CC sending/VBG
them/PRP to/TO San/NNP Francisco/NNP instead/RB ./.
From the training set, induce a function/algorithm that maps new
sentences
to their tag sequences.
Two Types of Constraints
Influential/JJ members/NNS of/IN the/DT House/NNP Ways/NNP
and/CC
Means/NNP Committee/NNP introduced/VBD legislation/NN
that/WDT
would/MD restrict/VB how/WRB the/DT new/JJ
savings-and-loan/NN
bailout/NN agency/NN can/MD raise/VB capital/NN ./.
Local": e.g., can is more likely to be a modal verb MD rather
than a noun
NN
-
39
Contextual": e.g., a noun is much more likely than a verb to
follow a
determiner
Sometimes these preferences are in conflict:
The trash can is in the garage
Supervised Learning Problems
We have training examples (), () for = 1 . Each () is an
input,
each () is a label.
Task is to learn a function mapping inputs to labels ()
Conditional models:
o Learn a distribution (|) from training examples
o For any test input , define () = max
(|)
Generative Models:
We have training examples (), () for = 1 . Task is to learn
a
function mapping inputs to labels ().
Generative models:
o Learn a distribution (, ) from training examples
o Often we have (, ) = ()(|)
Note: we then have:
(|) =()(|)
()
where () = ()(|)
Decoding with Generative Models
We have training examples (), () for = 1 . Task is to learn
a
function mapping inputs to labels ().
Generative models:
o Learn a distribution (, ) from training examples
o Often we have (, ) = ()(|)
Output from the model:
() = max
(|)
= max
()(|)
()
= max
()(|)
Hidden Markov Models
-
40
We have an input sentence = 1, 2, ( is the ith word in the
sentence)
We have a tag sequence = 1, 2, ( is the ith tag in the
sentence)
We'll use an HMM to define:
(1, 2, , 1, 2, )
for any sentence 1, 2, and tag sequence 1, 2, of the
same length.
Then the most likely tag sequence for x is
max (1, 2, , 1, 2, )1..
Trigram Hidden Markov Models (Trigram HMMs)
For any sentence 1, 2, where 1 V for = 1 and any tag
sequence 1, 2, +1 where for = 1 and +1 = ,
the joint probability of the sentence and tag sequence is:
(1, 2, , 1, 2, +1) = (|2, 1)
+1
=2
(|)
=1
where we have assumed that 0 = 1 =
Parameters of the model:
(|, ) for any {}, , {}
(|) for any ,
An Example:
If we have = 3, 1 3 equal to the sentence the dog laughs,
and 1 4 equal to the tag sequence D N V STOP, then
(1, 2, , 1, 2, +1)
= (|,) (| , ) (|, ) (|, )
(|) (|) (|)
STOP is a special tag that terminates the sequence
We take 0 = 1 = where * is special padding symbol.
Why the Name?
(1, , 1, ) = (|1, ) (|2, 1) (|)
1
1
Smoothed Estimation:
-
41
(|, ) = 1 (, , )
(, )+ 2
( , )
( )+ 3
()
()
1 + 2 + 3 = 1 and for all i, 0
Dealing with Low-Frequency Words: An Example
Profits soared at Boeing Co., easily topping forecasts on Wall
Street, as
their CEO Alan Mulally announced first quarter results.
A common method is as follows:
Step 1: Split vocabulary into two sets
Frequent words = words occurring >= 5 times in training
Low frequency words = all other words
Step 2: Map low frequency words into a small, finite set,
depending on
prefixes, suffixes etc.
The Viterbi Algorithm
Problem: for an input 1 find
max1+1
(1 , 1 +1)
where the is taken over all sequences 1 +1 such that
for = 1 and +1 = .
We assume that again takes the form
(1, , 1, +1) = (|2, 1) (|)
=1
+1
=1
Recall that we have assumed in this definition that 0 = 1 =
and
+1 = .
Brute Force Search is Hopelessly Inefficient
The Viterbi Algorithm
Define n to be the length of the sentence
Define for = 1 to be the set of possible tags at position k:
1 = 0 = {}
= for = 1
Define:
-
42
(1, 0, 1, ) = (|2, 1) (|)
=1
=1
Define a dynamic programming table:
(, , ) = maximum probability of a tag sequence ending in tags ,
at
position that is,
(, , ) = max:1=,=
(1, 0, 1, )
An Example: The man saw the dog with the telescope
A Recursive Definition:
(0,,) = 1
Recursive definition:
For any {1. . }, for any 1 and
(, , ) = max2
(( 1, , ) (|, ) (|))
The Viterbi Algorithm
Input: a sentence 1 , parameters (|, ) and (|)
Initialization: Set (0,,) = 1
Definition: 1 = 0 = {}, = for = 1
Algorithm:
For = 1
For 1 and
(, , ) = max2
(( 1, , ) (|, ) (|))
Return: max1,
((, , ) (|, ))
The Viterbi Algorithm with Backpointers
Input: a sentence 1 , parameters (|, ) and (|)
Initialization: Set (0,,) = 1
Definition: 1 = 0 = {}, = for = 1
Algorithm:
For = 1
For 1 and
(, , ) = max2
(( 1, , ) (|, ) (|))
(, , ) = max2
(( 1, , ) (|, ) (|))
Set (1, ) = max(,)
((, , ) (|, ))
-
43
For = 2 1 = ( + 2, +1, +2)
Return the tag sequence 1
The Viterbi Algorithm: Running Time
(||3) time to calculate (|, ) (|) for all , , , .
||2 entries in to be filled in.
(||) time to fill in one entry
(||3) time in total.
Pros and Cons
Hidden markov model taggers are very simple to train (just need
to
compile counts from the training corpus)
Perform relatively well (over 90% performance on named
entity
recognition)
Main difficulty is modeling
(|)
can be very difficult if words" are complex.
- Ni dung tho lun
Kho ng liu ting vit, thc trng v gii php xy dng.
- Yu cu SV chun b
Tm hiu v kho ng liu ting vit v vit chng trnh gn nhn s dng
m hnh markov n.
- Bi tp
- Ti liu tham kho
1. Speech&Language Procesing: An Introduction to Natural
Language
Processing, Computational Linguistics, and Speech Recognition,
2nd
edition, Daniel Jurafsky and James Martin. Prentice Hall, 2008.
Chng 6.
2. Foundations of Statistical Natural Language Processing,
Christopher
Manning and Hinrich Schtze, MIT Press, 1999. Chng 4.
3. Data Mining: Practical Machine Learning Tools and Techniques
(3rd ed),
Ian H. Witten and Eibe Frank, Morgan Kaufmann, 2005. Chng 3.
-
44
Bi ging 07: Dch my
Chng I, mc:
Tit th: 1-3 Tun th: 9,10
- Mc ch yu cu
Mc ch: Tm hiu v dch my, cc khi nim c bn v cc k thut
ph bin.
Yu cu: Nm c kin trc, ni dung ca dch my, cc giai on v
mt s k thut c bn
- Hnh thc t chc dy hc: L thuyt, tho lun, t hc, t nghin cu
- Thi gian: Gio vin ging: 2 tit; Tho lun v lm bi tp trn lp: 1
tit;
Sinh vin t hc: 6 tit.
- a im: Ging ng do P2 phn cng.
- Ni dung chnh:
1. Challenges in machine translation
2. Classical machine translation
3. A brief introduction to statistical MT
4. The IBM Translation Models
a. IBM Model 1
b. IBM Model 2
c. EM Training of Models 1 and 2
5. Phrase-Based Translation
6. Decoding with Phrase-Based Translation Models
1. Challenges in machine translation
Challenges: Lexical Ambiguity
-
45
(Example from Dorr et. al, 1999)
Example 1:
book the flight t ch read the book c sch
Example 2:
the box was in the pen the pen was on the table
Example 3:
kill a man git kill a process hy
Challenges: Differing Word Orders
I English word order is subject verb object
I Japanese word order is subject object verb
English: IBM bought Lotus Japanese: IBM Lotus bought
English: Sources said that IBM bought Lotus yesterday
Japanese: Sources yesterday IBM Lotus bought that said
Syntactic Structure is not preserved across Translations!
2. Classical machine translation
2.1. Direct Machine Translation
Translation is word-by-word
Very little analysis of the source text (e.g., no syntactic or
semantic
analysis)
Relies on a large bilingual directionary. For each word in the
source
language, the dictionary specifies a set of rules for
translating that word
After the words are translated, simple reordering rules are
applied (e.g.,
move adjectives after nouns when translating from English to
French)
An Example of a set of Direct Translation Rules
(From Jurafsky and Martin, edition 2, chapter 25. Originally
from a system
from Panov 1960)
Rules for translating much or many into Russian:
if preceding word is how return skol'ko
else if preceding word is as return stol'ko zhe
else if word is much
if preceding word is very return nil
else if following word is a noun return mnogo
else (word is many)
-
46
if preceding word is a preposition and following word is noun
return
mnogii
else return mnogo
Some Problems with Direct Machine Translation
Lack of any analysis of the source language causes several
problems, for
example:
Difficult or impossible to capture long-range reorderings
English: Sources said that IBM bought Lotus yesterday
Japanese: Sources yesterday IBM Lotus bought that said
Words are translated without disambiguation of their syntactic
role e.g.,
that can be a complementizer or determiner, and will often be
translated
differently for these two cases
They said that ...
They like that ice-cream
2.2. Transfer-Based Approaches
Three phases in translation:
Analysis: Analyze the source language sentence; for example,
build a
syntactic analysis of the source language sentence.
Transfer: Convert the source-language parse tree to a
target-language
parse tree.
Generation: Convert the target-language parse tree to an output
sentence.
Transfer-Based Approaches
The parse trees" involved can vary from shallow analyses to
much
deeper analyses (even semantic representations).
The transfer rules might look quite similar to the rules for
direct
translation systems. But they can now operate on syntactic
structures.
It's easier with these approaches to handle long-distance
reorderings
The Systran systems are a classic example of this approach
-
47
Japanese: Sources yesterday IBM Lotus bought that said
2.3. Interlingua-Based Translation
Two phases in translation:
Analysis: Analyze the source language sentence into a
(language-
independent) representation of its meaning.
Generation: Convert the meaning representation into an output
sentence.
One Advantage: If we want to build a translation system that
translates between n languages, we need to develop n analysis
and
generation systems. With a transfer based system, we'd need to
develop O(n2)
sets of translation rules.
Disadvantage: What would a language-independent representation
look
like?
Interlingua-Based Translation
How to represent different concepts in an interlingua?
Different languages break down concepts in quite different
ways:
-
48
German has two words for wall: one for an internal wall, one for
a
wall that is outside
Japanese has two words for brother: one for an elder brother,
one for
a younger brother
Spanish has two words for leg: pierna for a human's leg, pata
for an
animal's leg, or the leg of a table
An interlingua might end up simple being an intersection of
these
different ways of breaking down concepts, but that doesn't seem
very
satisfactory...
3. A Brief Introduction to Statistical MT
Parallel corpora are available in several language pairs
Basic idea: use a parallel corpus as a training set of
translation examples
Classic example: IBM work on French-English translation, using
the
Canadian Hansards. (1.7 million sentences of 30 words or less in
length).
Idea goes back to Warren Weaver (1949): suggested applying
statistical
and cryptanalytic techniques to translation.
4. The IBM Translation Models
4.1. IBM Model 1
Alignments
How do we model (|)?
English sentence has words 1 , French sentence has words
1 .
An alignment a identifies which English word each French
word
originated from
Formally, an alignment a is {1 }, where each {0 }
There are ( + 1) possible alignments.
Example: = 6, = 7
= And the program has been implemented
= Le programme a ete mis en application
One alignment is {2, 3, 4, 5, 6, 6, 6}
Another (bad!) alignment is {1, 1, 1, 1, 1, 1, 1}
Alignments in the IBM Models
We'll define models for (|, ) and (|, , ), giving
(, |, ) = (|, )(|, , )
-
49
Also
(|, ) = (|, )(|, , )
A By-Product: Most Likely Alignments
Gi s chng ta c m hnh (, |, ) = (|)(|, , ), ta c th
tnh
(|, , ) =(, |, )
(, |, )
for any alignment
For a given , pair, we can also compute the most likely
alignment,
= max
(|, , )
Nowadays, the original IBM models are rarely (if ever) used
for
translation, but they are used for recovering alignments.
An Example Alignment
French: le conseil rendu son avis, et nous devons prsent
adopter un nouvel avis sur la base de la premire position.
English: the council has stated its position, and now, on the
basis of
the first position, we again have to give our opinion.
Alignment:
the/le council/conseil has/ stated/rendu its/son position/avis
,/,
and/et now/prsent ,/NULL on/sur the/le basis/base of/de
the/la
first/premire position/position ,/NULL we/nous again/NULL
have/devons
to/a give/adopter our/nouvel opinion/avis ./.
IBM Model 1: Alignments
In IBM model 1 all alignments are equally likely:
(|, ) =1
( + 1)
This is a major simplifying assumption, but it gets things
started...
IBM Model 1: Translation Probabilities
Next step: come up with an estimate fo (| , , )r
In model 1, this is:
(| , , ) = (|)
=1
e.g., = 6, = 7
=
-
50
=
= {2,3,4,5,6,6,6}
(| , ) = (|) (|) (|)
(|) (|) (|)
(|)
IBM Model 1: The Generative Process
To generate a French string f from an English string :
Step 1: Pick an alignment with probability 1
(+1)
Step 2: Pick the French words with probability
(| , , ) = (|)
=1
The final result:
(| , , ) = (|, ) (|, , )
=1
( + 1) (|)
=1
An Example Lexical Entry
English French Probability
Position position 0.756715
Position situation 0.0547918
Position mesure 0.0281663
Position vue 0.0169303
Position point 0.0124795
Position attitude 0.0108907
de la situation au niveau des ngociations de l ' ompi
.. of the current position in the wipo negotiations
nous ne sommes pas en mesure de dcider ,
we are not in a position to decide , : : :
le point de vue de la commission face ce problme complexe.
the commission's position on this complex problem.
4.2. IBM Model 2
Only difference: we now introduce alignment or distortion
parameters
(|, , ) =
,
-
51
Define:
(|, ) = (|, , )
=1
where = {1, }
Gives:
(, |, ) = (|, , )
=1
(|)
An Example:
= 6
= 7
e = And the program has been implemented
=
= {2, 3, 4, 5, 6, 6, 6}
(|, 7) = (2|1,6,7) (3|2,6,7) (4|3,6,7) (5|4,6,7)
(6|5,6,7) (6|6,6,7) (6|7,6,7)
(|, , 7) = (|) (|) (|)
(|) (|) (|)
(|)
IBM Model 2: The Generative Process
To generate a French string from an English string :
Step 1: Pick an alignment = {1, } with probability
(|, , )
=1
Step 2: Pick the French words with probability
(| , , ) = (|)
=1
The final result:
(, |, )) = (|, ) (|, , ) = (|, , )
=1
(|)
Recovering Alignments
-
52
If we have parameters and , we can easily recover the most
likely
alignment for any sentence pair
Given a sentence pair 1, 2, , 1, 2, , define:
= max{0...}
(|, , ) (|)
for = 1
=
=
4.3. EM Training of Models 1 and 2
The Parameter Estimation Problem
Input to the parameter estimation algorithm: ((), ()) for = 1
.
Each () is an English sentence, each () is a French
sentence.
Output: parameters (|) and (|, , )
A key challenge: we do not have alignments on our training
examples, e.g.,
(100) =
(100) =
Parameter Estimation if the Alignments are Observed
First: case where alignments are observed in training data.
E.g.,
(100) =
(100) =
(100) = {2, 3, 4, 5, 6, 6, 6}
Training data is ((), (), ()), for = 1 . Each () is an
English
sentence, each () is a French sentence. each () is an
alignment.
Maximum-likelihood parameter estimates in this case are
trivial:
(|) =(,)
(), (|, , ) =
(|,,)
(,,)
Input: A training corpus
((), (), ()), for = 1 , where
() = 1()
(), () = 1
() (), () = 1
()
()
Algorithm:
Set all counts ( ) = 0
For = 1
For = 1 , j= 0
((),
()) (
(), ()
) + (, , )
(()) (
()) + (, , )
-
53
(|, , ) (|, , ) + (, , )
(, , ) (, , ) + (, , )
where (, , ) = 1 if ()
= , 0 otherwise
Output: (|) =(,)
(), (|, , ) =
(|,,)
(,,)
Parameter Estimation with the EM Algorithm
Training examples are: ((), ()) for = 1 . Each () is an
English
sentence, each () is a French sentence.
The algorithm is related to algorithm when alignments are
observed, but
two key differences:
1. The algorithm is iterative. We start with some initial (e.g.,
random)
choice for the q and t parameters. At each iteration we compute
some counts
based on the data together with our current parameter estimates.
We then re-
estimate our parameters with these counts, and iterate.
2. We use the following definition for (, , ) at each
iteration:
(, , ) =(|, , ) (
()|())
(|, , ) (()|
())=0
Input: A training corpus
((), ()), for = 1 , where () = 1()
(), () = 1
() ()
Initialization: Initialize (|) and (|, , ) parameters (e.g., to
random
values)
For = 1
Set all counts ( ) = 0
For = 1
For = 1 , j= 0
((),
()) ((),
()) + (, , )
(()) (
()) + (, , )
(|, , ) (|, , ) + (, , )
(, , ) (, , ) + (, , )
where
(, , ) =(|, , ) (
()|())
(|, , ) (()|
())=0
-
54
Recalculate the parameters: (|) =(,)
(), (|, , ) =
(|,,)
(,,)
The EM Algorithm for IBM Model 1:
For = 1
Set all counts ( ) = 0
For = 1
For = 1 , j= 0
((),
()) ((),
()) + (, , )
(()) (
()) + (, , )
(|, , ) (|, , ) + (, , )
(, , ) (, , ) + (, , )
Where
(, , ) =
1(1 + )
(()|
())
1
(1 + ) (
()|())
=0
= (
()|())
(()|
())=0
Recalculate the parameters: () = (, )/() An Example:
(100) =
(100) =
Justification for the Algorithm
Training examples are: ((), ()) for = 1 . Each () is an
English
sentence, each () is a French sentence.
The log-likelihood function:
(, ) = log (()|()) = log ((), |())
=1
=1
The maximum-likelihood estimates are:
arg max,
(, )
The EM algorithm will converge to a local maximum of the
log-likelihood
function.
-
55
Summary
Key ideas in the IBM translation models:
o Alignment variables
o Translation parameters, e.g., (|)
o Distortion parameters, e.g., (2|1,6,7)
The EM algorithm: an iterative algorithm for training the q and
t
parameters
Once the parameters are trained, we can recover the most
likely
alignments on our training examples
5. Phrase-Based Translation
1. Learning phrases from alignments
2. A phrase-based model
3. Decoding in phrase-based models
Phrase-Based Models
First stage in training a phrase-based model is extraction of a
phrase-based
(PB) lexicon
A PB lexicon pairs strings in one language with strings in
another
language, e.g.,
nach Kanada in Canada
zur Konferenz to the conference
Morgen tomorrow
fliege will fly
6. Decoding with Phrase-Based Translation Models
- Ni dung tho lun
Cc vn dch my vi ting vit, cc sn phm c.
- Yu cu SV chun b
Ci t tm hiu v l thuyt cc th vin v modul v dch my do gio
vin cung cp.
- Ti liu tham kho
1. Speech&Language Procesing: An Introduction to Natural
Language
Processing, Computational Linguistics, and Speech Recognition,
2nd
edition, Daniel Jurafsky and James Martin. Prentice Hall, 2008.
Chng 7.
-
56
2. Data Mining: Practical Machine Learning Tools and Techniques
(3rd ed),
Ian H. Witten and Eibe Frank, Morgan Kaufmann, 2005. Chng 4.
Bi ging 08: Log-linear models
Chng I, mc:
Tit th: 1-3 Tun th: 11
- Mc ch yu cu
Mc ch: Tm hiu v m hnh log-linear trong dch my
Yu cu: Nm vng k thut v vit chng trnh.
- Hnh thc t chc dy hc: L thuyt, tho lun, t hc, t nghin cu
- Thi gian: Gio vin ging: 2 tit; Tho lun v lm bi tp trn lp: 1
tit;
Sinh vin t hc: 6 tit.
- a im: Ging ng do P2 phn cng.
- Ni dung chnh:
1. The Language Modeling Problem
2. Log-linear models
3. Parameter estimation in log-linear models
4. Smoothing/regularization in log-linear models
5. Global Linear Model
Log-linear models
Given
An input domain X and a finite set of labels Y
A set of m feature functions k : X Y (very often these are
indicator/binary functions: k : X Y {0,1}).
The feature vectors (x,y) m induced by the feature functions k
for
any x X and y Y
Learn a conditional probability P(y | x W), where
W is a parameter vector of weights (W m )
P(y | x, W) = e W (x,y) / y Y e W (x,y)
-
57
log P(y | x, W) = W (x,y) / - log y Y e W (x,y) [substraction
between
a linear term and a normalization term]
Examples of feature functions for POS tagging:
1 (x,y) = {
2 (x,y) = {
3 (x,y) = {
It is natural to come up with as many feature functions (
pairs) as we can.
Learning in this framework amounts then to learning the weights
WML
that maximize the likelihood of the training corpus.
WML = argmax W m L(W) = argmax W m i = 1..n P( yi | xi)
L(W) = i = 1..n log P( yi | xi)
= i = 1..n W ( xi , yi) / - i = 1..n log y Y e W (xi , y)
Note: Finding the parameters that maximize the
likelihood/probability of
some training corpus is a universal machine learning trick.
Summary: we have cast the learning problem as an optimization
problem.
Several solutions exist for solving this problem:
Gradient ascent
Conjugate gradient methods
Iterative scaling
Improved iterative scaling
- Ni dung tho lun
Xy dng th vin lp trnh cho dch my
- Yu cu SV chun b
Xy dng modul m phng dch Vit-Anh.
1 if current word wi is the and y
= DT
0 otherwise
1 if current word wi ends in ing and y =
VBG
0 otherwise 1 if < t i-2, t i-1, t i > = < DT JJ
Vt >
0 otherwise
-
58
- Ti liu tham kho
1. Speech&Language Procesing: An Introduction to Natural
Language
Processing, Computational Linguistics, and Speech Recognition,
2nd
edition, Daniel Jurafsky and James Martin. Prentice Hall, 2008.
Chng 7.
2. Data Mining: Practical Machine Learning Tools and Techniques
(3rd ed),
Ian H. Witten and Eibe Frank, Morgan Kaufmann, 2005. Chng 5.
- Cu hi n tp
- Ghi ch: Cc mn hc tin quyt : ton ri rc, cu trc d liu v gii
thut,
lp trnh cn bn.
Bi ging 09: Conditional random fields, and global linear
models
Chng I, mc:
Tit th: 1-3 Tun th: 12, 13
- Mc ch yu cu
Mc ch: Tm hiu v CRFs v GLMs.
Yu cu: Nm vng m hnh v bit cch xy dng ng dng da trn 2
dng m hnh c cung cp trong bi ging
- Hnh thc t chc dy hc: L thuyt, tho lun, t hc, t nghin cu
- Thi gian: Gio vin ging: 2 tit; Tho lun v lm bi tp trn lp: 1
tit;
Sinh vin t hc: 6 tit.
- a im: Ging ng do P2 phn cng.
- Ni dung chnh:
CRF (conditional random fields) l m hnh chui cc xc sut c iu
kin,
hun luyn ti a ha xc sut iu kin. N l mt framework cho php xy
dng nhng m hnh xc sut phn on v gn nhn chui d liu [1]. Theo
[3], CRF, cng ging nh trng ngu nhin Markov (Markov random
field), l
mt m hnh th v hng m mi nh biu din cho mt bin ngu nhin
(random variable) m c phn phi (distribution) c suy ra, v mi
cung
(edge) biu din mi quan h ph thuc gia hai bin ngu nhin.
-
59
Hnh 1: Cu trc chui (chain-structured) ca th CRFs.
X l mt bin ngu nhin trn chui d liu cn c gn nhn v Y l
bin ngu nhin trn chui nhn (hoc trng thi) tng ng. V d X l
chui
cc t quan st (observation) thng qua cc cu bng ngn ng t nhin, Y
l
chui cc nhn t loi c gn cho nhng cu trong tp X (cc nhn ny c
quy nh sn trong tp cc nhn t loi). Mt linear-chain (chui tuyn
tnh)
CRF vi cc tham s c cho bi cng thc [2]:
Vi Zx l mt tha s chun ha nhm m bo tng cc xc sut ca
chui trng thi bng 1 [4].
fk(yt-1, yt, x, t) l mt hm c trng (feature function), thng c gi
tr
nh phn (binary-valued), nhng cng c th l gi tr thc (real-valued).
V
l mt trng s hc (learned weight) kt hp vi c trng fk. Nhng hm
c
trng c th o bt k trng thi chuyn dch (state transition) no, yt-1
yt, v
chui quan st x, tp trung vo thi im hin ti t. V d, mt hm c trng
c
th c gi tr 1 khi yt-1 l trng thi TITLE, yt l trng thi AUTHOR v
xt l mt
t xut hin trong tp t vng cha tn ngi.
Ngi ta thng hun luyn CRFs bng cch lm cc i ha hm
likelihood theo d liu hun luyn s dng cc k thut ti u nh L
-
60
BFGS1. Vic lp lun (da trn m hnh hc) l tm ra chui nhn tng ng
ca mt chui quan st u vo. i vi CRFs, ngi ta thng s dng thut
ton qui hoch ng in hnh l Viterbi2 (l thut ton lp trnh ng nhm
tm
ra chui kh nng (most likely) ca cc trng thi n ) thc hin lp lun
vi
d liu mi [5].
Maximum Entropy Models
An equivalent approach to learning conditional probability
models is this:
There are lots of conditional distributions out there, most of
them very
spiked, overfit, etc. Let Q be the set of distributions that can
be specified
in log-linear form:
Q = { p : p(y | xi) = e W (x i
,y) / y Y e W (x i , y)
We would like to learn a distribution that is as uniform as
possible
without violating any of the requirements imposed by the
training data.
P = {p : i = 1..n ( xi , yi) = i = 1..n y Y p(y | xi )
(xi,y)
(empirical count = expected count)
p is an n |Y| vector defining P(y | xi ) for all i, y.
Note that a distribution that satisfies the above equality
always exist
p( y | xi ) = {
Because uniformity equates high entropy, we can search for
distributions
that are both consistent with the requirements imposed by the
data and
have high entropy.
Entropy of a vector P:
1 http://en.wikipedia.org/wiki/L-BFGS
2 http://en.wikipedia.org/wiki/Viterbi_algorithm
1 if y
= yi
0
otherwise
-
61
H (p) = - px log px
Entropy if uncertainty, but also non-commitment.
What do we want from a distribution P?
o Minimize commitment = maximize entropy
o Resemble some reference distribution (data)
Solution: maximize entropy H, subject to constraints f.
Adding constraints (features):
o Lowers maximum entropy
o Raises maximum likelihood
o Brings the distribution further from uniform
o Brings the distribution closer to a target distribution
Lets say we have the following event space:
NN NNS NNP NNPS VBZ VBD
and the following empirical data:
3 5 11 13 3 1
Maximize H:
1/e 1/e 1/e 1/e 1/e 1/e
-
62
but we wanted probabilities: E[NN, NNS, NNP, NNPS, VBZ, VBD]
=
1
1/6 1/6 1/6 1/6 1/6 1/6
This is probably too uniform:
NN NNS NNP NNPS VBZ VBD
1/6 1/6 1/6 1/6 1/6 1/6
we notice that N* are more common than V* in the real data, so
we
introduce a feature fN = {NN, NNS, NNP, NNPS}, with E[fN] =
32/36
8/36 8/36 8/36 8/36 2/36 2/36
and proper nouns are more frequent than common nouns, so we add
fp
= {NNP, NNPS}, with E[fp] = 24/36
4
/36
4
/36
1
2/36
1
2/36
2
/36
2
/36
we could keep refining the models, for example by adding a
feature to
distinguish singular vs. plural nouns, or verb types.
Fundamental theorem: It turns out that finding the maximum
likelihood
solution to the optimization problem in Section 3 is the same
with finding the
maximum entropy solution to the problem in Section 4.
The maximum entropy solution can be written in log-linear
form.
Finding the maximum-likelihood solution also gives the
maximum
entropy solution.
- Yu cu SV chun b
c trc phn bi ging gio vin giao. Thc hin cc bi tp theo phn
cng.
- Ti liu tham kho
1. Speech&Language Procesing: An Introduction to Natural
Language
Processing, Computational Linguistics, and Speech Recognition,
2nd
edition, Daniel Jurafsky and James Martin. Prentice Hall, 2008.
Chng 9.
-
63
2. Foundations of Statistical Natural Language Processing,
Christopher
Manning and Hinrich Schtze, MIT Press, 1999. Chng 6.
3. Data Mining: Practical Machine Learning Tools and Techniques
(3rd ed),
Ian H. Witten and Eibe Frank, Morgan Kaufmann, 2005. Chng 5.
Bi ging 10: Hc my trong NLP
Chng I, mc:
Tit th: 1-3 Tun th: 14, 15
- Mc ch yu cu
Mc ch: Tm hiu v cc phng php hc my trong x l ngn ng t
nhin, ng dng.
Yu cu: Nm c cc phng php v ng dng. Cch xy dng
chng trnh.
- Hnh thc t chc dy hc: L thuyt, tho lun, t hc, t nghin cu
- Thi gian: Gio vin ging: 2 tit; Tho lun v lm bi tp trn lp: 1
tit;
Sinh vin t hc: 6 tit.
- a im: Ging ng do P2 phn cng.
- Ni dung chnh:
Khi nim v hc my
Hc my, c ti liu gi l My hc, (ting Anh: machine learning) l
mt
lnh vc ca tr tu nhn to lin quan n vic pht trin cc k thut cho
php
cc my tnh c th "hc". C th hn, hc my l mt phng php to ra
cc chng trnh my tnh bng vic phn tch cc tp d liu. Hc my c lin
quan ln n thng k, v c hai lnh vc u nghin cu vic phn tch d
liu,
nhng khc vi thng k, hc my tp trung vo s phc tp ca cc gii
thut
trong vic thc thi tnh ton. Nhiu bi ton suy lun c xp vo loi
bi
ton NP-kh, v th mt phn ca hc my l nghin cu s pht trin cc gii
thut suy lun xp x m c th x l c.
Cc loi thut ton thng dng bao gm:
Hc c gim st:
L mt k thut ca ngnh hc my xy dng mt hm (function) t tp
d liu hun luyn. D liu hun luyn bao gm cc cp gm i tng u vo
(thng dng vec-t), v u ra mong mun. u ra ca mt hm c th l mt
gi tr lin tc (gi l hi qui), hay c th l d on mt nhn phn loi cho
mt
-
64
i tng u vo (gi l phn loi). Nhim v ca chng trnh hc c gim st
l d on gi tr ca hm cho mt i tng bt k l u vo hp l, sau khi
xem xt mt s v d hun luyn (ngha l, cc cp u vo v u ra tng
ng). t c iu ny, chng trnh hc phi tng qut ha d liu sn c
d on c nhng tnh hung cha gp phi theo mt cch "hp l".
Hc c gim st c th to ra 2 loi m hnh. Ph bin nht, hc c gim
st to ra mt m hnh ton cc (global model) nh x i tng u vo n
u ra mong mun. Tuy nhin, trong mt s trng hp, vic nh x c thc
hin di dng mt tp cc m hnh cc b, da trn cc hng xm ca n.
gii quyt mt bi ton hc c gim st(v d: nhn dng ch vit
tt) ngi ta phi xt nhiu bc khc nhau:
Xc nh loi ca tp d liu hun luyn. Trc khi lm bt c iu g,
chng ta nn quyt nh loi d liu no s c s dng lm dng hun
luyn. Chng hn, c th l mt k t vit tay n l, ton b mt t vit
tay,
hay ton b mt dng ch vit tay.
Thu thp d liu hun luyn. Tp d liu hun luyn cn ph hp vi cc
hm chc nng c xy dng. V vy, cn thit phi kim tra tch thch hp
ca
d liu u vo c d liu u ra tng ng. Tp d liu hun luyn c th
c thu thp t nhiu ngun khc nhau: t vic o c tnh ton, t cc tp d
liu c sn
Xc nh vic biu din cc c trng u vo cho hm chc nng. S
chnh xc ca hm chc nng ph thuc ln vo cch biu din cc i tng
u vo. Thng thng, i tng u vo c chuyn i thnh mt vec-t
c trng, cha mt s cc c trng nhm m t cho i tng . S lng
cc c trng khng nn qu ln, do s bng n d liu, nhng phi ln
d on chnh xc u ra. Nu hm chc nng m t qu chi tit v i tng,
th cc d liu u ra c th b phn r thnh nhiu nhm hay nhn khc
nhau,
vic ny dn ti vic kh phn bit c mi quan h gia cc i tng hay
kh tm c nhm(nhn) chim a s trong tp d liu cng nh vic d on
phn t i din cho nhm, i vi cc i tng gy nhiu, chng c th c
dn nhn, tuy nhin s lng nhn qu nhiu, v s nhn t l nghch vi s
phn ca mi nhn. Ngc li, hm chc nng c qu t m t v i tng d
dn ti vic dn nhn i tng b sai hay d b xt cc i tng gy nhiu.
Vic xc nh tng i ng s lng c tnh ca phn t s gim bt chi ph
khi thc hin nh gi kt qu sau hun luyn cng nh kt qu gp b d liu
u vo mi.
-
65
Xc nh cu trc ca hm chc nng cn tm v gii thut hc tng ng.
V d, ngi k s c th la chn vic s dngmng n-ron nhn to hay cy
quyt nh.
Hon thin thit k. Ngi thit k s chy gii thut hc t tp hun luyn
thu thp c. Cc tham s ca gii thut hc c th c iu chnh bng cch
ti u ha hiu nng trn mt tp con (gi l tp kim chng -validation set)
ca
tp hun luyn, hay thng qua kim chng cho (cross-validation). Sau
khi hc
v iu chnh tham s, hiu nng ca gii thut c th c o c trn mt tp
kim tra c lp vi tp hun luyn.
Hc khng gim st:
ting Anh l unsupervised learning, l mt phng php nhm tm ra mt
m hnh m ph hp vi cc tp d liu quan st. N khc bit vi hc c gim
st ch l u ra ng tng ng cho mi u vo l khng bit trc. Trong
hc khng c gim st, u vo l mt tp d liu c thu thp. Hc khng c
gim st thng i x vi cc i tng u vo nh l mt tp cc bin ngu
nhin. Sau , mt m hnh mt kt hp s c xy dng cho tp d liu .
Hc khng c gim st c th c dng kt hp vi suy din
Bayes(Bayesian inference) cho ra xc sut c iu kin cho bt k bin
ngu
nhin no khi bit trc cc bin khc.
Hc khng c gim st cng hu ch cho vic nn d liu: v c bn, mi
gii thut nn d liu hoc l da vo mt phn b xc sut trn mt tp u vo
mt cch tng minh hay khng tng minh.
Hc na gim st:
Kt hp cc v d c gn nhn v khng gn nhn sinh mt hm hoc
mt b phn loi thch hp.
Hc tng cng:
Thut ton hc mt chnh sch hnh ng ty theo cc quan st v mi
trng xung quanh. Mi hnh ng u c tc ng ti mi trng, v mi
trng cung cp thng tin phn hi hng dn cho thut ton ca qu trnh
hc.
Chuyn i:
Tng t hc c gim st nhng khng xy dng hm mt cch r rng.
Thay v th, c gng on kt qu mi da vo cc d liu hun luyn, kt qu
hun luyn, v d liu th nghim c sn trong qu trnh hun luyn.
Hc li:
-
66
L mt gii thut cp n vic b sung cc gii thut m chng trnh
hc s dng d on kt qu cho nhng trng hp cha tng gp trc y.
Khi ngi thit k chng trnh nhm mc tiu vo xy dng gii thut,
ngi c th cho chng trnh hc d on mt u ra ch no . Vi iu
ny, chng trnh hc hc s c c mt lng hu hn cc v d hun luyn
minh ha mi quan h mong mun gia gi tr u vo v u ra. Sau khi hc
thnh cng, chng trnh hc s tnh ton s xp x u ra ng, ngay c cho
cc
v d vn cha c th trong sut qu trnh hun luyn. Khng c cc gi nh
b sung, nhim v ny khng th c gii quyt v cc tnh hung cha c
xem xt c th c u ra bt k. Loi gi nh cn thit v bn cht ca hm
chc
nng ch c gi l qu trnh thin kin qui np (ting Anh: inductive
bias).
Vic tip cn n mt nh ngha hnh thc hn ca thin kin qui np l
da trn lgic ton. y, thin kin qui np l mt cng thc lgic, cng
vi
d liu hun luyn, i hi mt cch lgic gi thuyt a ra bi chng trnh
hc. Kt qu c c c th c xem l m t th v nhng kt qu ca cc i
tng hon ton.
Khi quyt nh xy dng mt h thng hc my, ngi thit k cn tr
c cc cu hi sau:
H thng truy xut d liu bng cch no? Vic ny ng ngha vi vic:
lm th no h thng hc c th s dng nhng tri thc thu thp c t d liu
hun luyn?
Nu chng trnh hc nm trong mt mi trng c th v thc hin c
cc hnh ng kim sot trn cc tp d liu u vo, ng thi c th cp nht
tri thc trong qu trnh thc th nh mt qu trnh hc tng cng. Hoc n
c
th lm iu thng qua qu trnh c rt kinh nghim. D liu c th c th
b m ha, hay cha nhiu i tng gy nhiu, iu ny i hi chng trnh
hc phi c kh nng gii m hay nh gi mt cc xp x cc i tng gy
nhiu thc hin phn tch v kt qu t c tt nht. T quan im ny,
chng trnh hc c th c xy dng da trn cc m hnh thc thch hp:hc
c gim st, hay khng c gim st ty theo ngi thit k.
Chng trnh cn hc nhng g? Mc tiu cn t c l gi?
Cc dng hm chc nng khc nhau c th c nh ngha bn trong mt
chng trnh hc. Cc chc nng cc hm ny cn c xc nh thng qua s
mong mun nhng g c c sau qu trnh phn tch. Mc tiu ny c th c
m t bng u ra cc hm chc nng c s dng. Chng ta c th xc nh
-
67
c xp x mc tiu ny thng qua b d liu hun luyn hay thng qua nhng
phn ng ca chng trnh hc trong qu trnh x l b d liu thc t.
Lm th no khi qut(m t) c d liu? Lm th no xc nh ng
c s i s cho cc hm chc nng c th nh ngha c hm chc
nng?
Mt qu trnh quy np c th c xy dng xc nh mt cch gn
ng cc c trng ca hm chc nng. Qu trnh ny c th c hiu nh mt
qu trnh tm kim gi thit(hay m hnh) ca d liu, trong mt khng
gian
rng ln d liu hay trong b d liu hun luyn m ngi thit k a vo. S
la chn m t xp x ny ny gip hn ch c b d liu cn thit ng thi
c th gim bt chi ph.
Qu trnh ny, c th c dng ch ra i din cho mt nhm nay ton
b tp d liu.
Thut ton no c th c p dng?
S la chn thut ton ph hp rt cn thit xy dng mt chng trnh
hc. V chng trnh hc cn thit phi hn ch bt can thip ca con ngi
n
qu trnh phn tch, nn vic ny cn phi c thut ton tha nh cu: gip
chng trnh hc t c mt mc xp x gn tt nht theo nhu cu trn mt
tp d liu ln, lin tc c c cp nht. Vic xc nh th no l mt ln
hay mc tin cy ca chng trnh s c xc nh ty theo tng trng hp
c th.
S dng my hc trong x l ngn ng t nhin:
Hin nay, ngi ta c nhu cu p dng cc thnh tu ca my hc vo lnh
vc x l ngn ng t nhin. Nhiu m hnh hc my khc nhau c p
dng vo lnh vc x l ngn ng t nhin. Trc kia, ngi ta phi x l bng
tay mt khi lng d liu ln, bn cnh , mt khi lng ln quy tc c
s dng trong cc ngn ng cng gp phn lm tng khi lng cng vic ln
rt nhiu. Hu ht cc m hnh my hc c p dng u gi nguyn c bn
cht, tuy khng phi lc no cng p dng cc quy tc thng k, tm ra
nhng
quy tc in hnh da trn cc mu d liu thu thp c.
V d: xt nhim v ca mt phn gn th cho mt bi pht biu, tc l xc
nh chnh xc ngha ca bi pht biu vi mi t trong mt cu nht nh,
thng l mt trong cha bao gi c thy trc y. Da trn hc my,
cc phng php nh du thng c tin hnh theo hai bc:
Bc u tin: cc bc o to - lm cho vic s dng ca mt th tp d
liu hun luyn, trong bao gm