1 Applying Conditional Random Fields to Japanese Morphological Analysis Taku Kudo 1* , Kaoru Yamamoto 2, Yuji Ma tsumoto 1 1 Nara Institute of Science and Technology 2 CREST, Tokyo Institute of Technology * Currently, NTT Communication Science Labs.
Dec 17, 2015
1
Applying Conditional Random Fields to Japanese Morphological Analysis
Taku Kudo 1*, Kaoru Yamamoto 2, Yuji Matsumoto 1
1 Nara Institute of Science and Technology
2 CREST, Tokyo Institute of Technology
* Currently, NTT Communication Science Labs.
2
Backgrounds Conditional Random Fields [Lafferty 01]
A variant of Markov Random Fields Many applications
POS tagging [Lafferty01], shallow parsing [Sha 03], NE recognition [McCallum 03], IE [Pinto 03, Peng 04]
Japanese Morphological Analysis Must cope with word segmentation Must incorporate many features Must minimize the influence of the length bias
3
Japanese Morphological Analysis
word segmentation (no explicit spaces in Japanese)
POS tagging lemmatization, stemming
INPUT: 東京都に住む (I live in Metropolis of Tokyo.)
東京 / 都 / に / 住む
東京 (Tokyo) NOUN-PROPER-LOC-GENERAL都 (Metro.) NOUN-SUFFIX-LOCに (in) PARTICLE-GENERAL住む (live) VERB BASE-FORM
4
Simple approach for JMA Character-based begin / inside tagging
non standard method in JMA cannot directly reflect lexicons
over 90% accuracy can be achieved using the naïve longest prefix matching with a lexicon
decoding is slow
東 京 / 都 / に / 住 む B I B B B I
5
Our approach for JMA Assume that a lexicon is available word lattice
represents all candidate outputs reduces redundant outputs
Unknown word processing invoked when no matching word can be
found in a lexicon character types e.g., Chinese character, hiragana, katakana, number .. etc
6
Problem SettingInput: “ 東京都に住む (I live in Metropolis of Tokyo)”
BOS 東 (east)[noun]
東京 (Tokyo)[noun]
京都 (Kyoto)[noun]
都 (Metro.)[suffix]
に (in)[particle]
に (resemble)[verb]
住む (live) [verb]
EOS
Lattice:
京 (capital)[noun]
に particle, verb東 noun京 noun東京 noun京都 noun…
GOAL: select the optimal path out of all candidates
),,,,( ##11 YY twtwY
X
Input:
Output:
lexicon
NOTE: the number of tokens #Y varies
7
Long-standing Problems in JMA
8
Complex tagset Hierarchical tagset
HMMs cannot capture them How to select the hidden classes?
TOP level → lack of granularity Bottom level → data sparseness Some functional particles should be lexicalized
Semi-automatic hidden class selections [Asahara 00]
京都 (Kyoto)
NounProper
LocGeneralKyoto
9
Complex tagset, cont. Must capture a variety of features
京都 (Kyoto)
nounproper
locgeneralKyoto
に (in)
particlegeneral
φφに
住む (live)
verbindependent
φφ
live
base-formPOS hierarchy
overlapping features
inflectionscharacter types
prefix, suffix
These features are important to JMA
lexicalization
10
JMA with MEMMs [Uchimoto 00-03]
Use discriminative model, e.g., maximum entropy model, to capture a variety of features
sequential application of ME models
都 (capital) [suffix]
BOS 東 (east)[noun]
東京 (Tokyo) [noun]
P( 東 |BOS) < P( 東京 |BOS) に (particle)[particle]
に (resemble)[verb]
P( に , particle| 都 ,suffix) > P( に ,verb| 都 ,suffix)
),|,()|( 11 iii
ii twtwpXYP
11
Problems of MEMMs
Label bias [Lafferty 01]
BOS A
B
DC
E
0.6
0.4
1.0
1.0
1.0
1.0
0.4
0.6 EOS
P(A, D | x) = 0.6 * 0.6 * 1.0 = 0.36P(B, E | x) = 0.4 * 1.0 * 1.0 = 0.4
P(A,D|x) < P(B,E|x)
paths with low-entropy are preferred
),|,()|( 11 iii
ii twtwpXYP
12
Problems of MEMMs in JMA
Length bias
BOS A
B
DC0.6
0.4 1.0
1.0
1.0
0.4
0.6 EOS
P(A, D | x) = 0.6 * 0.6 * 1.0 = 0.36P(B | x) = 0.4 * 1.0 = 0.4
P(A,D|x) < P(B|x)
long words are preferred length bias has been ignored in JMA !
),|,()|( 11 iii
ii twtwpXYP
13
Long-standing problems must incorporate a variety of features
overlapping features, POS hierarchy, lexicalization, character-types
HMMs are not sufficient must minimize the influence of length bias
another bias observed especially in JMA MEMMs are not sufficient
Can CRFs solve these problems? Yes!
14
Use of CRFs to JMA
15
CRFs for word lattice
Global Feature F(Y,X) = (… 1 … … 1 … … 1 … )
Parameter Λ = (… 3 … … 20 … 20 ... )
BOS - noun noun - suffix
BOS 東 (east)
[noun]
東京 (Tokyo)
[noun]
京都 (Kyoto)
[noun]
都 (Metro.)
[suffix]
に (in)[particle]
に (resemble)
[verb]
住む (live)
[verb]EOS
Lattice:
京 (capital)
[noun]
noun / Tokyo
encodes a variety of uni- or bi-gram features in a path
xZ
XYXYP
)),(exp()|(
FΛ
)('
)),'(exp(XY
X XYZ FΛ
)(X : a set of all candidate paths
16
CRFs for word lattice, cont. single exponential model for the entire paths
X
i kiiiikk
x
Z
twtwf
Z
XYXYP
),,,(exp(
)),(exp()|(
11
FΛ
otherwise :0
, :1),,,( 1
111234
paritcletnounttwtwf ii
iiii
fewer restrictions in the feature design can incorporate a variety of features can solve the problems of HMMs
17
Encoding Maximum Likelihood estimation
j XYjjj
N
jjj
j
K
XYXY
XYPL
L
)('
1
),'(),(explog
))|(log(
maxargˆ
FΛFΛ
Λ
Λ
ΛΛ
all candidate paths are taken in encoding influence of length bias will be minimized can solve the problems of MEMMs
A variant of Forward-Backward [Lafferty 01] can
also be applied to word lattice
18
MAP estimation
L2-CRF (Gaussian prior) non-sparse solution (all features have non-zero weight) good if most given features are relevant non-constrained optimizers, e.g., L-BFGS, are used
L1-CRF (Laplacian prior) sparse solution (most features have zero-weight) good if most given features are irrelevant constrained optimizers, e.g., L-BFGS-B, are used
C is a hyper-parameter
2
1
1 ||||
||||
2
1))|(log(C
N
jjj XYPLΛ
19
Decoding
),(maxarg
)|(maxargˆ
)(
)(
XY
XYPY
XY
XY
FΛ
Viterbi algorithm essentially the same architecture as HMMs and MEMMs
20
Experiments
21
Data
KC
source Mainichi News Article ‘95
lexicon (size) JUMAN 3.61 (1,983,173)
POS structure 2-levels POS,
c-form, c-type, base
# training sentences 7,958
# training tokens 198,514
# test sentences 1,246
# test tokens 31,302
# features 791,798
KC and RWCP, widely-used Japanese annotated corpora
22
Features
otherwise :0
, :1),,,( 1
111234
paritcletnounttwtwf ii
iiii
京都 (Kyoto)
nounproper
locgeneralKyoto
に (in)
particlegeneral
φφに
住む (live)
verbindependent
φφ
live
base-formPOS hierarchy
overlapping features
inflectionscharacter types
prefix, suffix lexicalization
23
Evaluation
three criteria of correctnessseg: word segmentation only top: word segmentation + top level of POSall: all information
F =2 ・ recall ・ precision
recall + precision
recall =# correct tokens
# tokens in test corpus
precision=# correct tokens
# tokens in system output
24
Resultsseg top all
L2-CRFs 98.96 98.31 96.75
L1-CRFs 98.80 98.14 96.55
HMMs 96.22 94.99 91.85
MEMMs 96.44 95.81 94.28
L1/L2-CRFs outperform HMM and MEMM L2-CRFs outperform L1-CRFs
Significance Tests: McNemar’s paired test on the labeling disagreements
25
Influence of the length bias
HMM, CRFs: relative ratios are not much different MEMM: # of long word errors is large → influenced by the length bias
# long word err. # short word err.
HMMs 306 (44%) 387 (56%)
L2-CRFs 79 (40%) 120 (60%)
MEMMs 416 (70%) 183 (30%)
26
L1-CRFs v.s L2-CRFs
L2-CRFs > L1-CRFs most given features are relevant (POS hierarchies, suffixes/prefixes, character types)
L1-CRFs produce a compact model # of active features
L2: 791,798 v.s L1: 90,163 11%
L1-CRFs are worth being examined
if there exist practical constraints
27
Conclusions An application of CRFs to JMA Not use character-based begin / inside tags
but use word lattice with a lexicon CRFs offer an elegant solution to the proble
ms with HMMs and MEMMs can use a wide variety of features (hierarchical POS tags, inflections, character types, …etc)
can minimize the influence of the length bias (length bias has been ignored in JMA!)
28
Future work Tri-gram features
Use of all tri-grams is impractical as they make the decoding speed significantly slower need to use a practical feature selection
e.g., [McCallum 03]
Apply to other non-segmented languages e.g., Chinese or Thai
29
CRFs encoding A variant of Forward-Backward [Lafferty 01] can also be
applied to word lattice
twtw X
twk
kktw
k
kXYP
Z
twtwf
twtwf
XYFE
,,','
,'
''','
)|(
,,','exp
,,','
),(
w,t
w’,t’α’
α’
α’
α
kkk twtwf ,,','exp'
w’,t’
w’,t’
boseosXeosbos Z ,1
BOS
w’,t’ w,tEOS
α β
XZ
exp()
30
Influence of the length bias, cont.
caused rather by the influence of the length bias (CRFs can correctly analyze these sentences)
海sea
にparticle
かけたbet
ロマンはromanticist
ロマンromance
はparticle
The romance on the sea they bet is …
荒波 rough waves
にparticle
負けloose ない
not心
heart
ない心 one’s heart
A heart which beats rough waves is …
MEMMs select
MEMMs select
31
Cause of label and length bias MEMM only use correct path in encoding transition probabilities of unobserved paths will be
distributed uniformly
BOS 東[noun]
東京 [non]
都 [suffix]
に [particle]
に[particle]
京都 [noun]
京 [noun]