Xinhao Wang, Jiazhong Nie, Dingsheng Luo, Xinhao Wang, Jiazhong Nie, Dingsheng Luo, and Xihong Wu and Xihong Wu Speech and Hearing Research Center, Speech and Hearing Research Center, Department of Machine Intelligence, Peking Department of Machine Intelligence, Peking University University September 18 September 18 th th , 2008 , 2008 A Joint Segmenting and A Joint Segmenting and Labeling Approach for Labeling Approach for Chinese Lexical Analysis Chinese Lexical Analysis ECML PKDD 2008, Antwerp ECML PKDD 2008, Antwerp
17
Embed
Xinhao Wang, Jiazhong Nie, Dingsheng Luo, and Xihong Wu Speech and Hearing Research Center, Department of Machine Intelligence, Peking University September.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Xinhao Wang, Jiazhong Nie, Dingsheng Luo, and Xihong WuXinhao Wang, Jiazhong Nie, Dingsheng Luo, and Xihong Wu
Speech and Hearing Research Center,Speech and Hearing Research Center,Department of Machine Intelligence, Peking UniversityDepartment of Machine Intelligence, Peking University
September 18September 18thth, 2008 , 2008
A Joint Segmenting and Labeling A Joint Segmenting and Labeling Approach for Chinese Lexical AnalysisApproach for Chinese Lexical Analysis
ECML PKDD 2008, AntwerpECML PKDD 2008, Antwerp
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
Cascaded Subtasks in NLPCascaded Subtasks in NLP
Chunking and Parsing
Word Segmentation and Named Entity Recognition
POS Tagging
Word Sense Disambiguation Drawbacks:
Errors introduced by earlier subtasks propagate through the pipeline and will never be recovered in downstream subtasks.
The information sharing among different subtasks is prohibited by this pipeline manner.
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
Researchers’ Efforts on Joint ProcessingResearchers’ Efforts on Joint Processing
Reranking (Shi, 2007; Sutton, 2005; Zhang, 2003)As an approximation of joint processing, it may miss the true optimal result, which often lies out of the k-best list.
Taking multiple subtasks as a single one (Luo, 2003; Miller, 2000; Yi, 2005; Nakagawa, 2007, Ng, 2004)The obstacle is the requirement of corpus annotated with multi-level information.
Unified probabilistic models (Sutton, 2004, Duh, 2005)Dynamical Conditional Random Fields (DCRFs) and Factorial Hidden Markov Models (FHMMs), which are trained jointly and performs the subtasks all at once.Both DCRFs and FHMMs suffer from the absence of multi-level data annotation.
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
A Unified Framework for Joint ProcessingA Unified Framework for Joint Processing
A WFSTs based approach is presented to jointly perform a cascade of segmentation and labeling tasks, which holds two remarkable features as below:
WFST offers a unified framework that can represent many widely used models, like lexical constraints, n-gram language model and Hidden Markov Models (HMMs), and thus a unified transducer representation for modeling multiple knowledge sources can be achieved.
Multiple WFSTs can be integrated into a fully composed single WFST, which makes it possible to perform a cascade of subtasks with a one-pass decoding.
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
Weighted Finite State Transducers (WFSTs)Weighted Finite State Transducers (WFSTs)
The WFST is the generalization of the finite state automata, which is capable of realizing a weighted relation between strings.
Composition operation
0 1
2
0
1
b:c/0.3
c:b/0.4
(a)
(b)
a:b/0.1
c:b/0.2
a:c/0.4 3
2
b:c/0.6
c:b/0.2
0,0
1,1a:c/0.4
2,1c:c/0.5
a:b/0.8
1,0c:c/0.8
b:a/0.6
b:c/0.7
3,2
(c)a:c/0.1
Example of WFSTs composition. Two simple WFSTs are showed in (a) and (b), in which states are represented by circles and labeled with their unique numbers. The bold circles represented initial states and double circles of final states. The input and output labels as well as weight of transition t are marked as in(t):out(t)/weight(t). In (c), the composition of (a) and (b) is illustrated.
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
Joint Chinese Lexical AnalysisJoint Chinese Lexical Analysis
The WFST based approach
Uniform Representation for Multiple Subtask Models.
Integration of Multiple Models.
Tasks
word segmentation, part-of-speech tagging, and person and location names recognition.
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
An n-gram language model based on word classes is adopted for word segmentation.
Hidden Markov Models (HMMs) are adopted both for names recognition and POS tagging.
In names recognition, both Chinese characters and words are considered as model units, and it is performed with word segmentation simultaneously
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
The Pipeline System vs. The Joint SystemThe Pipeline System vs. The Joint System
input dict ne n gramFSA FST WFST WFST
Pipeline Baseline Integrated Analyzer
posWFST
Decode
Decode
ComposeCompose
The Best SegmentationThe Best Segmentation
Output
Decode
ComposeCompose
Output
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
Simulation SetupSimulation Setup
Corpus: People’s Daily of China annotated by the Institute of Computational Linguistics of Peking University01-05(98) is used as the training set06(98) is the test setThe first 2000 sentences of the test set are taken as the development set
SystemWord
Segmentation F1(%)
POS TaggingF1(%)
Person NamesRecognition
F1(%)
Place NamesRecognition F1(%)
Pipeline Baseline 95.94 91.06 83.31 89.90
Integrated Analyzer 96.77 91.81 88.51 90.91
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
The Statistical Significance TestThe Statistical Significance Test
The approximate randomization approach (Yeh, 2000) is adopted to test the performance improvement produced by the joint processing.
The evaluation metric F1-value of word segmentation is tested.
The responses for each sentence produced by two systems are shuffled and equally resigned to each system, and then the significance level is computed based on the shuffled results
10 sets, 500 sentences for each, are randomly selected and tested. For all the selected 10 sets, the significance level p-values are all far smaller than 0.001.
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
DiscussionsDiscussions
This approach holds the full search space and chooses the optimal results based on the multi-level sources, rather than reranking the k-best candidates .
The models for each level subtask are trained separately, while the decoding is conducted jointly. Accordingly, it avoids the necessary of corpus annotated with multi-level information.
In the case when a segmentation task precedes a labeling task, the WFSTs based approach naturally ensures the consistency restriction imposed by the segmentation.
The unified framework of WFSTs provides the opportunity to easily apply the presented analyzer in other natural language related applications which are also based on WFSTs, such as speech recognition and machine translation
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
ConclusionConclusion
In this research, within the unified framework of WFSTs, a joint processing approach is presented to perform a cascade of segmentation and labeling subtasks.
It has been demonstrated that the joint processing is superior to the traditional pipeline manner.
The finding suggests two directions for future researchMore linguistic knowledge will be integrated in the analyzer, such as organization names recognition and shallow parsing.
Since rich linguistic knowledge will play an important role for the tough tasks, such as ASR and MT, incorporating our integrated analyzer may lead to a promising performance improvement.
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
Thank you for your attention!
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
The WFSA representing a toy bigram language model, where un(w1) denotes the unigram of w1; bi(w1;w2) and back(w1) respectively denotes the bigram of w2 and the backoff weight given the word history w1.
0
w1/un(w1)
w2/un(w2)
w3/un(w3)
4
w3/un(w3)
ε /back(w1)
w1/un(w1)
ε /back(w3)
w2/un(w2)
ε /back(w2)
w1/bi(w2,w1)
w2/bi(w3,w2)
w2/bi(w1,w2)
w3/bi(w2,w3)
w1/bi(w3,w1)
w3/bi(w1,w3)
2
3
1
(c)
Classes Description
wiThe ith word listed in the dictionary
CNAME Chinese person names
TNAME Translated person names
LOC Location names
NUM Number expressions
LETTER Letter strings
NONOther non Chinese character strings
BEGIN Beginnings of sentences
END Ends of sentences
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
POS WFSTs. (a) is the WFST representing the relationship between the wordand the pos; (b) is the WFSA representing a toy bigram of POS
0
pos1/un(pos1)
pos2/un(pos2)
pos3/un(pos3)
pos1/bi(pos2,pos1)
pos3/bi(pos2,pos3)
pos2/bi(pos1,pos2)
pos3/bi(pos1,pos3)
pos1/bi(pos3,pos1)pos2/bi(pos3,pos2)
(a)
(b)
word: pos/p(word/pos)
3
2
1
0 surname
the first character of the given name
The second characterof the given name
CNAME
Speech and Hearing Research Center, Peking UniversitySpeech and Hearing Research Center, Peking University
The Statistical Significance TestThe Statistical Significance Test
The approximate randomization approach (Yeh, 2000) .The responses for each sentence produced by two systems are shuffled and equally resigned to each system, and then the significance level is computed based on the shuffled results.The shuffle times is fixed as:
Since in our test set there are more than 21,000 sentences, the use of 220 shuffles to approximate 221000 shuffles turns unreasonable any more. Thus, 10 sets, 500 sentences for each, are randomly selected and tested.
For all the selected 10 sets, the significance level p-values are all far smaller than 0.001.