Page 1
Semantic Role Labeling for Learner Chinese:the Importance of Syntactic Analysis and L2-L1 Parallel Data
Zi Lin, Yuguang Duan, Yuanyuan Zhao, Weiwei Sun, Xiaojun Wan
Peking University
{zi.lin, ariaduan, zhao yy, ws, wanxiaojun}@pku.edu.cn
October 25, 2018
Page 2
Overview
Background
Data Set
Robustness of L1-annotation-trained SRL Systems
Analysis
Improving SRL Systems with L2-L1 Parallel Data
Page 3
Outline
Background
Data Set
Robustness of L1-annotation-trained SRL Systems
Analysis
Improving SRL Systems with L2-L1 Parallel Data
Page 4
What is interlanguage?
A second language (or L2) which preserves some features of their first language (or L1).
你好~
Mandarin
Chinese
Page 5
What is interlanguage?
A second language (or L2) which preserves some features of their first language (or L1).
你好~
Mandarin English
Chinese
Page 6
What is interlanguage?
A second language (or L2) which preserves some features of their first language (or L1).
你好~
Mandarin English
Chinese
influence
Page 7
Interlanguage is everywhere...
Page 8
Interlanguage is everywhere...
Social Network
Page 9
Interlanguage is everywhere...
And perhaps your paper...
Page 10
Outline
Background
Data Set
Robustness of L1-annotation-trained SRL Systems
Analysis
Improving SRL Systems with L2-L1 Parallel Data
Page 11
L2-L1 Parallel Data
Collect a large dataset of L2-L1 parallel texts of Mandarin by exploring “languageexchange” social networking services – lang-81.
1http://lang-8.com/
Page 12
Data for SRL annotation
Initial collection1,108,907 pairs
717,241 pairsclean up
600 pairsmanual
selectionSRL annotation
segmentation
4 typologically different mother tongues
Language FamilyChinese Sino-TibetanRussian SlavicArabic SemiticJapanese UnknownEnglish Germanic
Page 13
Data for SRL annotation
Initial collection1,108,907 pairs
717,241 pairsclean up
600 pairsmanual
selectionSRL annotation
segmentation
4 typologically different mother tongues
Language FamilyChinese Sino-TibetanRussian SlavicArabic SemiticJapanese UnknownEnglish Germanic
Page 14
Data for SRL annotation
Initial collection1,108,907 pairs
717,241 pairsclean up
600 pairsmanual
selectionSRL annotation
segmentation
4 typologically different mother tongues
Language FamilyChinese Sino-TibetanRussian SlavicArabic SemiticJapanese UnknownEnglish Germanic
Page 15
Data for SRL annotation
Initial collection1,108,907 pairs
717,241 pairsclean up
600 pairsmanual
selectionSRL annotation
segmentation
4 typologically different mother tongues
Language FamilyChinese Sino-TibetanRussian SlavicArabic SemiticJapanese UnknownEnglish Germanic
Page 16
Data for SRL annotation
Initial collection1,108,907 pairs
717,241 pairsclean up
600 pairsmanual
selectionSRL annotation
segmentation
4 typologically different mother tongues
Language FamilyChinese Sino-TibetanRussian SlavicArabic SemiticJapanese UnknownEnglish Germanic
Page 17
Two Questions
1. Can human understand interlanguage robustly?
2. Can automatic system produce high-quality semantic structures?
Page 18
Can human understand interlanguage robustly?
/ It is difficult to define the syntactic formulism of learner language.
, But sometimes we can understand what they mean...
Why not Semantics?
Page 19
Can human understand interlanguage robustly?
/ It is difficult to define the syntactic formulism of learner language.
, But sometimes we can understand what they mean...
Why not Semantics?
Page 20
Can human understand interlanguage robustly?
/ It is difficult to define the syntactic formulism of learner language.
, But sometimes we can understand what they mean...
Why not Semantics?
Page 21
Semantic Role Labeling
Argument (AN): Who did what to whom ?
Adjunct (AM): When , where , why and how ?
I ate breakfast quickly in the car this morning because I was in a hurry.
A0 A1
AM-MNR
AM-LOC
AM-TMP
AM-PRP
Page 22
Inter-annotator agreement
I Annotator: two Linguisticstudents
I The first 50-sentence trial set:adapting and refining CPBsecification
I The rest 100-sentence set:reporting the inter-annotatoragreement
Page 23
Inter-annotator agreement
I Annotator: two Linguisticstudents
I The first 50-sentence trial set:adapting and refining CPBsecification
I The rest 100-sentence set:reporting the inter-annotatoragreement
Page 24
Two Questions
1. Can human understand interlanguage robustly?
2. Can automatic system produce high-quality semantic structures?
Page 25
Outline
Background
Data Set
Robustness of L1-annotation-trained SRL Systems
Analysis
Improving SRL Systems with L2-L1 Parallel Data
Page 26
Three SRL systems
I The Necessity of Parsing for Predicate Argument Recognition. (2002). Gildea and Palmer.
I Semantic Role Labeling Using Different Syntactic Views (2005). Pradhan et al.
I Syntax for Semantic Role Labeling, To Be, Or Not To Be. (2018). He et al.
I Linguistically-Informed Self-Attention for Semantic Role Labeling. (2018). Strubell et al.EMNLP 2018 Best Paper
Parsers
Systems PCFGLA-parser-basedSRL system
Neural-parser-basedSRL system
Neural syntax-agnosticSRL system
Minimal span-based parser
Berkeley parser
Performance<
Trained on Chinese TreeBank that has SRL in CPB
Trained on Chinese PropBank (CPB)
Page 29
Findings
The syntax-based systems are more robust when handling learner texts.
Page 30
Findings
The better the parsing results we get, the better the performance on L2 weachieve.
Page 31
Outline
Background
Data Set
Robustness of L1-annotation-trained SRL Systems
Analysis
Improving SRL Systems with L2-L1 Parallel Data
Page 32
Why syntactic analysis is important?
用 汉语 也 说话 快 对我来说 很 难 啊。Using Chinese also speaking quickly for me very hard.
Gold
A0 rel
A0
Syntax-based system
Neural end-to-end system
AM AM
A0 AM AM AM rel
AM rel
Using Chinese and also speaking quickly is very hard for me
Page 33
Why syntactic analysis is important?
CP
IP
IP
用汉语using
Chinese
VP
ADVP
也also
VP
VP
说话快speakingquickly
VP
PP
对我来说for me
ADVP
很very
VP
难hard
SP
啊MOD
PU
。
I Though the whole structure is ill-formed
Page 34
Why syntactic analysis is important?
CP
IP
IP
用汉语using
Chinese
VP
ADVP
也also
VP
VP
说话快speakingquickly
VP
PP
对我来说for me
ADVP
很very
VP
难hard
SP
啊MOD
PU
。
I Partial of the sentence can be well-formed.
Page 35
A new Questions
1. Can human understand interlanguage robustly?
2. Can automatic system produce high-quality semantic structures?
↓3. Can we improve the SRL performance on interlanguage?
Page 36
Outline
Background
Data Set
Robustness of L1-annotation-trained SRL Systems
Analysis
Improving SRL Systems with L2-L1 Parallel Data
Page 37
Leveraging L2-L1 Parallel Data
, 我 喜欢 做 中国菜I like cooking Chinese food
, 我 喜欢 做饭I like cooking meal
/ 我 喜欢 做饭 中国菜I like cook-meal Chinese food
Page 38
Leveraging L2-L1 Parallel Data
, 我 喜欢 做 中国菜I like cooking Chinese food
, 我 喜欢 做饭I like cooking meal
/ 我 喜欢 做饭 中国菜I like cook-meal Chinese food
Page 39
Leveraging L2-L1 Parallel Data
, 我 喜欢 做 中国菜I like cooking Chinese food
, 我 喜欢 做饭I like cooking meal
/ 我 喜欢 做饭 中国菜I like cook-meal Chinese food
Page 40
Leveraging L2-L1 Parallel Data
〈predicate, argument, role〉 tuples
L1:
我 喜欢 做中国菜I like cooking Chinese food
ARG0 ARG1
我 喜欢 做 中国菜I like cooking Chinese food
ARG0
ARG1
L2:
我 喜欢 做饭中国菜I like cook-meal Chinese food
ARG0 ARG1
我 喜欢 做饭 中国菜I like cook-meal Chinese food
ARG0
ARG1
# of shared tuples = 1
Page 41
Leveraging L2-L1 Parallel Data
Metric for comparing SRL results
I L2-recall:(# of shared tuples) / (# of tuples of the result in L2)
I L1-recall:(# of shared tuples) / (# of tuples of the result in L1)
Well-formed sentence pair if both are greater than λ
Page 42
Retraining two essential modules
Parsers
Systems PCFGLA-parser-basedSRL system
Neural-parser-basedSRL system
Neural syntax-agnosticSRL system
Minimal span-based parser
Berkeley parser
Performance<
1. Retrain the parser: Using the automatically generated syntactic trees of the well-formed sentence pairs
Page 43
Retraining two essential modules
Page 47
Thanks for your attention!
Zi Lin is planning to apply for PhD program in CS orlinguistics this fall. Email me at [email protected] .
cn if you are interested!