-
English-to-Chinese Transliteration with a Phonetic Auxiliary
Task
Yuan He∗Department of Computer Science
University of [email protected]
Shay B. CohenSchool of Informatics
University of [email protected]
Abstract
Approaching named entities transliteration asa Neural Machine
Translation (NMT) prob-lem is common practice. While many
haveapplied various NMT techniques to enhancemachine
transliteration models, few focus onthe linguistic features
particular to the relevantlanguages. In this paper, we investigate
theeffect of incorporating phonetic features forEnglish-to-Chinese
transliteration under themulti-task learning (MTL) setting—where
wedefine a phonetic auxiliary task aimed to im-prove the
generalization performance of themain transliteration task. In
addition to oursystem, we also release a new English-to-Chinese
dataset and propose a novel evalua-tion metric which considers
multiple possibletransliterations given a source name. Our re-sults
show that the multi-task model achievessimilar performance as the
previous state ofthe art with a model of a much smaller size.1
1 Introduction
Transliteration, the act of mapping a name from theorthographic
system of one language to another,is directed by the pronunciation
in the source andtarget languages, and often by historical reasons
orconventions. It plays an important role in tasks likeinformation
retrieval and machine translation (Mar-ton and Zitouni, 2014;
Hermjakob et al., 2008).
Over the recent years, many have ad-dressed transliteration
using sequence-to-sequence(seq2seq) deep learning models (Rosca and
Breuel,2016; Merhav and Ash, 2018; Grundkiewicz andHeafield, 2018),
enhanced with several NMT tech-niques (Grundkiewicz and Heafield,
2018). How-ever, this recent work neglects the most crucialfeature
for transliteration, i.e. pronunciation. To
*Work done at The University of Edinburgh.1Our code and data are
available at https://github.
com/Lawhy/Multi-task-NMTransliteration.
English IPA Chinese Pinyin
A /"eI./ 艾 àimy /mi/ 米 mı̌
Table 1: An example of English-to-Chinese transliter-ation, from
Amy to 艾米. Each row presents a groupof corresponding subsequences
in different representa-tions.
bridge this gap, we define a phonetic auxiliary taskthat shares
the sound information with the maintransliteration task under the
multi-task learning(MTL) setting.
Depending on the specific language, the writtenform of a word
reveals its pronunciation to variousextents. For alphabetical
languages such as Englishand French, a letter, or a sequence of
letters, usuallyreflects the word pronunciation. For example,
theword Amy (in the International Phonetic Alphabet,IPA, /"eI.mi/)
has the sub-word A corresponding to/"eI./ and my corresponding to
/mi/. In contrast,characters in a logographic2 writing system for
lan-guages like Chinese or Japanese do not explicitlyindicate sound
(Xing et al., 2006).
In this paper, we give a treatment to the problemof
transliteration from English (alphabet) to Chi-nese3 (logogram)
using an RNN-based MTL modelwith a phonetic auxiliary task. We
transform eachChinese character to the alphabetical
representationof its pronunciation via the official phonetic
writingsystem, Pinyin,4 which uses Latin letters with
fourdiacritics denoting tones to represent the sounds.
2A logogram is an individual character that represents awhole
word or phrase.
3The Chinese language we mention in this paper refers
ex-plicitly to Mandarin, which is the official language
originatedfrom the northern dialect in China.
4Pinyin is the official romanization system for StandardChinese
(Mandarin) in mainland China and to some extent inTaiwan. It does
not apply to other Chinese dialects.
https://github.com/Lawhy/Multi-task-NMTransliterationhttps://github.com/Lawhy/Multi-task-NMTransliteration
-
For example, the Chinese transliteration for Amy is艾米 and the
associated Pinyin representation is àimı̌. We summarize the
correspondences occurringin this example in Table 1.
Due to the similarity between the source nameand the Pinyin
representation, Jiang et al. (2009)proposed a sequential
transliteration model thatuses Pinyin as an intermediate
representation be-fore transliterating a Chinese name to English.
Incontrast, our idea is to build a model with a sharedencoder and
dual decoders, that can learn the map-ping from English to Chinese
and Pinyin simulta-neously. By jointly learning source-to-target
andsource-to-sound mappings, the encoder is expectedto generalize
better (Ruder, 2017) and pass morerefined information to the
decoders.
Transliteration datasets are often extracted fromdictionaries,
or aligned corpus generated from ap-plying named entity recognition
(NER) system toparallel newspaper articles in different
languages(Sproat et al., 2006). We use two datasets forour
experiments, one taken from NEWS MachineTransliteration Shared Task
(Chen et al., 2018) andthe other extracted from a large dictionary.
Weevaluate the transliteration system using both theconventional
word accuracy and a novel metric de-signed for English-to-Chinese
transliteration (seeSection 5). Our contributions are as
follows:
1. We make available a new English-to-Chinesenamed entities
dataset (“DICT”) particular tonames of people. This dataset is
based on the dic-tionary A Comprehensive Dictionary of Names
inRoman-Chinese (Xinhua News Agency, 2007).
2. We propose a substitution-based metric calledAccuracy with
Alternating Character Table(ACC-ACT), which gives a better
estimation ofthe system’s quality than the traditional wordaccuracy
(ACC).
3. We propose a multi-task learning transliterationmodel with a
phonetic auxiliary task, and runexperiments to demonstrate that it
attains betterscores than single-main-task or single-auxiliary-task
models.
We report accuracy and F-score of 0.299 and0.6799, respectively,
on the NEWS dataset, with amodel of size 22M parameters, compared
to the pre-vious state of the art (Grundkiewicz and Heafield,2018),
which achieves accuracy and F-score of0.304 and 0.6791,
respectively, with a model ofsize 133M parameters. On the DICT
dataset, for
Source (x) Target (y) Pinyin (p)
Caleigh 凯莉 kai li
Table 2: An example data point under our multi-tasklearning
setting.
the same model sizes, we report accuracy of 0.729as compared to
their 0.732.
2 Problem Formulation
We use the word vocabulary to describe the set ofcharacters for
the purpose of our task specification.Let Vsrc and Vtgt denote the
source and target vo-cabularies, respectively. For a source word x
oflength I and a target word y of length J , we have:
x = (x1, x2, ..., xI) ∈ V Isrc,y = (y1, y2, ..., yJ) ∈ V
Jtgt.
where the kth element in the vector denotes a char-acter at
position k.
We formulate the task of transliteration as a su-pervised
learning problem: given a collection of ntraining examples,
{(x(i),y(i))}ni=0, the objectiveis to learn a predictor function, f
: x → y, ofwhich the parameter space maximizes the follow-ing
conditional probability:
P (y|x) Chain Rule=J∏
j=1
P (yj |y1, ..., yj−1,x).
For our multi-task transliteration model, the pre-dictor becomes
fMTL : x → (y,p), where p de-notes the written representation of
the pronuncia-tion of the target word y. For decoding, we maxi-mize
the conditional probabilities, P (p|x, ỹ) andP (y|x, p̃), where ỹ
and p̃ refers to the implicitinformation channeled by one task to
the other.
The phonetic information we use for our taskrefers to the Pinyin
version of the name in Chi-nese, without tone marks,5 because they
are oftenremoved for spelling Chinese names in an alpha-betical
language. We present an example data pointin the form of (x,y,p) in
Table 2.
3 Dataset Preparation
We experiment with two different English-to-Chinese datasets.
For simplicity, we denote the one
5For example, the Pinyins, chı̄, chı́, chı̌ and chı̀, are
alltransformed to chi. Note that this process will decrease
thevocabulary size.
-
taken from NEWS Machine Transliteration SharedTask (Chen et al.,
2018) as “NEWS,” and theone extracted from the dictionary (Xinhua
NewsAgency, 2007) as “DICT.”
3.1 NEWS DatasetWe use the preprocessing script6 created by
Grund-kiewicz and Heafield (2018) to construct theNEWS dataset from
raw data provided in theShared Task (Chen et al., 2018). This
script mergesthe raw English-to-Chinese and
Chinese-to-Englishdatasets into a single one, then transforms it to
up-percase7 and tokenizes all names into sequences ofcharacters
(words are treated as sentences, charac-ters are treated as words).
In addition, it takes 513examples from the training data to form
the internaldevelopment set and uses the official developmentset as
the internal test set.
To make the final comparison, we download thesource-side data of
the official test set from theShared Task’s website,8 and submit
the translitera-tion results (see Section 6.4).
3.2 DICT DatasetThe source dictionary contains approximately680K
name pairs for transliteration from other lan-guages than Chinese.
We extracted 58,456 pairsthat originated in English and performed
the fol-lowing preprocessing steps:
1. For the source side (English), we remove theinverted commas
and white spaces from namesthat contain them (e.g. A’Court, Le
Gresley).
2. For both sides, we lowercase9 all the words andtokenize them
into sequences of characters.
3. Name pairs with multiple target transliterationsare removed
from the dataset and saved in a sep-arate file for the construction
of the ACT (seenext paragraph). As such, every name pair be-comes
unique in our preprocessed dataset. Werandomly divide the rest into
the ratio of 8 : 1 : 1,to form training, development and test
sets.
We report the final partitions of both datasets inTable 3.
6Available at https://github.com/snukky/news-translit-nmt.
7We lowercase all the words in both NEWS and DICTdatasets as
evaluating transliteration is case-insensitive.
8The official test set with task ID T-EnCh is availableat:
http://workshop.colips.org/news2018/dataset.html.
9Lowercasing does not affect Chinese characters as theyare not
alphabetical.
Source Train Dev Test
NEWS 81,252 513 1,000DICT 46,620 5,828 5,828
Table 3: Numbers of data points in training, develop-ment and
test sets of NEWS and DICT datasets. Devand Test for the NEWS
dataset (first row) refer to theinternal development and test set,
respectively.
3.3 Alternating Character Table
Chinese characters10 that sound alike can often re-place each
other in the transliteration of a namefrom other languages. Unlike
an alphabetical lan-guage where a similar pronunciation is
boundedto sub-words of various lengths, characters in Chi-nese have
concrete and independent pronunciations.Thus, we can conveniently
build the AlternatingCharacter Table (ACT) with each row storing a
listof interchangeable characters.
We construct the ACT based on the DICT datasetbecause it
contains less noise after applying signif-icant data cleansing. In
total, 449 English namesfrom the DICT dataset have more than one
translit-erations in Chinese. We purposely removed allthese names
from the DICT data during the pre-processing so as to ensure that
we are not usingany knowledge from the test set. The final
ACTcontains 29 rows (see Appendix) and we use it withour adaptive
evaluation metric (see Section 5).
3.4 Pinyin Conversion
In transliteration, the pronunciations of the Chinesecharacters
are often unique (even for a polyphoniccharacter, e.g. 什, that has
more than one Pinyins,shı́ and shén, only shı́ is commonly used in
translit-eration). Therefore, we can directly transform eachChinese
character into a unique Pinyin, thus form-ing the target data for
the auxiliary task. The proce-dure is as follows: for each
character yt in the targetname y, we use the Python package
pypinyin11
to map yt to the corresponding Pinyin (without thetone mark).
The tool will generate the most fre-quently used Pinyin for each yt
based on dictionarydata. We then apply further manual correction
onthe Pinyins because the most frequent Pinyin is notnecessarily
the one used in transliteration.
10Limited to the set of characters (with size ≈1K out of80K)
commonly used in transliteration.
11Available at: https://github.com/mozillazg/python-pinyin. We
use the lazy pinyin feature togenerate Pinyins without tone
marks.
https://github.com/snukky/news-translit-nmthttps://github.com/snukky/news-translit-nmthttp://workshop.colips.org/news2018/dataset.htmlhttp://workshop.colips.org/news2018/dataset.htmlhttps://github.com/mozillazg/python-pinyinhttps://github.com/mozillazg/python-pinyin
-
Figure 1: Visualization of the Seq2MultiSeq model.The left half
illustrates the components involved in themain task and the right
half is for the auxiliary task.The shared part is the encoder that
consists of a sourceembedding layer and a stacked biRNN (top
middle).
4 Model
Our model is intent on solving English-to-Chinesetransliteration
through joint supervised learning ofsource-to-target (main) and
source-to-Pinyin (aux-iliary) tasks. Training closely related tasks
togethercan help the model to learn information that isoften
ignored in single-task learning, thus obtain-ing a better
representation in the shared layers (inour case, encoder).
Moreover, the auxiliary taskimplicitly provides the phonetic
information thatis not easily learned through the single main
taskgiven the characteristics of Chinese (see Section 1).Our model
has a sequence-to-multiple-sequence(Seq2MultiSeq) architecture that
contains a sharedencoder and dual decoders. Between the encoderand
decoder is a bridge layer12 that transforms the
12We call it “bridge” because it connects the shared encoderto
each decoder. It allows flexible choices of the hidden sizesof the
encoder and decoder and serves as the intermediate“buffer” before
passing the encoder final state to each decoder.
encoder’s final state into the decoder’s initial state(see
Figure 1).
The encoder has an embedding layer withdropout (Hinton et al.,
2012), followed by a 2-layer biLSTM (Schuster and Paliwal, 1997).
Thebridge layer consists of a linear layer followed bytanh
activation. The shared encoder passes its finalstate to the
main-task decoder and the auxiliary-task decoder via separate
bridge layers. In eachdecoder, we use additive attention (Bahdanau
et al.,2015) to compute the context vector (weighted sumof the
encoder outputs according to the attentionscores), then concatenate
it with the target embed-ding to form the input of the subsequent
2-layerfeed-forward LSTM. The prediction is made byfeeding the
concatenation of the LSTM’s output,the context vector and the
target embedding into alinear layer followed by log-softmax.
Our model is expected to simultaneously max-imize the
conditional probabilities mentioned inSection 2. To achieve this
goal, we use the lin-ear combination of the main-task decoder’s
loss13
(negative log likelihood; ly) and the auxiliary-taskdecoder’s
loss (lp) as the model’s objective func-tion:
lMTL = λ · ly + (1− λ) · lp,
where the subscript MTL stands for multi-tasklearning and 0 <
λ < 1. Note that for λ = 0and λ = 1, it is equivalent to train
on a single aux-iliary task and a single main task, respectively.
Thewhole system is implemented using the deep learn-ing framework
PyTorch (Paszke et al., 2019).14
5 Adaptive Evaluation Metrics
We evaluate the transliteration system using wordaccuracy (ACC)
and its variants on the 1-best out-put:
ACC =1
N
∑(y,ŷ)
Icriterion(ŷ,y),
where N is the total number of test-set samples,Icriterion(ŷ,y)
is an indicator function with value 1 ifthe prediction (top
candidate) ŷ matches the refer-ence y under certain criterion. The
simplest crite-rion is exact string match between ŷ and y. If
thetest set contains multiple target words for a singlesource word,
we let indicator be 1 if the predictionmatches one of the
references (Chen et al., 2018).
13We use nn.NLLLoss() from the PyTorch library.14Available at
https://pytorch.org/.
https://pytorch.org/
-
Source Target (F) Target (M) MED
Mona 莫娜 莫纳 1Colina 科莉娜 科利纳 2
Table 4: Examples of a single source name with morethan one
target transliterations, with (F) and (M) indi-cating female and
male, respectively.
We use ACC and ACC+ to denote the original ac-curacy and its
variant with multiple references.
The drawback of ACC is that it may underesti-mate the quality of
the system because it neglectsthe possibility of having more than
one transliter-ation for a given source name, as is the case
forEnglish-to-Chinese transliteration. For example inTable 4, if
the test set only includes Target (F) for aSource while the model
predicts Target (M), ACCwill mistakenly count it as wrong. Although
ACC+considers the alternatives appearing in the dataset,it is
unrealistic to expect the dataset to contain allpossible
references. To resolve this issue, we pro-pose a new variant of
word accuracy specific toEnglish-to-Chinese transliteration.
Based on the knowledge of a native Chinesespeaker, we analyze
the English-to-Chinese datasetand summarize the key observations
for sourcenames with multiple target transliterations as fol-lows:
the minimum edit distance (MED) betweenany two target names ≤ 2,
and the lengths are thesame; for any two such target names,
distinct char-acters occur in the same position, and they
oftenindicate the gender of the name (see Table 4).
To use ACT in accord with the above obser-vations, we propose
the following criterion forthe accuracy indicator function (we
refer to it asACC-ACT). Let subscript t denote the positionof a
character, then Icriterion(ŷ,y) = 1 if eitherMED(ŷ, y) = 0 (which
covers all the cases forACC) or the following conditions are met in
or-der:
1. ŷ and y are of the same length, L;2. MED(ŷ, y) ≤ 2 and
distinct characters of ŷ andy must occur in the same
position(s);
3. If ŷt 6= yt for 1 ≤ t ≤ L, replace ŷt by lookingup the ACT
and this condition will be satisfied ifany of the modified ŷ(s)
can match y exactly.
There is no guarantee that characters that areinterchangeable
according to ACT can replace eachother in every scenario. But since
we only apply
Enc Dec-M Dec-A
Emb.h 256 256 128δ 0.1 0.1 0.1
RNNh 512 512 128δ 0.2 0.2 0.1
Table 5: Illustration of the model settings, where Emb.and RNN
stand for the embedding layers and RNNunits in each part (column)
of the model, h and δ arethe hidden size and dropout value,
respectively. Thecolumn names (from left to right) stand for
encoder,main-task decoder and auxiliary-task decoder.
substitution on the output predictions rather thanthe
references, we are not manipulating the testset by creating any new
instance. This new metric(ACC-ACT) will ensure cases like in Table
4 arecaptured without requiring extra data in the testset, thus
giving a more reasonable estimate of thesystem’s quality than both
ACC and ACC+.
6 Experimental Setup
Recall from Section 4 that we use λ to denote theweighting of
the two tasks we train. We set thesingle-main-task (λ = 1) and the
single-auxiliary-task (λ = 0) models as the baselines, and com-pare
the multi-task models of different weightings(λ ∈ {16 ,
14 ,
12 ,
23 ,
56 ,
89}) against them. We conduct
experiments on both the NEWS and DICT datasetsand select the
best model for each of them to com-pare to the previous state of
the art.
6.1 Model and Training SettingsThe configurations of hidden
sizes and dropoutvalues of embedding layers and RNN units
arepresented in Table 5. The type of all RNN units isLSTM and the
number of layers is set to 2. Besidesthe bridge layer that
transforms the encoder’s finalhidden state to the decoder’s initial
hidden state,we add another one to carry the final cell state
forusing LSTM (in total, we have 4 “bridges”).
We use the Adam optimizer (Kingma and Ba,2015) with the batch
size set to 64. Evaluation ofthe development set is carried out on
every 500batches. We record the validation score (ACC) anddecrease
the learning rate (initially set to 0.003) by90% if the score does
not surpass the previous best.We pick the final model that attains
the highestvalidation score within 100 training epochs.
For decoding in the training phase, we applyteacher forcing
(Williams and Zipser, 1989) with
-
NEWS DICT
Main Auxiliary Main Auxiliary
λ ACC ACC+ ACC-ACT ACC ACC ACC-ACT ACC
1 0.723 0.731 0.746 NA 0.725 0.748 NA
1/6 0.666 0.672 0.688 0.698 0.728 0.750 0.7441/4 0.734 0.743
0.751 0.755 0.725 0.747 0.7461/2 0.724 0.733 0.740 0.738 0.723
0.748 0.7392/3 0.698 0.707 0.715 0.705 0.722 0.746 0.7395/6 0.739
0.749 0.760 0.757 0.729 0.752 0.7468/9 0.670 0.679 0.686 0.705
0.722 0.746 0.734
0 NA NA NA 0.743 NA NA 0.743
Table 6: Experiment results on NEWS internal test set and DICT
development set, where λ = 1 and λ = 0 arebaselines of main task
and auxiliary task, respectively. Maximum score in each metric is
is bold.
Figure 2: The plots of main-task ACC against auxiliary-task ACC
on the NEWS (left) and DICT (right) develop-ment sets. Colors
indicate which multi-task model (by λ value) the evaluation points
belong to. To highlight thedense regions, we set the minimum of the
x-axis to 0.5 and 0.6 for NEWS and DICT datasets, respectively.
the following empirical decay function:
tfr = max
(1− 10 + epoch× 1.5
50, 0.2
),
where tfr refers to the teacher forcing ratio, i.e.
theprobability of feeding the true reference instead ofthe
predicted token. We use beam search decodingwith beam size 10 and
length normalization (Wuet al., 2016) for evaluation.
6.2 Evaluation
We use ACC and ACC-ACT to evaluate the perfor-mance on the main
task and ACC on the auxiliarytask. Note that since the only data
portion we havethat contains multiple references given a sourceword
is the internal test set of NEWS data, weapply ACC+ on this
particular set exclusively.
6.3 Model Selection
In the experiments in this section, we tune λ on theNEWS
internal test set and DICT development set,and select the model
with the highest ACC on themain task.
The experiment results in Table 6 show thatλ = 56 yields the
best models on both datasets. Weobserve a significant improvement
against the base-lines on NEWS while a less noticeable increase
onDICT. Besides, the models are more sensitive toλ on NEWS than
DICT (with standard deviation0.03 and 0.003 on ACC,
respectively).
Furthermore, we investigate the relationship be-tween the main
and the auxiliary tasks based on theevaluation points of the
development set. In Figure2, we observe a nearly-total positive
linear correla-tion between the main-task ACC and
auxiliary-taskACC, and this is further evident in the Pearson
cor-
-
Internal Test Official Test
Main Auxiliary Main
System ACC ACC+ ACC-ACT ACC ACC+
Baseline 0.724 0.733 0.742 0.736 NAMulti-task 0.739 0.749 0.760
0.757 0.299BiDeep 0.731 0.739 0.746 0.740 NABiDeep+ NA 0.765 NA NA
0.304
Table 7: Experiment results on the NEWS internal test (official
development) set and official test set, where“Baseline” refers to
the single-task model and “BiDeep+” refers to the best system
Grundkiewicz and Heafield(2018) submitted to the NEWS workshop, and
the corresponding scores are taken from their paper.
Main Auxiliary
System ACC ACC-ACT ACC
Baseline 0.726 0.748 0.738Multi-task 0.729 0.751 0.749BiDeep
0.732 0.755 0.760
Table 8: Experiment results on the DICT test set, whereBaseline
refers to the single-task model.
User ACC+ F-score
romang 0.3040 (1) 0.6791 (2)Ours 0.2990 (2) 0.6799 (1)
saeednajafi 0.2820 (3) 0.6680 (3)soumyadeep 0.2610 (4) 0.6603
(4)
Table 9: Table of the NEWS leaderboard (avail-able at
https://competitions.codalab.org/competitions/18905#results,
accessed 19 June2020). User “romang” refers to Grundkiewicz
andHeafield (2018).
relation coefficients15, which are 0.982 and 0.992for NEWS and
DICT, respectively. This meansthe multi-task model improves the
performance onboth tasks simultaneously.
6.4 Test-set Results and System Comparison
We submit our 1-best transliteration results on theNEWS official
test set through the CodaLab linkprovided by the Shared Task’s
Committee and wepresent the leaderboard partially in Table 9.
Notethat in addition to ACC+, the leaderboard also
15Computed by pearsonr() from Scipy library, whichis available
at:
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html.
records mean F-score16 on which we rank first.We report the
test-set performance of our best
multi-task model on NEWS in Table 7 and DICTin Table 8, in
comparison to the system built byGrundkiewicz and Heafield (2018).
The base-line model of their work employs the RNN-basedBiDeep17
architecture (Miceli Barone et al., 2017)which consists of 4
bidirectional alternating stackedencoder, each with a 2-layer
transition RNN cell,and 4 stacked decoders with base RNN of depth
2and higher RNN of depth 4 (Zhou et al., 2016; Pas-canu et al.,
2014; Wu et al., 2016). Besides, theystrengthen the model by
applying layer normal-ization (Ba et al., 2016), skip connections
(Zhanget al., 2016) and parameter tying (Press and Wolf,2017). We
reproduce their model without changingany configurations in their
paper (Grundkiewiczand Heafield, 2018), and train it on both tasks
sep-arately.
In Table 7, we can see that the multi-task modelperforms
significantly better than both the single-task baseline and the
BiDeep model in all met-rics on NEWS. Note that the BiDeep model
wereproduce achieves the same ACC+ as reportedin the work of
Grundkiewicz and Heafield (2018)and ACC+ is the only evaluation
metric used intheir paper. “BiDeep+” in the third row refersto the
final system they submitted to the SharedTask, on which they
adopted additional NMT tech-niques including ensemble modeling for
re-rankingand synthetic data generated from back transla-tion
(Sennrich et al., 2017). Our ACC+ score on
16The F-score metric measures the similarity between thetarget
prediction and reference. Precision and Recall in thisparticular
F-score are computed based on the length of theLongest Common
Subsequence. See details in the NEWSwhitepaper (Chen et al.,
2018).
17Implemented with the Marian toolkit available at
https://marian-nmt.github.io/docs/.
https://competitions.codalab.org/competitions/18905#resultshttps://competitions.codalab.org/competitions/18905#resultshttps://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.htmlhttps://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.htmlhttps://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.htmlhttps://marian-nmt.github.io/docs/https://marian-nmt.github.io/docs/
-
Source Output (ST) Output (MT)
ocallaghan 奥卡拉根 奥卡拉汉 Xholleran 霍尔伦 霍勒伦Xajemian 阿赫米安 阿杰米安
Table 10: Example outputs and the correspondingsource words of
our systems, where “ST” and “MT” re-fer to “single-task” and
“multi-task” models. The ticksymbols indicate which outputs match
the references.
the anonymized official test set is 0.299 which isslightly worse
than their 0.304. However, we at-tain a better F-score (0.6799)
than them (0.6791)as shown in Table 9. Moreover, our model is
ofsize 22M parameters, which is much smaller thantheir baseline
BiDeep of size 133M parameters,18
and we do not apply as many NMT techniques asthey did.
Nevertheless, on the DICT test set, thereis no prominent difference
among the single-taskbaseline, multi-task and BiDeep model,
possiblybecause the noise pattern in the DICT dataset is notcomplex
enough to reflect the learning ability ofthese models.
7 Discussion
In our experiments, a system has ACC-ACT>ACC+>ACC because
both ACC-ACT andACC+ consider the cases of ACC but ACC-ACTcan
capture more acceptable transliterations.Despite a consistent
ranking given by the threemetrics, ACC-ACT reveals different
informationfrom ACC and ACC+. For example, in Table6, the model of
λ = 56 outperforms λ =
12 by
0.015 and 0.016 in ACC and ACC+, respectively,but the difference
is 0.020 in ACC-ACT, on theNEWS dataset. This suggests a more
prominentgap between these two models. In contrast, bylooking at
the same two rows but on the DICTdataset, ACC-ACT indicates a
smaller gap (0.004)than ACC (0.006). If we conduct experimentson
another dataset, the disagreement among themetrics might be
significant enough to render aninconsistent ranking.
Furthermore, we present some typical examplesin which the
multi-task model generates better pre-dictions than the single-task
in Table 10. In the first
18We compute the size of our multi-task model by count-ing the
number of trainable parameters extracted frommodel.parameters();
For the BiDeep model, we usethe numpy package to load the model in
.npz format andcalculate the number of parameters via a simple
for-loop.
example, the single-task model wrongly maps thesub-word ghan to根
(emphasizing on the characterg) while the multi-task model
correctly maps hanto汉. The erroneous grouping of the English
char-acters also occurs in the second example where thesingle-task
model maps er to尔 instead of morereasonably ler to 勒. Even in the
third examplewhere both outputs are mismatched, the multi-taskmodel
predicts the character杰, which is closer tothe source sub-word je
than the single-task model’s赫 in terms of pronunciation. Overall,
it seems thatthe multi-task model can capture the
source-wordpronunciation better than the single-task one.
Still, the multi-task model does not consistentlyhandle all
names better than the single-task model–especially for exceptional
names that do not havea regular transliteration. For instance, the
nameFyleman is transliterated into 法伊尔曼, but thecharacter伊 does not
have any source-word corre-spondence if we consider the
pronunciation of thesource name.
Finally, our model can be generalized to othertransliteration
tasks by replacing Pinyin with otherphonetic representations such
as IPA for Englishand rōmaji for Japanese. In addition, ACC-ACTcan
be extended to alphabetical languages by, forinstance, constructing
the Alternating Sub-wordTable which stores lists of interchangeable
subse-quences. Another possible future work is to re-design the
objective function by treating λ as atrainable parameter or
including the correlation in-formation (Papasarantopoulos et al.,
2019).
8 Related Work
Previous work has demonstrated the effectivenessof using MTL on
models through joint learningof various NLP tasks such as machine
translation,syntactic and dependency parsing (Luong et al.,2016;
Dong et al., 2015; Li et al., 2014). In most ofthis work, underlies
a similar idea to create a uni-fied training setting for several
tasks by sharing thecore parameters. Besides, machine
transliterationhas a long history of using phonetic information,for
example, by mapping a phrase to its pronun-ciation in the source
language and then convertthe sound to the target word (Knight and
Graehl,1997). There is also relevant work that uses bothgraphemes
and phonemes to various extents fortransliteration, such as the
correspondence-based(Oh et al., 2006) and G2P-based (Le and
Sadat,2018) approaches. Our work is inspired by the intu-
-
itive understanding that pronunciation is essentialfor
transliteration, and the success of incorporatingphonetic
information such as Pinyin (Jiang et al.,2009) and IPA (Salam et
al., 2011), in the modeldesign.
9 Conclusion
We argue in this paper that language-specific fea-tures should
be used when solving transliteration ina neural setting, and we
exemplify a way of usingphonetic information as the transferred
knowledgeto improve a neural machine transliteration system.Our
results demonstrate that the main translitera-tion task and the
auxiliary phonetic task are indeedmutually beneficial in
English-to-Chinese translit-eration, and we discuss the possibility
of applyingthis idea on other language pairs.
Acknowledgements
We thank the anonymous reviewers for their insight-ful feedback.
We would also like to thank ZhengZhao, Zhijiang Guo, Waylon Li and
Pinzhen Chenfor their help and comments.
ReferencesJimmy Ba, Jamie Ryan Kiros, and Geoffrey E.
Hinton.
2016. Layer normalization. ArXiv, abs/1607.06450.
Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2015. Neural
machine translation byjointly learning to align and translate.
CoRR,abs/1409.0473.
Nancy Chen, Rafael E. Banchs, Xiangyu Duan, MinZhang, and
Haizhou Li, editors. 2018. Proceedingsof the Seventh Named Entities
Workshop. Associa-tion for Computational Linguistics, Melbourne,
Aus-tralia.
Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, andHaifeng Wang. 2015.
Multi-task learning for mul-tiple language translation. In
Proceedings of the53rd Annual Meeting of the Association for
Compu-tational Linguistics and the 7th International
JointConference on Natural Language Processing (Vol-ume 1: Long
Papers), pages 1723–1732, Beijing,China. Association for
Computational Linguistics.
Roman Grundkiewicz and Kenneth Heafield. 2018.Neural machine
translation techniques for named en-tity transliteration. In
Proceedings of the SeventhNamed Entities Workshop, pages 89–94,
Melbourne,Australia. Association for Computational
Linguis-tics.
Ulf Hermjakob, Kevin Knight, and Hal Daumé. 2008.Name
translation in statistical machine translation -learning when to
transliterate. In ACL.
Geoffrey E. Hinton, Nitish Srivastava, AlexKrizhevsky, Ilya
Sutskever, and Ruslan Salakhut-dinov. 2012. Improving neural
networks bypreventing co-adaptation of feature detectors.ArXiv,
abs/1207.0580.
Xue Jiang, Le Sun, and Dakun Zhang. 2009. Asyllable-based name
transliteration system. In Pro-ceedings of the 2009 Named Entities
Workshop:Shared Task on Transliteration (NEWS 2009), pages96–99,
Suntec, Singapore. Association for Computa-tional Linguistics.
Diederik P. Kingma and Jimmy Ba. 2015. Adam:A method for
stochastic optimization. CoRR,abs/1412.6980.
Kevin Knight and Jonathan Graehl. 1997. Machinetransliteration.
In Proceedings of the 35th AnnualMeeting of the Association for
Computational Lin-guistics and Eighth Conference of the
EuropeanChapter of the Association for Computational Lin-guistics,
ACL ’98/EACL ’98, page 128–135, USA.Association for Computational
Linguistics.
Ngoc Tan Le and Fatiha Sadat. 2018. Low-resourcemachine
transliteration using recurrent neural net-works of Asian
languages. In Proceedings ofthe Seventh Named Entities Workshop,
pages 95–100, Melbourne, Australia. Association for Compu-tational
Linguistics.
Zhenghua Li, Min Zhang, Wanxiang Che, Ting Liu,and Wenliang
Chen. 2014. Joint optimization forchinese pos tagging and
dependency parsing. Au-dio, Speech, and Language Processing,
IEEE/ACMTransactions on, 22:274–286.
Minh-Thang Luong, Quoc V. Le, Ilya Sutskever,Oriol Vinyals, and
Lukasz Kaiser. 2016. Multi-task sequence to sequence learning.
CoRR,abs/1511.06114.
Yuval Marton and Imed Zitouni. 2014. Translitera-tion
normalization for information extraction andmachine translation.
Journal of King Saud Univer-sity - Computer and Information
Sciences, 26(4):379– 387. Special Issue on Arabic NLP.
Yuval Merhav and Stephen Ash. 2018. Design chal-lenges in named
entity transliteration. In Proceed-ings of the 27th International
Conference on Compu-tational Linguistics, pages 630–640, Santa Fe,
NewMexico, USA. Association for Computational Lin-guistics.
Antonio Valerio Miceli Barone, Jindřich Helcl, RicoSennrich,
Barry Haddow, and Alexandra Birch.2017. Deep architectures for
neural machine trans-lation. In Proceedings of the Second
Conference onMachine Translation, pages 99–107, Copenhagen,Denmark.
Association for Computational Linguis-tics.
https://www.aclweb.org/anthology/W18-2400https://www.aclweb.org/anthology/W18-2400https://doi.org/10.3115/v1/P15-1166https://doi.org/10.3115/v1/P15-1166https://doi.org/10.18653/v1/W18-2413https://doi.org/10.18653/v1/W18-2413https://www.aclweb.org/anthology/W09-3521https://www.aclweb.org/anthology/W09-3521https://doi.org/10.3115/976909.979634https://doi.org/10.3115/976909.979634https://doi.org/10.18653/v1/W18-2414https://doi.org/10.18653/v1/W18-2414https://doi.org/10.18653/v1/W18-2414https://doi.org/10.1109/TASLP.2013.2288081https://doi.org/10.1109/TASLP.2013.2288081https://doi.org/https://doi.org/10.1016/j.jksuci.2014.06.011https://doi.org/https://doi.org/10.1016/j.jksuci.2014.06.011https://doi.org/https://doi.org/10.1016/j.jksuci.2014.06.011https://www.aclweb.org/anthology/C18-1053https://www.aclweb.org/anthology/C18-1053https://doi.org/10.18653/v1/W17-4710https://doi.org/10.18653/v1/W17-4710
-
Jong-Hoon Oh, Key-Sun Choi, and Hitoshi Isahara.2006. A machine
transliteration model based oncorrespondence between graphemes and
phonemes.ACM Trans. Asian Lang. Inf. Process., 5:185–208.
Nikos Papasarantopoulos, Lea Frermann, Mirella Lap-ata, and Shay
B. Cohen. 2019. Partners in crime:Multi-view sequential inference
for movie under-standing. In Proceedings of the 2019 Conference
onEmpirical Methods in Natural Language Processingand the 9th
International Joint Conference on Natu-ral Language Processing
(EMNLP-IJCNLP), pages2057–2067, Hong Kong, China. Association
forComputational Linguistics.
Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho,and Yoshua
Bengio. 2014. How to construct deeprecurrent neural networks. CoRR,
abs/1312.6026.
Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James
Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia
Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang,
Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank
Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith
Chintala. 2019. Py-torch: An imperative style, high-performance
deeplearning library. In H. Wallach, H. Larochelle,A. Beygelzimer,
F. d'Alché-Buc, E. Fox, and R. Gar-nett, editors, Advances in
Neural Information Pro-cessing Systems 32, pages 8026–8037. Curran
Asso-ciates, Inc.
Ofir Press and Lior Wolf. 2017. Using the outputembedding to
improve language models. ArXiv,abs/1608.05859.
Mihaela Rosca and Thomas Breuel. 2016. Sequence-to-sequence
neural network models for translitera-tion. CoRR,
abs/1610.09565.
Sebastian Ruder. 2017. An overview of multi-task learning in
deep neural networks. CoRR,abs/1706.05098.
Khan Md. Anwarus Salam, Yamada Setsuo, and Tet-suro Nishino.
2011. Translating unknown words us-ing wordnet and
ipa-based-transliteration. 14th In-ternational Conference on
Computer and Informa-tion Technology (ICCIT 2011), pages
481–486.
Mike Schuster and Kuldip Paliwal. 1997. Bidirectionalrecurrent
neural networks. Signal Processing, IEEETransactions on, 45:2673 –
2681.
Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan-dra Birch,
Barry Haddow, Julian Hitschler, MarcinJunczys-Dowmunt, Samuel
Läubli, Antonio ValerioMiceli Barone, Jozef Mokry, and Maria
Nădejde.2017. Nematus: a toolkit for neural machine trans-lation.
In Proceedings of the Software Demonstra-tions of the 15th
Conference of the European Chap-ter of the Association for
Computational Linguistics,pages 65–68, Valencia, Spain. Association
for Com-putational Linguistics.
Richard Sproat, Tao Tao, and ChengXiang Zhai. 2006.Named entity
transliteration with comparable cor-pora. In Proceedings of the
21st International Con-ference on Computational Linguistics and
44th An-nual Meeting of the Association for
ComputationalLinguistics, pages 73–80, Sydney, Australia.
Associ-ation for Computational Linguistics.
Ronald J. Williams and David Zipser. 1989. A learn-ing algorithm
for continually running fully recurrentneural networks. Neural
Computation, 1:270–280.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad
Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao,
KlausMacherey, Jeff Klingner, Apurva Shah, Melvin John-son,
Xiaobing Liu, Lukasz Kaiser, Stephan Gouws,Yoshikiyo Kato, Taku
Kudo, Hideto Kazawa, KeithStevens, George Kurian, Nishant Patil,
Wei Wang,Cliff Young, Jason Smith, Jason Riesa, Alex Rud-nick,
Oriol Vinyals, Gregory S. Corrado, MacduffHughes, and Jeffrey Dean.
2016. Google’s neu-ral machine translation system: Bridging the
gapbetween human and machine translation. ArXiv,abs/1609.08144.
Hongbing Xing, Hua Shu, and Ping Li. 2006. The ac-quisition of
chinese characters : Corpus analyses andconnectionist
simulations.
Xinhua News Agency. 2007. Names of the World’sPeoples: a
comprehensive dictionary of names inRoman-Chinese. China
Translation & PublishingCorporation.
Saizheng Zhang, Yuhuai Wu, Tong Che, ZhouhanLin, Roland
Memisevic, Ruslan Salakhutdinov, andYoshua Bengio. 2016.
Architectural complexitymeasures of recurrent neural networks.
ArXiv,abs/1602.08210.
Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and WeiXu. 2016. Deep
recurrent models with fast-forwardconnections for neural machine
translation. Transac-tions of the Association for Computational
Linguis-tics, 4:371–383.
https://doi.org/10.18653/v1/D19-1212https://doi.org/10.18653/v1/D19-1212https://doi.org/10.18653/v1/D19-1212http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdfhttp://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdfhttp://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdfhttp://arxiv.org/abs/1610.09565http://arxiv.org/abs/1610.09565http://arxiv.org/abs/1610.09565http://arxiv.org/abs/1706.05098http://arxiv.org/abs/1706.05098https://doi.org/10.1109/78.650093https://doi.org/10.1109/78.650093https://www.aclweb.org/anthology/E17-3017https://www.aclweb.org/anthology/E17-3017https://doi.org/10.3115/1220175.1220185https://doi.org/10.3115/1220175.1220185https://books.google.co.uk/books?id=tY1yHwAACAAJhttps://books.google.co.uk/books?id=tY1yHwAACAAJhttps://books.google.co.uk/books?id=tY1yHwAACAAJhttps://doi.org/10.1162/tacl_a_00105https://doi.org/10.1162/tacl_a_00105
-
A Alternating Character Table in Full
Alternating
Characters莉,利,里,丽弗,夫,芙思,斯,丝妮,内,娜,纳,尼萨,沙,莎亚,娅玛,马,穆琳,林芭,巴茜,西,锡萝,罗滕,坦莱,来,勒代,黛,戴瓦,沃,娃吉,姬,基雷,蕾薇,维,威鲁,卢,露塔,特尤,于安,阿菲,费纽,努范,文蒙,莫查,恰保,葆柯,科
Table 11: The Alternating Character Table in full.