Finding Translations for Low-Frequency Words in Comparable Corpora Viktor Pekar, Ruslan Mitkov, Dimitar Blagoev, Andrea Mulloni ILP, University of Wolverhampton, UK Contact email: [email protected]
Jan 17, 2016
Finding Translations for Low-Frequency Words in Comparable Corpora
Viktor Pekar, Ruslan Mitkov, Dimitar Blagoev, Andrea Mulloni
ILP, University of Wolverhampton, UK
Contact email: [email protected]
Overview
Distributional Hypothesis and bilingual lexicon acquisition
The effect of data sparseness Methods to model co-occurrence vectors
of low-frequency words Experimental evaluation Conclusions
Distributional Hypothesis in the bilingual context Words of different languages that appear in
similar contexts are translationally equivalent Acquisition of bilingual lexicons from
comparable, rather than parallel corpora Bilingual comparable corpora: not translated
texts, but same topic, size, style of presentation Advantages over parallel corpora:
Broad coverage Easy domain portability Virtually unlimited number of language pairs Parallel corpora = restoration of existing dictionaries
General approach Comparable corpora in languages L1 and L2
Words to be aligned: N1 and N2
Extract co-occurrence data on N1 and N2 from respective corpora: V1 and V2
Create co-occurrence matrices N1×V1, each cell containing f(v,n) or p(v|n)
Create a translation matrix using a bilingual lexicon: V1×V2
Equivalences between only the core vocabularies Each cell encodes translation probability Used to map a vector from L1 to the vector space of L2
Words with the most similar vectors are taken to be equivalent
Data sparseness
The approach works quite unreliably on all but very frequent words (e.g., Gaussier et al 2004)
Polysemy and synonymy: many-to-many correspondences between the two vocabularies
Noise introduced during the translation between vector spaces
Data sparseness
0
20
40
60
80
100
120
140
160
180
1-100 101-200 201-300 301-400 401-500 501-600 601-700 701-800 801-900 901-1000
Frequency ranks
Mea
n R
ank
En-Fr
En-Ge
En-Sp
Fr-Ge
Fr-Sp
Ge-Sp
Dealing with data sparseness
How can one deal with data sparseness? Various smoothing techniques exist: Good
Turing, Kneser-Ney, Katz’s back-off Previous comparative studies:
Class-based smoothing (Resnik 1993) Web-based smoothing (Keller&Lapata 2003) Distance-based averaging (Pereira et al 1993;
Dagan et al. 1999)
Distance-based averaging
Probability of an unknown co-occurrence p*(v|n) is estimated from known probabilities of N’, a set of nearest neighbours of n:
where w is a weight with which n’ influences the average of known probabilities of N’; w is computed from distance/similarity between n and n’
norm is a normalisation factor
Adjusting probabilities for rare co-occurrences DBA was used to predict unseen probabilities We’d like predict unseen as well as adjust seen,
but unreliable probabilities:
0 ≤ γ ≤1, the degree to which the seen probability is smoothed with data on the neighbours
Problem: how does one estimate γ?
Heuristical estimation of γ
The less frequent is n, the more it gets smoothed
Log-transformed corpus counts to downplay differences between frequent words
Exact relationship between corpus frequency of n and γ is determined on held-out pairs The held-out data are split into frequency ranges Mean rank of the correct equivalent in each range is
computed Function g(x) is interpolated along the mean rank points
g(n) – predicted rank for n RR - random rank, lowest bound on mean rank
Performance-based estimation of γ
Smoothing functions
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1-100 101-200 201-300 301-400 401-500 501-600 601-700 701-800 801-900 901-1000
Frequency ranks
gam
ma
En-Fr (perf)
Ge-Sp (perf)
En-Fr (heur)
Ge-Sp (heur)
Less frequent neighbours
Remove less frequent neighbours, in order to avoid “diluting” corpus-attested probabilities
Experimental setup
6 language pairs: all combinations with English, French, German, and Spanish
Corpora: EN: WSJ (87-89), Connexor FDG FR: Le Monde (94-96), Xerox Xelda GE: die Tageszeitung (87-89, 94-98), Versley SP: EFE (94-95), Connexor FDG
Extracted verb-direct object pairs from each corpus
Experimental setup Translation matrices:
Equivalents between verb synsets in EuroWordNet
Translation probabilities equally distributed among different translations of a source word
Evaluation samples of noun pairs: 1000 pairs from EWN for each language pair Sampled from equidistant positions in a sorted
frequency list Divided into 10 frequency ranges Each noun might have several translations in
the sample (1.06 to 1.15 translations)
Experimental setup
Assignment algorithm To pair each source noun with a correct target
noun Similarity measured using Jensen-Shannon
Divergence Kuhn-Munkres algorithm to determine the most
optimal assignment on the entire set Evaluation measure
Mean rank of the correct equivalent
Baseline: no smoothing
0
20
40
60
80
100
120
140
160
180
1-100 101-200 201-300 301-400 401-500 501-600 601-700 701-800 801-900 901-1000
Frequency ranks
Mea
n R
ank
En-Fr
En-Ge
En-Sp
Fr-Ge
Fr-Sp
Ge-Sp
DBA: replace p(v|n) with p*(v|n)
Discard less frequent neighbours
significant reduction of Mean Rank: Fr-Ge, Fr-Sp, Ge-Sp
Heuristical estimation of γ
significant reduction of Mean Rank: all language pairs
Performance-based estimation of γ
significant reduction of Mean Rank: all language pairs
Relationship between k, frequency and Mean Rank
0 2 5
10 20
40
60
100
140
200
300
500
800
100 30
0 500 70
0 900
0
20
40
60
80
100
120
140
160
180
me
an
ra
nk
k frequency ranges
Conclusions Smoothing co-occurrence data on rare words
using intra-language similarities to improve retrieval of their translational equivalents
Extensions of DBA, to smooth rare co-occurrences: Heuristical (amount of smoothing is a linear function of
frequency) Performance-based (the smoothing function is estimated
on held-out data) Both lead to considerable improvement:
up to 48 ranks reduction (from 146 to 99, 32%) in low frequency ranges
up to 27 ranks reduction (from 81 to 54, 33%) overall