INTERNATIONAL JOURNAL OF TRANSLATION Vol. 19, No. 1 Hierarchical Agglomerative Clustering for Cross-Language Information Retrieval RAYNER ALFRED 1 , ELENA PASKALEVA 2 , DIMITAR KAZAKOV 1 , MARK BARTLETT 1 1 Computer Science Department, York Univeristy, YORK, UK. 2 Bulgarian Academy of Science, Sofia, Bulgaria. ABSTRACT In this article, we report on our work on applying hierarchical agglomerative clustering (HAC) to a large corpus of documents where each appears both in Bulgarian and English. We cluster these documents for each language and compare the results both with respect to the shape of the tree and content of clusters produced. Clustering multilingual corpora provides us with an insight into the differences between languages when term frequency-based information retrieval (IR) tools are used. It also allows one to use the natural language processing (NLP) and IR tools in one language to implement IR for another language. For instance, in this way, the most relevant articles to be translated from language X to language Y can be selected after studying the clusters of abstracts in language Y. INTRODUCTION Effective and efficient document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by categorizing large amounts of information into a small number of meaningful clusters. In particular, clustering algorithms build illustrative and meaningful hierarchies out of large document collections, and are ideal tools for their interactive visualization and exploration, as they provide data views that are consistent, predictable and contain multiple levels of granularity. There has been a lot of research in clustering text documents. However, there are few experiments that compare the results of clustering across languages. It is also interesting to examine the impact on clustering when we reduce the set of terms considered in the clustering process to the set of the most descriptive terms taken from
25
Embed
Hierarchical Agglomerative Clustering for Cross …More details survey of Genetic Algorithms can be found in (Filho et al. 1994). HIERARCHICAL AGGLOMERATIVE CLUSTERING FOR CROSS-LANGUAGE
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
INTERNATIONAL JOURNAL OF TRANSLATION
Vol. 19, No. 1
Hierarchical Agglomerative Clustering for
Cross-Language Information Retrieval
RAYNER ALFRED
1, ELENA PASKALEVA
2, DIMITAR
KAZAKOV1, MARK BARTLETT
1
1Computer Science Department, York Univeristy, YORK, UK.
2Bulgarian Academy of Science, Sofia, Bulgaria.
ABSTRACT
In this article, we report on our work on applying hierarchical agglomerative
clustering (HAC) to a large corpus of documents where each appears both in
Bulgarian and English. We cluster these documents for each language and
compare the results both with respect to the shape of the tree and content of
clusters produced. Clustering multilingual corpora provides us with an insight
into the differences between languages when term frequency-based information
retrieval (IR) tools are used. It also allows one to use the natural language
processing (NLP) and IR tools in one language to implement IR for another
language. For instance, in this way, the most relevant articles to be translated
from language X to language Y can be selected after studying the clusters of
abstracts in language Y.
INTRODUCTION
Effective and efficient document clustering algorithms play an
important role in providing intuitive navigation and browsing
mechanisms by categorizing large amounts of information into a small
number of meaningful clusters. In particular, clustering algorithms build
illustrative and meaningful hierarchies out of large document
collections, and are ideal tools for their interactive visualization and
exploration, as they provide data views that are consistent, predictable
and contain multiple levels of granularity.
There has been a lot of research in clustering text documents.
However, there are few experiments that compare the results of
clustering across languages. It is also interesting to examine the impact
on clustering when we reduce the set of terms considered in the
clustering process to the set of the most descriptive terms taken from
RAYNER ALFRED et al. 2
each cluster. Using the reduced set of terms can be attractive for
several reasons. Firstly, clustering a corpus based on a set of reduced
terms can speed up the process. Secondly, with the reduced set of
terms, we can attempt to use a genetic algorithm to tune the weights of
terms to users’ needs, and subsequently classify unseen examples of
documents.
In this paper, we provide the results of clustering parallel corpora
of English-Bulgarian texts, looking at the similarities and differences in
three main areas: English-Bulgarian cluster mappings, English-
Bulgarian tree structures and the lists of terms that are the most
representative for each cluster in English and Bulgarian. Additionally,
the effect of term reduction on the cluster mappings and the application
of a genetic algorithm in tuning the clustering algorithm are examined.
We will first explain some of the background to (1) the vector
space model representation of documents, (2) the hierarchical
agglomerative clustering method, (3) genetic algorithms and (4) our
semi-supervised clustering technique. Next, we describe the
experimental design set-up and the experimental results and draw our
conclusions.
BACKGROUND
Vector Space Model Representation
In this work, we use the vector space model (Salton & Michael 1986),
in which a document is represented as a vector in an n-dimensional
space (where n is the number of different words in the collection of
documents). Here, documents are categorized by the words they contain
and their frequency. Before obtaining the weights for all the terms
extracted from these documents, stemming and stopword removal is
performed. Stopword removal eliminates unwanted terms (e.g., those
from the closed vocabulary) and thus reduces the number of dimensions
in the term-space. Once these two steps are completed, the frequency
of each term across the corpus is counted and weighted using term
frequency – inverse document frequency (tf-idf) (Salton & Michael
1986), as described in equation (1).
Weights are assigned to give an indication of the importance of a
word in characterizing a document as distinct from the rest of the
corpus. In summary, each document is viewed as a vector whose
dimensions correspond to words or terms extracted from the document.
The component magnitudes of the vector are the tf-idf weights of the
terms. In this model, tf-idf, as described in equation (1), is the product
HIERARCHICAL AGGLOMERATIVE CLUSTERING FOR CROSS-
LANGUAGE INFORMATION RETRIEVAL
3
of term frequency tf(t,d), which is the number of times term t occurs in
document d, and the inverse document frequency, equation (2), where
|D| is the number of documents in the complete collection and df(t) is
the number of documents in which term t occurs at least once. To
account for documents of different lengths, the length of each document
vector is normalized so that it is of unit length (van Rijsbergen 1979).
Hierarchical Agglomerative Clustering
In this work, we concentrate on hierarchical agglomerative clustering.
Unlike partitional clustering algorithms that build a hierarchical
solution from top to bottom, repeatedly splitting existing clusters,
agglomerative algorithms build the solution by initially assigning each
document to its own cluster and then repeatedly selecting and merging
pairs of clusters, to obtain a single all-inclusive cluster, generating the
cluster tree from leaves to root (Zhao & Karypis 2005). The main
parameters in agglomerative algorithms are the metric used to compute
tf-idf = tf(t,d) ∙ idf(t) (1)
idf(t) =
df(t)
Dlog10
(2)
sim(di, dj) =
djd
dd
i
ji
(3)
Precision (C,L) = ALLALL LLCCC
LC
,, (4)
Purity = ),( LCPD
C
ALLCC
(5)
Precision (EBM) =
)(
)()(
EC
BCEC
(6)
Precision (BEM) = )(
)()(
BC
ECBC
(7)
RAYNER ALFRED et al. 4
the similarity of documents and the method used to determine the pair
of clusters to be merged at each step.
In these experiments, the cosine distance, equation (3), is used to
compute the similarity between two documents di and dj. This widely
utilized document similarity measure becomes one if the documents are
identical, and zero if they share no words. The two clusters to merge at
each step are found using the average link method. In this scheme, the
two clusters to merge are those with the greatest average similarity
between the documents in one cluster and those in the other. Given a set
of documents D, one can measure how consistent the results of
clustering are for each of the languages to which these documents are
translated in the following way. The clusters produced for one language
are used as ‘gold standard’, a source of annotation assigning each
document in the set D a cluster label L from the list LALL of all clusters
for that language. Clustering in the other language is then carried out
and purity (Pantel & Lin 2002), equation (5), used to compare each of
the resulting clusters CCALL to its closest match among all clusters
LALL. (Precision is the probability of a document in cluster C being
labelled L. Purity is the percentage of correctly clustered documents.)
Genetic Algorithm
A Genetic Algorithm (GA) is a computational abstraction of biological
evolution that can be used to some optimization problems (Holland
1975; Goldberg 1989). In its simplest form, a GA is an iterative process
applying a series of genetic operators such as selection, crossover and
mutation to a population of elements. These elements, called
chromosomes, represent possible solutions to the problem. Initially, a
random population is created, which represents different points in the
search space. An objective and fitness function is associated with each
chromosome that represents the degree of goodness of the chromosome.
Based on the principle of the survival of the fittest, a few of the
chromosomes are selected and each is assigned a number of copies that
go into the mating pool. Biologically inspired operators like crossover
and mutation are applied on these strings to yield a new generation of
strings. The process of selection, crossover and mutation continues for a
fixed number of generations or till a termination condition is satisfied.
More details survey of Genetic Algorithms can be found in (Filho et al.
1994).
HIERARCHICAL AGGLOMERATIVE CLUSTERING FOR CROSS-
LANGUAGE INFORMATION RETRIEVAL
5
Semi-Supervised Clustering Algorithm
As a base for our semi-supervised algorithm, we use an unsupervised
clustering method combined with a genetic algorithm incorporating a
measure of classification accuracy used in decision tree algorithms, the
GINI index (Breiman et al. 1984). Here, we examine the clustering
algorithm that minimizes some objective function applied to k-cluster
centers. In our case, we consider the cluster dispersion and cluster
purity. Before the clustering task, each term is assigned with a specific
weight that is normalized across all terms. The main objective is to
choose the best weights for all terms considered that minimize some
measure of cluster dispersion and cluster quality. In out GA algorithm,
the fitness function will be the reciprocal of objective function.
Typically cluster dispersion metric is used, such as the Davies-Bouldin
Index (DBI) (Davies & Bouldin, 1979). DBI uses both the within-
cluster and between-clusters distances to measure the cluster quality.
Let dcentroid(Qk), defined in (8), denotes the centroid distances within
cluster Qk, where xiQk, Nk is the number of samples in cluster Qk, ck is
the center of the cluster and k ≤ K clusters. Let dbetween(Qk, Ql), defined
in (10), denote the distances between clusters Qk and Ql, where ck is the
centroid of cluster Qk and cl is the centroid of cluster Ql.
Therefore, given a partition of the N points into K clusters, DBI is
defined in (11). This cluster dispersion measure can be incorporated
into any clustering algorithm to evaluate a particular segmentation of
data. The Gini index (GI) has been used extensively in the literature to
determine the purity of a certain split in decision trees. Clustering using
K cluster centers partitions the input space into K regions. Therefore
clustering can be considered as a K-nary partition at a particular node in
a decision tree, and GI can be applied to determine the purity of such
partition (cluster purity). In this case, GI of a certain cluster, k, is
computed as defined in (12), where n is the number of class, Pkc is the
number of points belong to c-th class in cluster k and Nk is the total
number of points in cluster k.
dcentroid(Qk) =
k
i
N
cx ki
(8)
ck = 1/Nk( ki Qix x ) (9)
RAYNER ALFRED et al. 6
dbetween(Qk, Ql) = lk cc (10)
DBI =
K
k lkbetween
lcentroidkcentroidkl
QQd
QdQd
K 1 ),(
)()(max
1 (11)
GiniCk =
n
c k
kc
N
P
1
2
0.1 (12)
impurity =
N
GiniCTKk kCk 1
(13)
f(N,K) = Cluster Dispersion + Cluster Purity (14)
f(N,K) = N
GiniCTDBI
Kk kCk
1 (15)
Equation (13) represents the impurity of a particular partitioning
into K clusters where N is the number of points in the dataset and kCT is
the number of points in cluster k. The smaller the number the better the
quality of clustering we have. In order to get a cluster of better quality,
we have to minimize the measure of impurity, defined in (13). In
general, the objective function is defined in (14), and in our case, it is
computed in (15). By minimizing the objective function defined as the
sum of the cluster dispersion measure (DBI) and the cluster impurity
measure (represented by the second term in (15)), the algorithm
becomes semi-supervised. We use this expression to reflect the fact that
clustering, typically used as an unsupervised learning technique, has
now some of its parameters altered to produce results closer to a given
‘gold standard’. More specifically, given N points and K clusters, the
term weights are modified to maximize the objective function defined
in (15).
EXPERIMENTAL DESIGN
There are three main stages in this experiment. (I) In the first stage, we
perform the task of clustering parallel corpora of English-Bulgarian
HIERARCHICAL AGGLOMERATIVE CLUSTERING FOR CROSS-
LANGUAGE INFORMATION RETRIEVAL
7
texts. We look at the similarities and differences in three main areas:
English-Bulgarian cluster mappings, English vs Bulgarian tree
structures and the extracted most representative terms for English and
Bulgarian clusters. (II) Next, in the second stage, we perform the task
of clustering the English texts based on the reduced set of terms and
comparision with the previous results of clustering English texts using
all terms. (III) Finally, we apply the genetic algorithm to optimize the
weights of terms considered in clustering the English texts.
I. Clustering Parallel Corpora
In the first stage of the experiment, there are two parallel corpora (News
Briefs and Features), each in two different languages, English and
Bulgarian. In both corpora, each English document E corresponds to a
Bulgarian document B with the same content, see Table 1. It is worth
noting that the Bulgarian texts have a higher number of terms after
stemming and stopword removal.
Table 1. Statistics of Document News and Features
Category (Num Docs) Language Total
Words
Avg.
Words
Different
Terms
News briefs (1835) English 279,758 152 8,456
Bulgarian 288,784 157 15,396
Features (2172) English 936,795 431 16,866
Bulgarian 934,955 430 30,309
The process of stemming English corpora is relatively simple due
to the low inflectional variability of English. However, for
morphologically richer languages, such as Bulgarian, where the impact
of stemming is potentially greater, the process of building an accurate
algorithm becomes a more challenging task (Nakov 2003). In this
experiment, the Bulgarian texts are stemmed by the BulStem algorithm.
English documents are stemmed by a simple affix removal algorithm.
Figure 1 illustrates the experimental design set up for the first stage of
the experiment. The documents in each language are clustered
separately according to their categories (News Briefs or Features) using
hierarchical agglomerative clustering. The output of each run consists
of three elements: a list of terms characterizing the cluster, the cluster
members, and the cluster tree for each set of documents. The next
section contains a detailed comparison of the results for the two
languages looking at each of these elements.
RAYNER ALFRED et al. 8
Figure 1. Experimental set up for parallel clustering task
II. Clustering Document with a Set of Reduced Terms
In the next stage of the experiment, after clustering the English texts,
we examine the terms that characterize the clusters and extract these
terms into the set of terms used for clustering the English document
again later. We repeat the clustering process for the English texts with
only 10, resp. 50 most descriptive terms from each cluster, t, taken from
each cluster (k = 10), in which we may have t ≤ 100 (10 terms from
each cluster, k = 10), resp. t ≤ 500 (50 terms from each cluster, k = 10),
due to the fact that the same term may appear in more than one cluster.
Figure 2 illustrates the experimental design set up for the second stage
of the experiment, in which we repeat the clustering process with a
reduced set of terms and compare the results with the previous
clustering results.
III. A Semi-Supervised Clustering Technique Based on Reduced Terms
The last stage of the experiment uses a corpus where documents are
labeled with their target cluster ID. Clustering is then combined with a
genetic algorithm optimizing the weight of the terms so that clustering
matches as closely as possible the annotation provided. There are two
Stopping
On
Stemming
On
HAC
English
Documents
Document Statistics
Stopping
On
HAC
Stemmed Bulgarian
Documents
Document Statistics
Terms
Comparison
Cluster
Terms
Cluster
Terms
Cluster
Mapping
Cluster
Membership Cluster
Membership
Tree
Comparison
produces produces
HIERARCHICAL AGGLOMERATIVE CLUSTERING FOR CROSS-
LANGUAGE INFORMATION RETRIEVAL
9
possible reasons for such an approach. Firstly, one can use the clusters
provided for some of the documents in language X as a cluster
membership annotation for the same documents in language Y. The
additional tuning the GA provides could help cluster the rest of the
document in language Y in a way that resembles more closely the result
expected if the translation to language X was used. Secondly, experts
such as professional reviewers, often produce cluster that are different
from the ones generated in an automated way. One can hope that some
of their expertise can be captured in the way some of the term weights
are modified, and reused subsequently when new documents from the
same domain are added for clustering. Here, we describe the
representation of the problem in the Genetic Algorithm setting.
Figure 2. Experimental set up for clustering with the set of reduced
terms.
Population Initialization Step: A population of X strings of length m is
randomly generated, where m is the number of terms (e.g. cardinality of
reduced set of terms). X strings are generated with continuous numbers
representing the weight of terms.
Fitness Computation: The computation of the fitness function has two
parts: Cluster Dispersion and cluster purity. In order to get clusters of
better quality, we need to minimize the DBI, defined in (11). On the
other hand, in order to group the same type of objects together in a
cluster, we need to minimize the impurity function. Since in GA, we
want to maximize the fitness function, the fitness function (OFF) that
we want to maximize will be as follows (16).
OFF = 1/Cluster Dispersion + 1/Cluster impurity
OFF = 1/DBI +1/(N
GiniCTKk kCk 1 ) (16)
English
Document
Clustering
Process Set of Reduced
Terms
Purity
Computation
produces
RAYNER ALFRED et al. 10
Selection Process: For the selection process, a roulette wheel with slots
sized according to the fitness is used. The construction of such a
roulette wheel is as follows:
Calculate the fitness value fi, i ≤ X, for all chromosomes and get the
total fitness TFitness for all X chromosomes.
Calculate the probability of a selection pi for each chromosome, i ≤
X, pi = fi/TFitness.
Calculate the cumulative probability qi for each chromosome,
qi
i
j 1
pj.
The selection process is based on spinning the roulette wheel X times:
each time we select a single chromosome for a new population in the
following way
Generate a random number r from the range of [0..1].
Select the i-th chromosome such that qi-1 < r ≤ qi
Crossover: A pair of chromosomes, ci and cj, are chosen for applying
the crossover operator with probability pc. In this experiment, we set pc
= 0.25. This probability gives us the expected number pc·X of
chromosomes that undergo the crossover operation. We proceed by
Generating a random number r from the range [0..1].
Performing crossover if r < pc. In this case, for each pair of
chromosomes we generate a random integer number pos from the
range [1..m-1] (where m is the length of the chromosome), which
indicates the position of the crossing point (i.e., one-point
crossover is used).
Mutation: The mutation operator is applied on a bit-by-bit basis.
Another parameter of the genetic system, probability of mutation pm
modifies the expected number of mutated bits, equal to pm·m·X. In this
experiment, we set pm = 0.01. For each chromosome and bit within the
chromosome, the mutation process:
Generates a random number of r from the range [0..1].
Modifies (flips) the bit if r < pm.
As a result of selection, crossover and mutation, the next generation
of the population is produced. Its evaluation is used to build the
probability distribution for a construction of a roulette wheel with slots
sized according to the new fitness values. The rest of the evolution is
just a cyclic repetition of selection, crossover, mutation and evaluation
until a number of specified generations or specific threshold has been
achieved.
HIERARCHICAL AGGLOMERATIVE CLUSTERING FOR CROSS-
LANGUAGE INFORMATION RETRIEVAL
11
EXPERIMENTAL RESULTS
Clustering Parallel Corpora
Mapping of English-Bulgarian Cluster Membership In the first experiment, every cluster in English is paired with the
Bulgarian cluster with which it shares the most documents. The same is
repeated in the direction of Bulgarian to English mapping. Two
precision values for each pair are then calculated, the precision of the
English-Bulgarian mapping (EBM) and that of the Bulgarian-English
mapping (BEM). Figures 38 show the precisions for the EBM and
BEM for the cluster pairings obtained with varying numbers of clusters,
k (k = 10, 20, 40) for each of the two domains, News Briefs and
Features. The X axis label indicates the ID of the cluster whose nearest
match in the other language is sought, while the Y axis indicates the
precision of the best match found. For example, in Figure 3, EN cluster
7 is best matched with BG cluster 6 with the EBM mapping precision
equal to 58.7% and BEM precision equal to 76.1%.
A final point of interest is the extent to which the EBM mapping
matches BEM. When this happens, that is, the best EBM match of BG
cluster X is EN cluster Y, and the best BEM match of EN cluster Y is
BG cluster X, we say the pair of clusters is aligned. Table 3 shows that
alignment between the two sets of clusters is 100% when k = 10 for
both domains, News Briefs and Features. However, as the number of
clusters increases, there are more clusters that are unaligned. This is
probably due to the fact that Bulgarian documents have a greater
number of distinct terms. As the Bulgarian language has more word
forms to describe English phrases, this may affect the computation of
weights for the terms during the clustering process.
Table 2. Purity for Cluster Mapping for English-Bulgarian Documents
Category k=5 k=10 k=15 k=20 k=40
News briefs 0.82 0.63 0.67 0.65 0.59
Features N/A 0.77 N/A 0.61 0.54
Table 3. Percentage Cluster Alignment
Category k = 10 k =20 k = 40
News briefs 100.0% 85.0% 82.5%
Features 100.0% 90.0% 80.0%
RAYNER ALFRED et al. 12
It is also possible to study the purity of the mappings. Table 2
indicates the purity of the English-Bulgarian document mapping for
various values of k. This measure has only been based on the proportion
of clusters that have been aligned, so it is possible to have a case with
high purity, but a relatively low number of aligned pairs.
Figure 3. Ten clusters, Features corpus.
Figure 4. Ten clusters, News Briefs corpus.
HIERARCHICAL AGGLOMERATIVE CLUSTERING FOR CROSS-
LANGUAGE INFORMATION RETRIEVAL
13
Figure 5. Twenty clusters, Features corpus.
Comparison of HAC Tree Structure
The cluster trees obtained for each language are reduced to a predefined
number of clusters (10, 20 or 40) and then the best match is found for
each of those clusters in both directions (EBM, BEM). Here, again, we
would only pair a Bulgarian cluster CBG with an English cluster CEN if
they are each other’s best match, that is, CBG BEM CEN and CEN
EBM CBG.
Figure 6. Twenty clusters, News Briefs corpus.
RAYNER ALFRED et al. 14
Figure 7: Forty Clusters, News briefs corpus
Figure 8. Forty Clusters, Features corpus
The pair of cluster trees obtained for each corpus are compared by
first aligning the clusters produced, and then plotting the corresponding
tree for each language. Figure 9 and Figure 11 illustrate that when k =
10, all clusters can be paired, and the tree structures for both the English
HIERARCHICAL AGGLOMERATIVE CLUSTERING FOR CROSS-
LANGUAGE INFORMATION RETRIEVAL
15
and Bulgarian documents are identical (although distances between
clusters may vary). However, when k = 20, there are unpaired clusters
in both trees, and after the matched pairs are aligned, it is clear that the
two trees are different. We hypothesize that this may be a result of the
higher number of stems produced by the Bulgarian stemmer, which
demotes the importance of terms that would correspond to a single stem
in English.
Figure 9. Ten clusters, News Briefs corpus.
Figure 10. Twenty clusters, News Briefs corpus.
RAYNER ALFRED et al. 16
Figure 11. Ten clusters, Features corpus.
Comparison of Terms Extracted
The ten most representative terms that describe the matching English
and Bulgarian clusters have a similar meaning as illustrated in Tables 4
and 5. The only notable exception is listed in column 2 of Table 4,
where all top Bulgarian terms are related to the topic of ‘bird flu’,
whereas the English terms are split between this topic and the one of
‘Olympic games’. This difference disappears when the number of
clusters is increased to 20 (and a consistent ‘bird flu’ 19EN/20BG pair of
clusters is formed).
Clustering Based on a Set of Reduced Terms
Having seen in the previous experiment that the most representative
words for each cluster are similar for each language, an interesting
question is whether clustering using only these words improves the
overall accuracy of alignment between the clusters in the two
languages. The intuition behind this is that, as the words characterizing
each cluster are so similar, removing most of the other words from
consideration may be more akin to filtering noise from the documents
than to losing information.
The clustering is rerun as before, but with only a subset of terms
used for the clustering. That is to say, before the tf-idf weights for each
document are calculated, the documents are filtered to remove all but n
HIERARCHICAL AGGLOMERATIVE CLUSTERING FOR CROSS-
LANGUAGE INFORMATION RETRIEVAL
17
of the terms from them. These n terms are determined by first obtaining
10 clusters for each language, and then extracting the top 10 (resp. 50)
terms which best characterize each cluster, with the total number of
terms equal to at most 10 10 = 100 (resp. 10 50 = 500). Four new
sets of clusters are thus created, one for each language and number of
terms considered. The results in the four cases are compared to each
other, and to the sets of previously obtained sets of clusters for which
the full set of terms was used.
Figure 12. Twenty clusters, Features corpus.
The results of comparing clusters in English and Bulgarian are
shown in Table 6. These clearly indicate that as the number of terms
used in either language falls, the number of aligned pairs of clusters
also decreases. While term reduction in either language decreases the
matching between the clusters, the effect is fairly minimal for English
and far more pronounced for Bulgarian.
In order to seek to explain this difference between the languages, it
is possible to repeat the process of aligning and calculating purity, but
using pairs of clusters from the same language, based on datasets with
different levels of term reduction. The results of this are summarized in
Table 7.
RAYNER ALFRED et al. 18
Table 4. Top ten terms for pairs of English and Bulgarian clusters