Page 1
ORIGINAL ARTICLE
Semi-supervised clustering for gene-expression datain multiobjective optimization framework
Abhay Kumar Alok • Sriparna Saha •
Asif Ekbal
Received: 10 June 2014 / Accepted: 18 January 2015 / Published online: 15 February 2015
� Springer-Verlag Berlin Heidelberg 2015
Abstract Studying the patterns hidden in gene expression
data helps to understand the functionality of genes. But due
to the large volume of genes and the complexity of bio-
logical networks it is difficult to study the resulting mass of
data which often consists of millions of measurements. In
order to reveal natural structures and to identify interesting
patterns from the given gene expression data set, clustering
techniques are applied. Semi-supervised classification is a
new direction of machine learning. It requires huge unla-
beled data and a few labeled data. Semi-supervised clas-
sification in general performs better than unsupervised
classification. But to the best of our knowledge there are no
works for solving gene expression data clustering problem
using semi-supervised classification techniques. In the
current paper we have made an attempt to solve the gene
expression data clustering problem using a multiobjective
optimization based semi-supervised classification tech-
nique with the aim to attain good quality partitions by using
few labeled data. In order to generate the labeled data,
initially Fuzzy C-means clustering technique is applied. In
order to automatically determine the partitioning, multiple
cluster centers corresponding to a cluster are encoded in the
form of a string. In order to compute the quality of the
obtained partitioning, values of five objective functions are
computed. The effectiveness of this proposed semi-super-
vised clustering technique is demonstrated on five publicly
available benchmark gene expression data sets. Compari-
son results with the existing techniques for gene expression
data clustering prove that the proposed method is the most
effective one. Statistical and biological significance tests
have also been carried out.
Keywords Gene expression data clustering � Semi-
supervised classification � Multiobjective optimization �Cluster validity index � AMOSA
1 Introduction
Due to invention of DNA (Deoxyribonucleic acid) micro-
array technology, it has become feasible to examine the
expression level of thousands of genes at a time during
their different ongoing biological processes and across
collection of related samples. Different application areas of
microarray technology are gene expression profiling,
medical diagnosis, bio-medicine [1, 22, 39]. Usually, dur-
ing the biological experiment, and at different time points,
gene expression values are measured. A microarray gene
expression data structure is defined as 2D matrix A ¼ ½fij� ofsize c� t, where c represents a gene and t represents a time
point. Each element fij tells about the expression level of ith
gene at the jth time point. To depict the set of genes
exhibiting similar expression profile, clustering or unsu-
pervised learning [7, 33, 51] is in general used.
Clustering, also termed as unsupervised learning, is the
procedure of grouping the data items into different parti-
tions or clusters in such a way that data items belonging to
same group are similar to each other according to some
criteria of similarity and dissimilar to each other according
to same criteria [1]. In supervised classification, actual
class labels of some data points are available. These
A. K. Alok (&) � S. Saha � A. EkbalComputer Science Engineering, Indian Institute of Technology,
Patna, India
e-mail: [email protected]
S. Saha
e-mail: [email protected]
A. Ekbal
e-mail: [email protected]
123
Int. J. Mach. Learn. & Cyber. (2017) 8:421–439
DOI 10.1007/s13042-015-0335-8
Page 2
labeled data are used to build a model which is further used
to assign class labels to some unknown samples. The main
problem of supervised classification is to generate the
labeled data, which is both time consuming and expensive.
In contrast we can easily access plenty of unlabeled data.
For unsupervised classification we can use this plenty of
unlabeled data to form the partitioning. It depends on the
data distribution and its intrinsic property. Semi-supervised
classification is a halfway between supervised and unsu-
pervised classification [3, 9, 11, 14, 15]. Here in addition to
the unlabeled data, some amount of labeled data are also
available. The available labeled data helps to improve the
clustering result. Due to this improved performance, semi-
supervised classification has a large number of applications
in the field of pattern recognition, document classification,
data mining, information retrieval, image categorization,
and gene function classification. Literature studies show
that semi-supervised clustering is much more effective as
compared to unsupervised classification techniques [3, 9,
11, 14, 15]. As gene expression data clustering is an
important problem in the field of bioinformatics with
applications in diagnosis of diseases, it would be beneficial
to use some sophisticated techniques like semi-supervised
clustering for solving this problem. But existing literature
shows that there was no effort in solving the gene
expression data clustering problem using semi-supervised
classification. Inspired by this, in the current paper we aim
to solve the gene expression data clustering problem using
a semi-supervised clustering technique with the motivation
of getting much improved results which will help in
diagnosis of some diseases more accurately. A new mul-
tiobjective optimization based semi-supervised classifica-
tion technique is developed to solve the gene expression
data clustering problem.
For the semi-supervised classification, some amount of
labeled data are required to fine-tune the obtained parti-
tioning based on the unlabeled data. In case of gene
expression data it is difficult to generate the labeled
information. In the current paper we have utilized a well-
known clustering technique, Fuzzy C-means [10], to gen-
erate some labeled data. Semi-supervised clustering is
having two different objectives : determining the parti-
tioning (i) corresponding to which some cluster quality
measures should be optimized, (ii) the class labels of the
available labeled data should be satisfied by the proposed
partitioning. In order to simultaneously satisfy both the
objectives, the use of multiobjective optimization (MOO)
[8] is proposed. In order to represent clusters, center based
encoding is used. A single cluster is divided into several
non-overlapping hyperspherical sub-clusters and the cen-
ters of these sub-clusters are used to represent this cluster.
AMOSA (archived multiobjective simulated annealing
based technique) [8], a newly developed simulated
annealing based multiobjective optimization technique is
used as the underlying MOO technique. Here five objective
functions are used and simultaneously optimized by
AMOSA. First four objective functions are some internal
cluster validity indices, based on some unsupervised
properties of data set. Last objective function is an external
cluster validity index which captures the compatibility of
the obtained partitioning with respect to the available
supervised information.
The performance of the multiobjective simulated
annealing based semi-supervised clustering technique
(Semi-GenClustMOO) has been tested on some well
known publicly available five real-life gene expression data
sets, viz., Yeast sporulation, Arabidopsis Thaliana, Rat
CNS, Yeast cell cycle, and Human fibroblasts serum. The
superiority of the proposed Semi-GenClustMOO clustering
technique is also compared with MO-fuzzy [51], MOGA
clustering [7], GenClustMOO [49], FCM [10], single
objective GA based clustering technique [41], Self
Organising Map (SOM) [59], Chinese Restaurant Cluster-
ing (CRC) [45], and Hierarchical average linkage cluster-
ing [62] techniques. Further some statistical significance
tests have been also performed to show the superiority of
the Semi-GenClustMOO clustering technique. We have
also conducted some biological significance tests to prove
that the formed clusters are biologically correct.
The current paper is unique in the following ways:
– As far our knowledge is concerned, this is the first
attempt where semi-supervised classification is applied
for solving the gene expression data clustering problem.
– A novel technique is devised to generate some labeled
data from a given unlabeled data set without taking
help of any human annotator. In general external
knowledge about a data set is generated by taking the
help of human annotators. In the current paper, we have
proposed a method which utilizes any popular fuzzy
clustering technique to generate the labeled data. Here
fuzzy C-means clustering is used for this very purpose.
– A multiobjective based approach is developed to solve
the semi-supervised classification problem. Here a new
encoding strategy, use of multiple centers correspond-
ing to a particular cluster is utilized to represent a
particular partitioning.
– Five different objective functions capturing different
data properties are optimized simultaneously using the
search capability of a newly developed simulated
annealing based multiobjective optimization technique,
namely AMOSA (archived multiobjective simulated
annealing). First four objective functions capture some
unsupervised data properties. The last one checks the
compatibility of the obtained partitioning with respect
to the original class labels.
422 Int. J. Mach. Learn. & Cyber. (2017) 8:421–439
123
Page 3
– Different mutation strategies are used to handle the new
encoding strategy.
– Results are shown on five benchmark gene expression
data sets. For all the cases, proposed technique
outperforms seven existing techniques for gene expres-
sion data clustering. It also outperforms two recently
developed multiobjective clustering techniques for
gene expression data. This in turn shows the utility of
semi-supervised classification for solving the gene
expression data clustering problem.
– Obtained results are verified biologically and
statistically.
The rest of the paper is organized as follows. Section 2
discusses about some existing techniques of gene expres-
sion data clustering and semi-supervised classification
techniques. The procedure of generating some labeled data
from the set of unlabeled data is illustrated in Sect. 3.
Section 4 discusses about the different steps of the pro-
posed multiobjective based semi-supervised clustering
technique. Experimental results are discussed in detail in
Sect. 5 and finally Sect. 6 concludes the paper.
2 Background
In this section we have discussed about existing techniques
for gene expression data clustering and semi-supervised
classification.
2.1 Gene expression data clustering
Traditional approaches of genomic research concentrates on
local examination and collection of data for a single gene.
However, use of microarray technology enables monitoring
of expression levels of tens of thousands of genes simulta-
neously. There are mainly two different types of microarray
experiments [35]. These are cDNA microarray [53] and
oligonucleotide arrays (abbreviated oligo chip) [38].
During microarray experiment, a large number DNA
sequences (genes, cDNA clones, or co-expressed sequence
tags) are judged under multiple conditions. Examples of
conditions are timeseries during a biological process (e.g.,
the Yeast cell cycle) or a collection of different tissue
samples (e.g., normal versus cancerous tissues). In general
no distinction is made among DNA sequences; which are
uniformly termed as ‘‘genes’’. Similarly all kinds of
experimental conditions are termed as ‘‘samples’’.
Microarray technology has been successfully applied for
solving problems from many areas like medical diagnosis,
bio-medicine, gene expression profiling, etc. In general, the
gene expression values during a biological experiment are
determined at different time points [42]. A microarray gene
expression data, consisting of x genes and y time points, is
generally arranged in a 2D matrix G ¼ ½gij� of size x� y.
Each element gij tells the expression level of the ith gene at
the jth time point. In recent years, many new techniques
have been developed in the literature to deal with the gene
expression data [30, 56, 60, 67].
Some noise and missing values can be present in the
original gene expression matrix determined from the
scanning process. There may be some systematic variations
arising from the experimental procedure. Data pre-pro-
cessing is a necessary step before application of any clus-
tering technique. Missing value can be estimated using
various methods [63]. After this data normalization is done.
Thereafter any clustering technique can be applied on the
processed gene expression data.
Clustering is a widely used microarray analysis tool.
The genes with similar expression profiles can be detected
by using clustering techniques. Clustering methods divide a
set of n objects into K partitions depending on some sim-
ilarity/dissimilarity metric. Here the value of number of
clusters, K, is not known a priori. Application of clustering
techniques aid in the understanding of gene function, gene
regulation, cellular processes, and subtypes of cells. Genes
with similar expression patterns (co-expressed genes)
should be put in a single cluster with similar cellular
functions. This helps to predict the functionality of some
unknown genes whose information was not previously
known [22, 61]. Regulatory motifs specific to each gene
cluster can be recognized based on the searching for
common DNA sequences at the promoter regions of genes
within the same cluster [35]. Cis-regulatory elements can
also be proposed based on this information [12, 61].
Hypotheses regarding the mechanism of the transcriptional
regulatory network can be revealed based on the inference
of regulation through the clustering of gene expression data
[21]. At the end sub-cell types can be identified based on
the clustering results on expression profiles. These infor-
mation were difficult to be identified by using traditional
morphology-based approaches [1].
Clustering methods have been widely used for the dis-
covery of cancer subtypes [1, 22, 39]. Many novel clus-
tering techniques have been developed by the
bioinformaticians which are suitable for clustering gene
expression data. But medical scientists prefer to use some
traditional clustering techniques for solving this particular
problem [4]. Using modern days microarray technologies,
molecular signatures of cancer cells can be measured [58].
One most important and useful exploratory analysis on the
microarray data is to apply clustering techniques on the
cancer/patient samples (tissues) [58]. The basic goal is to
determine groups of samples sharing similar expression
patterns, which can help to discover novel cancer subtypes.
Int. J. Mach. Learn. & Cyber. (2017) 8:421–439 423
123
Page 4
Clustering methods have been used extensively for solving
these kind of problems. Several clustering techniques have
been proposed by bioinformaticians taking into account
different internal characteristics of gene expression data,
for example presence of noise and missing values, and
high-dimensional nature of data sets [13, 37, 58]. These
clustering techniques are tested with the help of available
public data in clinical studies which were previously pub-
lished. In [18, 57] authors have used K-means clustering
technique for solving the gene expression data clustering
problem. Recently, many new clustering algorithms [28,
29] have been proposed for solving the gene expression
data clustering which can overcome the drawbacks of
K-means algorithm. Self organizing maps are used to
cluster gene expression data in [59]. These algorithms
perform better than K-means algorithm. Eisen et al. [22]
developed an agglomerative algorithm called UPGMA
(Unweighted Pair Group Method with Arithmetic Mean)
for gene expression data clustering and proposed a method
to graphically represent the clustered data set. Alon et al.
[2] devised an algorithm to partition the genes using a
divisive approach which is called the deterministic-
annealing algorithm (DAA) [47]. Eisen’s method was very
much popular among biologists and has become the most
widely-used tool in gene expression data analysis [2, 22,
32]. However, because of the sensitivity of the data on the
structure of the hierarchical dendogram, these conventional
agglomerative approaches suffer most. High computational
complexity is another drawback of the hierarchical clus-
tering techniques. In [55], the authors used CLICK, a graph
theory based clustering technique, for clustering publicly
available gene expression data and compared their
approach with SOM based [59] and Eisen’s hierarchical
method [22]. In [34], a new clustering technique named
DHC (a density-based, hierarchical clustering method), is
proposed to identify the co-expressed gene groups from
gene expression data. Several model based clustering
approaches for gene expression data have been proposed in
[25, 27, 68] which provide some statistical frameworks to
model the gene expression data.
In recent years the problem of gene expression data
clustering is modeled as a multiobjective optimization
problem. Many multiobjective optimization based clustering
techniques are developed to solve this problem. In [51], a
simulated annealing based multiobjective fuzzy clustering
technique, MO-fuzzy [51], is developed for gene expression
data clustering. It again uses AMOSA as the underline
optimization technique. But no supervised information is
used here to fine-tune the partitioning. In [7] a genetic
algorithm based multiobjective clustering technique, MOGA
clustering, is developed for gene expression data clustering.
It uses a genetic algorithm based multiobjective optimiza-
tion technique, NSGA-II, as the underline optimization
strategy. In [44], a multiobjective based interactive cluster-
ing technique is developed for gene expression data clus-
tering. This approach interactively takes the input from the
human decision maker (DM) during execution and adap-
tively learns from that input to obtain the final set of validity
measures along with the final clustering result. In [20], a
Fuzzy C-means based multiobjective clustering technique is
developed for gene expression data clustering via optimi-
zation of multiple objectives. In [24], an algorithm for
cluster analysis that integrates aspects from cluster ensemble
and multiobjective clustering is developed. The algorithm is
developed using the concepts of a Pareto-based multiob-
jective genetic algorithm, with a special crossover operator,
which uses clustering validation measures as objective
functions. The approach is then applied on some gene
expression data sets. In [43], a multiobjective clustering
technique is first applied on the gene expression data.
Thereafter the solutions on the final Pareto front are com-
bined using the help of a post-processing technique using
support vector machine.
2.2 Semi-supervised learning
Semi-supervised classification is a new direction of pattern
recognition and machine learning [9, 11, 15]. This is
popularly being used technique in many real-life domains
like text processing and bioinformatics, where there are
scarcities of the sufficient amount of labeled data. In case
of supervised learning the main goal is to build a classifier
which is further used to assign class labels to unknown
samples. Thus the main goal is prediction. But in case of
unsupervised learning, the major objective is to discover
the natural grouping from the data. Thus the main goal is to
determine a description. Semi-supervised learning utilizes
unlabeled data and a few labeled data for prediction. Here
the given data set can be divided into two parts (i) data
points for which actual class information is known, and (ii)
data points for which no class labels are known. The
supervised information can be provided in different forms
also. Like must link (two points should belong to the same
cluster) and can-not link (two points should not belong to
the same cluster). Semi-supervised learning is, in general,
solved using two different ways: (i) unsupervised learning
with addition to labeled information as constraints. Here
the additional labeled information can be utilized while
initializing the cluster centers or during the assignment of
points to different clusters or during objective function
calculation. (ii) supervised learning with additional infor-
mation on the distribution of the examples. This interpre-
tation is more appropriate if the final aim is to predict the
class label of an unknown sample. But this is not applicable
if the number and the nature of classes are not known in
advance. In the current paper we have solved the semi-
424 Int. J. Mach. Learn. & Cyber. (2017) 8:421–439
123
Page 5
supervised learning problem by using the first view. Here
we have devised an unsupervised classification technique
which also takes care of the available supervised
information.
In any semi-supervised classification technique [50], the
target is to satisfy two major objectives. The obtained
partitioning of points into different clusters should be
perfect and there should not be any violation of available
supervised information. In order to measure these two
properties, two different types of cluster validity indices are
used: internal and external. Internal validity index utilizes
intrinsic properties of data items while external validity
index utilizes the supervised information given in the form
of class labels of data items. Different internal cluster
validity indices [5, 40, 52, 66] exist in the literature. Most
of the internal validity indices try to minimize the cluster
compactness and maximize the separation between clus-
ters. There are different ways of computing the cluster
compactness and separation. In order to capture clusters
having different shapes like hyperspherical shapes, sym-
metrical shapes, overlapping structures, different internal
cluster validity indices are used. In the current paper as the
internal cluster validity indices four different objective
functions are used. The first index is an Euclidean distance
based cluster validity index, I-index [40]. This utilizes the
maximum distance among any two cluster centers as the
separation between clusters. Sym-index [5] is the second
cluster validity index used in the current paper which uses a
newly developed point symmetry based distance for the
computation of cluster compactness. The next two cluster
validity indices, XB-index [66] and FCM-index [10], utilize
the concepts related to fuzzy logic in order to compute the
cluster compactness. These two indices try to obtain
overlapping structures from a given data set. As the
external cluster validity index, adjusted rand index (ARI)
[52] is used.
2.3 Multiobjective optimization
As semi-supervised clustering requires optimization of
more than one objective functions, use of multiobjective
optimization (MOO) is required to solve such problems.
MOO has a different perspective compared to single
objective optimization (SOO). In SOO we need to optimize
a single objective function but in case of MOO we need to
optimize more than one objective functions. SOO provides
a single solution as the final solution and MOO provides a
set of solutions on the final Pareto optimal front. All the
solutions produced by some MOO based technique are
equally important and they are non-dominating to each
other.
The concept of domination is an important aspect of
MOO. In case of maximization of objectives, a solution xi
is said to dominate xj if 8k 2 1; 2; . . .;O; OBkðxiÞ�OBkðxjÞ and 9k 2 1; 2; . . .;O; such that OBkðxiÞ[OBk
ðxjÞ: Among a set of solutions SOL, the non-dominated set
of solutions SOL0are those which are not dominated by any
member of the set SOL. The non-dominated set of the
entire search space S is called the globally Pareto-optimal
set or Pareto front. In general, a MOO algorithm outputs a
set of solutions not dominated by any solution encountered
by it.
3 Generation of labeled data for semi-supervised
clustering
In the current paper we have solved the gene expression
data clustering problem using semi-supervised classifica-
tion techniques. In case of semi-supervised classification,
in addition to the unlabeled data some amount of labeled
data are also available. But for gene expression data, it is
difficult to generate the labeled information. In the current
paper we have proposed a way of generating the labeled
data without taking help from any human annotator. It is
both time consuming and cost sensitive to generate the
labeled data with the help of human annotator. In the
current paper, we have initially executed the popular Fuzzy
C-means (FCM) clustering technique [10] to generate the
obtained partitioning of the available gene expression data
sets. The points which attain highest membership values
with respect to a given cluster are selected as the available
labeled data. Note that points with highest membership
values are the most certain points within a particular
cluster. Highest values of membership function signify that
those points are not the boundary points but the core points
of a given cluster. Instead of Fuzzy C-means we could have
used any other clustering techniques. But as Fuzzy
C-means quantifies the membership values and is very
popular in the field of pattern recognition, we have opted
for Fuzzy C-means.
Fuzzy C-means (FCM) [10], is a very popular clustering
technique used widely in pattern recognition, which
incorporates the property of fuzzy logic. A single data
point may belong to two or more clusters. FCM is based on
a single objective function (given below) which should be
minimized
Ji ¼XN
j¼1
XC
c¼1
umc;jD2 zc; xj� �
; 1�m�1 ð1Þ
Here, N represents the number of genes, C is the total
number of clusters, u denotes fuzzy membership and m
represents the fuzzy exponent. Let, xj denotes the jth gene,
zc is the centre of cth cluster, and distance of xj gene from
their cluster center zc is represented by D(zc, xj).
Int. J. Mach. Learn. & Cyber. (2017) 8:421–439 425
123
Page 6
Initially, FCM algorithm starts its functioning by ran-
domly picking K cluster centers. Then after every iteration,
it evaluates the fuzzy membership for each gene with
respect to every cluster according to the following equation
ui;j ¼1
Dðzc;xiÞ
� � 1m�1
PCj¼1
1Dðzj;xiÞ
� � 1m�1
; 1� c�C; 1� i�N ð2Þ
where D(zc, xi) and D(zj, xi) represent the distances
between xi and zc, and xi and zj respectively. Now, after
evaluation of fuzzy membership of each gene, cluster
centers are updated with the help of the following equation
zc ¼PN
i¼1 umc;ixiPN
i¼1 umc;i
; 1� c�C ð3Þ
The two steps mentioned above, evaluation of fuzzy mem-
bership and re-computation of cluster centers, are executed
several times until there will be no change in cluster centers.
Final membership values are obtained considering each
cluster individually. We sort all the points for individual
cluster c based on their membership values. We select top 10C
% points with highest membership values with respect to a
particular cluster center and assign cluster label c to them.
Here C is the total number of clusters. Note that here 10 %
points from each cluster are selected based on the mem-
bership values. Points with highest values of membership
are the most certain points within that cluster. Points with
membership values below 0.5 are the most uncertain points.
Fuzzy C-means is a widely used technique for data clus-
tering. Literature shows that for data sets with overlapping
clusters points with highest values of membership with
respect to a particular cluster are the most central points of
that cluster. Thus using the class label information of those
points is supposed to be the most reliable one. This infor-
mation is used as the supervised information in the proposed
semi-supervised clustering technique.
4 Semi-supervised multiobjective clustering
algorithm :Semi-GenClustMOO
In this paper we have proposed a multiobjective based
solution for solving the semi-supervised clustering prob-
lem. The proposed algorithm, Semi-GenClustMOO, is a
generalized multiobjective framework to solve the semi-
supervised clustering problem. It can be associated with
any multiobjective optimization technique. In the current
paper we have used the archived multiobjective simulated
annealing based technique, AMOSA [8] as the underlying
optimization algorithm. Semi-GenClust MOO is an
extended version of GenclustMOO [49] (multi-center
based multiobjective clustering technique) to handle semi-
supervised clustering. One single cluster is represented by
‘‘multiple centers’’ in the form of a string. We assume that
each cluster consists of several non-overlapping hyper-
shperical sub-clusters. Centers of different sub-clusters are
then encoded in a state to represent a particular cluster.
Here, to optimize the five objective functions simulta-
neously, a newly developed simulated annealing based
optimization technique, AMOSA [8] is used. First four
objective functions quantify some unsupervised properties
and last one quantifies the supervised information. We
assume supervised information is available only for few of
the data items. The flowchart of the Semi-GenClustMOO
algorithm is shown in Fig. 1.
4.1 The SA based MOO algorithm: AMOSA
Here, AMOSA [8], archived multiobjective simulated
annealing based technique, which is a generalized version
of probabilistic metaheuristic based simulated annealing
(SA) algorithm using the concepts of multiobjective opti-
mization (MOO) is used as the underlying optimization
strategy. Simulated annealing is a search technique for
solving difficult optimization problems, which is based on
the principles of statistical mechanics [36]. SA can not only
replace exhaustive search to save time and resource, but
also converge to the global optimum if annealed suffi-
ciently slowly [26]. Although the single objective version
of SA has been quite popular, its utility in the multiob-
jective case was limited because of its search-from-a-point
nature. Recently Bandyopadhyay et al. developed an effi-
cient multiobjective version of SA called AMOSA [8] that
overcomes this limitation.
In AMOSA (archived multiobjective simulated anneal-
ing) [8], which is a multiobjective version of SA, several
concepts have been newly integrated. It utilizes the concept
of an archive as used in [69] where the non-dominated
solutions seen so far are stored. Two limits are kept on the
size of the archive: a hard or strict limit denoted by HL, and
a larger, soft limit denoted by SL, where SL[HL. The
non-dominated solutions are stored in the archive as and
when they are generated. In the process, if some members
of the archive get dominated by the new solutions, then
these are removed. If at some point of time, the size of the
archive exceeds a specified value, then the clustering pro-
cess, described below, is invoked.
In AMOSA, the initial temperature is set to Tmax. Then,
one of the points is randomly selected from the archive.
This is taken as the current-pt, or the initial solution. The
current-pt is perturbed to generate a new solution called
new-pt, and its objective functions are computed. The
domination status of the new-pt is checked with respect to
the current-pt and the solutions in the archive. A new
426 Int. J. Mach. Learn. & Cyber. (2017) 8:421–439
123
Page 7
quantity called the amount of domination, Ddomða; bÞ,between two solutions a and b is defined as follows:
Ddoma;b ¼YM
i¼1;fiðaÞ6¼fiðbÞ
fiðaÞ � fiðbÞj jRi
ð4Þ
where, fiðaÞ and fiðbÞ are the ith objective values of the two
solutions and Ri is the corresponding range of the objective
function computed from the individuals in the population.
M is the number of objectives. Based on the domination
status between the new-pt, current-pt and the points in the
archive, different cases may arise. These are discussed in
details in [8], and are briefly mentioned here for the sake of
completeness.
Case 1: new-pt is either dominated by the current-pt or it
is nondominating with respect to the current-pt, but some
points in the archive dominate the new-pt. Suppose new-pt
is dominated by a total of k points (including current-pt
and points in the archive). This case is demonstrated in
Fig. 2 (the points D, E, F, G and H in the figure signify the
content of the archive at any instant, while the other points
illustrate different cases that may arise with respect to the
archive) where F represents the current-pt and B represents
the new-pt. Then a quantity Ddomavg is computed as
ðPk
i¼1ðDdomi;new�ptÞ þ Ddomcurrent�pt;new�pt=ðk þ 1Þ. The
new-pt is accepted as current-pt with a probability
pqs ¼1
1þ eDdomavg
T
: ð5Þ
Note that Ddomavg denotes the average amount of domi-
nation of the new-pt by ðk þ 1Þ points, namely, the current-
pt and k points of the archive. Also, as k increases, Ddomavg
will increase since here the dominating points that are
farther away from the new-pt are contributing to its value.
Fig. 1 Working principle of
Semi-GenClustMOO algorithm
Fig. 2 Pareto-optimal front and different domination examples
Int. J. Mach. Learn. & Cyber. (2017) 8:421–439 427
123
Page 8
Case 2: Neither the current-pt nor the points in the
archive dominate the new-pt. This can be demonstrated
with different examples shown in Fig. 2, e.g., F represents
the current-pt and E represents the new-pt, G represents the
current-pt and I represents the new-pt, F represents the
current-pt and I represents the new-pt. For all these cases,
accept the new-pt as the current-pt. If there are any points
in the archive which are dominated by new-pt, remove
them from the archive. Add new-pt in the archive. If
archive size crosses the SL, apply single linkage clustering
to reduce its size to HL.
Case 3: new-pt dominates the current-pt but k points in
the archive dominate the new-pt. This case can be dem-
onstrated using Fig. 2 where A represents the current-pt
and B represents the new-pt. Here the minimum of the
differences of domination amounts between the new-pt and
the k points, denoted by Ddommin of the archive is com-
puted. The point from the archive that corresponds to the
minimum difference is selected as the current-pt with
probability
probability ¼ 1
1þ expðDdomminÞ: ð6Þ
Otherwise the new-pt is selected as the current-pt. This
may be considered as an informed reseeding of the
annealer only if the archive point is accepted.
The above process is repeated iter times for each tem-
perature (temp). Temperature is reduced to a� temp, using
the cooling rate of a till the minimum temperature, Tmin, is
attained. The process thereafter stops, and the archive
contains the final non-dominated solutions.
It has been demonstrated in [8] that the performance of
AMOSA is better than that of NSGA-II [19] and some
other well-known MOO algorithms. The pseudo-code of
the AMOSA algorithm is shown in Fig. 3.
4.2 State representation and archive initialization
In Semi-GenClustMOO, a set of real numbers represents
the state of AMOSA. These real numbers are in fact the
coordinates of the centers of the partitions. Hence AMOSA
can easily identify the appropriate set of cluster centers and
the respective partitionings of the data items. Suppose a
state comprises of encoded centers of K number of clusters,
and thereafter each cluster center is further sub-divided into
C number of sub-clusters. Then the length of that state will
be K �C � d, where d is the dimension of data set. Let
state contain K ¼ 2 number of clusters, each cluster is
further divided into C ¼ 10 number of sub-clusters, and
dimension of the data set, d ¼ 2:Sojth subcluster of ith
cluster is represented by cij ¼ (cxij, cyij). Then the entire
state will look like (cx11, cy11, cx
12, cy
12, . . ., cx
110, cy
110, cx
21,
cy21, . . ., cx210, cy210). Here, Ki ¼ ðrandðÞðKmax � 1ÞÞ þ 2,
where Ki is the total number of whole clusters encoded in
the string i of the archive. Here Kmax represents the upper
limit of the number of clusters and rand() is a function
which returns integer. So number of initial clusters will
range between 2 to Kmax where Kmax is the maximum
number of clusters encoded in a particular string. Initiali-
zation step is consisting of three sub-steps. One third
solutions of the archive are initialized randomly. Minimum
center based distance criteria is used to randomly initialize
these solutions. Single linkage clustering technique [23] is
used to initialize another one third solutions for different
values of number of clusters (K). These solutions work
well when clusters in data sets are well separated. To ini-
tialize the last one third solutions, K-means algorithm is
applied with different values of number of clusters (K).
These solutions work well for data sets having hyper-
spherical shaped clusters. After forming the initial parti-
tions, C number of sub-cluster centers are chosen for each
cluster. These sub-cluster centers are then encoded in the
string to represent a particular partitioning. So total number
of centers encoded in that string is C�K.
4.3 Assignment of points
Here, we have considered each sub-cluster as a separate
cluster for the assignment process. Now, assume that, each
state has K number of clusters and each cluster is divided
into C number of sub-clusters. For the assignment point of
view, minimum Euclidean distance based criterion has
been considered. A data point yj is assigned to the ðv; tÞthsub-cluster where
ðv; tÞ ¼ argmin deðznp; yjÞn o
; for p ¼ 1. . .K; n ¼ 1; . . .;C:
znp is the nth sub-cluster center of pth cluster. Thereafter, the
partition matrix can be formulated as followed: X½ðv�1Þ � C þ t�½j� ¼ 1 and X½c�½j� ¼ 0; 8c ¼ 1. . .K � C
otherwise .
4.4 Objective function
Five objective functions are considered for the purpose of
optimization. First four objective functions are some
internal cluster validity indices which rely on some
intrinsic properties of the data sets. Last one measures the
violation of available supervised information. This is also
called an external cluster validity index. These five
objective functions are Sym-index [5], I-index [40], XB-
index [66], FCM-index [10], and adjusted rand-index [52].
The details of the objective functions are now provided in
the supplementary material.
428 Int. J. Mach. Learn. & Cyber. (2017) 8:421–439
123
Page 9
4.4.1 Other steps
Here, the other steps of Semi-GenClustMOO clustering
technique are similar to those of GenClustMOO [49]
algorithm. To evaluate the five objective functions, the
sub-cluster centers are joined to generate the whole
partitioning. Thereafter, the above mentioned five
objective functions are evaluated on these obtained
partitionings. To optimize these five objective functions
simultaneously, AMOSA is used as the underlying
optimization technique. Three types of mutation opera-
tions have been used. In the first type of mutation
operation, total number of clusters present in a state is
reduced by 1. In the second type, total number of clusters
present in a state is increased by 1, and finally in the third
type of mutation operation, cluster centers present in a
particular state are modified by some values. If any
string is selected for mutation, then any one of the above
mentioned mutation operations is applied on it with
uniform probability.
Fig. 3 The AMOSA algorithm [51] (source code is available at: http://www.isical.ac.in/*sriparna_r)� sriparna_r)
Int. J. Mach. Learn. & Cyber. (2017) 8:421–439 429
123
Page 10
4.4.2 Selection of the best solution
In MOO, the final Pareto optimal front consists of a set of
non-dominated solutions [50]. Each solution is responsible
for providing a partitioning information for a given data
set. Each non-dominated solution is equally important and
none of these dominates each other. However the user may
sometimes want only a single solution. In the current paper
we have adopted the following method to select a single
solution from the final Pareto optimal front. For the
supervised information, using the method mentioned in
Sect. 3, we have generated class labels of 10 % data points.
We have computed the ARI index [52] value for these
10 % data points for each solution on the final Pareto front.
Finally we have selected the solution with the highest value
of ARI index. The use of 10 % labeled data helps to
evaluate the goodness of each solution with respect to
ground truth information.
5 Experiments
In this section we have discussed about the experiments
which are conducted to prove the utility of the proposed
Semi-GenClustMOO clustering method. We have analyzed
the obtained results biologically and also statistically.
Obtained partitioning results are also compared with the
results of existing clustering techniques for gene expression
data.
5.1 Data sets used for experiment
Pre-processed datasets have been downloaded from the
site.1 The short descriptions of the data sets are given in
Table 1.
Yeast sporulation In this data set [17], seven time points
are considered for measurement of expression values of
6118 genes during the sporulation process of budding
yeast. Data are log-transformed. During the sporulation
process, the genes whose expression levels did not change
sufficiently are ignored. These rejected genes are not fur-
ther considered. Now consider log-transformed ratio of
gene expression values, and take root mean square values.
The value of 1.6 of root mean square is considered as a
threshold for selection of genes. Finally, we extract total
474 genes out of 6118.
Yeast cell cycle Here, in this data set [16], approxi-
mately 6000 gene expression levels are considered over
two cell cycles and 17 time points. The genes whose
expression levels did not change sufficiently are ignored
and, finally 384 genes are selected among 6000 genes.
Arabidopsis Thaliana Here, in this data set [46] only 138
gene expression levels are considered. The expression level
measurements have been conducted over 8 time points.
Human fibroblasts serum Here, in this data set [32]
expression levels of 8613 human genes are considered. The
expression level measurements have been conducted over
12 time points. It consists of 13 dimensions including 12
time points and an unsynchronized sample. The genes
whose expression levels did not change sufficiently are
ignored and, finally 517 genes are selected among 8613
genes. This data is again log2-transformed.
Rat CNS Here, in this data set [64] expression levels of
112 genes have been considered. Reverse transcription
coupled PCR method has been used for expression level
measurement. This measurement has been conducted over
9 time points during rat central nervous system
development.
5.2 Performance metrics
To validate the performance of clustering algorithms,
mainly two validity indices, Silhouette index (S(C)) [48]
and ARI are considered. The value of S(C) index lies
between -1 to ?1. So good partitioning results correspond
to high positive value of S(C) index. Simultaneously, for
the purpose of visualization, two methods have been used
in the form of Cluster profile plot [22, 42] and Eisen plot
[42].
5.2.1 Eisen plot
In case of Eisen plot [22, 42], gene expression value at a
particular time point is represented in a natural way. For
better understanding, it can be shown in Fig. 5. The first
step is to search the color of data matrix exactly similar to
its spotted color on the microarray. Thereafter, to represent
the Eisen plot, coloring of data matrix is done as mentioned
above. Higher expression levels are denoted by the shades
of red color, low expression levels are denoted by the
shades of green color and absence of differential expres-
sion values are denoted by the colors towards black. In this
paper before plotting the Eisen plot, genes have been
ordered in such a way that genes belonging to a particular
cluster have been placed one after another. White colored
blank rows are used to identify the cluster boundaries.
5.2.2 Cluster profile plot
With the help of cluster profile plot [42], it is possible to
visualize the normalized gene expression values of the
obtained gene clusters with respect to different conditions
like time points. Before plotting, at first for each gene1 http://anirbanmukhopadhyay.50webs.com/mogasvm.html.
430 Int. J. Mach. Learn. & Cyber. (2017) 8:421–439
123
Page 11
cluster its average gene expression value is calculated with
respect to different time points. Thereafter, standard devi-
ation of each gene cluster is calculated. Finally, the gene
expression values of a cluster are plotted along with their
average expression values and standard deviation, which
can be shown as a black line. The cluster profile plot for
Yeast sporulation data is shown in Fig. 6.
5.3 Discussion of results
Semi-GenClustMOO clustering technique is applied on
five real-life publicly available benchmark data sets. These
data sets are Arabidopsis Thalian, Yeast cell cycle, Human
fibroblasts serum, Yeast sporulation and Rat CNS data. The
partitioning results obtained after application of this clus-
tering technique on these five data sets are shown in this
paper. In the proposed clustering technique, a recently
developed simulated annealing based multiobjective opti-
mization technique, AMOSA [8] is used as the underlying
optimization technique. After a thorough sensitivity study,
the parameters of the proposed algorithm are selected.
These are: SL = 400, HL = 300, iter = 50, Tmax = 100,
Tmin = 0.00001 and cooling rate, a = 0.8. A discussion
regarding the parameters values of AMOSA is presented in
[8]. Inspired by this discussion, we have selected the
parameter values in the current paper. Increasing the Tmax
value does not improve the results further. Similarly
increasing the value of iter does not change the results. In
order to generate the labeled data, first Fuzzy C-means
clustering technique is applied on all the gene expression
data sets. The membership plots after application of Fuzzy
C-means clustering technique for all the data sets are
shown in Fig. 4a–e, respectively.
We have computed the Silhouette index and ARI values
of the final partitionings identified by the proposed Semi-
GenClustMOO technique. Those values are reported in
Table 2. The proposed Semi-GenClustMOO clustering
technique is able to automatically detect the number of
clusters from a data set. The automatically identified
number of clusters for different gene expression data sets
are reported in Table 2.
Results obtained by Semi-GenClustMOO clustering
technique are also compared with MO-fuzzy [51], fuzzy
MOGA clustering [7], GenClustMOO [49], FCM [10],
single objective GA based clustering technique [41], Self
Organising Map (SOM) [59], Chinese Restaurant Cluster-
ing (CRC) [45], and Hierarchical average linkage cluster-
ing technique [62]. Each of the algorithms is applied on
each data set five times. We have executed the algorithms
with default parameter values as specified in the corre-
sponding papers. Obtained clustering results are verified
after conducting several statistical and biological signifi-
cance tests. Mean Silhouette index values of the partiti-
onings for various data sets obtained by different clustering
techniques are shown in Table 3. Results reveal that the
proposed Semi-GenClustMOO clustering technique per-
forms the best for almost all data sets. For all the five data
sets the Silhouette index values attained by the proposed
Semi-GenClustMOO clustering technique are the highest
compared to Silhouette index values obtained by other
seven clustering techniques. In [7, 51], some multiobjective
clustering techniques are proposed to solve the gene
expression data clustering problem. In [51], AMOSA was
used as the underlying optimization technique and for the
assignment of genes to different clusters a newly developed
point symmetry based distance [5] is utilized. In [7], a
multiobjective genetic algorithm based technique, NSGA-
II [19], is used as the underlying optimization technique
and Euclidean distance based membership matrix compu-
tation is conducted. The GenClustMOO [49] clustering
technique is similar to our proposed approach without
using the labeled information. It is similar to our proposed
approach using four objective functions, Sym-index, I-
index, XB and FCM indices. It is not using the extra 10 %
labeled information. Note that the proposed semi-super-
vised clustering technique using the search capability of
AMOSA outperforms all these recent multiobjective tech-
niques for clustering the gene expression data in terms of
Silhouette index. This proves that the use of 10% labeled
data as the supervised information helps the proposed
technique to determine good quality partitions.
In order to visually demonstrate the results of Semi-
GenClustMOO clustering, Fig. 5 shows the Eisen plots of
the partitionings obtained by the proposed clustering
technique for all the gene expression data sets. From the
color representation of the Eisen plot, we can see that
similar colors are grouped together, denoting that within a
particular cluster all genes are having similar gene
expression profiles because they produce similar color
patterns.
The cluster profile plots (Fig. 6) are also drawn based on
the obtained clustering results by the proposed algorithm.
These plots also demonstrate how the expression profiles
for the different groups of genes differ from each other,
while the profiles within a group are reasonably similar.
Here cluster profile plots obtained by the proposed
Table 1 Data set description
Data set Number of genes Total features
Yeast sporulation 474 7
Yeast cell cycle 384 17
Arabidopsis Thaliana 138 8
Human fibroblasts serum 517 13
Rat CNS 112 9
Int. J. Mach. Learn. & Cyber. (2017) 8:421–439 431
123
Page 12
algorithm are shown only for Yeast sporulation data in Fig.
6. For Arabidopsis Thaliana, Yeast cell cycle, Rat CNS,
Human fibroblasts serum, cluster profile plots are shown in
the supplementary file. The proposed technique performs
better compared to the other clustering methods mainly
because of the following reasons: first of all, this is a
multiobjective semi-supervised clustering method. Simul-
taneous optimization of multiple cluster validity measures
helps to handle clusters with different characteristics. This
in turn produces high quality solutions representing dif-
ferent possible partitionings. Secondly, the strength of
semi-supervised classification is utilized along with
Fig. 4 Fuzzy membership plots after application of FCM algorithm over data sets a Arabidopsis Thaliana, b Yeast cell cycle, c Rat CNS,
d Human fibroblasts serum, e Yeast sporulation
Table 2 Number of clusters automatically determined by the pro-
posed Semi-GenClustMOO technique and, the ARI and Silhouette
index values of the optimum partitionings identified by the proposed
Semi-GenClustMOO clustering technique when applied on several
gene expression data sets
Data set K ARI Silhouette
Sporulation K = 6 0.8584 0.6786
Cell cycle K = 5 0.7962 0.4353
Arabidopsis K = 4 0.8316 0.4310
Serum K = 6 0.6611 0.4112
Rat CNS K = 6 0.6023 0.4772
Table 3 Mean values of
Silhouette index corresponding
to the partitionings identified by
different gene expression
clustering techniques
Algorithm_used Sporulation Cell cycle Thaliana Serum Rat CNS
K s_(C) K s_(C) K s_(c) K s_(C) K s_(C)
Semi-GenClustMOO 6 0.6786 5 0.4353 4 0.4310 6 0.4112 6 0.5027
MO-fuzzy 6 0.5877 5 0.4342 4 0.4194 6 0.4073 6 0.4977
MOGA 6 0.5754 5 0.4232 4 0.4023 6 0.3874 6 0.4832
GenClusMOO 6 0.6037 5 0.4253 4 0.4154 6 0.4078 6 0.4993
FCM 7 0.4696 6 0.3856 4 0.3665 8 0.3125 5 0.4132
SGA 6 0.5712 5 0.4232 4 0.3854 6 0.3445 4 0.4492
Average linkage 6 0.5023 4 0.4378 5 0.3162 4 0.3576 6 0.4142
SOM 6 0.5794 6 0.3862 5 0.2352 6 0.3352 5 0.4354
CRC 8 0.5623 5 0.4275 4 0.3965 10 0.3254 4 0.4576
432 Int. J. Mach. Learn. & Cyber. (2017) 8:421–439
123
Page 13
multiobjective optimization efficiently. Here we have
devised a novel technique without using any human
annotator to generate some labeled data. These information
are then used in the proposed multiobjective based semi-
supervised classification technique to fine tune the obtained
partitionings.
In order to show the conflicting nature of the objective
functions, the final Pareto optimal fronts obtained by the
Semi-GenClustMOO clustering technique for different
gene expression data sets are shown in Fig. 7a–e. Note that
as we have optimized total five objective functions simul-
taneously, each solution of the final Pareto optimal front is
Fig. 5 Eisen Plot for
a Arabidopsis Thaliana b Yeast
cell cycle, c Rat CNS, d Human
fibroblasts serum, e Yeast
sporulation after application of
Semi-GenClustMOO clustering
technique
Fig. 6 Cluster profile plot for Yeast sporulation data obtained after application of Semi-GenClustMOO clustering method
Int. J. Mach. Learn. & Cyber. (2017) 8:421–439 433
123
Page 14
having five different objective functional values. But it is
not possible to show the five dimensional Pareto front.
Here we have projected the Pareto front on three dimen-
sions. The different solutions of the final Pareto front are
shown with three objective function values, Sym-index, I-
index and ARI-index. These figures show that for all the
data sets multiple solutions are obtained on the final Pareto
optimal front. This proves that the objective functions used
in the current algorithm are conflicting in nature.
The objective functions used in the current paper are
conflicting to each other. The first objective is based on
symmetry property of the clusters; the second one is based
on Euclidean distance, third and fourth cluster validity
indices are based on the concepts of fuzzy logic. The first
four cluster validity indices capture different properties of
data partitionings. These are some internal cluster validity
indices which differ in the way of capturing the cluster
goodness. The first objective function, I-index tries to
minimize the cluster compactness based on Euclidean
distance while maximizing the maximum separation
between any two cluster centers. The second objective
function, Sym-index tries to determine some symmetrical
shaped well separated clusters. Third objective function,
XB-index helps to determine some overlapping clusters
where the minimum distance between any two cluster
centers is maximized. The fourth objective function is
based on the concepts of fuzzy logic; it helps to detect
some overlapping clusters. The fifth objective is an external
cluster validity index which helps to check the matching of
the obtained partitioning by the proposed clustering
Fig. 7 Pareto optimal fronts obtained by Semi-GenClustMOO clustering technique for a Yeast sporulation, b Rat CNS, c Yeast cell cycle,
d Arabidopsis Thaliana and e Human fibroblasts serum data sets
434 Int. J. Mach. Learn. & Cyber. (2017) 8:421–439
123
Page 15
technique with the available labeled information. This
helps to utilize the available supervised information for
reshaping the obtained clusters. Thus the objective func-
tions used in the current paper are conflicting to each other
and they capture different data properties. There is a trade-
off among the five objectives. AMOSA is used to optimize
the five objective functions. The final Pareto optimal fronts
identified by the proposed technique also show the set of
trade-off solutions.
In order to measure the quality of the obtained Pareto
optimal front by the proposed approach, Semi-GenClust-
MOO, we have computed the Purity [6, 31] and Minimal
Spacing [6] measurements. The measure named Purity
[6, 31] is used to compare the solutions obtained using
different MOO strategies. It calculates the fraction of
solutions from a particular method that remains nondomi-
nated when the final front solutions obtained from all the
algorithms are combined. A value near to 1(0) indicates
better (poorer) performance.
Purity Let N � 2 MOO strategies are applied on a given
problem. Let ri ¼ jRi1j; i ¼ 1; 2; . . .;N be the number of
rank one solutions obtained from different MOO strategies.
Now combine all these solutions as R =SN
i¼1fRi1g.
Thereafter ranking scheme is applied on these R solutionsand a new ranking solution set R
1 is obtained for the ith
strategy. Let ri be one of the obtained rank one solutions
in R1 and is denoted as ri = jfc j c�Ri
1and c�R1gj. There-
after, the Purity measure for the ith MOO strategy Pi is
defined as :
Pi ¼riri; i ¼ 1; 2; . . .;N: ð7Þ
The second measure named Minimal Spacing reflects the
uniformity of the solutions over the non-dominated front.
Minimal Spacing Schott[54] suggested a relative dis-
tance measure between consecutive solutions on the
obtained nondominated front. It is defined as
S ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1
jQjXjQj
i¼1
di � dð Þ2vuut ð8Þ
where di = mink2Q^k 6¼i
PMm¼1 f im � f km
�� �� and f im or f km is the
mth objective value of the ith (or, kth) solution on the final
nondominated solution set, Q, and d =PjQj
i¼1dijQj is the mean
value of all dis. M is the number of objective function
values. Thus when the solutions are uniformly spaced, then
their corresponding distance measure values become low.
So, for finding minimal spacing between the nondominated
solutions, value of S should be low.
Smaller values of Minimal Spacing indicate better per-
formance. Table 4 shows the values of Purity and Minimal
Spacing measures of the obtained Pareto optimal fronts by
the proposed technique when combined with the solutions
obtained by other gene expression data clustering tech-
niques. Here after application of other multiobjective
optimization based clustering techniques like MO-fuzzy
and MOGA, we obtain a set of partitioning solutions on the
final Pareto optimal front. The five cluster validity indices
used as the objective functions of the proposed clustering
technique, Sym-index, I-index, XB-index, FCM-index, and
adjusted rand-index are calculated for those partitioning
solutions. Adjusted rand-index is calculated over the same
10 % data as used in the current paper. Similarly after
execution of a single objective based clustering technique,
only a single partitioning solution is obtained. The five
objective functional values are calculated for this solution.
All the solutions obtained by all the methods used in the
current paper are then combined and compared with
respect to the five objective functions. The fraction of
solutions which is still non-dominated with respect to all
the solutions is used as the purity value of that particular
method. These results show that the solutions generated by
the proposed Semi-GenClustMOO clustering technique are
of good quality. Minimum values of MinimalSpacing
indicate that solutions are well-spread over the final Pareto
optimal front. Similarly high values of Purity indicate that
the obtained solutions by the Semi-GenClustMOO clus-
tering technique are pure or globally non-dominating.
5.4 Biological significance
The statistically significant Gene Ontology (GO) annota-
tion database2. is used to establish the biological relevance
of a cluster. Three structured terms or vocabularies
(ontologies) are used to measure the functional augmen-
tations of a group of genes. These terms are given in the
form of associated biological processes, molecular func-
tions and biological components. A cumulative hypergeo-
metric distribution is used to determine the degree of
functional enrichment (pvalue). Thereafter, the probability
of calculating how many genes are there within a particular
Table 4 Purity and MinimalSpacing measurements of the obtained
solutions by the proposed Semi-GenClustMOO clustering technique
Data set Minimal Spacing Purity
Semi-GenClustMOO clustering technique
Sporulation 0.0548 0.9873
Cell cycle 0.0675 0.9583
Thaliana 0.0609 0.9637
Serum 0.1334 0.8723
Rat CNS 0.0769 0.9285
2 http://db.yeastgenome.org/cgi-bin/GO/goTermFinder.
Int. J. Mach. Learn. & Cyber. (2017) 8:421–439 435
123
Page 16
cluster is determined which shows their involvement in a
given Go term. So, for a particular GO category, and
cluster of size K, probability p, to evaluate the number of
genes n belonging to a particular cluster can be formulated
according to the following equation[42, 61]
p ¼ 1�Xn�1
j¼0
t
j
� l� t
K � j
�
l
n
� ð9Þ
where t represents the total number of genes belonging to a
particular category, while l is the number of genes
belonging to the genome. Now, after evaluation of p-value
for each GO category, we can verify statistical significance
of the genes in a cluster. Due to this test, we can easily
quantify the matched label of a gene belonging to a par-
ticular cluster with the different types of GO categories.
The p value of the obtained cluster for a particular GO-
category will be equal to 0 if most of the genes belonging
to a particular cluster possess the same biological function.
In this paper, for yeast sporulation data set, biological
significance test has been performed at 1 % significance
level. Here we have tested the clustering results obtained
by all the algorithms used in this paper biologically. For the
proposed Semi-GenClustMOO clustering technique, all the
6 clusters obtained are biologically significant whereas for
MO-fuzzy, MOGA, FCM, SGA, Average linkage, SOM
and CRC the number of biologically significant clusters are
6, 6, 4, 6, 4, 4 and 6, respectively. In Table 5, top three
most relevant GO terms for genes of individual clusters
along with their respective p-values have been shown.
These results are obtained after applying Semi-GenClust-
MOO clustering technique on Sporulation data set. Here, to
evaluate the GO terms, we consider all p-values � 0.01.
Biological significant test reveals that the proposed Semi-
GenClustMOO clustering technique generates biologically
relevant and functionally enriched clusters in case of gene
expression data clustering.
5.5 Statistical significance
To show that the results obtained by the proposed Semi-
GenClustMOO algorithm are statistically significant, Wil-
coxon’s rank sum test [65] is used. In Table 6, the com-
parative p-values of Semi-GenClustMOO algorithm with
respect to other algorithms are reported. Null hypothesis
reveals that there is no significant change between median
values of the two different groups, but alternative
hypothesis reveals that there are some significant changes
between the median values of the two groups. Here we
have used the Silhouette index value to measure the per-
formance of the individual clustering technique. In Table 6,
all the p values are less than 5% significant level. This is a
strong evidence against the null hypothesis, indicating that
the better median values of the performance metric pro-
duced by Semi-GenClust-MOO are statistically significant
and have not occurred by chance.
6 Conclusions
In this paper the problem of gene expression data clustering is
solved using a multiobjective optimization based semi-super-
vised clustering technique, namely Semi-GenClustMOO. To
the best of our knowledge this is the first attempt in solving
gene expression data clustering problem using some semi-
supervised classification techniques. A newly developed
multiobjective optimization technique using the concepts of
simulated annealing is utilized as the underlying optimization
technique. In order to represent a particular cluster in the form
of a solution, multiple centers are used. For the purpose of
assignment, sub-clusters are considered individually. In order
to determine the optimal partitioning automatically, five
objective functions are optimized simultaneously by the search
capability of AMOSA. First four objective functions are some
internal cluster validity indices, Sym-index, I-index, XB index
and FCM index. The fifth objective function is an external
cluster validity index, namely adjusted rand-index. Here, we
assume that for each data set, class label information of 10 %
data points are available as the supervised information. For the
gene expression data, it is difficult to generate the labeled
information. In order to generate the labeled data without
taking help of any human annotator, we have first executed
another popular clustering technique, Fuzzy C-means on the
given data set. The data points with highest values of mem-
bership with respect to a particular cluster are the core points
of the clusters. These core points are used as the available
labeled information of the proposed semi-supervised classifi-
cation technique. The proposed technique is applied for
solving clustering problems from five bench-mark gene
expression data sets. The qualities of the obtained partitioning
results are measured using four internal cluster validity indices
and one external cluster validity index. Partitions are also
verified biologically and statistically. The effectiveness of the
Semi-GenClustMOO clustering technique has been compared
with MO-fuzzy, MOGA, FCM, SOM, CRC clustering tech-
niques. Results prove that the proposed semi-supervised
classification technique is more effective compared to the
existing techniques of gene expression data clustering.
In future we would like to propose other semi-super-
vised clustering techniques based on some other optimi-
zation techniques like genetic algorithm/differential
evolution. Thereafter we would like to evaluate these
techniques for gene expression data clustering. Another
436 Int. J. Mach. Learn. & Cyber. (2017) 8:421–439
123
Page 17
important future work is the application of the proposed
semi-supervised clustering technique for some other
domains like satellite image segmentation.
References
1. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald
A, Boldrick JC, Sabet H, Tran T, Yu X et al (2000) Distinct types
of diffuse large b-cell lymphoma identified by gene expression
profiling. Nature 403(6769):503–511
2. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D,
Levine AJ (1999) Broad patterns of gene expression revealed by
clustering analysis of tumor and normal colon tissues probed by
oligonucleotide arrays. Proceed National Acad Sci
96(12):6745–6750
3. Altun Y, Belkin M, Mcallester DA (2005) Maximum margin
semi-supervised learning for structured variables. In: Advances in
neural information processing systems, pp 33–40
4. Bandyopadhyay S (2007) Analysis of biological data: a soft
computing approach, World Scientific
5. Bandyopadhyay S, Saha S (2008) A point symmetry-based
clustering technique for automatic evolution of clusters. Knowl
Data Eng IEEE Trans 20(11):1441–1457
6. Bandyopadhyay S, Pal SK, Aruna B (2004) Multiobjective gas,
quantitative indices, and pattern classification. Syst Man Cybern
Part B Cybern IEEE Trans 34(5):2088–2099
7. Bandyopadhyay S, Mukhopadhyay A, Maulik U (2007) An
improved algorithm for clustering gene expression data. Bioin-
formatics 23(21):2859–2865
8. Bandyopadhyay S, Saha S, Maulik U, Deb K (2008) A simulated
annealing-based multiobjective optimization algorithm: AMOSA.
Evol Comput IEEE Trans 12(3):269–283
9. Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision
for pairwise constrained clustering
10. Bezdek JC (1981) Pattern recognition with fuzzy objective
function algorithms. Kluwer Academic Publishers
11. Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints
and metric learning in semi-supervised clustering. In: Proceed-
ings of the twenty-first international conference on Machine
learning, ACM, pp 81–88
12. Brazma A, Vilo J (2000) Gene expression data analysis. FEBS
lett 480(1):17–24
13. Brunet JP, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes
and molecular pattern discovery using matrix factorization. Pro-
ceed Natl Acad Sci 101(12):4164–4169
14. Chapelle O, Zien A (2004) Semi-supervised classification by low
density separation. In AI STATS
15. Chapelle O, Scholkopf B, Zien A, et al. (2006) Semi-supervised
learning, MIT press Cambridge
Table 5 Three most significant
GO terms of individual six
clusters of Yeast sporulattion
data and their p-values, after
application of Semi-
GenClustMOO clustering
technique
Clusters Significance GO term p-value
Cluster1 Carboxylic acid metabolic process:G0:0019752 2.25E-05
Oxoacid metabolic process: GO:0043436 3.09E-05
Organic acid metabolic process: GO:0006082 3.17E-05
Cluster2 Single-organism cellular process:GO:0044763 0.00019
Single-organism process:GO:0044763 0.00086
Cell cycle process:GO:0022402 7.62E-19
Cluster3 Gene expression:GO:0010467 2.95E-06
Cellular component biogenesis:GO:0044085 4.47E-08
Ribonucleoprotein complex biogenesis:GO:0022613 6.84E-16
Cluster4 Carbohydrate metabolic process:GO:0005975 4.56E-09
Single-organism carbohydrate metabolic process:GO:0044724 7.06E-07
Generation of precursor metabolites and energy:GO:0006091 3.39E-07
Cluster5 Single-organism cellular process:GO:0044763 0.00077
Single-organism process:GO:0044699 0.00638
Cell cycle process:GO:0022402 3.15E-24
Cluster6 metabolic process:GO:0008152 1.16E-06
Cellular metabolic process:GO:0044237 3.04E-06
Organic substance metabolic process:GO:0071704 1.57E-06
Table 6 p-values produced by
Wilcoxon’s rank sum test
comparing Semi-GenClustMOO
with other algorithms
Data set MO-fuzzy MOGA FCM SGA SOM CRC
Sporulation 3.21E-03 3.57E-03 7.12E-05 3.77E-04 3.82E-04 3.22E-03
Cell cycle 1.29E-03 2.42E-05 5.60E-05 3.20E-04 2.44E-04 3.11E-04
Arabidopsis 1.11E-03 2.29E-03 6.42E-05 4.30E-03 2.12E-03 2.01E-03
Serum 2.10E-03 3.41E-03 6.21E-04 3.4E-03 2.62E-04 2.44E-03
Rat CNS 1.72E-03 2.71E-04 5.67E-05 3.43E-04 2.76E-03 2.71E-03
Int. J. Mach. Learn. & Cyber. (2017) 8:421–439 437
123
Page 18
16. Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A,
Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lock-
hart DJ et al (1998) A genome-wide transcriptional analysis of
the mitotic cell cycle. Mol cell 2(1):65–73
17. Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO,
Herskowitz I (1998) The transcriptional program of sporulation in
budding yeast. Science 282(5389):699–705
18. De Smet F, Mathys J, Marchal K, Thijs G, De Moor B, Moreau Y
(2002) Adaptive quality-based clustering of gene expression
profiles. Bioinformatics 18(5):735–746
19. Deb K, Pratap A, Agarwal S, Meyarivan T, Fast A (2002) Nsga-
ii. IEEE Trans Evol Comput 6(2):182–197
20. Dembele D (2008) Multi-objective optimization for clustering
3-way gene expression data. Adv Data Anal Cl 2(3):211–225
21. Dhaeseleer P, Wen X, Fuhrman S, Somogyi R (1998) Mining the
gene expression matrix: Inferring gene relationships from large
scale gene expression data. In: Information processing in cells
and tissues, Springer, pp 203–212
22. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster
analysis and display of genome-wide expression patterns. Pro-
ceed Natl Acad Sci 95(25):14863–14868
23. Everitt B (1974/1993) Cluster analysis. Halsted Press
24. Faceli K, de Souto MC, de Araujo DS, de Carvalho AC (2009)
Multi-objective clustering ensemble for gene expression data
analysis. Neurocomputing 72(13):2763–2774
25. Fraley C, Raftery AE (1998) How many clusters? which clus-
tering method? answers via model-based cluster analysis. Com-
put J 41(8):578–588
26. Geman S, Geman D (1984) Stochastic relaxation, gibbs distri-
butions, and the bayesian restoration of images. Patt Anal Mach
Intell IEEE Trans 6:721–741
27. Ghosh D, Chinnaiyan AM (2002) Mixture modelling of gene
expression data from microarray experiments. Bioinformatics
18(2):275–286
28. Herwig R, Poustka AJ, Muller C, Bull C, Lehrach H, O’Brien J
(1999) Large-scale clustering of cdna-fingerprinting data. Gen-
ome Res 9(11):1093–1105
29. Heyer LJ, Kruglyak S, Yooseph S (1999) Exploring expression
data: identification and analysis of coexpressed genes. Genome
Res 9(11):1106–1115
30. Hu Q, Pan W, An S, Ma P, Wei J (2010) An efficient gene
selection technique for cancer recognition based on neighborhood
mutual information. Int J Mach Learn Cybern 1(1–4):63–74
31. Ishibuchi H, Murata T (1998) A multi-objective genetic local
search algorithm and its application to flowshop scheduling. Syst
Man Cybern Part C Appl Rev IEEE Trans 28(3):392–403
32. Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, Trent
JM, Staudt LM, Hudson J, Boguski MS et al (1999) The tran-
scriptional program in the response of human fibroblasts to
serum. Science 283(5398):83–87
33. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review.
ACM Comput Surv 31(3):264–323
34. Jiang D, Pei J, Zhang A (2003) Dhc: a density-based hierarchical
clustering method for time series gene expression data. In: Pro-
ceedings of Bioinformatics and Bioengineering. Third IEEE
Symposium, pp 393–400
35. Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene
expression data: a survey. Knowl Data Eng IEEE Trans
16(11):1370–1386
36. Kirkpatrick S, Gelatt CD, Vecchi MP et al (1983) Optimization
by simmulated annealing. Science 220(4598):671–680
37. Liu L, Hawkins DM, Ghosh S, Young SS (2003) Robust singular
value decomposition analysis of microarray data. Proceed Natl
Acad Sci 100(23):13,167–13,172
38. Lockhart D, Dong H, Byrne M, Follettie M, Gallo M, Chee M,
Mittmann M, Wang C, Kobayashi M, Horton H et al (1996)
Expression monitoring by hybridization to high-density oligo-
nucleotide arrays. Nature Biotech 14(13):1675–1680
39. Lockhart DJ, Winzeler EA (2000) Genomics, gene expression
and dna arrays. Nature 405(6788):827–836
40. Maulik U, Bandyopadhyay S (2002) Performance evaluation of
some clustering algorithms and validity indices. Patt Anal Mach
Intell IEEE Trans 24(12):1650–1654
41. Maulik U, Bandyopadhyay S (2003) Fuzzy partitioning using a
real-coded variable-length genetic algorithm for pixel classifica-
tion. Geosci Remote Sens IEEE Trans 41(5):1075–1081
42. Maulik U, Mukhopadhyay A, Bandyopadhyay S (2009) Com-
bining pareto-optimal clusters using supervised learning for
identifying co-expressed genes. BMC Bioinform 10(1):27
43. Mukhopadhyay A, Bandyopadhyay S, Maulik U (2010) Multi-
class clustering of cancer subtypes through svm based ensemble
of pareto-optimal solutions for gene marker identification. PloS
one 5(11):e13803
44. Mukhopadhyay A, Maulik U, Bandyopadhyay S (2013) An
interactive approach to multiobjective clustering of gene
expression patterns. Biomed Eng IEEE Trans 60(1):35–41
45. Qin ZS (2006) Clustering microarray gene expression data using
weighted chinese restaurant process. Bioinformatics 22(16):
1988–1997
46. Reymond P, Weber H, Damond M, Farmer EE (2000) Differ-
ential gene expression in response to mechanical wounding and
insect feeding in arabidopsis. Plant Cell Online 12(5):707–719
47. Rose K (1998) Deterministic annealing for clustering, compres-
sion, classification, regression, and related optimization prob-
lems. In: Proceedings of the IEEE 86(11):2210–2239
48. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the inter-
pretation and validation of cluster analysis. J Comput Appl Math
20:53–65
49. Saha S, Bandyopadhyay S (2013) A generalized automatic
clustering algorithm in a multiobjective framework. Appl Soft
Comput 13(1):89–108
50. Saha S, Ekbal A, Alok AK (2012) Semi-supervised clustering
using multiobjective optimization. In: Hybrid Intelligent Systems
(HIS), 12th International Conference, IEEE, pp 360–365
51. Saha S, Ekbal A, Gupta K, Bandyopadhyay S (2013) Gene
expression data clustering using a multiobjective symmetry based
clustering technique. Comput Biol Med 43(11):1965–1977
52. Santos JM, Embrechts M (2009) On the use of the adjustedrand index as a metric for evaluating supervised classifica-
tion. In: Artificial Neural Networks-ICANN, Springer,
pp 175–184
53. Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative
monitoring of gene expression patterns with a complementary
dna microarray. Science 270(5235):467–470
54. Schott JR (1995) Fault tolerant design using single and multi-
criteria genetic algorithm optimization. Tech Rep DTIC Doc
55. Sharan R, Shamir R (2000) Click: a clustering algorithm with
applications to gene expression analysis. Proceed Int Conf Intell
Syst Mol Biol 8:16
56. Sharma A, Imoto S, Miyano S, Sharma V (2012) Null space
based feature selection method for gene expression data. Int J
Mach Learn Cybern 3(4):269–276
57. Sherlock G (2000) Analysis of large-scale gene expression data.
Curr Opin Immunol 12(2):201–205
58. de Souto MC, Costa IG, de Araujo DS, Ludermir TB, Schliep A
(2008) Clustering cancer gene expression data: a comparative
study. BMC Bioinform 9(1):497
59. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmit-
rovsky E, Lander ES, Golub TR (1999) Interpreting patterns of
gene expression with self-organizing maps: methods and appli-
cation to hematopoietic differentiation. Proceed Natl Acad Sci
96(6):2907–2912
438 Int. J. Mach. Learn. & Cyber. (2017) 8:421–439
123
Page 19
60. Tang VT, Yan H (2012) Noise reduction in microarray gene
expression data based on spectral analysis. Int J Mach Learn
Cybern 3(1):51–57
61. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM
(1999) Systematic determination of genetic network architecture.
Nature Genet 22(3):281–285
62. Tou JT GR (1974) Pattern recognition principles. Reading:
Addison-Wesley
63. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T,
Tibshirani R, Botstein D, Altman RB (2001) Missing value
estimation methods for dna microarrays. Bioinformatics
17(6):520–525
64. Wen X, Fuhrman S, Michaels GS, Carr DB, Smith S, Barker JL,
Somogyi R (1998) Large-scale temporal gene expression map-
ping of central nervous system development. Proceed Natl Acad
Sci 95(1):334–339
65. Wilcoxon F, Katti S, Wilcox RA (1963) Critical values and
probability levels for the Wilcoxon rank sum test and the Wil-
coxon signed rank test. American Cyanamid Comp
66. Xie XL, Beni G (1991) A validity measure for fuzzy clustering.
IEEE Trans Patt Anal Mach Intell 13(8):841–847
67. Xu X (2013) Enhancing gene expression clustering analysis using
tangent transformation. Int J Mach Learn Cybern 4(1):31–40
68. Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001)
Model-based clustering and data transformations for gene
expression data. Bioinformatics 17(10):977–987
69. Zitzler E, Laumanns M, Thiele L (2001) Spea 2: Improving the
strength pareto evolutionary algorithm
Int. J. Mach. Learn. & Cyber. (2017) 8:421–439 439
123