Enhanced Particle Swarm Optimization Firefly Algorithm–Modified Artificial Neural Network for Cancer Classification Using Microarray Gene Data 1 N. Kanchana and 2 N. Muthumani 1 Department of Computer Science, Dr.G.R.Damodaran College of Science, Coimbatore. [email protected]2 Department of Computer Applications, S.N.R. Sons College, Coimbatore. [email protected]Abstract Microarray data classification is a difficult challenge for machine learning researchers due to its high number of features and the small sample sizes. The cancer classification is done by using gene expression dataset. In the existing system, improved minimum Redundancy Maximum Relevance with Glowworm Swarm Optimization (ImRMR-GSO) algorithm is introduced for efficient cancer classification. However it has issue with inaccurate gene classification results due to number of error occurrences. Hence the overall classification performance is degraded for the given datasets. To overcome these issues, in the proposed system, Enhanced Particle Swarm Optimization FireFly Algorithm–Modified Artificial Neural Network (EPSOFFA-MANN) is proposed. In this research, it has three main modules are such as preprocessing, gene selection and classification. The preprocessing step is performed by using improved kNN (IKNN) which is used to handle the missing values and redundancy values from the given gene dataset. The reduction of preprocessed dataset is given to the Feature Selection (FS) step. In the FS step, the important and relevant features are selected EPSOFFA optimally. It is used to select most informative genes from the given gene array dataset. These features are taken into MANN to classify more accurate gene results. In this module, training and testing is performed which provides relevant features. It is used to increase the classification accuracy rather than the preceding methods. The result proves that the proposed EPSOFFA-MANN method has better performance in terms of higher classification accuracy, precision, International Journal of Pure and Applied Mathematics Volume 119 No. 15 2018, 1693-1717 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ Special Issue http://www.acadpubl.eu/hub/ 1693
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Enhanced Particle Swarm Optimization Firefly
Algorithm–Modified Artificial Neural Network for
Cancer Classification Using Microarray Gene Data 1N. Kanchana and
Artificial Neural Network (EPSOFFA-MANN) is proposed. In this research,
it has three main modules are such as preprocessing, gene selection and
classification. The preprocessing step is performed by using improved kNN
(IKNN) which is used to handle the missing values and redundancy values
from the given gene dataset. The reduction of preprocessed dataset is given
to the Feature Selection (FS) step. In the FS step, the important and relevant
features are selected EPSOFFA optimally. It is used to select most
informative genes from the given gene array dataset. These features are
taken into MANN to classify more accurate gene results. In this module,
training and testing is performed which provides relevant features. It is
used to increase the classification accuracy rather than the preceding
methods. The result proves that the proposed EPSOFFA-MANN method
has better performance in terms of higher classification accuracy, precision,
International Journal of Pure and Applied MathematicsVolume 119 No. 15 2018, 1693-1717ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/
1693
recall, f-measure and lower time complexity, error rate.
Key Words:Gene selection, microarray data, EPSOFFA, MANN, cancer
classification.
International Journal of Pure and Applied Mathematics Special Issue
1694
1. Introduction
Gene expression technology using DNA microarrays, allows for the monitoring
of the expression levels of thousands of genes at once. As a direct result of
recent advances in DNA microarray technology, it is now feasible to obtain
gene expression profiles of tissue samples at relatively low costs. Gene
expression profiles provide important insights into, and further our
understanding of, biological processes.
Cancer diagnosis is one of the most important emerging clinical applications of
gene expression microarray technology. An important emerging medical
application domain for microarray gene expression profiling technology is
clinical decision support in the form of diagnosis of disease as well as
prediction of clinical outcomes in response to treatment. Currently, cancer
diagnosis highly depends on a variety of histological observations, including
immunohistochemical assays, which detect cancer biomarker molecules [1].
Most of gene expression data sets have a fairly small sample size compared to
the number of genes investigated. This data structure creates an unprecedented
challenge to some classification methodologies. Only several methods had been
successfully applied to the cancer diagnosis problem in the previous studies,
including support vector machine (SVM), k-nearest neighbors (KNN), back
propagation neural networks (NN), and probabilistic neural networks (PNN).
Applying data mining techniques has been proved to be an effective approach
towards knowledge discovery based on probe-level data sets consisting of
millions of variables and often only several or dozens of observations (i.e., the
samples). However, since practitioners and researchers engaged in this domain
stem from various background, one cannot be expected to be extremely familiar
with the techniques and algorithms in data mining. This phenomenon is fair and
the trend that more people become interested in this interdisciplinary area is
even encouraging, because everyone perceives and processes information from
his or her own academic or practice backgrounds, which actually foster the
development of genetic engineering [2].
The microarray technique in concurrent measurement of the expression level in
thousands of messenger RNA (mRNA) has been enabled. This has become
feasible by mining the data; it is possible to recognize the dynamics of a gene
expression time series in this manner. Researchers decreased the dimensionality
of the data set by employing the Principal Component Analysis (PCA). An
examination of the components has provided an approach to the underlying
factors calculated in the experiments. The PCA has demonstrated that it is
proved from their consequences, that all rhythmic content of data can be
decreased to three main components [3].
Preprocessing step to tackle microarray data has rapidly become indispensable
among researchers, not only to remove redundant and irrelevant features, but
International Journal of Pure and Applied Mathematics Special Issue
1695
also to help biologists identify the underlying mechanism that relates gene
expression to diseases. The set of techniques used prior to the application of a
data mining method is named as data preprocessing for data mining and it is
known to be one of the most meaningful issues within the famous knowledge
discovery from data process [4]. Since data will likely be imperfect, containing
inconsistencies and redundancies is not directly applicable for a starting a data
mining process. It mentions the fast growing of data generation rates and their
size in business, industrial, academic and science applications. The bigger
amounts of data collected require more sophisticated mechanisms to analyze it.
Data preprocessing is able to adapt the data to the requirements posed by each
data mining algorithm, enabling to process data that would be unfeasible
otherwise.
FS methods are constantly emerging and, for this reason, there is a wide suite of
methods that deal with microarray gene data. Feature selection is often deployed
to select the most influential features from the original feature set for better
classification performance. Feature selection is to find an ideal feature subset
from a problem domain while improving classification accuracy in representing
the original features [5]. It can be perceived as an optimization process that
searches for a subset of features which ideally is sufficient and appropriate to
retain their representative power that describe the target concept from the
original set of features in a given data set.
FS is the process of identifying and removing as much of the irrelevant and
redundant information. The optimality of a feature subset is measured by an
evaluation criterion. The selection of a small number of highly predictive
features is used to avoid over fitting the training data. It reduces the number of
features, removes irrelevant, redundant, or noisy data, and there by speeds up
data-mining algorithm, improving mining performance such as predictive
accuracy and result comprehensibility [6].
Abeel et al. [7] are concerned with the analysis of the robustness of biomarker
selection techniques. They proposed a general experimental setup for stability
analysis that can easily be included in any biomarker identification pipeline. In
addition, they also presented a set of ensemble feature selection methods
improving biomarker stability and classification performance in four microarray
datasets.
The classification of gene expression data samples into distinct classes is a
challenging task. The dimensionality of typical gene expression data sets ranges
from several thousands to over ten thousands gene. However, only small sample
sizes are typically available for analysis [8]. The outcomes from this
classification process help domain experts to identify the “informative features”
embedded in the data and the relationships between the data items. As such,
they can be used to generate hypotheses about the correlation between genes
and their impact on a specific disease.
International Journal of Pure and Applied Mathematics Special Issue
1696
2. Related Work
In [9] Martinez et al (2001) gives theoretical analysis on how well a two-stage
algorithm approximates the exact LDA in the sense of maximizing the LDA
objective function. The theoretical analysis motivates us to devise a new two-
stage LDA algorithm. Our algorithm outperforms the PCA+LDA while both
have the similar scalability. Furthermore, it provides an implementation of this
algorithm on distributed system to handle large scale problems.
In [10] Alba et al (2007) compared the use of a Particle Swarm Optimization
(PSO) and a Genetic Algorithm (GA) (both augmented with Support Vector
Machines SVM) for the classification of high dimensional Microarray Data.
Both algorithms are used for finding small samples of informative genes
amongst thousands of them. A first contribution is to prove that PSOSVM is
able to find interesting genes and to provide classification competitive
performance. A second important contribution consists in the actual discovery
of new and challenging results on six public datasets identifying significant in
the development of a variety of cancers (leukemia, breast, colon, ovarian,
prostate, and lung).
In [11] Banerjee et al (2007) emphasized an evolutionary rough feature
selection algorithm to preprocess the array data to obtain a reduced set of
feature. The proposed algorithm uses redundancy reduction for effective
handling of gene expression data enabling faster convergence. The reducts are
generated using a rough set theory and they are a minimal set of non redundant
features capable of distinguishing between all objects in a multi-objective
framework. The proposed algorithm was implemented on three different cancer
samples. Experiments using KNN classifiers showed that the proposed
algorithm improved the performance of the classifier on the test set.
In [12] Askarzadeh et al (2012) used harmony search (HS)-based parameter
identification method to identify the unknown parameters of the solar cell single
and double diode models. Simple concept, easy implementation and high
performance are the main reasons of HS popularity to solve complex
optimization problems. For this aim, HS variants are used to determine the
unknown parameters of the models. The effectiveness of the HS variants is
investigated with comparative study among different techniques. Simulation
results manifest the superiority of the HS-based algorithms over the other
studied algorithms in modeling solar cell systems.
In [13] Xu, et al (2013) used a New Artificial Bee Colony (NABC) algorithm,
which modifies the search pattern of both employed and onlooker bees. A
solution pool is constructed by storing some best solutions of the current swarm.
New candidate solutions are generated by searching the neighborhood of
solutions randomly chosen from the solution pool. Experiments are conducted
on a set of twelve benchmark functions. Simulation results show that this
International Journal of Pure and Applied Mathematics Special Issue
1697
approach is significantly better or at least comparable to the original ABC and
seven other stochastic algorithms. Chişet al [14] presented clustering which is
the most popular method that makes an attempt to separate data into disjoint
groups such that same-group data points are similar in its characteristics with
respect to a referral point whereas data points of different-groups differs in its
characteristics. Such described groups are called as clusters. Thus clusters are
comprised of several similar data or objects with respect to a referral point.
Cluster is one of the most important methods in the disciplines of engineering
and science, including data compression, statistical data analysis,
3. Proposed Methodology
Input Data
The datasets are real gene expression data and gene samples generated using
micro-arrays technology. The results of both implementations are compared to
the output from the classification algorithm. This gene expression data has been
used to build cancer classifiers. A microarray experiment is to monitor the
expression level of genes. Patterns could be derived from analyzing the change
in expression of the genes, and new insights could be gained into the underlying
biology. In this section, basic terminologies, representations of the microarray
data and the various methods by which expression data can be analyzed will be
introduced. Microarrays measurements are carried out as differential
hybridizations to minimize errors originating from DNA. The overall block
diagram of the proposed system is shown in fig 1.
Fig. 1: Overall Block Diagram of the Proposed System
Gene expression dataset
Preprocessing using IKNN
Gene selection using EPSOFFA
Calculate fitness value using best fireflies in PSO
Generate optimal genes
Classification using MANN
Perform training and testing genes
Neurons classify relevant genes
Provide accurate classification results
International Journal of Pure and Applied Mathematics Special Issue
1698
3.1. Preprocessing using IKNN
In this research, the k-NN algorithm is introduced to perform data
preprocessing. The underlying idea of the k-NN algorithm has served of
inspiration to tackle data imperfection. It distinguishes data imperfection that
needs to be addressed such as noisy data, redundancy and incomplete data.
Preprocessing is an important step in the analysis of microarray data. The
distance-based similarity idea of the kNN has been widely applied to detect and
remove class noise. IKNN used to eliminate all potential noisy examples and it
may change the class label of clearly erroneous examples. The given dataset
also contain missing values (MVs) in their attribute values. Intuitively, a MV is
simply a value for an attribute. Human or equipment errors are some of the
reasons of their existence. Once again, this imperfection on the data influences
the mining process and its outcome. The simplest way of dealing with MVs is to
discard the attributes that contain them. The imputation of MVs is a procedure
that aims to fill in the MVs by estimating them. Data reduction is aim to obtain
a smaller representative set of attributes from raw data without losing important
information. This process allows alleviating data storage necessities as well as
improving the later data mining process. This process may result on the
elimination of noisy information, but also redundant or irrelevant data.
Algorithm 1: IKNN Step 1: Begin
Input: D={(x1, c1), (x2, c2),…(xn, cn)}
Step 2: For every labeled instance (xi , ci )
Step 3: calculate d(xi , x)
Step 4: Order d(xi , x) from lowest to highest, (i = 1, . . . , N)
Step 5: Compute missing values
{
𝑥 = 𝑤 𝑖𝑥𝑖
𝑛𝑖=1
𝑤 𝑖𝑛𝑖=1
(1)
Fill the missing value
}
Compute redundancy
{
𝑟𝐴,𝐵 = 𝐴−𝐴 (𝐵−𝐵 )
(𝑛−1)𝜎𝐴𝜎𝐵 (2)
Filter repeated values
}
Step 6: Select the K nearest instances to x: 𝐷𝑥𝑘
Step 7: Assign to x the most frequent class in 𝐷𝑥𝑘
Step 8: End
This algorithm is used to provide more accurate gene dataset which is used to
increase the gene classification accuracy higher rather than previous system.
International Journal of Pure and Applied Mathematics Special Issue
1699
3.2. Feature Selection using EPSOFFA with Firefly Algorithm
In this research, gene selection is performed by using EPSOFFA. It is focused
to improve the relevant gene data and reduce the irrelevant genes from the given
gene dataset.
The PSO is a computational approach that optimizes a problem in continuous,
multidimensional search spaces. PSO starts with a swarm of random particles.
Each particle is associated with a velocity. Particles‟ velocities are adjusted in
order to the historical behavior of each particle and its neighbors during they fly
through the search space. Thus, the particles have a tendency to move towards
the better search space. The version of the utilized PSO algorithm is described
mathematically by the following equations:
Each particle updates its own position and velocity according to formula (3) and
(4) in every iteration.
vidk+1 = ωvid
k + c1γ1 pid
k − xidk + c2γ
2 pgd
k − xidk + α(rand −
1
2) (3)
xidk+1 = 1 s vid
k+1 > 𝑟𝑎𝑛𝑑 (0,1)
0 else (4)
where the s vidk+1 is the sigmoid function S(𝑣𝑖𝑑 ) = 1/(1 + exp(−vid )),i = 1, 2, 3
... m, m is the number of particles in the swarm, vidk and xid
k stand for the
velocity and position of the ith
particle of the kth
iteration, respectively.
pidk denotes the previously best position of particle i, pgd
k denotes the global best
position of the swarm. ω is the inertia weight, c1 and c2 are acceleration
constants (the general value of c1 and c2 are in the interval [0 2]), γ1 and γ2 are
random numbers in the range [0 1].
Each feature subset can be considered as a point in feature space. The optimal
point is the subset with least length and highest classification accuracy. The
initial swarm is distributed randomly over the search space, each particle takes
one position. The goal of particles is to fly to the best position. By passing the
time, their position is changed by communicating with each other, and they
search around the local best and global best position. Finally, they should
converge on good, possibly optimal, positions since they have exploration
ability that equip them to perform FS and discover optimal subsets.
The velocity of each particle is displayed as a positive integer; particle
velocities are bounded to a maximum velocity Vmax. It shows how many of
features should be changed to be same as the global best point, in other words,
the velocity of the particle moving toward the best position. The number of
different features (bits) between two particles related to the difference between
their positions.
After updating the velocity, a particle‟s position will be updated by the new
velocity. Suppose that the new velocity is V. In this case, V bits of the particle
International Journal of Pure and Applied Mathematics Special Issue
1700
are randomly changed, different from that of Pg. The particles then fly toward
the global best while still exploring the search area, instead of simply being
same as Pg. The Vmax is used as a constraint to control the global exploration
ability of particles. A larger Vmax provides global exploration, while a smaller
Vmax increases local exploitation. When Vmax is low, particles have difficulty
getting out from locally optimal sections. If Vmax is too high, swarm might fly
past good solutions. Objective function is computed as follows
(𝑋𝑖)=𝜙. 𝛾 𝑆𝑖 𝑡 + 𝜙(𝑛 − |𝑆𝑖(𝑡)| (5)
Where 𝑆𝑖 𝑡 is the feature subset found by particle i at iteration t, and |𝑆𝑖 𝑡 |is
its length. Fitness is computed in order to both the measure of the classifier
performance, 𝛾𝑆𝑖(𝑡), and feature subset length. ϕ and φ are two parameters that
control the relative weight of classifier performance and feature subset length, ϕ
∈[0,1] and φ = 1 − ϕ. This formula denotes that the classifier performance and
feature subset length have different effect on gene selection.
To enhance the optimal solutions in the gene selection dataset, the PSO is
hybrid with FFA. The PSO has lower convergence rate when large number of
genes increase. And hence time complexity is an issue also the accuracy of the
gene classification lower due to misclassification of important genes. To
overcome the above mentioned issues, the EPSOFFA is proposed in this
research.
The light intensity of each firefly determine its brightness and hence its
attractiveness. Attractiveness of the firefly is calculated using equation (6).
β (r) = β0
e−γrij (6)
where rij = d(xi , xj), a Euclidean distance between two data points and j. In
general, β0
∈ [0, 1], describes the fitness value o distance at r = 0, i.e.,
when two data points are found at the same point of search space S. The value
of γ ∈ [0, 10] determines the variation of fitness value with increasing distance
from communicated data points. It is basically the light absorption coefficient
and generally γ ∈ [0, 10].
The movement of the firefly i in the space which is attracted toward another
firefly j is defined by using equation (7).
Xi = xi + β0
e−γrij + α(rand −1
2) (7)
Where is the randomization parameter in interval [0, 1] and rand is random
number generator with numbers uniformly distributed in range [0, 1]. Parameter
is controls the variation in attractiveness and define convergence.
Algorithm 2: EPSOFFA Step 1: Initializing a population with N individuals
Step 2: Initialize the position and velocity of each particle (gene) in the swarm
Step 3: While Maximum Iterations is not reached do
International Journal of Pure and Applied Mathematics Special Issue
1701
Step 4: Set algorithm factors: Objective function of f(x), from, where x = x1, . . . . . . . . , xd T
Step 5: Generate primary population of fireflies or xi (i = 1, 2, . . . , n)
Step 6: Describe light intensity of Ii at xi via f (xi) from equation (p)
Step 7: While (t < 𝑀𝑎𝑥𝐺𝑒𝑛) do
Step8:For i = 1 to n all n fireflies (genes) ; Step 9: For j = 1 to n(all n fireflies)
Step 10: If (Ij > Ii) fi towards fj
Step 11: end if
Step 12: Attractiveness vicissitudes with distance „r‟ via Exp − r2 Step 13: Estimation novel solutions and study light intensity;
Step 14: End for j; Step 15: End for i; Step 16: Construct new position of firefly solution
Step 17: Evaluate the fitness of the new firefly solution which is directly
proportional to its brightness
Step 18: If the fitness value is better than its personal best (pBest)
Step 19: Set current value as the new pBest
Step 20: End
Step 21: Choose the particle with the best fitness value of all as gBest
Step 22: For each particle
Step 23: Calculate particle velocity and update particle position according
Equation (6) and (7)
Step 24: Randomly selecting a 𝑔𝑏𝑒𝑠𝑡 for particle 𝑖 from highest ranked
solutions
Step 25: Update the velocity and position of particle based on best firefly
behavior (3) and (4)
Step 26: Return most informative gene features
Step 27: Update the pbest and gbest
Step 28: Return the positions of genes
Step 29: End
The efficiency of gene classification model, based on EPSOFFA algorithms
relies on learning acquired through the gene dataset. The appropriately dataset
helps the hybrid model to attain desirable gene classification performance. This
EPSOFFA approach finds and ranks the most informative gene features from
the given microarray gene dataset to provide an optimal classification result. In
this way EPSOFFA evolves an optimal dataset that helps in building an
effective gene classification model.
3.3. Classification using MANN
In this proposed system, Modified Artificial Neural Network (MANN) is
introduced to improve the overall gene classification results. In this section,
MANN is proposed which is faster algorithm rather than the TSVM approach.
MANN has advantage of higher accuracy in training as well as testing results.
International Journal of Pure and Applied Mathematics Special Issue
1702
There are two main phases in the operation of ANN such as learning and testing
[15]. Learning is the process of adapting or modifying the neural network
weights in response to the training input patterns being presented at the input
layer.
The optimal size for the gene dataset such as the total number of layers, the
number of hidden units in the middle layers, and number of units in the input
and output layers in terms of accuracy on a test set, and the training algorithm
used during the learning phase. It is used to perform the required extraction of
the knowledge from a noisy training set to achieve better gene prediction
enhancement.
The aim of learning is to minimize a cost function based on the error signal
ei(t), with respect to network parameters (weights), such that the actual response
of each output neuron in the network approaches the target response [16]. A
criterion commonly used for the cost function is the MSE criterion, defined as
the mean-square value of the sum squared error:
𝐽 = 𝐸[1
2 (𝑒𝑖(𝑡))2
𝑖 ] (8)
= 𝐸[1
2 𝑑𝑖 𝑡 − 𝑦𝑖(𝑡))2
𝑖 ] (9)
Where E is the statistical expectation operator and the summation is over all the
neurons of the output layer. Usually the adaptation of weights is performed by
using the desired signal 𝑑𝑖(𝑡) only. In [17] it is stated that a new signal
𝑑𝑖 ( 𝑡 ) + 𝑛𝑖 ( 𝑡 ) can be used as a desired signal for output neuron i instead of
using the original desired signal 𝑑𝑖(𝑡), where ni(t) is a noise term. This noise
term is assumed to be white Gaussian noise, independent of both the input
signals xk(t) and the desired signals 𝑑𝑖(𝑡).
It is important in every ANN modelling and need to be cross checked in order to
find the optimal solution. ANNs are applied in many different tasks such as
prediction, classification, time series projection, etc. that require different
structures and features to be examined.
It has been developed as generalizations of mathematical models of biological
nervous systems. A first wave of interest in neural networks (also known as
connectionist models or parallel distributed processing) emerged after the
introduction of simplified neurons
The basic processing elements of neural networks are called artificial neurons,
or simply neurons or nodes. In a simplified mathematical model of the neuron,
the effects of the synapses are represented by connection weights that modulate
the effect of the associated input signals, and the nonlinear characteristic
exhibited by neurons is represented by a transfer function. The neuron impulse
is then computed as the weighted sum of the input signals, transformed by the
transfer function. The learning capability of an artificial neuron is achieved by
adjusting the weights in accordance to the chosen learning algorithm.
International Journal of Pure and Applied Mathematics Special Issue
1703
The neuron output signal O is given by the following relationship:
𝑜 = 𝑓( 𝑤𝑗𝑥𝑗𝑛𝑗=1 ) (10)
Where 𝑤𝑗 is the weight vector, 𝑥𝑗 is input of neurons
The basic architecture consists of three types of neuron layers: input, hidden,
and output layers. In feed-forward networks, the signal flow is from input to
output units, strictly in a feed-forward direction. The data processing can extend
over multiple (layers of) units, but no feedback connections are present.
A neural network has to be configured such that the application of a set of
inputs produces the desired set of outputs. The strength of ANN is to set the
weights explicitly, using a priori knowledge. Another is to train the neural
network by feeding it teaching patterns and letting it change its weights
according to some learning rule. The learning situations in neural networks
classified into supervised learning. In this phase, an input vector is presented at
the inputs together with a set of desired responses, one for each node, at the
output layer. A forward pass is done, and the errors or discrepancies between
the desired and actual response for each node in the output layer are found.
To create a robust and reliable network, some noise or other randomness is
added to the training data to get the network familiarized with noise and natural
variability in real data. Poor training data inevitably leads to an unreliable and
unpredictable network. The MANN is trained for a prefixed number of epochs
or when the output error decreases below a particular error threshold. Special
care is to be taken not to over train the network. By overtraining, the network
may become too adapted in learning the samples from the training set, and thus
may be unable to accurately classify samples outside of the training set. To
overcome this issue, the MANN is improved in terms of better choosing of
number of neurons. With too few hidden neurons, the network may be unable to
learn the relationships amongst the data and the error is increased. Hence, large
number of hidden neurons ensures correct learning, and the network is able to
correctly predict the data it has been trained on.
Algorithm 3: MANN Step 1: Create an initial ANN consisting of three layers, i.e., an input, an output,
and a hidden layer.
Step 2: For each training pattern
Step 3: Apply the input gene features to the network
Step 4: Train the network on the training set until the error is almost constant for a
certain number of training epochs τ that is specified by the user.
Step 5: Consider the genes for input nodes and hidden nodes
Step 6: Calculate the output for every neuron from the input layer, through the
hidden layer(s), to the output layer
Step 7: Compute the error of the MANN using (9)
Step 8: If the error is found unacceptable (i.e., too large), then assume that the
MANN has inappropriate architecture, and go to the next step.
International Journal of Pure and Applied Mathematics Special Issue
1704
Step 9: Otherwise, stop the training process. The error E is calculated according to
the following equations.
𝐸 𝑤, 𝑣 =1
2 (𝑆𝑝𝑖 − 𝑡𝑝𝑖 )2𝐶
𝑝=1𝑘𝑖=1 (11)
Where 𝑘 is the number of patterns and C is the number of output nodes. 𝑡𝑝𝑖 and 𝑆𝑝𝑖 are the
target and actual outputs for the ith
pattern of the pth
output node. The actual output Spi
is
calculated according to the following equation.
𝑆𝑝𝑖 = 𝜎( 𝛿( 𝑥𝑖)𝑇𝑤𝑚 𝑣𝑚)
𝑚=1 (12)
Here h is the number of hidden nodes in the network, xi is an n-dimensional input pattern,
𝑖 = 1, 2, …𝑘, 𝑤𝑚 is weights for the arcs connecting the input layer and the m-th hidden
node, m=1, 2, …, h, 𝑣𝑚 weights for the arcs connecting the m-th hidden node and the
output layer, 𝜎 is activation function and 𝛿 is the hidden layer hyperbolic tangent function
Step 10: Add one hidden node to the hidden layer. Randomly initialize the weights
of the newly added node and go to training process using (10).
Step 11: The neurons checks error values
Step 12: Apply the weight adjustments
Step 13: The training and testing of MANN use gene features
Step 14: Classify gene output features
Step 15: Genes are classified more accurately
4. Experimental Result
In this section, evaluate the overall performance of gene selection methods
using six popular binary and multiclass microarray cancer datasets, which were
downloaded from http://www.gems-system. org/. These datasets have been
widely used to benchmark the performance of gene selection methods in
bioinformatics field. The binary-class microarray datasets are colon [18],
leukemia [18, 19], and lung [20] while the multiclass microarray datasets are
SRBCT [21], lymphoma [22], and leukemia [23]. In Table 1, a detailed
description of these six benchmark microarray gene expression datasets with
respect to the number of classes, number of samples, number of genes, and a
brief description of each dataset construction.
Table 1: Gene Datasets
Dataset No.of classes No.of samples Number of genes
Colon [21] 2 62 2000
Leukemia [22] 2 72 7129
Lung[23] 2 96 7129
SRBCT [24] 4 83 2308
Lymphoma [25] 3 62 4026
Leukemia [26] 3 72 7129
In this study, the performance of the proposed EPSOFFA-MANN algorithm is
tested by comparing it with other standard bioinspired algorithms, including
ImRMR-HCSO, ImRMR-GSO and mRMR-ABC. Compare the performance of
each gene selection approach based on parameters such as classification
accuracy, error rate, precision, recall, time complexity and the number of
International Journal of Pure and Applied Mathematics Special Issue
1705
predictive genes that have been used for cancer classification. Classification
accuracy is the overall correctness of the classifier and is calculated as the sum
of correct cancer classifications divided by the total number of classifications.
Performance Metrics 4.1. Accuracy
Classification accuracy = 𝐶𝐶
𝑁× 100
Where N is the total number of the instances in the initial microarray dataset
and, CC refers to correctly classified instances.
4.2. Precision
Precision is explained as the ratio of the true positives contrary to both true
positives and false positives outcomes for imposition and real features. It is
Recall value is calculated on the root of the data retrieval at true positive
forecast, false negative. Usually it can be distinct as
𝑅𝑒𝑐𝑎𝑙𝑙 =𝑇𝑃
𝑇𝑃+𝐹𝑁
4.4. F-Measure
It is a measure of a test's accuracy. It considers both the precision p and
the recall r of the test to compute the score
2.𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
4.3. Time Complexity
The algorithm is superior when it provides lower time complexity for the given
dataset.
4.4. Error Rate
The system is better when the algorithm provides lower error rate.
The comparison results for the binary-class microarray datasets: colon,
leukemia1, and lung are shown in Tables 2, 3, 4, 5 6, and 7 respectively, present
the comparison result for multiclass microarray datasets: SRBCT, lymphoma,
and leukemia 2.
From these tables, it is clear that proposed EPSOFFA-MANN algorithm
performs better than the ImRMR-HCSO with TSVM, ImRMR-GSO with RF
SVM and mRMR-ABC algorithm in every single case (i.e., all datasets using a
different number of selected genes).
International Journal of Pure and Applied Mathematics Special Issue
1706
Table 2: Comparison between EPSOFFA-MANN and ImRMR-HCSO with TSVM,
ImRMR-GSO with the RFSVM, mRMR-ABC Classification Performance for Colon
Dataset
Classification Accuracy in (%)
Number of
genes
mRMR-
ABC
ImRMR-GSO with
RF SVM
ImRMR-HCSO
with TSVM
EPSOFFA-
MANN
3 87.50 88 89.99 92.01
4 88.27 89.9 91.21 93.33
5 89.5 90 92.56 94.77
6 90.12 90.80 93.83 95.99
7 91.64 92 94.41 96.32
8 91.8 92.2 95.79 97.45
9 92.11 92.75 96.10 98.22
10 92.74 93.1 96.77 98.99
15 93.6 94 97.55 99.01
20 94.17 94.8 97.89 99.65
Table 3: Comparison between EPSOFFA-MANN and ImRMR- HCSO with TSVM,
ImRMR-GSO with RFSVM, mRMR-ABC Classifier for Leukemia1 Dataset
Classification Accuracy in (%)
Number of
genes
mRMR-
ABC
ImRMR-GSO with
RF SVM
ImRMR-HCSO
with TSVM
EPSOFFA-
MANN
2 89.63 90 91.34 93.5
3 90.37 91 92.76 94
4 91.29 92 93.99 95
5 92.82 93 94 96
6 92.82 93 94.41 97
7 93.10 93.50 95.82 97.66
10 94.44 95 96 98
13 94.93 95 96.33 98.3
14 95.83 96 97 98.99
Table 4: Comparison between EPSOFFA-MANN and ImRMR-HCSO with TSVM,
ImRMR-GSO with RFSVM, mRMR-ABC Classifier for Lung Dataset
Classification Accuracy in (%)
Number of
genes
mRMR-
ABC
ImRMR-GSO with
RF SVM
ImRMR-HCSO
with TSVM
EPSOFFA-
MANN
2 95.83 96 97 98
3 96.31 97 98.2 98.55
4 97.91 98 98.7 98.99
5 97.98 99 98.99 99.34
6 98.27 98.6 98.99 99.78
7 98.53 98.85 98.99 99.79
8 98.95 99 99.2 99.8
International Journal of Pure and Applied Mathematics Special Issue
1707
Table 5: Comparison between EPSOFFA-MANN and ImRMR-HCSO with TSVM,
ImRMR-GSO with RFSVM, mRMR-ABC Classifier for SRBCT Dataset
Classification Accuracy in (%)
Number of
genes
mRMR-
ABC
ImRMR-GSO with
RF SVM
ImRMR-HCSO
with TSVM
EPSOFFA-
MANN
2 71.08 71.6 82 85
3 79.51 80 83 87
4 84.33 84.9 85 88
5 86.74 87 88 90
6 91.56 92 93 95
7 94.05 94.5 95 97
8 96.3 96.9 97 98
Table 6: Comparison between EPSOFFA-MANN and ImRMR-HCSO with TSVM,
ImRMR-GSO with RFSVM, mRMR-ABC Classifier for Lymphoma Dataset
Classification Accuracy in (%)
Number of
genes
mRMR-
ABC
ImRMR-GSO with
RF SVM
ImRMR-HCSO
with TSVM
EPSOFFA-
MANN
2 86.36 86.9 88 90
3 90.90 91.2 92 94
4 92.42 92.8 94 96
5 96.96 97.1 97.99 98.55
Table 7: Comparison between EPSOFFA-MANN and ImRMR-HCSO with TSVM,
ImRMR-GSO with RFSVM, mRMR-ABC Classifier for Leukemia 2 Dataset
Classification Accuracy in (%)
Number of
genes
mRMR-
ABC
ImRMR-GSO with
RF SVM
ImRMR-HCSO
with TSVM
EPSOFFA-
MANN
2 84.72 85.03 86 88
3 86.11 86.5 87 90
4 87.5 87.9 88 91
5 88.88 89 89.5 92
6 90.27 90.65 91 93
7 89.49 89.9 92 94.5
8 91.66 92.05 93 95.69
9 92.38 92.7 94 97.8
10 91.66 92.1 95 98
15 94.44 94.85 96 98.33
18 95.67 96 97 98.77
20 96.12 96.5 97.7 99
International Journal of Pure and Applied Mathematics Special Issue
1708
Fig. 2: Feature Selection Results Comparison for Colon Dataset
The comparison results for the binary-class microarray datasets: colon,
leukemia1, and lung are shown in Fig 2,3, and 4, respectively while Fig 5, 6,
and 7, respectively, present the comparison result for microarray datasets:
SRBCT, lymphoma, multiclass and leukemia2. From these tables, it is clear that
proposed EPSOFFA-MANN algorithm performs better than the ImRMR-HCSO
with TSVM, ImRMR-GSO and mRMR-ABC algorithms in every single case
(i.e., all datasets using a different number of selected genes).
Fig. 3: Feature Selection Results Comparison for Leukemia1 Dataset
80
82
84
86
88
90
92
94
96
98
100
102
3 4 5 6 7 8 9 10 15 20
Cla
ssif
ica
tio
n a
ccu
racy
Number of genes for colon dataset
mRMR-ABC ImRMR-GSO with RF SVM
ImRMR-HCSO with TSVM EPSOFFA-MANN
84
86
88
90
92
94
96
98
100
2 3 4 5 6 7 10 13 14
Cla
ssif
ica
tio
n a
ccu
racy
Number of genes for Leukemia1 dataset
mRMR-ABC ImRMR-GSO with RF SVM
ImRMR-HCSO with TSVM EPSOFFA-MANN
International Journal of Pure and Applied Mathematics Special Issue
1709
Fig. 4: Feature Selection Results Comparison for Lung Dataset
Fig. 5: Feature Selection Results Comparison for SRBCT Dataset
Fig. 6: Feature Selection Results Comparison for Lymphoma Dataset
95
96
97
98
99
100
2 3 4 5 6 7 8
Cla
ssif
ica
tio
n a
ccu
racy
Number of genes for Lung dataset
mRMR-ABC ImRMR-GSO with RF SVM
ImRMR-HCSO with TSVM EPSOFFA-MANN
70
75
80
85
90
95
100
2 3 4 5 6 7 8
Cla
ssif
ica
tio
n a
ccu
racy
Number of genes for SRBCT dataset
mRMR-ABC ImRMR-GSO with RF SVM
ImRMR-HCSO with TSVM EPSOFFA-MANN
85
87
89
91
93
95
97
99
2 3 4 5
Cla
ssif
ica
tio
n a
ccu
racy
Number of genes for Lymphoma dataset
mRMR-ABC ImRMR-GSO with RF SVM
ImRMR-HCSO with TSVM EPSOFFA-MANN
International Journal of Pure and Applied Mathematics Special Issue
1710
Fig. 7: Feature Selection Results Comparison for Leukemia2 Dataset
Thus, the ImRMR is a promising method for identifying the relevant genes and
omitting the redundant and noisy genes. It can conclude that the proposed
EPSOFFA-MANN algorithm generates accurate classification performance
with minimum number of selected genes when tested using all datasets as
compared to the ImRMR-HCSO with TSVM, mRMR-ABC, ImRMR-GSO
algorithm under the same cross validation approach. Therefore, the EPSOFFA-
MANN algorithm is a promising approach for solving gene selection and cancer
classification problems.
Fig. 8: Time Complexity Results Comparison for all Given Datasets
From the above Fig 8, the graph explains that the time complexity comparison
75
80
85
90
95
100
105
2 3 4 5 6 7 8 9 10 15 18 20
Cla
ssif
ica
tio
n a
ccu
racy
Number of genes for Leukemia2 dataset
mRMR-ABC ImRMR-GSO with RF SVM
ImRMR-HCSO with TSVM EPSOFFA-MANN
01020304050
Tim
e co
mp
lexit
y (
sec)
Number of datasets
mRMR-ABC ImRMR-GSO with RF SVM
ImRMR-HCSO with TSVM EPSOFFA-MANN
International Journal of Pure and Applied Mathematics Special Issue
1711
for the given datasets. In x-axis the number of datasets is considered and in y-
axis the time complexity value is considered. The experimental result proves
that the proposed EPSOFFA-MANN algorithm has lower time complexity than
the existing ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC
methods. Thus, the result explains that the proposed EPSOFFA-MANN
algorithm is superior to existing system in terms of better classification.
Fig. 9: Precision Results Comparison for All Given Datasets
From the above Fig 9, the graph explains that the precision metric comparison
for the given datasets. In x-axis the number of datasets is considered and in y-
axis the precision value is considered. The experimental result proves that the
proposed EPSOFFA-MANN algorithm provides higher precision value than the
existing ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC methods.
Thus, the result explains that the proposed EPSOFFA-MANN algorithm is
superior to existing system in terms of better classification.
Fig. 10: Recall Results Comparison for All Given Datasets
0
0.2
0.4
0.6
0.8
1
Pre
cisi
on
Number of datasets
mRMR-ABC ImRMR-GSO with RF SVM
ImRMR-HCSO with TSVM EPSOFFA-MANN
00.20.40.60.8
1
Re
call
Number of datasets
mRMR-ABC ImRMR-GSO with RF SVM
ImRMR-HCSO with TSVM EPSOFFA-MANN
International Journal of Pure and Applied Mathematics Special Issue
1712
From the above Fig 10, the graph explains that the recall metric comparison for
the given datasets. In x-axis the number of datasets is considered and in y-axis
the recall value is considered. The experimental result proves that the proposed
EPSOFFA-MANN algorithm provides higher recall value than the existing
ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC methods. Thus,
the result explains that the proposed EPSOFFA-MANN algorithm is superior to
existing system in terms of better classification.
Fig. 11: F-measure Results Comparison for All Given Datasets
From the above Fig 11, the graph explains that the f-measure metric comparison
for the given datasets. In x-axis the number of datasets is considered and in y-
axis the f-measure value is considered. The experimental result proves that the
proposed EPSOFFA-MANN algorithm provides higher f-measure value than
the existing ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC
methods. Thus, the result explains that the proposed EPSOFFA-MANN
algorithm is superior to existing system in terms of better classification.
Fig. 12: Average Error Rate
00.20.40.60.8
1
F-m
eas
ure
Number of datasets
mRMR-ABC ImRMR-GSO with RF SVM
ImRMR-HCSO with TSVM EPSOFFA-MANN
123456789
10
Erro
r ra
te
Number of datasets
mRMR-ABC ImRMR-GSO with RF SVM
ImRMR-HCSO with TSVM EPSOFFA-MANN
International Journal of Pure and Applied Mathematics Special Issue
1713
From the above Fig 12, the graph explains that the error rate comparison for the
given datasets. In x-axis the number of datasets is considered and in y-axis the
error value is considered. The experimental result proves that the proposed
EPSOFFA-MANN algorithm provides lower error rate value than the existing
ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC methods. Thus,
the result explains that the proposed EPSOFFA-MANN algorithm is superior to
existing system in terms of better classification.
5. Conclusion
Microarray data can be used in the discovery and prediction of cancer classes.
Various approaches are used to efficient gene selection for producing cancer
classification results. In this research work, EPSOFFA-MANN is proposed to
improve the overall system performance. This research has modules as
preprocessing, feature selection and classification. The preprocessing is used to
remove the noise data from the dataset. It is done by using IKNN algorithm
which is used to fill the missing values and remove the redundancy values
effectively. It provides reduction of the dataset which is passed to feature
selection process. The important and relevant genes are selected by using
EPSOFFA. This optimization algorithm generates best fitness function values
and it is used to provide optimal features. Then the features are passed to
classification phase. In this research, the MANN algorithm is applied to classify
the genes and provides more accurate classification results. The result proves
that the proposed system achieves higher classification results. It provides
higher accuracy rate, precision, recall and f-measure value which indicates
better cancer classification for the specified microarray database. Also it
reduces the time complexity and error rates significantly. Thus the result
concludes that the proposed system is better than the existing system.
References
[1] Ntzani, Evangelia E., and John PA Ioannidis. "Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment."The Lancet 362.9394 (2003): 1439-1444.
[2] Tseng and C.-P. Kao, “Efficiently mining gene expression data via a novel parameter less clustering method,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2(4), pp. 355–365, 2005.
[3] Layana, C. and Diambra, L. “Dynamical analysis of circadian gene expression”, International Journal of Biological and Life Sciences, Vol.8, No.3, pp.101–5, 2007
[4] García S, Luengo J, Herrera F. Data Preprocessing in Data Mining. Berlin: Springer; 2015.
[5] Fong, Simon, Xin-She Yang, and Suash Deb. "Swarm search for feature selection in classification." Computational Science and
International Journal of Pure and Applied Mathematics Special Issue
1714
Engineering (CSE), 2013 IEEE 16th International Conference on. IEEE, 2013.
[6] Al-Ani, Ahmed, Mohamed Deriche, and Jalel Chebil. "A new mutual information based measure for feature selection." Intelligent Data Analysis 7.1 (2003): 43-57.
[7] T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, Y. Saeys, Robust biomarker identification for cancer diagnosis with ensemble feature selection methods, Bioinformatics 26 (3) (2000) 392–398.
[8] Y. Zhang and J. C. Rajapakse. Machine Learning in Bioinformatics. Wiley Series in Bioinformatics. 1’st edition, 2008.
[9] Martinez, A. M. and Kak, A. C. Pca versus lda. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, 2001.
[10] Alba, Enrique, et al. "Gene selection in cancer classification using PSO/SVM and GA/SVM hybrid algorithms." Evolutionary Computation, 2007. CEC 2007. IEEE Congress on. IEEE, 2007.
[11] Banerjee, M. Mitra, S. and Banka, H. “Evolutionary Rough Feature Selection in Gene Expression Data”, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, Vol. 37, No. 4, pp. 622-632, 2007
[12] Askarzadeh, Alireza, and Alireza Rezazadeh. "Parameter identification for solar cell models using harmony search-based algorithms." Solar Energy86.11 (2012): 3241-3249.
[13] Xu, Yunfeng, Ping Fan, and Ling Yuan. "A simple and efficient artificial bee colony algorithm." Mathematical Problems in Engineering 2013 (2013).
[14] Chiş, M., A new evolutionary hierarchical clustering technique, Babeş-BolyaiUniversity Research Seminars, Seminar on Computer Science, 2000,13-20.
[15] Badri, Lubna. "Development of Neural Networks for Noise Reduction." Int. Arab J. Inf. Technol. 7.3 (2010): 289-294.
[16] Dorronsoro J., López V., Cruz C., and Sigüenza J., “Auto associative Neural Networks and Noise Filtering,” IEEE Transactions on Signal Processing, vol. 51, no. 5, pp. 1431-1438, 2003.
[17] Tsenov, Georgi T., and Valeri M. Mladenov. "Speech recognition using neural networks." Neural Network Applications in Electrical Engineering (NEUREL), 2010 10th Symposium on. IEEE, 2010.
International Journal of Pure and Applied Mathematics Special Issue
1715
[18] U. Alon, N. Barka, D. A. Notterman et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 12, pp. 6745–6750, 1999.
[19] T. R. Golub, D. K. Slonim, P. Tamayo et al., “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, vol. 286, no. 5439, pp. 531–527, 1999.
[20] D. G. Beer, S. L. R. Kardia, C.-C. Huang et al., “Gene-expression profiles predict survival of patients with lung adenocarcinoma,” Nature Medicine, vol. 8, no. 8, pp. 816–824, 2002.
[21] J. Khan, J. S. Wei, M. Ringner et al., “Classification and ´ diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nature Medicine, vol. 7, no. 6, pp. 673–679, 2001.
[22] A. A. Alizadeh, M. B. Elsen, R. E. Davis et al., “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,” Nature, vol. 403, no. 6769, pp. 503–511, 2000.
[23] S. A. Armstrong, J. E. Staunton, L. B. Silverman et al., “MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia,” Nature Genetics, vol. 30, no. 1, pp. 41–47, 2001.
International Journal of Pure and Applied Mathematics Special Issue