Enhanced Particle Swarm Optimization Firefly Algorithm ... · Enhanced Particle Swarm Optimization FireFly Algorithm .Modified Artificial Neural Network (EPSOFFA-MANN) is proposed.

Enhanced Particle Swarm Optimization Firefly

Algorithm–Modified Artificial Neural Network for

Cancer Classification Using Microarray Gene Data 1N. Kanchana and

2N. Muthumani

1Department of Computer Science,

Dr.G.R.Damodaran College of Science,

Coimbatore.

[email protected] 2Department of Computer Applications,

S.N.R. Sons College, Coimbatore.

[email protected]

Abstract Microarray data classification is a difficult challenge for machine

learning researchers due to its high number of features and the small

sample sizes. The cancer classification is done by using gene expression

dataset. In the existing system, improved minimum Redundancy

Maximum Relevance with Glowworm Swarm Optimization (ImRMR-GSO)

algorithm is introduced for efficient cancer classification. However it has

issue with inaccurate gene classification results due to number of error

occurrences. Hence the overall classification performance is degraded for

the given datasets. To overcome these issues, in the proposed system,

Enhanced Particle Swarm Optimization FireFly Algorithm–Modified

Artificial Neural Network (EPSOFFA-MANN) is proposed. In this research,

it has three main modules are such as preprocessing, gene selection and

classification. The preprocessing step is performed by using improved kNN

(IKNN) which is used to handle the missing values and redundancy values

from the given gene dataset. The reduction of preprocessed dataset is given

to the Feature Selection (FS) step. In the FS step, the important and relevant

features are selected EPSOFFA optimally. It is used to select most

informative genes from the given gene array dataset. These features are

taken into MANN to classify more accurate gene results. In this module,

training and testing is performed which provides relevant features. It is

used to increase the classification accuracy rather than the preceding

methods. The result proves that the proposed EPSOFFA-MANN method

has better performance in terms of higher classification accuracy, precision,

International Journal of Pure and Applied MathematicsVolume 119 No. 15 2018, 1693-1717ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

1693

recall, f-measure and lower time complexity, error rate.

Key Words:Gene selection, microarray data, EPSOFFA, MANN, cancer

classification.

International Journal of Pure and Applied Mathematics Special Issue

1694

1. Introduction

Gene expression technology using DNA microarrays, allows for the monitoring

of the expression levels of thousands of genes at once. As a direct result of

recent advances in DNA microarray technology, it is now feasible to obtain

gene expression profiles of tissue samples at relatively low costs. Gene

expression profiles provide important insights into, and further our

understanding of, biological processes.

Cancer diagnosis is one of the most important emerging clinical applications of

gene expression microarray technology. An important emerging medical

application domain for microarray gene expression profiling technology is

clinical decision support in the form of diagnosis of disease as well as

prediction of clinical outcomes in response to treatment. Currently, cancer

diagnosis highly depends on a variety of histological observations, including

immunohistochemical assays, which detect cancer biomarker molecules [1].

Most of gene expression data sets have a fairly small sample size compared to

the number of genes investigated. This data structure creates an unprecedented

challenge to some classification methodologies. Only several methods had been

successfully applied to the cancer diagnosis problem in the previous studies,

including support vector machine (SVM), k-nearest neighbors (KNN), back

propagation neural networks (NN), and probabilistic neural networks (PNN).

Applying data mining techniques has been proved to be an effective approach

towards knowledge discovery based on probe-level data sets consisting of

millions of variables and often only several or dozens of observations (i.e., the

samples). However, since practitioners and researchers engaged in this domain

stem from various background, one cannot be expected to be extremely familiar

with the techniques and algorithms in data mining. This phenomenon is fair and

the trend that more people become interested in this interdisciplinary area is

even encouraging, because everyone perceives and processes information from

his or her own academic or practice backgrounds, which actually foster the

development of genetic engineering [2].

The microarray technique in concurrent measurement of the expression level in

thousands of messenger RNA (mRNA) has been enabled. This has become

feasible by mining the data; it is possible to recognize the dynamics of a gene

expression time series in this manner. Researchers decreased the dimensionality

of the data set by employing the Principal Component Analysis (PCA). An

examination of the components has provided an approach to the underlying

factors calculated in the experiments. The PCA has demonstrated that it is

proved from their consequences, that all rhythmic content of data can be

decreased to three main components [3].

Preprocessing step to tackle microarray data has rapidly become indispensable

among researchers, not only to remove redundant and irrelevant features, but


1695

also to help biologists identify the underlying mechanism that relates gene

expression to diseases. The set of techniques used prior to the application of a

data mining method is named as data preprocessing for data mining and it is

known to be one of the most meaningful issues within the famous knowledge

discovery from data process [4]. Since data will likely be imperfect, containing

inconsistencies and redundancies is not directly applicable for a starting a data

mining process. It mentions the fast growing of data generation rates and their

size in business, industrial, academic and science applications. The bigger

amounts of data collected require more sophisticated mechanisms to analyze it.

Data preprocessing is able to adapt the data to the requirements posed by each

data mining algorithm, enabling to process data that would be unfeasible

otherwise.

FS methods are constantly emerging and, for this reason, there is a wide suite of

methods that deal with microarray gene data. Feature selection is often deployed

to select the most influential features from the original feature set for better

classification performance. Feature selection is to find an ideal feature subset

from a problem domain while improving classification accuracy in representing

the original features [5]. It can be perceived as an optimization process that

searches for a subset of features which ideally is sufficient and appropriate to

retain their representative power that describe the target concept from the

original set of features in a given data set.

FS is the process of identifying and removing as much of the irrelevant and

redundant information. The optimality of a feature subset is measured by an

evaluation criterion. The selection of a small number of highly predictive

features is used to avoid over fitting the training data. It reduces the number of

features, removes irrelevant, redundant, or noisy data, and there by speeds up

data-mining algorithm, improving mining performance such as predictive

accuracy and result comprehensibility [6].

Abeel et al. [7] are concerned with the analysis of the robustness of biomarker

selection techniques. They proposed a general experimental setup for stability

analysis that can easily be included in any biomarker identification pipeline. In

addition, they also presented a set of ensemble feature selection methods

improving biomarker stability and classification performance in four microarray

datasets.

The classification of gene expression data samples into distinct classes is a

challenging task. The dimensionality of typical gene expression data sets ranges

from several thousands to over ten thousands gene. However, only small sample

sizes are typically available for analysis [8]. The outcomes from this

classification process help domain experts to identify the “informative features”

embedded in the data and the relationships between the data items. As such,

they can be used to generate hypotheses about the correlation between genes

and their impact on a specific disease.


1696

2. Related Work

In [9] Martinez et al (2001) gives theoretical analysis on how well a two-stage

algorithm approximates the exact LDA in the sense of maximizing the LDA

objective function. The theoretical analysis motivates us to devise a new two-

stage LDA algorithm. Our algorithm outperforms the PCA+LDA while both

have the similar scalability. Furthermore, it provides an implementation of this

algorithm on distributed system to handle large scale problems.

In [10] Alba et al (2007) compared the use of a Particle Swarm Optimization

(PSO) and a Genetic Algorithm (GA) (both augmented with Support Vector

Machines SVM) for the classification of high dimensional Microarray Data.

Both algorithms are used for finding small samples of informative genes

amongst thousands of them. A first contribution is to prove that PSOSVM is

able to find interesting genes and to provide classification competitive

performance. A second important contribution consists in the actual discovery

of new and challenging results on six public datasets identifying significant in

the development of a variety of cancers (leukemia, breast, colon, ovarian,

prostate, and lung).

In [11] Banerjee et al (2007) emphasized an evolutionary rough feature

selection algorithm to preprocess the array data to obtain a reduced set of

feature. The proposed algorithm uses redundancy reduction for effective

handling of gene expression data enabling faster convergence. The reducts are

generated using a rough set theory and they are a minimal set of non redundant

features capable of distinguishing between all objects in a multi-objective

framework. The proposed algorithm was implemented on three different cancer

samples. Experiments using KNN classifiers showed that the proposed

algorithm improved the performance of the classifier on the test set.

In [12] Askarzadeh et al (2012) used harmony search (HS)-based parameter

identification method to identify the unknown parameters of the solar cell single

and double diode models. Simple concept, easy implementation and high

performance are the main reasons of HS popularity to solve complex

optimization problems. For this aim, HS variants are used to determine the

unknown parameters of the models. The effectiveness of the HS variants is

investigated with comparative study among different techniques. Simulation

results manifest the superiority of the HS-based algorithms over the other

studied algorithms in modeling solar cell systems.

In [13] Xu, et al (2013) used a New Artificial Bee Colony (NABC) algorithm,

which modifies the search pattern of both employed and onlooker bees. A

solution pool is constructed by storing some best solutions of the current swarm.

New candidate solutions are generated by searching the neighborhood of

solutions randomly chosen from the solution pool. Experiments are conducted

on a set of twelve benchmark functions. Simulation results show that this


1697

approach is significantly better or at least comparable to the original ABC and

seven other stochastic algorithms. Chişet al [14] presented clustering which is

the most popular method that makes an attempt to separate data into disjoint

groups such that same-group data points are similar in its characteristics with

respect to a referral point whereas data points of different-groups differs in its

characteristics. Such described groups are called as clusters. Thus clusters are

comprised of several similar data or objects with respect to a referral point.

Cluster is one of the most important methods in the disciplines of engineering

and science, including data compression, statistical data analysis,

3. Proposed Methodology

Input Data

The datasets are real gene expression data and gene samples generated using

micro-arrays technology. The results of both implementations are compared to

the output from the classification algorithm. This gene expression data has been

used to build cancer classifiers. A microarray experiment is to monitor the

expression level of genes. Patterns could be derived from analyzing the change

in expression of the genes, and new insights could be gained into the underlying

biology. In this section, basic terminologies, representations of the microarray

data and the various methods by which expression data can be analyzed will be

introduced. Microarrays measurements are carried out as differential

hybridizations to minimize errors originating from DNA. The overall block

diagram of the proposed system is shown in fig 1.

Fig. 1: Overall Block Diagram of the Proposed System

Gene expression dataset

Preprocessing using IKNN

Gene selection using EPSOFFA

Calculate fitness value using best fireflies in PSO

Generate optimal genes

Classification using MANN

Perform training and testing genes

Neurons classify relevant genes

Provide accurate classification results


1698

3.1. Preprocessing using IKNN

In this research, the k-NN algorithm is introduced to perform data

preprocessing. The underlying idea of the k-NN algorithm has served of

inspiration to tackle data imperfection. It distinguishes data imperfection that

needs to be addressed such as noisy data, redundancy and incomplete data.

Preprocessing is an important step in the analysis of microarray data. The

distance-based similarity idea of the kNN has been widely applied to detect and

remove class noise. IKNN used to eliminate all potential noisy examples and it

may change the class label of clearly erroneous examples. The given dataset

also contain missing values (MVs) in their attribute values. Intuitively, a MV is

simply a value for an attribute. Human or equipment errors are some of the

reasons of their existence. Once again, this imperfection on the data influences

the mining process and its outcome. The simplest way of dealing with MVs is to

discard the attributes that contain them. The imputation of MVs is a procedure

that aims to fill in the MVs by estimating them. Data reduction is aim to obtain

a smaller representative set of attributes from raw data without losing important

information. This process allows alleviating data storage necessities as well as

improving the later data mining process. This process may result on the

elimination of noisy information, but also redundant or irrelevant data.

Algorithm 1: IKNN Step 1: Begin

Input: D={(x1, c1), (x2, c2),…(xn, cn)}

Step 2: For every labeled instance (xi , ci )

Step 3: calculate d(xi , x)

Step 4: Order d(xi , x) from lowest to highest, (i = 1, . . . , N)

Step 5: Compute missing values

{

𝑥 = 𝑤 𝑖𝑥𝑖

𝑛𝑖=1

𝑤 𝑖𝑛𝑖=1

(1)

Fill the missing value

}

Compute redundancy

{

𝑟𝐴,𝐵 = 𝐴−𝐴 (𝐵−𝐵 )

(𝑛−1)𝜎𝐴𝜎𝐵 (2)

Filter repeated values

}

Step 6: Select the K nearest instances to x: 𝐷𝑥𝑘

Step 7: Assign to x the most frequent class in 𝐷𝑥𝑘

Step 8: End

This algorithm is used to provide more accurate gene dataset which is used to

increase the gene classification accuracy higher rather than previous system.


1699

3.2. Feature Selection using EPSOFFA with Firefly Algorithm

In this research, gene selection is performed by using EPSOFFA. It is focused

to improve the relevant gene data and reduce the irrelevant genes from the given

gene dataset.

The PSO is a computational approach that optimizes a problem in continuous,

multidimensional search spaces. PSO starts with a swarm of random particles.

Each particle is associated with a velocity. Particles‟ velocities are adjusted in

order to the historical behavior of each particle and its neighbors during they fly

through the search space. Thus, the particles have a tendency to move towards

the better search space. The version of the utilized PSO algorithm is described

mathematically by the following equations:

Each particle updates its own position and velocity according to formula (3) and

(4) in every iteration.

vidk+1 = ωvid

k + c1γ1 pid

k − xidk + c2γ

2 pgd

k − xidk + α(rand −

1

2) (3)

xidk+1 = 1 s vid

k+1 > 𝑟𝑎𝑛𝑑 (0,1)

0 else (4)

where the s vidk+1 is the sigmoid function S(𝑣𝑖𝑑 ) = 1/(1 + exp(−vid )),i = 1, 2, 3

... m, m is the number of particles in the swarm, vidk and xid

k stand for the

velocity and position of the ith

particle of the kth

iteration, respectively.

pidk denotes the previously best position of particle i, pgd

k denotes the global best

position of the swarm. ω is the inertia weight, c1 and c2 are acceleration

constants (the general value of c1 and c2 are in the interval [0 2]), γ1 and γ2 are

random numbers in the range [0 1].

Each feature subset can be considered as a point in feature space. The optimal

point is the subset with least length and highest classification accuracy. The

initial swarm is distributed randomly over the search space, each particle takes

one position. The goal of particles is to fly to the best position. By passing the

time, their position is changed by communicating with each other, and they

search around the local best and global best position. Finally, they should

converge on good, possibly optimal, positions since they have exploration

ability that equip them to perform FS and discover optimal subsets.

The velocity of each particle is displayed as a positive integer; particle

velocities are bounded to a maximum velocity Vmax. It shows how many of

features should be changed to be same as the global best point, in other words,

the velocity of the particle moving toward the best position. The number of

different features (bits) between two particles related to the difference between

their positions.

After updating the velocity, a particle‟s position will be updated by the new

velocity. Suppose that the new velocity is V. In this case, V bits of the particle


1700

are randomly changed, different from that of Pg. The particles then fly toward

the global best while still exploring the search area, instead of simply being

same as Pg. The Vmax is used as a constraint to control the global exploration

ability of particles. A larger Vmax provides global exploration, while a smaller

Vmax increases local exploitation. When Vmax is low, particles have difficulty

getting out from locally optimal sections. If Vmax is too high, swarm might fly

past good solutions. Objective function is computed as follows

(𝑋𝑖)=𝜙. 𝛾 𝑆𝑖 𝑡 + 𝜙(𝑛 − |𝑆𝑖(𝑡)| (5)

Where 𝑆𝑖 𝑡 is the feature subset found by particle i at iteration t, and |𝑆𝑖 𝑡 |is

its length. Fitness is computed in order to both the measure of the classifier

performance, 𝛾𝑆𝑖(𝑡), and feature subset length. ϕ and φ are two parameters that

control the relative weight of classifier performance and feature subset length, ϕ

∈[0,1] and φ = 1 − ϕ. This formula denotes that the classifier performance and

feature subset length have different effect on gene selection.

To enhance the optimal solutions in the gene selection dataset, the PSO is

hybrid with FFA. The PSO has lower convergence rate when large number of

genes increase. And hence time complexity is an issue also the accuracy of the

gene classification lower due to misclassification of important genes. To

overcome the above mentioned issues, the EPSOFFA is proposed in this

research.

The light intensity of each firefly determine its brightness and hence its

attractiveness. Attractiveness of the firefly is calculated using equation (6).

β (r) = β0

e−γrij (6)

where rij = d(xi , xj), a Euclidean distance between two data points and j. In

general, β0

∈ [0, 1], describes the fitness value o distance at r = 0, i.e.,

when two data points are found at the same point of search space S. The value

of γ ∈ [0, 10] determines the variation of fitness value with increasing distance

from communicated data points. It is basically the light absorption coefficient

and generally γ ∈ [0, 10].

The movement of the firefly i in the space which is attracted toward another

firefly j is defined by using equation (7).

Xi = xi + β0

e−γrij + α(rand −1

2) (7)

Where is the randomization parameter in interval [0, 1] and rand is random

number generator with numbers uniformly distributed in range [0, 1]. Parameter

is controls the variation in attractiveness and define convergence.

Algorithm 2: EPSOFFA Step 1: Initializing a population with N individuals

Step 2: Initialize the position and velocity of each particle (gene) in the swarm

Step 3: While Maximum Iterations is not reached do


1701

Step 4: Set algorithm factors: Objective function of f(x), from, where x = x1, . . . . . . . . , xd T

Step 5: Generate primary population of fireflies or xi (i = 1, 2, . . . , n)

Step 6: Describe light intensity of Ii at xi via f (xi) from equation (p)

Step 7: While (t < 𝑀𝑎𝑥𝐺𝑒𝑛) do

Step8:For i = 1 to n all n fireflies (genes) ; Step 9: For j = 1 to n(all n fireflies)

Step 10: If (Ij > Ii) fi towards fj

Step 11: end if

Step 12: Attractiveness vicissitudes with distance „r‟ via Exp − r2 Step 13: Estimation novel solutions and study light intensity;

Step 14: End for j; Step 15: End for i; Step 16: Construct new position of firefly solution

Step 17: Evaluate the fitness of the new firefly solution which is directly

proportional to its brightness

Step 18: If the fitness value is better than its personal best (pBest)

Step 19: Set current value as the new pBest

Step 20: End

Step 21: Choose the particle with the best fitness value of all as gBest

Step 22: For each particle

Step 23: Calculate particle velocity and update particle position according

Equation (6) and (7)

Step 24: Randomly selecting a 𝑔𝑏𝑒𝑠𝑡 for particle 𝑖 from highest ranked

solutions

Step 25: Update the velocity and position of particle based on best firefly

behavior (3) and (4)

Step 26: Return most informative gene features

Step 27: Update the pbest and gbest

Step 28: Return the positions of genes

Step 29: End

The efficiency of gene classification model, based on EPSOFFA algorithms

relies on learning acquired through the gene dataset. The appropriately dataset

helps the hybrid model to attain desirable gene classification performance. This

EPSOFFA approach finds and ranks the most informative gene features from

the given microarray gene dataset to provide an optimal classification result. In

this way EPSOFFA evolves an optimal dataset that helps in building an

effective gene classification model.

3.3. Classification using MANN

In this proposed system, Modified Artificial Neural Network (MANN) is

introduced to improve the overall gene classification results. In this section,

MANN is proposed which is faster algorithm rather than the TSVM approach.

MANN has advantage of higher accuracy in training as well as testing results.


1702

There are two main phases in the operation of ANN such as learning and testing

[15]. Learning is the process of adapting or modifying the neural network

weights in response to the training input patterns being presented at the input

layer.

The optimal size for the gene dataset such as the total number of layers, the

number of hidden units in the middle layers, and number of units in the input

and output layers in terms of accuracy on a test set, and the training algorithm

used during the learning phase. It is used to perform the required extraction of

the knowledge from a noisy training set to achieve better gene prediction

enhancement.

The aim of learning is to minimize a cost function based on the error signal

ei(t), with respect to network parameters (weights), such that the actual response

of each output neuron in the network approaches the target response [16]. A

criterion commonly used for the cost function is the MSE criterion, defined as

the mean-square value of the sum squared error:

𝐽 = 𝐸[1

2 (𝑒𝑖(𝑡))2

𝑖 ] (8)

= 𝐸[1

2 𝑑𝑖 𝑡 − 𝑦𝑖(𝑡))2

𝑖 ] (9)

Where E is the statistical expectation operator and the summation is over all the

neurons of the output layer. Usually the adaptation of weights is performed by

using the desired signal 𝑑𝑖(𝑡) only. In [17] it is stated that a new signal

𝑑𝑖 ( 𝑡 ) + 𝑛𝑖 ( 𝑡 ) can be used as a desired signal for output neuron i instead of

using the original desired signal 𝑑𝑖(𝑡), where ni(t) is a noise term. This noise

term is assumed to be white Gaussian noise, independent of both the input

signals xk(t) and the desired signals 𝑑𝑖(𝑡).

It is important in every ANN modelling and need to be cross checked in order to

find the optimal solution. ANNs are applied in many different tasks such as

prediction, classification, time series projection, etc. that require different

structures and features to be examined.

It has been developed as generalizations of mathematical models of biological

nervous systems. A first wave of interest in neural networks (also known as

connectionist models or parallel distributed processing) emerged after the

introduction of simplified neurons

The basic processing elements of neural networks are called artificial neurons,

or simply neurons or nodes. In a simplified mathematical model of the neuron,

the effects of the synapses are represented by connection weights that modulate

the effect of the associated input signals, and the nonlinear characteristic

exhibited by neurons is represented by a transfer function. The neuron impulse

is then computed as the weighted sum of the input signals, transformed by the

transfer function. The learning capability of an artificial neuron is achieved by

adjusting the weights in accordance to the chosen learning algorithm.


1703

The neuron output signal O is given by the following relationship:

𝑜 = 𝑓( 𝑤𝑗𝑥𝑗𝑛𝑗=1 ) (10)

Where 𝑤𝑗 is the weight vector, 𝑥𝑗 is input of neurons

The basic architecture consists of three types of neuron layers: input, hidden,

and output layers. In feed-forward networks, the signal flow is from input to

output units, strictly in a feed-forward direction. The data processing can extend

over multiple (layers of) units, but no feedback connections are present.

A neural network has to be configured such that the application of a set of

inputs produces the desired set of outputs. The strength of ANN is to set the

weights explicitly, using a priori knowledge. Another is to train the neural

network by feeding it teaching patterns and letting it change its weights

according to some learning rule. The learning situations in neural networks

classified into supervised learning. In this phase, an input vector is presented at

the inputs together with a set of desired responses, one for each node, at the

output layer. A forward pass is done, and the errors or discrepancies between

the desired and actual response for each node in the output layer are found.

To create a robust and reliable network, some noise or other randomness is

added to the training data to get the network familiarized with noise and natural

variability in real data. Poor training data inevitably leads to an unreliable and

unpredictable network. The MANN is trained for a prefixed number of epochs

or when the output error decreases below a particular error threshold. Special

care is to be taken not to over train the network. By overtraining, the network

may become too adapted in learning the samples from the training set, and thus

may be unable to accurately classify samples outside of the training set. To

overcome this issue, the MANN is improved in terms of better choosing of

number of neurons. With too few hidden neurons, the network may be unable to

learn the relationships amongst the data and the error is increased. Hence, large

number of hidden neurons ensures correct learning, and the network is able to

correctly predict the data it has been trained on.

Algorithm 3: MANN Step 1: Create an initial ANN consisting of three layers, i.e., an input, an output,

and a hidden layer.

Step 2: For each training pattern

Step 3: Apply the input gene features to the network

Step 4: Train the network on the training set until the error is almost constant for a

certain number of training epochs τ that is specified by the user.

Step 5: Consider the genes for input nodes and hidden nodes

Step 6: Calculate the output for every neuron from the input layer, through the

hidden layer(s), to the output layer

Step 7: Compute the error of the MANN using (9)

Step 8: If the error is found unacceptable (i.e., too large), then assume that the

MANN has inappropriate architecture, and go to the next step.


1704

Step 9: Otherwise, stop the training process. The error E is calculated according to

the following equations.

𝐸 𝑤, 𝑣 =1

2 (𝑆𝑝𝑖 − 𝑡𝑝𝑖 )2𝐶

𝑝=1𝑘𝑖=1 (11)

Where 𝑘 is the number of patterns and C is the number of output nodes. 𝑡𝑝𝑖 and 𝑆𝑝𝑖 are the

target and actual outputs for the ith

pattern of the pth

output node. The actual output Spi

is

calculated according to the following equation.

𝑆𝑝𝑖 = 𝜎( 𝛿( 𝑥𝑖)𝑇𝑤𝑚 𝑣𝑚)𝑕

𝑚=1 (12)

Here h is the number of hidden nodes in the network, xi is an n-dimensional input pattern,

𝑖 = 1, 2, …𝑘, 𝑤𝑚 is weights for the arcs connecting the input layer and the m-th hidden

node, m=1, 2, …, h, 𝑣𝑚 weights for the arcs connecting the m-th hidden node and the

output layer, 𝜎 is activation function and 𝛿 is the hidden layer hyperbolic tangent function

Step 10: Add one hidden node to the hidden layer. Randomly initialize the weights

of the newly added node and go to training process using (10).

Step 11: The neurons checks error values

Step 12: Apply the weight adjustments

Step 13: The training and testing of MANN use gene features

Step 14: Classify gene output features

Step 15: Genes are classified more accurately

4. Experimental Result

In this section, evaluate the overall performance of gene selection methods

using six popular binary and multiclass microarray cancer datasets, which were

downloaded from http://www.gems-system. org/. These datasets have been

widely used to benchmark the performance of gene selection methods in

bioinformatics field. The binary-class microarray datasets are colon [18],

leukemia [18, 19], and lung [20] while the multiclass microarray datasets are

SRBCT [21], lymphoma [22], and leukemia [23]. In Table 1, a detailed

description of these six benchmark microarray gene expression datasets with

respect to the number of classes, number of samples, number of genes, and a

brief description of each dataset construction.

Table 1: Gene Datasets

Dataset No.of classes No.of samples Number of genes

Colon [21] 2 62 2000

Leukemia [22] 2 72 7129

Lung[23] 2 96 7129

SRBCT [24] 4 83 2308

Lymphoma [25] 3 62 4026

Leukemia [26] 3 72 7129

In this study, the performance of the proposed EPSOFFA-MANN algorithm is

tested by comparing it with other standard bioinspired algorithms, including

ImRMR-HCSO, ImRMR-GSO and mRMR-ABC. Compare the performance of

each gene selection approach based on parameters such as classification

accuracy, error rate, precision, recall, time complexity and the number of


1705

predictive genes that have been used for cancer classification. Classification

accuracy is the overall correctness of the classifier and is calculated as the sum

of correct cancer classifications divided by the total number of classifications.

Performance Metrics 4.1. Accuracy

Classification accuracy = 𝐶𝐶

𝑁× 100

Where N is the total number of the instances in the initial microarray dataset

and, CC refers to correctly classified instances.

4.2. Precision

Precision is explained as the ratio of the true positives contrary to both true

positives and false positives outcomes for imposition and real features. It is

distinct as given below

Precision = |{relevant documents }∩{retrieved documents }|

|{retrieved documents }|

4.3. Recall

Recall value is calculated on the root of the data retrieval at true positive

forecast, false negative. Usually it can be distinct as

𝑅𝑒𝑐𝑎𝑙𝑙 =𝑇𝑃

𝑇𝑃+𝐹𝑁

4.4. F-Measure

It is a measure of a test's accuracy. It considers both the precision p and

the recall r of the test to compute the score

2.𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

4.3. Time Complexity

The algorithm is superior when it provides lower time complexity for the given

dataset.

4.4. Error Rate

The system is better when the algorithm provides lower error rate.

The comparison results for the binary-class microarray datasets: colon,

leukemia1, and lung are shown in Tables 2, 3, 4, 5 6, and 7 respectively, present

the comparison result for multiclass microarray datasets: SRBCT, lymphoma,

and leukemia 2.

From these tables, it is clear that proposed EPSOFFA-MANN algorithm

performs better than the ImRMR-HCSO with TSVM, ImRMR-GSO with RF

SVM and mRMR-ABC algorithm in every single case (i.e., all datasets using a

different number of selected genes).


1706

Table 2: Comparison between EPSOFFA-MANN and ImRMR-HCSO with TSVM,

ImRMR-GSO with the RFSVM, mRMR-ABC Classification Performance for Colon

Dataset

Classification Accuracy in (%)

Number of

genes

mRMR-

ABC

ImRMR-GSO with

RF SVM

ImRMR-HCSO

with TSVM

EPSOFFA-

MANN

3 87.50 88 89.99 92.01

4 88.27 89.9 91.21 93.33

5 89.5 90 92.56 94.77

6 90.12 90.80 93.83 95.99

7 91.64 92 94.41 96.32

8 91.8 92.2 95.79 97.45

9 92.11 92.75 96.10 98.22

10 92.74 93.1 96.77 98.99

15 93.6 94 97.55 99.01

20 94.17 94.8 97.89 99.65

Table 3: Comparison between EPSOFFA-MANN and ImRMR- HCSO with TSVM,

ImRMR-GSO with RFSVM, mRMR-ABC Classifier for Leukemia1 Dataset


Number of

genes

mRMR-

ABC

ImRMR-GSO with

RF SVM

ImRMR-HCSO

with TSVM

EPSOFFA-

MANN

2 89.63 90 91.34 93.5

3 90.37 91 92.76 94

4 91.29 92 93.99 95

5 92.82 93 94 96

6 92.82 93 94.41 97

7 93.10 93.50 95.82 97.66

10 94.44 95 96 98

13 94.93 95 96.33 98.3

14 95.83 96 97 98.99


ImRMR-GSO with RFSVM, mRMR-ABC Classifier for Lung Dataset


Number of

genes

mRMR-

ABC

ImRMR-GSO with

RF SVM

ImRMR-HCSO

with TSVM

EPSOFFA-

MANN

2 95.83 96 97 98

3 96.31 97 98.2 98.55

4 97.91 98 98.7 98.99

5 97.98 99 98.99 99.34

6 98.27 98.6 98.99 99.78

7 98.53 98.85 98.99 99.79

8 98.95 99 99.2 99.8


1707


ImRMR-GSO with RFSVM, mRMR-ABC Classifier for SRBCT Dataset


Number of

genes

mRMR-

ABC

ImRMR-GSO with

RF SVM

ImRMR-HCSO

with TSVM

EPSOFFA-

MANN

2 71.08 71.6 82 85

3 79.51 80 83 87

4 84.33 84.9 85 88

5 86.74 87 88 90

6 91.56 92 93 95

7 94.05 94.5 95 97

8 96.3 96.9 97 98


ImRMR-GSO with RFSVM, mRMR-ABC Classifier for Lymphoma Dataset


Number of

genes

mRMR-

ABC

ImRMR-GSO with

RF SVM

ImRMR-HCSO

with TSVM

EPSOFFA-

MANN

2 86.36 86.9 88 90

3 90.90 91.2 92 94

4 92.42 92.8 94 96

5 96.96 97.1 97.99 98.55


ImRMR-GSO with RFSVM, mRMR-ABC Classifier for Leukemia 2 Dataset


Number of

genes

mRMR-

ABC

ImRMR-GSO with

RF SVM

ImRMR-HCSO

with TSVM

EPSOFFA-

MANN

2 84.72 85.03 86 88

3 86.11 86.5 87 90

4 87.5 87.9 88 91

5 88.88 89 89.5 92

6 90.27 90.65 91 93

7 89.49 89.9 92 94.5

8 91.66 92.05 93 95.69

9 92.38 92.7 94 97.8

10 91.66 92.1 95 98

15 94.44 94.85 96 98.33

18 95.67 96 97 98.77

20 96.12 96.5 97.7 99


1708

Fig. 2: Feature Selection Results Comparison for Colon Dataset

The comparison results for the binary-class microarray datasets: colon,

leukemia1, and lung are shown in Fig 2,3, and 4, respectively while Fig 5, 6,

and 7, respectively, present the comparison result for microarray datasets:

SRBCT, lymphoma, multiclass and leukemia2. From these tables, it is clear that

proposed EPSOFFA-MANN algorithm performs better than the ImRMR-HCSO

with TSVM, ImRMR-GSO and mRMR-ABC algorithms in every single case

(i.e., all datasets using a different number of selected genes).

Fig. 3: Feature Selection Results Comparison for Leukemia1 Dataset

80

82

84

86

88

90

92

94

96

98

100

102

3 4 5 6 7 8 9 10 15 20

Cla

ssif

ica

tio

n a

ccu

racy

Number of genes for colon dataset

mRMR-ABC ImRMR-GSO with RF SVM

ImRMR-HCSO with TSVM EPSOFFA-MANN

84

86

88

90

92

94

96

98

100

2 3 4 5 6 7 10 13 14

Cla

ssif

ica

tio

n a

ccu

racy

Number of genes for Leukemia1 dataset




1709

Fig. 4: Feature Selection Results Comparison for Lung Dataset

Fig. 5: Feature Selection Results Comparison for SRBCT Dataset

Fig. 6: Feature Selection Results Comparison for Lymphoma Dataset

95

96

97

98

99

100

2 3 4 5 6 7 8

Cla

ssif

ica

tio

n a

ccu

racy

Number of genes for Lung dataset



70

75

80

85

90

95

100

2 3 4 5 6 7 8

Cla

ssif

ica

tio

n a

ccu

racy

Number of genes for SRBCT dataset



85

87

89

91

93

95

97

99

2 3 4 5

Cla

ssif

ica

tio

n a

ccu

racy

Number of genes for Lymphoma dataset




1710

Fig. 7: Feature Selection Results Comparison for Leukemia2 Dataset

Thus, the ImRMR is a promising method for identifying the relevant genes and

omitting the redundant and noisy genes. It can conclude that the proposed

EPSOFFA-MANN algorithm generates accurate classification performance

with minimum number of selected genes when tested using all datasets as

compared to the ImRMR-HCSO with TSVM, mRMR-ABC, ImRMR-GSO

algorithm under the same cross validation approach. Therefore, the EPSOFFA-

MANN algorithm is a promising approach for solving gene selection and cancer

classification problems.

Fig. 8: Time Complexity Results Comparison for all Given Datasets

From the above Fig 8, the graph explains that the time complexity comparison

75

80

85

90

95

100

105

2 3 4 5 6 7 8 9 10 15 18 20

Cla

ssif

ica

tio

n a

ccu

racy

Number of genes for Leukemia2 dataset



01020304050

Tim

e co

mp

lexit

y (

sec)

Number of datasets




1711

for the given datasets. In x-axis the number of datasets is considered and in y-

axis the time complexity value is considered. The experimental result proves

that the proposed EPSOFFA-MANN algorithm has lower time complexity than

the existing ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC

methods. Thus, the result explains that the proposed EPSOFFA-MANN

algorithm is superior to existing system in terms of better classification.

Fig. 9: Precision Results Comparison for All Given Datasets

From the above Fig 9, the graph explains that the precision metric comparison


axis the precision value is considered. The experimental result proves that the

proposed EPSOFFA-MANN algorithm provides higher precision value than the

existing ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC methods.

Thus, the result explains that the proposed EPSOFFA-MANN algorithm is

superior to existing system in terms of better classification.

Fig. 10: Recall Results Comparison for All Given Datasets

0

0.2

0.4

0.6

0.8

1

Pre

cisi

on

Number of datasets



00.20.40.60.8

1

Re

call

Number of datasets




1712

From the above Fig 10, the graph explains that the recall metric comparison for

the given datasets. In x-axis the number of datasets is considered and in y-axis

the recall value is considered. The experimental result proves that the proposed

EPSOFFA-MANN algorithm provides higher recall value than the existing

ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC methods. Thus,

the result explains that the proposed EPSOFFA-MANN algorithm is superior to

existing system in terms of better classification.

Fig. 11: F-measure Results Comparison for All Given Datasets

From the above Fig 11, the graph explains that the f-measure metric comparison


axis the f-measure value is considered. The experimental result proves that the

proposed EPSOFFA-MANN algorithm provides higher f-measure value than

the existing ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC

methods. Thus, the result explains that the proposed EPSOFFA-MANN

algorithm is superior to existing system in terms of better classification.

Fig. 12: Average Error Rate

00.20.40.60.8

1

F-m

eas

ure

Number of datasets



123456789

10

Erro

r ra

te

Number of datasets




1713

From the above Fig 12, the graph explains that the error rate comparison for the

given datasets. In x-axis the number of datasets is considered and in y-axis the

error value is considered. The experimental result proves that the proposed

EPSOFFA-MANN algorithm provides lower error rate value than the existing

ImRMR-HCSO with TSVM, ImRMR-GSO and mRMR-ABC methods. Thus,

the result explains that the proposed EPSOFFA-MANN algorithm is superior to

existing system in terms of better classification.

5. Conclusion

Microarray data can be used in the discovery and prediction of cancer classes.

Various approaches are used to efficient gene selection for producing cancer

classification results. In this research work, EPSOFFA-MANN is proposed to

improve the overall system performance. This research has modules as

preprocessing, feature selection and classification. The preprocessing is used to

remove the noise data from the dataset. It is done by using IKNN algorithm

which is used to fill the missing values and remove the redundancy values

effectively. It provides reduction of the dataset which is passed to feature

selection process. The important and relevant genes are selected by using

EPSOFFA. This optimization algorithm generates best fitness function values

and it is used to provide optimal features. Then the features are passed to

classification phase. In this research, the MANN algorithm is applied to classify

the genes and provides more accurate classification results. The result proves

that the proposed system achieves higher classification results. It provides

higher accuracy rate, precision, recall and f-measure value which indicates

better cancer classification for the specified microarray database. Also it

reduces the time complexity and error rates significantly. Thus the result

concludes that the proposed system is better than the existing system.

References

[1] Ntzani, Evangelia E., and John PA Ioannidis. "Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment."The Lancet 362.9394 (2003): 1439-1444.

[2] Tseng and C.-P. Kao, “Efficiently mining gene expression data via a novel parameter less clustering method,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2(4), pp. 355–365, 2005.

[3] Layana, C. and Diambra, L. “Dynamical analysis of circadian gene expression”, International Journal of Biological and Life Sciences, Vol.8, No.3, pp.101–5, 2007

[4] García S, Luengo J, Herrera F. Data Preprocessing in Data Mining. Berlin: Springer; 2015.

[5] Fong, Simon, Xin-She Yang, and Suash Deb. "Swarm search for feature selection in classification." Computational Science and


1714

Engineering (CSE), 2013 IEEE 16th International Conference on. IEEE, 2013.

[6] Al-Ani, Ahmed, Mohamed Deriche, and Jalel Chebil. "A new mutual information based measure for feature selection." Intelligent Data Analysis 7.1 (2003): 43-57.

[7] T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, Y. Saeys, Robust biomarker identification for cancer diagnosis with ensemble feature selection methods, Bioinformatics 26 (3) (2000) 392–398.

[8] Y. Zhang and J. C. Rajapakse. Machine Learning in Bioinformatics. Wiley Series in Bioinformatics. 1’st edition, 2008.

[9] Martinez, A. M. and Kak, A. C. Pca versus lda. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, 2001.

[10] Alba, Enrique, et al. "Gene selection in cancer classification using PSO/SVM and GA/SVM hybrid algorithms." Evolutionary Computation, 2007. CEC 2007. IEEE Congress on. IEEE, 2007.

[11] Banerjee, M. Mitra, S. and Banka, H. “Evolutionary Rough Feature Selection in Gene Expression Data”, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, Vol. 37, No. 4, pp. 622-632, 2007

[12] Askarzadeh, Alireza, and Alireza Rezazadeh. "Parameter identification for solar cell models using harmony search-based algorithms." Solar Energy86.11 (2012): 3241-3249.

[13] Xu, Yunfeng, Ping Fan, and Ling Yuan. "A simple and efficient artificial bee colony algorithm." Mathematical Problems in Engineering 2013 (2013).

[14] Chiş, M., A new evolutionary hierarchical clustering technique, Babeş-BolyaiUniversity Research Seminars, Seminar on Computer Science, 2000,13-20.

[15] Badri, Lubna. "Development of Neural Networks for Noise Reduction." Int. Arab J. Inf. Technol. 7.3 (2010): 289-294.

[16] Dorronsoro J., López V., Cruz C., and Sigüenza J., “Auto associative Neural Networks and Noise Filtering,” IEEE Transactions on Signal Processing, vol. 51, no. 5, pp. 1431-1438, 2003.

[17] Tsenov, Georgi T., and Valeri M. Mladenov. "Speech recognition using neural networks." Neural Network Applications in Electrical Engineering (NEUREL), 2010 10th Symposium on. IEEE, 2010.


1715

[18] U. Alon, N. Barka, D. A. Notterman et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 12, pp. 6745–6750, 1999.

[19] T. R. Golub, D. K. Slonim, P. Tamayo et al., “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, vol. 286, no. 5439, pp. 531–527, 1999.

[20] D. G. Beer, S. L. R. Kardia, C.-C. Huang et al., “Gene-expression profiles predict survival of patients with lung adenocarcinoma,” Nature Medicine, vol. 8, no. 8, pp. 816–824, 2002.

[21] J. Khan, J. S. Wei, M. Ringner et al., “Classification and ´ diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nature Medicine, vol. 7, no. 6, pp. 673–679, 2001.

[22] A. A. Alizadeh, M. B. Elsen, R. E. Davis et al., “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,” Nature, vol. 403, no. 6769, pp. 503–511, 2000.

[23] S. A. Armstrong, J. E. Staunton, L. B. Silverman et al., “MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia,” Nature Genetics, vol. 30, no. 1, pp. 41–47, 2001.


1716

1717

1718

Enhanced Particle Swarm Optimization Firefly Algorithm ... · Enhanced Particle Swarm Optimization FireFly Algorithm .Modified Artificial Neural Network (EPSOFFA-MANN) is proposed.

Documents