Mini Project Report Sample

Mini Project Report

Design and implementation of a classification system based on soft computing

and statistical approaches

Submitted By:

Ashish Kumar Agrawal(2001114)

Abstract

I

This project, being developed as a part of MHRD research

project (Designing an Intelligent Robot for Explosive Detection and

Decontamination funded by MHRD, Govt. of India), explores the

design and development of classifier based on statistical methods and

soft computing based approaches which is capable of identifying the

mines and non mines using various clustering, classification and rules

establishment algorithms as to compare the algorithm on the basis of

complexity and accuracy. Designing such a classifier is a big challenge

because data is not linearly separable and since it has overlapping

features, it is not possible to design a classifier with 100% accuracy

.This project deals with PVC tubes, wood piece and copper cylinders as

non mine data in addition to data of various mines. The basic idea of

the classification is based on a fact that it is safe if the non-mines data

is predicted as mine but it is not the case when we predict mines data as

non-mines. So the unsupervised learning based ART algorithm divides

the data into several clusters which are merged on the basis of above

fact. Genetic algorithm is enhancing the results to establish the results

having negation in the antecedent part. In addition to these approaches,

fuzzy approaches also give the membership values corresponding to

each class to visual the class of data in better way.

Candidate’s Declaration

I hereby declare that the work presented in this project titled “Design and implementation of a classification system based on soft computing and statistical approach” submitted towards completion of mini-project in sixth Semester of B.Tech (IT) at the Indian Institute of Information Technology (IIIT), Allahabad. It is an authentic record of my original work pursued under the guidance of Dr. G. C. Nandi, Associate Professor, IIIT, Allahabad. I have not submitted the matter embodied in this project for the award of any other degree. (Ashish Kumar Agrawal) Place: Allahabad Date: 17-5-2004 ------------------------------------------------------------------------------------------------------------

Certificate

This is to certify that the above declaration made by the candidate is correct to the best of my knowledge and belief.

(Dr G.C. Nandi)

Associate Professor

Place: Allahabad IIIT (Deemed University)

Date: May, 2004 Deoghat, Jhalwa, Allahabad.

II

Acknowledgement First and foremost, I would like to express my sincere

gratitude to my project guide, Dr G.C. Nandi. I was privileged to

experience a sustained enthusiastic and involved interest from his side.

This fueled my enthusiasm even further and encouraged us to boldly

step into what was a totally dark and unexplored expanse before us.

I would also like to thank my seniors who were ready with a positive

comment all the time, whether it was an off-hand comment to encourage

us or a constructive piece of criticism and a special thank to JRC

database provider who arranged a good database for mines.

Last but not least, I would like to thank the IIIT-A staff members and

the institute, in general, for extending a helping hand at every juncture

of need.

Ashish Kumar Agrawal (2001114)

III

Table Of Contents

Abstract………………………………………………………………..….I

Declaration………………………………………………………………..II

Certificate…………………………………………………………………II

Acknowledgements……………………………………………………….III

List Of Figures………………………………………………………… VII

CHAPTER I

Introduction and Statement Of Problem………………………………....1

1.1 Introduction…………………………………………………………1

1.2 Problem Statement………………………………………………….1

CHAPTER II

Challenges In This Field ..………………………………………………...3

2.1 Features extraction …..……………………………………………..3

2.2 Selection of an algorithm………..……………………………….…3

CHAPTER III

Approaches in This Direction…..………………………………………...4

3.1 stat ist ical Approaches. .……………………………………...4

3.1.1 Clustering algorithm Kmean .. . .…………………………….4

3.1.2 k-nearest neighbour……..…………………………………..4

3.2 Softcomputing………………………………………………………5

3.2.1 Genetic algorithm ……………………………………………5

3.2.2 Adaptive resonance theory (ART)….…………………….....6

3.2.3 Fuzzy C-mean………………………………………………...7

3.2.4. Gustavson kessel algorithm………………………………...8

3.2.5 Gath–geva algorithm…………………………………….…..8

3.2.6 Kohonen SOM…………………………………………….….9

IV

CHAPTER IV

System Architecture……...………………………………………………10

4.1 Data Source Name and login….…………………………………..10

4.2 Algorithm and table selection……………………………………..10

CHAPTER V

Results And Conclusions…………………………………………………12

5.1 Results……………………………………………………………...12

5.2 conclusion …………………………………………………………15

5.3 Future Extensions…….…..………………………………………….15

5.3.1 Improvement in the genetic algorithm…………………..………16

5.3.2.Distributed computing environment……………………….16

5.3.3.Dealing with various platform and format………………...16

References ……………………………………………………………….. 17

-Books ……………………………………………………………………17

- Research Papers………………………………………………………..17

V

List of Figures

Fig 1: Flow of information………………………………………....…11

Fig 2: Main frame of algorithm……………………………………..…11

Fig 3: Result of genetic algorithm……….. ….………………………12

Fig 4: Result of ART………………………..…………………………12

Fig 5: Result of Fuzzy c-mean and Gustavson kessel algorithm

……………………………………………………………………………..13

Fig 6: Result of k-mean algorithm.... . .………………..……..……….14

Fig 7: Result of k-nearest neighbour algorithm….….……………….14

Fig 8: Result of kohonen SOM….…….……..……………………….15

VI

VII

Chapter 1

Introduction & Statement of ProblemBRIEF OVERVIEW

1 .1 Introduction

“If we already know about the upcoming hazards; it is very easy to find

the way to abolish it.”

Here, this sentence is being described in the context of Landmine

Detection and Decontamination. My objective is to predict

whether at a particular point of working area is occupied by mines

or not, with some confidence parameter. Robot is designed to

move toward these predicted areas to decontaminate the mines.

These mines occupied area can be known before initiation of robot

movements or can be predicted dynamically, so to design an

obstacles free path for robot is another aspect beyond the domain

of this module.

To tackle this problem a classification toolkit has been designed using

some statistical and soft computing based approaches to cluster

the data, to predict the possible class of incoming data, to generate

some rules in the term of confidence parameter. The data may be

given in image form or some tabular form having all numeric or

categorized attributes.

It is impossible to design a classifier having 100% right classification

because it is not easy to differentiate between the data of metallic

debris, PVC tubes and actual mine data.

On the basis of this prediction path designers develop the obstacle free

path to decontaminate these mines.

1

1 .2 Statement OF Problem

2

Anti-Personal landmines are a significant barrier to economic and social

development in a number of countries, so we need a classification

system that can differentiate a mine from metallic debris on the

basis of given data. This data is generated by some highly accurate

sensors.

Chapter 2

Challenges in This Field

In the field of classification and rules establishment, the basic problems

are the features extraction (building blocks of algorithms) and

selection of good algorithms those can generate results with high

certainty value.

2.1 Features Extract ion

The initial problem is the problem of features extraction. Generally the

image data is given having a blurred image of an object, so it is

very difficult to extract the exact boundary of object. There may

be various features those can be used as the raw material of

system. Here blobsize, blobaspectratio and blobintensity have been

chosen. The given data may contain the images of PVC tube,

metallic debris and Mines.

The data in some tabular format having numerical or categorized

values of attributes can also be given, which is more suitable for

the algorithms

2.2 select ion of an Algori thm

The second problem is to choose an algorithm that can interpret the

problem in best way. The algorithms can be categorized in two

parts:

(1)Statistical approaches

(2)Softcomputing based approaches

The three types of algorithms can be applied here: Classification,

Clustering and Rules establishment with some certainty factor, so

the best way is to design various algorithms and then check their

efficiency and accuracy.

3

Chapter 3

APPROACHES IN THIS DIRECTION

In this section the various algorithms will be discussed being used to

achieve the objective.

3 .1 Stat ist ical Approaches

Two algorithms have been used one for clustering (K-mean algorithm)

and another for prediction of class of incoming data (K-nearest

neighbour).

3.1 .1 Cluster ing Algori thm: Kmean

Clustering is a nonlinear activity that generates ideas, images and

feelings around a stimulus word. Clustering may be a class or an

individual activity.

If the number of data is less than the number of cluster then we

assign each data as the centroid of the cluster. Each centroid will

have a cluster number. If the number of data is bigger than the

number of cluster, for each data, we calculate the distance to all

centroid and get the minimum distance. This data is said belong to

the cluster that has minimum distance from this data. Since we are

not sure about the location of the centroid, we need to adjust the

centroid location based on the current updated data. Then we

assign all the data to this new centroid. This process is repeated

until no data is moving to another cluster anymore. Mathematically

this loop can be proved to be convergent

Since there are only two classes mine and non-mine so number of

classes is given 2 as input with the dataset [B1].

4

3.1 .2 K-nearest Neighbour

5

K-nearest neighbour technique is used to predict the class of

incoming data on the basis of given training data and density

estimator (k-nn) to estimate the confidence of the incoming

sample for a particular class. Finally the class is predicted having

the highest estimator.

Density estimator: qc(x) = (number of neighbors of class c)/K

The neighbors are the k closest point to the given sample .Their mutual distances are calculated by city block distance. [B2]

The problem of choosing k still remains, but a general rule of thumb is to use

K= sqrt(N).

Where N is the number of learning samples.

A disadvantage of this method is that it is computationally intensive for large data sets.

3 .2 Softcomputing approaches

Softcomput ing approaches can be c lass i f ied in to severa l ca tegor ies

l ike :

1 . ) Neura l approaches

2 . ) Fuzzy c luster ing

3 . ) Adapt ive resonance theory

4 . ) Kohonen SOM

5. ) Genet ic a lgor i thm

3 .2 .1 Genetic a lgori thm to establ ish rules

To establ i sh the ru les between the a t t r ibutes of data

assoc ia t ion ru le but association Rule mining cannot predict the

complete set of rules, i .e. the rules which have negation in the

attributes cannot be discovered. To overcome that disadvantage,

Genetic Algorithms (GAs) has been used.

F i rs t of a l l assoc ia t ion ru le i s appl ied wi th some support and

conf idence va lues entered by user to generate some base ru les

6

and these ru les a re sent to genet ic a lgor i thm as input which

he lps to evolve some new ru le hav ing negat ion in a t t r ibutes .

The three bas ic par t of genet ic a lgor i thm are as fo l low:

(a )Select ion : Roulet te whee l technique i s used to se lect the two

parents [R1] .

(b)Crossover : A random point (crossover po int ) i s generated and

the segment to the le f t of th is point of f i r s t parent and that of

second parent are interchanged.

(c )Mutation : mutat ion point i s generated randomly and the b i t

va lue a t th is point i s togg led .

After some i tera t ion we f ind some ru les fo l lowing the above

propert ies and hav ing h igh f i tness va lue that can be ca lcu la ted

e i ther us ing the conf idence va lue or by confus ion matr ix .

3.2.2 Adaptive resonance theory (ART)

As we know backpropagation network is very powerful in the sense that

it can simulate any continuous function given a certain number of

hidden neurons and a certain forms of activation functions. But

once a back propagation is trained, the number of hidden neurons

and the weights are fixed. The network cannot learn from new

patterns unless the network is re-trained from scratch, so there is

no plasticity. [R2]

So ART is a new neural network technique to solve this problem.

Our ultimate objective is to cluster the data in several chunks.

Each time one by one samples from the data as input neurons is sent as

input and the activation value is calculated corresponding to each

of the existing output neurons, and the highest value is chosen ,if

this value is higher than threshold values then the weight of this

connection is updated otherwise a new output neuron is added.

After certain iteration it’s found that the proper clusters of the

data in our application don’t have classes more than two (mine and

non-mine). The another fact is that if a non-mine data is predicted

as mine it is acceptable but vice-versa is not true because it may be

dangerous, so among all the clusters, the cluster having the

cluster-center farthest from the mine data center is classified as

non-mine, rest of the clusters are classified as mine.

Here activation function is calculated as the city block distance of the

incoming normalized data and weights of connection.

3.2.3 Fuzzy c-mean:

In the c lass ica l c lus ter ing a lgor i thm we have the cr isp

membership of a c lass (e i ther one or zero) .but whi le

c lass i fy ing the mine data i t i s not very easy to d i f ferent ia te

between mine and non-mine . So we need a method that can te l l

the membership of the data in each c lass . I f th is membersh ip

is average then we dea l th is data as spec ia l data and c lass i fy

th is in the c lass of mine (as mine are dangerous ! ! ) . [R3]

where |X| is the feature vector

and p is the number of classes (p=2 in our case)

Membership values

Euclidean distance:

Mean center prototype:

7

Mean center

prototype(Ci)=

If the d i f ference of the membership va lue wi th prev ious

membership va lue i s less than threshold than a lgor i thm

terminate with hav ing the membership va lue for each c lass .

3.2.4Gustavson-Kessel Algorithm

It is an improvement of fuzzy c-mean clustering algorithm .the

correlation between the data is not considered in c mean. In this

algorithm we redefine our distance formula as: [R3]

Mahalobis distance :

where Ai is the mean center

prototype and xj and cj are

the sample attribute and

cluster center.

And covariance matrix is

calculated as :

Fuzzy covariance matrix

Mean center prototype

3.2.5 Gath-Geva Algorithm :

This algorithm assumes that data is normally distributed. [R3]

8

Distance :

where is the a-priori probability of data belonging to cluster i,

and Mean center prototype

The symbols have same explanation as above.

Before applying this algorithm it is suggested to analyze data whether it

is normally distributed or not.

3.2 .6 Kohonen SOM: A competitive network learns to categorize the input vectors presented to it. If a neural network just needs to learn to categorize its input vectors, then a competitive network will do. Competitive networks also learn the distribution of inputs by dedicating more neurons to classifying parts of the input space with higher densities of input.[B3] A self-organizing map learns to categorize input vectors. It also learns the distribution of input vectors. Feature maps allocate more neurons to recognize parts of the input space where many input vectors occur and allocate fewer neurons to parts of the input space where few input vectors occur. Self-organizing maps also learn the topology of their input vectors. Neurons next to each other in the network learn to respond to similar vectors. The layer of neurons can be imagined to be a rubber net that is stretched over the regions in the input space where input vectors occur. Self-organizing maps allow neurons that are neighbors to the winning neuron to output values. Thus the transition of output vectors is much smoother than that obtained with competitive layers, where only one neuron has an output at a time.

9

Now we have some brief knowledge of algorithms those have been implemented .Now I will discuss the architectural design of classification system followed by the results .

Chapter 4

System Architecture

As I have already discussed that input can have image form or tabular

form .Matlab has been used to extract the features from the input

images. We have the numerical attributes based table with the

entry whether the data belongs to mine or non-mine, but for the

genetic algorithm categorized table is required so data is

categorized in three categories :Low, Medium and High with class

value simply mine or non-mine.

4 .1 Data source name and login:

Data i s be ing mainta ined in MS Access . User i s f ree to enter any

data but he needs to conf igure the database f i r s t us ing (contro l

pane l ->adminis t ra t ive tool ->data sources(odbc)->system DSN

-> conf igure ) . After the conf igurat ion he wi l l be ass igned a

DSN name . th is DSN name is asked when u in i t i a l ize the

appl icat ion wi th the user name and password that can be

obta ined from help(Because th is i s des igned for demonstrat ion

so username and password have been g iven in he lp) . When

connect but ton i s pressed i f the username and password are

correct and the entered DSN ex ists , a new page opens having

a l l the a lgor i thms and tab le se lect ion fac i l i ty .

4 .2 Algori thm and table select ion

10

Any tab le ex is t ing in the input database can be se lected wi th the

a lgor i thm(Se lect ca tegor ized tab le i f u apply genet ic

a lgor i thm) .Now the a lgor i thm spec i f ic resu l ts wi l l be

d isp laced.

Different algorithm can ask for some input parameter like clustering

algorithm can ask for number of cluster etc.

The interface is self explanatory with proper help. Java language has

been used at front hand and Microsoft Access XP for Database in

back hand and JDBC Bridge to communicate between algorithms

and databases.

Fig 1: Flow of information

11Fig 2: Main frame of algorithms

Chapter 5

Result And Conclusion 5.1 Results : -

This module has successfully been implemented. The ultimate objective

of this module is to compare between various algorithms and

differentiate between them on the basis of their accuracy and

results.

Genetic algorithm:

Fig 3 : Result of genetic algorithm

This snapshot is displaying the result of both association rule and

genetic rules. It is very much clear that genetic algorithm has generated

the rule having negative attribute value in antecedent part so this

algorithm is very useful to establish rules.

ART

12 Fig 4 : Result of ART algorithm

the ART algorithm is also giving good rules .the ART gives the multiple

class distribution of given data. Because to predict a non-mine as a

mine is not as much dangerous as to predict a mine as non-mine,

so the all the cluster having more distance from the non-mine

center has been assigned mine class.

13

Fuzzy C-mean

The fuzzy c-mean algorithm is

giving rules with membership

value in each class. So it is

very easy to check some data

that can not be classified as

mine and non-mine, so this

type of data can be put into

mine class to avoid danger.

Gustavson kessel

This algorithm is abolishing the

drawback of fuzzy c-mean

algorithm because it

considers the correlation

between data.

Fig 5 : Result of Fuzzy c-mean and Gustavson kessel algorithm

Kmean Algorithm

Kmean algorithm is non-adaptive

and time consuming and

giving the accuracy of 65%.

Fig 6 : Result of Kmean algorithm

14

K nearest neighbour algorithm This algorithm is useful if

someone wants to know

the class of a given data.

First of all , the training

data must be given with

the inputs and the

number of nearest

neighbours. On the basis

of class of nearest

neighbours, this

algorithm predicts the

possible class of the input

data.

The algorithm gives good result

when number of K is

more which also makes

this algorithm very time

consuming.

Fig 7 : Result of Knearest neighbour algorithm

Kohonen SOM

This algorithm is also used for clustering and it’s quite a fast algorithm

based on ‘winner take all ’ strategy. It differentiates the mine and

non-mine up to 80% accuracy

Fig 8 : Result of Kohonen SOM algorithm

5.2 Conclusion

Al l the e ight d i f ferent a lgor i thms have been implemented to

compare the resu l ts . This c lass i f ie r i s g iv ing resu l t wi th 80%

accuracy .The best resu l t i s be ing g iven by ART and Genet ic

a lgor i thm. Fuzzy C-mean and Gustavson kesse l i s a l so good

because of membership va lues for each c lass . This module can

d if ferent ia te between the PVC tube , wood p iece ,brass tube

,copper cy l inder(Non mine data)and the mine data obta ined

from j rc Israe l (ht tp ://apl -database . j rc . i t ) .

15

16

5.3 Future Extension

We contemplate following future features which can be incorporated

into this project:-

5.3.1 Improvement in the genetic a lgori thm :the implemented

genetic algorithm in this module incorporates only point mutation,

so the other type of mutation can also be practiced like deletion

,insertion and segment mutation etc. and the crossover and

mutation probabilities can be modified to get better results.

5.3.2 Distributed computing environment: Generally we have to deal

with large databases because on the basis of 100 tuples databases it

is very hard to predict the exact class of data .In practical and real

l ife application we have several GB of data . To operate this much

of data we need the distributed databases and computing.

5.3.3 Dealing with various platforms and formats: The data may be

various format and databases system so system should be flexible

enough to handle the various formats and DBMSs like (Oracle

,MySql etc).

17

References

Books

B.1 Earl Gose Steve Jost Richard Johnsonbaugh Pattern Recognition

and Image Analysis June, 1996 0132364158 Prentice Hall.

B.2 Richard O. Duda, Peter E. Hart, David G. Stork (2001) Pattern

classification (2nd edition), Wiley, New York, ISBN 0471056693

B.3 Valluru B. Rao C++ Neural Networks and Fuzzy Logic second

edition.

Research Papers

R.1 Improvements in Genetic AlgorithmsJ. A. Vasconcelos, J. A. Ramírez, R. H. C. Taka hashi, and R. R. Saldanha . IEEE TRANSACTIONS ON MAGNETICS, VOL. 37, NO. 5, SEPTEMBER 2001. R.2 ART Neural Networks for Remote Sensing: Vegetation

Classification from Landsat TM and Terrain Data Gail A.

Carpenter, Marin N. Gjaja, Sucharita Gopal, and Curtis E.

Woodcock.

R.3. Bezdek, J.C., Pal, S.K., 1992: Fuzzy Models for Pattern Recognition. IEEE Press, New York.

Mini Project Report Sample

Documents