LARGE DATA CLASSIFICATION USING NEURAL NETWORKS A MINI PROJECT SUBMITTED TO COMPUTER SCIENCE DEPARTMENT OF THE UNIVERSITY OF AGRICULTURE, ABEOKUTA BY ADELANI DAVID IFEOLUWA, 06/1166 AIGBERUA TOBI DEBORAH, 06/1172 IWOBHO AKHAZE ANTHONY, 06/1199 OWODUNNI ELIAS ADEFARASIN, 06/1223 COURSE: CSC 328 (COMPUTER APPLICATIONS) SUPERVISED BY: DR ADEWOLE PHILIPS.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
LARGE DATA CLASSIFICATION USING
NEURAL NETWORKS
A
MINI PROJECT
SUBMITTED TO
COMPUTER SCIENCE DEPARTMENT OF THE
UNIVERSITY OF AGRICULTURE, ABEOKUTA
BY
ADELANI DAVID IFEOLUWA, 06/1166
AIGBERUA TOBI DEBORAH, 06/1172
IWOBHO AKHAZE ANTHONY, 06/1199
OWODUNNI ELIAS ADEFARASIN, 06/1223
COURSE: CSC 328 (COMPUTER APPLICATIONS)
SUPERVISED BY:
DR ADEWOLE PHILIPS.
ABSTRACT
The three-layer artificial neural network (ANN) model with back-propagation (BP) algorithm
was used to classify customers of a German automobile company into various categories. The
classification entails three divisions namely: Germany, South Africa and Maldives of the
company. The primary aim was to break down the customers into different categories namely:
good, average and below average on the basis of invoicing data and acts as an important
component in data mining. The classification of data using Neural Networks entails a day to day
invoicing data of the customer as the base. An intelligent data is gotten from a raw data through
a process called data cleaning and relevance analysis. Extraction of data depends on a number
of factors like customers who order the maximum amount of invoicing quantity in each of the
three source systems. The intelligent data undergoes the following: conditioning, averaging,
preparing and normalizing. Normalization makes the data suitable for use in a three layer feed
forward ANN using the back propagation algorithm. On the basis of a number of iterations of
the “supervised” Input/Output training pairs, the ANN learns to master the classification of
customer data. The error in each iteration on the ANN is fed back to adjust the weights in the
previous layer, which makes the network an accurate classifier. The ANN makes use of; different
learning rates annealing schedules, various number of nodes in the hidden layer and different
activation functions which not only provides for various rates of error convergence but to
measure the confidence and support in data mining which helps to predict a customer
For this test sample T, the MLFFNN should correctly classify the customer invoicing data in the
category of good customers.
3.5 IMPLEMENTATATION OF NEURAL NETWORK USING MATLAB 2008
MATLAB is software that can easily be used to implement artificial neural network. There are
other programming languages such as Java, C# and so on that can be used to implement neural
networks. MATLAB has been simplified to program mathematical functions and neural network
in an easily form without writing too much code. In this work, a MAT-file (which allows you to
input an “m x n” matrix in the format of an excel file) was created. The training input (INPUT),
training output data (OUTPUT), (test input and test output). The data is expressed in 3.4.2
(Initializing the weights and training sample).
Neural network can be implemented using three forms in MATLAB.
(1) Using command-line functions
(2) Using the Neural Network Toolbox TM clustering tool GUI (nprtool)
(3) Graphical user interface (nntool).
This discussion is limited to the first two methods of implementation because graphical user
interface (nntool) is basically used for multilayer perception.
1. Using command-line functions
After imputing all the data needed, the next steps are as follows;
Create a new M-file which is in the form of a text editor.
Using a particular algorithm for training, a neural network code for Pattern
Recognition was written. The code was written using Scalar Conjugate Gradient
(trainscg) algorithm.
The code is written below.
PROGRAM
Backpropagation Algorithm for multilayer feed forward artificial neural network
fprintf('INPUT represents training input while RESULT represents training output');INPUTRESULTnet = newpr(INPUT,RESULT,4,{},'trainscg'); % using Scaled Conjugate Gradient(trainscg)
[net,tr]=train(net,INPUT,RESULT); %Training of the network fprintf('to test the neural network, the RESULT needs to be tested with the result of the network - which appears below'); outInput = sim(net,INPUT) testINPUT testRESULT fprintf('the result of testing data appears below'); outTest = sim(net,testINPUT) if round(outTest) == [1; 0; 0] disp('this implies that the customer is in a GOOD category'); end plotperf(tr)plotconfusion(RESULT,outInput)[y_out,I_out] = max(outTest);[y_t,I_t] = max(testRESULT);diff = [I_t - 3*I_out];g_g = length(find(diff==-2)); g_a = length(find(diff==-3)); g_b = length(find(diff==-1)); a_a= length(find(diff==0)); a_g= length(find(diff==3)); a_b= length(find(diff==-3)); b_b= length(find(diff==2)); b_g= length(find(diff==-1)); b_a= length(find(diff==0)); N = size(testINPUT,3); % Number of testing samplesfprintf('Total testing samples: %d\n', N);
cm = [g_g g_a g_b; a_a a_g a_b; b_b b_g b_a] cm_p = (cm ./ N) .* 100 % classification matrix in percentagesfprintf('Percentage Correct classification : %f%%\n', 100*(cm(1,1)+cm(2,2)+cm(3,3))/N);fprintf('Percentage incorrect classification : %f%%\n', 100*(cm(1,2)+cm(2,1)+cm(1,3)+cm(3,1)+cm(2,3)+cm(3,2))/N);The output of the code will be on the command line, an interface of the training is shown and the
confusion matrix. The performance network is also plotted. It is shown below.
OUTPUT OF THE PROGRAM IN COMMAND-LINE
INPUT represents training input while RESULT represents training output
INPUT =
0.4301 0.0803 0.0479
0.4889 0.0940 0.0461
0.3084 0.0122 0.0025
0.8013 0.7013 0.3062
0.4851 0.0984 0.0408
RESULT =
1 0 0
0 1 0
0 0 1
To test the neural network, the RESULT needs to be tested with the result of the network -which appears below
outInput =
1.0000 0.0003 0.0000
0.0007 0.9993 0.0007
0.0001 0.0006 0.9994
testINPUT =
0.3001
0.4089
0.2084
0.8512
0.6512
testRESULT =
1
0
0
the result of testing data appears below
outTest =
1.0000
0.0025
0.0001
this implies that the customer is in a GOOD category
Total testing samples: 1
cm =
1 0 0
0 0 0
0 0 0
cm_p =
100 0 0
0 0 0
0 0 0
Percentage Correct classification : 100.000000%
Percentage Incorrect classification : 0.000000%
>>
(1) Using the Neural Network ToolboxTM clustering tool GUI.
(i) Data inputted in MAT-file is used.
(ii) A command “nprtool” is typed on the command lines which show an interface of Neural
Network Pattern Recognition tool used for classification of data.
(iii) The input and target data is inputted which can be accessed directly from the system
using “browse” button.
(iv) By pressing next, you input the number of neurons in the hidden layer; for our project we
inputted four neurons.
(v) Next, you have an interface where you press “train” button to train the neural network.
From the training interface in fig. 2, the performance and the confusion matrix can be
plotted which will have the same output using the command-line prompt.
(vi) The test input and test output is fed into the neural network and the network is tested (test
network).
Fig. 2
The training toolbox has the following characteristics;
(1) Performance plotting: this is used to plot mean square error (MSE) against the epoch (a
presentation of the entire training set to the neural network) on the graph shown below
Fig. 3
Confusion Matrix: this is used to plot output matrix (output of the neural network) against the
target matrix (output of the training data). The values in the diagonal matrix (green colour)
represent data that are well classified while the one in the red colour represents data that are
misclassified.
Simulate data: a command “sim” is used to test the input data if truly the neural network
understands the training.
Mean Square Error: this is the average squared difference between outputs and targets. Lower
values are better. Zero means no error.
Percent Error: this indicates the fraction of samples which are misclassified. A value of 0
means no misclassifications, 100 indicates maximum misclassifications.
Fig. 4
4.0 RESULTS AND DISCUSSION
The output of the data classification using neural network has proved to be a good classifier. The
efficiency of the neural network depends on a number of neurons in the hidden layer. Actually,
there is no formula for selection of number of neurons. It is mainly by trial and error, the higher
the number of neurons the more efficient (accurate) the neural network. It should be noted that
large number of neurons can cause complication in classification. Having tried different number
of neurons, we used four (4) neurons which provided accurate classification since our input data
is not too large.
The Neural Network output is tested to be an accurate classifier using Mean Square Error (MSE)
and Percent Error. The MSE got was 1.26518e-7 which is the average square difference between
the output (the neural network result of the input-outInput in the code) and the target (RESULT-
in the code), the function ‘sim()’ (simulate) is used to get the output data when using code;
hence, the classification is good. A zero error can’t be got because the neural network cannot be
100% accurate. The percent error is zero which indicates no error in classification; this is shown
diagrammatically in confusion matrix. The test data gave a MSE of 4.90938e-7 and percent error
is also zero since the neural network classified the test data well.
5.0 CONCLUSION
From the entire analysis of the classification of customer invoicing data with the help of a Multi
Layer Feed Forward Neural Network, the following conclusions were made;
1. The framework of classification of data into distinct classes is independent of the entities used
as examples (customers, parts etc.) and thus the analysis is very general in nature.
2. After the MLFFNN learns to classify the customer invoicing data it would serve as a
forecasting tool, where an early invoicing data for an unknown customer would serve to forecast
this customer’s classification in days to come.
Neural network has proven to be a good classifier, a predicting tool and a forecasting tool of
data. It overcomes the limitation of statistical methods for analysing data and also provides
advantages of high accuracy, noise tolerance and ease of maintenance.
REFERENCES
Antony Browne, Brian D. Hudson, David C. Whitley, Martyn G. Ford, Philip Pictoni, (2003).Biological data mining with neural networks: implementation and application of a flexible decision tree extraction algorithm to genomic problem domains School of Computing, Guildford:University of Surrey.
Bishop, Christopher (1995). Neural Networks for Pattern Recognition, Oxford.
Christos Stergious and Dimitrios Siganos (1996). Neural Network, www.doc.ic.ac.uk/˷nd/surprise_96/journal/vol4/cs11/report.html
Hieu Trung Huynh, Jung-Ja Kim and Yonggwan Won (2009). Classification Study on DNA Microarray with Feedforward Neural Network Trained by Singular Value Decomposition, Korea: Chonnam National University.
Li-Xian Sun á Klaus Danzer á Gabriela Thiel (1996) Classification of wine samples by means of
artificial neural networks and discrimination analytical methods, China: Hunan Normal
University.
Portia A. Cerny. Data mining and Neural Networks from a Commercial Perspective, Australia:University of Technology Sydney.
Rahul Kala , Anupam Shukla , Ritu Tiwari (2009) Fuzzy Neuro Systems for Machine Learning for Large Data Sets , India: Indian Institute of Information Technology and Management.
Teh chee siong (2006). A hybrid artificial neural network model for data visualisation, classification, and clustering, University of sains: Malaysia
Trippi, Robert (1996). Neural Networks in Finance and Investing, McGraw Hill: Turban, Efraim
(editors )
Varun Dutt , V. Thiagaraj.The Concept of Classification in Data Mining using Neural Networks,
TamilNadu: Annamalai University.
(2007).What is data classification? www.searchdatamanagement.techtarget.com/sDefinitions/0,,sid91_gci1152474,00.html