i
i
GENDER CLASSIFICATION THROUGH DYNAMIC KEYSTROKE BASED ON MOBILE PHONE USING ARTIFICIAL NEURAL
NETWORK
SITI HAJAR BT MAT ZAN
BACHELOR OF COMPUTER SCIENCE WITH HONOURS
(COMPUTER NETWORK SECURITY)
UNIVERSITI SULTAN ZAINAL ABIDIN
2018
ii
GENDER CLASSIFICATION THROUGH DYNAMIC KEYSTROKE BASED ON MOBILE PHONE USING ARTIFICIAL NEURAL
NETWORK
SITI HAJAR BT MAT ZAN
BACHELOR OF COMPUTER SCIENCE WITH HONOURS
(COMPUTER NETWORK SECURITY)
FACULTY OF INFORMATICS AND COMPUTING
UNIVERSITI SULTAN ZAINAL ABIDIN, TERENGGANU, MALAYSIA
AUGUST 2018
i
DECLARATION
I hereby declare that the project report entitled Gender Classification Through
Dynamic Keystroke Based on Mobile Phone Using Artificial Neural Network is based
on the result of my research with information from sources that is stated in
confession. I also declare that it has not been produced by any other degree and other
education institutions.
________________________________
Name :SitiHajarBt Mat Zan
Date : .................................................
ii
CONFIRMATION
This Project report Title Gender Classification Through Dynamic Keystroke Based on
Mobile Using Artificial Neural Network was prepared and submitted by SitiHajarBt
Mat Zan. This project report has meet the requirement in term of scope, quality and
presentation for the Bachelor of Computer Science (Network Security) with Honors in
University Sultan ZainalAbidin.
_______________________________
Supervisor : Dr MohamadAfendee b Mohamed
Date : ..................................................
,
iii
DEDICATION
First of all, I would like to express my gratitude to the Most Gracious and The Most
Merciful to Allah S.W.T for his blessing that been given for me able to complete my
final year project, Gender Classification through Dynamic Keystroke Based on
Mobile Using Artificial Neural Network.
The research presented in this dissertation could not have been to complete without
the support,encouragement and cooperation of many people. I would like to express
my deepest gratitude to the most important person that was patiently supervising,
advising, teaching and giving encouragement at each of stage throughout the
development this project, Dr.MohamadAfendee bin Mohamed. I would like to thank
him for giving the opportunity to work and learn under his guidance along the way of
completing this project.
I also would like thank to all lectures especially to all member of panels that involve
in this final year project of Faculty of Informatics Computing for their comment,
feedback and advises to improving my project progress. Other than that, I would like
to show my appreciation to my parents Mat ZanARani and ZaidaCheMuda that given
moral support and attention in order to finishing this project.
My sincere thanks also go to my beloved friends for their encouragement
and valuable advice in completing this project. May Allah S.W.T blesses all of you
for the effort that been given. This project would not accomplish without their
precious support.
iv
ABSTRACT
Cyber crime also known as computer oriented crimes is refers to misuse of computer
and network equipment to steal, modify and damage the data for particular purpose.
It’sbeen used to threaten people security personally. Computer crimes are difficult to
be detected and proven due to the cyber crime happens virtually and cannot be proven
physically by the easy way. Nowadays cyber crimes mostly remain unsolved due to
limitation of evidence gathering process from security experts to identify the potential
attacker. There are method that can be used to searching criminals by narrow down
the possibility of the criminal identity by identifying their gender. The gender
identifying is useful to department of cyber security to take further action for criminal
that involve with cyber crime.Keystroke dynamics is an alternative approach to
identifying the gender of criminal. Keystroke dynamics is known as behavioural
biometric that refers to the rhythm of the individual typing on a touch keyboard based
on the manner which is a automated method of identity identifying. Keystroke
dynamics capture the individual unique behavioural characteristic of typing rhythm
and it will automated generate dataset type of gender by recording type of pattern
from a group of mobile user. The gender are classified on the criteria that meet the
requirement of the keyboard features types. In this project the artificial neural network
algorithm will applied. Artificial neural network will create the signature to pattern
type from individual and differentiate their data classification type whether male or
female. It is anticipated that this will bring a greatcontribution to the investigatorsby
providing information of gender for the investigations.
v
ABSTRAK
Jenayah siber juga dikenali sebagai jenayah berorientasikan komputer merujuk
kepada salah guna komputer dan peralatan rangkaian untuk mencuri, mengubah suai
dan merosakkan data untuk tujuan tertentu. Ia telah digunakan untuk mengancam
keselamatan orang secara peribadi. Jenayah komputer sukar untuk dikesan kerana
jenayah siber yang berlaku tidak dapat dibuktikan secara fizikal dengan cara yang
mudah. Kini, jenayah siber sebahagian besarnya masih tidak dapat diselesaikan
kerana terhadnya proses pengumpulan bukti daripada pakar keselamatan untuk
mengenal pasti penjenayah siber yang berpotensi. Terdapat kaedah yang boleh
digunakan untuk mencari penjenayah dengan mengurangkan kemungkinan identiti
jenayah dengan mengenal pasti jantina mereka. Pengenalpastian jantina berguna
untuk jabatan keselamatan siber untuk mengambil tindakan selanjutnya terhadap
jenayah yang melibatkan jenayah siber. Dinamika keystroke adalah pendekatan
alternatif untuk mengenal pasti jantina jenayah. Dinamika keystroke dikenali sebagai
biometrik tingkah laku yang merujuk kepada irama pemetik individu pada papan
kekunci sama ada komputer mahupun papa kekunci sentuh berdasarkan cara mereka
menaip yang merupakan kaedah pengenalan identiti automatik. Dinamika keystroke
menangkap ciri perilaku unik individu menaip irama dansecara automatik
menghasilkan jenis dataset jantina dengan merakam jenis corak dari sekumpulan
pengguna komputer. Jantina dikelaskan berdasarkan kriteria yang memenuhi
keperluan jenis ciri papan kekunci. Dalam projek ini, algoritma rangkaian neural
tiruan akan digunakan. Rangkaian neural tiruan akan mewujudkan model kepada
jenis corak dari individu dan membezakan jenis klasifikasi data mereka sama ada
lelaki atau perempuan. Dengan adanya data klasifikasi mengenai janitna penjenayah
ini diharapkan akan memberi sumbangan besar kepada penyiasat dengan
menyediakan maklumat untuk melakukan siasatan.
vi
TABLE OF CONTENTS
PAGE
DECLARATION I
CONFIRMATION II
DEDICATION III
ABSTRACT IV
ABSTRAK V
CONTENTS VI-IX
LIST OF TABLES X
LIST OF FIGURES XI
LIST OF ABBREVIATONS / TERMS / SYMBOLS XII
LIST OF APPENDICES
XIII
CHAPTER 1 INTRODUCTION
1.1 Project Background 1-2
1.2 Problem Statement 2
1.3 Objectives 3
1.4 Scopes of Work 3
1.5 Limitation Of Work 4
1.6 Thesis Structure 5
vii
CHAPTER 2LITERATURE REVIEW
2.1 Introduction 6
2.2 Literature Review
2.1.1 Gender Classification
2.1.2 Biometrics
2.1.3 Biometrics Techniques
2.1.4 Dynamic Keystroke
6
6-7
7
8-9
10-12
2.3 Method Use 13
2.4 Data Mining
2.4.1 Artificial Neural Network
2.4.2 Logistic Regression
2.4.3 Naive Bayes
2.4.3 Decision Table
2.4.5 Sequential Minimal Optimization (SMO)
13
14-15
15-16
16
17
17-18
2.5 Review Summary 19-21
3.5 Summary 22
viii
CHAPTER 3 METHODOLOGY
3.1 Introduction 23
3.2 Scientific Research Method 23-25
3.3 Knowledge Discovery in Database
3.4.1 Attribute Selection
3.4.2 Data Pre-Processing
3.4.3 Data Transformation
3.4.3 Data Mining
3.4.5 Interpretation / Evaluation
25
26
26
26
27
27
3.5 System Requirement and Specification
3.51 Hardware Requirement
3.5.2 Software Requirement
28
28
28
3.5 Framework 29-30
3.6 Datasets 30
3.7 Summary 31
ix
CHAPTER 4 RESULT AND DISCUSSION
4.1 Introduction 32
4.2 Experimental Results 32-34
4.3 Comparison in Accuracy of Data Model 34-35
4.4 Duration of Building Data Model 36
4.5 Artificial Neural Network Algorithm Confusion Matrix 36
4.5.1 Calculation 37
4.6 Summary 39
x
CHAPTER 5 CONCLUSION
5.3 Limitation
40
5.4 Future work 40
REFERENCES
xi
LIST OF TABLES
TABLE TITLE PAGE
2.5 Summary of Literature Review 19-21
3.5.1 List of Software Requirement. 28
3.5.2 List of Hardware Requirement. 28
3.6 Table of Dataset Keystroke Dynamic 30
4.6.2 Comparison Betwwen Data Mining Based On
Prescicion,Recall And F-Measure
35
4.6.4 Artificial Neural Network AlgorithmConfusion
Matrix Based on Gender
36
4.5.2 Comparison of Confusion Matrix Of 5 Different Data
Mining
38
xii
LIST OF FIGURES
FIGURE TITLE PAGE
1.1 Keystroke Dynamics In The Field Of Computer
Security
12
2.4.1 Diagram Of Artificial Neural Network 15
2.4.3 Formula Of Naive Bayes 14
2.4.5 Formula Of Sequential Minimal Optimization 18
3.2 Scientific Research Method 24
3.3 Knowledge Discovery In Database For Data Analysis 25
3.5 Framework 29
4.2 Graph Of 54 Attribute Of Different Keystroke
Dynamic
33
4.4.1 Comparison Between Data Mining 35
xiii
LIST OF ABBREVIATIONS/TERMS/SYMBOLS
ANN Artificial Neural Network
KDD Knowledge discovery Database
KD Key Down
KU Key Up
FAR False Acceptance Rate
FRR False Rejection Rate
EER Equal Error Rate
PP Key Press– Key Press
PR Key Press- Key Release
RP Key Release- Keys Press
RR Key Release-Key Release
QP Quadratic programming
SMO Sequential Minimal Optimization
xiv
LIST OF APPENDICES
APPENDIX TITLE PAGE
APPENDIX A
42
APPENDIX B
43
1
CHAPTER 1
INTRODUCTION
1.1 ProjectBackground
Biometrics is consist of keystroke dynamics, mouse dynamics, fingerprints, voice,
face that known as nonintrusive that do not require capture information biometrics
using specialized hardware. The term “biometrics” is borrowed from the Greek words
‘bio’ means life and ‘metric’ is to measure. Biometrics refers to the classification of
humans by their physical characteristics or traits. Biometrics is classified into two
parts which is physiological and behavioural biometrics. Physiological biometrics
known as something that related to part of the body such as fingerprint, voices, face
recognition and others .On the other hand, behavioural biometrics related to the
behaviour of a person. Keystroke dynamics and signature verification are some
example of behavioural biometrics. [1]
Keystroke dynamics is a behavioural biometrics that aims to identify users based on
the typing of the individuals such as duration of a keystroke,key hold time, latency of
keystroke, typing error, force of keystrokes and others from numerous of input devices
from normal keyboard to soft keyboards which is based on mobile phone. Many
previous studies have demonstrated that keystroke dynamics has potentialand ability
as a biometrics for identifying the gender that do not require high cost.[2]
Gender is a type of soft biometric that will help the cyber intelligence to investigate
and get relevant information of the person that involve with cyber criminal. Gender
classification has been successfully applied in several biometric identification based
on face, speech, iris or gait recognition. The methods of face recognition always
perform a gender classification first before the face recognition process to halved the
amount of comparisons for faster result in recognition system.
2
Neural Network is defined as a network composed of a number interconnected
units[3].Itis designed in a way in order to seek computing of human brain style. As a
result, it is powerful enough to variety of problem been solve that are proved to be
difficult with conventional digital computational methods [3].Neural network can
detect all complex nonlinear relationship between input and outputs which does not
require excessive statistical training [4]
In this project,51 student keystroke features data will be extracted which is the data
are consist of keystroke dynamic features that include flight times and dwell times that
have beencollected from mobile based keystroke dynamic features data based on
previous research paper which is consist of 51 student male and female are required to
type a password to extracted their dynamic keystroke feature based on different
gender. Other than that, theWeka Tool are been used to train the existed data and test
their accuracy on classification of gender.
1.2 Problem Statement
Gender identifying is one step to solve the cyber crime.The most common approach
for detecting the cyber criminal is identity based on their gender in the investigation
that using several types of biometrics data such as face,iris,speech recognition and
others.This method is greatly though to implement for cyber intelligent as it cannot
capture the information of cyber criminalintrusion occur and the cost of
implementation is high rather than keystroke dynamics biometric.Therefore,
nowadays cyber crimes remain unsolved due to limitation of evidence gathering
process to identify the potential attacker that inspired this proposal to be prepared
3
1.3 Objectives
The objectives are listed below
1. To Study the ability of keystroke dynamic based on gender classification
2. To Model a gender classification data collected keystroke dynamics using
Artificial Neural Network (ANN)
3. To evaluate and test the accuracy of gender classification using our model in
classifying gender
1.4 Scope of work
The scope of project are listed below
1. 1.The scope of the project is to pre-processing the data of 51 student keystroke
dynamic before used it to the algorithm which is the technique that been used
to convert the raw data into clear data set for feasibility to do analysis.
2. Application scope able to extract keystroke features that include dwell times
(the time interval a key is pressed down), and flight times (the duration
between keystrokes), typing speed.
3. Create machine learning model using WEKA tools from extracted keystroke
features
4. To Test the accuracy of the data model.
4
1.5 Limitation Of Work
The accuracy of the gender classification may low due to sample size because the
accuracy of data model is depending on the amount of data sample been used .The
accurateness the result of the data model which represent in percentage are influence
based on the keystroke dynamic data that been collected also have inaccurate and
error as the typing behaviour of each of student may not incorrect enough as it may
influence by emotion, the stress level of person, the switching the different physical
of touch screen keyboard, the influence of medication or alcohol and more.Other than
that, the other effect the result of accuracy of data model by factor of the switching
type of keyboard this may influence the convenient of each user to used the mobile
based data collected of keystroke dynamic to extract the keystroke dynamic features
with precisely.
5
1.6Thesis Structure
The first chapter of this report is the introduction that includes introduction,problem
statement, objective and scope for this project. The main of the project to be
contribute is state at this chapter. The second chapter is literature review for the
project. Literature review provide a knowledge and prominent understanding on
previous research paper that been done in related field, which can help the project can
be done without or reduce its imperfection as possible as can.Third Chapter describe
the methodology used in this research. Project Methodology depicts the multiple
development phase that used in the design,testing,implementation of the system.The
requirement needed to done the project also included in this chapter such as hardware
and software requirement.Chapter 4 is the implementation and testing the project.
Result from various inputs and output are tested and recorded to verify and predict the
accuracy of gender classification.Chapter 5 will conclude the general contribution of
project including the future work that can improve this project.
6
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
Literature review’s chapter will discuss about the previous article and research paper
that relate with this topic. In order to have a better understanding of the used and
applied technologies some of information has been gathered.
2.2 Gender Classification
Gender classification is to identify a person’s gender which is an example male or
female, based on its biometric information. Usually facial images are used to extract
features and then a classifier is applied to the extracted features to learn a gender
recognizer. It is an active research topic in computer environment and biometrics
fields. The gender classification the result of gender classification are often in a binary
value which is 1 or 0, that representing male or female. Gender recognition and
classification is crucial fortype of two class classification problem. Although other
biometric traits could also be used for gender classification, such as gait, face-based
approaches which is still a popular method for gender discrimination.[5]
Based on research paper write by Gokhansilahtaroglu (2015), the gender classification
of customer are very important parameters for retailing and marketing. It is well
known that they both play very important roles in purchasing habits. In this study, a
model to predict the gender of an online customer analysing by using mouse
movements. Which is known as biometrics behavioural. To accomplish this purpose,
Ithave been developed a novel data cube model. The model consists of six dimensions
which are customer demographic data, customer visits, mouse movements, online
7
shopping cart, external data and time dimension to detect customer gender using
artificial neural network model.This research paper based on gender online customer
may be predicted with up to 80% of success rate. The prediction or classification of
online user gender are useful for promotional and marketing purposes.
2.3 Biometrics
Biometrics itself derived from Greek words ‘bio’ means life and ‘metric’ is to
measure. Biometrics refers to the identification of human by their traits or
characteristics. Biometrics is used as a form of identification. Biometrics can be
categorized into two parts which are physiological and behavioural biometrics.
Physiological biometrics is related to the physical of a person including iris,
fingerprint, face recognition, DNA and many more. Behavioural are associated to the
behaviour of a person that includes particular of voices, mouse dynamics, keystroke
and signature of the user .Biometrics historically, have presented a problem that they
tend to be rather expensive for the average end user [6]
The described characteristics and their related techniques have also been commonly
classified as Soft and Hard biometrics. Soft biometric are those characteristics or
features, usually associated to behavioural traits, that provide some information about
the individual, but lack the distinctiveness to differentiate effectively any two
individuals [7] . On the other hand, Hard biometric traits, are considered better in
terms of distinctiveness, just like the fingerprintor the geometry of the face that can
give great results when classified individual
2.2.3 Biometric techniques
8
There are many biometric techniques been used in previous research paper in
distinctiveness of human characteristics based on their gender, emotional states , age
and others. According to research paper that written by Clayton Teppand his research
team on identifying emotional states through keystroke dynamics, His research paper
provide a solution in determining user emotions by analysing the rhythm of an
individual‘s typing patterns on a standard keyboard. The keystroke dynamics
approach allow for the uninfluenced determination of emotion using technology that is
in widespread use nowadays. He isconducted a field study where participants
keystrokes were collected and their emotional states were recorded via self reports by
using various data mining techniques and data get model based on 15 different
emotional states.
9
List above show that the most common techniques and their main defining
characteristics for distinctiveness the human characteristics.[8,9]
1. Fingerprint scanning: A fingerprint is the pattern of furrows on the surface of a
fingertip. They are so distinct that even fingerprint of identical twins are
different..This technique has been used for centuries and its validity has been well-
established.
2. Face recognition: This technique focuses on recognizing the global positioning
and shape of the eyes, eyebrows, nose, lips, and chin of the face of an individual.
Applications using identification based on face geometry range from the static,
where users are still in front of non-variable backgrounds to dynamic, uncontrolled
face identification with dynamic backgrounds.
3. Iris scan: The iris is the annular region of the eye bounded by the pupil and the
sclera (white of the eye) on either side. The visual texture of the iris stabilizes
during the first two years of life and its complex structure carries very distinctive
information useful for identification of individuals
4. Hand geometry: This biometric technique focuses and the shape of the hand,
includingthelengthofthe fingers and their respective width. Thetechniqueisvery
simple, relatively easy to use, and inexpensive. Unfortunately, the physical size of
a hand geometry-based system is too big for applications in laptop computers. At
the same time, the use of the shape of the hand as an authentication is totally
viable, but using it to continuously verify a user may not be feasible.,
The most important feature, and the one that is most looked for this proposed is
accuracy in discrimination the characteristics of Human in computer security
environment especially for authentication,password hardening and detecting the cyber
criminal
10
2.4Keystroke dynamics
The emergence of keystroke dynamics biometrics was dated back in the late 19th
century, where telegraph revolution was at its peak .It was the major long distance
communication instrument that been used in that century. Telegraph operators could
smoothly differentiate each other by simply listening to the tapping rhythm of dots
and dashes. While telegraph key served as an input device in those days, just like a
computer keyboard, mobile keypad, and touch screen are common input devices in the
21st century. Moreover, hand written signature unique that humans have relied on to
verify identity of an individual for many centuries has the same neurophysiologic
factors just like keystroke pattern.[10]
Individual’s unique profile can be generated by monitoring keyboardkeystroke when
individual typing in the program that been provides. The keyboard input includesthe
time taken key pressed down and released,number of backspace used,the position of
keystroke used and the total key pressed.Keystroke dynamics are usually evaluated
based on the following metrics[ 11] :
1 False Acceptance Rate (FAR) – the percentage that the system wrongly denied
access to user
2 False Rejection Rate (FRR) – the percentage that the system wrongly gives
authorization to unauthorized user
3 Equal Error Rate (EER) – the error rate when the system’s parameter are set such
that FRR and FAR are equal.The lower the EER the more precise the system.
11
Keystroke dynamics refers to the habitual patterns or rhythms an individual exhibits
while typing on a keyboard input device. These rhythms and patterns of typing are
idiosyncratic, in the same way as handwritings or signatures, due to their similar
governing neuron physiological mechanisms [12]. Keystroke Dynamics (also known
as Keystroke Biometrics or Typing Dynamics) can also be defined as the detailed
timing information that describes when each key was pressed (KeyDown) (KD) and
when it was released (KeyUp )(KU) as a person is typing on a computer keyboard.
This also includes dwell times (the time interval a key is pressed down), and flight
times (the duration between keystrokes), typing speed, frequency of errors, used of
modifier keys. The principal idea behind this biometric measurement is that every user
has a particular way of typing and that, like any other behavioural biometric system, it
allows the identification, authentication or classification of these users.[13]
12
Figure 1.1 Keystroke dynamics in the field of Computer security
Keystroke dynamics is antechnology to distinguish people by their typing rhythm
were demonstrablyreliable, it would significantly advance computer security. For
criminal investigations, keystroke dynamics could tie a suspect to the “scene” of a
computer-based crime much like a fingerprint does in real-world crime. For access
control, keystroke dynamics could act as a second factor in authentication an impostor
who compromised a password would still need to type it with the correct rhythm. For
insider-threat detection, keystroke dynamics could detect when a masquerade is using
another user’s account; the technology could even identify who is using a backdoor
account(Kevin S. Killourhy et al January 2012)
13
2.3 Method used
Keystroke dynamics biometric is build or designed with three main modules which are
data capture module, feature extraction module and classifier module. Data
collectingmodule which is fundamental stage that consist of an program that can
collect data regarding the keystroke behaviour on a keyboard of an individual when
individuals is interacting with keyboard. The purpose of feature extraction is to
analyse raw keystroke data to generate user feature and stored as reference template
that can be used to distinctive user behaviour through their mobile based touch screen
keyboard keystroke. Moreover, classifier module are used to identify a user based on
the extraction feature
2.4 Data Mining
The growth of computer technology was produced an enormous amount of data
nowdays.The impact of this growth technology make the difficulty in analyzing for
particular data set.Hence,data mining is useful to extract the crucial and benefit data
from large amount of data.Data mining is a process of finding trends and pattern in
data to discover new information based on KDD(Knowledge Discovery Database).In
order to extract information and pattern in data Algorithm is used in produces a
statically proven result.The comparison of accuracy of the predicting model between
different techniques is the main reason of why this research is done.
14
2.4.1 Artificial neural network
Artificial neural network alas defined as Neural network is a network that composed
of interconnected unit(neurons).It is design in a similar to the human brain which is
dynamic organ that involve with training and learning for specific period of time.This
biologically and behavioural characteristics of human brain is converted into artificial
neurons in order to attain the better of result in data mining.[14].The Studies have
found that it is produced a very efficient and effective result in the data mining field.
(Ripundeep et al, 2014)optimization and time-consuming calculations are no longer
needed when ANN is used because it fast and accurate after the training process is
completed, So, the network outputs are predicted directly for the provided inputs
based on what it has learned to predict for a specific system. There are many ANN
types that are used for various applications such as engineering, weather and flood
forecasting, business, and medicine because of their power and ability to generalize
any practical problem (Coit et al., 1998; Twomey et al, 1998).
According to PriyankaMehtaniand her team in their research paperPattern
Classification using ArtificialNeural Networks. The word network in Neural Network
refers to the interconnection between neuronspresent in various layers of a system.
Every system is basically a 3 layered system,which are Input layer, Hidden Layer and
Output Layer. The input layer has inputneurons which transfer data via synapses to the
hidden layer, and similarly the hiddenlayer transfers this data to the output layer via
more synapses. The synapses storesvalues called weights which helps them to
manipulate the input and output to variouslayers. In neural Network, the
backpropagation algorithm and others are learning algorithm that are commonly
used.The networks the output is compared to the expected output and its error is
15
computed.The weights are adjusted with the error fed back with each iteration,the
error gradually declines until the neural model produces the expected output
(Giovanni et al , 2013)
Figure 2.4.1 Diagram Of Artificial Neural Network
2.4.2 Logistic Regression
Logistic Regression considered as one of the most common predictive models that are
used in variety of Predicting and identifying in the investigation of cyber criminal
tasks.Logistic regression determines a relative importance for each variable by
estimating probabilities using logistic function.In logistic regression,the model
complexity is low rate, especially when there few interactions terms and variable
transformation used. This indicates that over-fitting and long training time is less of an
issues in this case,Although performing variable selection is way to reduce the
16
complexity of the model and consequently decrease the risk of over-fitting,a loss in
the flexibility of the model(Stephan,2003) .
However, the prediction of continuous outcomes are difficult in logistic regression.It
attempts to predict outcomes based on set of independent variables the logic models
may result in overconfidence. The models appear to have more tendency in predictive
power than it actually do as a result of sampling bias.
2.4.3 Naive Bayes
Naive Bayes is a the most simple classification techniques forconstructing classifiers:
models that assign class labels to problem instances represented as vector of feature
values,where the class labels are drawn from finite set.
Figure 2.4.2 Formula of Naive Bayes
Naive Bayes is able to train discrete data and classify in a limit of time and not
sensitive to irrelevant features.The training data to estimate the parameters only
require small amount in order to estimate the parameters necessary for
classification.Unfortunately,Naive Bayes assume the independence of features which
may cause loss of accuracy
17
2.4.3 Decision Table
Decision Table is one of the type algorithm for data mining and classification
techniques that involve with hierarchical table that each of the entry in a table of the
most highest gets split by the values of a pair additional attributes to build or form
another table. Method of visualization is presented that let on a model with many
attributes recognize even the attributes not well known with machine learning. The
assorted forms of interaction been used to make the visualization more benefits and
appropriate than other static design.
2.4.4 Sequential Minimal Optimization (SMO)
Sequential Minimal Optimization (SMO) is one of the data mining algorithm which is
fast algorithm for training support vector machine. Support vector machine are need
the really large quadratic programming (QP) optimization problem, this larger QP
problem are break down by the SMO into the smallest possible of QP problems which
is been solved analytical. SMO allows to handle a high and large training sets by
scales the linear and quadratic with assorted test problem. Beside that Sequential
Minimal Optimization can be fasters than Support Vector Machine and sparse data
sets[ John plat, 1988]
18
Figure 2.4.3 Formula of Sequential Minimal Optimization
19
2.5 Review Summary
Author Title Algortihm Advantages Disadvantages
Shing-honLau, Roy Maxion et. Al, 2014
Clustersand Markers for Keystroke Typing Rhythms
Agnes clustering,Sparse Logistic Regression,Support Vector Machine (SVM)
The typist can be grouped into small number of types.Each type is distinguished from the rest of the population by characteristic keystroke features It can distinguished from the rest of the population by characteristic keystroke features.
The work presented in this paper is only a preliminary investigation, leaving many stones unturned. examined only one data set generalization to other data sets remains to be verified.
PriyankaMehtani, et al 2010
Pattern Classification using Artificial Neural Networks (IRIS dataset)
Artificial Neural Networks,Probabilistic Neural Network (PNN), Optical Backpropagation Algorithm
ANN gives the bestaccuracy classification of gender based on IRIS dataset
Optical Backpro propagation Algorithm less accuracy in gives classification than Artificial Neural Network
GOKHANSILAHTAROGLU et al 2015
predicting gender of online customer using artificial neural networks
artificial neural networks,
K-Means,
K-Medoids
The tests suggest that predictions are accurate enough to be used for business purposes such as marketing, production. It propose the reliability, accuracy and feasibility of predicting online customer gender.
20
Dr.ReganMandryk,ClaytonEpp, Mike Lippold et al 20
Identifying emotional states through keystroke dynamics
Decision Tree Determine the affective emotional state of the user without the user aware and not continuously reminded that he is being recorded
Depending on the frequency of the sample period, the interruption to subjects daily activities can be burdensome
AnushriJaswante Asif UllahKhan,BhupeshGour et al
Back Propagation Neural Network Based Gender Classification Technique Based on Facial Features
Back propagation Neural Network
Viola Jones Algorithm
The proposed methodology give 90% accurate results in identifying gender images .The proposed system has a low complexity and is suitable for real time implementations. The efficiency of the proposed method makes it a good choice for real-time systems
PranjaliPohankar,SnehalataKarmare et al 2014
Character Recognition using Artificial Neural Network
Artificial Neural Network(Back Propagation Neural Network)
This paper show that simple character recognition program can be designed. The algorithm used works on gradient decent rule
Handwritten character recognition is a very difficult to get accurate efficiency due to great variation of writing style, different size and shape of the character.
21
StepehanDreiseitlaLucila ,OhnoMachando . 2002
Logistic RegressionANN
Logistic Regression, Artificial Neural Network
Neural network are better in terms of discriminatory
The model building process is easier for logisticregression
RipundeepDigh Gill and Ashima.2014
Understanding of Neural Networks
Neural Networks
Neural network offer a significant learningabilities,able to represent highly nonlinear and multivariable relationship
The lack of comparison between Neural Network and logistic regression algorithm
Hongjun Lu et al
Decision Tables: Scalable Classification Exploring RDBMS Capabilities
Decision Table a novel approach to build efficient scalable classifiers by exploring the capability of relational database management systems that support powerful data aggregation and summarization functions.
John Platt, 1998 Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines
Sequential Minimal Optimization
SMO is an improved training algorithm for SVMs. SMO solved quickly and analytically, improving its scaling and computation time significantly.
22
2.6 Summary
This chapter discuss the direction of the research and development that will taken for
requirement of the system design. It to ensure that the end product carries the ability to
perform prediction or classification gender type with high accuracy.
23
CHAPTER 3
METHODOLOGY
3.1 Introduction
This chapter aims to explain the details of methodology that is being used.The Project
methodology should organized in a systematic and scientific way to solve a problem
in order to ensure the project objectives are achieves.This study focused on the gender
classification project based on data from a group of student in DSS model,testing and
evaluate for its usability.Thereforesuitable methodology need to be adopted to ensure
the successful completion of the project.The study firstlypre-processing the collected
keystroke dynamics data from group of student using a mobile basedsystem.Then,each
of data is classified with its meet criteria by using Neural Network classifying of
gender using Neural Network was develop using waterfall model which is consist of
detailed plan on describing how to develop,maintain replace and alter with specific
software tools.
3.2 Scientific research Method
The scientific research method is a process of experimentation that used to known the
observations and answer questions. The main purpose of an experiment is to
determine whether observations agree with or not with the prediction derived from a
hypothesis which is experiment is designed so that changes to one item cause
something else to vary in predictable path.
Based on observations,
research questions to next section.
At this stage,evidence is evaluated from previous experiments, personal scientific
observations and previous research to formulate question
Hypothesis is stated in way that can easily measure and is constructed to answer the
research question. In this case t
gender classification.
Literature study on previous research is done to find the best way to do things and
prevent the repeating the mistake from the previ
hypothesis is to determine whether observations of the real world to agree or conflicts
with the predictions derived from a hypothesis. After the testing phase,general
5
6
24
Figure 3.2 Scientific Research Method
rvations,on the keystroke dynamic for gender classification
to next section.
At this stage,evidence is evaluated from previous experiments, personal scientific
observations and previous research to formulate question
Hypothesis is stated in way that can easily measure and is constructed to answer the
ion. In this case the keystroke dynamics is proposed to be releva
Literature study on previous research is done to find the best way to do things and
ting the mistake from the previous past.The purpose of
hypothesis is to determine whether observations of the real world to agree or conflicts
with the predictions derived from a hypothesis. After the testing phase,general
4 3
7
on the keystroke dynamic for gender classification this leads to
At this stage,evidence is evaluated from previous experiments, personal scientific
Hypothesis is stated in way that can easily measure and is constructed to answer the
is proposed to be relevant to
Literature study on previous research is done to find the best way to do things and
us past.The purpose of testing the
hypothesis is to determine whether observations of the real world to agree or conflicts
with the predictions derived from a hypothesis. After the testing phase,general
1
2
25
theories are developed but must be consistent with most or all variable data and with
the other current theories
3.3 Knowledge Discovery in Database
Knowledge discovery in Database, KDD is the process of searching useful
information and patterns in data that consist many steps. According to Gregory
Piatetsky-Shapiro, Christopher Matheus, Padhraic Smyth, and RamasamyUthurusamy
(1996), KDD refers to the process of discovering benefits knowledge from data that
involves with evaluation and possibly interpretation of the patterns to make the
decision of what qualifies as knowledge. The core of the process is refers to process of
Data mining method for extracting and discovery patterns from data.
Figure 3.4Knowledge Discovery in Database for Data Analysis
The data was taken from 51 student typing rhythms and from previous research paper.
Dataset contains 954 instances and 54 attributes that include data keystroke features
based on mobile phone soft keyboard which is include flight times : release-to-press
(RP) the duration of the time interval between a key released and a key that been
26
pressed), press-to-release (PP) (The duration of the time interval between key pressed,
release-to-release (RR) (the duration of the time interval between two key released
and (PR) the duration of the time interval between a key that been pressed and
released which that been represent in milliseconds.
3.3.1 Attribute Selection
An attribute selection known as feature selection of subset that relevant to the
features(variables , predictors ) use in model construction. Attribute Selection purpose
to reduce training time, easier interpretation when simplify the model and reducing
over-fitting by enhance generalization.
3.3.2 Data Pre-processing
Data pre-processing are known as a raw data that been transform into understandable
format. Commonly the real-world data is always incomplete, incompatible and likely
to have more errors. There are some of techniques to clean a incompatible data which
is replace the missing value of the mean of the attribute, remove records with missing
value.
3.3.3 Data Transformation
Data transformation known as a process of converting the data format from a source
of data system into the data format of a destination system. In this case, the Dataset
isCSVformat. In order to train model using Weka, the CSV format of the dataset must
be converted into ARFF format before the phase of training and testing model. The
CSV format restrict the data from having special characters in the dataset.
27
3.3.4 Data Mining
Data Mining known as a process of extracting information from a data set and
converted or transform it into acomprehensible structure for future use. The data
mining algorithms which is Artificial Neural Network is applied in order to mining the
data by discover the relationship between data
3.3.5 Interpretation/Evaluation
The data that been extracted is interpreted into new of knowledge. The relationship
between the selected attributes and the class is show based on the accuracy of the
model. In this case,model needed to have high accuracy in distinctiveness the gender.
28
3.4 Software and hardware requirements
This section will list all of software and hardware that been used to developed the
project is efficient way,
3.4.1 Software Requirement
No Software Purpose
1 Microsoft office 2016 Tool for writing report,proposal and
Gantt Chart
2 Paint Tool for crop and editing images
3 Xampp v3.2.1 Tool to set up and run localhost
4 Google chrome Browser to open and run localhost
5 Dropbox 3.18.1 Tool for backup data in cloud storage
6 MySQL Workbench 17.0 Tool to for check sql syntax
7 Weka 3.6 Tool used for data analysis and data
modelling
3.4.2 Hardware requirement
No Software Purpose
1 Laptop HP 14 Notebook PC
2 Processor Inter (R) core i3
3 Memory
4.00 GB RAM
4 Hard disk
Samsung SSD 500GB
5 System Type 64-bit Operating System
6 Pendrive Kingston 2GB
29
3.5System Framework
FIGURE 3.6.1 System Framework
A framework is a conceptual structure to guide for the developing or building
something into useful structure. This Project is divided into 2 phase where as the pre
processing data training and test data using Weka tools. The data that been use which
is keystroke dynamic from 51 student that require to type 13 character of sentences
that capture that keystroke dynamic data based on touchscreen keyboard which is
contain of the 54 attribute and 954 instances. The attribute include of flight times and
dwell times of key press which is PP( Duration of Time interval between Key press to
Key press) PR (duration of time interval between Key press to Key release,
RP(duration of time interval between Key Release to Key Press) and RR (Duration of
time interval between Key Press to Key Press) That represent in Milliseconddata of 51
student are been used. Each of student are require to type the 13 character of sentences
30
(password) and average 15-20 times for each of person that produce 954 instances.
The data that been used are converted from Csv format into Arff format used in
training and test phase of data. The phase of pre-processing data included removing
the unintentional attribute that are not require in this study. At the phase where
artificial neural network algorithm used in wekatool the data are split into 81%
percentage split which is each of is to train data and test data accuracy. From 10/51
student which 19% from data are used for testing that show the 181 data are for test
data. The expected result is to show the best and highest accuracy in gender
classification based on keystroke dynamic features.
3.6 Datasets
3.6 Table of Dataset Keystroke Dynamic
Table 4.6 show half of dataset keystroke dynamic data using timing based feature
which is flight time and dwell times . Column A which is PP represent the
time interval between key press and key press of 13 characters of
RHUUNIVERSITY word, Column B which is PR represent the time interval key
press and key release while Column C which is RR represent the time interval the
time interval between key release and key release. For the column D RP represent
time interval between key release and key press. Lastly for the last column which
is E column show the class of gender each of total 954 attributes based on 4 type
31
of attributes which is as mention before PP,PR,RR,R. All of this data represent in
milliseconds. This dataset is taken from research paper before and been customize
to compatible with this research which is focused on classified the gender.
3.7Summary
In this Chapter, it represent all the methodologies that are used by effectiveness
predicting system. It also provide explanation about the required hardware and
software that are used in this project.The explanation of every phase in this project are
been briefly explain in order to able understanding in better way.This chapter also
explained about the design and modelling of the system.
32
CHAPTER 4
RESULTS & DISCUSSION
4.1 Introduction
This chapter represented the experimental of the result and analysis of the technique
that been proposed which is artificial Neural Network will be represented. The
experimental result and testing phase show the accuracy that represent in percentage
of correctly classified instances which is the accuracy of algorithm that can gives for
this project . This project are also show the comparison between other algorithm to
show that the proposed algorithm will gives the best accuracy and more reliable
compare of other 5 algorithm that have discuss previous chapter 2.
4.2 Experimental Results
In this study, the proposed algorithm in classifying gender based on keystroke
dynamic feature is implemented using WEKA tool that can train and test the data of
51 student (keystroke dynamic feature) with the data that consist of 954 instances and
54 attributes as shown in Figure 4.2. The accuracy of proposed algorithm in this
project which is Artificial Neural Network are compare with other four algorithm
which is Logistic Regression, Naive Bayes, Decision Table and Sequential Minimal
Optimization. This is are for show that the proposed algorithm will gives the best
accuracy, high recall and precision rate in classified the gender based on their typing
rhythm.
Percentage split are been used to evaluate the accuracy of the classifier which 81% for
data training and other 19% for data testing. This percentage split are used equal each
of algorithm to avoid the data evaluate error and data not bias to each other. The 954
label data are from 51 student that require to type average 18 times of 13 character
which is keystroke dynamic data based on times based that represent in milliseconds.
33
Figure 4.2 Graph of 54 attribute of different keystroke dynamic
This figure 4.2 show the comparison of 54 attributes between two type of gender
based on their typing rhythm that show the different of Time interval each of the
keystroke between male and female (Blue represent Male while Red represent
Female). The firstPP1 until thirteenthPP13 attribute are represented the duration of
time interval between a key press for the next key press. In this case have 13 character
(RHUUNIVERSITY) which is R-H-U-U-N-I-V-E-R-S-I-T-Y while for the attribute
34
firstPR1 until thirdPR13 is represent the duration of time interval between key press to
the next key release in 13 character. Beside that firstRR1 until fourteenthRR14 are
represent the duration of time interval between key release for the next key release.
Lastly, for the attribute firstRP1 until thirteenthRP13 are represent the duration of
time interval between key release to the next key press. All of these keystroke features
are useful for classified gender.
The metrics of the precision,recall (True positive Rate) accuracy (correctly classified
instances ) is obtained after the classifier is be run with percentage split 81% for
testing and 19% for training.
4.3 Comparison in Accuracy of Data Model
4.3.1 Comparison between data model based on Precision, Recall and F-Measure
35
Table 4.3.2 Comparison Between data mining
Table 4.3.1 show that the Artificial Neural Network gives the best result in accuracy
as it show the highest accuracy compare with Naive Bayes, Decision Table,
Sequential Table, Sequential Minimal Optimization and Logistic regression other as
show in table above. Based on table 4.3.2, the precision probability show 0.767 for the
L class (Male) and 0.7691 for P class (Female) .The recall which is equal rate with the
true positive rate show that 0.856 and 0.649 correctly classified instances .This table
show the logistic regression is second best algorithm in data model accuracy followed
by SMO, Decision Table and Naive Bayes.
36
4.4Duration of Building Data Model
Time taken to build model 46.53 seconds for artificial neural network algorithm, while
for Naive Bayes take 0.19 second, Decision Table 2.18 seconds, SMO 0.6 and logistic
regression1.74seconds. This can be big factor that artificial can gives highest accuracy
in data model as it taken longer time to make some summation calculation between
weight and input.
4.5 Artificial Neural Network Algorithm Confusion Matrix
A techniques that can summarizing the classification based on performance algorithm
is also known as confusion matrix. The accuracy of classification based on model may
misleading or bias if the number of observations in each class is unequal. The
calculation of confusion matrix will gives the type of right and error of classification
model.In a simple word confusion matrix is summary result of prediction on a
classification problem. The summarizing the number of correct and incorrect
predictions by broken down each of class.The table 4.6 show below the confusion
matrix of artificial neural network algorithm based on gender.
Positives (L) Negatives (P)
Positives
(L)
TP (a)
89
FP (b)
15
Negatives
(P)
FN (c)
27
TN (d) 50
4.5.1 table of artificial neural network algorithmconfusion matrix based on gender
37
4.5.1 Calculation
False-negative rate = c/(a+c)
27/(89+27) = 0.233
False-Positive rate = b/(b+d)
15/(15+50)= 0.230
Positive-predictive value = a/(a+b)
89/(89+15)= 0.856
Negative-predictive value = d/(c+d)
50/(27+50)= 0.650
Sensitivity (power) = a/(a+c)
89/(89+27)= 0.767
Specificity= d/(b+d)
50/(15+50)= 0.770
Efficiency = (a+d)/(a+b+c+d)
(89+50)/(89+15+27+50)= 0.768
38
Algorithm Confussion Matrix
Artificial Neural Network
Naive Bayes
Decision Table
Sequential Minimal Optimization
Logistic Regression
4.5.2 Comparison of confusion matrix of 5 different data mining
39
4.6 Summary
In this Chapter, it represent all the result that related to study which is data mining on
keystroke dynamic time based feature. The comparison of five algorithm also be
shown in this chapter to differentiate the capability of different algorithm which is
Naive bayes,ANN,SMO, Logistic Regression and Decision table in accuracy data
model. Moreover, the calculation based on confusion matrix and duration of building
data model been shown in this chapter.
40
CHAPTER 5
CONCLUSION AND FUTURE WORKS
.
5.1Conclusion
In this Research, a high accuracy in data modelling based on keystroke dynamic in
classifying gender is aimed to study the effectiveness in classified that gender class
based on 5 different algorithm to prove that the proposed algorithm in study show the
great and best in model the data. A class of gender is selected based on the different of
typing behaviour from 2 different gender male or female that include the extracted
keystroke features which is Flight time and dwell times from 954 sample. Based on
the data, the attributes ( L or P) that relevant to with their timing based of keystroke
dynamics is found and substantiate and proven using Artificial Neural Network also
known as Neural Network. Beside that the data model is been tested to determine the
accuracy. After the data testing, the model prediction is prove to have a result of
prediction highly accurate.As conclusion, The study of gender classification through
dynamic keystroke based on mobile phone using artificial neural network has two
main phase which is the pre-processing data and the training & testing the data model.
Based on this scientific data of keystroke dynamic in distinctness of gender can be
useful for cyber security environment in implement this model in real world as
narrow down the possibility of the cyber criminal identity by identify their gender as
this approach is cost effective and among other biometric approach in acquisitions of
criminal without use of other high cost installed hardware as it used natural typing
behaviour of the people on the touch screen based keyboard.Hopefully, this project
can contribute to the cyber security department in the analysing the different typing
41
gender behaviour as it can reduce the potential of computer crimes such as hacking,
phishing and online scams, committing fraud and child soliciting and abuse as it can
analyses the gender of criminal that can monitored by the security department to take
further action in preventing this criminal.
5.2 Future Work
Based on the discussed the limitation of work, some of improvement can be added for
achievable and viable that useful in cyber security world based on this data model.
The added more extracted feature keystroke dynamic data can gives best and more
accuracy in predicting and classifying the gender that not limited the used of flight
times and dwell times which is not only the time based feature but also can added the
frequency of the Delete Backspace used as it also can distinct the class of gender.
Beside that, the added of larger data sample clearly can gives best in accuracy data
model in prediction. There might be a possible attributes which is a hidden
relationship not be found due to limitations of data provided.
42
REFERENCES
[1] Stephen Mayhew, “History of Biometrics | BiometricUpdate,” January 14,
2015. [Online]. Available: http://www.biometricupdate.com/201501/history-of
biometrics. [Accessed: 26-Apr-2017].
[2] Neural Network based Age and Gender Classification for Facial Images by
Thakshila R. Kalansuriya and Anuja T. Dharmaratne
[3] Bechtel, Jason, Serpen, Gursel, and Brown, Marcus. “Passphrase authentication
based on typing style through an ART 2 neural network”. In: International Journal of
Computational Intelligence and Applications 2.02 (2002), pp. 131–152
[4]A Scientific Understanding of Keystroke DynamicsKevin S. Killourhy,January
2012School of Computer Science Computer Science Department Carnegie Mellon
University Pittsburgh, PA 15213
[5]Bleha, Saleh Ali and Gillespie, Dave. “Computer user identification using the mean
and the median as features”. In: Systems, Man, and Cybernetics, 1998. 1998 IEEE
International Conference on. Vol. 5. IEEE. 1998, pp. 4379–4381.
[6] Epp, Clayton, Lippold, Michael, and Mandryk, Regan L. “Identifying Emotional
States using Keystroke Dynamics”. In: Conference on Human Factors in Computing
Systems. 2011.
[7]Furnell, Steven M., Morrissey, Joseph P., Sanders, Peter W., and Stockel, Colin T.
“Applications of keystroke analysis for improved login security and continuous user
authentication”. In: Information systems security. Chapman & Hall, Ltd. 1996, pp.
283–294.
43
[8]Hocquet, Sylvain, Ramel, Jean-Yves, and Cardot, Hubert. “User classification for
keystroke dynamics authentication”. In: Advances in biometrics. Springer, 2007, pp.
[9]Ilonen, Jarmo. “Keystroke dynamics”. In: Advanced Topics in Information
Processing (2003), pp. 03–04.
[10]” An Overview Of Biometrics” Jammi Ashok 1 VakaShivashankar.V.G.S.Mudiraj
and Head Assistant Professor Associate Professor,MCADept.Department of IT,
Department of MCA Adams Engg. College, GCET, Hyderabad, India.
[11]] L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms. Wiley-
Interscience, 2004.
[12]Gutierrez, F.J., Lerma-Rascon, M.M. et al. 2002. Biometrics and Data Mining:
Comparison of Data Mining-Based Keystroke Dynamics Methods for Identity
Verification. Lecture Notes in Computer Science. 221-245.
[13] Rodrigues, Ricardo Nagel et al. “Biometric access control through numerical
keyboards based on keystroke dynamics”. In: Advances in biometrics. Springer, 2005,
pp. 640–646
[14]Hempstalk, Kathryn. “Continuous typist verification using machine learning”.
PhD thesis. The University of Waikato, 2009
[15] Decision Tables: Scalable ClassificationExploring RDBMS Capabilities
[16]”A mobile-based benchmark for keystroke dynamics systems “,Mohamad El-
Abed, MostafaDafer,Ramzi El Khayat ,Rafik Hariri University, Meshref, Lebanon
2014
44
Gantt Chart
1 2 3 4 5 6 8 9 10 11 12 13 14 15 16
Discuss the title for the project with supervisor
Submission of project title and abstract
Precision problem statement, objective, scope and literature review
Presentation Preparation
Proposal Presentation
Proposal Correction
Design CD, ERD, DFD
Prepare documentation of proposal
Proposal slide presentation
Designing the interface
Final Presentation FYP1
Report Submission
Final Submission to Supervisor
Gantt chart (FYP 1)
activity
Week
45
Gantt Chart
1 2 3 4 5 6 8 9 10
Project Meeting with Supervisor
Project Development
Testing and Documentation
Project Progress Presentation, Panel's Evaluation
Project Development& Testing
Report, Seminar Registration
Seminar Presentation and Panel's Evaluation
Finalizing Report and Documentation of the Project
Report, Logbook Submission
Gantt chart (FYP 2)
activity
Week