Enhanced Preprocessing, Feature Selection and ...shodhganga.inflibnet.ac.in/bitstream/10603/90745/6/visa_intro.pdf · CHAPTER NO. TITLE PAGE NO 2.2.4 Imputation-based on K-Nearest

Enhanced Preprocessing, Feature Selection and

Classification for Automatic Contamination Detection

to Improve Water Quality

Thesis submitted in Partial Fulfilment of the

Degree of Doctor of Philosophy in Computer Science

By

S. Visalakshi 12PHCSF002

Department of Computer Science

Avinashilingam Institute for Home Science and Higher Education for

Women, Coimbatore – 641043

December 2015

ACKNOWLEDGEMENT

I record my sincere thanks to Dr. P. R. KRISHNA KUMAR, Chancellor,

Avinashilingam Institute for Home Science and Higher Education for Women,

Coimbatore, for providing the infrastructure facilities for the conduct of the study.

I express gratitude to Dr. T. S. K. MEENAKSHI SUNDARAM, M.A.,

M.Phil., Ph.D., Former Chancellor, Avinashilingam Institute for Home Science

and Higher Education for Women, Coimbatore, for providing the infrastructure

facilities for the conduct of the study.

I express my immense gratitude to Dr. (Mrs.) PREMAVATHY VIJAYAN,

M.Sc., M.Ed., Dip.Spl.Edn., M.Phil., Ph.D., Vice Chancellor (i/c),


Coimbatore, for the academic support and the facilities provided to carry out the

research work.

I express my special thanks to Dr. (Mrs.) A. VENMATHI, M.Sc., Dip.Ed.,

M.Phil., Ph.D., Registrar (i/c), Avinashilingam Institute for Home Science and

Higher Education for Women, Coimbatore, for extending precious help.

I record my gratefulness to Dr. (Mrs.) A. PARVATHI, M. Sc., Dip.Ed.,

M.Phil., Ph.D., Dean, Faculty of Science, Avinashilingam Institute for Home

Science and Higher Education for Women, Coimbatore for her timely help and

encouragement in carrying out the research work.

I also extend my thanks to Dr. (Mrs.) G. P. JEYANTHI, M.Sc., M.Phil.,

Ph. D., Controller of Examinations, Avinashilingam Institute for Home Science

and Higher Education for Women, Coimbatore, for her support, encouragement

and co-operation rendered towards the completion of this research.

I express my thanks to Dr. (Mrs.) G. PADMAVATHI, M.Sc., M.Phil.,

Ph.D., Professor and Head of the Department of Computer Science,


Coimbatore, for her support and encouragement rendered towards the completion

of this research.

I express my sincere gratitude to my Supervisor Dr. (Mrs.). V. RADHA,

M.Sc., P.G.D.C.A., P G.D.O.R., B.Ed., M.Phil., Ph.D., Professor, Department of

Computer Science, Avinashilingam Institute for Home Science and Higher

Education for Women, Coimbatore, for her valuable guidance, intellectual inputs

and constant encouragement received throughout the research work. She patiently

provided necessary support and encouragement for the completion of my research.

Apart from the subject of my research, I learnt a lot regarding academic and

research related process, which I am sure, will be useful in different stages of my

life and career. She always gave liberty to pursue my research work and I consider

it as a great opportunity to undergo my Doctoral programme under her guidance. I

solemnly submit my honest and humble thanks to her for converting my dreams

into reality.

I thank the Doctoral Committee Member, Dr. K. THANGAVEL, M.Sc.,

M.C.A., M.Phil., P.G.D.C.A., Ph.D., Professor and Head, Department of

Computer Science, Periyar University, Salem, for helping me to fine tune my

research work through his valuable discussions, comments and suggestions.

I am very much grateful to Prof. M. KARNAMURTHI, M.Phil., Professor

and Head, Department of English (Retd.), Government Arts College, for the proof

reading of my research papers and thesis of my research work. I deeply appreciate

his timely help and constructive criticism which brought the papers and document

to shape.

My heartfelt thanks to Mr. R. JAYACHANDRAN, Executive Engineer,

Velliangadu, Drinking Water Treatment Plant, Pillur, TWAD, for giving

permission to visit the Velliangadu treatment plant and for motivating me to

carryout the research work.

I accord my warm thanks to Mr. N. MATHESHAN, Electrical

Superintendent, Velliangadu, Drinking Water Treatment Plant, Pillur, TWAD, for

providing encouraging and constructive feedback towards my research work.

I express my thanks to Mrs. N. SUBULAKSHMI, Junior Water Analyst,

Velliangadu, Drinking Water Treatment Plant, Pillur, TWAD, for providing

sustained help towards my research work.

I express my sincere appreciation to Dr. (Mr.). B. ANIRUDHAN,

Principal, Nehru Arts and Science College, Coimbatore for his support and

encouragement rendered towards the research.

I accord my warm thanks to all the FACULTY MEMBERS, NON-

TEACHING STAFF and RESEARCH SCHOLARS of the Department of

Computer Science, Avinashilingam Institute for Home Science and Higher

Education for Women, Coimbatore, for their encouragement and support.

The thesis would not have come to a successful completion without the

help received from my family. Words cannot express how grateful I am to my

FATHER, MOTHER and SISTER for all the sacrifices they have made to

support me. They encouraged and helped me at every stage of my personal and

academic life, and longed to see this achievement come true. I owe every

achievement of mine to my family.

Finally, I express my warm gratitude to all my FRIENDS for their

valuable help and suggestions rendered for the completion of the research work.

Above all, I thank GOD Almighty for His blessings in this endeavour.

CONTENTS

CHAPTER

NO. TITLE PAGE NO

LIST OF TABLES

LIST OF FIGURES

LIST OF ABBREVIATIONS

ABSTRACT

1 INTRODUCTION 1

1.1 Water Quality 2

1.2 Water Contamination 3

1.2.1 Types of Water Contamination 5

1.3 Water Contamination Detection System

(WCDS) 8

1.4 Data Mining 10

1.4.1 Importance of Data Mining in Water

Contamination Detection

13

1.4.2 Data Mining Methods 14

1.5 Motivation and Objectives 16

1.6 Chapter Formulation 18

1.7 Chapter Summary 19

2 REVIEW OF LITERATURE 20

2.1 Water Contamination Related Studies 20

2.2 Missing Value Handling 29

2.2.1 Imputation-based on Artificial Neural

Network Imputation

32

2.2.2 Imputation-based on Recurrent Neural

Network

33

2.2.3 Imputation-based on Auto-Associative

Neural Network Imputation

33

CHAPTER

NO. TITLE PAGE NO

2.2.4 Imputation-based on K-Nearest Neighbour

(K-NN) Algorithm

34

2.2.5 Imputation-based on Self-Organizing Map

(SOM)

34

2.2.6 Traditional Imputation Methods 35

2.3 Feature Selection 37

2.4 Classification Algorithms 44

2.4.1 Decision Trees 44

2.4.2 Neural Networks 45

2.4.3 Statistical Learning Algorithms 46

2.4.4 K-Nearest Neighbour (K-NN)–based

Techniques

48

2.5 Outlier Detection 49

2.5.1 Statistical Techniques for Outlier Detection 50

2.5.2 Depth-based Outlier Detection Approaches 50

2.5.3 Distance-based Approaches for Outlier

Detection

51

2.5.4 Density-based Approaches for Outlier

Detection

51

2.5.5 Classification-based Approaches for

Outlier Detection

52


3 RESEARCH METHODOLOGY AND

APPROACH

55

3.1 Research Methodology, Phases and Interactions 56

3.2 Phase I: Preprocessing – Missing Value Handling 59

3.3 Phase II: Feature Selection 61

CHAPTER

NO. TITLE PAGE NO

3.3.1 Step 1 : Multiple Filter-based Approach 62

3.3.2 Step 2 : Genetic Algorithm based Wrapper

Approach using SVM Classifier

62

3.4 Phase III : Contamination Detection 63

3.4.1 Anomaly Detection for Contamination

Detection

63

3.4.2 Enhancing SVM Classifier 65

3.4.3 Integrated Anomaly Detection, Feature

Selection and Classification

65

3.5 Experimental Results 67

3.5.1 Study Area 67

3.5.2 Datasets 69

3.5.3 Performance Metrics 70


4 DESIGN OF PREPROCESSING ALGORITHM 74

4.1 Traditional K-NN Imputation (K-NNI)

Method

74

4.2 Fast Weighted K-NN Imputation

(FWKNNI) Algorithm

76

4.2.1 Pruning Algorithm 77

4.2.2 K-Means Clustering Algorithm 81

4.2.3 Weighted K-NN Method 82


5 DESIGN OF FEATURE SELECTION

ALGORITHM

90

5.1 Overview to Feature Selection 91

5.1.1 Generate Candidate Subset 92

CHAPTER

NO. TITLE PAGE NO

5.1.2 Subset Evaluation Function 92

5.1.3 Stopping Condition 92

5.1.4 Validation Procedure 93

5.2 Filter and Wrapper-based Approaches 93

5.3 The 2FWFS Algorithm 95

5.3.1 Step 1: Multiple Filter-based Feature Pre-

Selection (M2FPS) Algorithm

96

5.3.2 Step 2: GA-SVM Wrapper-based (GA-

SVM) Algorithm

102

5.3.3 Step 3: 2FWFS Algorithm 113


6 DESIGN OF CONTAMINATION DETECTION

ALGORITHM

115

6.1 Anomaly Detection Algorithm 115

6.1.1 Overview of Outliers 115

6.1.2 Proposed Anomaly Detection Algorithm 118

6.1.2.1 Boosting Algorithm 120

6.1.2.2 Enhanced K-Means Algorithm 122

6.1.2.3 Anomaly Detection in Normal

Clusters

134

6.1.2.4 Merging of Similar Clusters 136

6.2 Enhanced SVM Classifier 136

6.2.1 Step 1 : Pre-clustering 137

6.2.2 Step 2 : Identify Crisp Clusters 137

6.2.3 Step 3 : Removal of Irrelevant SVs 138

6.3 Integrated System 139


CHAPTER

NO. TITLE PAGE NO

7 RESULTS AND DISCUSSION 141

7.1 Performance Evaluation of Missing Value

Handling Algorithm

141

7.2 Performance Evaluation of Anomaly Detection

Algorithm

149

7.3 Performance Evaluation of Feature Selection,

Classification and Integrated Contamination

Detection System

155


8 SUMMARY AND CONCLUSION 162

BIBLIOGRAPHY 166

PUBLICATIONS RELATED TO RESEARCH

WORK

190

LIST OF TABLES

TABLE

NO. TITLE

PAGE

NO.

2.1 Comparative Evaluation of Missing Value Handling

Techniques

31

5.1 Filters and Wrappers 95

5.2 Parameters setting of GA-SVM algorithm 111

7.1 Coding Scheme 142

7.2 NRMSE of Missing Value Handling Algorithms

(Siruvani Dataset)

143

7.3 NRMSE of Missing Value Handling Algorithms (Pillur

Dataset)

144

7.4 Speed (seconds) of Missing Value Handling Algorithms

(Siruvani Dataset)

147

7.5 Speed (seconds) of Missing Value Handling Algorithms

(Pillur Dataset)

148

7.6 Anomaly Detection Rate (%) 150

7.7 Anomaly Detection Speed (seconds) 153

7.8 Accuracy (%) of the Contamination Detection Systems 156

7.9 Error Rate (%) of the Contamination Detection Systems 158

7.10 Speed (seconds) of the Contamination Detection Systems 160

LIST OF FIGURES

FIGURE

NO. TITLE

PAGE

NO

1.1 Crime Data Mining Model 11

1.2 Data Mining Process 12

1.3 Data Mining Methods 15

2.1 Feature Selection Approaches 39

3.1 Steps in Water Management System 57

3.2 Development Methodology 58

3.3 Interaction of Algorithms and Research Phases 59

3.4 Study Area 68

3.5 Sample Snapshot (Partial) of Pillur Dataset 71

3.6 Sample Snapshot (Partial) of Siruvani Dataset 71

4.1 General Steps of FWKNNI Algorithm 75

4.2 Steps in FWKNNI Algorithm 78

4.3 Pruning Process 81

4.4 Conventional K-Means Algorithm 82

5.1 Steps in Feature Selection 91

5.2 Filter-Based Feature Selection Method 94

5.3 Wrapper-Based Feature Selection Method 95

FIGURE

NO. TITLE

PAGE

NO

5.4 Flow of 2FWFS algorithm 96

5.5 M2FPS Algorithm 97

5.6 Markov Blanket Filter 102

5.7 Encoding of Feature Subset in GA - A L-Dimensional

Binary Vector

104

5.8 Roulette Wheel Selection 106

5.9 Support Vector Machine Hyperplane 108

5.10 Process of GA-SVM Hybrid Feature Selection Algorithm 110

5.11 Detailed Steps Involved in GA-SVM Algorithm 112

5.12 2-Step Feature Selection Algorithm Combining M2FFS

and GA-SVM (2FWFS)

113

6.1 Examples of Outliers 116

6.2 Clustering-Based Anomaly Detection 119

6.3 Boosting Algorithm 121

6.4 Example of Local and Global Consistency (Toy Dataset) 123

6.5 K-Means with DLG Measure 125

6.6 Automatic Estimation of K 129

6.7 RNN Example 131

6.8 Proposed Initialization Procedure 133

FIGURE

NO. TITLE

PAGE

NO

6.9 Proposed Enhanced K-Means Algorithm 135

6.10 Procedure to Merge Similar Clusters 136

6.11 Crisp Cluster Identification Algorithm 137

6.12 Integrated AWCDS 140

7.1 Average NRMSE of the Missing Value Handling

Algorithms (Siruvani Dataset)

146

7.2 Average NRMSE of the Missing Value Handling

Algorithms (Pillur Dataset)

146

7.3 Average Speed (seconds) of the Missing Value Handling

Algorithms (Siruvani Dataset)

151

7.4 Average Speed (seconds) of the Missing Value Handling

Algorithms (Pillur Dataset)

151

7.5 Average Anomaly Detection Rate (%) 152

7.6 Average Speed (seconds) of Anomaly Detection

Algorithms

154

7.7 Average Accuracy (%) of Contamination Detection

Systems

157

7.8 Average Error Rate (%) 159

7.9 Average Speed (seconds) 161

LIST OF ABBREVIATIONS

1NN Single Nearest Neighbour

2FWFS 2-Step Filter and Wrapper-based Feature Selection

2FWFS 2-Step Filter and Wrapper-based Feature Selection

Algorithm

AANN Auto-Associative Neural Network

ANN Artificial Neural Network

ANOVA Analysis of Variance

AWCDS Automatic Water Contamination Detection System

AWCDS Proposed Integrated Automatic Water Contamination

Detection System

BN Bayesian Network

BP Back Propagation

CART Classification And Regression Tree

CBLOF Cluster-Based Local Outlier Factor

CFS Correlation-based Feature Selection

CI Conditional Independence

CS Chi -Square

CVI Cluster Validity Index

CVI Cluster Validity Index

CWS Contamination Warning System

DLG Local and Global Consistency

EBIC Enhanced Bayesian Information Criterion

EDM Enhanced Distance Measure

EDS Event Detection System

EDS Event Detection System

EKPAD Enhanced K-Means clustering with Pruning for Anomaly

Detection

EKPAD Enhanced K-Means Clustering with pruning for Anomaly

Detection

EPA Environmental Protection Agency

FN False Negatives

FP False Positives

FRR False Positive Rate

FS Feature Selection

FS Feature Selection

FS Fisher Criterion Score

FSVM Proposed Fast SVM

FWKNN Fast Weighted KNN Imputation

FWKNNI Fast Weighted KNN Imputation Algorithm

GA Genetic Algorithm

GA-SVM Genetic Algorithm and Support Vector Machine Wrapper-

Based Algorithm

ICF Iterative Case Filtering

ICT Information and Communication Technology

IT Information Technology

KCAD K-Means Clustering Based Anomaly Detection

KCAD K-Means Clustering Based Anomaly Detection

K-NN K-Nearest Neighbour

KNNI KNN-Imputation

KNNI KNN Imputation

KNNI Conventional KNN Imputation Algorithm

LASSO Least Absolute Shrinkage and Selection Operator

LOF Local Outlier Factor

LPCF Linear Prediction Correction Filter

LVW Las Vegas algorithm

M2FPS Multiple Filter-Based Feature Pre-Selection

MB Morkov Blanket

MDI Median Imputation

MDVI Modified Dynamic Validity Index

MI Multiple Imputation

MI Mutual Information

MLD Million liter per Day

MLP Multi-Layer Perceptron

MRMR Minimum Redundancy-Maximum Relevance

MTL Multi-Task Learning

NN Neural Network

NRMSE Normalized Root Mean Square Error (NRMSE),

OHSR Over Head Service Reservoir

ORP Oxidation-Reduction Potential

PC Pearson Correlation

PCA Principal Component Analysis

PSO Particle Swarm Optimization

RBF Radial Basis Function

RGSS Random Generation plus Sequential Selection

RNN Recurrent Neural Network

RNN Recurrent Neural Network

ROC Receiver Operating Characteristics

SA Simulated Annealing

SCADA Supervisory Control and Data Acquisition

SMO Sequential Minimal Optimization

SMR Stepwise Multiple Regression

SOM Self Organizing Map

SPCA State Pollution Control Agencie

SVM Support Vector Machine

SVs Support Vectors

TEST Training-EStimation-Training

TN True Negatives

TP True Positives

TS-SOM Tree Structured-SOM

TWAD Tamilnadu Water Supply And Drainage

ULBs Urban Local Bodies

UNICEF United Nations Children’s Fund

USC Uncorrelated Shrunken Centroid

WCDS Water Contamination Detection System

WHO World Health Organization

W-KNN Weighted KNN

ABSTRACT

The quality of drinking water has always been a powerful environmental

determinant of health concern worldwide. A secure and safe supply of drinking

water is fundamental to public health. Water contamination, defined as the

pollution of water bodies, is an important factor that reduces the quality of

drinking water. This main aim of this research work is to design and develop

algorithms based on data mining to detect the presence and absence of water

contaminants.

The proposed water contamination detection system consists of three steps,

namely, preprocessing, feature selection and classification. In preprocessing, an

enhanced K-Nearest Neighbour Imputation Method is used to handle the missing

values in the water dataset. The enhanced algorithm uses a pruning algorithm to

reduce the size of the dataset by removing irrelevant instances, K-Means

algorithm to group similar instance together, a weighted K-Nearest Neighbour

Search and Imputation algorithm to impute the missing values, a merging

algorithm to combine all the imputed clusters to form a dataset with no missing

values.

The feature selection is performed using a 2-step algorithm which

combines the advantages of filter and wrapper based feature selection algorithm.

This algorithm first uses a multiple filter algorithm to prune irrelevant features.

For this purpose, the algorithm makes use of four filter based algorithms, namely,

Mutual Information (MI), Pearson Correlation (PC), Chi-Squared test (CS) and

Fisher Criterion Score (FS) along with Markov Blanket Filter (MBF). The results

are combined using a simple Boolean union operation. This result is then used by

the wrapper-based algorithm, which is designed as a method combining genetic

algorithm and Support Vector Machine (SVM) Classifier. The final result is a set

of optimal features which have great positive impact on water contamination

detection.

The final step of contamination detection performs two tasks, namely,

anomaly detection and classification. The anomaly detection algorithm identifies

features that are abnormal as contaminants and removes them from the dataset.

The research work proposes a clustering-based algorithm enhanced through the

use of a pruning algorithm and an enhanced K-Means algorithm. K-Means

algorithm is enhanced by incorporating solutions to (i) automatically estimate K

value and initial centroids, (ii) handle the problem of large sized dataset, (iii)

reduce time complexity and (iv) modify the distance metric to consider both intra

and inter distance between data points to improve the process of clustering. The

anomalies are detected using Cluster-Based Local Outlier Factor.

The presence and absence of contaminations in the normal data is

identified in the second step from previous step. This is performed by used a Fast

Support Vector Machine classifier. The speed is improved by using K-Means

algorithm to form clusters, from which using only crisp clusters, the irrelevant

support vectors that are far from the boundaries are removed, thus reducing the

size of the training data. This improves both time complexity and accuracy of the

contamination detection process.

Experiments to evaluation the proposed algorithms used two datasets

collected from Siruvani and Pillur, Coimbatore, Tamil Nadu, India. The missing

value handling algorithm was evaluated using Normalized Root Mean Square

Error and speed. The anomaly detection algorithm was analyzed using two

metrics, namely, outlier detect rate and speed. The contamination detection

algorithm was evaluated using three metrics, namely, accuracy, error rate and

speed. All the proposed algorithms were compared with their respective existing

(or conventional) counterparts.

In conclusion, the results showed that the proposed automatic water

contamination system that combines Proposed Fast Weighted KNN Imputation

Algorithm for handling missing values, Proposed 2-Step Filter and Wrapper-based

Feature Selection Algorithm, Enhanced K-Means Clustering with pruning for

Anomaly Detection and Proposed Fast SVM was efficient and produced a

maximum average accuracy of 98.77% (Siruvani dataset) and 98.53% (Pillur

datasets).

Enhanced Preprocessing, Feature Selection and ...shodhganga.inflibnet.ac.in/bitstream/10603/90745/6/visa_intro.pdf · CHAPTER NO. TITLE PAGE NO 2.2.4 Imputation-based on K-Nearest

Documents