Page 1
Enhanced Preprocessing, Feature Selection and
Classification for Automatic Contamination Detection
to Improve Water Quality
Thesis submitted in Partial Fulfilment of the
Degree of Doctor of Philosophy in Computer Science
By
S. Visalakshi 12PHCSF002
Department of Computer Science
Avinashilingam Institute for Home Science and Higher Education for
Women, Coimbatore – 641043
December 2015
Page 4
ACKNOWLEDGEMENT
I record my sincere thanks to Dr. P. R. KRISHNA KUMAR, Chancellor,
Avinashilingam Institute for Home Science and Higher Education for Women,
Coimbatore, for providing the infrastructure facilities for the conduct of the study.
I express gratitude to Dr. T. S. K. MEENAKSHI SUNDARAM, M.A.,
M.Phil., Ph.D., Former Chancellor, Avinashilingam Institute for Home Science
and Higher Education for Women, Coimbatore, for providing the infrastructure
facilities for the conduct of the study.
I express my immense gratitude to Dr. (Mrs.) PREMAVATHY VIJAYAN,
M.Sc., M.Ed., Dip.Spl.Edn., M.Phil., Ph.D., Vice Chancellor (i/c),
Avinashilingam Institute for Home Science and Higher Education for Women,
Coimbatore, for the academic support and the facilities provided to carry out the
research work.
I express my special thanks to Dr. (Mrs.) A. VENMATHI, M.Sc., Dip.Ed.,
M.Phil., Ph.D., Registrar (i/c), Avinashilingam Institute for Home Science and
Higher Education for Women, Coimbatore, for extending precious help.
I record my gratefulness to Dr. (Mrs.) A. PARVATHI, M. Sc., Dip.Ed.,
M.Phil., Ph.D., Dean, Faculty of Science, Avinashilingam Institute for Home
Science and Higher Education for Women, Coimbatore for her timely help and
encouragement in carrying out the research work.
I also extend my thanks to Dr. (Mrs.) G. P. JEYANTHI, M.Sc., M.Phil.,
Ph. D., Controller of Examinations, Avinashilingam Institute for Home Science
and Higher Education for Women, Coimbatore, for her support, encouragement
and co-operation rendered towards the completion of this research.
I express my thanks to Dr. (Mrs.) G. PADMAVATHI, M.Sc., M.Phil.,
Ph.D., Professor and Head of the Department of Computer Science,
Page 5
Avinashilingam Institute for Home Science and Higher Education for Women,
Coimbatore, for her support and encouragement rendered towards the completion
of this research.
I express my sincere gratitude to my Supervisor Dr. (Mrs.). V. RADHA,
M.Sc., P.G.D.C.A., P G.D.O.R., B.Ed., M.Phil., Ph.D., Professor, Department of
Computer Science, Avinashilingam Institute for Home Science and Higher
Education for Women, Coimbatore, for her valuable guidance, intellectual inputs
and constant encouragement received throughout the research work. She patiently
provided necessary support and encouragement for the completion of my research.
Apart from the subject of my research, I learnt a lot regarding academic and
research related process, which I am sure, will be useful in different stages of my
life and career. She always gave liberty to pursue my research work and I consider
it as a great opportunity to undergo my Doctoral programme under her guidance. I
solemnly submit my honest and humble thanks to her for converting my dreams
into reality.
I thank the Doctoral Committee Member, Dr. K. THANGAVEL, M.Sc.,
M.C.A., M.Phil., P.G.D.C.A., Ph.D., Professor and Head, Department of
Computer Science, Periyar University, Salem, for helping me to fine tune my
research work through his valuable discussions, comments and suggestions.
I am very much grateful to Prof. M. KARNAMURTHI, M.Phil., Professor
and Head, Department of English (Retd.), Government Arts College, for the proof
reading of my research papers and thesis of my research work. I deeply appreciate
his timely help and constructive criticism which brought the papers and document
to shape.
My heartfelt thanks to Mr. R. JAYACHANDRAN, Executive Engineer,
Velliangadu, Drinking Water Treatment Plant, Pillur, TWAD, for giving
permission to visit the Velliangadu treatment plant and for motivating me to
carryout the research work.
Page 6
I accord my warm thanks to Mr. N. MATHESHAN, Electrical
Superintendent, Velliangadu, Drinking Water Treatment Plant, Pillur, TWAD, for
providing encouraging and constructive feedback towards my research work.
I express my thanks to Mrs. N. SUBULAKSHMI, Junior Water Analyst,
Velliangadu, Drinking Water Treatment Plant, Pillur, TWAD, for providing
sustained help towards my research work.
I express my sincere appreciation to Dr. (Mr.). B. ANIRUDHAN,
Principal, Nehru Arts and Science College, Coimbatore for his support and
encouragement rendered towards the research.
I accord my warm thanks to all the FACULTY MEMBERS, NON-
TEACHING STAFF and RESEARCH SCHOLARS of the Department of
Computer Science, Avinashilingam Institute for Home Science and Higher
Education for Women, Coimbatore, for their encouragement and support.
The thesis would not have come to a successful completion without the
help received from my family. Words cannot express how grateful I am to my
FATHER, MOTHER and SISTER for all the sacrifices they have made to
support me. They encouraged and helped me at every stage of my personal and
academic life, and longed to see this achievement come true. I owe every
achievement of mine to my family.
Finally, I express my warm gratitude to all my FRIENDS for their
valuable help and suggestions rendered for the completion of the research work.
Above all, I thank GOD Almighty for His blessings in this endeavour.
Page 7
CONTENTS
CHAPTER
NO. TITLE PAGE NO
LIST OF TABLES
LIST OF FIGURES
LIST OF ABBREVIATIONS
ABSTRACT
1 INTRODUCTION 1
1.1 Water Quality 2
1.2 Water Contamination 3
1.2.1 Types of Water Contamination 5
1.3 Water Contamination Detection System
(WCDS) 8
1.4 Data Mining 10
1.4.1 Importance of Data Mining in Water
Contamination Detection
13
1.4.2 Data Mining Methods 14
1.5 Motivation and Objectives 16
1.6 Chapter Formulation 18
1.7 Chapter Summary 19
2 REVIEW OF LITERATURE 20
2.1 Water Contamination Related Studies 20
2.2 Missing Value Handling 29
2.2.1 Imputation-based on Artificial Neural
Network Imputation
32
2.2.2 Imputation-based on Recurrent Neural
Network
33
2.2.3 Imputation-based on Auto-Associative
Neural Network Imputation
33
Page 8
CHAPTER
NO. TITLE PAGE NO
2.2.4 Imputation-based on K-Nearest Neighbour
(K-NN) Algorithm
34
2.2.5 Imputation-based on Self-Organizing Map
(SOM)
34
2.2.6 Traditional Imputation Methods 35
2.3 Feature Selection 37
2.4 Classification Algorithms 44
2.4.1 Decision Trees 44
2.4.2 Neural Networks 45
2.4.3 Statistical Learning Algorithms 46
2.4.4 K-Nearest Neighbour (K-NN)–based
Techniques
48
2.5 Outlier Detection 49
2.5.1 Statistical Techniques for Outlier Detection 50
2.5.2 Depth-based Outlier Detection Approaches 50
2.5.3 Distance-based Approaches for Outlier
Detection
51
2.5.4 Density-based Approaches for Outlier
Detection
51
2.5.5 Classification-based Approaches for
Outlier Detection
52
2.6 Chapter Summary 53
3 RESEARCH METHODOLOGY AND
APPROACH
55
3.1 Research Methodology, Phases and Interactions 56
3.2 Phase I: Preprocessing – Missing Value Handling 59
3.3 Phase II: Feature Selection 61
Page 9
CHAPTER
NO. TITLE PAGE NO
3.3.1 Step 1 : Multiple Filter-based Approach 62
3.3.2 Step 2 : Genetic Algorithm based Wrapper
Approach using SVM Classifier
62
3.4 Phase III : Contamination Detection 63
3.4.1 Anomaly Detection for Contamination
Detection
63
3.4.2 Enhancing SVM Classifier 65
3.4.3 Integrated Anomaly Detection, Feature
Selection and Classification
65
3.5 Experimental Results 67
3.5.1 Study Area 67
3.5.2 Datasets 69
3.5.3 Performance Metrics 70
3.6 Chapter Summary 73
4 DESIGN OF PREPROCESSING ALGORITHM 74
4.1 Traditional K-NN Imputation (K-NNI)
Method
74
4.2 Fast Weighted K-NN Imputation
(FWKNNI) Algorithm
76
4.2.1 Pruning Algorithm 77
4.2.2 K-Means Clustering Algorithm 81
4.2.3 Weighted K-NN Method 82
4.3 Chapter Summary 89
5 DESIGN OF FEATURE SELECTION
ALGORITHM
90
5.1 Overview to Feature Selection 91
5.1.1 Generate Candidate Subset 92
Page 10
CHAPTER
NO. TITLE PAGE NO
5.1.2 Subset Evaluation Function 92
5.1.3 Stopping Condition 92
5.1.4 Validation Procedure 93
5.2 Filter and Wrapper-based Approaches 93
5.3 The 2FWFS Algorithm 95
5.3.1 Step 1: Multiple Filter-based Feature Pre-
Selection (M2FPS) Algorithm
96
5.3.2 Step 2: GA-SVM Wrapper-based (GA-
SVM) Algorithm
102
5.3.3 Step 3: 2FWFS Algorithm 113
5.4 Chapter Summary 113
6 DESIGN OF CONTAMINATION DETECTION
ALGORITHM
115
6.1 Anomaly Detection Algorithm 115
6.1.1 Overview of Outliers 115
6.1.2 Proposed Anomaly Detection Algorithm 118
6.1.2.1 Boosting Algorithm 120
6.1.2.2 Enhanced K-Means Algorithm 122
6.1.2.3 Anomaly Detection in Normal
Clusters
134
6.1.2.4 Merging of Similar Clusters 136
6.2 Enhanced SVM Classifier 136
6.2.1 Step 1 : Pre-clustering 137
6.2.2 Step 2 : Identify Crisp Clusters 137
6.2.3 Step 3 : Removal of Irrelevant SVs 138
6.3 Integrated System 139
6.4 Chapter Summary 139
Page 11
CHAPTER
NO. TITLE PAGE NO
7 RESULTS AND DISCUSSION 141
7.1 Performance Evaluation of Missing Value
Handling Algorithm
141
7.2 Performance Evaluation of Anomaly Detection
Algorithm
149
7.3 Performance Evaluation of Feature Selection,
Classification and Integrated Contamination
Detection System
155
7.4 Chapter Summary 161
8 SUMMARY AND CONCLUSION 162
BIBLIOGRAPHY 166
PUBLICATIONS RELATED TO RESEARCH
WORK
190
Page 12
LIST OF TABLES
TABLE
NO. TITLE
PAGE
NO.
2.1 Comparative Evaluation of Missing Value Handling
Techniques
31
5.1 Filters and Wrappers 95
5.2 Parameters setting of GA-SVM algorithm 111
7.1 Coding Scheme 142
7.2 NRMSE of Missing Value Handling Algorithms
(Siruvani Dataset)
143
7.3 NRMSE of Missing Value Handling Algorithms (Pillur
Dataset)
144
7.4 Speed (seconds) of Missing Value Handling Algorithms
(Siruvani Dataset)
147
7.5 Speed (seconds) of Missing Value Handling Algorithms
(Pillur Dataset)
148
7.6 Anomaly Detection Rate (%) 150
7.7 Anomaly Detection Speed (seconds) 153
7.8 Accuracy (%) of the Contamination Detection Systems 156
7.9 Error Rate (%) of the Contamination Detection Systems 158
7.10 Speed (seconds) of the Contamination Detection Systems 160
Page 13
LIST OF FIGURES
FIGURE
NO. TITLE
PAGE
NO
1.1 Crime Data Mining Model 11
1.2 Data Mining Process 12
1.3 Data Mining Methods 15
2.1 Feature Selection Approaches 39
3.1 Steps in Water Management System 57
3.2 Development Methodology 58
3.3 Interaction of Algorithms and Research Phases 59
3.4 Study Area 68
3.5 Sample Snapshot (Partial) of Pillur Dataset 71
3.6 Sample Snapshot (Partial) of Siruvani Dataset 71
4.1 General Steps of FWKNNI Algorithm 75
4.2 Steps in FWKNNI Algorithm 78
4.3 Pruning Process 81
4.4 Conventional K-Means Algorithm 82
5.1 Steps in Feature Selection 91
5.2 Filter-Based Feature Selection Method 94
5.3 Wrapper-Based Feature Selection Method 95
Page 14
FIGURE
NO. TITLE
PAGE
NO
5.4 Flow of 2FWFS algorithm 96
5.5 M2FPS Algorithm 97
5.6 Markov Blanket Filter 102
5.7 Encoding of Feature Subset in GA - A L-Dimensional
Binary Vector
104
5.8 Roulette Wheel Selection 106
5.9 Support Vector Machine Hyperplane 108
5.10 Process of GA-SVM Hybrid Feature Selection Algorithm 110
5.11 Detailed Steps Involved in GA-SVM Algorithm 112
5.12 2-Step Feature Selection Algorithm Combining M2FFS
and GA-SVM (2FWFS)
113
6.1 Examples of Outliers 116
6.2 Clustering-Based Anomaly Detection 119
6.3 Boosting Algorithm 121
6.4 Example of Local and Global Consistency (Toy Dataset) 123
6.5 K-Means with DLG Measure 125
6.6 Automatic Estimation of K 129
6.7 RNN Example 131
6.8 Proposed Initialization Procedure 133
Page 15
FIGURE
NO. TITLE
PAGE
NO
6.9 Proposed Enhanced K-Means Algorithm 135
6.10 Procedure to Merge Similar Clusters 136
6.11 Crisp Cluster Identification Algorithm 137
6.12 Integrated AWCDS 140
7.1 Average NRMSE of the Missing Value Handling
Algorithms (Siruvani Dataset)
146
7.2 Average NRMSE of the Missing Value Handling
Algorithms (Pillur Dataset)
146
7.3 Average Speed (seconds) of the Missing Value Handling
Algorithms (Siruvani Dataset)
151
7.4 Average Speed (seconds) of the Missing Value Handling
Algorithms (Pillur Dataset)
151
7.5 Average Anomaly Detection Rate (%) 152
7.6 Average Speed (seconds) of Anomaly Detection
Algorithms
154
7.7 Average Accuracy (%) of Contamination Detection
Systems
157
7.8 Average Error Rate (%) 159
7.9 Average Speed (seconds) 161
Page 16
LIST OF ABBREVIATIONS
1NN Single Nearest Neighbour
2FWFS 2-Step Filter and Wrapper-based Feature Selection
2FWFS 2-Step Filter and Wrapper-based Feature Selection
Algorithm
AANN Auto-Associative Neural Network
ANN Artificial Neural Network
ANOVA Analysis of Variance
AWCDS Automatic Water Contamination Detection System
AWCDS Proposed Integrated Automatic Water Contamination
Detection System
BN Bayesian Network
BP Back Propagation
CART Classification And Regression Tree
CBLOF Cluster-Based Local Outlier Factor
CFS Correlation-based Feature Selection
CI Conditional Independence
CS Chi -Square
CVI Cluster Validity Index
CVI Cluster Validity Index
CWS Contamination Warning System
Page 17
DLG Local and Global Consistency
EBIC Enhanced Bayesian Information Criterion
EDM Enhanced Distance Measure
EDS Event Detection System
EDS Event Detection System
EKPAD Enhanced K-Means clustering with Pruning for Anomaly
Detection
EKPAD Enhanced K-Means Clustering with pruning for Anomaly
Detection
EPA Environmental Protection Agency
FN False Negatives
FP False Positives
FRR False Positive Rate
FS Feature Selection
FS Feature Selection
FS Fisher Criterion Score
FSVM Proposed Fast SVM
FWKNN Fast Weighted KNN Imputation
FWKNNI Fast Weighted KNN Imputation Algorithm
GA Genetic Algorithm
Page 18
GA-SVM Genetic Algorithm and Support Vector Machine Wrapper-
Based Algorithm
ICF Iterative Case Filtering
ICT Information and Communication Technology
IT Information Technology
KCAD K-Means Clustering Based Anomaly Detection
KCAD K-Means Clustering Based Anomaly Detection
K-NN K-Nearest Neighbour
KNNI KNN-Imputation
KNNI KNN Imputation
KNNI Conventional KNN Imputation Algorithm
LASSO Least Absolute Shrinkage and Selection Operator
LOF Local Outlier Factor
LPCF Linear Prediction Correction Filter
LVW Las Vegas algorithm
M2FPS Multiple Filter-Based Feature Pre-Selection
MB Morkov Blanket
MDI Median Imputation
MDVI Modified Dynamic Validity Index
MI Multiple Imputation
Page 19
MI Mutual Information
MLD Million liter per Day
MLP Multi-Layer Perceptron
MRMR Minimum Redundancy-Maximum Relevance
MTL Multi-Task Learning
NN Neural Network
NRMSE Normalized Root Mean Square Error (NRMSE),
OHSR Over Head Service Reservoir
ORP Oxidation-Reduction Potential
PC Pearson Correlation
PCA Principal Component Analysis
PSO Particle Swarm Optimization
RBF Radial Basis Function
RGSS Random Generation plus Sequential Selection
RNN Recurrent Neural Network
RNN Recurrent Neural Network
ROC Receiver Operating Characteristics
SA Simulated Annealing
SCADA Supervisory Control and Data Acquisition
SMO Sequential Minimal Optimization
Page 20
SMR Stepwise Multiple Regression
SOM Self Organizing Map
SPCA State Pollution Control Agencie
SVM Support Vector Machine
SVs Support Vectors
TEST Training-EStimation-Training
TN True Negatives
TP True Positives
TS-SOM Tree Structured-SOM
TWAD Tamilnadu Water Supply And Drainage
ULBs Urban Local Bodies
UNICEF United Nations Children’s Fund
USC Uncorrelated Shrunken Centroid
WCDS Water Contamination Detection System
WHO World Health Organization
W-KNN Weighted KNN
Page 21
ABSTRACT
The quality of drinking water has always been a powerful environmental
determinant of health concern worldwide. A secure and safe supply of drinking
water is fundamental to public health. Water contamination, defined as the
pollution of water bodies, is an important factor that reduces the quality of
drinking water. This main aim of this research work is to design and develop
algorithms based on data mining to detect the presence and absence of water
contaminants.
The proposed water contamination detection system consists of three steps,
namely, preprocessing, feature selection and classification. In preprocessing, an
enhanced K-Nearest Neighbour Imputation Method is used to handle the missing
values in the water dataset. The enhanced algorithm uses a pruning algorithm to
reduce the size of the dataset by removing irrelevant instances, K-Means
algorithm to group similar instance together, a weighted K-Nearest Neighbour
Search and Imputation algorithm to impute the missing values, a merging
algorithm to combine all the imputed clusters to form a dataset with no missing
values.
The feature selection is performed using a 2-step algorithm which
combines the advantages of filter and wrapper based feature selection algorithm.
This algorithm first uses a multiple filter algorithm to prune irrelevant features.
For this purpose, the algorithm makes use of four filter based algorithms, namely,
Mutual Information (MI), Pearson Correlation (PC), Chi-Squared test (CS) and
Fisher Criterion Score (FS) along with Markov Blanket Filter (MBF). The results
are combined using a simple Boolean union operation. This result is then used by
the wrapper-based algorithm, which is designed as a method combining genetic
algorithm and Support Vector Machine (SVM) Classifier. The final result is a set
of optimal features which have great positive impact on water contamination
detection.
Page 22
The final step of contamination detection performs two tasks, namely,
anomaly detection and classification. The anomaly detection algorithm identifies
features that are abnormal as contaminants and removes them from the dataset.
The research work proposes a clustering-based algorithm enhanced through the
use of a pruning algorithm and an enhanced K-Means algorithm. K-Means
algorithm is enhanced by incorporating solutions to (i) automatically estimate K
value and initial centroids, (ii) handle the problem of large sized dataset, (iii)
reduce time complexity and (iv) modify the distance metric to consider both intra
and inter distance between data points to improve the process of clustering. The
anomalies are detected using Cluster-Based Local Outlier Factor.
The presence and absence of contaminations in the normal data is
identified in the second step from previous step. This is performed by used a Fast
Support Vector Machine classifier. The speed is improved by using K-Means
algorithm to form clusters, from which using only crisp clusters, the irrelevant
support vectors that are far from the boundaries are removed, thus reducing the
size of the training data. This improves both time complexity and accuracy of the
contamination detection process.
Experiments to evaluation the proposed algorithms used two datasets
collected from Siruvani and Pillur, Coimbatore, Tamil Nadu, India. The missing
value handling algorithm was evaluated using Normalized Root Mean Square
Error and speed. The anomaly detection algorithm was analyzed using two
metrics, namely, outlier detect rate and speed. The contamination detection
algorithm was evaluated using three metrics, namely, accuracy, error rate and
speed. All the proposed algorithms were compared with their respective existing
(or conventional) counterparts.
In conclusion, the results showed that the proposed automatic water
contamination system that combines Proposed Fast Weighted KNN Imputation
Algorithm for handling missing values, Proposed 2-Step Filter and Wrapper-based
Page 23
Feature Selection Algorithm, Enhanced K-Means Clustering with pruning for
Anomaly Detection and Proposed Fast SVM was efficient and produced a
maximum average accuracy of 98.77% (Siruvani dataset) and 98.53% (Pillur
datasets).