Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selection based on Mutual Information Abeer Alzubaidi (PhD researcher) School of Science and Technology Nottingham Trent University
Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selection based on
Mutual Information
Abeer Alzubaidi
(PhD researcher)
School of Science and Technology
Nottingham Trent University
• What is Breast Cancer?
• Breast Cancer Diagnosis
• Statistical Methods
• Predictive Modelling
• Evolutionary Computation
• The Hybrid Genetic Approach
• Breast Cancer Dataset
• The results
• Current Work & Conclusion
2
Content
What is the Breast Cancer?
• Breast Cancer begins in the breast tissue and may start in the duct or lobe of the breast when the “controls” in the breast cells are not working properly, they divide continually and a lump or tumor is formed.
3
• Breast cancer is the most common cancer in women in both developed and developing countries.
• The number of breast cancer cases worldwide was estimated at 14.1 million new cases and 8.2 million deaths in 2012.
4
Breast Cancer Statistics
Article Source: Model Comparison for Breast Cancer Prognosis Based on Clinical Data
Breast Cancer Diagnosis
• Successful early detection
– Better treatments to patients.
– Better clinical decision making.
5
• Statistical methods are the most popular approaches used in clinical practice for cancer diagnosis and prognosis.
• Statistical Methods Challenges o Data Diversity
o High dimensional data.
o The uncertainty and imprecision
o Relevancy & Redundancy
6
Statistical Methods
• Predictive modelling in medicine involves deriving a mathematical model for the prediction of an outcome for future patients.
• Our goal is to classify two types of tumors for breast cancer diagnosis, i.e. if the cancer is Malignant or if it is Benign.
7
Predictive Modelling
Generating a good Model
Accurate Stable General
8
Prediction Model Challenges
• Evolutionary Algorithms are suitable for constructing good predictive models.
9
Evolutionary Computation
10
The Hybrid Genetic Approach
• The proposed method is the combination of a Genetic Algorithm (GA) based on Mutual Information (MI) for identifying cancer predictors.
• Genetic algorithm iterates through the combinations of features. The best set of features (i.e. predictors) is then selected statistically and passed through the ML classifier.
• Prediction is based the knowledge which has been acquired by the model during the learning process.
• This study used the Wisconsin Breast Cancer dataset.
• The dataset is provided by university of Wisconsin hospital, Madison.
• The dataset contains records collected from 699 patients.
• According to the class distribution 458 (65.5%) cases were derived from patients with a benign tumor and 241 (34.5%) cases were derived from patients with a malignant tumor.
11
Breast Cancer Datasets
The Attribute Information For Breast Cancer Datasets
Feature name Range
1 Clump thickness 1-10
2 Uniformity of cell size 1-10
3 Uniformity of cell shape 1-10
4 Marginal adhesion 1-10
5 Single epithelial cell size 1-10
6 Bare nuclei 1-10
7 Bland chromatin 1-10
8 Normal nucleoli 1-10
9 Mitoses 1-10
10 Diagnosis 0 for benign, 1 for malignant.
Article Source : Multisurface method of pattern separation for medical diagnosis applied to breast cytology 12
Leave-One-Out Cross Validation
(LOOCV) • Breast cancer dataset contained 699 patient
cases. • Evaluations using cross validation: A total of
699 iterations. In each iteration 699-1 patient cases were used for training and the remaining one case was used for testing. This is the most acceptable approach in the clinical literature.
• Eventually, all patient cases are passed through the testing process.
• Performance of the algorithm is based on its predictive accuracy to detect the test cases (i.e. all previously unseen patient records)
13
14
Experimental Results Using The SVM Classifier
Evaluation Measures
SVM - Kernel Functions
RBF Linear Quadratic MLP 5 Features
Correct Rate 0.9820 0.9822 0.9844 0.9795
AUC 0.9605 0.9659 0.9669 0.9508
ORP FPR 0.0332 0.0332 0.0290 0.0373
ORP TPR 0.9541 0.9651 0.9629 0.9389
6 Features
Correct Rate 0.9778 0.9823 0.9844 0.9683
AUC 0.9607 0.9681 0.9669 0.9382
ORP FPR 0.0415 0.0332 0.0290 0.0581
ORP TPR 0.9629 0.9694 0.9629 0.9345
7 Features Correct Rate 0.9822 0.9845 0.9800 0.9909
AUC 0.9648 0.9702 0.9617 0.9688
ORP FPR 0.0332 0.0290 0.0373 0.0166
ORP TPR 0.9629 0.9694 0.9607 0.9541
15
Experimental Results Using The k-NN Classifier
Evaluation Measures
k-NN Distance Measures
Correlation Minkowski Euclidean Seuclidean
7 Features
Correct Rate 0.9605 0.9887 0.9887 0.9865
AUC 0.9156 0.9678 0.9678 0.9679
ORP FPR 0.9017 0.0207 0.0207 0.0249
ORP TPR 0.9017 0.9563 0.9563 0.9607
8 Features
Correct Rate 0.9624 0.9843 0.9843 0.9844
AUC 0.9133 0.9658 0.9658 0.9680
ORP FPR 0.8930 0.0290 0.0290 0.0290
ORP TPR 0.8930 0.9607 0.9607 0.9651
9 Features
Correct Rate 0.9559 0.9910 0.9910 0.9888
AUC 0.9104 0.9731 0.9731 0.9733
ORP FPR 0.8996 0.0166 0.0166 0.0207
ORP TPR 0.8996 0.9629 0.9629 0.9672
• Developed a hybrid approach to detecting breast cancer based on Genetic Algorithm and Mutual Information.
• Experiments were performed to evaluate the performance of proposed approach with two different machine learning classifiers, K-NN, and SVM, each tuned using different distance measures and kernel functions, respectively.
• The results revealed that the proposed hybrid approach is highly accurate for predicting breast cancer.
16
Current Work
17
Conclusion
18
• Director of Studies: – Dr Georgina Cosma [email protected]
• Supervisory Team: – Professor Graham Pockley [email protected]
– Professor David Brown [email protected]
TEAM
19
Acknowledgements
• Support: Funding received from Ministry of High Education and Scientific Research in Iraq.
Thank you Any questions?