Top Banner
Jagdeep Singh HYBRID TECHNIQUE FOR ASSOCIATIVE CLASSIFICATION OF HEART DISEASES
39

Hybrid Technique for Associative Classification of Heart Diseases

Jan 19, 2015

Download

Data & Analytics

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hybrid Technique for Associative Classification of Heart Diseases

Jagdeep Singh

HYBRID TECHNIQUE FOR ASSOCIATIVE CLASSIFICATION

OF HEART DISEASES

Page 2: Hybrid Technique for Associative Classification of Heart Diseases

Table of Contents

Ø  Introduction Ø Motivation Ø Data Mining Ø Classification Ø Association Ø Heart Disease Database

Ø  Literature Survey Ø  Problem Formulation Ø  Objectives

Ø  Present Work Ø  Result and Discussion Ø  Conclusion Ø  Future Scope Ø  References

Page 3: Hybrid Technique for Associative Classification of Heart Diseases

Motivation

Ø  Accumulation of huge data-sets in the field of Engineering and Biomedical Science.

Ø  Ability to extract hidden and useful knowledge from large databases.

Ø  Need to development intelligent and cost effective decision support system.

Ø  How to teach the people to ignore the irrelevant data.

Ø  The greatest problem of today is to get optimal outcome of irrelevant data.

Page 4: Hybrid Technique for Associative Classification of Heart Diseases

Data Mining

Ø  Data mining computational process of finding patterns in large data sets including methods at the intersection of machine learning, artificial intelligence, statistics and database systems.

Ø  The main focus of data mining process is to obtain information from the data and converted it into an knowledgeable and reasonable structure for further use.

Page 5: Hybrid Technique for Associative Classification of Heart Diseases

Data Mining Process

The Data Mining Process [1]

Page 6: Hybrid Technique for Associative Classification of Heart Diseases

Classification

Classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

Page 7: Hybrid Technique for Associative Classification of Heart Diseases

Association

Association learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness.

For example, the rule : {onions, potatoes} => {burger}.

Page 8: Hybrid Technique for Associative Classification of Heart Diseases

Example : Heart diseases Dataset

ID age Gender Chest pain Blood pressure diagnosis

1 63   male   typ_angina   High   No  

2 67   male   asympt   very_high   Yes  

3 67   male   asympt   high   Yes  

4 37   male   non_anginal   high   No  

5 41   female   atyp_angina   high   No  

6 56   male   atyp_angina   high   No  

7 62   female   asympt   high   Yes  

8 57   female   asympt   high   No  

9 63   male   asympt   high   Yes  

10 53   male   asympt   high   Yes  

11 57   male   asympt   high   No  

12 56   female   atyp_angina   high   No  

13 56   male   non_anginal   high   Yes  

14 44   male   atyp_angina   high   No  

Page 9: Hybrid Technique for Associative Classification of Heart Diseases

Association rules example:

1. cp=atyp_angina trestbps=high 4 ==> diagnosis=No 4

2. gender=male cp=asympt trestbps=very_high 2 ==> diagnosis=Yes 1

3. gender=female cp=atyp_angina 2 ==> diagnosis=No 2

4. gender=male cp=atyp_angina trestbps=high 2 ==> diagnosis=No 2

5. gender=female cp=atyp_angina trestbps=high 2 ==> diagnosis=No 2

6. cp=atyp_angina 4 ==> diagnosis=No 4

7. gender=male cp=asympt trestbps=high 4 ==> diagnosis=Yes 2

8. gender=male cp=atyp_angina 2 ==> diagnosis=No 2

Page 10: Hybrid Technique for Associative Classification of Heart Diseases

Result new prediction ?

age gender Chest pain

Blood pressure

diagnosis

52   male   non_anginal   very_high  

Page 11: Hybrid Technique for Associative Classification of Heart Diseases

Classifiers

Ø  ZeroR : There is no predictability, it is useful for determining a baseline performance as a benchmark for other classification methods.

Ø  OneR : Classification rules based on the value of a single predictor, that generates one rule for each predictor in the data.

Ø  NaiveBayes: Bayes rule is implemented or assigned to make easier to evaluate prior from a probability model. it handles condition of some missing entries in data.

Ø  J48: It creates a binary tree, With this technique, a tree is constructed to model the classification process.

Ø  IBk (k nearest neighbour): The nearest neighbor algorithm categorise a given instance depend on a set of already categorise the training set by measuring the distance to the closed instances

Page 12: Hybrid Technique for Associative Classification of Heart Diseases

Association Methods

Ø  Aprior Algorithm: Find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction.

Ø  FP-Growth Algorithm: Allows frequent discovery

without candidate itemset generation. Extracts frequent itemsets form the FP-tree. Follow Divide and conquer approach.

Page 13: Hybrid Technique for Associative Classification of Heart Diseases

Heart Disease Database

Sr. No.  

Attributes  

Description  

Values  

1   age   Age in years   Continuous  

2   gender   Male or female   1 = Male, 0 = female  

3   cp   Chest pain type  

1 = typical type, 2 = typical type angina, 3 = non-angina pain, 4 = asymptomatic  

4   thestbps   Resting blood pres- sure   Continuous value in mm hg  

5   chol   Serum cholesterol   Continuous value in mm/dl  

6   thalach   Maximum heart rate achieved   Continuous value  

7   fbs   Fasting blood sugar   1 =>120 mg/dl, 0 =<120 mg/dl  

Page 14: Hybrid Technique for Associative Classification of Heart Diseases

Continue…

8   Restecg   Resting electro- graphic results  

0 = normal, 1 = having ST-T wave abnormal, 2 = left ventricular hypertrophy  

9   exang   Exercise induced angina   0 = no 1 = yes  

10   oldpeak  ST depression induced by exercise relative to rest  

Continuous value  

11   slope   Slope of the peak exercise ST segment  

1 = unsloping, 2 = flat, 3 = downsloping  

12   ca  Number of major vessels colored by floursopy  

0 - 3 value  

13   thal   Defect type  3 = normal, 6 = fixed, 7 = reversible defect  

14   Diagnosis   Heart disease Predi- cation  

Value 1: no heart disease Value 0: has heart disease  

Page 15: Hybrid Technique for Associative Classification of Heart Diseases

Literature Survey

Ø  Liao et al. [3] author report about data mining techniques and application,

development through a survey of literature, form 2000 to 2011. Paper surveys

three areas of data mining research: knowledge types, analysis types, and

architecture types. A discussion deals with future progress in social science and

Engineering methodologies implement data mining techniques and the development

of applications in problem- oriented

Ø  Liu et al. [4] presented an associative classification, to integrate classification rules

and association rule mining. The integration is done by focusing on mining a special

subset of association rules whose consequent parts are restricted to the classification

class labels, called Class Association Rules (CARs). This algorithm first generates all

the association rules and then selects a small set of rules to form the classifiers.

When predicting the class label for a coming sample, the best rule is chosen.

Page 16: Hybrid Technique for Associative Classification of Heart Diseases

Continue…

Ø  The first association rule mining algorithm was the Apriori algorithm [5] developed

by Agrawal, and swami. The Apriori algorithm generates the candidate item sets in

one pass through only the item sets with large support in the previous pass, without

considering the transactions in the database.

Ø  Palaniappan and Awang [6] developed a prototype Intelligent Heart Disease

Prediction System (IHDPS) using data mining techniques, namely, Decision Trees,

Nave Bayes and Neural Network. Results show that each technique has its unique

strength in realizing the objectives of the defined mining goals. IHDPS can answer

complex what if queries which traditional decision support systems cannot. Using

medical profiles such as age, gender, blood pressure and blood sugar it can predict

the likelihood of patients getting a heart disease. IHDPS is Web-based, user-

friendly, scalable, reliable and expandable. It is implemented on the .NET platform.

Page 17: Hybrid Technique for Associative Classification of Heart Diseases

Continue…

Ø  Srinivas et al. [7] presented Application of Data Mining Technique in Healthcare and Prediction of Heart Attacks. The potential use of classification based data mining techniques such as Rule based, Decision tree, Nave Bayes and Artificial Neural Network to the massive Volume of healthcare data. Tanagra data mining tool was used for exploratory data analysis, machine learning and statistical learning algorithms. The training data set consists of 3000 instances with14 different attributes.

Ø  Shouman et al. [8] proposed k-means clustering with the decision tree method to predict the heart disease. In their work they suggested several centroid selection methods for k- means clustering to increase efficiency. The 13 input attributes were collected from Cleveland Clinic Foundation Heart disease data set. For the random attribute and random row methods, ten runs were executed and the average and best for each method were calculated. In Addition, integrating k-means clustering and decision tree could achieve higher accuracy than the paging algorithm in the diagnosis of heart disease patients. The accuracy achieved was 83.9% by the enabler method with two clusters.

The algorithm used   Accuracy   Time taken  Naive Bayes   52.33%   609ms  Decision list   52%   719ms  K-NN   45.67%   1000ms  

Page 18: Hybrid Technique for Associative Classification of Heart Diseases

Summary and Gaps Identified

Ø  Implementation of different methods like NaiveBayes, Decision tree and Neural, K-nearest, Artificial Neural Network etc, is done on heart disease dataset.

Ø  The performance of the classifiers is evaluated and their results are analysed.

Ø  Maximum accuracy achieved according to the survey is 83.9% using K-means clustering with decision tree.

Ø  The classification methods does not provide better accuracy and

experimental results.

Ø  Integration of associative classification is not yet implemented on heart diseases data set.

Page 19: Hybrid Technique for Associative Classification of Heart Diseases

Problem Formulation

Ø  Accuracy of heart data diseases is only calculate on basis of classification

methods.

Ø  Accuracy of corrected classified instances is less to predict heart diseases.

Ø  Association and classification suffers from inefficiency due to the fact that it

often generates a very large number of insignificant rules.

Ø  Most of the associative classification algorithms adopt the exhaustive search

method to discover the rules and require multiple passes over the

database.

Ø  They find frequent items in one phase and generate the rules in a separate

phase consuming more resources such as storage and processing time.

Page 20: Hybrid Technique for Associative Classification of Heart Diseases

Objectives

Ø  To propose a technique that can generate Classification Association Rules (CARs) efficiently for heart diseases prediction.

Ø  Perform evaluation of proposed approach. Ø  Comparative analysis of proposed method with

other state-of-the-art techniques

Page 21: Hybrid Technique for Associative Classification of Heart Diseases

Present Work

The Present Work has been implemented using data mining tool Weka . Implementation steps are listed below :

1. Review of the classification and association rule generation methods. 2. Understanding the existing algorithm of classification.  3. Study the existing methods of Classification and association to predict heart diseases. 4. Understanding the heart disease data set attributes used in predication. 5. Study ARFF file format standard of representing datasets. 6. Preparing data set for implementation of association algorithm 

Page 22: Hybrid Technique for Associative Classification of Heart Diseases
Page 23: Hybrid Technique for Associative Classification of Heart Diseases

Continue…

 7. Implement association algorithm like Aprior and FP growth on prepared data set. 8. Select the best 10 rules for each associate algorithm. 9. Make classes and extract training data sets bases on different rules.  10. Implement classification algorithms on extracted training data set. 11. Compared the performance and accuracy of corrected classified instances of classification methods. 12. Construct a system based on high performance and better accuracy of classification meth- ods.

Page 24: Hybrid Technique for Associative Classification of Heart Diseases

Apriori algorithm best rules

1. gender=female fbs=f restecg=normal exang=no thal=normal 35 ==>diagnosis=No 35 conf:(1).

2. gender=female cp=non anginal thal=normal 31 ==>diagnosis=No 31 conf:(1).

3. cp=asympt chol=high risk thal=reversable defect 42 ==>diagnosis=Yes 41 conf:(0.98)

4. cp=asympt restecg=left vent hyper thal=reversable defect 41 ==>diagnosis=Yes 40 conf:(0.98)

5. gender=female fbs=f slope=up 39 ==>diagnosis=No 38 conf:(0.97)

6. gender=female restecg=normal exang=no thal=normal 38 ==>diagnosis=No 37 conf:(0.97)

7. gender=female fbs=f restecg=normal exang=no 37 ==>diagnosis=No 36 conf:(0.97)

8. gender=female fbs=f slope=up thal=normal 37 ==>diagnosis=No 36 conf:(0.97)

9. cp=asympt trestbps=high chol=high risk thal=reversable defect 37 ==>diagnosis=Yes 36 conf: (0.97). 10. gender=female cp=non anginal 35 ==>diagnosis=No 34 conf:(0.97).

Page 25: Hybrid Technique for Associative Classification of Heart Diseases

FP-Growth algorithm best rules 1. (fbs binarized=1, restecg=left vent hyper binarized=1, diagnosis=Yes, exang binarized =1): 31 ==>(cp=asympt binarized=1): 31 conf:(1)

2. (chol=high risk binarized=1, cp=asympt binarized=1, thal= reversable defect binarized = 1): 42 =>(diagnosis=Yes): 41 conf:(0.98)

3. (restecg=left vent hyper binarized=1, cp=asympt binarized=1, thal= reversible defect bi- narized =1): 41 ==>(diagnosis=Yes): 40 conf:(0.98)

4. (thal=normal binarized=1, trestbps=normal binarized=1): 37 ==>(fbs binarized=1): 36 conf:(0.97)

5. (slope=up binarized=1, thal=reversable defect binarized=1): 37 ==>(gender binarized=1): 36 conf:(0.97)

6. (trestbps=high binarized=1, chol=high risk binarized=1, cp=asympt binarized=1, thal= re- versable defect binarized=1): 37 ==>(diagnosis=Yes): 36 conf:(0.97)

7. (chol=high risk binarized=1, thal=reversable defect binarized=1, exang binarized=1): 34 ==>(diagnosis=Yes): 33 conf:(0.97)

8. (fbs binarized=1, chol=high risk binarized=1, cp=asympt binarized=1, thal= reversible defect binarized=1): 34 ==>(diagnosis=Yes): 33 conf:(0.97)

9. (gender binarized=1, chol=high risk binarized=1, cp=asympt binarized=1, thal= reversible defect binarized=1): 34 ==>(diagnosis=Yes): 33 conf:(0.97)

10. (fbs binarized=1, restecg=left vent hyper binarized=1, cp=asympt binarized =1, thal= re- versable defect binarized=1): 33 ==>(diagnosis=Yes): 32 conf:(0.97)

Page 26: Hybrid Technique for Associative Classification of Heart Diseases

Sample Data form of Heart Disease Prediction Online Available : http://gndec.ac.in/~jagdeepmalhi/ihdps/

Page 27: Hybrid Technique for Associative Classification of Heart Diseases

Sample Data of Heart Disease Prediction for Risk Level: No

Page 28: Hybrid Technique for Associative Classification of Heart Diseases

Sample Data of Heart Disease Prediction for Risk Level: Low

Page 29: Hybrid Technique for Associative Classification of Heart Diseases

Sample Data of Heart Disease Prediction for Risk Level: High

Page 30: Hybrid Technique for Associative Classification of Heart Diseases

Results and Discussion

The Evaluation of results is done on bases of two categories.

Ø  Compare the different parameters like time taken, Correctly/Incorrectly classified instances, Kappa statistic value, mean absolute error and root mean squared error rate of different classifier with Aprior and FP-Growth association algorithm.

Ø  Compare the accuracy evaluated by different authors on the heart disease dataset. 

Page 31: Hybrid Technique for Associative Classification of Heart Diseases

Continue…

Comparison of different classifiers using Aprior association algorithm on heart diseases dataset.

Classifiers  Time Taken (In seconds)  

Correctly C l a s s i f i e d I n s t a n c e s (%)  

Incorrectly C l a s s i f i e d I n s t a n c e s (%)  

Kappa statistic  

Mean absolute error  

Root mean squared error  

ZeroR   0.001   67.2   32.79   0   0.441   0.470  

OneR   0.01   97.31   2.6   0.94   0.027   0.164  

J48   0.04   97.85   2.15   0.951   0.031   0.143  

IBk   0.003   99.19   0.81   0.982   0.010   0.090  

NaiveBayes   0.01   97.58   2.42   0.946   0.023   0.137  

Page 32: Hybrid Technique for Associative Classification of Heart Diseases

Continue…

Comparison of different classifiers using FP- Growth association algorithm on heart diseases dataset.

Classifiers  Time Taken (In seconds)  

Correctly Classified Instances (%)  

Incorrectly Classified Instances (%)  

Kappa statistic  

Mean absolute error  

Root mean squared error  

ZeroR   0.001   85.67   14.33   0   0.247   0.350  

OneR   0.005   92.55   7.45   0.649   0.075   0.273  

J48   0.01   96.56   3.44   0.859   0.056   0.185  

IBk   0.001   94.84   5.16   0.779   0.053   0.227  

NaiveBayes   0.003   97.55   7.45   0.711   0.088   0.265  

Page 33: Hybrid Technique for Associative Classification of Heart Diseases

Continue…

Comparison of Aprior and FP-Growth association algorithms heart diseases dataset

Association Algorithms  

ZeroR accuracy  

OneR accuracy  

J48 accuracy  

IBk accuracy  

NaiveBayes accuracy  

Aprior   67.2   97.31   97.85   99.19   97.58  

FP-Growth   85.67   92.55   96.56   94.84   97.55  

Page 34: Hybrid Technique for Associative Classification of Heart Diseases

Continue…

Comparison of results evaluated by different authors on the heart disease dataset.

Author /Year Technique Accuracy (%)

Cheung 2001 [11] NaiveBayes 81.48

Polat and Sahan et al. 2007 [12] K-Nearest Neighbor 87.00

Shouman and Turner et al. 2012 [13] Decision tree 84.10

Das and Turkoglu et al. 2009 [14] K-Nearest Neighbor 97.40

Tu and Shin et al. 2009 [15] J4.8 Decision Tree 78.90

Proposed Method 2014 IBk with Aprior Algorithm 99.19

Page 35: Hybrid Technique for Associative Classification of Heart Diseases

Conclusion

Ø  The development of a hybrid technique for implementation of associative classification is done on heart diseases dataset to predict more accurate results.

Ø  Dataset is implement on weka environment and compared the performance of different classifier after apply association algorithm.

Ø  Results show that IBk (k Nearest Neighbor) with Aprior associative algorithms shows better results than others.

Ø  Compare the results of different classifiers with proposed implementation methods.

Ø  Finally develop Intelligent Heart Diseases Prediction System (IHDPS) for end user to check the risk of heart diseases.

Page 36: Hybrid Technique for Associative Classification of Heart Diseases

Future Scope

Ø  In future work plan to reduce numbers of attributes and to determine the attribute which contribute towards the diagnosis of heart disease.

Ø  Additional Data Mining techniques can be incorporated to provide better results.

Ø  There is a need to build a system where every human can check the risk of heart diseases using minimum recourses and parameters.

Ø  Parameters like processing time, resources and memory used can be further enhanced.

Page 37: Hybrid Technique for Associative Classification of Heart Diseases

References

1)  U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “Data mining to knowledge discovery in databases,” American Association for Artificial Intelligence, vol. 17, no. 3, pp. 37–54, 1996.

2)  D. Aha. (1988, July) Heart disease databases. [Online]. Available: http://repository.seasr. org/Datasets/UCI/arff/heart-c.arff.

3)  S. H. Liao, P. H. Chu, and P. Y. Hsiao, “Data mining techniques and applications - a decade review from 2000 to 2011,” Elsevier Expert Systems with Applications, vol. 39, no. 1, pp. 11 303–11 311, 2012.

4)  B. Liu, W. Hsu, and Y. Ma, “Integrating classification and association rule mining,” In Knowledge Discovery and Data Mining, New York, vol. 2, pp. 80–86, 1998.

5)  R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in VLDB, Santi- ago, Chile, September 1994, pp. 487–499.

6)  S.Palaniappan and R.Awang, “Intelligent heart disease prediction system using data mining techniques,” in IEEE/ACS International Conference, Doha, 2008, pp. 108–115.

7)  K. Srinivas, B. K. Rani, and D. A. Govrdhan, “Application of data mining techniques in healthcare and prediction of heart attacks,” International Journal on Computer Science and Engineering, vol. 2, no. 2, pp. 250–255, 2011.

Page 38: Hybrid Technique for Associative Classification of Heart Diseases

Continue …

8)  M. Shouman, T. Turner, and R. Stocker, “Integrating decision tree and k-means clustering with different initial centroid selection methods in the diagnosis of heart disease patients,” in Proceedings of the International Conference on Data Mining, 2012.

10)  J. Singh, H. Singh, and A. Kamra, “Recent trends in data mining: A review,” in Proceeding of 3rd International Conference on Biomedical Engineering and Assistive Technologies, Chandigarh, India, 2014, pp. 138–144.

11)  N.Cheung, “Machine learning techniques for medical analysis,” B.Sc. Thesis, School of Information Technology and Electrical Engineering, University of Queenland, 2001.

12)  K. Polat, S. Sahan, and S. Gunes, “Automatic detection of heart disease using an artifi- cial immune recognition system (airs) with fuzzy resource allocation mechanism and k-nn (nearest neighbor) based weighting preprocessing,” Expert Systems with Applications, pp. 625–663, 2007.

13)  M. Shouman, T. Turner, and R. Stocker, “Applying k-nearest neighbor in diagnosing heart disease patients,” International Journal of Information and Education Technology, vol. 2, no. 3, pp. 220–223, June 2012.

14)  R. Das, I. Turkoglu, and A. Sengur, “Effective diagnosis of heart disease through neural networks ensembles,” Expert Systems with Applications, Elsevier, pp. 7675–7680, 2009.

15)  M. C. Tu, D. Shin, and D. Shin, “Effective diagnosis of heart disease through bagging approach,” in Proceeding of 2nd International Conference on Biomedical Engineering and Informatics. Seoul, South Korea: IEEE, October 2009, pp. 1–4.

Page 39: Hybrid Technique for Associative Classification of Heart Diseases

Jagdeep Singh http://jagdeepmalhi.blogspot.com