This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Artech Journal of Engineering and Applied Technology (AJEAT)Volume, 1, Issue 1, 2020, Pages: 7-14ISSN: 2523-5478
Improvement of Framework for the Grouping of CADiseases by Investigating Big Data
1D. Stalin David and 2S. Namboodiri1Deprtment of Computer Science and Engineering, IFET College of Engineering,
Viluppuram, Tamilnadu, India2Deprtment of Computer Science and Engineering,
PG MES College of Engineering, Kuttippuram, Kerala, India
Abstract: Heart disease is a leading disease that causes death. One of the major heart diseases is Cardiac Arrhythmia.There are different types in Arrhythmia diseases such as Tachycardia, Bradycardia, Premature Ventricular Contractionsand Premature Atrial Contractions, etc. ECG is an important clinical tool for diagnosing and monitoring of the heartdisorder. For a single patient, ECG is taken at some random intervals. Single reading of ECG contains 279 attributes,had many readings at random intervals are taken leading to huge data. A large database is needed to store this data.Hence, this is referred as big data. The analysis of this big data is a tedious process. So, computer based automaticsystem is needed for the detection of heart abnormalities and classification of ECG signals by analyzing the Big data.It is proposed to develop an automated system for the classification of various types of cardiac arrhythmias by analyzingthe ECG big data, which is a very complex process. The proposed system includes data collection, pre-processing,attribute selection, rules formation and classification. In data collection, data is collected from the repository. Inpreprocessing, the missing values in the dataset are replaced by mean values. Attribute selection process selects theattributes that are of most important. In Rule formation, rules are formed based on the rule weights. In classificationprocess, the classification of the various types of arrhythmia is done. In real time, this proposed system will be helpfulfor the clinical diagnosis of cardiac arrhythmias such as Tachycardia, Bradycardia, Coronary Artery Disease (CAD),Atrial Fibrillation and Atrial Flutter.
Arrhythmia is a life threatening heart disease. Duringan arrhythmia, the heart can beat too fast, too slow or withan irregular rhythm. A heartbeat that is too fast is calledtachycardia. A heartbeat that is too slow is calledbradycardia. Most arrhythmias are harmless, but some canbe serious or even life threatening. During an arrhythmia,the heart may not be able to pump enough blood to thebody. Lack of blood flow can damage the brain, heart andother organ. Different types of arrhythmia includesPremature Atrial Contractions, Premature VentricularContractions, Atrial Fibrillation, Atrial Flutter,Tachycardia, Bradycardia, Ischemic Changes, LeftVentricular Hypertrophy, Left and Right Bundle BranchBlocks and Myocardial Infarction, etc. In general,arrhythmia is diagnosed by an Electrocardiogramprocedure. ECG signals are comprised of P wave, QRS
complex and T wave. They are designated by capitalletters P, Q, R, S and T. A typical normal ECG signal isshown in Fig. 1.
The main parameters included for inspection inheart-patients are the shape, the duration and the
Fig. 1: Schematic representation of normal ECG
Corresponding Author: D. Stalin David, Deprtment of Computer Science and Engineering, IFET College of Engineering,Viluppuram, Tamilnadu, India
Citation: David, D.S. and S. Namboodiri, 2020. Improvement of framework for the grouping of CA diseases by investigating bigdata. Artech J. Eng. Appl. Technol., 1: 7-14.
Table 1: ECG signal featuresECG features Normal values (msec)P interval 80 – 100QRS duration 80 – 120T interval 120 – 160PR interval 120 – 200QT interval < 440
relationship with each other of P wave (during normalatrial depolarization, the main electrical vector is directedfrom the SA node towards the AV node and spreads fromthe right atrium to the left atrium. This turns into the Pwave on the ECG), QRS complex (The QRS complexreflects the rapid depolarization of the right and leftventricles. They have a large muscle mass compared tothe atria and so the QRS complex usually has a muchlarger amplitude than the P wave) and T wavecomponents (QRS complex to the apex of the T wave isreferred to as the absolute refractory period.
The last half of the T wave is referred to as therelative refractory period) and also R-R interval (Theinterval between an R wave and the next R wave. Normalresting heart rate is between 60 and 100 bpm). Anychanges in these parameters signify an illness of the heart.Table 1 shows the normal values of these ECG signalfeatures measured in milliseconds. The entire irregularbeat phases are commonly called arrhythmia and somearrhythmias are very dangerous for a patient.
RELATED WORK
A lots of research work in the field of Big Data inBiomedical applications has been done. Those researchworks, are summarized as follows.
One of the works presented support vector machinebased methods for arrhythmia classification in ECGdatasets with selected features. Various existing SVMmethods such as One Against One (OAO), One AgainstAll (OAA), Fuzzy Decision Function (FDF) and DecisionDirected Acyclic Graph (DDAG) are used to distinguishbetween the presence and absence of cardiac arrhythmiaand classifying them into one of the arrhythmia groups[8]. The various types of arrhythmias in the cardiacarrhythmias ECG database chosen from University ofCalifornia at Irvine (UCI) to train SVM include ischemicchanges (coronary artery disease), old inferior myocardialinfarction and others. ECG arrhythmia datasets are ofgenerally complex nature and the results obtained throughimplementation of four well known methods arecompared as per their accuracy rate in percentages and theperformance of the SVM classifier using One Against All(OAA) technique was found to be of vital importance forclassification based diagnosing diseases pertaining toabnormal heart beats.
Another work is based on the system with adaptivefeature selection and modified Support Vector Machines
(SVMs) for cardiac arrhythmia detection in ECGs(Shen et al., 2012). Candidates which enumerated areWavelet transform-based coefficients and signalamplitude/interval parameters. Proposed system withadaptive feature selection integrates with k-meansclustering and SVMs (Kohli and Verma, 2011). Theproposed system includes the ideas of enumerating morecandidate features in the early stage but screening outuseless ones for each class pair in classification stage,partitioning large variation classes into several subclassesto boost up the training performance and duplicating thetraining samples to balance the number of samples foreach class pair.
Another related work proposes the usage of alinguistic fuzzy rule based classification system, whichwe have called Chi-FRBCS-Big Data (Lopez et al., 2014).This method is based on the Map Reduce framework, oneof the most popular approaches for big data nowadays andhas been developed in two different versions: Chi-FRBCSBigData-Max and Chi-FRBCS-BigData-Ave. The goodperformance of the Chi-FRBCS-Big Data approach issupported by means of an experimental study over six bigdata problems. The results show that the proposal is ableto provide competitive results, obtaining more precisebut slower models in the Chi-FRBCS-Big Data-Avealternative and faster but less accurate classificationresults for Chi-FRBCS-Big Data-Max.
One of the categories described the application ofcompetitive neural networks with the Learning VectorQuantization (LVQ) algorithm for classification ofElectrocardiogram (ECG) signals (Melin et al., 2014).The MIT-BIH arrhythmia database with 15 classes hasbeen used for their study. For the LVQ algorithm it isdesirable that the data are mostly different from eachother when they belong to different classes, on the otherhand, good results are obtained when the data belongingto the same class are more similar to each other.
An early and accurate detection of arrhythmia ishighly solicited for augmenting survivability. In thisconnection, intelligent automated decision supportsystems have been attempted with varying accuraciestested on UCI arrhythmia data base. One of the attemptedtool was neural networks for classification. For betterclassification accuracy, various feature selectiontechniques have been deployed. This work attemptedCorrelation-based Feature Selection (CFS) with linearforward selection search and Incremental BackPropagation Neural Network (IBPLN) and Levenberg-Marquardt (LM) was used for classification, tested onUCI data base (Mitra and Samanta, 2013).
One of the approaches concerned big data aslarge-volume, Complex, growing data sets with multiple,autonomous sources (Wu et al., 2013). They presented aHACE theorem that characterizes the features of the BigData revolution and proposes a Big Data processing
8
Artech J. Eng. Appl. Technol.
model, from the data mining perspective. With the fastdevelopment of networking, data storage and the datacollection capacity, Big Data are now rapidly expandingin all science and engineering domains including physical,biological and biomedical sciences. This data-drivenmodel involves demand-driven aggregation ofinformation sources, mining and analysis, user interestmodeling and security and privacy considerations.
The new category presented a general methodologyfor automatic detection of the normal, Atrial Fibrillation(AF) and Atrial Flutter (AFL) beats of ECG (Martis et al.,2014). They designed a Computer Aided Diagnosis(CAD) tool that can help physicians significantly.Investigation of four methods for feature extraction aredone such as, the Principal Components (PCs) of DiscreteWavelet Transform (DWT) coefficients, the IndependentComponents (ICs) of DWT coefficients, the PCs ofDiscrete Cosine Transform (DCT) coefficients and theICs of DCT coefficients. In this three differentclassification techniques are explored namely K-NearestNeighbor (KNN), Decision Tree (DT) and ArtificialNeural Networks (ANN). The methodology is testedusing data from MIT BIH arrhythmia and atrialfibrillation databases.
Another approach which was proposed includes aneffective Electrocardiogram (ECG) arrhythmiaclassification scheme consisting of a feature reductionmethod combining Principal Component Analysis(PCA) with Linear Discriminant Analysis (LDA) and aProbabilistic Neural Network (PNN) classifier todiscriminate eight different types of arrhythmia from ECGbeats (Wang et al., 2013). Each ECG beat samplecomposed of 200 sampling points at a 360 Hz samplingrate around an R peak is extracted from ECG signals. Thefeature reduction method is employed to find importantfeatures from ECG beats. With the selected features, thePNN is then trained to serve as a classifier fordiscriminating different types of ECG beats.
PROPOSED WORK
In this section, the arrhythmia classification systemand the entire work is presented in detail.
Arrhythmia classification system: In this proposedwork, an automated cardiac arrhythmia classificationsystem for the classification of various cardiacarrhythmias is developed by analyzing the clinical bigdata. Electrocardiogram involves recording and analyzingthe electrical signals generated by the heart. ECG is animportant clinical tool for diagnosing and monitoring ofheart disorders. ECG signal consists of P, Q, R, S and Twaves and these signals constitute the big data.
For developing the cardiac arrhythmia classificationsystem, the arrhythmia dataset is taken from the Database
Fig. 2: Arrhythmia classification system
of University of California at Irvine (UCI).The datasetcontains 452 instances and 279 attributes. This datasetcontains missing values and these are replaced by themean values of the attributes. Of the 279 attributes, thefocus is limited to only 6 attributes and it is done in theattribute selection phase. This data is considered as BigData since the attributes involved is very high for a singledata per patient. So increase in the number of records willmake it further bigger which can be referred to as big dataand also analysis of big data is very complex.
The development of the system is divided into thefollowing modules (Fig. 2).
C Data collectionC Data pre-processingC Attribute selectionC Rules formationC Classification
The first module deals with the collection of thedataset from the UCI Repository. Second module dealswith the preprocessing of the dataset where the missingvalues are replaced by their mean values. The nextmodule deals with the attribute selection using PrincipalComponent Analysis, selecting only limited attributes forfurther processing.
The next module rules formation deals with the ruleweight calculation and map reduce functionality. The finalmodule explores the classification of the various types ofarrhythmias to identify the type of arrhythmia disease.
Data collection: The arrhythmia dataset is collected fromthe University of California at Irvine (UCI) machinelearning repository. The number of out patients in thehospitals is increasing every day. The duty doctors arealso changing according to their duty timings. So therecomes a situation where the same disease may bediagnosed and treated by different cardiologists with somuch effort. This can be avoided by integrating andcollecting all the medical records into a single dataset.This collection of dataset is useful for the cardiologists totreat similar kind of diseases with ease.
9
Artech J. Eng. Appl. Technol.
The dataset contains 452 instances and 279 attributes.This dataset contains missing values for many attributes.The list of 279 attributes in the dataset is given below:
1 Age 2 Sex
3 Height 4 Weight
5 QRS duration 6 PR interval
7 QT interval 8 T interval
9 P iterval 10 QRS vector angle
11 T vector angle 12 P vector angle
13 QRST vector angle 14 J vector angle
15 Heart rate 16 Of channel DI:Average width-Q wave
147 Average width-R wave 18 Average width-S wave
19 Average width-R wave 20 Average width-S wave
21 Number of intrinsicdeflections
22 Existence of ragged R wave
23 Existence of diphasicderivation of R wave
24 Existence of ragged P wave
25 Existence of diphasicderivation of P wave
26 Existence of ragged T wave
27 Existence of diphasicderivation of T wave
Of channel DII:28.. 39 (similar to 16.. 27 ofchannel DI)
Of channels DIII:40.. 51
Of channel AVR:52.. 63
Of channel AVL:64.. 75
Of channel AVF:76.. 87
Of channel V1:88.. 99
Of channel V2:100.. 111
Of channel V3:112.. 123
Of channel V4:124.. 135
Of channel V5:136.. 147
Of channel V6:148.. 159
Of channel DI:160 amplitude-JJ wave
161 Amplitude Q-wave
162 Amplitude-R wave 163 Amplitude-S wave
164 Amplitude-R wave 165 Amplitude-S wave
166 Amplitude-P wave 167 Amplitude-T wave
168 QRSA 169 QRSTA
Of channel DII:170.. 179
Of channel DIII:180.. 189
Of channel AVR:190.. 199
Of channel V1:220.. 229
Of channel V2:230.. 239
Of channel V3:240.. 249
Of channel V4:250.. 259
Of channel V5:260.. 269
Of channel V6:270.. 279
In this module, the dataset which was originallyavailable as data file was converted into text file. Then thetext file is processed to read the values of the attributes forthe 452 records. Finally all the data is represented in atable format.
Pre-processing: The arrhythmia dataset taken from theUCI repository consists of missing values for manyattributes. The missing values cannot be processedaccurately. Hence the missing values are being replacedby the mean values of the attributes. The main advantageof replacing with the mean values is that, it is possible torecover the values if it is lost in the subsequent processes.In this module, the mean values are computed for all theattributes and they are used for replacing the missingvalues.
Attribute selection: A total of 279 attributes are availableper record in the pre-processed dataset. Considering allthe 279 attributes is not necessary and it is a timeconsuming process. Hence it is planned to do attributeselection. In this process, only the required attributes areselected using Principal Component Analysis.
Principal Component Analysis is the method ofanalysis which involves finding the linear combination ofset of variables that has maximum variance and removesits effect, repeating this successively. PCA is mainlyconcerned with identifying correlations in the data.Correlation measures the simultaneous change in thevalues of two or more variables. Correlation between apair of variables measures to what extent their valuesco-vary. The covariance between a pair of variables iscalculated as in Eq. 1 as follows:
(1)m
1i 1 2i 2i 11 2
(X X )(X X )COV(X , X )
m
This module selects a list of 15 attributes from the setof 279 attributes. Of them the most discriminant attributesare listed in the arrhythmia classification system.
Rules formation: This module deals with the formationof rules, based on the rule weights for each and every
10
Artech J. Eng. Appl. Technol.
instance. The entire dataset is divided into individual mapfiles. Using the class information the estimation of ruleweight for the attributes is calculated.
Rule weights are calculated for the instances in eachmap in parallel. Rules are formed and the rules areassociated with a rule weight. Likewise, for all the maps,rules are formed and finally they are combined. After that,the rules for the same class are formed by calculating theaverage values of the attributes. The distinct final sets ofrules are used for the classification purposes. The formulafor the Rule weight calculation is given in Eq. 2 asfollows:
(2) j jp j p j
j
A p A px c x c
j m
A pp 1
µ (x ) µ (x )RW
µ x
Classification: The rules thus formed in the previousphase are used in this module for the classificationprocess. The rules are formed using the training datasetand they are tested using the instances from the testdataset. When a test dataset is given, the system will firstidentify the rule to which the instances belong to. Afterthat, the class is identified and the new class is fixed forthe instance from the test dataset. Likewise, the systemwill repeat the classification process for all the instancesin the test dataset.
The classification is done using the decision treeclassifier. The decision tree classifiers are tree-shapedstructures that represent the sets of decisions. Thesedecisions generate rules for the classification of a dataset.Decision Tree Classifier is a simple and widely usedclassification technique. It applies a straightforward ideato solve the classification problem.
EXPERIMENTS AND EVALUATION
This study presents the experimental results of ourproposed work in detail.
Experiments: In the collected dataset, there are a total of452 records on the UCI machine learning repository.During Data collection, the raw data is converted into textfile. The entire dataset was considered for preprocessingand the missing values in the dataset are being replacedby the mean values computed by the averaging technique.This preprocessed dataset contains all the 279 attributeswhich is difficult to process. Hence attribute selection isdone using principal component analysis technique. Onexperimenting with principal component analysis, only 16attributes are selected including the class label from theentire 279 attributes.
Rules are formed using rule weight calculation andfor the 13 classes specified with instances, 13 individualrules are generated totally. According to the rules, theclassification is done using the decision tree classifier.
Experimental results: The raw data collected from therepository is in data file format, which is difficult toprocess. Hence it is being converted into text file formatfor processing. Figure 3-10 show the output of eachmodule.
The experimental results obtained shows that thesystem when used with map and reduce concept worksvery fast when compared to the system without using thisconcept.
When analyzing big data, the major concern is thesize of the files and the speed of the processing. Hencewhen applying map reduce framework the speed of
Fig. 3: The original Big Data
11
Artech J. Eng. Appl. Technol.
Fig. 4: Output of data collection
Fig. 5: Output of pre-processing
Fig. 6: Output of attribute selection
12
Artech J. Eng. Appl. Technol.
Fig. 7: Splitting into maps
Fig. 8: Rule weight calculation and rules formation for individual maps
Fig. 9: Overall rules formation
13
Artech J. Eng. Appl. Technol.
Fig. 10: Classification of test dataset
execution is increased to a greater extent because of itsparallel execution and file size can be limited by splittingthem into as many number of maps as the user wants.
CONCLUSION
An efficient methodology for analyzing the big datawas proposed based on rule weight processing and mapreduce classification. The system involves mainprocessing techniques such as preprocessing, selection ofattributes, formation of rules and classification. Theclassification of the test data is effectively handled usingrule based map reduce classification methodology.
REFERENCES
Kohli, N. and N.K. Verma, 2011. Arrhythmiaclassification using SVM with selected features. Int.J. Eng., Sci. Technol., 3: 122-131.
Lopez, V., S. del Rio, J.M. Benitez and F. Herrera, 2014.On the use of MapReduce to build linguistic fuzzyrule based classification systems for big data. In 2014IEEE International Conference on Fuzzy Systems(FUZZ-IEEE) (pp. 1905-1912). IEEE.
Martis, R.J., U.R. Acharya, H. Adeli, H. Prasad, J.H. Tan,K.C. Chua ... and L. Tong, 2014. Computer aideddiagnosis of atrial arrhythmia using dimensionalityreduction methods on transform domainrepresentation. Bio-Med. Signal Process. control, 13:295-305.
Melin, P., J. Amezcua, F. Valdez and O. Castillo, 2014.A new neural network model based on the LVQalgorithm for multi-class classification ofarrhythmias. Inf. Sci., 279: 483-497.
Mitra, M. and R.K. Samanta, 2013. Cardiac arrhythmiaclassification using neural networks with selectedfeatures. Procedia Technol., 10: 76-84.
Shen, C.P., W.C. Kao, Y.Y. Yang, M.C. Hsu, Y.T. Wuand F. Lai, 2012. Detection of cardiac arrhythmia inelectrocardiograms using adaptive feature extractionand modified support vector machines. Expert Syst.Appl., 39: 7845-7852.
Wang, J.S., W.C. Chiang, Y.L. Hsu and Y.T.C. Yang,2013. ECG arrhythmia classification using aprobabilistic neural network with a feature reductionmethod. Neurocomputing: 116, 38-45.
Wu, X., X. Zhu, G.Q. Wu and W. Ding, 2013. Datamining with big data. IEEE Trans. Knowl. Data Eng.,26: 97-107.