Artech Journal of Engineering and Applied Technology ...

Artech Journal of Engineering and Applied Technology (AJEAT)Volume, 1, Issue 1, 2020, Pages: 7-14ISSN: 2523-5478

Improvement of Framework for the Grouping of CADiseases by Investigating Big Data

1D. Stalin David and 2S. Namboodiri1Deprtment of Computer Science and Engineering, IFET College of Engineering,

Viluppuram, Tamilnadu, India2Deprtment of Computer Science and Engineering,

PG MES College of Engineering, Kuttippuram, Kerala, India

Abstract: Heart disease is a leading disease that causes death. One of the major heart diseases is Cardiac Arrhythmia.There are different types in Arrhythmia diseases such as Tachycardia, Bradycardia, Premature Ventricular Contractionsand Premature Atrial Contractions, etc. ECG is an important clinical tool for diagnosing and monitoring of the heartdisorder. For a single patient, ECG is taken at some random intervals. Single reading of ECG contains 279 attributes,had many readings at random intervals are taken leading to huge data. A large database is needed to store this data.Hence, this is referred as big data. The analysis of this big data is a tedious process. So, computer based automaticsystem is needed for the detection of heart abnormalities and classification of ECG signals by analyzing the Big data.It is proposed to develop an automated system for the classification of various types of cardiac arrhythmias by analyzingthe ECG big data, which is a very complex process. The proposed system includes data collection, pre-processing,attribute selection, rules formation and classification. In data collection, data is collected from the repository. Inpreprocessing, the missing values in the dataset are replaced by mean values. Attribute selection process selects theattributes that are of most important. In Rule formation, rules are formed based on the rule weights. In classificationprocess, the classification of the various types of arrhythmia is done. In real time, this proposed system will be helpfulfor the clinical diagnosis of cardiac arrhythmias such as Tachycardia, Bradycardia, Coronary Artery Disease (CAD),Atrial Fibrillation and Atrial Flutter.

Key words: Big data, electrocardiogram, arrhythmia, tachycardia, bradycardia, atrial fibrillation, flutter

INTRODUCTION

Arrhythmia is a life threatening heart disease. Duringan arrhythmia, the heart can beat too fast, too slow or withan irregular rhythm. A heartbeat that is too fast is calledtachycardia. A heartbeat that is too slow is calledbradycardia. Most arrhythmias are harmless, but some canbe serious or even life threatening. During an arrhythmia,the heart may not be able to pump enough blood to thebody. Lack of blood flow can damage the brain, heart andother organ. Different types of arrhythmia includesPremature Atrial Contractions, Premature VentricularContractions, Atrial Fibrillation, Atrial Flutter,Tachycardia, Bradycardia, Ischemic Changes, LeftVentricular Hypertrophy, Left and Right Bundle BranchBlocks and Myocardial Infarction, etc. In general,arrhythmia is diagnosed by an Electrocardiogramprocedure. ECG signals are comprised of P wave, QRS

complex and T wave. They are designated by capitalletters P, Q, R, S and T. A typical normal ECG signal isshown in Fig. 1.

The main parameters included for inspection inheart-patients are the shape, the duration and the

Fig. 1: Schematic representation of normal ECG

Corresponding Author: D. Stalin David, Deprtment of Computer Science and Engineering, IFET College of Engineering,Viluppuram, Tamilnadu, India

Citation: David, D.S. and S. Namboodiri, 2020. Improvement of framework for the grouping of CA diseases by investigating bigdata. Artech J. Eng. Appl. Technol., 1: 7-14.

© Artech Journals

Artech J. Eng. Appl. Technol.

Table 1: ECG signal featuresECG features Normal values (msec)P interval 80 – 100QRS duration 80 – 120T interval 120 – 160PR interval 120 – 200QT interval < 440

relationship with each other of P wave (during normalatrial depolarization, the main electrical vector is directedfrom the SA node towards the AV node and spreads fromthe right atrium to the left atrium. This turns into the Pwave on the ECG), QRS complex (The QRS complexreflects the rapid depolarization of the right and leftventricles. They have a large muscle mass compared tothe atria and so the QRS complex usually has a muchlarger amplitude than the P wave) and T wavecomponents (QRS complex to the apex of the T wave isreferred to as the absolute refractory period.

The last half of the T wave is referred to as therelative refractory period) and also R-R interval (Theinterval between an R wave and the next R wave. Normalresting heart rate is between 60 and 100 bpm). Anychanges in these parameters signify an illness of the heart.Table 1 shows the normal values of these ECG signalfeatures measured in milliseconds. The entire irregularbeat phases are commonly called arrhythmia and somearrhythmias are very dangerous for a patient.

RELATED WORK

A lots of research work in the field of Big Data inBiomedical applications has been done. Those researchworks, are summarized as follows.

One of the works presented support vector machinebased methods for arrhythmia classification in ECGdatasets with selected features. Various existing SVMmethods such as One Against One (OAO), One AgainstAll (OAA), Fuzzy Decision Function (FDF) and DecisionDirected Acyclic Graph (DDAG) are used to distinguishbetween the presence and absence of cardiac arrhythmiaand classifying them into one of the arrhythmia groups[8]. The various types of arrhythmias in the cardiacarrhythmias ECG database chosen from University ofCalifornia at Irvine (UCI) to train SVM include ischemicchanges (coronary artery disease), old inferior myocardialinfarction and others. ECG arrhythmia datasets are ofgenerally complex nature and the results obtained throughimplementation of four well known methods arecompared as per their accuracy rate in percentages and theperformance of the SVM classifier using One Against All(OAA) technique was found to be of vital importance forclassification based diagnosing diseases pertaining toabnormal heart beats.

Another work is based on the system with adaptivefeature selection and modified Support Vector Machines

(SVMs) for cardiac arrhythmia detection in ECGs(Shen et al., 2012). Candidates which enumerated areWavelet transform-based coefficients and signalamplitude/interval parameters. Proposed system withadaptive feature selection integrates with k-meansclustering and SVMs (Kohli and Verma, 2011). Theproposed system includes the ideas of enumerating morecandidate features in the early stage but screening outuseless ones for each class pair in classification stage,partitioning large variation classes into several subclassesto boost up the training performance and duplicating thetraining samples to balance the number of samples foreach class pair.

Another related work proposes the usage of alinguistic fuzzy rule based classification system, whichwe have called Chi-FRBCS-Big Data (Lopez et al., 2014).This method is based on the Map Reduce framework, oneof the most popular approaches for big data nowadays andhas been developed in two different versions: Chi-FRBCSBigData-Max and Chi-FRBCS-BigData-Ave. The goodperformance of the Chi-FRBCS-Big Data approach issupported by means of an experimental study over six bigdata problems. The results show that the proposal is ableto provide competitive results, obtaining more precisebut slower models in the Chi-FRBCS-Big Data-Avealternative and faster but less accurate classificationresults for Chi-FRBCS-Big Data-Max.

One of the categories described the application ofcompetitive neural networks with the Learning VectorQuantization (LVQ) algorithm for classification ofElectrocardiogram (ECG) signals (Melin et al., 2014).The MIT-BIH arrhythmia database with 15 classes hasbeen used for their study. For the LVQ algorithm it isdesirable that the data are mostly different from eachother when they belong to different classes, on the otherhand, good results are obtained when the data belongingto the same class are more similar to each other.

An early and accurate detection of arrhythmia ishighly solicited for augmenting survivability. In thisconnection, intelligent automated decision supportsystems have been attempted with varying accuraciestested on UCI arrhythmia data base. One of the attemptedtool was neural networks for classification. For betterclassification accuracy, various feature selectiontechniques have been deployed. This work attemptedCorrelation-based Feature Selection (CFS) with linearforward selection search and Incremental BackPropagation Neural Network (IBPLN) and Levenberg-Marquardt (LM) was used for classification, tested onUCI data base (Mitra and Samanta, 2013).

One of the approaches concerned big data aslarge-volume, Complex, growing data sets with multiple,autonomous sources (Wu et al., 2013). They presented aHACE theorem that characterizes the features of the BigData revolution and proposes a Big Data processing

8


model, from the data mining perspective. With the fastdevelopment of networking, data storage and the datacollection capacity, Big Data are now rapidly expandingin all science and engineering domains including physical,biological and biomedical sciences. This data-drivenmodel involves demand-driven aggregation ofinformation sources, mining and analysis, user interestmodeling and security and privacy considerations.

The new category presented a general methodologyfor automatic detection of the normal, Atrial Fibrillation(AF) and Atrial Flutter (AFL) beats of ECG (Martis et al.,2014). They designed a Computer Aided Diagnosis(CAD) tool that can help physicians significantly.Investigation of four methods for feature extraction aredone such as, the Principal Components (PCs) of DiscreteWavelet Transform (DWT) coefficients, the IndependentComponents (ICs) of DWT coefficients, the PCs ofDiscrete Cosine Transform (DCT) coefficients and theICs of DCT coefficients. In this three differentclassification techniques are explored namely K-NearestNeighbor (KNN), Decision Tree (DT) and ArtificialNeural Networks (ANN). The methodology is testedusing data from MIT BIH arrhythmia and atrialfibrillation databases.

Another approach which was proposed includes aneffective Electrocardiogram (ECG) arrhythmiaclassification scheme consisting of a feature reductionmethod combining Principal Component Analysis(PCA) with Linear Discriminant Analysis (LDA) and aProbabilistic Neural Network (PNN) classifier todiscriminate eight different types of arrhythmia from ECGbeats (Wang et al., 2013). Each ECG beat samplecomposed of 200 sampling points at a 360 Hz samplingrate around an R peak is extracted from ECG signals. Thefeature reduction method is employed to find importantfeatures from ECG beats. With the selected features, thePNN is then trained to serve as a classifier fordiscriminating different types of ECG beats.

PROPOSED WORK

In this section, the arrhythmia classification systemand the entire work is presented in detail.

Arrhythmia classification system: In this proposedwork, an automated cardiac arrhythmia classificationsystem for the classification of various cardiacarrhythmias is developed by analyzing the clinical bigdata. Electrocardiogram involves recording and analyzingthe electrical signals generated by the heart. ECG is animportant clinical tool for diagnosing and monitoring ofheart disorders. ECG signal consists of P, Q, R, S and Twaves and these signals constitute the big data.

For developing the cardiac arrhythmia classificationsystem, the arrhythmia dataset is taken from the Database

Fig. 2: Arrhythmia classification system

of University of California at Irvine (UCI).The datasetcontains 452 instances and 279 attributes. This datasetcontains missing values and these are replaced by themean values of the attributes. Of the 279 attributes, thefocus is limited to only 6 attributes and it is done in theattribute selection phase. This data is considered as BigData since the attributes involved is very high for a singledata per patient. So increase in the number of records willmake it further bigger which can be referred to as big dataand also analysis of big data is very complex.

The development of the system is divided into thefollowing modules (Fig. 2).

C Data collectionC Data pre-processingC Attribute selectionC Rules formationC Classification

The first module deals with the collection of thedataset from the UCI Repository. Second module dealswith the preprocessing of the dataset where the missingvalues are replaced by their mean values. The nextmodule deals with the attribute selection using PrincipalComponent Analysis, selecting only limited attributes forfurther processing.

The next module rules formation deals with the ruleweight calculation and map reduce functionality. The finalmodule explores the classification of the various types ofarrhythmias to identify the type of arrhythmia disease.

Data collection: The arrhythmia dataset is collected fromthe University of California at Irvine (UCI) machinelearning repository. The number of out patients in thehospitals is increasing every day. The duty doctors arealso changing according to their duty timings. So therecomes a situation where the same disease may bediagnosed and treated by different cardiologists with somuch effort. This can be avoided by integrating andcollecting all the medical records into a single dataset.This collection of dataset is useful for the cardiologists totreat similar kind of diseases with ease.

9


The dataset contains 452 instances and 279 attributes.This dataset contains missing values for many attributes.The list of 279 attributes in the dataset is given below:

1 Age 2 Sex

3 Height 4 Weight

5 QRS duration 6 PR interval

7 QT interval 8 T interval

9 P iterval 10 QRS vector angle

11 T vector angle 12 P vector angle

13 QRST vector angle 14 J vector angle

15 Heart rate 16 Of channel DI:Average width-Q wave

147 Average width-R wave 18 Average width-S wave

19 Average width-R wave 20 Average width-S wave

21 Number of intrinsicdeflections

22 Existence of ragged R wave

23 Existence of diphasicderivation of R wave

24 Existence of ragged P wave

25 Existence of diphasicderivation of P wave

26 Existence of ragged T wave

27 Existence of diphasicderivation of T wave

Of channel DII:28.. 39 (similar to 16.. 27 ofchannel DI)

Of channels DIII:40.. 51

Of channel AVR:52.. 63

Of channel AVL:64.. 75

Of channel AVF:76.. 87

Of channel V1:88.. 99






Of channel DI:160 amplitude-JJ wave

161 Amplitude Q-wave

162 Amplitude-R wave 163 Amplitude-S wave

164 Amplitude-R wave 165 Amplitude-S wave

166 Amplitude-P wave 167 Amplitude-T wave

168 QRSA 169 QRSTA

Of channel DII:170.. 179

Of channel DIII:180.. 189

Of channel AVR:190.. 199







In this module, the dataset which was originallyavailable as data file was converted into text file. Then thetext file is processed to read the values of the attributes forthe 452 records. Finally all the data is represented in atable format.

Pre-processing: The arrhythmia dataset taken from theUCI repository consists of missing values for manyattributes. The missing values cannot be processedaccurately. Hence the missing values are being replacedby the mean values of the attributes. The main advantageof replacing with the mean values is that, it is possible torecover the values if it is lost in the subsequent processes.In this module, the mean values are computed for all theattributes and they are used for replacing the missingvalues.

Attribute selection: A total of 279 attributes are availableper record in the pre-processed dataset. Considering allthe 279 attributes is not necessary and it is a timeconsuming process. Hence it is planned to do attributeselection. In this process, only the required attributes areselected using Principal Component Analysis.

Principal Component Analysis is the method ofanalysis which involves finding the linear combination ofset of variables that has maximum variance and removesits effect, repeating this successively. PCA is mainlyconcerned with identifying correlations in the data.Correlation measures the simultaneous change in thevalues of two or more variables. Correlation between apair of variables measures to what extent their valuesco-vary. The covariance between a pair of variables iscalculated as in Eq. 1 as follows:

(1)m

1i 1 2i 2i 11 2

(X X )(X X )COV(X , X )

m

This module selects a list of 15 attributes from the setof 279 attributes. Of them the most discriminant attributesare listed in the arrhythmia classification system.

Rules formation: This module deals with the formationof rules, based on the rule weights for each and every

10


instance. The entire dataset is divided into individual mapfiles. Using the class information the estimation of ruleweight for the attributes is calculated.

Rule weights are calculated for the instances in eachmap in parallel. Rules are formed and the rules areassociated with a rule weight. Likewise, for all the maps,rules are formed and finally they are combined. After that,the rules for the same class are formed by calculating theaverage values of the attributes. The distinct final sets ofrules are used for the classification purposes. The formulafor the Rule weight calculation is given in Eq. 2 asfollows:

(2) j jp j p j

j

A p A px c x c

j m

A pp 1

µ (x ) µ (x )RW

µ x

Classification: The rules thus formed in the previousphase are used in this module for the classificationprocess. The rules are formed using the training datasetand they are tested using the instances from the testdataset. When a test dataset is given, the system will firstidentify the rule to which the instances belong to. Afterthat, the class is identified and the new class is fixed forthe instance from the test dataset. Likewise, the systemwill repeat the classification process for all the instancesin the test dataset.

The classification is done using the decision treeclassifier. The decision tree classifiers are tree-shapedstructures that represent the sets of decisions. Thesedecisions generate rules for the classification of a dataset.Decision Tree Classifier is a simple and widely usedclassification technique. It applies a straightforward ideato solve the classification problem.

EXPERIMENTS AND EVALUATION

This study presents the experimental results of ourproposed work in detail.

Experiments: In the collected dataset, there are a total of452 records on the UCI machine learning repository.During Data collection, the raw data is converted into textfile. The entire dataset was considered for preprocessingand the missing values in the dataset are being replacedby the mean values computed by the averaging technique.This preprocessed dataset contains all the 279 attributeswhich is difficult to process. Hence attribute selection isdone using principal component analysis technique. Onexperimenting with principal component analysis, only 16attributes are selected including the class label from theentire 279 attributes.

Rules are formed using rule weight calculation andfor the 13 classes specified with instances, 13 individualrules are generated totally. According to the rules, theclassification is done using the decision tree classifier.

Experimental results: The raw data collected from therepository is in data file format, which is difficult toprocess. Hence it is being converted into text file formatfor processing. Figure 3-10 show the output of eachmodule.

The experimental results obtained shows that thesystem when used with map and reduce concept worksvery fast when compared to the system without using thisconcept.

When analyzing big data, the major concern is thesize of the files and the speed of the processing. Hencewhen applying map reduce framework the speed of

Fig. 3: The original Big Data

11


Fig. 4: Output of data collection

Fig. 5: Output of pre-processing

Fig. 6: Output of attribute selection

12


Fig. 7: Splitting into maps

Fig. 8: Rule weight calculation and rules formation for individual maps

Fig. 9: Overall rules formation

13


Fig. 10: Classification of test dataset

execution is increased to a greater extent because of itsparallel execution and file size can be limited by splittingthem into as many number of maps as the user wants.

CONCLUSION

An efficient methodology for analyzing the big datawas proposed based on rule weight processing and mapreduce classification. The system involves mainprocessing techniques such as preprocessing, selection ofattributes, formation of rules and classification. Theclassification of the test data is effectively handled usingrule based map reduce classification methodology.

REFERENCES

Kohli, N. and N.K. Verma, 2011. Arrhythmiaclassification using SVM with selected features. Int.J. Eng., Sci. Technol., 3: 122-131.

Lopez, V., S. del Rio, J.M. Benitez and F. Herrera, 2014.On the use of MapReduce to build linguistic fuzzyrule based classification systems for big data. In 2014IEEE International Conference on Fuzzy Systems(FUZZ-IEEE) (pp. 1905-1912). IEEE.

Martis, R.J., U.R. Acharya, H. Adeli, H. Prasad, J.H. Tan,K.C. Chua ... and L. Tong, 2014. Computer aideddiagnosis of atrial arrhythmia using dimensionalityreduction methods on transform domainrepresentation. Bio-Med. Signal Process. control, 13:295-305.

Melin, P., J. Amezcua, F. Valdez and O. Castillo, 2014.A new neural network model based on the LVQalgorithm for multi-class classification ofarrhythmias. Inf. Sci., 279: 483-497.

Mitra, M. and R.K. Samanta, 2013. Cardiac arrhythmiaclassification using neural networks with selectedfeatures. Procedia Technol., 10: 76-84.

Shen, C.P., W.C. Kao, Y.Y. Yang, M.C. Hsu, Y.T. Wuand F. Lai, 2012. Detection of cardiac arrhythmia inelectrocardiograms using adaptive feature extractionand modified support vector machines. Expert Syst.Appl., 39: 7845-7852.

Wang, J.S., W.C. Chiang, Y.L. Hsu and Y.T.C. Yang,2013. ECG arrhythmia classification using aprobabilistic neural network with a feature reductionmethod. Neurocomputing: 116, 38-45.

Wu, X., X. Zhu, G.Q. Wu and W. Ding, 2013. Datamining with big data. IEEE Trans. Knowl. Data Eng.,26: 97-107.

14

Artech Journal of Engineering and Applied Technology ...

Documents