Analyzing Predictive Algorithms in Data Mining for Cardiovascular Disease using WEKA Tool

Cardiovascular Disease (CVD) is the foremost cause of death worldwide that generates a high percentage of Electronic Health Records (EHRs). Analyzing these complex patterns from EHRs is a tedious process. To address this problem, Medical Institutions requires effective Predictive Algorithms for the Prognosis and Diagnosis of the Patients. Under this work, the current state-of-the-art studied to identify leading Predictive Algorithms. Further, these algorithms namely Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF), Artificial Neural Network (ANN), Logistic Regression (LR), AdaBoost and k-Nearest Neighbors (k-NN) analyzed against the two datasets on opensource WEKA software. This work used two similar structured datasets i.e., Statlog Dataset and Cleveland Dataset. For PreProcessing of Datasets, The missing values were replaced with the Mean value and later 10 Fold Cross-Validation was utilized for the evaluation. The result of the performance analysis showed that SVM outperforms other algorithms against both datasets. SVM showed an accuracy of 84.156% against the Cleveland dataset and 84.074% against the Statlog dataset. LR showed a ROC Area of 0.9 against both datasets. The findings of the work will help Health Institutions to understand the importance and usage of Predictive Algorithms for the automatic prediction of CVD based on the symptoms. Keywords—Logistic regression (LR); support vector machine (SVM); Statlog; Cleveland; WEKA


I. INTRODUCTION
The heart is a vital organ that circulates rich oxygenated blood through coronary arteries. When these arteries block, such a situation is term as CVD. Major risk factors mostly relate to the patient's lifestyle (e.g., eating behaviour, obesity, smoking, alcohol, and physical inactivity). Global Burden of Disease (2019) reported that nearly a quarter of all deaths in India is because of CVD [1]. It is estimated that every year average of 17 million people dies from CVD, reported by World Health Organization (2019) [2]. There is another report in which the Lancet Medical Journal (2019) reported that women in India are more vulnerable than men [3]. Analyzing complex and similar EHRs is not a cost and effort effective solution.
Predictive algorithms in Data Mining have been used for finding patterns and generalize this for prediction in the last few decades. In our previous work [4], we have discussed: (1) The state-of-the-art for the usage of Data Mining in the Health Sector, (2) Top ten causes of Deaths from chronic Disease. One of the foremost applications of Data Mining in the Health Sector is to build an effective Clinical Disease Prediction System (CDPS) by the algorithm(s). Poorly designed CDPS can be devastating and may result in unwanted outcomes. But properly designed and analyzed CDPS will help hospitals to reduce their expenses. Traditional decision making in healthcare facilities is heavily reliant on the instincts and skills of doctors, rather than the amount of data concealed in EHRs. The consequences of this will be unintentional biases, mistakes, and superfluous medical costs that will impact patient care.
Before analyzing the algorithms, we had several questions like what algorithms to choose for CVD prediction, and on what basis. So, we put them as Research Questions (RQ) and later analyzed them on WEKA Tool. RQ for unbiased and effective analysis of algorithms are as follow: • RQ1: What are the leading algorithms for the prediction of CVD after extensive study of related work?
• RQ2: Out of these, which Algorithm(s) outperforms other algorithms in terms of performance analysis?
This work divides into multiple Sections. Section II discusses related work by various researchers related to the prediction of CVD using data mining algorithms. Section III outlines the Methodology for performance analysis of the algorithms. This section briefly discusses the datasets, performance metrics, Software, and leading predictive algorithms. Section IV discusses the result of the analysis.

II. RELATED WORK
To answer the RQ1, we have collected several research papers related to CVD from various sources such as IEEE Xplore, Google Scholar, Scopus, and Springer. Then these papers were filtered out based on the Year of Publication (2019-2021). This will help to find the recent usage of algorithms in the prediction of CVD. After extensive study, we have compiled the list of popular algorithms in Table I that answer the first research question. Further, these algorithms will use for performance analysis.
Muniasamy et al. [5] stressed on usage and applications of Machine Learning (ML) techniques for CVD prediction. They have used six algorithms viz. SVM, DT, k-NN, RF, and Linear Discriminant Analysis (LDA), Multilayer perceptron (MLP/ANN). They have used four heart datasets (i.e., Cleveland, Switzerland, Hungary, Long Beach VA) available on the UCI (University of California, Irvine) repository. They used 10-fold cross-validation for splitting training and testing data on WEKA software. Later their performance was evaluated using Metrics. Their work concluded that four algorithms i.e., LDA, RF, DT, and MLP suitable for the prediction of CVD.
Deshmukh et al. [6] suggested a Heart Disorder Prognosis System, in which they used two datasets from the UCI ML repository (i.e., Hungary, Cleveland dataset). They applied k-NN, ANN, DT, and SVM on described datasets using Python Programming language. Their result concluded that DT/ID3 outperform other algorithms on both datasets with the accuracy of 84.08% and 100%, respectively.
Garg et al. [7] performed a comparative analysis of five Data Mining Algorithms namely k-NN, NB, RF, SVM on four datasets collected from the UCI repository (i.e., Cleveland, Switzerland, Hungary, Long Beach VA). The analysis was performed using Python Programming language and concluded SVM outperforms others in terms of accuracy.
Katarya & Meena [8] used the python programming language to study the advantages and disadvantages of eight algorithms viz. LR, NB, SVM, k-NN, DT, RF, ANN/MLP, Deep Neural Network (DNN) for prediction of CVD.
Karun [9] performed a comparative analysis to find the best suitable model for the Prediction of CVD. They used the heart disease dataset from the UCI repository and concluded that RF outperforms other algorithms i.e., SVC/SVM, and k-NN.
Li et al. [10] proposed a feature selection algorithm i.e., "Fast Conditional Mutual Information (FCMIM)". In their work, the Cleveland heart disease dataset was used, collected from the UCI repository. During pre-processing of data, data normalized by min-max scalar and then visualized using heatmap to understand the correlation. In the next phase, feature selection techniques viz. LASSO, MRMR, Relief, and FCMIM were applied to extract relevant features out of the dataset. To check the performance of each feature selection, data was passed to various classifiers (i.e., DT, ANN, LR, k-NN, SVM, and NB). Research work concluded that FCMIM when used with an SVM classifier gives better accuracy and reduces execution time than other cases.
Singh & Kumar [11] calculated the accuracy of various heart prediction algorithms such as SVM, k-NN, and Linear Regression classifiers. This work utilized the heart disease dataset from the UCI repository and then split it into 73% as a training dataset, 37% as a testing dataset. During the preprocessing phase, data balancing and feature selection were carried out on Jupyter (Python). Research work concluded that k-NN perform better than other classifiers in terms of accuracy (87%).
Choudhary & Narayan Singh [12] suggested using AdaBoost over DT because DT may lead to the over-fitting problem. They used the Cleveland dataset and experimented with the python programming language. Results concluded that AdaBoost gives almost the same accuracy (89%) at test sizes 40% and 10% of the model. Sangle et al. [13] analyzed the theoretical aspect of different work in the field of ML and Deep Learning (DL) for the prediction of Cardiovascular Disease. They have studied the pros/cons of techniques like DT, k-NN, SVM, NB, ANN, and Ensemble Learning. Finally, the authors suggested using ensemble learning/hybrid models to boost the CVD model's prediction accuracy. Shah et al. [14] discussed and experimented with various predictive algorithms like NB, k-NN, DT, and RF where k-NN outperform other algorithms at k=7 in terms of accuracy. They have used the Cleveland dataset and analyzed it with Python Programming language.
Peng et al. [15] presented and discussed the importance/usage of ANN in the prediction of Cardiovascular disease. They have discussed previous work by various researchers related to neural networks for the prediction of CVD.
Hamdaoui et al. [16] proposed a clinical predictive system for Cardiovascular disease. They have used various algorithms like NB, k-NN, SVM, RF, and DT and then applied them to the Cleveland dataset. They used two separate validation techniques i.e., 10-Fold cross-validation, and 70-30 Split validation. In both, the scenario NB outperforms other algorithms. In Split validation, NB gets higher accuracy (84.28%) than Cross-Validation (82.17%).
Kumar et al. [17] calculated various performance metrics like Accuracy, AUC ROC score, and execution time of various classifiers such as RF, DT, LR, SVM, and k-NN. It utilizes a heart disease dataset from the UCI repository and was carried out on Jupyter (Python). Research work concluded that RF performs better in terms of accuracy (85%), ROC AUC score (0.8675), and execution time (1.09 sec).
Santhana Krishnan & Geetha [18] concluded that DT (accuracy=91%) perform better than NB (accuracy=87%) in terms of handling heart medical dataset. The experiment was carried out using Python Programming language by utilizing the heart disease dataset from the UCI repository.

III. METHODOLOGY
To answer the RQ2, This work purposed a methodology for finding which algorithm outperforms other algorithms in terms of performance. The complete and step-by-step workflow has shown in Fig. 1. This Section divides into four sections: (1) Datasets used and their pre-processing, (2) Algorithms selected from the first research question, (3) Software used for analysis, (4) Performance metrics.

A. Datasets
We have used two similar structured datasets related to CVD (i.e., Cleveland Dataset, Statlog Dataset). Both of these were collected from UCI ML Repository [21] [22] and their properties in mentioned in Table II. Cleveland dataset contains 76 attributes, but only 14 attributes are usable for CVD prediction. In this dataset Age, Tresbps, Chol, Thalach, Oldpeak, and Ca are of numeric type and others are of Nominal type. Statlog dataset has 13 feature attributes. Unlike Cleveland dataset, it does not have any missing values. The goal of these datasets is to predict whether the patient is may suffer from CVD in the future or not based on feature attributes. If the outcome of the target variable comes Yes then it means the presence of Cardiac disease else not.

B. Selected Algorithms
The selection of algorithm(s) largely depends on the Dataset and type of problem (e.g., classification, clustering etc.). Table I shows the list of popular algorithms after the extensive study (RQ1). In this sub-section, algorithms that had ℎ ≥ 2 in Table I is discussed. 1) Support vector machine: SVM identifies the hyperplane with the greatest distance between two classes (see Fig. 2) [23]. The supporting vectors are the vectors (cases) forming the hyperplane. Researchers/Scholars must optimize the distance between hyperplanes. SVM employs a non-linear kernel function to map information at a place where a linear hyperplane cannot isolate the data. The kernel trick is the kernel function, which converts the data into a higher dimensionality, allowing for linear separation. In this work, we have used SMO (Sequential Minimal Optimization) function in the WEKA tool. 2) Naïve Bayes: The foundation of the NB classifier is grounded on the theorem of Bayes (see Equation (1)) with the assumptions of independence among predictors [24]. An iterative parameter estimate that is especially useful for the very largest datasets is simple to construct, without a complicated iteration model. NB classifier does not struggle to be very simple and often works extremely well, as it often beats more complex classification methods. Here, we have used the NaiveBayes filter in the WEKA tool.
Where ( | ) is the possibility of occurrence of K if L has already happened; ( | ) is the possibility of occurrence of L if K has already happened; ( ) , ( ) is the independent possibility of event K and L respectively.

3) Decision tree:
DT builds a prediction model in the shape of a tree structure [25]. DT provides a simple graphical solution to the problem which makes it most easily understandable among all classifiers. DT divides a dataset into successively smaller subgroups while building a new decision tree. The end output is a tree with decision/prediction and leaf nodes. The decision node has two or more branches (for example obesity? exercise?). A classified node (e.g., Unfit, Fit) is a decision as shown in Fig. 3. If Age < 40 and the person is Obese then it means the Patient is Unfit. If Age > 40 and not doing Exercise then the patient is unfit. DT is capable of handling both numerical and nominal/categorical attribute types. We have used the J48 (Implementation of DT based on JAVA) function in WEKA Tool.  Fig. 4 and it consists of many decision trees [26]. Each decision tree provides training data as input and then their result aggregates and most voted will result as a prediction. Overfitting is a common concern in DT; RF aids in preventing this problem. Here, we have used the RandomForest function in WEKA Tool.

5) Artificial neural network:
ANN is composed of three layers: input, output, and hidden layer(s) as shown in Fig. 5 [27]. The input layer nodes communicate with the hidden layer nodes, as do the output layer nodes from each hidden layer node. The network data are taken from the layer of input. The hidden layer receives raw data from the input layer and processes it. The value obtained is transferred to the output layer, which also processes and returns data from the hidden layer. Incapable of justifying its choices is ANN's most critical shortcoming. Here, we have used the MutilayerPerceptron function in WEKA Tool.  Fig. 6 [28]. In Fig. 6, represents linear regression and probability represents LR. The vertical axis is the likelihood of a particular number, and the horizontal axis represents the value of . A sigmoid function is used by the logistic function to limit the value from a wide-scale to inside the range (0, 1). Here, we have used the SimpleLogistics function in WEKA Tool.

7)
Adaptive boosting: Adaptive Boosting (AdaBoost) is an ensemble learning technique that is used to enhance the accuracy of weak binary classifiers i.e., DT. Unlike RF, here weak classifiers add sequentially. For Dataset having number N feature variables, N decision stumps will create. Initially, all decision stumps assigned equal-weighted data. The selection of the base model (first stump) will be based on the lesser value of Entropy. After that, each observation updates with normalized new weight based on performance and total error. Finally, based on a random number and normalized weight a new decision stump will select, and so on. In WEKA Tool, Implementation of Adaptive Boosting is known by AdaBoostM1.

8) k-Nearest Neighbors:
k-NN is a classifier that classifies data points based on their closest neighbours. Implementation of k-NN consists of simple steps. Initially, data points transform into vectors. In the next step, the distance between vector points is found by using a mathematical equation such as Euclidian Equation, and Manhattan distance shown in Equation (2). Then the probability of these points calculates being like test data. Finally, the classification of these vector points having the highest probability. Here we have used the IBk (Instance-Based Learner) (Implementation of k-NN) function in WEKA Tool.
where ( , ) is the distance between vector and ; denotes the number of data points in the vector.

C. Software used
WEKA (Waikato Environment for Knowledge Analysis) is a free and open-source software application designed to address a range of data mining issues [29]. The framework allows the implementation of several algorithms for data analysis and provides an API to call inbuilt algorithms from a particular application by JAVA Programming Language. It provides a variety of tools for classification, regression, clustering, removes irrelevant features, builds associate rules, and visualization of the dataset. We have used WEKA v3.8.5 on Intel® Core™ i3 @ 1.70GHz with 8GB RAM on x64 bit Windows 10 Operating System. Fig. 7 that describes how well a classifier performs for which the true values are known. It consists of 4 entities. True positive (TP) are the cases where the classifier predicted that patients have the illness and, they have the illness. True negatives (TN) are those where classifier predicted patient does not have the illness and, they have no illness. False-positive (FP) is also referred to as Type I Error. In this, the classifier predicted that patients have the illness but, they do not have. False-negative (FN) is also referred to as Type II Error. In this case, the Classifier anticipated that the patient would not have the disease, but they do. The confusion matrix will then be used to determine Accuracy, Precision, Recall (Sensitivity), and F-Measure. Accuracy means how often is the model correct? Mathematically, it is shown in Equation (3). Precision is defined as the ratio of True Positives to Total Positives and the recall is how many true positives were found by the model. Mathematically, Precision and Recall are shown in Equation (4), Equation (5), respectively.

D. Performance Metrics used 1) Confusion matrix: Confusion Matrix represented by × table shown in
F-Measure is defined as the Harmonic Mean of Precision and Recall as stated in Equation (6). Instead of balancing the trade-off between Precision and Recall, the researchers can look for a good score of F-Measure. The Receiver Operator Characteristic (ROC) curve is a probability curve that compares the True Positive Rate (TPR) to the False Positive Rate (FPR) at different threshold levels. The greater the ROC Area, the better is the model's ability to differentiate between positive and negative groups.
2) Cohen's kappa: These metrics use to measure how closely the instances are classified by the classifier when matched with labelled data as ground truth. It is mathematically shown in Equation (7). The greater the value of Cohen's kappa, the greater will be the level of agreement and the higher will be the percentage of reliable data. A value below 0.60 usually considers a weak classifier.
where is actual percentage agreement, is expected percentage agreement based only on chance.

IV. EXPERIMENTAL RESULTS
This paper examined two research questions for effective and unbiased analyzing the algorithms. To answer RQ1, we have inspected the extensive state-of-the-art related to Predictive algorithms and CVD. Following data pre-processing, each dataset was divided into Training and Testing data (for validation) using 10-fold cross-validation. Algorithms from RQ1 were applied to these datasets. To measure the effectiveness of these algorithms, 148 | P a g e www.ijacsa.thesai.org each one was put to the test using performance measures, the results of which were displayed in Table III and Table IV. Against Cleveland Dataset, the result of the performance analysis showed that both SVM and ANN perform better than other selected algorithms with the accuracy of ~84.15% and ~84.09% respectively (Table III). DT scored 73.62%, Naïve Bayes scored ~81.67%, RF scored ~81.37%, Logistic Regression scored ~81.37%, AdaBoost scored ~82.99%, and k-NN scored ~75.74% in terms of accuracy. The accuracy of the ANN classifier is very close to SVM but the ROC Area value of ANN (0.907) is more than SVM (0.842) (see Table III). So, both ANN and SVM are suitable choices for the prediction of CVD against the Cleveland Dataset.
Analysis Result against Statlog Dataset showed there were three algorithms whose performance was worthy to talk about (Table IV). SVM scored the highest accuracy of ~84.07%. Next in order, NB and LR showed the same accuracy of ~83.70%. DT scored ~76.66%, RF scored ~76.29%, ANN scored ~78.14%, AdaBoost scored 80% and k-NN scored ~75.18% in terms of accuracy. If we compare the ROC area then both NB and LR are better than SVM (see Table IV).
The results discussed were about individual datasets. If we compared the accuracy of algorithms against the Cleveland dataset and Statlog dataset then SVM performed better than other algorithms (see Fig. 8). Against Cleveland Dataset, it showed an accuracy of ~84.15% and Against Statlog Dataset, it showed an accuracy of ~84.07%. Next in order, NB showed an accuracy of ~81.67% against the Cleveland Dataset and an accuracy of ~83.70% against the Statlog Dataset. Algorithms having a ROC Area value near 1 generally consider a good classifier against the dataset. LR scored a ROC Area of 0.9 against both datasets (see Fig. 9). Next in order, ANN showed 0.907 against Cleveland Dataset and 0.839 against the Statlog Dataset. Predictive Algorithms founds to be very effective in the automatic prediction of CVD. In this work, we analyzed popular predictive algorithms namely SVM, NB, DT, RF, LR, ANN, AdaBoost and k-NN. They were chosen based on the state-of-the-art related to the CVD and Predictive Algorithms. The experiment was conducted using two similar structured datasets (i.e., Cleveland and Statlog Dataset) on open-source WEKA software. The outcome of the experiment concluded that (1) SVM showed maximum accuracy against the datasets, (2) LR showed a ROC Area of 0.9 against both the datasets. These results imply that (1) SVM shows better accuracy against most of the datasets by finding optimal hyperplane using kernel tricks, (2) LR shows better ROC Area against the binary classification datasets.
These findings will help the researchers and Health institutions (1) To understand the current trends related to CVD prediction using the algorithm(s), (2) To build successful and 149 | P a g e www.ijacsa.thesai.org effective CDPS (i.e., Clinical Disease Prediction System) for CVD. Unfortunately, we were unable to study and analyze hybrid models/algorithms but it can extend in future by considering this work as a blueprint/base. Future work should give priority to (1) Real-time and Complexed CVD data, (2) Ensemble Learning and Hybrid Models for analysis, (3) Checking the effects on the value of Performance Metrics against different validation and features selection techniques.