Hybrid Feature Selection Algorithm and Ensemble Stacking for Heart Disease Prediction

—In cardiology, as in other medical specialties, early and accurate diagnosis of heart disease is crucial as it has been the leading cause of death over the past few decades. Early prediction of heart disease is now more crucial than ever. However, the state-of-the-art heart disease prediction strategy put more emphasis on classifier selection in enhancing the accuracy and performance of heart disease prediction, and seldom considers feature reduction techniques. Furthermore, there are several factors that lead to heart disease, and it is critical to identify the most significant characteristics in order to achieve the best prediction accuracy and increase prediction performance. Feature reduction reduces the dimensionality of the information, which may allow learning algorithms to work quicker and more efficiently, producing predictive models with the best rate of accuracy. In this study, we explored and suggested a hybrid of two distinct feature reduction techniques, chi-squared and analysis of variance (ANOVA). In addition, using the ensemble stacking method, classification is performed on selected features to classify the data. Using the optimal features based on hybrid features combination, the performance of a stacking ensemble based on logistic regression yields the best result with 93.44%. This can be summarized as the feature selection method can take into account as an effective method for the prediction of heart disease.


I. INTRODUCTION
The process of learning a function that maps an input to an output based on examples of input-output pairs is referred to as supervised learning in the field of machine learning. This task involves learning a function that maps an input to an output. It accomplishes this by drawing conclusions about a function based on a collection of samples from training that have been labelled [1]. In a variety of fields, including marketing, commercial applications, pattern recognition, image processing, classification, and prediction, feature selection has been utilized. It is common to encounter a sizable data collection and a high number of features while working with actual applications. Most of the time, just a few of the features are important and pertinent to the objective. Since the remaining features are viewed as unimportant and unnecessary, doing without them would not only affect performance but also classification accuracy. As a result, choosing a suitable and compact feature subset from the original features is crucial to improving classification performance and accuracy as well as overcoming the curse of dimensionality. To determine the importance of attributes, feature selection techniques are employed, and the aim is to minimize the number of input variables to those demands most relevant to the model. Aside from minimizing the number of attributes, feature selection also reduces processing time as well.
According to [2], medical records from the National Heart Institute Malaysia (IJN) discovered between January 1, 2009, and December 31, 2018, were used in a non-interventional study that looked back 10 years. From the IJN database, there were 3923 out of 4739 eligible and used in the analysis. Another study by [3] in 2019, conducted by the Department of Statistics of Malaysia, found that heart disease was the leading cause of death in Malaysia. Representing 15% of all fatalities requiring rapid medical attention. However, heart disease can be prevented by avoiding dangerous factors. In machine learning, varieties of algorithms such as supervised, unsupervised, semi-supervised, reinforcement, and transduction, are frequently employed. Supervised learning is the ability of an algorithm to synthesize knowledge from previously labelled data in order to predict future unlabelled cases [4].
In this study, 13 attributes from the UCI dataset are used for the experiment to determine the cause of heart disease. Nevertheless, not all attributes are useful, and a feature selection method is needed to prove the only important cause of heart disease. The choice of attributes based on the feature selection method might vary depending on the feature selection method used. The prediction of heart disease can be detected based on symptom from patients which make the specialist's task easier. When we talk about predicting heart disease, we should note that prediction is one of the applications of machine learning that is utilized frequently. With the assistance of machine learning, data mining is quickly becoming an essential part of the healthcare industry by employing classification and prediction techniques which are used to generate models that describe necessary classes [5].
It is commonly held risk factors such as age, sex, chest pain type, trestbps, chol, fasting blood sugar, restecg, thalach, exang, oldpeak, slope, number of major vessels, and thalassemia are the major risk factors for heart disease according to the dataset used. In light of these considerations, this research employed a feature selection method to build a heart disease risk assessment model that could aid specialists in making accurate early predictions [6].
Even though several feature selection strategies have been used in decision support systems for medical datasets, there is www.ijacsa.thesai.org always the opportunity for improvement. The combination of feature selection algorithms and classifiers has to be tuned for heart disease datasets with a lot of feature space in order to deliver high performance. The proposed framework is based on a well-balanced mix of two different types of feature selection algorithms that work well together.
This study aims to propose a hybrid feature selection that combines both chi-squared and ANOVA techniques. Chisquared is utilized for the selection of categorical features, whereas ANOVA is applied to numerical features. The research proposes to combine the highest rank from both techniques, and the five most influential features are derived from a total of 13 features. The five most influential features are then evaluated using an ensemble stacking approach to improve the accuracy of heart disease prediction. This paper is organized as follows: Section II discovered related works consisting of accuracy achieved by the author using feature selection technique for the prediction of heart disease. Section III discusses the dataset to use for the experiment along with the feature selection technique and framework that visualize the whole process for the experiment. Lastly, section IV discussed the result obtained based on the experiment made, and in Section V, the conclusion is presented.

II. LITERATURE REVIEW
There are several elements that lead to heart disease, however the present approaches for heart disease prediction are inadequate and need to be improved. By using reduction approaches to remove some of the redundant features, the prediction accuracy might be improved. Feature selection is a process of selecting important attributes of the dataset. Preprocessing is the main step for selecting important attributes for a certain dataset. In this research, ten base classification algorithms and three subsets of meta-models are tested for the prediction of accuracy.

A. Filter Method
The filter method is one of the feature selection methods that independently evaluates the importance of each feature. The selected features are subsequently used as input for a model-building process.
Before induction can take place, the filter method is used to remove unwanted attributes using one paradigm which independently act [7]. Karl Pearson pioneered the use of chisquared statistics for categorical data, but it will take some time before the asymptotic distribution of these statistics was thoroughly understood [8].
However, the valid conclusion from chi-squared depends on several assumptions such as [9]: 1) A cross-tabulation can be used to figure out actual frequencies.
The chi-squared test should not be used for percentages or other derived statistics.
2) The two variables are nominal which is the categories have no natural ordering.

4)
More than 75%-80% of contingency table columns have an expected count of ≥ 5, and none have an expected count of 0.
Aside from chi-squared, [10] ANOVA test is another filter-based feature selection technique used in this research. By utilizing the SelectKBest class, the f_classif() function is called upon to determine the most important features. SelectKBest class may be found in the scikit-learn library which employs a scoring function to assign the features with the highest score.
According to [11], Classification and Regression Trees (CART), Gradient Boosting Machine (GBM), Adaboost, K-Nearest Neighbor (KNN), Multilayer Perceptron (MLP), Stochastic Gradient Descent (SGD), Support Vector Classifier (SVC) and Naïve Bayes are tested through feature selection to find the best accuracy algorithm. CART was found to have the best accuracy with 87.65%. Four important attributes from eleven features are selected based on the feature selection. The author uses the majority voting technique to find out the best attributes and the result proved that st_slope_flat and st_depression are the best and second highest results go to max_heart_rate_achieved, exercise_induced_angina, and cholesterol. The authors claimed these attributes are the leading cause of heart disease.
In 2019 [12], the author uses a rapid miner as a tool to test the accuracy of each algorithm. Six algorithms such as Decision Tree, Logistic Regression, Logistic Regression SVM, Naïve Bayes, and Random Forest are tested and the result found out Logistic Regression SVM is the highest with 84.85%. However, the author did not reveal the attributes of the leading cause.
On the other hand, [13] investigated the use of principal component analysis (PCA) in clinical aspect. To measure the effectiveness of reducing infection risk among university student, a pilot study of 200 volunteers was carried out. Essential clinical parameters were identified and confirmed by medical experts. From the clinical history variables with 49 parameters, the disease was identified through the use of PCA. PCA method was utilized to confirm the weightage of risk level towards the disease in order to ensure the system possesses the highest possible level of accuracy, reliability, and efficacy. Cumulative achieved with the use of PCA is 58.288% and the author proof optimal accuracy, reliability, and efficiency to conduct mass-screening of students.
The author in [14], found the most accurate algorithm achieved 85.00% using chi-squared feature selection with the BayesNet classifier. The dataset of heart disease is tested using principal component analysis (PCA), chi-squared testing, ReliefF, and symmetrical uncertainty. The author agreed to use PCA feature extraction with IBK and the result is highest for recall at 87.22% but the accuracy is low compared to the chi-squared result. Based on the results, cp is categorized as the most influential feature for heart disease prediction followed by exang, chol, and thal. Different features are ranked differently based on which feature selection is used. www.ijacsa.thesai.org Based on the dataset, this research [15] compares several machine learning techniques and determines the most efficient classification technique. KNN, NB, decision tree (J48), and RF are four different classification algorithms and other techniques, such as SVM were used to compare with affinity degree (AD) classification. All these algorithms are then tested on three different UCI dataset. As a result, J48 demonstrates the highest level of performance when compared to the other four classifiers as the purpose of this research is to investigate the compatibility of affinity to use for classification method.
The study by [16] affirms the use of the backward feature selection technique resulted in the highest accuracy of 88.52% using the decision tree algorithm. Algorithms such as random forest, support vector machine, decision tree, k-nearest neighbor, logistic regression, and gaussian naïve bayes are tested and the decision tree outperformed the other five algorithms. They also experimented with the accuracy using ten different feature selection techniques which are ANOVA, chi-squared, mutual information, ReliefF, forward feature selection, backward feature selection, exhaustive feature selection, recursive feature elimination, lasso regression, and ridge regression. As a result, backward feature selection is the most influential feature selection technique which leads to a better result.
Research done by [17], suggested dataset of 70000 patients and 11 features are tested with the chi-squared feature selection method. Features involved in this research consist of age, gender, height, weight, systolic blood pressure, diastolic blood pressure, cholesterol, glucose, smoking, alcohol intake, and physical activity. Seven algorithms and the chi-squared method were used to filter the most influential features. The author adjusted some features of the dataset to discover the factors that have the greatest impact on cardiovascular disease which resulted in weight and height as the most influential cardiovascular disease. As a result, Multi-Layer Perceptron achieved the highest accuracy with 87.23%.
The authors of the research [18], proposed two different datasets and use a feature selection technique to find the best features. The author also tested the ensemble classifier with a sampling technique to find the best accuracy. ANOVA is one of the feature selection techniques used by the author to find the best features for improved accuracy. The study [19] suggested a model predict numerous diseases as there are very few suggestions made about the detection of numerous diseases. The author takes into consideration conditions such as heart disease, diabetes, and kidney disease. There are only a few features in the dataset that will not affect how well the prediction system works and only important features will be taken into consideration for the decision-making. Chi-squared and ANOVA are applied to trace out the best features from the dataset. Exang, cp, ca, oldpeak and thalach are chosen as the most influential features.
A study conducted by [20], shows the size of the dataset increase as the complexity of the model increases.
Classification and regression fields are tested respectively in this research for comparison purposes as they might be a potential resource for the researcher to decide on appropriate algorithms. Chi-squared as one of the feature selection methods is used for categorical, ordered with missing values, and ordered without missing values. The major benefit of chisquared is, it decreases computing complexity through the merging procedure by decreasing the number of categories for each predictor.
Recently, [21] developed a heart failure survival prediction model with the help of an ensemble tree machine learning approach. Extreme Gradient Boosting (XGBoost) was demonstrated as the most accurate classifier with 83.00%. During the pre-processing stage, the unimportant feature will be removed to obtain better accuracy. The author uses ANOVA and chi-squared to analyze numerical and binary features, respectively. The most influential features consist of anemia, time, ejection_fraction, and serum_creatninine but 'time' features are counted as the highest contribution for the improvement of accuracy.
A comparison of the result obtained shows that different authors came out with different results. The highest accuracy achieved based on past work is 88.52% from the decision tree. Thal features can be categorized as the most influential features seems all the experiments conducted with feature selection show that thal ranked the most among other features. As will be shown in succeeding sections, we analyze and present a comparison with our feature selection technique together with the result of accuracy for heart disease prediction.

III. METHODOLOGY
This study is based on the UCI dataset of heart disease which consists of 303 datasets and 13 attributes. The original attributes consist of age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal, and target. Data dictionary in Table I will explain further the attributes involved. Cases of heart disease and non-heart disease are extracted from the dataset and displayed visually. According to Fig. 1, 54.50% of patients suffer from heart disease and the remaining 45.50% are free from heart disease.
The execution is accomplished using the following procedures:

1) UCI Cleveland dataset is obtained 2) Data visualization is performed
3) Dataset is divided into testing and training data 4) Applying algorithms method for training 5) Train the model 6) Heart disease prediction based on accuracy obtain From the UCI dataset, 80% of the dataset has been assumed as training input for machine learning methods, and the model has been fitted accordingly. The remaining 20% is test data for predicting heart disease [22].

B. Pre-Processing
Dimensionality reduction is a pre-processing procedure that can eliminate irrelevant data, noise, and redundant features to improve the accuracy of learning features and save training time [23]. Data pre-processing often encompasses the following task [24]: Data cleansing: The first stage in data cleansing is identifying mistakes and inconsistencies in the database by evaluating the data. In other words, this phase is known as data audits and will identify all forms of database irregularities [25].
Normalization: Initially, pre-processing is not only a method for transforming raw data into a clean dataset but it also improves the performance of machine learning. By way of explanation, if data is acquired from various sources, it is collected in a raw format which is incompatible with analysis and machine learning [26].
Feature discovery: Feature discovery is one of the preprocessing methods which is the data filtered from the pre-processing section. The advantage of feature discovery is extracting meaningful data from identified correlations of patterns [27].
Management of imbalance data: An example of an issue known as imbalance data classification is when the proportional class size of a dataset differs significantly by a significant margin from one another. From this, a group of a small number is represented as a minority class and the remaining belong to the other group represented majority class [28].

C. Feature Selection
Attribute or feature selection is a data reduction method that is applied to the dataset. This method decreases the size of the data by eliminating unnecessary or duplicate attributes. Methods for selecting features subset can be broken into four distinct categories which are the embedded method, wrapper method, filter method, and hybrid method [29]. In our research, the features are divided into numerical and categorical which is the filter method applied. As it operates independently from the induction algorithm, this method is faster than the wrapper approach and produces a better generalization. However, the chi-squared method favours selecting a subset with a large number of features, necessitating a threshold to select a subset [30].
According to [31], it is found that the filter approaches are effective, scalable, computationally straightforward, and independent of the classifier. In this research, categorical features consist of sex, fb, restecg, exang, slope, ca, thal and target while age, trestbps, chol, thalach, and oldpeak are numerical features. Chi-squared is used to generate categorical features and ANOVA is tested for numerical features. Both methods generate the features according to rank based on the importance of each feature. The Table II below shows the selected features for categorical and numerical features.  2 depicts the ranking of the involved features based on their importance. Ca, cp, and exang counted as the highest rank tested with chi-squared for categorical features while oldpeak and thalach counted as the most influential features for the numerical group using the ANOVA score.

D. Chi-Squared
Chi-squared is one of the techniques for categorical types of data. The chi-squared test determines if two categorical variables are significantly associated. Two-sided chi feature selection is tested between each categorical and binary outcome with a p-value. The features are retained with twosided p<0.05 [32].
Several steps involved in the chi-square process are explained as follows [33]: Step 1: All features from the original dataset are selected.
Step 2: Utilize the chi-squared () function from the scikitlearn to figure out whether the two features are independent or not. Use (1) to find the chi-squared score for each of the following features.
Step 3: The value with the highest chi-squared value probably relies on the target feature and is therefore selected for model creation. SelectKBest() was utilized to choose the five features with the highest chi-squared value.
Step 4: The next step is to determine a threshold to construct a subset for the number of features represented by n. The optimal number of features with the highest Chi test score is utilized based on the top five ranking features. In this research, five features with the highest Chi test score are tested to create the original feature subset.
According to [34], the strategy for the chi-squared method is incrementally adding important characteristics to the feature subset. At each level, this method will determine the significance threshold and discards features that fall below it. As a result, the chi-squared strategy is more efficient than similar step-wise selection methods. Most of the studies prove the use of the chi-squared method among other feature selection methods improves most of the classifiers' performance and accomplishes outstanding results [35].
Based on [36], up to 1900, the evolution of the chi-squared test process can be divided into six stages. Six related stages included: 1) From the multivariate error law to the multivariate normal distribution.
2) Exponent distribution in multivariate normal density.
3) Multinomial distribution approximation by multivariate normal density. 4) Evaluation of the exponent when the moment is multinomial.
5) The definition to which probability refers. 6) Provision for the effect of estimating an undetermined parameter.

E. ANOVA
Analysis of Variance (ANOVA) is another technique used for the classification method. ANOVA is tested for numerical feature from the dataset and the ratio between variances from two different samples are formulated [33]. For completion of the ANOVA technique, the below step is applied [33]: Step 1: All features are selected from the original dataset Step 2: The target feature function from scikit-learn is calculated using ANOVA F-score for each feature. Below (2), (3), (4) are the following formula to calculate ANOVA. F= variance between groups variance within groups (2) Variance between groups= Step 3: The result from the test is used to perform feature selection which enables the removal of features that are unrelated to the target variable. The most influential features with the lowest variance are chosen in the experiment and tested with SelectKBest(); K represents the number of features for the final dataset.
Step 4: The number of features(n) with the highest ranking is used to create various feature subsets.
Research conducted by [37], shows the use of ANOVA can enhance the accuracy which is a 9.1% increase from 72.70%. www.ijacsa.thesai.org  Fig. 3, several steps are applied including data gathering and pre-processing before feature selection is applied. 13 attributes from the dataset are extracted to remove the missing value and visualize the data accordingly. Before we go deeper for base and meta classifiers, the feature selection method is applied to the data. Feature selection with the chi-squared technique is applied for categorical features while the ANOVA technique is applied for numerical features. Accuracy is tested for each feature selection method and the best accuracy is selected before we filter the important features. From the experiment made, five important features have been sorted out.
The data are then tested for base and meta-classifier methods. Ten base algorithms consisting of logistic regression (LR), support vector classifier (SVC), random forest (RF), extra tree classifier (ETC), naïve bayes (NB), extra gradient boosting (XGB), decision tree (DT), k-nearest neighbor (KNN), multilayer perceptron (MLP), and stochastic gradient descent (SGD) is tested and result obtained is used to find the optimum number of base classifiers. Then, meta-classifiers are applied for MLP, LR, NB, and SVC algorithms.

IV. RESULTS AND DISCUSSION
The proposed work is using the chi-squared method for categorical features and ANOVA for the numerical feature. 13 features from the UCI dataset are reduced to five features and tested accordingly. Highly rank of features are tested using the required method and there is an improvement in terms of accuracy for each algorithm. Table III will further explain the involvement of five attributes for chi-squared and ANOVA and the achieved accuracy for each feature selection method. Chi-squared and ANOVA technique feature selection was the focus of the subsequent testing phase. Ca, cp, exang, oldpeak, and thalach was chosen as the first five attributes selection which is superior to those of another feature selection algorithm.
Accuracy tests for both the base and meta classifiers using these five features and the result show an improvement from the accuracy of base classifiers. Results for both techniques of feature selection which are chi-squared and ANOVA are contracted in the following table.  Table IV, logistic regression obtains the highest accuracy compared to the other nine algorithms. For the level 1 base classifier, 85.24% is achieved before feature selection is applied and increases to 6.56% after feature selection is applied. Level 2 meta-classifier, increase from 90.16% to 93.44%.
For SVC, the accuracy increases by 4.92% from 86.88% for base classifiers and meta-classifiers, the accuracy achieved is 91.80% from 83.60%. MLP achieved 90.16% accuracy from 88.52% for base classifiers while the accuracy spike from 88.52% to 91.80% for meta-classifiers.
Classification and regression trees (CART) have an acquired accuracy of 87.65%, according to the literature [11]. The author makes an effort to boost precision by employing feature selection and an ensemble technique. There has been some improvement, but the accuracy is still low. In the current research, we suggested the same process but with a new set of features and an alternative method of feature selection. Logistic regression was able to provide a success rate of 93.44 percent, which is an increase over earlier efforts.

V. CONCLUSION
The main goal of this work is to develop hybrid feature selection method for heart disease prediction that combines chi-squared and ANOVA approaches. ANOVA is used to choose numerical data, whereas Chi-squared is used to pick categorical features. The involved algorithms are logistic regression, k-nearest neighbor, decision tree, random forest, gaussian naive bayes, extra gradient boosting, support vector classifier, multilayer perceptron, stochastic gradient descendent, and additional tree classifier. Various algorithms are tested for base classifiers. The meta-classifier is evaluated using the logistic regression, support vector classifier, and multilayer perceptron methods. Then, feature selection techniques are used to evaluate the base and meta-classifiers.
We decided to assess the efficacy of two distinct featureselection algorithms in this study. Chi-squared tests and analysis of variance are utilized as feature selection methods. The experimental results show that the accuracy of heart disease prediction may be improved by employing the hybrid feature selection technique.
In addition to the utilization of feature selection techniques, the selected features from the dataset are also something that have to be emphasized. The chi-squared test and the analysis of variance (ANOVA) are used to evaluate the results of the experiment regarding five characteristics, namely ca, cp, exang, oldpeak, and thalach. The logistic regression method had a performance that was 93.44% better than the other ensemble stacking techniques. Because the accuracy of the approach might change depending on the dataset that is being used, it will be possible in the future to evaluate the technique of feature selection using a variety of different datasets.