Proficiency Assessment of Machine Learning Classifiers: An Implementation for the Prognosis of Breast Tumor and Heart Disease Classification

Breast cancer and heart disease can be acknowledged as very dangerous and common disease in many countries including Pakistan. In this paper classifiers comparative study has been performed for the tumor and heart disease classification. Around one lac women are diagnosed annually with this life-threatening disease having no family history of the disease. If it is not treated on time it may grow and spread to the other parts of human body. Mammograms are the X-rays of the breast which can be used for the screening of cancer tumor. Prior identification of breast cancer may increase the chance of survival up to 70 percent. Tumors which causes cancer can be categorized into two types: a) Benign and b) Malignant. Benign tumor can be explained as the tumor which are not attached to neighbor tissues or spread in the other parts of the body. In Malignant tumor, other parts may be affected by it as it can grow and spread in the other parts of the body. To classify the tumor as Malignant or Benign is very complex as the similarities of cancer tumor and tumor caused by the skin inflammation are almost same. The early identification of Malignant is mandatory to protect the patient life. Diversified medical methods based on deep learning and machine learning have been developed to treat the patients as cancer is a very serious and crucial issue in this era. In this research paper machine learning algorithms like logistic regression, K-NN and tree have been applied to the breast cancer data set which has been taken from UCI Machine learning repository. Comparative study of classifiers has been performed to determine the better classifier for the robust prediction of breast tumors. Simulated results proved that using Logistic regression, ninety-one percent accuracy was achieved. The research showed that logistic regression can be applied for the accurate and precise early prediction of breast cancer. Cardiovascular disease is very common throughout the world. It has been noticed that health in cardiac patients that there are so many factors which causes heart disease or heart attack. The factors leading to the heart failure includes varying blood pressure, high sugar, cardiac pain, and heart rate, high cholesterol level (LDL), artery blockage and irregular ECG signals. Many researchers proved that stress in patients can also be the reason for the heart disease. Higher numbers of cardiac surgeries like angioplasty and heart by-pass are performed on annual basis. Actually, people don’t care about their lifestyle and diet and fully ignore the symbols. It can be early predicted and cured if proper testing and medication for heart is done. Sometimes there is a false pain which has the same feeling like angina pain depicting cardiovascular disease. To reduce the false alarm and robustly classify the heart disease, several machine learning approaches have been adopted. In proposed research for the accurate classification of heart disease comparison has been performed among support vector machine (SVM), K-nearest neighbors K-NN and linear discriminant analysis. Simulated results demonstrated that Support vector machine was found to be a better classifier having an accuracy of 80.4%. Keywords—Breast cancer; benign; malignant; logistic regression; cardiovascular disease; heart disease diagnosis; support vector machine; classifiers; k-nearest neighbors


I. INTRODUCTION
Twenty-five percent of women die due to the breast cancer in the ages of thirty-five to forty. Mammography is usually performed to enhance the radiographic decisions. SENOLOG was developed for the breast therapy assistance using SENOBASE [1]. RF-ELM classifier was applied to find out the tumor from the digital mammogram. Mammogram images were taken from MIAS database. Kurtosis, mean, standard deviation, correlation coefficient, entropy and variance were chosen for the accurate classification. RF-ELM was found to be very competent classifier for the diagnosis of breast cancer [2]. This research paper is divided into four main sections. Section one explains the introduction. Second part discusses the implementation of classifier algorithms for breast tumor identification. Section three elaborated the heart disease *Corresponding Author (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 561 | P a g e www.ijacsa.thesai.org classification and its implementation. Results and conclusion have been discussed in section four.

A. Existing Methods for the Identification of Malignant
Tumor Local binary patterns (LBP) were applied using mammograms. Data set was collected from DDSM. LBP using mammograms achieved the accuracy of 84% [3]. Mammography is acknowledged as a good strategy for the screening of tumors. Generally, mammogram analysis is very complex as the image comprises of various little differences of different tissues [4]. A novel hybrid approach of digital image processing was adopted to analyze the mammograms. Using this novel approach, the early identification of breast cancer at the stage of micro calcification was achieved leading to the higher accuracy of proposed technique [5]. Digital image based elasto-tomography was developed as a prototype for the evaluation of breast cancer. Segmentation was performed to identify the model of breast. Using this system up to 10 mm tumor could be detected in a silicon phantom breast [6]. Ensemble empirical mode decomposition (EEMD) has been proposed for the prior determination of breast tumors using ultra-wide band (UWB) microwave imaging. Approximately 4 mm tumor has been identified in inside the glandular tissue whose di-electric constant was 35 in a breast model [7]. The pulsed confocal approach was proposed to improve the identification of breast tumors. Two-dimensional finite difference time domains (FDTD) analysis was conducted to determine the 2mm tumor in the presence of clutter [8]. It has been observed that rumors possess different permittivity and conductivity with respect to the surrounding tissues. Electrical impedance spectroscopy (EIS) was used to classify the normal tissues and malignant [9]. Fig. 1(a) and 1(b) elaborated that homogenous breast model was designed. Incident wave of 6 Ghz having vertical polarization was tested. Artificial Neural network was designed to evaluate the scattered electromagnetic waves. The dielectric values for malignant and normal tissue was randomly chosen [11][12].

B. Problem Statement
Early detection of Breast cancer has become very crucial issue in the medical science as 30% women die annually due to the breast cancer. Women usually ignore the tumor because of the lack of awareness for the breast cancer. To classify the tumor as Malignant or Benign is very complex as there is a misconception and confusion regarding these two classes [1][2][3][4]. The last stage symptoms are also similar with the normal inflammatory conditions. Therefore, vigorous early breast cancer detection was needed. Mammography based screening is usually used to evaluate the cancer tumor on early basis as well when it is small. Many clinical laboratories are there to record the mammograms of breast which is the X-ray of breast. Data acquisition for the breast tumor can also be possible as the size, shape adhesion, location and other attributes related to cancer tumor can also be recorded [2][3][4][5][6]. To make it more certain and enhance the accuracy of Malignant tumor identification Machine Learning based decision making was required. The similarities of the tumor's symptoms are almost same as the inflammation of the skin problem. Breast pain, swelling, and reddish skin are very common symptoms of cancer but people ignore it as they take it as normal skin inflammatory problem [9][10][11]. Fig. 2 elaborates the main fundamental block diagram for the proposed breast cancer classification using machine learning. Clinical data acquisition was performed and collected from the UCI machine learning repository for applying the proposed classification models. Logistic regression, K-NN and decision tree classifiers were applied to determine the best predictive model for the breast cancer. Cross validation curve was also obtained for the comparison. Results were obtained in terms of accuracy, precision, prediction speed, ROC, true positive rate and false positive rate.  (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 562 | P a g e www.ijacsa.thesai.org Table I represents that Breast cancer dataset has been gathered from the database of UCI machine learning repository for the Benign and Malignant classification purpose. The data set has been used in many researches by using neural networks and machine learning based classification [13][14]. The data set comprised of eleven attributes including the patient ID. Column 1 describes the patient ID for each individual patient. Clump thickness has been mentioned in column 2. Clump is the bunch of close roots which have been grown with the tumor tissue. Uniformity of cell size has been represented in the column no. 3. Cell shape of tumor has been described in column no. 4. Column 5 shows the marginal adhesion for the tumor. Column no. 6 represents the single epithelial cell size. Epithelial cell is defined as the cell which protects the upper surface of the skin against germs and bacteria. Bare Nuclei has been demonstrated in the column no. 7. Bare nuclei are the cytology preparation which can be observed in the degeneration of cell. Bland chromatin has been defined in the column no. 8. Bland chromatin explains the pattern and texture of the Benign tumor. Usually in cancer cell the texture is found to be rough and harsh. Column no. 9 displays the normal nucleoli. Nucleoli depict the cell's response to the stress. Column no 10 displays the Mitosis attribute of breast tumor. Mitosis can be defined as the two daughter cells having same properties and number of chromosomes in parent cell like an ordinary tissue. Output results have been mentioned in the last column. Column 11 shows the output classes 1 and 2, 1 for Benign and 2 for Malignant [13][14].

A. Decision Tree Implementation
Decision tree is same as the tree in which the outputs are represented by leaves. Decision tree algorithm is able to classify and sort from roots to the leaf. For the information gain the entropy is estimated in information coding theory.
Probabilistic classification can be positive or negative. Entropy for" t" training sample can be explained as:   In Class "2-Benign" classification, true positive rate was found to be 92% and false positive rate was determined as 8%. For the class "4-Malignant" classification, 60% true positive rate was observed with 40% false positive rate.

B. K-Nearest Neighbors K-NN Classifier
Fig. 5 portrays the confusion matrix of K-NN algorithm. 88.2% accuracy was achieved by the K-NN for the classification of Benign and Malignant Tumor. The efficiency was estimated using true positive rate and false positive rate www.ijacsa.thesai.org values. Confusion matrix assessed the K-NN classifier and displayed that 92% true positive rate was achieved with 8% false positive rate for the classification of "class 1-Benign". In "Class-Malignant" classification, 80% true positive rate was achieved with 20% false positive rate.
Euclidean distance is estimated to determine the closest distance with the value of the K. d = √ (3) Fig. 6 shows ROC and area under the curve (AUC) for the K-NN classifier. The ROC curve has been plotted between true positive rate and false positive rate. Area under the curve (AUC) was found to be 0.86. It is slightly away from the 1. For good classification AUC must be close to 1.

C. Logistic Regression
Logistic regression is defined as the regression model in which predictive model predict the output in binary. Outliers and missing values must be filtered out before processing the predictive regression analysis. Logistic function may be defined as: Where o is the predicted output, b0 is the bias and b1 is the coefficient for the single input value (x). Each column in your input data has an associated b coefficient (a constant real value) that must be learned from your training data.
In Fig. 8, Logistic Regression confusion matrix portrayed that mean accuracy of 91.2% was achieved based on the true positive rate (TPR) and false positive rate (FPR). For the classification of Benign, confusion matrix showed that 100% true positive rate was achieved in classifying class 2 while 0% of false positive rate was achieved in the classification of class 1. It was also observed that 70% true positive rate was estimated in the classification of "class 2-Malignant" with the 30% false positive rate.

D. Fine Gaussian
In Fig. 10, Confusion matrix of fine Gaussian SVM elaborated that the algorithm performed very poor for the prediction of breast cancer as it classified all classes as class 1. Class 2 was not predicted at all therefore false negative rate was found to be 100% and false positive rate for class 2 was found to be 0%.

E. Comparative Study of Classifiers for the Prognosis of
Breast Tumor Fig. 11(a) shows the accuracy, prediction speed and training time for the Decision tree. Fig 11(b) explains the parametric analysis for the logistic regression. It can be observed from the parametric analysis that 91.2% accuracy has been achieved by logistic regression. Moreover, it can also be noticed that 88.2% accuracy was achieved in the trained predictive model for the breast tumor classification. Table II showed that the Logistic Regression and K-NN performed better classification compared to the Decision tree and Fine Gaussian SVM in terms of Accuracy, prediction speed, training elapsed time, precision and area of under the curve. 91.2% accuracy was achieved by logistic regression for the benign and malignant classification.

III. RECENT TRENDS FOR HEART DISEASE CLASSIFICATION
Cardiovascular disease is very common throughout the world even in United states of America in every thirty-four second one patient losses his or her life due to this silent disease [15]. Electrocardiography signals (P-wave, QRS complexes) and cardiac arrhythmias have also been processed and classified through convolutional neural network (CNN) to identify the heart disease [16]. Smoking and hypertension may also increase the chances of heart disease. Data mining www.ijacsa.thesai.org techniques were applied to the heart disease data (HDD) for the classification of heart disease [17]. Quadratic support vector machine and discriminant analysis have been performed in the MATLAB environment for the classification of heart disease [18]. Fuzzy based K-NN classifier was developed for the classification of pure cardiovascular disease. Training, testing and validation curve demonstrated that fuzzy based K-NN classifier worked better [19]. A wearable gadget was also fabricated for the real time data transmission. The health parameters including blood pressure, temperature and heart rate were optimized using particle swarm optimization for the optimal results in previous research [20]. Fig. 12 illustrated that cardiac arrhythmias and ECG signals classification were performed using, Fuzzy logic controller, MLP-PSO, Improved PSO(ImpSO) and Genetic algorithm [21][22]. The data was collected from PHYSIONET. Heart rate variability (HRV) has been used as a yardstick to measure the heart health. Heart rate variability signals have been processed for the classification of heart disease using artificial neural network (ANN) [23][24][25][26]. Cardiac arrhythmia can be categorized into following categories, Asystole, Bradycardia, Tachycardia, Ventricular Tachycardia and Ventricular flutter. Fuzzy logic was used to classify the heart disease as cardiac arrhythmia can be used for measuring the heart health [27]. Rs and QRS complexes were cleaned and classified using Pan and Tompkins algorithm [28]. Cardiac abnormalities have been observed for the heart rate classification [29]. ECG signals have been classified using fuzzy network for the heart disease classification [30]. Cardiac arrhythmias based on rate time series were forecasted using radial basis function [31]. 116 heart sounds were segmented using classification and regression trees [32]. An optimal architecture of multi-layer perceptron with the combination of particle swarm optimization has been designed for the prediction of cardiac arrhythmias [33]. A stand-alone system using DSK6713 was developed to measure the abnormalities in heart sound [34].

A. Problem Statement for Heart Disease Classification
Cardiovascular disease is very common throughout the world. Usually people ignore the early symptoms of the heart disease like people think the actual cardiac chest pain as a typical angina or non-cardiac chest pain. Ignorance of blood pressure, sugar and cholesterol serum may lead towards the heart disease. Sometimes gastric pain and non-cardiac chest pain occurred as a false alarm for the heart disease identification. The accurate and proper prediction of heart disease may be performed using machine learning based on the patient historical data set with respect to the age.

B. Implementation of Classifiers for Heart Disease
Classification Table III explained that dataset has been collected from the database of UCI machine learning repository for the heart disease classification purpose. The data set has been used in many researches by using neural networks and ensemble classification [35]. The data set contained thirteen attributes to predict the heart disease. Output results were represented as two classes 1 or 2. Class "1" confirms the absence of heart disease and heart is working fine while class "2" indicates the high risk of heart attack as the heart disease has been found and it needs urgent consultation or treatment with the doctor or heart specialist. The Age and sex of patients have been mentioned in the column 1 and 2, respectively. Column no. 3 defines the type of the chest pain (CP) as the chest pain has four types. The chest pain types include typical angina, atypical angina, non-angina pain and asymptomatic. A heart pain or chest trouble caused by the muscles of heart due to the less oxygen in the blood is usually referred as typical angina pain. Atypical angina pain is a symbol of the problem that is not related to the heart actually. Non-angina pain is also acknowledged as non-cardiac chest pain (NCCP) that has the same feel like heart pain but that doesn't describe the heart disease. Asymptomatic shows that there was no heart disease detected. Column no. 4 shows the blood pressure of the patients in the rest condition. The attribute no. 5 describes the cholesterol serum level value in mg/dl. The sixth column explains that blood sugar (FBS) was measured in fasting of patients to identify the sugar greater than 120 mg/dl. The electrocardiographic results of patients in rest have been observed in the column no. 7 having the values of 0, 1 and 2. The maximum peak heart rate of patient was measured in the column no. 8. The column no. 9 shows the data of those patients who got induced angina pain due to the exercise. In column no. 10 old peak which is related to the ST depression achieved by exercise at rest position. Slope of the peak exercise has been mentioned in the column no. 11 (up, flat, down). Number of the main vessels (0-3) has been recorded in the column no. 12 that has been colored by the fluoroscopy. Thallium is the stress scintigraphy which elaborates that the heart rate is normal, fixed defect or reversible defect. The predicted results have been described in the 14 th column having two classes. Class "1" shows that there was no heart disease identified and the heart is working properly. Class "2" indicates that the presence of heart disease has been confirmed therefore emergency consultation or treatment will be needed to cure it. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 566 | P a g e www.ijacsa.thesai.org  Sex  CP  RBP  SCHL  FBS  RECR  MHR  EIA  OP  SP  MV  TH  Output   67  0  3  15  564  0  2  60  0  1  62  0  7  1   57  1  3  24  261  0  0  41  0  0  31  0  7  2   67  1  2  28  263  0  0  5  1  0  22  1  7  1   74  0  4  20  269  0  2  21  1  0  21  1  3  1   65  1  2  20  77  0  0  40  0  0  41  0  7  1   56  1  4  30  256  0  2  42  1  0  62  1  6  2   59  1  3  10  239  0  2  42  1  1  22  1

C. Support Vector Machine (SVM)
Generally, Support Vector Machine (SVM) classifiers are applied to resolve complicated engineering problems of the real world; it has been observed in many classification applications that SVM performed better classification. A support vector mechanism produces a hyperplane or a series of hyperplanes in a high or infinite dimension area that can be used for classification, regression or detection of outliers. In the proposed research two hyper planes have been designed for the two classes.
H1 and H2 are the planes: H1: w•x i +b = 1 (6) H2: w•x i +b = 2 The plane H0 is the median in between, where w•xi +b =0 For the maximization of the margin, ||w|| can be minimized. Having the condition that there will be no data points between H1 and H2.
Non-Linear SVMs also used to separate the classes linearly by using the quadratic equation.
(x-a)(x-b) = x 2 -(a+b)x + ab (10) Optimization issue of the weight values can be resolved by using the following equations foe SVMs: For the maximization; (11) Min. | w T x + b |= 0 for n = 1,2,3……n For the minimization; W T .W (12) y n = | w T x + b |= 0 for n = 1,2,3……n The preprocessed data was made ready for analysis. The data was filtered and the missing values were recovered. The data was given in the form of numbers and fractions and can be used for training. Fig. 13 represents the proposed model data points of parameters. Proposed predictive model was trained with different algorithms and cores to verify performance. The method of teaching the algorithm also makes a big difference. If the wrong training data type is provided, the algorithm cannot achieve useful results. Fig. 14 shows the Support Vector Machine confusion matrix. Commonly confusion matrix is seen diagonally; all the values in diagonal show the true positive classes. The confusion matrix shows the performance of the classifier. The confusion matrix of SVM elaborates that classification has been performed for the two classes. According to the confusion matrix 89% true positive rate with the 11% false positive rate have been predicted while classifying class 1 for the absence of heart disease. It can be easily observed that www.ijacsa.thesai.org 68% true positive rate has been achieved with 32% false positive rate for the classification of class 2. 80.4% accuracy for SVM algorithm was experienced in the classification of heart disease. The training time for the proposed algorithm was found to be 0.75427 with the prediction speed of 5900 observations per second. Fig. 15 illustrates SVM receiver operating characteristics (ROC). To evaluate the multi-class classifier performance, it must be visualized for the analysis. The area under the curve (AUC) determines the degree up to which scale it can classify. Receiver operating characteristics (ROC) is usually measured as the probability of classification. The graph of ROC is plotted between true positive rate and false positive rate. The area under the curve (AUC) was determined as 0.88 and it may be acknowledged as a competent classifier due to the closeness of AUC to 1. Fig. 16 shows the confusion matrix of K-NN algorithm. 76.1% accuracy has been achieved by the K-NN algorithm for the classification of heart disease. The efficiency was calculated using true positive rate and false positive rate. Confusion matrix evaluated the K-NN algorithm and showed that 85% class 1 was classified as true positive rate and false positive rate was achieved as 15% for class 1. For the class 2 classification using K-NN, 62% positive rate was achieved with the 38% of false positive rate.    Usually Euclidean distance is calculated to find out the closest distance with the value of the K. Fig. 17 shows ROC and area under the curve (AUC) for the K-NN classifier. The ROC curve has been plotted between true positive rate and false positive rate. Area under the curve (AUC) was found to be 0.80. It is slightly away from the 1. For good classification AUC must be close to 1. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 568 | P a g e www.ijacsa.thesai.org

E. Linear Discriminant Analyss (LDA)
Linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) are the two most popular classifiers which are based on probabilistic method. For each class predictions can be easily computed using the following mathematical formulation of Baye's rule.
In Fig. 18, linear discriminant analysis (LDA) confusion matrix demonstrated that overall accuracy of 79.3% was achieved based on the true positive rate (TPR) and false positive rate (FPR). For the classification of heart disease, confusion matrix showed that 87% true positive rate was achieved in classifying class 1 while 13% of false positive rate was achieved in the classification of class 1. Fig. 19 demonstrated that area under the curve (AUC) for the linear discriminant analysis was found to be 0.85. Fig. 20 demonstrated the confusion matrix for fine Gaussian SVM. Confusion matrix of fine Gaussian SVM elaborated that the algorithm performed very poor as it classified all classes as class 1. Class 2 was not predicted at all therefore false negative rate was found to be 100% and false positive rate for class 2 was found to be 0%.

G. Performance Comparison of Classifiers for Heart Disease
Classification Table IV proved that the SVM performed better classification compared to the K-NN and LDA in terms of Accuracy, prediction speed, training elapsed time, precision and area of under the curve. 80.4% accuracy was achieved by SVM for the heart disease classification which was greater than the accuracies of K-NN and LDA.

IV. RESULTS AND CONCLUSION
Comparative study of classifiers was performed to determine the better classifier for the breast cancer prediction. It has been proved from the results that Logistic regression gained highest accuracy of 91.2%. K-NN also performed better with the accuracy of 88.25. Research study shows that logistic regression may be adopted on the real time data set of the patients to reduce the false alarm rate in the prediction of breast cancer tumors. Moreover, simulated results on real time data showed that SVM performed classification rapidly in very less time of 0.74237 seconds compared to the K-NN and LDA for heart disease classification. Prediction was observed 5900 observations per second which is higher than the LDA and K-NN classification algorithms. Accuracies and area under the curve of SVM were found to be 80.4% and 0.88 respectively. SVM proved to be a better and robust classifier for the heart disease classification.