An Improved Breast Cancer Classification Method Using an Enhanced AdaBoost Classifier

—The goal of this research is to create a machine learning (ML) classifier that can improve breast cancer (BC) diagnosis and prediction. The principle components analysis (PCA) technique is used in this work to minimize the dimensions of the BC dataset and achieve better classification metrics. The developed classifier outperformed others in terms of F1 score and accuracy score. Using the original BC dataset, four different classifiers are applied to determine the best classifier in terms of performance metrics. The used classifiers were RandomForest, DecisionTree, AdaBoost, and GradientBoosting. The RandomForest classifier obtained (95.7%) f1 score and (94.5%) accuracy score, the DecisionTree classifier obtained (93%) f1 score and (91%) accuracy score, the GradientBoosting classifier obtained (95%) f1 score and (93.5%) accuracy score, and the AdaBoost classifier obtained (95.8%) f1 score and (94.5%). The AdaBoost classifier was utilized to create the final model using the reduced PCA dataset because it scored the highest performance metrics. The developed classifier is named as “pcaAdaBoost”. The optimized pcaAdaBoost achieved higher performance metrics in terms of f1 score (99%) and accuracy score (98.8%). The results reveal that the optimized pcaAdaBoost scored highest performance measures in terms of cross-validation and testing outcomes, with an overall accuracy of (99%). The improved results justify the use of dimensionality reduction in high-dimension datasets to reduce complexity and improve performance measures.


INTRODUCTION
Several studies and technologies have been conducted worldwide to screen for and investigate the risk of Breast Cancer (BC). Despite significant advances in screening and patient management, BC represents one of the common malignancies in women worldwide and it was ranked as the second most likely cause of cancer mortality. Statistically, there were 268,600 new cases of BC diagnosed in American women in 2019, with 41,760 deaths [1][2][3]. BC is a very diverse illness with numerous forms and subtypes. Approximately 95 % of BCs responded to endocrine and targeted therapy, and their prognosis and survival rates are generally favorable. However, the widely used screening instrument is a two-dimensional mammography, which can detect tumors that are too small to perceive. The breast is compressed between two rigid plates in a conventional mammogram, and X-rays have been used to capture images of the breast tissue. Such techniques are invasive, costly, and tedious to conduct. With the advent of new computational power in terms of big data, machine learning (ML), and data science (DS), scholars have attempted to apply such new computation techniques to the analysis of BC datasets, as well as to develop new promising low cost and fast BC classification techniques. However, to reduce processing time and to increase prediction performance, data reduction techniques are used. Removing irrelevant input data and get rid of redundant inputs would probably enhance classifier's capability in terms of performance measures. In this study, the PCA technique is used in this work to minimize the dimensions of the BC dataset and achieve better classification metrics. The generated classifier outperformed others in terms of F1 score and accuracy score. Using the original BC dataset, four different classifiers are applied to determine the best classifier in terms of performance metrics. Therefore, the following classifiers, RandomForest, DecisionTree, AdaBoost, and GradientBoosting, are used and evaluated. This work utilized the PCA method to reduce features from the original BC dataset. This method improved the performance of the ML model on hand and enabled better data visualization. In this way, the PCA is used to reduce the dimensions of the BC dataset, making it less sparse and more statistically significant.

II.
LITERATURE REVIEWS Scholars defined ML as a subset of artificial intelligence (AI). It denotes a mathematical model that is used to make decisions or predictions using a training dataset. It is frequently referred to as an evolving prediction model that will improve classification capabilities in a variety of fields including disease diagnosis and screening as in medical industry [4]. It is necessity to reduce the danger of diseases, infections, disorders, or pandemics using a proactive ML model [5][6]. To deal with the increasing complexity of the vast data and convert it into meaningful scientific knowledge for the benefit of humanity, new bioinformatics methods must be developed [7][8]. The use of ML classification techniques in medical diagnosis applications is highly valued [6]. However, traditional classification may not have performed as well as planned, raising the necessity of such investigations that could improve the current classification technologies in medical sectors. The goal of medical AI and ML research is to create applications that use AI technologies to aid practitioners in providing treatment based on better decision making [9][10]. Some research has been conducted on comprehensible AI in order to solve the downfalls of AI analysis tools being black boxes. In comparison to AI systems such as deep learning, XAI can present model's explanations and decision-making *Corresponding Author. www.ijacsa.thesai.org capabilities [11]. Many traditional ML approaches for classification problems are used like logistic regression (LR), support vector machine (SVM), decision tree (DT), and RandomForest (RF). The RF is a ML classification method that comprised of several decision trees in an ensemble. The outcome of these DT elections symbolizes the RF decision. Regardless of the used ML algorithm, several evaluation methods such as F1_score, accuracy_score, precision, recall, AUC, and ROC have been commonly used to assess the effectiveness of each proposed method [12]. One of the most important areas of medical based ML applications is BC classification. BC has now surpassed lung cancer as being the most prevalent malignancy afflicted in women worldwide [13]. For the prognosis and diagnosis of BC disease, researchers developed a SVMs based classifier in contrast to Bayesian classifiers and ANN. However, they gave implementation summary for the findings of the evaluated classifiers [14]. Another study [15] used the BC dataset to classify BC disease using several ML methods: Knn, RF, SVM, DT, and LR. They evaluate the results of each method and concluded that the SVM algorithm outperformed the others with a performance accuracy of (97.2%). Another study [16] tried to examine the findings and to analyze several ML approaches for the detecting of BC using the same dataset. Another study [17] proposed an efficient recursive neural network (RNN) approach for BC classification using RNN and -Keras-Tuner‖ enhancement method in which they claimed that, the developed model achieved high performance accuracy. However, the study in [16] showed that Logistic regression classifier beats the other classifiers in predicting BC disease using BC Wisconsin (Diagnostic) data set (BCWD).

III. RESEARCH METHODS
The current study attempts to minimize the dimensionality of the dataset before selecting the optimal features to be fed to the classifier. The proposed method employs the simplest model that meets the performance requirements of the complicated models. Accordingly, the dataset shall be dimensionally reduced in order to improve model performance and eliminate extraneous features. Therefore, the proposed model begins with data preprocessing, followed by feature selection, dimensionality reduction, and classification ( Fig. 1). In this methodology, four different supervised classification algorithms are used, RandomForest, AdaBoost, GradientBoost, and DecisionTree respectively. For this study, the classifier that obtained the best performance measure (Accurecy_score, and F1_score) is selected to perform classification process. However, after selecting the best classifier, the principle components analysis (PCA) procedure is taken place to perform dimensionality reduction. The reduced dataset is then fed to the chosen classifier to implement BC classification. The developed model is then validated using k-fold cross validation, tested using a subset of the original dataset (30%), and lastly, performance measures are evaluated.

A. Dataset Description
In this study, BCWD data is being utilized for model development purposes. The data include 31 features along with the class feature (target). The cell nuclei detected in the breast image clip are represented by the independent indices. Moreover, the dependent index contains binary outcome: zero indicates benign, and one describes malignant. The output will be classified as being benign or malignant. However, the shape of the used dataset is (569,31), and its descriptive analysis is shown in Fig. 2. Additionally, the bar chart (Fig. 3) depicts the count of the target variable to be malignant (M) or benign (B).

B. Dimensionality Reduction
Features reduction represents a very important preprocessing stage that eliminates redundancy, inconsistent, and unimportant features to optimize learning, classification accuracy, and minimize training cost [18]. One of the its approaches that diminishes computation cost for the learning process is called -Principal Component Analysis (PCA)‖. Moreover, features reduction is useful in several realms because it reduces the computational burden as well as other unfavorable characteristics of high-dimensional areas. Many scholars recommended the utilization of features reduction techniques to improve computational power and to enhance performance accuracy [5]. Therefore, the literature has several uses of dimensionality reduction techniques such as, Zhao and Du [19] in which they advocated for the use of the -feature_based‖ spectral-spatial‖ classification (SSFC) structure. Another study by Xu. Y et. al [20] proposed spontaneous removal of piece picture from side to side deep learning. However, the PCA is commonly used method for features reduction.

C. Classification
Classification represents the most important task in supervised learning techniques [21][22][23]. It normally utilized to separate the dataset into a unique class as per the values in the dependent variable [24][25]. To select the best classifier that provides the optimal performance metrics based on BC dataset, four different classifiers are used, namely, RandomForest, Adaboost, Gradientboost, and DecisionTree are used. A brief detail on each classifier is shown below: 1) Random forest classifier: Because of its ease of implementation and high versatility, it is one of the most often used supervised learning algorithms. It is a collection of prediction trees capable of handling large datasets.
2) Adaboost classifier: AdaBoost was among the first applications to employ the boosting technology. It accomplishes this by integrating numerous weak classifiers into a single strong classification method.
3) Gradientboost classifier: It is an ensemble, functional gradient iterative approach that reduces a -loss function‖ by repeatedly selecting a function who points towards the negative slope.

4) Decision tree classifier:
The decision tree can be defined as a supervised learning technique in which it is commonly implemented for solving binary classification problems. However, such a technique bases its decision on some rules.

IV. RESULTS
In this study, initially four different classifiers were employed to get performance metrics using BC dataset. The established approaches were assessed using accuracy score and F1 score metrics. However, the model with the best metrics was used to develop the enhanced model. After data preprocessing and visualization, the dataset is reduced into two main components namely, first principal component and second principal component. Fig. 4 illustrates the reduced PCA components. 476 | P a g e www.ijacsa.thesai.org  As it can be seen in Fig. 4, the reduced dataset consists of two principal components in which they can represent the original dataset with no loss of information. It can be noticed that the reduced dataset can be used to clearly discriminate between the two classes of the target variable ('diagnosis'). The reduced dataset represents a dumpy matrix array in its rows represents the principal components and each column in it relates back to the original indices. Such relationship can be visualized as heatmap (Fig. 5).

V. DISCUSSIONS
The obtained results showed that AdaBoost classifier outperforms other used classifiers in terms of f1_score and accuracy_score. Thereby, it was used to implement the final enhanced model. On other hand, the RandomForest classifier achieved a very similar performance metrics to the AdaBoost classifier in terms of accuracy_score, but the AdaBoost classifiers achieved higher F1_score. Hence, the reduced PCA components were fed again to AdaBoost classifier to validate the enhancement made by the reduced dataset features using PCA components. Fig. 6 showed the enhanced performance metrics in which it achieved an overall accuracy of (99%). The new enhanced model is used to make predictions on a new dataset to validate its performance. The developed classifier was able to correctly classify the new data into its correct classes where they were 'Malignant' or 'Benign'. The final accuracy_score and F1_score is depicted in Fig. 7. The model obtained a noticeable higher f1_score (99%) and noticeable higher accuracy_score (98.8%).  As shown in Fig. 7, the developed model obtained higher accuracy score and higher F1 score. Therefore, it can be can concluded that, the reduced dataset using PCA components analysis can enhance classification performance in highdimensions datasets. Furthermore, dimensionality reduction simplifies the classification process in ML, resulting in a better fit to the constructed classifier.

VI. CONCLUSIONS
This research utilized PCA technique to minimize the input features in the BC dataset seeking better enhancement of BC classification in terms of F1_score and accuracy_score. The developed model started with a performance metrics comparison between four supervised classification techniques namely, RandomForest, DecisionTree, AdaBoost, and GradientBoosting. The RandomForest classifier showed (95.7%) f1_score and (94.5%) accuracy_score, DecisionTree classifier obtained (93%) f1_score and (91%) accuracy_score, GradientBoosting classifier obtained (95%) f1_score and (93.5%) accuracy_score, and finally, AdaBoost classifier obtained (95.8%) f1_score and (94.5%) accuracy_score. Since the AdaBoost classifier scored the highest performance metrics, it used to implement the final model using the reduced PCA dataset. The developed classifier is named -pcaAdaBoost‖. The optimized pcaAdaBoost achieved higher performance metrics in terms of F1_Score (99%) and accuracy_score (98.8%). The results show that the optimized pcaAdaBoost has delivered the best results in terms of crossvalidation and testing. with an overall accuracy of (99%). However, as per future works, the developed classifier should be trained and tested using different datasets to validate its ability to enhance performance metrics. Finally, the developed model is hoped to introduce a predictive tool for early diagnosis and classification of BC in our large society.

VII. DATA AVAILABILITY
The used data in the development of this model and that is used to support the findings of this research can be accessed online at: UCI Machine Learning Repository: Breast Cancer Wisconsin (Diagnostic) Data Set.