A Comparative Analysis of Data Mining Techniques on Breast Cancer Diagnosis Data Using WEKA Toolbox

Breast cancer is considered the second most common cancer in women compared to all other cancers. It is fatal in less than half of all cases and is the main cause of mortality in women. It accounts for 16% of all cancer mortalities worldwide. Early diagnosis of breast cancer increases the chance of recovery. Data mining techniques can be utilized in the early diagnosis of breast cancer. In this paper, an academic experimental breast cancer dataset is used to perform a data mining practical experiment using the Waikato Environment for Knowledge Analysis (WEKA) tool. The WEKA Java application represents a rich resource for conducting performance metrics during the execution of experiments. Pre-processing and feature extraction are used to optimize the data. The classification process used in this study was summarized through thirteen experiments. Additionally, 10 experiments using various different classification algorithms were conducted. The introduced algorithms were: Naïve Bayes, Logistic Regression, Lazy IBK (Instance-Bases learning with parameter K), Lazy Kstar, Lazy Locally Weighted Learner, Rules ZeroR, Decision Stump, Decision Trees J48, Random Forest and Random Trees. The process of producing a predictive model was automated with the use of classification accuracy. Further, several experiments on classification of Wisconsin Diagnostic Breast Cancer and Wisconsin Breast Cancer, were conducted to compare the success rates of the different methods. Results conclude that Lazy IBK classifier k-NN can achieve 98% accuracy among other classifiers. The main advantages of the study were the compactness of using 13 different data mining models and 10 different performance measurements, and plotting figures of classifications errors. Keywords—Data mining; breast cancer; data mining techniques; classification; WEKA toolbox


I. INTRODUCTION
Worldwide, breast cancer has become one of the most common cancers [1]. It originates in the area of the breast tissue that has a concentration of milk ducts. Although most cases occur in women, there have been reported cases in men as well. There are noticeable signs and symptoms of breast cancer. The first noticeable symptom is usually a different mass from the rest of the breast tissue. Most women, about 80%, discover these masses during self-examinations.
Breast cancer can be classified as benign or malignant; however, this classification is determined through diagnostic testing. Some criteria to consider are uniformity of cell size and shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, and normal nucleoli. By observing these criteria, doctors or scientists are able to make a diagnosis according to the patient's diagnostic test results.
Certain unhealthy lifestyle choices tend to put some people in danger of developing breast cancer: smoking, consuming fatty food and alcohol, lack of exercise, and stress. Although genetics plays a minor role in many breast cancer cases, unhealthy habits contribute more.
The rapid spread of breast cancer and the inability to accurately diagnose and recognize its presence represents a challenge for researchers and developers in biomedical engineering [2]. This challenge leads to deploying new data mining techniques. Data mining is the uprooting and recall of unknown data from the past that can be useful. Data mining also includes the acknowledge recovery and analysis of data that is saved in a data repository. Some of the important methods of data mining are classification, association, clustering and regression, etc. [3]. The focus of this paper is to drive the research towards new feasible solutions for mining breast cancer data. Thus, a data mining-based experiment for breast cancer classification mechanism is introduced with different types of classifiers. In addition to identifying the best classifier model that introduces higher classification accuracy for the predefined dataset used in this study, the data mining process is implemented by applying pre-processing operations and extracting features to the specified data records from the data set using WEKA.
The WEKA (Waikato Environment for Knowledge Analysis) is an open-source software that contains a set of algorithms for data mining tasks [4]. These algorithms can be applied to a data set either directly through the WEKA interface or via Java code. Then the different classifiers are implemented with different variables using several algorithms and multiple options to compute the best accuracy ratio.
In this study, the classification process was summarized through 13 experiments, including three experiments using the Bayes Net algorithm by three different search mechanisms and ten experiments using classification algorithms, Naïve Bayes (NB), Logistic Regression, Lazy IBK (Instance-Bases learning with parameter K), Lazy Kstar, Lazy Locally Weighted Learner (LWL), Rules ZeroR, Decision Stump, Decision Trees J48, Random Forest, and Random Trees to create a predictive model that can be tested with new records and that can obtain classification accuracy, and compare the results obtained after implementing different algorithms compared to the slow algorithm IBK and k-NN. However, k-NN was the best in ranking for optimum time and accuracy values. The optimal classifier was determined to identify more new records for accurate breast cancer identification.
Moreover, this study aimed at utilizing data mining techniques to diagnose breast cancer using diabetic patients' datasets [5]. By looking at the literature, it is noticeable that there have been many efforts to use data mining for breast cancer datasets; however, previous studies lack in comparing WEKA with different parametric values and attributes. Experiments in this research used thirteen different data mining algorithms as well as the use of feature selection for data cleansing. This study shows competitive results compared to previous studies, mentioned in the literature.
The organization of this paper is as follows: Related works on breast cancer datasets using thirteen different classifiers are discussed in Section II, followed by Section III which provides an introduction about the classification algorithm used in the experiment. Section IV introduces the methodology used in this work; all the data mining techniques which are compared and analyzed are illustrated in Section V. Section VI concludes the proposed study and highlights the most accurate classifiers.

II. RELATED WORK
The comparisons made in [6], were based on the performance of four different machine learning algorithms: Naïve Bayes (NB), Support Vector Machine (SVM), Decision Tree (C4.5) and k-Nearest Neighbors (k-NN) and were conducted on the Wisconsin Breast Cancer (WBC) datasets. The objective of the study and the experiments that were conducted were to determine the effectiveness of each algorithm in regards to precision, accuracy, specificity, and sensitivity. The results showed that SVM obtained 97.13% accuracy and outperformed Naïve Bayes, C4.5, and k-Nearest Neighbors algorithm (k-NN) that obtained accuracy variance between 95.12% and 95.28%.
In [7], Genetic Algorithm (GA) was used alongside different data mining techniques for WBC. GA was used to extract significant and informative features to reduce computational complexity and enhance the data mining processing speed. Data mining techniques used in this study were: Decision Trees (DT), Bayesian Network (BN), Logistic Regression (LR), Random Forest (RF), SVM, Rotation Forest, Radial Basis Function Networks (RBFN) and Multilayer Perceptron (MLP). Two WBC medical datasets (WBC and Wisconsin Diagnostic Breast Cancer (WDBC)) were used to test the performance of the algorithm models. The highest accuracy of 99.48% was obtained by Random Forest and GA feature selection.
The study conducted by [8] aimed at diagnosing breast cancer using three different techniques, namely: SVM, DT, and Artificial Neural Network (ANN). The study was applied on the WDBC dataset from the UCI. Feature selection was applied to increase the effectiveness of the methods. The ensemble method yielded the best results among the used methods. It gave 98.77% accuracy, 98.05% sensitivity and 100% specificity.
In [9], the authors used three well-known data mining methods, namely, Naïve Bayes (NB), J48, and RBF Network, to develop prediction models for the survivability of breast cancer. The data, that contains 683 instances, was acquired from the UCI [5]. To develop the prediction models, data selection, pre-processing, and transformation were applied. The results obtained from the experiment showed that the Naïve Bayes performed the best with a classification accuracy of 97.36%, RBF Network resulted in a classification accuracy of 96.77%, and the J48 resulted in classification accuracy of 93.41%.
The work by [10] used 12 different machine learning techniques for the diagnosis of breast cancer. The techniques that were used are namely; NB, Decision Table, Ada Boost M1, J48, J-Rip, Logistics Regression, Lazy IBK, Lazy K-star, Multiclass Classifier, Multilayer-Perceptron, RF, and RT. WBCD dataset was used to train the model. Most of the applied methods scored above 94%. Only NB underperformed, compared to the other models, with an accuracy of 73.21%. RT and Lazy classifier algorithms outperformed the others with an accuracy close to 99%.
In [11], evaluated six different data mining techniques, namely: SVM, Bayes Network (BN), ANN, k-NN, Decision Tree (C4.5) and Logistic Regression. The WEKA tool was used for the experiment on the WBC dataset. SVM and BN yielded the highest accuracy of 97.28%. However, the BN classifier produced minimal time compared to SVM, which makes the BN classifier better.
In [12], researchers employed eight different data mining techniques for breast cancer prediction. The dataset used for the experiment was WPBC [5]. The experiments were done on four classification algorithms: SVM, DT C5.0, NB and k-NN and on four clustering algorithms: EM, K means, PAM and Fuzzy c-means. The experiments were implemented using the R programming tool. The results showed that classification algorithms have better performance than the clustering where SVM and DT (C5.0) had the best accuracy of 81% and Fuzzy c-means resulted in the lowest accuracy of 37%, among the tested algorithms. The study conducted by [13] utilized three data mining techniques to classify breast cancer as either malignant or benign. The techniques conducted on the WDBC breast cancer dataset [5] are, namely: LR, NB and DT. Results showed that Logistic Regression (LR) got the highest classification accuracy of 97.90% among the other two tested classifiers.
The study conducted by [14] proposed nested ensemble methods to distinguish benign tumors from malignant breast cancers. Each ensemble method contains "Classifiers", as well as "Metaclassifiers" that can have more than two classification algorithms. Metaclassifiers were developed in the two-layer nested ensemble. The dataset used for the experiments was WBDC. The proposed method (used by [14]) was compared to the conventional single classifiers such as BN and NB. The results indicated that the two-layer nested ensemble method outperforms the single classifiers.
To analyze breast cancer data, [15] utilized four different DT classification algorithms, namely, Classification and Regression Trees (CART), J48, Best First Tree (BF Tree) and DT (AD Tree). The experiment employed the WEKA tool, and the results demonstrated that the J48 classifier reached the highest accuracy of 99% whereas the CART algorithms resulted in 96% accuracy; AD Tree algorithm resulted in 97%, and BF Tree algorithm resulted in 98%.
For the experiment in this research study, k-NN achieved the highest accuracy of 98% whereas in [7,4,1], RF achieved the highest accuracy. The highest accuracy in [8] was achieved by the ensemble methods of SVM, DT, and ANN. By analyzing the literature, it is noticeable that different techniques got the highest accuracy in each study as follows: in the study conducted by [11], the highest accuracy was achieved by Bayes Network (BN), and SVM and DT (C5.0) got the best accuracy in [12]. Similarly, by looking at [13], it is noticeable that the highest accuracy was yielded by LR and in [15] the highest accuracy was achieved by the J48 classifier.
By analyzing the literature, it was also noticed that the proposed research yields competitive results in terms of accuracy. However, not all mentioned previous works that used WEKA tools for data mining, the data mining using WEKA tools, achieved the same task. For example, [10], [13] and [14] used data mining methods to classify cases of breast cancer into malignant and benign. Moreover, the other studies did experiments on a few techniques while this study tested thirteen different algorithms.

III. CLASSIFICATION ALGORITHM
The experiment for this study ran the k-NN algorithm on the dataset. However, this algorithm is known as IBK in WEKA toolbox. The IBK classification algorithm, for each test instance, measured the distance to identify the nearest k instances to that instance from the training data. Then all selected instances were used for explication of prediction results. Such mechanism is referred to as k-NN algorithm. Fig. 1 shows the pseudo code of the k-NN algorithm that is described as follows [16,17]: In the IBK algorithm, a model was not manufactured but generated a just-in-time prediction for a test case. In each instance, the IBK algorithm reached a distance measure for locating k "near" cases in training data and used those instances to predict in order to determine which classifier in the WEKA toolbox, using the diabetic patient's dataset, had the highest accuracy. Experiments in this research analyzed the comparison of thirteen algorithms in various accuracy measurements.
Each algorithm conducted is represented briefly in terms of how it operates with the key parameters of the respective algorithm. The parameters of the algorithms conducted in this study are highlighted in Table III. The region size of k-NN is verified by the k-parameter. The distance metric used in k-NN is another important parameter. In the k-NN algorithm, the distance metric in default is Euclidean distance, which is ideal for quantitative data of the same size to determine the distance between instances.
The parameter C called the complexity parameter in WEKA governs the versatility with which the line can be drawn to separate groups. There is no margin violation with a value of 0, while the default is 1. The type of kernel to be used is a key parameter in SVM. The easiest kernel is a linear kernel that separates data from a direct line or hyperplane. The default kernels for the WEKA is the polynomial, where the higher the polynomial kernel, the wigglier of the exponent value. The classes are separated by means of a curved or angled line. In LR algorithm, the ridge parameter determines how much the algorithm needs to be forced to decrease the coefficient value. This regularization is deactivated by setting it to 0. 226 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 8, 2020 In order to implement the NB algorithm, the Kernel Estimator argument was used which might more accurately suit the actual distribution of attributes in the dataset. With DT, the researchers of this study chose to transform the no Pruning parameter to Real value. The minimum number of the tree instances in a leaf node when constructing the tree out of the training data was set to 7.

IV. METHODOLOGY
The methodology followed in this paper first started with data collection and data preparation from the BC dataset. Then data mining techniques were applied to generate a classification model that was used for evaluation and deployment. The diagrammatic representation of the proposed study framework is shown in Fig. 2. This methodology framework represents the standard data mining framework that should be followed in various applications of data mining, and the framework states that before data processing, selection and pre-processing should be performed to get clean and complete data records for processing. Then transformation of data represents a data extension conversion or compatibility conversion of data shape to an equivalent data mining tool format to perform data mining algorithm implementation and generate the desired knowledge outcome. Analysis of the breast cancer dataset using WEKA machine learning software tool aims for mining the relationship in breast cancer data for efficient classification. A dataset is an aggregation of data which refers to the contents of one statistical data matrix or one database table. This dataset is then processed using WEKA toolbox. In this sense, Table I provides an overview of the dataset used in the experiments. The dataset values for each of the variables, such as the height and weight of an object, are then listed and each member of the dataset is indicated by a datum [17][18]. The dataset may include data for single or multiple members, with respect to the number of rows [19][20].
For this study, diabetic patients' dataset was collected and consists of 699 patients records with 10 different attributes and nine nominal attributes of them were selected as shown in Table II. The data mining algorithms were explored to identify efficient classification of diabetes dataset [5]. Accuracy metric was used as the main comparison base while other metrics were also considered such as the Precision Recall (PRC), corresponding sensitivity (recall), the Receiver Operating Characteristics (ROC) and Matthews Correlation Coefficient (MCC). In this section, an experimental analysis of the effectiveness of the proposed methodologies for the thirteen different classifiers using the same dataset was completed. The experiments were implemented using WEKA toolbox. The experiments were done on a Toshiba desktop computer with Intel(R) Core (TM)i7-4710MQ CPU with @2.50GHz, 2.5GHz, and 8192MB RAM. The result of the evaluation represents the classification model results used to evaluate 25% of the selected dataset records, while the remaining 75% was used for training.
So, alternative performance measurement was the confusion matrix specifically using the ROC curve shown in Fig. 3. However, the accuracy of the Algorithm 9 (Rules-ZeroR algorithm) was very weak; in contrast, PRC is more useful which shows how the classifier was behaving on one class.
The classifiers from strong and weak classifications (Fig. 3) were reviewed and analysed in order to investigate these results further (Table IV). Groups of classifiers that were very similar in their performances were found. Thereby, similar performance from the similar classifiers are highlighted in bold in Table IV.   Fig. 4, Roles ZeroR had the weakest classifier which ultimately lead to a lower accuracy rate while other investigated algorithms produced more than 90% accuracy rate. Fig. 4 also shows that the best performed classifier in the three measurements was for both Algorithm 5 (logistic algorithm) and Algorithm 6 (lazy-IBK algorithm). In Fig. 4, the thirteen classifiers' results show both the weak (shown in orange color) and strong classifiers' (shown in blue color) performance. The classifiers were experimented for the dataset illustrated in Table II. The statistical records for the execution of the experiments shown in Fig. 5, are the kappa statistics, absolute error, and the mean error. The failure classifier in the experiment was for Algorithm 5 (logistic algorithm). All statistics correspond to the classifiers' results where the type of errors is computed as a part of Kappa statistics plots.  Table V. Table V shows the accuracy and elapsed time during the setup experiment time of each algorithm and its configurations. The worst performance was by the Rules ZeroR algorithm, and the best by the Lazy IBK and K-star algorithms while all other tests introduced more than 90% accuracy. The highest performance results are grouped and highlighted in bold.

VI. CONCLUSIONS
This paper studied a common practical problem in the detection or recognition of data patterns using data mining techniques. The comparative analysis proposed here for the Breast Cancer dataset using different pre-processing techniques was conducted using the WEKA data mining tool. The final evaluation of classification processes was done by extracting accuracy ratios for all experiments, and the results showed high rates ranging between 72% and 98%. The other ten experiments used the different classification algorithms to obtain the highest accuracy ratios; however, the IBK and Kstar algorithms from Lazy algorithms showed the best performance up to 98.2% in optimum time and accuracy values. Those records can be further used in real-world applications such as any development models introduced for biomedical laboratory technology.
In the future, this study will be extended by utilizing deep learning techniques in order to get the highest accuracy. Moreover, the proposed technique will be applied on datasets for different cancer types.