Classification based on Clustering Model for Predicting Main Outcomes of Breast Cancer using Hyper-Parameters Optimization

Breast cancer is a deadly disease in women. Predicting the breast cancer outcomes is very useful in determining the efficient treatment plan for the new breast cancer patients. Predicting the breast cancer outcomes (also called Prognosis) are done based on the previous patient’s data, which show the patient’s characteristics and how the doctors treated the patient. In this paper we propose a new efficient model for predicting the main outcomes; Survival Rate, Disease Free Survival, and Recurrence detection; of breast cancer. The proposed model utilizes two techniques to increase the accuracy of the predictive results. The first technique is applying the classification model on various data clusters rather than the full dataset. In such steps, the data is grouped in different clusters according to the similarity of the main characteristics, then the classification model is applied on these clusters. The second technique is using the Hyper-Parameters Optimization (also called Hyper-Parameters Tuning) to increase the accuracy of the classification model. In this step, the proposed model uses HyperParameters Optimization to find a tuple of hyper-parameters that yields on the optimal model which minimizes a predefined loss function on given dataset. The experimental study shows in detail how utilizing such two techniques results in an efficient prediction model producing accurate results. Keywords—Breast cancer; Survival Rate (SR); Disease Free Survival (DFS); recurrence detection; egy; prediction; data mining; classification; clustering; hyper-parameters optimization


I. INTRODUCTION
Breast cancer is the famous type of cancer which infects women; it is considered as one of the highest deadly diseases at these times [1].Diagnosis of such disease at an early stage results in better opportunities for good outcomes.The main breast cancer prognostication types are: Five-year Survival rate, Disease-Free Survival and the Recurrence Detection [2].Survival rate indicates how many patients who were diagnosed as breast cancer patients' have survived for a time, this time can be 5 , 7 , 10 years according to the used time interval.Survival rates help to give a good knowledge about how the treatment was successful in these cases.Recurrence of the breast cancer is a breast cancer's case in which the cancer is back in the same breast or the other breast or at chest after a time.Disease Free Survival (DFS) is a measure for the time of period.This period measures the time interval between starting treatment plan and the time in which the patient survives without any symptoms of that cancer.Disease-free survival measuring is a way to see how the treatment works.Also called DFS, relapse-free survival and RFS.Breast cancer in Egypt is represented in incidence rate (48.8/100,000) and mortality rate (19.2/100,000) [3].According to WHO, cancer country profiles in 2014 was that 21.6% of women cancer deaths in Egypt happen because of the breast cancer.
Data mining has a big importance in healthcare field, especially in building detection models, diagnosis and prognosis data [4].In this paper, we propose a classification model based on data clustering and Hyper-Parameters Optimization techniques to predict the main outcomes of breast cancer prognosis with the highest possible accuracy.II.RELATED WORK Rohit J.Kate et al. [5] built a predictive model for stages survival rate with the following classifiers: decision tree, naïve Bayes and the logistic regression at the SEER data set.The work proved that predicting the stage survivability achieves high accuracy than predicting the full data set survivability.Houriyeh Ehtemam et al. [6] made a comparison between 64 classifiers' techniques for breast cancer early diagnosis and prognoses.The research was done on 208 Iranian records and consists of 10 attributes collected between 2014 and 2015.Bayesian network achieved accuracy of 95.7% as most accurate classifier.M. Mehdi Owrang O. [7] was predicting the breast cancer survival rate using the association rules and naive Bayes with the SEER dataset.Results show that the two techniques were very similar in most of the cases.www.ijacsa.thesai.orgHadi Lotfnezhad Afshar et al. [8] predicted the breast cancer survivability using the SEER dataset and three classifiers: SVM, Bayes net and CHAID.The results showed that the SVM classifier was achieved the highest accuracy with 97.7 % and the two other classifiers were 81.8% and 82.2% sequentially.
Woojae Kim et al. [9] built a recurrence prediction model using SVM classifer .It uses a data set of 733 records and 7 attributes.The study conducted a comparison between the SVM and ANN techniques.Support Vector Machine scored 84.58% accuracy better than the Artificial Neural Network.
Hamid Karim Khani Zand [10] predicts survivability of breast cancer by making a comparison between many classifiers on the SEER dataset.Results showed that C4.5 classifier was more accurate with 86.7% accuracy than the Artificial Neural Network 86.5% and Naïve Bayes 84.5%.
Mamour Gueye et al. [11] works on the inflammatory breast cancer patient's outcomes .caseswere treated in Dakar.The mean time to recurrence is 11.2 months and was found in 45.5% of cases.Survival rate was 31.8%.The median overall survival was 13.3 months.
Omar Farouk et al. [13] studied Egyptian women <=35 breast cancer characteristics .Results say that the breast cancer has more biological aggressive behavior at advanced stages for the young women in Egypt.The 3-year OS was 88% and 5-year OS 68%.The disease free survival median was 61 months, the 3-year disease-free survival was 58% and 5-year disease-free survival was 50%.S.Kharya [14] predicting breast cancer prognosis on the SEER data set using different classifiers.Decision tree was achieved the highest accuracy with 93.62%.Si Chen et al. [15] in this study they introduced a new post-labelling algorithm, creates partitions on the data sample, next they analyze the clusters results.Unlabeled data with high labeling confidence are selected to label and added into the labeled training set.Results show that the average result of over all the experimental data sets is highest .The algorithm outperforms self-training on all the data sets.M.I.López et al. [16] used cluster for classification process in a meta-classifier.Clustering process is executed using the training set then training the classifier to classify using the training set to classifying the unseen data in the test set.The results show that the EM clustering algorithm produce results similar to the best classification algorithms, especially when using only a group of selected attributes.Ritu Yadav et al. [17] proposed the Nottingham Prognostic Index (NPI) that uses lymph node status, tumor size and histological grade to define three types of patients with different probabilities of dying from breast cancer; good, moderate , and poor prognosis groups.Increase in numerical value of NPI is related with poor prognosis.
Gordon C Wishart et al. [18] proposed the PREDICT model which is a prognostic model developed for patient's whom diagnosed in the early breast cancer based on United Kingdom cancer registry data.This model is predicting breast cancer survival after surgery for invasive breast cancer and includes mode of detection for the first time.
Mohammad R. Mohebian et al. [19] used the PSO (particle swarm optimization) algorithm and BDT (bagged decision tree) to achieve highly accuracy in breast cancer recurrence prediction model .Three classifiers (SVM, DT, and multilayer perceptron NN) were used for comparison.The results show that the HPBCR achieve the highest accuracy with 89.2% while other three classifiers achieved SVM 77.6 %, Decision tree 77.1% and MLP 75%.
Chi-Hyuck Jun et al. [20] used PSO to improve tree-based classification rules.The study used the CART classification algorithm through three stagestree building, threshold optimization, and simplification of the rules.
To conclude, the previous related work for predicting the breast cancer outcomes suffers from the following problems; work with small data set as in [6], work with an un-updated data set as in [12], predicting partial breast cancer outcomes as in [9,13], from our point of view working with local data set as SEER and Breast Cancer Wisconsin Data Set doesn't reflect the main characteristics of other nation as in [5, 7, 8, 10, 14].

1) Dataset:
The first branch of the Egyptian NCI "National Cancer Institute" for breast tumors is -NCI First Settlement Hospital‖.Our study automated for the first time the manual cases' files of this hospital and presents the required information into a new developed dataset for Egyptian patients.The proposed mining model of the study developed data set to predicting the main outcomes of breast cancer for Egyptian patients, table1 shown the selected data set attributes.The developed data set captures the important attributes related to the prognosis process.About more than 40 attributes are used to specify the main characteristics of patients selected from the manual records and about 30 attributes of the overall attributes are used in the prediction model.
The total number of processed cases is 1692 case.These cases are diagnosed as breast cancer patients at the period from 2010 to 2012.After preprocessing the data and excluding the missing data records, 1471 records have been selected form the whole sample to be used in the experimental study.The selection criteria of the cases are: www.ijacsa.thesai.orgThe BCOAP model is used the TwoStep-AS Cluster algorithm which built in the IBM SPSS Modeler tool.TwoStep-AS Cluster is a tool designed to clustering a data set.This algorithm has several desirable features that differentiate it from traditional clustering techniques a :  Handles two variables types categorical and continuous.
 Automatic selection of clusters number.
 TwoStep can analyze large data files.
2) Phase 2: Features selection: Features selection is an important ML technique which is used in classification models creates.It is used to decrease number of the features; it results in better classification performance.The feature selection's main purpose is to determining the important features from inputs to create the classification model with the highest accuracy.We can call a feature as a good one when it is relevant to the other features but not redundant.Feature selection algorithms care about limit the features to only features which would improve a task performance.
3) Phase 3: Classification: Decision tree is one of the most important and popular ML algorithms .Decision trees are supervised learning algorithms.It is very easy to understand.Decision trees are use the training data to build a predictive model which in a tree structure.The objective is to achieve high accuracy classification with low number of decisions.The decision tree consists of three nodes types: Decision nodes, Chance nodes and End nodes.Decision trees are drawn from top to bottom with its root at the top.

IV. EXPERIMENTAL STUDY
The aim of this study is to prove the efficiency of the proposed model in predicting the breast cancer main outcomes.The first step as described in Section 3.2, we use the Two Step-AS clustering technique to assign each patient's file in a data group -cluster‖.Two Step-As algorithm recommended ten clusters as the optimal suitable number of clusters to the data set.Next, Features Selection phase is applied to determine the most important features in the selected dataset as an input to the classification phase.The list of selected features is shown in Table 2.For the prediction phase, Decision Jungle algorithm is used to create a machine learning model that is based on a supervised ensemble learning algorithm and train this model to predict the main outcomes of breast cancer in a decision tree form.
The first set of experiments predicts the breast cancer main outcomes at the level of the full dataset and at the level of clusters to highlight how clustering technique results in more accurate prediction.
The second set of experiments applies Hyper-Parameters Optimization on the classifier to test how it can improve the final prediction model.The model is configured to use a loop of 70 iterations to find the optimal classification model.The dataset we use is collected form the patient's files in the NCI First Settlement Hospital.IBM SPSS Modeler Subscription tool is used.Machine configurations are: Processor: Intel Core i3-3110M CPU @ 2.40GHz, Installed Memory (RAM): 4.00 GB and System Type: Windows Professional 64-bit.

V. RESULTS
The results show the efficiency of the proposed BCOAP model in predicting the main outcomes of the breast cancer.The model achieved the highest prediction accuracy for the three main breast cancer outcomes; 5-Years survival rate (SR), breast cancer recurrence and disease free survival (DFS).The following section shows in details the results of the breast cancer main outcomes prediction.Table 3 shows the accuracy percentage when applying the classification model on the full dataset level and on the individual data clusters level.The results we got in the accuracy comparison between the full dataset and the data clusters showed that the accuracy was improved in almost of the dataset clusters for all the breast cancer outcomes we have predict in our BCOAP model.On other hand some clusters showed less accuracy than the full dataset.The reasons for this accuracy's decreasing are: some clusters has one label set in this records to predict, some records have missing values that has effect on the predicting process.In 5-years survival rate prediction, clusters from 1 to 6 achieved higher accuracy than the full dataset prediction is achieved, clusters from 7 to 10 achieved less than the full dataset achieved because of some missing data effect on the 5-years survival rate prediction.In the recurrence detection prediction, clusters 1 and from 3 to 9 have achieved higher accuracy than the full dataset achieved.Disease free survival prediction clusters have achieved no improvements in the accuracy predication after using the classification based on clustering technique because it has one value in the most of the data set records.This problem is found in the most of the classification algorithm.

B. Predicting the Breast Cancer Main Outcomes using the Hyper-Parameters Optimization:
In this section we present the results of the second set of experiments in which we apply the hyper-parameters optimization after using the classification based on clustering technique to predict the breast cancer main outcomes.The results we discuss in this section shows how is the using of hyper-parameters optimization is very helpful and very effective in achieving our contribution purpose in which we seek to improve the accuracy of predicting the breast cancer main outcomes.

VI. CONCLUSIONS
In this paper we introduced an optimal classification model for predicting the breast cancer main outcomes.In the proposed BCOAP model we used two techniques to increase the prediction accuracy.First technique is the classification based on clustering, in which we used the Two Step-AS clustering technique to group the data in clusters.After clustering process, the features selection is applied to select the most important features variables as an input to the third phase; the classification phase.The Decision Jungle algorithm is used as a classification model.The results of the experiments show that the classification of each single cluster is more accurate than the classification of the full dataset in the most of the clusters.In the fourth phase the hyperparameters optimization is used to increase the accuracy by tuning the model parameters to find the optimal classification model.The experimental study proved the efficiency of the proposed BCOAP model and shows how it increases the accuracy of predicting the main outcomes of breast cancer significantly.

a)
Being a female patient.b) The case has been diagnosed since five years or more.c)A complete data record of the main required data.B.Proposed ModelIn this study, we propose a model for predicting the main breast cancer outcomes, using the classification based on clustering and Hyper-Parameters Optimization to achieve the highest possible accuracy.The model is tested on data set of Egyptian patients developed through the study.The (BCOAP) model consists of four phases, Clustering phase, Features Selection phase, Classification phase, and Hyper-Parameters optimization phase.The following sections explain in details each of these phases.1)Phase 1:Clustering: Clustering is a method of unsupervised learning, it is a ML (Machine Learning) technique that is grouping of data points.A set of data points are given, the clustering algorithms are used to classify each data point into a group.Each group should have similar properties and/or features.The data points in different groups should have dissimilar properties and/or features.Clustering methods can used to understand the relations between the data point in every data group.

4 )
Phase 4: Hyper-Parameters optimization: Hyper-Parameters Optimization (or tuning) is the solution of the problem of choosing a set of optimal hyper-parameters for a learning algorithm.Machine learning model may require different parameters to generalize different data patterns.The measures required are called hyper-parameters and must to be tuned to make the model optimally find a solution of the machine learning problem.Hyper-Parameters Optimization has many techniques to achieve the purpose of parameters optimization.According to such feature, different algorithms can be applied at each of its phases which offer more flexibility.In our research we implemented the BCOAP model with the following algorithms; TwoStep-AS cluster in the clustering phase, Decision Jungle in the classification phase, and Hyper-Parameters Optimization at the final phase.The structure of the BCOAP model is presented in Fig. 1.

Fig 3 .
Fig 3. Predicting the Breast Cancer Recurrence Type.

Fig 4 .
Fig 4. Predicting the Breast Cancer Disease Free Survival.


FUTURE WORKFuture work of this research includes the following main points: Utilizing different algorithms of clustering and decision trees to increase the efficiency of the proposed model.Considering the different therapy methods in the data set and the prediction outcomes. Extending the proposed model to highlight the outliers of the patient's characteristics.