Optimized Machine Learning based Classifications of Staging in Gynecological Cancers using Feature Subset through Fused Feature Selection Process

After diagnosing the cancer, the next step is to identify the staging of the cancer to start with the appropriate treatment plans. There are different kinds of gynaecological cancers and this research lays emphasis on cervical and ovarian cancer types with their staging classifications. The cervical and ovarian cancers data from SEER registry are used in this work. This work intends to propose an optimized classification method for staging prediction in gynaecological cancers through fused feature selection process that aimed to provide an optimal feature subset. The fused feature selection process includes the hybridization of relief filter approach with wrapper method of genetic algorithm to produce revised feature subset of data as an outcome. Accordingly, this work attained an improved feature subset through fused feature selection process for precise classification of cervical and ovarian cancer stages by identifying their significant features. The predictive models are established with 10-fold cross validation using major classification algorithms like C5.0, Random Forest and KNN. The classification results are attained for the respective types of cervical, ovarian cancer stages and the stage-wise classification based on patients age also obtained through this proposed method. The results portrayed that the women in the age group of 45 and above are more critical with the incidence of cervical and ovarian cancer types. Random Forest method has shown progressive accuracy rate with progressive percentage of other performance outcomes. Also, this work recognized that the best and optimal feature subset selection could condense the complexity of the predictive model. Keywords—Ovarian cancer; cervical cancer; diagnosis; gynaecological cancers; staging; feature selection; machine learning; classification


I. INTRODUCTION
Gynaecological cancer denotes five types of cancers which starts in the reproductive organs of women. Cervical cancer is a form of gynaecological cancer that originates in the cells that line the cervix. This cancer type is most identified in women between ages 35 and 44. The average age at diagnosis is 50 years. Also, there are higher chances of the patients risking the development of cervical cancer as they grow older. Ovarian cancer is a type of gynaecological cancer which is more perilous in recent times. Ovarian cancer is ranked fifth in cancer demises among women [1]. Early detection of ovarian cancer could have a huge influence on the cure rate and it is instantly needed [2], but only 20% are found at a primary stage. The study [3] to find the survival outcome in ovarian cancer patients insisted that accurate estimation is essential for the reason that prognosis could be a determining factor of medication aggressiveness. Both cervical, ovarian cancers are critical but early detection of these cancers are erratic in most of the women. After diagnosing any type of cancer, it is obligatory to identify the staging to know about how much it has affected the other organs of the body. Staging procedure helps to decide better treatment plans and to know about survival information. Thus, it is essential to identify these conferred types of gynaecological cancer stages in an accurate manner to initiate with effective treatment procedures for the patients. The sections of this paper are structured as follows. In Section 2, the literature study is discussed, Section 3 shows the types of staging classes in cervical and ovarian cancers. In Section 4, the proposed methodology for staging classifications is discussed and the implementation procedure is discussed in Section 5. The classification outcomes of cervical, ovarian cancer stages are discussed in Section 6; in Section 7, the experimental results are shown, and the conclusion is conferred in Section 8.

II. LITERATURE STUDY
Machine Learning (ML) techniques are more effective in various types of cancer diagnosis and staging predictions. The study [4] to diagnose and classify the stages of an ovarian cancer used classification and clustering methods to train the cancer images with respect to the ovarian cancer stages and this work attained 94% of accuracy. This work aimed to improve the sensitivity measure. The work [5] which proposed the system for staging predictions in cervical cancer insisted that genetic algorithms are efficient in processing the huge quantities of information. But the comparative performance of classifiers is not deliberated using various performance metrics. The study [6] which used Gynecologic Cancer Society supported open dataset applied SVM technique and suggested that SVM method accomplishes better results in classifying the stages in cervical cancer. The dataset used in this work is moderately small. For staging predictions in cervical cancer, the study [7] designed CVSS dictionary learning framework by means of multi-view MR images. This work demonstrated the results of classification accuracy in identifying the stages of cervical cancer, however the accuracy is not reasonable. The comparative study [8] using various classifiers to identify the stages in cervical cancer suggested J48 as the suitable method for classifications of stages with the SEER, "Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov) (Data Source). accuracy of 93%, still the score for sensitivity and specificity was trivial. A decision tree-based procedure applied in the study [9] used cervical cancer data from IGCS for staging classification. This method used correlation-based feature selection and C5 algorithm, the accuracy of this method is not consistent. The staging classification study which included ovarian cancer data combined the outcomes of ten feature selection methods to select the subset for eventual classification to improve the accuracy [10]. The literature study evidenced that the accuracy attained in staging classifications of cervical and ovarian cancer is not adequate. Also, the classifications of these cancer stages among women in various age groups is also mandatory to provide better treatment plans. Accordingly, a methodology has been proposed and the results are enhanced with broad staging classifications for cervical and ovarian cancers of women in various age groups.

III. TYPES OF STAGING CLASSES IN CERVICAL AND OVARIAN CANCERS
Staging is the procedure of identifying the amount of cancer in an individual's body and its setting in the body. This procedure helps to determine how severe the cervical or ovarian cancer is and deciding about the exact and best treatment for the same.

A. Cervical Cancer Staging Classes
Cervical cancer's most common staging classification is the FIGO system -International Federation of Gynaecology and Obstetrics. The cervical cancer stages are summarized in Table I.   TABLE I. CERVICAL CANCER STAGES

FIGO Stage Stage Description
Stage I Cancer has spread into the deeper tissue from the cervix liner.

IA
Tumour is < 5 mm deep, and < 7 mm wide.

IA1
Tumour's depth is not > 3 mm and < 7 mm wide in tissues.

IB
Tumour is in the cervix, and the size is wider than in IA2.

IB1
Tumour is not > 4 cm at its widespread part.

IB2
Tumour is not < 4 cm at its widespread part.
Stage II Cancer has spread outside of the cervix to closer parts.

IIA
Tumour has not spread into tissues adjacent the cervix and uterus.

IIA1
Tumour is not > 4 cm at its extensive part.

IIA2
Tumour is not < 4 cm at its extensive part.

IIB
Tumour has spread adjacent to the cervix and uterus.
Stage III Tumour has spread to the pelvis walls.

IIIA
Tumour has grown into the lower part of the vagina.

IIIB
Tumour has grown into the pelvis walls and has blocked a ureter.

IVA
Tumour has grown into the bladder, rectum.

IVB
Cancer has spread to further parts of the body.

B. Ovarian Cancer Staging Classes
The FIGO (International Federation of Gynaecology and Obstetrics) system and the AJCC (American Joint Committee on Cancer) TNM staging system are the two main systems used for classification staging in ovarian cancer. The ovarian cancer stages are shown in Table II.   TABLE II. OVARIAN CANCER STAGES

FIGO Stage Stage Description
Stage I Tumour curbed to ovaries.

IA
Tumour is limited to only one ovary.

IB
Tumour affects both the ovaries.

IC
Tumour covers one/both ovaries with subsequent consequences.
Stage II Cancer has affected one/both ovaries, spreads to other pelvic areas.
IIA Cancer has the extension and/or implant on uterus IIB Cancer has the extension to other pelvic intraperitoneal tissues.

IIC
This stage comprises IIA or IIB with positive washings/ascites.

Stage III
Tumour encompasses one/both ovaries with cytologically formed, spread to the peritoneum outside the pelvis.

IIIA
Cancer covers the pelvis only, but the cancer cells which are visible only through a microscope are spread to the out of the peritoneum.

IIIB
Cancer has moved to the peritoneum; its size is <= 2 cm.

IIIC
Cancer has moved to the peritoneum which is not < 2 cm and/or it has moved to the abdominal lymph nodes. Stage IV Cancer has affected the area outside the abdomen to other organs, such as the lungs or the tissue inside the liver.

IV. PROPOSED METHODOLOGY FOR STAGING CLASSIFICATIONS
It is evident through various inquiries that combined feature selection approaches are effectual in handling high dimensional data and proficient in achieving enhanced classification results [11] [12]. This research aimed to provide an optimized classification method for staging predictions in cervical and ovarian cancers data with enhanced performance outcome. Consequently, a methodology is proposed here with Revised and Improved Feature Subset through Fused Feature Selection process (RIFSt_2FS) framework. The proposed methodology is depicted through Fig. 1.
After obtaining the data from the registry, the preprocessing of data is required to remove the missing values and formatting of the data. Initial feature set is generated with initial features. Inappropriate and superfluous features need to be removed to attain an effective classifier model [13]. To attain an enhanced feature subset the procedure mentioned in the RIFSt_2FS framework is implemented using Relief and Genetic Algorithm. After attaining the enhanced and revised feature subset the prominent classification algorithms in ML are applied for various types of staging classifications. The best and optimized classification approach is selected based on the evaluation of various models.
Revised and Improved Feature Subset through Fused Feature Selection process (RIFSt_2FS) framework is designed as shown in Fig. 2 which is intended at an optimal feature subset for improved prediction performance for staging classifications in gynaecological cancers.  The implementation procedures of the proposed framework for gynaecological cancer staging prediction with fused feature selection process are conferred in this section.

A. Procedure for Classification of Cervical and Ovarian Cancer Stages with RIFSt_2FS Framework
The illustrative procedure to implement the proposed methodology is designed and the sequence of phases in the process are as follows.

B. Dataset
The SEER [14] (Surveillance, Epidemiology, and End Results) is a database which contains largest and greatest comprehensive information on all the types of cancer incidences. This registry has the cancer patients' populations' data from America and from Asian/Pacific regions as it is the worldwide cancer data collection system. To select only relevant data, few conditions were made to decide the pertinent cases. The conditions are like region (Asian/Pacific Islander/Indian patients), cancer type as ovary (C56.9), 156 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 7, 2020 cervical (C53.9) with confirmed diagnosis and the year of diagnosis ranged from 2000 to 2017. The selected instances were organized as CSV files.

C. Data Pre-Processing
The selected instances had some missing values for few columns. Missing values were replaced with mean or mode based on the attribute types and few rows were removed which had major number of missing values in their fields. After pre-processing, the cervical cancer dataset comprises 4062 instances and the ovarian cancer dataset had 5843 instances. The cervical and ovarian cancer data instances are segregated as training and test data for training and performance evaluation of prediction models.

D. Feature Selection
Feature selection is the vital stage in obtaining optimized prediction models with advanced classification accuracy through reduced number of features of the dataset. As our objective is to obtain reduced and an improved feature subset, we intended to apply fused feature selection process by integrating filter and wrapper methods. The pre-processed dataset had more than 30 attributes which represented various test results of the patients along with their age and marital status. The integrated approach of feature selection process aimed to reduce these attributes and thereby getting an enhanced and improved feature subset which can be used for classifying the cancer stages. The expected outcome of this phase is to obtain an optimal feature subset which contains major significant features which are essential for staging classification of cervical, ovarian cancer data.

E. Incorporation of Filter and Wrapper Feature Selection Methods
There are several methods available for feature selection processes using filter and wrapper approaches. It is evident through various researches that the Relief algorithm which uses filter approach and Genetic algorithm with wrapper approaches are more effective in gaining optimal feature subsets by assigning precise rankings and selecting relevant features of the dataset. Based on our prior findings, the filter feature selection approach of relief method is fused with wrapper approach using genetic algorithm to attain a fused performance by fabricating an optimal feature subset of mentioned cancer datasets. The phases of the fused feature selection process anticipated for the generation of the feasible combinations of improved feature subsets is described below: 1) Procedure: RIFSt_2FS -Revised and Improved Feature Subset through Fused Feature Selection Process.
The feature subset obtained through this integrated feature selection process is expected to have an enhanced and revised feature which could be used for the succeeding phase of classification of discussed gynaecological cancer stages.

F. Gynaecological Cancer Stages Classification with K-Fold Cross Validation
For classifying all the stages of cervical and ovarian cancers, prominent ML classification algorithms are executed on the datasets accomplished with optimal features. Repeated K-fold cross validation technique is recommended for model training and in the process of building various classifiers through ML classification algorithms for determining an optimized classification technique [15]. The prevalent classification algorithms used for model training and validation are Random Forest, C5.0 and K-Nearest Neighbor. The classifier models are initially constructed with conferred ML algorithms using all the preliminary features and with the feature subset gained through fused feature selection process by combing Relief and Genetic Algorithm. The classification results are conferred under Results Section with detailed analysis of stagewise and age-wise groupings.

G. Performance Evaluation
The performance of the predictive models obtained through ML classification algorithms are assessed based on performance metrics accuracy. Initially, the predictive performance of algorithms such as Random Forest, C5.0 and K-Nearest Neighbour are evaluated on training datasets with preliminary features by means of applying test data for validation. Subsequently the model generated through fused feature selection process is assessed by means of test data with the stated classification algorithms based on their proficiency in classifying the data to the appropriate cervical, ovarian cancer stages. The performance metrics used in this work s accuracy, which is calculated as shown below.

Accuracy = (Total No. of correct predictions)/(Total No. of instances)
The optimal feature subset generated through RIFSt_2FS method is effectual for appropriate staging classifications of cervical, ovarian cancer with extreme accuracy and improved performance outcomes. This approach has shown prominent results as compared with the existing techniques with image classifications [16], [17] The outcomes are conferred in the subsequent section.

VI. DISCUSSION
The proposed framework is efficient in identifying the important and relevant features which are to be designated for cervical and ovarian cancer staging classifications.

A. Variable Importance
The significant features derived through RIFSt_2FS method are termed as an optimal feature subset for staging classification of gynaecological cancers data. The overall variable importance in cervical and ovarian staging classifications is obtained and the chart of C5.0 method for combined staging prediction in ovarian cancer is depicted in Fig. 3.

B. Classification of Cervical and Ovarian Cancers Test Data
Based on the dataset retrieved from SEER Registry, we were able to predict the stages for 1078 cervical cancer patients and 1446 ovarian cancer patients which are specified as test data in this work. This work aimed to classify the stages of cervical and ovarian cancers with their sub types and through the patients' age-wise aspects through each stage of cervical and ovarian cancers.

1) Comprehensive Stagewise Classification Results of
Cervical and Ovarian Cancers. The inclusive classification of all the stages of cervical cancer is shown in Table III and Fig. 4. It is obvious through the results that incidence of Stage IIIB and Stage IB1 cervical cancer are higher and Stages III and IV are considered as more critical.  In a similar way, the inclusive classification of all the stages of ovarian cancer is shown in Table IV and Fig. 5. It is obvious through the results that incidence of Stage IIIC and Stage IV ovarian cancers are higher besides considered as more critical. The study [18] to know the implication among diagnostic patterns and stages in ovarian cancer using medical indicative features and symptoms insisted that self-attention is vital for all women.
2) Age-Wise Classification Aspects with Specific Stages of Cervical and Ovarian Cancer. The comprehensive classification of all the stages of cervical cancer based on the age groups is depicted in Fig. 6. It is obvious through the outcomes that the women in the age group 35-44 and 45-54 are more critical to be affected with all the types of cervical cancer stages of 1 to 4. The stages IB1 and IIIB is higher. The women in the age group 35-44 are also having more percentage of occurrences in the stages IA1, IB1 and IIIB. So consistent follow-up and timely treatment could diminish the critical and life-threatening situations for those women. Consecutively, the comprehensive classification of all the stages of ovarian cancer based on the age groups is depicted in Fig. 7.
158 | P a g e www.ijacsa.thesai.org It is apparent through the findings that the women in the age group 45-54 are more critical to be affected with all the types of ovarian cancer stages of 1 to 4. The women in the age group 55 and above are more affected with stage 4. Correspondingly, this classification shows that the ovarian cancer stage IIIC is having more incidence in the women with age group 45-54.

VII. RESULTS
Random Forest and C5.0 classifiers have tremendously performed well in categorizing all the stages of cervical and ovarian cancer types using RIFSt_2FS Framework with precise results. The results proved that RF and C5.0 are the finest classifiers and the results attained through KNN are not satisfactory. Consequently, we have attained an optimized classification results using Random Forest classifier. The performance results of the classifiers are shown in Table V and Fig. 8.
The performance of proposed approach with RF classifier is aggregated and compared with some of the existing studies and the findings are shown in Table VI and Fig. 9.

VIII. CONCLUSION AND FUTURE WORK
This work is intended to attain for an improved feature subset through fused feature selection process for precise classification of cervical and ovarian cancer stages by identifying the significant features. The integration of feature selection methods through fused approach enhances the performance of staging classifications through its positive, negative predicted values of the results with uppermost accuracy and improved performance measures. The predictive models are established with 10-fold cross validation using major classification algorithms like C5.0, Random Forest and KNN procedures. The classification results are attained for the respective types of cervical and ovarian cancer stages and the stage-wise classification based on patients age also obtained through this proposed method.
This proposed method has shown improved performance outcomes than the studies discussed in the literature. The results portrayed that the women in the age group of 45 and above more critical with the incidence of cervical and ovarian cancer types. It is mandatory for all the women to have a regular follow-up and timely treatments to reduce the complications in the advanced stages. Random Forest method has shown progressive accuracy rate with 97 percentage of combined performance outcomes. C5.0 algorithms has also shown improved accuracy in all the types of staging classifications of cervical and ovarian cancers. But the performance of KNN algorithm is comparatively less than RF and C5.0 methods. The experiments revealed that through enactment of fused feature selection approach an optimal and reduced feature subset is appropriate for the improvement of classification accuracy with a reduced computational cost. Also, this work recognized that the best and optimal feature subset selection could condense the complexity of the predictive model.
In future work, the staging classifications for other types of gynaecological cancers like uterine, vaginal, vulvar cancers will be analyzed using further types of ML classifiers with other performance metrics like sensitivity, specificity, precision, and F-score values.