Cervical Cancer Prediction through Different Screening Methods using Data Mining

Cervical cancer remains an important reason of deaths worldwide because effective access to cervical screening methods is a big challenge. Data mining techniques including decision tree algorithms are used in biomedical research for predictive analysis. The imbalanced dataset was obtained from the dataset archive belongs to the University of California, Irvine. Synthetic Minority Oversampling Technique (SMOTE) has been used to balance the dataset in which the number of instances has increased. The dataset consists of patient age, number of pregnancies, contraceptives usage, smoking patterns and chronological records of sexually transmitted diseases (STDs). Microsoft azure machine learning tool was used for simulation of results. This paper mainly focuses on cervical cancer prediction through different screening methods using data mining techniques like Boosted decision tree, decision forest and decision jungle algorithms as well performance evaluation has done on the basis of AUROC (Area under Receiver operating characteristic) curve, accuracy, specificity and sensitivity. 10-fold cross-validation method was utilized to authenticate the results and Boosted decision tree has given the best results. Boosted decision tree provided very high prediction with 0.978 on AUROC curve while Hinslemann screening method has used. The results obtained by other classifiers were significantly worse than boosted decision tree. Keywords—Boosted decision tree; cervical cancer; data mining; dcision trees; decision forest; decision jungle; screening methods


I. INTRODUCTION
Cancer is a dangerous disease in which group of abnormal cells develops hysterically by avoiding the usual rules of cell division.Development of cancer takes place when normal cells in a particular portion of the body begin to grow out of control [1].Each year around 8.2 million people die from cancer which is 13% of total deaths worldwide.In 2017, only 26% of under developing countries reported having screening services available for public.In 90% developed countries treatment services are available compared to less than 26% of low income countries.The expected cancer incidences will reach up to 22 million in 2030 [2,3].Millions of early deaths among women is due to lung and breast cancer but cervical cancer is most dangerous because it is only diagnosed in females.Woman's reproductive system consists of cervix, uterus, vagina and the ovaries.Cervix is the opening to the uterus from the vagina where cervical cancer occurs [4].Sexually transmitted human papillomavirus (HPV) is the important cause of cervical cancer [5][6][7][8].Cervical cancer occurrence is abundant in low and middle income countries [9].The important task of cervical cancer is screening.An ideal screening test is the one that is least incursive, easy to achieve, acceptable to subject, cheap and effective in diagnosing the disease process in its early incursive stage when the treatment is easy for ailment.There are four screening methods including cervical cytology also called Pap smear test, biopsy, Schiller and Hinslemann [10].Cytology screening method is a microscopic analysis of cells scratched from the cervix and is used to detect cancerous or precancerous conditions of the cervix [11].Biopsy method is a surgical process which includes finding of a living tissue sample for performing diagnosis [12].The solution of iodine has applied for visual inspection of cervix known as Hinslemann test.Lugol's iodine is used for visual inspection of cervix after smearing Lugol's iodine detection rate of doubtful region over the cervix, this is also known as Schiller test [13].
The size of data is increasing gradually.Expansive, complex and useful datasets have now expanded in all the different fields of science, business and especially in healthcare domain.With these larger data sets, the capacity to mine beneficial hidden knowledge in these huge volume of data is gradually significant in today's economical world.The method of applying novel techniques for discovering knowledge from data is called data mining [14].Medical data consists of information regarding patients and symptoms with respect to specific disease.The volume of such type of data is expanded quickly.By utilizing the traditional techniques, it is exceptionally difficult to separate the important information from raw medical data.
Due to growth in statistics, mathematics and other domains, it is now possible to extract the meaningful information from raw data.Data mining is helpful where large collections of healthcare data are available [15].Several data mining techniques like support vector machine (SVM), kernel learning methods as well as clustering techniques were used in healthcare [16].With the rise of computing methods for disease prediction, WHO and other international organizations are working together for effective screening method to detect the cervical cancer.These initiatives are raising public awareness for effective screening methods for cervical cancer but over the time all these measures have proved to be ineffective because the number of parameters for screening of cervical cancer are still debatable [4,8,10].The methods and techniques have been used for www.ijacsa.thesai.orgscreening of cervical cancer are limited to small number of parameters.The available literature for screening of cervical cancer explores mainly Papanicolaou (Pap) smear test [17], hormonal status, FIGO stage [18] and cervical intraepithelial neoplasia (CIN) [19] but only single parameter was used for screening prediction of cervical cancer.The available data mining techniques using large number of parameters [20][21][22][23] were not given effective results.A comparison of studies for screening prediction of cervical cancer along with approaches has presented in Table 1.It was not found effective results in screening prediction of cervical cancer while using huge number of parameters with the help of data mining techniques.As the current techniques are not sufficient, it is necessary to explore the all parameters or symptoms for screening prediction of cervical cancer.Decision tree methods have been used to predict cervical cancer but the demographic and medical attributes were different in previous studies.The aim of this study was to predict the cervical cancer, based on the demographic information, tumor related parameters, sexually transmitted diseases (STD) related parameters and important medical records.[20] presented an automated method for predicting the effect of the patient biopsy for the diagnosis of cervical cancer by using medical history of patients.Their technique allows a joint and fully supervised optimization method for high dimensional reduction and classification.They discovered certain medical results from the embedding spaces and confirmed through the medical literature.R. Vidya and G. M. Nasira [24] predicted cervical cancer using random forest with K-means learning and implemented the techniques in MATLAB tool.These experiments were performed with the help of NCBI dataset to construct decision tree using classification methods.Yulia et al. [25] predicted cervical cancer using Pap smear test results.The Pap smear test results were divided into two categories: cancerous and non-cancerous patients.Three classification methods Naïve Bayes, support vector machine and random forest were used to compute the results in which random forest tree was given better results.Jimin kahng et al. [21] predicted the cervical cancer development using SVM.Weka was used to train and test the data set as well as analyze relationships between attributes.Chang et al. [17] predicted the recurrence of cervical cancer in patients using MARS (Multivariate Adaptive Regression Splines) and C5.0 algorithm.MARS powerfully estimated the relationship between a dependent variable and set of descriptive variables in a pair wise regression.C5.0 used greedy method in which a top down approach was used to build the decision tree and then trained the data with the help of significant attributes.Maciej Kusy et al. [18] presented neural networks to predict adverse events in cervical cancer patients.MLP is a type of neural network where the input signal is fed forward through a number of layers.MLP contains input layer, hidden layer and output layer.The GEP classifier delivered efficient results in the prediction of the adverse events in cervical cancer as compare to other methods.Kelwin Fernandes et al. [26] used transfer learning technique for cervical cancer screening.Their study consists on linear predictive models.Positive results were obtained in most experiments as compared to other methods.Bogdan Obrzut et al. [27] utilized computational intelligence methods for prediction for cervical cancer patients.The probabilistic neural network (PNN) was a very efficient method for predicting overall survival in cervical cancer patients treated with radical hysterectomy.

III. METHODOLOGY
Our methodology consists of three main steps; the first step is data set selection.The second step includes preprocessing in which the original data is prepared for classification.The last step contains building effective classification based model for prediction.

A. Dataset
Publicly available dataset have been utilized [28] which was obtained from the UCI repository, in this research.The dataset contains 858 patients and 36 attributes which includes the patient age, number of pregnancies, contraceptives usage, smoking patterns and chronological records of sexually transmitted diseases (STDs).

B. Data Preprocessing
Data mining fundamentally depends on the quality of data.Raw data generally vulnerable to noisy data, missing values, outliers and inconsistency.So, it is vital for selected data to be processed before being mined.Preprocessing the data is an essential step to enhance data efficiency.Data preprocessing is one of the most vital data mining step which deals with data preparation and transformation of the dataset which make knowledge discovery more efficient.There are following steps which were used to preprocess data in this study for the experiments.
Step 1: Ignoring some instances and attributes which makes the data consistent because of high ratio of missing values.This method is very effective because there were several instances and attributes with missing values in the dataset which has been used.Some attributes in this dataset like STDs: Time since first diagnosis and STDs: Time since last diagnosis, in which more than 80% data was missing so these attributes were deleted.Two attributes STDs:cervical condylomatosis and STDs:AIDS has constant value so these were also deleted.
Step 2: There were many attributes with missing values like number of pregnancies, hormonal conceptive etc. whereas missing values denoted in data as "?" then replace these values with median values of respective class.The median value was computed as following [29].

( ) ( )
Step 3: The other important task was outlier detection in data.An outlier is a data object that deviates significantly from the rest of the objects.In this study, two attributes like age and number of partners contains outliers.To solve this issue defining lower and upper threshold limits, these outliers were replaced with median value.
Step 4: Normalization is scaling technique of data preprocessing.There were several methods of normalization i.e.Min-Max, Z-score and decimal scaling normalization [30].Decimal scaling normalization was applied by using following method [31]., V and j denotes the scaled values, range of values and smallest integer respectively.
In this study, all integer values of all attributes like age, hormonal conceptive etc. are scaled between [0-10] and Boolean attributes like smokes, HPV,STD etc. are scaled [0,1].
Step 5: After data cleaning, cervical cancer data set consists of 734 instances and 32 attributes.This data is imbalanced because only 70 instances are cancerous and 663 are non-cancerous diagnosed patients.To overcome this problem of imbalanced data, Synthetic Minority Oversampling Technique (SMOTE) has been used.This is a statistical method for increasing the number of instances in dataset in a balanced way.The module works by producing new instances from existing minority cases that supplied as input.By using SMOTE, majority instances do not change.The new instances are not just copies of existing minority www.ijacsa.thesai.orgclasses because the algorithm takes samples of the feature space for each target class and its nearest neighbors which generate new instances that associate the features of the target class.This method makes the samples more generic [32]. is a minority class and searches the nearest neighbors and one neighbor is randomly selected as then random numbers between [0,1]  selected.The new sample was created as: SMOTE outperforms random oversampling method because it also avoids over fitting problem [33].Using SMOTE function the total instances have increased.After SMOTE, minority class has oversampled from 70 to 563 instances.

C. Classification Models
A supervised method for classification is decision trees, which is very popular because most of biomedical data mining tasks have already used decision trees for efficient prediction [18].Three decision tree methods were used in this study as follows.

1) Boosted decision tree:
The transformation of a weakened classifier to a vigorous or strong classifier is the key role of boosting.A weak classifier is generally a poor performance prediction model which leads to low accuracy due to high misclassification rate.Boosted method works perfect when majority vote of all weak learners for each prediction combines in such way that final prediction results are effective.Each iteration for a weak learner is added in base learner which trained with respect to the error of the whole ensemble.When weak learner is added iteratively in an ensemble then it delivers the precise classification.A learning method consecutively tries new models to provide an extra accuracy of the class variable which leads to gradient boosting.The negative gradient of the loss function is correlated with each new model which tends to minimize the error.Friedman [34] presented a complete detail associated with boosted decision tree.
Step 1: ( ) fit a decision tree to pseudo residuals.Represents the number of leaves and input space divided into disjoint regions R 1 m … R m which predicts a constant value in each region.The output can be written as: Denotes the predicted value in region.
Step 2: has multiplied with which deceases the error rate by minimizing the loss function the value of model is updated ( ) Step 3: when the new updated value has determined then previous value is discarded.The new function is written as: Terminals nodes or leaves are denoted by J in the tree.The accuracy of boosted decision tree will improve if number of leaves and size of tree also increases but over fitting problem and longer processing time may occur.

2) Decision forest:
The other algorithm to perform classification by utilizing ensemble learning method is known as decision forest.Ensemble methods are generalized rather than depend on a single model.A generalized model generates multiple associated models and merging them which gives better results.Mostly, ensemble models offer efficient accuracy as compared to single decision tree.Decision forest differs from random forest method, in random forest method the individual decision trees might only use some randomized portion of the data or features.There were many methods to ensemble decision trees but voting is one of the effective method for making results in an ensemble model [35].Decision forest works by constructing multiple decision trees and then voting on the most popular output class.By utilizing the whole data set and different starting points, set of classification trees are constructed.Decision forest outputs non-normalized frequency of histograms of labels for each decision tree.Probabilities of each label is determined by aggregation method which sums the histograms then normalizes the results.Final decision of the ensemble is based on trees in which high prediction confidence depends on high weight.Criminisi [36] presented a complete detail associated with decision forest.
Step 1: Forest training is done by optimizing the parameters of the weak learner at each split node j and denotes the parent set and split parameters.

( ) ∑ ( ) ( )
This method contrasts from random forest method like some random features of data may only use by decision tree instead of complete features.www.ijacsa.thesai.org 3) Decision jungles: A large number of applications was developed by using decision forests and trees in data science but these methods have some limitations like while given large amount of data the number of nodes in decision trees will develop exponentially with depth.Decision jungles method compares two new node merging algorithms that jointly optimize both the features and the structure of the directed acyclic graph (DAGs) powerfully.DAGs have same structure as decision trees except the nodes can have multiple parents.Node splitting and node merging is determined by objective function and entropies of weighted sum at leaves.The training of DAGs is done level by level by combining objective function over both structure of DAGs and split function.At each level, the algorithm jointly learns the features and branching structure of the nodes.This is done by minimizing an objective function defined over the predictions.Decision jungles require radically less memory while considerably improved generalization.Shotton [37] presented a comprehensive detail related to decision jungles.
Step 1: Set of parent nodes, and a set of child nodes were denoted by and .Denotes the parameters of split feature function for parent node and S i denotes the set of labeled that reach node i.The set of instances that reach any child node is.
Step Step 4: To solve the minimization problem cluster search method was used which substitutes among optimizing the branching variables and the split parameters but optimizes the branching variables more globally.

IV. RESULTS AND DISCUSSION
In this study numerous methods have been examined and three methods that have the best performances has been presented.10 fold cross validation method was used in the evaluation of the proposed methods.Cross validation method was used because it uses the entire training dataset for both training and evaluation, instead of some portion [38].Among 858 patients, 124 patients have huge number of missing values due to privacy concerns and the remaining 734 were considered.Using SMOTE method, imbalanced dataset problem was overcome and instances were increased.The new balanced dataset consists of 32 attributes and 1226 patients in which cancer patients were 563 and non-cancer patients were 663 as shown in Fig. 1 of confusion matrix.The median value of patients' age was 26 years (range, 13-84).The median number of sex partners was 2 (range, 1-10).The median of first sexual intercourse age was 17 (range, 10-32) and median of number of pregnancies was 2 (range, 0-10).
Fig. 1.Confusion Matrix Obtained by using Different Models.www.ijacsa.thesai.orgThere were four screening methods (target attributes) in the data set labeled as biopsy, cytology, Schiller and hinslemann.These four screening methods have been used to diagnose cancer and each screening method was trained with same dataset but individually.Boosted decision tree outperformed all other methods as shown in Table 2.The hinslemann screening method also outperformed other methods as AUROC curve performance is 0.978 which was slightly higher from Biopsy but significant higher from cytology and Schiller.The AUROC curve has also given better results on boosted decision tree i.e. 0.974 on biopsy, 0.959 on cytology and 0.943 on Schiller target attribute.The complete performance of proposed models has given in Fig. 3 and performance on AUROC curve has shown in Fig. 2.
Boosted decision tree, decision forest and decision jungle algorithms were used to determine the prediction ability of tested models by computing the accuracy, sensitivity, specificity and AUROC curve.AUROC curve is a best measure to evaluate the performance of classification models [39][40][41][42].The AUROC curve performance of proposed models has shown in Fig. 2.
The AUROC curve is a summary measure of performance that indicates whether on average a true positive is ranked higher than a false positive rate or not.AUROC curve was also used for evaluation of different techniques [18,27] in biomedical data mining.There are 50% of cervical cancer identification in females age (35)(36)(37)(38)(39)(40)(41)(42)(43)(44)(45)(46)(47)(48)(49)(50)(51)(52)(53)(54) and around 20% diagnosed more than 65 years old as well as around 15% of between the age of (20 -30).Median age for diagnosis in cervical cancer is 48 years.Cervical cancer is significantly unusual in females, younger than age 20.In any case, several young females end up infected with different sorts of human papilloma infection (HPV), which can expand their danger of getting cervical cancer in future.Young females with early abnormal changes who don't have regular checkup are at high risk of cervical cancer when they reach at the age of 40 [43][44][45].The main risk factor for cervical cancer growth is HPV.Sexual relation with infected persons is another risk factor for HPV.Different parameters with respect to sexual relation like sexual relation with multiple persons are also danger factor for females which leads to cervical cancer.Sexually dynamic females (sexually obsessed) have never been in danger of cervical cancer as compare to those who have multiple sexual partners [46,47].Smoking is related with a higher risk for precancerous fluctuations in the cervix and development to invasive cervical cancer, particularly for women infected with HPV.Women with weak immune system are more prone to getting HPV [48].
This study was exploited late advancements in statistical learning for handling the high dimensional data with numerous features.Other promising areas of research in these conditions were also used ensemble learning methods [49].Classification algorithms have a wide range of applications which used decision trees other than biomedical domain.Astronomical objects detection [50], fraud detection in banking [51] and financial failure prediction [52] were also utilized decision trees for classification.There were several classification algorithms presented in literature but decision trees were generally utilized because of its simplicity of implementation and ease to understand as compared to other classification methods.Recently, high dimensional classification problems have been abundant due to substantial developments in technology [53].Generally the problem of large dimensional data modelling has been solved by variable reduction methods in the preprocessing and in the post-processing stage.Several data mining methods like artificial neural networks, support vector machines and k-nearest neighbor method were also used to resolve the high dimensional classification problem [54].In this study, high dimensional classification problem was resolved by using decision tree methods because only those attributes were considered which showed highest relevance with the screening method (target class).The Hinslemann screening method showed high performance because Hinslemann is also traditional method of screening of cervical cancer which is effective [55][56][57].The performance of biopsy screening method was slightly low from Hinslemann screening method.From various studies, it was also found that biopsy screening has huge impact for cervical cancer detection [58,59].The use of boosted decision tree was preferred because it focused on misclassified instances and had tendency to increase accuracy.Boosting is one way to decrease the misclassification rate because inside boosting, iteration was introduced [60].In general, this increased the degree of accuracy in classification.Since, boosted decision tree is an ensemble model in which results from various models are consolidated.The outcome acquired from ensemble model is normally higher to the outcome from any of individual model.In this study, maximum number of leaves per tree were 20 and minimum number of leaves per tree were 10.Learning rate has taken low which is 0.1 but processing time slightly increases because 100 number of tree to ensemble are constructed while boosted decision tree has used.Learning rate and number of trees are higher which leads to better performance but processing time also increased.Boosted decision tree was also used for sentiment analysis of Greek language which efficiently coped with both high dimensional and imbalanced datasets and achieves considerably enhanced then other traditional machine learning methods [61] as well as utilized for cardiovascular risk prediction [62] and risk prediction for inflammatory bowel disease [63].Due to some limitations, decision forest was not given better results.The main limitation of the decision forests is that real time prediction is slow when a large number of trees are made.These algorithms are fast to train but quite slow to create predictions once they are trained.The accuracy may increases when the number of www.ijacsa.thesai.orgtrees were also increased [64] but also leads slower model for prediction.In most real world applications the decision forest is fast enough but in some situations run time performance is important and other methods would be chosen.Decision forest was also used to understand protein interactions and making predictions based on all the protein domains [65].The other applications of decision forest were prediction of different types of liver diseases including alcoholic, liver damage and liver cirrhosis [66].Other than biomedical classification, Decision forest method was applied for academic data analysis [67] as well as classification and forecasting of chronic kidney disease [68].Decision Jungles were used for feature selection for images with some modification to achieve efficient results with modest training time [69].
V. CONCLUSION Nowadays, cervical cancer is a common disease and its screening often involves very time consuming clinical tests.In this perspective, machine learning can deliver efficient methods to speed up the diagnosis procedure.Furthermore in this research work, Data mining methods especially tree based algorithms enable sound prediction for cervical cancer patients.The imbalanced data set problem in which cancerous patients were too small as compared to non-cancerous patients has been resolved by using SMOTE method.The prediction ability of the boosted decision tree measured by the AUROC curve value which outperformed decision forest and decision jungle.The low AUROC curve value for the decision forest and decision jungle methods disqualified them as best predictive classifiers.We believe that with the growing collection of cervical cancer patient's data and the rapidly advancing methods for analyzing this data, we will begin to be able to identify best screening method for cervical cancer patients that will be informative for patient care.In future, this study can be used as a prototype to develop a healthcare system for cervical cancer patients.

Step 2 :Step 3 :
The objective function or loss function denoted as I which takes the value of information gain.( ) Described as Entropy of example set parent node, denotes the weighting left/right children and ( ) represents entropy of example sat child nodes The entropy of generic set of training points were denoted by S and ( ) represents labels of normalized empirical histogram resultant to the training points in .

2 :
The objective function E related with the current level of the DAG is a function of * + .The difficulty of learning the parameters of the decision DAG as a joint minimization of the objective over the split parameters * + and the child assignments * + * + were resolved.The task of learning the current level of a DAG can be written as: * + * + * + (* + * + * + Step 3: The information gain objective needs the minimization of the total weighted entropy of instances, defined as: (* + * + * +)=∑ H ( ) (* + * + * +) Presents features and branches for all parent nodes , ∑ presents sum over child nodes and number of examples at , H ( ) denotes entropy of examples that reach child node .

Fig. 3 .
Fig. 3.The Results in Terms of Accuracy, Sensitivity, Specificity and AUROC Curve in the Prediction of Cervical Cancer.

TABLE II .
AUROC CURVE OBTAINED BY THE ML TECHNIQUES ON THE RISK PREDICTION TASK WITH MULTIPLE SCREENING METHODS: BIOPSY, CYTOLOGY, SCHILLER AND HINSELMANN.PERFORMANCE WAS ALSO EVALUATED IN TERMS OF ACCURACY, SENSITIVITY AND SPECIFICITY Fig. 2. Comparison of Area under Receiver operating Characteristic (AUROC) Curve between Boosted Decision Tree (Blue Line) and Decision Forest (Red Line) as these Model Gives Best Results.Plots are Shown for the Models with Threshold=5.www.ijacsa.thesai.org