A Proposed Hybrid Effective Technique for Enhancing Classification Accuracy

The automatic prediction and detection of breast cancer disease is an imperative, challenging problem in medical applications. In this paper, a proposed model to improve the accuracy of classification algorithms is presented. A new approach for designing effective pre-processing stage is introduced. Such approach integrates K-means clustering algorithm with fuzzy rough feature selection or correlation feature selection for data reduction. The attributes of the reduced clustered data are merged to form a new data set to be classified. Simulation results prove the enhancement of classification by using the proposed approach. Moreover, a new hybrid model for classification composed of K-means clustering algorithm, fuzzy rough feature selection and discernibility nearest neighbour is achieved. Compared to previous studies on the same data, it is proved that the presented model outperforms other classification models. The proposed model is tested on breast cancer dataset from UCI machine learning repository. Keywords—Data mining; bioinformatics; fuzzy rough feature selection; correlation feature selection and data classification


I. INTRODUCTION
Medical data characterized by being intricate, noisy and immense so there are challenges in decision making for patient health.Therapeutic datasets contains details related to patients, past diagnosis, treatment cost etc.therefore new approaches to extract and analyse valuable information from such data are required.These approaches improve decision making in regards to patient treatment.Because of the significant increase in digital data, good exploration and analysis of data is needed.Healthcare data that stored digitally is about 500 petabytes worldwide in 2012 and in 2020 expected to reach 25000 petabytes [6].Breast cancer is one of the largest reasons for cancer deaths among women.Early expectation of the trademark of bosom protuberances (benign or malignant) happening in patients accordingly help to focus a suitable treatment for the cancer.Extracting valuable information from the breast cancer therapeutic datasets may help in early expectation of the disease.Data mining concept is the methodology of extracting knowledge or finding models from huge amount of data.Data mining can be called "Knowledge mining from data [1].Predictive data mining will be utilized to forecast some property of incoming data, for example how to classify it.The Data Mining (DM) techniques provide efficient methods to improve statistical tools for future pattern forecasting [2].The idea for the developments of this advanced analysis is to extract useful information from large datasets and transforming it to meaningful pattern or structure that can be utilized later.The methods involved in that step are machine learning, database systems, artificial intelligence, statistics and business intelligence.Data mining is about making solutions by analysing data that are presented in datasets [3].Because of academic and industrial models development there are need for hybrid models development such models represented (Fig. 1) by Cios et al. [4].
Data mining provides technologies that can be used in many organizations.Health awareness organizations can utilize these advances for characterizing patients that have comparative highlights and propose successful treatment.
Mining get to be imperative in healthcare management in light of the fact that they require techniques for effective examination to distinguish important, hidden and valuable data from medical data sets.Data classification issues in healthcare services result from the instability and high dimensionality nature of gathered medical information.In therapeutic frameworks, data mining can be utilized to attain fascinating advantages like lower expense answer for patient, discovering fraud in health insurance, looking for reasons for different diseases and discovering medical treatment solutions for them [5].www.ijacsa.thesai.orgThis paper proposes a hybrid model that represents a unified schema for the classification algorithms.The model studies how the efficient preparation and selection of data participate in the improvement of classification algorithms.The model is a hybrid of the pre-processing phase and the classification phase.The pre-processing phase intends to cluster the data set using the K-mean clustering algorithm then use fuzzy rough feature selection (FRFS) or correlation feature selection (CFS) on each cluster to produce the reduct.The reducts of the clusters are then combined to produce the final set of features used for classification later.The discernibility nearest neighbour algorithm is trained by the reduced data set and classifies unseen cases of the test data.The accuracy of the hybrid model is judged by the 10 folds cross validation and percentage split by 80-20 for training and test respectively.The experimental results were applied using the WEKA and Rapid Miner data mining tools.The results showed that the proposed model improved the classification accuracies by 2.92% more than the old models that may classify original data or be applied on reduced data.
The rest of the paper is organized as follows: Section II is a quick review on the previous studies in using data mining techniques for medical applications and classification of breast cancer data set.Section III describes the techniques and methodologies used in this study such as data classification, data clustering and feature reduction algorithms.Section IV presents an overview on the proposed system and its modules.Experimental results and conclusion will display in Sections V and VI, respectively.

II. RELATED WORK
Because biomedical is considered important and critical issue, many research papers seek to enhance medical data classifications accuracy.Agarwal and Pandey [7] performed a comparative study between different machine learning techniques such as fuzzy inference systems (FIS), perceptron neural networks and backprobagation neural networks.Their experiment tries to eye clinic datasets by using matlab simulation and proved that perceptron neural networks are better in its results from other techniques.From their study, also perceptron neural networks is simple and fuzzy logic and back propagation neural networks are widely used in several research area, they didn"t produce good results for the data Anushya and Pethalakshmi [8] used fuzzy logic because of their comprehensible result for evaluating the accuracy of occurrence of a heart disease with several data mining classification techniques such as decision tree, k-means, naïve Bayes and neural networks.Authors used the classifiers to classify heart dataset as healthy or sick.They used sensitivity with specificity to measure the accuracy of classifiers.The results showed better accuracy in using fuzzy logic with k-means classifier but with the whole dataset without reducing its features.
Rawat and Burse [9] proposed a soft computing geneticneuro fuzzy system for medical data mining diagnosis.They used genetic algorithms for feature selection combined with adaptive neuro-fuzzy inference system (ANFIS) for classification.Data grouped from UCI to ovarian cancer data.The system achieved higher accuracy with minimum cost.Cost decreased when using genetics for reducing data.Shukla and Agarwal [10] presented hybrid system of combining clustering with classification with some k-means improvement.The system was tested against Tuberculosis Dataset .This model starts by handling data pre-processing and feature selection using principal components analysis (PCA).Then it applies clustering using modified k-means and comparing classification accuracy with three classifiers (Naïve Bayes, Decision Tree and Artificial Neural Networks (ANN)).The model proved that the modified K-means results are better than using the original K-means.Hamdan and Garibaldi [11] proposed a framework for survival modelling by using ANFIS fuzzy inference system.This framework consists of preprocessing data against missing values by replacing or discarding it with respect to data types and volumes then used ANFIS for implementation and using dataset related to operative surgery for ovarian cancer patients.They proved the predictive power of proposed framework and facilitation for clinician to understand the process of data by set of linguistic rules.Cedeño et al. [12] presented a novel enhancement in neural network training for pattern classification.The proposed training algorithm is roused by the biological met plasticity property of neurons and Shannon"s information theory.Joshi et al. [13], the outcome of their research is justified that clustering by k-means algorithm and FF algorithm are useful for early diagnosis of the breast cancer patients.Input data Image, video, semistructured data, etc. www.ijacsa.thesai.orgMany of previous studies interested in using powerful and intelligent classification algorithms for their work.The proposed model concentrates on pre-processing and not just for data reduction but selecting the features that have role in improving classification accuracies that help in prediction model specially the field about healthcare.

A. Methodology
In this paper the proposed system focus first on applying pre-processing on medical data such as handling missing data, clustering data, data reduction, performing comparison study between different algorithms and measuring the contribution of each method to improve the performance.Second, different techniques that can be used for optimizing classification accuracy of bioinformatics data have applied.The proposed system applies different data mining and artificial intelligence techniques.

B. Data Pre-Processing
Medical data is not complete and need several preprocessing steps that are performed by several techniques [14], [15].Machine learning fields have several model analysis, design and data pre-processing techniques that guaranty high performance in achieving accuracy in its results.There are many problems in biomedical and medicine research that can make use of machine learning techniques in its tasks [16].
The pre-processing stages are very important and critical issue to ensure the success of data mining and data warehouse in time and space.Medical data is incomplete, noisy, and inconsistent.There are many different ways to solve such problems.Data pre-processing include several methods such as data reduction and data cleaning [4], [17].

1) Data Reduction:
A central issue in machine learning is recognizing specific set of features from which a classification model can be built.Data reduction can be used to reduce the data set achieving integrity of the original data.It is better to apply data mining on reduced version of data producing results as the same as or almost the same of original data.Data reduction methods involve data representation, dimensionality reduction and data compression [14].This paper concentrates mainly on applying CFS and fuzzy rough feature selection.

a) Correlation based Feature Selection (CFS)
Correlation based feature selection (CFS) assesses the value of a subset of attributes by considering the individual predictive ability of every feature alongside the level of redundancy between them.CFS algorithm assembles evaluation formula with specific and reasonable correlation metrics and heuristic search strategy.There are many trials on standard datasets demonstrated that CFS rapidly distinguishes immaterial, repetitive, and noisy features.Also CFS screens relevant features as long as their significance does not emphatically rely on other features.On medical domains, CFS commonly reduced well over a large portion of the features.Much of the time, classification accuracy based on eliminated features gives good accuracy [18], [19].

b) Fuzzy Rough Feature Selection (FRFS)
The rough set attribute selection (RSAR) methodology can just work adequately with datasets containing discrete values.Furthermore, there is no chance to handle noisy data.Because most datasets contain real valued attributes, it is important to perform a discretization step in advance.This is normally actualized by standard fuzzification methods.Fuzzy-rough feature selection (FRFS) gives a method by which discrete or real-valued noisy data can be successfully eliminated without the requirement for user supplied data.Furthermore, this procedure can be implemented with nominal or continuous attributes that can be found in classification and regression datasets [27], [30]- [38].
2) Data Cleaning: To achieve high quality and accuracy of data to any information system data must be cleaned.Data cleaning is defined as the process of discovering and reducing artifacts for improving the data quality that is necessary for building any knowledge discovery and data warehouse [20].Data cleaning methods differ according to the nature of the problem or area that apply to it but in general used to detect incomplete, inaccurate or unreasonable data and starting to improve such data by correcting what detected.

 Data Classification
Classification [18] is considered one of the forms of data analysis with supervised learning for extracting models portraying imperative data classes.Such models called classifiers for forecasting discrete or unordered class labels.The process of data classification comprise of two main steps for learning and classification where the model is utilized to anticipate class labels for given or specific data.

 Neural Networks
Neural networks are one of the most popular approaches to machine learning for improving the performance of intelligent systems.Neural network simulate human brain so called biological system that can be used for pattern recognition.Artificial neural networks (ANN) are artificial intelligence techniques used widely to solve pattern recognition and decision support problem in bioinformatics field.Neural network can be combined with different techniques such as producing rough neural network by accumulating neural networks with rough set.[21]  K-Nearest Neighbours K-nearest neighbour [22] is a classification technique that accepts the class of an instance to be the same as the class of the closest occurrence to that instance.It receives a similarity metric to quantify the closeness of an instance to others.Nearest neighbour proposes that instances in the data will be independently and indistinguishably distributed, so the instances have the same classification if they are in close proximity.For predicting a class, the algorithm must calculate how far attributes of new and previous differ.www.ijacsa.thesai.org

 Naive Bayes
The naive Bayes algorithm utilizes an improved form of Bayes equation to choose which class a novel occurrence belongs to.The back likelihood of every class is calculated.Given the highlight qualities introduced in the occasion, the occurrence is allocated the class with the most elevated likelihood.Equation 1 demonstrates the naive Bayes equation, which makes the presumption that features values are factually autonomous inside every class [19].
 C4.5 C4.5 [19] is algorithms that can represent their training data as a decision tree.C4.5 utilizes a greedy approach that makes use of an information theoretic measure to assemble a decision tree from training data as its guide.Training instances are divided into subsets by selecting an attribute for the root of the tree according to attribute values.C4.5 use gain ratio for ranking and selecting which attribute to be a root for the tree.There are more algorithms that depend upon tree classification such that ID3, NB tree and FT (functional tree) Classifier for "Functional trees" builds a tree for classification that could have logistic regression capacities at the internal nodes and/or its leaves.

 Decision Table/Naive Bayes Hybrid (DTNB)
Naive Bayes and Decision Tables (DTNB) [23] is hybrid model of combining a simple Bayesian network where the decision table (DT) represents a conditional probability table.Hybrid model learning algorithm (DTNB) continues similarly as the one for stand-alone DTs.The research for each point evaluates and assesses the merit connected with splitting the attributes into two subsets that are disjoint: the first for the DT and the second for NB.DTNB hybrid demonstrated a high performance evaluation compared to applying each in its own. K-Star (K*) K* is an instance-based learner that are means for classifying an instance by instance-based learners.This is made by comparing and contrasting instances with pre-classified samples of pre-defined data sets [24].Hence, the crucial supposition with comparison states that similar instances have similar classifications.The question here is how to determine that there are two similar instances and how similar they are.Distance function, considered from the corresponding components of instance-based learner, is used to measure instance similarity.The second component of instance-based learner is the classification function.It is responsible for determining the final classification for a new instance produced from instance similarities.
 Sequential Minimal Optimization Algorithm(SMO) Sequential minimal optimization (SMO) is a simple algorithm that can rapidly tackle the support vector machinequadratic programming (SVM QP) problem with no additional matrix storage or utilizing numerical QP optimization steps at all.SMO limited the problems overall QP to QP sub-problems by using Osuna"s theorem to guarantee convergence [25].
Data clustering is a task of data mining that group data or objects that have similar properties together to be used to facilitate their processing.Data clustering have applied in many domains such as medical area.There are several clustering algorithms existing in research but k-mean algorithm is popular because of its simplicity in implementation and ability to deliver great results.
The k-means algorithm is an unsupervised mining clustering techniques.K-means is widely applied in bioinformatics and related fields that need to determine the number of clusters that appropriate for specific problem.Kmeans algorithm includes five steps [14] (Fig. 2): K-means algorithm purpose and goal is minimizing the objective function (squared error function) given by: Where, "||x iv j ||" is the Euclidean distance between xi and vj."C i " is the number of data points in i th cluster."C" is the number of cluster centers.

K-mean clustering method calculates the distance (d) between two objects o i and o j by Euclidean separation given as
The hybrid system proposed in this study evaluates a new pre-processing step.It consists of data cleaning, clustering of data, and data reduction.Moreover, a comparative study

Step1. Determining number of clusters Step2. Create clusters randomly and select cluster midpoint
Step3.Assign each object to the group that has the closest centroid Step4.When all objects has been assigned, recomputation of the positions of new cluster midpoints Step5.Repeat steps 3 and 4 until a separation is produced between objects that achieved when no longer centroids moved.www.ijacsa.thesai.org between different algorithms for data reduction and the effect of this whole pre-processing step for enhancing classification algorithms is introduced.Data cleaning is important to overcome incomplete or inaccurate data.Data selection, without missing values in columns or rows, participate in achieving more accurate results as pre-processing step.There are many algorithms for clustering process but k-means is accepted for its simplicity and widely used in bioinformatics and related domains.
After data clustering, the new data set is composed of two subsets.The feature reduction step is applied on each subset.Data reduction can be done by many algorithms but only two algorithms are choosen in this step.The first algorithm for data reduction is CFS that rapidly screens immaterial, repetitive, and noisy features.Moreover CFS distinguishes relevant features as long as their significance does not emphatically rely on upon other features.
The core element of CFS is a heuristic [28] that used to evaluate the worth (merit) to specific set of the features.Heuristic used to calculate how a set of feature effect on predicting the label of class through the inter-correlation among them.Heuristic formalization can be displayed in ( 4): Where Merits is the heuristic 'merit' of a feature Subset S that contain k features, The second algorithm is FRFS that have many advantages when working with discrete, real values, noisy, nominal or continuous terms of data without more user supplied data.The fuzzy rough set has utilized the vagueness of fuzzy sets with rough sets concepts of indiscernibility.FRFS generalizes the rough set by a fuzzification strategy which remains the basic values of attributes unchanged yet produces a collection of fuzzy sets for each one.Fuzzy partitioning of the input space or fuzzy similarity relation for approximating fuzzy concept (5) can be used in the FRFS algorithm implementations.
( ( ) ( ( ))) ( , ) max (min ( , ( ( ) ( ( ) )) FRQuickReduct [29] implements the FRFS basing on the dependency (6) that calculates the membership dependency degree between the fuzzy attributes and the equivalence classes. Where sup The FRQuickReduct(C,D) is illustrated in Fig. 3.Where C is the set of all conditional attributes and D is the set of decision attributes.{}, 0, 0 For selecting which attributes ought to be appended to the candidate reduct, the algorithm utilizes and employs the dependency function γ′.The stopping criteria is when there are no attributes that increase the dependency.The algorithm finishes and gets the reduct.
The next stage of the system is merging reduced features of different clusters again.This methodology, clustering then reduction and merging the data again, prevents eliminating attributes that can participate with any degree to the classification accuracy.
The last step compares more than one algorithm for classification to test to what extend the pre-processing step affects classification of medical data.This study used neural network (NN), K-nearest neighbours, Fuzzy-rough K-nearest neighbours, Discernibility NN classifier, Naive Bayes ,K-Star(K*), Functional trees, C4.5, decision table/naive Bayes hybrid and training a support vector classifier by sequential minimal optimization algorithm(SMO).The proposed framework that display the main components is represented in Fig. 4. a) The first between CFS and FRFS and their effect on classification algorithms.
b) The second with hybrid model between CFS and FRFS in applying them directly on the whole data set vice versa implementing them on each cluster with its own and also the study of their effect on enhancing the accuracy of classification algorithms.
c) The third among classification algorithms with time complexity for original data and proposed model.www.ijacsa.thesai.org

a) Data Set
For the examination breast cancer data set from UCI machine learning repository will be utilized to test the model [26].The highlights of data sets are given in Table I.
The data set contains 699 cases about patients who had experienced surgery for breast cancer.The yield values are either 2 or 4 demonstrating that resting tumor protuberance (benign) or risky bump (malignant).Nine different fields are esteemed from 1 to 10 in addition to ID number, which are itemized in Table II.The undertaking is to figure out whether the identified tumor is benign (2) or malignant (4) given estimations of nine characteristics portrayed in Table II.From the whole data set there are 458 instances for benign and 241 instance for malignant.The class instead of 2 and 4 we replace them by 0 and 1 for easy processing.

b) Clustering and Reduction
The first step in the proposed model is preparing data set for clustering and handling missing and noisy data.In the process of clustering, K-means clustering algorithm is used for its simplicity.The K attribute value is 2 clusters that suites the nature of data and their classification.Rapid Miner Studio was used for data clustering.After data clustering, there have been 2 sub datasets.One cluster has 354 instances and the other has 345 instances.For each sub data set, data reduction was applied by correlation feature selection and fuzzy rough feature selection algorithms.
The WEKA tool was used to apply the reduction algorithms.Feature reduction showed that the correlation feature selection (CFS) algorithm keeps the same number of attributes as in the original dataset while applying on clustered data yields 8 attributes.The fuzzy rough feature selection (FRFS) algorithm yields the same number of attributes (7 attributes) in both models of reduction (FRFS directly on the original data and the clustered data sets) but with different attributes.From the result of reduction, the level of reduction is not large but the accuracy of medical data classification is the most important factor in patient treatment.

c) Classification Algorithms
At this step of the model, the research tries to investigate different machine learning algorithms for classifying data and testing how pre-processing step of (cleaning +clustering +reduction+ merging) have affected the improvement of classification algorithms accuracies.Tables III and IV show the classification algorithms and their results for accuracy metric for the proposed model compared to original data for both feature reduction algorithms.From the results, there are enhancements in the accuracy of the pre-processing with clustering against applying the same algorithms directly on the original data or after reducing the feature directly on the original data by CFS and FRFS.
From Fig. 5 and 6 the proposed model proved that the clustering added to the pre-processing step has a main role in improving the accuracy of classification algorithms.Fig. 7 shows that there are enhancements in the proposed model for both reduction algorithms with two test modes.Also it is noted that by using FRFS the accuracy levels exceeds the accuracy levels of CFS.
The proposed model increases the efficiency with ratio up to 2.92 than using just reduction techniques.Fig. 8 and 9 demonstrates the levels of improvements for all classification algorithms when using the proposed pre-processing step.It shows that all algorithms of classification change for the better.In the other hand, using feature extraction algorithms directly on data set may decrease or increase some algorithms accuracy which is not an increase rate of the proposed system.
The enhancements of proposed model can be summarized in Fig. 10 in add value property.In addition to improving the accuracy of classification algorithms, the proposed model reduces the time consumption for those algorithms.Time complexity is important factor combined with accuracy in dealing with critical fields of human life.Fig. 11 and 12 show the time complexity (in seconds) for building the classification model under the two reduction algorithms for original data, reduced data and the proposed model.The proposed model showed not only enhancements in classification algorithms but suggest a new hybrid models that can be compared to previous studies on the same data set.The proposed system reaches in classifying breast cancer with accuracy to (98.9%) with a hybrid composed of (FRFS +Kmeans +Discernibility NN) and Table V shows this result with other results in previous studies.The relation between the proposed system and previous studies can be graphically displayed in Fig. 13.www.ijacsa.thesai.orgV. CONCLUSION Medical data provides a challenging field for data mining researchers.Machine learning algorithms were used to mine information from ambiguous and vague concrete data.Data pre-processing intends to enhance the final accuracy of medical data classification.A hybrid pre-processing model to enhance the performance of classification algorithms has been presented.Such model combines K-means clustering algorithm with fuzzy rough feature reduction or correlation feature reduction to achieve effective data reduction.The proposed model has been applied on breast cancer data set from UCI machine learning repository.Simulation results have shown the effectiveness of the proposed model in enhancing the performance of classification algorithms.Furthermore it has been proven that fuzzy rough feature selection is better than correlation feature selection in data reduction, in addition, it increases the accuracy of classification.Compared to previous studies on the same data, it has been shown that the hybrid model of k-means, fuzzy rough feature selection and discernibility nearest neighbour is more efficient than other algorithms in the same field.

From Fig. 4
the whole system can be summarized in the following phases: -Data pre-processing: Noise and missing values handling Data clustering Feature extraction Merging data to produce one dataset -Applying classification algorithms -Testing the model This model includes many comparative studies as follows:

Fig. 11 .
Fig. 11.Time for building model for original data, reduced and proposed model using FRFS.

Fig. 12 .Fig. 13 .
Fig. 12.Time for building model for original data, reduced and proposed model using CFS.

TABLE I
Classification algorithms are implemented by WEKA tool on the new data produced from proposed model compared to applying the same algorithms on original and reduced data.The classification algorithms were tested by 10 folds cross validation and percentage split with 80-20 for training and test respectively.The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier.

TABLE III .
THE ACCURACY OF CLASSIFICATION BY USING FRF

TABLE IV .
THE ACCURACY OF CLASSIFICATION BY USING CFS

TABLE V .
COMPARISON BETWEEN PROPOSED AND PREVIOUS STUDIES ON BREAST CANCER DATA SET