Efficient Cancer Classification Using Fast Adaptive Neuro-fuzzy Inference System (fanfis) Based on Statistical Techniques

The increase in number of cancer is detected throughout the world. This leads to the requirement of developing a new technique which can detect the occurrence the cancer. This will help in better diagnosis in order to reduce the cancer patients. This paper aim at finding the smallest set of genes that can ensure highly accurate classification of cancer from micro array data by using supervised machine learning algorithms. The significance of finding the minimum subset is three fold: a) The computational burden and noise arising from irrelevant genes are much reduced; b) the cost for cancer testing is reduced significantly as it simplifies the gene expression tests to include only a very small number of genes rather than thousands of genes; c) it calls for more investigation into the probable biological relationship between these small numbers of genes and cancer development and treatment. The proposed method involves two steps. In the first step, some important genes are chosen with the help of Analysis of Variance (ANOVA) ranking scheme. In the second step, the classification capability is tested for all simple combinations of those important genes using a better classifier. The proposed method uses Fast Adaptive Neuro-Fuzzy Inference System (FANFIS) as a classification model. This classification model uses Modified Levenberg-Marquardt algorithm for learning phase. The experimental results suggest that the proposed method results in better accuracy and also it takes lesser time for classification when compared to the conventional techniques. I. INTRODUCTION MICRO array data analysis has been successfully applied in a number of studies over a broad range of biological disciplines including cancer classification [3, 10] by class discovery and prediction , identification of the unknown effects of a specific therapy , identification of genes relevant to a certain diagnosis or therapy , and cancer prognosis.


INTRODUCTION
MICRO array data analysis has been successfully applied in a number of studies over a broad range of biological disciplines including cancer classification [3,10] by class discovery and prediction , identification of the unknown effects of a specific therapy , identification of genes relevant to a certain diagnosis or therapy , and cancer prognosis.
The multivariate supervised classification techniques such as support vector machines (SVMs) [13] and multivariate statistical analysis method such as principal component analysis (PCA), singular value decomposition (SVD) [9] and generalized singular value decomposition (GSVD) cannot be applied to data with missing values.The finding of missing value is an essential preprocessing step.Because of various reasons, there may be some loss of data in gene expression [8,11,12] e.g.inadequate resolution, image corruption, dirt or scratches on the slides or experimental error during the laboratory process.Several algorithms have been developed for recovering data because it is costlier and time consuming to repeat the experiment.Moreover, estimating unknown elements in the given data has many potential applications in the other fields.There are several approaches for the estimating the missing values.Recently, for missing value estimation, the singular value decomposition based method (SVDimpute) and weighted k-nearest neighbors imputation (KNNimpute) has been introduced.It has been shown that KNNimpute shows better performance on non-time series data or noisy time series data, whereas, SVDimpute works well on time series data with low noise levels.Considering as a whole, the weighted k-nearest neighbor based imputation offers a more robust method for missing value estimation than the SVD based method.
In this paper, Fast Adaptive Neuro-Fuzzy Inference System (FANFIS) is used along with gene ranking technique called Analysis of Variance (ANOVA).The learning technique used in this paper is Modified Levenberg-Marquardt algorithm.

II. RELATED WORKS
Isabelle et al., [1] proposed the Gene Selection for Cancer Classification using Support Vector Machines.In this paper, the author address the problem of selection of a small subset of genes from broad patterns of gene expression data [4,5], recorded on DNA micro-arrays.
Using available training examples from cancer and normal patients, the approach build a classifier suitable for genetic diagnosis, as well as drug discovery.Previous attempts to address this problem select genes with correlation techniques.The author proposes a new method of gene selection utilizing Support Vector Machine methods based on Recursive Feature Elimination (RFE).It is experimentally demonstrated that the genes selected by our techniques yield better classification [14] performance and are biologically relevant to cancer.Jose et al., www.ijacsa.thesai.org[2] presents a Genetic Embedded Approach for Gene Selection [15,16] and Classification of Microarray Data [7,17].
Murat et al., [6] gives the early prostate cancer diagnosis by using artificial neural networks.The aim of this study is to design a classifier based expert system for early diagnosis of the organ in constraint phase to reach informed decision making without biopsy by using some selected features.The other purpose is to investigate a relationship between BMI (body mass index), smoking factor, and prostate cancer.The data used in this study were collected from 300 men (100: prostate adenocarcinoma, 200: chronic prostatism or benign prostatic hyperplasia).Weight, height, BMI, PSA (prostate specific antigen), Free PSA, age, prostate volume, density, smoking, systolic, diastolic, pulse, and Gleason score features were used and independent sample t-test was applied for feature selection.In order to classify related data, the author have used following classifiers; scaled conjugate gradient (SCG), Broyden-Fletcher-Goldfarb-Shanno (BFGS) and Levenberg-Marquardt (LM) training algorithms of artificial neural networks (ANN).

III. METHODOLOGY
Cancer classification proposed in this paper comprises of two steps.In the first step, all genes in the training data set are ranked using a scoring scheme.Then genes with high scores are retained.This paper uses Analysis of Variance (ANOVA) method for ranking.In the second step, the classification capability of all simple two gene combinations among the genes selected are tested in this step using Fast Adaptive Neuro-Fuzzy Inference System (FANFIS) in which the training is performed using Modified Levenberg-Marquardt algorithm.

Step 1: Gene Importance Ranking
This step performs the computation of important ranking of each gene by means of Analysis of Variance (ANOVA) method.
Step 2: Finding the minimum gene subset This step attempts to classify the data set with single gene after selecting several top genes in the important ranking list.Each selected gene is given as an input to the classifier.When good accuracy is not obtained, it is required to classify the data set with all possible 2 gene combination within the selected genes.
Even if the good accuracy is not obtained, this procedure is repeated with all of the 3 gene combinations and so on until the good accuracy is obtained.The fuzzy inference system that we have considered is a model that maps  Input characteristics to input membership functions,  Input membership function to rules,  Rules to a set of output characteristics,  Output characteristics to output membership functions, and  The output membership function to a single-valued output, or  A decision associated with the output.

Architecture of ANFIS
The ANFIS is a framework of adaptive technique to assist learning and adaptation.This kind of framework formulates the ANFIS modeling highly organized and not as much of dependent on specialist involvement.To illustrate the ANFIS architecture, two fuzzy if-then rules according to first order Sugeno model are considered: ) where x and y are nothing but the inputs, A i and B i represents the fuzzy sets, f i represents the outputs inside the fuzzy region represented by the fuzzy rule, p i , q i and r i indicates the design parameters that are identified while performing training process.
The ANFIS architecture to execute these two rules is represented in figure 2, in which a circle represents a fixed node and a square represents an adaptive node.In the first layer, every node is adaptive node.The outputs of first layer are the fuzzy membership grade of the inputs that are represented by: where ( ) ( ) can accept any fuzzy membership function.For example, if the bell shaped membership function is employed, ( ) is represented by: where a i , b i and c i represents the parameters of the membership function, controlling the bell shaped functions consequently.
In layer 2, the nodes are fixed nodes.These nodes are labeled with M, representing that they carry out as a simple multiplier.The outputs of this layer can be indicated by: which are the called as firing strengths of the rules.
The nodes are fixed in layer 3 as well.They are labeled with N, representing that they are engaged in a normalization function to the firing strengths from the earlier layer.
The outputs of this layer can be indicated as: which are the called as normalized firing strengths.
In layer 4, all the nodes are adaptive nodes.The output of the every node in this layer is merely the product of the normalized firing strength and a first order polynomial.Therefore, the outputs of this layer are provided by: In layer 5, there exists only one single fixed node labeled with S. This node carries out the operation like summation of every incoming signal.Therefore, the overall output of the model is provided by: It can be noted that layer 1 and the layer 4 are adaptive layers.Layer 1contains three modifiable parameters such as a i , b i , c i that is associated with the input membership functions.These parameters are called as premise parameters.In layer 4, there exists three modifiable parameters as well such as {p i , q i , r i }, related to the first order polynomial.These parameters are called consequent parameters.

Learning algorithm of ANFIS
The intention of the learning algorithm is to adjust all the modifiable parameters such as{a i , b i , c i } and {p i , q i , r i }, for the purpose of matching the ANFIS output with the training data.
If the parameters such as a i , b i and c i of the membership function are unchanging, the outcome of the ANFIS model can be given by: (8) Substituting Eq. ( 5) into Eq.( 8) yields: Substituting the fuzzy if-then rules into Eq.( 15), it becomes: After rearrangement, the output can be expressed as: which is a linear arrangement of the adjustable resulting parameters such as p 1 , q 1 , r 1 , p 2 , q 2 and r 2 .The least squares technique can be utilized to detect the optimal values of these parameters without difficulty.If the basis parameters are not adjustable, the search space becomes larger and leads to considering more time for convergence.A hybrid algorithm merging the least squares technique and the gradient descent technique is utilized in order to solve this difficulty.The hybrid algorithm consists of a forward pass and a backward pass.The least squares technique which acts as a forward pass is utilized in order to determine the resulting parameters with the premise parameters not changed.Once the optimal consequent parameters are determined, the backward pass begins straight away.The gradient descent technique which acts as a backward pass is utilized to fine-tune the premise parameters equivalent to the fuzzy sets in the input domain.The outcome of the ANFIS is determined by using the resulting parameters identified in the forward pass.
The output error is utilized to alter the premise parameters with the help of standard backpropagation method.It has been confirmed that this hybrid technique is very proficient in training the ANFIS.

Modified Levenberg-Marquardt algorithm
A Modified Levenberg-Marquardt algorithm is used for training the neural network.
Considering performance index is ( ) using the Newton method we have as: www.ijacsa.thesai.org The gradient can write as: ( ) is called the Jacobian matrix.
Next we want to find the Hessian matrix.The k, j elements of the Hessian matrix yields as: The Hessian matrix can then be expressed as follows: If ( ) is small assumed, the Hessian matrix can be approximated as: Using ( 13) and ( 21) we obtain the Gauss-Newton method as: The advantage of Gauss-Newton is that it does not require calculation of second derivatives.
There is a problem the Gauss-Newton method is the matrix may not be invertible.This can be overcome by using the following modification.
Hessian matrix can be written as: Suppose that the eigenvalues and eigenvectors of H are * + and * +.Then: Therefore the eigenvectors of G are the same as the eigenvectors of H, and the eigen values of G are ( ).
The matrix G is positive definite by increasing μ until ( ) for all i therefore the matrix will be invertible.
This leads to Levenberg-Marquardt algorithm: As known, learning parameter, μ is illustrator of steps of actual output movement to desired output.In the standard LM method, μ is a constant number.This paper modifies LM method using μ as: Where e is a matrix therefore is a therefore , is invertible.
Therefore, if actual output is far than desired output or similarly, errors are large so, it converges to desired output with large steps.Likewise, when measurement of error is small then, actual output approaches to desired output with soft steps.Therefore error oscillation reduces greatly.

Lymphoma Data Set
Lymphoma data set [18] contains 42 samples obtained from diffuse large B-cell lymphoma (DLBCL).Among these, 9 samples are from follicular lymphoma (FL), 11 samples are from chronic lymphocytic leukaemia (CLL).The whole data set contains the expression data of 4026 genes.In this data set, a small portion of data is missing.A k-nearest neighbor technique was utilized to fill those missing data.In the initial step, the 62 samples are randomly seperated into 2 groups in such a way those 31 samples for training, and 31 samples for testing.Next the complete 4026 genes are ranked with the help of ANOVA technique.Then 100 genes are taken from them with the highest rank.Then the proposed ANFIS technique is applied for classification.

Liver Cancer Data Set
The liver cancer data set [19] contains two classes, i.e. the nontumor liver and HCC.The data set consista 156 samples and the expression data of 1648 important genes.In that, 82 are HCCs and the remaining 74 are nontumor livers.
The data is randomly separated into 78 training samples and 78 testing samples.In this data set, there are some missing values.K-nearest neighbor technique is utilized to fill those missing values.Initially, 100 important genes are chosen in the training data set.Next all possible 1-gene and 2-gene combinations are tested within the 100 important genes.

V. CONCLUSION
This paper suggests a better technique for classification of cancer.In the proposed technique, the ANOVA ranking technique is initially applied to the dataset in order to find the higher ranked genes.
After ranking the genes, Adaptive Neuro-Fuzzy Inference System is used in used for classification which has both the advantages of neural network and fuzzy logic.But, it takes more time for classification.To overcome this paper uses Fast Adaptive Neuro-Fuzzy Inference System (FANFIS).The learning is performed using the Modified Levenberg-Marquardt algorithm.The proposed technique is tested using two dataset namely, Lymphoma dataset and Liver cancer dataset.The experimental result shows that the proposed technique results in better accuracy of classification and also takes lesser time for convergence.

Fig. 5 .Figure 5
Fig.5.: Classification Accuracy for Liver Cancer Data Set with ANOVA RankingFigure5represents the resulted for classifying the liver cancer data set and figure6represents the convergence behavior of liver cancer data set.From these results, it can be observed that the proposed technique results in better accuracy of classification and it takes lesser time to converge.

Fig. 6 .
Fig.6.Convergence Behavior for Liver Cancer Data Set with ANOVA Ranking