An NB-ANN based Fusion Approach for Disease Genes Prediction and LFKH-ANFIS Classifier for Eye Diseases Identification

A key step to apprehend the mechanisms of cells related to a particular disease is the disease gene identification. Computational forecast of disease genes are inexpensive and also easier compared to biological experiments. Here, an effectual deep learning-centered fusion algorithm called Naive BayesArtificial Neural Networks (NB-ANN) is proposed aimed at disease gene identification. Additionally, this paper proposes an effectual classifier, namely Levy Flight Krill herd (LFKH) based Adaptive Neuros-Fuzzy Inferences System (ANFIS), for the prediction of eye disease that are brought about by the human disease genes. Utilizing this technique, completely '10' disparate sorts of eye diseases are identified. The NB-ANN includes these ‘4’ steps: a) construction of ‘4’ Feature Vectors (FV), b) selection of negative data, c) training of FV utilizing NB, and d) ANN aimed at prediction. The LFKH-ANFIS undergoes Feature Extraction (FE), Feature Reduction (FR), along with classification for eye disease prediction. The experimental outcomes exhibit that method’s efficiency with regard to precision and recall. Keywords—Disease gene identification; eye disease identification; deep learning; adaptive neuro-fuzzy inferences system (ANFIS); levy flight based krill herd (LFKH); principle component analysis (PCA)


I. INTRODUCTION
Disease genes are the dysfunction of a collection of genes, which in turn leads to Complex diseases [1] [2]. A key step towards enlightening the fundamental molecular operations of diseases is the recognition of genes concerned with genetic as well as rare diseases [3]. Prioritizing the candidate genes using experimental approaches is very costly and tedious [4]. Matrix decomposition along with Network propagation is the '2' categories under which all these existing techniques for the prediction can well be summarized [5]. In current years, technologies, say higher-throughput [6] gene expressions profiling has permitted the characterization of molecular differences betwixt healthy and disease states, bringing about the recognition of an augmenting number of disease-linked genes [7]. A great quantity of machine learning-centered computational methods was generated for predicting disease genes [8], say restricted Boltzmann machines [9], deep belief network [10], linear regressions model [11], support vectors machine [12], multilayer perceptions (MLP) [13], et cetera. These often attain greater prediction accuracy on larger data sets [14]. Nonetheless, on account of the lower statistical power brought about by means of smaller samples in biomedical data, the issue of smaller samples typically causes poor reproducibility of prediction outcomes among disparate patients [15]. To trounce such downsides, in this paper, an NB-ANN is proposed for identifying the disease genes as well as the LFKH-ANFIS is proposed for the identification of eyelinked diseases triggered by means of those recognized disease genes.

II. LITERATURE REVIEW
Chen BoLin et al. [16] proffered a kernel-centric Markov random field approach. This approach was deployed for capturing the genes-diseases associations on the base of biological networks. Here, three sorts of kernels were deployed for delineating the overall relations of vertices in 5 biological networks, respectively, and weighted methodology was built with the proffered approach to merge those data. It acquired 0.771-Area under the ROC Curves (AUC) score when merging all the concerned biological data. Here, Markov Exponential Diffusions (MED) kernel rendered the low AUC performance contrasted with Laplacian Exponential Diffusions (LED) kernel on integrated "3" network situation. Abdulaziz Yousef and Nasrollah Moghadam Charkari [17] rendered a disease gene identification technique centered on amino acids" physicochemical properties as well as classification algorithm. Amino acids physic-chemical properties were utilized to change the sequences of protein into numerical vector for the feature vector generation. Support vector data description algorithm was employed to envisage the disease genes. The rendered method performed better contrasted with the prevailing methods concerning precision, recall, along with F-measure. Data standardization was required for Principle Component Analysis (PCA) utilization. The standardization absence brought about the PCA"s failure in finding optimal components which in turn affected this model"s performance.
Zhen Tian et al. [18] paid attention on a framework, termed RWRB, for inferring the causal genes of disease. The Similarity Networks (SN) of 5 genes (protein) was individually constructed grounded on countless genomic data. The integrated gene SN was re-developed in respect of the SN fusion approach. The restart along with random walk algorithm (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 10, 2021 360 | P a g e www.ijacsa.thesai.org was deployed on a Phenotype-Gene (PG) bi-layer network, which integrated phenotype SN, PG association, and integrated gene SN for proffering the priority for the candidate genes (disease-associated ones). Outcomes corroborated that the RWRB was accurate when analogized to certain methods regarding the evaluation metrics. This method rendered the degraded performance with respect to Number of Successful Predictions (NSP) metric when the jumping probability is above 0.6 in disparate experiments.
Mehdi Joodaki et al. [19] put forward a gene ranking approach, named as Random Walks with Restart on a Heterogeneous Networks with Fuzzy Fusions (RWRHN-FF).
Here, first, centred on disparate genomic sources, "4" genegene similarity networks were generated, and then, they were joined utilizing the type-II fuzzy voter scheme. The resultant gene to gene network was linked with the disease-disease similarity network. By means of integrating "4" sources via a "2"part disease gene network, the disease-disease similarity network was created. RWRHN analyzed this network. While considering Area Under ROC Curves (AUC) as well as convergence time, the presented approach trounced the prevailing methods. On account of the bad data integration of manifold sources, the precision metric of this method was declined.
Pradipta Maji and Ekta Shah [20] recognized the diseaseassociated genes with the utilization of a gene selection algorithm, named SiFS. The SiFS algorithm gathered countless genes as of micro-array data as diseased genes by elevating the functional and significance similarity of the chosen gene subset. Contrarily, a similarity metric was instated for computing the functional similarity betwixt 2 genes. The experimental outcomes on disparate data sets corroborated that the algorithm recognized more disease-associated genes when analogized to prevailing disease gene selection methodologies. The similarity measure of the presented method was affected by the low coverage of human genes and reliability of proteinprotein interaction (PPI).

III. PROPOSED METHODOLOGY
Here, a novel sequence-based fusion method (NB-ANN) is proposed aimed at disease genes identification, and the LFKH-ANFIS is proposed aimed at identifying eye-related diseases that are triggered by those disease genes, say Age-associated Macular Degenerations (AMD), cataract, glaucoma, inherited optics neuropathies, Marfan syndrome polypoidal choroidals vasculopathies, retinitis pigmentosas, Stargardt disease, along with uveal melanoma.
The proposed method"s architecture is exhibited in Fig. 1. Fig. 1 exhibits the proposed methodology"s architecture. In the initial phase, representation methods are used to achieve the FV as of the disease and unknown disease genes. In the 2 nd phase, a dataset with positive as well as reliable negative instances is created by selecting negative protein set. In the 3 rd phase, disparate FV of the same instances are categorized using the NB classifier. In the "4 th " phase, ANN fuses together the NB classifiers to enhance the accuracy. After the identification of disease gene, FE is done. It extracts the features as of the identified disease genes for classifying the eye-related diseases caused via the disease genes. Next, PCA is employed for FR for removing the redundant features. Lastly, the LFKH-ANFIS algorithm takes care of the eye disease classification.

A. Disease Gene Identification
Here, the technique for identifying along with prioritizing disease genes is elucidated. The proposed work comprises "4" steps: (i) Translate equivalent gene products (proteins) into "4" numerical FV utilizing "4" sorts of protein sequence translator, (ii) choosing negative data as of unknown genes, (iii) modeling every FV utilizing NB, (iv) ANN is utilized for making the last decision via fusing the envisaging outcomes of the base NB classifiers.

B. Protein Sequence Translation
Extracting FV aimed at disease and unknown genes is the utmost vital challenges while identifying disease-gene issues utilizing a machine learning algorithm. Here, for characterizing genes, equivalent gene products (Proteins) are utilized. Hereof, to extract the vital information of protein wherein fully encoded is taken, "4" sorts of representation techniques were utilized, they are i) Normalized Moreau-Broto autocorrelation (NA), ii) Geary autocorrelations (GA), iii) auto covariances (AC), and iv) Moran auto-correlations (MA). The reason for utilizing these representation techniques is to evade losing imperative information that is concealed in the protein sequences. All of these techniques are centered upon the physicochemical properties of amino acids since sequence of amino acid determines the protein. In other words, amino acids are the building block of protein. Here, "12" physic-chemical properties are employed as a descriptor to render more information concerning the amino acid sequence. These  361 | P a g e www.ijacsa.thesai.org

C. Negative Data Generation
Subsequent to generating the FV for all genes, it is essential to select a negative protein set as of the unknown proteins to construct a dataset with positive as well as reliable negative instances. With regard to it, a "6" steps algorithm is proposed.
Step1: Define four negative sets as an empty set for each of the feature vectors as (1) Step2: Second, representing each protein i R (disease and unknown proteins) into four vector: using AC, GA, MA, and NA representation methods can well be expressed as Step3 Step5: Fifth, for each FV, choose g negative proteins as of R U set by selectig the g farthest proteins as of the p M , which can well be specified as.
Manhattan distance measure is utilized as a distance measurement to gauge the distance betwixt j R and p M . As the number of unknown proteins is much more than disease proteins, ascertaining the appropriate number ( g ) of chosen negative proteins has a direct effect on the prediction model construction.
Step6: Lastly, the proteins ascertained by means of the intersection of chosen negative protein sets will be selected as reliable negative data ( NS R ).

D. Naive Bayes Algorithm
Naive Bayes (NB) stands as a probabilistic classifier stimulated by the Bayes theorem under a simple assumption, i.e., the attributes are autonomous conditionally. NB is a particularly simple algorithm to execute, and good outcomes have been attained in the utmost instances. Nevertheless, utilizing the same classifier (NB) to categorize the disparate FV of the same instances produces some uncertainties and also makes some individual errors. Therefore, a practical fusion of these classifiers will more likely lessen the overall prediction inaccuracies and renders better prediction outcomes by reducing the negative effects of noise data which proportionally increases with rising negative data ratio. Here, the ANN is utilized as a fusion method in the 4 th layer. The general explanation concerning the ANN is rendered in the section below

E. Artificial Neural Network
ANN classifier comprises countless interconnected artificial neurons which have multiple interconnections connected to the adjustable weights. The inputted patterns are transmitted through the layers to solve the problem. By employing the corresponding synaptic weights, the information is mapped. Step3: To find the final output unit, the hidden unit is multiplied with the weight of the hidden layer output, which is given in the equation (23). It is apparent that the NB-centered classifiers construct the model for the same dataset utilizing disparate FV. Therefore, fusing the NB-based predictors" outputs utilizing the ANN brings about concurrent utilization of optional feature descriptors along with classification procedures.

F. Feature Extraction
After the disease gene identification, this phase is done to extract the features as of the identified disease genes for classifying the eye-related diseases caused through the disease genes. The features, namely Katz Fractal Dimension (KFD), Log Energy (LE), Hurst exponent (HE), Shannon Entropy (SE), Skewness, Mean, Kurtosis, Detrended Fluctuation Analysis (DFA), Discrete Wavelet Transforms, and also Standard Deviation are extracted.

G. Feature Reduction
Following feature extraction, feature reduction is done with the utilization of PCA. PCA that conserves the existent information and eliminates the redundant constituents is employed to discover significant features. Step3: Sort the outcomes in decreasing order of i  Step4: Choose the indispensable components (that is, features).

H. Classification for the Identification of Eye Diseases
Here, the related eye disease prediction is performed with the utilization of the LFKH-ANFIS algorithm. Gradient-centric learning is the standard learning process in ANFIS but it is prone to trap in local minima. On this account, the ANFIS is ameliorated with the utilization of LFKH for lessening its complexity and for elevating the classification accuracy. And thereby, the ANFIS is termed as LFKH based ANFIS (LFKH-ANFIS). The ANFIS has 2 fuzzy IF-THEN rules as specified in the equations (28) and (29).  These provided parameters are optimized with the assist of the LFKH algorithm for attaining a better outcome. The ANFIS encompasses some layers as elucidated below, Layer1: The first layer named a fuzzification layer gathers the input values and finds their membership functions (MF) as proffered below.

 
Each node here is adapted well to a functional parameter. The output acquired from each node acts as a degree of member-ship value that is provided by the input of an MF. The MF utilized in the proposed work is Gaussian kernel MF. The reason for choosing Gaussian kernel MF is to diminish the computational price of ANFIS since Gaussian kernel MF has least number of modifiable parameters. The MF used in the proposed work is specified in the succeeding equation.
Where, i c , i d and also i e -MF parameters that could alter the MF"s shape and are concerned as the premise parameters.
Layer2: This layer named the rule layer is accountable for creating the firing strengths (FS) for the rules. The incoming signals are mathematically multiplied to acquire the output that means the FS of a rule.
Layer4: It takes the above attained normalized values as inputs (resultant parameter sets) and it has adaptive nodes with a node function. Layer5: The former fourth layer proffers the defuzzificated values and these values are transmitted to the fifth layer for acquiring the final output. All incoming signals are summated to acquire overall output, and here, the circle node is labeled as From the LFKH-ANFIS, the 10 classes of eye diseases for the identified disease gene, that is, Age-related Macular Degeneration (AMD), cataract, Marfan syndrome, glaucoma, inherited optic neuropathies, polypoidal choroidal vasculopathies, retinitis pigmentosa, uveal melanoma, and Stargardt disease are acquired.

I. Levy Flight based Krill Herd Algorithm
The Krill Herd (KH) algorithm has the potential to effectively determine the optimum solution for certain search spaces configurations. With the futile exploration of KH"s search approach, it is incompetent to assure convergence. This proposed method utilizes the Levy flight (LF) in KHA with the intention of resolving the aforesaid difficulty. Hence, the parameter tuning for ANFIS utilizing this optimization is termed as LF based KH (LF-KH). With the utilization of the Lagrangian model, the krill's location is evaluated as,  For enhancing exploration, " rd " which is a random value lies in the gamut of (0, 1) is utilized. The proposed approach utilizes the LF for the process of a random walk rather than a simpler one to overcome the incapability of KH search approach which led to its inability to ensure convergence. LF maximizes the efficiency of the searches in uncertain environments. Whilst generating a new solution ' Xi for the th solution by performing LF, the new candidate is evaluated as, Step2: The foraging motion is also known as searching motion is evaluated in respect of 2 vital effective parameters like i) food location along with ii) the prior experiences of the KIs" food location. They are evaluated as Where, s F -Foraging speed, The KH movement is concerned as a process on the way to the BF. So, the KI position is proffered by.
Where, t  -Scale factor of the speed vector t  is an imperative parameter, and it must be adjusted in respect of the optimization issue. Its value is completely contingent on the provided search space.

IV. RESULT AND DISCUSSION
Here, the proposed system is analyzed and its performance is analogized to the existing algorithms regarding certain performance metrics. To ascertain the proposed method"s robustness, to lessen the over fitting and to lessen the bias in the estimate of the classification model, 5 fold crossvalidations have been employed utilizing a dataset with 10,000 instances (that is, 5000 positive and 5000 negative instances). Table I proffers the values acquired by the proposed NB-ANN predictor and some NB based classifiers regarding their prediction performance.
Table I evinces the f-measure, precision, together with recall values attained by the NB-based and fusion-based predictions. The AC-NB shows 81.6% precision, which is higher when analogized to that of GA-NB (74), NA-NB (76.54), and MA-NB (72.4). But, only the proposed fusion methodology acquires the highest precision (83.52) amongst others. Likewise, for f-measure and recall, the proposed NB-ANN classifier proffers the higher most values when analogized to other NB based approach. It is found that the fusion predictor shows the topmost performance when analogized to each NB-based predictor. As the classification of disparate FVs of the same data utilizing the same classifier generates certain uncertainties, fusing the classifier outcomes would diminish the overall classification errors. Fig. 2 evinces the comparison of NB-ANN and other existing approaches regarding f-measure, precision, together with recall. Here, the existing SVM-C4.5 shows greater performance. But, when analogized to the proposed NB-ANN, the existing ones show the least performance. From this comparison, the proposed NB-ANN is confirmed to acquire a remarkable performance for disease gene identification, and it worked well than other approaches. Then, the next experiment is performed for analyzing the proposed LFKH-ANFIS and comparing the LFKH-ANFIS with the existing techniques centered on performance regarding sensitivity, precision, specificity, recall, accuracy, f-measure, PPV, NPV, MCC, and FDR. Table II proffers the outcomes of LFKH-ANFIS and some existing algorithms.  Table II could be utilized for contrasting the results of the LFKH-ANFIS and the existing classifiers. The LFKH-ANFIS acquires 0.9412 for precision, f-measure, and recall, whereas, the existing ANFIS, DNN, ANN, and KNN proffered the values of 0.8417, 0.8254, 0.8347, and 0.7648 for precision, recall, f-measure. On considering the sensitivity, specificity, accuracy, NPV, and MCC, the LFKH-ANFIS evinces superior performance. Likewise, for MCC and NPV, the LFKH-ANFIS proffers the greatest outcomes analogized to existing algorithms. From these results, the proposed LFKH-ANFIS is confirmed to be better when analogized to other existing algorithms for eye disease identification. The error rate measures of the classification algorithm, namely FPR, FRR, and FNR, define the error that transpires at the time of performing classification.
For an effectual and excellent classification algorithm, the error rate measures must be low and that is achieved only by the proposed LFKH-ANFIS algorithm.

V. CONCLUSION
When analogized to the existing PUDI, ProDige, and SVM-C4.5, the proposed NB-ANN acquires the higher most values of precision, f-measure, and recall. Likewise, the LFKH-ANFIS shows the topmost performance by acquiring the highest results of sensitivity, precision, specificity, recall, accuracy, f-measure, NPV, and MCC when analogized to ANN, KNN, DNN, and ANFIS. And, the proposed LFKH-ANFIS acquires the lowest error rates (FNR, FPR, and FRR) for eye disease identification, which evinces the proposed method"s efficiency. Therefore, the disease gene identification and the possibility of eye disease incurred by those disease genes are identified more accurately using both classification algorithms. For future work, more number of physicochemical properties of amino acids will be considered for better performance in classification. For future work, more number of physicochemical properties of amino acids will be regarded for better performance in classification.