Binning Approach based on Classical Clustering for Type 2 Diabetes Diagnosis

In recent years, numerous studies have been focusing on metagenomic data to improve the ability of human disease prediction. Although we face the complexity of disease, some proposed frameworks reveal promising performances in using metagenomic data to predict disease. Type 2 diabetes (T2D) diagnosis by metagenomic data is one of the challenging tasks compared to other diseases. The prediction performances for T2D usually reveal poor results which are around 65% in accuracy in state-of-the-art. In this study, we propose a method combining K-means clustering algorithm and unsupervised binning approaches to improve the performance in metagenome-based disease prediction. We illustrate by experiments on metagenomic datasets related to Type 2 Diabetes that the proposed method embedded clusters generated by K-means allows to increase the performance in prediction accuracy reaching approximately or more than 70%. Keywords—Unsupervised binning; K-means clustering algorithm; metagenomics; metagenome-based disease prediction; Type 2 diabetes diagnosis


I. INTRODUCTION
Metagenomics (Environmental Genomics, Ecogenomics or Community Genomics) is the study of genetic material recovered directly from environmental samples. Metagenomics is directly the study of communities of microbial organisms in their natural environments by applying modern genomic techniques that pass the need for isolation and lab cultivation of individual species [1], [2], [3], [4], [5], [6]. Reassembly of multiple genomes has provided insight into energy and nutrient cycling within the community, genome structure, gene function, population genetics and microheterogeneity, and lateral gene transfer among members of an uncultured community. The application of metagenomic sequence information will facilitate the design of better culturing strategies to link genomic analysis with pure culture studies. Why do we study metagenomics? As in [2] mentioned that Metagenomics has brought us discovery of novel natural products, new antibiotica, new molecules with new functions, new enzymes and bioactive molecules, what is a genome or species, diversity of life, interplay between human and microbes, how do microbial communities work and how stable are they, holistic view on biology. Metagenomics cloned specific gene sequences (usually 16S rRNA genes) to conduct data on the biodiversity of environmental samples. With traditional genetic and microbiological studies of genomes sequencing of microorganisms based on cultured lineage samples, it was found that it would be impossible to biodiversity of microorganisms. Therefore, metagenomics plays an important role in helping humans discover microbial diversity. In medicine, the microbial community plays a very important role in protecting human health. Therefore, the purpose of metagenomics is to understand the composition and activity of complex microbial groups in environmental samples through analysis of their DNA sequences. On the other hand, there are numerous data on multiple genomes that we can carry out a series of gene isolation projects depending on the purpose of the research.
Metagenomic is an improved method compared to traditional microbiology, the research of metagenomes obtained from genetic material from first samples, without the need for laboratory cultures. This method is commonly used on the human intestine because it is the place where the digestive process, metabolism and has 10 times the total number of cells of the body. Based on metagenomics, we can develop algorithms to predict disease, determine a patient's sensitivity and then offer reasonable treatments. However, the disease is complicated in diagnosis and prognosis and we only have a limited amount of data to observe. Type 2 diabetes (T2D) is a heterogeneous metabolic disorder that damages many organs of the body. The disease tends to increase due to the influence of modern life, bad living habits. Nowadays, the prediction is not highly accurate and the treatment is commonly applied to patients diagnosed with some similar manifestations. With that treatment, we find that genetic diversity has not been effectively applied, leading to an improvement in the health of some patients. The performances on models for predicting T2D usually yield poor results.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 3, 2020 II. RELATED WORK As mentioned above, metagenomics is an approach that utilizes extraction of genomic information directly from the environmental sample. So that, genetic information samples are more representative for a given environment and supplies a better insight into microbial environmental and metabolic diversity. By using next-generation sequencing in metagenomics project to determine genetic potential in microbial communities from a wealth of environmental niches, including those linked with human body and relative with human healthcare. Human microbiome in health and disease plays a significant role that has recently been given considerable observation [7], and distinct diseases have been associated with gut microbiota [7], [8], [9], [10], [11], [12], [13], [14], [15]. With respect to, experience 's Maja and et al [8] that a bias in codon usage present throughout the entire microbial community by applying definitions of translational optimization through codon usage adaptation on completely metagenomic datasets. They can be used as a powerful analytical tool for predicting community lifestyle-specific metabolism. Moreover, Maja and et al demonstrate this approach combined with machine learning, to classify microbiome samples in human gut according to the pathological condition diagnosed in the human host. In addition, predicting disease-relevant features in microbial gut metagenomes by using the principle of utilizing the prokaryotic translational optimization effect combined with the machine learning based classification and enriched gene datasets that explore a supportive method to analyzing metagenomic datasets. Authors in [8], [16] proposed methods using machine learning and deep learning to do disease prediction tasks and obtained promising results.
K-means clustering is an unsupervised learning algorithm. From the input data without the label to be clustered and the number of clusters to be divided, we will use the algorithm to divide the data into clusters of similar properties. Applications of clustering algorithms have been used commonly to resolve data clustering. Based on clustering methods, we can obtain a meaningful intuition of the structure of the data. Moreover, we can use "Cluster-then-predict". That means, we observe generated clusters, then different models will be built for various subgroups if there exists a wide variation in the behaviors of a variety of subgroups. Numerous studies in biological computation tasks have been applying k-mean to do specific analyses. Authors in [17] used k-mean to process Microarray data for bioinformatics tasks. [18] also implemented k-mean to cluster biological sequences by first converting them into an intermediate binary format where Hamming distance is used as the metric of comparison. The research in [19] presented enhanced k-mean to do Bioinformatics Data Clustering. In 2019, a study [20] introduce a modified sparse K-means clustering method to detect risk genes involved with Type II Diabetes Mellitus. From some previous results, we can see potential benefits to leverage k-mean in bioinformatics tasks.
In recent years, the application of machine learning algorithms to study metagenomic has become popular and the accuracy of diagnosis has been improved over time. In this article, we propose the application of the K-means clustering algorithm in the binning approach to improve the accurate results in predicting T2D. We leverage k-mean clustering as a tool to support binning data. By identifying clusters which can exist in the data, we hope to improve the performance via using a binning approach. Our study's contribution is multi-fold: • We present results of various binning approaches on Type Diabetes disease using metagenomic data which appear as a very big challenge for diagnosis.
• The work aims to illustrate a potential advantage of using clustering algorithms to identify breaks for binning approaches to obtain a better result in T2D prediction compared to other binning methods.
• The results reveal high performances of state-of-theart in deep learning algorithms, the Convolutional neural network, compared to traditional neural networks such as Multi-Layer Perceptron. Convolutional Neural networks can work efficiently even on onedimensional data.
• Most cases, machine learning outperforms deep learning algorithms. For numeric data formed in 1D, classical machine learning reveals a robust prediction ability.
• Previous studies have not investigated the efficiency of classic machine learning with binning approaches. Our study proves by using Random Forest that it is possible be the best choice to select machine learning combining approaches to improve prediction performance on numeric species abundance datasets.
The remaining of this study, we present a short description of two considered T2D datasets in Section III. Furthermore, methods which we choose will be introduced in Section IV. Experimental Results of our proposed methods in this paper are illustrated in Section V. Finally, Section VI and Section VII discuss the results and summarize important remarks for this research.

III. DATA BENCHMARKS FOR METAGENOMIC ANALYSIS
We run the experiments on metagenomic abundance data that indicates how present (or absent) is an OTU (Operational taxonomic unit) in human gut. The abundance datasets are obtained using default parameters of MetaPhlAn2 described as detailed in [14].
A little more detail of the process of generating abundance shown in Fig. 1, the stool sample collected from human is fetched into machines to extract total Deoxyribo Nucleic acid (DNA). DNA then is sequenced to create millions of reads. The new generation sequencing techniques can process millions of sequencing reads in parallel. These reads are mapped to a catalog of references including all known gut microbial genes and known bacterial at levels of species, genus and so on. The techniques also indicate the presence and abundance of each gene and each species in any samples. As revealed in numerous studies, species abundance and genes abundance can distinguish patients and healthy controls. Moreover, genes and species can be leveraged to develop robust tools for diagnosis and prognosis.
We evaluated our approach on the disease of Type 2 Diabetes with two datasets. The first one (T2D1) includes 344 Chinese individuals [22], and 96 western women are in other  dataset (T2D2) [23]. The datasets are characterized by bacterial species abundance. For each sample in each dataset, species abundance is a relative proportion and formed as a real number.
The total abundance of all features in each sample is equal to 1. More details are shown in Table I. We consider to investigate on T2D because it is considered as one of the most changeling disease prediction tasks.
Let D be the set of considered datasets, .., f m } includes m features corresponding to d i P i = {p 1 , p 2 , ..., p k } includes k patients who affected by T2D corresponds to d i Total abundance of all features in one sample is sum up to 1: With: • k is the number of features for a sample.
• f i is the value of the i-th feature.

A. Binning Approaches for Metagenomic Data
Some binning approaches were introduced in [24] including Species bins (SPB) based on species abundance distribution on 6 datasets, binning based on equal width and the method based on equal frequency.
• Species Bins (SPB) are conducted from data distribution of six metagenomic bacterial species abundance datasets related to various diseases. Authors in [25] observed that original species abundance almost follows the zero-inflated distribution. When they convert data with a scaler using log-transformed (with logarithm base 4), the scaled data is more normallydistributed (see a example of the raw species abundance and log-transformed (with logarithm base 4) of two considered datasets of T2D shown in Fig. 2). From that, authors proposed breaks for binning where each break is the one that in the logarithm base 4 is equivalent to a fold increase from the previous bin. A little more detail, the first breaks will start at 0 and 10 −7 (the minimum values of six considered datasets), the next break will be 4 * 10 −7 and so on. This bins seem to be efficient for the prediction. • Binning based on frequency of values is also an effective method. The method is equal frequency binning (EQF) where each bin can contain approximately the number of elements. Therefore, the interval width can be very different. The breaks can be 0.1, 0.11, 0.2, 0.5 and so on, for example, depending on the value distribution.
• The last binning described in this section is binary bins. This method only considers whether the value of that feature is greater 0 or not. Since it determines the Presence of feature in the samples, we also call it "PR".

B. Binning based on K-means Algorithm
With different distributions of data, the clustering algorithm is a crucial tool to identify groups in data. Determining groups for binning, we hope to improve the performance by identifying various areas which have high data density. Kmeans clustering is a common method in cluster analysis and data mining. The purpose of this method is to partition n elements into clusters such that each element of the cluster has the closest mean value, acting as the cluster's prototype. This method is performed based on the smallest Euclidean distance between the elements and the central element of the group. Assume each object has m attributes. Each object's properties are like coordinates of an m-dimensional space; each object is a point on that space. Euclidean distance is calculated by the formula: The central element is determined by the average of the elements in the group. Initially, these elements will be randomly selected and after each addition of objects to groups, the central elements will be recalculated. To calculate cijthe j coordinate of the group i central element, we have the formula: With: • j = 1..m (m is the number of properties) • xsj -jth attribute of element s (s = 1..t) Binning with K-means clustering, we will get better results than the methods mentioned earlier. Suppose we need to binning with n = 10 (the numbers of bins). This method is performed as follows: www.ijacsa.thesai.org Algorithm 1 Algorithm for identifying the list of binning breaks based on clustering algorithm, K-Means Input: n -number of clusters, matrix C to find bin breaks Output: B -array containing list of n bin breaks found Begin Step 1: Initialize data -Convert matrix C to 1-dimensional array.
-Remove 0 or uncountable values in array.
-Sort the array in ascending values.
Step 2: Using the K-means algorithm with a total number of clusters n -1. We have array A containing the grouped elements.
Step 3: Construct array B containing n bin breaks -Find n -2 bin breaks by calculating the average of two boundaries in two adjacent groups. For easier comparisons, all binning approaches in this study are implemented with the same number of bin (10 bins) for all classifiers. We underline that the breaks for binning are conducted using the training sets to avoid overfitting issues.

V. EXPERIMENTS
For comparing the efficiency binning approaches in improving T2D prediction performance on various learning algorithms, each learning architecture is presented in each separated table. Table II gives results using MLP while Table III illustrates the performance of CNN1d. The last table (Table  IV), we present the best results with Random Forest and also compare to state-of-the-art in MetAML [14]. The datasets used was described in Section III. The details of models used in the experiments and results are presented as following.

A. Learning Models for Comparison
In order to evaluate and compare the efficiency on a wide range of learning models, we propose to use 3 different learning algorithms. A state-of-the-art in machine learning is Random Forest that is implemented to run the experiments on the datasets. Moreover, as a traditional neural network, Multi-Layer Perceptron (MLP) is also leveraged for the comparison. We also evaluate one-dimensionality convolutional neural network (CNN1D) on considered datasets.
• Previous studies, most successful methods applied to numeric omics datasets are known mainly Random Forest (RF). Authors in [14] introduced MetAML using Random Forest and obtained the best results among considered algorithms. Applying the same parameters proposed in [14], we use 500 trees for this algorithm for the learning.
• The MLP is used in this study with parameters proposed in [16] including one hidden layer and 128 neural.
• CNN1D consists of one one-dimensional convolutional layer of 128 filters followed by a max pooling of 2 and ending by a fully connected layer. MLP and CNN1D use Adam optimizer function with a batch size of 16. Other parameters are also the same with a default learning rate of 0.001 and epoch patience of 5 for early stopping technique (for reducing overfitting issues).

B. Metrics for Comparison
The performances are assessed by 10-fold cross validation. We compute Average Accuracy and Average Matthews Correlation Coefficient (MCC) as performance measurement for evaluating the generalization of the classifiers. Training and test sets are exactly the same for each classifier, or we can say that the same folds are used for all classifiers. With this technique, the changes when comparing performance of any two classifiers could be computed directly as the difference in metrics within each test fold.
Accuracy is a common measurement for models's performance while MCC is considered as a good performance evaluation score for biology datasets and helps to evaluate whether the model is going well or not. As in [28], the authors said that "among the common performance evaluation scores, MCC is the only one which correctly takes into account the ratio of the confusion matrix size". Matthews correlation coefficient score is computed as following formula: With: • TP stands for True Positive And Accuracy = T P +T N T P +T N +F P +F N The model reaches the best when mcc = 1 while the worst value is mcc = −1. Authors in [28] recommended using this metric for evaluating the algorithm performance.
C. Experimental Results 1) Evaluation binning approaches with MLP: We are considering two diseases T2D1 and T2D2 with results using MLP in Table II. As a result, the binning approach with Kmeans in both diseases achieved val acc and val mcc values higher than all other approaches EQW, PR, SPB. Considering dataset dataset T2D1, K-means is significantly higher than SPB. Specifically, val acc is higher than val acc of SPB is 0.034 and of val mcc is 0.044. For approaches like EQW, PR or EQF, the K-Means approach returns values with relatively good disparities. Considering dataset dataset T2D2, val acc of K-means is more than 0.069, val mcc is 1.46 times higher than  EQF. The value of EQW in this disease is the second most in approach and is 0.022 different from when using K-Means. In summary, the results when binning with K-Means cluster using Multi-Layer Perceptron, we will get the best results compared to the remaining methods.
2) Evaluation binning approaches with Convolutional Neural Network on 1D data: Table III shows the performance using CNN1D. When using the One-Dimensional Convolutional Neural Network, the results of K-Means are 0.692 for val acc, 0.740 for val mcc, respectively. Both results are better than using Multi-Layer Perceptron (val acc = 0.686, val mcc = 0.727). In T2D1, the result of K-Means is much higher than the next EQW value, namely 0.014 difference for val acc and 0.076 for val mcc compared to K-Means. The value of val acc of K-Means compared to the lowest value in this disease of SPB is 0.076 and of val mcc is 0.043. In T2D2, the lowest valued approach for this disease is EQF. Val acc value is more than 0.066, val mcc of K-Means is 1.367 more than EQF. The difference between the values of EQW and K-Means is quite good, respectively 0.033 for val acc, 0.06 for val mcc. In summary, when using the One-Dimensional Convolutional Neural Network, the K-Means approach results in better results when using the Multi-Layer Perceptron and this result is still the best result compared to the other approach.
3) Random Forest obtains promising results with the proposed binning, compared to state-of-the-art MetAML: We also used the Random Forest for results comparison in Table  III. Similar to the previous two tables, when binning with K-means we obtain very good results compared to using other approaches. A previously used framework, MetAML, Kmeans, gave val acc more than 0.036 for T2D1 and 0.056 for T2D2. Considering T2D1, K-means val acc is more than 0.04 and val mcc is 0.07 more than SPB. The second result in the table for both diseases is the PR approach. The difference in value between K-means and PR is quite good. K-means has val acc more than 0.014, val mcc is more than 0.017 than PR. Considering T2D2, val acc is 0.107 and val mcc is 1.683 times higher than SPB results. K-means has val acc more than 0.023, val mcc is more than 0.032 than PR. In short, when choosing K-means as an approach, we will get better results than some common approaches such as PR, EQW, EQF or SPB, especially the approach used was MetAML.

4) Random
Forest obtains better results compared to neural networks: The chart in Fig. 3 shows the results being conducted from two datasets of T2D. We use five approaches for testing, namely, EQF, EQW, K-Means, PR, SPB. Considering T2D1 disease, the K-means approach has the largest Average Accuracy value, reaching 0.7. SPB has a value of Average Accuracy is 0.66, this is the smallest value and smaller than K-Means 0.34. Similarly, for T2D2 disease, the Average Accuracy of K-Means value is 0.759, the highest among the remaining approaches. This value is higher than the next PR value of 0.023. The Average Accuracy of SPB is less than 0.107 compared to K-Means.
The chart in Fig. 4 shows the results Average MCC value on 2 datasets of T2D and 5 approaches. K-Means has the highest Average MCC value on both datasets and 0.4 for T2D1 and 0.515 for T2D2. Average MCC value of K-Means greater than SPB in T2D1 is 0.07, 1,683 times that of T2D2. The disparity with the next high value of PR is also quite clear, namely, 0.017 for T2D1 and 0.032 for T2D2.

VI. DISCUSSION
From collected results, we can see that RF obtains the best among considered models. These results are similar to [25] where authors also have attempted to apply deep learning but the performance in T2D disease is still worse than RF. This reflects a fact as mentioned in [26]: "the deep learning approaches may not be suitable for metagenomic applications". As stated in [27], we are facing challenges when applying deep learning to solve biological and clinical tasks because of limited data availability, result interpretation and hyperparameters tuning for deep learning algorithms.
Although PR only considers whether a bacterial species exists in a patient or, it revels a better performance (using RF) than several other binning methods such as SPB, EQW, EQF. From results, we can propose medical examinations for T2D only determining the existence of bacterial species in human body for the diagnosis. These examinations can be simpler than computing quantitative compositions of bacterial.
In most situations, SPB performs poor performance compared to the others because SPB was conducted from species abundance distribution from various diseases. Each disease should be considered independently because one disease can have its own complexity, characteristics as well as data density.

VII. CONCLUSION
We introduce a novel binning approach using a classical clustering algorithm such as K-means. As shown from the comparison results among considered existing binning approaches such as binning based on species distribution, based on width and frequency and binary bins, we can see the encouraging results in use of clustering methods for identifying breaks for binning to enhance the prediction performance.
The analysis of two architectures of one-dimensional convolutional neural network and Multi-layer Perceptron shows that convolutional neural network not only achieve a good performance on images but also obtain a promising result compared to traditional neural network such as MLP.
As some results in previous studies, classic machine learning such as Random Forest still works better more complex models such as MLP and CNN1D in T2D diagnosis by metagenomic data. Further research can investigate more deeper and sophisticated models to improve the performance.
Using classic clustering algorithm K-means with default parameters in binning gives encouraging results. This could promote studies to go deeper in use of clustering methods to generate breaks for binning. This illustrate that there are potentials in exploring density data to improve not only for T2D disease but also for other diseases.