Very Deep Neural Networks for Extracting MITE Families Features and Classifying them based on DNA Scalograms

DNA sequencing has recently generated a very large volume of data in digital format. These data can be compressed, processed and classified only by using automatic tools which have been employed in biological experiments. In this work, we are interested in the classification of particular regions in C. Elegans Genome, a recently described group of transposable elements (TE) called Miniature Inverted-repeat Transposable Elements (MITEs). We particularly focus on the four MITE families (Cele1, Cele2, Cele14, and Cele42). These elements have distinct chromosomal distribution patterns and specific number conserved on the six autosomes of C. Elegans. Thus, it is necessary to define specific chromosomal domains and the potential relationship between MITEs and Tc / mariner elements, which makes it difficult to determine the similarities between MITES and TC classes. To solve this problem and more precisely to identify these TEs, these data are classified and compressed, in this study, using an efficient classifier model. The application of this model consists of four steps. First, the DNA sequence are mapped in a scalogram’s form. Second, the characteristic motifs are extracted in order to obtain a genomic signature. Third, MITE database is randomly divided into two data sets: 70% for training and 30%for tests. Finally, these scalograms are classified using Transfer Learning Approach based on pre-trained models like VGGNet. The introduced model is efficient as it achieved the highest accuracy rates thanks to the recognition of the correct characteristic patterns and the overall accuracy rate reached 97.11% for these TEs samples classification. Our approach allowed also classifying and identifying the MITES Classes compared to the TC class despite their strong similarity. By extracting the features and the characteristic patterns, the volume of massive data was considerably reduced. Keywords—DNA scalograms; genomic signature; classification; deep learning; transfer learning; VGGNET; accuracy


I. INTRODUCTION
DNA is a molecule composed of a long chain of four nucleotides: Adenine (A), Thymine (T), Cytosine (C) and Guanine (G) [1,2]. It comprises a multitude of periodic structures; the majority of which have an unknown biological function. This molecule adopts a three-dimensional doublehelix having a curve shape [3]. In our work, the character's string was mapped into a scalogram form, based on wavelet transform applied on a signal extracted from experimental measurements of the DNA curve. From these scalograms, we extracted patterns to classify some DNA regions. We chose, as model, the Caenorhabditis Elegans organism, which is an invertebrate combining simplicity and complexity. This duality makes it the most widely used versatile model for nearly all aspects of biological and genomic research. We also investigated a recently-described ET group, called Miniature Inverted-repeat Transposable Elements (MITEs). The latter were first discovered when studying the genes of several grass species including maize [4,5], rice [6] and barley [7]. They are genomic components abundant in many species, such as green pepper [8] and Arabidopsis [9,10], as well as in several animal genomes including Caenorhabditis elegans [11], insects [12], humans [13] and zebrafish [14]. These species represent 1 % to 2% of the total sequence of the genomes. In these MITEs, we focused on four families, which are Cele1, Cele2, Cele14 and Cele42, because they have distinct chromosomal distribution patterns. In fact, Cele14 MITEs show clustering near the autosomes' ends. In contrast, the Cele2 MITEs display an even distribution through the central autosome domains, with no evidence for clustering at the ends. These patterns complicate the classification tasks. So far, there is no model for the systematic classification of 4 MITEs family.
However, more extensive sequence relationships between the MITEs and the Tc / mariner elements were established for the first time in C. Elegans. Most MITE families of this genome share their endings (~ 20 bp to 150 bp) and their TSD sequence with, at least, one of the described Tc1 / mariner transposons in this species. The comparison of the Tc elements coding of transposase and the numerous MITE families suggests possible scenarios for the origin of MITE in the C. Elegans genome.
As the distinction between the MITE families and the "transposable elements" (TC1, TC2, TC5) [15,16,17,18] is a very difficult task, we thought about creating an efficient automatic model to classify them. In this paper, we introduce a new approach to classify DNA scalograms employing VGGNET while considering these scalograms as characteristic motifs of DNA. Our proposed method started first by converting the DNA string into DNA scalograms. www.ijacsa.thesai.org Afterward, a deep learning approach, that formed a Deep Neural Network (VGGNET) [19] for the prediction of database-derived tags from the original scalogram, was used. It allowed extracting high-level abstraction of characteristics from minimal preprocessing data. An evaluation of different CNN architectures namely, ResNet [37,38,39], inceptionv3 [40,41], Mobilnet [35,36] and Xception [42] was performed. This assessment shows that transfer learning achieved topscoring performance.
The paper is organized as follows. Section II describes the utilized materials (the MITE and Transposons families, etc.) and the applied methods (DNA coding, continuous wavelet transform and the VGGNET classification methodology). It also details the criteria considered to evaluate the model performance (Accuracy and Confusion Matrix). Section III presents the different proposed approaches applied to classify and identify the four MITEs families applying these classification techniques. Section IV presents the experiments carried out to classify the MITE classes of C. Elegans and discusses the obtained results. Finally, Section V presents some concluding remarks.

A. Materials
In this study, we focus on Caenorhabditis Elegans as an invertebrate combining simplicity and complexity, which makes it efficiently used to examine the important biological processes relevant to all eukaryotes. C. Elegans sequences were extracted from the National Center for Biotechnology Information (NCBI) public database [20]. Two sets of genomic data (the MITE dataset, composed of Cele1, Cele2, Cele14, and Cele42 [12], and non-ITE sequences which are the TEs TC1, TC2 and TC5 [15,16,17,18]) are considered.
The MITEs are small non-autonomous elements derived from transposons. Their identification is usually based on the presence of target site duplications and terminal inverted repeats [10,11,12]. These elements are structurally comparable to defective class II. They are characterized by their small size (usually varying between 100bp and 458 bp in length) and their lack of coding capacity for transposase. They carry Terminal Inverted Repeats (TIR) and two adjacent short direct repeats called Target Site Duplications (TSD). MITEs are often located near or within genes, where they can affect gene expression [13,14]. They are preferentially located in single or weak copy regions. Thus, they can be used as genetic markers, especially for large genomes with low gene content [21]. MITEs can be grouped into super-families based on their association with TEs because they have almost the same TIRs. A relation between a given MITE family and its potential source of transposase is often based on limited sequence similarity in TIRs. The choice of a given family of TC as a non-MITE is justified by the fact that MITEs themselves contain TC sequences; which increases considerably MITE recognition rates in most bioinformatics tools. The studied MITE families are: CELE 1, CELE 2, CELE 14 and CELE 42. They have complex and variable structures and sizes.
Our database is composed of 7862 MITEs elements whose frequency occurrence in the C. Elegans genome of varies from 20 to 458, according to the class of family they belong to ( Table I). The variability of length, composition and structure of these regions complicate their identification. Table I shows that MITEs have also a non-uniform distribution in the chromosomes. In fact, chromosome I (Chr I) contains the largest number of MITEs which is equal to 1799 with a size varying between 29 base pair(pb) and 380pb. Table I also demonstrate a high variability characterizing the sequences of MITE family; hence it is challenging to introduce an automated algorithm to predict them. Table II reveals that the sequences of Transposon family (TC1, TC2, TC5) are completely different. Although TC1, TC2 and TC5 are structurally characterized by more reduced numbers, they have big sizes (usually varying between 12 and 2088 bp in length).
N Occ is the number of occurrences of a class in 6 chromosomes of C. Elegans, and S min−max represents the range of the minimum and maximum sizes and occurrences of a class in 6 chromosomes of C. Elegans.
In this research work, DNA scalograms are used to characterize these regions and transfer learning is applied to classify them. In bio-informatics field, two sequences are considered homologous if they come from a common ancestor. Multiple sequence alignment techniques allow specifying the homologous regions of each sequence. Fig. 1, shows that the DNA scalograms highlight DNA homology, a degree of identity or similarity, between scalograms of different regions of CELE2 and similar homology for different elements of TC2. It also reveals a slight difference between the CELE2 and TC2 images [12,21].

B. Methods
To classify the MITE families, it is necessary to parameterize the DNA sequences regardless to their heterogeneity. Thus, we choose the DNA mapping into image based on scalograms. For this reason, we use PNUC coding technique [22,23] and Continuous Wavelet Transform (CWT) [24,25,26] to highlight features. Then, we extract these features from DNA images using VGGNet, a powerful CNN architecture, pre-trained on ImageNet. Finally, a classification is performed based on deep learning model (VGG19 and VGG16) [19,27]. Transfer Learning consists first in training a base network on a dataset and then transferring the learned features to a second target network to train them to a target dataset.

1) PNUC coding technique and Wavelet Transform:
For our classification technique, we consider DNA images. These images represent scalograms which are energy distributions obtained by taking the square module of the continuous wavelet transform applied to sequences encoded in PNUC [22,23]. Considering the square module, time-frequency localization is enhanced, and a new database of DNA scalograms is generated.
PNUC coding is based on curvature measurements. This curvature is directly related to the nucleosome structures presence. Applying this technique, the pairing of the two DNA helix (A-T and CG) along the helix is taken into account. PNUC coding consists in assigning, to each codon or trinucleotide, the numerical value given by the experimental values associated with each codon [23].
For example, the S DNA is replaced by the numerical sequence.
The Pnuc coding of S DNA is : DNA has a multitude of periodic structures and the wavelet analysis was proposed to reveal the local and frequential properties of the DNA periodic motifs. The analysis based on the Morlet Complex wavelet allows detecting the different periodicities in various types of C. Elegans chromosomal DNA [24,25,26].
Wavelet analysis relies essentially on the signal's decomposition into a sum of time-frequency atoms. The latter, called "wavelets", are obtained by dilating or contracting a Mother Wavelet ψ (t) [28,29] and translatin g it along the time axis. The versions obtained after these transformations are noted ψ [(t-b) / a].
The dilation and compression of a mother wavelet depend on a scaling factor (a), while the translation is ensured using a translation parameter (b). The wavelet family of scales and positions is then generated by the following expression: In general, the wavelet transform of a signal f(t) is given by Equation (2): where the symbol * indicates the complex conjugate. The obtained Tψ (a, b) numbers are called coefficients of wavelets. The Morlet Complex wavelet is the most efficient technique applied to analyze and characterize DNA structures [24] and presented as an exponential-modulated Gaussian envelope. It is defined by the following equation: (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 10, 2020 526 | P a g e www.ijacsa.thesai.org where the parameter ω0 designates the number of oscillations of the mother wavelet. It must be greater than 5 to satisfy the eligibility requirement. Using M scales, we obtain a matrix of N × M coefficients representing the time-frequency plane where N is the length of the analyzed signal. The modulus of wavelet coefficients | Tψ (a, b) | is called "scalogram".
In our study, the performed analysis consists in applying a continuous wavelet transform based on the complex Morlet wavelet [24,25,26]. This analysis highlights the periodicities that reside in the DNA (represented by its inverse the frequency on the y axis) with precision on location in the nucleotides position (equivalent to time in the x axes). The result of analysis generates scalograms which are the images used in the classification.

2) VGGNET model for classification of DNA scalograms:
We are interested in studying the VGGNET which is a convolutional neural network trained on more than one million images from the ImageNet database [30].
We use transfer learning to classify the DNA scalograms. The main idea of transfer learning based on very deep neural networks is to apply a pre-trained deep learning model, previously trained on a large-scale dataset such as ImageNet. Containing 1.2 million images with another 50,000 images for validation and 100,000 images for testing, on 1000 different categories, and re-purpose, to handle an entirely different problem [31]. where: After learning certain features from a large dataset (ImageNET), they are used by VGGNet model as a base to learn the presented classification problem. As demonstrated in Fig. 2, we employ a popular and reliable CNN architecture called VGGNet with 16 convolutional and 3 fully-connected layers [27]. The width of convolutional layers (the number of channels) is rather small, starting from 64, in the first layer, and increasing by a factor of 2, after each max-pooling layer, up to 512. The input of the CNN is a fixed-size 224 x 224 RGB image. Each image passes through a stack of convolutional (conv.) layers. Subsequently, the convolution stride is added such that the spatial resolution will be preserved after convolution, i.e. the padding is considered also in Conv. layers. Spatial pooling is carried out by five maxpooling layers, which follow some but not all of the Conv. Layers. Max-pooling is performed over a specific pixel window; with stride. A stack of Conv. Layers, having different depths in various architectures, are followed by three fullyconnected (FC) layers: each of the two first layers has 4096 channels, while the third one performs the classification of 2 after each max-pooling layer, up to 512 [32,33].

III. PROPOSED APPROACH
The adopted methodology includes three steps. Fig. 3 represents the flowchart describing our proposed approach whose application consists in:  Extracting the MITEs sequences (CELE1, CELE2) and the TEs (TC1, TC2 and TC5) of all the chromosomes of C. Elegans from the NCBI database. The extraction phase can be divide into the following two sub-steps: o Generating the corresponding PNUC sequences to convert the DNA string into a 1D signal.
o Applying Continuous Wavelet analysis to transform the signal to scalogram images.
 Extracting features using convolutional neural networks  Using the VGGNET model to classify the studied sequences.

A. Creating MITE Signal Database
In the first step of our methodology, we extract the entire DNA sequences corresponding to the C. Elegans genome from the NCBI database [20]. Then, we apply PNUC coding on all chromosomes (6 chromosomes). Thereby, a chromosomal www.ijacsa.thesai.org signal database (1D signal) is created after applying the module on the square of the continuous wavelet transform [24,25,26], which enhances time-frequency localization and generates a new DNA database of DNA containing images (or scalograms) which represent energy distributions).

B. Extraction of Features using Convolutional Neural
Networks Several models were used to extract the characteristics of AND scalograms in the field of deep learning. In this work, we use a model adapted for their extraction (Fig. 4). These Different values of independent variables are also considered as the input of the classifier to predict the corresponding class to which the independent variable belongs. The architecture of introduced model is presented in Table III.
As shown in Fig. 4, the shape of the input image is (224, 224, 3) and the last layer produced from VGGNet has the shape (7,7,512). This means that VGGNet returns a feature vector of 7×7×512 = 25088 features. In order to perform transfer learning with VGGNet, we first save the extracted features (bottleneck features) from the pre-trained model. Then, top model is trained to classify our data using the saved bottleneck features. Finally, we combine our training data and the VGGNet model with the top model to predict DNA Pattern of scalograms [34].

C. Classification Algorithm
For our classification algorithm, we use the third convolutional layer containing only two channels (one for each class). The final layer is the soft-max layer. All hidden layers are equipped with the non-linearity rectification [19,27]. For each image X of study type T in the training set, the weighted binary cross-entropy loss is optimized. The VGGNet specifications are described in Fig. 3.
The major limitation of VGGNet lays in the fact that this architecture necessitates huge memory requirements. Because of the number of fully-linked nodes and its depth, the size of VGGNet is equal to 574 MB, which complicates its use as features extractor.
We also employ the VGG16 and compare its results with those provided by the VGG19. The VGG-16 is a 16-layer CNN developed by Simon et al. for image recognition in the 2014 ImageNet large scale visual recognition challenge (ILSVRC) [19]. The filters 3 × 3 are employed for all convolutional layers. This network accepts the input image with a dimension of 224 × 224. The image passes through a sequence of 16 convolutional layers. A multilayer perceptron (MLP) classifier, including three fully connected (FC) layers and the convolutional layers, is utilized in the classification step. The Rectified linear unit (ReLU) layers and max-pooling layers are also used in the whole network to prevent overfitting.
To evaluate our classification model, we apply the classification rate calculation and the confusion matrix as classification criteria. The performance of the proposed approach is tested in terms of accuracy, recall, precision, sensitivity, specificity, F-measure (F1), Confusion matrix illustrated in Fig. 5 Vol. 11, No. 10, 2020 528 | P a g e www.ijacsa.thesai.org where "TP" (True Positives) refers to the CELE samples correctly labeled by the classifier, "TNs" (True Negatives) are the Transposon samples correctly labeled by the classifier, "FPs" (False Positives) denotes the CELE scalograms incorrectly labeled as Transposons TC and "FNs" (False Negatives) are the transposon samples mislabeled as CELE.
The two most crucial and most intensively-employed loss functions are: the cross entropy function and the MSE function. Both of them are applied in regression and classification problems, respectively. The can be formulated as follows: where n is the total number of samples in the dataset, M denotes the number of classes with in the dataset, y i,c designates a binary indicator indicating if class c represents the correct classification for sample i and y i,ĉ refers to the predicted probability of sample i which belongs to class c. The first previously-mentioned loss produces high loss when the predicted value is close to the true value; whereas the crossentropy loss punishes uncertain prediction probabilities.

IV. RESULTS
The main objectives of this study are to characterize MITE families and distinguish them from other regions. For this purpose, genomic sequences used in this work are composed of two parts: a part containing MITE family sequences (CELE1, CELE2, CELE14 and CELE42) and another one including TC1, TC2, TC5.
The examination of the structure and distribution of MITEs reveals that the number of appearances of these elements are variable and that the TC family number of scalograms is reduced compared to them, which complicates the MITE classification process. To solve this problem of unbalanced data, we enlarge the database of TC and MITEs signals by grouping all the elements of the TC family in the same class and applying a binary classification. Here, the idea is based on the identification of the MITE family of elements (CELE1 CELE2 CELE14 and CELE42) with respect to other non-MITE families (Tc1, Tc2, Tc5). Thereafter, this dataset is split into two parts 70% for training and 30% for test). Then, VGGNET is applied with softmax activation mode. The Recognition process consists of two stages: features extraction and features recognition. The performance of the proposed system strongly depends on the choice of the extraction method.
The experimental results demonstrate that most of the CELE elements are correctly recognized with the Tc elements. Obviously, the VGG16 trained model achieves an accuracy rate of 97.11% for CELE14 identification, 93.38% for CELE1 identification, 91.79% for CELE42 identification and 89.66% for CELE2 identification. Fig. 6 illustrates the accuracy of the VGG-16 model over the Test images.
Similarly, Fig. 8 illustrates the accuracy rate obtained by applying the VGG-19 model on the validation dataset. The trained model reaches an accuracy rate equal to 96.44%, 94.52 %, 91.05 and 90.17 for the identification of CELE14, CELE42, CELE1 and CELE2, respectively. Fig. 6, 7, 8 and 9 demonstrate that the learning and validation curves are remarkably enhanced for VGG16 and VGG19 Models. It is also clear that the network converges from the second epoch and, with the rise in the epochs, and the cross entropy loss tends to zero.
These figures represent the accuracy curves of train set and those of validation set. Each point of the precision curve corresponds to the accurate prediction rate for train or validation images. The accuracy curve follows similar smooth processing as that adopted by the loss curve. It is obvious that the train set accuracy and the validation set accuracy approach 100% after 2 epochs. Fig. 6 and 7 show that the VGG16 model accuracy value is higher, compare to that of VGG19 model. However, this is not true depending on the element to be identified. Thus, the accuracy average is computed to classify the 4 MITE families. The accuracy rate attains 92.98, for VGG16, and 93.045, for VGG19, revealing that VGG16 is more effective in the classification of MITES scalograms, compared to VGG19 model.
Additionally, testing results are given in Fig. 10 and 11 representing the Confusion Matrix for the validation data.
The performance measurement is with four different combinations of predicted and target classes which are the true positive, false positive, false negative, and the true negative. In this format, the number and percentage of the correct classifications performed by the trained network are indicated in the diagonal.
The confusion matrix shows that the used models clearly differentiate the families of MITE, compared to TC1, TC2, TC5, despite the similarities between the CELE and TC1, TC2, TC5, as cited in the first part (Section 3) of this paper [21].
As seen in Fig. 10, all the classes of MITEs are correctly classified. Our model, using VGG16, recognizes CELE2 with a very promising rate of 99.52%, and 99.19%, for CELE14 identification, 96.48%, for CELE1 identification, and 86.19% for identification of CELE42.

Comparison of our models with other CNN architectures
Similarly, a comparative analysis of the results obtained by the VGGNET framework, employing four well-known methods, was carried out to shed light on the efficiency of VGGNET in identifying the four MITE families, as given in 529 | P a g e www.ijacsa.thesai.org this table (Table IV). We show that Mobilnet [35,36], Resnet [37,38,39], InceptionV3 [40,41] and Xception [42] give average accuracy rates (Acc.) of 88.85%, 86.92%, 86.23% and 88.06%, respectively, to classify the four MITE families.     However, findings obtained using the VGGNET provide the highest accuracy of 97.11% for the identification of the CELE1 element. These results reveal that our architectures more powerful and promising than other deep models   Table V presents the existing works based on supervised machine learning algorithms used in the classification step and compares the results obtained employing DNA sequences database. As shown in this table, several studies utilized CNN (Convolutional neural networks) [47], C-KNN [45] and support vector machines utilized [46]. Nevertheless, successful categorization rate ranges between 70% and 90%. Obviously, a successful categorization relies mainly on the entry variability. The majority of the studies listed in Table V

V. CONCLUSION
In this paper, we focused on DNA images. Our main purpose is to identify the MITES Families from Transposan families and classify them. The DNA images represent the scalograms. In fact, the ATCG chain was first converted, using a PNUC coding technique, into a signal based on the experimental DNA curve measures. Then, the Continuous Wavelet Transform by Morlet Complex wavelet allowed converting this signal into particular images. Thirdly, a selection of features was performed applying Transfer learning approach. Finally, each produced feature set was tested by several classifiers to validate the proposed model. This approach showed high performance by achieving accuracy, loss, recall, precision and f1-score of 97.11%, 09.03, 99.18, 97.34, and 85.64, respectively. The obtained results are the highest among all known published works on the same dataset, even if compared to other convolutional network models. In fact, the classification rate obtained in previous works did not exceed 90%.