Helitron ‟ s Periodicities Identification in C . Elegans based on the Smoothed Spectral Analysis and the Frequency Chaos Game Signal Coding

Helitrons are typical rolling circle transposons which make up the bulk of eukaryotic genomes. Unlike of other DNA transposons, these transposable elements (TEs), don’t create target site duplications or end in inverted repeats, which make them particular challenge to identify and more difficult to annotate. To date, these elements are not well studied; they only attracted the interest of researchers in biology. The focus of this paper is oriented towards identifying the helitrons in C.elegans genome in the perspective of signal processing. Aiming at the helitron's identification, a novel methodology including two steps is proposed: the coding and the spectral analysis. The first step consists in converting DNA into a 1-D signal based on the statistical features of the sequence (FCGS coding). As for the second step, it aims to identify the global periodicities in helitrons using the Smoothed Fourier Transform. The resulting spectrum and spectrogram are shown to present a specific signature of each helitron’s type. Keywords—Helitrons; C.elegans; Frequency Chaos Game Signal (FCGS) coding; spectral analysis; tandem periodicities


INTRODUCTION
The helitrons are a distinguished type of the transposable elements (TEs) DNA which transposes by a rolling circle replication mechanism.Due to their ability to move rapidly and replicate within genomes, helitrons play a major role in genomes evolution.In fact, by transferring a DNA segment from one genomic site to another, these elements are responsible for intragenomic multiplication [1].Many types of genetic variation caused by TEs in animals and plants are described in [2].
Since their discovery, helitrons have attracted widespread attention.Many computational tools were developed to identify and analyze the helitron in genomes from which are: HelitronFinder [20], HelSearch [21], a combination of BLAST search and hidden Markov models [22] and HelitronScanner [23].
HelitronFinder and HelSearch are similar [24], [25]; they are based on the conserved sequences at the termini 5′-TC and CTAG-3′(R = A or G) of most Helitrons.Both programs look for the hairpin structure and the CTRR 3′ terminus.
The users of HelSearch have to manually search for the 5′end of Helitrons whereas users of HelitronFinder can identify the 5′end automatically.The combination of BLAST search and hidden Markov models is a limited method to identify helitrons with more diverse termini.The HelitronScanner, a two-layered local combinational variable (LCV), is a tool that identifies helitrons missed by both HelSearch and HelitronFinder [23].This tool aims to extract helitron features like the hairpin structure, CTRR at the 3′ end and the TC dinucleotide at the 5′ end, and the A and T residues flanking the 5′ and 3′ ends, respectively.In an automated way, this tool uncovers many new helitrons which were missed by other tools.But this method presents also limitations in finding helitrons which don"t have hairpins.Given that these transposable elements lack typical transposon features, the helitron"s investigation is limited.The automated identification and localization of helitrons remain purely based on previously known sequences.
Helitrons elements are widespread and highly heterogeneous which makes their identification a difficult task.Here, the novelty of this work consists in the use of signal processing techniques to search helitrons in a large genomic database without prior knowledge about the content of the DNA sequences.The idea consists in finding a way to identify each helitron"s type based on a signature that characterizes it.
To apply the signal processing methods on biological sequences, it is required to convert DNA into a digital signal known as DNA coding.The numerical representation of DNA using coding techniques is important since it plays a great role in visualizing, characterizing and highlighting the information contained in it.Different coding techniques exist: the binary www.ijacsa.thesai.orgcoding [26], the structural bending trinucleotide coding (PNUC) [27], the electron-ion interaction pseudo-potential (EIIP) mapping [28], the Frequency Chaos Game Signal [29]- [31], etc.
In the signal processing field, several techniques were applied with success to detect some biological sequences.For example, genes were segmented into coding and non-coding regions using the windowed Fourier Transform and based on the 3-bp periodicity that characterizes exon [32], [33].In addition, the Fast Fourier Transform (FFT) was used to reflect the correlation properties of the coding and non-coding DNA sequences [32].In [34], another technique of analysis is used to detect the short exons: the Fourier transform infrared spectroscopy.The latent periodicities in different genome elements including exons and microsatellite DNA sequences were detected and used the Fourier transform [35].In addition, the modified Gabor-wavelet transform was used to identify the protein coding regions [36].The Auto-Regressive technique was also used to predict genes and exon location using allpassbased filters [37].
Further, the Wavelet Transform (WT) allowed one to balance resolution at any time and frequency, which gave the ability to automatically capture different periodicities (frequencies inverse): periodicity 3 in exons [38] and periodicity 10 in nucleosomes [39].The wavelet transform was also shown to reflect the characteristic signature associated to the tandem repeats.Indeed, some biological sequences (such as tandem repeats) were characterized by periodicities; the scalograms served to visualize the way these features appear as well as their locations [40].Given the efficacy of these signal processing tools, analyzing the helitron DNA category (which is governed by complex latent periodicities) is particularly challenging.
Our main goal is to characterize helitrons within the framework of signal processing.For this, the Frequency Chaos Game Signal (FCGS) is selected to preserve the statistical proprieties of DNA sequences.Secondly, signal processing analysis techniques are applied to identify each helitron type by a genomic signature.The key component of this system is the combination of the DNA coding with the Frequency Chaos Game Signal (FCGS) and the windowed Smoothed Fourier technique which enhances the spectral signature of DNA (the overall periodicities).This paper distinguishes four main sections: the First section introduces the work.The second section describes the methodology adopted for the helitron identification.The third section explains the establishment of the helitron's signal database.The fourth section explains the spectral analysis used to characterize helitrons.Then, the experimental results are provided and discussed.Finally, a summary is put forth describing the effectiveness to capture all the periodicities in helitrons sequences based on this new approach.The rest of this paper end with a perspective that open questions concerning the biology of this type of TEs.

II. METHODOLOGY OF HELITRON"S IDENTIFICATION
With the wealth of genomic sequences now available, the identification of a specific DNA element has to be an automatic task.The following methodology consists of three steps.The flowchart describing this work is given in Fig. 1.The first step: A DNA sequence must be converted into a numerical sequence before processing.This step consists in generating different signals for each C.elegans chromosome.For this, chromosomes sequences from the NCBI data base for the C.elegans model (http://www.ncbi.nlm.nih.gov/Genbank) are extracted.Then, 1-D signals are generated by applying the Frequency Chaos Game Signal of order i: FCGS i, (i=2, 4, 6) for the whole chromosomes.This coding technique is based on the apparition"s probability of N successive nucleotides groups in an entry DNA sequence [29]- [31].Here, three values of the length: N=2, 4, 6 are selected.
The probability (P N_nuc ) of given N nucleotides in the chromosome is as follows: N N_nuc represents the number of apparition of the N nucleotides in the whole sequence.N ch represents the length in base pairs of the DNA sequence.In other words, the number of occurrence of each of these elements (N nucleotides) in the genome are counted.It is important to note that the FCGS signal depends on the DNA sequence to be encoded since it is based on counting the words contained in it.It reflects, therefore, the statistical-features of the DNA sequence itself.In fact, changing the DNA sequence would affect the probability of apparition of words and the FCGS values.The interesting point here is that this coding technique is useful in terms of enhancing the repetitive DNA (such is the case of helitronic DNA) at any desired scale (i.e. for any word"s size).Specifically, the highest level of repetitions in DNA are detected by high order FCGS"s coding techniques.It must be noted that hidden information at a certain scale can be highlighted at another scale; that"s make the FCGS very suitable to investigate helitrons.
The second step: It consists in the establishment of a 1-D signal database of the helitrons.This is done through the association of the FCGS i values to each group of letters in the helitron sequence.
In position (k), the oligomer (i), which consists of N nucleotides, is replaced by the correspondent occurrence probability: www.ijacsa.thesai.org The sum of the N nucleotide indicators (S N_nucl ) can be computed as following: Then, a database of helitrons signals regarding different oligomers is prepared.As a result, helitrons are represented by three levels of FCGS (FCGS i,i=2,4,6 ) .
As example, Fig. 2   After coding process, several numerical signals, denoted by H[n], are obtained for each helitron.A signal database of each helitron"s type and for each chromosome in the considered genome is then established.For a desired level i of the FCGS (FCGSi) and for a specific class of the helitron, signals of all existing helitron elements are concatinated in such a way to obtain one global sequence.The resulting database comprises three FCGS i signals (i=2, 4, 6) for each class of helitron contained in the chromosomes of the C.elegans genome.

The third step:
The spectral analysis method is adopted to detect the periodicity of each helitron"s type.Based on the classic discrete Fourier transform related to numerical sequence (4), a windowed local analysis is used to give more precise location in time and frequency.
The reason for selecting the spectral analysis is to characterize the helitron elements by a global spectral signature.
[ ] ∑ (4) At this stage, two types of signatures for each helitron class are revealed: the1-D spectrum and the 2-D spectrogram.The spectral and the time-frequency signatures are established for different orders of our coding technique (FCGSi, i=2,4,6).
From Table I, it is obvious that HelitronY1A is the most frequent (occurrence number of 1093) and the longest (size of 425487bp) in the genome.
The least frequent helitron in the genome is Helitron NDNAX1 (with an occurrence number of 77).At the chromosomal level, Helitron NDNAX1 and HelitronY3 are the least frequent elements; they are present in chromosomeII (with an occurrence number of 8).As for the smallest class of helitron is HelitronY3 with a size of 16404 bp; in other words it presents the shortest helitron in the genome.
Because of the variability of helitron"s number and size in the six chromosomes, regrouping all the helitron signals in a way to obtain a global signature of each class comes in handy.The idea consists in concatenating all the signals in one vector for a well-defined order of the FCGS, while keeping the apparition order of helitrons in chromosome.This reunification allows linking all the local signatures of helitrons.Our main goal is to find the structure that could be repeated in each helitron.

IV. SMOOTHED DISCRETE FOURIER TRANSFORM
The smoothed spectral analysis is a convenient tool to search for base periodicities within DNA sequences.In fact, the periodically placed motifs may indicate the presence of genes, regulatory elements or other significant hotspots like helitrons; hence the need to locate them and exhibit the correspondent frequency (or periodicity).This technique is applied to investigate the global signature of helitrons within the genome by considering the concatenated FCGS signals of all helitrons which constitutes the novelty of this work.Therefore, two tasks are carried out here:


Revealing periodicities by enhancing the global periodicities in the helitron sequences by using the Smoothed Fourier Transform.
d). Calculating the DFT mean value for each portion (1: L); then, performing the same operation to the N segments.The mean smoothed spectrum is expressed as: This representation consists of the spectrogram amplitude for a specific index periodicity in a specific nucleotide position in the DNA sequence.
Fig. 3 provides an example of the spectra and the spectrograms of the concatenated helitrons of type Helitrons2_CE that exist in chromosome IV of C.elegans.The frequency and the time-frequency representations are generated by the mean valued technique based on the smoothed Discrete Fourier Transform.As for the sliding window, we suggest using a blackman window with the parameters: L=1024, Δl=512, N=256 and Δn=64.In this example, two levels of the coding technique are used: FCGS2 and FCGS4.From these figures, it is noticeable that increasing the FCGS order induces a general smoothing of the 1D spectrum and the 2D spectrogram; which allows to enhance the spectral and the time-frequency behaviours of the considered helitron by highlighting its periodicities.

V. HELITRON"S CHARACTERIZATION IN C.ELEGANS: TESTS AND RESULTS
A characterization of helitron DNA at the aim of its identification is the main focus of this paper.Therefore, such a goal can be reached by studying the global resident periodicities of each helitron"s class.Thus, considering the helitrons representations in the frequency and the timefrequency plans, a specific signature (pattern or form) characterizing the repetitions of the sequence is allowed by the spectral attributes for each type.The windowed smoothed Fourier analysis allows following the path of existing periodicities (engendered by the repetitions) in each helitron"s class.For experimentation, all helitrons existing in the C.elegans genome are encoded by the FCGS i,(i=2,4,6) coding technique.Secondly, the FCGS i fragments which correspond to the helitron sequences in the same chromosome are concatenated for each helitron"s type.For comparison, the concatenation of all helitrons contained in the genome (all chromosomes) is considered.Finally, the smoothed spectral analysis is applied to these sequences.In this step, many types of windows with different values of length and overlap are tested.The optimal parameters are fixed to L=1024, Δl=512, N=256, Δn=64 and the most accurate smoothed spectrum is given by a Blackman window.Consequently, two types of spectral representations are obtained: the 1-D spectrum and the 2-D spectrograms.For illustration, the example of HelitronY4 is selected and encoded with FCGS 2 .The resulting 1-D spectra are presented in Fig. 4.  By examining each subfigure closely, it is crystal clear that HelitronY4 is identified by five peaks whose power value differs from one chromosome to another.These peaks are located around remarkable frequencies which are: 0.02 (periodicity 50), 0.1 (periodicity 10), 0.2 (periodicity 5), 0.3 (periodicity 3) and 0.4 (periodicity 2).Since the C.elegans genome kept the same behavior as for each chromosome apart, these periodicities characterize the HelitronY4 type.Here, the idea is to identify each helitron class with a specific spectrum since the behavior of all spectra remains almost the same with a small variation in amplitude.

A. Helitron's Signature and Effect of the FCGS Order on the Spectrum
In this section, the role played by the FCGS coding in enhancing the helitron signature is thoroughly examined.For this purpose, the mean values of the Smoothed Discrete Fourier Transform for each type of helitron sequence are computed and thus for different orders of FCGS (order=2,4,6).
These representations reflect the specific periodicities (frequencies) that characterize each helitron"s type which forms a pertinent tool for their identification.In fact, each helitron class is shown to possess a specific spectral signature.
Further, for each helitron type, the effect of increasing the FCGS order in the spectrum shape evolution is remarkable.But the overall shape remains when increasing the FCGS level.Based on that, the fact that the helitron can be identified by its spectral signature is confirmed.And that is emphasized even in the spectrum shape evolution.
On other hand, a high similarity between these three helitrons is striking: These helitrons have in common the frequencies; 0.02734 (Periodicity 37), 0.05859 (Periodicity 17) and 0.1 (Periodicity 10); which have the highest amplitudes.These periodicities reflect the presence of hidden minisatellite into the helitrons sequences.
In addition, Helitron2_CE have important amplitudes within the frequencies bands [0.09 0.125] and [0.15: 0.2].They share several frequencies with the HelitronY2_CE class.The latter helitron (HelitronY2_CE) is shown to be easily recognized through its distinctive spectrum.In fact, during tests, from a chromosome to another the HelitronY2_CE is the only subclass which possesses an invariant behavior along the whole genome.
The Helitron NDNAX1 have two principal frequency bands which are centered around the frequency 0.1 (periodicity 10) and around the frequency 0.03 (periodicity 30).
Finally, the Helitron NDNAX3_CE has pronounced amplitude around the frequency 0.1(periodicity 10).Other important frequencies are within reach, such as 0.01953 which corresponds to periodicity 51 and 0.02344 which corresponds to periodicity 42.In addition, the periodicities 10 and 9 are present in this class of helitrons families.

B. Helitron's Spectrograms with different FCGS Orders
The time-frequency presentation allows an energetic interpretation of signals.This representation has shown high- Based on these sub-figures, each helitron class seems to possess a unique time-frequency behavior.In other words, it forms a time-frequency signature which will be taken as a basis for the helitron identification.
In this case, the FCGS 2 coding seems to allow the best way of characterizing helitrons since it shows more details about the energy distribution in spectrograms.On the other hand, increasing the coding order has no major impact on the overall signature for all helitron types.Nevertheless, smoothed spectrograms with high levels of the FCGS signals are obtainable.

VI. CONCLUSIONS
Exploring the latent periodicities of helitrons (eukaryotic rolling-circle transposons) can play a key role in the identification of this subclass of DNA.The way these periodicities appear can mark specific regions of helitrons in a unique manner; which constitutes a signature allowing us to distinguish these elements.
This study was done based on the spectral analysis.In fact, the recognition of each class"s periodicities can be very useful for helitron classification.
To be able to apply the spectral method, the DNA sequence is converted into a numerical 1-D signal using the FCGS coding.This technique offers the possibility to encode DNA into several signals according to a well-defined order.Three levels of the representation are taken into account: FCGS 2 , FCGS 4 and FCGS 6 .
To know more about how similar are these helitrons along the genome, all the helitron sequences that exist in the C.elegans chromosomes are associated in order to obtain a global genomic signature.
After that, the Smoothed Fourier Transform Analysis is applied and the 1-D spectra and the 2-D spectrograms are collected as characteristic signatures of the studied elements.In fact, for each helitron type, the periodicities shape was involved in the spectral and the time-frequency representations in a unique manner; which forms a pertinent tool for the helitron characterization and identification.
Comparing the signature of helitrons in a chromosome with the signature of the overall genome, the great similarity between them is plain to see.In addition, increasing the FCGS order has not affected the global behavior in the spectra and the spectrograms: the main periodicities of each helitron class have remained.A smoothing effect is spotted in the representations.This confirms that the helitron adopts a distinctive behavior which characterizes it and thus permits its identification.
In this work, the major advantage of these frequency representations (spectrum) and time-frequency representation (Spectrograms) is that it captures all the periodicities (repetitive sequence) for the heletronic sequences.It is worth mentioning that this approach can be used in detecting periodicities of other transposable elements.The limitation of the approach depends on the limitation of the Fourier transform.Therefore, this approach remains limited as it does not allow the temporal localization using a fixed window size.

VII. FUTURE WORK
In this work, the importance of helitron characterization has been highlighted and a method to perform their identification based on the Smoothed Fourier Analysis has been suggested.Classification methods based on the feature analysis technique might be investigated to improve the classification accuracy of helitrons as a future work.
outlines the resulted signals: FCGS 2 , FCGS 3 , FCGS 4 , FCGS 5 and FCGS 6 of an helitron type NDNAX2 with a size of 341 base pairs and positioned at: [274811bp: 275151bp] in the chromosome II of the C.elegans genome.
www.ijacsa.thesai.orgThe technique consists in: a).After converting the DNA sequence into a numerical one, the signal H[n] must be divided into frames of L length with an overlap Δl. b).Using a sliding analysis window w[n], each L portion is also divided into N overlapped segments with an overlap length Δn: H w [n,k]=H[n]w[n-kΔn] (5) Where, the index of the frequency ([0, N-1]) is noted by k.The choice of the windowing function determines the timefrequency resolution.c).Using the Discrete Fourier Transform (DFT), each weighted block of the frame H w [n] is transformed in the spectral domain.The DFT of each segment is expressed as follows: Noted that i corresponds to the index frame of N frames ([1...N]), k is the index of the frequency and j corresponds to the index frame of L frames ([1: L]).e).With the obtained values, a matrix containing the join time frequency information is constructed:

Fig. 4 .
Fig. 4. HelitronY4 spectra-(a) spectrum of the concatenated chromosomes-(b:g) spectrum of each chromosome apart.The sub-figures (b, c, d, e, f, g) provide the spectrum shape for each chromosome apart; the sub-figure (a) gives the shape spectrum of the overall genome.The horizontal axis of each sub-figure indicates the frequency (which is equivalent to the inverse of periodicity) measured by the Smoothed Fourier Transform and the vertical axis indicates the spectrum"s amplitude.
Fig. 6 presents the spectrogram of each helitron type considering three levels of FCGS: FCGS 2 , FCGS 4 and FCGS 6 .The vertical axis of each sub-figure indicates the frequency measured by the Smoothed Fourier Transform.As for the horizontal axis, it indicates the position in base-pairs.

TABLE I .
THE OCCURRENCE NUMBER AND THE LENGTH (IN BASE PAIRS) OF HELITRONS IN C.ELEGANS.