A Coding Technique Based on the Frequency Evolution Creates with a Time Frequency Analysis a New Genome's Landscape

—In recent years, considerable effort has been devoted to study the biological data sets within the framework of the genomic signal processing field. However, the enormous amount of data deposited into public databases makes the search for useful information a difficult task. Effectively, the choice for a convenient analysis approach is not at all obvious at all. In this work, we provide a new way to map the genomes within the form of images. The mapping uses the Complex Morlet wavelet as analysis technique and the Frequency Chaos Game Signal (FCGS) as digital dataset. Before processing the wavelet analysis, we build the FCGS in such a way that we can follow the frequency evolution of nucleotides' occurrence along the genome. The time-frequency analysis of the FCGS signals constitutes a pertinent tool for exploring the DNA structures in the C.elegans genome-wide landscape. I. INTRODUCTION With the advances in the field of genomics, the sequencing techniques keep improving; which speeds up the collection of the biological data. Consequently, the amounts as well as the types of the data are continually increasing. Hence the need for new tool that permits an easy navigation within the genomes. Nowadays, researchers in genomics rely on a standard graphical representation of chromosomes called ideogram. Ideograms allow genomic data visualization using points, lines and other shapes to indicate the location of particular sites along the chromosomes [1]. However, the ideograms' annotations must be updated once one discovers new DNA hotspots within the chromosomic sequences. Thus, it is better to find other tool that permanently describes all the chromosome structures independently from their complexity. The idea consists in finding an adequate representation tool that directly maps the DNA produced by sequencing (i.e. in its character form) [2] [3] without a need for biological experiments or alignment algorithms [4] [5] [6] or automatic pipelines [7] to annotate all the inherent components.


I. INTRODUCTION
With the advances in the field of genomics, the sequencing techniques keep improving; which speeds up the collection of the biological data.Consequently, the amounts as well as the types of the data are continually increasing.Hence the need for new tool that permits an easy navigation within the genomes.Nowadays, researchers in genomics rely on a standard graphical representation of chromosomes called ideogram.Ideograms allow genomic data visualization using points, lines and other shapes to indicate the location of particular sites along the chromosomes [1].However, the ideograms' annotations must be updated once one discovers new DNA hotspots within the chromosomic sequences.Thus, it is better to find other tool that permanently describes all the chromosome structures independently from their complexity.The idea consists in finding an adequate representation tool that directly maps the DNA produced by sequencing (i.e. in its character form) [2] [3] without a need for biological experiments or alignment algorithms [4] [5] [6] or automatic pipelines [7] to annotate all the inherent components.

II. RELATED WORK
Within the context of the genomic signal processing, the joint time-frequency analysis based on the Fourier transform has played a key role in the data characterization and visualization [8][9][10][11].In fact, the color spectrograms were shown to provide significant information about periodicities and recurrent motifs along the bio-molecular sequences [8] [9].Furthermore, the tri-color spectrograms, which are obtained by the reduction of dimensionality, give a unique visual signature of specific regions of the DNA [12].Nevertheless, the problem with STFT goes back to its limited time-frequency resolution capability.Indeed, according to the Heisenberg uncertainty principle, one cannot obtain a good resolution in both the time and the frequency domains due to the fixed STFT window's length.Thus, the wavelet transform appears to be a good solution to overcome the STFT resolution limitation [13][14][15][16][17].Given the localization wavelet properties, recent works were oriented towards investigating the DNA correlations [18][19][20] as well as the identification of the coding regions in genes [21] and some of the repeating protein motifs [22].In this framework, we propose the continuous wavelet analysis to map the genomic DNA as scalogram images based on the complex Morlet wavelet.The main objective of this work consists in unraveling the localized spectral behaviors of different DNA structures.Since the DNA sequences are stored in the biological databases within the form of strings, it is necessary to convert them into numerical data; which will enable in turn the signal processing based applications.This operation defines the so-called "DNA coding".Here, we are concerned with a new numerical assignment scheme, which is the Frequency Chaos Game Signal (FCGS).The latter, allows following the evolution of the oligomers frequency of occurrence.Furthermore, by combining the FCGS with the wavelet analysis we create a new perspective to represent the information within any given genomic sequence.The method is not only effective and original; it also enhances specific information about the sequence at each FCGS level.To prove the efficacy of the work in terms of the genomes landscapes visualizing, we consider the C.elegans genome.

III. MAPPING THE DNA SEQUENCES: FROM CHARACTER STRINGS TOWARDS THE FREQUENCY CHAOS GAME SIGNAL
The Frequency Chaos Game Signal is a new DNA coding technique consisting in a linear form of the Frequency Chaos Game Representation (FCGR).The latter method illustrates a DNA sequence in the form of a 2D image [23][24][25], based on the Chaos Game theory [26] [27].The method (we mean the FCGR) encodes the oligomers frequency of apparition according to given color scale; where each frequency occupies www.ijacsa.thesai.org a precise emplacement in the representation area.Consider the following sequence: Seq comp =ACGATACAGATCAGATTTAGACAGACCGATA GTAGACGATCAGATCACCAGTGAC.
The monomers, dimmers and timers frequencies of apparition as well as the related words organization are given by TABLE I and TABLE II.
In our coding approach, we take the frequency matrices as assignment base [28] [29].For this purpose, we fix the representation order k and we generate the FCGR k for the totality of the entry sequence.For this coding we generally take the chromosomic sequence as entry data set, because we want that the FCGS reflects the statistical properties of the genome itself.Then, we read the bases' succession by a group of knucleotides and we assign to each position the correspondent frequency value.For example seq=ACGATACA, which is a portion of the seq comp .So, based on the FCGRs matrices previously extracted, we attribute to the monomers, dimmers and timers the frequency values as illustrated in TABLE III.
If we want to encode the totality of the sequence Seq comp , we do the same thing.In this case, the resulting FCGS 1 , FCGS 2 and FCGS 3 plots are given in Figure 1.IV. COMPLEX MORLET WAVELET ANALYSIS Unlike the Fourier transform, which is based on the average of signals contents within a fixed window, the wavelet transform offers very good time-frequency localization as it satisfies the uncertainty principle [30].Indeed, the wavelet transform adapts its window (called mother wavelet and denoted ψ) in such a way it shortens at high frequencies and expands at low frequencies depending on a scale parameter a.At each scale, the daughter wavelet shifts in time using a shift parameter b to permit the convolution of the input signal with the analysis window.This principle is the basis of the Continuous Wavelet Transform (CWT).Thus, the daughter wavelets are generated by equation (1); and the continuous wavelet transform is defined by equation (2).

 
Here * represents the operation of complex conjugate and X(t) is the input signal function.The most usual way to display wavelet transforms is to look at the absolute wavelet coefficients: |T ψ (a,b)|, giving a time-scale representation.This representation is named scalogram.The scale set is proportional to the frequency one, it follows that one can project the wavelet coefficients in the time-frequency plan [31][32][33].In general, it is preferable to choose a continuously differentiable mother wavelet with compact support [34].In our analysis, we choose the Complex Morlet wavelet which is a Gaussian-windowed complex sinusoid : () In practice, one often takes ω 0 >=5 to satisfy the admissibility condition which is required by CWT [34].

V. GENOME-WIDE VISUALIZATION OF C.ELEGANS
This paper concerns the DNA representation into timefrequency images based on the complex Morlet wavelet analysis.For this purpose, we turn our attention to the C.elegans genome, whose sequences and annotations are extracted from the NCBI database [35].Concerning the coding, we consider the FCGS technique with order {1, 2 and 3}.As for the wavelet analysis, we take a mother wavelet with ω 0 >5.4285.We fix, then, the number of scales to 64.Afterward, we perform the CWT on the FCGS signals for the six chromosomes.Since the scalograms yield almost the same global behavior, we selected results relating to the chromosome1 for illustration.Observing the important length of the chromosome, we proceed by zooming into the scalogram.A myriad of structures are shown to possess specific time frequency behavior which exhibits a number of  IV).
 The first sequence consists of two (TTAGGC)n motifs which are spread over 10 3 bp.The Figure 2 illustrates the correspondent representation while coding with FCGS 1 , FCGS 2 and FCGS 3 .
From subfigures (a) and (b), we mainly note the characterization of the structure by concentrated energy around the frequency 0.1 (which is equivalent to the 10bp-periodicity).The latter frequency is generally related to the nucleosome formation [10] [11] [36][37][38][39].However, in the subfigure (c), energy level decreases.By comparing all the subfigures, we conclude that energy indicating the 10 periodicity decreases instead of the enhancement of other shapes within other frequency bands when we raise the FCGS order. The Figure 3 reveals the time-frequency behavior of a CEMUDR1 example.At first glance, we see periodic motifs around the frequency 0.15 (corresponding to the 6.5 bp periodicity which is generally found in noncoding DNA of genes [21]); which offers an easy way to detect the presence of the structure.A careful inspection of subfigures (from a to c) reveals that each level of the FCGS coding highlights specific repetitive shapes within a given frequency band while keeping the overall aspect.much when the FCGS order changes (Figure 5).However at each scale, the difference lays in the energy band repartition.Further, thanks to our representation we can easily detect the boundaries of each sequence.

VI. CONCLUSION
The work presented here, described a new way to represent the genomic DNA in the form of images.Our goal was to offer visual navigation of the genome where different structures can be easily distinguished through a specific signature in the representation's plan.This has the advantage of avoiding the inaccurate and unavailable annotations as well as the long and expensive experiments.This is why; we based our study on the Frequency Chaos Game Signal (FCGS) and the Complex Morlet Wavelet analysis.The FCGS approach is a new coding tool to convert the DNA strings into signals that reflect the statistical properties of genomes.These signals are converted in turn into scalogram images that reflect the periodic features of DNA using the complex Morlet wavelet.The analysis reveals the characterization of different structures by particular periodic motifs localized within particular frequency bands with different energy levels.Furthermore, the variation of the FCGS order offers variability in terms of information while keeping the global aspect of the sequence's behavior.

VII. FUTURE WORK
Overall, our analysis is a promising and fruitful direction in the sense that it forms a strong base for classifying the DNA structures as well as recognizing unknown sequences and rectifying the available annotations.Then, future work will focus on extracting pertinent parameters for DNA structures' classification based on the generated scalograms.

TABLE III .
LINEARIZATION OF THE OLIGOMERS FREQUENCIES TO OBTAIN FCGSK WITH K={1, 2 AND 3}

TABLE IV .
BOUNDARIES AND LENGTHS OF THE SEQUENCES HAVING SPECIFIC TIME-FREQUENCY SIGNATURES