Microsatellite ‟ s Detection using the S-Transform Analysis based on the Synthetic and Experimental Coding

Microsatellite in genomic DNA sequence, or Short tandem repeat (STR). It is a class of tandem repeat that have repeated pattern with size of 26 base-pairs adjacent to each other. The detection of the specific tandem repeat is an important part of genetic diseases identification and it is also used in DNA fingerprinting and in evolutionary studies. Many tools based on string matching have been developed to detect microsatellites. However, these tools are based on prior information about repetitions in the sequence which cannot be always obtainable. For this, the signal processing techniques were suggested to overcome the limitations of the bioinformatic tools. In this paper, we use a new variant of the S-Transform which we apply to short tandem repeats signals. These signals are firstly obtained by applying different coding techniques to the DNA sequences. To further study the performance of the proposed method, we establish a comparison with different bioinformatics approaches (TRF, Mreps, Etandem) and three other methods of signal processing: The Adaptive S-Transform (AST), the Empirical Mode and Wavelet Decomposition (EMWD) and the Parametric Spectral Estimation (PSE) considering the AR model. This study indicates that our approach outperforms the earlier methods in identifying the short tandem repeat, in fact, our method detects the exact number and positions of trinucleotides present in the tested real DNA sequence. Keywords—DNA sequence; microsatellites; synthetic and experimental coding; s-transform; bioinformatic tools; Empirical Mode and Wavelet Decomposition (EMWD); Parametric Spectral Estimation (PSE)


I. INTRODUCTION
Computational analysis of the DNA sequences is a fundamental subject, which aims to understand the biological functionality of all living organisms.A particular attention was turned to microsatellites, which ensure many biological functions.In fact, they are implicated in cell metabolism, mismatch repair system [1], regulation of chromatin organization [2], genes activity and in many other functions [3].
A microsatellite sequence, also called Short Tandem Repeat (STR), represents two or more adjacent copies of a short nucleotide pattern unit [4].The STR is defined by a specific period (pattern unit).The microsatellite's period is typically between 2 and 6 nucleotides per unit appointed di-, tri-, tetra-, penta-and hexa-nucleotides, respectively [5].These elements (STRs) have a length less than 150 base pair [6].Microsatellites considerably occur at different locations within the organism genome.They are very redundant, reduced and dispersed therefore, microsatellites are detected through automatic tools due to their importance on the one hand, for the human genome; approximately 10% of the DNA consists of microsatellites [7].This special repeat can be a direct cause of many human diseases such as Huntington"s chorea, spinal and bulbar muscular atrophy [8], myotonic dystrophy [9] and Friedreich"s ataxia [10].On the other hand, for other genomes, microsatellite elements are useful in many research domains such as DNA forensics [11], population genetic analysis [12], conservation biology and phylogenetics [13], [14].
Taking into account the importance of these regions, many researches focused on studying tandem repeats or microsatellites using the bioinformatic tools [15], [16]: the MISA [17], Sputnik [18], Mreps [19], EMBOSS (etandem and equitandem) [20], RepeatMasker [21], and TRF [22] .These tools use repeats candidates and compare them to DNA consensus sequences to detect microsatellites.These algorithms use a regular expression [21], the Hamming distance [15], the recursive match and the penalty scores [22].They also use k-mers with suffix trees [23] and Heuristic alignment procedure [24].Among these tools, Tandem Repeats Finder (TRF) is the most used one for detecting the short tandem repeats in DNA www.ijacsa.thesai.orgsequences [22].Nevertheless, it is not easy to use due the need to carefully choose settings.This is not only specific to this tool, as most bioinformatics tools need also prior information for the input parameters of the system such as [16]: pattern, pattern size, number of repeats, reference sequence or score [25].However, sometimes, we do not have this prior information because of lack the short tandem repeats characteristics.
Aiming to overcome these limitations, scientists tried to find effective approach based on the signal processing techniques without prior information on targeted sequence.These approaches mainly use periodicity to detect STRs [26].In this sense, the spectral analysis based on the exact periodic subspace decomposition and the autoregressive model (AR) were carried out [27].On the other hand, methods providing a time-frequency representation have been proposed [29], [30].Thus, the Short Time Fourier transform [28] and the Complex Morlet wavelet transform was used for patterns visualization [31], [32].In an attempt to detect Microsatellites, the adaptive and modified S-transform has been also used [33], [34].
In this paper, we are interested in the microsatellite"s identification in the genomic sequences.As part of the genomic signal processing domain, this work proposes a new method that combines the S-transform and a particular coding technique.Our detection system achieves accurate results without using any prior knowledge about the input data.This paper is organized as follows.Section 2 presents the S-Transform which we will use as a time-frequency representation technique.A coding step is recommended to directly apply the S-transform on the DNA sequence.The different coding techniques used for the genomic sequences coding are described in section 3.In Section 4, the STRs detection algorithm has been detailed and illustrative examples are included.Section 5 provides the experimental results and evaluates the short tandem repeat identification performance by comparison to other methods.Finally, Section 6 concludes this paper.

II. S-TRANSFORM AS ANALYSING TECHNIQUE
The S-Transform (ST) is a time-frequency distribution which was developed by Stockwel et al. in 1994 for analyzing geophysics data [35].It is a hybrid technique of the Short Time Fourier Transform (STFT) and the Continuous wavelet Transform (CWT).It retains the phase information as in the STFT and provides a variable resolution similar to CWT.There are several ways to deal with these characteristics.Here, we present three existing variants of the S-Transform: The Standard S-Transform (SST) [35], the Generalized S-Transform (GST) [36] and the Width Window Optimized S-Transform (WWOST) [37].Finally, we propose our S-Transform modification aiming to enhance the time-frequency resolution of microsatellites representations.

A. The Standard S-Transform (SST)
The S-Transform, in its standard form, consists in calculating the Fourier Transform of a signal ( ) multiplied by a gaussian window.Therefore, the Standard S-Transform calculation formula is: Where represents the frequency, represents the time, is the gaussian window and controls the position of on the -axis.
In the time domain, the gaussian window is given by: Where is the gaussian standard deviation.It depends on the frequency as follows: This function controls the window"s width.The S-Transform can be defined as: With lowest frequency, S-Transform performs well in the frequency domain.While, with highest frequency, S-Transform gives better resolution in the time domain.The main drawback to the S-Transform is then the time-frequency resolution.More efficient representations were introduced by proposing several modifications [33].The S-Transform optimization consists in controlling the gaussian window"s width by adding new parameters [34].

B. The Generalized S-Transform (GST)
The Generalized S-Transform is proposed by McFadden [36] as a modified form of the Standard S-Transform.This modification consists in introducing a novel parameter .This parameter controls the gaussian window"s width as follows: Consequently, the Generalized S-Transform is written as follows: Width Window Optimized S-Transform (WWOST) Sejdic and his team [38] have suggested another modification of the gaussian window width by introducing a new parameter in the expression of ( ).
( ) Thus, the S-Transform becomes as follows: In order to enhance the energy concentration in the timefrequency representation by the S-Transform, we propose another way to control the window width.

C. Proposed Modification of the S-Transform
In this work, we propose a new variant of the S-Transform by combining the two modified versions of the S-Transform.www.ijacsa.thesai.org The gaussian standard deviation in this case will be defined as: The S-Transform becomes: In Fig. 1, we represent the Gaussian window function around the frequency 0.33 (which is equivalent to periodicity 3 in DNA).We take into account different values of p and α and we provide the temporal and the spectral supports of the correspondent window.When and , we are in the case of the Standard S-Transform (SST).When and , it is a Generalized S-Transform (GST).For and , it is the Width Window Optimized S-Transform (WWOST).Finally, for and , we are in the presence of the proposed S-Transform.
The combination of the parameters and in the S-Transform offers more flexibility to the gaussian window to capture periodicity 3 than the anterior versions.It has the characteristic that it minimizes the band in both spatial and frequency domains.Hence, the importance of this new variant of the S-Transform in terms of detecting the characteristic periodicities is in DNA especially in the microsatellite ones.

III. DNA CODING TECHNIQUES
To be able to apply suitable signal processing methods to the DNA sequence, we must first convert the ATCG string to numeric signal.We would point out that DNA is a character string combining 4 nucleotides: A, C, T and G.
To achieve this conversion operation, different coding techniques have been proposed.The choice of the coding technique is delicate since that each method must be tested to see if it can enhance particular useful information [39].These techniques can be defined by the substitution of nucleotides by numerical values according to the user's choice.On the other hand, they can be based on statistical or structural properties of DNA; which will reflect interesting specificities of the sequence.Thus, two large DNA coding methods including synthetic and experimental coding.

A. Synthetic Coding
The synthetic coding principle consists in assigning a real or an imaginary value to a nucleotide base or a group of nucleotides.The most widely used synthetic mapping techniques are the binary coding, the complex binary [40] and the random walk [41].

1) Binary coding:
The binary coding is based on simply assigning 0 or 1 to indicate the presence or the absence of a nucleotide base in the original sequence.For example, we can apply the following formula to seek the presence of the base A: 2) Binary complex coding: The binary complex coding consists in giving an imaginary value to each nucleotide as follows: 3) Random walks: DNA nucleotide can be classified according to their chemical structure [41].We found in the pyrimidine class the nucleotides (C, T) and (A, G) in the purine one.The random walk is based on assigning the value -1 if the base is C or T and 1 in case if the base A or G.

B. Exprimental Coding
Experimental coding techniques make use of experimental tables to reflect the chemical and the structural properties of DNA in the produced signal.As examples, we present here the EIIP [42], EIIPc [43] and the PNUC coding [44].

1) EIIP:
The EIIP mapping is based on the electrons energy measurement which is delocalized in nucleotides [42].The energy values corresponding to each nucleotide are illustrated in Table I. www.ijacsa.thesai.org2) EIIPc: Here, the EIIPc technique [43] uses the electrons energy measurement delocalized in amino acids.The DNA sequence is read by overlapping codon and then, a numerical value according to the following Table II, is allocated to each position.
3) PNUC: The PNUC coding [44] is based on the scratches curvatures measurement related to tri-nucleotides during nucleosome positioning.The correspondent experimental values are given in Table III.The principle consists in values assignment after an overlapping reading of the tri-nucleotides in the sequence.
In the next section, we tried these several coding techniques by using the S-Transform to select the most convenient one for tandem repeat detection.

IV. MICROSATELLITES DETECTION METHOD
After presenting the mathematical principle of the S-Transform as well as the methods for DNA coding, we present our algorithm for the microsatellite"s detection.The model we are providing here is based on the new variant of the Stransform and the PNUC coding.The correspondent flowchart is given in Fig. 2.
The algorithm consists, first, in converting the DNA nucleotides by the PNUC technique.After that, the obtained signal is smoothed using the OTSU method to maximize energy concentration.Afterwards, the new variant of the S-Transform is applied to the preprocessed signal.Then, we apply in a binarization operation of the S-transform timefrequency representation to locate the short tandem repeat.The final step consists to extract motif from microsatellite detected.

A. Choice of the Coding Technique
In order to choose the appropriate coding ,which allows the best detection of existing periodicities in the genomic sequences with the best concentration of energy, we test different coding techniques.As exprimental coding we consider: PNUC, EIIP and EIIPc.As for synthetic coding, we consider the random walk and the binary and complex indicators.For experiments, we investigate an artificial microsatellite sequence S that contains the periodicites: 2, 3, 4, 5 and 6 bp.
A description of the short tandem repeats (STRs) existing in the sequence S is illustrated in Table IV.In Fig. 3, we provide the time fequency presentation of this sequence after application of the new S-Transform with and .
The evaluation of the coding technique is done based on a visual inspection of the time-frequency representation.The latter, must enhance all the periodicities in the DNA sequence both in the time and the frequency plans.
For the synthetic coding group, we notice that the random walk coding doesn"t give a good temporal localization.It is the complex binary technique that gives the best result since it detects the majority of periodicities (all periodicities except periodicity 6).As for the binary coding, we consider the U A , U C , U T and U G indicators.Based on the correspondent timefrequency representations, we find that the best result is obtained with U G .
For the experimental tables based-on coding, PNUC is shown to be the best method compared to EIIP and EIIPc.
The binary coding depends on the nucleotide constituting the sequence.In fact, if the microsatellite does not contain the nucleotide G, the binary indicator U G will not detect it.It is the same case with other indicators as well.Therefore, we will use the PNUC coding.

B. OTSU Thresholding
In this sub-section, we study the effect of preprocessing on the microsatellite time-frequency representation by the new S-Transform.For this aim, we take an artificial sequence characterized by periodicity 3 which is caused by two different motifs (CGT and TAC).The two microsatellites contained in the sequence Seq have almost the same size.
The time-frequency representation of the sequence Seq is given in Fig. 4(a).As we can see, the two STRs are detected but one of them is more expressed than the other.So, we need a method that allows the detection of the two microsatellites (i.e. the two STRs must have same energy in the timefrequency plan).That is why we thought of using the OTSU thresholding technique on the signal before applying the S-Transform.
The principle of this preprocessing technique consists in calculating the histogram shape-based threshold.This method assumes that the data can be divided into two classes, and then determines the optimum threshold that minimizes the combined spread of the two classes [45].
In order to obtain the threshold established by the OTSU"s method, we firstly select the location of the minima/maxima of the signal.Then, we prepare the peak strengths histogram in absolute value.Finally, we apply the OTSU"s algorithm.Indeed, we notice that the preprocessing we choose here allowed us to easily identify all the existing periodicities in the artificial sequence.It enhances the energy concentration for the pattern CGT as for the pattern TAC.By contrast, the ST without taking into account the OTSU"s method cannot well highlight the pattern TAC.

C. New Variant of the S-Transform
After preprocessing the DNA sequence, we move to the analysis step.Since the new variant of the S-Transform proposed is a general form of the latter, we test its parameters in such way that SST, GST and WWOST will be also verified.Therefore, we have to modify the frequency-dependent control parameters and to get the appropriate window length allowing the microsatellite periodicities (period=1/frequency) detection.Fig. 5 illustrates the time-frequency representations of the DNA sequence Seq by testing different values of and ; Seq is the same sequence presented in sub-section B.
These sub-figures demonstrate that the proposed S-Transform gives the best time-frequency resolution, so it is the most suitable setting to study microsatellites.The choice of these parameter values is done in an empirical way.We retain the values with a narrowest frequency band: and .with the OTSU"s Thresholding Method.www.ijacsa.thesai.org

D. Binarization
After enhancement of the microsatellite representation in the time-frequency plan with the particular values, we go on to the detection step.However, in order to delimit the start and the end of the short tandem repeat from the ST representation, we must eliminate the noise existing in it.For this aim, we thought of transforming the time frequency representation into a binary one using a thresholding operation.The binarization step consists in giving the value of 0 or 1 to a pixel after comparing it to a threshold.In our work, we tested different threshold values.The optimal STRs detection was obtained for a threshold equal to 0.83.In Fig. 6, we present the timefrequency representation of Seq after binarization considering the best threshold value.

E. Extraction Pattern from Short Tandem Repeat
After identifying the microsatellite length and its periodicity.We want now to determine its specific pattern.So, we use an automatic algorithm to capture repetitive pattern.
The extraction pattern consists in comparing a DNA sequence of size p to DNA sequence of length n.The consensus repeat pattern is the most repeated pattern of the sequence.
In the previous example, the algorithm above gave us the results shown in Table V.This table presents multiple short tandem repeats that exist in seq with its characteristics: (beginning, end, periodicity, pattern).And we notice that the patterns location is made with good precision.In this section, we will test the performance of our algorithm in detecting STRs.The sequence X64775 of Oryza sativa Indica Group is selected for our experimentations.This DNA sequence is obtained from the NCBI database [46].It has a short tandem repeat starting at 142 base-pairs and extending to 186 base-pairs.The repetition of the pattern "GGC" is the characteristic of this repeat region.We chose this sequence Due to its common use in previous studies [27], [33].
The result obtained after applying each step of our algorithm is illustrated in Fig. 7.We also give the ST presentation of the sequence without applying the OTSU method (Fig. 7(b)).The sub-figures (b) and (c) demonstrate the role played by the OTSU method in smoothing the ST representation; which enhances the three regions with high energy around the frequency 0.33 (i.e.Periodicity 3).In fact, these zones represent repeats of trinucleotides motifs in the X64775 sequence.Periodicity 3 is well localised in both the time and the frequency domains.Hence, the importance of our algorithm is in characterizing microsatellites by the correspondent frequency and locating their position.
To evaluate the efficiency of the proposed method, we compared the obtained results, first, with bioinformatics tools.So, we chose: Mreps, Etandem and TRF.
For Mreps and Etandem, we kept the default parameters.For TRF, in the beginning we also kept the default settings.Then, we changed the settings as follow: match=2, mismatch=5 and indels=5, for the Minimum Alignment Score= 30.
The obtained results are presented in Table VI.
We notice that periodicity 3 is detected one time by Etandem and TRF with the default setting.however, Mreps detects 2 regions of periodicities 3 and one region with periodicity 6.As for our method, it locates 3 regions with periodicity 3. Our results are the nearest to those of TRF considering the adjusted parameters; which are the most suitable.
We compared as well with analysis techniques as the Adaptive S-Transform (AST), the Empirical Mode Wavelet Decomposition (EMWD) and the Parametric Spectral Estimation (PSE) considering the AR model.
The results of the microsatellite detection in X64775 by the methods ATS, EMWD and PSE are detailed in [27] and [33].
From Table VII, Periodicity 3 is detected by EMWD only two times.The remaining techniques identify 3 regions with periodicity 3.These techniques succeed to identify the microsatellites listed in the NCBI database.Whereas, PSE and our method detect another short tandem repeat similar to one detected by TRF in terms of localization.Only our proposed method detected the same three microstatellites detected also by TRF, from the standpoint of periodicity and position.To conclude, we succeeded in finding an efficient method for STRs detection.Furthermore, the obtained results match those of bioinformatics tools with the advantage of being independent from any prior knowledge about the searched repeat.

VI. CONCLUSION
This study reveals the advantage of signal and image processing tools in highlighting short tandem repeats in DNA sequences instead of bioinformatics ones.The system, we proposed here, is based upon using a DNA coding technique and the S-transform.
First, we have investigated the role played by the coding technique in enhancing the time-frequency representation of microsatellites.Thus, we have tested six coding techniques which are: PNUC, EIIPc, EIIP, the binary coding, the complex binary coding and the random walk.The best resolution was obtained with the PNUC coding technique.
Next, we have presented our new approach for the microsatellites" detection.The algorithm consists of four steps.
As a first step, we encoded the DNA sequence into a numerical signal using the PNUC technique.Secondly, we preprocessed the obtained signal with the Otsu's method in order to maximize the useful information.Then, we applied a new variant of the S-Transform to get time frequency presentation of the sequence subject of study.The latter representation allowed us to easily localize the microsatellite position and periodicity after proceeding by a binarization step.The final step consists of extracting the pattern from the microsatellites, automatically.
To prove the effectiveness of our method, we have compared results with those of some bioinformatics tools: TRF, Mreps and Etandem.We have also established a comparison with other signal processing tools, which are: AST, Parametric Spectral Estimation and EMWD.In all cases, our approach outperforms these methods in terms of STR detection.
The main advantage of our algorithm consists in being independent from any prior knowledge of the repeat"s characteristics.Moreover, it offers the possibility to get a simple graphic visualization of microsatellites.
In the future work, this approach can be extended to identify tandem repeats with higher repetitions unit length (minisatellites and satellites).

TABLE I .
THE EXPERIMENTAL VALUES FOR THE EIIP CODING TECHNIQUE

TABLE II .
THE EXPERIMENTAL VALUES FOR THE EIIPC CODING

TABLE IV .
THE EXISTING STRS IN THE SEQUENCE S

TABLE V .
SHORT TANDEM REPEAT DETECTION IN SEQ

TABLE VI .
MICROSATELLITES DETECTION IN X64775 WITH BIOINFORMATIC TOOLS AND OUR PROPOSED METHOD

TABLE VII .
MICROSATELLITES DETECTION IN X64775 WITH ANALYSIS TECHNIQUES AND PROPOSED METHOD