An Efficient and Rapid Method for Detection of Mutations in Deoxyribonucleic Acid - Sequences

—The comparison of genomic sequences plays a key role in determining the structural and functional relationships between genes. This comparison is carried out by identifying the similarities, differences and mutations between genomic sequences. This makes it possible to study and analyze the genetic and the evolutionary relationships between organisms. Alignment algorithms have been in the spotlight for the last few decades, due to a vast genomic data explosion. They have attracted a great deal of interest from many researchers who focus on the development of practical solutions to ensure effective alignments with an optimal response time. In this paper, a novel algorithm based on Discrete To Continuous "DTC" approach has been developed. The proposed methodology was compared against other existing methods, which are largely based on the concept of string matching. Experimental results show that the DTC algorithm delivers supremely efficient alignment with a reduced response time.


I. INTRODUCTION
Bioinformatics is the intersection of biology and informatics because it is a field that covers life sciences disciplines such as genomics, proteomics and biology through computer methods. The main mission of this research area is to analyze and interpret deoxyribonucleic acid (DNA) sequences in central databases, accessible worldwide, to enable scientists to present and search biological information.
The DNA sequence is an ordered collection of alphabets of the four nucleotides A, C, G and T containing the information necessary for the survival and reproduction of living beings. Analyzing this sequence is then important and useful for both research on the life of organisms and for biomedical engineering.
Comparison of DNA sequences is done through softwares based on alignment algorithms that give results in the form of score and percentages of similarities and identities, and whose dynamic programming plays a considerable role. Dynamic programming relies on a relationship between the optimal solution of the problem and that of a finite number of subproblems. Concretely, this means that it would be possible to deduce the optimal solution of a problem from an optimal solution of a sub-problem.
With regard to sequence alignment, three types of alignment of the DNA sequences can be distinguished: 1) Global alignment: used when the sequences are about the same length because the alignment is done on all their lengths. This type of alignment was first proposed by Needleman and Wunsch [1].
2) Semi-global alignment: used in the case where one sequence is shorter than the other or when one looks for overlaps at the ends without counting the penalties of the gaps.
3) Local Alignment: that searches for the two most conserved sub-regions between two sequences and only these two regions will be aligned. Smith and Waterman algorithm [2] is the most used in this matter.
In this study, a new DNA sequence alignment algorithm Discrete-To-Continuous (DTC), will be presented to ensure the three types of alignment: Global, Semi Global and Local. DTC relies on dynamic programming based on polynomial interpolation of data. This approach was originally applied in shape recognition and chirality measurement [3]. Subsequently, DTC has been adapted to tackle other areas of application, namely: the alignment of time-shifted signals [4], correction of the DNA-electropherogram errors resulting from capillary electrophoresis sequencing experiments [5], online signature matching [6], speech recognition [7], algorithmic geometry [8] and fingerprint matching [9].
Unlike string matching algorithms, which try to find a point-to-point correspondence of the chains, the DTC approach solves this problem in its entirety by superimposing In order to ensure the performance of this approach, the programming and adaptation of DTC in the DNA sequence alignment domain were carried out. For this purpose, a comparative study was carried out in terms of accuracy, temporal complexity and response time with other algorithms that were the subject of a benchmarking study on a DNA sequence. The sequence studied in this work is JN222368 for Genbank belonging to the marine sponge.
For an efficient comparison, the test environment and conditions were unified by downloading a partial Genbank database containing 7682 sequences of different sizes, including the sequence in question JN222368. Subsequently, the DTC algorithm was implemented as well as the other reference algorithms in the Java programming language. The machine used was a 2.40 GHz Intel Core i7 processor with 8 GB of RAM. Experimental results show that the DTC algoritm delivers supremely efficient alignment with a reduced response time, including in the detection of mutations and gaps.
This work is organized as follows. Section I is a general introduction of the problem. Section II is dedicated to presenting the studies and works carried out during the last five years in this area of competence. Section III gives an overview of the different string matching algorithms. Section IV presents, in a detailed and in-depth way, the operating principle of the DTC approach. The results of comparing DTC algorithm with the other approaches are presented and discussed in Section V.

II. RELATED WORKS
Given the importance of string matching algorithms, in determining the functional and structural relationships of the biological sequence, several studies and works have been carried out. This section is dedicated to presenting the studies and works carried out during the last five years in this area of competence: In 2015, a research team made up of professors: Nadia B.N, Lecroq T and Elloumi M, conducted a study [10] presenting an algorithm which extends the variants of Boyer-Moore's exact string matching algorithm. The goal of this work is to solve the problem of exact pattern matching in a set of similar DNA sequences, in which only the pattern can be preprocessed.
In another work carried out in 2015 [11], new methods for matching key motifs in secondary RNA structures, based on the notion of structural chains, were proposed. In this approach, new correspondence algorithms to solve the problem of structural matching problem were used. This solution also made it possible to respond to various combinatorial requests encountered during the pairing of secondary RNA structures.
In 2017 a comparative study [12] was performed on exact string matching algorithms in the field of DNA sequence analysis by Iji and Mahalakshmi. This work was essentially based on the response time, the alignment accuracy of the DNA sequences and the temporal complexity of the algorithms in question. The results revealed that the Boyer-Moore algorithm provides the highest accuracy while the Reverse Colussi algorithm provides the shortest run time.
In another study carried out in 2019 by [13], a new solution was used, based on massive multithreaded exploitation with a focus on the latest Intel architectures based on Advanced Vector Extensions 512 (AVX-512). The goal is to address the limited acceptance of the Smith-Waterman algorithm by the computational requirements of large protein databases often used for local sequence alignment.
Recently in 2019, a new treatment method [14] based on the comparison of sequences without using explicit pair pairing, was proposed by S. Kouchaki, A. Tapinos and D. L. Robertson.This approach provides a viable solution to the functions of the textual representation of sequences data.
One of the most recent studies in this area was done in 2020. In this study [15], an algorithm called Maximal Average Shift (MAS) was presented. Its operating principle consists in finding a pattern scan order which maximizes the average length of the offset. In this work, two MAS extensions were also presented: the first optimizes the MAS scanning speed, by means of the result of the analysis in the previous window, while the second optimizes its processing time by deploying q-grams. The results of this study revealed that these methods have better average scanning speed performance than previous chain matching algorithms for DNA sequences.

String Matching Algorithms
String matching algorithms play a key role in analyzing of biological sequences and they are divided into two categories: 1) The "exact string matching", whose algorithms are below, used to find the exact substring match; 2) The approximate match, which attempts to approximately find strings that correspond to a given pattern. The following algorithms: Rabin Karp [16] and [17], Brute Force [18] and Fuzzy string searching [19], are often used in this area of matching. In this section we will give a brief overview of string matching algorithms, focusing on their spatial and temporal complexities. Then a comparative study on said algorithms will be described.
A. Description of the String Matching Algorithms 1) Smith-Waterman algorithm [2]: This algorithm was invented by Temple F. Smith and Michael S. Waterman in 1981. It is often used in DNA sequence alignment, especially for gene prediction, phylogeny or function prediction. Its operating principle is to give an alignment corresponding to the best matching score between the nucleotides of the subject sequences. It relies on dynamic programming using similarity matrices or substitution matrices. Alignment is accomplished by inserting "gaps" or "INDELs" into the reference sequence or subject sequence in order to increase the number of matching characters between the two sequences. The preprocessing phase requires temporal ( + σ) and spatial (σ) www.ijacsa.thesai.org complexities. The search phase of the algorithm requires a quadratic time complexity.
2) Needleman-Wunsch algorithm [1]: This algorithm is often used in the maximum global alignment of two character chains, especially protein or DNA sequences. The algorithm looks for the maximum score alignment. This was the first application of dynamic programming for the comparison of biological sequences. The processing time to search for a pattern in a given text is Ο( ).
3) Boyer-Moore algorithm [20]: Boyer-Moore is considered one of the most commonly used string matching algorithms in everyday applications. The operating principle of this approach is based on the analysis of the characters of the text from right to left starting with the rightmost. If a complete match is detected, it deploys two precomputed functions to shift the window to the right, known by the matching shift and occurrence shift. The temporal complexity of Boyer Moore is of order Ο( ).
4) Turbo-Boyer-Moore algorithm [21]: The Turbo-BM algorithm is a variant of the Boyer-Moore algorithm. Unlike the original Boyor More, this modified version does not require additional pretreatment and occupies only one constant additional space. It consists of recalling the text factor that corresponds to a suffix of the model during the last attempt, and this only in the case of a correct suffix offset. The peculiarity of this improvement is that it is possible to perform a turbo shift by neglecting said text factor. The temporal complexity of this algorithm is Ο (m.n).

5) Tuned
Boyer-Moore Algorithm [22]: The Tuned Boyer-Moore is another variant of Boyer-Moore algorithm, intended to increase the speed of treatment. The principle of this approach is to optimize the matching verification phase between the character of the pattern and the character of the window. To avoid redoing this verification, which is very expensive in terms of response time, this method takes several shifts before performing a real characters comparison; the order of the comparisons between the characters of a pattern and text during each attempt not posing any more constraints. The temporal complexity of this algorithm is also of order Ο (mn).
6) Brute force algorithm [18]: The Brute force matching string algorithm is a classic alignment model, which does not require preprocessing. This approach attempts to verify, at all positions of the text, the position of occurrence of the pattern. The extracted patterns are compared one by one. The search window is moved exactly one position from right to left. The search can begin in any order (from left to right / from right to left). The temporal complexity of the search phase is equal to Ο (mn) and to a minimum comparison of 2 expected characters.
7) Deterministic Finite Automaton algorithm [23]: This algorithm consists of searching for a given sequence through the use of a finite state automaton. Each character in the model has a state, and each match sends the automaton to a new state. After matching all the characters in the pattern, the automaton switches to the approval state. In this case, the automaton will return to a suitable state depending on the primary state and the entered character. This algorithm has a temporal complexity of order O (n) since each character is examined once. This technique is very efficient because it examines each character of the text exactly once and displays all valid time shifts.
8) Karp-Rabin algorithm [17]: The Rabin-Karp algorithm calculates a numeric value (hash) for the pattern p and for each substring of m characters from text. Then, it confronts numerical values instead of confronting the real symbols. At the moment when a match is detected, the pattern is compared to the substring by a naive approach. If not, it goes to the next substring of the sequence to compare with p. The hash method deployed in this algorithm provides a simple process by avoiding a quadratic number of character comparisons in most practical situations. The time complexity of the algorithm is Ο( + ). 9) Knuth Morris-Pratt algorithm [24]: This algorithm was developed by Morris and Pratt as the first linear timematch algorithm based on the analysis of the naive algorithm. The Knuth-Morris-Pratt algorithm preserves the information that the naive approach has consumed during the text analysis period. This approach avoids the exhaustion of the information through a temporal complexity of order ( + ). The use of this algorithm is effective because it minimizes the total number of comparisons of the pattern with the input string.
10) Reverse Colussi algorithm [25]: The Reverse Colussi string matching algorithm is another Boyer-Moore derivative. This algorithm consists of partitioning all the positions of the pattern into two disjoint subsets. The comparison of characters is carried out using a specific order declared in a matrix. The process requires a pretreatment step of order Ο( 2) while the search complexity is Ο( ) in the most complex cases performed in comparison of characters. 11) Apostolico-Giancarlo algorithm [26]: Boyer-Moore algorithm is difficult to analyze because after each search, it does not memorize the characters already found. To remedy this, Apostolico and Giancarlo have designed an algorithm that records the length of the longest suffix of the text that ends at the correct position of the window at the end of each search. The spatial and temporal complexity of this algorithm is similar to that of Boyer-Moore Ο (m+ ). During the search phase, only the last information m of the table break is needed for each attempt so that the size of the table break can be reduced to Ο (n). The disadvantage of the Apostolico-Giancarlo algorithm is that it happens, in some cases, to perform up to (32n) comparisons of text characters.
12) Raita algorithm [27]: Raita's algorithm was produced by Tim Raita in 1992. The pretreatment phase of the Raita algorithm consists of calculating the bad character shift function (Boyer-Moore). It first compares the last character of the pattern with the rightmost text character of the window, in the case of matching, it continues to compare the first www.ijacsa.thesai.org character of the pattern with the leftmost text character of the window. If again the match is found, it makes a comparison between the character of the middle of the pattern with the text character of the middle of the window. Finally, if the match is found, it continues to compare the other characters from the penultimate to the last, eventually by comparing again the character of the medium. During the preprocessing phase, the Raita algorithm requires temporal complexity ( + ) and spatial complexity Ο( ). While in the search phase this algorithm has an extreme quadratic time complexity.
13) Reverse Factor algorithm [28]: The Reverse Factor algorithm results from the use of the smallest inverse pattern suffix automaton, to match some prefixes of the pattern by scanning the character of the window from right-to-left and improving the shift length. The pretreatment phase is linear in time and space. During this phase, the algorithm tries to calculate the smallest automaton suffix for the inverse pattern. During the search phase, the Reverse Factor algorithm analyzes the characters of the window from right to left until any the completion of any transition defined for the current character of the window from the current state of the automaton. At this point, it is easy to know which is the longest prefix length of the matched pattern. The Reverse Factor algorithm requires quadratic time complexity in the worst case, but is optimal on average. It performs O (n.log (m) / m) inspections of text character on average. 14) Berry-RavinrASYdran algorithm [29]: This algorithm consists in ensuring shifts by taking into account the bad character shift (Boyer-Moore algorithm) for the two consecutive text characters immediately to the right of the window. In the preprocessing phase, which requires spatial and temporal complexity of order ( +n2), the algorithm attempts to compute for each pair of characters (a, b) with a, b in Σ the occurrence the most to the right of ab. The search step of the Berry-Ravindran algorithm has a time complexity Ο( + ).
15) Aho-Corasick algorithm [30]: Aho-Corasick algorithm falls into the category of dictionary matching algorithms because it performs the localization of the elements of a finite set of strings (the "dictionary") in an entered text. This is achieved by ensuring a correspondence to all chains simultaneously. Both the preprocessing phase and the search phase require a complexity of order O (m + n).
16) Alpha Skip Search algorithm [31]: This algorithm uses buckets of positions for each factor of length log (m). The preprocessing phase requires temporal and spatial complexity of order O (m). The worst case of this pretreatment phase is linear if the size of the alphabet is considered a constant. The temporal complexity of the search phase in the worst case is quadratic, but the expected number of text character comparisons is O(log (m).(n / (m-log (m)))).

B. Comparative Study of the String Matching Algorithms
Iji and Mahalakshmi [6] performed a comparative study of the aforementioned algorithms (Table I) using the sequence JN222368 (Genbank) with a size of 3481 characters. In case of a larger or smaller sequence size the process and the results in terms of accuracy do not change in contrast to the execution time which proportionally depends on the size of the sequences. The tests were conducted using the online tools EMBOSS and GENE Wise.
The results of this study revealed that the Boyer-Moore (BM) chain matching algorithm provides the highest accuracy, 83%, with an execution rate of about 84 ms. The Reverse Colussi (RC) chain matching algorithm provides the shortest execution time (≈57 ms) with an accuracy of 79%. To prove the performance of the proposed approach, DTC was tested with the two best algorithms BM and RC. The experimental results of this test are presented in Section IV.

III. DTC ALGORITHM
Unlike the aforementioned algorithms, which attempt to find a point-by-point correspondence of strings, the DTC approach addresses this problem in its entirety by performing a superposition of the discrete representation of the test points on the continuous representation of the reference points. In this section, the principle of operation of the algorithm will be presented in detail: 282 | P a g e www.ijacsa.thesai.org For this application, SF and F respectively represent the DNA sequence in question and the sequence of references. The said sequences are composed of nucleotides in the form of the alphabets (nucleotide) T, C, A and G.
The alignment attempts to find the correspondences between these two points clouds to compare. In this work, we propose an implementation of DTC, as a method of alignment of DNA sequences according to mathematical metrics. Each nucleotide of F and SF is represented by an abscissa (position in the sequence) and an ordinate (a code corresponding to the type of the nucleotide). For our application case, the nucleotides were assigned the following codes: (A = 200, C = -200, T = 400 and G = -400).
Generally, to decide if the form SF is included in the form F (Fig. 1), one starts by finding a point-by-point correspondence between SF and F and possibly looking for a transformation which would superimpose SF on F.
In the case where one opts for the search for the transformation T (which checks the superposition of SF in F), a direct search of the latter, without any prior knowledge of the correspondence of the SF points with respect to those of F, may be very consuming in terms of execution time caused by the large number of possibilities to test (combinatorial explosion).
It should be noted that the existence of such a transformation T would obviously confirm the inclusion of SF in F. In this respect, the DTC algorithm has the ultimate goal of finding the transformation T in order to confirm the inclusion of SF in F.
Recalling that the origin of the difficulty (combinatorial explosion) comes from the discrete nature of the clouds of points to be treated. The solution proposed by DTC to avoid this problem is to make a transition from the discrete representation to continuous representation of one of the entities (F).
In this case, with a continuous representation of F by a polynomial interpolation (SF would be retained in its discrete representation), the problem of deciding if SF is included in F thus becomes the research, not of T, but at first of a transformation T' which would bring back SF, on the continuous representation of F. Thus the existence of T' could induce the probable existence of T, and therefore, will confirm the inclusion of SF in F.
In fact, the algorithm consists in avoiding a direct search for T but rather in carrying out a search for a transformation T', which would ensure the superposition of SF on the continuous representation of F.
It should be noted that the existence of such a transformation T' is necessary but not sufficient to confirm the inclusion (total or partial) of SF in F. Indeed, the transformation T' (if it exists) must ensure that SF is returned to the continuous representation of F. And if for points ∈ there exists a point ∈ such that ′ ( ) = then T' = T then SF is totally or partially included in F.
The DTC algorithm is developed to deal with arbitrary models defined by cloud of points models in a N-dimensional (ND) space. In the case of our application, the models of the clouds will be considered in a space of 2 dimensions (2D).
The points of F are given in the plane.
Let be the interpolation polynomial in the plane. Because of this, for each point = { , }belongs to F, we have: This representation will be called (R).
There are different interpolation methods to represent R. In order for the degree of the polynomial to not depend on the size of the point cloud F, the DTC algorithm uses the "cubic spline" interpolation which is a third degree polynomial succession (piecewise interpolation), which also ensures the continuity and the differentiability over the entire interpolation interval (Fig. 2).

Search for transformation T':
The purpose of the transformation T' sought is to bring back the cloud SF on the cubic interpolation of the form F along the plane .
The desired transformation T' is expressed in homogeneous coordinates.
The parameters of the transformation T' described in DTC relates to a three-dimensional space are: : Scale factor along the axis .
: Scale factor along the axis .
In this work we deal with the problem of alignment of the biological sequences, the transformation T' becomes a function with a single parameter which is the translation tx according to axis Ox (Fig. 3).
The notation of R can thus be written as: Or: Applied to all SF points: Consider QT as the following expression: Based on this definition, we now look for the parameters of the transformation T' which minimizes the QT function.
The obtained function QT is a non-linear equation which is continuous and differentiable.
After the step of adjusting the points of SF on the continuous representation of F (defined by T'), we associate each point of SF with its isomorph, which is its nearest neighbor in F according to a type of distance and a predetermined threshold (ε) (Fig. 4(a)).
is the isomorph of the point of SF defined by T ( Fig. 4(b)).The Root Mean Score (RMS) which is used to measure the global precision of the superposition of SF in F is: At this stage, since the isomorphs of the points of SF in F are known, it would be possible, if necessary, to refine the superposition.
RMS is also formulated as a non-linear function. Generally, the transformation T thus found, provides directly T′ and therefore the solution of the RMS is already found (Fig. 4).
Nevertheless, some refinements can be operated according to a predefined RMS threshold. After determining the parameter of the RMS ( ′ ) , only the points of SF whose distance to their isomorphs in F, are less than a threshold ε fixed in advance. If all the distances between the points of SF to their isomorphs in F, do not exceed ε, then SF is declared included in F. If SF is declared not included in F, the DTC algorithm can process the Largest Common Point Set (LCP) between the two sequences. Indeed, the isomorphic pairs which do not respect the predefined threshold ε would be eliminated, and a revival of a refinement of RMS will take place for a better superposition of the remaining points of SF on the points of F.

IV. EXPERIMENTAL RESULTS
As mentioned in Section III, a survey of exact string matching algorithms for motif detection in the protein sequence was performed. The results of this comparative study are shown in Table I. Following this analytical study, the aforementioned algorithms were analyzed in terms of time complexity, response time and accuracy using online tools such as EMBOSS and GENE Wise. The sequence studied in this work is JN222368 for Genbank belonging to the marine sponge. Experimental results revealed that the Boyer-Moore (BM) chain matching algorithm provided the highest accuracy 83% with a run rate of about 84 ms. The Reverse Colussi (RC) chain matching algorithm provides the shortest execution time (≈57 ms) with an accuracy of 79%. These results have sparked the interest in implementing the DTC algorithm to test it in terms of execution time and accuracy. To do this, the test environment and conditions were unified by downloading a partial Genbank database containing 7682 sequences of different sizes including our sequence of interest JN222368. Subsequently, the DTC algorithm has been implemented as well as the other two reference algorithms Boyer-Moore and Reverse Colussi in the Java programming language. The machine used was a 2.40 GHz Intel Core i7 processor with 8 GB of RAM. Table II presents the results of comparison of DTC approach with the aforementioned algorithms.
Where m represents the size of F and n represents the size of SF. For this type of test, the three algorithms have an accuracy of 100%. This is an excellent result for ensuring alignment in restricted databases as an example for the deployment in mutation prediction software, stored on a local server, which will make it possible to compare them with the new sequenced genomes. However, to increase the challenge in terms of accuracy, it would be wise to perform tests (Big Data) by accessing online databases.
As far as temporal complexity is concerned, the proposed algorithm has a log complexity O (m log (n)), unlike BM O (m + n) and RC O (m2). This explains the reduced response time of DTC approach (42 ms), compared to other BM (74 ms) and RC (51ms) algorithms. This means that our algorithm is faster than all 18 algorithms that were the subject of that aforementioned study. Another advantage that has led to this performance in terms of processing time consists in the possibility of storing the interpolations of the reference sequences. This practice has allowed us to save preprocessing time. To demonstrate the temporal complexity of DTC, a test was also performed. It consists in finding an alignment of the form SF on the reference form F (JN222368) with different sizes of SF. The response time and the success rate of this test are shown in Table III. In this case, for the different sizes of SF, the response time follows a logarithmic evolution (Fig. (5).
This logarithmic complexity makes DTC more efficient in terms of response time in the processing of long sequences. As the results of this test show, the processing time of 600 characters is the same for 3481 characters (15 ms).
In the contrary of our algorithm, BM and RC fail to detect mutations and Gaps. To determine the performance of DTC in detecting mutations and Gaps, tests were performed on the same sequence (with a size of 3481) by simulating mutations (Table IV) and gaps ( Table V). The search was carried out as indicated above in a database containing 7682 other sequences of different sizes.
Mutation test: To simulate mutations, nucleotide modifications were made (5 to 60% of the modifications) on our sequence of interest. The results of this test are shown in Table IV.
The results of this test revealed that up to a mutation rate of 60% the algorithm remains insensitive to mutations and the variation of the response time remains marginal despite the considerable change in the rate of mutations. The change choice of 60% is beyond this rate of change, it would no longer be a mutation, but another problem for which the algorithm provides another solution.
Gap test: The representation of the DNA sequences is a succession of the alphabets A, C, G and T. For the simulation of the gaps some SF nucleotides will be replaced by a letter x (unknown) which, in F, would be isomorphic to A, C, G or T. The gap phenomenon is often encountered in sequence alignment and some algorithms find it difficult to treat it (such as Boyer Moore and Reverse Collusi).
DTC approach is still very efficient in terms of gap treatment because any gap corresponds to the reduction of the www.ijacsa.thesai.org size of the SF sub-sequence, which generates a reduction in processing time when the gap rate increases.
As shown in the table, the processing of the sequence with 5% gaps lasted longer (25 ms) than with 60% gaps (18 ms).
This explains the performance of DTC in solving this phenomenon often encountered during DNA sequencing.

V. CONCLUSION
The field of analysis and interpretation of DNA sequences is essential for determining the functional and structural relationships of said sequences. To do this, software based on intelligent algorithms has been made available to the scientific community. In this work, we have presented and compared the DTC algorithm based on polynomial interpolation with algorithms commonly used in this field of application namely: Boyer Moore and Reverse Collussi. The peculiarity of DTC algorithm is that it ensures the exact string matching and the approximate matching with a very short response time, avoiding the preprocessing time by storing the interpolations of the reference sequences. The satisfactory results of the proposed approach, encourage to realize our own software for the detection of gene mutations predisposing to various genetic diseases.
On the other hand, it is possible to apply the DTC algorithm to facilitate and accelerate the implementation of metagenomic analysis as a tool for rapid and precise diagnosis or prognosis of cancers. The metagenomic approach allows us to gain a fairly precise understanding of the molecular mechanisms at work in the emergence and progression of cancer. The application of DTC will make it possible to identify relevant bioindicators / biomarkers (bacterial taxa / genes or metabolic profiles) in order to be able to propose diagnostic and therapeutic approaches for the populationspecific. The expected potency of DTC plus the "exhaustive" aspect of metagenomics would also allow the discovery of new genes or biotechnological functions.