An Efficient Computational Method of Motif Finding in the Zika Virus Genome

—The Zika virus (ZIKV) outbreak and spread is a global health emergency declared by World Health Organization. ZIKV rapidly spread across the world, causing neurological disorders. It is gaining public and scientific consideration. ZIKV genome biology and molecular structure are better understood with published papers. Genetic regulation is better understood by finding the motif in the DNA Genome sequence. The transcription factor binding sites need to be identified to understand the genetic code. There is diversity in gene expression. Motif-finding methods work towards efficiently identifying the repeated patterns in the genome. ZIKV genome sequence is used in the study. Identifying the motif is still a difficult task. There is a low probability of identifying the binding sites. Finding all possible solutions is challenging as it requires a lot of time and has high space complexity for finding long motifs. The Greedy search technique with pseudocount finds the motif in real-time. The count matrix is computed, and the profile matrix is constructed from the genome of the Zika virus. The calculated consensus string helps in calculating the score of the motif. The Greedy motif search technique is applied in this paper to find the motifs in the Zika virus Genome. This technique is not applied earlier to find the motifs in Zika Virus. The motifs are identified using a Greedy motif search without pseudocount and with pseudocount.


I. INTRODUCTION
The Zika virus (ZIKV) was first separated from a rhesus macaque in Uganda in 1947. The study of ZIKV was not given importance till 2015. In 2015, Brazil's big epidemic of ZIKV infections was linked to intensification in microcephaly cases. ZIKV can spread sexually and is known to persevere in the male and female reproductive systems. West Nile virus (WNV), Dengue virus (DENV), Yellow fever virus (YFV), and Japanese encephalitis virus (JEV) are included in the family Flaviviridae, genus Flavivirus. These are mosquito-borne pathogens. Zika virus also belongs to the same family of viruses. The ZIKV lifespan has Aedes mosquitoes and monkeys, whereas a broader range of mosquito species transmits WNV.
World Health Organization (WHO) has acknowledged the Zika virus (ZIKV) as a public health crisis worldwide. ZIKV is a flavivirus. It has its place in the family of Flaviviridae. It is spread over several parts of Africa, Southeast Asia, and the Pacific island. In August 2016, the ZIKV outburst in Brazil was the major ever recorded, with a projected 165000 doubted cases. ZIKV is transmitted through monkeys and the Aedes genus [1]. ZIKV infection causes slight illness, headache, and rash. A recent study suggested that the virus causes neurological disorders such as Guillain-Barre syndrome. ZIKV is transmitted from mother to child and is also transmitted sexually. It has identical symptoms as compared to different arboviral diseases like DENV. Analysis based on symptoms is unpredictable for precise detection. Test center analysis is vital to obtain conclusive results. A suitable choice of molecular analysis is meaningful for routine ZIKV or flavivirus detection. No specific treatment for the infection is available. The course of injection and medicine development is extremely complicated. Anti-ZIKV vaccine development may take several years. Traditional drug development policies make the circumstances worse. Silico methods are helpful in enlightening possible vaccine candidates.
This fast rise of ZIKV cases and its consequence is severe. The disease has provoked the research community to produce interventions to battle Zika disease. The disease mechanism is presently unstated. Reviewing and analyzing ZIKV genome ecology and pathogenesis can provide awareness of ZIKV. microRNAs are also found to play a significant part in virusrelated diseases and activation of the phenomenal immune response. Methodical genome-wide study of the ZIKV genome may assist in designing antiviral therapeutics. Zika virus has a 10.7 kb genome of single-stranded RNA. Multiscale examination of the genomic data produced throughout the widespread of infection and available in public databases was combined. The focus of the study was to inspect virus-related appearance at different scales [2] [3].
The research problem was understanding the regulatory mechanism of genes. This involves finding transcription factor binding sites. Identifying regulatory networks is also challenging. The objective of the research was to find the regulatory relations of genes. It is better considered with the identification of motifs. Motifs are known to have complex forms. An essential class of motifs is spaced motifs. It consists of short segments separated by nucleotides of different lengths. Locating motifs is a difficult task. Existing algorithms identify short contiguous motifs. Better algorithms identify spaced motifs with a different number of spaces in between [4]. So, this research is significant as it will help in identifying the spaced motifs with a mutation at one or more places in the genome.
In this Research paper, a literature review is conducted on various studies conducted on the Zika virus. The study is *Corresponding Author. www.ijacsa.thesai.org conducted on phylogenetic analysis, ZIKV circulation in different countries, RNA-protein interaction, Viral RNA targets, Host cell-binding mechanism of ZIKV, Motif finding and Genetic behaviors of ZIKV, ZIKV infection, and the role of E-protein amino acids and many other aspects of Zika virus. The count matrix and profile matrix is calculated on the dataset of the Zika virus genome with pseudocount. The Greedy motif search method is applied to the dataset, and results are tabulated with and without pseudocount.

II. LITERATURE REVIEW
Zika virus has turned out to be a worldwide health problem as it is linked to potential congenital disabilities. The virus was revealed 70 years ago; still, the genomic construction and genetic variation are not fully known. The genome structure is compared with different other flaviviruses. Structural and functional similarity is found between the various flaviviruses' genome structures. The similarity is found in the conserved terminal structures. So, it is concluded that the Zika virus shares a high constitutional and functional similarity with other viruses of the Flavi family. It is known from the genomic comparison. Also, the prediction of motifs in viral proteins in the Zika virus with other viruses shows similarities. All Zika virus strains in America have similarities with the strains in Asia. Some conserved amino acids differentiate earlier African strains from Asian and American strains. Studies provide clues for different viruses' studies.
The Zika virus spread over more than fifty nations of the world. The evolution and spread are understood by studying the replication in the genome. Zika infection happens when an infected mosquito bites a person. It is also transmitted from one person to another person due to various reasons. ZIKV molecule needs to be studied to understand the infection in detail. It can also help in finding a solution to the problem. The neural progenitor cell growth is affected by ZIKV infection in monkeys [5] [6]. The same things happen in the case of humans. The virus damages the DNA of humans and monkeys. It also initiates DNA damage responses. The DNA damage response is then attenuated. The cycle of the growth of the virus is considered to determine the behavior of the virus during different times of the day. The virus may behave differently during the day and during the night. A different mutation of the Zika virus is reported in America. These mutations are different from the mutations found in Asia and Africa. The Aedes Aegypti mosquito was studied with mutation, and the results were compared with the studies performed without mutation. Fitness is increased for new mutations. It increases the risk of the spread of the virus. Fitness is reduced for original mutations. Zika virus infects the immunocompetent adult. It precipitates and increases brain damage due to antiviral responses. The immune system of mice is studied. The modifications stimulated by the Zika virus were found. A significant decline in micro-organisms like Actinobacteria and Firmicutes phyla was found due to the Zika virus infection. A boost of Spirochaetes and Deferribacteres was prominent in infected mice compared to healthy mice. The modulation caused the enhancement of white blood cells. The Zika virus induced the modulation of microbiota. Birth defects are caused due to utero exposure to the Zika virus. The ill effects of this virus were not prominent in the early stages of birth. The researchers concluded that if they studied the toddlers till twenty-four months, then the effects would be clear. It was concluded that the women who became pregnant during the year 2016 were to be studied. It was the time of the outbreak of the Zika virus in America. The study was conducted on the neurodevelopment stages of toddlers from birth to twenty-four months. The behavior of the child was normal before and after birth. The activities of the body seemed to be reduced as the child started growing. The child started showing the symptoms of birth defects. The connection of birth defects to Zika virus infection was later identified. As such, no vaccine was available for the disease, and treatment of the condition was not possible. People were not aware of such an infection. A vaccine named VacDZ was produced using the dengue vaccine as a reference [7]. The immune responses are seen to the Zika virus. The vaccine seems to be showing positive results in mice infected with the Zika virus. Blood samples were collected from Brazil to study the impact and spread of the virus. The people of Brazil were impacted by three viruses: Zika, chikungunya, and dengue. The situation was challenging to handle and trace the spread of the virus [8]. The review of the literature is presented in Table I. TBEV might, to some extent, run off interferon and IFITM interceded containment throughout highdensity co-culture infection.
Cell-to-cell reach may form an approach for viruses to break out from native host defenses.
From the detailed literature review, it is identified that different categories of algorithms are applied to find the motif in DNA sequence. The limitations found in the previous search are that the Greedy motif search without pseudocount and with pseudocount is not applied to find motifs in the Zika virus [16]. Heuristics algorithms worked for solving combinatorial, but bit did not work for large datasets [17]. The greedy motif search algorithm is proposed to find the motif in RNA sequences [18]. A greedy mixture learning technique is proposed for finding motifs in already known motifs in real proteins by using the PRINTS datasets [19]. Time series-based data of different lengths are aligned and joined using the Greedy search technique [20]. The Greedy search algorithm is used to discover motifs in hm03r, yst04r, and yst08r. The results show that the algorithm is effective in finding motifs in the DNA sequences of hm03r, yst04r, and yst08r [21]. DNA motif discovery is made using the Greedy motif search method over the datasets -GATA1, SOX2, OCT4, STAT3, and KLF1 [22]. So, a research gap is identified that the Greedy search technique for motif finding is not applied to the genome of the Zika virus.

A. Data Collection: Zika virus genome
ZIKV genome structure data from openly existing catalogs are collected and used in this study. The Zika virus genome sequence is available at NCBI [23]. ZikaVirus.fasta is the file name of the dataset. Fasta is a file format; the genome sequence is stored as a nucleotide sequence. The size of the dataset is 10.7 KB.

B. Gene Representations
RNA is formed using DNA. RNA further gets converted into proteins. Four ribonucleotides from the RNA. These four nucleotides are namely Adenine, Cytosine, Guanine, and Uracil. DNA replaces Thymine with Uracil. Amino acid sequences of proteins are formed by RNA transcripts [9]. Proteins regulate the function of the cell. Ori is the replication for the origin of DNA, so DNA replication starts at ori. Ori has some specific properties. Biologists find it difficult to identify the position of replication. Some other complicated tasks happening inside the cell are transcription and transpiration. The transcription process replaces what happens inside the cells. Thymine (T) that occurs in DNA is getting converted into Uracil (U) during the transcription process. The amino acids sequence is formed from RNA. There are a total of twenty different amino acids in RNA [13]. Three different nucleotides form these amino acids, also called 3-mers or codons. Each combination of these 3-mers forms different amino acids following a genetic code. Due to transcription, different genes can form RNA. Different genes may transcribe at different rates. This property is also called gene expression. Due to gene expression, different cell at different parts of the body of any living being behaves differently. Brain cells behave differently compared to skin cells. They differ in features and functionality. Cells with different variations know how to keep track of time. The variation in cell functionalities is known to occur in people infected with ZIKV. Pro-inflammatory reactions are prominent in women infected with the Zika virus.

IV. COMPUTATION OF MOTIF IN ZIKA VIRUS GENOME
Zika virus genome nucleotide sequence has a length of 10780. The length is calculated using the program. Based on the profile matrix of the Zika virus, the probability of a string can be calculated. The regulatory motif binds to specific short DNA. It regulates the gene. The site of binding is generally the upstream region and is important to identify. A method to identify the motif is useful for gene study.

A. Importance of Motif
Motifs are important to identify and study. Motifs have finite lengths. These are short sequences in DNA. Sequence motifs are used to signify transcription factor binding sites. Transcription regulations are better assumed with motif sequences. Dynamic sites of enzymes and proteins are characterized by motifs. The individual instances of the motif is calculated and scored using the ideal motif. An ideal motif is not known and can only be predicted. To recognize motif, a kmer string is selected from each string. Based on identical nucleotides, each motif is scored. A list of t strings is created. The length of the string in each list is n. a motif collection is created by selecting k-mer nucleotides from each string. A t X k motif matrix is formed. From the t X k matrix created, the nucleotides are counted and stored in an array. There are four different types of nucleotides, so four rows are created. The first row represents nucleotide A, the second row represents C, the third row represents G, and the fourth row represents T. Now in the matrix, the columns are viewed to find the nucleotide with the highest count. So for that column, the nucleotide with the highest count is represented in the uppercase letter. Different motif matrices for DNA strings if formed using different values of k. The aim is to obtain the most conserved motif matrix. The conserved matrix also means the matrix has more capital letters. The minimum score is to be obtained for the collection of kmers.

B. Calculating the Count Matrix
A count matrix is formed for the Motifs. It is 4 X k matrix. It is abbreviated as count (Motifs). It is the sum of each nucleotide column-wise. The element (I, j) represents I nucleotide in jth column. The count matrix obtained is further used for the calculation in the next steps [4]. www.ijacsa.thesai.org

C. Finding the Count Matrix with Pseudocount
Pseudocount is a small number that is added to zeroes. This improves the unfair scoring. This method is named Laplace's Rule of Succession. In motifs, pseudocount method, one or a small number is added to the count matrix. The different matrices are formed for calculations. These are the motif, count, and profile matrices. A count matrix of 4 X k is formed for a given matrix. This count matrix's (I, j) element represents nucleotide I of column j. Pseudocounts one is added to each element of this count matrix.

D. Framing the Profile Matrices with Pseudocount
Finding a perfect motif is a challenging task. It binds the finest to the transcription site. A motif discovery problem is solved by finding the similarity to the ideal motif. A motif similar to an ideal motif is calculated, as an ideal motif is not known. Capital letters are used to denote the most common nucleotide in each column. Motif[i][j] represent the ith row and jth column. If the matrix has more capital letters, it means the matrix is more conserved. The genome of the Zika virus is used to create different motif matrices for different values on k. The count of the small letter is noted. The sum of this count gives the score of the matrix. Then a k-mer is assumed, which will reduce the value of the score. The profile of the motif matrix is calculated. The elements of the count are divided by the number of rows. The sum of any column of the profile matrix is unity. The profile matrix for the Zika virus genome is given below.

E. Calculating a Consensus String for the Genome of the Zika Virus and Score
The consensus string of the Zika virus is calculated from the genome and is given below. AGGGGGGAGAGGGAGGAGGACGAAGGAGGTGGC AGATAGAGAGGAGAAGGGGGAGAGGGAGAAGAGGA AA. The score of the consensus string is calculated to be 7444.

A. Using Profile Matrix for Calculation
Iterative algorithms use select different alternatives during each iteration. The greedy search technique selects the most attractive alternative in each iteration. The profile matrix of the Zika virus is calculated in the preceding section and is used the same. The likelihood of any string can be intended. The probability of the consensus string is also calculated.

B. The Search of Binding Sites
The Greedy motif algorithm is implemented with pseudocount. The results of the Greedy motif search with pseudocount are summarized in Table II. The table contains values from 3-mer strings to 15-mer strings. The string with one nucleotide and two nucleotides have little significance and are hence ignored. The results of various k-mers are tabulated. The score of the Greedy motif search without pseudocount is compared with the score of the Greedy motif with pseudocount. The score of the Greedy motif search without pseudocount and with pseudocount is calculated for string of length three to sting of length fifteen. It is found that the score is less for calculations with pseudocount so these results are more promising compared to motifs obtained without pseudocount. For 15-mer string, the score is 1190 for the Greedy motif search and 1138 for the Greedy motif search with pseudocount. If the score is low means, the performance of the algorithm is good. For the fifteen nucleotides long string, a score of 1190 is obtained for the Greedy motif search without pseudocount. A score of 1138 is obtained for the Greedy motif search with pseudocount. So, the results have considerably improved by using the Greedy motif search with pseudocount.

VI. CONCLUSION AND FUTURE WORK
The genome of ZIKV is considered for the study. In the case of an infected mother, ZIKV causes neurological disorders in babies. The causes and effects of ZIKV were not identified earlier as the symptoms of the disease are mild headache and fever. Later it was linked to reduced brain activities in babies. The diseases pass from the infected mother to the child. Medicines or vaccines for this infection are not available. This research paper studies the Zika virus genome to get more insight into the molecular structure. For a given profile matrix, the probability of every k-mer string is calculated and tabulated. The score is calculated with the Greedy motif search without pseudocount and with pseudocount. The results are computed and tabulated. The comparison shows the results are improved with the Greedy motif search with pseudocount. The aim is to reduce the score, and it is obtained with a Greedy motif search with pseudocount. The Greedy motif search for motif finding is applied to PRINTS datasets, hm03r, yst04r, and yst08r, in earlier research. It is also applied to datasets GATA1, SOX2, OCT4, STAT3, and KLF1. It is not applied to the Zika virus to identify the motifs in the Zika virus genome. It is concluded that the Greedy motif search with pseudocount performs better than the Greedy motif search without pseudocount as it gives a score of 1138 over a score of 1190.