Genetic Behaviour of Zika Virus and Identification of Motif

ZIKV is a mosquito-borne disease. It is known to cause neurological disorders and congenital disabilities in newborns. The Genome Sequence of the Zika virus is used for the study. The essential cell functionalities like circadian behavior and expression of genes are studied. Regulatory proteins are alternating functionalities during daytime and night time. Identifying motif is made by understanding the features of motifs, finding the count matrix, and formulating the profile matrix. The consensus string of the Zika virus is to be computed, and the score motif is to be calculated. Different techniques of motif finding like the Brute Force technique and Greedy Search techniques are proposed. In the Brute Force technique, each motif is selected, its score is to be calculated, and then the minimum score can be obtained. The Brute Force technique will take an enormous amount of time, but it is guaranteed to find a solution. The Greedy Search technique is not guaranteed to find motif like the Brute Force technique but can give a close answer in a realistic time. This paper presents the identification of motif in the Zika virus genome using programming techniques. Keywords—Circadian behaviour; consensus string; genome study; greedy search technique; motif search; regulatory proteins


I. INTRODUCTION
Zika virus (ZIKV) was discovered in 1947. It is a mosquito-borne disease. In 2013, it spread in South Pacific. It continued to spread to all parts of America. Early studies revealed that the virus originated and remained in Africa for many decades. Zika virus study is interesting as it can solve many biological problems. Genome sequences are quite complex. It is not possible to explain by a probabilistic model. So low-order Markov models explain the properties quite well. The DNA k-mer frequencies in the genome sequence of the Zika virus provide an insight into the genome complexity. It is possible to study the k-mers with different lengths as segments of genome sequence from different animals are available. A study was carried out for k-spectra of different species from Archaea, Bacteria, and e-coil. It studied the modalities of distributions. Some species have multimodal spectra, whereas all other species have a unimodal k-mer spectrum [1].
As a patient falls ill, based on the medical signs and symptoms, Symptoms bases categorization is possible. A frequent association can be extracted based on association mining [2].
There is a need to extract a sequence of existing strains. Different disciplines are collaborating for combating the outbreak of the disease. The different methodologies used are RNA extraction and material validation. Also, genome sequencing, consensus variation, and sequence analysis are done to understand the whole genome sequencing. The scientific method does not need to separately authenticate the virus reagents. The data is available through public repositories. It helps to study the pathogenesis, neurotropism, prevention, and possible spread [3]. Microcephaly, primers, and probes in ZIKV is detected. ZIKV is transmitted from a viremic host to normal people. Non-African mosquitoes have more potential to transmit the disease compared to African mosquitoes. Mosquito to mosquito transmission is possible and was evaluated in a study. In Africa, aegypti is less susceptible to ZIKV [4]. The spread of ZIKV is associated with neurological complications. The ZIKV spread and mode of transmission is studied carefully. Serological tests of ZIKV react with antibodies by other viral infection. Viral nucleic acids are present in polymerase chain reaction testing. Also, virus isolation is done for confirmation. It confirms the presence of ZIKV. It combines with global aedes vector distribution. Mother to the foetus and sexual transmission is very common person-to-person transmission modes [5].
The detection of the infection becomes difficult due to its close association with flaviviruses. The potential of antibodydependent enhancement also increases as the cross reactiveness to flaviviruses like the dengue virus and the West Nile virus. Serum samples were collected and tested. A study was conducted on the dengue virus and the West Nile virus to find whether these viruses can enhance or neutralize the ZIKV. The West Nile virus enhanced ZIKV, so it failed to neutralize [6].
People with ZIKV have a mild fever and fewer symptoms to identify the infection. Babies are born with birth defects for infected pregnant women. Deterministic models were designed, considering sexual transmission, mosquito-human transmission, and Wolbachia-infected male mosquito release. Disease-free equilibrium and its stability is studied. The study was performed on impact of parameters on reproduction. The intervals of the liberation of Wolbachia-infected male mosquitoes were studied. A bounded global solution was derived for the extinction of ZIKV. As per the numerical simulations, ZIKV may be destroyed when the amount of white noise reaches a threshold value. The wild mosquitoes may be extinct with the delivery of Wolbachia-infected mosquitoes [7].
ZIKV is a severe public health issue. Still, a little study is done to find the transmission of the Zika virus in sexual groups. A study to control the spread of virus between *Corresponding Author www.ijacsa.thesai.org individuals by changing contact patterns is done. A heterosexual network-based model is designed based on the Costa Rica case study. A study is carried out to measure the effect of changing the degree of heterogeneity. It is measured by removing the sexual contact of persons with a limited number but a greater degree and at different places. A threshold time for Zika virus infection next to the peak time was devised [8].

II. LITERATURE REVIEW
The ZIKV genome is studied to better understand the evolution and spread of Zika virus infection in more than fifty countries of the world. The spread is caused due to infected mosquito bites and person-to-person transmission. This disease is better understood with molecular insights of ZIKV [9]. It can help to better combat the disease. In monkeys and humans, the neural progenitor cell growth is attenuated by infection. The DNA is damaged by the virus and activates DNA damage responses [10]. The biological cycle of the virus is studied to understand the behaviour of the virus during the daytime, night time, and replication process. With the introduction of ZIKV to the Americans, four mutations of ZIKV were reported. This represents direct eversions from earlier mutations during the spread from Africa to Asia. Studies were performed with and without mutation on the experimental infection of aedes aegypti mosquito and human cells. It was found that fitness is reduced for original mutation for urban human-amplified transmission, whereas the fitness was enhanced for new mutations increasing the risk. The findings include three adaptive mutations of ZIKV [11].
The adult with moderate immunocompetent features may get infected by ZIKV. It triggers and enhances antiviral responses and brain damage. The neuroendocrine functions, inflammation, and immune reaction for different pathogens can get modulated due to gut microbiota composition. The modification stimulated by ZIKV in the belly microbiome of immune-capable mice was studied. It was found that the infection caused a considerable decline in microbes like Actinobacteria and Firmicutes phyla; compared to healthy mice. A significant boost of Deferribacteres and Spirochaetes was identified. Intestinal harm and extreme white blood cells recruitment were caused due to modulation of microbiota induced by the Zika virus [12]. The birth defects are associated with utero exposure to ZIKV. In early childhood, the impact remained unclear. The study of neurodevelopment and impact of ZIKV to 24-month toddlers born to pregnant women infected with ZIKV was conducted. These women were pregnant during the 2016 ZIKV outbreak in America. There was no abnormal transfontanelle cerebral ultrasound finding before and after delivery. But later, the child had reduced brain activity and birth defects [13]. There are no approved vaccines, i.e., antiviral treatments, available. Using the dengue vaccine as a reference, a chimeric dengue/ZIKV named VacDZ was created. It is a live diluted inoculation to ZIKV. It reveals key markers of dilution of pathogenicity in interferon deficient adult mice. The vaccine shows an immune response to ZIKV. It neutralizes the virus and shows a successful shot against ZIKV in mice [14]. The mobilization of the health test center network to detect COVID-19 patients was prompted by the emergence of SARS-Co V-2. It started tracing the contacts; identify the hot spot area prone to active community transmission. The Brazilian public health system faced difficulties amid triple epidemics, i.e., dengue, chikungunya, and Zika virus. Various samples were collected from Brazil and tested. An inter-disciplinary response to health gained importance. A need to search for an effective vaccine became important as no vaccine is 100% effective to any virus [15]. The Literature Review is shown in Table I.

A. Data Collection: Zika Virus Genome
ZIKV is family of Flaviviridae. It is a type of virus family. The entire genome sequence is available at NCBI. The functionalities of DNA were studied and discussed by Watson and Crick [19]. The filename of the dataset is ZikaVirus.fasta. It is stored as a nucleotide sequence, and fasta defines the file format. The size of the dataset is 11 KB. The genome of the Zika virus is stored in the file.

B. Circadian behavior in Zika Virus
The daily activities of any virus or any living organism are controlled by an internal clock called the circadian clock. Animals also follow the daily routine work based on the circadian clock. The clock maintains a 24-hour activity cycle. When it starts malfunctioning due to disorder, then many organisms face genetic diseases. It is called a delayed sleepphase syndrome. The circadian clock has its base at the molecular level. Because of the malfunction of the circadian clock, people become prone the many diseases. Heart attack is more common in the daytime, whereas asthma attack is more common in the night time.
Scientists Ron Konopka and Seymour Benzer identified abnormal circadian patterns in mutant flies and traced their causes. They found that the mutation in a single gene. Later after many years, a similar clock gene in mammals was discovered. Then many circadian genes were discovered. These genes display a high degree of evolutionary conservations across different species. Maintaining the circadian clock in a plant is very important as its entire life cycle depends on it. It is a matter of life and death for plants. More than a thousand plant genes are circadian. Such genes include the genes that control photosynthesis, photoreception, budding and flowering. Circadian behavior of the Zika virus is studied [16]. The immune system is regulated by the circadian clock. The immune system reacts to microbes, and pathogen replication is affected. BMAL1 and REV-ERBα are circadian components related to flaviviruses in dengue and Zika. The replication of flavivirus is regulated by the circadian clock.

C. Representations of Genes
DNA makes RNA which makes proteins. It is composed of four ribonucleotides, namely adenine, cytosine, guanine, and uracil. Thymine is replaced by Uracil in DNA. RNA transcript is translated to the amino acids sequence of a protein. These proteins regulate the function inside the cell. www.ijacsa.thesai.org DNA replication happens at the origin of replication called ori. Finding the position of ori is a complicated task even for biologists. The process of transcription and transpiration is also a complicated task happening inside the cells [20]. During transcription, all occurrences of Thymine (T) in DNA is replaced with Uracil (U). The RNA strand is then translated into an amino acid sequence. The RNA strand is partitioned into 3-mers. These 3-mers are called codons. Each codon takes the form of one of the 20 amino acids. During this, it follows the genetic code. Each of the 64 codons encodes an amino acid. Out of 64 codons, 3 codons are stop codons which halt the translation. For example, the DNA string "ATATCGAAA" transcribes into the RNA string "AUAUCGAAA" which translates into the amino acid "ISK".
Cells can transcribe different genes and can form RNA. The rates may be different for other genes. This is known as gene transcripts or gene expression. That is the reason why brain cells and skin cells behave in various manners. Both have different functionalities and vary greatly in their features. These variations help the cells to understand the time and keep track of it. Pregnancy-associated variations in reactions to ZIKV were identified using DNA expression of samples of different women. ZIKV infected pregnant showed proinflammatory responses [17].

IV. REGULATORY PROTEINS
The dataset contains the nucleotide sequence of the Zika virus. The length of the string can be found using the python program. It was found to be 10780. Each cell in the plant keeps track of day and night. There are three master cells, which are called clock masters. These are CCA1, TOC1 and LHY. These genes are controlled by external factors like sunlight and the availability of nutrients in the soil. This helps the organism to adjust to the gene expression.
The regulatory protein TOC1 regulates the expression of LHY and CCA1. The expression of TOC1 is suppressed by LHY and CCA1. It basically works in a negative feedback loop. Sunlight activates the transcription of LHY and CCA1. This deactivates the TOC1 transcription. At night time, TOC1 peaks and starts promoting the transcription of LHY and CCA1. LHY and CCA1 repress the transcription of TOC1, and the loop continues. The Condon usage is controlled by biased nucleotide composition in the Zika virus [21].
The transcription regulates a gene by binding to a specific short DNA. It is called a regulatory motif [22,23]. It is also called as the transcription binding site. It is the upstream gene region which is 600-1000 nucleotide long, also the start of the gene. CCA1 can bind to "AAAAAATCT" in the upstream region. It will be helpful for bioinformaticians if the regulatory motifs can be in the gene. An algorithm to find motif will be useful.

A. Importance of Motif
Motifs are short sequence patterns. It has a finite length. It is used to study the features of DNA, RNA, and Proteins. Transcription factor binding sites are represented using sequence motifs. Finding the motif sequences of motifs can help in understanding the transcription regulation [24,25]. Motifs represent active sites of enzymes and proteins structures and stability. Study of DNA Arrays is done to identify the genes that are active during the daytime in plants. The upstream region of nearly 500 genes was extracted to find the circadian behaviour. The frequently appearing pattern in the upstream region was identified. Suppose it was found that "AAAATATCT" is the most frequent word that appears more than 40 times. It was named as an upstream region evening element. The gene loses its circadian behaviour if the gene is muted. In plants, the evening element is quite conserved. It is easy to find the evening element in the plant whereas in animals, finding the evening element is quite difficult because of many mutations. If a fly is infected with a bacterium, its immunity genes will get activated to fight with the bacterium. The immunity gene has elevated expression levels as the fly gets infected. The most common 12-mers is "TCGGGGATTTCC" in the upstream region of many genes. It is the binding of the transcription factor NF-kB that activates the various genes in flies. The biological challenge of finding a regulatory motif is to be converted into a computational problem.
Depending on the similarity with the ideal motif, it will score individual instances of motifs. An ideal motif is the transcription factor binding site that best binds to the transcription factor. An attempt is made to select a k-mer from each string, as the ideal motif is not known. Each motif is scored depending on their similarity to each other. A list of t DNA string DNA is taken. Each string is of length n. k-mer from each string is selected to form a collection of motifs. It represents a (t X k) motif matrix. The motif matrix of the Zika virus is formulated. The most frequent Nucleotide in each column is identified and denoted by upper case. By using different values of k-mers in each string, a different motif matrix from each DNA string is created. The most conserved motif matrix is to be obtained. It also means matrix with most uppercase characters or few lower-case characters. The goal is to compute a collection of k-mers that minimizes the score.

C. Formulating the Profile Matrices
An ideal motif is a transcription factor binding site that binds the best to the transcription factor. A motif finding problem is would score instances of motifs depending on the similarity to the ideal motif as the ideal motif is not known to us. Our aim is to find a k-mer from each string of the array and find the score depending on similarity.
The most frequent Nucleotide in each column is identified and denoted in the upper case. If two nucleotides are most frequent, then randomly, one Nucleotide is selected. The motif matrix is represented as a string of motif matrices. The i-th row and j-th column can be accessed by using the motif[i] [j]. A conserved matrix is a matrix with a smaller number of lowercase characters or more uppercase characters. A most conserved motif matrix is to be selected from several different motif matrices. From a given sample of DNA string, using different values of k-mer, a different motif matrix can be created. The score of the motif matrix is found by counting the number of lower-case letters in the motif matrix. Then we can find a set of k-mer that reduces the score. To find it, all elements of the count matrix is divided by the number of rows in the motif i.e., t. The resultant matrix is the Profile of the motif matrix. The element (I, j) is the i-th nucleotide frequency in the j-th column of the motif matrix. The sum of any column is 1 in the profile matrix.

D. Finding a Consensus String for Zika Virus
A consensus string for the Zika virus is derived by identifying the most common Nucleotide present in the column of the motif matrix. If two nucleotides have the same frequency, anyone is selected at random. If the motif is selected correctly, the consensus matrix provides a candidate regulatory motif.
The most frequent Nucleotide in each column i.e. the Consensus (Motifs) of the Zika virus genome is: AGGGGGGAGAGGGAGGAGGACGAAGGAGGTGGCAG ATAGAGAGGAGAAGGGGGAGAGGGAGAAGAGGAAA . So, the consensus string of the Zika virus is known.

E. Score Motif
The score motif of the Zika virus can be calculated using the consensus matrix. The number of symbols in the j-th column that does not match with the symbol at position j of the consensus matrix is added. The score of the Zika virus genome is 7444.

V. FINDING THE BINDING SITES
The motif finding problem is to be solved using a collection of strings. A set of k-mers for each string to be identified minimizes the score of the resulting motif. The input to the problem is the DNA string and an integer k. The output is kmer collection motif for each DNA. The output will minimize the score motif for any choice of k-mers. A general problemsolving technique like Brute Force algorithm can find a solution that will take a lot of time to execute for a large genome. The brute force algorithm will consider each possible k-mers Motifs and gives a solution as motifs with the least score.

A. Comparing the Working of Brute Force Motif Finding
The Brute Force motif finding technique identifies all possible solutions. These algorithms may be easy to design. It will be guaranteed to find a solution as it will verify each and every possible solution and identify the best solution or the motif with the lowest score. These algorithms will take an enormous amount of time as it has to check all possible solutions to discard a motif. The number of candidate motif will be too large to verify.
In the brute force algorithm, n-k+1 choice of k-mers is possible. There is a number of ways to form motif are (n-k+1) t .
The algorithm can calculate the score in k X t steps. The running time of the algorithm is of the order ((n-k+1) t ) X k X t. This value is too high to be calculated using even the fastest computer. If the value of k is already known, then it may be a little easy, but this is not possible. So, another method needs to be explored.

A. Use of Profile Matrix
Iterative procedures are used in many algorithms that select different alternatives during the iterations. Some of these iterations are correct, whereas some are not. The most attractive alternative is selected by greedy search algorithms. In a chess game, the Greedy search algorithm at every move tries to capture valuable piece. Greedy may not find the best solution but can quickly predict the approximate solution in many cases. So, The Greedy search is to be applied to biological problems to approximate a solution. So, this algorithm is applied for motif finding. A collection of k-mers from a DNA string is motif. The columns of the profile matrix are viewed as four-sided dice. Each Nucleotide {A, C, G, T} is present on each side. The first column of the profile matrix has the data (0.2922077922077922, 0.14935064935064934, 0.2662337662337662, 0.2922077922077922).
The sum of all probabilities is 1 for any column. So, it means that the probability of generating A is 0.2922077922077922, C is 0.14935064935064934, G is 0.2662337662337662, and T is 0.2922077922077922. The profile matrix for the Zika virus is given in the previous section. The probability of any selected string can be calculated using the entry in the i-th column of the Nucleotide. Say, for example, the probability of the series "ACGG" is found to be 0.006458989565084851. A higher probability k-mer is achieved when it is more like the consensus string, "AGGGGGGAGAGGGAGGAGGACGAAGGAGGTGGCA GATAGAGAGGAGAAGGGGGAGAGGGAGAAGAGGAA A".

B. The Search for Binding Sites
Search for Binding Sites or Greedy Motifs is done. The best motif is set to the first k-mer from each string in Deoxyribonucleic acid (DNA). The DNA string is represented using the abbreviation DNA. These strings will be helpful for study. It ranges over all possible k-mers in DNA[0]. It finds a value for each motif [0]. The algorithm builds a profile matrix for the k-mer. Then motif [1] is set equal to Profile most probable k-mer in DNA [1]. Greedy motif search is iterated by updating Profile. To generalize, to find k-mers motifs in the istrings of DNA, greedy motif search constructs a profile matrix and sets motif[i] equal to Profile most probable k-mer from DNA[i]. k-mer from each string in DNA is obtained as a collection of strings. Greedy motif search compares whether the motif score is greater than the best scoring collection of motifs. If it is greater than the best score motif is updated, otherwise ignored. The execution moves to the beginning of the loop, and the next symbol in the DNA[0] is selected. The results of the Greedy motif search can be for different k-mer strings and summarized in Table II. The 1-mer and 2-mer string have less significance, so the results are demonstrated for 3-mer till 15-mer string. The results are obtained for various k-mers. The 15-mer has a score of 1190. The score obtained using the Greedy method may not be the optimal score, as there may exist a motif with a minimum score which can be obtained by finding all possible solutions. But this method provides 15-mer motif in a reasonable time. This score can be improved using other algorithms.

VII. CONCLUSION AND FUTURE WORK
The study is conducted to understand the behaviour of the Zika virus. ZIKV shows circadian expression, which also regulates the day-to-day functions in genes. Zika virus infection causes birth defects like neurological disorders in babies, and no proper cure or vaccine is available. This paper attempts to find the probability of every k-mer for a given profile matrix. The Profile most probable k-mer is calculated. This k-mer is most likely to be generated by Profile compared to all k-mers in the text. For example, if the size of k-mer string selected is 12, then the highest probability k-mer is found using the probabilities values is "AGGGGGGAGAGG". Similarly, if the size of k-mer string selected is 15, then the highest probability k-mer is found using the probabilities values is "AGGGGGGAGAGGGAG". Greedy motif Search is done as it is better compared to Brute Force search in real-time. The best motif is set to the first k-mer from each string in DNA. kmer from each string in DNA is obtained as a collection of strings. Greedy motif search compares whether the motifs score is greater than the best scoring collection of motifs. If it is greater than the best score motif is updated, otherwise ignored. Python programming is used to study the genetic behaviour of the Zika virus genome.
If any value in the profile matrix is zero, then the entire probability of the string becomes zero. If a string is obtained for which the profile matrix value is zero, the string is completely rejected. Such results can be improved using other methods like the Laplace Rule of succession. The score obtained can be further improved using this Laplace Rule of succession.