Modeling of Coronavirus Behavior to Predict it’s Spread

With the increasing presence and feast of infectious diseases and their fatalities in densest areas, many academics and societies have become fascinated in discovering new behaviors to predict these diseases' feast behaviors. This media will help them to plan and contain the disease better in trivial provinces and thus decrease the beating of human lives. Some cases of an indeterminate cause of pneumonia occurred in Wuhan, Hubei, China, in December 2019, with clinical presentations closely resembling viral pneumonia. In-depth analyzes of the sequencing from lower respiratory tract samples discovered a novel coronavirus, called 2019 novel coronavirus (2019-nCoV). Current events showed us how easily a coronavirus could take root and spread—such viruses transmitted easily between persons. To cure with these infections, we applied time series forecasting model in this paper to predict possible coronavirus events. The forecasting model applied is SIR. The results of the implemented models compared with the actual data. Keywords—COVID-19; coronavirus; SIR model; data mining; R Software; forecasting model


I. INTRODUCTION
On 30 January 2020, the WHO announces the outbreak of COVID-19 as a public health emergency of international concern (PHEIC) by the WHO [14]. This continuing outbreak has since 3 December 2019 spread to over 50 other countries [13]. There are over 1 million cases of confirmed COVID-19 worldwide and deaths over 50 000 as of 20 February [15]. Together with MERSnCoV and SARS-nCoV, it is the 7th member of the coronavirus group which can spread to humans [1,2,16].
Human coronaviruses, which include hCoV-229E, OC43, NL63, and HKU1, cause light respiratory disease. Fatal coronavirus infections that have occurred over the past two decades are extreme coronavirus acute respiratory syndrome (SARS-CoV) and coronavirus respiratory syndrome in the Middle East [3]. Coronavirus disease (COVID-19) is a recently discovered coronavirus-caused infectious disease. The majority of people diagnosed with COVID-19 have mild to moderate respiratory diseases [4]. This way has drawn considerable interest not only in China but globally.
A brief definition for the term "data mining" is to extract the useful information and patterns from large data sources" [11]. Across different areas, data mining used to help to increase the quality and efficacy of pattern detection and analysis of such events using existing statistical evidence [32]. Data mining technology is the process by which we perform all sorts of analyzes on vast volumes of data [33,34]. This paper goals to use perceptions made possible from data mining methods to relief in predicting coronavirus feast. Large volumes of data can be cumbersome and repetitive to compile and analyze. However, the underlying patterns in the data identified, such that the predicted occurrence of coronavirus is known beforehand. Forward-thinking techniques joining from the grounds of computer science, mathematics, and data science are necessary for discovering these underlying patterns [12]. Because data mining is a vast assembly of procedures and has extensive solicitation, and agreed data mining concept differs slightly based on the source. The data mining can also be named like "Big Data" or "Data Science" as the alternative name of data mining [19].
Various imminent characteristics of our lives rely on historical data arithmetic analysis. For example, prediction of illness, changes in stock market activities, weather prediction, etc. can be forecasted only if we can find a pattern in historical data due to time and it can be any ways for example daily, weekly, monthly, or annually. This form of forecast is commonly called Forecasting of the Time Series. Observations were sequentially taken in time, usually called time series [5]. Mathematical models have been applied to study a variety of communicable disease outbreaks [8], [9], [10].
The formal description of predictive modeling given by [18] is "the method of evolving a mathematical tool or prototypical that produces a perfect prediction". Predictive demonstrating could be engaged to compute information accessible and to create better conversant result based on different information finding. Modeling techniques offer computational capabilities in circumstances with vast quantities of evidence to build prototypes with the prognostic implication that help in hands-on conclusions. Because of overlapping algorithms to discover unseen facts in data, prognostic modeling closely correlated with data mining [19].
Data spring-cleaning, structure, statistical analysis, prognostic modeling, and statistics picturing performed for this project with R software system and R-studio GUI have used widely. R is a programming language open-source platform with different statistical functionalities. The R environment is interactive computing, graphical display, statistics, visualization, and data manipulation suite. Robert Gentleman and Ross Ihaka of the Department of Statistics at the University of Auckland initially wrote R. Currently, R is a product of user assistances around the world. Apart from users who build innovative features, there are thousands of built *Corresponding Author 395 | P a g e www.ijacsa.thesai.org packages open on the Robust R Archive Network (CRAN) with countless features and capabilities [20].
For mathematical epidemiology, the SIR model is perhaps the most commonly used [23, 24 and 25]. It widely used for a variety of purposes. Often as the core of further multifarious epidemiological prototypes for the transmission of communicable diseases [24,26,27], and other times to research the propagation of phenomena in other regions, such as rumor spread [28], computer viruses [29], Dissemination of information in Web fora [30], or the behavior of investors in stock markets [31].
For this paper, we used statistical models of time series to forecast possible cases of coronavirus. Because coronavirus is one of the contagious viruses, it is essential to estimate the number of cases so that the government can take the appropriate steps and measures to avoid its spread to treat or avoid viruses.

II. RELATED WORKS
S. Eubank et al. [6] analyzed the algorithmic and structural properties of Portland, Oregon, from extensive social communication networks. A bipartite graph composed of individuals and places created. The individuals are nodes, while edges reflect places. This binary configuration does not provide users with node to node interaction information.
Using a CL-model they use the method of a random graph. The CL-model correctly mimics the dataset's critical features. Alternatively, algorithms of rapid approximation developed to measure basic structural properties. Such studies investigated the impact of political decisions on the management of largescale urban diseases. It study demonstrates the effect of different algorithms that deal with the pattern of spreading the disease but do not visualize its effect.
An epidemiological model developed by K. Wang et al. to explain the flu when its transmission through a network of human contacts. The aim of this study is to build an ABM method combining GIS and civic ecosystems to mimic the spread of influenza. Using JAVA and GIS software, the model was developed using the Repast Symphony framework. A system developed to simulate the spread and control of influenza in a specific area. the model defined influenza using a mathematical relationship among the probability of transmission, the distance between two persons, the latent duration, The time between infections and death, the rate of cure. For example, users could modify the results of the simulation by changing the time-value to the hospital. [7].
The research, as mentioned earlier, and models offer useful information which allows medical leaders to take better outbreak protocol decision making. Nonetheless, each model mentioned above lacks a particular feature, whether it is computational capacity, necessary variables that are essential in calculating an accurate pattern of virus spread, susceptibility to disease expansion, or the incapability to measure over one form of spreading disease.
SEIR points respectively to the Susceptible, Exposed, Infectious, and Removed or Recovered. These results are designed on the basis SIR but gives a variable to the container Exposed. Susceptible relates to persons that may acquire the infection and be carriers if infect, the exposed are persons already infected but asymptomatic, the infectious are persons that display signs of infection. They may spread the disease, eliminate, or recover the virus are previously infected persons who are no longer infectious and who are now immune to the virus [17].

A. Dataset
The dataset used is the "COVID 19 data.csv" file, which the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) periodically updates. The dataset contains data about COVID-19, which is the Province/State and Country/Region also the date and confirmed, deaths, and recovered cases number the data was collected from the starting of the virus spread as shown (screenshot) in Fig. 1.

B. Data Visualization
From Fig. 2 to 7, data were visualized as most of the countries where COVID19 is spread, the rates of Infect, recovery, and death caused by the epidemic, as well as the rates of new cases in the most affected countries.
By using the R program, we visualize major outbreaks for the top 10 countries in Fig. 2.

C. SIR Model
Here a briefly explain the features of the essential Susceptible-Infected-Recovered (SIR) system which used define the recent COVID-19 outbreak. The original SIR model, in which Kermack and McKendrick modified a Malthusian growth model, is a model known to simulate epidemic growth using ODE.
The SIR model describes three-stage rules for the infection. The first rule is the unsafe condition (S) in which an agent is likely to become contaminated at any given point in time. The second is infection (I) when many neighbors are also in this state. The agent will switch to that state. An agent moves to the third state that is recovered (R) after a given period [21]. The SIR diagram in Fig. 7 shows how individuals move.
According to [22], S, I, and R are susceptible, infectious, and removed individuals, and where parameters β and π are the rate of infection and the rate of recovery. The equations for every time t are defined as follows:

dI (t) / dt = βS (t) I (t) -γ I (t), dR (t) / dt = γ I (t)
We first created a time series of that data, divided into infected and recovered data. We can visualize the generated time-series in Fig. 7.
It is interesting to note that there two sub-waves of infection, which may be the time the coronavirus left China and spread worldwide. The two different growths in the number of reported cases (indicating spread) can be more clearly seen in the graph in Fig. 8 after calculating the rate of infection and recovery.
The first wave considered to run from the beginning of the dataset until 14 February, and the second wave is from 15 February onwards until four days before the current date. This cut-off lets us later compare the output of the model with the actual data. At the 95% confidence interval, both the twosided t-test and the KS-test conducted with the null hypothesis that the two infection growth rates are the same as the output in Table I.   So, the inference is that the second wave is more aggressive on average than the first (rejecting H0 at 5 percent t-test significance) since the KS-Test's p-value is above 0.05. In contrast, the t-test is only marginally lower, while the overall shape of the distribution is similar (not enough evidence to reject H0 at 5% concerning the KS-test).

IV. APPLYING THE SIR MODEL AND DISCUSSION
N reflects the total considered population. One condition the SIR equations have to satisfy for all t is: We will observe linear growths in I in order to evaluate b and c, hence the motivation to divide the data into two waves, as discussed earlier. As we have found that both waves behave similarly, we carry out the modeling using the second wave only. The output is shown in Table II. The p-value is small for the coefficient, which shows that we have sufficient evidence to reject the null hypothesis and conclude that there is a linear correlation between growth and time. The R-squared values are also very similar to 1 which means that the above regression also provides a good fit for the variance.  So, the c and b are determined, which means we have the necessary coefficients to build the model. That is where the last equation will come in. The model is building on the already present data, predicting 150 days in the future. Fig. 9 shows the output of the SIR model along with the data. In Table III, a comparison between actual data and the data from the SIR model application, which proved the efficiency of the model as the numbers are very close to each other, while some numbers constitute a perfect match. It is important to note that the recovery rate tends to be reliable in the short term, at least. In contrast, the infected rate causes underestimation when this model previously tested. With more data available due to disease progression, the predictions will get better. The results are generally reassuring that the epidemic, according to the model, will soon be limited. Coronavirus has become a concern of many recently, and the focus on it became intense due to the rapid spread and lethality of people. In this paper, the SIR model for predicting 150 days in the future was applied, and the numbers compared with the real numbers, which proved the efficiency of the model. As the numbers are very close to each other, while some numbers constitute a perfect match. The results are generally reassuring that the epidemic, according to the model, will soon be limited.

ACKNOWLEDGMENT
The researcher is master student of big data course and wants to thank instructor Dr. Shakir Khan for his valuable support in this research.