A Novel Machine Learning based Model for COVID-19 Prediction

Since end of 2019, the World Health Organization (WHO) provided the name COVID-19 for the disease caused by the novel coronavirus. Coronavirus is a family of viruses that are named according to the spiky crown existed on the outer surface of the virus. The novel coronavirus, also known as SARS-CoV-2, which is a contagious respiratory virus that first reported in Wuhan, China. According to the rapid and sudden spread for COVID-19, it attracts most of the scientists and researchers all over the world. Researchers in the data science field are trying to analyze the worldwide infection cases day-by-day to gain a complete statistical view of the current situation. In this paper, a novel approach to predict the daily infection records for COVID19 is presented. The model is applied for Egypt as well as the highest 10 ranked countries based on the number of cases and rate of change. The proposed model is implemented based on supervised Machine-Learning Regression algorithms. The dataset used for prediction was issued by WHO starting from 22 Jan 2020. Keywords—Coronavirus; COVID-19; coronavirus in Egypt; supervised machine learning; regression models


I. INTRODUCTION
Since the end of 2019, the outbreak of COVID-19 began in Wuhan, China. The new virus is a form of Coronaviruses, that affects the respiratory system such as the SARS virus. COVID-19 consists of a protein membrane with a diameter of 50-200-200 nm, inside which the DNA of the RNA virus is enveloped, which forms the spinal bumps on the surface of the virus and gives it a distinctive coronary shape [1], [2]. " Fig. 1", shows the internal structure of SARS-COVID virus [3].
Rational decisions are the goal that governments seek so as to address the COVID-19 epidemic. The prediction process is one of the most important tools needed to face that problem. The prediction models are used to predict the number of new daily confirmed cases, recoveries, and deaths. The prediction of newly confirmed cases helps the governments to update their precautionary procedures as well as getting ready by the needed hospitals equipment and the human preparation.
Nowadays, the main target is to find a cure for the killing virus as well as to predict its spread rate. Many researches in the data science field were found to study the statistical situation of the virus. Furqan Rustam et al. [4] tested a set of different Machine Learning based models in order to predict the number of future COVID-19 patients. The used models are linear regression, least-absolute shrinkage and selection operator, support vector machines, and exponential smoothing. Results showed that the exponential smoothing based model provided the most accurate prediction results, while the support vector machines provided the worst results as compared to the four selected models.
Nanning Zheng et al. [5] proposed a new hybrid Artificial Intelligence (AI) based model using Natural Language Processing (NLP), and the Long-Short Term Memory (LSTM) network, in addition to the Improved-Susceptible Infected (ISI) model, presented so as to predict the future cases in China. The proposed model could predict to a high degree the actual epidemic-cases.
In [6] Li Yan et al. proposed a machine learning based model using 3 clinical parameters to detect the new death rate of current patients. The accuracy of the proposed model is more than 90%.
Renato R. Silva et al. [7] used a Bayesian based methodology in order to detect the peak of the outbreak in one of the Brazilian countries (Goias) based on the number of confirmed cases. They found that, the peak will be reached between 7 to 10 weeks from the beginning of the crisis supposing that, there will be no change in the governmental control during the upcoming period.
In this paper, a novel prediction model that predicts the number of new confirmed cases is presented. The proposed model uses a set of statistical based techniques in a supervised machine learning process. The model is tested on Egypt as well as the top 10 ranked countries for COVID-19 till end of September 2020. The results of the proposed model are compared against the Bayesian Ridge regression model. The next sections of the paper will be as follows. In Section 2, the distribution of COVID-19 all over the world is presented. Section 3, shows COVID-19 in Egypt. The proposed model is explained in Section 4 followed by the experimental results in Section 5. Finally, the conclusion is presented in Section 6. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 524 | P a g e www.ijacsa.thesai.org II. STATE OF THE ART As a subset of Data Science, Artificial Intelligence (AI) and Machine Learning (ML) are playing a major role in the analysis and visualization of the COVID-19 crisis. These predictions will provide a help to the healthcare systems and government institutions to speed up investigations about the virus rapid and terrible spread.
" Fig. 2" illustrates the distribution for COVID -19 case all over the world. During the interval starting from 22 January till end of September 2020, the number of confirmed cases, deaths, and recovery cases are (38,917,803), (1,098,254), and (26,885,286) consequently.
" Table I" and " Fig. 3", show the number of confirmed cases, deaths and recovery cases over the world grouped by continents. As seen, Europe is the most affected continent followed by Asia, America, Africa and finally Australia.   Table II" and " Fig. 4", show the number of confirmed cases, deaths, and recovery cases for the mostly infected 10 countries classified by WHO on 30 September. " Table III" and " Fig. 5", show the rate of change percentage of the top 10 countries.  (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 525 | P a g e www.ijacsa.thesai.org In this paper, the rank of Egypt based on the number of confirmed cases and the rate of change is calculated. It is found that, Egypt's rank is 43 around the world based on the number of infections. while, its rank based on the change in rate is 143. These calculations are performed using the WHO dataset till end of September 2020.
There are different regression models found in the literature to predict the number of new confirmed COVID-19 cases such as Support Vector Machines (SVM) [8], Linear regression [9], binomial regression [10], and Bayesian Ridge regression [11].
During our experiments, it was found that the Bayesian Ridge regression model could provide the most accurate prediction results as compared to the other techniques.

IV. PROPOSED PREDICTION MODEL
The main idea of this paper is to build a hybrid model based on mathematical and statistical approaches in a machine learning based environment. The model is performed using rate of change, geometric mean and standard deviation [12] [13] [14] [15].
As seen in " Fig. 7", the data set consisting of the number of confirmed COVID-19 cases [16] , number of deaths [17], and the number of recovery cases [18] is collected. Then, the data preprocessing is performed and the data is split into 80% training and 20% testing. The steps of the proposed model are explained in details with an example on Egypt.
"Table V" shows the sample of data used as an example where, X represents the days starting from the 110th to the 119th day from the start of the epidemic while Y represents the number of confirmed cases on each day (each of the numbers below is multiplied by ).

A. Steps of the Proposed Model
Step 1: Splitting the data set Dataset was split into a training set and testing set, 80%, and 20%, respectively.
Step 2: Calculating the number of newly confirmed Where new cases are calculated as, Where, [ ] is the number of confirmed cases at day .
Step 3: Calculating the Rate of Change (RoC): The Rate of Change (RoC) for the newly confirmed cases is calculated by the next formula [19]. " Step 4: Calculating the Geometric mean (GM) It is a root for the RoC for days [20] [21].
Here, in this example the GM = 0.149956677. Step 5: Calculating the standard deviation, Standard Deviation "STD" is calculating the extent of deviation of values from the average value: Where, is the number of days.
Here, in this example the = 0.1334 Step 5: Calculating the newly expected cases and boundaries The new expected cases based on the proposed model are calculated using the following formula: " Table VII" represent results of steps 4, 5 and 6 calculating GM, SD and boundaries.

B. Testing the Model Accuracy
The proposed model accuracy is tested using both the Mean Square Error (MSE) [22] [23] and the correlation (R) between the expected values and the real values [24] [25].
Where is the difference between expected and real data.
Here, in the example the 4716.571429.
Here, in the example 0.880848 There is a strong relation between results that indicate to model has highest accuracy.

V. EXPERIMENTAL RESULTS
The proposed model is compared against the Bayesian Ridge regression model, as it was most accurate model for COVID-19 predictions amongst the other state of the art techniques.
" Fig. 8" illustrates the daily cases that predicted by proposed model for the highest rated for COVID-19 till end of September 2020. These 10 highest countries are USA, India, Brazil, Russia, Argentina, Colombia, Spain, Peru, Mexico, and France. " Fig. 9" illustrates the comparison between the proposed model versus the Bayesian Ridge model applied for Egypt as well as the 10 highest rated countries for COVID-19 till end of September 2020. The red lines in the figure represent the daily prediction results while the blue lines represent the real values. As seen, the proposed model is more accurate than its counterpart over all the countries.

COVID-19 Dataset
Data Pre-processing www.ijacsa.thesai.org The sharp peaks found for the prediction results corresponding to France were according to the sudden jump in the number of cases during that period.
The Mean Square Error ( ) for the two models are presented in "Table VIII" as seen, the MSE for the proposed model over all the testing countries is less than that of Bayesian Ridge model. VI. CONCLUSION COVID-19 or corona virus pandemic is the danger that threaten both peoples and governments all over the world. Many researches tried to predict the number of newly infected cases, deaths, and recoveries. In this paper, a new hybridmachine learning based model is proposed so as to predict the newly expected infections. The model is tested on Egypt as well as the 10 highly rated COVID-19 countries till end of September 2020. The proposed model is compared against one of the most accurate prediction models found in the literature i.e. Bayesian Ridge model. Results showed the powerful of the proposed model as compared to its counterpart all over the countries under study.