A Covid-19 Positive Case Prediction and People Movement Restriction Classification

—The world experienced a pandemic that changed people's daily life due to Coronavirus Disease 2019 (covid-19). In Jakarta, the covid-19 cases were discovered on March 18, 2020, and the case increased uncontrollably until the government conducted a movement restriction called pembatasan sosial berskala besar (PSBB). The effectivity of movement restriction was not evaluated in detail. Therefore, we investigated the covid-19 cases in the PSBB period to understand the contribution of movement restriction. Moreover, a prediction model is proposed to computerize the decision of movement restriction. The models are divided into regression and classification models. The regression model is developed to forecast the number of infected cases. At the same time, the classification model is used to identify the best movement restriction type. We utilize data transformation named Principal Component Analysis (PCA) to reduce the number of features. In our case, the best regression method is Multiple Linear Regression (MLP). Then, the best classification method is the Support Vector Machine (SVM). The MLP results are 148.38, 37036.37, and 0.250336 for Mean Absolute Error (MAE), Mean Square Error (MSE), and R 2 , respectively. In contrast, the SVM achieved an accuracy of 84.81%. Moreover, the prediction system on the website were successfully deployed.


I. INTRODUCTION
The Coronavirus Disease 2019 (COVID- 19) was first detected in China at the end of 2019. The virus causes breathing symptoms such as fever, coughing, pneumonia, and diarrhea in patients [1]. The World Health Organization (WHO) received a report from the Chinese government on December 31, 2019, regarding this disease with reports on the case of pneumonia or wet lungs in Wuhan, in Hubei Province, China. A week later, the Chinese government confirmed that Covid-19 had been identified as the cause of pneumonia [2]. Then the media reported that many new cases were recorded in other countries because international travel and trade were operating as usual.
As the center of government and trade in Indonesia, Jakarta has the highest population density compared to other places [3]. Consequently, an increased number of positive confirmed Covid-19 in Jakarta. Based on data on March 18, 2020, there were 171 cases. In Jakarta, the number of positive cases of Covid 19 drastically expanded since there were ten times increasing positive cases on April 9, 2020, compared to March 18, 2020. Jakarta suffered 1776 cases at that time. While the average increase per day for the period March 18 until April 9, 2020, is 70 cases. However, at that time, the government had not taken any corrective actions. In contrast, Malaysia held a national scale MCO (Movement Control Order), which began on March 18, 2020 [4]. After three weeks without corrective action, then on April 10, 2020, the DKI Jakarta Regional Government imposed a pembatasan sosial berskala besar (PSBB) consisting of two categories, namely a strict PSBB and a transition PSBB. The Strict PSBB was valid from April 10 -June 4, 2020, and September 14 -October 11, 2020. And the Transition PSBB took place from June 5 -September 10, 2020, and October 12 -October 25, 2020.
Based on [5], [6], the positive confirmation trend per day when Strict PSBB is carried out tends to be stable when compared to No-PSBB. This shows the effect of restrictions on community movements. To reduce the economic burden, the government relaxed community activities by imposing Transition PSBB. However, there is a significant increase in the number of positive cases per day. Comparing the Transition PSBB and No-PSBB, the condition of the Transition PSBB is more worrying as the rate, and cumulative cases are much higher. A comprehensive data collection can be an excellent choice to gain more insight. Moreover, an artificial intelligence system can help the government to decide the type of movement restriction. Therefore, the research question can be determined as follow: 1) What factors or features affect the number of daily positive cases?
2) How to predict the number of daily positive cases?
3) How to determine the type of movement restriction in certain conditions? This paper is organized into five sections which are the introduction, related works, methods, result and discussion, and conclusion and future work. The introduction section consists of a brief explanation of the problem and the purpose of investigation. The related work section contains the summary of other researcher works that solved a covid-related prediction problem. The methods section shows the data acquisition, preparation, exploration, transformation, and frameworks. After that, the result and discussion section report the experimental findings. The conclusion and future work section contain a final comment on the current work and the potential research in the future.

II. RELATED WORK
The infectious disease outbreaks such as food contamination, severe acute respirations syndrome (SARS), dengue, malaria, and influenza [7]- [9] have been analyzed using the artificial intelligence (AI) approach. The most common research is to forecast the number of infected individuals using the machine learning method. As the data is mostly time series, the model can be following a linear or nonlinear model [10]- [12]. The AI system can be used to prevent the outbreak become uncontrollable as the estimated number of diseased people is calculated. The government can prepare the medical facilities and their staff to match the number of patients. The anticipation of the worst-case scenario makes us more ready to confront the outbreak.
The machine learning-based model [13] has been deployed to predict the positive case of covid-19. The result is satisfying as the R 2 of 0.9998, 0.9996, and 0.9999 for Gompertz, Logistic, and Artificial Neural Networks models, respectively. Based on the results, trends can be forecasted and extended until the termination of the pandemic in Mexico. However, their models were only able to predict new COVID-19-positive cases. Velásquez and Lara [14] utilized a reduced-space Gaussian process regression model to forecast 82 days of positive, dead, and recover cases in the USA. Compared to the actual values, it is discovered that the model can generate the expected case value with a significant correlation coefficient. Buckingham-Jeffer [15] performs prediction using stochastic SIR (susceptible infectious removed) and SEIR (susceptibleexposed infectious removed) models. Both models can immediately forecast the number of positive cases using maximum likelihood inference in a time frame period.
A typical time-series experiment uses the Autoregressive Integrated Moving Average (ARIMA) model to produce a prediction model to estimate the amount of COVID-19 confirmed cases, fatalities, and recoveries [16]. If the current trend in Pakistan continues, the number of actual instances might triple by May. The investigation result concludes that the statistics are accurate and that the trends will continue to rise in the next month. Fortunately, the ARIMA model cannot comprehend the provided historical data pattern, so their findings are questionable.
A comparison of ARIMA with machine learning methods has been conducted by Kamarudin et al. [17] for positive, dead, and recovery cases in Malaysia. They compare support vector regression (SVR), Gaussian process (GP), linear regression (LR), neural network (NN), and ARIMA. They found that the NN outperforms most of the methods in positive and recovery case prediction. The ARIMA can predict the number of dead cases accurately. The most unreliable model is LR because it overestimated the case number so much. Meanwhile, the GP and SVR have shown a promising result as the Root Mean Square Error (RMSE) value is around the ARIMA value.
The previous literature shows the machine learning method can forecast the outbreak damage level by predicting the number of infected people. In this study, the number of positive cases is not only expected but the type of movement restriction can be predicted using the machine learning method. The move restriction has been proved by many countries can reduce the number of positive and dead cases [4], [18]- [23]. The comparison of Multiple Linear Regression (MLR), Support Vector Regression (SVR), Random Forest Regressor, and Decision Tree Regressor are conducted to know the best regression method [24]- [26]. In contrast, the classification methods are implemented to determine the movement restriction status. The Logistic Regression, Support Vector Machine (SVM), Decision Tree, Naïve Bayes, KNN, Adaboost, and XGboost methods are compared to know the most successful approach to predict the correct movement restriction. Moreover, the Particle Component Analysis (PCA) and grid search algorithm are implemented to understand the features reduction and hyperparameter optimization effect in accuracy, Mean Absolute Error (MAE), and Mean Square Error (MSE) [27]- [29].
The number of positive cases can be predicted by using a regression model, and the number of expected cases is combined with factual information (e.g., number of public transport passengers, wind speed, and congestion level) at that time to predict the type of the restriction. The number of positive cases needs to be predicted because the data of positive cases is unknown at that specific time. Therefore, a prototype of the website application is developed and deployed to show the proposed system in a real environment.

III. METHOD
We collect all data such as COVID-19 patient data, weather, and climate data as well as air transportation user data. Then the data is cleaned using imputation method to recover the missing data and scaling method if the data is not on the same scale. Then the clean data is processed with statistical and visual analysis. Only then after getting a deeper understanding, the data can be added or reduced depending on the findings with visualized data. After the data is ready, the data can be trained with a regression algorithm to obtain predictions of the number of patients with Covid 19 in daily basis. The results of the prediction are entered into the classification system and the type of restrictions can be determined. The mean absolute error (MAE), the mean square error (MSE), and accuracy are used as measuring instrument.

A. Data Accuisation
Based on the problems formulation, we collect data that might affect the number of positive cases. To get complete COVID-19 data in Jakarta we access the following website: https://corona.jakarta.go.id. The display form of the website above is shown in Fig. 1. We can see various information related to Covid-19 in Jakarta such as the number of specimens being tested, the number of people who are vaccinated, the amount that recovered, etc. www.ijacsa.thesai.org Because we cannot access the data from the website, we traced the data source and found that the dashboard was made using the tableau application. Then we found the tableau address as follows: https://public.tableau.com/app/profile/jsc.data/viz/dashboar dcovid-19jakarta_15837354399300/dashboard22 From the website, we can download data in the form of a PDF File, as shown in Fig. 2. Then, from the PDF data, we change it to Excel File with the help of the PDF Editor application (Nitro Pro). Then we searched the movement data of Jakarta people by looking at the number of users of public transportation modes. For train user data, we take from the following site: https://www.bps.go.id/indicator/17/72/2/sum-penumpangkereta-api.html. From that website, we can download the data in an excel file, as shown in Fig. 3. Then we took the data of users of Transjakarta, Mass Rapid Transport (MRT), and LRT (Integrated Rail Cross) via https://data.jakarta.go.id/dataset. In addition to the threetransportation data, we can also obtain other data, such as data on the number of residents moving to Jakarta, the number of family planning participants, and others, as shown in Fig. 4. We can perform a search, as shown in Fig. 4, and then select the data we want to access. For MRT data, we can download the data by clicking Download Data, as shown in Fig. 5. Then, we get a comma-separated value (CSV) from it. For the LRT and Transjakarta data, we used the same method as previously described to obtain the data. The addresses of the datasets are as follows: https://data.jakarta.go.id/dataset/data-penumpangtransjakarta-2020/ https://data.jakarta.go.id/dataset/data-penumpang-mrt-diprovinsi-dki-jakarta https://data.jakarta.go.id/dataset/data-penumpang-lrt-diprovinsi-dki-jakarta (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 8, 2022 237 | P a g e www.ijacsa.thesai.org For data on weather conditions and air quality in Jakarta, we take from two sources, namely: 1) https://dataonline.bmkg.go.id/data_iklim 2) https://aqicn.org/data-platform/covid19/ For Badan Meteorologi, Klimatologi, dan Geofisika (BMKG) online data, we need to create an account first, and we can only download it for a period of 30 days, as shown in Fig. 6. While on the Air Quality Open Data Platform we can download air condition data around the world from 2015 to 2021 as shown in Fig. 7. From BMKG we can obtain data on rainfall, duration of exposure, wind speed, humidity, etc. Meanwhile, on the Air Quality Open Data Platform, we collected PM10 and PM2.5 data, which are international standard air condition indexes. For congestion data, we take from https://www.tomtom.com/en_gb/traffic-index/jakarta-traffic/. We can only take data every month as an indicator of congestion, as shown in Fig. 8. The index is between 0 -1, where 0 is the value without congestion and 1 is a great traffic jam.
For data on the status of people's movement restrictions, we take the following news: https://news.detik.com/berita/d-5167032/timeline-psbbjakarta-to-tarik-rem-darurat From this news, we can determine the period of strict PSBB, transitional PSBB, and no PSBB. So that the data we need is complete but still not well organized and validated.

B. Data Preparation
Once we get all the required data, we look at the data. The first criterion used is the availability of data during the PSBB period, which is March 10, 2020 -November 25, 2020. After investigating the dataset, the downloaded data meets these criteria. Then we discard the unnecessary information, such as the highest wind direction, maximum wind speed, maximum/minimum temperature, etc. We delete the data because we only used average values such as average wind speed and average temperature. In addition, we also discard data on the number of suspects, the number of negative covid-19, and the daily percentage of cases because this has been explained in the daily positive number and daily test number. After the needless data is removed, we then combine the data into a comprehensive table.
The overview of this research problems are the daily positive case prediction and the movement restriction type classification problems. We expect that the predicted positive case will be used as pseudo data to indicate the type of people's movement restrictions. Table I

C. Data Exploration
Descriptive statistical analysis can be seen in Table II  below. From Table II, we can find out the average, maximum value, and minimum value of each attribute or feature. From Table II, there is nothing strange about these statistical values, so we can say that the combined data is valid.  Fig. 9 shows that the daily positive number changes along with the number of people being tested (blue) for COVID-19 per day. This indicates that the daily positive number (red) is strongly influenced by the number of people who test for Covid. Fig. 9. The Positive Case vs the Sample Tested. Fig. 10 shows a strong correlation between the number of daily tests and daily positives. There is a weak correlation between the number of Bus passengers and the daily test. There is also a weak correlation between congestion levels and daily tests.  Since we don't have daily data on congestion levels, we use monthly data, creating a scatter plot graph as shown in Fig. 12. There is a weak correlation between congestion with a positive total. Because data on the number of bus passengers per day is not available, we use the average per day of the number of passengers per month as shown in Fig. 13. We get a weak positive correlation between Bus vs. Total positive.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 8, 2022 239 | P a g e www.ijacsa.thesai.org There is a very weak correlation between solar irradiance and daily positive, as shown in Fig. 14. There is a very weak negative correlation between wind speed and positive total, as shown in Fig. 15. Based on Fig. 16, there is a difference in the median value between no PSBB and strict PSBB. However, there was no significant difference between the three PSBB statuses in the number of daily positive cases.

D. Data Transformation
Since we do not have daily data on the number of KRL passengers, we use the mean of total monthly passengers. However, giving the same value for one month is very unrealistic, so we need to transform the data to be more realistic by generating random data that meets the criteria for the mean of total passengers and standard deviation = 0.2 * mean, as shown in Fig.17. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 8, 2022 240 | P a g e www.ijacsa.thesai.org To be able use total monthly passengers, number of LRT, number of Bus passengers, and daily data of congestion in realistic way, data transformations must be done. The random data are generated using mean and standard deviation (it is calculated by 0.2 * mean) of the respective datasets. The transformations of the dataset have shown in Fig. 18, Fig. 19, Fig. 20, and Fig. 21 respectively. To be able to see the correlation between PSBB status and other attributes, it is necessary to transform the data from PSBB status with one hot encoding. Then by using the Pearson correlation method, a correlation table is obtained as shown in Fig. 22. Based on Fig. 22, the daily test has the most significant correlation, which is 0.92. Then Bus_baru, which is the number of bus passengers per day, has a weak correlation of 0.26. Then the light, which is the length of irradiation in hours, has a correlation of 0.21. Then pm10, the air pollution index, has a weak correlation of 0.21. And humidity has a weak correlation of -0.14.  Table III.

E. Framework
Based on the aim of this project is to create a system that can predict the number of people infected with Covid-19 and determine what restrictions should be made. Therefore, the PSBB status and the number of daily positive confirmations are labeled to be predicted. We created two models, namely the regression model and the classification model. The Regression Model is used to indicate the daily confirmed number of Covid-19 under certain conditions, and the classification model is used to determine what restrictions should be placed on the state of the daily positive count, as shown in Fig. 24. www.ijacsa.thesai.org

IV. RESULT AND DISCUSSION
We first made a regression model by comparing the evaluation results using PCA and without PCA. The Grid Search Method is used to determine the best method and the best parameters. To produce a model that can predict future values, it is necessary first to divide the training and testing data as shown in Fig. 25. The best methods, parameters, and evaluation results are shown in Table IV. Based on Table IV the best method is Multiple Linear Regression (MLR) with MAE = 148.3892 and MSE = 37036.37. If you look at the R2 value, only MLR without PCA gets a positive value. So, the MLR method without PCA was chosen as the regression method used.
Next, we need to determine a suitable method for classification, using PCA and without PCA, and determine the best parameters with grid search. The results of the grid search are shown in Table V. Based on Table V, the classification method used is a support vector machine (SVM) with PCA with a training accuracy of 0.72 and a testing accuracy of 0.8481. Because the testing accuracy exceeds 0.7, the SVM method is good enough to predict what kind of tightening will be done to the people of Jakarta. In Fig. 26, after all, forms are filled in, the system can predict the confirmed number of Covid-19 on that day totaling 871 people. In addition, we are recommended to implement the Transition PSBB by the system. www.ijacsa.thesai.org  In this study, the data collection and data preprocessing were explained in detail to show the reader how to collect and prepare the data before it is fed to the classification or regression methods. By detailed explanation of both the tasks, the reader can follow and develop their own data acquisition and preprocessing tasks. From the correlation analysis, it was found that the number who carry out the Covid-19 test affects the number of confirmed Covid-19 significantly. Adding data transformations such as PCA can enhance the accuracy of SVM. However, the MLR did not gain improvement. In addition, based on the test results, we were able to obtain an accuracy of 84%, which is a pretty good result. With this level of accuracy, we can be confident in using the model in actual cases. From the deployment results, we can see that the machine learning model can make predictions as we desire. The model can predict the number of Covid-19 sufferers and provide recommendations for restricting the movement of people.
Even though the accuracy of classification is relatively high, the system is still limited to the Jakarta region. The research expansion to the whole nation (Indonesia) can be more conclusive and comprehensive. The challenge will be the data collection that most of the regions do not have interactive web base covid data. So, collaboration with the government must be established. The provided data transformation is still limited to PCA. The extensive study in data transformation can be exciting as a lot of feature extraction and selection methods are available. Moreover, the technique is still limited to conventional machine learning approaches. The exploration of deep learning algorithms such as long short-term memory (LSTM) and convolutional neural networks (CNN) can be interesting discussions.