Light Gradient Boosting with Hyper Parameter Tuning Optimization for COVID-19 Prediction

—The 2019 coronavirus disease (COVID-19) caused pandemic and a huge number of deaths in the world. COVID-19 screening is needed to identify suspected positive COVID-19 or not and it can reduce the spread of COVID-19. The polymerase chain reaction (PCR) test for COVID-19 is a test that analyzes the respiratory specimen. The blood test also can be used to show people who have been infected with SARS-CoV-2. In addition, age parameters also contribute to the susceptibility of COVID-19 transmission. This paper presents the extra trees classification with random over-sampling by considering blood and age parameters for COVID-19 screening. This research proposes enhanced preprocessing data by using KNN Imputer to handle large missing values. The experiments evaluated the existing classification methods such as Random Forest, Extra Trees, Ada Boost, Gradient Boosting, and the proposed Light Gradient Boosting with hyperparameter tuning to measure the predictions of patients infected with SARS-CoV-2. The experiments used Albert Einstein Hospital test data in Brazil that consisted of 5,644 sample data from 559 patients with infected SARS-CoV-2. The experimental results show that the proposed scheme achieves an accuracy of about 98,58%, recall of 98,58%, the precision of 98,61%, F1-Score of 98,61%, and AUC of 0,9682.


I. INTRODUCTION
Coronavirus 19 (COVID-19) is a highly contagious viral infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [1]. SARS-CoV-2 can cause tissue damage and cause acute respiratory distress syndrome. It is rapidly increasing transmission rate which demands an early response to diagnose and prevent the rapid spread of this disease [2]. Currently, COVID-19 is being transmitted by human-to-human through air transmission that cause a wide spread of the disease [3]. One way to detect COVID-19 is through the Reverse-Transcriptase Polymerase Chain Reaction, also known as RT-PCR [4]. RT-PCR has limited resources, it has high specificity and high sensitivity [5]. However, according to the study of validation of the SARS-CoV-2 RT-PCR test [6], blood or hematological parameters showed high sensitivity and specificity as well as intra and inter-test precision and efficiency.
Machine learning can become an alternative for diagnosing and analyzing COVID-19 infection [7]. Machine Learning has been widely used to investigate and help in screening with suspected COVID-19 infection [8]. The implementation of machine learning in RT-PCR with blood assessments has a critical function for diagnosing COVID- 19 and different respiration diseases. The parameters are involved white blood cells, C-reactive protein, neutrophils, lymphocytes, monocytes, eosinophils, basophils, aspartate and alanine, lactate dehydrogenase, and others. Those parameters have proven an excessive correlation in sufferers identified with COVID [9]. In addition, age parameters [10] also affect the susceptibility of COVID-19 transmission. Therefore, it motivates researchers to investigate parameters that significantly effect for covid-19 prediction.
This research presents a predictive model for diagnosing COVID-19 by considering C-reactive protein, neutrophils, lymphocytes, monocytes, eosinophils, basophils, aspartate and alanine, lactate dehydrogenase, including blood and age parameters. This research proposes a predictive model by using ensemble learning which involved Random Forest, Extra Trees, AdaBoost, Gradient Boosting and Light Gradient Boosting, then optimizes the best model with hyperparameter tuning. The experiments also investigate the best solution for imbalance data by implementing the existing sampling methods such as Random Under Sampling (RUS), Random Over Sampling (ROS) and Synthetic Minority Over Sampling TEchnique (SMOTE). The sampling class imbalance approaches is used to overcome imbalance data that has been carried out in the research related to Covid-19 [11]. This research is expected to obtain the best predictive model that can achieve high accuracy, recall, precision, f-score and AUC compared to the existing schemes.

II. RELATED WORK
Several researches have proven the significant of blood exams for the diagnosis of Covid-19 [12] analyzing the blood www.ijacsa.thesai.org index of 69 COVID-19 sufferers. All have been dealt with on the National Center for Infectious Diseases (NCID) placed in Singapore. Among those sufferers, sixty-five underwent whole blood assume the day of admission. In addition, demographic facts inclusive of age, gender, ethnicity, and region have been furnished for this study. Around 13,4% of sufferers require indepth care unit (ICU) care, specifically the elderly. During the primary examination, 19 sufferers had leukopenia (low white blood cells) and 24 had lymphopenia (low lymphocyte stage with inside the blood), with five instances categorized as severe (Absolute lymphocyte count (ALC).
The application of a Covid-19 diagnosis based on blood tests has previously been carried out to provide comprehensible answers primarily based totally on device studying techniques using public data from the Albert Einstein Hospital. Previously, data preprocessing was carried out for selection of blood features. Then normalization of features with z-score and use of iterative imputer method to fill in missing values is done. The remaining 608 patients, 84 of whom have been high-quality for COVID-19 showed with the aid of using RT-PCR [13]. In order to apprehend the decisions, a neighborhood Decision Tree Explainer (DTX) approach is performed to obtain the results.
Data from the Israel Albert Einstein Hospital located in São Paulo, Brazil are also used in the application of machine learning in the diagnosis of COVID-19 with hematological parameters. Pre-processing is done by selecting features using particle swarm optimization (PSO) and evolutionary search (ES). Furthermore, experiments were carried out with different machine learning techniques. The experimental results show that Bayesian networks [7] have superior performance compared to other techniques with an overall accuracy of 95,159%, kappa index 0,903, sensitivity 0,968, precision 0.938, and specificity 0,936.
A study was also conducted to identify SARS-CoV-2 positive patients from a total of 598 complete data and 5046 were not used because they were incomplete. A machine learning model, ANN was carried out to test based on the dataset obtained from the Israelta Albert Einstein Hospital, in São Paulo, Brazil by testing various hematological parameters. As a result, the flexible ANN model [14] predicts COVID-19 patients with high accuracy between the population in the regular ward AUC 94-95% and those not hospitalized or in the community AUC 80-86%.
Other research was conducted by building a two-stage test; in level one, no preprocessing technique is carried out even as in level preprocessing is emphasized to attain higher predictive effects. Blood samples from sufferers from Einstein Hospital in Brazil were amassed and used for prediction of the severity of COVID-19 with studying algorithms. The Tuned Random Forest algorithm [15] produced an accuracy of 0,98 with numerous preprocessing methods.
Based on the description of the related research above, the existing considers few parameters to diagnose COVID-19. There are a quite few research studies on blood exams for the diagnosis of COVID-19. However, studies on eosinophils, age and blood parameters are rare to find in literature. This study proposes a pre-processing KNN imputer data to overcome the large missing values. Then various data sampling class with imbalance approaches methods is used to find out the best sampling class for imbalance datasets. Whereas the prediction model generated from the data classification process using an ensemble, namely Extra Trees, Bagging Decision Tree, Random Forest, Ada Boost, Gradient Boosting and Light Gradient Boosting.

A. Ensemble Learning Classification Model
Ensembles learning classification model can increase the computational costs [16], as it is necessary to train several individual classifiers, and their computational requirements can grow exponentially when dealing with large scales.

B. Extra Trees
The extra tree classifier creates a gaggle of unpruned decision trees in step with the standard top-down method. The predictions of all trees were combined to determine the ultimate prediction, through the majority alternative [17]. The extra tree classifier generates a random multiple of the choice tree with completely different sub-samples while not bootstrapping. The extra trees can avoid over-fitting issues and improves accuracy [18]. Efficiency is also the main strength of this study.

C. AdaBoost
AdaBoost is an iterative algorithm, in each iteration, instances that were wrongly classified in the previous iteration are given more weight. Sequentially apply the learning algorithm to reweighted the sample from the original training data. Initially, each instance is ssigned the same weight and iteration as the iteration, the weight of all misclassified instances is increased and the correctly classified instances are reduced [17]. The AdaBost algorithms [19] are defined by:

D. Gradient Boosting
Gradient Boosting is a machine learning algorithm that can solve regression and classification problems. Gradient Boosting generates a prediction model consisting of an ensemble of weak prediction models in the decision tree [20]. The construct of a gradient boosting call tree is to mix a series of weak base classifiers into one sturdy one. a conventional boosting methodology that weighs positive and negative samples, GBDT builds a world convergence rule by following the direction of the negative gradient [21]. The GBDT measures GBDT [21] are presented as follows.

1)
Step 1: The values for the initial constants of the model β are given: Step 2: For the number of iterations m = 1: M (M is the iteration time), the residual gradient direction is calculated

3)
Step 3: Base classifiers are used to adjust the sample data and obtain the initial model. According to the least squares approach, the parameters of the model are obtained and the model h (xi; am) is installed

4)
Step 4: Function loss is minimized. According to Eq. (4), the new step size of the model, i.e. the weight of the current model, is calculated.

5)
Step 5: the model is updated as follows

E. Light Gradient Boosting
Light Gradient Boosting Machine or LightGBM uses gradient enhancement in its construction, but light GBM does not divide the eigenvalues one by one, so it is necessary to calculate the splitting benefit of each eigenvalue. LightGBM algorithm on the model to improve forecasting accuracy and robustness [22]. It can indeed find the optimal split value, but it costs a lot, and may not be good for generalizing information when the amount of data is large [23].
Remembering the supervised training LightGBM's target is to find approximation for a particular function   as follows [24] : LightGBM integrates a number of T regression trees to approach the final model, which is.
where J denotes the number of leaves, represents the guideline of thumb of the choice tree and is the leaf node weight vector. Therefore, LightGBM could be educated additively inside the following steps: In LightGBM, Newton' technique simply approximates the target function. Where and ℎ indicate the first-and secondorder gradient statistics of the loss function, let show the instance set of leaf .
For the tree structure   x , the optimum leaf weight score of every leaf node * w and therefore the extreme worth of t T may be solved as follows:

F. Random Forest
Random Forest is an integrated learning method based on bagging. The essence is to apply the bootstrap method to the CART algorithm. Random Forest samples were taken using the bootstrap method, and then an independent decision tree model was built using the CART algorithm [25]. Random forest algorithm (for each type and regression) [26] are discussed as follows: 3) Predict new facts by combining tree n tree predictions (i.e., majority vote for type, common for regression.

G. Random Over Sampling (ROS)
ROS algorithm randomly replicates samples from the minority classes [27]. Oversampling [28] can be done by (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 8, 2022 517 | P a g e www.ijacsa.thesai.org increasing number of instances or minority class samples by production new instance or repeated multiple instances.

H. Random Under Sampling (RUS)
RUS technique at random eliminates samples from the bulk categories, till achieving a relative categories balance [27]. For the under-sampling approach, most of the category instances are discarded till additional a balanced distribution of information is achieved. This data merchandising method is completed every which way. Considering an information set with a hundred minority class instances and 2,000 magnitude class instances, a complete of 1800 categories that majority are going to be deleted randomly within the RUS technique. The dataset will be balanced with two hundred instances, it will be delineating with 200 instances, whereas minority also have 200.

I. Synthetic Minority Over-Sampling Technique (SMOTE)
SMOTE produces artificial samples from the minority class by interpolating existing instances that are terribly near to every other [27]. For the minority category within the information set, SMOTE initial selects the minority class data instance randomly. The distance from the sample set to several classes is calculated by the Euclidean distance D, and Knearest neighbors are obtained. The Euclidean distance D is defined by: According to the proportion of the unbalanced data set, the sampling rate N is set. The six samples closest to D were selected as one group. Each sample group is connected to each other to generate several new samples at random, which are added to the data set and recycled [29]. This results in a new formula: IV. EXPERIMENTAL SETUP Images are divided by 70% for training, 20% for validation, and 10% for testing. Then the YOLO architectural model is used from training and validation and then a data test is carried out with data testing and detecting disease. After that, a performance evaluation's carried out for the architectural model used. The block diagram of the proposed covid-19 classification is depicted in Fig. 1.
This study uses machine learning techniques to predict negative and positive cases using RT-PCR data with blood parameters. Before applying the machine learning classification method, data preparation was carried out by using several methods, namely, Remove non-blood parameter, Imputation Missing Values, Label Encoding Class and Normalization with Z-Score. The processed data was tested using several machine learning classification methods using an ensemble, namely Extra Trees, Bagging Decision Tree, Random Forest, Ada Boost, Gradient Boosting and Light Gradient Boosting. In testing the machine learning classification method, the best method was chosen based on the evaluation of the results in terms of accuracy, precision, recall, F-1score and AUC. The best method is optimized by searching for the best parameters by using hyper parameter tuning. Then, the results were compared before using hyper parameter tuning and after using hyper parameter tuning. The results of the best methods can be used for prediction of COVID-19.

A. Data Collection
The dataset is collected from the existing benchmark [30]. The dataset consists of 5644 patients treated at the Albert Einstein Israelta Hospital located in Saulo Paulo, Brazil. Kaggle makes data sets available for public access. Data was collected from 28 March 2020 to 3 April 2020, with more than 100 laboratory tests including blood test, urine test, SARS-CoV-2 test, RT-PCR test, presence of influenza virus [30]. The dataset consists of 89% missing values, so the missing value is handled by filling in the missing value using the KNN Imputer method using K = 5 [31]. Label encoding is done which aims to perform coding on the class label. Label Encoding serves to change the data format of numbers 0 to n_classes-1, this is intended to make data training easier. Normalization of the data was performed using Z-Score [32]. Then the best method is to optimize hyper parameter tuning using GridSearchCV. GridSearchCV taken from Scikit learn [33]. This study considers several features for classification as shown in Table I.

B. Split Validation
In this study, the experiments divide the data based on the ratio entered, for example the percentage of 80:20 [34]. There are 80% of the total amount for training set and 20% for test set.

C. Evaluation
To compare the overall performance of the proposed scheme, we decided on five metrics: accuracy, recall, precision, F1-Score and receiver running characteristic (ROC) curves, and the cost of the vicinity below the ROC curve (ROC AUC). Accuracy is the maximum generally used assessment metric for type. However, for imbalance facts type problems, accuracy won't be a great preference due to the fact accuracy regularly has a bias closer to the bulk class [35] [36]. The accuracy can be defined by: Recall is the collection of data that has been successfully taken from the part of the data relevant to the query [37]. The Recall is defined by: Precision is part of the data taken in accordance with the required information [38]- [40]. The precision is defined by: The F1 score is the Harmonic Mean between precision and Recall [41]. The F-Score indicates how precise the classifier is (how many instances are correctly classified), as well as how strong it is (it doesn't miss a large number of instances). The F1-Score formula is defined by: The ROC curve represents the genuine advantageous rate (TPR) and fake advantageous rate (FPR). TPR represents the ratio of advantageous samples that have been successfully detected through the algorithm, and FPR represents the ratio of terrible samples that have been incorrectly labeled as advantageous. The expressions for TPR and FPR are as follows: where TP is the number of true positives, TN is the number of true negatives, FN is the number of false negatives, FP is number of false positives.

V. EXPERIMENTAL RESULTS
After pre-processing the data to overcome the missing value, performing a Z-Score then encoding the dataset class, testing the specified model without using the sampling class imbalance approaches method. Testing the model without sampling class imbalance approaches method is carried out first for further comparison with various sampling class imbalance approaches methods to be tested. The test results are listed in Table II. The best accuracy was obtained by using extra trees method with an average accuracy of 98.40% for imbalance sampling method. While, the light gradient boosting achieved high accuracy with random under sampling than extra trees, AdaBoost, Gradient Boosting, and Random Forest methods. Overall, the extra trees method performs better than other method for different types of sampling method except random under sampling. The experimental results in terms of recall, precision, F1-Score, and AUC are listed in Tables III, IV, V and VI.
The classification of light gradient boosting method achieved recall value of 91.96%. The best recall result was obtained from sampling technique of without imbalance sampling method, random under sampling, SMOTE and SMOTE-Tomek. The experiments also evaluate the precision of the classification method; classification by using extra tree produced a high precision result except sampling technique of random under sampling. The classification of light gradient boosting method can achieve a good F1-Score and AUC score under various sampling techniques. The visual comparison of the accuracy, recall, precision, F1-score and AUC is shown in Fig. 2, 3, 4, 5 and 6.  The best AUC was produced by light gradient boosting with RUS sampling technique. Light gradient boosting with RUS sampling technique produces AUC score of 0.9693. It can be concluded that the best model that has improved majority of performance in terms of accuracy, precision, recall, f1-score and AUC is light gradient boosting. Light Gradient Boosting produces the best accuracy of 98.49%, recall on the RUS sampling technique is 97.32% and AUC is 0.9693. Furthermore, hyperparamater tuning tests were carried out to optimize the results of Light Gradient Boosting. The parameters used in the Hyperparameter tuning are listed in Table VII. After going through the Grid Search process, the best parameters were found that could be tested on the Light Gradient Boosting model. These parameters can be seen in Table VIII.   The hyper parameter tuning has increased the accuracy of light gradient boosting with an accuracy of 98.58%. The comparison of recall light gradient boosting has increased in almost all tests using sampling techniques. Random forest before the sampling technique was 92.59%. The comparison of F1-score light gradient boosting after hyperparamerer tuning achieved 98.61% on the ROS sampling technique. Based on the results, it can be concluded that light gradient boosting with hyperameter tuning can improve the accuracy, recall, precision, F1-score and AUC. The use of the ROS sampling technique has some advantages in terms of accuracy, recall, precision, f1-score. With the conclusion that the results are 98.58% accuracy, 98.58% recall, 98.61% precision, f1-Score 98.61% and AUC 0.9682%. Based on the results obtained, the results of feature importance are shown in Fig. 7. Based on Fig. 7, it shows that the first order important features in eosinophiles are. Followed by leukocytes, monocytes, creatinine, platelets, MPV, neutrophils, age, RBC and potassium. The addition of age in the proposed test becomes the seventh most important feature of the best model. The comparison with related research was conducted to assess the performance of the proposed research, the comparison results is listed in Table X.  Fig. 8.

VI. CONCLUSION
This paper has presented various classification methods for COVID-19 prediction. The classification method of light gradient boosting with hyper parameter tuning using ROS sampling technique perform better than the existing the classification methods such as extra trees, random forest, adaboost and gradient boosting for predicting the COVID-19 data. Eosinophils, blood and age parameters has potential become important parameters for COVID-19 prediction. The data was taken from kaggle.com with 5644 data, it shows a classification improvement based on the majority of performance in terms of recall, precision, f1-score and AUC score due to eosinophils, blood and age parameters. Hyper parameter tuning using ROS sampling technique achieved an accuracy of 98.58%, recall of 98.58%, precision of 98.61%, f1-score of 98.61% and AUC of 0.9682. The first important feature in these experiments is eosinophils; it can significantly influence the classification results, while age feature is in the seventh order of important features. In the future research, the proposed model has potential to predict monkey pox disease by identifying important features.