COVID-19 Dataset Clustering based on K-Means and EM Algorithms

—In this paper, a COVID-19 dataset is analyzed using a combination of K-Means and Expectation-Maximization (EM) algorithms to cluster the data. The purpose of this method is to gain insight into and interpret the various components of the data. The study focuses on tracking the evolution of confirmed, death, and recovered cases from March to October 2020, using a two-dimensional dataset approach. K-Means is used to group the data into three categories: “Confirmed-Recovered”, “Confirmed-Death”, and “Recovered-Death”, and each category is modeled using a bivariate Gaussian density. The optimal value for k, which represents the number of groups, is determined using the Elbow method. The results indicate that the clusters generated by K-Means provide limited information, whereas the EM algorithm reveals the correlation between “Confirmed-Recovered”, “Confirmed-Death”, and “Recovered-Death”. The advantages of using the EM algorithm include stability in computation and improved clustering through the Gaussian Mixture Model (GMM).


I. INTRODUCTION
Cluster analysis involves organizing data into meaningful and valid groups [1], which are homogeneous and similar. This technique involves classifying each data point into a specific set using clustering algorithms [2], [3]. A method proposed by the authors in [4] determines the optimal number of clusters, k, which represents the inherent significant clustering structures of the dataset. K-Means and Expectation Maximization (EM) algorithms are commonly used for clustering [5]. The proposed EM algorithm, initially designed for finding maximum likelihood parameters of a statistical model, has been applied to various domains such as speech recognition [6], interactive systems [7], etc.
On the other hand, the researchers in [8] have proposed a new epidemiological mathematical model for the spread of the COVID-19 disease with a special focus on the transmissibility of individuals with severe symptoms. Recently an important report using C++ can be used to "track" the daily evolution of new confirmed cases of the COVID-19 epidemic [9]. Rizvi et al. [10] have described K-Means clustering of 79 countries has been performed for COVID-19 confirmed cases and COVID-19 death cases based on 18 feature variables.
This study presents a fresh approach to analyzing the COVID-19 dataset using clustering techniques. Specifically, we apply a standard version of K-Means and EM algorithms based on GMM to partition the local COVID-19 Moroccan dataset into three sets: "Confirmed-Recovered", "Confirmed-Death", and "Recovered-Death", with varying cluster numbers. Our primary objective is to identify the optimal classification for each data cluster. This paper is organized as follows: Section II gives the Literature Review. The K-Means and EM algorithms is introduced in Section III. The COVID-19 pandemic is presented in Section IV. Section V describes the COVID-19 dataset. Section VI exposes the results and discussion. Finally, in Section VII we conclude this work and gives perspectives.

II. LITERATURE REVIEW
The COVID-19 pandemic has led to an increase in the use of data mining and machine learning techniques to understand and analyze the spread of the virus. Clustering is a popular technique used to group similar data points together. K-Means, EM Algorithm and GMM are three commonly used clustering algorithms in machine learning. Several clustering methods have been developed with the objective to find the correct number of clusters [11], [12], [13], [14], [15]. In [16] the authors focus on utilizing Probabilistic Graphical Models for detecting COVID-19, resulting in excellent detection of the disease. One potential use of the EM algorithm is to estimate the parameters of a mixture model in cases where the data is incomplete. This technique is sometimes referred to as finding the parameters of Gaussian mixture densities [17], [18]. Eva and Dharmende [19] conducted a comparison between K-Means and GMM to assess their effectiveness in representing clusters of heterogeneous resource usage in Cloud workloads. Their experiments, which utilized Google cluster trace and business critical workloads by Bitbrains, revealed that K-Means provided a more generalized representation, whereas GMM resulted in better clustering with clearly defined usage boundaries. Despite Gaussian Mixture Model's longer computation time compared to K-Means, it is preferred for more detailed workload analysis and characterization.
Appiah et al. [20] proposed a study that utilizes the EM algorithm, which is initialized by a semi-supervised K-Means clustering approach based on geodesic distance classification of crime dataset. The aim is to track changes in cluster centroids (mean), shape and orientation, volume, and predictive trends of criminal activities. In this approach, the cluster assignment obtained from K-means is assumed as the distribution of GMM. The model-based clustering algorithm is then used to estimate the parameters of the mixed model while maintaining the probabilistic assignment and multivariate nature of the Aungkulanon et al. [22] clustered different regions of Thailand based on financial conditions and mortality differentials, revealing super-locale that are mainly urban and have a low all-cause normalized mortality proportion but a high colorectal disease-specific death rate. The study also found that deaths caused by liver cancer, diabetes, and renal diseases are common in low economic super-regions. Malav et al. [23] conducted a study to predict coronary heart disease using Kmeans and artificial neural networks. The combined approach led to a system with a very high accuracy rate. Another work by Singh et al. [24] used clustering and classification techniques to forecast heart diseases with high accuracy.
Isikhan et al. [25] clustered countries based on causes of deaths, health profiles, and risk factors using unsupervised K-means. The study analyzed clusters based on some financial and socio-demographic indicators and found that climate and ethnicity were more significant factors for clustering than socio-economic factors. These studies demonstrate the importance of COVID-19 dataset clustering in identifying patterns and trends associated with the virus, which can aid in developing effective strategies to combat its spread.

III. K-MEANS AND EM ALGORITHMS
Given a set of observations Y = (Y 1 , ..., Y N ), independent and identically distributed (i.i.d) where each observation Y t = (y t1 , ..., y tj , ..., y td ) ∈ R d is a d-dimensional real vector. The objectives of K-Means and EM are to partition N observations into G clusters [26].

A. K-Means Algorithm
In this part, the objective is to find values for z tk and µ k the mean so as to minimize D. Let Φ = µ = {µ 1 , ..., µ G } be the set represents the mean of each cluster c k , where C k ∈ {C 1 , ..., C G } the set of G clusters, and let Z = (z 1 , z 2 , ..., z N ) the set of binary indicator variables.
Where z tk = 1 when Y t is a member of C k , otherwise z tk = 0. Or more exactly arg min k D(Φ, Z). when D achieved minimal value, sum of ∥Y t − µ k ∥ 2 is minimal [27].
by Euclidean distance. We can do this through an iterative procedure in which each iteration involves two successive steps corresponding to successive optimizations with respect to z tk and µ k . We initialize the class centers µ .., C G } set of clusters; by some initial values called seedpoints, using methodically sampling.
Step 1: We minimize D and we update z tk , keeping the µ k fixed.
Step 2: We minimize D and we update µ k , keeping the z tk fixed.
(m) being the current iteration. This two-stage optimization is then repeated until convergence.

B. Expectation Maximization Algorithm
In this work, EM algorithm is used to complete the missing COVID-19 data. We introduce the latent variable Z. Y t can describe the mix "Confirmed cases-Recovered cases". The same study for the mixture of confirmed cases -death cases and recovered cases -death cases. We will assume that the observations Y t are i.i.d and the observations from different clusters have correlated Bivariate Gaussian Density. If data t belongs to cluster C k (denoted by t ∈ C k ) then: (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 14, No. 3, 2023 µ k and Σ k denote the mean vector and covariance matrix. Assign a data point to a nearest cluster, with calculate the following likelihood [28].: Where P(z tk = 1/y) is a posterior probability of y t ∈ C k the k th -classes and z t correspond to the Gaussian identity which generated an entry y t .
Step 1 (Expectation): Given the current estimates, Step 2 (Maximization): Compute the parameters that maximize the likelihood of the data set P(Y /µ k , Σ k , Π k , z tk ) which is the probability of all of the data under the GMM. Find the probability P(Y ) that generated the COVID-19 dataset. Maximizing this with respect to each of the parameters can be done in closed form: 1) Re-estimation of mixed weights: To find the parameter we using a Lagrange multipliers [29] with constraint G i=1 Π i = 1 and maximizing the following quantity: Then we obtain and we have new estimation for Π k (see Eq. 8).
2) Re-estimation of the means vectors: We assume γ(z tk ) fixed. We derive this equation with respect to the means µ k at zero, we obtain: W here ∂l(Φ) ∂µk = 0, Then we find: The new µ k is gives in (Eq. 9) 3) Re-estimation of the covariance matrix: In the same way we derive l(Φ) with respect to Σ k Where ∂l(Φ) ∂Σk = 0, then we obtain new values of covariance matrix (see Eq. 10).

IV. COVID-19 PANDEMIC
Later in 2019, in the city of Wuhan, in China, a new discovered version of coronavirus was detected as the principal reason for a strange aspect of pneumonia cluster. Local scientists react by isolating the SARS-CoV-2 into a patient on the earlier of January 2020, which led to the genome sequence of the SARS-CoV-2 [30].
According to the authors of sequencing, phylogenetic analysis this genome has made it possible to establish that the initial host of this virus is an animal sold on the market in Wuhan. Several studies have suggested bats could be at the origin of SARS-CoV-2 [31]. The virus was referred to as 2019-nCoV before the COVID-19 name. It is defined as a severe acute respiratory syndrome coronavirus number 2 (SARS-CoV-2). The WHO declares that the first the infection as a pandemic on March 11, 2020. It rapidly spread, followed by an increase in the number of infected cases around the globe. To this disease of August 16, 2020, the world has had 21.294.845 total confirmed cases, and 761.779 total deaths cases [32].

V. COVID-19 DATASET DESCRIPTION
In the present study, we use public data from the COVID-19 outbreak in Morocco to estimate the evolution of this epidemic. The data is received through the official website created by the Moroccan Ministry of Health. For this disease, Morocco has had 194461 total confirmed cases, said the Director of epidemiology and disease control at the Ministry of Health as of October 24,2020 the total number of deaths has increased to 3255; and 160372 total cured cases (see Fig. 1) [33].
The training dataset is composed of the real COVID-19 cases daily collected Confirmed, Recovered, and Death patterns. The clustering is done with two-dimensional dataset "Confirmed -Recovered", "Confirmed -Death" and "Recovered -Death" features of 237 samples. The Table I below shows a part of the complete data.
The recording of the 237 th COVID-19 cases are store in the Table I. Each feature is a combination of two parameters, the Confirmed recorder and the Death cases, then the Confirmed recorder and the Recovered cases and the Recovered recorder and the Death cases, respectively.

VI. RESULTS AND DISCUSSION
In this work, we selected k-points as the primary group ranks as the points are calculated in order. The total number of initial points c k is 237/k for all groups, then we define the initial centroids µ k . The test data consists of three groups, "Confirmed -Recovered" (see Fig. 2), "Confirmed -Death" (see Fig. 3) and "Recovered-Death" (see Fig. 4). After divided each group into k = 2 to k = 6. The hybrid of K-Means algorithm and the Elbow method is been used to determine the best clustering as in [34].
Each data point is classified by computing the distance between that point and each group center, and then classifying the point to be in group whose center is closest to it. The results of sum square error calculations of each cluster have experienced the greatest decrease in k = 4 for groups "Confirmed-Recovered" and "Confirmed-Death", k=3  for "Recovered-Death" data can be seen in (Fig. 5), (Fig. 6) and (Fig. 7).
We used the hybrid K-Means algorithm and Elbow method, which gave best clustering with 4 and 3 clusters. This result is exploited in the EM classification based on GMM, we notice that the "Confirmed-Recovered", "Confirmed-Death" and "Recovered-Death" can be divided into 4, 4 and 3 subsets, respectively. We analyze the correlation of feature variables for COVID-19, Correlation matrix is used to find the relationship between two variables "Confirmed-Recovered", "Confirmed-Death" and "Recovered-Death". Correlation Coefficient r is used to calculate the strength of this relationship between two quantitative variables Y i and Y j by using the formula given in (Eq. 13): i and j = Confirmed, Recovered , Death r, the correlation coefficient is a unitless value between -1 and 1.
In Table II, we have the initial parameters of the different groups. To start the K-Means and EM algorithms, we use the same means values and the same coefficients of the found covariance matrix.
In this part, we aim to implement selected C ++ object from [35] using K-Means algorithm is to partition the first, the second and the third group into four, four and three     clusters, respectively. Also, we apply EM by using GMM based on Matlab for all three groups 'Confirmed -Recovered' Fig. 11. Contours of probability density function (PDF) with four mixture components of "Confirmed-Death" data for (g) and (h) figures. (see Fig. 8, 9), 'Confirmed -Death' (see Fig. 10, 11), and 'Confirmed -Recovered' (see Fig. 12, 13).
We obtain values at convergence by using K-Means algorithm and EM algorithm (see Table III).
The correlation matrix of the initial values (see Table IV), and values at convergences (see Table V) for features "Confirmed -Recovered -Death" cases using (Eq. 13): Positive values of r indicate a positive correlation, as the values of the two variables tend to increase together. Negative values of r indicate a negative correlation when the values of one variable tend to increase and the values of the other variable decrease. In the data mining of COVID-19 in Morocco, K-Means is a simple and fast algorithm for solving clustering issues, but it requires clarification in advance the exact number of clusters k, which is often difficult.
The "Confirmed -Recovered", "Confirmed -Death" and "Recovered -Death" groups are Mixtures Models of four and three two-dimensional Gaussians. K-Means algorithm only considers the mean to update the new centroids nevertheless EM based GMM takes into account the mean value as well as the covariance matrix of this data groups. We use this partitioning to start K-Means and EM. We start from a real model with correlated covariance matrices, the values at convergence are of the same nature. It can be interpreted that high positive correlation exists, in the third phase of the epidemic's spread, between Confirmed cases and Recovered cases (0.70), Confirmed cases and Death cases (0.67) and Recovered cases and Death cases (0.85) [10]. To evaluate clusters "Confirmed -Recovered" and "Recovered -Death"; values are in forms four categories (low, lower-middle, uppermiddle, and high), on the order hand "Confirmed -Death" data is in forms three phase (low, medium, and high).
We notice a clear difference between means of the K-Means algorithm and the means of the GMM. The EM based GMM has higher computation time than K-Means; because K-Means does not account for variance. The findings are in according with those of [19]. The Data membership points to clusters in GMM is probabilistic as versus the nonprobabilistic, hard clustering K-Means process, thus resolving the membership vagueness that may appear in overlapping clusters. The analysis exposes a more meaningful workloads clustering with GMM than with K-Means, enabling a detailed characterization of resource usage needs of Cloud workload. As a comparison, the clustering by using K-Means algorithm is faster than Gaussian Mixture Models method.
K-Means clustering faces a major challenge in determining the optimal number of clusters, especially when working with COVID-19 data. Depending on the type of data being analyzed, the number of clusters may vary, and selecting the correct number of clusters is crucial for obtaining meaningful results. Furthermore, K-Means clustering relies on the Euclidean distance metric, which may not be suitable for all COVID-19 data. Other distance metrics, such as cosine distance, may be necessary to accurately capture the similarity between data points. Another clustering algorithm, EM clustering, is also sensitive to the initial conditions of the algorithm. Different initial conditions may result in different cluster assignments, leading to inconsistent results. Additionally, EM clustering may struggle to converge to a solution when working with high-dimensional data or complex probability distributions. Preprocessing and tuning of the algorithm may be necessary to ensure reliable results.

VII. CONCLUSION
This study focuses on analyzing the COVID-19 situation in Morocco using K-Means and EM clustering algorithms. The dataset includes daily Confirmed, Death, and Recovered cases from March 2 to October 24, 2020. For the k-means algorithm, discovering intra-cluster similarity in complex nonlinear models using Euclidean distance is difficult. The EM algorithm is more computationally intensive and requires larger sample sizes for accurate parameter estimates. The results indicate that the EM-based GMM method is the preferred clustering method as it yields smaller classification error rates. The K-Means generated clusters provide limited information, and the best clustering was found with four and three clusters. Furthermore, the EM algorithm demonstrates the correlation between "Confirmed-Recovered", "Confirmed-Death", and "Recovered-Death". The number of clusters corresponds to the number of phases of the epidemic propagation, as determined by the process of identifying the optimal number of clusters. In the future work, we will be focused on the enhancement of our model clustering for multi-dimensional datasets with several features.