A -interpolation Model Clustering Algorithm based on Kriging Method

—In this work, a -interpolation model clustering algorithm is proposed based on Kriging method, aim to partition data according to the relationship between the response of interest and input variables. Kriging method is used to describe the relationship between the response of interest and input variables. For each datum, the estimation errors of the interpolation models of the clusters are used to decide its assignment. An optimization strategy is proposed to obtain the final clustering results. The key factors of the proposed algorithm on its performance are studied through the synthetic and real-world datasets. The results show that the proposed algorithm is able to cluster the data according to the response of interest and input variables, and provides competitive clustering performance compared with the other clustering algorithms.


I. INTRODUCTION
In recent years, massive data have been generated and recorded from real-world systems. The information mined from these data represents the characteristics of the real-world system, which can be used to analyze and improve the performance of the system. In most data mining tasks, it is necessary to build the performance prediction model first, aiming to accurately estimate the response of interest according to the input variables. However, the relationship between the response and input often changes greatly, which is difficult to evaluate through a unified prediction model [1][2][3]. Obviously, this issue can be solved by partitioning the data so that the data in the same part have a more similar relationship between response and input than the data from the other parts, and this work can be accomplished by data clustering.
Data clustering is a class of algorithms and techniques aiming to partition a dataset such that the data characteristics in the same cluster are more similar than the other clusters [4]. Many clustering algorithms have been proposed in the past decades, such as -means algorithm, fuzzy -means algorithm, Gaussian mixture model, and so on [5][6]. Since the -means algorithm is easy to understand and implement, it has been widely used in many data mining tasks such as image recognition, modal analysis, and outlier detection. Shubair, and Al-Nassiri used the least square method to estimate the centers of clusters in -means algorithm and applied the clustering algorithm in the preparation process of data streams [7]. Aldino et al. used -means algorithm to group the corn-producing regions based on the collected data of corn crops to assist in the formulation of corn planting [8]. Yu et al. proposed multi-layers framework to increase the performance of -means algorithm on the dataset with outliers and noisy values [9]. In addition, genetic algorithm is used to obtain the optimal clustering results. Zhu et al. proposed a grid -means algorithm to improve the clustering accuracy and stability and validated its performance on the dataset with the noise points [10]. Cuomo et al. used parallel techniques to reduce the computation cost of -means algorithm for the large data analytic problem and provides solutions for the problems of GPU space limitation and host-device data transfer time [11].
-means algorithm clusters data according to their spatial distance, resulting in it being difficult to ensure that the data in the same cluster have a similar or same relationship between the response of interest and input variables. Thus, it is necessary to develop a new clustering method under the framework of -means algorithm.
In recent years, an interpolation model, Kriging method, has been widely used to model the relationship between the response and input variables of the measured data of the realworld systems. For example, Echard et al. assessed the failure probabilities of an engineering system using the importance sampling method and Kriging method, which has been successfully used in the reliability analysis of engineering systems [12]. Keshtegar et al. used Kriging method to estimate the solar radiation based on the meteorological data [13]. Wojciech proposed a digital terrain estimation method based on Kriging method, in which a neighbor points selection method is designed to accelerate the training speed Kriging method [14]. Belkhiri et al. estimate the groundwater quality for drinking purposes using Kriging method [15]. The results indicate that the Kriging model with electrical conductivity as co-variable produces the best performance compared with the other Kriging models. From the above works, it can be seen that Kriging method can effectively learn the relationship between the response of interest and input variables from the measured data. Thus, a -interpolation model clustering algorithm is proposed based on Kriging method under the framework of -means algorithm in this work, aims to partition data according to the relationship between the response of interest and input variables. Kriging method is used to evaluate the relationship between the response and input variables. For each datum, the estimation errors of the interpolation models of the clusters are used to decide its assignment. An optimization strategy is designed to obtain the clustering results. Finally, the performance of the proposed algorithm is validated through several synthetic and real-world datasets. The remainder of this work is organized as follows. The proposed algorithm *Corresponding Author.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 5, 2022 202 | P a g e www.ijacsa.thesai.org including the background of k-means algorithm and Kriging method is introduced in Section 2. The synthetic datasets and engineering datasets are used to test and compare the performance of the proposed algorithm with the conventional clustering algorithms in Sections 3 and 4. The conclusions are provided in Section 5.

II. LITERATURE REVIEW
In recent years, several data clustering algorithms have been proposed to partition data according to their relationship between the response of interest and input. Peng et al. [16] introduced ridge regression to evaluate the relationship of twodimensional data in their clustering. Chen et al. [17] used the least square method to evaluate the features of data, and then applied fuzzy c-means algorithm to cluster them. However, only the linear relationship is considered in the above methods.
To realize data clustering based on their nonlinear relationship between the response of interest and input, artificial neural networks and Gaussian process regression have been used to replace linear models. For example, Blažič et al. [18] used artificial neural networks to evaluate the nonlinear regression relationship to identify the state of engineering systems. Fuhg et al. [19] applied Gaussian process regression to evaluate the relationship among attributes to partition data according to their variation ranges. Fang et al. [20] used artificial neural networks to evaluate the relationship among data attributes to cluster the in-situ data of a tunnel boring machine.

A. K-means Algorithm
-means algorithm is developed in the area of signal processing, which aims to partition data into clusters in which each datum belongs to the cluster with the nearest mean (the prototype of the cluster). Generally, the clustering process of -means algorithm can be subdivided into two stages: assignment step and update step as follows.
Assignment step: each datum is assigned to the cluster with the nearest prototype as follows where represents the distance between the datum and the mean (Euclidean distance is usually used), and is assigned to exactly one .
Update step: the mean (prototype) of each cluster is recalculated as follows.
The iterations are carried out until the assignments no longer change.

B. Kriging Method
In Kriging method (KRG), the following model is used to model the outputs at the samples: where ( ) is the function of interest, , ( ) ( ) ( )is the basis functions, and , is the corresponding coefficient vector. ( ) is a Gaussian stationary process with zero mean and covariance.
where is the process variance, ( ) is the correlation function of the stochastic process, is the hyperparameters of ( ), and is the sample number, The maximum likelihood method is used to optimize , where the likelihood function is expressed as follows: where is the correlation matrix and is a vector including the value of ( ). and are estimated through the least-square method as follows.
By taking the logarithm of Eq. (5) with the imposed value and multiplying by -1, the maximum problem to obtain the optimal is revised as The prediction of Kriging method for a new sample is

C. The Proposed Algorithm
A -interpolation model clustering algorithm is proposed based on Kriging method in this section. From Eq. (1), it can be known that the distance should involve the relationship between the response of interest and input variables, if we want to cluster the data based on the relationship. In this work, Kriging method is used to evaluate the relationship, and the estimated response of each datum can be obtained as flows.
where ̂ is the estimated response of -th datum of -th cluster, is the vector of input variables. The distance is defined as follows.
Similar to -means algorithm, the clustering process of the proposed algorithm (named -IM) is summarized as follows.
Step 1. Set the clustering number ; Step 2. Generate the assignment of the data randomly; Step 3. Construct KRG model of -th cluster based on the data contained in the cluster; Step 4. Using the obtained KRG models to get the responses of all the data and creating the responses matrix ; Step 5. Assigning each datum to the cluster using Eq. (1) and Eq. (11).
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 5, 2022 203 | P a g e www.ijacsa.thesai.org Step 6. If any stop conditions are satisfied, the procedure is stopped, and the current assignment results are considered as the final clustering results, otherwise, return to Step 3.

IV. EXPERIMENTS ON SYNTHETIC DATASETS
In this section, the synthetic datasets are used to validate the proposed algorithm. For each dataset, the data of each cluster is generated first and combined as the final dataset. The Latin hypercube sampling method is used to generate the input variables, and then the corresponding responses are calculated through the setting relationship between the response and input variables. The naming of the dataset is based on its sample number and cluster number. For example, N400C2 means that the dataset has 400 samples and two clusters. The proposed algorithm is compared with three popular clustering methods, -means algorithm (KM), fuzzy -means algorithm (FCM), and Gaussian mixture model (GMM). The clustering performance is evaluated through the following indexes.

1) Misclassification rate ( ):
where is the number of misclassified data; is the total number of data. The lower , the higher cluster validity.

2) Adjusted rand index (
) [21]: Given a set of elements, and two partitions of these elements, namely * + and * + , the overlap between and can be summarized in a contingency table [ ] where each entry denotes the number of objects in common between and : | | as shown in Table  I. Adjusted rand index is defined as follows: The closer to 1, the higher cluster validity. 3) Normalized mutual information ( ) [22]: where ( ) is the mutual information metric and ( ) is the entropy metric. The closer to 1, the higher cluster validity.

A. Effect of Sample Number
In this section, four synthetic datasets are used to study the effect of sample number on the performance of the -IM algorithm. In each dataset, there are two clusters, and each cluster has the following relationship between the response and input.
Cluster 1: where , -. For each synthetic dataset, one cluster has 150, 200, 250, 300 samples, respectively. Thus, the four synthetic datasets are denoted as N100A2C2, N200A2C2, N300A2C2, N400A2C2, respectively. The obtained N400A2C2 dataset is shown in Fig. 1. From Fig. 2, it can be seen that seen that the samples of the two clusters are distributed similarly, but the relationship between the response of interest and input is different. The 30 times experiments are conducted for each dataset. The average experimental results are shown in Tables II to IV.  From these tables, it can be seen that the -IM algorithm produces much better results than the FCM, KM, and GMM algorithms. The mean misclassification rate of the proposed algorithm is less than 0.03, which is much smaller than those of FCM, KM, and GMM, indicating the -IM algorithm is able to accurately cluster the synthetic datasets. To further compare the performance of the clustering algorithms, the clustering results of N400C2 dataset of one experiment are shown in Fig. 3. From this figure, it can be seen that the proposed algorithm clusters the data based on the relationship between the output and input. The FCM and KM algorithms cluster the data according to their spatial distribution. It is noted that the GMM algorithm assigns most data to one cluster. The reason is mainly that it clusters data with the assumption that the data obey a Gaussian mixed distribution. The assumption cannot be stratified for N400A2C2 dataset. Thus, the clustering results of the GMM algorithm are much worse than the other algorithms. From the experimental results shown in Tables II to IV, it is observed that the sample number has an effect on the proposed algorithm. With the sample number increasing from 300 to 600, the MS of the proposed algorithm first decreases to 0.013 and then increases to 0.019. Similar results can be found in the indexes ARI and NMI. The reason is explained as follows. As the sample number increases, more samples can be utilized to construct the KRG models, which mean that the relationship between the output and input can be evaluated more accurately. The performance of the -IM algorithm increases with the increase in the sample number. However, the KRG model tends to be overfitting when the sample number is too large. Thus, the performance of the -IM algorithm decreases with the sample number increasing from 500 to 600. The proposed algorithm produces competitive clustering results for the datasets with different sample numbers tested in this section.  205 | P a g e www.ijacsa.thesai.org

B. Effect of Cluster Number
Three datasets with different cluster numbers are used to test the effect of clustering number on the proposed algorithm, as shown in Table V. It is observed that the clusters of each dataset have similar but different relationships between the response and input variables. For each dataset, 30 times experiments are conducted. The average clustering results are shown in Fig. 4.  Fig. 3, it is observed that the -IM algorithm is still able to produce the best results among the tested four algorithms. The highest misclassification rate of the -IM algorithm is around 0.10, which is much smaller than the conventional FCM, KM, and GMM algorithms. The index of the -IM algorithm is higher than 0.80 for all three datasets.

The index
is around 0.80, which is higher than the other clustering algorithms as well. It is noted that the misclassification rate of the GMM algorithm is higher than 0.50 for N600C3 and N800C4 datasets. The reason is that the GMM algorithm clusters almost all the data into one class, which means that most data are misclassified. Thus, the is higher than 0.50. With the cluster number increasing, the performance of the -IM algorithm decreases, but it is still much better than the other popular clustering algorithms. The proposed algorithm can produce competitive clustering results when clustering the dataset with different cluster numbers tested in this section.

C. Effect of Noise
The measured data of real-world systems usually have noise. N400A2C2 dataset is used to test the performance of the -IM algorithm on the noise. The synthetic datasets are generated as follows. For each cluster, the input variables are generated. The response is calculated according to the set function. For each datum, a random value is generated according to the set interval as shown in Table VI Tables VI to VIII.   TABLE V. EFFECT OF SAMPLE ON THE CLUSTERING PERFORMANCE ( )   The performance of the -IM algorithm is better than the other popular clustering algorithms even if the dataset has noise in the relationship between the response of interest and input variables. The of the -IM algorithm is smaller than 0.05, which means that less than five percent of the data are misclassified. Similar results can be found in the experimental results of the performance index and . With the noise level increasing, the performance of the -IM algorithm decreases. When the dataset has higher noise in the relationship between the response of interest and input variables, the Kriging method is more difficult to accurately evaluate the relationship. Thus, the performance of the proposed algorithm is worse when the noise level is higher. But, the of the -IM algorithm is still smaller than 0.045. The -IM algorithm can produce competitive clustering results for the datasets tested in this section.

V. EXPERIMENTS ON ENGINEERING DATASETS
In this section, three engineering datasets are used to further test the proposed algorithm. Since the classification information of the engineering datasets is unknown, the experiments are conducted as follows. For each engineering dataset, the dataset is first clustered into several subsets. For each subset, five cross-validation method is used to test whether the data in the same subset has a similar relationship between the response of interest and input variables. The subset is randomly divided into five parts, one part is selected as the testing data, and the remaining four parts are used as the training data. The experiments are conducted five times, and the average -square of the five experiments is used to assess the consistency of the relationship of the subset. -square is calculated as follows.
where is the number of the testing data, is the real response, ̂ is the estimate response, and ̅ is the mean of the real responses. The closer to 1, the better performance.

A. Yacht Hydrodynamics Dataset
The yacht hydrodynamics dataset is first used. The dataset comes from a series of 308 experiments on the residuary resistance of sailing yachts [23]. Severn input variables are considered, including the prismatic coefficient, longitudinal position of the center of buoyancy, length-displacement ratio, beam-draught ratio, length-beam ratio, and Froude number. The residuary resistance is evaluated through the per unit weight of displacement. The -IM, FCM, KM, and GMM algorithms are used to cluster the yacht hydrodynamics dataset into two subsets. And, five cross-validation methods are applied to each subset to test whether the data in the same subset has a similar relationship between the residuary resistance and input variables. The experiments are conducted 30 times, and the corresponding results are shown in Fig. 5. The average of the -IM algorithm is higher than 0.98 for the obtained two clusters, which is much better than that of the FCM, KM, and GMM algorithms, indicating that the data of the same cluster obtained by the proposed algorithm have more similar relationship between the response of interest and input variables. The -IM algorithm is able to cluster the yacht hydrodynamics dataset according to the relationship between the residuary resistance and input variables.

B. Bolt Tensioner Dataset
Bolt tensioner is a widely used tensioning tool in the assembly of large equipment such as nuclear power generators or the construction of large buildings [24]. It is an annular jack that rises up through hydraulic pressure. The bolt tensioner dataset recorded the data from 40 simulations, including the maximum stress at the piston of the bolt tension and the corresponding structural parameters with the same hydraulic pressure. In the experiment, the cluster number is set two as well, and the -IM, FCM, KM and GMM algorithms are used to cluster the dataset. Based on the clustering results of each clustering algorithm, five cross-validation methods are to test whether the data in the same cluster has a similar relationship between the maximum stress and structural parameters. Fig. 5 shows the experimental results. It is noted that the GMM algorithm cannot provide clustering results since the covariance matrix is ill. From Fig. 6, it can be seen that the average s of the -IM algorithm is the highest among the tested four clustering algorithms. The -IM algorithm is able to cluster the bolt tensioner data such that the data in the same cluster have a similar relationship between the maximum stress and structural parameters.

VI. CONCLUSION
In this work, we proposed a k-interpolation model clustering algorithm (named -IM) to cluster data according to the relationship between the response of interest and input variables. In the proposed algorithm, Kriging method is used to construct the interpolation models. For each datum, the estimation errors of the interpolation models of the clusters are used to decide its assignment. An optimization strategy is designed to obtain the clustering results under the framework of -means algorithm. The effect of the sample number, cluster number, and noise level on the -IM algorithm is studied through several synthetic datasets. The results indicate that the -IM algorithm in this paper can provide competitive clustering results. Two engineering datasets are further to test the performance of the -IM algorithm as well, and the experimental results show that the -IM algorithm is able to cluster the data such that the data in the same part have a similar relationship between the response of interest and input variables.