Determining Optimal Number of K for e-Learning Groups Clustered using K-Medoid

e-Learning is appropriate when the learners are grouped and facilitated to learn according to their learning style and at their own pace. Elaborate researches have been proposed to categorize learners based on various e-learning parameters. Most of these researches have deployed the clustering principles for grouping eLearners, and in particular, they have utilized KMedoid principle for better clustering. In the classical K-Medoid algorithm, predicting or determining the value of K is critical, two methods namely the Elbow and Silhouette methods are widely applied. In this paper, we experiment with the application of both these methods to determine the value of K for clustering eLearners in K-Medoid and prove that Silhouette method best predicts the value of K. Keywords—Clustering; e-learning; elbow method; k-means; kmedoid; machine learning; silhouette method


I. INTRODUCTION
The educational systems nowadays are slightly moved from Traditional Teaching Method to Electronic Teaching Method. There are a variety of tasks that can be performed in e-learning, such as; assignments, quizzes, and so on. These activities are used to assess the learner's performance. To facilitate appropriate e-learning activities by grouping users into a possible number of groups, Clustering is the Machine Learning (ML) technique that is used to group the related objects. There are numerous existing methods available for cluster analysis in the field of Data Analytics. Determining the optimal number of clusters in a data set is a fundamental problem in partitioning clustering, such as K-means clustering, which allows the user to define the number of clusters K to be generated. The possible number of clusters is rather arbitrary and is determined by the method used for measuring similarities and the parameters used for partitioning [1]. There are many clustering algorithms used for the group the similar objects in many domains such as the Medical domain, Education domain, Governance domain, etc. The clustering algorithm is the most suitable one to group users based on the learners preferred learning activities in elearning.
The existing methods mainly focus on the majority of learning activities based on the learner's style. This could be improved further by grouping the users based on their learning activities. The main objective of this paper is to identify the optimal number of groups by using cluster validation methods. If we identify the possible number of groups of learners, we can easily enhance their learning abilities according to their preferred learning activities.
The flow of organization of work is as follows: This paper introduces a different method for identifying the value of K. Then it elaborates two major method such as Elbow and Silhouette method. The paper experiments with data using both methods. It further denotes the best method for identifying K values in K-means along with the eLearners.

A. Choosing the Optimum Number of Cluster
The optimum number of clusters obtained by using the following two methods such as Elbow and Silhouette methods [2] [3].

1) Elbow method:
The number of clusters (K) in the Elbow method ranges from 1 to n. We calculate WCSS (Within-Cluster Sum of Square) for each value of K. In a cluster, WCSS is the number of squared distances between each point and the centroid. The plot looks like an Elbow when we plot the WCSS with the K meaning. The WCSS value will begin to decrease as the number of clusters grows. When K = 1, the WCSS value is the highest. When we examine the graph, we can see that it will shift rapidly at a point, forming an elbow shape. The graph begins to travel almost in the same direction as the X-axis [2] [3]. The optimal K value or the optimum number of clusters corresponds to this point.
Where is the set of observations in the K th cluster and ̅ P is the j th variable of the cluster center for the K th cluster.
Step 3: Plot the curve of WCSS according to the number of clusters K.
Step 4: The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.
2) Silhouette method: This method calculates the similarity of an object to its own cluster called cohesion, when compared to other clusters is called separation. The Silhouette value, which is a value in the range [-1, 1], is the comparison's means; a value close to 1 indicates a close relationship with objects in its cluster, while a value close to -1 indicates the opposite [2] [3]. A model that produces mostly high Silhouette values from a clustered collection of data is most likely acceptable and reasonable.

Algorithm
Step 1: Choose the K from 1 to n clusters.
Step 2: For each k, calculate the Silhouette value.
Let C(i), be the cluster to which the i th data point has been allocated.
Let |C(i)|,be the number of data points allocated to the i th data point in the cluster.
Let a(i), indicates how well the i th data point is allocated to its cluster.
Let b(i), be the average dissimilarity to the cluster nearest to it, but not its own The Silhouette Coefficient S(i) is given by: Step 3: Plot the curve of Silhouette value according to the number of clusters K.
Step 4: The location of the highest point is taken as the suitable number of clusters.  Table I contrasts the comparison between Elbow and Silhouette methods.

Elbow method Silhouette method
It is more of a criterion for making decisions.
The Silhouette is a validation metric used in clustering.
The WCSS is a metric for clustering compactness, and it should be as low as possible.
This approach is useful for determining the consistency of clustering, or how well an object fits into its cluster.
The Elbow method is not computationally demanding.
The Silhouette method is the most computationally demanding.
Sometimes we don't get a clear elbow point on the plot, in such cases it's very hard to finalize the size of the cluster.
Based on the Silhouette we can identify the cluster size.

II. RELATED WORKS
In cluster analysis, especially in the field of Data Analytics determining the optimum cluster number is a major challenge. Many methods are used to find optimal clusters among Elbow and Silhouette methods which are frequently used.
H Humaira et al. [4] proposed the method of the identifying size of the clusters using the Elbow method for the K-Means Algorithm. However, this approach lacks comparison with another method called Silhouette. 401 | P a g e www.ijacsa.thesai.org Integration K-Means Clustering method and Elbow method for identification of the best Customer profile cluster were suggested by M A Syakur et al. [5]. This approach is used to determine the best number of clusters with the elbow method and it will be the default for characteristic process based on the case study.
Mohammad Khalil et al. [6] applied a Clustering pattern of engagement in Massive Open Online Courses (MOOCs): the use of learning analytics to identify student groups. This research is used to predict the learners' engagement in MOOCs by choosing optimal clusters. Brahim Hmedna et al. [7] proposed a method, how does a learner prefer to process information in MOOCS? Using the K-Means clustering algorithm, this study discovered that the majority of learners favour active learning styles.
Towards an optimal personalization strategy, MOOCs were suggested by Alaa A.Qaffas et al. [8]. Using K-Means clustering, this approach was used to increase the retention rate and quality of learning in MOOCs.

MOOC Video Personalized Classification Based on
Cluster Analysis and Process Mining was suggested by Feng Zhang et al. [9]. They suggest a process model for a group of students that represent the students' overall video-watching behaviour. Then, based on the video watching data of the students involved, it suggested using the process mining technique to mine the process model of each student cluster. Finally, the method is used to measure the difficulty and importance of a video based on a process model.
Mohammad KHALIL et al. [10] introduced a Portraying MOOCs Learners: a Clustering Experience Using Learning Analytics. The study used Clustering Analysis to group the students into suitable profiles based on their participation in a university-mandated MOOC that was also accessible to the public.
Delali Kwasi Dake et al. [11] applied a K-means clustering algorithm to analyze students' clusters for centered projectbased learning. K clusters of 20 are used in this study. The findings show that the K-means clustering algorithm is good at grouping learners based on similar performance characteristics.
Analysis of University Students' Behavior Based on a Fusion K-Means Clustering Algorithm was proposed by Wenbing Chang et al. [12]. They proposed a new algorithm based on K-means and clustering by quick search and find the density peaks (K-CFSFDP), which improves data point distance and density.
Abdallah Moubayed et al. [13] proposed a model for Student Engagement Level in an e-learning Environment: Clustering Using K-means. This study recommends that students be clustered using the K-means algorithm based on 12 engagement metrics divided into two categories: interaction-related and effort-related.
Using Self-Organizing Map and Clustering to Investigate Problem-Solving Patterns in the Massive Open Online Course: An Exploratory Study proposed by Youngjin Lee et al. [14]. This study suggests that combining self-organizing map and hierarchical clustering algorithms in a clustering technique can be a useful exploratory data analysis method for MOOC instructors to classify related students based on a large number of variables and analyse their characteristics from multiple perspectives.
Prerna Joshi et al. [15] proposed a model for Prediction of Students Academic Performance Using K-Means and K-Medoids Unsupervised Machine Learning Clustering Technique. The K-mean and K-Medoids grouping algorithms were used in this study to examine students' consequence information.
Yaminee S. Patil et al. [16] suggested a technique Kmeans Clustering with Map Reduce Technique. This research article identified the implementation of the K-Means Clustering Algorithm over a distributed environment using Apache Hadoop.
Xin Lu et al. [17] proposed a method Improved K-means Distributed Clustering Algorithm based on Spark Parallel Computing Framework. This research identified, a density based initial clustering center selection method proposed to improve the K-means distributed clustering algorithm.
Literature review reveals that the authors have mostly focused on evaluating learner's performance by clustering techniques based on their learning styles, learning activities, and e-learning tools. The existing approaches lack in choosing an optimum number of cluster size K to group the learners. Hence, this research proposes an algorithm to identify optimum numbers of cluster size K to group the learners based on their preferred e-learning activities.

1)
To identify learners preferred e-learning activities using PCA.
2) To find correlation coefficient for selected e-learning activities using Pearson Correlation.
3) To identify an optimal number of clusters using Elbow and Silhouette method. 4) To select best method for choosing optimal cluster size.

5)
To group the learners based on their preferred elearning activities with optimal cluster size.

IV. PROPOSED ARCHITECTURE OF CHOOSING OPTIMAL CLUSTER
The purpose of finding optimal number cluster is to extract the most possible number of groups with learner's preferred elearning activities. The course teacher can implement those selected activities to learners based on the clusters, which helps the learners to enhance their learning abilities. Fig. 2 portrays the process to find an optimal number of groups with the help of Elbow and Silhouette method. It has been classified in the following stages: Stage 1: Identify learners preferred e-learning activities.
Stage 2: Apply cluster validation to fix the cluster size. Stage 3: Identify the possible number of clusters using Elbow and Silhouette method. 402 | P a g e www.ijacsa.thesai.org Stage 4: Select the method which is suitable to group the learners based on their preferred learning activities.
Stage 5: List the optimal cluster to group the learners.

Stage 1: Identify learner's preferred e-learning activities:
In the first stage, identify the learners' preferred e-learning activities by using Principal Component Analysis (PCA) and compute the correlation coefficient using Pearson Correlation.

3: Apply the cluster validation methods
Elbow method: Where is the set of observations in the K th cluster and ̅ P is the j th variable of the cluster center for the K th cluster Silhouette method: 4: Choose an appropriate method to fix cluster size 5: List the optimal cluster size 6: End

VI. RESULTS AND DISCUSSION
The optimal number of cluster size was implemented with the following e-learning activities such as Continuous Assessment (CA), Assignment, Test, Practical, Seminar, and Course Work.
Step 1: Preferred e-learning activities Table II listed the preferred e-learning activities of 70 users and their performances. These preferred e-learning activities are identified through PCA and compared with the Pearson Correlation to find a correlation coefficient with each attribute.
Step 2: Compute the sum of activities based on learner wise.
Compute the sum of selected activities for each learner. Table III listed the sum of e-learning activities of 70 users and their performances.
Step 3: Apply the cluster validation methods Graph 1 portrays dataset of 70 users and their performance with their preferred e-learning activities. 403 | P a g e www.ijacsa.thesai.org Step 4: Choose appropriate method to fix cluster size From Step 3, the appropriate method for validating cluster size for given data set is Silhouette method. This can be achieved by comparing cluster size of both the methods.
Step 5: List the optimal cluster size The most possible number of cluster size is K = 10. The possible number of learners' groups is also 10, according to their preferred e-learning activities.
The paper experiments data with two cluster validation methods such as Elbow and Silhouette method. These two methods are frequently used to validate cluster size. The cluster size return by Elbow method for the given data set is 5 (K=5). The cluster size return by Silhouette method for the given data set is 10 (K=10). Naturally when the cluster size increases, the learning abilities of the each learner is identified in depth. Based on their learning abilities, we can optimize and predict e-learning activities for each user. Finally this paper concludes that Silhouette method is the optimal method for validating cluster size for the given data set.

VII. CONCLUSION AND FUTURE WORK
In this paper, we have illustrated that determining optimal value for K to cluster eLearners using K-Medoid algorithm is essential. Further, we have experimented the Elbow and Silhouette methods to compute the optimal value for K. To decide the suitable method among these two, sufficient experiments were conducted, and the results of the experiments were investigated carefully. The results were indicative that the Silhouette method best suits to fix the optimal value for K. This paper is focused in estimating K value for K-Medoid based clustering of eLearners, computing the optimum value for k, number of clusters to be formed can also be done in other clustering methods which are applied in eLearners groupification, which is suggested by the authors as a future enhancement.