OvSbChain: An Enhanced Snowball Chain Approach for Detecting Overlapping Communities in Social Graphs

—Overlapping Snowball Chain is an extension to Snowball Chain, which is based on the concept of community formation in line to the snowball chaining process. The inspiration behind this approach is from the snowball sampling process, wherein a snowball grows to form chain of nodes, leading to the formation of mutually exclusive communities in Snowball Chain. In the current work, the nodes are allowed to be shared among different snowball chains in a graph, leading to the formation of overlapping communities. Unlike its predecessor Snowball Chain, the proposed technique does not require the use of any hyper-parameter which is often difficult to tune for most of the existing methods. The proposed algorithm works in two phases, where overlapping chains are formed in the first phase, and then they are combined using a similarity-based criteria in the second phase. The communities identified at the end of the second phase are evaluated using different measures, including modularity , overlapping NMI and running time over both real-world and synthetic benchmark datasets. The proposed Overlapping Snowball Chain method is also compared with eleven state-of-the-art community detection methods.


I. INTRODUCTION
In recent years, there has been a tremendous growth in the study of linked data in the form of networks, such as Internet, World Wide Web, and social networks. The relationships among the entities existing in these networks provide rich insights pertaining to various dynamic interactions and might prove to be beneficial in various applications [1]. To analyse and study these networks, graph is used as a data structure, which consists of a set of nodes joined by links or edges that can be labelled/unlabelled, directed/undirected, or signed/unsigned. The representation of an online social network is termed as social graph, which provides a good visualization and eases the interpretation of the network.
One of the emerging research areas in social network analysis is community detection, which digs deep into the social graph and mines the most dense subgraphs that are highly cohesive in nature. A community in a network is represented by a set of nodes with high density links among themselves, but low-density links among inter-community connections [2]. These subgraphs are called communities or modules. Community detection in a social graph mainly involves splitting it into its constituent functional groups. The task has largely been addressed in a distinct community context wherein the communities are considered to be mutually exclusive. However, in case of real-world networks, community structures can be overlapping wherein a node belongs to multiple communities. A density-based approach called CMiner in [3], aims to find similarity among nodes and defines a distance function. Overlapping communities are identified based on this distance function. Another work in [4], detects overlapping communities along with their evolution, called as OCTracker. A similar work in [5], identifies hierarchical communities called HOCTracker which works for dynamic social networks.
The work in this paper aims to address this issue by proposing a novel overlapping community detection algorithm which extends the existing SbChain algorithm. The proposed method, named OvSbChain, starts with identification of the seed or core nodes in a social graph based on a node parameter, called normalized degree. The nodes in the entire social graph are ranked on this parameter and processed in a non-increasing order of their ranks. The method works in two phases. In the first phase, every node is paired with its best suited neighbor in accordance to a score value in each iteration. After several iterations, chains of nodes are formed that may share nodes with each other, i.e., there could be overlapping nodes among different chains. Therefore, the proposed technique is called overlapping snowball chains. The second phase tries to combine chains based on a similarity criteria as discussed in Section III, which finally leads to the formation of overlapping communities. Therefore, the technique focuses on resolving the problem in hand, i.e., community detection using an uncomplicated and elementary strategy. The major enhancements in this work can be summarized as follows: 1) OvSbChain introduces overlapping communities unlike SbChain, which produces only crisp communities. 2) There is no hyper-parameter tuning required in OvS-bChain, hence, it always produces the same set of communities every time it is run. 3) SbChain uses a maximum common neighbor criteria for finding its best neighbor. Whereas, OvSbChain uses normalized degree function to find its best neighbor. Also, both the techniques differ in the way they find the seed nodes. This is discussed in detail in Section III. 4) The results are evaluated and compared based on (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 5, 2022 Nicosia modularity measure [6], two types of ONMI [7], [8] and their running time, as discussed in Section IV.
The rest of the paper is organized as follows. Section II presents a brief review of the existing literatures on overlapping community detection. Section III presents the preliminary concepts, along with the proposed approach. This section also presents the functional details of the OvSbChain method. Section IV describes the details about the datasets, evaluation parameters, experimental settings, and analysis of the results. Section V concludes the paper and finally, Section VI provides future directions of research.

II. RELATED WORK
This section presents a brief description of the state-of-theart in the area of overlapping community detection. A review of the current community detection methods is described in [19]. It segregates the detection methods into probability-based and deep learning-based. The classical methods use probabilitybased models for community identification. Whereas, complex networks are generally converted to lower dimensional data using deep learning methods so as to ease the process. A few other works like [20], [21], [22], discuss various community detection algorithms based on their weakness and strengths, performance of algorithm and other domains. We mainly discusses all the traditional approaches for overlapping community detection and compare them with OvSbChain in Section IV.
CFinder is an overlapping community detection technique that makes use of the Clique Percolation Method (CPM) [23] to identify the k-cliques in a network. A k-clique is a complete subgraph consisting of k nodes. This method finds dense groups of overlapping nodes in a network [9]. LAIS [10] is an algorithm that combines two functions List Aggregate (LA) and Improved Iterative Scan (IS 2 ). The LA procedure initializes the clusters, and the IS 2 procedure improves upon these set of clusters in an iterative manner. The IS procedure starts with a seed node and processes clusters by expanding or shrinking them according to a metric value, and IS 2 improves upon this by focussing on nodes within a cluster and its neighboring nodes, instead of considering the entire graph. The overall algorithm detects overlapping community in a network.
CONGA (Cluster-Overlap Newman Girvan Algorithm) [11] is an overlapping community detection algorithm that uses the concept of split-betweenness, i.e., it counts the shortest paths that exist between all pairs of nodes in the network. It keeps removing edges with high betweenness, and thus, keeps splitting the network into singleton clusters. The partition with the desired number of clusters is picked up. However, it requires number of communities as an input for the algorithm.
PEACOCK algorithm [11] consists of two phases; the first phase is similar to CONGA, where the network is split using split betweenness. The altered network is processed by a disjoint community detection algorithm, called centrality of detecting communities based on node centrality or CNM.
COPRA [12] technique extends the previous work on the label propagation by Raghavan, Albert, and Kumara [24], and it is able to detect overlapping communities in a social network. The main extension is to make the label and propagation step to include information about more than one community. Therefore, it allows each node to belong to up to v communities, where v is a hyper-parameter.
In SLPA (Speaker-listener Label Propagation algorithm) [13], the nodes store multiple labels, and act either as the provider or consumer of information. A node keeps gathering information about the observed labels without removing the previously stored label. The frequency of observation of a label by a node is directly related to the spreading of the label among other nodes. It requires a threshold input parameter that gives the minimum probability of occurrence of a label, before it is deleted from the memory of the node.
Demon (Democratic Estimate of the Modular Organization of a Network) [14] is a simple approach for community detection which works on the modular structure of networks. Firstly, each node finds and votes the communities present in its local neighborhood, using a label propagation algorithm. These local communities are merged to form a global collection by combining all the votes, leading to the formation of overlapping modules. However, this algorithm requires a minimum threshold parameter. BIGCLAM (Cluster Affiliation Model for Big Networks) [15] is a model-based community detection algorithm that allows for identification of dense overlapping, hierarchical communities in massive networks. Each node-community pair is assigned a non-negative latent factor that decides the degree of membership between them. The probability of a connection between a pair of nodes in the network is modeled as a function of the shared community affiliations. Further, the communities are identified using non-negative matrix factorization methods and block stochastic gradient descent.
MULTICOM is another community detection technique that produces overlapping communities starting with an initial seed set. Local community is detected around the seed nodes using a transformation function. After this step, each node belongs to a single community. Thereafter, the transformation function is used to transform a node into its respective vector, that is clustered using a local clustering technique. For each cluster produced in the previous step, a ratio value is calculated using a function mentioned in [16]. The clusters having ratio value less than a pre-defined threshold are considered for further exploration. The process keeps repeating until the number of communities is greater than the set value or if there is no new seed.
In [17], the technique called Lemon (Local Expansion via Minimum One Norm) detects overlapping communities by finding a sparse vector in the local spectra span, such that all the seeds are in its support. The span of vector dimensions produced by random walk is used as an approximate invariant subspace, called the local spectra. However, this local spectral (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 5, 2022 approach is used for community detection from a small seed set.
ANGEL [18] is a faster successor of Demon that uses a bottom-up approach to find overlapping communities. It works in two phases, where the first phase produces local communities using ego network of the nodes. The second phase merges communities until convergence or a threshold value is met.
The work in [25], develops a PageRank algorithm with constraints so as to obtain tightly packed overlapping communities. Using probability-based methods, a walker avoids irrelevant communities. Therefore, it results in communities with good fitness score. In [26], a method called Adjacency Propagation Algorithm (APA) is developed using adjacent nodes as seed nodes. It uses a threshold parameter to identify subgraphs based on their intraconnectivity. Another work in [27], can produce disjoint as well as overlapping communities in a two-step process that uses genetic algorithm. In the first step, mean path length of a community is calculated in relation with its respective ER random graph. And the second step shrinks the search space by selecting a subset of nodes. Another work in [28], influential nodes are identified to form local communities. These communities expand as nodes join these local communities. Overlapping communities are merged and evaluated on a model.
An application-based work in [29], exploits community detection to protect the privacy of individuals on social platforms It discusses community detection attacks and rewiring of connections for development of effective attack approach.
OvSbChain approach focuses on local community detection using graph parameters such as degree and global clustering coefficient. If these local communities are identical they are merged. The motivation behind this work is that it exploits simple topological features of the graph to detect communities without any expensive overhead in two simple levels, (i) formation of local communities, and (ii) combining local communities based on two criteria.

III. PROPOSED APPROACH
The OvSbChain approach discussed in this section is an extension to the previously developed SbChain [30] method. It detects overlapping communities, i.e., nodes are allowed to be shared among more than one community. The approach works on two levels. In the first level, it starts with finding the best suited pairs of nodes according to an initial criterion. This level ends up with formation of overlapping snowball chains. In the second level, these chains are merged to form the larger chains, and eventually form communities based on global clustering coefficient or majority overlapping criteria.

A. Preliminaries
For a graph G(V, E), V represents the set of vertices or nodes in the graph, i.e {v ∈ V }, where n is the number of nodes. And E is the set of edges, i.e., {e uv = (u, v) : u, v ∈ V }. This section presents the details about frequently used terms and their meanings, as mentioned in table I.
OvSbChain works at two levels that are described in the following paragraphs: Description Maximum degree value in the graph N best (v) Best scoring neighbor of node v s (n) n th snowball chain GCC(s (n) ) Global clustering coefficient of a snowball chain s (n) 1) Level-I: It starts by finding the seed nodes and sorting them in non-increasing order, based on the following criteria so as to begin the processing.
1) Seed function -A seed v can by identified by sorting nodes according to their normalized degree value function, given by equation 1. This also represents the score score(v) of a node v.
These sorted nodes are processed in non-increasing order of this function value. It should be noted that SbChain used a combination of normalized degree and normalized local clustering coefficient for sorting of nodes. 2) N best (v) function -The best suited neighbor for a seed v is identified using the same score value, i.e., the normalized degree. This neighbor further combines with the seed v to form a snowball chain. Whereas, SbChain used maximum number of overlapping neighbors for finding its best neighbor.
It should be noted that these functions have been chosen and designed empirically.
2) Level-II: The second level starts with the chains formed in the first level. These chains are merged to form communities, so as to eliminate almost similar chains. The snowball pairs/chains formed in first level are combined based on global clustering coefficient (GCC) or majority overlapping criteria to form a community. GCC signifies the number of closed triangles to the number of triplets in a graph. Therefore, the technique focuses on finding higher values of GCC for a community, so as to find coherent communities. The first criteria involves calculation of GCC of the formed community, along with GCC of each individual snowball chain. If the combined GCC is higher than the GCC of each chain, then their combination is permitted, otherwise it is discarded, i.e., the chains remain undisturbed. Communities can also be combined as per the second criteria of majority overlapping. This allows communities to get merged if they have atleast 70% overlapping nodes. This percentage is decided empirically, as the value of communities do not change after this point. Also, the minimum percentage overlap was decided to be above 50% so as to form coherent communities. The majority overlapping test prevents the existence of two or more similar communities.
It should be noted that OvSbChain creates overlapping communities because it does not follow non-redundant node strategy, previously used by SbChain. According to this (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 5, 2022 strategy, a node could join with a single node per iteration which creates mutually exclusive communities. The focus of OvSbChain is to develop communities that share nodes among themselves. Therefore, it discards this strategy and allows a node to be a part of multiple chains within a single iteration itself.

B. Algorithm
As discussed in the algorithm 2, OvSbChain starts with the pre-processing, i.e., it calculates neighbor list N (v), degree list k(v) and score(v) (equation 1) for each node v in the social graph. These nodes are then sorted in non-increasing order of their respective score and processed one at a time. Snowball chains are formed by finding the best neighbor (algorithm 1) for each node on the basis of this score value itself, i.e., for a given node v, the best neighboring node N best (v) with highest value of score parameter is chosen.
In the first iteration, best suited node pairs are combined. The snowball chains s (n) so formed grow internally and new chains are also formed in each iteration, as the nodes find their matches. This sums up the level-I of the proposed technique.
The level-II starts with calculation of global clustering coefficient GCC(s (n) ) for each snowball chain s (n) formed in level-I. These chains are combined and added to community list C if the GCC of the union of two chains GCC(s (j) ∪s (k) )

IV. EXPERIMENTAL SETUP AND RESULTS
In this section, the performance of the OvSbChain algorithm is evaluated over different datasets using various parameters. The OvSbChain is compared with several other overlapping community detection techniques. The following

A. Dataset
The efficacy of OvSbChain and other techniques is evaluated over ten real-world datasets and five computer-generated Lancichinetti Fortunato Radicchi (LFR) benchmark datasets [40], as discussed in Tables II and III. The LFR benchmark datasets consists of 1000 nodes with value of the mixing parameter (µ) varying from 0.1 to 0.5. Hence, the datasets are named as LFR1K-0.1, LFR1K-0.2, ..., LFR1K-0.5, respectively.

B. Evaluation Metrics
The communities identified as a result of algorithm 2 are analyzed by an overlapping modularity measure given by [6], two types of ONMI (Overlapping Normalized Mutual Information), and their running time.
It should be noted that the modularity measure given by [6] is represented as Q ov . By definition, Q ov = 0, for singleton communities or if all nodes belong to a community. Q ov uses a belonging coefficient for each node which defines the percentage contribution of a node in a community. The sum of this coefficient in 1, for each node.
ONMI is an extension of the NMI score that accommodates overlapping partitions within a network. There are two types of ONMI used in this section; one is LFK (Lancichinetti Fortunato Kertesz) [7], which is referred as N M I LF K , but it overestimates the similarity of two clusters in some cases. To fix this, another ONMI called MGH (McDaid Greene Hurley) is used. This version uses a different normalization than the original LFK based ONMI [8], and it is represented as N M I M GH .

C. Results
Techniques like COPRA, PEACOCK, CONGA, SLPA, CFinder, Demon, and ANGEL use a parameter for tuning. Hence, the values represented in this paper are the best values for Q ov . Fig. 1a shows the results of various overlapping techniques compared with OvSbChain on Q ov , respectively. The same is also represented via Table IV. Also, Fig. 2a represents the number of datasets for each technique that have their respective value greater than or equal to 80% of the maximum Q ov that exists for all the techniques. It can be observed that OvSbChain has an above average performance in terms of Q ov . Though other techniques are seen to show a better value in terms of Q ov , it is seen that the respective ONMI values drops. Hence, high modularity does not necessarily guarantee good partitions. It can be seen that although modularity values are comparable or average in comparison to existing techniques, the ONMI values are promising. As an example, SLPA has the highest modularity among all techniques, it does not produce high ONMI values. OvSbChain is faster for smaller datasets and produces comparable or better results for certain cases in terms of both N M I LF K and N M I M GH .
Both N M I LF K and N M I M GH are calculated and compared on both real-world and LFR datasets, as shown in Fig.  1b and 1c for OvSbChain and other techniques. OvSbChain is seen to perform well in most of the cases. Fig. 2b and 2c also show the comparison of the number of datasets that have NMI values greater than or equal to 80% of the maximum existing value of NMI (among all the given datasets). Tables VI and VII show both the ONMI values. It can be observed that the performance of OvSbChain is above average for both N M I LF K and N M I M GH measures.
A comparison of the running time of all the techniques is presented in Fig. 1d. Logarithmic scale is used for this comparison because it provides a better visualization. CFinder technique is excluded from this comparison because it does not mention the time it takes to evaluate the communities so formed. It can be observed that OvSbChain is works fast on smaller datasets, and it is comparable to other techniques on larger datasets. The same can be seen through table V. As mentioned before, a few techniques use a parameter which needs to be defined every time they are executed. Therefore, in our experimental evaluation, these techniques were run for different parameter values and the best value for Q ov was chosen and the corresponding ONMI and run time values are represented. On the other hand, our proposed OvSbChain approach does not need any parameter value to be set, hence, produces the same result every time it is run.

V. CONCLUSION
It can be seen that the technique OvSbChain discussed in the current article works well on real-world datasets with good results in terms of N M I LF K and N M I M GH . It gives comparable results on a few benchmark datasets as well. It should be noted that the running speed of the algorithm was at par with other techniques, or even better in a few cases. The experiments show average results on modularity measure as well. OvSbChain does not use any external parameter like most of its counterparts. Also, it produces the same results every time it is run, unlike the other techniques, e.g., COPRA. It gives different results each time it is run (for same parameter value). Hence, it is run for ten times, and the results are averaged. Therefore, it can be established that our technique works well without any parameter tuning, unlike the other (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 5, 2022 approaches. The overhead of calculations involved in the technique slows it down, but that can be resolved using better hardware options.

VI. FUTURE WORK
The future scope of improvement includes extension of the technique to directed graphs as real-world networks are generally directed in nature. OvSbChain can be improvised to find faster and high coverage seed nodes for snowball chains formation and eventually communities.