Clustering Social Networks using Nature-inspired BAT Algorithm

The widespread extent of internet availability at low cost impels user activities on social media. As a result, a huge number of networks with a lot of varieties are easily accessible. Community detection is one of the significant tasks to understand the behavior and functionality of such real-world networks. Mathematically, community detection problem has been modeled as an optimization problem and various meta-heuristic approaches have been applied to solve the same. Progressively, many new nature-inspired algorithms have also been explored to handle the diverse optimization problems in the last decade. In this paper, nature-inspired Bat Algorithm (BA) is adopted and a new variant of Discrete Bat algorithm (NVDBA) is recommended to identify the communities from social networks. The recommended scheme does not require the number of communities as a prerequisite. The experiments on a number of real-world networks have been performed to assess the performance of the proposed approach which in turn confirms its validity. The results confirm that the recommended algorithm is competitive with other existing methods and offers promising results for identifying communities in social networks. Keywords—Community detection; nature inspired optimization; bat algorithm; discrete particle swarm optimization; social network


I. INTRODUCTION
In the era of internet, social network analysis is a lively research area in the field of complex systems. Complex systems in various disciplines can be modeled in the form of networks such as collaboration networks, biological networks, transportation networks and social networks viz. facebook and twitter. Social networks are evolved from communications among users of similar or comparable interests. The basic components of social networks are nodes (vertices) and edges (links) which represent individual users and their associations respectively. Subsequently, social networks are usually represented with the help of graphs. Community detection is one of the fundamental tasks in the course of social network analysis. The target of identifying communities in social networks is interpreted as clustering the group of nodes or divides the network with high cohesive connection strength among the nodes in a group and low among the groups. In other words, the community is a tightly knitted group of nodes and loosely connected with rest of the network. Community detection from social networks supports in understanding their user's topological arrangement and functionality. In fundamental spirit, community detection in social networks has been well-established research domain and has fascinated more researchers from interdisciplinary domains. In modern days, social networks have become part of almost everyone's life. Due to a wide range of available datasets and use in multi-discipline, researchers' interest is increasing in the area of community detection for social network analysis. The community detection problem has been perceived from two different viewpoints, for instance, partitioning/clustering problem and optimization problem. Girvan and Newman (GN) are founders of recommending the quantitative measure, modularity, for assessing the goodness of community structures in complex networks. Girvan-Newman algorithm [1] is well recognized and widely accepted method in the domain of community detection in complex networks. Due to widespread applicability and usability of community detection in different disciplines for performing diverse tasks, researchers have proposed numerous algorithms with a different line of attack. Some established algorithms include Fast Newman (FN) [2], Clauset, Newman & Moore (CNM) [3], Communities Overlapping based on label PRopagation Algorithm (COPRA) [4], Clique Percolation Method (CPM) [5], Louvain [6] and Label Propagation Algorithm (LPA) [7].
In another view, the community detection problem has been modulated as an optimization problem i.e. modularity maximization problem. Brandes et al. [8] proved that modularity maximization is an NP-hard problem. Metaheuristic methods are one of preferred choice to deal with NPhard problems. Many optimization algorithms have been explored to deal with the community detection problem. For example, Genetic Algorithm (GA) [9], Particle Swarm Optimization (PSO) [10] and Ant Colony Optimization (ACO) [11]; optimization techniques exhibit more accurate and promising results.
The growing size of networks and rise in the number of available networks have grabbed the attention of researchers to develop more effective and efficient methods for identifying the community structure in complex networks. For Example, Brentan et al. [8] have used community detection method for proposing District Metered Area (DMA) design in managing large Water Distribution Systems (WDSs). Additionally, they have enhanced it by applying a multilevel particle swarm optimization approach. The proposed approach was tested in a real water supply network and experimental results show evidence for its validity and the significant improvement in obtained results. www.ijacsa.thesai.org community detection problem. They adopted locus based representation and encoding schema. They redesigned the position update rule for bat movement in respect of community discovery problem. Results show the improved performance in contrast to the ground truth value of networks under test. Chunyu et al. [13], used ordered adjacency list encoding scheme for representation. Their proposed work requires a number of communities as input parameter and experimental results are shown for karate network only. In recent work, Song et al. [14], employed total random initialization scheme for representation. They gave discrete velocity and new position update function. The results show better performance over the existing approaches compared with.
Originally, Bat algorithm was proposed for the continuous optimization problems [15]. Further, Binary bat algorithm was developed for dealing with binary optimization problems [16]. In this work, a new variant named as NVDBA is proposed to deal with community detection problem. In the proposed algorithm, discrete position vector of each virtual bat is coded with node ids' as its element. The associated velocity vector of the bat is encoded as a binary vector i.e. 0 or 1 only will be its members. Discrete bat update rules are redesigned such that position and velocity of each bat get updated with a target that bat moves in the direction of prey/food. To enhance the local search ability of bat, a new solution is generated around the global best solution found so far. Additionally, position vector reshuffle operation [17] is carried out to reduce the number of redundant computations and saves on computational time. Modularity optimization is a remarkable and regularly used approach for community detection in complex networks. Due to its remarkable appearance, the method proposed in the paper considers modularity as the fitness function. Experiments conducted on several real-world datasets show its efficacy and competitiveness against LPA [7], Discrete Particle Swarm Optimization (DPSO) [17] and Discrete Bat Algorithm (DBA) [14] as well. The algorithms are compared in terms of quality of community structure and their statistical significance.
The motivation and major contributions of the proposed work are as follows:  The work is inspired by the success of new natureinspired bat algorithm [15] which combines the properties of existing algorithms PSO [18] and simulated annealing [19]. The notion of the proposed work relies on inherent properties of bat algorithm. Bat algorithm performs better than the existing algorithms [14]. The proposed approach is a new variant of Discrete Bat algorithm, referred as NVDBA.
 In this paper, a problem specific, random neighborhood-based initialization scheme is used. Community structure is presented based on network topology and labels of nodes in the network. Bat status update rules are redefined in respect of community detection problem.
 The proposed method does not require a number of communities as a prerequisite. Further, experiments on a variety of datasets validate the performance of the proposed algorithm by comparing it with the traditional method, LPA and evolutionary method, DPSO. In addition, results are also compared with experimental values quoted in recent work [14], referred to DBA.
 The statistical investigation is done by box-plot analysis and Wilcoxon signed-rank test which confirms that proposed algorithm has shown significant improvement.
The organization of this paper is intended as follows. Section 2 presents a background of the related problem, DPSO algorithm, and Bat algorithm. Section 3 provides a detailed description of standard Bat algorithm. In Section 4, the proposed method is described in detail along with its pseudo code and workflow diagram. Further, Section 5 presents workflow of proposed method then followed Section 6 discusses the experimental results obtained on several realworld datasets and its statistical analysis. Finally, Section 7 concludes the paper with a conclusion and future prospects.

A. Community Discovery Problem
A social network is commonly represented by a graph G (V, E) where V is the set of nodes and E is the set of edges. Each node represents a different user or an entity and each edge represents the relation between the vertices. Community detection problem is defined as partitioning the network into subgraphs such that nodes are connected by a high density of edges in the subgraph and low density of edges among the subgraphs. Mathematically it can be written as; for an input }, finding the set of community C = { 1 , 2 , …. } where t is the number of communities such that = 1 ∪ 2 ∪ … ∪ and 1 ∩ 2 … ∩ = . Quality of partitions will be determined by dense intra connection of edges in the community and sparse interconnection of edges from rest of the network. Quantitatively, the most widely used and accepted criteria for measuring the quality of partition is Modularity.
In the undirected graph with no weights and no direction on edges, Modularity Q [20] is defined as: where = ∑ =1 is the degree of ith vertex and. is the community with node i. In case the node i and j belong to the same community, ( , ) = 1, otherwise the value will be 0. Higher Q value indicates the quality of community structure. If all the nodes belong to one community then the value of Q = 0. If all the nodes lie in their own community, then the value of Q will be negative. The maximum value attainable by Q is 1.
The problem of finding t number of communities in the network without a priori input on a number of existing communities with maximum modularity is formulated as modularity optimization problem which is an NP-hard problem. Hence, several evolutionary approaches have been applied and grabbing more attention from a different group of researchers. www.ijacsa.thesai.org

B. Discrete Particle Swarm Optimization (DPSO)
Cai et al. [17] have proposed a framework of DPSO for identifying community structures in signed social networks. They identify communities from signed social networks with locus based adjacency initialization strategy. Particle updating rules are defined in following equations: In the algorithm, and are taken as 0.9 and 0.4, respectively, and the learning factors 1 and 2 are considered as 1.494.
The symbol ''⊕'' in Eq. (2) is the XOR operator and the function Y= Sig (X) where Y = ( 1 , 2 , …., ) and X = ( 1 , 2 , …., ) is defined as And the sigmoid function is given as The operator " Θ " in Eq. (3) provide guidance to the particle so that it will be at a better location in the search space. If = ( 1 , 2 , …., ) and velocity = ( 1 , 2 , …., ) then ′ = will be a new position vector X'=( 1 ′ , 2 ′ ,…., ′ ) that maps to a new solution such that where is the label identifier possessed by the majority of ith node's neighbors.
In Eq. (8), N = { 1 , 2 ,… } is the set of i th node's neighbours and r is an integer which will maximize f(r).

C. Bat Algorithm
Nature-inspired algorithms have shown the promising result for solving NP-Complete problems. The biological, physical and chemical process of mammals in nature motivates researchers to propose new algorithms and use them for different applications. Out of 1240 species of bats, most of them react and respond to a sophisticated sense of hearing [21]. X. Yang [15] gave a new meta-heuristic algorithm named as Bat algorithm (BA) inspired by echolocation behavior of bats. Mirjalili et al. [16] introduced binary bat algorithm for solving discrete optimization problems. The success of bat algorithm has inspired researchers to explore it to solve different optimization problems. For example, bat algorithm is applied to solve the optimal power flow problem (OPF) [22]. Experimental results obtained with their proposed approach reveal effective and robust high-quality response to the OPF problem. Following, Ssection 3 gives the detailed description of standard bat algorithm.

III. DESCRIPTION OF BAT ALGORITHM
Bat algorithm is proposed by simulating the echolocation characteristics of bats. During simulation of original bat algorithm [15], virtual bats are used and the update rules for each bat are governed as: 1) Update rules: The position vector and velocity vector of each virtual bat is represented as X = ( 1 , 2 , …., ) and V = ( 1 , 2 , …., ) composed of n variables. The position vector X represents an instance of the solution for the optimization problem. The current frequency of the bat is denoted by which ranges between and . The new population will be generated by the movement of bats in search space governed by update rules of their velocity vector V and position vector X given by Eq. (10) and Eq. (11) respectively.
Here, denotes current global best solution obtained so far from all bats where ∈ [0,1] is a uniformly distributed random number in the range of 0 and 1, = 0 and = 1 is chosen.

2) Local Search ability:
To boost the local search capability, the new solution will be generated locally using random walk process expressed in Eq. (12).
where ∈ [0,1] and = average loudness of all bats in the current population.

3) Update Loudness (A):
It will be updated for each virtual bat as iteration proceeds and the new population is generated. When the bat reaches near to their prey/food, its loudness decreases by the following expression: where ∝ is a constant in the range 0 < ∝ < 1 such that → 0 as → ∞ . The rate of decrement in loudness is determined by ∝

4) Pulse emission rate update(r):
In order to search for the best solution, bat explores the search space by updating pulse emission rate. The moment when bat reaches near to its prey, pulse emission rate increases by the following equation: where > 0 is a constant such that +1 → 0 as → ∞ and 0 = initial pulse emission rate.
Both, loudness and pulse emission rate gets updated only when a new solution is improved over the previous solution that infers that bats are on the way of a near-optimal solution. www.ijacsa.thesai.org IV. FRAMEWORK OF PROPOSED METHOD In a theoretical sense, community detection is modularity maximization and NP-hard problem. The superiority of bat algorithm inspires and motivates to deal with the modularity maximization problem. The representation schema and bat movement update rules are defined in following subsections.

A. Discrete Bat Position Encoding / Decoding Scheme
In population-based approaches, population initialization should be in the context of problem domain instead of purely total randomization. In proposed NVDBA, random neighborhood-based encoding is used to represent the position vector of each virtual bat. The chosen encoding scheme for position vector supports in finding the number of communities without using its previous knowledge or a priori input on a number of communities. The position vector of virtual bat represents an instance of the solution to the optimization problem. Discrete position vector is an n-dimensional vector for each virtual bat (say b th bat) in context of community detection problem which is defined as Xb = { 1 , 2 , ….., } where ϵ [1,n] that is set of real integers and n is a total number of nodes in the network. Each dimension of a position vector is a random neighbor of the respective node as depicted in Fig. 1. If = , then it implies that node i and j belong to the same community as shown in Fig. 2.

B. Discrete Bat Velocity Encoding
The velocity of each virtual bat will be n-dimensional vector written as Vb = { 1 , 2 , …., } where each is binary coded such that {0,1} and 1≤ i ≤ n where n denotes the number of nodes in the network. The change in position vector dimension will be governed by a component of velocity vector such that if = 1 then change associated element in position vector else it will remain in the same state.
During initialization, it is assumed that dimensions of discrete velocity vector for each virtual bat will be 0 as depicted in Fig. 3.
In Eq. (15), ⊕ is defined as XOR operator and the sigmoid function ( ) maps the real value of velocity to either 0 or 1 defined as below: Here, (0,1) is uniformly generated random number between 0 and 1.

3) Position:
According to above redefined discrete velocity updating strategy, new position update rule is given in Eq. (18) Here, operator ⊛ is applied between the previous position and newly defined velocity vector that results to a position vector. The definition of the operator is based on label propagation updating strategy. Given a position vector X = { 1 , 2 , …., } and a velocity vector V = { 1 , 2 , …., } then the new position vector Xnew = { 1 ′ , 2 ′ ,…., ′ } is defined as (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 4, 2020 119 | P a g e www.ijacsa.thesai.org Here, ( ) is the set of ith node's neighbor. The function ( ) returns an integer r that maximizes ( ). If more than one value of r maximizes ( ) then randomly one among them will be chosen. So, ( ) is an integer that is community identifier possessed by a maximum neighbour of i th node.
An example instance on toy network with visual representation for the definition of discrete velocity and discrete position update rules is illustrated in Fig. 4.

4) Position reshuffle:
A position vector at any stage represents an instance of the solution to the optimization problem. At a certain point in time, two different position vectors may correspond to identical community structure that is they have equivalent community structure. For example, an instance 1 ={3,3,3,3,8,8,8,8} and 2 ={1,1,1,1,7,7,7,7}, they both correspond to same community structure as represented in Fig. 5.
Position restructuring is carried out to reduce redundant computation and save on computational time [17].

D. Local Search
To enhance the local search capability of the bat, a new solution is generated from the global best solution by using Eq. (12). The Eq. (12) cannot be applied directly in the context of discrete position vector for creating the new solution [12]. In the proposed method, a new solution is created based on definition written in Eq.
Here, ^ ∈ ( ) that is ^ is one from a set of i th node's neighbors.
The instance on toy network with a graphical representation for forming a new solution is depicted in Fig. 6.

E. Loudness(A)
The loudness of each virtual bat will be updated by Eq. (13). After performing experiments, the initial value for A and is initialized with A= 0.95 and = 0.95. As bats reach near to their food, the loudness usually decreases. The rate of decrease in loudness is controlled by the parameter α as shown in Fig. 7.

F. Pulse Emission Rate (r)
The pulse emission rate of each virtual bat is updated by Eq. (14) that tends to find the better solution near a global best solution . Generally, pulse emission rate increase as the bats reaches near to their pray/food. The rate of increase in pulse emission rate is controlled by the initial value of 0 and . The initial value of 0 = 0.5 and = 0.03 is assumed and an increase in r is depicted in Fig. 8. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 4, 2020 120 | P a g e www.ijacsa.thesai.org

V. WORKFLOW OF NVDBA
In this work, discrete position vector and binary velocity vector is adopted. Initialization and updating rule strategies are explained with the help of a graphical example in the preceding subsections. Modularity is one of the prominent representative optimization functions to evaluate the goodness of community partition. Thus, proposed method employs modularity as the fitness function and a number of iterations as the termination criteria. The following Fig. 9 presents the workflow of NVDBA to deal with the community discovery problem.
The proposed NVDBA pseudocode is presented in Table I.  In this section, experiments were performed on some standard real networks on a desktop machine with Intel Core i7-3770 CPU @ 3.40 GHz processor and 4 GB of RAM. The operating system is windows 7 OS. The algorithms codes are written in python with networkx and matplotlib. In this paper, DPSO [17] is simulated with random neighborhood-based initialization schema for undirected and unweighted social networks. However, LPA results are obtained by using the inbuilt function of 'igraph' package defined in R. In [14], testing of DBA is performed on three real-world networks and their results are taken for comparison purpose and confirming the improved performance of NVDBA.
Each social network dataset during experimentation is presumed as unweighted and undirected social network. Social network datasets used during experimentation and their basic properties are listed in Table II. In population-based approach, DPSO parameters are taken similar to [17] for simplicity. In proposed NVDBA, parameters population size and maximum iterations are taken same as in DPSO. Min frequency and Max frequency is initialized with 0 and 1 respectively [12]. Other parameters; initial pulse rate, initial loudness, gamma, and alpha are initialized after prior experimentation in a way that pulse emission rate and loudness constraints hold. The required experimental parameters for DPSO and NVDBA are listed in Table III. With respect to the massive amount of work carried out in the existing literature based on modularity to assess the quality of community partition, the proposed method has adopted it as an evaluation metric. Each algorithm is executed independently 15 times on all dataset considered under test with an idea of reducing the statistical error. For statistical analysis and evaluating the performance; maximum modularity, average modularity, and the standard deviation is reported. The number of communities identified by each algorithm on all the networks under study is in accordance with the maximum value of modularity.

A. Performance Evaluation
The quantitative results based on experimentation performed on real-world networks are reported in Table IV.
The comparative analysis of obtained results reveals the effectiveness of the proposed algorithm in contrast with traditional community detection algorithm (LPA) and nature inspired approach (DPSO). Additionally, it is compared with recent algorithm DBA. Further, observation inferences that NVDBA results in higher objective function values i.e modularity. High modularity characterizes the dense connection in a community and sparse connection among the communities. The community structure obtained by proposed algorithm on each network at highest modularity is shown in Fig. 10.  Table IV, demonstrate the rationality of the proposed method. Increase in modularity value for almost all data sets considered during experimentation shows that proposed method enhances the quality of community structures. Also, it points out its statistical significance by computing the average modularity value for 15 independent runs and standard deviation on each dataset. Minimum standard deviation for each data set conveys that the proposed method is reliable. Overall, experimental results demonstrate that NVDBA provides more promising results and is effective as well.

B. Statistical Analysis
In the following, SPSS statistics software is used for performing the statistical analysis. For each network under study, the proposed algorithm is executed 15 times independently. The necessary experimental parameters required for the implementation are listed in Table II. The experimental values are recorded in Table III. The box-plot analysis (Fig. 11) of proposed algorithm portrays variability in experimental values over 15 runs around the mean.
The variation in modularity value obtained during independent runs is realistically low. This box plot analysis concludes that reliability of proposed algorithm is convincingly high.
The proposed algorithm was prospected to improve the quality of community structure in contrast to the existing algorithm. So, for further validating the performance of the proposed algorithm, non-parametric test i.e. Wilcoxon Sign-Rank Test [31] is conducted. DPSO and NVDBA detect community structure from undirected and unweighted realworld networks with the intent of finding the highest modularity. Both algorithms are executed 15 times independently on each dataset with same population initialization methodology. Modularity values for each run on each dataset by both the algorithms are recorded up to four decimal points. Following are the steps performed to response the query: "Change in modularity value by DPSO and NVDBA is not statistically significant?" For answering this query, Wilcoxon Sign-Rank Test is performed on datasets listed in Table II. Firstly, the hypothesis is framed as:independently on each dataset with same population initialization methodology. Modularity values for each run on each dataset by both the algorithms are recorded up to four decimal points. Following are the steps performed to response the query: "Change in modularity value by DPSO and NVDBA is not statistically significant?" For answering this query, Wilcoxon Sign-Rank Test is performed on datasets listed in Table II. Firstly, the hypothesis is framed as: Null Hypothesis H0: There is no significant difference in modularity value obtained by DPSO and NVDBA.
Alternative Hypothesis H1: There is a significant difference in modularity value obtained by DPSO and NVDBA.
Test statistics for all datasets are evaluated and compared using Z statistic and Asymp. Sig. (2-tailed) (Asymptotic Significant (2-tailed)) at 5% significance level. The Asymp. Sig. (2-tailed) is p-value for the test. To take the decision for accepting or rejecting the H0, p-value will be used. If the pvalue is less than specified level 0.05 then the null hypothesis H0 will be rejected else accepted.
To understand the results of Wilcoxon Sign-Rank Test, tables viz. Ranks (Table V) and Test Statistics (Table VI)  In Table V, a-z and aa reprsents negative rank, positive rank and tie respectively for each dataset listed in Table II. From Table V, it can be viewed that in all the runs, NVDBA has obtained higher modularity value in comparison to DPSO. The Table VI has a p-value of less than 0.05 for all the datasets. Hence, null hypothesis H0 is rejected which implies that the alternate hypothesis prevails i.e. change in modularity value obtained by NVDBA is statistically significant change.   Many meta-heuristic approaches have been applied for community discovery problem. In this work, NVDBA is proposed to deal with community detection problem. The proposed algorithm NVDBA makes use of modularity as the fitness function is widely used metric. Input parameters of the proposed algorithm are initialized after a set of experiments besides maintaining the constraint on varying the loudness and pulse emission rate as per behavior of a bat. Experimental outcomes infer that the performance of NVDBA is encouraging as highest modularity is attained for almost all the datasets. Comparative analysis reveals that it produces reliable and quality community structures in comparison to LPA, DPSO, and DBA. However, still, there is the scope for improvement to suggest an improved NVDBA. The proposed method is tested on undirected and unweighted real-world networks. However, this work may further be extended to weighted and directed networks as well. In future, in-depth analysis of NVDBA can be carried out by incorporating the impact or influence of neighbors node. Such investigation is prospected to work better and yields high-quality community structures.