Identifying Influential Nodes with Centrality Indices Combinations using Symbolic Regressions

—Numerous strategies for determining the most influential nodes in a connected network have been developed. The use of centrality indices in a network allows the identification of the most important nodes in the network. Specific indices, on the other hand, cannot search for a network's entire meaning because they are only interested in a single attribute. Researchers frequently overlook an index's characteristics in favour of focusing on its application. The purpose of this research is to integrate selected centrality indices classified by their various properties. A symbolic regression approach was used to find meaningful mathematical expressions for this combination of indices. When the efficacy of the combined indices is compared to other methods, the combined indices react similarly and outperform the previous method. Using this adaptive technique, network researchers can now identify the most influential network nodes.


I. INTRODUCTION
For years, the prediction of the most influential nodes has been a source of contention [1].The node with the most impact is ranked first, and the one with the least effect is ranked last [2]- [4].Several research had been carried out to enlist the importance of nodes detection which is such as in finding importance suppliers [5], [6], detection of cancer or virus gene [7] or as well as to monitor the terrorist activities [8].
Over the years, over 403 indices have emerged from the four major indices: Degree Centrality (DC), Betweenness Centrality (BC), Eigenvector Centrality (EV), and Closeness Centrality (CC).Individual task and node importance priorities are claimed to impact the development of various indices.Various centrality measures have been employed to predict node outcomes, with the underlying assumption being that the more centrally situated the nodes are in the network, the greater their spreading potential [9], [10].
However, there were limitations to using indices as a single centrality metric because they could focus on one application area.DC for example, is a good indicator of a node's total connections [11].Still, it does not necessarily imply a node's value in linking nodes or how central it is to the main group.CC on the other hand, determines how close a node is, but an independent network will not profit from its supremacy if two nodes are placed in distinct components [9].In the case of BC, the result will be zero if many nodes are not on the shortest possible path to the remainder of the network [12].
A single centrality metric proved to be insufficient for accurately predicting the network's most important nodes [13].
Combining centrality indices to determine the most influential nodes has been floated.According to [7], [14], there is no single centrality measure that can accurately identify key nodes, but the combination of at least two centrality measures is the most accurate.Combining multiple indices is considerably more accurate than using one index while assessing a node's influence capacity [11].The influence of a node may be evaluated by its location and surroundings.
The researchers have recursively investigated this topic and determined that each indices have distinct features.This attribute, known as network topology, is reflected differently by different methods, and the evaluation results may include flaws or deficiencies.Borgatti [15]observed two forms of network topology: geodesic paths and walking paths.In the following research, Ashtiani [16]determined that centrality measures may be categorized into five classes based on the reasoning and formulas used.This characterization of centrality indices is also used by [17] while accessing the topological structure of student network.
The goal of this research is to investigate the effectiveness of centrality indices combination.Genetic programming-based symbolic regression (SR) is used here to find expressions to combine the selected indices.Two datasets examined to test the performance of a mix of indices based on their individual properties.Vignery's topology principles guided the selection of the centrality indices in this study.Combination outcomes are compared to results from a previous combination strategy to get a better sense of how effective the combinations are.At the end of this research, the possibility on applying symbolic regression to combine centrality indices will be clarified, and whether the categorization of indices according to their characteristics similarities has an impact on the combined indices.

A. Data
Zachary and Les Misérables (Les-M) datasets, both weighted and unweighted, were used in this investigation.Thirty-four people from the karate club were included in the www.ijacsa.thesai.orgZachary dataset, which documented 78 connections between members who interacted outside the club.The novel Les-M features 77 nodes and 254 edges, including co-occurring characters.Both networks were depicted in the Fig. 1, with information on the most connected nodes.

B. Theoretical Topology of Centrality Indices
The definition of Vignery's eleven centrality indices is simplified and shown in Table I.Several of the indices had the same features and hierarchical clustering analysis is executed to observe whether the indices can be clustered into a single component to justify where the indices are converging.The dendrogram is built up by clustering observations and their similarity levels at each stage and assessing the similarity (or distance) levels of the produced clusters.As a first stage in the modeling technique, the value for each index is computed for each node.Following that, the indices was categorized based on their commonalities.The higher a cluster's similarity level, the more related the variables in that cluster are.These indices are divided into five theoretical groups, explained in Table I.

C. Combinations of Indices
Network dynamics are examined as a function of the structure.The best estimate is made by combining a given number of indices based on the features of the component clusters.Genetic programming (GP) with symbolic regression (SR) is employed to generate mathematical expressions that may predict the simulation response values based on the topological indices used.SR is a technique that uses collected data to construct mathematical equations that may be used to test hypotheses [18].With SR, the parameters and equation form are automatically searched, unlike typical regression methods that require a fixed-form model built from prior knowledge.GP is commonly utilized in SR because of the high computational complexity imposed by a vast search space that generates new solutions using the notion of biological evolution as a meta-heuristic [18], [19].SR method's results will then be compared with those from other methods that use centrality indices as a comparison metric.Three algorithms are chosen for comparison which is provided by Eq. ( 1), ( 2) and (3).
 C(v) algorithm: Wang [20] developed a combination formula with the integration of DC, diffusion degree (DD) and BC as denoted in Eq. 1 with  BC and Katz (BKC) algorithm: Zhang [21] merged BC and Katz's centrality.Eq. 2 expresses the relationship between BC and KC.

A. Cluster Analysis
Clustering is used to group comparable data objects using a similarity measure.The similarity is a value that displays the strength of a relationship between two data items; it represents how similar data patterns are.The topological framework classifications from Vignery will be extended.We wanted to see how well the indices matched, so we applied a simple hierarchical clustering algorithm.
The results of clustering are depicted in Fig. 2a and 2b.We discovered that for both networks, all the indices could be clustered into four groups.Take note that the component clustering is quite close to Vinery's recommendations.The final partition specifies how the indices will be clustered.In those eleven indices, both networks were supposed to cluster similarly into four groups, except for Les-M, where PR will be in Cluster 2 rather than Cluster 3 (as in Zachary).EC has the www.ijacsa.thesai.orglowest similarity score and is not assigned to any category.EC was not assigned to any cluster because it has the lowest similarity score (10.56 and 7.00) compared to the other indices.

B. Combinations of Indices
Turingbot software was applied to execute the symbolic regression codes, which entails combining a set of base functions into simple formulas to produce a regression model.Fig. 3 depicts the steps we take to generate different mathematical formulations for each cluster.Because the output of SR will vary, we choose the phrase with the lowest value in terms of root mean square error (RMS error) and the highest Rsquared (R-sq).Finally, we obtained four distinct expressions for each cluster specified, namely C1, C2, C3, and C. C is an expression that includes all the indices involved.As a result of efficient training and shifting, the analytic equations for both networks are shown in Table II are derived.

C. Pearson Correlation Analysis
The combined indices and component clusters employed in the earlier approach are compared.The correlation technique can be used to discover the similarity of combined centrality indices.The dataset also includes an average value (AVE) for the average result for each node from the Combined, IVI, BKC, and C(v).This AVE value will serve as the reference result to which the correlation converges.
Correlation for both networks show a significant and favorable relationship as shown in Tables IIIa and IIIb.In Zachary, there is a high correlation between AVE and IVI, BKC, and C(v).There is a high association between IVI and the clustered group and all other combined indices.The concept of combining indices while considering their spreaders and hubs can be extended for future use.With a correlation coefficient of 0.797, the relationship with C is likewise satisfactory.It is fascinating to notice that the C1 and C2 are more closely linked to AVE than is C, while C3 is obviously diverged from correlation toward others.
Results for Les-M also give results like Zachary's network when looking at each cluster component, with C1 and C2 being more correlated than C and C3.However, it is interesting to observe that IVI is quite diverging for this network.According to these findings, the SR modeling combination model has successfully identified influential nodes.In the following section, the node's ranking position is being observed.

D. Node's Ranking of Position
In this section, the placements of nodes were analyzed and arranged in descending order.When comparing procedures side by side in Tables IVa and IVb, the top ten ranking position of each approach is considered.NodeAVE is a reference column that contains the average positioning value of nodes for each method, as expressed by the average positional value of nodes for each technique.Zachary and Les-M discovered that BKC and C(v) have a very similar node detection to AVE when comparing the two algorithms.
Nodes in IVI react similarly in Zachary, whereas they deviate significantly in Les-M.However, when compared to C for both networks, C1 and C2 show more comparable node detection for both networks when compared to C. To better understand the similarity result, we use the Jaccard similarity (JS) score and Kendall's tau-b to access the top ten nodes' ranking positions.Kendall's tau-b indices are used to measure the strength of a method's ranking position to understand the similarity result from the JS-score better. Jaccard similarity score Jaccard similarity (JS) score compares two sets of scores by counting the number of elements in each group.JS can be calculated numerically by dividing the intersection of sets by the union of sets [23].The higher the value, the greater the correlation between the two data sets.The higher the Jaccard similarity indices, the closer two sets of data are to one.Definition of JS is formulated as in Eq. 4.
number of observations in both sets ( , ) number in either set For Zachary, C has a higher similarity score than Les-M when looking at the Jaccard scores for the two networks.If we look at Zachary and compare C to the previous combined technique, we see that C has the same top ten ranking entities for IVI (0.6667), BKC (0.8182), and AVE (0.6667).C has a low degree of similarity (less than 0.5) to IVI, BKC, and AVE in Les-M.C1 and C2 are interestingly comparable to C with the IVI, BKC, and C(v), which follow similar patterns.The JSscore heatmap for both networks are shown in the Tables Va and Vb.The greater the degree of resemblance between methods, the darker is the color.When comparing the rankings of different methods, Kendall's tau-b is applied to determine the ordinal relationship between pairs of observations [24], [25].Correlation strength and direction are measured using Kendall's tau-b correlation coefficient.According to the theory of rank correlations, the closer two sets of data are linked together, the more closely they are related.Within this range, positive and negative numbers can indicate concordance or discordance, which is characterized by increasing or decreasing values, respectively.The correlation value between two variables increases when the ranks of the observations are similar; the correlation value decreases when the positions of the observations are different.Kendall's tau-b is defined as in Eq. 5.

( , ) ( )( )
The top ten ranking of nodes using Kendall's tau-b results for Zachary and Les-M are shown in Tables VIa and VIb.Comparing C with IVI, BKC, C(v), and AVE shows that C has a rather low and moderate positive tau value in the Zachary network, while it has a high-rank similarity for the Les-M network.C also has significance and a strong positive tau correlation with C1 (0.644).C1 also shows the importance and positive correlation with C2.Observe from AVE analysis shows that results were significant in Les-M compared to Zachary except for C3.Since there were differences in C2 and C3 for Zachary and Les-M, it might affect the way on the rank behavior.The selection of influential nodes is crucial for fostering knowledge and behavior adoption in a network because they can influence other nodes.It is possible to gain a better understanding of network structure and behavior by using prominent nodes.The importance of centrality in identifying influential network spreaders in this scenario cannot be overstated.
Our results show that combining centrality indices can identify the influential nodes in a network.To combine these indices, symbolic regression can identify appropriate mathematical expressions that will fit the network's features.When it comes to recognizing significant nodes, the newly constructed mathematical expression's function performs similarly or better than previous methods (IVI, BKC, and C(v)) that were validated using Pearson correlation, Jaccard similarity score, and Kendall's tau-b correlation of ranking.
It was discovered while clustering indices based on similar attributes that each cluster component may have the same impact as aggregating all indices.Clustering can reduce the total number of indices to be combined while achieving the same overall result.It was also discovered that index selection is critical.A few indices, including Katz centrality and Closeness centrality, could not be computed.Katz's centrality fails to detect connections between high-centrality nodes.To function, closeness centrality requires a well-connected network, and it fails when two nodes belong to different components.Large, complex datasets necessitate more computationally intensive methods.Clustering for the indices involved would be difficult, as it is here.
In future work, the selection of suitable indices to be combine is important.Analyzing their features as well as the computational time required to run each of the indices may be put into factor selection.It is also a good idea to run another run-up for a different network with a different number of weighted nodes if possible.

TABLE IV .
(A) TOP TEN NODES BASED ON RANKING IN ZACHARY