Method of Graph Mining based on the Topological Anomaly Matrix and its Application for Discovering the Structural Peculiarities of Complex Networks

—The article introduces the mathematical concept of the topological anomaly matrix providing the foundation for the qualitative assessment of the topological organization underlying the large-scale complex networks. The basic idea of the proposed concept consists in translating the distributions of the individual vertex-level characteristics (such as the degree, closeness, and betweenness centrality) into the integrative properties of the overall graph. The article analyzes the lower bounds imposed on the items of the topological anomaly matrix and obtains the new fundamental results enriching the graph theory. With a view to improving the interpretability of these results, the article introduces and proves the theorem regarding the smoothness of the closeness centrality distribution over the graph’s vertices. By performing the series of experiments, the article illustrates the application of the proposed matrix for evaluating the topology of the real-world power grid network and its post-attack damage.


I. INTRODUCTION
The distinctive feature characterizing the upcoming fourth wave of the industrial revolution lies in the rapid expansion, complication, and integration of the complex networks serving the needs of humanity and world economy [1,2].While enabling the development of the more efficient business processes leading to the increase in the produced outcome and quality of service, such tendency makes the entire society extremely vulnerable to the disruptions of the most critical infrastructural networks [3,4].Meanwhile, the functionality and reliability of any complex network heavily relies on its topology inspiring the emergent properties that could not be deduced from the separate network's entities and arise only in result of their interaction [5,6].For example, the United States of America has suffered from several catastrophic blackouts caused by the cascading failures in the power grid steaming largely from the low redundancy of its topological design [7 -9].These observations contribute to the particular reasonableness of assessing the topology of complex networks while making decisions regarding their reliability or need for the additional protection.Remark that this article focuses on considering the complex networks modeled by the undirected simple graphs   , G V E  .In turn, the topology of any graph G could be regarded as the class of all possible graphs that are isomorphic to G .The evaluation of such topology is extremely challenging due to its underlying combinatorial nature and serves as a core problem of the emerging Big Data graph mining and analytics [10,11].In the prior works, the graph topology is assessed based on applying the quantitative metrics summarized in the review [12].However, these metrics give a limited insight into the qualitative topological properties such as the concentration of bottlenecks inspiring the non-uniform load on the entities and links of the modeled network, which points to the presence of the research gap.Thereby, the objective of this article lies in constructing the mathematical object of the topological anomaly matrix providing the qualitative evaluation of the graph topology and its richness in bottlenecks, while satisfying the computational efficiency demands imposed to the instruments of the Big Data analytics.

II. RELATED WORK: VERTEX IMPORTANCE METRICS
The inhomogeneous topology of graph gives rise to the differentiation in the relative importance of its nodes for ensuring the normal activity of the modeled complex network.However, the vertex importance is difficult for analyzing due to the possibility of its consideration from the radically different conceptual viewpoints.Thereby, in the existing works, the comprehensiveness of assessing the importance of the graph's nodes is ensured through applying a family of the formalized centrality metrics.In particular, the degree   dv of the vertex v reflects the extent of its local importance and serves as the simplest centrality metric.Nevertheless, the value of degree is incapable of capturing the position of the examined vertex within the entire graph.At the same time, the metrics of the closeness and betweenness centrality [13,14] provide the formal way for evaluating the global importance of the graph's nodes and are defined in the following way: Definition 1.The closeness centrality   cv of the node v belonging to the vertex set V of the connected graph G represents the inverted value of its average geodesic distance ( , ) Definition 2. The betweenness centrality   bv reflects the likelihood that the examined vertex v appears on the shortest path between a pair of other nodes and is calculated as follows: Here σ kl denotes the total number of the shortest paths between the vertices k and l that differ in at least one edge, while σ ( ) kl v stands for the number of such paths transiting the vertex v.
Intuitively, the closeness centrality could be interpreted as the velocity of the information broadcasting from the examined vertex to all other nodes of the graph.For example, by starting to spread from the nodes with the highest closeness centrality, the computer worms could potentially reduce the time required for infecting all vertices.For its part, the betweenness centrality could be viewed as the extent to which the examined vertex is involved as an intermediate in the communication flows between the other graph's nodes.Moreover, the vertices that ensure gluing together multiple implicit communities take the crucial responsibility for the exchange of information between them and, thereby, are typically characterized by the high betweenness centrality (especially in the case of the strong community structure) [15,16].

III. PROPOSED CONCEPT OF THE TOPOLOGICAL ANOMALY MATRIX AND ITS FUNDAMENTAL PROPERTIES
The main contribution of this article lies in introducing the following mathematical object embodying the strategy of translating the local vertex-level characteristics into the property of the overall graph G : containing n vertex importance metrics ω: i V  R is given in the form of the following nn  array: aG is taken to be undefined if either ω i or ω k is constant on the entire vertex set V (i.e. if there exists such By definition, the matrix

 
G Ω A is symmetric, while its undefined components should be organized into the rows and columns crossing at the diagonal entries ω ω i i a and, thereby, indicating the incapability of the corresponding metrics ω i to distinguish the vertices of G .In turn, all defined components comprising the main diagonal of should be equal to one.For convenience, the matrices

 
G Ω A deprived of the undefined entries are referred to as perfect through this article.
The selection of metrics into the base vector Ω is driven by essential need for ensuring the descriptiveness of the constructed matrix in assessing the topology of G at the optimal utilization of resources involved in the process of its calculation.In particular, the conceptual interpretability and computational efficiency of the metrics discussed in the previous section points to the reasonableness of introducing the canonical base vector defined as Remark that the matrices characterizing the purely random (and thereby unstructured) connected graphs R following the binomial distribution of vertex degrees tend to have the close-to-one values of all non-diagonal components.This tendency steams from the fact that, simply by chance, the higher-degree vertices demonstrate a larger probability of being located at the lower average distance to all other nodes and are likely to participate in the larger fraction of the shortest paths between them.In view of these considerations, every low (i.e.close-to-zero or negative) entry of the matrix

 
G Ω A clearly points to the significant non-randomness of the graph G and reveals the presence of the unexpected anomaly in its topology.In total, the matrix could encapsulate three major anomalies originating from the manner of fragmenting the graph G into the cohesive implicit communities.
In particular, the low value of aG indicates that the larger number of the direct neighbors attached to an arbitrary vertex of G does not shrink its farness from the rest nodes of the graph to the statistically significant extent.The main topological property responsible for producing such anomaly consists in differentiating the entire communities of G into the central and peripheral ones (depending on the average distance to the other communities in terms of the inter-community edges).In this context, the high-degree vertices involved in the peripheral communities as well as the low-degree nodes occurring in the central ones serve as the key factors aG implies that the higher-degree vertices do not act as the significantly more preferred intermediates in the www.ijacsa.thesai.orgshortest paths of the graph G .The topological pattern provoking such effect is characterized by the incidence of many critical inter-community edges to the low-degree nodes along with the presence of the high-degree vertices adjacent exclusively to the members of their own communities.Finally, at the low value of   c b aG , the ability of an arbitrary vertex to be involved into the shortest paths in the graph G (and control the corresponding communication flows) is not strongly dependent on its average distance to the other vertices.From the topological viewpoint, the anomalous decrease in   c b aG is driven by the nodes that, while being located in the central communities, are neither directly incident to the intercommunity edges nor lie on the shortest path between any pair of vertices equipped with such edges.
In order to provide a fruitful insight into the entries of , let us introduce and prove the following fundamental relationship between the closeness centrality values of the adjacent graph's nodes: ▲ Let us assume that v is adjacent to the node u having the closeness centrality of   cu .This, for its part, implies that every vertex

 
\, h V v u  could be reached from v based on the walk (i.e.sequence of edges with allowed repetitions) composed of the edge   , vu and shortest path from u to h .Accordingly, the geodesic distance between v and h is bounded above by the condition     In view of this observation, the entire closeness centrality of v is constrained in the next manner: , which completes deriving the desired relationship.At the same time, the increase in () cu over the whole allowed range   0,1 leads to the monotonic growth of the imposed bound at any fixed 3 V  .This remark clearly points to the largest restrictiveness of the bound produced by the neighbor with the highest closeness centrality.▼ The most significant implication of the above theorem lies in the smooth nature of distributing the closeness centrality values the graph's vertices.On the contrary, the values of the betweenness centrality could be distributed in much more rugged manner implying the extreme differences between the adjacent nodes.For example, each leaf vertex l , by definition, is associated with zero betweenness centrality   0 bl  regardless the properties of its single neighbor.Conversely, the closeness centrality of l takes the lowest possible value satisfying the bound given in Theorem 1. Remark that such bound demonstrates the close-to-linear behavior at the low values of  Remark that this graph implies the inclusion of the threedegree vertices into the peripheral communities (represented by cycles) and placement of the two-degree node as the connector between these communities.As a result, such connector is associated with the largest closeness centrality compared to all other vertices.The graphs on six or fewer nodes, in turn, could not contain the lower-degree vertex characterized by the larger closeness centrality than the higher-degree one due to the influence of the structural restrictions.
At the same time, the shape of the surfaces constructed in Figs.1b and 1c  φ G .These effects are fully attributable to the fact that the betweenness centrality is capable of producing the rugged distributions over the graph's nodes, while the closeness centrality is unavoidably subjected to the smoothing requirement proved in Theorem 1.
The inspection of the callouts in Fig. 1

V. APPLICATION OF THE PROPOSED MATRIX FOR ASSESSING THE TOPOLOGY OF THE POWER GRID NETWORK
AND ITS POST-ATTACK DAMAGE The role of this section lies in demonstrating the descriptive potential of the introduced mathematical structure in evaluating the qualitative topological properties of the real-world complex networks.As a sample dataset for investigation, this work uses the benchmark model of the power grid infrastructure of the United States of America available at the open-access network collection [17] and given by the undirected graph   Notice that this graph is connected and contains 4 941 vertices reflecting the facilities responsible for producing and distributing electricity along with 6 594 edges modeling the high-voltage transmission lines.The canonical matrix   P Ω A calculated for the described graph P is given by  aP  .These values indicate the involvement of all considered structural anomalies in the topological organization of P , which serves as the natural result for the spatially distributed technological man-made system needing the constant supervision for preserving the desired functionality.www.ijacsa.thesai.orgWith a view to illustrate the usefulness of applying the proposed matrix as a measure of the topological damage, let us consider the attack on P implying the removal of all its nodes having the degree of at least t , i.e. comprising the subset additionally referred to as the giant component of t P , while its topology accumulates the majority of damage that is not related to the connectivity issues [3]. .This result allows noting that the topological damage of c t W is expressed primarily by the more significant differentiation of its communities into the central and peripheral ones, which follows from the degradation of the inter-community relationships.aG , the network modeled by the graph G is characterized by the tendency of the entities accommodating only a few neighbors to act as hubs managing the significant portions of traffic.In turn, the links attached to such entities are subjected to the enhanced risks of overloading and, thereby, play a role of the primary structural bottlenecks.
Meanwhile, the low value of   c b aG is caused by the entities that, while being located close to all other nodes, do not use their beneficial geodesic position to support the traffic transmission in the network and, in this sense, contribute to the formation of bottlenecks.In sum, the opportunity of ensuring the balance between the descriptive potential and computational complexity of the matrix (through the www.ijacsa.thesai.orgselection of metrics into the base vector Ω ) allows its consideration as the promising tool in the Big Data graph mining and analytics.Due to its usefulness in describing the topological damage (as illustrated in the previous section), the topological anomaly matrix could be potentially applied as one of the robustness metrics in assessing the attack tolerance of complex networks.Similarly, the proposed matrix could assist in detecting the differences in the topological organization between the whole network and its important subnetworks (such as the rich-clubs).Another possible application lies in tracing the evolutionary topological transformation of complex networks (by comparing the matrices calculated for the giant connected components of the graph models constructed for the series of the time-indexed network snapshots).

Definition 3 . 1 =
The topological anomaly matrix G with respect to the base vector   ω ... ω n Ω vector has the size of 33  , while its full specification requires values of only three items contributing to the reduce in the value of   d c aG .Conversely, the topological anomaly evidenced by the low value of   d b . This observation clearly shows that the leaf nodes of the sparse large-scale graph G typically tend to have almost the same closeness centrality as their neighbors.In view of such relationship, the leaf vertices appearing in the central communities are characterized by the relatively high closeness centrality compared to the other graph's nodes and, thereby, serve as the most evident contributors to the reduce in the value of   c b aG .IV.ANALYSIS OF THE LOWER BOUNDS IMPOSED ON THE ENTRIES OF THE CANONICAL TOPOLOGICAL ANOMALY MATRIXMeanwhile, the anomalous effects indicated by the matrix solely by the intentional selforganizing process of the complex network modeled by the graph G .Additionally, the values of are affected by the structural constraint taking the form of the vertex degree multiset  all nodes in G .Each multiset   DG , for its part, characterizes the family   Γ DG composed of all non- sense, the specification of   DG restricts the possible topologies of G only to ones contained in   Γ DG and imposes the structural bounds on the components of .Notice that all these lower bounds are defined over the domain restricted by     φ φ 1 tree VG  , where www.ijacsa.thesai.org

φφ 7 G
restriction steams from the impossibility of constructing any connected graph sparser than a tree along with the presence of only undefined items in the matrix graph K having all possible edges.With a view to simplifying the discussion of the results given in Fig.1, let us use the notations G at the value of V fixed to k (representing slices of the illustrated surfaces).As evident from Fig.1a, the dependence k exhibits a single minimum located close the lowest allowed density   φ tree k .Moreover, such minimum becomes deeper with the increase in k , which is directly attributed to the growing number of possible topologies.Another notable feature of the analyzed surface consists in the presence of the wide plateau-like region where takes the close-to-one values.While being located at the high density   φ G , this region complies with the limited suitability of the dense graphs to the elaboration the high-modular topology underlying the emergence of the structural anomalies.Conversely, the tree graphs could be strongly segregated into the sparse implicit communities, which acts as an explanation for the relatively low values of , the requirement regarding the sparsity of communities also hinders the formation of the structural anomalies.Accordingly, the minima of all considered dependences tree k .For example, Fig.1adepicts the graph responsible for producing the minimum of requires the more careful investigation.The distinctive feature expressed by the experimentally dependences local minima whose number grows with the increase in k (one at 4 k  and 5 k  , two at 6 k  , and three at 7 k  ).Remark that for every considered k , the local minima of both the identical graph topologies and same values of   φ G .Furthermore, the presented results allow noticing that the bounds the presence of the densely interconnected group of the highest-degree vertices along with the inclusion of the low-degree nodes into the chain-like substructures.Moreover, the collected results allow discovering that the formation of such topologies is driven by the hidden fundamental rules.In particular, each graph labeled in Fig.1as 1 be obtained based on constructing the diamond graph (i.e.complete graph on four vertices with one removed edge) with the subsequent linking of its two-degree nodes by the path containing 3 k  edges.Each graph labeled as 2 vertices.Its formation involves placing all possible edges between the nodes of 3C r denotes the central node of the star 3 k S  .These trends suggest that the additional local minima arising in the dependences in k are caused by the graph topologies following the new fundamental rules. 2306

Fig. 2
Fig. 2 illustrates the application of the matrix

Fig. 2 .
Fig. 2. Entries of the matrix t  .The post-attack graph on the remaining