An Improved K-anonymization Approach for Preserving Graph Structural Properties

Privacy risks are an important issue to consider during the release of network data to protect personal information from potential attacks. Network data anonymization is a successful procedure used by researchers to prevent an adversary from revealing the user's identity. Such an attack is called a re-identification attack. However, this is a tricky task where the primary graph structure should be maintained as much as feasible within the anonymization process. Most existing solutions used edge-perturbation methods directly without any concern regarding the structural information of the graph. While that preserving graph structure during the anonymization process requires keeping the most important knowledge edges in the graph without any modifications. This paper introduces a high utility K-degree anonymization method that could utilize edge betweenness centrality ( ) as a measure to map the edges that have a central role in the graph. Experimental results showed that preserving these edges during the modification process will lead the anonymization algorithm to better preservation for the most important structural properties of the graph. This method also proved its efficiency for preserving community structure as a trade-off between graph utility and privacy. Keywords—Privacy; social networks; anonymization; edgeperturbation methods


I. INTRODUCTION
Social network sites have become one of the largest sources of personal information. Daily, millions of users can use social applications like Twitter, Facebook, and LinkedIn to communicate with others. The increase in data being collected from different social network sites has attracted many researchers and social network analysts for extracting knowledge from data [1]. Hence, social network data publishing for analysis purposes becomes Inevitable, as the structural data analysis and studying the relations between individuals can serve many fields including marketing and also business. Social data includes a large amount of sensitive information about individuals, so releasing data of social networks in its primary form without anonymizing it could expose data to many attacks [2], [3], which harms the user's privacy. Many types of data privacy-related attacks had been discussed in previous literature [4], [5], which were summarized as follows: identity disclosure, sensitive attribute disclosure, and link disclosure risk. That's why privacy preservation methods must be implemented by specialists before the release of network data to the public.
The re-identification attack causes dangerous violations of social networks which harm user's privacy. An adversary can violate the user's privacy in two ways: (1) either by reaching the target's personal information such as name, edge, and salary, known as profile data, or (2) by utilizing the graph structural information. Recognition of the topological structure of graphs and relations between individuals enables an adversary to utilize his background knowledge to re-identify individuals. Once an adversary recognizes a specific person in the social network, all sensitive information related to him becomes identified. Also, confidential information regarding the belonging of individuals to a particular community becomes disclosed. For example, in the healthcare domain, PatientsLikeMe is a social network site that consists of several communities of patients. Each community represents the patients that suffer from the same illness. To keep track of their health and benefit from patient-reported concerns, members of this site are allowed to exchange private information such as health status and treatments [6]. In such a case, the disclosure of a patient's existence in a particular group will result in revealing all secret information that they share with others and violating their privacy.
The primitive way that people follow to prevent reidentification attacks, for the publishing data of social networks, is to delete a user's identifier attributes and replace them with symbols or synthetic identifiers. This method is known as simple and naïve anonymization. The authors in [7] presented two types of attacks of the naïve-anonymized graph: passive attack and active attack, which means that this simple method of anonymizing graphs is not enough to prevent the reidentification attack. The attacker can exploit his background knowledge concerning only the graph structure to reach the target and breach privacy.
For example, in the above-displayed graph shown in Fig. 1, each vertex/node represents an individual, and the edge connecting between two individuals represents the relation between them. After performing the naïve anonymization on the original graph , we can get an anonymous version as shown in Fig. 1(B). If an attacker has some background knowledge about Carl and knows that Carl has five friends. Hence, he can re-identify Carl in the anonymously published graph and reach all sensitive information about Carl. Once an attacker got to the information about Carl, this will also increase the probability that this attacker will reach all of Carl's friends. So, such a method can't preserve the user's privacy. Therefore, researchers extended the well-known K-anonymity [8] model, introduced to protect statistical data from the disclosure risk, to develop different privacy models of the graph according to various assumptions of an attacker's This paper assumes that structural information is the only information available to an attacker that can exploit it to carry out a re-identification attack. Assuming that the attacker is aware of the vertex degree of the target vertex. Researchers introduced many attempts to tackle such a case in previous studies by applying the K-Degree anonymity model. This model can distort networks structure by adding or deleting edges so that each vertex of the adjusted version is identical with at least ( ) vertex concerning vertex degree. However, this approach may cause a large distortion to the local structure of the primary graph. Thus, this distortion will harm the utility of data especially, when the anonymized data is used to meet analytical needs. The main reason behind this large distortion is that most existing anonymization algorithms, which are based on the edge modification approaches, don't take into consideration the concept of edge's relevance proposed in [12], which aim at maximizing data utility through keeping the important edges in the graph without any modifications.
In this paper, we introduce edge betweenness centrality measure [13] to highlight the most valuable edges in the graph and apply the K-Degree anonymity model only to edges with no or fewer betweenness values to preserve privacy and maximize the data utility, especially for clustering processes. Since edges with high betweenness values consider the most important knowledge for some popular community detection algorithms to discover community [14].
The remnant of this paper will be structured as follows, Section II discusses the literature review, Section III introduces the proposed Method, Section IV declares the results and evaluation, lastly, Section V highlights conclusions and the directions of the future works.

II. LITERATURE REVIEW
According to the previous works of literature [15] [16] on the anonymization of social networks, different anonymization approaches are categorized into three main groups: Edge modification-based anonymization approaches, clusteringbased generalization, and differential privacy approaches [17].
Edge modification-based anonymization [4], [9]: these methods can anonymize the graph structure through modifying edges (adding and/or deleting) until reaching the desired value (K-anonymity). While some other methods suggest modifying the edges of the graph randomly.
Clustering-based generalization approaches [18]: these methods cluster nodes that are similar together (groups). Then each group will be generalized into an obscure cluster without any information about a specific individual. Although such methods succeeded in hiding the details about individuals, they fail in preserving the local graph structure of the social network. Because the graph structure is shrunk during the anonymization process. Consequently, these methods will not be eligible for analyzing the graph structure [19].
Differential privacy approaches: such methods seek for preserving user's privacy through imposing restrictions on the data release mechanisms; whereas the differentially privatebased algorithms aim at providing statistical information about data without allowing direct access to the whole database. Consequently, such methods prevent a malicious attacker who can query the database from disclosing the target's identity.
In this paper, we focus on previous studies that addressed the anonymization problem through Edge modification-based approaches. Some authors concentrate on preserving the general structural properties of the anonymized network [20]- [22], While others are interested in preserving the community structure in the anonymized version [23], [24].
The authors in [25] compared the results of four algorithms, used for implementing K-degree anonymity, in terms of the information loss furthermore the data utility. These algorithms were introduced by different authors. The first one introduced the concept of K-degree anonymity in [4]. The second and third algorithms are EAGA and UMGA presented [26], [27] respectively. The last one, introduced in [28], which are based on the vertex addition method. They tested all algorithms using the same configurations. Each one follows its method for minimizing the changes performed on the graph structure. Their results showed that the UMGA scored the best results with all tested networks because it succeeds in minimizing the number of edges modified within the anonymization phase.
The authors in [12] propounded an efficient anonymization approach for creating a K-degree anonymized graph. They utilized the neighborhood centrality as a measure for assigning the most significant edges in the graph. They proved that preserving these edges during the anonymization process decreases the amount of information loss. At the same time, their method proved its efficiency in increasing the usefulness of the anonymized graph for evaluating the clustering process. Also, their algorithm achieves the highest results with less information loss compared to other popular anonymization algorithms.
The authors in [29] presented a new method to satisfy Kdegree anonymity through node addition and edge set modification. Instead of adding nodes randomly, they gave the priority to the nodes with low betweenness centrality values to be modified. Their results proved that their approach could preserve APL, Closeness centrality, as well as nodes degree. But they didn't clarify how their proposed method achieves utility about the preservation of the anonymous graph's community structure. www.ijacsa.thesai.org The authors in [30] introduced a genetic K-degree anonymity method in two steps to enhance the preservation of the structural information in anonymized graphs. In the first step, they partitioned vertices of the graph and assigned a label for each vertex to show how many edges needed to be added to achieve the required K-degree anonymized sequence. Then, they identified the set of vertices that should be existed in each community. In the second step, within each community, a few edges were added between the vertices to modify the graph using a meta-heuristic algorithm [31].

III. THE PROPOSED METHOD
In our proposed approach we seek to preserve the most impactful edges during the modification phase which in turn help us to limit the number of the modified edges. We present edge betweenness Centrality ( ) measure to determine the most essential edges in the graph. Also, keeping these edges in the anonymized network will lead the suggested approach to optimize data utility for clustering analysis.

A. Overview
For undirected and unlabeled graph ( ) , where describes the set of vertices, and defines the edges set in the graph. Let defines the degree sequence of graph , where is a term to describe the vector of elements, i.e. * V1 V2 V + each element is an integer, whereas is the degree value of vertex and is the number of elements (vertices).
Regarding the graph anonymization, Liu and Terzi introduced two essential definitions in [4] for satisfying the Kdegree anonymity concept: 1) A degree sequence is described as K-anonymous when each distinct value appears not less than K times.
2) A graph ( ) is known as a K-degree anonymous graph when the degree sequence of the graph G is Kanonymized. As shown in Fig. 2.
By considering the previous definitions, we introduce our enhancing approach to anonymize the graph as described in Fig. 3. Our approach goes through two main stages. The first one accepts the original graph and anonymized the degree sequence. After executing this stage and getting the anonymized degree sequence, the second stage starts to realize the anonymized graph . Finally, the utility estimation of the anonymized graph version will be evaluated in the experimental results section by extracting the community structure for both the initial and anonymized version of the graph.

B. Stage1: Degree Sequence Anonymization
Taking description (A) into consideration, we must adjust the values of to construct groups of at least K copies for each element. To satisfy this definition, we adopt the wellknown univariate microaggregation technique proposed in [32] to perturb the degree sequence of the primary graph. The main objective is to get the optimal solution that decreases the distance between the primary degree sequence ( ) and the resulting K-anonymous sequence ( ) , using the distance function: (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 9, 2021 202 | P a g e www.ijacsa.thesai.org Our method starts by getting an optimal partitioning of graph vertices which is an order degree sequence that has been divided into several groups, then modifying the values of each group to achieve the required degree sequence that minimizes distance calculated by Eq.1. As stated in [32], given a directed graph the optimal partition is defined as a set of groups which match the arcs of the shortest-paths that follows from the source vertex 0 to a vertex of the graph. Each group that belongs to the optimal partition represents an arc that exists on the shortest path in the graph. The group size is in the range of K and ( ) items. Then, to modify the values of each group of the optimal partition, we calculated the differences matrix as computed by [12]. Using this matrix, several solutions existed to satisfy K-degree anonymity. Finally, the Greedy method being selected to find the optimal solution among all possible ones using a probability distribution matrix.

C. Stage2: Graph Reconstruction
In this stage, we start to adjust the original graph according to the anonymous degree sequence resulting from the first stage. In our approach, the scope of modifications made is limited only to the set of edges, while the vertices set don't get any changes. We deploy three types of operations to modify the set of edges of the original version:  Edge insertion operation is to link and include a new edge between two vertices and it is denoted as ( ).
 Edge deletion operation is to eliminate an existing edge between two vertices, denoted as ( ).
 Edge swap operation is used to switch between two edges, i.e., ( ) with ( ) It is referred as edge (( ) ( )).
Instead of specifying edges to be modified randomly, as with most previously used anonymization methods, we prefer to select the set of auxiliary edges that help to preserve the graph structure for the community analysis purpose. So, we utilize edge betweenness centrality measure for quantifying the most significant edges in the graph. The betweenness centrality of an edge being estimated by computing the number of times that this edge exists on the shortest paths in each pair of graph vertices. It is computed as follows: Where indicates the number of shortest-paths from a vertex to a vertex while ( ) is the number of the paths that go across .
Among all available edges to be adjusted, we choose edges with high betweenness values to be preserved during the modification process. These edges have more importance than others. Only edges with no or low betweenness values are allowed to be modified during the modification process.

D. Summary
Algorithm: High utility -degree anonymity algorithm Input: A graph ( ) and anonymity parameter . Output: -degree anonymous graph .

IV. COMPUTATIONAL RESULTS
In this section, we show the empirical results to assess the performance of our proposed algorithm. We will compare our method to the results of the two well-known approaches for Kdegree anonymity. We change the value of K to vary from 2 to 10. The two methods are the KDA approach presented in [4] and the UMGA-NC approach proposed in [12]. We run all algorithms on the same dataset and the same configuration. Firstly, we show how far the structural properties of the graph can be conserved. Secondly, we measure how well our anonymization approach could preserve the community structure of the original graph.

A. Datasets and Environment
We test all algorithms on three real datasets which are unlabeled and undirected networks: these networks are Polbooks [33], American College football [13], and Jazz Musicians [34]. Table I

B. Assessment Measures
To assess the performance of our proposed approach compared to the others, we test four important measures that are used commonly in social network analysis. The four used measures are:  Average path length (APL) is the average distance in the graph between every pair of vertices as described in Eq.3. Where is the vertices set in the graph G, ( ) is the shortest path length from vertex to vertex , and is the vertices number in G.
 Closeness Centrality ( ) [35] is the Inverse of average distances to all reachable vertices. We calculate the Closeness of a vertex of as follows:  Betweenness Centrality ( ) [35] of a vertex is specified as in Eq.5. ( ) indicate to the number of shortest-paths from the vertex to while ( ) is the number of the shortest-paths that go across  Transitivity (T) is defined as the fraction of all triangles available in graph G. Available triangles are determined by triads number (two edges with a common vertex). We can compute the Transitivity of a graph as: To analyze the performance of our approach compared to the other two methods clearly, we evaluate the perturbation produced during the anonymization process of the four metrics listed above. As in Table II, we calculate mean absolute error ( ) between the original and anonymized version of the tested networks over ten K levels as follows: As is the value of the tested metric, e.g. ( ) , of the anonymized graph at a particular level of k, is the true value of the tested metric of the original graph and is the number of K levels.

C. Structural Analysis of the Perturbed Graph
In this section, we show the results of KDA, UMGA-NC, and our algorithm on the three networks listed in Table I. We calculate the four measures described previously for both the original graph and its anonymized version to show how much information is lost during the anonymization process. The actual metrics values of the original graph are constant for all different K values. They are represented by horizontal lines .   Fig. 4a, 5a and 6a show the average path length ( ) of the three anonymized networks as parameter K varies from 2 to 10. As we can see, the values of our proposed method are more similar to the actual ones than values of KDA, UMGA-NC, which means that lower information loss on . Fig. 4b, 5b, and 6b refer to the average Closeness ( ) of the perturbed networks. All figures indicate that changes produced by our anonymization method on the average closeness also kept much closer to the real ones than existed by the two other methods. Fig. 4c, 5c, and 6c describe the average node betweenness values. From the indicated figures, we note that our method could preserve the node betweenness values to become identical to the original values with varying anonymity parameter K in both football and Jazz Musicians networks. As for the Polbooks network, there are quite a few changes in the betweenness values.
Lastly, Fig. 4d, 5d, and 6d present the transitivity results on the three perturbed graphs. The performance of our proposed method comparing to the two other permutation methods isn't clear. We will quantify the performance of three permutation methods on transitivity obviously in Table II.      We calculate the amount of error on the four tested metrics over 10 K levels as we referred to in Eq.7. As can be seen in Table II, our method gets the best results on the APL, ACLN, and ABTW except for the Transitivity results which are much affected in the anonymous graph. UMGA-NC method ranks first for the Transitivity metric on the three tested networks.
Where the values of the mean computed error by the UMGA-NC method are lower than the ones obtained by both KDA and our method. As for KDA, our method achieves better results for Transitivity on both Football and Jazz Musicians.

D. Community Structure Preservation
Community detection algorithms are one of the most significant tasks for the processes of graph mining. This section will appreciate the utility of the perturbed graph for three different community algorithms. The three algorithms are (1) Girvan-Newman algorithm (GN) [13], which is is a hierarchical decomposition algorithm where edges deleted in descending order according to their edge betweenness scores.
(2) Walktrap (WT) introduced in [36] is based on the concept of the random walk where the short random walks are likely to be kept in the same community. (3) Label Propagation (LP) proposed in [37], the main notion of the algorithm is to assign each vertex in a graph into a specific community, to which most of its adjacent vertices belong. For more details, see [38].
Using the networkX library, we extract community structure for both the original and the anonymized version of the three previous networks described in Table I. We use the f1-score measure [39] to assess the accuracy of our approach in preserving the actual community structure as described in Eq.8. This measure is used to test the similarity between the predicted communities set of the anonymized graph and the ground truth communities of the original version. We compute the f1-score values of K-anonymity for our algorithm and UMGA-NC using the three community algorithms. Then, we estimate the mean error on the f1-score over ten K levels. Where , is the vertices set that belong to the ground truth communities and , denotes the set of vertices in the predicted communities produced by the community algorithm. Table III, our method-EBC could present the lowest error on the tested networks using the three community algorithms. Consequently, a less information loss and better preservation for the community structure compared to UMGA-NC. Comparing the three community algorithms, The Girvan-Newman algorithm (GN) performs best on the three networks anonymized by our method. The reason behind this is that the Girvan-Newman algorithm (GN) is essentially based on the edge betweenness centrality values to detect communities, and our approach could preserve this metric well during the anonymization process.

V. CONCLUSION
Most of the previous works seek to anonymize graph data, regardless of the role of some edges that have proven their usefulness in analyzing the graph data. In this paper, we focus on optimizing the utility of an anonymized graph by minimizing the changes made to these edges. For this reason, we introduce the edge betweenness measure to identify and preserve the most relevant edges in the graph during the modification operation. Those edges, if modified, will cause large distortion to the local structure of the anonymized graph.
We perform an analysis using many structural metrics and different community algorithms on the graph structure. The final results proved that our method achieves the best performance as less information is lost comparing with other popular anonymization algorithms. Besides that, it can provide better preservation of the community structure compared to other similar methods.
In our future work, we plan to enhance the performance of our proposed approach. We intend to implement our algorithm on big data platforms to utilize graph computation systems such as GraphX on the Apache Spark platform and to test our proposed method on large graphs.