Incremental Diversity: An Efficient Anonymization Technique for PPDP of Multiple Sensitive Attributes

—Data collected at the organizations such as schools, offices, healthcare centers and e-commerce websites contain multiple sensitive attributes. The sensitive information from these organisations such as marks obtained, salary, disease, treatment and traveling history are personal information that an individual dislikes to disclose to the public as it may lead to privacy threats. Therefore, it is necessary to preserve privacy of the data before publishing. Privacy Preserving Data Publishing(PPDP) algorithms aim to publish the data without compromising the privacy of individuals. In the recent years several algorithms have been designed for PPDP multiple sensitive attributes. The major limitations are, firstly among several sensitive attributes these algorithms consider one of them as primary sensitive attribute and anonymize the data, however there may be other dominant sensitive attributes that need to be preserved. Secondly, there is no consistent way to categorize multiple sensitive attributes. Lastly, increased proportion of records are generated due to usage of generalization and suppression techniques. Hence, to overcome these limitations the current work proposes an efficient approach to categorize the sensitive attributes based their semantics and anonymize the data using an anatomy technique. This reduces the residual records as well as categorizes the attributes. The results are compared with popular techniques like Simple Distribution of Sensitive Values (SDSV) and (l, e) diversity. Experiments prove that our method outperforms the existing methods in terms of categorization of multiple sensitive attributes, reducing the percentage of residual records and preventing the existing privacy threats.


I. INTRODUCTION
The developments in digital devices and information systems have created various opportunities and challenges. Enormous amount of data gets collected by various digital devices in sectors such as healthcare, education, e-commerce, banking, government etc., and stored in the information systems. The data that is specific to a single organization is called as Microdata, among other attributes it contains individual's sensitive attributes.The main purpose of data collection is to glean actionable insights and help the organizations to perform analysis, research and succeed in terms of greater productivity and return on investments.Few organizations like healthcare centres, education and e-commerce share the microdata to third parties for investigation or stored in cloud and made available for researchers to perform some fact-findings [1].
Amid constructive usage of the microdata, there may be an intruder, the purpose is to steal individual information and cause privacy threats. Fig. 1 shows the process of micro data collection, storage, publishing and usage. The primary data is one that is collected directly from the source and contains personal information such as marks obtained, salary, credit card information, treatment history and disease. When such a data is shared to the public care must be taken not to disclose individuals sensitive information. Privacy Preserving Data Publishing (PPDP) provides methods and tools with the aim to protect the privacy of the individuals and at the same time make sure that is the data is usable by the public for analysis [2].
Anonymization algorithms are approaches that are commonly used to achieve PPDP [2] [3]. Existing algorithms were designed over a duration to overcome various privacy threats [4]. These algorithms can be broadly classified into algorithms that preserve privacy of Single Sensitive Attribute(PPDP-SSA) and those that preserve for Multiple Sensitive Attributes (PPDP-MSA).
2. PPDP-MSA: With big data, IOT and cloud storage the microdata in effect consist of MSA that had to be preserved [18]- [20]. Many algorithms are proposed under this category [21]- [25], but the major limitations of these algorithms were: i) one of the attributes is chosen as a primary sensitive attribute and the data is anonymized, the other dominant sensitive attributes are not preserved. ii) The algorithms do not provide any basis for categorizing the sensitive attributes. iii) The algorithms use generalization and suppression techniques to anonymize the data which leads to generation of residual records.
Simple Distribution of Sensitive Values (SDSV) [26] is the recent approach that discusses distribution of MSA. Here, the ranking of the sensitive attributes is based on the frequency of occurrence. The author uses l-diversity to group the records. However, the approach do not consider the semantic similarity between the attributes, hence the data anonymized using this approach is vulnerable to semantic attacks.
In the current work, the semantic hierarchical trees are constructed for sensitive attributes of the microdata, based on the similarity indicator 'e' proposed in (l, e)-diversity [17], the sensitive attributes are categorized into primary, secondary, tertiary sensitive attributes. Later, the records of the microdata are recursively grouped into the equivalence class such that each class satisfies l-diversity [9]. The results obtained after conducting the experiments prove that the proposed algorithm is efficient in terms of preventing the existing privacy threats associated with MSA, reducing the generation of residual records and providing a basis for categorizing the sensitive attributes.

A. Organization of the Paper
The paper is organized as follows: Section II presents Data anonymization and Basic definitions, Section III presents the related work, Section IV discusses the proposed method and empirical results, Experiments and Performance Equations are presented in Section V. Results and Discussion are presented in Section VI. Section VI discusses Conclusion and Future work.

II. DATA ANONYMIZATION AND BASIC DEFINITIONS
Data anonymization is a process of protecting individual's sensitive information so as to prevent disclosures and privacy threats. Fig. 1 shows the process of micro data collection, storage, publishing and usage. The collected data contains Multiple Sensitive Attributes (MSA) such as disease, treatment, salary, marks obtained, travel history and health conditions. The data owners dislike such data to be disclosed to others. Consider Table I, that is a sample microdata of a health care data center(M). From literature, the microdata attributes are classified as identifiers, quasi identifiers, sensitive and non sensitive attributes.
These attributes are defined as follows: Identifiers (ID) -Directly identifying attributes are called as identifiers. For example: Name, Patient ID, Social Security number etc. Such attributes are removed be-fore publishing the micro data.
Quasi Identifiers (QID) -These attributes are used to indirectly identify a particular person. For example: Age, Gender and ZipCode. In any anonymization algorithms, the QID's are treated to different values to prevent disclosures.
Sensitive Attributes (SA) -The attributes that provide valuable information to the researchers/analyst and are used in data analysis. For example: Disease, Salary and Marital Status.
The following are the definitions of the terms that are used throughout the paper. Definition 2: (e-Similar) -Let a1 and a2 be the levels of two sensitive values v1,v2 in their semantic tree respectively. A0 be the closest common ancestor. e= [(a1-a0)+(a2-a0)]/2. v1 and v2 are now said to be e-similar. In other words, the similarity between v1 and v2 is 'e'. Definition 3: (l, e) Diversity -A data set is said to satisfy (l,e ) diversity [14] if every EQ is l-diverse and the similarity among any two values in an EQ is equal to or more than 'e'. Definition 4: Anatomy -An Anatomy [8] is anonymization technique, it disassociates the sensitive at-tributes and quasi identifier attributes into two tables. These tables increases utility when compared to k-anonymity because the attribute values are published in its original form. The anatomy breaks the correlation between the SA and QID's, this increases the privacy.
Definition 5: Residue Records -Those records that do not fit into any equivalence classes as they do not satisfy the constraints of the equivalence class are called as residues. When any anonymization algorithms is applied care must be taken to ensure that the residue percentage is as less as possible.

III. RELATED WORK
The datasets used in technologies such as BigData, IOT and cloud computing contain multiple sensitive attributes that need to be preserved [18]. Fig. 2 shows the advancement of the anonymization algorithms.
Initially, the algorithms were designed for SSA for example: k-anonymity [5], [7], [27], l-diversity [9], t-closensess [10], anatomy [11],slicing [12], failed to consider semantics between the sensitive atributes. Similarity and semantic similarity attacks [17] occur when the anonymization algorithms do not consider the semantics between the sensitive attributes. For example, Gastritis, Gastric Ulcer and Gastroparesis are diseases related to stomach. An intruder who has some background knowledge about the person can get to know that he is suffering from stomach infection. Later, algorithms like p-sensitive k-anonymity [28], (p+, α) sensitive k-anonymity [29], (p+, α, t) anonymity [30] were proposed. These algorithms,though considered the semantic relationship between the attributes, failed miserably when applied for dataset with MSA. Therefore, new set of algorithms were proposed to protect MSA.
The algorithms such as Rating [31], p-cover k-anonymity [32],Decomposition [33], Decomposition+ [34], KC slice [35], KCi slice [36] were designed to prevent privacy threats that occured on data with multiple sensitive attributes such as association attacks [37], semantic similarity attacks [16].In these algorithms, one of the attributes was considered as a primary sensitive attribute and other as secondary attribute. l-diversity, Anatomy or Slicing methods were used to group the records and anonymize the data. These algorithms did not discuss any method on how to select the sensitive attributes.
Simple Distribution of Sensitive Values for MSA (SDSV) [26] is a recent approach to distribute the MSA. In this method, two sensitivity levels are considered-High Sensitive Value (HSV) and Low Sensitive Value (LSV). Those sensitive attributes that have more than HSV is considered to be Primary Sensitive Attribute (PSA) and others are called as Contributory Sensitive Attributes (CSA). To understand this approach let us see basic definitions.

A. SDSV Approach to Select and Distribute the Multiple Sensitive Attributes
The user first selects the High Sensitive Values(HSV's) that he wants to preserve. Consider Table I, let the selected HSV be 'Heart Infection", "> 50K" and 'Unmarried' for Disease, Salary and Marital Status attributes respectively. The occurrences of these values is II, III and IV in the table. Since the attribute value 'Unmarried' occurrence is high, the attribute Marital Status is considered as Primary Sensitive Attribute (PSA) and Disease, Salary are treated as Contributory Sensitive At-tribute(CSA). The table is anonymized as per anatomy [9]and it is published. The resulting tables are Table II, Table  III , Table IV and Table V. Table II contains the QID's of the Microdata, these are grouped and assigned the GroupID. Table  III contains the Marital Status as PSA that is grouped such that within each EQ there is equal diversity of the attribute value.
Similarly, Table IV and Table V contains the grouping for Salary and Disease attributes. The groupID is assigned for the EQ's that are created in the tables.    From the generated tables it can be observed that marital status is considered to be HAS and the distribution of all the records was done based on this attribute. In Table III, the first EQ contains two occurrences of attribute value " Unmarried" and one occurrence of "Married". In Table V, the diseases in the EQ1, 'flu' and 'pneumonia' belong to chest infection and the intruder with some background knowledge( age, gender and zip code) can easily get to know the sensitive information. For example, if a person is neighbor of Alice and knows her QID's , on getting access to the published Tables II, III, IV and V, he concludes that the Alice record belong to EQ1 and that she is suffering from some chest infections. This happened because in EQ1, all diseases are semantically similar. It can be seen that the PSA is chosen based on number of occurrences. However, when the equivalence classes are created the attributes may be grouped such that they are semantically similar, this leads to semantic attacks and also due to multiple sensitive attributes there are every possibility that there could also be association attacks.
The following are the research gaps observed from the background study: • Among the existing PPDP algorithms for MSA very few discuss how to select the Primary/Secondary/Tertiary sensitive attributes.
• Most of the algorithms do not deal with the residue records-those records that are skewed and do not fit into any of the equivalence classes.

B. Main Contribution of the Article
The main contributions of this work are: • To provide an efficient method to select the sensitive attributes.
• Distributing the records within the EQ groups based on parameter 'e'.
• Applying incremental diversity so as to distribute the records appropriately within EQ with minimal residue records and preventing semantic attacks.
• Comparing the performance of the proposed algorithm (changing the primary sensitive attributes) against various parameters like residue percentage, diversity parameter (e) and computation time.

IV. PROPOSED METHOD AND EMPIRICAL RESULTS
Initially, the semantic hierarchy tree is constructed for all the selected sensitive attributes. For example, if disease, marital status and relationships are considered as sensitive attributes, the semantic hierarchical tree for all these is shown in Fig. 3,4 and 5 respectively.The semantic hierarchical tree for disease attribute, with Disease labelled as root node is at Level 0, the childrens namely Respiratory Disease and Digestive System diseases are at Level 1 and the attributes under these diseases are at level 3 and so on. Similarly, for attribute Marital Status there are 3 levels(0,1 and 2) and for Relationship there are 2 levels. Once the semantic hierarchy trees are constructed, those attributes with trees having more number of levels and  with more number of child nodes can be selected as Initial Sensitive Attributes(ISA). This selection is essential to achieve optimal diversity of sensitive attributes in each equivalence classes. For example, if Disease is chosen as a ISA, if the equivalence class consist of sensitive values "Flu", "Heart Infection" and "Jaundice", the class satisfies (3, 2) diversity. Here, the equivalence class contains different values as well as the values are semantically far from each other. If Marital status is chosen as a ISA, then it is difficult to achieve (3,2) diversity, we can achieve only (3,1) diversity by repeating one of the values in each equivalence class. If Relationship is chosen as the ISA then it is possible to achieve only 'l' diversity and achieving (l,e) is not viable. A (3, 1) diversity table is shown in Table VI. Here, Disease sensitive attribute is chosen as the ISA  After choosing the ISA, it is necessary to choose secondary sensitive attribute, ternary sensitive attribute and so on. This is necessary because if the Table 6 and 7 are published as they are, it may lead to association attack. For example, consider equivalence class 3, here even though the disease attribute satisfies (3, 1 ) diversity, the other associated attributes are predictable. If the intruder knows that a woman is more than 50 years and she is married, he well be easily be able to get to know that the lady belongs to group 3 and suffering from gastric ulcer. Such an attack is known as association attack [34]. These attacks happen in data set with multiple sensitive attributes.

A. Choosing Secondary and Tertiary Sensitive Attributes
From previous discussions it is clear that as a primary phase of data anonymization it is necessary to assign certain ranks to sensitive attributes. The attributes can be ranked based on the structure of the semantic tree. Those attribute values for which parents are more can be chosen as ISA and marked as rank 1. The next attributes are those with lesser parents as in case of Marital Status these attributes are termed as Subsequent Sensitive attributes (SSA) with rank 2 . Those sensitive attributes for which there are no many unique values and also are numerical in nature, for such attribute's values within each equivalence classes, they can be replaced with the mean of the values. For example, salary attribute, can be replaced with it mean value in each equivalence class The resulting tables generated based on the categorization of multiple sensitive attributes is shown in Table VIII and IX.   TABLE VIII. QID TABLE OF TABLE 1   Age  Gender ZipCode  Group ID  52  F  560298  1  25  M  560096  1  39  M  560094  1  36  F  560091  2  30  M  560190  2  61  F  560092  3  53  M  560090  3  23  F  560098  3  42  M  560099  2   TABLE IX. ANONYMIZED TABLE BASED  When implementing the algorithm, the records are recursively reordered to make sure that in every EQ there is high diversity between the ISA values, average diversity between SSA and so on. That is, there will be incremental diversity achieved over the ranks of the sensitive attributes. The algorithm proposed next, takes the microdata table with identifiers, quasi identifiers and sensitive attributes as input. I, Q and S represents the number of identifiers, quasi identifiers and sensitive attributes respectively. The output of the algorithm is the separate QID ).

3) If not satisfied, place the tuples into Residue
Dictionary RD and select next tuple. 4) If size of EG ¿ K break and place ti in next EG. 8: end while 9: while RD ̸ = Empty do 1) reiterate the above steps (9-13) to reduce the residual records. 2) If the SA is numerical, within each EG, replace all the values by the mean.

3) Separate the SA's and QID's into separate table.
Assign the Group ID's for the groups generated. 10: end while implemented in Python language, the results obtained with varying k, number of records, 'l' and choosing different sensitive attributes. This is discussed in the next section.

V. EXPERIMENTS AND PERFORMANCE EQUATIONS
The algorithm is implemented in Python language using native python data types tuples, dictionary and lists. The use of external libraries such as NumPy and Pandas is avoided since it increases time complexity of the algorithm. The iteration through the tuples is pretty faster when dictionaries are used. The implemented algorithm is tested on the demographic data set obtained from University of California (UCI) machine learning repository [38]. This microdata contains 30162 records. Occupation, Education, Marital Status, Work Class and Race are chosen as MSA's. Age, gender and Zipcode are chosen as QID's. The number of unique values for each of these is shown in Table X.
The following equations are used to compute various performance parameters. The residue percentage is computed as per equation 1. The results obtained after performing the experiments is presented in this section. The first three experiments are by varying the primary sensitive attributes and k, observing the residue percentage, computation time and diversity. The next set of experiments discusses the performance of the proposed algorithm with (l, e) diversity algorithm, in terms of residue percentage and computation time. Finally the comparison is done with proposed method, (l, e) diversity [17] and SDSV algorithm [29].
A. Performance of the Proposed Algorithm 1) Percentage of residue records based on choosing different primary sensitive attributes: The main objective is to reduce the residue percentage. Choosing k=3, and records 1000-5000, each line indicates number of residue records left out when a particular attribute is chosen as a ISA. It can be observed in Fig. 6, that if race is chosen as ISA, the percentage of residue records is highest and it is lowest when education is chosen as the ISA. The percentage of residue records is computed as per equation 1. 2) Computation time: The computation time is the time required to generate the final QID and SAT tables. For this, on the chosen number of records, the equivalence classes are to be created choosing the diversity parameter 'l', of l-diversity [12] the levels of sensitive attributes and group size k as defined in [8]. On experimentation it is observed that, when Education is chosen as a ISA it consumes more time than Occupation or Race. This is obvious because the unique values are more for education attribute. The time performance choosing different attributes is shown in Fig. 7. The computation time is as per equation 2. 3) Diversity among the attribute values within the equivalence class: The diversity is computed as per (l,e) diversity discussed previously. From the experiments it can be observed in Fig. 8, that the attribute with more unique values (Education) achieves better diversity among the attributes within the equivalence classes. With the value of 'k' from 5 to 8, the performance of achieving more diversity can be seen with "education" attribute. The diversity percentage is computed according to equation given in 3 and 4.

B. Comparing with (l,e) Diversity Algorithm
The performance of our proposed algorithm is compared with the existing (l,e) diversity algorithm. The (l,e) diversity chooses only one sensitive attribute i.e Education. On observation it can be seen that choosing multiple sensitive attributes and then diversifying records achieves better performance in terms of reducing residue percentage. However, the time taken is more since multiple attributes are considered.

1) Residual percentage:
The comparison is done for No.of records vs residue records and value of k. It can be observed from Fig. 9 that our proposed algorithm-choosing the attributes based on the ranks and then anonymizing results in reduction of residue records. Since (l,e) diversity uses generalization for anonymization it leads to more number of residue records. 2) Computational time: As shown in Fig. 10, the time taken by the proposed algorithm is slightly higher than (l, e) diversity because the algorithm considers MSA where as (l,e) preserves privacy of SSA. 3) Diversity percentage: The diversity percentage achieved in the proposed method with multiple sensitive attributes is much better when compared with (l,e) diversity. This is mainly because the attributes are selected based on their semantics and every EQ has diversified primary sensitive attribute. This is shown in . 11. Fig. 11. Diversity percentage in proposed algorithm vs (l,e) diversity.

C. Comparision of Incremental Diversity, SDSV and (l,e) Diversity
As discussed in related work section, one recent algorithm that discusses the distribution of sensitive attributes is SDSV algorithm. However, the algorithm doesn't consider the semantic similarity between the attributes within an EQ. This leads to semantic diversity attack and weaker diversity among the attributes within an EQ's. It can be seen from Fig. 12, that, (l, e) diversity has highest diversity among the attributes within an EQ, since it considers single sensitive attributes. The diversity percentage for the proposed algorithm is average considering the multiple sensitive attributes and their semantic.

D. Security Evaluations
As mentioned before the privacy attacks considered in this work are semantic attacks, similarity attacks and association attacks that are predominant in data set with MSA. The proposed algorithm overcomes all these threats since the semantics of the sensitive attributes is addressed. Consider Table II, III, IV and V that were generated by SDSV algorithm. The algorithm did not consider the semantic relationship between the sensitive attributes there were semantically similar attribute values for disease within an equivalence class. Also, the algorithm generates multiple tables and this increases as the number of sensitive attributes increases.
The proposed algorithm overcomes the semantic and similarity attacks. Consider Tables VI and VII that are generated using the proposed algorithm. Every equivalence class has diversity of the sensitive attributes, which becomes difficult for the intruder to cause privacy threats. Even though the intruder knows one of the sensitive attribute and a quasi identifier it is difficult to cause association attack. For example, if the intruder is neighbour of Trudy (from Table I  and results it can be concluded that the proposed algorithm is efficient in terms of reducing the residual records, computation time at the same time achieving optimal diversity considering multiple sensitive attributes. Also, as discussed,the proposed algorithm overcomes the privacy threats that exists for MSA.

VII. CONCLUSION AND FUTURE WORK
Concern to data privacy is increased with the increase in the digital technology. The personal data is collected at various places that contain multiple sensitive attributes (MSA). These attributes must be treated well to prevent privacy threats when the data set is published to outside world. Many algorithms have been proposed to preserve privacy of MSA in the literature. In these algorithms one of the attribute is chosen as a primary sensitive attribute and the microdata is anonymized. These algorithms do not discuss how to rank the sensitive attributes. This is the essential step in anonymizing the data. In this paper we discuss an efficient approach to rank the sensitive attributes and then anonymize the data. Experiments along with performance parameters, prove that our algorithm outperforms the existing methods and can be efficiently used to anonymize the data. As a part of future work we would propose an infrastructure framework where in the tables can be published.