Evaluation of Re-identification Risk using Anonymization and Differential Privacy in Healthcare

In the present scenario, due to regulations of data privacy, sharing of data with other organization for research or any medical purpose becomes a big hindrance for different healthcare organizations. To preserve the privacy of patients seems like a crucial challenge for Healthcare Centre. Numerous techniques are used to preserve the privacy such as perturbation, anonymization, cryptography, etc. Anonymization is well known practical solution of this problem. A number of anonymization methods have been proposed by researchers. In this paper, an improved approach is proposed which is based on k-anonymity and differential privacy approaches. The purpose of proposed approach is to prevent the dataset from re-identification risk more effectively from linking attacks using generalization and suppression techniques. Keywords—Data privacy; anonymization; differential privacy; re-identification risk analysis; privacy preserving data publishing


I. INTRODUCTION
Due to the advancements in the areas of business intelligence, generally organizations for instance banks, healthcare, health insurance are converted into -data-driven‖ organizations. These organizations used to apply new mechanisms to analyze a high volume of data. It is the responsibility of the data controller to ensure the user about their privacy and it should be done before publishing the data to a third party. There is no protection of privacy in the original dataset. PPDP (Privacy-Preserving Data Publishing) offered numerous tools and mechanisms to preserve privacy. [1] [2][3] [4]. Anonymization must be done on the datasets before publishing to various organizations because they may contain personal information. It is well known that personal information can be gathered from these types of records and there are many people who assess the re-identification risk. European Medicines Agency (EMA) recommends an anonymization approach for risk analysis based on qualitative technique and quantitative technique [5].
PPDP process consists of different phases i.e. collection of data; providing storage for collected data; perform anonymization; data publishing after modification and perform data mining process as shown in the conceptual scenario of PPDP described in Fig. 1. There are some persons such as record owner, data holder; data publisher; data recipient, and adversary are involved in this process. The record owner is the entity of record, data holder can be person or organization that holds the data; data publisher is responsible for the publishing of anonymous data; data recipient is any entity that has access to published data and adversary is the entity whose objective is to gather user's information. At the time of the data publishing process, sensitive records may be leaked out. To overcome this problem one possible solution is to modify the dataset. There are many methods for modification of datasets in PPDP [6]. Data anonymizaton is most commonly used to achieve privacy protection in data publishing. Several methods have been proposed to handle the security issues related to datasets. In particular, anonymisation and differential privacy are two techniques that have been used for implementation practically. The k-anonymity used to perturb datasets by generalization and suppression. K-anonymity algorithm is used to preserve user's identity through linking attacks [7]. Differential privacy is also used to prevent privacy by furnishing individuals' personal information ability. However, instead of using kanonymity's deterministic approach to in distinguishability, differential privacy invokes stochastic in-distinguishability by adding noise or perturbing values. Both k-anonymity and Ɛdifferential privacy suffer from a number of drawbacks. In particular, the curse of dimensionality of adding extra quasi identifiers to the k-anonymity framework results in greater information loss [8]. On the other hand, differential privacy has long been criticized for the large information loss imposed on records. The proposed technique in this paper shows how to overcome these drawbacks by combining k-anonymity and Ɛ-differential privacy, while simultaneously benefitting from their advantages. This paper presents the k-anonymity and differential privacy technique. Both techniques have their own limitations. This can be improved upon in their combination. To implement such a concern is focus of their paper is on reidentification risk analysis.
The rest of the organization of the paper is as follows: Section II provides the literature survey related to anonymization and differential privacy. Section III elaborates the materials and methods used in the paper. Section IV describes the proposed work. Section V presents the experimental details of proposed technique and corresponding results. Section VI concludes the paper. www.ijacsa.thesai.org

II. RELATED WORK
Protection of sensitive data and extraction of useful information from distributed data is also a challenging task. It is need to preserve the privacy of the before publishing. More than sufficient work has been proposed and implemented in the field of privacy-preserving data publishing. There are several methods used to protect sensitive data. There are various privacy-enhanced mechanisms that are related to the preservation of privacy [9]. Luc Rocher et.al [10] proposed an approach based on the generative copula method. This approach estimated more accurately the probability of anyone to be rightly re-identified.
Boris Lubarsky [11] described a method that proved to be successful even in the heavily incomplete dataset shared. Reidentification can occur due to insufficient anonymization of datasets or combining the datasets. Pseudonym reversal may also be one of the causes of re-identification risk.
Branson, et al [12] have presented a study of testing the reidentification problem. They presented a study through the testing of how a prescribed drug can be subjected to cause of re-identification.
Suman et al [13], introduced a novel technique based on anonymization. The proposed algorithm's performance was measured by using information loss and accuracy. In various experiments, proposed approach provided minimum information loss and maximum accuracy.
Sumana and Hareesh [14] described various anonymization methods in PPDM which are used to provide privacy of the data. Anonymization's main goal is to secure access to personal information and is also used to provide accumulated information.
Vibhor Sharma et.al [15] presented a new Evolutionary privacy-preserving technique in data mining. Whenever data mining is applied to large datasets a number of threats are automatically introduced to privacy. To provide protection to the sensitive data of individuals, data should be masked before it is revealed for data mining.
Marques et.al [16] discussed a complete analysis study on anonymization. A number of techniques of anonymization can be applied to datasets to prevent re-identification risk. They discussed different tools such as ARX, µ-Argus, SDC Micro, and Privacy Analytics Eclipse.
Manoj Kumar Gupta et.al [17] determined various approaches like a generalization, k-anonymity, l-diversity, suppression, shuffling, noise addition, etc. l-diversity is based on the inside group diversity of sensitive attributes. According to the definition of l-diversity, there must be minimum value for each private attribute when each group contains one sharing combination of key attributes. Only then the dataset will be considered as satisfied l-diverse. P Ram Mohan Rao et.al [18] introduced a novel approach named -Synthesize Quasi Identifiers and apply Differential Privacy‖ (SQIDP) for privacy-preserving in data mining. This approach was applicable to text data set with 100% data utility.

III. METHODS AND TECHNIQUES EXISTING
This section highlights the existing techniques and algorithms that are used in proposed technique i.e. Anonymization and differential privacy. These techniques are used to preserve the privacy before publishing.

A. Anonymization
Anonymization is a type of modification technique used to preserve privacy [19]. In data anonymization, sensitive information is either encrypted or removed from the datasets in order to preserve the privacy. There are two methods of anonymization i.e. generalization and suppression [20]. In www.ijacsa.thesai.org Generalization, individual attributes are substituted with an extensive category. Generalization is also a method used for changing categorical attributes and continuous numeric attributes, while suppression means just removing the values of attributes. In this, certain values of the attributes are converted into an asterisk '*'. Various types of attributes are as [21].
Although these types of information may seem very harmless and individually may not present any harm but by linking them from each other, the attackers can misuse can also change the information. In order to hide these original data, there is need to hide and secure these data which may, in turn, present us with another challenge, information loss.
Nowadays, it is common that some of the datasets are openly available for research purpose. To preserve the privacy of shared data, the owner of data can apply different types of anonymization on the datasets. Generalization, suppression, permutation, and perturbation are some examples of anonymization. Furthermore, more than one approach can be applied to the dataset. It proved more beneficial to protect the privacy of data [22]. Therefore, it is necessary to consider the concept of de-identification and re-identification of data. For this purpose, a medical data set has been used that contains the information of some patients. It is depicted in Table I. Here the name attribute is the personal identification attribute; a sensitive attribute is a disease. These attributes are removed in anonymization process.

Quasi identify
When one attribute linked with some other attribute caused the disclosure of privacy then those are called quasi identify attributes. For example, age and sex when linked to some other database can easily disclose the person's identity These attributes are suppressed or generalized in order to preserve the privacy of an individual.

Sensitive
These attributes are crucial and should not be shared. For example. Disease information, salary information should not be shared against any organization.
Mostly do not change for data analyses.

Non-Sensitive
These are the attributes that are publishable publicly because these do not create any problem related to privacy. For example weight, hair color, height, etc.
These are not collected in most cases. If collected, shared as it is.  Table II is an example of de-identification. Deidentification is the process of altering the dataset to create an alternate use of the dataset so that it is impossible to recognize the identity. De-identification of Table II is shown in  Table III, where the field name -Name‖ is deleted. To provide privacy if the name attribute is removed, then to provide the privacy data can be altered and the altered data is displayed in Table III. Now the names of patients are not shown in Table III. However, if anyone has access to Aadhar Card Data (as shown in Table IV), it is very easy to discover the information regarding all records. It can be done by joining the two different tables on the common attributes. These common attributes are known as quasi-identifier. By using the data of Table III and Table IV, an attacker can easily get the information that it is Bob is suffering from a disease of Flu Holdon. So removing the personal information will not be helpful for complete privacy to the data. The method of reversing the de-identification by connecting the identity of the data subject is referred to as Re-identification. So in short it can be said that deletion of the personal identification data from relation will not much helpful to protect privacy [23]. To protect privacy first of all personal identification data must be removed and anonymization of the quasi-identifiers is also required.

B. Differential Privacy
Differential privacy is also a widely used privacy preservation method. This approach permits the analysts to explore necessary answers from the data repositories that contain sensitive information [24]. In this method, analysts are able to get answers from data stores having sensitive data with secure protection of privacy [25]. In differential privacy, a randomized function R provides ℇ-differential privacy protection for all data sets named DS1 and DS2. These datasets are differing on at most one data element [26]. This randomized function is such that: ℇ is the statistical distance, it is use to define the strength of privacy. A lower value of ℇ means stronger privacy [27]. Different steps of the differential privacy approach are shown in Fig. 2(a). Fig. 2(b) describes the GDP (Global Differential Privacy) and LDP (Local Differential Privacy). A trusted curator recruited in GDP. He can apply gauged noise in order to produce DP (Differential Privacy). The curator should make some practical algorithms or mechanisms that are inappropriate for deep learning. Here the algorithm resides on the server and the original data set has to be uploaded onto the server for training. But in the case of LDP, owners of data modify the data before publishing. There is no need for a trusted curator or any third party to preserve privacy. LDP guaranteed better privacy as compared to GDP. It should be noted that data values are not changed in DP. Here, Users cannot access the database directly. These inaccurate data are sufficient to protect privacy but so small that helpful for the analysts and researchers. Privacy and Utility are not mutually exclusive [28].

IV. PROPOSED TECHNIQUE
This paper presents an enhanced privacy -preserving approach based on anonymization and differential techniques. It helps to hide information without abruptly changing the records. The records are k-anonymized as there are k data sets with the same value in each quasi field. To provide anonymization to the original dataset generalization is used. This method is always applied to the quasi attributes [29]. Suppression and generalization techniques are used to provide anonymization. The suppression method is used on quasi attributes in the format of same size intervals. It is done for uniformity in the data set. The proposed enhanced approach tends to solve the privacy issue related to various attacks Generalization is the process through which data can be presented in the form of clustering. The elementary objective of this technique used to collect the links into the cluster and then make a super vertex. Every vertex provides the merged information of the super network. Using this approach, identifying the local data or information is very difficult. To provide protection from re-identification risk, different PPDM (Privacy Preserving Data Mining) techniques [30] are used but the method of anonymity is widely used. This paper proposed the technique of k-anonymity and Ɛ-differential privacy. The proposed method anonymized the data set using a kanonymity algorithm with k=2 and k=5. The very step first step is to classify the features into sensitive, quasi, and identifiers features. After this, the quasi-identifiers are partitioned into k-quasi on which k-anonymity is applied, and on k-quasi, Ɛ-differential privacy is applied. After this, kquasi attributes are processed to provide the k-anonymity. After this in the next step differential privacy is applied to the k-quasi attributes. The inspiration to take differential privacy is its stochastic in-distinguishability. Now k-anonymity has applied, an attacker can uniquely recognize the equivalence class. In which any individual's record belongs to that k-quasi. With the help of Ɛ-quasi, it is ensured that the re-identification of records cannot occur.
The proposed method is shown with the help of a flowchart in Fig. 3. It preserves from re-identification risk between equivalence classes. In differential privacy, every equivalent class is considered as a single independent class of an individual's record. In this concept, it is more important to know that differential privacy equivalence class is not the set of attributes. To prevent from re-identification risk records are shuffled.

Addition of noise
Provide results to analysts www.ijacsa.thesai.org The proposed work is described in Algorithm 1.
Step5. Apply k-ADP (k-Anonymity Differential Privacy) technique to each equivalence class of k-annonymised dataset Step6. Now merge k-anonymised records and Ɛ-Differential Privacy records.

V. EXPERIMENTS AND RESULTS
There are numerous tools and mechanisms for privacypreservation of datasets. In this paper, anonymization and differential privacy methods are used to provide protection from re-identification risk. From the UCI machine learning repository, Heart dataset is selected for analysis purposes. There are 14 attributes in heart dataset and 2602 records. Out of all attributes, only quasi attributes and sensitive attributes are considered. Here two attributes names as ‗age' and ‗sex' are considered as quasi attributes and class names as ‗result' is considered as a sensitive attribute. Users can directly apply the anonymization method to datasets by using the ARX tool. This tool accepts the files of .csv, .xls, and .xlsv format. Here, k-anonymity with k=2, k=10, and generalization method is selected to perform anonymization on the dataset. Differential privacy is applied to the anonymized dataset. The proposed technique is used to evaate the risk factor of the reidentification. For this purpose, the relationship between k and Ɛ is evaluated. As increases the value of k, the risk is decreased and the risk is decreased with decreasing the value of Ɛ. Now, re-identification risk analysis is done on three datasets i.e. original dataset, anonymised dataset, and enhances anonymised dataset. Experimental results are shown using a tabular and graphical format.

A. Effect on Re-identification Risk
Risk related to privacy can be analyzed using ARX tool [31]. These risks are related to re-identification risk for the prosecutor, journalists and markets attacker. The risk that can be derived from population uniqueness is also included. The impact of data anonymization on the re-identification risk profile for the Heart disease dataset is shown in Fig. 4 and Fig. 5. Fig. 4(a) highlights risk of re-identification risk of original dataset at Prosecutor level. Here approximately 3.47% of the total number of records is at risk. The higher risk calculated here is 100%. It means at most all records are at risk in the original dataset. The Success rate is 5.912% in the case of original dataset. At the journalist level, higher risk calculated here is 100%. It means at most all records are at risk in the original dataset. The Success rate is 5.912% in the case of original dataset. It is the same as in the case of the Prosecutor scenario. Fig. 4(b) shows the risk of re-identification of annonymised datasets at the prosecutor level. The highest risk, in this case is 5. 08%. And the effect of the proposed technique is displayed in Fig. 4(c). Here in this case data is purely safe i.e. rate of records at risk is 0% in all scenarios.

Comparison of Results
Re-identification Risk Analysis www.ijacsa.thesai.org Comparative study of the risk of various attackers of the original dataset, anonymized data set, and enhanced anonymized dataset is given in Table V. Table V lists the risk estimation evaluated at prosecutor level, journalist level, and marketer level. It is depicted that the estimated risk for journalists is higher in the original data set i.e. 33.3% and is lower in enhanced anonymized data set i.e. 0.11%. It can also be noted that estimated Marketer and the Journalist risk are also lowest in enhanced anonymized data set and higher in the original dataset. The detail of various risks is also listed in the Table V. Through the experiments, it is proved that enhanced anonymized data is safer as compared to original data and anonymized data shown in the Fig. 4. The re-identification risk of the original dataset and anonymized dataset is described in Fig. 5 and Fig. 6, respectively.
From Table V, it is stated that the highest Prosecutor risk is higher in the original dataset (100%), and less in enhanced anonymized datasets i.e. 0.11%. Estimated Journalist risk is higher in original dataset (33.30%) and lowers in enhanced anonymized dataset i.e. 0.11%. Estimated marketer risk is higher in original dataset (7.12%), and very less in enhanced anonymised datasets i.e. 0.09%. Re-identification risk estimated in various approaches to the number of records is shown in the following figures.  In the above figures, re-identification risk distribution among the dataset's records is displayed. The calculation of distribution depicted on the input dataset and output dataset. Fig. 6(a) highlights the records with Maximum risk, records of with risk, and risk threshold of the data to prosecutor reidentification risk in percentage. Fig. 6(b) depicted the Maximum risk, Record with risk, and the Risk Threshold of the anonymized dataset at Prosecutor re-identification, and in Fig. 6(c), it is shown that when anonymization with differential privacy is applied on original data set, all three estimations approaches to zero so, the proposed method is much efficient to minimize the re-identification risk. www.ijacsa.thesai.org

VI. CONCLUSION AND FUTURE WORK
In the era of data sharing, protection of privacy has become an important matter in different organization and in a healthcare industry it is directly concerned with patients. This paper proposed an enhanced anonymized approach to preserve the privacy of patients' data. To preserve the privacy, a proposed technique has been implemented on the dataset related to the heart disease. In this paper, anonymization (Kanonymity) and differential privacy approaches are used to provide privacy to the dataset. Through various experimental results, it is proved that an anonymized dataset achieved more security. The re-identification risk in a modified dataset is very much less as compared to the original dataset. In future, different classification algorithms would be applied to the anonymized dataset to measure the accuracy, execution time, kappa-static, etc.