Improving the Performance of Various Privacy Preserving Databases using Hybrid Geometric Data Perturbation Classification Model

As the size of the privacy preserving databases is increasing, it is difficult to improve the privacy and accuracy of these databases due to dimensionality and runtime. However, most of the traditional privacy preserving models are independent of privacy and runtime. Also, it is essential to preserve the privacy of the large sensitive attributes before publishing it to the third-party servers. As a result, a novel framework is required to improve the privacy as well as accuracy on the high dimensional privacy preserving data with less runtime. In order to improve the privacy, accuracy and runtime of the traditional privacy preserving models, a hybrid perturbation based privacy preserving classification model is proposed on the multiple databases. In this work, a new data transformation approach, hybrid geometrical perturbation approach and hybrid boosting classifier are proposed in order to enhance the overall efficiency of the model on the privacy preserving databases. In this work, a hybrid geometric perturbation approach is used to enhance the privacy preserving on the sensitive attributes. Initially, a pre-processing method is applied on the input dataset in order to remove the noise in the feature values. A hybrid machine learning classifier is proposed to predict the privacy preserving class label based on the training data. Experimental results represents the proposed hybrid geometric perturbation based boosting classifier has better statistical accuracy, recall, precision and runtime than the conventional models. Keywords—Privacy preserving databases; machine learning; perturbation; high dimensionality; data filtering; data classification


I. INTRODUCTION
Data mining focuses on the problem of discovering patterns that are unknown or hidden. It includes building data models, providing a human-comprehensible statistical summary of data, deciding strategies based on mined information [1]. Recently, researchers have drawn much attention to integrate utility constraints into data mining tasks. Utility mining is commonly used in many practical applications. A sensitive pattern is the repeated object with a sensitive information. The datasets used for data mining are represented in centralized or distributed way. In the centralized way, data are stored in the physical location, but that data accessibility / possession is involved. In the distributed manner, data are shared by two or more parties who do not really have trust in their personal information but are interested in the extraction of their common data. The dataset can be heterogeneous, i.e. horizontally partitioned, if each group has the same set of records with various sub-sets of attributes. Centralized data is usually more complete than a portion of the distributed data, as it contains complete records and attributes for collecting and mining purposes. Many realtime applications, telecommunications networks, internet traffic flows, online banking and financial transactions, retail markets, manufacturing process data, sensor-based application data flows, satellite data, research laboratory data, electrical grids, engineering data, and other dynamic environments often use data mining tools and techniques. Data streams are enormous in volumes and possibly infinite. To recognize trends and patterns, these data streams need to be analysed, which benefit us in isolating anomalies and predicting future behaviour. However, due to some reasons, most notably privacy considerations, data proprietors or originators may not be willing to accurately discover the true values of their data. A certain amount of privacy preservation must therefore be done on the data before it can be made widely accessible. Data understanding is important and is combined with the need to use appropriate algorithms to preserve privacy. Various approaches such as data perturbation, k-anonymity, association rule mining, masking and encryption have been suggested for this purpose. It is not possible to apply existing techniques directly to data streams. In addition, robust assurances on the maximum permitted interval between incoming data and its anonymous output with minimum data losses and maximum privacy gain are required in data mining applications. Another approach to privacy preservation is to perform anonymization that ensures that the record of any individual in a dataset cannot be distinguished from a group of similar individuals. The availability of raw data is the most significant consideration in data mining privacy. For detailed statistical details about the data, the data miner should not be able to access all sensitive information into its original form. This calls for more rigorous data mining techniques, which will intentionally modify data in order to mask sensitive information and preserve the data statistics inherent in mining. The latest trend in corporate cooperation is that they want to exchange data and mining findings to help each other. Nevertheless, the disclosure of sensitive information also increased the potential threat. Sanitization of information is the process that covers the sensitive items in the source www.ijacsa.thesai.org database by appropriate modification and exposes the updated database [2]. In this work, they presented an efficient algorithm to maintain the privacy of high-value items from mining that extends our proposal to weighted utilities. The majority of data mining techniques that safeguard privacy turn original data into technologies or algorithms for data mining to decrease performance. There is also a common compromise between privacy and accuracy, but this compromise is endured by certain particular algorithms used for protecting privacy. Deep learning is a multi-layered data processing network that consists of multiple levels of abstraction to train the data for pattern analysis [3]. This network uses a non-linear transformation approach to transform and learn the data in each level. Recently, a large number of composite functions have been used in the deep learning framework for pattern analysis.
Data partitioning, there are two scenarios that require using of cluster analysis in a distributed way. In the first, the volume of data that is to be analysed is fairly great. Therefore, this requires a huge amount of computational effort-so much so, sometimes, it is not feasible to complete this computation. In such a case, a better alternative is to split the data and cluster it in a distributed manner and, finally, unify the distributed results. In centralized database, data will be located and maintained at single place where as in distributed database, data may be distributed vertically or horizontally to various sources. When the database is centralized, all the data is stored in one place. This type of database is completely different from the distributed database. One of the issues the centralized database faces is that as the entire data resides at one central location [4], there can be problems with bottlenecks occurring at key points where the data is released or assimilated.
Anonymity is "nameless." Anonymity is the identification of the information with their identity. Data anonymization is the process of removing personal information from the dataset to protect the privacy of individuals and allows data users and holders to safely reveal data for data analysis, decision making, testing and other purposes so that people whose information is in the dataset remain anonymous. Even if the specific identifiers are removed, the availability of individual's background information (e.g. in the public voter list) makes it easier for the adversary to re-identify individuals by linking the released data making it very hard to publish data without disclosing privacy [5]. Once the data is released to the third party, it is hard for the owners to control the way the data is manipulated.
K-anonymity protects privacy against the identification of records; however, it is not generally successful for protecting privacy against inference attacks of the sensitive attributes. kanonymity is characterized as the degree of inference data protection. For example, a politician who intends to be elected to a post in the governance of a state utilizes the medical history of his opponent in demonstrating to the populace that his opponent cannot or is not ready to deal with the obligations as an agent of the state due to his medical problems. In the former scenario, l-diversity [6] fails to prevent attribute disclosure because the distribution for the real population is different from the dataset. K-anonymity is designed for single data set where each row represents a different person. In case of relational database, k-anonymity might distort data too much or leak privacy. They proposed Ldiversity to avoid attribute linkage attack. L-diversity demands that at least one responsive attribute value in each quasi identifier (QID) class [7]. This provision also satisfies the kanonymity criterion where k= l. L-diversity varies from kanonymity, while k-anonymity demands that a group contain at least k individuals with the same QID, l-diversity means that a group contain at least l of sensitive attributes.
L-diversity does not offer sufficient protection against probabilistic attack because some attributes appear more often than others [8]. In probabilistic, the sensitive attribute is inferred because it appears more frequently than other sensitive attributes and therefore attacker can infer that his victim must also have that value for the sensitive attribute. Isolating the sensitive attributes are considered as anonymous. The underlying principle here is isolation: if it cannot be isolated from its neighbours, a record is personal. In particular, when removed from the database, an opponent takes advantage of discovering the identity of the data. This is embedded in the breach of privacy that anonymizes a server. The attacker targets a server when entire data is accessed as a single large entity. If the selected data are removed from the server, the opponent cannot detect missing data and must change the attack strategy. Re-identification of individual records through quasi-identifiers is one of the major types of privacy outbreaks. Anonymization solves this type of attack. The idea behind k-anonymity is to suppress or generalize the publicly available selected data in order to make each record very similar from at least k-1 other records. Sensitive data can therefore be linked to collections of at least k size records. Quasi-identifier attribute values are a set of minimum values for the information attribute that can identify individuals in combination with other dataset. K-anonymity is intended to prevent the privacy of individuals without altering the attribute values. The traditional k-anonymity cannot be applied directly to the census data primarily for static dataset. The Kanonymity approach is the most widely used in PPDM while maintaining confidentiality. [9] proposed a K-anonymity approach by splitting the original dataset into data estimates, so that each one follows the K-anonymity. A classifier was trained on each projection and then an unknown instance was classified by combining all classifiers.
Perturbation is known for its long history, simplicity and effectiveness. It works by replacing original data with synthetic data which has similar statistical properties. Attacker cannot gain sensitive information from perturbed data because it does not correspond to original data. The downside of perturbation is that the data is meaningless for humans and it is only useful for computing statistical properties such as minimum, maximum, average, mean and so on. Additive noise is perturbation method that works by adding some random value to original value so that statistical properties of the original table would not differ too much from original ones. The downside of additive noise is that it does always offer sufficient protection to sensitive attribute. For example, when there is high correlation between QID and sensitive attribute and noise is low, the sensitive attribute's original value can be www.ijacsa.thesai.org covered from perturbed data [10]. The perturbation function requires a minor or major alteration of the problem-solving scenario to mathematically obtain the expected return. The perturbation functions were concerned with mathematical issues dealing with duality and primacy. The name of the function is appropriate for those which alter or trigger function changes at the start of the problem, and the function is twofold which is generally used to modify the limitations in order to obtain the desired solution. This contrasts with the previously proposed data mining strategies focused on additive random perturbation in order to show a significant breach of privacy. It also discusses the possibilities of proposed feature filtering techniques on various data types and interference approaches such as discrete and exclusive data or noise. Such data are widely available as statistical or categorical data. Numeric data are values that can be enumerated by categorical data. As the data of a database typically consists of ordered objects like tables and instances, the whole table or instance is not affected by the identity as a whole. The analyst or the miner is aware of the table or example but the information within the organizations are held privately. The sections or structural elements of the object are therefore chosen to cause randomization. In a database, each user typically comes up with a table consisting of multiple attributes where the user may pick the set of attributes for the query or where the attributes are appropriate for the query operations.
Increasing amounts of personal data collected and processed by companies also increases the complexity of information systems that protect information. Mainly, Privacy Preserving Data Mining (PPDM) problem focuses on two important aspects. Research's first facet: maintaining server confidentiality based on analysts ' confidence rates and key attributes for their data mining queries. The second facet of analysis is to determine the level of sensitivity of the information disseminated from the database based on the queries of the analysts. In centralized database, data will be located and maintained at single place whereas in distributed database, data may be distributed vertically or horizontally to various sources. When the database is centralized, all the data is stored in one place. This type of database is completely different from the distributed database. One of the issues the centralized database faces is that as the entire data resides at one central location, there can be problems with bottle-necks occurring at key points where the data is released or assimilated. As a result, when looking for the availability of data, the efficiency with which it is retrieved is not as strong as in the distributed database system. The rest of the paper is organized as follows. Section 2, describes the related works of the privacy preserving models and its limitations. Section 3, describes the proposed solution to the privacy preserving based machine learning framework on high dimensional data. Section 4, describes the experimental results and analysis. Finally, we conclude the paper in Section 5.

II. RELATED WORKS
Privacy Preservation Data Mining (PPDM) is a dataprotection research field focused on personally identifiable information that is considered for the creation of data-mining information systems. Therefore, numerous efforts have been made to integrate data protection techniques with data mining algorithms. The current data storage technologies for data extraction are viewed in four dimensions: (i) data delivery (central or distributed); (ii) modification used (encryption, perturbation, generalization, etc.) to sanitize data; (iii) data mining algorithms optimized for the protection of privacy techniques; (iv) data mining techniques; This study incorporates techniques for noise generation that represent the sensitivity of the attributes and disturbance techniques. Data analysis, usually a realistic, multi-story business procedure, involves people using standardized methods to detect and analyse suitable problems, find approaches and techniques for implementation, and achieve measurable results. In general, information on privacy for data mining is taken as in tuples that contain several attributes. Each privacy data is scanned and transformed into normalized continuous data. The main issues of the privacy datasets are high dimensionality and imbalance nature. Traditional machine learning classifiers consider subset of features for classification and privacy prediction with high true negative rate and error rates. Attribute selection is used to compute the measure for each feature and rank them accordingly. These ranking methods select the top ‗k' features based on highest rank and eliminate those having lower feature ranks.
The Privacy Preserving Data Mining (PPDM) problem in this traditional work concentrates on two important aspects. The first facet of the research: Assuring privacy of database based on the trust levels of the analysts and with respect to the key attributes for their data mining queries. The second facet of the research is to assess the sensitivity level of the information that is disseminated from the database based on the analysts" queries. The issue of utility-based privacy controlling data mining was reviewed in [11]. In [12], a technique for the suppression of anonymization of data. Disclosure top-down does not require a tree of taxonomy. The process begins with a set of deleted records and identifies the best specific candidate value that satisfies the privacy constraint for disclosure. The multidimensional k-anonymity is a multidimensional QID global recoding technique. In order to determine the optimum generalization, they consider Discernibility metric and Equivalence Class Size metric parameters. Multidimensional partitioning compared to singledimensional partitioning to achieve the generalization error rate. The principle of t-closeness is that the distribution of sensitive values is as close to the distribution of sensitive values in the original data set in each equivalence class. Support vector machine is an optimization technique for solving a variety of approaches such as classification, learning and outlier problems. The basic support vector machine (SVM) solves the two class problems, in which the data are partitioned by a hyper-plane using support vectors. If the support vector machine fails to separate two classes, then it solves this problem using a kernel function. Various kernel functions can be used in the SVM model such as linear, polynomial, Gaussian, regression, etc. to preserve the privacy on training dataset [13]. The author in [14] studied the utilitybased problem of PPDM on large dataset. The idea was to extend the cursed dimensionality by distributing disjointed www.ijacsa.thesai.org matrices covering efficient attributes (Utility), but it is also challenging for privacy to be preserved. In Xu et al. proposed the use of local utility-based data mining method. The method is based on the fact that different attributes have varied utility from a software point of view. In local data partitioning, the data space is separated into many areas and the instance plotting to generalize value is local to that area. Another alternative way of using utility-based PPDM to anonymize data is that its residues beneficial to specific types of knowledge discovery process. This form of approach is frequently modelled with the k-anonymity framework and its derivatives: l-diversity, t-closeness, etc. Another popular model of privacy is that of π-differential privacy, which ensures that the addition or removal of data from a dataset results in a maximum change in any published information relative to π [15]. This ensures that a particular individual's presence or absence in the dataset has a limited impact on the information released, thus protecting the privacy of each individual. Data will be located and maintained at a single location in a centralized database, where data can be distributed vertically or horizontally to different sources, as in the distributed database. All data is stored in one place when the database is centralized. This database type is entirely different from the distributed database. One of the issues facing the centralized database is that since the entire data is located at one central location, bottle-neck problems can occur at key points where the data is released or published. As a result, the efficiency with which it is retrieved when searching for data availability is not as strong as in the distributed database system. Some of the traditional approaches, including k-anonymity,'-diversity, t-closeness and incognito, provide solutions to the problem of disclosure. They introduced a solution, namely, k-anonymity, which is considered a standard approach to dealing with the problem of linking attack. The anonymization-based study to protect individual privacy has become popular for the past decade. They conducted [16] a survey of U.S. census summary data to state the privacy risk of individuals.

III. PROPOSED GEOMETRIC PERTURBATION BASED PRIVACY PRESERVING CLASSIFIER
In this proposed an advanced privacy preserving classification model is designed and implemented on the various datasets. Initially, each input data is pre-processing using the novel data filtering method. This transformation method is used to transform the numerical and nominal values and to fill the sparsity values on large datasets. After the data pre-processing step, a hybrid geometric perturbation method is developed to improve the classification rate on the filtered data. Finally, a novel boosting classification model is applied on the perturbation data for privacy preserving as shown in Fig. 1.
In this work, a hybrid data filtering method is designed and implemented on each PPDM input dataset. In the proposed data filtering method, each numerical attribute is normalized using the hybrid data transformation equation.

If (PD A[j] is nominal && PD A[j] is not null)
14. Then

Replace  [k]
Aj PDV using the eq. (2) 16. In the algorithm 1, each attribute of the input privacy preserving dataset is taken as input and transform the value using the equation 1 and 2. Initially, each attribute is tested for numerical data type or nominal type. If the attribute is numerical and it is not empty then each value in the attribute is transformed to new value by using eq.1. Similarly, if the attribute is nominal type then each value is estimated by using the maximization and minimization of its class probabilities.
In the geometrical homomorphic perturbation, two keys are generated to each communication parties for data sharing and data re-construction process. The two keys public key and private keys are generated using the non-linear cyclic group elements.
Choose two cyclic group elements with prime orders k1,k2. Step 3: Geometric attribute perturbation is given as n n r mod(s). mod(s).mod(s)  Step 4: Geometric data re-construction process is given as

Algorithm 3: Boosting Privacy Preserving Classification model
In this algorithm, a hybrid privacy preserving based classification model is designed and implemented on the input datasets. This algorithm is used to check the performance of the privacy preserving model on the geometric perturbation data and the original data. Here, multiple boosting classifiers are integrated to improve the voting rate of the overall classification model. In this proposed classification model, a novel random tree and non-linear kernel function based multiclass SVM approach. In the boosting classification model KNN, random tree and non-linear kernel based SVM are used to improve the overall accuracy on the perturbation data.

Non-linear SVM
Apply SVM multi-class optimization models as In the above multi-objective function, a new kernel function is defined to improve the performance of the privacy preserving classification model. Here kernel function ker(x,y) defines the v input values that are mapped to m dimensional space as: To each pattern in the decision tree construction, rule type is considered as either left side or right side of the pattern for privacy preserving.

IV. EXPERIMENTAL RESULTS
Experimental results are carried out in java environment with multiple privacy preserving datasets. In this experimental results, proposed privacy preserving model is simulated on original datasets and transformed datasets. Different statistical measures such as accuracy, recall, precision and runtime are computed on the different datasets. These statistical metrics are used to check the performance of the privacy preserving based model on the perturbation dataset. Here, all the sensitive features are perturbated in order to preserve the privacy on machine learning decision patterns. Experimental results are compared on different privacy preserving models such as geometric perturbation, rotational perturbation and PABIDOT.
Among the attributes of bank marketing dataset, 'client subscribed to term deposit' attribute is sensitive attribute. There are no identifier attributes to be removed from given dataset. Attributes age, job, marital status and education are considered as quasi identifiers. Age is numerical quasi identifier and job, marital status, education are categorical quasi identifiers. Various utility measurements are used to measure the usefulness of generalized data. Some are loss metrics, ambiguity metrics, differentiation in discernibility, KL, entropy-based loss of information, and so on. In this work, traditional model PABIDOT and other perturbation models are used to compare the proposed model on the input training data. These traditional models have issues on high dimensionality and sparsity problems. Discernability metric (DM): Each tuple in the database has a penalty based on the number of other tuples that cannot be distinguished from it. For a size n database, DM assigns n to each deleted tuple as a penalty. Penalty shall be the total number of tuples with the same quasi-identifier values for unrestrained tuples. Thus, if tuples are grouped by a quasiidentifier, the DM shall be defined as the total number of squared groups plus n times the number of deletes.
Metric ambiguity: This metric is highly suitable for the kanonymity framework. AM calculates the number of tuples for every tuple t, generalized to tuple t*, in the sanitized data domain. This is the ambiguity of t*. The AM for sanitized data is an average ambiguity of all tuples in the sanitized dataset.
KL-Divergence: The original table is treated as a distribution probability p1 to use KL-divergence. P1(t) is the tuple fraction equal to t. The sanitized data will also be converted to p2 (possible ways to do this will be discussed). The KL-divergence among the two is the same as for p1(t) log (p1(t)/p2(t)). Table I illustrates the performance of the present proposed hybrid perturbation-based privacy preserving model to the traditional models on different training datasets. Here, the average of F-measure is computed on the training datasets. As (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 10, 2020 255 | P a g e www.ijacsa.thesai.org shown in the table, it is noted that the proposed geometric perturbation based boosting classifier has better F-measure than the traditional models. Fig. 2 illustrates the performance of the present proposed hybrid perturbation-based privacy preserving model to the traditional models on different training datasets. Here, the average of recall measure is computed on the training datasets. As shown in the figure, it is noted that the proposed geometric perturbation based boosting classifier has better recall measure than the traditional models. Fig. 3 illustrates the performance of the present proposed hybrid perturbation-based privacy preserving model to the traditional models on different training datasets. Here, the average of precision measure is computed on the training datasets. As shown in the figure, it is noted that the proposed geometric perturbation based boosting classifier has better precision measure than the traditional models. Fig. 4 illustrates the performance of the present proposed hybrid perturbation-based privacy preserving model to the traditional models on different training datasets. Here, the average of accuracy measure is computed on the training datasets. As shown in the figure, it is noted that the proposed geometric perturbation based boosting classifier has better accuracy measure than the traditional models.     5 illustrates the performance of the present proposed hybrid perturbation-based privacy preserving model to the traditional models on different training datasets. Here, the average of error rate measure is computed on the training datasets. As shown in the figure, it is noted that the proposed geometric perturbation based boosting classifier has better error rate than the traditional models.   Table II illustrates the performance of the present proposed hybrid perturbation-based privacy preserving model to the traditional models on different training datasets. Here, the average of runtime (ms) measure is computed on the training datasets. As shown in the table, it is noted that the proposed geometric perturbation based boosting classifier has better runtime (ms) measure than the traditional models.

A. Results Analysis
A new privacy preserving data mining method is proposed. The proposed method is applied on various data sets and results were observed. The proposed method retains the classification accuracy while balancing data utility. Traditional approaches are limited to fixed sensitive attributes for privacy preserving. Also, these models are not appropriate on large data size. Also, the experimental results simulated on the perturbation anonymization bank data were improved by nearly 2% than the original data and nearly over 1% on the perturbation bank data. Experimental results suggested that the proposed geometric perturbation model achieves better efficiency in terms of high dimensionality and large data size than the conventional models.

V. CONCLUSION
In this work, a novel filtered based privacy preserving model is designed and implemented on the different datasets. Since, most of the conventional privacy preserving models are depend on the data size and number of features, it is difficult to provide the privacy to a large number of attributes due to computational time and accuracy. Also, it is essential to preserve the privacy of the large sensitive attributes before publishing it to the third-party servers. As a result, a novel framework is required to improve the privacy as well as accuracy on the high dimensional privacy preserving data with less runtime. In this work, a filter-based hybrid privacy preserving model is designed and implemented on the different complex datasets in order to optimize the privacy preserving accuracy and the runtime. Experimental results proved that the proposed privacy preserving model has better efficiency on the different domain datasets compared to the conventional models. In the future work, this work can be extended to a cryptographic based perturbation method for big datasets in order to minimize the error rate and to improve the privacy preserving policies.