Swapping-based Data Sanitization Method for Hiding Sensitive Frequent Itemset in Transaction Database

—Sharing a transaction database with other parties for exploring valuable information becomes more recognized by business institutions, i.e., retails and supermarkets. It offers various benefits for the institutions, such as finding customer shopping behavior and frequently bought items, known as frequent itemsets. Due to the importance of such information, some institutions may consider certain frequent itemsets as sensitive information that should be kept private. Therefore, prior to handling a database, the institutions should consider privacy preserving data mining (PPDM) techniques for preventing sensitive information breaches. Presently, several PPDM methods, such as item suppression-based methods and item insertion-based methods have been developed. Unfortunately, the methods result in significant changes to the database and induce some side effects such as hiding failure, significant data dissimilarity, misses cost, and artificial frequent itemset occurrence. In this paper, we propose a swapping-based data sanitization method that can hide the sensitive frequent itemset while at the same time it can minimize the side effects of the data sanitization process. Experimental results indicate that the proposed method outperforms existing methods in terms of minimizing the side effects.


I. INTRODUCTION
Retails and supermarkets are actively collecting their customers' data transactions. The collected data is then stored in a database, and it is referred to as a transaction database. A transaction database D contains a set of transactions such as in Table I. In general, a set of transaction records T is a nonempty set where T = {t 1 , t 2 , t 3 , . . . , t x }. Each transaction t is composed of a transaction id T id, customer name or id number Cname, and a set of items bought by the customer, IID. The transaction database provides various benefits for the business institutions when they perform data analysis, such as using data mining technology. Unfortunately, analyzing such a transaction database by using data mining techniques is not a trivial task for these institutions since many of them do not have sufficient resources, i.e., computation resources and human resources, to perform the data mining task. Therefore, they opt to handle the transaction database to other parties, for example, a data mining company to conduct the task. Even though this solution may solve the problem, sharing the transaction database may bring a hidden threat since there might be sensitive information resides the database.
One of the data mining tasks that are widely employed in various domains is frequent itemset mining [1]. The frequent itemset mining is very useful to find the frequently bought items as well as to analyze customer buying patterns in transaction databases. Moreover, understanding such information allows the companies to enhance their marketing strategy as a way to increase business revenue. Referring to the Table I as an illustration, a company, defines that an itemset {1, 3} has valuable information that should be learned by others. The table shows that item id 1, iid = 1 and item id 3, iid = 3, are frequently appear together in several transactions such as in t 1 , t 5 , t 7 , and t 10 . Due to the importance of this information, the company does not want any other parties exploring such an itemset. Concealing sensitive information is mandatory prior to sharing databases [2]. Therefore, data sanitization methods should be taken into account by the database owner to enable database sharing while at the same time preserving sensitive frequent itemset from being disclosed by external parties during the data mining process.
Recently, various data sanitization methods have been proposed with different settings and assumptions. Most of them rely on item suppression-based and item insertion-based strategies to address the aforementioned problem. However, the methods that follow suppression-based strategy [3], [4] incur significant side effects such as hiding failure, significant data dissimilarity, misses cost, and artificial frequent itemset occurrence. Accordingly, the data utility of the sanitized one degrades drastically, leading to induce inaccurate information for data recipients. The term hiding failure represents the percentage of sensitive frequent itemset that fail to be hidden by the data sanitization algorithm. Meanwhile, data dissimilarity measures the difference between an original database and its anonymized version in terms of its items frequency. Misses cost indicates the percentage of non-sensitive frequent itemsets that cannot be discovered in a sanitized database. Simultaneously, artificial frequent itemset corresponds to any frequent itemset that previously do not exist in an original database; however, it newly appears as the frequent itemset in a sanitized database. Therefore, in this paper, a distinct data sanitization method is proposed. The proposed method follows the swappingbased strategy to ensure privacy protection in a database while at the same time preventing excessive side effects of the data sanitization process. The method follows a recent data swapping method that has been developed in [5] to generate an anonymized database. The proposed method uses an item collision detection strategy, and it carefully selects a pair of transaction records for swapping by evaluating item similarity in the transaction records. To the best of our knowledge, our proposed method is the first method which uses the swapping technique in PPDM to hide sensitive frequent itemset.
The rest of the paper is organized as follows: Section 2 explores related work. The proposed method is explained in Section 3. Section 4 and 5 describe the experimental result and conclusion, respectively.

A. Frequent Itemset Mining
Frequent itemset mining is a data mining task which aims to explore all combinations of itemset contained in transaction records under a certain number of occurrence frequency threshold [6], [7]. Prior to performing frequent itemset mining, a database owner needs to determine a minimum support threshold value. In addition, there is no certain fixed number of minimum support thresholds, and thus if a database owner sets the frequency threshold too low, the database may output a significant number of frequent itemset and vice versa.
Suppose we have a transaction database denoted as D. Support supp of an itemset X, is the total number of transactions in D containing X. We denote the support of itemset X in a database D as supp(X, D). To compute the supp(X, D), one can divide the frequency of itemset X ∈ D, f (X), over the total number of transaction records in the database |D|. An itemset X is called frequent itemset F I if supp(X, D) is greater or equal to the number of determined minimum support minSupp [8]. Thus, any itemset having the support value below the minSupp can be referred as F I. To compute the supp of itemset X in D we can refer to (1).

B. Sensitive Frequent Itemset
Sensitive frequent itemset refers to any frequent itemset in which if such itemset are disclosed during the mining process conducted by other parties, and the database owner may lose their interest. In general, the database owners determine a set of a sensitive frequent itemset. Thus, if we formally denote the sensitive frequent itemset F s(X, D) as frequent sensitive itemset, then F s(X, D) ⊂ F I. Any other frequent itemset which is not considered as sensitive can be referred as nonsensitive frequent itemset F n, where F s ̸ = F n and F I = F s ∪ F n. The relation between Sensitive frequent itemset and frequent itemset can be depicted in Fig. 1.

C. Data Sanitization Method
Data sanitization methods can be grouped into three main categories such as perturbation-based method, cryptographicbased method, and heuristics-based methods [9]. It has been proved that achieving a sanitized database that guarantees privacy protection and preserves maximum database utility is an NP-Hard problem [10], [11]. Therefore, various data sanitization methods with distinct parameters and settings have been proposed to address the issue. In addition, each proposed method is application-specific where it is designed for a particular problem, and it may not be adequate to work on another problem. For example, a data sanitization method that is intended for protecting sensitive frequent itemset mining is not suitable for privacy preserving data clustering. Thus, there is no one method fits all.

1) Perturbation-based Method:
A perturbation-based method relies on a perturbing database either by removing items or inserting artificial items into transactions in the database. An initial data sanitization which follows the concept of the reconstruction-based to hide sensitive frequent itemset has been proposed in [4]. One of the solutions in the proposed method is called Naïve approach. It removes all the sensitive itemsets from a transaction database such that the sensitive information cannot be disclosed. While the technique effectively addresses the privacy problem, it causes significant item loss due to the removing process.
In reality, items in the transaction database may have a different level of importance. For example, item x is an item that is less important than item y since x generates low profit in a business process while item y is considered as an essential item due to its economic value. Therefore, a method that considers various threshold sensitivity has been proposed in [12]. The technique does not arbitrarily suppress all the sensitive frequent itemsets; instead, it creates a template containing possible victim items to disguise them. Another perturbationbased method has also been proposed in [13] namely rotation perturbation. However, the method is specifically designed to address sensitive information issues in clustering data mining. To solve the item loss issue, a technique that uses transaction insertion has been introduced in [14]. However, the method results in a significant difference between an original database and the sanitized one.
To optimize the performance of data sanitization, a method which based on particle swarm optimization (PSO) have been proposed in [15]. The method achieves a sanitized database by removing sensitive items in specific transaction record while at the same time reducing the side effects. The size of database is also another challenge to solve. Concerning that issue a method called MR-OVnTSA have been proposed in [16]. The method hides frequent sensitive itemsets in big data environment by removing items and transactions that can balance the privacy and knowledge in the database.
2) Cryptographic-based Method: Realizing that transaction database is potentially analyzed by several geographically separated parties, another scenario of hiding frequent sensitive itemset in a distributed system has also been intensively studied. Pioneering work in this area is proposed in [17], [18]. The methods use a secure multi-party computation technique to where several parties perform data mining analysis. To improve the quality of the sanitized database, a more recent approach in [19] proposes a cryptographic technique to hide sensitive rules in transaction databases. The method successfully protects the transaction database from inference attacks. A recent method in [20] proposed employs a cryptographic technique where it improves the mining process by disjoining the encrypted transactions into a certain number of blocks and only uses bilinear pairs of ciphertexts from the blocks. Therefore, the approach becomes more applicable in real-life cases. Even though the cryptographic-based method provides a strong privacy guarantee, however, when it meets a huge-sized transactional database, the performance decreases drastically due to the encrypt and decrypt process.
3) Heuristic-based Method: As it has been mentioned that finding maximum privacy guarantee and maximum database utility is an NP-Hard problem, a close to an exact solution which is based on a heuristic approach needs to be devised to address the problems in a real-life scenario. Presently, various heuristic-based methods have been proposed under different settings and parameters. One of the pioneering works in this area, such as in [4], [21]. In literature, most of the heuristicsbased methods apply either item pruning or artificial transaction insertion strategy to reduce the support of itemset, and therefore it successfully hides the sensitive frequent itemset in a database.
Distinct from the previous approach, [22] proposed a method which uses a unique strategy where it does not reduce the support of itemset to hide the sensitive frequent itemset; instead, it considers representative rules to remove the rules at the beginning. Another recent study proposed in [3] also adopts heuristic-based data sanitization method where the method performs item pruning strategy, and it successfully hides sensitive itemset in a database. To select the items for the pruning process, the method considers calculating the frequency of sensitive items and removing the one which causes a minimum item loss.
It is undoubtedly true that the heuristic-based method which uses either items pruning strategy or artificial transaction insertion can successfully hide sensitive frequent itemset in a database. Unfortunately, such strategies lead the database to lose its useful information due to some items are missing from the database. In addition, artificial transaction insertion strategy results in excessive changes to the database as a result, the item composition between an original database and the sanitized one differs significantly.

D. Swapping Techniques
The principle of data swapping technique is moving items from a certain transaction record to another record and vice versa. Therefore, it does not remove or add items in the transaction records, as a result, the database content can be well preserved. The data swapping techniques have been widely adopted for controlling statistical disclosure in micro dataset sharing. Pioneering work to protect sensitive information using the swapping technique was developed in [23], [24]. The method has successfully protects sensitive information in numerical and categorical attributes.
Regardless of a debate on its side effect, i.e., the techniques cause information incorrectness at a record level due to items of transaction records being swapped to another record. However, the techniques can successfully maintain items in the transaction database from loss. Thus, data recipients may perform data exploration to obtain all information of the items in the sanitized database.
The illustration of the swapping technique in transaction database is described in Fig. 2.

III. PROPOSED METHOD
To successfully hide sensitive frequent itemsets while at the same time maintaining the database utility, in this research, we propose a swapping-based data sanitization method. To the best of our knowledge, our proposal is the first data sanitization method that adopts swapping strategy. The swapping strategy does not remove items from a database and inserts new artificial transactions into the database; instead, it swaps items from one transaction to another. Accordingly, the side effect such as the number of artificial frequent itemset in the sanitized database can be minimized. An initial work in swapping strategy is firstly introduced in [25] to control data from disclosure. In this paper, the proposed method is distinct from the initial work which relies on a randomization strategy to protect the database. Our solution framework can be described in Fig. 3.
To evaluate whether an itemset is called a frequent itemset in D, the data owner needs to determine a certain value called minimum support threshold, minSupp and perform frequent itemset mining. All the obtained itemsets having support value greater than or equal to the minSupp is called frequent itemsets, F I. The next step is the database owner defines a set of sensitive frequent itemsets F s from the F I, where F s ⊂ F I. The F s is a non empty set containing sensitive frequent itemset si, thus F s = {si 1 , si 2 , . . . , si n }. Meanwhile, all the frequent itemsets that are not considered as F s are called non-sensitive frequent itemset F n, and it does not need to be hidden in a D, such that F I = F s ∪ F n. In general, database owners can determine F s in two ways. The first is database owners define F s according to their intention from business perspective, and the second is customers can freely determine their purchased items as either sensitive or non-sensitive itemset [26]. In this research, we follow the first approach where the database owners determine a set of itemsets in which according to his/her point of view it is considered as sensitive information.

A. Reading and Segmenting Database
Initially, our proposed method scans a database D and reads each transaction record t x ∈ D. During the reading process, the method identifies each t x to check whether it contains sensitive frequent itemset si. For each t x containing si, append the t x to a bucket T F s otherwise append it to another bucket T F n. In this step, T F s and T F n have influence in separating the sensitive and non-sensitive transactions in database. Therefore, the T F s only contains a set of transactions containing si, while T F n is only containing a set of transactions not having si. The pseudo-code of this procedure is presented in the following Algorithm 1.

B. Measuring Transactions Similarity and Pairing the Transactions
Following the previous step, the proposed method measures similarity among transactions to obtain a pair of transactions for the swapping, P . P is used to simplify the pairing process of two transactions that will be used for swapping procedure. In this research, we follow the idea of [27] where the Jaccard coefficient is adopted to measure the similarity of transactions. In essence the Jaccard coefficient Jc computes the number of items that coexist in the two records over the number of the total item from those records. The formula of Jc measurement is depicted in (2).
Algorithm 1: Reading and segmenting database Input: D, si ∈ SI Result: T F s and T F n 1 Scan D 2 ∀ t x ∈ D 3 if si ⊆ t x then 4 add the t x to T F s 5 else 6 add the t x to T F n 7 end Algorithm 2: Measuring similarity and finding a pair Input: T F s Result: 6 end 7 select a pair P having the minimum Jc To avoid an item collision which may result in item loss and reduce the number of generated artificial frequent itemsets in a sanitized dataset, the proposed method implements two protocols. The first is our method only selects a pair of records that have the minimum similarity. Initially, the method selects a transaction t x ∈ T F s randomly, and then it picks another transaction t ′ x ∈ T F s, selected transaction is referred as P .
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 11, 2021 The second step is our method ensures the sensitive itemsets si should not coexisting in both transactions, i.e., si ∈ t x ̸ = si ′ ∈ t ′ x of P . While the si of both transaction are different, the algorithm computes the Jc. The next step is selecting a pair of records P which has the minimum Jc. Therefore, when the item i ∈ si of the pair P are swapped to each other, the process does not cause item collision in both transactions significantly. In addition, such procedure can successfully ensure the hiding of sensitive frequent itemsets and minimize data dissimilarity. Algorithm 2 represents the pseudo-code of this procedure in detail.

1) Selecting Item for Swapping:
Once the pair P have been determined, the following step is selecting items from the P to swap. In general, arbitrarily swapping items from these transactions may also hide the sensitive frequent itemset for both transactions. However, this action may distort item correlation in the transactions that result in significant changes in a sanitized database content [28]. To address this problem, in this research, the strategy in [5] is adopted. The key point of the strategy is checking whether items i ∈ t x that will be swapped are coexisting with that in t ′ x .
Referring to Table I as an illustration, we aim to swap si ∈ t 2 with si ′ ∈ t 3 . Let us denote item id as iid, for example, an item namely coffee has iid = 7 is a subset of sensitive frequent itemset si appears in t 2 and it also coexists in t 3 . Swapping the iid= 7 from t 2 with another item such as bread i.e., an item with iid = 6 that presents in t 3 can successfully hides si ∈ t 2 . However, due to the iid = 7 coexists in t 3 , while the iid = 6 does not present in t 2 , swapping the iid = 7 from t 2 causes an item collision in to the transaction t 3 , as a result, the t 3 looses one of its items i.e., iid = 6 and it is no longer exists in the t 3 . Accordingly, to successfully hide the si ∈ t x while at the same time reduce the number of items loss in the transactions, the proposed method selects items that do not cause item collision.
In addition, to minimize the amount of data utility loss, the proposed method also selects the sensitive items i ∈ si that have the minimum support P r in the D. Selecting items i ∈ si with the lowest P r can minimize the changes of item correlation in t x . For example, suppose we have a sensitive itemset with iid = 2 and iid = 3, {2, 3}. Referring to the Table I, the P r of iid = 2 is 3/10=0.33 while the P r of iid = 3 is 5/10=0.50. To hide the sensitive itemset we would like to swap either iid = 2 or iid = 3. Suppose we select iid = 3 as the item to swap, the item correlation of iid = 2 with other items is significantly distracted since it appears five times in the D. On the other hand, when iid = 2 is selected to swap, its item correlation with other items is not significantly reduced due to its appearance in the D is lower than that of the iid = 3, as a result, only small parts of the transactions in the D experience changes.
Thus, to be selected as the items for the swapping process, the items i ∈ si have to satisfy these two conditions. Firstly, the items i ∈ si should not collide with any other items i ∈ t ′ x . Secondly, it should have the lowest probability distribution in D. Thus, it can successfully minimize the number of artificial frequent itemsets in the sanitized database D ′ . The detail of item selection is described in Algorithm 3.

Algorithm 3: Procedure of items selection for swapping
Input: P Result: i ′ 1 calculate the P r of i ∈ si of t x 2 select i ∈ si of t x that has the minimum P r 3 check whether the i exist in t ′ create Buffer br t k and br t ′ k ; 2) Swapping the Selected Items: Once the items for the swapping process have been determined, the next step is performing item swapping between that of t x and t ′ x . To swap the items, the proposed method creates two buffers for storing the items i ∈ t x and i ′ ∈ tx ′ . At first, the item from t x is stored in buffer br tk and that of tx ′ is stored in br t ′ k . In this stage, br is a buffer to temporarily store the modified transaction records in swapping process. The second step is taking the items in br t ′ k and appending it to the t x . Following that, items in br tk is appended to t ′ x . The procedure is performed until all i ′ from the pairs of records P have been swapped. Once the swapping process is finished, the algorithm can combine all the transaction records from T F n to successfully generate a sanitized database D ′ . Algorithm 4 represents the pseudo-code of item swapping in detail.

IV. EXPERIMENTAL RESULTS
To evaluate the effectiveness of the proposed method, we conduct several extensive experiments using several real datasets such as the foodmart dataset [29]. The properties of the dataset are described in Table II, while the testing  parameters are presented in Table III. We implement the algorithm in JAVA code and run it in UNIX operating system with memory of 8 GB and storage of 256 GB. An additional tool, namely SPMF [30] is also adopted to generate frequent itemset by utilizing FP-Growth algorithm [31].

A. Evaluation Metrics
To verify the performance of the proposed method, we compare the proposed method, SW with several existing sen- sitive frequent itemset methods, i.e., heuristic method, HEU [3] and naïve method, N V [4]. Testing parameters are also determined in this experiment, and the detail is presented in Table III. Several metrics are adopted to evaluate the performance of the proposed method, such as hiding failure, misses cost, dissimilarity, and artificial frequent itemset [32].
1) Hiding Failure: Hiding failure, HF is a metric to evaluate the percentage of sensitive frequent itemsets that fail to be hidden. Ideally, a data sanitization method should be able to hide all the sensitive frequent itemsets in a database, i.e., the HF is 0. However, in some cases because of the data sanitization method's inaccuracy, several sensitive frequent itemsets are failed to hide. The metric to evaluate HF is presented in (3), where #F s(D) represents the number of sensitive frequent itemsets in an original database and #F s(D ′ ) refers to that of the sanitized one.
Referring to Fig. 4, we can observe that the proposed method results in the lowest percentage of hiding failure. Even though SW fails to hide some si, the percentage of the failure is insignificant compared to that of other methods. The percentage of HF induced by the SW is around 7.143%, while the percentage of HF resulted from HEU and N V are 47.619% and 66.667%, respectively. The method successfully achieves the results since it takes a pair of records and swaps the items in si of the records.
2) Misses Cost: The term misses cost, M C refers to the percentage of non-sensitive frequent itemsets F n that are accidentally hidden when performing data sanitization. Ideally, the percentage of M C is 0%. The formula to compute M C is described in equation 4, where #F n(D) and #F n(D ′ ) represent a set of frequent itemset that can be explored in D and a set of non-sensitive itemset that cannot be discovered in D ′ .
As can be observed in Fig. 5, when the sanitized database resulted from SW is mined under minSupp = 0.03%, our proposal induces a slightly higher percentage of M C compared to that of HEU . However, as the minSupp value increases to minSupp = 0.1% the proposed scheme achieves the same results as HEU . In addition, the SW successfully achieves better results compared to that of N V in terms of minimizing  M C in all the varying minSupp values. The detail values of M C among the methods are described in Table IV.
The main motivation of such results is due to the proposed method does not limit the number of modified records like in HEU . The HEU lefts some records containing si are kept unmodified to reduce the M C. However, such a strategy allows the si remain discoverable when data recipients perform frequent itemset mining using a lower confidence value than the minnSupp value. As our goal is designing strong data sanitization, the proposed method does not apply the same strategy in HEU .
3) Dissimilarity: Applying data sanitization methods to a database always results in some changes to the database content. The changes in database content are considered as a side effect of the data sanitization methods, and it is referred to as dissimilarity. To evaluate the dissimilarity between an original database and its sanitized version, one can compare the items' frequency in both databases. The formula to evaluate the dissimilarity Diss is presented in (5), where f D(i) represents the frequency of item i in an original database D and f D(i) refers to that of the sanitized one.
As can be observed from Fig. 6a, the item frequency of the sanitized database D ′ generated by our proposed method SW is almost the same as that of the original database D. Even though there are some differences in certain item frequency between the two databases, it does not significantly deviate. Referring to Fig. 6b, the item frequency in the sanitized database D ′ generated by HEU also experiences a small dissimilarity. Meanwhile, in Fig. 6c we can see that the item frequency in D ′ obtained from N V has a significant difference compared to the item frequency in the original database D.
The summary of data dissimilarity of those databases is presented in Fig. 7. According to the figure, we can observe that the proposed method results in the lowest Diss value  compared to that of other methods. The Diss value resulted from the proposed method is 1.372, while that of the HEU and N V are 4.327 and 366.436, respectively. The result is achieved because the proposed method can minimize the number of item losses in the sanitized database. Meanwhile, since the other two methods adopt a suppression strategy that removes items from a database, their dissimilarity values are higher than that of our proposed method.

4) Artificial Frequent Itemset:
Artificial frequent itemsets, AF I is defined as a percentage of all frequent itemsets that do not present in an original database. However, it newly appears in the sanitized one. Ideally, the percentage is 0. The formula to compute the AF I is stated in equation (6).
The notations | F I| and |F I| represent the cardinality of frequent itemset in a sanitized database and that of the original database, respectively. As can be seen in Fig. 8, the sanitized database resulted by SW results in considerably lower AF I than that of N V . While, it has the same AF I as the HEU , when the minSupport value is more than 0.03%. The proposed method, SW can minimize the AF I due to it does not remove or add items to a database. Therefore, the frequent itemset in D remain the same as that of the original one. The detail values of the AF I is presented in Table V. Threats to the construct validity relates to the proposed method's performance in handling various database with different properties. In our study, we only used one transaction database as described in the Table II. Even though we only used single database, however, it has more complex data properties compared to other databases that are usually used in PPDM areas such as BM S − W ebV iew1 and BM S − W ebV iew2 [33], specifically in the number of distinct items and the average of tuple length. Thus, we consider that the impact of using various database is not significant.
The second threats to validity is related to the performance of the proposed method compared to other more recent methods. Even though N V is not considered as the recent one, however, recent researches in PPDM [34], [35] still consider the method as the benchmark to evaluate the performance of their proposed method. Therefore, the impact of using other recent methods is small.

VI. CONCLUSION
In this paper, a data sanitization method based on a swapping approach called SW have been proposed. The main property of the proposed method is that it does not add or remove items in the database. The method has several steps to obtain a sanitized database. The main idea of the proposed method is finding transactions containing frequent sensitive itemset, measuring their similarity to determining a pair of records, and deciding items in the sensitive frequent itemset for the swapping process.
Experimental results show that in general the proposed method has a better performance compared to some existing methods. The method successfully hides the sensitive frequent itemsets with the lowest HF compared to that of several existing methods, indicating it provides stronger privacy protections in the sanitized database. In addition, since the method does not remove or add items in a database, the dissimilarity value between the original database and the sanitized one resulted from our method is lower than that of HU E and N V . In terms of data utility preservation, our method has a similar performance with HEU where the percentage of AF I is close to zero.
In the future, a more deeper analysis to the proposed method needs to be conducted, specifically in handling various transaction databases that have different properties and also evaluating the algorithm complexity. The proposed method SW also needs to be compared to more recent existing works in the same field to evaluate its performance.