Protecting User Preference in Ranked Queries

—Protecting data privacy is of extremely importance for users who outsource their data to a third-party in cloud computing. Although there exist plenty of research work on data privacy protection, the problem of protecting user’s preference information has received less attention. In this paper, we consider the problem of how to prevent user preference information leakage from the third-party when processing ranked queries. We propose two algorithms to solve the problem, where the first one is based on distortion of preference vector and the second one is based on homomorphic encryption. We conduct extensive experiments to verify effectiveness of our approaches.


I. INTRODUCTION
Ranked queries are useful in many real world applications for users to express their tastes in queries, such that the returned query results are the most preferable ones than the other records in database. Currently, there are two kinds of approaches to capturing user's preferences, that is, the quantitative approach [1] and qualitative approach [2]. In the paper, we focus on quantitative approach for modeling user preference.
Although there are extensive research work on processing ranked queries, the privacy issue of user preferences so far has not received attention. There is no doubt that user's preference is a strong and direct link to his/her identity, which the user wants to keep in private, otherwise the information may be utilized by an adversary to against he/she. Thus, one may want to hide his/her preference embedded in the ranked query in order to protect privacy. To illustrate, we give two examples below.
Example 1. Consider a user looking for a second-hand car at a website running by a car dealer. The dealer maintains a used car database, recording for all the cars the features such as Make, Model, Year, MPG (miles per gallon), Mileage, and Price, etc. The user may care more about some of the features, such as MPG, mileage, and Year, and want to find a suitable car for him with the lowest price. By collecting and observing user's queries with preferred features, the dealer knows that the user strongly favors MPG. So the dealer may want to increase a bit the prices of the cars with favorable MPG, such that he is expected to profit more from the user. To have the edge over the car dealer while making a deal, the user wishes to hide his preferences during the querying process.
Example 2. Consider a customer searching for financial information of NYSE-listed companies, through websites such as Yahoo! finance or Google finance. By giving preferences on attributes such as cash flow, P/E (price-to-earnings) ratio, ROA (return on assets) ratio, and debt ratio, the customer intends to search through query interface the favorable companies, based on which he/she may make buy or sale decision for his/her investment portfolio management. On the other hand, a curious adversary may sniff the search results of the customer at the server side and extract the preferable attributes, which may be used to infer the possible investment of the customer. Since customer's portfolio information is critical for him/her to stand up to the fierce business competition, he/she may strongly oppose the exposure of his/her preferences to anyone else.
Problem Statement: Our model includes the user and the service provider (the server). The server maintains a database D and processes users' queries with difference preference on the attributes of D. We assume that the server is semi-honest, that is, the server correctly performs the query processing and returns the results, but he is curious and tries to find out the user's preferences. Each user composes his ranked query as a weight vector f = {ω 1 , ω 2 ,..., ω d } on attributes of D, and submits it to the server for processing. Our objective is to prevent the server from knowing the exact preferences, i.e., the weight vector of the user, without deteriorating the query processing efficiency and accuracy.
The contributions of this paper are as follows: • We consider the problem of protecting user's preferences in ranked queries, which is of great importance in many real world applications.
• We propose a simple strategy to distort user's true preferences, and give a test to the strategy.
• To strengthen the privacy level, we devise another algorithm based on homomorphic encryption to protect user's preference.
• We conduct extensive experiments to verify effectiveness of the proposed approaches.

A. Preliminaries
Ranked query. Ranked query [3] is very useful in many real-world applications. It is a powerful technology to simulate users' personalized information needs, which has attracted great attention of researchers all over the world. It allows users to express their preferences for queries and specify the specific weight of each query limit, so that the returned query results meet the needs of users better than other data records in the database. At present, there are two methods to describe user preferences, namely qualitative method and quantitative method.
The qualitative method [2] usually uses the preference formula to evaluate which data tuple is more favorable over the others, i.e. determining the partial order relationship between tuples. The quantitative method [1] defines a preference function to express the user's preference for different data attributes, and then finds the data records that best meet the user's needs from the database according to the preference function, that is, measured by specific values. For example, given a hotel dataset containing three dimensions D = {d 1 , d 2 , d 3 }, where d 1 represents price, d 2 represents score, d 3 represents distance, preference function is defined as f = The highest score is the most desirable.
In practical applications, ranked query is often directly related to other problems, such as skyline computing [4] and top-k query [5], [6]. Combining multiple queries can usually get the results you want most quickly.
Top-k query. Top-k query [7], [8] refers to returning the best k data records according to an objective function. It is widely used in many fields, such as e-commerce, recommendation system, search engine and so on. When the Top-k query is associated with the preference problem, it will calculate the score of each data record according to the preference function given by the user, and then return the k objects with the smallest (or largest) scores.
The concept of Top-k query has been around for many years and is widely used in real life. So far, many algorithms for Top-k query have been proposed, which can be roughly divided into three categories. The first is index based algorithm [9], [10]. Its main idea is to divide the whole data set into multiple layers according to the division rules, then index and mark each layer, and finally retrieve layer by layer according to the index order to return the best k results, such as Onion algorithm [11]. The second is the Top-k query algorithm based on view [12]. This kind of algorithm first calculates the scores corresponding to each tuple according to the preference function provided by the user and arranges them in order. The view contains the identifier and score of tuples. After these preparations, it finally returns the Top-k query results. The third is the Top-k query algorithm based on ordered list [13], [14], which is realized by using multiple column files.

B. Related Work
Skyline query. Skyline query problem [4], [15], [16] is a popular technology for processing user preferences and Top-k query. It is used to select a series of objects that meet user preferences and are not dominated by other objects. Given a dataset D with d-dimensions {d 1 , d 2 ,..., d d } and n objects{A 1 , A 2 ,..., A d }, where A i .d j denotes the j-th dimension value of object A i . The definition of dominance and skyline are as follows: for at least one attribute. We call such A i as dominant object and such A j as dominated object between A i and A j .
At present, there are mainly two kinds of methods for skyline query processing. The first kind of methods do not need to preprocess the data set, but retrieve the query by scanning the whole database at least once, such as block nested loop (BNL), divide and conquer, etc. BNL algorithm [17] adopts the most straightforward method, that is, a point p is compared with each other point to decide whether it is dominated by other points, so as to determine whether the point is part of the skyline. Divide and conquer algorithm [17] divides the universe into several regions, calculates the skyline in each region, and produces the final skyline from the regional skylines. Therefore, the performance of this kind of methods is low because of scanning the whole database.
The second [18] is to reduce the query cost by using the index structure, such as nearest neighbor(NN), branch and bound skyline(BBS), etc. NN and BBS find the skyline by using an R-tree. NN algorithm [19] divides the data space iteratively based on the nearest object in the space, and prunes the dominant object quickly and effectively. However, BBS algorithm [20] uses heap to realize progressive search without redundant query in subspace. Obviously, the difference is that NN issues multiple NN queries [21], whereas BBS performs only a single traversal of the tree. It has been proved [22] that BBS is I/O optimal; that is, it accesses the least number of disk pages among all algorithms based on R-trees (including NN).
Homomorphic encryption(HE). Homomorphic cryptographic system [23], [24] is a public-key cryptosystem that can provide user with the ability to directly perform algebraic operations on ciphertext without decrypting the ciphertext.
Given two messages m 1 and m 2 , suppose a homomorphic cryptographic system encrypts them, using public key PK, to ciphertexts C 1 = E(PK, m 1 ) and C 2 = E(PK, m 2 ). Without knowing the corresponding secret key, one can compute E(PK, m 1 + m 2 ), i.e., the ciphertext of the addition of m 1 and m 2 , by simply multiplying the two ciphertexts C 1 × C 2 . This property is called additive homomorphicsm. Similarly, a crytographic system is multiplicatively homomorphic if one can derive E(PK, m 1 × m 2 ) from C 1 and C 2 directly.
The concept of homomorphic encryption [25], [26] is proposed to directly perform operations in the encryption domain, that is, the results obtained by decrypting the operations performed in the ciphertext domain are consistent with those obtained by performing the same operations in the plaintext domain. However, most existing homomorphic encryption schemes only support accurate computing operations in some discrete spaces, so these schemes can not be applied to tasks requiring floating-point or real number computing. For example, in the bit-wise encryption scheme, the integer is first converted into binary, and then encrypted by bit. The addition and multiplication operations are also based on bits. This scheme cannot be applied to floating-point numbers. For the word-wise encryption scheme, multiple numbers can be encrypted in a single ciphertext, but the rounding operation is difficult to evaluate because it is not expressed as a decimal polynomial.
After Regev [27] introduced the learning with errors (LWE) problem, approximate homomorphic encryption was proposed one after another. Since the key of the method based on LWE problem is realized through matrix, its efficiency will decrease rapidly with the increase of security parameters. Then the ring-LWE problem [28] is proposed. The key of the encryption scheme based on this problem is expressed by several polynomials, which greatly reduces the size of the key and speeds up the encryption and decryption operations. The scheme based on RLWE problem include BFV [26], BGV [29] and HEAAN [30]. BFV and BGV encryption scheme only support accurate computing operations on integers.
However, HEAAN scheme can encrypt floating-point numbers. And the goal of this scheme is efficient approximate calculation on HE. Its main idea is to add a noise to the plaintext number that can reflect important information, so that the addition and multiplication operations of encrypted messages can be approximately calculated. In HEAAN encryption scheme, its decryption structure of the form ⟨c, sk⟩ = m + e (mod q) where e is a small error inserted to guarantee the security. In addition, HEAAN also provides a rescaling operation to remove the error of the least significant bit, which ensures that the length of the error bit increases linearly in proportion to the number of levels consumed, rather than exponentially. The efficiency of HEAAN [30] has been proved in many practical applications, and is still being improved by better bootstrapping algorithms [31], [32]. Therefore, considering that the data used in this paper are floating-point numbers, HEAAN homomorphic encryption scheme will be adopted.
Approximate Algorithm for Comparison Function. Logical operation has always been a difficult point in HE. Bitwise FHEs encrypt data in a bit-by-bit manner. They support fast logical operations, such as comparison. But they can not support floating point encryption. On the contrary, word-wise FHEs, which store messages as their word-sized numbers, support high-speed arithmetic operations between messages. Therefore, in order to calculate the comparison function, an approximate form of the comparison function is proposed by using polynomials.
Cheon et al. [33] first proposed a new identity comp(a, b) = lim k→∞ a k /(a k + b k ), and showed that the identity can be computed by an iterative algorithm. Because of the iterative feature, this algorithm is slow in HE implementation. Then, they proposed a new comparison methods SIMD [34], using composite polynomial approximation on the sign function, which is equivalent to the comparison function. That is, repeated compositions of (2n+1)-degree polynomial f n (x) and g n (x) output the approximate value of the sign function.
We denote the approximate comparison for two inputs x,y by (x > y) or (y < x). According to the conclusion of [34], given iteration numbers d f and d g , (x > y) is computed as Here f d means f • f • · · · • f, i.e., the operation is performed d times.

III. OUR APPROACHES FOR PREFERENCE PROTECTION
Ranked queries are complementary to traditional SQL query semantics. The preferences expressed by the user are considered as soft constraints of the queries, whereas the hitor-miss query conditions, e.g., ≤, >, and ̸ =, are known as hard constraints [1], [2], [35]. To evaluate a ranked query q, the server computes the sum of the linear combination of attributes of the records, based on the user preference vector f = {ω 1 , ω 2 ,.., ω d }(note we use vector and weight interchangeably), then returns to the user the top-k objects with the highest sum. Table I gives an example of a toy database with 2 attributes A 1 and A 2 . Suppose the user expresses in a query his preference as f = {0.4, 0.6}, and wants to pull top 3 favorable records from the server. Starting from the first record r 1 , the server computes its preference score score(r 1 ) = 0.4 * 1.4 + 0.6 * 4.4 = 3.2, and the scores of the rest records. Among the computed scores {3.2, 2.7, 2, 3.8, 3.4, 2.1}, r 1 , r 4 , and r 5 are the top 3 records with highest scores and they are returned as result to the user. As this example shows, there is no protection for user queries, which means that the server can easily obtain the preference information of a user. In this section, we propose two approaches to prevent leakage of user preference, which are described below.

A. The First Approach
Our first approach is called PD, which is based on preference distortion. So far in the model we assume that there is no encryption involved, i.e., the user queries and database content are all in plaintext. We defer discussion of the case in which encryption is employed in the next section.
Our strategy to protect user's true preference is simple, and goes as follows. Instead of directly submitting to the server the true preference vector f, the user distorts f by randomly adding to or subtracting from the components of f a small number, then sends the modified preference vector to the server. Specifically, given a preference vector Suppose the increment/decrement added to component ω i of f is δ i , based on the relation ω ′ i = ω i + δ i we can easily verify the following equation We outline the preference distortion in Algorithm 1. We generate d random numbers, and normalize them to maintain www.ijacsa.thesai.org the convex property. These random numbers are sorted in descending order, such that the relative preference relation among attributes is maintained. For example, we are given a vector f = {0.3, 0.5, 0.2}, and the generated random numbers are R = {0.4, 0.2, 0.8}. After normalization and sorting, we get R = {0.6, 0.3, 0.1}, from which we get the distorted vector f' = {0.3, 0.6, 0.1}, and the increment/decrement vector ∆ = {0, 0.1, -0.1}. It is obvious that after distortion, the new vector still preserves the relative preference relation among attributes as in the original vector, but the actual value of preference weights have been changed.

Algorithm 1: Preference Distortion
input : Insert rand ′ i into R'; 6 end 7 Sort R' in descending order; 8 count = len(f ); 9 j = 1; 10 while count > 0 do 11 i = the index of the greatest weight in f; Insert ω ′ i and δ i into f' and ∆,respectively; Having obtained the distorted vector f', the user sends it to the server for processing, where at the server side the query processing is just as the same as before and the query results and scores are sent back to the user. Since the scores do not reflect the true values for the original preference vector, the user has to revert the scores returned by the server. Consider the score of a record r i with respect to f', score(i) = f' · r i , which can be represented as After re-arrangement of the above formula we have It is clear that the first parenthesized part is the correct score with respect to f, and the second part is the noise artificially added by the distorted vector f'. Thus, the user computes the noise for each r i of the return top k records by multiplying ∆ with r i , and then subtracting the noise from the returned score. After the reverting procedure, the user re-orders the resulting records according to the restored scores. We summarize the reverting procedure in Algorithm 2.
The algorithm of preference distortion is simple and efficient for protecting user preferences, however it may introduce Algorithm 2: Score Reverting input : Query result set S, and the set Score output: The restored score set Score' 1 Score = ∅; 2 for Score(i) ∈ Score do 3 Noise = δ 1 × r i .a 1 + ... + δ d × r i .a d ; 4 Score'(i) = Score(i) − Noise; 5 Insert Score'(i) into Score'; 6 end 7 Sort the records in S according to Score'; 8 Return Score'; false negative and false positive query result, as we will show in the next section.
Security Discussion. When the user sends the preference vector to the server for query, even if the preference vector is disturbed, the server may predict the user's real vector according to the returned top k result, that is, the higher the precision, the higher the probability of being predicted. It is known that the weight of the user preference vector is a decimal between [0,1], and the server cannot know the specific decimal places of each weight in the real vector. Obviously, the top k result corresponding to the vector in a small range is the same. Therefore, when the server predicts the user preference vector according to the top k query results, there are countless possibilities in the case of uncertain decimal places of each weight in the vector. Especially when the dimension is higher and the number of decimal places is more, there is a lower probability of being predicted by the server.

B. The Second Approach
To further strengthen the privacy level of user's preferences, in this section we propose the second approach called HE, which is based on homomorphic encryption.

Algorithm 3: Preference Encryption
input : Preference vector f,the public key of CKKS encryption system PK output: Encrypted preference vector E(f) Insert E(ω i ) into E(f); 5 end 6 Return E(f); After the preference vector encryption operation is completed, the next step is to perform the linear weighted summation operation on the data records in the data set, and select the k results with the top scores through comparison.
The comparison operation of ciphertext is realized by SIMD [34] scheme. In order to reduce the computational consumption of encryption, we consider that the server first performs K-Skyband operation on the dataset D, and then only calculates and compares the data records in the K-Skyband dataset. Similar to K nearest-neighbor queries, a K-Skyband [22] query reports the set of points which are dominated by at most K − 1 points.The definition of K-Skyband is as follows.
K-Skyband: An object A i ∈ D is said to be a K-Skyband object of D, if A i is dominated by at most K − 1 objects in D.
As can be seen from the definition, K-Skyband [22] is a variant of skyline, in which K can be regarded as the thickness of skyline; the case K = 0 corresponds to a conventional skyline. Thus, we implement K-Skyband by modifying the BBS algorithm. Based on this, the specific process of query is shown in algorithm 4.

Algorithm 4: Query Evaluation
input : Encrypted preference vector E(f ) output: Result records and their encrypted scores 1 Score = ∅; 2 Compute the set Band k of k-skyband points of D; 3 for ∀r i ∈ Band k do 4 E(score(i)) = ∀aj ∈ri E(ω j ) * a j ; 5 Insert E(score(i)) into Score,and compare; 6 end 7 Send Score and Band k to the user; Security Discussion. This scheme encrypts the preference vector. The IND-CPA security nature of FHE scheme ensures that any opponent with the result homomorphic operation between ciphertext and ciphertext cannot extract any information of the message in the ciphertext, our HE method is secure based on the security of FHE, because the server can only access the ciphertext. There is no information leak of user preference.

IV. EXPERIMENTAL RESULTS
In this section we conduct extensive experiments to evaluate performance of the proposed two approaches, and all experiments are conducted on a PC running Ubuntu 18.04.2 LTS. We discuss experiment settings below.
Dataset. We use synthetic datasets of three data distributions, namely, independent (uniform), correlation and anticorrelation, with dimensionality d varying in the range [2,5] and cardinality N in the range [1k,20k], respectively.
Performance Indicators. For our first approach PD, since it is an approximate method, we focus on precision and recall indicator, i.e., how many real results can be found for PD. On the other hand, for our second approach HE, since it is an exact method, i.e., HE can find all the correct results, we employ running time as the performance indicator for the HE approach. The precision and recall indicator is defined as where T P indicates that the number of positive query results retrieved, F P indicates the number of points which the retrieved positive ones are actually negative, F N indicates the number of points which it has not been retrieved but they are actually positive. Each precision and each recall result is the average of 10 trails.
Suppose Q(x) is defined as the number of data points in area x. Based on the premise of Top-k preference query in this paper, it is easy to obtain Q(T P ) + Q(F P ) = k and Q(T P ) + Q(F N ) = k then Q(F P ) = Q(F N ). Therefore, the results of precision and recall are the same.

A. Effect of Dimensionality d
We vary data dimensionality d from 2 to 5. As shown in Fig. 1, the cardinality N is 10k and K is 50. And it can be seen that the average precision and recall results of datasets with independent (uniform), correlation and anti-correlation distribution are basically stable and were not be affected by the dimensionality. It is obvious that the average precision and recall results of correlation data set are the highest, while that of anti-correlation are poor.

B. Effect of K
In order to study the effect of K we carried out experiments when K = 10, 20, 50, 100. Fig. 2 shows the precision and recall corresponding to each case. Obviously, the query results are not affected by the K value.
C. Effect of Cardinality N Fig. 3 shows the precision and recall results when the cardinality N in the range [1k,20k]. It can be clearly seen that the precision and recall values of the three types of datasets tend to be stable.
In conclusion, our proposed distortion algorithm has little relationship with the size of K, data dimension and data set cardinality. Thus, the disturbance scheme is practical and will not be affected by other factors. In addition, it can be clearly seen that it has a good precision and recall result in correlation datasets but poor in anti-correlation datasets.

D. Discussion
From the above experimental results, it can be obviously seen that precisions are float up and down. That is due to the existence of false negative and false positive points, and the number of them is uncertain.
For ease of discussion, we use the example database in Table I, and assume the data space range for attribute A 1 , A 2 are x = [0, U x ], y = [0, U y ], respectively, and k = 3 for the returned top k results. Now consider an original vector f = {ω 1 , ω 2 } and the corresponding transformed vector f' = {ω ′ 1 , ω ′ 2 }. After examining each vector against the records in D, the qualified records (represented as data points in 2-D space) are highlighted in Fig. 4(A). Specifically, the top 3 records with    respect to the original vector f include r 4 , r 5 , r 1 , whereas the answers for distorted vector f' are r 4 , r 5 and r 3 . There is a false positive result, r 3 , and a false negative r 1 , due to the weight distortion.
To evaluate the original and distorted preference vector, one has to quantify the number of false positive and negative results. As depicted in Fig.4(B), it is clear that the larger striped area (denoted by F N ) contains false negative points, and the smaller striped area (denoted by F P ) false positive points. The area of F N and F P can be easily calculated by computing the following integrals. (f − f')dx (6) Under the assumption of uniform distribution of the data records in D, we can estimate the number of false negative and false positive results as N * F N and N * F P respectively, where N is the cardinality of D. On the other hand, the computation of the F P and F N becomes complicated in the case of high dimensional data. However, one may resort to techniques such as Monte Carlo method to approximate the integral in highdimensional case. This is a future direction of our work.
When K in K-Skyband is large, the data to be calculated decreases relatively, but the running time still increases steadily with the increase of dimensionality.

V. CONCLUSION
In order to protect user preference privacy information, we propose two approaches to solve the problem of user preference privacy protection in ranked queries. The first method, called PD, hides the user's real preference information by introducing perturbation to the user ranked query vector. This scheme will lead to a slight degradation in precision in the query results. Moreover, the experiment results also show that PD achieves the best performance on dataset with correlation distribution, whereas it performs relatively poor on dataset with anti-correlation distribution. Therefore, PD is suitable for real-time ranked query processing scenarios with less accurate requirement for the query result. In order to get exact query result, a homomorphic encryption-based method called HE is proposed to encrypt the preference vector, which enables the third-party server to process ranked query by calculating the ciphertext, so as to fully protect the privacy of user preference.