Permutation of Web Search Query Types for User Intent Privacy

Privacy remains a major concern when using search engines to find for information on the web due to the fact that search engines own massive resources in preserving search logs of each user and organizations. However, many of the present query search privacy practices require the very same search engine and third party to collaborate, making privacy even more difficult. Therefore, as a contribution, we present a heuristic, permutation of web search query types, a noncryptographic heuristic that works by formation of obfuscated search queries via permutation of query keyword categories. Preliminary results from this study show that web search query and specific user intent privacy might be achievable from the user side without involvement of the search engine or other third parties by the permutation of web search query types. Keywords—Web search query privacy; user intent privacy; search engines; Information Retrieval


INTRODUCTION
Privacy remains a major concern when using search engines to search for information on the Internet due to the fact that search engines own considerable resources in keeping user and organization search logs.However, many of the present query search privacy practices necessitate the very same search engine and third party applications to collaborate, making web search privacy a challenge.As illustrated in Figure 1, many web search query privacy techniques require collaboration with a search engine or third party cryptographic system.In our proposed heuristic, as shown in Figure 2, the user has full control in obfuscating their search intent without the need of a third party.In this paper, we present a heuristic, permutation of web search query types, a non-cryptographic heuristic that works by formation of obfuscated search queries via permutation of query keyword categories.We also make a distinction between concept user intent and specific user intent, with the goal of giving users privacy controls over their specific user search intent without involvement of third parties.Concept user intent is concerned with general aspects that users search for, while specific user intent is concerned with the specific item the user intends to search for.Preliminary results from this study show that web search query and specific user intent privacy might be achievable from the user side without involvement of the search engine collaboration or other third parties by the permutation of web search query types.In this proposed heuristic, users initially generate obfuscated queries based on permutations of different query keyword types.The generated permutated queries are then combined with the original search terms to search for information in the search engine at the same time.The rest of the paper is organized as follows.In Section 2, we discuss the background and related work.In Section 3, we outline the suggested heuristic.In Section 4, we discuss preliminary results.In Section 5, a conclusion and future works is given.

II. BACKGROUND AND RELATED WORK
A web search engine: This is a software application normally hosted by search engine organizations and used mainly to search for information and index documents on the web [1][2] [3].Web search queries: These are words and phrases input by a user into a search engine to retrieve relevant indexed documents stored by the search engine [4].There are three main web search queries categories [5] [6] [7]: Informational queriesweb search queries that deal with general subjects, and retrieve large related result numbers, e.g."Trains" and "Tourism".Navigational queriesthese are queries that deal with searching for a specific website or webpage e.g."Twitter" and "Yahoo Movies".Transactional www.ijacsa.thesai.orgqueriesthese are queries in which the user seeks to make an online action like buy an item, stream music, or watch a movie; e.g., "Buy airline tickets".Web search queries in most cases are always concise, imprecise, contain subtopics, and can be viewed as in two major categoriesfaceted and ambiguous queries [6]: Faceted queries: Faceted queries can be comprised of subtopics, but are non-ambiguous, precise, and return particular and relevant results.Ambiguous queries: these types of queries typically have more than one denotation, and so the search engine returns results that might not be pertinent to the user.
Data privacy: This is the process in which individual or entity information is protected against unauthorized disclosure.In this case, user intent, which is the real purpose for an individual or entity issuing a query on a search engine, could be considered private information [8].
Single-party privacy search: Single-party privacy search is a privacy technique that works by permitting users to generate their own public profiles with need to make changes on the server side, by using cataloged topics of interests, generating false queries that are amalgamated with real queries, and implementing all produced queries simultaneously in the web search engine [9].The phony queries are produced using a knowledge-base, such as, the open directory project, to interpret the singular intent based on the sematic distance between the false and real intent in the query outcomes [9].We chart a parallel methodology to the single party privacy search, nevertheless, in our heuristic, focus is placed on permutation of topical query keywords in the production of disguised queries and the pseudo-user profile in this case could be created via deflecting URL clicks produced from retrieved search results.
Web search query disambiguation: This is the process in which query search terms undergo reformation and refinement to eliminate any ambiguity so as to better predict user intent in order to and retrieve highly relevant search results for o the user [10].
Web search query reformation: Similar to query disambiguation, reformation is the process in which search engines use query enhancement methods to modify and accurately capture user intent.The reformatted query is then presented as an alternative to the user in replacement for what the search engine perceived as ambiguous.It is important for privacy practitioners to note that search engines store both the original query issued by the user and the reformed query selected by the user, to correct future errors, typos, and for adjustments of the query for personalized results.Although search engines could yield improved and more unambiguous search results for the web user, search query reformation can be viewed as vulnerability against web search query obfuscation [11][12][13] [14].
Precision and recall: these are the two main measures employed by search engines to calculate the efficiency of web search queries in regards to retrieved relevant documents.The value for these measures is between 0 and 1, with 1 being the best value for both precision and recall [10].The formal expression for precision and recall are: The average precision metric: can be employed to measure how efficient the obfuscated web search queries are by measuring the precision and relevance of documents returned to a user.Suggested by Turpin and Scholer (2006), the average precision (AP) quantifies the efficiency of search queries in retrieving relevant documents and is formally expressed as follows [15]: The symbol returns a value of 1 if the retrieved document is relevant otherwise a 0 is returned.k symbolizes the amount of retrieved items, that is, the top-k retrieved items.
Plausible deniability: Plausible deniability search is web search query privacy process in which a set of k-1 dummy queries with attributes analogous to the original but on dissimilar subjects, are produced and used to obscure original queries [16].Plausible deniability search demands that every unprecedented query be replaced with a consistent but analogous dummy query with the intention to retrieve outcomes very comparable to those projected from the original query.Any subsection of k dummy queries will generate statistically indistinguishable outcomes to a corresponding unprecedented set of k queries [16].The produced dummy queries are implemented at the identical time to cover the intent of the user, creating a difficulty in detecting which precise query was deliberated on by the user; leaving the burden of proof to the search engine to ascertain which query fits a particular user [16].Formally plausible deniability search is expressed as follows [16]:

"The conditions for plausible deniability privacy (PD-Privacy) are realized if: (i) the user can show that any query would have produced the set with the same likelihood as ; (ii) all are on distinctive subjects; (iii) all
are likely similar to the authentic query.Where * +, the set of queries at time , kept in a log on a server; , is the authentic query by the user; is the set of dummy queries similar to the authentic query".Furthermore, plausible deniability necessitates that at the query implementation phase, all queries must be correlated to the leading theme but with low Euclidean distance in the semantic space [16].Although our proposed heuristic meets some benchmarks of plausible deniability search, our methodology deviates from the standard plausible deniability search approach by producing dummy queries based on the query type instead of the query topic.We focus on creating dummy query keywords that comprise a permutation of navigational, informational, transactional, temporal, and natural language processing query keywords.The query keywords could be correlated or dissimilar to the unprecedented query but are joined together with the unprecedented query keywords during query implementation www.ijacsa.thesai.orgphase.Each permutation is expected to produce fluctuating outcomes.

III. METHODOLOGY
Permutation of Web Search Query Types Heuristic: In this section, we discuss how permutation is used in our suggested heuristic after identifying the major web search query categories.The typical search query groups as observed in literature is used in this proposed heuristic in the following way [5] [6] [7]: Navigational queries symbolized as N. Informational queries symbolized as I. Transactional queries symbolized as T. Natural language processing queries symbolized as L. Temporal queries symbolized as P.For effectual privacy, the user should create a set of dummy queries that comprise, navigational, informational, transactional, natural language processing, and temporal search query types.The suggested set of search query formation will be a permutation of items in set Q = {NITLP}.This entails that any obfuscated query formation will comprise a permutation of N, I, T, L, P keywords.Formally the quantity of permutations of any k items can be calculated as follows [17]: Where n is the number in set Q, and k is the quantity of permutations for any k items.Assuming the set Q = {NITLP}, the quantity of permutations of the five objects in the set Q at an occurrence is: ( ) Consequently the quantity of permutations from the five objects in set Q produces 120 arrangements of conceivable obfuscated query formations.For the preliminary experiment done suggested heuristic, not every permutation of each k or all the k=n (where n=5) items was used in the query implementation phase.Nonetheless, an assortment was done from the k=1, k=2, k=3, k=4 and k=n permutation categories.The subsequent permutations were chosen for the preliminary experiment to examine the hypothesis: Q p ={I, IT, IP, TP, IL, NI, NIT, NIP, IPL, ITP, NITP, ITPL, NIPL, NITL, NITPL}.Because order is essential to the query formation, permutation is employed as core for producing several query sets.Although the permutations are determinate, the incentive is that for each search term that necessities obfuscation, the user has the prospect to produce permutations (n!) of obfuscated queries that are problematic for the search engine to disambiguate immediately.This is with the assumption that the search engine does not know apriori what that particular user intent is and how that same user intends to construct their search queries.A sample of the obfuscated queries using the suggested heuristic is shown in Figure 3.

IV. EXPERIMENT AND PRELIMINARY RESULTS
The experiment: The aim of the experiment was to disguise queries using the suggested search heuristic of permutation of web search query types, namely, transaction, navigational, temporal, and informational queries.In the experiment, obfuscation of web search queries is done for a hypothetical entity that wants to unambiguously "buy a Toyota" automobile.In the experiment, real search queries are embedded in a set of distracting keywords but keeping with the permutations as suggested in the heuristic.The goal is that retrieved outcomes should allow the hypothetical user to read or surf Toyota related web snippets but without letting the search engine know the true intent of the search.The intention of the experiment was to apply the suggested heuristic to obfuscate the search query, "Buy Toyota", and explicit intent that the user wanted to "buy a Toyota".In the experiment, 121 search queries were produced after the permutation process.Each created query permutation was then implemented simultaneously with the original search query in the search engine.In the experiment, the user is logged into their browser/search engine account.The aim is that the browser/search engine logs the user query activity and generates a browsing and search history that further helps serve as a decoy and make it difficult for the search engine to decipher user intent.In this experiment, the Google search engine and Chrome browser were employed.A sum of 12,177 Google articles (snippets) was retrieved and analyzed using text mining tools for relevant and non-relevant documents.www.ijacsa.thesai.orgDocuments that were viewed as relevant contained phrases that were related to "Buy Toyota".Non-relevant articles were those that did not contain any phrase related to "Buy Toyota" and were viewed as diversionary.
Preliminary results: The examination of results from the experiment was to show the number of retrieved documents that were considered relevant.In Figure 4, retrieval search outcomes from the obfuscated queries are shown with overall 121 queries formulated and implemented.The amount of retrieved and relevant documents (snippets) is presented on the y-axis with each query (Q1 to Q121) corresponding on the y-axis.Additionally Figure 4 shows a summary of how the obfuscated web search queries performed as compared to the number of retrieved relevant results.The x-axis shows the obfuscated queries Q1 to Q121, whereas the y-axis depicts the quantity of retrieved relevant documents.The two series illustrated in the graph show the documents with the top series showing overall retrieved documents whereas the lower series depicts the relevant documents.In this experiment, the key phrase "Toyota" was employed in selecting all documents that were viewed as relevant.For example, for the query Q1, 126 overall documents were retrieved but only ten documents were relevant, that is, only documents that comprised the key search phrase, "Toyota".The other residual non-relevant documents are associated to the dummy queries and deflecting key phrases employed to obfuscate the main intent and key search phrase, "Toyota".The specific intent of the user was to intentionally buy a Toyota.Yet deprived of any encryption methods, this experiment pursued if it were achievable to obfuscate particular user intent and yet retrieve relevant documents to the user.As shown in Figure 4, the amount of retrieved documents was cut off at an average of 100 documents per query.In this experiment, the top-k articles were chosen, with k being an average of 100 articles.The rational was grounded on research that the average user is inclined to glance the first page of the search engine results and browse the first 20 top articles (snippets) [18][19] [20][21].In certain occasions, particular search engine outcomes were less than 100.In such situations, owing to the type of the query formation employing a set of key words to obscure the intended key search term, retrieved results were abridged with demands to reformulate the query, a technique that Google and other search engines do in order to have the user to enter a more "correct" search term.However, query reformation was evaded in this experiment; instead focus was placed on observing the effect of the obfuscation in terms of the retrieved relevant documents.The other observation from Figure 4 is that some queries retrieved larger amounts of relevant documents in comparison with other queries.For example, queries Q1 and Q4 retrieved 126 and 106 documents correspondingly.However, only 10 and 17 were considered relevant for each of those queries.
Meanwhile, for queries Q22 and Q50, a total of 105 and 107 documents were retrieved correspondingly.Yet only 74 and 82 were considered relevant documents for each respective query.One reason for such an effect in the various numbers of retrieved documents is that the obfuscated search query formation, in this case, could be viewed as an adjustable parameter.This is to say that the amount of obfuscation during query formation affects the number of relevant articles retrieved in context of the user's intended original query.Therefore, the reverse is also true; in that less obfuscated the query is the more retrieved relevant results one generates.Additionally, the low correlation value shown in Table 1, among the retrieved and relevant results, is at 0.21 signifying that there exists little relation.Yet this correlation value could be viewed, as a signal that the obfuscation methods used in the making of the web search query could be effectual.Although the low correlation value could be a good gauge for a better privacy, the difficulty of usability still remains.The relevancy of retrieved documents to the user in regards to the obfuscated queries remains essential to the user.Moreover, as shown in Table 1, the null hypothesis that the entire retrieved articles are relevant is rejected given that the Chi-square shows the p-value < 0.05.However, caution should be taken when reading these results.Although the pvalue is less that 0.05 and could signify that the permutation of the web search query type heuristic is effective in granting obfuscation and confidentiality for specific user intent, other factors such as user clicks on URLs and advertisement links could still reveal the specific user intent to a search engine.Furthermore, we are only presenting our suggested heuristic and preliminary results.More extended experiments need to be done that will take the user clicks on URLs and advertisement links into consideration and will be done as part of our future works.
Privacy verses usability: Furthermore, the type of query permutation determines the retrieved relevant results.This highlights the challenge of finding equilibrium between privacy and usability while showing the need to consider necessary trade-offs.To highlight this challenge, in Figures 6, 7, 8, and 9, four queries, Q12, Q27, Q22, and Q102 are chosen two with the least average precision values two with the highest average precision values as shown in Figure 5.The precision and recall values are depicted in each graph; the xaxis shows the amount of retrieved articles, while the y-axis shows the precision and recall values.The recall values continuously move near the value 1, since precision is computed basing on the top-k retrieved articles.Hence the average precision is that value at which the recall is 1.Thus for queries Q12, Q27, Q22, and Q102, the average precision values are 0.376, 0.342, 0.735, and 0.916 correspondingly as illustrated in Figure 5. www.ijacsa.thesai.orgConsidering the average precision metric, the values vary between 0 and 1, where 1 as a signal for better performance.The queries Q12 and Q27 outcomes show low average precision values but effective obfuscation, which can be attributed to the level of permutation.This is additionally seen in Figure 4, where the retrieved articles for Q12 and Q27 are 148 and 104 correspondingly, yet the relevant articlesarticles that comprised the keyword "Toyota", for Q12 and Q27 recorded values 31 and 27 in that order.It could be reasoned that this is a signal of good obfuscation due to a robust permutation set being chosen.However, this only brings the question of usability to the forefront.Retrieving articles that are of no meaning in regards to what the user is querying, does grant some level of user intent privacy yet the usefulness of such outcomes remains challenging.The permutation of web search query types: In the case of the permutation of web search query types heuristic, the aim is to take advantage of the distortion caused by the different permutation of the query keywords to conceal user intent but at the expense of disregarding useful retrieved links associated with the original "Toyota" keyword.Yet still, one significant aspect that our preliminary results reveal is that various permutations of the web search query type have an effect on retrieved relevant results.For example the query Q12 is of the formation, "Acquire Toyota 2014 get Samsung Obtain shoes.comwestern civilization"; Q27 is of the formation, "I want a Toyota, get Samsung phone, shoes.com,Nairobi Kenya, influence".Q12 and Q27 do yield the least values in regards to relevant documents retrieved and the average precision.Though the main intended key search term by the user, "Toyota", is in the first segment of the query text, the search engine appear incapable of deciphering the obfuscation and retrieve the most relevant documents associated to "Toyota".Yet this result continues beyond queries Q12 and Q27.Meanwhile queries Q22 and Q102 returned results of relevant retrieved documents at 74 out of 105 and 105 out of 111 correspondingly.The average precision results for queries Q22 and Q102 respectively returned values 0.735 and 0.916.The permutation of the web search query type NIT and IP were used for queries Q22 and Q120, while NITP and NITPL were used for Q12 and Q27 in that order.The query Q22 has the formation "Samsung Obtain shoes.comBarack Obama Toyota", the keyword "Toyota" is positioned at the end of the query text.Nevertheless, for query Q102, the formation is instinctive, "Toyota 2014".Queries Q22 and Q102 yield the highest retrieved relevant documents and uppermost average precision values, with query Q102 returning a significant average precision value of 0.916.Nonetheless, while queries Q22 and Q102 yield some of the best results in terms of retrieved relevant documents, such yields come at the expense of privacysignifying low user intent privacy.Finally, the process of the query formation indubitably plays an essential part in the retrieval of relevant documents, making it a challenge as to which query formation or permutation of the query type yields the most optimal results in context of privacy and usability.

V. CONCLUSION AND FUTURE WORK
Preliminary experimental outcomes from this study indicate that the permutation of web search query types heuristic might be an effective means of providing obfuscation and privacy for user intent.Preliminary outcomes indicate that www.ijacsa.thesai.orgweb search query and intent privacy might be achievable from the user side using non-cryptographic means such as the suggested permutation of web search query types heuristic, without the requirement of third parties.However, the privacy versus usability challenge remains suggesting that more study is necessary on the formation of obfuscated web search queries in the context of both optimal privacy and usability.The practice of retrieving relevant documents yet preserving satisfactory levels of privacy also remain a challenge and necessitates a more comprehensive study that takes into consideration all aspects of security.Another issue is that trade-offs would be obligatory between user privacy and usability needs; yet finding optimal trade-off is another challenge.For instance, web search queries with lower levels of obfuscation might yield higher relevant retrieved documents but at the expense of revealing user intent and undermined web search privacy.Finally, it is essential that web search users and privacy custodians give considerable attention to the resource capabilities of search engines such as, computational, storage, query disambiguation, and semantic processing when employing any query obfuscation techniques.Search engines continuously analyze web search queries with the objective of decoding user intent by separating dummy from real queries.Consequently, a constant revision of search query obfuscation and permutation of query search types techniques would have to be done to prevent classification attacks, a topic to be investigated as part of our future.We intend to consider application of the suggested heuristic on big data and enterprise systems with multiple logged in users, simulating a real world scenario.

Fig. 1 .
Fig. 1.Third party web search privacy system

Fig. 3 .
Fig. 3.A Sample of Obfuscated Queries using the Permutation Heuristic

TABLE I .
CORRELATION -RETRIEVED VS.RELEVANT