Special Negative Database (SNDB) for Protecting Privacy in Big Data

Despite the importance of big data, it faces many challenges. The most important big data challenges are data storage, heterogeneity, inconsistency, timeliness, security, scalability, visualization, fault tolerance, and privacy. This paper concentrates on privacy which is one of the most pressing issues with big data. As mentioned in the Literature Review below there are numerous methods for safeguarding privacy with big data. This paper introduces an efficient technique called Specialized Negative Database (SNDB) for protecting privacy in big data. SNDB is proposed to avoid the drawbacks of all previous techniques. SNDB is based on deceiving bad users and hackers by replacing only sensitive attribute with its complement. Bad user cannot differentiate between the original data and the data after applying this technique. Keywords—Big data; big data challenges; privacy violations; privacy-preserving techniques; special negative database; data integrity


I. INTRODUCTION
One of the most pressing challenges in big data is data privacy. Patients' data must be kept private since there is a risk of improper use of personal information being exposed when data from multiple sources is combined. In Privacy, every person has the right to select the extent of his or her interaction with the environment, as well as the amount of data that can be accessible by a third party. While it is sufficient to detect information as a "password" in security issues, since security is between two trusted parties, the server provider (SP) may be an adversary in privacy difficulties. We classified likely privacy violations in big data systems into four categories based on a literature review: data breaches, reidentification attacks, information gathering by service providers, and government tracking. The motivation of this manuscript is the importance of preserving privacy for everyone specially when dealing with big data. Also, the drawbacks of previous techniques like time consuming, losing data integrity, increasing size of data, low level of privacy and high complexity are one of the motivation factors for the author to propose a new technique called SNDB that will avoid drawbacks of other techniques. The next section will introduce literature review of previous techniques and their drawbacks. While in the third section, proposed technique and the manuscript contribution will be introduced. In fourth section, the author will introduce datasets used in proposed technique. Fifth section will discuss results and evaluation of the proposed technique when comparing with other techniques. Finally, conclusion and future work will be introduced [1], [2], [3], [4].

A. Privacy Preserving by Slicing
Slicing is a method of dividing a dataset in vertical and horizontal manner. The process of dividing attributes into columns based on their correlations means vertical partitioning. Slicing can handle data with high dimensions according to attribute splitting. However, horizontal partitioning happens when records are combined into various buckets, and values in each column are permuted within each bucket randomly to disrupt the relationship between columns. The links between columns are broken by slicing, but the associations within each column are preserved [5].
B. Privacy in Big Data Generation Phase 1) Access restriction: Advertisement blockers, encryption methods, anti-tracking extensions, anti-virus software and anti-Malware are used to limit the access to sensitive data [6].
2) Falsifying data  Socketpuppet is a deception-based method of masking an individual's internet identity [6].
 Users can use MaskMe to establish aliases for personal information such as their credit card number or email address [6].
C. Privacy in Big Data Storage Phase 1) Attribute based encryption (ABE); ABE is a cloud storage encryption technique that assures big data privacy. The data owner defines the access policies in ABE, and data is encrypted according to those policies. Users whose features match with the data owner's access requirements can decrypt the encrypted data [6].
2) Identity based encryption (IBE): IBE is used to simplify key management in a certificate-based public key infrastructure (PKI) by employing personal identities as public keys, such as an IP address or an email address, to maintain sender and receiver anonymity [6].
3) Homomorphic encryption: By calculating directly on the encryption of a message, it is possible to obtain the encryption of a function of that message [6]. *ta1284@fayoum.edu.eg www.ijacsa.thesai.org 4) Storage path encryption: The huge amount of data is first divided into numerous sequential parts, and each component is then saved on a distinct storage media controlled by several cloud storage providers [6].

5) Usage of hybrid clouds:
The inherent qualities of public clouds, such as scalability and processing capacity, are combined with the inherent features of private clouds, such as security, to open up possible research opportunities in the processing and storage of enormous amounts of data [6].

D. Privacy in Big Data Processing Phase using
Anonymization Techniques 1) Generalization: In the taxonomy of an attribute, a parent value is used to replace some values. An artist, rather than a singer or actor, might be used to symbolise a job attribute [6].
2) Suppression: A special character (e.g., *") is replaced for some values to declare that the modified value is not exposed in suppression. Value suppression, record suppression and cell suppression are examples of suppression schemes [6].
3) Anatomization: Rather of changing the quasi-identifier or sensitive features, anatomization separates the connection between the two [6]. 4) Permutation: By dividing a set of data into groups and rearranging the sensitive values within each group, the connection between the quasi-identifier and the numerically sensitive feature is de-associated in permutation [6].

5) Perturbation:
The actual data values are replaced with generated data values in perturbation, resulting in statistical information acquired from modified data that is statistically similar to that computed from the original data [6].

1) Privacy laws and regulations:
Regulations and Laws have helped to protect privacy by limiting government tracking and limiting the reading, analysing, and publishing of users' personal information. Laws can also compel service providers to put in place necessary safeguards to protect data confidentiality and prevent data theft. This can enhance protecting privacy by avoiding privacy violations [7].
 Honeypots and other espionage devices.
 Firewalls and other preventative measures.
 Malicious behavior is also detected via access logs and alert systems.
 Mechanisms for encrypting data.

F. Foggy Dummies
This approach is utilized in fog computing, and the fundamental idea is to create extremely intelligent dummies to preserve the user's privacy. This technique is used by the researcher to swap requests between fogs before sending them to server provider and then swapping the responses. This will be accomplished by fogs cooperating to exchange data before sending it to the server provider [9].

G. Blind Third Party (BTP)
The essential point is why we must rely on a third party (TP) to keep the user safe from SP. That is, we are transferring the problem from one server to another. This strategy is dependent on fog's role as a middleman between the user and the SP in each location [9].

H. Double Foggy Cache
The primary idea behind this method is to use traditional cooperation to tackle the problem of peer trust. Furthermore, use SP to preserve your privacy. This strategy, in particular, can be seen as a significant advancement in the field. To accomplish this, we propose placing two caches in the Fog that will operate as intermediaries between peers. The first is for questions, while the second is for responses [9].

I. Secured Map Reduce Model (SMR)
As the data passes through the map-reduce phase, this new layer applies the security techniques to each individual piece of data. This security technique should be a simple encryption scheme, so that the complexity of new technique does not interfere with the big data's fundamental functioning. When data is processed using this suggested Secured Map Reduce (SMR) layer of big data, it can also be stored and secured. It begins with the collecting of data from social media, weblogs, and streaming data which is then delivered to Hadoop Distributed File System (HDFS). SMR is a suggested paradigm that adds a privacy layer between HDFS and the Map Reduce Layer (MR). Randomized procedures and perturbation were employed to strengthen the data's privacy [10].

J. Blind Peer Approach
This technique fixes the fundamental flaw in the prior technique, in which blind third party may collude with server provider to infringe on consumers' privacy. The new notion in the BLP strategy is to rely on collaboration with a large number of peers rather than dealing with a single TP. As a result of user's request would be sent to another peer in the same area, then encrypted by SPPK, giving the other peer no choice but to pass the question on to the SP, who would decrypt and resolve it [11].

K. Integrated Blind Parties (IBPs)
By integrating the BTP and BLP, This IBPs strategy raises the level of privacy while removing the disadvantages of the other seven options. When a peer isn't active in the area, the user can only rely on the BLP in this case. Furthermore, in the event of a resource shortage, without encrypting the query, the user might exchange it with another peer. In that circumstance, the peer can perform the BTP strategy rather than the user. This strategy can be used in any of the seven techniques [11].

L. Negative Database Conversion Algorithm
Instead of a single tuple, a negative database conversion technique is utilised to generate a big set of values. The data sets that have been generated are inserted into the database. In contrast to normal database applications, a harmful request in our negative database will be unable to access the database's www.ijacsa.thesai.org data. Because of the fabrication of fake sets of data in comparison to the premier data, the term negative is utilized. Both database encryption algorithms and virtual database encryption are used to encrypt the actual data [12].

M. Negative Database and Generic Database
The Entity, Attributes, and Values model of the general database design (EAV) was evaluated using the blob data storage type. The data collected, such as exam results, will be organised into three columns Entity, Attributes and Values as the name implies, the EAV is made up of three parts: entities, attributes, and values. The most straightforward approach to apply this principle is to create three tables for each data input (entity). There are two ways to implement the Negative database concept: one that statically generates negative data, i.e. the System Administrator defines the Negative data. Another that both statically and dynamically generates negative data. The user I.e. generates the dynamic negative data [13].

N. Enhanced BTP
In this enhanced approach, there is a new factor added to the old BTP, which is a unique token. This new technique consists of seven factors. A unique token is defined when the user sends a hidden code within a query to the service provider (SP) while SP returns the previous query token. Then SP will store the token for each ID generated by the third party, so the previous one cannot be used in a later query. When the third party inquiries from SP, a change will occur on the user's token, and the user will discover unauthorized access to his data by third party, so the proposed technique will be a powerful guarantee that there is no breakthrough [14].

O. Light Weight Cryptography Techniques(LWCT)
Based on an oil spill detection application, LWCT is utilized to secure a data transmission framework for the internet of things. Through locative and boundary value aggregation, this strategy eliminates duplicate data transmission. The suggested method protects data transfer by combining known lightweight cryptographic techniques with simple ID-based authentication [15], [16].

P. Block Nested Loop (BNL) Skyline Algorithms
This method is used to determine which encryption algorithm is best for ensuring data protection and privacy issue. The author of the Skyline algorithm considers two primary parameters: the rate of variation and the number of dimensions [17].
The author summaries all important drawbacks of previous techniques in the following factors: The next section will introduce the major contribution of this manuscript. The author will propose a new technique based on negative database and the deception of bad users or hackers. The proposed technique is special negative database. This manuscript contribution can be summarized to enhance preserving privacy in big data and avoid time consuming, losing data integrity, complexity and violating any other big data challenges like its size, fault tolerance, timeliness and scalability.

III. PROPOSED TECHNIQUE
In this section, the authors introduce a new technique to protect privacy in big data in a new manner that deceives bad users or hackers.
Bad user cannot differentiate between the original data and the data after applying this technique. This technique is called Special Negative Database (SNDB). SNDB is based on deceiving bad users and hackers by replacing only sensitive attribute with its complement. SNDB takes into consideration all attribute types such as binomial, numeric, polynomial as mentioned in Fig. 1. The authors divided the technique into different cases as we will see in the following subsections.

A. Binomial
Binomial attribute means having only two values. Binomial consists of two categories one of them is binary and the other is Boolean. Binary consists of two values 0 or 1 but Boolean is like True or False values or other two values that are vice versa. SNDB deals with binomial attribute by replacing the value with its complement.

B. Numeric
The numeric attribute can be either integer or real numbers. SNDB deals with the numeric value in a different manner. It computes the complement of the digit into 9 individually with a maximum of 4 digits from right taking into consideration the national ID.
For example, if the numeric value is 2896547, the value after complement will be 2893452. While in real numbers, SNDB computes the complement of the digit into 9 individually with a maximum of 4 digits from left.

C. Date and Time
In the case of the date attribute, the date is divided into the year, month, and day. If the value represents a year, SNDB complement each value as a whole to current year when values are less than current year. If some values are equal to current year, the complement of each value will be a whole to Current year+1. If the value represents a month, SNDB will compute the complement of the value into 13. If the value represents a day, SNDB will compute the complement of the digit into 31.
In the case of time attribute. The time is divided into hours, minutes, and seconds. If the value represents hours, SNDB will compute the complement of the value into 24. If the value represents minutes or seconds, SNDB will compute the complement of the value into 9 for the right digit and into 6 for the left digit. www.ijacsa.thesai.org

D. Polynomial
Polynomial means that the attribute has more than two text values. Polynomial is divided into two main types. The first is called ordinal and the second is called nominal.

1) Ordinal:
Ordinal means that the values can be categorized, classified, ranked, and ordered. Drink size, for example, is an ordinal feature that correlates to the sizes of drinks available at fast-food restaurants. Small, medium, and large are the three possible values for this nominal attribute. These values have a logical order (which correlates to increasing drink size), but the values do not indicate how much larger a medium is than a large. Grade (e.g., A+, A, A-, B+, and so on) and professional rank are two further examples of ordinal characteristics [18].
In ordinal, the number of categories is very important in dealing with the swapping technique. Swapping is used to deceive the bad user that the data is real. In the case of an even number of categories, swapping is very easy to be implemented. SNDB will swap the first value with the third and the second with the fourth and so on. But in case of the odd number of categories, first, we will add the not allowed value (NA) at the last then swapping will be applied as an even manner.

2) Nominal:
A nominal attribute's values are names of things or symbols. Nominal implies "related to names." Because each value reflects a state, code or category, nominal characteristics are also known as categorical. There is no discernible order to the values.
Enumerations are another term for values in computer science. Hair color and marital status are two attributes are examples of nominal. Black, blond, red, brown, auburn, grey, and white are all conceivable hair color values in our system. The value of single, married, divorced, or widowed can be assigned to the characteristic of marital status. Both marital status and hair color are also examples of nominal attributes. An occupation is another example of a nominal attribute, having values such as teacher, programmer, farmer dentist, and so on [18].
In the Nominal case, it's important to make a list of values to swap between the real value and one from this list randomly. If the values are names of persons, a list of names will be created. Then one value from this list will be selected and replaced randomly with the original one saving its index in the list. This operation will be repeated again and so on. Using the index, we can get the original data. www.ijacsa.thesai.org

IV. DATASETS
In this paper, we apply our model on different datasets according to sensitive attributes taking into consideration all data types.

A. Pollution Dataset
This dataset is about pollution in the United States. The EPA has well-documented pollution in the United States, but downloading all of the data and arranging it in a format that data scientists are interested in is a nuisance. As a result, I gathered data for four key pollutants (nitrogen dioxide, carbon monoxide, ozone, and Sulphur dioxide) for every day between 2000 and 2016 and organized them in a CSV file. There are a total of twenty-eight fields. Each of the four pollutants (NO2, CO, O3, and SO2) has its own set of five columns. 1746661 observations were made. The city, date local, and CO mean are all sensitive parameters.

B. Prouni
This dataset is about Brazil student's scholarship given by Brazilian government on the Prouni program. It contains data from 2005 to 2019 and each line of it corresponds to a student who benefits or has benefited from the Prouni program along with details about them. This dataset consists of 2692540 records.

V. RESULTS AND EVALUATIONS
This section will list the results and the evaluation of applying SNDB technique on the datasets for privacy preserving. The author of this paper introduces results for applying SNDB on sensitive attributes in case of binomial, year date and full date and the rest of attribute types will be introduced in the next paper.

A. Binomial Results
Assuming that the sensitive attribute is BENEFICIARIO_DEFICIENTE_FISICO that means does student have special needs. SNDB swap the value nao that means no with the value sim that means yes and vice versa. Fig. 2 and Fig. 3 illustrate one sample of Prouni dataset before applying SNDB technique and another sample after applying it. Fig. 4 shows computing some statistical operations on Prouni dataset before applying SNDB technique to get the type of scholarship according to special need. The results show that the number of students who have a special need and have got BOLSA PARCIAL 50% scholarship is 4,947 while the number of who do not have a special need with the same scholarship is 808,557. But students with special needs who have got BOLSA INTEGRAL scholarship is 14,222 while the number of the students who do not have special needs with the same scholarship is 1,862,484. On the other hand, the number of the students with special needs who have got BOLSA COMPLEMENTAR 25% scholarship is 4. While the number of who do not have special needs with the same scholarship is 2,326.     5 shows computing some statistical operations on Prouni dataset after applying SNDB technique to get the type of scholarship according to special needs. The results show that the number of students who have a special need and have got BOLSA PARCIAL 50% scholarship is 808,557 while the number of who do not have special needs with the same scholarship is 4,947. But students with special needs who have got BOLSA INTEGRAL scholarship is 1,862,484 while the number of who do not have special needs with the same scholarship is 14,222. On the other hand, the number of the students with special needs who have got BOLSA COMPLEMENTAR 25% scholarship is 2,326. While the number of students who do not have special needs with the same scholarship is 4. Fig. 4 and Fig. 5 show that the results have a big difference before applying SNDB and after applying SNDB on the same dataset which mean the success of the algorithm when applying on binomial sensitive attribute. www.ijacsa.thesai.org

1)
Year date: SNDB technique deals with sensitive attribute with the type date in a special manner. Assuming the sensitive attribute is ANO_CONCESSAO_BOLSA that means the year of the scholarship. It consists of a year only. SNDB applies the complement of a year to the current year (2021). Fig. 6 and Fig. 7 illustrate one sample of Prouni dataset before applying SNDB technique and another sample after applying SNDB on ANO_CONCESSAO_BOLSA. Fig. 8 and Fig. 9 illustrate the difference between gender of the student who has got the scholarship before applying SNDB and after applying it. If we check out Fig. 8 and Fig. 9, we will say that the years of scholarship in the original data are from 2005 to 2019 while years from 2002 to 2016 are the years of scholarship after applying SNDB technique. When taking 2005 as an example, we will see the number of male students is 36,097 and the number of female students is 39,532 in the original data. While in the data after applying SNDB, the number of male students is 108,057 and the number of female students is 131,205 in original data. Fig. 8 and Fig. 9 show that the results have a big difference before applying SNDB and after applying SNDB on ANO_CONCESSAO_BOLSA in the same dataset which means the success of the algorithm when applying on date sensitive attribute of year value only. In the next section the paper will show the result of applying SNDB on different date value.
Full date: Assuming the sensitive attribute is date local. SNDB deals with date local in a different manner, SNDB changes the month only because changing the month is sufficient to change the original data. SNDB swaps January with December and vice versa, February with November and vice versa, March with October and vice versa, April with September and vice versa, May with August and vice versa and Jun with July and vice versa. Fig. 10 and Fig. 11 illustrate one sample of pollution dataset before applying SNDB technique and another sample after applying SNDB on date local attribute. Fig. 12 and Fig. 13 below illustrate the difference between the maximum value of (Sulphur Dioxide and Nitrogen Dioxide) mean before and after applying SNDB on date local. This paper takes the first five days of December,2000 as an example to show the difference between the original dataset and the dataset after applying SNDB technique. When checking out Fig. 21 and 21, we will see the big difference between the values of original and SNDB dataset. www.ijacsa.thesai.org        As seen from results figures above, SNDB is valid for big data since the size of data does not change before and after applying SNDB. Processing time is fast when comparing to traditional negative database. The deception of SNDB technique is big and this makes privacy level is stronger. Bad users or hackers cannot differentiate between the original data and the data after applying SNDB technique. This makes the decryption very hard for bad users while it is very easy for data owner to decrypt the SNDB data. Table I     VI. CONCLUSION This paper lists the most important big data challenges and focuses on privacy challenge; it summaries privacy violation situations. The author also provides a list of the most efficient and popular techniques used to protect data privacy with their advantages and drawbacks. The proposed technique in this paper is SNDB based on negative database in different manner. SNDB is based on deceiving bad users and hackers by replacing only sensitive attribute with its complement. SNDB takes into consideration all attribute types such as binomial, numeric, polynomial. SNDB technique is applied on different datasets according to the type of the sensitive attributes of each dataset. In this technique, bad user cannot differentiate between the original data and the data after applying this technique which enhances the level of privacy.
As seen from results, SNDB can avoid drawbacks of previous techniques since it has the advantage of high privacy protection in big data. SNDB has no time consuming since it deals with sensitive attribute only. It also keeps track of data integrity and data size since there is no decreasing or increasing for any record of data and this advantage makes SNDB very suitable for big data. It also has low complexity since it only replaces sensitive attribute value with its complement. After applying SNDB, we can easily get the original data by applying the complement another time according to the rules of the data owner.

VII. FUTURE WORK
Finally, the author provides the results of applying SNDB on big dataset with binomial, year date and full date sensitive attribute. In the future work, the author will introduce the results of applying SNDB on numeric, ordinal and nominal sensitive attributes. Also, the author tends to take into consideration transposition techniques instead of replacing values with each other's.