Increasing Randomization of Ciphertext in DNA Cryptography

Deoxyribonucleic acid (DNA) cryptography is becoming an emerging area in hiding messages, where DNA bases are used to encode binary data to enhance the randomness of the ciphertext. However, an extensive study on existing algorithms indicates that the encoded ciphertext has a low avalanche effect of providing a desirable confusion property of an encryption algorithm. This property is crucial to randomize the relationship between the plaintext and the ciphertext. Therefore, this research aims to reassess the security of the existing DNA cryptography by modifying the steps in the DNA encryption technique and utilizing an existing DNA encoding/decoding table at a selected step in the algorithm to enhance the overall security of the cipher. The modified and base DNA cryptography techniques are evaluated for frequency analysis, entropy, avalanche effect, and hamming weight using 100 different plaintexts with high density, low density, and random input data, respectively. The result introduces good performances to the frequency analysis, entropy, avalanche effect, and hamming weight, respectively. This work shows that the ciphertext generated from the modified model yields better randomization and can be adapted to transmit sensitive information. Keywords—DNA cryptography; avalanche effect; frequency test; entropy; hamming weight


I. INTRODUCTION
With the amazing development of Deoxyribonucleic Acid (DNA) computing, DNA cryptography is a new advancement in cryptography. DNA molecules are an integral part of a cell and act as genetic information carriers, but when applied in modern cryptography, it serves as a data manipulation tool [1].
The design of an encryption/decryption algorithm should be complex enough to stand for a long time against a security attacks. The best way to reach such complexity in a system is to work towards scalability because this will ultimately lead to large-scale complexity. The main idea to increase the complexity in the system by augmenting its size is to achieve the desired security that will require tremendous efforts to attack the system successfully. These desired properties can be achieved by DNA cryptography as it offers huge parallelism and storage capacity simultaneously [2]. The power of DNA encryption is not only in the molecules or encoding but in the positions where we want to save our data to protect it from attacks for a longer time [3] . Cryptography is the procedure to create such algorithms, whereas; cryptanalysis is the procedure where attackers or the algorithm developers validate the cipher for its vulnerabilities and improve it by giving insight for future directions [4]. Randomness [5], avalanche effect [6], and entropy per bit [7] are some of the desired properties to evaluate the ciphertext. A cryptographic solution should satisfy this criterion, at least to ensure safety.
Avalanche effect [5] is a compelling test, whereby changing one bit in plaintext or key will change at least 50% of the bits in the ciphertext. This research work focuses on the change in ciphertext from a plaintext perspective. A detailed study of DNA cryptographic encryption algorithm as in [8] indicates that the avalanche effect is considerably less, leading to security vulnerabilities. Specifically, the conversion of the binary data into DNA bases (00 to A, 01 to G, 10 to C, 11 to T) exhibits poor avalanche effect or randomization of the ciphertext. This may cause an attacker to establish a relationship between plaintext and its ciphertext. In this paper, a modified DNA encryption technique with an existing DNA encoding table used in [9] are introduced to the existing algorithm to overcome the mentioned security vulnerability. The proposed encryption technique allows the user to send encrypted information with an extra fold of security. The experimental results have endorsed the effectiveness of the proposed technique by performing a statistical analysis between the base technique and the proposed technique.
The overall structure of the study takes the form of six sections, including this introductory section followed by a literature review in Section 2. Section 3 gives an insight on the encryption algorithm using base and the proposed technique. Section 4 explains the list of tests to measure the randomness in technique. Section 5 has a detailed analysis of results validating the effectiveness of the proposed technique. Section 6 discusses the concluding remarks considering improvements and limitations followed by cited references in a separate section.

II. RELATED WORK
DNA computing is an increasingly important area in applied cryptography where the inherited property of storing huge data is adopted along with DNA replication to introduce randomness in the cipher. DNA computing can be applied in various forms during the encryption-decryption process; it can either be used as complement operation, digital coding, polymerase chain reaction, or as a security alternative [10]. A large volume of published studies describes the role of DNA in cryptography. This research focuses on DNA digital coding www.ijacsa.thesai.org only with detailed insight into its security impact in cryptographic techniques. In 2012, Noorul Hussain [9] introduced a new concept based on DNA digital coding, where a dynamic DNA encoding table was presented. This encoding table is a 24 * 4 matrix of 96 American Standard Code for Information Interchange (ASCII) characters consisting of alphabets, numbers, and special characters. Later this table was used and extended by other researchers [11]- [13]. It is evident that the utilization of this table in an algorithm has improved the randomness of ciphertext and consequently enhanced the system security.
An extended version of the ASCII table was introduced with 256 ASCII character encoding [11]. For a dynamic sequence, table creation, all characters are initially allocated randomly to DNA base sequences followed by an iterative rearrangement using a mathematical pattern, whereas in the encryption process, the plaintext is first converted into DNA bases using the sequence table, followed by the creation of data chunks to encrypt them using an asymmetric cryptosystem and finally to merge the chunks as the ciphertext. The system of dynamic encoding coupled with asymmetric cryptosystem naturally raises the degree of data confidentiality. It is proved by comparison with existing techniques and a statistical suit of randomness defined by the National Institute of Standards and Technology (NIST).
In [12], a network traffic and intrusion detection system is proposed using DNA sequences, where DNA bases are used to encode the 41 attributes of the network. The next attributes have been analyzed for experimentation purposes, and the results indicate a 15% improvement in accuracy, whereby a more complex encoding can effectively improve the accuracy of the intrusion detection system. In [13] and [14], a Dynamic DNA sequence table is used in combination with OTP to improve data security. The attacker must execute all possible DNA sequence variations before getting original data, which is supposed to be very difficult. The proposed technique provides better security than other techniques, in particular against brute force attacks. The algorithm aims to transmit the One-time-pad (OTP) securely, but execution time has been increased as compared to other similar techniques. Interestingly a cryptographic system is designed, where the authors in [8] apply a delayed Hopfield neural network to generate the cryptographic key before DNA encryptiondecryption process. Specifically, the chaotic neural network generates a binary sequence, passed on to the permutation function yielding the first level key for encryption. The system's strength lies in the random selection of trajectories for neural networks, delay function, and DNA cryptography. The authors claim that changing one byte can change 32 out of 128 bits in the ciphertext, which is significantly less than the expected change, where changing one bit in plaintext or key should bring more than 50% change in the ciphertext.
All these research works endorse the fact that DNA encryption using DNA encoding can significantly improve the security of the cryptographic solution. A similar approach in [8] is extended with the existing DNA encoding table at a carefully selected location. The subsequent section explains the encryption process for the base technique followed by the improved technique with an additional layer of the dynamic sequence table.

A. Base Technique
The Hybrid chaotic neural network as in [8] generates the key while the DNA cryptography algorithm encrypts/decrypts the original data. However, this paper only discusses the application of DNA cryptography, so it primarily discusses encryption/decryption without going into details of the key generation process. Plaintext, key, and ciphertext are of equal length, i.e., 128 bits. Following are the steps involved in the encryption process: 1) Take plaintext from the user and divide it into fixedlength sub-sequences .
2) A random binary sequence of equal length is produced using a key generation.
3) is permuted using left cyclic shift yielding where the number of bits to be shifted is pre-calculated by key generation part.

4)
is subjected to right cyclic shift producing using .

5)
To produce the 1st level encrypted text an XOR operation is performed between and ′ as given below: 6) The second level of encryption is performed on binaries obtained from C j′ . Applying "00" to "A", "01" to "G", "10" to "C", and "11" to "T", yielding is the DNA encoded ciphertext.
The decryption process is the reverse of the encryption process where DNA decoding produces and thus as below: This undergoes permutation and cyclic shift to give the plaintext.

B. Proposed Technique
The proposed algorithm works the same way as far as the key generation is concerned in [8], but there is an improvement for encryption and decryption part as in Fig.1. 1) The user enters the plaintext, which goes directly to the DNA sequence table and gets encoded.
2) Then binary coding is applied as , , , and . 3) Conversion into corresponding decimal values. 4) Conversion of decimal values into ASCII characters. 5) Split each string into equal size blocks .
The encryption process continues as the same steps, 3-6 in the base technique. Fig. 1 gives a pictorial representation of the system for the encryption-decryption process. The encryption process is illustrated in green, the key generation process in yellow, and the decryption process is in blue. These steps are similar to the original algorithm, and the improvements are www.ijacsa.thesai.org added as new layers (in red) and displayed distinctively in the encryption and decryption process. Table I indicates the DNA  encoding/decoding table being introduced to the algorithm as applied in [9].
In this section, several tests from the literature are performed to evaluate the randomness of the ciphertext produced by the proposed algorithm and the base technique [5], [7], [15]- [17]. These tests can only be performed on binary sequences. Thus, the ciphertext is then converted from DNA sequence into binary to complete the evaluation. Three different datasets have been used as inputs to these tests, categorizing them as low density, high density, and random [18]- [20]. Low and high density are the biased datasets, where plaintext has all zeros and only one 1 bit in string. A high density is an exact opposite with all ones but only 1 zero. The purpose of using biased data is to identify the exact randomness in the ciphertext. For an algorithm being provided with random plaintexts, there are high chances that the generated ciphertext will also be random. On the other hand, for a non-random (biased) dataset, the probability of obtaining a random ciphertext is relatively low. Therefore, the use of different categories of datasets can establish confidence in the improved scheme from security perspectives.

A. Frequency (Mono Bit) Test
The frequency test calculates the number of a binary string, 0's and 1's appear in the ciphertext. This test determines that either the number of zeros and ones are equal or not, as this is one of the desired properties of a ciphertext [5]. Value 0.01 is the level of significance for this test which means that only 1 sample out of 100 will be rejected. Ideally, the resultant value should be "1", which means a perfect balance of 0 and 1 in the string. This test assesses the closeness of these values to 1/2 of the total numbers of binary string appeared in the ciphertext, as it is ideal for these values to be equal. For this test, the preliminaries are: the length of the bit string, the sequence of bits in the string as the absolute value for summation of . (1) Finally, the tail probability, i.e., the p-value, is calculated in (4). - erfc is a complementary error function. This test evaluates the p-value, whereas if the computed value of p is less than 0.01, it is concluded that the given sequence is not random [15], [16]. On the other hand, if the p-value is more than 0.01, the string passed the test and can be declared as a random string.

B. Avalanche Effect
A small change in plain text or key yielding a significant change in the ciphertext is called the avalanche effect (5). It's a highly desirable property for algorithm design, such as the higher the avalanche effect, the better the algorithm [21]- [27]. www.ijacsa.thesai.org Avalanche > 50% of an exemplary algorithm makes the cipher more random and less predictable for attackers.

C. Entropy
Shannon introduced the concept of entropy in bits in 1948 [7] and is termed as uncertainty in the expected output bits. Uncertainty of the cipher is determined by the number of plaintext bits that can be recovered from scrambled ciphertext to get the original message [17] successfully. Moreover, entropy is the weighted average of optimal bit representation size, such as the average size of an encoded message. Mathematically, entropy can be defined as in (6).
Here we are calculating the entropy of X with * +. Calculating for both bases as in (7).
The highest uncertainty is only achieved when the values are equally distributed i.e.

D. Hamming Weight
Two strings of equal length having different symbols at some positions; the total number of those positions is called hamming weight [21], [26]. A higher value of the hamming weight represents the better randomness of the binary sequence.  Tables II and III have results for all of the tests described in Section IV. The value of plaintext is changed by toggling bits across the string followed by a constant key. The plaintexts in Table II are 128 bits long. The key is set to "0001100010010101001001000010101110000110101001000110 00110000111111000110101100010111101111001001001111100 11011000001111100110010". The same key is used to produce the ciphertext for further evaluations. As the frequency is one of the tests by NIST [28] and the minimum required length of the string is 100 bits, the ciphertext bits are concatenated to apply this test. Here, we have 33 high density, 33 low density, and 34 random plain texts for evaluation, and, ultimately, the average value of all these observations is calculated.

A. Frequency Test
As mentioned in the previous section, the frequency test calculates the number of 0's and 1's that appear in the ciphertext. If the p-value calculated on the ciphertext is more than 0.01, the ciphertext is concluded as a random string, or else it is a non-random. Thus, Table II shows the p-value of the ciphertext produced by the base and the improved algorithm using (1), (2), (3), and (4). High density, low density, and random plaintexts are used as inputs to both algorithms. Based on Table II, both techniques have passed the NIST frequency test successfully with the average p-values (y-axis) for three variants of plaintexts are greater than 0.01. Each pvalue (y-axis) of the ciphertext generated from those variants of plaintexts is also depicted in Fig. 2 and 3. The ideal p-value for this test is 1, and all the ciphertexts should have a value close to 1. Based on Fig. 2 and 3, it can be seen that there are specific outputs that have successfully achieved a p-value of 1. However, this ratio is minimal in case of the base technique compared to the improved technique. It can be seen from Table II that the average p-value of the improved technique (0.9877 ) is very close to 1.

B. Avalanche Effect
The avalanche effect is a very desirable property when it comes to randomness in the ciphertext. As mentioned earlier, the higher the avalanche effect, the better the security. Any given scenario where the attacker has access to ciphertext tries to establish a relationship between ciphertext and its plaintext. www.ijacsa.thesai.org If changing one bit results in a change of more than 50% bits, it becomes challenging for the attacker to retrieve the original message. In Table II, the base technique has the avalanche effect values, which range from 37.99% for random to 38.06% and 40.6% for high and low density plaintext, respectively. Meanwhile, the improved technique has values ranging from 52.9% to 55.7%, significantly higher than the base technique.
These values are calculated using (5) and presented in Fig. 4 and 5. As depicted in Fig. 4, changing 1 bit in plaintext has generally introduced a difference from 8% to 65%. Whereas by looking at Fig. 5, it is evident that observed values range between 40% and 70%. Row 5 in Table II has the average value of avalanche effect, and it can be observed that this value is 38.55% in the case of the base technique and has significantly improved to 54.64% for the improved technique.
Example scenarios of avalance effect have been presented T III, w -CRYPTOGR MM TIST‖ g plaintext, feed to the algorithm and the produced ciphertext is used as a reference to calculate the number of flipped bits. For example, changing one bit in the 40th location of the binary sequence in the plaintext yields 24 flipped bits in the ciphertext by the base technique. Thus, the avalanche effect is 18.755%, considering that the length of ciphertext is 128 bits. For the base algorithm, the result shows that the avalanche effects range from 12.5% to 32.0312%, with an average of 19.72% when changing one bit of the binary sequence in the plaintext at different locations (bold and underlined bit). Meanwhile, the improved technique has the avalanche effects range from 53.9% to 61.75%, with an average of 57.4175%. The average avalanche effect indicates a significant improvement of 37.69%. Thus, this new encryption/decryption technique can be used to improve security for an environment in which data sensitivity and randomness are essential.    (6) and (7) are applied to find the entropy of ciphertext. In Table III, the entropy of the ciphertext has been calculated for the base and the proposed technique. It can be seen that the entropy of ciphertext in both cases is nearly equal, with a value of 0.9963 for the base technique and 0.9931 for the proposed technique, which is the information content per bit. So it can be said that the information content per bit has not decreased even for the improved technique but has sustained some optimum value throughout the observations. The ideal entropy in the given case is 1, as depicted in (8), but the observed entropy for both techniques is very close to one.

D. Hamming Weight
Hamming weight has been calculated using (9). In Table II, it is observed that hamming weight for base technique ranges from 62.3 to 63.9, whereas for improved technique, this value ranges from 63.5 to 64. The ideal expected value for hamming weight in a binary string of 128 bits should be 64. The average observed value of 100 plain texts for the base technique is 63.23, whereas, for the improved technique, it's 63.87.
In summary, Tables II and III confirm that the proposed technique performs better for frequency, avalanche effect, and hamming weight. The observed values are not only better than the base technique but are also nearly equal to ideal expected values. Whereas for entropy calculation, the value of the improved technique has not improved yet, the difference from the base technique is quite negligible. Hence, the improved technique is a better alternative to the proposed technique, where enhanced security is offered with all the security considerations of the base technique.

VI. CONCLUSION
DNA cryptography has served as a better alternative to traditional systems in recent times. Advancement in the study helps to identify the security vulnerabilities in the existing systems. This research highlights that by carefully examining the ciphertext produced by the base technique, in terms of avalanche effect can be further improved. The average avalanche effect is 38.55% when flipping one bit of binary sequence in the plaintext for 100 different plaintexts ranging from high density, low density, and random data set. On the other hand, the average avalanche effect of the proposed technique has increased to 54.64% by introducing a DNA encoding table. The work also includes the frequency, entropy, and hamming weight to test the overall security of the improved system. The results show that the improved technique is better in terms of the frequency' p-value, avalanche effect, and hamming distance than the base technique. For entropy, the value produced by both algorithms is approximately equal. Hence, the improved technique is a better alternative to the proposed technique, and this research. A good future direction of this work can be defining new trajectories in key schedules and analyzing the impact of key changes to the ciphertext.