A Low Cost FPGA based Cryptosystem Design for High Throughput Area Ratio

Over many years, Field Programmable Gated Arrays (FPGA) have been used as a target device for various prototyping and cryptographic algorithm applications. Due to the parallel architecture of FPGAs, the flexibility of cryptographic algorithms can be exploited to achieve high throughputs at the expense of very low chip area. In this research, we propose a low cost FPGA based cryptosystem named as Secure Cipher for high throughput to area ratio. The proposed Secure Cipher is implemented using full loop unroll technique in order to exploit the parallelism of the proposed algorithm. The proposed cryptosystem implementation achieved a throughput of 4600Mbps for encryption. The logic resource utilization of this implementation is 802 logic elements(LE) which yields a throughput to area ratio of 5.735Mbps/LE. Keywords—Encryption; Cryptosystem; Secure Cipher; AES; FPGA; Full loop unroll


I. INTRODUCTION
Data security has been a topic of major interest since decades.With the development of communication systems, the techniques of data exchange have been revolutionized hence the need of data integrity and authenticity has also elevated.Various cryptosystems have been proposed in this regard.A cryptosystem is a software or a hardware that can convert data from its original comprehensible form into a scrambled form in such a way that the original information can be disclosed to some selected persons only [1], [2], [3].Cryptosystems have evolved over the years from Ceaser's cipher, which was based on just shifting of letters, to the modern AES (Advanced Encryption Standard) proposed by Joan Daemen and Vincent Rijmen [4].
Cryptographic hardware solutions have been yet another field of interest for many researchers [5], [6].Various hardware cryptosystems have been proposed in which the choice of hardware may be microcontrollers, microprocessors, and custom ASICs based cryptosystems.Each of the aforementioned hardware offer some merits and demerits, for instance, a microcontroller based design might have low processing capability but such a design usually takes low time to market.Similarly, an ASIC based solution can achieve very high data rates and power efficiency but require high time to market.
The hardware based designs can be compared on the basis of the following performance metrics.Power consumption, time to market, and Non Recurring Engineering (NRE) cost etc.Microcontroller based designs can be a choice for hardware implementation of cryptosystems as these designs are low cost and low power solutions and require very low time to market but their performance is also very low.For high performance requirements, a microprocessor based solution can be opted but such designs run on high power and their cost is also very high.Another class of microprocessor based solutions offer low cost and low power designs, but such microprocessors based solutions also offer very low performance.Hardware based solutions with high performance and low power can be designed on custom ASIC platform.ASIC designs are usually produced in mass volumes, so their per unit cost is also low but these solutions have high time to market as the generation of ASIC designs is a very complex process and in case of any error in the design the ASIC solution is redesigned which increases the NRE cost.For a high performance solution with low cost and low power consumption, FPGA based design is another candidate.These designs have very low time to market and have very low NRE cost of FPGAs due to the reconfigurability.The speed and efficiency of FPGAs combined with their flexibility makes them very attractive for cryptographic applications.The ability to reconfigure an FPGA to use a different cryptographic algorithm on the fly or to be able to update, modify or even replace an outdated algorithm make them very useful for designing cryptosystems.Likewise, low power and subsequently high throughputs that FPGAs are capable of make them very useful in high speed communications links or servers that often require security.

A. FPGA based Cryptosystem
There have been many FPGA based cryptosystem designs which focused on obtaining high throughputs.These designs often fully unroll the iterative round structure of the cryptosystem and rely heavily on pipelining within each round to increase throughput.High throughput FPGA designs typically achieve throughput above 20 Gbps and are intended to use in solutions that need to handle multiple security sessions simultaneously.
An FPGA based implementation of AES proposed by T. Hoang used an iterative looping technique to implement AES for a block size of 128-bits [7].In [8] another compact implementation of AES on FPGA is proposed.AES with block size of 128-bits was targeted to be implemented on FPGA.The key objective of that implementation was to keep the design as small as possible.The design achieved a throughput of 166Mbps at the expense of 222 slices and 3 block RAMs of 4Kbits each.In [9], the design decisions that lead to area/delay trade-offs in a single chip FPGA based cryptosystem is explored for AES.The design achieved a throughput of 23.57Gbps with 16938 slices of hardware area.G. Rouvroy proposed an efficient solution to combine AES encryption and decryption in one FPGA design keeping focus on low area constraint [10].The proposed design achieved a throughput of 208Mbps using 163 slices and 3 blocks RAM only.In another research [11], a high performance encryptor/decryptor core of AES is presented.The design was implemented on a single-chip FPGA using fully pipelined technique.It uses 5677 slices and resulted in 4121Mbps throughput.Similarly, in [12], a fully pipelined AES encryption only design is presented.The design implemented on a single FPGA chip achieved a throughput of 21.54Gbps using 84 block RAMs and 5177 slices.In [13], another low power and low cost hardware core of AES algorithm is proposed.The core was designed with a novel 8-bit architecture that supports encryption with a 128bit key.The design produces 121Mbps throughput at 153MHz clock frequency.In [14], four different architectures for AES-128 bits algorithm implementation are proposed.The four design techniques proposed in [14] are accurate floor-planning, unrolling, pipelining and tiling.These architectures were derived for different area-delay trade-offs.In [15], an efficient pipelined hardware implementation of AES-128 is proposed.The implementation will stay efficient even after increasing the required number of rounds to encounter attacks.The iterative looping with multi-stage sub-pipelining AES architecture is proposed in [16].The design achieved 1.33Gbps throughput at 425MHz operating frequency.The logic resource utilization of the design is 303 slices.Another low cost AES implementation was proposed in [17].This implementation proposed a high throughput design by the introduction of parallel operation in folded architecture.This implementation produced 37.1Gbps throughput at the maximum operating frequency of 505.5MHz.
Besides the AES, various other algorithms are also used to design FPGA based cryptosystems.S. Singh recently proposed a hardware implementation of RSA algorithm [18].The authors have implemented RSA encryption using left to right radix-2 montmgomery multiplier on Xilinix Spartan-3 device.The design had a logic area utilization of 503 slices.The RSA algorithm FPGA implementation achieved 79.546MHz maximum clock frequency.In [19], an encryption scheme for real-time video streaming and its FPGA implementation has been proposed.
The demand of lightweight cryptographic algorithms has greatly increased due to the development and use of low resource devices for communication.In [20] a lightweight cipher named HIGHT, that provides adequate security at limited resource utilization is proposed along with its FPGA implementation.The authors presented pipelined and scalar (LUT) implementations of HIGHT with a claim of 18 times improved throughput at 60% less power consumption in pipelined design as compared to their LUT based design.
In [21], the Minalpher algorithm and its implementation on various FPGA devices with simple and pipelined architecture is proposed.The performance of Minalpher algorithm was evaluated on resource constrained hardware.
The encryption process in standard algorithms is usually carried out by creating confusion and diffusion in the data.This objective is achieved by various operations such as shifting, transposition, various logical operation, and multiplication operations.Modern advancements in the field of data security suggest the use of algorithms that can be embedded in resource constrained devices such as smart phones, PDAs, etc [22].Such devices have low on-board resources of memory and chip area, therefore, it is suggested to use algorithms with as low as possible complexity with adequate security.For this purpose, many researchers have proposed lightweight ciphers.The hardware implementation of such a lightweight block cipher named LEA is proposed in [23].The algorithm was generally intended for software efficiency, therefore, the S-BOX structure was designed to have simple addition, rotation and XOR operations.The authors proposed a custom ASIC design which achieved a throughput of 533.3, 457.1, and 400 Kbps for key sizes of 128, 196, and 256 bits respectively at the operating frequency of 100KHz only.Furthermore, the design achieved 800Mbps throughput at 100MHz operating frequency for the key size of 256 bits.A full loop unroll architecture based FPGA implementation of a lightweight cryptographic algorithm named Secure Force is presented in [24].The design achieved a throughput of 3.43Gbps at 53.5MHz operating frequency.In [25], an algorithm named Triple Hill Cipher, that can secure any binary data such as video, images, or audio data is proposed.The FPGA implementation of the algorithm achieved the maximum operating frequency of 528MHz at the expense of 4636 slices only.

B. Motivation and Organization of Paper
The ability of an FPGA to process data in parallel has attracted many researchers to use FPGA as a target device for the implementation and prototyping of a cryptosystem.Apart from keeping the algorithm efficient and lightweight, many programming techniques can be adopted to achieve high throughputs while keeping the chip area to the minimum.Such techniques include pipelining, full loop unrolling, subpipelining, partial loop unrolling etc [26].
In this paper, we propose a novel cryptosystem named Secure Cipher and its FPGA implementation.The rest of the paper is organized as follows; in section II, the proposed algorithm and its implementation is discussed.The experimental setup, evaluation criteria, and results are discussed in section III followed by the conclusion in section IV.

II. PROPOSED CRYPTOSYSTEM AND FPGA IMPLEMENTATION
The primary goals of any hardware cryptographic implementation are high throughput, low latency, low chip area, high operating frequency, and low power dissipation [27].Since all these goals can never be achieved in a single hardware implementation, therefore, trade-offs are generally considered .These trade-offs are generally between delay or latency and chip area or resource utilization.

A. Secure Cipher
Many lightweight encryption algorithms have been proposed that are computationally inexpensive [28] The proposed Secure Cipher is low complexity encryption algorithm based on Feistal structure.It is a block cipher that consists of 5 encryption rounds only.Each encryption round consists of five logical and mathematical operations that operate on 8-bit data.This creates adequate confusion and diffusion in the data to confront various types of attacks.The proposed cryptosystem consists of the following blocks.An example of the substitution of data using SBOX is shown in figure 5.Each of the substitution box is generated in such a way that the output of any two or more than two substitution boxes cannot be the same despite the chance of having exactly the same selection byte.The SBOX operation takes place when the result of 32 bits data and round key K R is divided into 4 halves of 8 bits each.Each of these 8 bit halves go through left shift operation as shown in figure 3. The 8 bit data is then moved to substitution boxes as the selection byte for the respective SBOX.The SBOX transformation takes place as; the 2 bits from MSB and 2 bits from LSB of the selection byte concatenate to give the row number of the SBOX, and the remaining 4 bits make the column number.In the example shown in figure 5, the output of the SBOX is 8C, which is the SB1 (0,15) entity of the SBOX1.

B. FPGA implementation
The overall hardware architecture of the Secure Cipher is based on loop unrolling technique.It is reported in [29] that loop unrolling is the main technique to achieve higher degrees of parallelism in reconfigurable hardware such as FPGA.It is also reported that loop unrolling increases the area but can also improve the throughput.The Secure Cipher is implemented on Altera Cyclone II EP2C35F672C6N FPGA using Verilog HDL.
As stated earlier that the design was lead out using full loop unroll technique.In this implementation, the iterations of Another operation included in key expansion block is the fixed matrix multiplication operation.There are four fixed matrices of the size 8×4, and these matrices hold fixed 8 bit integer values.As illustrated in figure 1, the output of the XNOR operation is arranged in an 4×8 matrix row-wise on which a left shift operation is applied.Each of these shifted matrices of 32 bits are multiplied with the fixed matrix which results into a 4×4 matrix of 128 bits.The obtained 4×4 matrix then goes through a left shift operation.We observed that the fixed matrix multiplication produces results from a finite set of numbers, as there involves multiplication of a binary 1 or 0 with an 8 bits wide number.Therefore, instead of using hard multiplier blocks, we transformed the fixed matrix multiplication problem into fixed look up tables.Each entity of the result of fixed matrix multiplication is defined by the equation shown as the output of the look up table presented in figure 6.
In figure 6, RS is defined as the row shifted matrix of size 4 × 8, F M is defined as the fixed matrix of size 8 × 4. The result of the equation is defined as the (i, j)th entity of the matrix labelled as fixed matrix multiplication output F M o .
The hardware of the fixed matrix multiplication is illustrated in figure 6.The select line of the look up table is the 8 bits wide row of the left shifted matrix RS, which depicts that the 8 bits wide output F M o of the look up table can be selected form 256 possible input combinations.Each of the input is the product of ith element of the row of RS matrix with the jth element of the column of fixed matrix F M .Each encryption round takes the output of the previous round and a round key K R as an input.As mentioned earlier that the F function block displayed in figure 3 is the block of principle importance in encryption.Each of the SBOX in F function is an array of the size 16×16, which performs substitution.The hardware of SBOX, as shown in figure 8 is also a look up table which selects its output from 256 standard values.The selection of output is displayed in figure 5.The encryption process is the same in all of the five rounds.At the end of the 5th round, the 32 bits wide outputs are concatenated to form the cipher text or encrypted message.

III. EXPERIMENTAL SETUP
The security evaluation of cryptosystems is done on the well known parameters such as key senstivity test based on strict avalanche criterion(SAC), entropy, histogram, and correlation [30], [31], [32], [33], [34].The hardware designs of cryptographic algorithms are generally compared on the basis of their logic resource utilization or area, propagation delay or latency, throughput, power consumption, and maximum operating frequency [35], [36], [37].
The target device for the proposed cryptosystem implementation is a low cost Altera Cyclone II EP2C35F672C6N FPGA.The details of the aforementioned evaluation parameters will be described in later subsections.

A. Evaluation Parameters
The performance of the Secure Cipher is evaluated on the following performance metrics.The results related to security were performed on MATLAB software.And the hardware performance evaluation parameters such as area, propagation delay, and throughput were performed on Altera Cyclone II FPGA using Quartus II 12.1 sp1 edition software.
1) Key Sensitivity: Key sensitivity of cryptosystems is tested on the basis of Strict Avalanche Criterion (SAC).The SAC states that "If a function is to satisfy the strict avalanche criterion, then each of its output bits should change with a probability of one half whenever a single input bit is complemented" [38].For key sensitivity test, the the cipher text should change with a probability of 50 %.
2) Image Entropy and Correlation: Entropy is the measure of information content of the data.The entropy of the encrypted data should be high so that the data cannot be recognized after encryption.And correlation is defined as the measure of similarity between the adjacent pixels of an image.For an efficient cryptosystem, the results of correlation of an encrypted image should be as low as possible so as to ensure that the data is scrambled adequately.
3) Histogram: For the security related testing, we performed the tests on image data since the results in the visual form can be understood easily.The histogram of an image before encryption shows the intensity variation of the image pixels.For an encrypted image, the pixel intensity should be uniform.This shows the randomness created in the image after encryption.

4) Area:
The area in FPGAs is measured in terms of the logic units or circuits being used by the design.For Altera Cyclone II FPGA family, the resource utilization or area is measured in terms of the number of logic elements (LE), whereas for Xilinx Spartan FPGAs, the term logic circuits (LC) is used.A logic element (LE) contains a 4 input Look-Up Table (LUT), a D flip-flop, and a register for carry chain connection.
In [26], it is reported that the cryptosystems designed with full loop unroll technique may have larger area on hardware as compared to partial loop unrolled architectures, but such designs can achieve high throughputs.
5) Propagation Delay: Propagation delay is defined as the maximum amount of time that exists between the edges of signal when it propagates from input to the output of a given circuit, so, it is the amount of time for the slowest signal to propagate from input to output in a circuit.The propagation delay can be greater if the circuit has complex operations and large area.In general, the propagation delay can be high for full loop unroll designs, but it can be low if the algorithm's flexibility is properly utilized.For instance, in the proposed algorithm, the fixed matrix multiplication is the most complex mathematical operation and it can cause higher delays even if hardware multiplier blocks are used to perform multiplication.But instead of using the multiplier blocks, we propose to implement this multiplication on a simple look-up tables problem which is very low in terms of complexity as compared to the conventional multiplication operation.Such look-up tables implementations cause much less propagation delays as compared to hard multiplier blocks.6) Throughput: Throughput is referred as the primary measure of speed for a hardware based cryptosystem.For hardware implementation of algorithms, throughput is the measure of the amount of data (in bits) processed per unit time.Modern hardware cryptosystems posses high speed data links, therefore, their throughputs should be high enough to be in orders of M bps to Gbps so as to utilize the high data link speeds.

B. Results
The evaluation parameters related to security have been described in section III-A.The proposed Secure Cipher performs adequately in terms of security.The visual testing results of image encryption using the proposed Secure Cipher have been displayed in figure 9.The security related tests have been performed on images of the size 256×256 named Cameraman and Lena.It can be seen in the figure 9 that the encrypted images are impossible to identify visually.
The key sensitivity is tested on strict avalanche criterion (SAC).Based on the SAC, we calculated the mean percentage avalanche value for 1000 variations in the input key and plain  The proposed Secure Cipher was implemented on Altera DE2 board with Cyclone II EP2C35F672C6N FPGA.The design was synthesized using Quartus II 12.1 sp1 edition.The FPGA implementation results are listed in table II.In [26], it is reported that the hardware implementations with full loop unroll architectures may occupy high area.But the proposed Secure Cipher has low algorithmic complexity such that the resource utilization of the proposed Secure Cipher is lower than [15], [25], [23], and [39].The throughput to area ratio should be high for a hardware cryptosystem as it shows the contribution of a single LE in the speed of the hardware design.It is evident from the results displayed in table II that the proposed Secure Cipher has higher throughput to area ratio than the designs presented in [15], [25], and [39] as mentioned in table II.

IV. CONCLUSION
Reconfigurable hardware devices such as FPGAs play a vital role in assessing the performance of cryptographic block ciphers on hardware platform.The proposed cryptosystem named Secure Cipher was designed on FPGA using Full Loop Unroll architecture.The hardware performance results are promising in terms of area, and throughput as the complete design was implemented on Altera Cyclone II FPGA using 802 LE only.And the proposed system has a throughput of 4600Mbps with 5.735Mbps/LE throughput to area ratio.Whereas the proposed Secure Cipher ensures adequate security with a percentage SAC value of 54.55%.For future considerations, the pipelined design of the proposed cryptosystem can be implemented which would help in evaluating the flexibility of the proposed Secure Cipher.

1 )F
Key Generation Block: Key generation block generates five keys for each encryption and decryption round.The key generation block takes a 128-bit key as an input and generates round keys(K r ) of size 32 bits for each encryption/decryption round.The key generation block performs logical operations such as XOR and XNOR, fixed matrix multiplication, and left shift.Each of the logical notations have been displayed in figure I. TABLE I: Notations and their Functions Operation Multiplication XOR XNOR Notation The input key (K) is an array of 128-bits, which is divided into 4 halves of 32-bits each.Each block of 32-bits is arranged in the form of a 4×8 matrix.Shift row operation is applied to each of the 4 matrices.Each of the shifted matrices are then arranged in an 4×8 matrix column-wise, on which XNOR logical operations are performed.The results of XNOR operation are stored in 4 matrices of the size 4×8 in column-wise fashion.These matrices then undergo a shift row operation and then multiplied with 4 individual fixed matrices of the size 8×4.The four fixed matrices labelled F M 1 , F M 2 , F M 3 , and F M 4 are defined in equations (1),(2),(3), and (4) respectively.The detailed diagram of key generation block is shown in figure 1.

TABLE II :
Comparison of Implementation Results