Area Efficient Implementation of Elliptic Curve Point Multiplication Algorithm

—Elliptic Curve Cryptography (ECC) has established itself as the most preferred and secured cryptography algorithm for the secure data transfer and secure data storage in embedded system environment. Efficient implementation of point multiplication algorithm is crucial activity for designing area efficient, low footprint ECC cryptoprocessors. In this paper, an area efficient implementation of double point multiplication algorithm over binary elliptic curve is presented. Area analysis of double point multiplication algorithm based on differential addition chains method is carried out and area report is generated. Area optimization is achieved by using pipelined structure and by reutilizing idle resources from previous stages in processing unit. The proposed architecture for double point multiplication is implemented on Xilinx Virtex-4 FPGA device. Architecture is modeled in verilog-HDL and synthesized using Xilinx ISE 14.1 design software and is found to be more efficient in terms of area than the existing such architectures.

Victor Miller and Neal Koblitz proposed the concept of elliptic curve cryptography in the mid of 1980's and was considered as a next big step in public key cryptographic systems.Few algorithms already existed such as DSA and RSA.The main advantage of ECC over RSA is the usage of shorter key and it is aided with a drawback that the design for ECC when implemented in software performs at dead slow speed, whereas if the implementation is done in hardware, the process is much more efficient.Hence ECC is the best choice for cryptographic hardware implementation.Due to these many advantages of ECC, a number of hardware implementations have been proposed, and included in many standards such as IEEE 1363and NIST.
An operation called point addition is defined on an elliptic curve.The point addition is an operation, where two points on the curve are added and a third point, which is also on the curve, is plotted as shown in figure 1. Importantly for cryptography, it is very hard to analyze which two points were added.Furthermore, using consecutive point additions, an operation called -Elliptic curve point multiplication‖ is defined.The most exorbitant finite field operation for point addition and point doubling is the finite field inversion.However, one way to handle finite field inversion is by transforming it into less expensive finite field operation, such as finite field addition and multiplication by using projective coordinates.The elliptic curve point doubling and point multiplication activities are shown in figure 2 and 3.
A vast number of resource-constrained and highperformance embedded applications utilize the ECC based public key cryptography due to shorter key sizes..The core operation in ECC systems is the point multiplication.The security of the cryptosystems like ECC depends mainly on the difficulty of the discrete logarithm problem (DLP).A commonly adopted method of solving DLP is to use the Pollard's rho technique [1].A traditional technique for computing side channel information is to apply a variant of double-and-add type algorithms with respect to the binary form of the secret exponent ‗a'.Such an algorithm would deteriorate due to power analysis attacks when doubling and addition operations are distinct [2].One method to provide Diffie-Hellman type protocols with some level of protection against side channel attacks, is to divide the scalar a = r + (a r) for some secret random integer r, and to compute aP = rP + (a r)P [3].
For the sake of generality, let G be an additive abelian group.Given an integer a and a point P belonging to G, a (single) point multiplication algorithm computes aP belonging to G. Given two integers a, b and two points P, Q belonging to G, a double point multiplication algorithm computes aP +bQ belonging to G. Having an efficient and secure one double point multiplication algorithm is important for most of the cryptographic schemes.Another scenario where one needs efficient and secure double point multiplication is to speed up single point multiplication over elliptic curves with endomorphism as in [4], [5], [6].
A simple way to implement double point multiplication is by making use of two single point multiplications in parallel.Straus-Shamir's trick [7] and interleaving [8] are two such methods.Straus-Shamir's type simultaneous double point multiplication algorithms are sensitive to side-channel analysis, because of which double and add instructions are not accomplished in a linear fashion.Fortunately, recoding the scalars a and b allows one to make use of Straus-Shamir's type algorithms in such a way that the same instructions are executed in the same order.Joye and Tunstall [9] introduced several techniques of regular recoding of scalars for regular point multiplication algorithms, which can immediately be adapted to yield regular simultaneous double point multiplication algorithms.Especially, their signed-digit recoding technique with the digit set f1; 3g generate a regular double point multiplication algorithm, referred as JT-f1;3g algorithm.JT-f1; 3g costs 0.5 addition and 1 doubling per scalar bit.Adapting differential addition chains (DAC) is another technique to compute simultaneous double point multiplication [10], [11], and [12].DAC-method is more prominent as it produces potentially simple power analysis resistant algorithms due to the uniform pattern of operations executed and it is particularly efficient towards elliptic curves setting because of the fact that double and add operations can be computed only using x-coordinates only.Bernstein [12] proposed a double point multiplication algorithm related to the new binary chain, known as the B-NBC algorithm.B-NBC has a uniform framework, and costs 2 additions and 1 doubling per scalar bit.Recently, Azarderakhsh and Karabina [13] designed a simultaneous double point multiplication algorithm based on DAC, the AK-DAC algorithm.AK-DAC has a uniform structure, and costs 1:4 additions and 1:4 doublings per scalar bit.
The above mentioned three double point multiplication algorithms JT-f1; 3g, B-NBC, and AK-DAC are normal, and hence they are potentially resistant towards power analysis attacks.Nevertheless, comparing these algorithms with respect to the efficiency point of view is not straight forward.Although JT-f1; 3g shows the best per-bit cost, B-NBC and AK-DAC have the benefit of being based on DAC.For example, in elliptic curves setting, one can implement B-NBC and AK-DAC by adapting the addition formulas that include only the x-coordinates of the points, and are much more efficient than that of their conventional counterparts.Moreover, JT-f1; 3g cannot be executed in parallel because an addition operation should be always performed following two successive doubling operations.Double and add operations can be completely parallelized in both B-NBC and AK-DAC.If one redistributes 2 parallel addition/doubling units, then the costs of B-NBC and AK-DAC per bit becomes 1A+1D and 1:4A.In the same way, if one redistributes 3 parallel addition/doubling units, then the per-bit cost of B-NBC becomes 1A.
In this paper, hardware architecture of Area efficient Elliptic Curve Point Multiplication using AK-DAC standard Weierstrass binary elliptic curve groups is implemented and is investigated for area occupancy.This will be realized with the promising regular algorithm with low hardware requirement (area).
The rest of the paper is organized as, Section 2 reviews some of the latest research works performed related to proposed work and in Section 3 the motivation and the methodology of research are discussed.Section 4 clearly explains and analyzes the proposed architecture with neat sketches and algorithms and in Section 5 the experimental results are reported and compared with other existing works.Finally the work concludes in Section 6.

II. RELATED WORK
Literature is a significant treasure house of various VLSI architectures for point multiplication in ECC.At this juncture, existing architectures offered in the literature need to be understood.Reza Azarderakhsh and Koray Karabina [13] designed a new double point multiplication algorithm and its application to binary elliptic curves with endomorphism.In this design, the algorithm was based on differential addition chains.The architecture was designed with a uniform structure and has some degree of built-in resistance against side channel www.ijacsa.thesai.organalysis attacks.Their double point multiplication algorithm is based on an adaptation of Montgomery's PRAC algorithm.Work also demonstrated how double point multiplication can be employed to speed up the computation of single point multiplication on elliptic curves with efficiently computable endomorphisms.In design, gain acceleration is 30% and 18% for computing single point multiplication with and without availability of parallel multipliers, respectively.
Efficient elliptic curve point multiplication using digitserial binary field operations was designed by Gustavo D. Sutter et.al [14].They used a new high-speed point multiplier for elliptic curve cryptography using either field programmable gate array or application-specified integrated circuit technology.Their design adapted a digit-serial approach in GF multiplication and GF division in order to construct an efficient elliptic curve multiplier using projective coordinates.The design involved many basic arithmetic operations in the underlying finite field.There are different acceleration techniques to improve the performance of the ECC operations.Their point multiplication technique used three types of algorithm Montgomery Ladder Algorithm, Point multiplication and Point multiplication using three multipliers and one divisor and precomputing x P−1 .This design achieved point multiplication over GF (2 163 ) in 19.38 μs in Virtex-E devices and in 5.48 μs in Virtex-5.
Efficient RNS implementation of elliptic curve point multiplication over GF (p) was designed by Mohammad Esmaeildoust et.al [15].In this design, based on the residue number system (RNS), new hardware architecture for ECPM over GF (p) was established.The designed architecture encompasses RNS bases with various word-lengths to efficiently implement RNS Montgomery multiplication.In that method two versions of fast and area-efficient designs for RNS Montgomery multiplication in six and four-stage pipelined architectures were used.When compared to stateof-the-art implementations, their implemented design achieved higher speeds and better area-delay.Kimmo U. Järvinen et.al [16] suggested efficient algorithm and architecture for elliptic curve cryptography for extremely constrained secure applications.They proposed an efficient implementation of point multiplication on Koblitz curves targeting extremely-constrained, secure applications.In design Gaussian normal basis (GNB) representation of field elements was adopted and employed an efficient bit-level GNB multiplier.The special property of normal basis representation and squarings was rewired in hardware very efficiently.Also, a new technique was introduced for point addition in affine coordinate which required fewer registers.In their newly designed technique extremely small processor architecture for point multiplication was used.Their architecture offered better results compared to the previous works, making it suitable for extremely-constrained, secure environment.
Theoretical modeling of elliptic curve scalar multiplier on LUT-based FPGAs for area and speed efficiency was designed by Sujoy Sinha Roy et.al [17].Two primitives used in elliptic curve scalar multiplier architecture (ECSMA) were implemented on k input lookup table (LUT)-based field-programmable gate arrays to approximate the delay of different characteristic.It was used to determine the optimal number of pipeline stages and the ideal placement of each stage in the ECSMA.In order to perform point addition and doubling in a pipelined data path, suitable scheduling was created.The three stage pipelined architecture for double and add based scalar multiplication is performed on Xilinx Virtex V platforms over GF (2 163 ).The implementation used a novel pipelined bit-parallel Karatsuba multiplier that has subquadratic complexity.In proposed design, efficient choice of scalar multiplication algorithm, optimized field primitives, balanced pipeline stages, and enhanced scheduling of point arithmetic resulted in a high-speed architecture with a significantly small area.
Hossein Mahdizadeh and Massoud Masoumi [18] designed a novel architecture for efficient FPGA implementation of elliptic curve cryptographic processor over GF (2 163 ).In architecture the critical path of the Lopez-Dahab scalar point multiplication architecture was organized and reordered by the maximum architectural and timing improvements, such that logic structures were implemented in parallel and operations in the critical path were diverted to noncritical paths.In the implemented design the execution delay of the LD algorithm has been reduced by parallelization of the multipliers in the implementation of the calculations of projective coordinates.The ECC processor was implemented using synthesizable VHDL codes, and synthesized, placed, and routed using Xilinx ISE 12.1.This design completes the computations in the projective coordinates in 326 * ([m/G1]) +1304 cycles and coordinate conversion in 15 * ( [m/G2])+214 cycles.With G1 =33, their new design was four times faster than other designs.
Hybrid binary-ternary number system for elliptic curve cryptosystems was designed by Jithra Adikari et.al [19].The most computational intensive operations in elliptic curve based cryptosystems are Single and double scalar multiplications.The performance of operations was improved by means of integer recoding techniques; with an aim to minimize the scalars density of nonzero digits.Designed system housed three novel algorithms for both single and double scalar multiplications.The first algorithm is w-HBTF and the other two algorithms, namely, HBTJF and RHBTJF.It was used to find the short and sparse representation for a single scalar or a joint representation for a pair of scalars.The output results showed that hybrid algorithms are almost always faster than classical w-NAF methods or JSF.Kazuo Sakiyama et.al [20] implemented a tripartite modular multiplication.In multiplication, for maximizing a level of parallelism, systematic approach was implemented for modular multiplication.The algorithm which is used in this method effectively integrates three different existing algorithms, a classical modular multiplication based on Barrett reduction, the modular multiplication with Montgomery reduction and the Karatsuba multiplication algorithms in order to reduce the computational complexity and increase the potential of parallel processing.In multiprocessor environment for hardware and software implementations, this algorithm is very effective.This algorithm clocks a higher speed when compared to the other algorithms for modular multiplication.www.ijacsa.thesai.orgIII.PROPOSED METHODOLOGY Most of the methods implemented for point multiplication include a pre-computation stage before the actual process for point multiplication.The operation of pre-computation stage includes the computation of the intermediate points which are then used for increasing the throughput of the point multiplication process.Hence the need for highly efficient elliptic curve point multiplication is an important activity in the field of cryptography.The traditional and the less complex method for point multiplication is the binary method which is well known as double-and-add method.While, the other double point multiplication algorithm discussed in literature is naive method.But, all these methods only speed up the point multiplication and, since for VLSI architectures the hardware utilization is the major requirement, thus thrust should be on optimizing area required for the proposed system.Thus, in this paper, area efficient implementation of double point multiplication over binary elliptic curves is presented.Area analysis of double point multiplication algorithm based on differential addition chains method is investigated.The performance and efficiency of any scheme is based on the required area.Proposed architecture for double point multiplication is implemented on Xilinx Virtex-4 FPGA device.The proposed architecture is modeled in verilog-HDL and synthesized using Xilinx ISE 14.1 design software.

IV. PROPOSED DOUBLE POINT MULTIPLICATION ALGORITHM
Proposed implementation of Elliptic Curve double point Multiplication algorithm is loosely based on Montgomery's PRAC algorithm [13].Algorithm is simplified and modified in order to make the design more area efficient than the exiting design.The modified double point algorithm used in proposed work is exhibited in figure 4 as a flowchart.
Flowchart for the proposed modified double point multiplication Fig. 4  The architecture for proposed area efficient point multiplication scheme using double point multiplication is shown in figure 5.The architecture includes a processing unit, memory unit and a control unit.
Modified data dependency graph for the processing unit Fig. 6.

A. Processing Unit
The processing unit is a combined architecture for differential point addition and differential point doubling operations.The major portion of the available slices are occupied by the processing unit because of the involvement of various finite field arithmetic units for computing the output point addition and doubling values.So the main contribution of this work is focused on designing an area efficient processing unit with a reduction in number of incorporated Arithmetic units.The modified area efficient data dependency graph for the processing unit is shown in figure 6.
Proposed data dependency graph for computing double point multiplication employs area efficient finite multipliers, squarers, and adders based on differential point addition and doubling formulae given in [4].The processing unit is designed with 4 stages of pipeline process in order to reduce the usage of arithmetic units for computation.
The inputs to the processing unit are three points and a difference between two points (the input points values are selected based on the sequence from the control unit).The parameter ‗a' is a constant integer value from the elliptic curve equation considered for cryptography.When the input is loaded to the processing unit, the processing of the input points takes place in 4 stages.After the completion of the previous stage, the values are stored temporarily in the respective registers (Buffers) and then only the next level of process begins.Hence in proposed architecture, resources such as registers and other arithmetic units that are used in previous stage of process are reused.For example in data flow graph the multiplier used in the first stage of computation can be reused in the stage four and the squarer used in the fourth stage can be reused in the final output computation stage thereby reducing the need for extra multipliers and squarers.The buffers that are used in the previous stage and that are found to be empty in the next stages are reused efficiently for making processing unit area efficient The arithmetic units that are incorporated inside proposed resource reusable combined architecture for differential point addition and differential point doubling are discussed in detail in following sections.

B. Addition Unit
The addition process that takes place in processing unit is a finite field modulo 2 binary additions.Let Finite field Adder Fig. 7.

C. Squaring Unit
The squaring of an element ‗A' in binary finite field is simpler than that of finite field multiplication.Squaring includes two steps of processing; in the first step zeros are inserted between each bit in the bit vector representing ‗A' shown in figure 8.In the Second step the bit vector obtained from first step is reduced by taking ) ( mod x f , where ) (x f is a degree-m irreducible polynomial.In hardware implementation, reduction can be done by XOR and shifting operation.The squaring operation for ‗A' is represented as Zero insertion for squaring Fig. 8.

D. Multiplication Unit
The design of Finite field multipliers is the complex issue in the implementation of the ECC processor.A number of multipliers with different area and time complexity are reported in the available literatures.In this work, an area efficient architecture for Karatsuba's multiplier which incorporates digit-level polynomial basis multiplier is adopted.
The modified Karatsuba multiplier used in proposed architecture for double point multiplication multiplies 2 finite inputs ‗A' and ‗B' of m-bit length.In Karatsuba multiplier, each operand is first split into two equal parts and then processed.The internal processor includes 3 multipliers and 4 adders.
The architecture for Karatsuba multiplier is as shown in figure 9.The multiplier used here is a digit-level polynomial basis multiplier for computing the product of two elements over The operand ‗A' register is initially loaded with l-bits and operand ‗B' register is loaded with l-bits.The D-block is an array of AND gates as shown in figure 10

( l G
adder block add all the partial products obtained in the before step using a array of ‗XOR' gates same as that have been used for addition operation for field elements.The main advantage of using this multiplier in proposed technique is that it can operate at higher clock frequencies in comparison to the other multipliers reported in the literature.The architecture for the digit-level polynomial basis multiplier used in proposed technique is shown in figure 11.
Internal Structure of D-block Fig. 10.

E. Inversion Unit
Inversion is the most expensive arithmetic finite field operations.In general, inversion can be computed as,

F. Control and Memory Unit
The control unit designed with LUTs generates the control signals as per the input rule and flow chart given in figure 1.Based on the input rule, appropriate selector signals are generated and are fed to the multiplexers.For each clock pulse, the selectors signals are generated and based on this the contents from the registers are fed to the processing unit.At the start of the process, the Register P 1 and Register P 2 are initialized with the input points P 1 and P 2 respectively.The designed control unit is simple and utilizes only a smaller area than the other units in the architecture.
The block diagram for the control unit is shown in figure 12.For storing the points and all other data needed for the computation, register files are used instead of RAM blocks.This is because the RAM blocks require communication between the memory unit and the processing unit which is not required in case of the register files.

V. RESULTS AND DISCUSSION
In this section, proposed architecture for double point multiplication is implemented to analyze its area and power requirements.
RTL schematic for the proposed double point multiplication Fig. 13.architecture The Xilinx® Virtex™-4 xc4vlx200 device is used as the target FPGA.The proposed architecture is modeled in verilog-HDL and synthesized for different digit sizes using XST™ of Xilinx® ISE™ version 14.1 design software.All the experiments were performed on 3.10GHz Intel(R) i5, 4.00GB RAM, and 32-bit operating system with windows7 professional.Figure 13 exhibits a snapshot of RTL schematic of proposed double point multiplication architecture.

A. Area report of proposed scheme:
Comparison of area utilization of proposed double point multiplication architecture with other existing methods such as Naive Method, B-NBC, JT -{±1, ±3} and AK-DAC is carried out.Target device includes 89,088 Slices (178,176 4 input LUTs and 178,176 Sliced FFs) and 960 bonded IOBs.Each slice contains 2 flip-flops (FFs) and 2 look-up tables (LUTs).The resource utilization comparison as depicted in table II below shows the slices utilized by proposed scheme are much lesser than the other existing methods.Proposed implementation utilizes only an average of 6.5% among the available 89,088 slices in the device.But all other methods report a high percentage of device utilization.
With increase in‗d' value from 7 to 13, it is observed that there is increase in proposed architecture footprint.The area comparison of proposed method with other similar existing methods is shown in figure 14 for a clear understanding of the efficiency of proposed method.For digit size of 7, proposed architecture uses 37.95% reduced slices as compared to Naive method, 25.9% fewer in comparison to B-NBC , 8% fewer slices as compared to JT -{±1, ±3} method and 6% lesser than AK-DAC technique.For digit size of 26, proposed architecture uses 47.42% reduced slices as compared to Naive method, 37.22 % in comparison to B-NBC, 22.09% fewer slices as compared to JT -{±1, ±3} method and 21.14% lesser than AK-DAC technique

B. Power and Performance Report of proposed architecture
The total clock periods required for the computation, frequency and power needed for the implemented architecture is tabulated in table III  Figure 15 shows the graph plotted between ‗d' along xaxis and Clock periods along y-axis.With the increase in digit size of the multiplier, the clock periods (Computation Time) increases.Hence a large digit size multiplier can boost up the throughput of architecture.But with the increase in digit size the need for registers, AND gates XOR gates and shift logic also increases which contributes to chip area.Since thrust is on area reduction, a low bit size for implementation has been chosen.Naïve Method [13] B-NBC [12] JT -{±1, ±3} [9] AK-DAC [13] Proposed www.ijacsa.thesai.org Graph plot between‗d' and ‗Clock Periods (ns)' Fig. 15.
From the graph shown in figure 16, it is observed that with the rise in the digit size of the multiplier, the operating clock frequency for implementation decreases.
The power consumption by proposed design is mainly due to the leakage power and the clock power.Figure 17 shows the graph plot for digit size (d) Vs Total Power consumption by proposed module.It can be observed that with the increase in digit size of the multiplier, the power consumed by architecture increases.From the above analysis, if ‗d' value is selected as a low value then Area and Power consumption decreases but the speed of computation decreases.On the other hand if ‗d' value is made high then the area and power consumption increases with a high speed computation.Hence in order to make proposed architecture efficient towards Area, Power and performance, a balanced value of digit size is to be chosen and is set to an average value as possible.

VI. CONCLUSION
In this paper, an area efficient elliptic curve point multiplication architecture using a double point multiplication technique is designed and implemented.Reutilization of idle resources and a pipelined data path scheme for data processing in the combined module for differential point adder and point doubler were presented clearly.The finite field arithmetic operators were designed efficiently to reduce the area utilization.The complete architecture was synthesized and simulated using Xilinx ISE 14.1.Reports were generated in terms of area, power and time by varying the digit size of the multiplier.The results obtained from area report were compared with other similar existing methods reported in the literatures and found to be much better.In future for further area optimization of the proposed architecture, research thrust should be on designing an efficient area optimized finite field multiplier.

,
' and ‗B' produces the result ‗C' as Where the symbol ‗  ' denotes the ‗XOR' operator.Hence in hardware realization of the addition unit, ‗XOR' gate array is used as shown in figure7for adding two finite field binary elements.The addition process utilizes only one clock cycle for storing the results in the respective output register.

.
Hence the input to the digit-level polynomial basis multiplier is of bit length which performs the operation B a i .Hence the output of the D-block is available only if the bit value of the corresponding A-register is ‗1' and if it is ‗0' then the output of D-block becomes ‗0'.When all the d-partial products are computed the

)
the method proposed by Itoh and Tsujii (IT)[20] is the most efficient of inversion and hence same technique has been adopted for hardware implementation of inversion module in proposed architectureH is the number of ones in the binary representation of 1  m bits known as ‗Hamming Weight'.The algorithm of Itoh and Tsujii method for inversion is given in Algorithm 1.
AREA COMPARISONS OF DIFFERENT DOUBLE POINTTABLE.II. below.