High Precision DCT CORDIC Architectures for Maximum PSNR

This paper proposes two optimal Cordic Loeffler based DCT (Discrete Cosine Transform algorithm) architectures: a fast and low Power DCT architecture and a high PSNR DCT architecture. The rotation parameters of CORDIC angles required for these architectures have been calculated using a MATLAB script. This script allows the variation of the angle’s precision from 10−1 to 10−4. The experimental results show that the fast and low Power DCT architecture correponds to the precision 10−1. Its complexity is even lower than the BinDCT which is a reference in terms of low complexity and its power has been enhanced in comparison with the conventional Cordic Loeffler DCT by 12 mW. The experimental results also show that the high PSNR DCT architecture corresponds to the precision 10−3 for which the PSNR has been improved by 6.55 dB in comparison with the conventional Cordic Loeffler DCT. Then, the hardware implementation and the generated RTL of some required Cordics are presented. Keywords—Cordic Loeffler DCT; high quality architecture; low power architecture; Image Processing; DCT


I. INTRODUCTION
The Discrete Cosine Transform DCT was developed by Ahmed et.al in 1974 [1].It is a robust approximation of the optimal Karhunen-Loeve Transform (KLT) [2].It has become one of the most widely used techniques of transforms in digital signal processing.
Many works deal with the optimization of the DCT architectures.Two principal axes are explored.The first one consists on the enhancement of the quality of the DCT in terms of precision measured through the Peak Signal to Noise Ratio (PSNR) ( [3], [4]).The reference in this case is the Loeffler based DCT which is the most precise architecture since it doesn't contain approximations.
The second axe consist on improve the DCT in terms of power consumption ( [5], [6], [7]).In fact, it is well-known that DCT is one of the computationally intensive transforms since it requires many multiplications and additions.Many researches had been done on low-power DCT designs [8], [9].As the multiplications are energy expensive operations, several algorithms are based on additions and shifts instead of multiplications.
In 2004, Jeong et al. [9] suggested improving a Cordic (COordinate Rotation Digital Computer) based implementation of the DCT.CORDIC is an algorithm which can be used to evaluate various functions in signal processing [10], [11], [12].In [9], authors proposed a low-complexity CORDIC based DCT algorithm based on the Flow Graph Algorithm (FGA) which is the commonly used way to represent the fast DCT.It requires only 38 add and 16 shift operations and consumes about 26.1 % less power compared to [13],with a minor image quality degradation of 0.04 dB.
In the same direction, Sun et .al[14], [15] proposed a new flow graph for Cordic based Loeffler DCT implementation.A new table of parameters is obtained with new choice of the elementary rotations.Their experimental result shows that the Cordic-based Loeffler DCT consumes 16% of energy compared to [16] with a minor image quality degradation of 0.03 dB.
After this analysis of state of the art, we remark that previous works have almost neglected the quality of the results provided by the DCT algorithm in order to decrease the energy consumption.In the aformentioned works, the reached precision degree is at most 10 −4 .We propose to remain in the same interval (10 −1 to 10 −4 ) and provide 2 optimal architectures.The first one is a fast and low power DCT architecture and the second one is a high PSNR DCT architecture.The parameters of the two architectures are obtained from a Matlab script which calculates the rotation parameters of the considered angles.

Contribution in this paper are:
• A matlab script which calculates the CORDIC paramwww.ijacsa.thesai.orgeters of the desired angles.
• A high PSNR DCT architecture which is the closest to the reference in terms of image quality (the Loeffler based DCT [16]) with significant power reduction.
• a fast and low power DCT architecture which is the closest to the reference in terms of low complexity (The BinDCT [17]) with substancial PSNR improvement.
This paper is organized as follows.Section 2 briefly introduces the algorithms of conventional Cordic-Based DCT Architecture.In Section 3, the proposed architectures and their Cordic parameters are presented.The experimental results are shown in Section 4 while Section 5 concludes this paper.

A. Cordic Algorithm
The conventional Cordic algorithm [10], [11] is hardwareefficient used for the approximation computation of the transcendental functions.It only uses shift and addition operations.The Cordic algorithm can operate in two modes, namely vectoring and rotation and in this paper, the first mode is focused on.
In the conventional Cordic algorithm, a rotation angle is decomposed into a combination of micro-rotation angles of arctangent radix.When the vector is rotated by an angle θ i , the coordinate changed from (X i , Y i ) to (X i+1 , Y i+1 ).
The value of vector after this micro rotation can be represented as: where θ i = arctan(2 −i ), σ i = ±1 and K i = cos(θ i ).
The circular rotation angle is depicted as: Fig. 1.The direct implementation of equation 1 In the equation (1), only shift and add operations are required to perform the rotation angle described in Fig. 1.But, the results of the rotation iterations need to be scaled by a compensation (scale) factor K. This can be done by using the following iterative method.
The scale factor K which can be interpreted as a constant gain (hence not data dependent) can be tolerated in many digital signal processing applications.Hence, it should be carefully investigated whether it is necessary to compensate for the scaling at all.If scale factor correction cannot be avoided, two possibilities are known.The first approach consists on performing a constant factor multiplication with 1/K i .The second method is based on extending the Cordic iteration in a way that the resulting inverse of the scale factor takes a value.In other words, writing the scaling factor as a sum of 2 −i where i must be determined so that the error is minimized, is needed.In the rotation mode, the angle accumulator is initialized with the desired rotation angle.The rotation decision at each iteration is made to diminish the magnitude of the residual angle in the accumulator one.The decision at each iteration is therefore based on the sign of the residual angle after each step [10].

B. Cordic-Based DCT Architecture
The One-dimensional DCT for 8x8 sub-images is defined as x(i) cos (2i + 1)tπ 16 Where x(i) is the input data and X(t) is 1-D DCT transformed output data.
The two-dimensional DCT is a separable transform.It can be executed by one-dimensional DCT in a serial manner as shown in the Fig. 2.
The Cordic array performs the fixed-angle rotation in the DCT algorithm.Therefore, the general signal flow graph of Cordic-based DCT is presented by Fig. 4.

III. PROPOSED HIGH PRECISION CORDIC-BASED LOEFFLER DCT ARCHITECTURE
In this section, the proposed MATLAB script which calculates the Cordic Rotations is presented.The main result of this

A. Computation of Micro-Rotation decomposition
The proposed MATLAB script takes as input the rotation angle.We vary the precision degree from 10 −1 to 10 −4 to remain in the same interval exploited by the conventional architectures.

Input Theta (angle) and Epsilon (tolerance);
The MatLab script This approach provides the Cordic parameters (iterations and direction) corresponding to the angle and the selected precision.The iterations, in other words, the micro-rotations are identified with their orientation, clockwise or anticlockwise.This method is applicable to the angles comprised within the range of 0 and π/4.The angles higher than π/4 can be decomposed into angles in this interval.For example, 3π/8 = π/4 + π/8.So, to determine the CORDIC parameters of this angle, we begin by the CORDIC parameters of π/4 followed by the CORDIC parameters of π/8.

B. Cordic parameters corresponding to the angle 3π/16
For a precision degree of 10 −1 and 10 −3 , the microrotations shown respectively in the Table I, II are found.The rotation angle 3π 16 can be written as the weighted sum of micro-rotations as seen in the Equation 13θ = 3π 16 = 0.589048 ≈ θ 1 + θ 3 = 0.588002 ± 10 −1 (13) Based on the previous computed micro-rotations of the 3π/16 angle, the Cordic architecture computing 3π/16 angle is given in Fig. 5.The Cordic architecture computing 3π/16 angle is given in Fig. 6.

C. Cordic parameters corresponding to the angle π/16
For a precision degree of 10 −1 and 10 −3 , the microrotations shown respectively in the Table III and IV are found.
The rotation angle π 16 can be written as the weighted sum of micro-rotations as seen in the Equation 15 The Cordic architecture computing π/16 angle is given in Fig. 7.The generated RTL is shown in Fig. 8.As it is shown, it consists on a subsystem with 2 inputs and 2 outputs.The subsystem is composed by two shift operators (sh1 and sh2) and two add/sub operators (a and sub).The rotation angle π 16 estimated with a precision degree of 10 −3 is shown in the Eq.16 The Cordic architecture computing π/16 angle is given in Fig. 9.For a precision degree of 10 −1 and 10 −3 , the microrotations shown respectively in the Table V and VI are found.

The rotation angle 3π
8 estimated with a precision degree of 10 −1 is shown in the Eq. 17 The Cordic architecture computing 3π/8 angle is given in Fig. 10.The generated RTL is shown in Fig. 11.As it is notable, it is composed by 4 add/sub operations (a1, a2, sub1 and sub2) and 2 shifters (sh1 and sh2).The rotation angle 3π 8 estimated with a precision degree of 10 −3 is shown in the Eq.18 The Cordic architecture computing 3π/8 angle is given in Fig. 12.

IV. EXPERIMENTAL RESULTS
In order to demonstrate the high-quality feature of the proposed DCT architectures, it has been evaluated considering a JPEG2000 compression chain [18] using a well-known test image.Table VII shows the comparison of the PSNR of the proposed DCT architectures for precision degrees ranged from 10 −1 to 10 −4 , with the other conventional DCT architectures.Checked results consider high-to-low quality compression (i.e.quantization factors from 95 to 70) using Lena image.Fig. 13 gives the experimental results based on the Lena image.
It can be easily noticed from the Table VII that Arch.Deg3 has better quality about 6.55 dB for Q=95 than the Cordicbased Loeffler.As seen in the Table VII (especially the last row which correponds to the average PSNR), Arch.Deg3 is the closest to the Loeffler DCT which is considered as the reference and the target in terms of precision and image quality.It is also noticed that it is useless to go higher than 10 −3 since the values remain stable.This is why Arch.Deg3 is considered as the best architecture in terms of image quality.
The considered architectures have been implemented on Virtex5 xc5vlx30-3ff676.The power consumption is measured with Xpower Analyzer with 100 Mhz clock frequency and 1V supply power.The delay of each architecture is determined with the ISE Simulator (ISIM).The power consumption, the latency and the complexity of the different DCT architectures (the conventional and the proposed ones) with precision degrees ranged from 10 −1 to 10 −4 are shown in the Table VIII.
As it could be noticed, the most interesting architecture in terms of power consumption and execution delay is Arch.Deg1 which corresponds to a precision degree of 10 −1 .The complexity of this architecture is even lower than the BinDCT which is a reference in terms of low complexity.The power consumption of Arch.Deg1 is almost the lowest.The fact is that the power of the BinDCT is lower but this loss of power is minor when the significant enhancement made by Arch.Deg1 in terms of image quality in comparison with the BinDCT is considered.
The waveform correponding to Arch.Deg1 and Arch.Deg3 are shown respectively in Fig. 14     In terms of number of cycles, it could be said that Arch.Deg1 takes 80 cycles and Arch.Deg3 88 cycles.This is perfectly normal since Arch.Deg3 requires more shift/add operation layers than Arch.Deg1.So the process takes more time.
In comparison with the Loeffler DCT, it could be said that Arch.Deg1 is somewhat slower since the multiplication operation is replaced by several layers of shift/add operators which leads to a little higher delay.
If one compares the conventional Cordic Loeffler based architecture, Arch.Deg1 and Arch.Deg2, he finds that the delay is the same even though the shift/add operation layers are not exactly similar.This is perfectly normal since the delay depends essentially on the longest path and in these three cases, the longest path passes through the 3π/16 Cordic.

Fig. 3 .
Fig. 3. Hardware architecture of CORDIC-based 1-D DCT According to the Fig. 4, the signal flow can be represented by three major components, the butterfly operator, the fixedangle CORDICs and the post-scaling factors of 8-point DCT.

Fig. 10 .
Fig. 10.Unfolded flow graph of the 3π/8 angle (Precision=10 −1 ) and 15.As it is notable from Table VIII, Fig. 14 and 15, the execution time of a single column of an 8 × 8 image block is 95 ns for Arch.Deg1 and 105 ns for Arch.Deg3.In terms of number of cycles, it could be said that for Arch.Deg1 it is equal to 10 cycles and for Arch.Deg3 11 cycles.The process of an entire 8 × 8 image block takes 905 ns for Arch.Deg1 and 985 ns for Arch.Deg3.www.ijacsa.thesai.org

TABLE I .
DETERMINING THE CORDIC PARAMETERS FOR 3π/16 CORRESPONDING TO A PRECISION DEGREE OF 10 −1

TABLE II .
DETERMINING THE CORDIC PARAMETERS FOR 3π/16 CORRESPONDING TO A PRECISION DEGREE OF 10 −3

TABLE III .
DETERMINING THE CORDIC PARAMETERS FOR π/16 CORRESPONDING TO A PRECISION DEGREE OF 10 −1

TABLE IV .
DETERMINING THE CORDIC PARAMETERS FOR π/16 CORRESPONDING TO A PRECISION DEGREE OF 10 −3

TABLE V .
DETERMINING THE CORDIC PARAMETERS FOR 3π/8 CORRESPONDING TO A PRECISION DEGREE OF 10 −1

TABLE VI .
DETERMINING THE CORDIC PARAMETERS FOR 3π/8 CORRESPONDING TO PRECISION DEGREE OF 10 −3