Comparative Analysis of ALU Implementation with RCA and Sklansky Adders In ASIC Design Flow

An Arithmetic Logic Unit (ALU) is the heart of every central processing unit (CPU) which performs basic operations like addition, subtraction, multiplication, division and bitwise logic operations on binary numbers. This paper deals with implementation of a basic ALU unit using two different types of adder circuits, a ripple carry adder and a sklansky type adder. The ALU is designed using application specific integrated circuit (ASIC) platform where VHDL hardware description language and standard cells are used. The target process technology is 130nm CMOS from the foundry ST Microelectronics. The Cadence EDA tools are used for the ASIC implementation. A comparative analysis is provided for the two ALU circuits designed in terms of area, power and timing requirements. Keywords—Arithmetic Logic Unit; Ripple Carry Adder; Sklansky Adder; ASIC Design, EDA Tools

In case of ALU-RCA two different designs were used to compare the performance of both the designs in terms of area and power usage.The first design contains a demux for selecting which block of ALU would be used based on the opcode and the blocks which are not used are shutdown to save power.While the second ALU-RCA design does not contain this demux and all the three units of ALU perform their respective operations and only one result out of three units is sent to the output based on the opcode which is selected by the mux which is placed before the output.In both ALUs designs a mux is placed before the output for selecting the result from only one of the three blocks based on the opcode.
The rest of the paper is organized as follow: In the next section we describe the design and verification of ALU circuits, followed by ALU circuit basic Synthesis.Later we describe the process of Design Respin, Power analysis, Place and Route of ALU circuits.Finally, we summarize our conclusions.

II. ALU DESIGN-VERIFICATION
Initial verification of both the ALUs i.e.ALU-RCA and ALU-SKL were performed based on the waveform approach using ModelSim software tool [47] as we made both the ALUs generic so we reduced the ALU size to 8-bit just to make initial verification simple.The waveform based approach of verification is only useful during the initial phases of small designs and always requires more comprehensive verification in the later design stages.There were some small bugs found in the code which were later corrected.The next task was to define a VHDL TestBench for more comprehensive verification of ALU, than waveform approach.The TestBench written takes the required input values of A, B and opcode from stimuli file and compares the result to an expected output which is also taken from a stimuli file.If the result is incorrect against any of the testvectors then false is written in a dumped file else true is written.If the results are correct against all the test vectors then a message is displayed that simulation was successful else an error message is displayed in the end of simulation.The first task in the verification was to functionally verify our 32-bit ALU meaning that whether our design conforms to specification and does this proposed design do what is intended.We can view the test vectors as a functional specification.The NCSIM logic simulator was used to logically verify our designs.The logic simulation is the use of some computer program to stimulate the operation of digital circuit before it is actually built.The designs were simulated against the 1000 randomized reference test vectors and found no errors.To check the correctness of our TestBench we changed some of the vectors in the testvectors stimuli files and again simulated our design this time it gave errors against those vectors which we had changed as expected.We cannot say that this verification is complete as we checked our design using only 1000 randomized testvectors and the possible combinations in our case is 268, but it is a good starting point of verification process and as they say verification is never complete.

III. ALU DESIGN-BASIC SYNTHESIS
The next task was to synthesize the VHDL code of ALUs using Cadence RTL compiler [42][43][44][45].The VHDL descriptions of ALU will be mapped to certain process technology which in this case is 130-nm technology provided by STMicroelectronics [46].After starting the RTL compiler the technology files were added by giving path and file name of library files.The next step in synthesis process was to read the VHDL files of ALU-RCA.The important thing to be noted here is that the TestBench was not included here since it is not synthesizable rather just behavioral description.The RTL compiler was instructed to assemble the VHDL descriptions of ALU-RCA into an internal representation i.e. network of logic gates by typing the elaboration command.The VHDL code of ALU-RCA was found synthesizable as it was kept in mind while writing it.The VHDL code of ALU-RCA can be called RTL since it is found to be synthesizable.Some time was spent in studying the gate level netlist produced during elaboration and later using GUI.Up till now we have done initial logic synthesis in which no process technology is used rather the RTL compiler makes use of a virtual gate library.The next step was to assign our hardware descriptions to real standard cells, which is commonly known as technology mapping.The initial synthesis was done without any timing constraint and using low effort to study the intrinsic behavior of implementation.The reason for using low effort here is that we do not want to optimize the timing at all to study the actual behavior of our design which is also known as Static Timing Analysis (STA).The timing and area of our design was documented by giving the appropriate commands.The worst-case delay value and estimated area of the implementation was found to be 5396ps and 13305µm 2 respectively.
The worst-case signal propagation path of ALU-RCA was found to be between input and output registers through the chain of adders starting from index 0 to 31, this was observed by looking at the GUI window.The design was re-synthesized using the new timing goal of 50% of the delay we obtained in the last task which comes to be 2689ps using medium effort.Here we are using medium effort because we want the Compiler to put some more effort to meet this timing constraint.The worst-case delay value and estimated area of  The RTL Compiler was able to meet this timing constraint at the expense of increased area because it has to put more effort to meet this timing constraint.The worst-case timing path was still passing through the chain of adders.The ten longest signal propagation paths were documented and they all were passing through the chain of full adders.The data books were studied to see what standard cells are being used.In the last task when we synthesized the design without timing constraint it was using one bit fuller adder cells but with timing constraint of 2698ps it is now using half adders those having less area than full adders.But as the half adders are used in more numbers than the full adders, so the total area of design has increased.
Later it was found out that the original intension of our ALU design is actually to put in inside a 800-MHz processor.So the ALU-RCA was re-synthesized using 1250ps timing constraint and with medium effort.The ALU-RCA was unable to meet this timing constraint as there was negative slack time, so we cannot use ALU-RCA for 800-MHz processor.The worst-case path was still passing through the chain of adders.This time the standard cells used in implementation were different from the previous task as Compiler tried its best to meet the timing constraint using fast standard cells.The worstcase delay value and estimated area of the implementation was found to be 2150ps and 15330µm 2 respectively.
The compiler was only able to reach the timing of 2150ps with increased area as it has to put extra effort in trying to meet this stricter timing constraint but was unsuccessful.The next step in the ASIC design flow is the verification of synthesized netlist of ALU-RCA.The VHDL description of ALU-RCA was synthesized again with new timing constraint of 2698ps for which function is guaranteed using medium effort.The TestBench was used with test vectors for the verification of synthesized netlist.The clk period used in the TestBench was 50% of timing constraint i.e. 1349ps meaning that clock used in the TestBench would be high for 1349ps and low for 1349ps.The netlist was successfully verified without any errors, which proves the idea that netlist has the same functionality as that of VHDL description of design.Fig. 2 shows the timing and area results of ALU-RCA at different settings.

IV. ALU DESIGN-DESIGN RESPIN AND POWER ANALYSIS
The ALU-SKL was synthesized without timing constraint and using low effort to study the intrinsic implementation.The worst-case delay value and estimated area of the implementation was found to be 2130ps and 13646µm 2 respectively.The ALU-SKL was re-synthesized using stricter timing constraint corresponding to 800-MHz using medium effort.The worstcase delay value and estimated area of the implementation was found to be 1250ps and 14578µm 2 respectively.The ALU-SKL was able to meet this timing constraint at the expense of increased area which means that we can use ALU-SKL for 800-MHz processor.The worst-case timing path of ALU-SKL was found to be passing through shifter block.The main reason of using Sklansky adder is that it is faster than RCA based adder.The worst-case path also proves this as in the case of ALU-RCA the worst case was passing through the chain of adders and now it is through the shifter block because of high performance of Sklansky adder in terms of speed.The data books were studied again to see which standard cells Sklansky adder is using and it was found out that it was using completely different cells than RCA e.g. 4 Input NOR gate etc and that's why it has more speed at the expense of larger area.The ten longest paths were documented and it found that first nine paths were passing through the shifter while the tenth path was passing through the Sklansky adder.Fig. 3 shows the timing and area results of ALU-SKL at different settings.
The Fig. 4 and 5 shows that the area of ALU-RCA changes more rapidly than ALU-SKL as the ALU-RCA has to put more effort to meet the stricter timing constraint and its area increases, while ALU-SKL is fast adder easily meets the stricter timing constraint without increasing the area.The netlist of ALU-SKL was also successfully verified.
The next task was to perform power analysis of both designs for a timing constraint that both ALUs satisfy.So Both the ALUs were synthesized with the timing constraint of 2500ps using medium effort.The estimated area of the implementation was found to be 14828µm 2 for ALU-RCA and 13825µm 2 for ALU-SKL.The ALU-RCA with DeMux was also synthesized using this timing constraint to compare its area and timing with the ALU-RCA without the DeMux.The initial power analysis was performed by assigning some switching probabilities on the primary inputs.This is common practice as initially the the test vectors are not available but the correctness of the power analysis highly depends on the test vectors used for power analysis.The Table I and II clearly show that increasing the toggling probability increases the switching power where as the leakage power almost remains the same.Table III shows the clk power and Capacitance of ALU-RCA and ALU-SKL.The ALU-RCA is consuming more power and has more area compared to ALU-SKL for stricter timing constraint of 2500ps.This is because ALU-RCA has to put more effort to meet this timing constraint which result in more area and high power, while ALU-SKL has no problem in meeting this timing constraint which result in more area and power efficient design.The individual power for the three blocks was also compared and it was found out that adder was consuming the most power then logical block and shifter was consuming the least power.The power 487| P a g e www.ijacsa.thesai.orgdissipated in the clock net for both the ALU-RCA and ALU-SKL was documented and was found to be in agreement with the common expression f * V 2 DD * C, it should be noted here that the RTL compiler shows wrong unit for capacitance.The next task was to relax the timing constraint for both the ALUs to see the impact of this on area and power.The timing constraint was set to 4500ps and both the designs were re-synthesized using medium effort.The estimated area of the implementation was found to be 13468µm 2 for ALU-RCA and 13819µm 2 for ALU-SKL.Now it can be seen that the ALU-RCA using less area and power as compared to ALU-SKL.It means that it is better to use ALU-RCA if the timing constraint is not high and we require area and power efficient ALU.Table IV and V shows the power results for timing constraint of 4500ps.
Table IV Table VI shows the power and area results for two different ALU-RCA designs.The design with DeMux is more power efficient as the switching is only taking place in the block that is needed at that time depending on the opcode but has little area overhead compared to the other design which is without DeMux.The power analysis was performed using toggling probability of 0.1 ns on the primary inputs.
The next task of power analysis was to do power analysis of ALU-RCA using three different set of test vectors, as discussed earlier that the correctness of power analysis highly depends on the test vectors used for power analysis.The TestBench was used to generate the VCD file for each set of test vectors using the synthesized netlist generated from the design with correct timing constraint, here timing constraint was set to 2500ps with medium effort.One important thing to note is that the clock period used in the TestBench should be 50% of timing constraint.Table VII shows the power results using test vectors.The result of power analysis using test vectors shows that the toggling probability of Random test vectors is higher as compared to Regular and Real trace because random test vectors has highest switching power where as the toggling probability of regular and real trace seems to be almost same and seems to be close to 0.1 ns.As instructed the TCF files were checked to compare the signals A [16] and B [15] in all the three TCF files.It was found that in the regular trace test vectors the signal A [16] was changing state from 0 to 1 and vice versa all the time and has the high state toggling probability of almost 0.5 ns, where as the signal B [15] was all the time zero and has the high state toggling probability of almost 0.0 ns.

V. ALU DESIGN-PLACE AND ROUTE
The final task in the ASIC design flow is place and route step which takes considerable amount of experience to make good place-and-route.The first step was to generate netlist file of our own design using timing constraint of 3.2 ns also need to produce the constraint file as we decided to work with our ALU-RCA design.Then these files were placed in the proper directories as directed.The partitioning step was performed followed by the Floorplanning, we placed the input registers on the left and the output registers on the right side of core.Then pin placement and power routing was done.The standard cell placement was the next step, it was found out that the placement of cells was done according to our pre-placement constraint.Then Clock Tree Synthesis (CTS) step was performed, after Pre-CTS optimization timing was checked and the timing constraint was not met.The Pre-CTS step can be explained as mapping the design to logic gates, without mapping to actual cells i.e. buffers.The actual CTS step was performed which is like mapping the design to actual cells.The positions of clock buffers and clock tree were checked and it was found that our design has one level of buffers.The timing was checked again after this step and we were still unable to meet the constraint.The last step of CTS i.e. post-CTS optimization was performed to do the optimization based on existing clock tree.The timing was checked again and it was found that timing was improved a lot and the slack time was -0.466, much better than before when the slack time was -2.259.The routing and post-route optimization was performed and the clock and reset signals should have highest priority because these signals have to be provided to every block in the design and therefore are critical.The Filler cells were used to fill the gaps and to connect them to the power rails.The layout verification was done, four MinCut violations were found which were later removed using the fixMinCutVia command.The final timing analysis was performed and the slack time was found to be -0.395.

VI. CONCLUSION
The aim of this research was to design a Arithmetic Logic Unit (ALU) for a 32-bit processor using two different adder circuits.The two ALU units were implemented in VHDL using Ripple Carry Adder and Sklansky Adder circuits.After the VHDL implementation synthesis was performed using Cadence RTL compiler to compare the performance of both the ALU units in terms of area, power and timing requirements.The VHDL descriptions of ALU were mapped to 130nm process technology provided by STMicroelectronics.The synthesis results shows that the area of ALU-RCA changes more rapidly than ALU-SKL as the ALU-RCA has to put more effort to meet the stricter timing constraint at the expense of more area.While ALU-SKL which is a fast adder easily meets the stricter timing constraint without increasing the area and power consumption.It was also observed that the ALU-RCA uses less area and power as compared to ALU-SKL, so it is better to use ALU-RCA if the timing constraint was not high so in this way we can get more area and power efficient ALU Design.

Figure 2 :
Figure 2: Timing and Area results of ALU-RCA

Figure 3 :
Figure 3: Timing and Area results of ALU-SKL

Figure 4 :Figure 5 :
Figure 4: Scaling of Area of ALU-RCA with Timing Constraint

Table I :
Power results of ALU-RCA and ALU-SKL

Table V :
: Power results of ALU-RCA and ALU-SKL Power results of ALU-RCA and ALU-SKL

Table VI :
Synthesis results of ALU-RCA

Table VII :
Power results of ALU-RCA