Performance-Optimised Design of the RISC-V Five-Stage Pipelined Processor NRP

org


I. INTRODUCTION
The Instruction Set Architecture (ISA) is the foundation of computer architecture.Existing ISAs (such as X86, ARM, etc.) have hindered the advancement and proliferation of technology through patent protection [1].In 2010, the University of California, Berkeley, first released the RISC-V Instruction Set Architecture [2].RISC-V is an open and free ISA.
This paper presents an optimized five-stage pipeline RV32I scalar processor, NRP (New RISC-V Processor).The main contributions of this paper are as follows:  To improve performance, we modified and optimized the ID and EX stages of the processor, reducing the negative impact of dependency conflicts on the processor.
 We implemented these optimizations using Verilog HDL and evaluated hardware resource utilization and processor performance.From the evaluation results, we found that this processor outperforms the classic fivestage pipeline processor.

II. RELATED WORKS
Dependency conflicts are an important factor affecting the performance of a five-stage pipeline processor.Dependency conflicts refer to the data dependency, control dependency, and structural dependency between instructions, which can lead to instruction hazards in the pipeline, thereby affecting the processor's performance.
In study [14], the authors designed and implemented a Tournament Branch Predictor, which improved the accuracy of branch prediction and enhanced the processor's efficiency.In reference [15], the authors combined the instruction fetch stage with the pre-fetch stage into a two-stage pipeline, resulting in a 17.6% improvement in processor performance.In reference [16], the authors proposed reducing hazards through the use of techniques such as data forwarding and branch prediction, leading to a 7.82% increase in processor performance.In reference [17], the authors optimized the instruction fetch unit, ALU, and data memory, increasing the processor's operating frequency.
The optimization strategies in references [14,15] resulted in significant performance improvements but increased the complexity and hardware resources of the branch predictor.The optimization strategies in references [16,17] had lower hardware overhead but led to smaller improvements in processor performance.This paper comprehensively compares these optimization strategies and proposes a new optimization strategy that achieves performance improvements with minimal hardware overhead.

A. Processor Architectures
The NRP processor adopts a five-stage pipeline design.As shown in Fig. 1, instructions undergo the following five stages during execution: Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), and Write Back (WB) [18].The design of the ID and EX stages in the NRP processor differs from that of the classic five-stage pipeline processor.The ID stage of the NRP processor consists of both the instruction decode unit and the decode execute unit, whereas the classic five-stage pipeline processor only has the instruction decode unit.The EX stage of the NRP processor consists of the execute unit and the branch prediction auxiliary unit, while the classic five-stage pipeline processor only has the execute unit.

B. Instruction Execution Process
This paper categorizes all instructions in RV32I into special instructions and regular instructions.Branch jump instructions and instructions similar to branch jump instructions in terms of computational operations are defined as special instructions.The computational operations of special instructions, originally executed in the EX stage, are now completed in the ID stage.www.ijacsa.thesai.orgSpecial instructions include ADD, ADDI, SUB, SLT, SLTU, SLTI, SLTIU, BEQ, BNE, BLT, BGE, BLTU, BGEU, JAL, JALR, LB, LH, LW, LBU, LHU, LWU, SB, SH, SW.The ID and EX stages of the NRP processor differ from those of the classic five-stage pipeline processor, resulting in differences in the execution process of instructions in the ID and EX stages.Fig. 2 illustrates the main execution process of instructions in the ID and EX stages.The ID stage of the NRP processor consists of the instruction decode unit and the decode execute unit, with each functional unit's decoder responsible for decoding a portion of the instructions.In the ID stage, the execution logic of special instructions involves first decoding the instructions by the decode execute unit and then completing the computational operations required by the instruction opcode within this module.The instruction decode unit is responsible for decoding regular instructions and forwarding the instruction decode information and source operands to the next stage.If an unsolvable data dependency conflict occurs during the execution of a special instruction in the ID stage, the instruction is flagged and then resolved through data forwarding in the EX stage.
The EX stage of the NRP processor consists of the execute unit and the branch prediction auxiliary unit.The execute unit performs operations based on the type of instruction, including regular instructions and flagged special instructions.The branch prediction auxiliary unit is responsible for handling unflagged special instructions and generating branch prediction auxiliary information.

IV. THE OPTIMIZATION DESIGN IN NRP
Correlation conflicts are a significant factor affecting the performance of a five-stage pipelined processor.These conflicts can lead to pipeline stalls, reducing the processor's performance.The optimization idea proposed in this paper aims to minimize the negative impact of correlation conflicts on processor performance.In this section, we describe the design and implementation of optimization strategies for the NRP processor.

A. Optimization Design of the Decoding Stage
The control dependency conflict in a five-stage pipelined processor refers to the situation where the conditional result of a branch instruction is not yet determined, potentially allowing subsequent instructions to enter the pipeline.If the branch prediction fails, the pipeline needs to be flushed and restarted, causing a stall and impacting processor performance.
The optimization design in the ID stage of the NRP processor aims to reduce the pipeline stall time caused by control dependency conflicts.In a classic RISC-V five-stage pipelined processor, when a branch prediction fails, a stall of two clock cycles is required for pipeline flushing.This paper introduces an additional decode and execute unit in the ID stage of the NRP processor, reducing the stall to just one clock cycle in the event of a branch prediction failure.
The optimization design in the ID stage allows branch instructions to know the branch prediction result and determine if there will be a control dependency conflict.The execution process of special instructions in decode and execute unit is illustrated in Fig. 3. Firstly, the decoder decodes the instruction to obtain instruction information.Then, based on the instruction opcode, it generates a 2-bit enable signal to activate the corresponding arithmetic unit.The arithmetic unit performs www.ijacsa.thesai.orgoperations on the source operands and communicates using shared data.Finally, the instruction operation result and related information are passed to the EX stage, and the branch prediction result is transmitted to the branch predictor.In the event of a branch prediction failure, the correct PC is passed to the IF stage, and the pipeline pause signal is transmitted to the Ctrl module.
Decode and execute unit consists of a special instruction decoder, an adder, and a comparator.In the implementation process, we virtually divide the full instruction decoder into a special instruction decoder and a regular instruction decoder.When the instruction decoder decodes a special instruction, decode and execute unit is activated.When the instruction decoder decodes a regular instruction, the decode and execute unit does not activate.The primary hardware costs in our optimization design in the ID stage are the adder and the comparator.

B. Branch Predictor Optimization
The branch predictor used in this paper is based on SonicBoom's NLP (Next-Line Predictor), consisting of BHT (Branch History Table ), BTB (Branch Target Buffer), and RAS (Return Address Stack).We have optimized the BHT.
Traditional BHT records the state of each branch instruction based on its historical execution results.When a branch instruction is executed for the first time, it defaults to not taken due to the lack of historical execution results.The design proposed in this paper allows obtaining the opcode and the target address of the instruction before its first execution, causing the branch instruction to default to taken upon its first execution.JAL and JALR, as direct jump instructions, always cause a jump upon each execution, which cannot be accommodated by the traditional BHT design.
The workflow of the BHT designed in this paper is as follows: When a branch instruction is first recorded in the BHT, the value of the corresponding two-bit saturating counter table (2BC) is set to 2. If the branch instruction indeed jumps during execution and the jump target address is correct, the value of the two-bit saturating counter table is incremented by 1.If the branch instruction does not jump during execution, then the value of the two-bit saturating counter table is decremented by 1.If the value of the two-bit saturating counter table for the branch instruction is greater than or equal to 2, it is predicted that the instruction will jump.
The branch prediction auxiliary module is crucial for implementing the BHT optimization design, as it allows obtaining the opcode and the target address of the instruction before its execution.The NRP processor classifies instructions into regular and special instructions.Special instructions are decoded and executed in the ID stage, so when a special instruction reaches the EX stage, an idle clock cycle is generated.The branch prediction auxiliary module utilizes this idle clock cycle to perform simple decoding of the instruction and generate data for updating the branch predictor.
The branch prediction auxiliary module consists of a branch instruction decoder and an adder, and its specific workflow is illustrated in Fig. 4. Firstly, the branch instruction decoder in the branch prediction auxiliary module decodes the instruction currently in the cache.If the instruction is a branch instruction, its instruction type and immediate value are obtained after decoding.Then, the PC value and the immediate value of the instruction are sent to the ALU for addition to obtain the jump target address.Finally, the PC value, instruction type, and jump target address of the instruction are sent to the branch predictor.
The primary hardware costs in the branch prediction optimization design are a branch instruction decoder and an adder, with the branch instruction decoder supporting only the decoding of branch instructions.

C. Optimization of Dependency Conflict
The Fig. 5 illustrates how a classic five-stage pipelined processor uses data forwarding, pipeline stalling, and branch prediction to resolve various dependency conflicts and their resulting impacts.
In Fig. 5, we can observe the following scenarios.Firstly, the classic five-stage pipelined processor utilizes data forwarding to forward data from the EX stage and MEM stage to the ID stage to resolve non-load instruction-induced data dependency conflicts, and a combination of pipeline stalling and data forwarding is used to resolve load instruction-induced data dependency conflicts.Secondly, the classic five-stage pipelined processor executes branch instructions in the EX stage, and in the event of a branch prediction failure, it requires flushing the pipeline for two clock cycles.Lastly, the unoptimized branch predictor defaults to not taking a branch on the first prediction of a branch instruction, so when the processor executes an immediate jump instruction for the first time, a branch prediction failure and pipeline flush are inevitable.www.ijacsa.thesai.orgThe Fig. 6 illustrates how the NRP processor uses data forwarding, pipeline stalling, and branch prediction to resolve various dependency conflicts and their resulting impacts.
In Fig. 6, we can observe the following scenarios.Firstly, in the NRP processor, the use of data forwarding is more extensive, including between ID and EX, ID and MEM, and EX and MEM.Secondly, the NRP processor executes branch instructions in the ID stage to obtain the branch prediction result, so in the event of a branch prediction failure, it requires flushing the pipeline for one clock cycle.Lastly, the NRP processor employs an optimized branch predictor, so when executing an immediate jump instruction for the first time, it correctly takes the jump, avoiding pipeline flushing.

A. Functional Test
The COMPLIANCE TEST officially released by RISC-V can test whether the design of a RISC-V core complies with the RISC-V standard [19].In this paper, joint simulation tests were conducted using Vivado and modsim, and the test results indicate that the NRP complies with the standard of RISC-V core design.Fig. 7 and Fig. 8 show the simulation test results for the ADD instruction and the JAL instruction, respectively.

B. Performance Test
CoreMark is a straightforward yet sophisticated benchmark designed specifically to evaluate the performance of a processor core.In this paper, the CoreMark program and the NRP processor core were ported to Xilinx's ARTYA7-35T development board using Vivado, and the clock function and serial print function were rewritten.The main frequency of the NRP processor core was set to 50MHz for testing, and the results were transmitted to a PC for display via a serial tool.

C. Experimental Analysis
We implemented various versions of the NRP processor in Verilog HDL and evaluated their performance on Xilinx's ARTYA7-35T development board.Based on the optimization level of the NRP processor, we categorized it into three versions.The version without any optimization design is defined as NRP-Original, the version with optimization design only in the decode stage is defined as NRP-OptID, and the version with simultaneous optimization design in the decode stage and branch predictor is defined as NRP-Final.

VI. CONCLUSION
We have proposed a five-stage pipelined processor based on RISC-V architecture.In this processor, we have employed instruction decoding unit optimization and branch prediction optimization as effective methods to improve operating frequency.We implemented the proposed processor in Verilog using Vivado and conducted tests and evaluations on the processor's performance and hardware resource consumption.The CoreMark test results for the NRP processor after adopting optimization strategies show a score of 3.11 CoreMark/MHz, representing an 11.07%improvement over the non-optimized design.

Fig. 1 .
Fig. 1.A block diagram of the five-stage pipelined processor NRP.

Fig. 2 .
Fig. 2. The execution process of instructions in the ID and EX stages.

Fig. 9
presents the serial print results, showing that the NRP processor achieved a final CoreMark score of 3.11 CoreMark/MHz.

Fig. 10
Fig. 10 displays the CoreMark scores for each version of the NRP processor.After optimizing the design of the ID stage and the branch predictor, the performance of the NRP processor improved by 11.07%.

Fig. 10 .
Fig. 10.Performance test results of different versions of NRP.

Fig. 11
Fig. 11 presents the CoreMark test results for other opensource processors, showing that the performance of the NRP processor is significantly better than that of other processors [21]-[27].