Design of Efficient Pipelined Router Architecture for 3 D Network on Chip

As a relevant communication structure for integrated circuits, Network-on-Chip (NoC) architecture has attracted a range of research topics. Compared to conventional bus technology, NoC provides higher scalability and enhances the system performance for future System-on-Chip (SoC). Divergently, we presented the packet-switching router design for 2D NoC which supports 2D mesh topology. Despite the offered benefits compared to conventional bus technology, NoC architecture faces some limitations such as high cost communication, high power consumption and inefficient router pipeline usage. One of the proposed solutions is 3D design. In this context, we suggest router architecture for 3D mesh NoC, a natural extension of our prior 2D router design. The proposal uses the wormhole switching and employs the turn mod negativefirst routing algorithm Thus, deadlocks are avoided and dynamic arbiter are implemented to deal with the Quality of Service (QoS) expected by the network. We also adduce an optimization technique for the router pipeline stages. We prototyped the proposal on FPGA and synthesized under Synopsys tool using the 28 nm technology. Results are delivered and compared with other famous works in terms of maximal clock frequency, area, power consumption and estimated peak performance. Keywords—3D network on chip; router optimization; turn model; parallel communication; router pipeline stages


INTRODUCTION
During the last decade, the evolution of technology has shrunk the dimension of transistor and made possible its integration of billions on the same chip.Thus, this increasing number of transistor densities allows the integration of countless cores on a single chip.Therefore, it requires a powerful on-chip interconnection scheme to satisfy the communication between the large numbers of cores on chip [1].As this communication plays a major role in determining the system performance, traditional on-chip interconnection schemes are no longer suitable for multi-Processor System on Chip (MPSoC) due to their lack of parallelism integration, scalability and resource management [2].Recently, Network on Chip (NoC) has been introduced as the best candidate to handle the on-chip communication requirements overcoming the limitations of traditional interconnections [3].NoC architectures are mainly composed of the router, which allows packets' distribution along the network, the network interface which grants network access and the link which permits the connection of NoC components.Based on a scalable architecture, NoC enables high bandwidth and overalls the system performances [4].Concurrently, transistor densities keep increasing.They in fact render the integration of hundreds of cores on a planner chip not satisfying for future applications.The latter are getting more complex demanding a higher performance system to handle parallel computing and provide higher bandwidth.At the same time, semiconductor industries are exploiting three dimensional integrated circuits (3D IC) which provide short global interconnects, lower power consumption, and higher performance [5].Bringing together 2D NoC architecture with 3D IC technology, makes the design of 3D NoC possible which is a stack multiple die in the vertical axis that are interconnected through silicon via (TSV) [6].Compared to 2D NoC, 3D NoC offers higher performance and higher package density.Thus, 3D NoC satisfies the on-chip communication requirements for future MPSoC.
In this paper, we propose extensible, flexible and efficient router architecture and its implementation for 2D and 3D mesh topologies.We present optimized router pipeline stages in order to reduce their dependencies and improve the router efficiency.The proposal adopts wormhole switching techniques, turns model negative-first routing algorithm to avoid deadlocks and a dynamic arbiter to improve the Quality of Service(QoS) expected by the 3D NoC.In order to evaluate the performance and the hardware cost, we prototype the proposal on FPGA and compare it with other designs.
The rest of the paper is organized as follows.Section 2 deals with related work.Section 3 gives an overview of the 2D router architecture.Section 4 tackles the optimized pipeline stages of the 3D router in detail.Section 5 provides the results evaluation to conclude the dissertation in Section 6.

II. RELATED WORK
Many works have been proposed in literature addressing the on-chip interconnection design challenges.One of the www.ijacsa.thesai.orgbrilliant solutions is the extension from 2D NoC to the 3D NoC architectures.The proposed 3D NoC design used 3D mesh topology due to its simplicity, regularity, scalability as it is a direct extension of the 2D mesh topology.In [7], the authors investigated resilience and adaptivity against fault on 3D NoC.They proposed a fault-tolerant routing algorithm named 4NP-First.Compared to stochastic random walk routing scheme, their turn model-based-routing algorithm shows better robustness against fault.However, their architecture implements two virtual channels, one for the transmitted original packet and the second for the redundant packet.Such a technique has a negative impact on the power consumption and the hardware cost.In [8], the authors presented AFRA, a deadlock free routing algorithm that tolerates faults for 3D mesh NoC.When faults are not detected, AFRA sends packets through ZXY.If there is fault detection, however, flits are forwarded through XZXY.This routing scheme shows good performance and robustness against faults.Nonetheless, AFRA focuses only on vertical link faults and ignores horizontal faults.Despite not requiring any additional virtual channels to avoid deadlock, it needs some global information to be stored for some overhead to be added to the router hardware complexity.In [9], the authors proposed 3D mesh NoC.Their scalable architecture adopts wormhole switching and implements look-ahead-routing algorithm.Their hardware implementation on FPGA illustrates a good performance in terms of area and maximal clock frequency.However, the deadlock situation may rise with any adaptive routing algorithm.In order to avoid deadlocks, they have to make their routing algorithm minimal or use virtual channels.In [10], authors investigated on topology and routing algorithm for 3D NoC.They suggested a modified structure of tree topology in order to reduce the degree and the diameter of the network which are vital characteristics for the network topology affecting the system performance.However, future applications require high throughput and low latency which cannot be provided by 3D tree topology.In [11], the authors introduced adaptive router architecture for heterogeneous 3D NoC.They implement a deadlock free adaptive routing algorithm.Compared with homogenous router, they modify the TSV selector and the routing logic blocks which enable hardware cost reduction and performance improvement.Only if the destination node is in a different layer, the TSV selector chooses a valid router as a vertical hub for interlayer routing; otherwise the routing logic would have a similar structure as 2D router.Conversely, heterogeneous NoC has a fixed topology and cannot be customized regarding an application requirement.In [12], authors presented router architecture for symmetric 3D mesh NoC.They implement dimension order XYZ routing algorithm that adopts credit-based flow control and uses virtual channel to avoid deadlock.A priority-basedscheduling is used to support and manage the different levels of QoS.Their results display a low latency and high bandwidth.However, their design suffers from area and power overheads.In [13], we proposed router architecture for 3D mesh based NoC.We implemented turn model negative-first routing algorithm in order to avoid deadlock conflicts.We adopted a packets priority scheme and round robin arbiter to ensure the QoS expected by the network.However, the proposal suffers from dependency between router pipeline stages which increases the hardware cost affecting the average latency of the network.
In this study, we suggest 3D NoC router based on our preceding design [13], [14].The proposal implements the Negative-First 3D turn model routing algorithm which employs some routing restrictions to prevent packets from deadlock.It also uses dynamic arbiter to fairly serve packets and enlace the QoS.We go through the optimized router pipeline stages in detail and its impact on reducing the hardware cost and on improving the system performance in terms of bandwidth.

III. 2D ROUTER ARCHITECTURE
Fig. 1 illustrates the NoC topology which is a 3x3 size mesh using wormhole switching policy and the credit-based flow control.To locate and differentiate between routers in the network, we give every router a unique address defined in XY coordinates.Each router can be connected with maximum of four adjacent routers as well as the local intellectual property (IP).The number of ports per router depends on its position in the network.In order to reduce the chip area and the power consumption, we have to eliminate any unused ports.
In their way to destination, packets must come across three pipeline stages as shown in Fig. 2. The first stage is the routing calculation (RC) in which the destination address is compared to the router address in order to define the next output port.Then, this information is sent to the next stage which is the switch allocation (SA).Based on its arbiter, this stage fairly serves packets to each destination.Finally, the information about the adequate output port is sent to the semi-crossbar traversal stage (ST) ensuring the traversal of packets to their destinations [14].

A. Optimized Router Pipeline Stages
We optimize a previous version of our 3D router design [13].Fig. 3 shows the 3D router pipeline stages.As for the 2D router, the pipeline stages are the routing calculation, the switch allocation and the semi-crossbar traversal.
We observe in typical pipeline stages a dependency between the routing calculation stage and the switch allocation stage.Each packet must wait for control signals to move from one phase to another.This fact increases the latency of the network.We break those stages' dependency by executing the routing calculation process and the switch allocation concurrently.Furthermore, on the previous routing calculation process, to define the output port, the X address of the flit is first compared to the X address of router, the Y address of the flit is then compared to the Y address of router to finally compare the Z address of the flit to the Z address of router.This makes the routing algorithm more complex and affects the hardware complexity of the router.Therefore we will compare the flit address with the router address without any decoded comparison.By using those optimizations, we aim to reduce the hardware cost and the communication latency of the network design.

B. Topology
As shown in Fig. 4, the NoC topology is a 3x3x3 size mesh.Each router can contain up to seven bidirectional ports.One port is connected to the local IP while the other six are connected to the adjacent routers in each direction of the network (north, east, south, west, up and down).Each router is defined by its XYZ coordinates.We choose the 3D mesh topology because it is the direct extension of 2D mesh topology and also due to several advantages like simplicity of implementation, regularity and scalability over other topologies.

C. Communication Flow
The proposal adopts the wormhole switching policy.As shown in Fig. 5, the packet is composed of two types of flits; the header flit and the body flit.The header flit is composed of 32 bits.First, six bits are allocated to the destination address.One bit is then allocated to the quality of service required by the NoC.Next four bits are allocated to the flit number per packet.Next three bits are allocated to the packet priority leaving the rest of the bits to constitute an extension.The body flit is composed of 32 bits data payload.The packet format and the flit size can be changed according to the application specifications.

D. Routing algorithm and deadlock avoidance
In [15], Glass presented the turn model for partially adaptive routing algorithm and targeted the mesh topology.This model designs a wormhole routing algorithm without the addition of physical or virtual channels.The principal of this model is to study all turns that can be taken by the packets in the network from source node to destination node as well as the cycles formed by those turns.A turn is referred to as a 90 www.ijacsa.thesai.orgdegree change in the direction of the packet and the cycle is referred to as four turns.Those cycles may enter packets into dependencies waiting named deadlocks leading to the network frailer.Therefore, they eliminate enough turns to prevent cycle's concurrency and make a deadlock-free routing algorithm.Fig. 6 shows an example of deadlock involving four packets.Fig. 6(a) displays a deadlock situation between packets from different plans.Fig. 6(b) presents a deadlock situation between packets belonging to the same plan.In 3D mesh, when flits travel between routers to reach their destination, they can pursue six directions: north, east, south, west, up and down.Each flit can make up to 24 turns, 8 turns in each plan (x,y), (x,z) and (y,z).In order to eliminate deadlock, we have to break cycles by prohibiting two turns at each plan.Glass [15] has proposed three turn model routing algorithms which are negative-first, west-first and north-last.We chose the negative-first routing algorithm since it is a simple extension in 3D, symmetric and doesn't require a packet ordering.The analysis of turns in the turn model is based on the XYZ coordinates and the directions of the flits in the network are defined by north, east, south, west, up or down.Thus, to simplify the terminology, the +y is north direction, the +x is east direction, the -y is south direction, the -x is west direction, the +z is up direction and the -z is down direction.The Negative-First routing algorithm routes packets first adaptively along -x, -z and -y and then adaptively along +y, +x and +z.We illustrate in Fig. 7(a) the prohibited turns in different four routers of the network.Regarding its position in the network, each router must eliminate turns from positive direction to negative direction in order to avoid deadlock.As shown in Fig. 7(b), solid lines indicate the allowed turns and the dash lines indicate the prohibited turns in negative-first routing algorithm from each plan of the 3D mesh.When flits are received by the input ports, each input port of each router performs the routing calculation independently of each other.So, the router can handle up to seven flits at the same time.This distributed routing is used to decrease the router latency and the average latency of the network.
In order to define the output port, the routing calculation process compares the destination address of the flit with the current router address and it takes into consideration the restriction turns by the negative-first routing algorithm to avoid deadlock:  If the destination address is equal to the router address + 1, then the output port will be px, else the output port will be nx.
 If the destination address is equal to the router address + 3, then the output port will be py, else the output port will be ny.
 If the destination address is equal to the router address + 9, then the output port will be pz, else the output port will be nz.

E. Switch allocation
The router implements distributed arbitration schemes.Thus, the switch allocator contains seven arbiter modules similar to the ones presented in Fig. 8 (one arbiter for each input port).The arbiter controls the connection between input ports and output ports.In order to avoid conflicts, especially when different flits from different input ports demand access to the same output port at the same time, an arbitration scheme is necessary to serve flits fairly.We use a dynamic arbiter that is composed of priority based scheduling, C-element ports and round-robin arbiter.The proposed arbiter prevents and solves conflicts access to the output port based on the priority comparator that compares the incoming flits priorities.It also provides the highest flits priority signals to the C-elements combining those signals with the corresponding requestor to be sure that there is at least one flit demand access to the output port.Then the C-elements send this information to the roundrobin arbiter.In this manner, only flits with the highest priority and that demand access to the output port will be served.

F. Semi-Crossbar traversal
The final pipeline stage is the semi-crossbar which is similar to a bridge that interconnects the output port of the current router with the input port of the next router.The semicrossbar waits for information signal about the selected output port from the switch allocator.Regarding this information, the semi-crossbar establishes an interconnection and sends flits to the adequate output port.The flit number per packet signal informs the semi-crossbar that all flits are transmitted and the channel is free to be used by another flits transmission.The semi-crossbar is based on multiplexer circuit.It uses seven multiplexers, one for each output port as presented in Fig. 9. Regarding the negative-first routing algorithm restrictions, we need to use two types of multiplexers, one for ports type p which can receive flits from any input port as shown in Fig. 9(b) and the other for ports type n which can only receive flits from input port n or from local port as shown in Fig. 9(c).

V. EXPERIMENTAL RESULTS
This section starts with an overview of the hardware complexity between the 2D and the proposed 3D routers designs.Then, it provides a Synopsis of different router implantations.Finally, it presents a results' comparison of the proposed design with other designs.The proposed router is designed and simulated in VHDL language at RTL level.The implementation and evaluation results targeted both FPGA and ASIC technologies and are provided in terms of maximal clock frequency, area, power consumption and the estimated peak performance.

A. FPGA Based Design
FPGA implementation of the proposed has been performed on Xilinx Virtex5 XC5VFX70T FPGA board using Xilinx ISE 13.1 design software.The parameters used to simulate 2D and 3D routers designs are presented in Table 1.Table 2 presents a comparison of the hardware evaluation results for both designs.The results indicate that the 3D router is 1.29 times faster than the 2D router.The estimated peak performance of the 3D router is 1.32 times greater than the 2D router.Compared to the 2D router, the hardware cost and the power consumption are decreased by 61.2% and 61.7% respectively.Thanks to the optimization used in the router pipeline stages design, results confirm an important improvement on the hardware complexity.Table 3 shows an overview of the different router implementations.The number of ports of the router depends on its position in the network.Hence, we have four routers of different port numbers: the 7-Port router which is located at the center of the cube and similar to the router R (1,1,1) of Fig. 4. We also have the 6-Port router which is situated at the center of the cube faces and similar to the router R (1,1,0) of Fig. 4. The 5-Port router which is to be found at the middle of the cube edges and similar to the router R (1,0,1) of Fig. 4. The 4-Ports router which is positioned at the vertex of the cube and similar to the router R (0,0,0) of Fig. 4. As can be seen by this table, the maximal clock frequency is decreased when the number of ports per router rises.This decreasing number of frequency is caused by the growth of the arbiter scheduling phase.Other metrics to evaluate the proposal is the estimated peak performance per router which depends on the maximal clock frequency, the flit size and the number of cycles to transmit one flit: PP perport = (Fmax / T) * flit size As the maximal clock frequency decreases, the estimated peak performance falls because it is related to the frequency of the design.The hardware resources and the power consumption www.ijacsa.thesai.orgare increased when the number of ports per router rises.These increasing numbers can be explained by the growth of the router complexity.In order to compare the proposed router with other designs, we implemented it in three different FPGA technologies as illustrated in Table 4. Results of other routers prototyped in FPGA are provided in terms of maximal clock frequency and area as shown in Table 5.The authors of [11] describe router architecture for 3D heterogonous NoC.Their architecture uses adaptive routing and was implemented in Virtex-6 FPGA.Previously in [13] we described router architecture for 3D mesh-based NoC.The design adopts the negative-first turn model routing algorithm and has been implemented in Virtex-5 FPGA.The authors of [16] describe buffer-less router architecture for 3D NoC.Their architecture uses minimal routing and has been implemented in Virtex-4 FPGA.The results demonstrate that area wise, the proposal outperforms all other designs.The proposal is 1.02, 2.82 and 5.49 times smaller than routers of [11], [13] and [16] respectively.The results show that when speaking of maximal clock frequency, the proposal is 1.37 times faster than router of [13].This also proves that the proposal underperforms the routers of [11] and [16] in terms of maximal clock frequency because we use a disturbed routing and arbitration.However, it allows the router to handle up to seven packets at the same time.The proposal increases the throughput while maintaining an area/speed trade-off.It even gives a better performance when we use advanced FPGA technology like Virtex-7.The estimated clock frequency reaches 168 MHz and the estimated peak performance extends to 53.76 Gbits/s.

B. ASIC Based Design
In order to evaluate area overhead and power consumption, the 2D and the 3D routers are synthesized by the Synopsys Design Vision tool.This tool uses the FD-SOI 28 nm technology assuming an operating point of 1GHz and a supply voltage of 1V.The resulting area, leakage and dynamic power consumption estimations of each router were extracted from the synthesized circuit and summarized in Table 6.Compared to the 2D router, the 3D router shows a augmentation of about 2.2 times for area, and about 2.53 times for power, due to the additional hardware requirements of the 3D router.
Table 7 illustrates a comparison in terms of area and power consumption of the proposed 3D router with two other routers presented in [17] and [18].Our choice has been fixed on these routers due to their remarkable performance.The authors of [17] describe robust router architecture that implements an adaptive routing algorithm ensuring fault tolerance both in router components and network links.It also provides high throughput by avoiding deadlocks without any use of virtual channels.The authors of [18] designate router architecture for vertically/partially connected 3D NoC based on stacked 2D Mesh topology.Their router implements Elevator-First routing algorithm that avoids deadlocks by using only two virtual channels in the plane.The results show that in terms of area, the proposed 3D router is 18.27 and 14.33 times smaller than the router of [17] and [18], respectively.
The results indicate that the proposed 3D router is characterized by the best leakage power consumption with a reduction of about 76.3% relatively to [17].Dynamic power www.ijacsa.thesai.orgconsumption wise, the proposed router outperforms the 2D router of [17] with a reduction of about 28.8%.The augmentation relative to the 3D router of [18] is only about 15.06%.This upsurge can only be explained by the use of dynamic arbiter that needs more computation than the roundrobin arbiter used in [18].In this dissertation, we adduced the router design for 3D mesh NoC topology which establishes an extension of our former work.We presented 3D NoC router architecture in detail and labelled its pipeline stages optimization, prototyped its architecture on FPGA and synthesized under Synopsys tool using the 28 nm technology.We assessed the proposal performance in terms of maximal clock frequency, area, power consumption and bandwidth and compared it with other famous works.Evaluation results prove that concerning clock frequency, the proposal is 1.37 times faster than our preceding work but it underperforms the routers of [11] and [16] because we use a disturbed routing and arbitration.Hardware cost wise, the proposal is 18.27 and 14.33 times smaller than the router of [17] and [18], respectively.Therefore, it has also been revealed that the best performance is exemplified in a 76.3% reduction of the leakage power consumption with low dynamic power consumption.Additionally, the proposal validates high performance improvement when compared to both the 2D router and 3D router designs thanks to the optimization method.Results have as well verified the capacity of the proposal to handle cost/performance trade-off for 3D NoC.

Fig. 7 .
Fig. 7. (a) prohibited turns in four routers of 3D NoC (b) six turns allowed (solid arrows) in negative first routing.

TABLE I .
SIMULATION PARAMETERS

TABLE III .
RESULTS OF ROUTERS IMPLEMENTATION ON FPGA

TABLE V .
PERFORMANCE COMPARISON OF THE PROPOSED ROUTER WITH OTHER STATE OF THE ART ONES

TABLE VI .
AREA AND POWER CONSUMPTION OF THE 2D AND THE 3D ROUTERS IN ST FD-SOI 28 NM TECHNOLOGY

TABLE VII .
COMPARISON IN TERMS OF AREA AND POWER CONSUMPTION WITH OTHER ROUTERS