Evaluation of Bufferless Network-On-Chip with Parallel Port Allocator

Multicore network-on-chip when scales upto hundred of nodes, energy consumption, design complexity and cost increases multifold owing to structure of interconnect. Many researches are being conducted to design novel architecture to build efficient networks-on-chips. Our paper proposes efficient bufferless design with deflection containment technique to eliminate buffers and latency. The high cost of buffers motivate us to go for bufferless design, however with increasing network loads, it become notorious with multiple deflection and flit loss between nodes. To overcome this, we have designed a bufferless architecture with local bypass ring within nodes to reduce deflection and packet loss. Deflection Containment with the use of local bypass ring shortens critical path and improves performance. Architecture of our designed bufferless NoC is analysed and RTL implementation of its components is done with Xillinx ISE design suite and its working is analysed in Modelsim SE. Our evaluation proves that bufferless routing with deflection containment technique reduces power dissipation without compromising on its performance.


Introduction
Interconnection fabric becomes important design parameter in NoC on connecting on chip components. This ranks higher than traditional bus interconnection in scalability and bandwidth [1]. In design of NoCs, buffers consume more power and occupy larger area leading to high cost. Figure 1 presents the architecture of NoC with multicore system. As shown it connect all nodes within a chip with network interface and routers. Recent works on designing low cost NoCs by eliminating buffers has been discussed [2][3][4][5][6][7][8][9][10]. In case of buffered network whenever there is a congestion or destination port of that is busy, the packet remains idle in buffer and waits for acknowledgement to transmit. In this process we see that link bandwidth is not used unnecessarily. However in bufferless NoCs only pipeline All the incoming flits are transmitted through pipeline registers and whenever destination port of particular flit is busy, it is routed to nearby port which is free.
When nearer ports are busy ,then it sends packet to whichever port is free irrespective of its distance to the destined port. So, whenever core increases packet injection rate also increases making deflection in bufferless router cumbersome reducing its performance [3]. Figure 2 shows the architecture of buffer and bufferless NoC.

Fig.2.Architecture of Buffered and Bufferless NoC
However, various subsequent evaluation has brought out problems with adoption of bufferless routers [4,5]. Flit ranking and port allocation used so far in all designs has increased critical path delay as flits are allocated ports sequentially. Flint ranking method to avoid livelock during transmission such as oldest first etc shown in Table 1 has increased complexity in bufferless router design. As shown in Figure 3, Whenever network load increases , deflection rate increases causing flit contention reducing its performance. This leads to unnecessary multiple hops to a packet to reach its destination making deflection containment a stumbling block for building efficient routers as shown in Figure 3. ABNoC(Approximate Bufferless) [6], is a recent work in bufferless NoCs where network conflicts has been reduced and packet transmissions are done using approximate allocation mechanism. Though it improved latency and bandwidth utilisation as shown in Figure 4, with retransmission mechanism it makes the design more complex compared to other design.

2.1Microarchitecture Of Proposed Bufferless Router
We have overcome the challenges in previous methods and made slight changes in their architecture which drastically improve the performance of the router. Figure 5 shows the proposed router architecture. It uses two stages pipeline where destination ports of incoming flits are calculated using conventional routing logic are calculated at the initial stage. Once destination ports of flits are known, they are passes to sorting network logic where flits are ranked based on priority. Sorting logic is designed in such a way that top channel is given highest priority. Once flits are sorted , it enters the second stage where port allocation is done in parallel depending upon the availability of destination ports. Unlike conventional bufferless NoCs , instead of deflection to other possible nodes we are sending contending flits to local bypass channel formed within subnetwork of that node. It contends for destined port after each clock cycle till it gets its turn for transmission. Injection and ejection are done till channels are occupied and winning flits moves to destined port over crossbar.

Flits Ranking By Priority
Flits ranking by priority becomes the first stage in arbitration process which is done by sorting network as shown in Figure 5. Flits form four ports namely West, East ,South and North are prioritised not including flits in bypass port for reducing hardware complexity. Since four ports are considered we are considering 2x2 partial permutation network with bitonic sorting. It is a parallel sorting algorithm where compare and exchange operation is done. Sorting is done such a way that top channel is with highest priority values. As shown in Figure 6, inputs a,b,c and d are inputs from west, east, north and south ports on sorting yield the output w,x,y and z. Example of bitonic sorting is shown in Figure 7.

Parallel Port Allocator
In earlier design, flits are allocated sequentially one by one in an order increasing delay in total transmission as shown in Figure 8. To overcome this, we are using now parallel port allocator where ports are assigned in parallel manner. This can be proved with RTL synthesis in results and discussion. It strictly follows two steps as shown below -Step 1: Check for status of the destined port of particular flit. If that destined port is free without contention, then particular flit is transmitted to the destined port.
Step 2: When the destined port of particular flit is busy it follows Step 2. Depending upon all other flits and their destined port , available ports are calculated .With this we are building simple look up table to calculate the allocated port. Here we have designed it to have a predefined order Bypass, North, South ,East and West as shown in Figure 9.First available port is given to the flit present in the lower channel.

Fig.8.Port allocation (a) BLESS (b) Parallel port allocator
Consider an example where flit f 0 is to be moved to south, flit f 1 is to be moved to north f 1 , flit f 2 is to be moved to east f 2 , flit f 3 is to be moved to north. As per the flowchart in Figure 9, it first checks for non-contending ports. Here find f 0 and f 2 are non-contending and hence these flits are transmitted to south and east ports respectively. We also find flits f 1 anf f 3 are contending for north port. As per our design first contending flit f 0 moves to bypass and and f 3 moves to north port. . Thus all flits get transmitted to destined port.

Fig.9.Flowchart of parallel port allocator
Consider an example where flit f 0 is to be moved to south , flit f 1 is to be moved to north f 1 , flit f 2 is to be moved to east f 2 , flit f 3 is to be moved to north. As per the flowchart in Figure 9, it first checks for non-contending ports. Here find f 0 and f 2 are non-contending and hence these flits are transmitted to south and east ports respectively. We also find flits f 1 anf f 3 are contending for north port. As per our design first contending flit f 0 moves to bypass and and f 3 moves to north port. . Thus all flits get transmitted to destined port.

Ejection and Injection
So far we have discussed transmission and reception of data along four ports. As discussed local bypass ring calls for fifth port bypass port. To avoid hardware complexity local ejection and injection are done separately from crossbar operation. With every increasing cycle, data is ejected out and this continues till space in sub network i.e ports exist and if several ports are are free during transmission it follows the given order. To make our design more contention free we have used ejection before injection.

Deflection using local bypasss ring
To reduce the critical delay we partition a network present in every node to multiple sub networks bridged by bypass ring as shown in Figure 10. Partition is done in such a way that datapath width remains unchanged. This increases network path diversity reducing critical path delay. As shown in Figure 10, each node consists of M subnetworks which are of same datapath width connected through local bypass ring which is unidirectional .Here we have used 2 sub networks for simplicity M= 2.

Fig.10. Deflection using local bypass ring with
two sub networks A contending flit which is to be deflected in conventional NoCs will be ejected to subnetwork through local bypass ring reducing multiple hops and improving latency.As shown in Figure 11, flits f 0 and f 1 are contending for east port (3) . Incase of BLESS f 0 is routed to east port while f 1 is routed to north port which again takes 2 hops to reach east port (3).So totally it takes 3 hops for complete data transmission f 0 and f 1 . In our design, f 0 will be routed to east port and f 1 will be routed to bypass port which again in next cycle sends f 1 to east port totalling of 2 cycles of data transmission. Thus improving path diversity and latency of our proposed architecture.

Results and Discussion
The proposed design is verified in ModelSim SE 6.3f .Here we have analysed flits ranking with Bitonic sorting , flits prioritisation with parallel port allocator. As discussed above the RTL implementation of 2x2 bitonic sorting is shown in Figure 12 while the output of 2x2 bitonic sorting is shown in Figure 13. Here we have taken 8 bits in which last 4 bits contains the priority bits. Two inputs "00011010" and "01011111" are given as inputs to a and b. By bitonic sorting , priority bits "1010" and "1111" are compared and output x is given with higher value (01011111) and output y is given with lower value (00011010). As seen in router micro architecture partial permutation network for flits priotization is built with four 2x2 bitonic nodes attached in cross bar format. Flits from east and west ports are given as input a and b respectively to u0 node whose output is a1 and b1,Flits from north and south ports are given as input c and d respectively to u1 node whose output is c1 and d1.The priority bits are compared and sorted and given to u2 and u3 whose outputs are (w,x),(y,z).Thus received flits from four ports are ranked and sorted. Its RTL implementation is shown in Figure 14. and its simulation is illustrated in Figure 15. From this Figure we can see that "0001011", "00001000", "00001101" and "00000111" flits to east, west, north and south port respectively are ranked in descending order of output ports w (0001101), x(00001011), y(00001000) and z(00000111). design Each router has 4 input and 4 output ports (East, West, North, South) and one bypass port in sub network for deflection containment. Assuming 8 bit data last three bits indicate next hop of the flit. Consider flit f 0 is to be moved to south f0(10000100), f 1 is to be moved to north f 1 (01000011), f 2 is to be moved to east f2(00100001), f 3 is to be moved to north f 3 (00010011). Here , according to DeC design it first checks for non-contention i.e, f 0 can be moved to south port, f 2 can be moved to east port. Then we see that f 1 and f 3 are contending for moving to north port. Here f1 is moved to bypass port and f 4 is moved to north port.RTL implementation of parallel port allocator is shown in Figure 16 and its simulation is shown in Figure 17.   We make the case that bufferless NoCs can be effective way to simplify our design reducing latency and improving performance. With the use of simple bypass ring in every node divided into subnetworks, deflection can be reduced and unnecessary hops of flits is removed improving its performance. Moreover parallel port allocation along with bitonic sorting can reduce delay and improve network bandwidth by allocating ports in parallel instead of sequential allocation. With reduced time and improved bandwidth power consumption is reduced and with removal of buffers we can achieve scalability i.e reducing area.We have evaluated our design in Modelsim and also found that our proposed design is far more better than previous works.We have run simulation with 4x4 mesh topology which in future will be tested for more number of cores and various topologies.