Advanced VLSI Architecture Design for Emerging Digital SystemsView this Special Issue
Design of Smart Power-Saving Architecture for Network on Chip
In network-on-chip (NoC), the data transferring by virtual channels can avoid the issue of data loss and deadlock. Many virtual channels on one input or output port in router are included. However, the router includes five I/O ports, and then the power issue is very important in virtual channels. In this paper, a novel architecture, namely, Smart Power-Saving (SPS), for low power consumption and low area in virtual channels of NoC is proposed. The SPS architecture can accord different environmental factors to dynamically save power and optimization area in NoC. Comparison with related works, the new proposed method reduces 37.31%, 45.79%, and 19.26% on power consumption and reduces 49.4%, 25.5% and 14.4% on area, respectively.
In recent years, the 3-dimensional IC and TSV (Through-Silicon Via) technology are proposed to solve area issues. The 3-dimensional IC of Intel Ivy Bridge processor and the 16-core multicore architecture can be implemented in 22 nm . Therefore, the multicore and heterogeneous systems are popular research in SoC (system-on-chip). These architectures require high throughput and performance to transfer data in a multicore SoC. Therefore, the NoC (network-on-chip) can be proposed to solve this requirement, but it derived new problems such as power consumption and area [2, 3].
The NoC architecture  consists of processing element (PE), network interface (NI), router, and topology which is shown in Figure 1. The PEs transfer information to NI, the NI packages the information into flits then passes to routers. The routers have difference corner router (CR), edge router (ER), and router (R); the CR, ER and R has three, four, and five I/O ports to access information then each port includes virtual channels. Router includes transmission channel, routing computation (RC), virtual channel arbiter (VA), switch arbiter (SA), and crossbar (XBAR). The flits includes header, body, and tail; the header flit has PE priority, source address, destination address, and so forth. The RC uses header flit and routing algorithms to find transmission path. VA uses two stages arbitration to select most high priority packet transmission and then will sign transmission channel. SA uses two stages arbitration and will select most body flits into XBAR to transmit. The VA will be working when the packet is arrival. The SA operation when the flit is arrival. The tail flit represents last flit, and then the router will unregister transmission channel. The router topology includes mesh, star, and fat tree [4, 5].
Yoon et al.  analysis of virtual channels (VCs) can avoid routing and protocol deadlock and improve the routing performance when the packet traffic is congested. The VCs can solve packet switch hard issue but it leads the power and area and so forth issue in NoC.
Nicopoulos et al.  proposed IntelliBuffer architecture to solve PV (process variation) to reduce the power consumption in layer 1 . It differs from the conventional architecture in two fundamental ways. First, these slots use clock-gating to reduce the power consumption when slots are empty. In order to avoid data loss transmission, one of slots clock keeps to access data in each I/O port. Second, the router creates a leakage classification register (LCR) table; then the write and read pointer always accesses the lowest power consumption slots from the LCR table.
Taassori et al.  proposed an adaptive data compression technology to reduce the number of packet bits in layer 3 . It reduces of the number of transmissions. Therefore, it can improve power consumption of router. Palma et al.  use T-Bus-Invert technology to reduce the hamming distance transition activity rate to improve the power consumption. Jafarzadeh et al.  use end-to-end data coding technology to minimize switching activity rate and routing path to improve NI power consumption.
Lee et al.  proposed buffer clock-gating architecture and used clock-gating to reduce the transmit power consumption when slots are empty and full. Ezz-Eldin et al.  proposed an adaptive virtual channel with two sections in layer 1 . First, the work used hierarchical multiplexing tree for Virtual Channels (VCs) to reduce area. Second, it uses clock-gating to reduce power consumption. Rosa et al.  proposed dynamic frequency scaling in PE for NoC. It considers the communication and loading rate to control the router frequency to reduce the power consumption.
Huaxi et al.  proposed fat tree-based optical NoC; this architecture includes topology, placement, layout, and protocol. This paper proposed low power and cost router optical turnaround router to improve the power consumption. Gu et al.  proposed Cygnus router to optimize the router algorithms to reduce the power consumption. Swaminathan et al.  create two FIFOs in NI. Use two FIFO dynamic configuration data access to improve throughput and power consumption.
In the next section we analyse the power consumption under the difference VCs access. Section 3 we introduce the topology and router packet architecture, we addition the SPS in router to save power. In Section 4 we present SPS with router design. Section 5 contains experimental results and Section 6 concludes this paper.
2. Power Issue with Virtual Channels
The multicore architecture and big data communication are more popular in next generation. Traditional communication technologies cannot meet a large amount of traffic on multicore and heterogeneous chip. The NoC can solve this issue. It uses network transmission method to make the difference core communication at same time. The NoC can solve the communication issue but the big data access enhances the power consumption.
The router composed of the arbitration and transmission unit  is illustrated in Figure 2. The arbitration unit selects the highest priority packet sent to next router. The arbitration unit includes routing computation (RC), VC arbiter (VA), and switch arbiter (SA). The RC is the calculation of routing paths and priorities. The VA contains a number of two-stage arbitrations to select packet and sign up VCs. First stage selects the local highest priority packet from input VCs to crossbar and signs up VCs. Second stage selects the global highest priority packet from input crossbar to output VCs and signs up VCs. The SA also contains a number of two-stage arbitrations to select flits for transmission. First stage selects the local highest priority flits from input VCs to crossbar. Second stage selects the global highest priority flits from input crossbar to output VCs. The VA executed prepacket and the SA executed preflits.
The router with transmission unit is illustrated in Figure 3. In this unit, it includes VCs to access large packet from input physical channel to output physical channel. A power consumption calculation to VCs is shown in (1). The variable of represents the number of access packets or flits in VCs. The variable of represents access frequency in VCs. The variable of represents capacitance and represents voltage in VCs. Nicopoulos et al.  and Katabami et al.  proposed clock-gating to solve this issue.
In this paper, we proposed a dynamic control of each virtual channel clock in different transmission environments. Whether packet transfer is complete, the SPS can effectively reduce the power consumption and does not affect the transmission performance. Consider
3. Router and Topology with SPS
3.1. Relation of Topology and Router
The relation of topology and router is illustrated in Figure 4. The router uses different transmission mode with topologies. For example, the mesh uses the - routing to transmit. The - routing flow chart for 2 × 2 meshes is illustrated in Figure 5, when the MSB of destination router address () is equal to the MSB of current router address () and if the LSB of router addresses ( and ) is equal then it means the flits arrival. Otherwise, the - routing algorithm includes two-stage flows. In stage one, the flits are sent until that the equals of on the -axis routers. In stage two, the flits are sent to the destination by -axis routers. The virtual channel will be initialed under packet transmit on two routers, which procedure is shown on Algorithm 1.
The control method of arbiter architecture uses different transmission mode to design. The VC arbiter and switch bar are by the topology and priority to design the routing computation unit. Algorithm 2 constructs VC two stages arbitration of prepackets. Stage 1 decided high priority packet into crossbar from local VCs (input VCs) of each packet at lines 3 to 4 and lines 8 to 10. Stage 2 decided most important packet to transmission from global VCs (output VCs) of each packet at lines 5 to 6 and lines 11 to 13.
Algorithm 3 constructs VC two stages arbitration of preflits. Stage 1 decided high priority flit into crossbar from local VCs (input VCs) of each flit at lines 3 to 4 and lines 8 to 10. Stage 2 decided most important flit to transmit from global VCs (output VCs) of each flits at lines 5 to 6 and lines 11 to 13.
The router includes four directions to connect other routers and one local physical channel to connect PE in transmission channel architecture. There have been VCs of each physical channel without local physical channel. The switch bar support for transmission the most important packet to output channel. The SPS controls each VCs power consumption when the channel status changes. The SPS architecture is introduced in next section.
3.2. Topology Architecture
The topology is definition of the packet transmission path between router and link. The router connection topology architecture is shown in Figure 6; they include star, mesh, ring, and tree topologies. The RC algorithms depend on topology architecture in arbitration unit. The VA and SA algorithms depend on packet priority in arbitration unit. In this paper, the topology is the 2 × 2 mesh, the RC algorithm is - routing, and the VA and SA algorithms are lottery .
The router that connects with PE is shown in Figure 7; so that the PE and router access information, use the network interface (NI). It handles the information between router and PE. The NI includes two level designs  as shown in Figure 8. It contains three modules to meet the specifications of the different layers. The shell module needs to meet IP specification. The kernel module needs to meet the NoC topology specification.
3.3. Flits with Router Architecture
The flit specification with router is shown in Figure 9; the flit type of 2-bit 00 represents the one packet; this flit type does not sign up VCs. The 2-bit 01 represents the header flit which includes routing information and address; this flit type always is determined in sign up channel. The 2-bit 10 represents the body flit which includes transmission information; this flit payload records the segment packet. The 2-bit 11 represent the tail as last transmission information; this flit not only records the last segment packet but also cleans the VCs.
4. SPS with Router Design
The VC that contains many slots to access data led to extra power consumption. In this paper, we propose SPS architecture to reduce the power consumption.
4.1. Router with SPS Architecture
The proposed router with SPS architecture is illustrated in Figure 10. The physical channel (PC) is used to connect other routers and access information. The input VCs (IVC) is used to store information from PCs. It always is designed by FIFO or other sequential logic. The arbiter decides the flits priority to control input switch logic (ISL) and output switch logic (OSL) to transmit flits. It includes RC, VA, and SA. The crossbar (CR) connects IVC to OVC, the switch signal form arbiter. The output VCs (OVC) store information from CR. The proposed SPS uses the transmission channel status to dynamic control IVC and OVC clock in essential operating.
The VCs with SPS architecture are illustrated in Figure 11. It controls system clock into I/O VC to reduce power consumption. In this architecture, the VC contains 0 to slots to access data.
4.2. Design of SPS Control Timimg
The VCs access timing diagrams of SPS architecture are illustrated in Figure 12. The Clock Block A indicates that the VCs have no information to transmit. The Clock Block B indicates that the VCs are writing information. The Clock Block C indicates that the data in VCs are waiting to transmit. Our analysis for unused clock-gating architecture is shown in (2). The slots access information of power consumption is denoted by . The slot content full and empty of power consumption are denoted by and , respectively. The is power consumption except for , , and . The unused clock-gating architecture does not control clock for sequential logic in VCs. Therefore, the logic will generate power consumption in high transmission structure.
(a) No clock-gating
The clocking gating consumes power in Clock Block B and Clock Block C. Our analysis for clock-gating architecture is shown in (3). The is power consumption of empty gating. The clock-gating architecture does not control clock when VCs is full stage. The VCs always store flits to wait for transmission.
The SPS consumes power in Clock Block B. Our analysis for SPS architecture is shown in (4). The is power consumption of SPS. It saves the power consumption of empty and full gating for VCs. Consider
4.3. Design of SPS
The proposed SPS uses the VCs status to dynamic control clock of each VC. The CFSM of SPS with VCs is illustrated in Figure 13; it contains two CFSM in this architecture.
The first CFSM includes initial, empty, full, and waiting status. Initial status: when the VC is reset, the structure is into the initial status until the flit arrive. Empty status: when the user resets the VCs or the flits transport to next storage unit, the structure is into this status. Full status: the store flit in VC is full. Waiting status: When the user resest the VCs or the store flit is complete.
The VCs with SPS algorithm is illustrated in Algorithm 4. In line 3, the VCs will initialize the VCs count and flags. The VCs will access flits to change VCs count when channel packet or arbiter signal arrive at line 4 to 9. When the VCs count can be changed, then the VCs flag will be changed at line 10 to 17.
The second CFSM includes initial, clock-gating, and wake up status. Initial status: this principle is the first CFSM of initial state. Clock-gating: when the VC changes to full or empty, then SPS will disable this VC clock and change to this status. Wake up: when the VC want to store flit, one VC will wake up.
The SPS algorithm is illustrated in Algorithm 5. In line 3, the SPS will initialize VCs clock and access status from VCs with VC flags. The slots priority from LCR  and each VCs clock can be initialized at lines 4, 5, and 7. The SPS controls VCs clock to reduce the VCs power consumption when VCs is accessed and flags changed at lines 8 to 17.
5. Experimental Results
In this section, we proposed autotesting architect for router with SPS. This architect includes four modules of autotesting. The first module is test-vector generator (TVG); the FSM is illustrated in Figure 14; the Idle status is waiting for the requirement of start testing, when the requirement arrives, TVG then will change status from idle to generator. When the requirement is cancelled, the status be changed from generator to idle. The generator status will generate test-vector and compare-vector; this is illustrated in Figure 15; we use language to generate lottery arbitration  in test-vector at control step 1. We use HDL to design the conventional router to generate the compare-vector and the input pattern from the test-vector at control step 2. When the compare-vector and test-vector functions are complete then the status will be changed from generator to vector output (VO) at control step 3. The VO status will transform test-vector and compare-vector to Xilinx memory IP files, through memory to control data output to test and compare only one clock.
The second module is vector database (VD); the control flow graph is illustrated in Figure 16; the module writes VO status vector in memory. The database includes two vectors to test and analyze the proposed circuit. The lottery database is provided test packet for router with SPS. The compare database is provided analysis for router with SPS.
The third module is router with SPS; we use VD to propose the test-vector to implement this module. The testing algorithm is illustrated in Algorithm 6, when the start signal set up to one from I/O, then the module starts to test and pass this signal to VD at lines 1 to 2. When testing is started, the input signal will be read from VD, shown at lines 3 and 4 in Algorithm 6. The read test-vector delay time is one clock from VD to router with SPS. The router with SPS uses VD test-vector to compute at line 6. When this pattern computation is finish, the next pattern will be read from VD at line 6. When the test pattern computation is finished or start signal is cancelled, test-start set up and stop testing at lines 7 and 8.
The final module, verification module, is illustrated in Figure 17; we verify the function in this module. The function verification is comparing of compare-vector and implement-results from VD and router with SPS. If the pattern is error, then verification result returns error signal.
The hardware experimental environment uses Xilinx FPGA xc5vlx50t-1ff1136 to verify SPS architecture. The software experimental environment uses Xilinx ISE 12.3 and the analysis tools use Modelsim 6.6, Xilinx Chipscope ILA, and Xpower 12.3, which are supported by Xilinx. The test experimental environment uses 2 × 2 mesh and - routing; the PC have 4 VCs to access flits. The power consumption distribution is illustrated in Figure 18; the number of test packets is from 100 to 10000. The packet format is flit and packet length is 18 bits.
Comparing related works, as shown in Table 1, IntelliBuffer , adaptive data compression , and buffer clock-gating , the proposed method reduces 37.31%, 45.79%, and 19.26% on power consumption, respectively, and reduces 49.4%, 25.5% and 14.4% on area, respectively.
The Smart Power-Saving (SPS) architecture for network-on-chip was presented. A clock control circuit and SPS algorithm are demonstrated to reduce the power consumption on the NoC architecture. From experimental results, the proposed SPS architecture is more efficient to reduce the power consumption than IntelliBuffer , adaptive data compression , and buffer clock-gating  in the NoC architecture.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors would like to thank the Ministry of Science and Technology of the Republic of China, Taiwan, for partially supporting this research.
D. James, “Intel Ivy Bridge unveiled—the first commercial tri-gate, high-k, metal-gate CPU,” in Proceedings of the Custom Integrated Circuits Conference (CICC '12), pp. 9–12, September 2012.View at: Google Scholar
C. Nicopoulos, S. Srinivasan, A. Yanamandra et al., “On the effects of process variation in network-on-chip architectures,” IEEE Transactions on Dependable and Secure Computing, vol. 7, no. 3, pp. 240–254, 2010.View at: Publisher Site | Google Scholar
M. Taassori, M. Taassori, and M. Mossavi, “Adaptive data compression in NoC architectures for power optimization,” International Review on Computers and Software, vol. 5, no. 5, pp. 540–547, 2010.View at: Google Scholar
D. Bertozzi and L. Benini, “Xpipes: a network-on-chip architecture for gigascale systems-on-chip,” IEEE Circuits and Systems Magazine, vol. 4, no. 2, pp. 18–31, 2004.View at: Publisher Site | Google Scholar
S. J. Lee, K. Lee, and H. J. Yoo, “Analysis and implementation of practical, cost-effective networks on chips,” IEEE Design and Test of Computers, vol. 22, no. 5, pp. 422–433, 2005.View at: Publisher Site | Google Scholar
Y. J. Yoon, N. Concer, M. Petracca, and L. Carloni, “Virtual channels versus multiple physical networks: a comparative analysis,” in Proceedings of the 47th ACM/IEEE Design Automation Conference (DAC '10), pp. 162–165, June 2010.View at: Google Scholar
L. Benini and G. de Micheli, “Networks on chips: a new SoC paradigm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, 2002.View at: Publisher Site | Google Scholar
J. C. S. Palma, L. S. Indrusiak, F. G. Moraes, R. Reis, and M. Glesner, “Reducing the power consumption in networks-on-chip through data coding schemes,” in Proceedings of the 14th IEEE International Conference on Electronics, Circuits and Systems (ICECS '07), pp. 1007–1010, December 2007.View at: Publisher Site | Google Scholar
N. Jafarzadeh, M. Palesi, A. Khademzadeh, and A. Afzali-Kusha, “Data Encoding Techniques for Reducing Energy Consumption in Network-on-Chip,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 3, pp. 675–685, 2014.View at: Publisher Site | Google Scholar
T. Y. Lee, C. H. Huang, and X. S. Lin, “Design of buffer clock-gating architecture for network-on-chip,” in Proceedings of the 22th VLSI Design/CAD Symposium, pp. 2–5, August 2011.View at: Google Scholar
R. Ezz-Eldin, M. A. El-Moursy, and A. M. Refaat, “Low leakage power NoC switch using AVC,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '12), pp. 2549–2552, Seoul, Republic of Korea, May 2012.View at: Publisher Site | Google Scholar
T. R. da Rosa, V. Larrea, N. Calazans, and F. G. Moraes, “Power consumption reduction in MPSoCs through DFS,” in Proceedings of the 25th Symposium on Integrated Circuits and Systems Design (SBCCI '12), pp. 1–6, 2012.View at: Google Scholar
G. Huaxi, X. Jiang, and Z. Wei, “A low-power fat tree-based optical network-on-chip for multiprocessor system-on-chip,” in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE '09), pp. 3–8, April 2009.View at: Google Scholar
H. Gu, K. H. Mo, J. Xu, and W. Zhang, “A low-power low-cost optical router for optical networks-on-chip in multiprocessor systems-on-chip,” in Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI '09), pp. 19–24, Tampa, Fla, USA, May 2009.View at: Publisher Site | Google Scholar
K. Swaminathan, G. Lakshminarayanan, F. Lang, M. Fahmi, and S. B. Ko, “Design of a low power network interface for Network on chip,” in Proceedings of the 26th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE '13), pp. 1–4, May 2013.View at: Publisher Site | Google Scholar
R. Mullins, A. West, and S. Moore, “Low-latency virtual-channel routers for on-chip networks,” in Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA '04), pp. 188–197, 2004.View at: Google Scholar
H. Katabami, H. Saito, and T. Yoneda, “Design of a GALS-NoC using soft-cores on FPGAs,” in Proceeding of the Embedded Multicore Socs (MCSoC '13), pp. 26–28, September 2013.View at: Google Scholar
J. Wang, Y. Li, Q. Peng, and T. Tan, “A dynamic priority arbiter for network-on-chip,” in Proceedings of the IEEE International Symposium on Industrial Embedded Systems (SIES '09), pp. 253–256, July 2009.View at: Google Scholar
S. Saponara, L. Fanucci, and M. Coppola, “Design and coverage-driven verification of a novel network-interface IP macrocell for network-on-chip interconnects,” Journal of Microprocessors and Microsystems, vol. 35, no. 6, pp. 579–592, 2011.View at: Publisher Site | Google Scholar