Conference PaperPDF Available

Network-on-chip: Current issues and challenges

Authors:
Network-on-chip: Current Issues and Challenges
Manoj Singh Gaur
Professor at Computer
Engineering,
Malaviya National
Institute of Technology
MNIT, Jaipur, India
Email:
gaurms@gmail.com
Manoj Kumar
Malaviya National
Institute of Technology
MNIT, Jaipur, India
Vijay Laxmi
Associate Professor at
Computer Engineering,
Malaviya National
Institute of Technology
MNIT, Jaipur, India
Email:
vlaxmi@mnit.ac.in
Niyati Gupta
Malaviya National
Institute of Technology
MNIT, Jaipur, India
Mark Zwolinski
Professor in
the Electronic Systems
Design Group University
of Southampton,
High field, Southampton
SO17 1BJ
Email:
mz@ecs.soton.ac.uk
Ashish
Malaviya National
Institute of Technology
MNIT, Jaipur, India
ABSTRACT
Due to the shrinking transistor sizes, the density of ICs roughly
doubles every year as predicted by Moore’s law. These advancements in the
VLSI integration densities towards the nano scale era, witnessed a paradigm
shift from computation centric designs to communication centric designs
incorporating very large number of simple cores. Plenty of traditional
interconnect schemes like point to point, buses and crossbars are available
to interconnect small number of cores. While achieving fast and efficient
communication with point to point communication schemes, wire density is
a barrier for adapting them to many core architectures. Moreover, buses are
simpler in design, they suffer from the scalability and arbitration issues
along with bandwidth bottleneck as the number of cores increases. Similarly
area and power requirements of a crossbar limits its applicability. Hence, in
many core architectures like Chip Multiprocessors (CMP) and Multi
processor System-on-Chip (MPSoCs), emerge the need of an efficient
communication infrastructure as traditional solutions fails to handle the
communication challenges.
Network-on-Chip (NoC), a scalable and modular design approach, has
been proposed as a promising alternative to traditional bus based
architectures for inter-core communication. NoC has also been accepted in
industy (Tilera’s TILE-Gx72,TILE64TM [1] processors and Intel’s terascale
processor [2]. NoCs are an attractive alternative for the traditional shared-
buses or dedicated wires due to many reasons. First, NoCs represent a
scalable solution to on-chip communication paradigm, because they provide
scalable bandwidth at low power and area overheads. Second, NoCs are very
efficient in terms of use of wiring and multiplexing many traffic flows on the
same channels providing quality of service and higher bandwidth. Finally,
on-chip networks with regular topologies have short interconnects that can
be optimized and reused using regular iterative blocks, thus making the
verification process easy. For on-chip networks, two-dimensional (2D) mesh
is the most preferred topology choice due to its regularity, scalability, and
perfect physical layout on an actual chip. This tutorial shall focus on NoC
routing algorithms, their implementations and issues. The main parameters
of the network which are affected by the routing algorithm include fault-
tolerance, quality of service, communication performance (throughput and
latency) and power consumption. The following are the main objective of this
tutorial:
Introduction to NoC [3]: In this part, we briefly discuss about various
design parameters of NoC such as topology, switching, flow control,
routing and comparison with existing mechanisms.
Routing Taxonomy [4]: In this part, we present classification of various
routing algorithms.
Deadlock and Livelock freedom in Routing: One of current issue in
NoC routing is the use of acyclic channel dependency graph (ACDG)
for deadlock freedom prohibiting certain routing turns. Thus, ACDG
reduces the degree of adaptiveness. In this section, we discuss various
turn models [5] and how these turn model can be improved to
increase adaptivity while maintaining deadlock freedom.
Routing Implementations for NoC: Denser integration advancements
make the chip more prone to failures (deep sub-micron effects,
manufacturing effects etc). Furthermore these failures may disrupt
the regularity of 2D meshes, leading to an irregular set of topologies
generated from regular 2D meshes. Under this condition, solutions of
regular 2D meshes may no longer work due to irregular topology. In
this section, we discuss state-of-art routing implementation
techniques [6]–[8] used for irregular 2D mesh under different failures.
Learning methods to handle congestion in Routing: Reinforcement
Learning (RL) is a machine learning paradigm that has been widely
applied in many areas. The Q-Learning has been used in NOC to learn
the network traffic and make the routing decisions accordingly. At
each node, a table is used to store the values that represent the
congestion level of each link and these values are updated after every
packet transfer. Although, Q-Learning has improved network
performance but there are many challenges which we would discuss
in this section
Brief hands on tool chain for NoC simulation shall also provide
towards the end.
REFERENCES
[1] S. Bell et al., “Tile64 - processor: A 64-core soc with mesh interconnect,”
in Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of
Technical Papers. IEEE International, Feb 2008, pp. 88–598.
[2] S. Vangal et al., “An 80-tile sub-100-w teraflops processor in 65-nm
cmos,” Solid-State Circuits, IEEE Journal of, vol. 43, no. 1, pp.
29–41, Jan 2008.
[3] M. Palesi and M. Daneshtalab, Routing Algorithms in Networks-on-Chip.
Springer Publishing Company, Incorporated, 2013.
[4] J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks - An
Engineering Approach. Morgan Kaufmann, 2003.
[5] M. Kumar, V. Laxmi, M. Gaur, M. Daneshtalab, and M. Zwolinski, “A
novel non-minimal turn model for highly adaptive routing in 2d nocs,” in
Very Large Scale Integration (VLSI-SoC), 2014 22nd International
Conference on, Oct 2014, pp. 1–6.
[6] R. Bishnoi, V. Laxmi, M. Gaur, R. Bin Ramlee, and M. Zwolinski, “Ceri:
Cost-effective routing implementation technique for networkon- chip,” in
VLSI Design (VLSID), 2015 28th International Conference on, Jan 2015,
pp. 59–64.
[7] J. Flich and J. Duato, “Logic-based distributed Routing for nocs,”
Computer Architecture Letters, vol. 7, no. 1, pp. 13–16, 2008.
[8] S. Rodrigo et al., “Cost-efficient On-Chip Routing Implementations for
CMP and MPSoC Systems,” Computer-Aided Design of Integrated
Circuits and Systems, IEEE Transactions on, vol. 30, no. 4, pp. 534–
547, 2011.
... In wormhole-switched NoCs, contention between packets causes head-of-line (HoL) blocking, leading to performance ii VOLUME 4, 2016 deterioration or communication failure; thus, deadlock-free is a vital design objective. A deadlock in unicast is mainly caused by a cyclic dependency between packets and is easily rescued by restricting the turn models at the routing stage [15]. However, in multicast routing, different types of deadlocks owing to the branching of packets must also be entirely handled. ...
Article
Full-text available
Network-on-chip (NoC) has become the mainstream fabric architecture for chip multiprocessor (CMP) design. Owing to the market-driven advancement of modern applications in CMP, multicast traffic is aggressively increasing to support barrier synchronization, multithreading, and cache coherence protocols. Although multicast by branching of packets in the NoC router facilitates shortest path routing, additional branching-induced deadlocks must be circumvented. Existing NoC studies on deadlock-free minimal path routing in multicast traffic have typically deployed additional virtual channels or large buffers to hold entire packets, thereby significantly increasing the router area. Focusing on the area-efficient solution while sustaining the performance, we propose a novel multicast router using buffer sharing (MRBS) to guarantee deadlock-free multicast routing by exploiting the spatial diversity of the input buffer. MRBS ensures minimal path routing without requiring additional virtual channels or large buffers to hold entire packets. Extensive experiments were conducted by varying the buffer, packet, and network sizes, as well as the number of destinations per packet, under random multicast traffic with diverse injection rates. Simulation results show that MRBS achieves a 39.3 % improvement in the area-delay product on average for various network sizes compared to the conventional tree-based router.
Article
Full-text available
Universal interconnection networks are prime performance tailback for high performance SoCs (Systems-on-Chip). Since shrinking the size of the ICs (Integrated Circuits) is the main aim, NoC (Network-on-Chip), being a segmental and mountable design tactic is a propitious substitute to outmoded bus-mode architectures. NoC combined with 3D-Routers and label switching technique can guarantee low power consumption, QoS along with less latency. In the proposed work, 3D NoCs are proven to be more advantageous by achieving 39.9% reduction in Area, 1.7% reduction in Power Consumption, and 11.3% reduction in Memory usage.
Chapter
The evaluation of wireless communication systems over the last decades has led to a growing demand for more advanced high-speed communication systems. In this paper, we propose a hardware workflow developed for implementing the Long Term Evolution (LTE) communication system. This work studies the Multiple-input, multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM) LTE system. The main focus of this work is on implementation of OFDM modulation/demodulation functions as they are the main contributors to the processing time and latency in high speed communication systems. To achieve this goal, a multicore low latency OFDM LTE system is proposed. The multicore RTL code is generated using the ProNoC tool. The main contribution of this system the archived speed up in OFDM LTE computation using parallel processing techniques on an NoC based multicore system. The speed-up comparison for systems having different numbers of cores computing the IFFT task are reported in this paper. The proposed multicore system is also compared with a single-core system as a reference design. Systems having different LTE OFDM configurations are synthesized, implemented and verified using Altera Stratix V GX FPGA. The application execution time and FPGA resource utilization are used as compassion metrics. The proposed multicore LTE OFDM systems having 2 and 16 processing tiles computing IFFT tasks on different LTE channel bandwidths, the execution time is reduced by 24% and 76%, respectively compared to a conventional LTE OFDM system that is running on a single-core system.
Article
The growing demand for wireless devices capable of performing complex communication processes has imposed an urgent need for high-speed communication systems and advanced network processors. This paper proposes a hardware workflow developed for the Long-Term Evolution (LTE) communication system. It studies the multiple-input, multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM) LTE system. Specifically, this work focuses on the implementation of the OFDM block that dominates the execution time in high-speed communication systems. To achieve this goal, we have proposed an NoC-based low-latency OFDM LTE multicore system that leverages Inverse Fast Fourier Transform (IFFT) parallel computation on a variable number of processing cores. The proposed multicore system is implemented on an FPGA platform using the ProNoC tool, an automated rapid prototyping platform. Our obtained results show that LTE OFDM execution time is drastically reduced by increasing the number of processing cores. Nevertheless, the NoC’s parameters, such as routing algorithm and topology, have a negligible influence on the overall execution time. The implementation results show up to 24% and 76% execution time reduction for a system having 2 and 16 processing cores compared to conventional LTE OFDM implemented in a single-core, respectively. We have found that a 4×4 Mesh NoC with XY deterministic routing connected to 16 processing tiles computing IFFT task is the most efficient configuration for computing LTE OFDM. This configuration is 4.12 times faster than a conventional system running on a single-core processor.
Article
Full-text available
Mạng trên chip (network on chip - NoC) được xem là giải pháp hiệu quả trong hệ thống đa lõi thay thế cho các kiến trúc bus truyền thống. Trong bài báo này, hoạt động của một hệ thống trên chip ứng dụng khái niệm mạng trên chip được minh họa một cách hoàn chỉnh. Kiến trúc bộ định tuyến sử dụng cơ chế chuyển mạch gói, các giao diện giao tiếp mạng cũng như các thành phần lõi được thiết kế và thực thi sử dụng nền tảng phần cứng FPGA. Thêm vào đó, một giao diện đồ họa giao tiếp với người dùng được cung cấp nhằm để giám sát tình trạng hoạt động của mạng từ bên ngoài. Các kết quả về mặt thời gian, và công suất tiêu thụ của thiết kế được tổng hợp và phân tích với công cụ Design Compiler và công nghệ CMOS 90nm.
Article
Full-text available
Network-on-Chip (NoC) is emerging as a promising communication paradigm to overcome bottleneck of traditional bus-based interconnects for future micro-architectures (MPSoC and CMP). One of current issue in NoC routing is the use of acyclic channel dependency graph (ACDG) for deadlock freedom prohibiting certain routing turns. Thus, ACDG reduces the degree of adaptiveness. In this paper, we propose a novel nonminimal turn model which allows cycles in channel dependency graph provided that extended channel dependency graph is acyclic. Proposed turn model reduces number of restrictions on routing turns (specially on 90-degree), hence able to provide additional minimal and non-minimal routes between source and destination. We also propose a non-minimal and congestion-aware adaptive routing algorithm based on proposed turn model to demonstrate advantages. From results, we can observe that proposed method improves the network performance by distributing the traffic load in the non-congested regions.
Article
Full-text available
The high-performance computing domain is enriching with the inclusion of networks-on-chip (NoCs) as a key component of many-core (CMPs or MPSoCs) architectures. NoCs face the communication scalability challenge while meeting tight power, area, and latency constraints. Designers must address new challenges that were not present before. Defective components, the enhancement of application-level parallelism, or power-aware techniques may break topology regularity, thus, efficient routing becomes a challenge. This paper presents universal logic-based distributed routing (uLBDR), an efficient logic-based mechanism that adapts to any irregular topology derived from 2-D meshes, instead of using routing tables. uLBDR requires a small set of configuration bits, thus being more practical than large routing tables implemented in memories. Several implementations of uLBDR are presented highlighting the tradeoff between routing cost and coverage. The alternatives span from the previously proposed LBDR approach (with 30% of coverage) to the uLBDR mechanism achieving full coverage. This comes with a small performance cost, thus exhibiting the tradeoff between fault tolerance and performance. Power consumption, area, and delay estimates are also provided highlighting the efficiency of the mechanism. To do this, different router models (one for CMPs and one for MPSoCs) have been designed as a proof concept.
Conference Paper
Full-text available
The TILE64<sup>TM</sup> processor is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications. A figure shows a block diagram with 64 tile processors arranged in an 8x8 array. These tiles connect through a scalable 2D mesh network with high-speed I/Os on the periphery. Each general-purpose processor is identical and capable of running SMP Linux.
Article
Full-text available
This paper describes an integrated network-on-chip architecture containing 80 tiles arranged as an 8x10 2-D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined single-precision floating-point multiply accumulators (FPMAC) which feature a single-cycle accumulation loop for high throughput. The on-chip 2-D mesh network provides a bisection bandwidth of 2 Terabits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm<sup>2</sup> custom design contains 100 M transistors. The fully functional first silicon achieves over 1.0 TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07 V supply.
Book
This book provides a single-source reference to routing algorithms for Networks-on-Chip (NoCs), as well as in-depth discussions of advanced solutions applied to current and next generation, many core NoC-based Systems-on-Chip (SoCs). After a basic introduction to the NoC design paradigm and architectures, routing algorithms for NoC architectures are presented and discussed at all abstraction levels, from the algorithmic level to actual implementation. Coverage emphasizes the role played by the routing algorithm and is organized around key problems affecting current and next generation, many-core SoCs. A selection of routing algorithms is included, specifically designed to address key issues faced by designers in the ultra-deep sub-micron (UDSM) era, including performance improvement, power, energy, and thermal issues, fault tolerance and reliability. © Springer Science+Business Media New York 2014. All rights are reserved.
Conference Paper
To deal with the communication challenges of current and future many-core architectures, Network-on-Chip (NoC) has been proposed as a promising alternative. Regular 2D mesh topology is the most preferred design choice for NoCs. Hardware failures owing to manufacturing, wear-out, aging etc., however, may disrupt the regularity of 2D mesh. Sustaining routing under these circumstances becomes a challenge. Though traditional table based routing method is flexible enough to handle any irregularity, it is neither scalable nor cost-effective solution. Scalable distributed logic based solutions like uLBDR have limited flexibility and work only in restricted architectural space despite complex switch design. To overcome these limitations, this paper presents CERI (Cost-Effective Routing Implementation), an efficient logic based routing capable of handling failure-induced irregularities in 2D mesh. Implementation of proposed approach does not require tables or a complex switch design. Performance analysis of CERI demonstrates its cost effectiveness as area and power requirements are reduced respectively by (14%) and (16%) than previously proposed logic based solution uLBDR.
Article
The design of scalable and reliable interconnection networks for multicore chips (NoCs) introduces new design constraints like power consumption, area, and ultra low latencies. Although 2D meshes are usually proposed for NoCs, heterogeneous cores, manufacturing defects, hard failures, and chip virtualization may lead to irregular topologies. In this context, efficient routing becomes a challenge. Although switches can be easily configured to support most routing algorithms and topologies by using routing tables, this solution does not scale in terms of latency and area. We propose a new circuit that removes the need for using routing tables. The new mechanism, referred to as logic-based distributed routing (LBDR), enables the implementation in NoCs of many routing algorithms for most of the practical topologies we might find in the near future in a multicore chip. From an initial topology and routing algorithm, a set of three bits per switch output port is computed. By using a small logic block, LHDR mimics (demonstrated by evaluation) the behavior of routing algorithms implemented with routing tables. This result is achieved both in regular and irregular topologies. Therefore, LBDR removes the need for using routing tables for distributed routing, thus enabling flexible, fast and power-efficient routing in NoCs.
Tile64 -processor: A 64-core soc with mesh interconnect
  • S Bell
S. Bell et al., "Tile64 -processor: A 64-core soc with mesh interconnect," in Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International, Feb 2008, pp. 88-598.