ArticlePDF Available

3A-2 Synthesis of Networks on Chips for 3D Systems on Chips

Authors:

Abstract and Figures

Three-dimensional stacking of silicon layers is emerging as a promising solution to handle the design complexity and heterogeneity of Systems on Chips (SoCs). Networks on Chips (NoCs) are necessary to efficiently handle the 3D interconnect complexity. Designing power efficient NoCs for 3D SoCs that satisfy the application performance requirements, while satisfying the 3D technology constraints is a big challenge. In this work, we address this problem and present a synthesis approach for designing power-performance efficient 3D NoCs. We present methods to determine the best topology, compute paths and perform placement of the NoC components in each 3D layer. We perform experiments on varied, realistic SoC benchmarks to validate the methods and also perform a comparative study of the resulting 3D NoC designs with 3D optimized mesh topologies. The NoCs designed by our synthesis method results in large interconnect power reduction (average of 38%) and latency reduction (average of 25%) when compared to traditional NoC designs.
Content may be subject to copyright.
Synthesis of Networks on Chips for 3D Systems on Chips
Srinivasan Murali!, Ciprian Seiculescu!, Luca Benini, Giovanni De Micheli!
!LSI, EPFL, Lausanne, Switzerland,
{srinivasan.murali, ciprian.seiculescu, giovanni.demicheli}@epfl.ch
DEIS, Univerity of Bologna, Bologna, Italy, lbenini@deis.unibo.it
ABSTRACT
Three-dimensional stacking of silicon layers is emerging as a promis-
ing solution to handle the design complexity and heterogeneity of
Systems on Chips (SoCs).Networks on Chips (NoCs) are necessary
to efficiently handle the 3D interconnect complexity. Designing
power efficient NoCs for 3D SoCs that satisfy the application per-
formance requirements, while satisfying the 3D technology con-
straints is a big challenge. In this work, we address this problem
and present a synthesis approach for designing power-performance
efficient 3D NoCs. We present methods to determine the best topol-
ogy, compute paths and perform placement of the NoC components
in each 3D layer. We perform experiments on varied, realistic SoC
benchmarks to validate the methods and also perform a compara-
tive study of the resulting 3D NoC designs with 3D optimized mesh
topologies. The NoCs designed by our synthesis method results in
large interconnect power reduction (average of 38%) and latency
reduction (average of 25%) when compared to traditional NoC de-
signs.
Keywords
3D, networks on chip, topology synthesis, application-specific
1. INTRODUCTION
The 2D chip fabrication technology is facing lot of challenges in
utilizing the exponentially growing number of transistors on a chip.
The wire delay and power consumption is increasing dramatically
and achieving interconnect design closure is becoming a challenge.
Designing the clock-tree network for a large chip is becoming very
challenging and its power consumption is a significant fraction of
total chip power consumption. Moreover, diverse components that
are digital, analog, MEMS and RF are being integrated on the same
chip, resulting in large complexity for the 2D manufacturing pro-
cess [19].
Vertical stacking of multiple silicon layers, referred to as 3D
stacking, is emerging as an attractive solution to continue the pace
of growth of Systems on Chips (SoCs) [19]-[24]. The 3D tech-
nology results in smaller footprint in each layer and shorter verti-
cal wires that are implemented using Through Silicon Vias (TSVs)
across the layers. Heterogeneous systems can be built easily, with
each layer supporting a diverse technology [19]. The 3D technol-
ogy has been maturing over the years in addressing thermal issues
and achieving high yield [20].
To tackle the on-chip communication problem, a scalable com-
munication paradigm, Networks on Chips (NoCs) has recently evolved
[1]-[3]. NoCs are composed of switches and links and use circuit
or packet switching technology to transfer data inside a chip. They
provide better structure, modularity and scalability when compared
to traditional interconnect solutions.
NoCs are a necessity for 3D chips: they provide arbitrary scala-
bility of the interconnects across additional layers, efficiently paral-
lelize communication in each layer and help controlling the number
of vertical wires (and hence TSVs) needed for inter-layer commu-
nication. The combined use of 3D integration technologies and
NoCs introduces new opportunities and challenges for designers.
Building power-efficient NoCs for 3D systems that satisfy the per-
formance requirements of applications, while satisfying the tech-
nology constraints is an important problem. To address this issue,
new architectures and design methods are needed. While the issue
of designing NoC architectures for 3D has received some atten-
tion [30]-[33], there has been little work on design methods for 3D
NoCs. The design methods for 2D NoCs do not consider important
3D information, such as the technology constraints on the number
of TSVs that can be supported, constraints on communication be-
tween adjacent layers, determining layer assignment for switches
and placement of switches in 3D.
In this work, we address this important problem and present a
synthesis approach for designing the most power efficient 3D NoC
that meets application performance and technology constraints. We
present a synthesis approach to determine the most power efficient
topology for the application and for finding paths for the traffic
flows that meet the TSV constraints. Our methods account for
power and delay of both switches and links. The assignment of
cores to different 3D layers and the floorplan of the cores in each
layer are taken as inputs to the synthesis process. To accurately
model the link delay and power consumption, for the given core po-
sitions, we present a method to determine the optimal positions of
switches in the floorplan in each layer. We then place the switches
on each layer, removing any overlap with the cores. Please note
that the assignment of cores to the different layers and the floorplan
of each layer needs to consider several performance and technolog-
ical constraints, such as thermal issues. There are several works
that address these issues [21]-[24] and our work is complementary
to them. Here, we only address the issue of designing the NoC
topology and determining the placement of the NoC switches. As
in our output floorplan (after placing the switches), the core po-
sitions are almost the same as the input floorplan, we minimally
affect these (such as the thermal) issues.
We perform experiments on varied, realistic SoC benchmarks to
validate the methods. Our results show that the topologies synthe-
sized by our method results in large interconnect power reduction
(an average of 38%) and latency reduction (25% on average), when
compared to optimized standard NoC topologies.
978-1-4244-2749-9/09/$25.00 ©2009 IEEE
3A-2
242
Figure 1: Proposed 3D NoC design approach
2. RELATED WORK
The use of NoCs to replace bus-based designs has been presented
in [1]-[2]. Several different NoC architectures and design methods
[4]-[5] have been developed over the past few years. A detailed
description of the important design issues and the current state-of-
the-art in NoC architectures and design methods is presented in [3].
Synthesis of bus and NoC architectures has been addressed by
several researchers for 2D systems. Mapping and placement of
cores onto standard NoC topologies has been explored in [6]-[9].
Synthesis of application specific NoC topologies has been addressed
in [10]-[17]. In [17], we presented a NoC synthesis method for 2D
SoCs and performed detailed comparisons with standard topologies
and other mapping tools. In this paper, we use the basic principles
from the 2D method to address the important issues in 3D NoC
design.
Several works have been presented on the 3D manufacturing pro-
cesses and interconnects [19], [20], [33]. A performance and cost
trade-off analysis of 3D integration is presented in [26]. Several
works have explored 3D floorplanning, placement and temperature
issues of cores [21]-[24]. These works do not consider the inter-
connect synthesis problem. Multi-dimensional topologies (such
as k-ary n-cubes, hypercubes) have been extensively explored in
the chip-to-chip interconnection field [18]. However, such works
only consider standard topologies suitable for homogeneous de-
signs. Most SoCs, especially in 3D, are heterogeneous in nature
and require application-specific interconnect architecture to opti-
mize power and performance. Moreover, such works do not ad-
dress the optimization of topologies based on traffic patterns.
Analysis and synthesis of NoCs for 3D technology is a relatively
new topic. Novel NoC switch architectures for 3D are presented in
[30] and [32]. In [31], the authors present the use of NoCs in 3D
multi-processors. In [33], the authors analyze the electrical charac-
teristics of vertical interconnects and show a back-end design flow
to implement 3D NoCs. In [27], the authors present an analyti-
cal model for cost metrics of 3D NoCs and compare them with 2D
NoCs. In [28], design of standard NoC topologies (such as mesh)
for 3D is analyzed. Mapping and placement of cores with thermal
constraints on to NoC topologies is presented in [29]. However,
none of these works address the issue of synthesizing application-
specific 3D NoC topologies.
3. DESIGN APPROACH
The approach used for topology synthesis is presented in Figure
1. In the core specification file, the name of the different cores, the
sizes and positions are obtained as inputs. The assignment of the
cores to the different layers in 3D is also obtained as an input. In
the communication specification file, the communication character-
istics of the application are specified. This includes the bandwidth
of communication across different cores, latency constraints and
message type (request/response) of the different traffic flows.
To achieve high yield, the number of TSVs that can be estab-
lished across two layers may need to be restricted below a threshold
[25]. In the rest of the paper, we model the maximum TSV con-
straint by using a constraint on the number of NoC links that can
cross two adjacent layers, denoted max ill (for maximum number
of inter-layer links). For a particular link width, the maximum num-
ber of links can be directly determined from the TSV constraints.
For the synthesis procedure, the power, area and timing models
of the NoC switches and links are also taken as inputs. We also take
the power consumption and latency values of the vertical intercon-
nects as inputs. The output of the topology synthesis procedure is a
set of Pareto design points of topologies that meet the constraints,
with different values of power, latency and design area. From the
resulting points, the designer can choose the optimal point for the
application. The synthesis procedure also produces a placement
of the switches in the 3D layers and the positions of the switches.
The TSV macros needed for establishing vertical links are directly
integrated in the switch input/output ports, as done in [33].
As the topology synthesis and mapping problem is NP-Hard [10],
we present efficient heuristics to synthesize the best topology for
the design. For achieving high yield, it is important to restrict the
number of vertical links used and to allow vertical connections only
across adjacent layers on the 3D chip [25]. Thus, in our procedure,
we connect cores in a layer only to switches in the same layer,
and ensure that switches of a layer are directly connected only to
switches in adjacent layers.
A NoC having fewer switches leads to longer core to switch links
and hence, higher link power consumption. On the other hand,
when many smaller switches are used, the flows have to traverse
more switches, leading to larger switch power consumption. Thus,
we need to explore designs with several different switches to ob-
tain the best solution, starting from one where all the cores are con-
nected to a single switch in a layer to a design point where each
core is connected to a separate switch. For each switch count, we
determine the core to switch connectivity, as explained in Section 4.
Then, we determine connectivity across the different switches (Sec-
tion 5). Then, we determine the optimal positions of the switches
on the floorplan (Section 6) and determine the wire lengths and link
power consumption.
4. ESTABLISHING NUMBER OF SWITCHES
In this section, we present methods for establishing connectivity
between the cores and switches. From the core specification file,
we obtain the core specifications:
DEFINITION 1. Let nbe the number of cores in the design. The
x and y co-ordinate positions of a core iare represented by xciand
ycirespectively, i1···n. The 3D layer to which the core iis
assigned is represented by layeri.
From the communication specification file, the communication
characteristics of the application are obtained and represented by a
graph [6], [9], defined as follows:
DEFINITION 2. The communication graph is a directed graph,
G(V, E )with each vertex viVrepresenting a core and the di-
rected edge (vi,v
j)representing the communication between the
cores viand vj. The bandwidth of traffic flow from cores vito vj
243
3A-2
Figure 2: Communication graph example Figure 3: LPGs for the two layers Figure 4: Two min-cut partitions of LPGs
is represented by bwi,j and the latency constraint for the flow is
represented by lati,j.
We define the Local Partitioning Graph for each layer:
DEFINITION 3. A local partitioning graph, LPG(Z,M,ly), is
a directed graph, with the set of vertices represented by Zand
edges by M. Each vertex represents a core in the layer ly. An edge
connecting two vertices is similar to the edge connecting the cor-
responding cores in the communication graph. The weight of the
edge (mi,m
j), defined by hi,j , is set to a combination of the band-
width and the latency constraints of the traffic flow from core mito
mj:hi,j =α×bwi,j /max bw + (1 α)×min lat/lati,j ,
where max bw is the maximum bandwidth value over all flows,
min lat is the tightest latency constraint over all flows and αis
a weight parameter. For cores that do not communicate with any
other core in the same layer, edges with low weight (close to 0) are
added between the corresponding vertices to all other vertices in
the layer. This will allow the partitioning process to still consider
such isolated vertices.
The LPGs for the two layers of the communication graph from
Figure 2 are shown in Figure 3. Since the LPGs are built layer by
layer, the graphs for the two layers are independent of one another.
Extra edges with low weights are added (dotted edges in the figure)
from the vertices that have no connections to the other vertices of
the LPG.
The algorithm for establishing core to switch connectivity is pre-
sented in Algorithm 1. As the number of input/output ports of a
switch increases, the maximum frequency of operation that can be
supported by it reduces, as the combinational path inside the cross-
bar and arbiter increases with size. In the first step of the algorithm,
for the required operating frequency of the NoC, the maximum size
of the switch (denoted by max sw size) that can support that fre-
quency is obtained as an input. Based on this and the number of
cores in each layer, in the next steps (2-4), we determine the min-
imum number of switches needed in each layer. Then the local
partitioning graph for each layer is built.
Then, the number of switches in each layer is incremented (start-
ing from the initial count calculated in steps 2-4) every iteration,
until it equals the number of cores in the layer. The term |LP G(Z, M , j)|
represents the number of cores in layer j. For each switch count,
that many min-cut partitions of the LPG of the layer are obtained
(step 13). The cores in the same partition are connected to the same
switch. Two min-cut partitions of the LPGs of Figure 3 are shown
in Figure 4. Once the partitions for all the layers are obtained, the
cores in a partition are attached to the same switch and hence the
core to switch connectivity is obtained. The next step is to deter-
mine switch to switch connectivity, by finding paths for the inter
switch traffic flows. This is explained further in the next section.
Algorithm 1 Core-to-switch connectivity
1: Obtain maximum switch size max sw size for current fre-
quency
2: for each layer j1···lr do
3: nij=%number of cores in layerj/max sw size&
4: end for
5: Build LP G(Z, M , j)for each layer j.
6: for i=0to maxj1···lr{|LP G(Z, M , j)|nij}do
7: for each layer j1···lr do
8: if nij+i|LP G(Z, M, j )|then
9: np =nij+i
10: else
11: np =|LP G(Z, M, j )|
12: end if
13: Obtain np min-cut partitions of LPG(Z,M,j)
14: end for
15: Compute paths for inter-switch flows (Section 5).
16: If valid paths found, save the current design point
17: end for
5. PATH COMPUTATION
The procedure to establish physical links and paths for traffic
flows is based on the power consumption increase and latency in
using the link. This cost computation in the 3D case is similar to
the 2D case, such as those presented in [14], [17], but it needs to
account for the max ill and max switch size constraints. Here,
we do not show the entire path computation algorithm, but only
present the steps needed to meet these constraints. In [14], [17],
the authors present methods to remove both routing and message-
dependent deadlocks when computing the paths. We also use the
methods to obtain paths that are free of deadlocks.
DEFINITION 4. Let nsw be the total number of switches used
across all the layers and let layeribe the layer in which switch iis
present. Let ill(i, j )be the number of vertical links established be-
tween layers iand j. Let the switch size inpiand switch size outi
be the number of input and output ports of switch i. Let costi,j be
the cost of establishing a physical link between switches iand j.
In Algorithm 2, we show the use of hard and soft thresholds
when evaluating the cost of establishing a physical link between
switches i and j. In steps 3, 4, we assign a cost of INF for es-
tablishing a link across switches in non adjacent layers and for
switches in layers that have reached the maximum vertical link
(max ill) threshold. To ensure meeting the maximum link con-
straint, we assign a very high cost (denoted by SOF T I N F ) for
establishing links between switches that are in layers having ver-
tical links close to the max ill value, denoted by sof t max ill
(steps 5, 6). From experiments, we found that a reasonable value
3A-2
244
for SO F T IN F to be 10 times the maximum cost of any flow and
soft max switch ill to be few (2 to 3) links less than max ill
value. We use a similar technique to meet the maximum switch size
constraints (steps 10-12). By using these softer constraints first, we
facilitate the path computation procedure to determine valid paths
when compared only using the hard constraints.
Algorithm 2 CHECK CONSTRAINTS(i,j)
1: for i=1to nsw do
2: for j=1to nsw do
3: if |layerilayerj|2or ill(layeri, lay erj)
max ill then
4: costi,j =IN F
5: else if |layerilayerj|=1and ill(layeri, layerj)
soft max ill then
6: costij =SO F T IN F
7: else if switch size inpi+1 max switch size or
switch size outj+1 max switch size then
8: costi,j =IN F
9: else if switch size inpi+1 sof t max switch size
or switch size outj+1 sof t max switch size
then
10: costi,j =SO F T IN F
11: end if
12: end for
13: end for
When paths are computed, if it is not feasible to meet the
max switch size constraints, we introduce new switches in the
topology that are used to connect the other switches together. These
indirect switches help in reducing the number of ports needed in the
direct switches. Due to space limitations, in this paper, we do not
explain the details of how the indirect switches are established.
6. SWITCH POSITION COMPUTATION
Once a topology for a particular switch count is obtained, the
next step is to find the latency and power consumption on the wires.
In order to do this, based on the input positions of the cores, the
optimal position of the switches needs to be determined. For this,
we model the problem as a Linear Program (LP) [34].
Let us consider a topology with nsw switches. We denote the
co-ordinates of a switch i by (xsi, ysi),i1···nsw. The goal
of the LP is to determine the values of xsiand ysi, for all switches
in the particular topology. The sum of the Manhattan distances
between a switch iand a core kis given by:
coredisti,k =8
<
:
|xsixck|+|ysiyck|, if switchiconnected
to corek
0, otherwise
(1)
The sum of the Manhattan distances between a switch iand switch
jto which it is connected to is given by:
swdisti,j =8
<
:
|xsixsj|+|ysiysj|, if switchiconnected
to switchj
0, otherwise
(2)
The above equations can be easily represented as a set of lin-
ear equations [34]. Let bw sw2corei,k and bw sw2swi,j be the
total bandwidth of traffic flows between switch iand core kand
switches iand j, respectively. To minimize the total power con-
sumption of the links, we need to minimize the length of the links
weighted by their bandwidth values, so that higher bandwidth links
are shorter than lower bandwidth ones. Formulating the objective
function mathematically, we get:
obj =PiPkcoredisti,k bw sw2corei,k
+PiPjswdisti,j bw sw2swi,j (3)
The LP for optimization is written as follows:
minimize obj
subject to Equations 13
xsi, ysi0,i1···nsw
(4)
We use the lp solve package [35] to obtain the optimum solution
for the switch co-ordinates. Even for big applications (65 cores,
tens of switches), the optimal solution is obtained in few seconds.
However, the optimal positions can result in overlap of switches
among themselves or with the cores. To remove the overlaps, we
use the floorplanner, Parquet [36], layer by layer. We feed the core
and switch positions as an input solution to the floorplanner. We
allow it to move the switches around the cores, maintaining the
relative positions of the cores and minimizing the movement of the
switches from the optimal positions computed by the LP. We also
pipeline long links to support full throughput on the NoC and add
Network Interfaces (NIs) to connect the cores to the network. The
resulting design is a valid floorplan of the NoC.
7. EXPERIMENTS AND CASE STUDIES
For our experiments, we use the switch and link libraries from
[5]. The power consumption and latency numbers of the compo-
nents of the library are obtained after post-layout analysis. We use
65nm low power technology libraries for the layout studies. For
the electrical characteristics of vertical interconnects, we use the
models from [33]. To obtain the electrical characteristics, a wafer-
to-wafer bonding technique is used as the underlying 3D integra-
tion technology. The vertical links are shown to have an order of
magnitude lower resistance and capacitance than a horizontal link
of the same dimension. This translates to a traversal delay of less
than 10% of clock cycle for 1 GHz operation and negligible power
consumption on the vertical links.
7.1 Multimedia SoC case study
For experimental case study, we consider a multi-media SoC,
Triple Video Object Plane Decoder, that has 38 cores (D 38 tvopd).
The communication graph of the benchmark is presented in Figure
5, where each vertex represents a core and the weight on the edge
represents the bandwidth between the cores expressed in MB/s.
The application is highly heterogeneous in nature, having three in-
dependent decoders working in parallel to improve performance.
Each decoder has 12 cores organized in a pipeline fashion. There
are two extra memories that are shared between the pipelines that
serve as input and output buffers. We consider the design imple-
mented on to 3 layers in 3D. The assignment of cores to the dif-
ferent layers and the floorplan of each layer were done manually,
such that the performance and manufacturing constraints (such as
thermal issues) are met. The processing cores are placed on the top
and bottom layers, so that they are close to the heat sink. The large
memory cores are all placed on the middle layer because they pro-
duce less heat and because this allows the manufacturer to use an
efficient integration process for implementing the memories. The
floorplan of the design (along with the network components syn-
thesized by our procedure) is presented in Figure 7.
The data width of the NoC links is fixed to 32 bits, to match the
data width of the cores in the design. We allowed the synthesis
method to sweep the NoC frequency and obtain NoC design points
245
3A-2
Figure 5: Communication graph for the D38 tvopd benchmark
Figure 6: Most power-efficient topology
Figure 7: Resulting 3D floorplan with switches
5 10 15 20 25 30 35
0
20
40
60
80
100
120
Switch count
Power consumption (mW)
Switch power
Coretoswitch link power
Switchtoswitch link power
Total power
Figure 8: Power consumption
for different frequencies. From the resulting design points, we
found that the lowest operating frequency (of 500 MHz) resulted in
least power consumption for this design. The power consumption
of NoC designs synthesized by our procedure for different switch
counts, at 500 MHz operation, is presented in Figure 8. In the fig-
ure, we show the core-to-switch link power, the switch-to-switch
link power, the switch power and the total power consumption. The
plot starts with 5 switches (on x-axis), as the maximum size of a
switch to support 500 MHz operation was 11x11 and the top and
bottom layers needed 2 switches each (topology shown in Figure
6), as they have more than 10 cores each. Because the number of
cores and the communication demand on each layer is different, we
obtain different number of switches on each layer.
Since the area of each 3D layer is small (approximately 20 mm2),
the links are short and switch power has higher impact on the to-
tal power consumption. With increasing switch count, the switch
power increases significantly, leading to higher power consump-
tion. For this design, the NoC with 5 switches is most power opti-
mal and the resulting floorplan is shown in Figure 7.
7.2 Comparisons with mesh
Custom topologies that match the application characteristics can
result in large power-performance improvement when compared to
the standard topologies, such as mesh and torus [17]. For this com-
parison we used the D38 tvopd benchmark presented in Section
7.1 and five other benchmarks that model different traffic scenarios.
We consider 3 benchmarks: D36 4,D36 6 and D36 8 with 36
cores, each core communicating to 4, 6 and 8 other cores, respec-
tively, modeling designs with multiple local memories. We also
consider a benchmark with shared memory bottleneck communi-
cation (D35 bot). For a larger design, we performed tests on the
D65 pipe which has 65 processing elements distributed on three
layers and organized in a pipeline fashion. All of the benchmarks
are mapped on to 3 layers in 3D.
We compared the custom topologies generated for the bench
marks against an optimized mesh topology. For the optimized mesh
each core is connected to a switch and only the necessary links
among switches are opened. The results of the comparison between
the best custom topology and the optimized mesh are presented in
Figure 9. As can be seen from results, the topology synthesized
by our method results in large power savings (38% on average)
when compared to the optimized mesh topologies. The synthesized
topologies also resulted in 24.5% reduction in average zero-load la-
tency, when compared the optimized mesh based NoC.
7.3 Impact on inter-layer link constraint
Limiting the number of inter-layer links has a great impact on
power consumption and average latency. Reducing the number of
TSVs is desirable for improving the yield of a 3D design. How-
ever, a very tight constraint on the number of inter-layer links can
3A-2
246
D_36_4 D_36_6 D_36_8 D_35_bot D_65_pipeD_38_tvopd
0
50
100
150
200
250
300
350
400
Power consumption (mW)
3D Application specific
3D Optmesh
Figure 9: Comparisons with mesh
10 11 12 13 14 15 16 17 18
170
180
190
200
210
220
230
240
Maximum number of interlayer links (max_ill)
Minimum power consumption (mW)
Figure 10: Impact of max ill on power
10 11 12 13 14 15 16 17 18
3.45
3.5
3.55
3.6
3.65
3.7
Maximum number of interlayer links (max_ill)
Minimum latency (cycles)
Figure 11: Impact of max ill on latency
lead to a significant increase in power consumption. To see the im-
pact of the constraint, we varied the value of max ill constraint
and performed topology synthesis for each value, for one of the
benchmarks (D36 4). The power and latency values for the dif-
ferent max ill design points are shown in Figures 10 and 11. The
dotted line in the figures represent points where the max ill con-
straint was too tight to produce any feasible topologies. When there
is a tight constraint on the inter-layer links, more flows are routed
through existing inter-layer links instead of opening new ones. This
leads to traversing more intermediate switches and higher switch
activities, leading to higher latency and power consumption. Please
note that our synthesis algorithm also allows the designers to per-
form such power, latency trade-offs for yield, early in the design
cycle.
The synthesis algorithm explores a large solution space. How-
ever, thanks to the efficient heuristic methods presented, the en-
tire topology design process completed in few hours for all the ex-
periments, when run on a 2 GHz Linux workstation. Please note
that the synthesis process is performed once at design time and this
computational time incurred is negligible.
8. ACKNOWLEDGEMENTS
This work is supported by the Swiss National Science Founda-
tion (FNS, Grant 20021-109450/1).
9. CONCLUSIONS
The use of Networks on Chips (NoCs) for communication in 3D
chips has posed new opportunities and challenges for designers.
One of the most important problems is to design the most power-
performance efficient NoC topology that satisfies the application
characteristics and 3D technology requirements. In this work, we
presented a synthesis approach to solve this problem. We also pre-
sented methods to place switches optimally on the 3D floorplan, so
that accurate power and delay numbers are obtained for the wires.
Our detailed comparisons with regular 3D optimized mesh show
that the custom 3D topologies lead to a large reduction in intercon-
nect power consumption. In future, we plan to explore tuning the
link data widths to meet the TSV constraints and to improve the
yield of the 3D NoCs.
10. REFERENCES
[1] L.Benini and G.De Micheli, “Networks on Chips: A New SoC Paradigm”, IEEE
Computers, pp. 70-78, Jan. 2002.
[2] P.Guerrier, A.Greiner,”A generic architecture for on-chip packet switched
interconnections”, Proc. DATE, pp. 250-256, March 2000.
[3] G. De Micheli, L. Benini, “Networks on Chips: Technology and Tools”, Morgan
Kaufmann, First Edition, July, 2006.
[4] K. Goossens et al., ”A Design Flow for Application-Specific Networks on Chip
with Guaranteed Performance to Accelerate SOC Design and Verification”,
DATE 2005.
[5] S. Stergiou et al., ×pipesLite: a Synthesis Oriented Design Library for
Networks on Chips”, pp. 1188-1193, Proc. DATE 2005.
[6] J. Hu, R. Marculescu, ’Exploiting the Routing Flexibility for
Energy/Performance Aware Mapping of Regular NoC Architectures’, Proc.
DATE, March 2003.
[7] S. Murali, G. De Micheli, “SUNMAP: A Tool for Automatic Topology Selection
and Generation for NoCs”, Proc. DAC 2004.
[8] S. Murali, G. De Micheli, “Bandwidth Constrained Mapping of Cores onto NoC
Architectures”, Proc. DATE 2004.
[9] D. Bertozzi et al., “NoC Synthesis Flow for Customized Domain Specific
Multiprocessor Systems-on-Chip”, IEEE TPDS, Feb 2005.
[10] A.Pinto et al., “Efficient Synthesis of Networks on Chip”, ICCD 2003, pp.
146-150, Oct 2003.
[11] W.H.Ho, T.M.Pinkston, A Methodology for Designing Efficient On-Chip
Interconnects on Well-Behaved Communication Patterns”, HPCA, 2003.
[12] T. Ahonen et al. ”Topology Optimization for Application Specific Networks on
Chip”, Proc. SLIP 04.
[13] K. Srinivasan et al., “An Automated Technique for Topology and Route
Generation of Application Specific On-Chip Interconnection Networks”, Proc.
ICCAD ’05.
[14] A. Hansson et al., “A Unified Approach to Constrained Mapping and Routing
on Network-on-Chip Architectures”, Proc. CODES-ISSS, 2005.
[15] X.Zhu, S.Malik, “A Hierarchical Modeling Framework for On-Chip
Communication Architectures”, ICCD 2002, pp. 663-671, Nov 2002.
[16] J. Xu et al., “A design methodology for application-specific networks-on-chip”,
ACM TECS, May 2006.
[17] S. Murali et al., “Designing Application-Specific Networks on Chips with
Floorplan Information”, pp. 355-362, ICCAD 2006.
[18] W. J. Dally, “Performance Analysis of k-ary n-cube Interconnection Networks”,
IEEE Transactions on Computers, Vol. 39, No. 6, pp. 775-785, 1990.
[19] K. Banerjee et al., “3-D ICs: ANovel Chip Design for Deep-Submicrometer
Interconnect Performance & SoC Integration”, Proc. of IEEE, 2001.
[20] B. Goplen and S. Sapatnekar, “Thermal Via Placement in 3D ICs”, Proc. Intl.
Symposium on Physical Design, pp. 167, 2005.
[21] J. Cong et al., “A thermal-driven floorplanning algorithm for 3D ICs”, ICCAD
2004.
[22] W.-L. Hung et al., “Interconnect and thermal-aware floorplanning for 3D
microprocessors”, Proc. ISQED, March 2006.
[23] S. K. Lim, “Physical Design for 3D System on Package”, IEEE Design & Test
of Computers, vol. 22(6), pp. 532539, 2005.
[24] P. Zhou et al., ”3D-STAF: Scalable temperature and leakage aware
floorplanning for three-dimensional integrated circuits”, ICCAD 2007.
[25] N. Miyakawa et al., New Multi-Layer Stacking Technology and Trial
Manufacture”, 3-D Architectures for Semiconductor Integration and Packaging,
Oct 2007.
[26] R. Weerasekara et al., “Extending Systems-on-Chip to the Third Dimension:
Performance, Cost and Technological Tradeoffs”, Proc. ICCAD, 2007.
[27] V. F. Pavlidis and E. G. Friedman, “Topologies for networks-onchip”, Proc.
SOCC, 2006.
[28] B. Feero and P. P. Pande, “Performance evaluation for three-dimensional
networks-on-chip”, Proc. ISVLSI, 2007.
[29] C. Addo-Quaye, “Thermal-Aware Mapping and Placement for 3-D NoC
Designs”, Proc. SOCC, 2005.
[30] J. Kim et al., “A novel dimensionally-decomposed router for on-chip
communication in 3d architectures”, ISCA, 2007.
[31] F. Li et al., “Design and Management of 3D Chip Multiprocessors Using
Network-in-Memory”, ISCA, pp. 130-141, 2006.
[32] D. Park et al., “MIRA: A Multi-Layered On-Chip Interconnect Router
Architecture”, Proc. ISCA, 2008.
[33] I. Loi, F. Angiolini, L. Benini, Supporting vertical links for 3D networks on
chip: toward an automated design and analysis flow, Proc. Nano-Nets, 2007.
[34] S. Boyd and L. Vandenberghe, “Convex Optimization”, Cambridge University
Press, 2004.
[35] Package available at: http://sourceforge.net/projects/lpsolve
[36] S. N. Adya, I. L. Markov, ”Fixed-outline Floorplanning : Enabling Hierarchical
Design”, IEEE TVLSI, Dec 2003.
247
3A-2
... Evaluations using synthetic traffic show that a 2×2×4 3D OASIS-NoC reduces 22% delay on average compared to a 4×4 2D OASIS-NOC. Since a fixed design may not be suitable for different applications, authors in [70] propose a synthesis approach to construct a power-efficient 3D NoC for a given application. Experiments on real applications show that the NoC topologies synthesized by the proposed methodology reduce power by 38% on average. ...
... Authors incorporate simulated allocation (SAL) to solve the MCF problem. The proposed methodology enables 22% power saving with respect to [70] for a set of synthetic benchmarks [101]. A summary of different 3D NoC technologies, their advantages, and drawbacks can be found in [83]. ...
Chapter
Full-text available
Networks-on-Chip (NoC) architectures have become the mainstream communication backbone of high-end processors and systems-on-chip (SoCs) after their introduction in the early 2000s. This success can be attributed mainly to their ability to satisfy the ever-increasing communication demands with superior energy-efficiency compared to traditional approaches. More specifically, NoCs provide low communication latencies (e.g., fast response in data centers) and high throughput (e.g., large data rates in graphics applications). Performance is a necessary condition, but it is not sufficient for their widespread adaption. They also need to minimize their contribution to the overall power dissipation, energy consumption, and cost while meeting the performance goals. This chapter reviews the design and run-time approaches developed to enable energy-efficient NoC architectures and future trends. It discusses both the traditional approaches and more recent proposals for 3D, wireless, and optical NoCs. Finally, it concludes by analyzing the trends in new workloads and future needs for NoCs.
... In this section, we analyze the efficacy of the proposed data encoding schemes on two complex heterogeneous systems. The first one, which is mapped to an 8 × 8 mesh, consisted of a triple video object plane decoder which has 38 cores(D 38 tvopd) [32] and multimedia and wireless communication which has 26 cores (D 26 media) [33]. We assumed a minimum of two-flit and maximum eight-flit packets, deter ministic XY routing, and input FIFO buffers of four flits. ...
... The time distribution of the traffic followed Poisson's distribution while random data sets were used as workloads. The packet injection rates of the different communication flows have been determined using the bandwidth requirements reportedin [32] and [33]. The results of power and energy saving when different data encoding schemes have been applied to this system. ...
Research
Full-text available
As technology improves, the power dissipated by the links of a network-on-chip(NoC) starts to compete with the power dissipated by the other elements of the communicate ion subsystem, namely, the routers and the network interfaces (NIs). In this paper, we present a set of data encoding schemes to reduce the power dissipated by the links of an NoC. The proposed schemes are general and transparent with respect to the underlying NoC fabric (i.e., their application does not require any modification of the routers and link architecture). Experiments carried out on both synthetic and real traffic scenarios show the effectiveness of the proposed scheme s , which allow to save up to 51% of power dissipation and 14% of energy consumption without any significant performance degrade ation and with less than 15% area overhead in the Network Interface.
... While [18] can be applied on arbitrary topologies it does require bidirectional links, which is not always the case with application specific topologies. While new links could be opened to make con-nections bidirectional, this is not always possible when the technology imposes constraints on the number of links, as described in [21]. ...
... The NoC power consumption for the different benchmarks is presented in Figure 10. A description of the benchmarks is given in [21]. The plot shows the relative power consumption overhead for the resource ordering method when compared to our deadlock removal algorithm. ...
Article
Networks-on-Chip (NoCs) are a promising interconnect paradigm to address the communication bottleneck of Systems-on-Chip (SoCs). Wormhole flow control is widely used as the transmission protocol in NoCs, as it offers high throughput and low latency. To match the application characteristics, customized irregular topologies and routing functions are used. With wormhole flow control and custom irregular NoC topologies, deadlocks can occur during system operation. Ensuring a deadlock free operation of custom NoCs is a major challenge. In this paper, we address this important issue and present a method to remove deadlocks in application-specific NoCs. Our method can be applied to any NoC topology and routing function, and the potential deadlocks are removed by adding minimal number of virtual or physical channels. Experiments on a variety of realistic benchmarks show that our method results in a large reduction in the number of resources needed (88% on average) and NoC power consumption, area reduction (66% area savings on average) when compared to the state-of-the-art deadlock removal methods.
... 263decmp3dec, 263encmp3dec and mp3encmp3dec are obtained from Ref. [11]. D 38 tvopd (a multi-media SoC, triple VOPD with 38 cores) and D 50 with 50 cores are obtained from Ref. [12]. ...
Article
Full-text available
Along with higher and higher integration of intellectual properties (IPs) on a single chip, traditional bus-based system-on-chips (SoC) meets several design difficulties (such as low scalability, high power consumption, packet latency and clock tree problem). As a promising solution, network-on-chips (NoC) has been proposed and widely studied. In this work, a novel algorithm for NoC topology synthesis, which is decomposing and cluster refinement (DCR) algorithm, has been proposed to minimize the total power consumption of application-specific NoC. This algorithm is composed of two stages: decomposing with cluster generation, and cluster refinement. For partitioning and cluster generation, an initial low-power solution for NoC topology is generated. For cluster refinement, the clustering is optimized by performing floorplan to further reduce power consumption. Meanwhile, a good tradeoff between power consumption and CPU time can be achieved. Experimental results show that the proposed method outperforms the existing work.
Article
Three-dimensional Networks-on-Chip (3D-NoCs) have been proposed as an enormously scalable solution to address communication problems in modern Systems-On-Chip (SoCs). Through-Silicon Via (TSV) is usually adopted as a viable technology enabling vertical connection among NoC layers. However, TSV-based architectures typically exhibit high vulnerability to transient and permanent faults caused by aging effects, thermal violations, manufacturing issues, or even transient fault sources. Therefore, TSV-based architectures call for robust routing schemes capable of sustaining operation under unpredictable failure patterns. In this paper, we introduce FL-RuNS a fault-tolerant routing scheme for achieving 100% packet delivery under an unconstrained set of runtime and permanent vertical link failures. The proposed scheme uses the concept of vertical link announcement to inform nodes in the network of the health condition of vertical links. This mechanism is able to dynamically and progressively reconfigure the entire network without any packet loss. FL-RuNS requires a very low number of asymmetric virtual channels to achieve both deadlock-freedom and reachability. Also, FL-RuNS introduces 1-flit-dedicated virtual channels which are used as an escape buffer in case of TSVs failures. The experimental results have confirmed that FL-RuNS shows better reliability when compared to the recently proposed fault-tolerant routing algorithm. Furthermore, the hardware synthesis performed using a commercial 28nm technology library shows a reasonable area and power overhead with respect to the non-fault-tolerant baseline.
Article
Three-dimensional (3D) integration enables the design of high-performance and energy-efficient network on chip (NoC) architectures as communication backbones for manycore chips. To exploit the benefits of the vertical dimension of 3D integration, through-silicon-via (TSV) has been predominantly used in state-of-the-art manycore chip design. However, for TSV-based systems, high power density and the resultant thermal hotspot remain major concerns from the perspectives of chip functionality and overall reliability. The power consumption and thermal profiles of 3D NoCs can be improved by incorporating a Voltage-Frequency-Island (VFI)-based power management strategy. However, due to inherent thermal constraints of a TSV-based 3D system, we are unable to fully exploit the benefits offered by the power management methodology. In this context, emergence of monolithic 3D (M3D) integration has opened up new possibility of designing ultra-low-power and high-performance circuits and systems. The smaller dimensions of the inter-layer dielectric (ILD) and monolithic inter-tier vias (MIVs) offer high-density integration, flexibility of partitioning logic blocks across multiple tiers, and significant reduction of total wire-length. In this work, we present the first-ever study of the performance-thermal tradeoffs for energy efficient monolithic 3D manycore chips. In particular, we present a comparative performance evaluation of M3D NoCs with respect to their conventional TSV-based counterparts. We demonstrate that the proposed M3D-based NoC architecture incorporating VFI-based power management achieves a maximum of 29.4% lower energy-delay-product (EDP) compared to the TSV-based designs for a large set of benchmarks. We also demonstrate that the M3D-based NoC shows up to 29.1% lower maximum temperature than the TSV-based counterpart for these benchmarks.
Article
Due to technology scaling, lifetime reliability is becoming one of the major design constraints in the performance optimization of future many-core systems. Given a lifetime reliability constraint, the existing lifetime-constrained runtime mapping schemes often lead to low throughput because of the requirement to map all applications to compact regions. In this paper, we propose a runtime application mapping scheme, LBRM, that exploits a borrowing strategy to improve the throughput of many-core systems given a lifetime constraint. First, we propose using different strategies for mapping communication-intensive applications and computation-intensive applications. The lifetime reliability constraint can be relaxed in the local time scale when the communication requirement is high. The throughput is improved because the communication distance of communication-intensive applications is optimized while the waiting time of computation-intensive application is reduced. Then, we propose a method to effectively classify applications depending on the communication-to-computation ratio. A dynamic threshold is determined according to the current locations of available cores. Finally, we propose an improved neighborhood allocation scheme to reduce the communication cost in the task mapping. The experimental results show that compared to the state-of-the-art lifetime-constrained mapping, the proposed mapping scheme improves the throughput of many-core systems by 26% on average for synthetic task graphs and by 20% on average for realistic task graphs while the lifetime reliability is maintained within a constraint.
Article
This article proposes a solution to the integrated problem of Through-Silicon Via (TSV) placement and mapping of cores to the routers in a three-dimensional mesh-based Network-on-Chip (NoC) system. TSV geometry restricts their number in three-dimensional (3D) ICs. As a result, only about 25% of routers in a 3D NoC can possess vertical connections. Mapping plays an important role in evolving good system solutions in such a situation. TSVs have been placed with detailed consultation with the application mapping process. The integrated problem was first solved using the exact method of Integer Liner Programming (ILP). Next, a solution was obtained via a Particle Swarm Optimization (PSO) formulation. Several augmentations to the basic PSO strategy have been proposed to generate good-quality solutions. The results obtained are better than many of the contemporary approaches and close to the theoretical situation in which all routers are 3D in nature.
Article
Full-text available
The SoC paradigm is a system integration approach that integrates large numbers of transistors as well as various mixed-signal active and passive components onto a single chip. This realization-led to the 3D system-in-package (SiP) approach, alternatively called 3D ICs or 3D stacked die/package. Designers can take SiP a step further by embedding both active and passive components, but passive-component embedding is bulky and requires thick-film discrete components. Thick-film component embedding distinguishes SiP from system on package (SoP), an emerging 3D system integration concept that involves embedding both active and passive components. SoP, however, incorporates ultrathin films at microscale to embed the passive components, and the package rather than the board is the system. SoP overcomes both the computing and integration limitations of SoC, SiP, multichip modules (MCMs), and traditional system packaging by having global wiring as well as RF, digital, and optical component integration in the package instead of on the chip. Moreover, 3D SoP addresses the wire delay problem by enabling the replacement of long, slow global interconnects with short, fast vertical routes.
Article
Full-text available
Compared to the well understood macro networks, networks-on-chip introduce novel design challenges. The characteristics of the system data flows and the knowledge of the required wire lengths can be exploited to optimize for speed and power consumption. A component library for flexible construction of interconnection architectures is being developed at the Tampere University of Technology to enable the creation of application development platforms. The overall design flow of these development platforms is reviewed in this paper. Network-on-chip topology optimization is addressed by describing the methodologies used by an effective design automation tool. The detailed cost functions of the tool capture the factors contributing to the speed and power consumption of asynchronous interconnections, while different abstraction level input information is supported. A case study into the application domain of industrial process control and monitoring is presented in order to evaluate the result quality.
Article
With increasing communication demands of processor and memory cores in Systems on Chips (SoCs), scalable Networks on Chips (NoCs) are needed to interconnect the cores. For the use of NoCs to be feasible in today’s industrial designs, a custom-tailored, application-specific NoC that satisfies the design objectives and constraints of the targeted application domain is required. In this work, we present a design methodology that automates the synthesis of such application-specific NoC architectures. We present a floorplan aware design method that considers the wiring complexityof the NoC during the topology synthesis process. This leads to detecting timing violations on the NoC links early in the design cycle and to have accurate power estimations of the interconnect. We incorporate mechanisms to prevent deadlocks during routing, which is critical for proper operation of NoCs. We integrate the NoC synthesis method with an existing design flow, automating NoC synthesis, generation, simulation and physical design processes. We also present ways to ensure design convergence across the levels. Experiments on several SoC benchmarks are presented, which show that the synthesized topologies provide a large reduction in network power consumption (2.78°— on average) and improvement in performance (1.59°— on average) over the best mesh and mesh-based custom topologies. An actual layout of a multimedia SoC with the NoC designed using our methodology is presented, which shows that the designed NoC supports the required frequency of operation (close to 900 MHz) without any timing violations. We could design the NoC from input specifications to layout in 4 hours, a process that usually takes several weeks.
Article
The performance of k-ary n-cube interconnection networks is analyzed under the assumption of constant wire bisection. It is shown that low-dimensional k-ary n-cube networks (e.g., tori) have lower latency and higher hot-spot throughput than high-dimensional networks (e.g., binary n-cubes) with the same bisection width.
Article
Long interconnects are becoming an increasingly important problem from both power and performance perspectives. This motivates designers to adopt on-chip network-based communication infrastructures and three-dimensional (3D) designs where multiple device layers are stacked together. Considering the current trends towards increasing use of chip multiprocessing, it is timely to consider 3D chip multiprocessor design and memory networking issues, especially in the context of data management in large L2 caches. The overall goal of this paper is to study the challenges for L2 design and management in 3D chip multiprocessors. Our first contribution is to propose a router architecture and a topology design that makes use of a network architecture embedded into the L2 cache memory. Our second contribution is to demonstrate, through extensive experiments, that a 3D L2 memory architecture generates much better results than the conventional two-dimensional (2D) designs under different number of layers and vertical (inter-wafer) connections. In particular, our experiments show that a 3D architecture with no dynamic data migration generates better performance than a 2D architecture that employs data migration. This also helps reduce power consumption in L2 due to a reduced number of data movements.