ArticlePDF Available

3A-2 Synthesis of Networks on Chips for 3D Systems on Chips

January 2009

January 2009

DOI:10.1145/1509633.1509701

Authors:

Srinivasan Murali

École Polytechnique Fédérale de Lausanne

Luca Benini

University of Bologna

Three-dimensional stacking of silicon layers is emerging as a promising solution to handle the design complexity and heterogeneity of Systems on Chips (SoCs). Networks on Chips (NoCs) are necessary to efficiently handle the 3D interconnect complexity. Designing power efficient NoCs for 3D SoCs that satisfy the application performance requirements, while satisfying the 3D technology constraints is a big challenge. In this work, we address this problem and present a synthesis approach for designing power-performance efficient 3D NoCs. We present methods to determine the best topology, compute paths and perform placement of the NoC components in each 3D layer. We perform experiments on varied, realistic SoC benchmarks to validate the methods and also perform a comparative study of the resulting 3D NoC designs with 3D optimized mesh topologies. The NoCs designed by our synthesis method results in large interconnect power reduction (average of 38%) and latency reduction (average of 25%) when compared to traditional NoC designs.

Proposed 3D NoC design approach

…

Figures - uploaded by Luca Benini

Content may be subject to copyright.

Content uploaded by Luca Benini

Content may be subject to copyright.

Synthesis of Networks on Chips for 3D Systems on Chips

Srinivasan Murali!, Ciprian Seiculescu!, Luca Benini‡, Giovanni De Micheli!

!LSI, EPFL, Lausanne, Switzerland,

{srinivasan.murali, ciprian.seiculescu, giovanni.demicheli}@epﬂ.ch

‡DEIS, Univerity of Bologna, Bologna, Italy, lbenini@deis.unibo.it

ABSTRACT

Three-dimensional stacking of silicon layers is emerging as a promis-

ing solution to handle the design complexity and heterogeneity of

Systems on Chips (SoCs).Networks on Chips (NoCs) are necessary

to efﬁciently handle the 3D interconnect complexity. Designing

power efﬁcient NoCs for 3D SoCs that satisfy the application per-

formance requirements, while satisfying the 3D technology con-

straints is a big challenge. In this work, we address this problem

and present a synthesis approach for designing power-performance

efﬁcient 3D NoCs. We present methods to determine the best topol-

ogy, compute paths and perform placement of the NoC components

in each 3D layer. We perform experiments on varied, realistic SoC

benchmarks to validate the methods and also perform a compara-

tive study of the resulting 3D NoC designs with 3D optimized mesh

topologies. The NoCs designed by our synthesis method results in

large interconnect power reduction (average of 38%) and latency

reduction (average of 25%) when compared to traditional NoC de-

signs.

Keywords

3D, networks on chip, topology synthesis, application-speciﬁc

1. INTRODUCTION

The 2D chip fabrication technology is facing lot of challenges in

utilizing the exponentially growing number of transistors on a chip.

The wire delay and power consumption is increasing dramatically

and achieving interconnect design closure is becoming a challenge.

Designing the clock-tree network for a large chip is becoming very

challenging and its power consumption is a signiﬁcant fraction of

total chip power consumption. Moreover, diverse components that

are digital, analog, MEMS and RF are being integrated on the same

chip, resulting in large complexity for the 2D manufacturing pro-

cess [19].

Vertical stacking of multiple silicon layers, referred to as 3D

stacking, is emerging as an attractive solution to continue the pace

of growth of Systems on Chips (SoCs) [19]-[24]. The 3D tech-

nology results in smaller footprint in each layer and shorter verti-

cal wires that are implemented using Through Silicon Vias (TSVs)

across the layers. Heterogeneous systems can be built easily, with

each layer supporting a diverse technology [19]. The 3D technol-

ogy has been maturing over the years in addressing thermal issues

and achieving high yield [20].

To tackle the on-chip communication problem, a scalable com-

munication paradigm, Networks on Chips (NoCs) has recently evolved

[1]-[3]. NoCs are composed of switches and links and use circuit

or packet switching technology to transfer data inside a chip. They

provide better structure, modularity and scalability when compared

to traditional interconnect solutions.

NoCs are a necessity for 3D chips: they provide arbitrary scala-

bility of the interconnects across additional layers, efﬁciently paral-

lelize communication in each layer and help controlling the number

of vertical wires (and hence TSVs) needed for inter-layer commu-

nication. The combined use of 3D integration technologies and

NoCs introduces new opportunities and challenges for designers.

Building power-efﬁcient NoCs for 3D systems that satisfy the per-

formance requirements of applications, while satisfying the tech-

nology constraints is an important problem. To address this issue,

new architectures and design methods are needed. While the issue

of designing NoC architectures for 3D has received some atten-

tion [30]-[33], there has been little work on design methods for 3D

NoCs. The design methods for 2D NoCs do not consider important

3D information, such as the technology constraints on the number

of TSVs that can be supported, constraints on communication be-

tween adjacent layers, determining layer assignment for switches

and placement of switches in 3D.

In this work, we address this important problem and present a

synthesis approach for designing the most power efﬁcient 3D NoC

that meets application performance and technology constraints. We

present a synthesis approach to determine the most power efﬁcient

topology for the application and for ﬁnding paths for the trafﬁc

ﬂows that meet the TSV constraints. Our methods account for

power and delay of both switches and links. The assignment of

cores to different 3D layers and the ﬂoorplan of the cores in each

layer are taken as inputs to the synthesis process. To accurately

model the link delay and power consumption, for the given core po-

sitions, we present a method to determine the optimal positions of

switches in the ﬂoorplan in each layer. We then place the switches

on each layer, removing any overlap with the cores. Please note

that the assignment of cores to the different layers and the ﬂoorplan

of each layer needs to consider several performance and technolog-

ical constraints, such as thermal issues. There are several works

that address these issues [21]-[24] and our work is complementary

to them. Here, we only address the issue of designing the NoC

topology and determining the placement of the NoC switches. As

in our output ﬂoorplan (after placing the switches), the core po-

sitions are almost the same as the input ﬂoorplan, we minimally

affect these (such as the thermal) issues.

We perform experiments on varied, realistic SoC benchmarks to

validate the methods. Our results show that the topologies synthe-

sized by our method results in large interconnect power reduction

(an average of 38%) and latency reduction (25% on average), when

compared to optimized standard NoC topologies.

3A-2

242

Figure 1: Proposed 3D NoC design approach

2. RELATED WORK

The use of NoCs to replace bus-based designs has been presented

in [1]-[2]. Several different NoC architectures and design methods

[4]-[5] have been developed over the past few years. A detailed

description of the important design issues and the current state-of-

the-art in NoC architectures and design methods is presented in [3].

Synthesis of bus and NoC architectures has been addressed by

several researchers for 2D systems. Mapping and placement of

cores onto standard NoC topologies has been explored in [6]-[9].

Synthesis of application speciﬁc NoC topologies has been addressed

in [10]-[17]. In [17], we presented a NoC synthesis method for 2D

SoCs and performed detailed comparisons with standard topologies

and other mapping tools. In this paper, we use the basic principles

from the 2D method to address the important issues in 3D NoC

design.

Several works have been presented on the 3D manufacturing pro-

cesses and interconnects [19], [20], [33]. A performance and cost

trade-off analysis of 3D integration is presented in [26]. Several

works have explored 3D ﬂoorplanning, placement and temperature

issues of cores [21]-[24]. These works do not consider the inter-

connect synthesis problem. Multi-dimensional topologies (such

as k-ary n-cubes, hypercubes) have been extensively explored in

the chip-to-chip interconnection ﬁeld [18]. However, such works

only consider standard topologies suitable for homogeneous de-

signs. Most SoCs, especially in 3D, are heterogeneous in nature

and require application-speciﬁc interconnect architecture to opti-

mize power and performance. Moreover, such works do not ad-

dress the optimization of topologies based on trafﬁc patterns.

Analysis and synthesis of NoCs for 3D technology is a relatively

new topic. Novel NoC switch architectures for 3D are presented in

[30] and [32]. In [31], the authors present the use of NoCs in 3D

multi-processors. In [33], the authors analyze the electrical charac-

teristics of vertical interconnects and show a back-end design ﬂow

to implement 3D NoCs. In [27], the authors present an analyti-

cal model for cost metrics of 3D NoCs and compare them with 2D

NoCs. In [28], design of standard NoC topologies (such as mesh)

for 3D is analyzed. Mapping and placement of cores with thermal

constraints on to NoC topologies is presented in [29]. However,

none of these works address the issue of synthesizing application-

speciﬁc 3D NoC topologies.

3. DESIGN APPROACH

The approach used for topology synthesis is presented in Figure

1. In the core speciﬁcation ﬁle, the name of the different cores, the

sizes and positions are obtained as inputs. The assignment of the

cores to the different layers in 3D is also obtained as an input. In

the communication speciﬁcation ﬁle, the communication character-

istics of the application are speciﬁed. This includes the bandwidth

of communication across different cores, latency constraints and

message type (request/response) of the different trafﬁc ﬂows.

To achieve high yield, the number of TSVs that can be estab-

lished across two layers may need to be restricted below a threshold

[25]. In the rest of the paper, we model the maximum TSV con-

straint by using a constraint on the number of NoC links that can

cross two adjacent layers, denoted max ill (for maximum number

of inter-layer links). For a particular link width, the maximum num-

ber of links can be directly determined from the TSV constraints.

For the synthesis procedure, the power, area and timing models

of the NoC switches and links are also taken as inputs. We also take

the power consumption and latency values of the vertical intercon-

nects as inputs. The output of the topology synthesis procedure is a

set of Pareto design points of topologies that meet the constraints,

with different values of power, latency and design area. From the

resulting points, the designer can choose the optimal point for the

application. The synthesis procedure also produces a placement

of the switches in the 3D layers and the positions of the switches.

The TSV macros needed for establishing vertical links are directly

integrated in the switch input/output ports, as done in [33].

As the topology synthesis and mapping problem is NP-Hard [10],

we present efﬁcient heuristics to synthesize the best topology for

the design. For achieving high yield, it is important to restrict the

number of vertical links used and to allow vertical connections only

across adjacent layers on the 3D chip [25]. Thus, in our procedure,

we connect cores in a layer only to switches in the same layer,

and ensure that switches of a layer are directly connected only to

switches in adjacent layers.

A NoC having fewer switches leads to longer core to switch links

and hence, higher link power consumption. On the other hand,

when many smaller switches are used, the ﬂows have to traverse

more switches, leading to larger switch power consumption. Thus,

we need to explore designs with several different switches to ob-

tain the best solution, starting from one where all the cores are con-

nected to a single switch in a layer to a design point where each

core is connected to a separate switch. For each switch count, we

determine the core to switch connectivity, as explained in Section 4.

Then, we determine connectivity across the different switches (Sec-

tion 5). Then, we determine the optimal positions of the switches

on the ﬂoorplan (Section 6) and determine the wire lengths and link

power consumption.

4. ESTABLISHING NUMBER OF SWITCHES

In this section, we present methods for establishing connectivity

between the cores and switches. From the core speciﬁcation ﬁle,

we obtain the core speciﬁcations:

DEFINITION 1. Let nbe the number of cores in the design. The

x and y co-ordinate positions of a core iare represented by xciand

ycirespectively, ∀i∈1···n. The 3D layer to which the core iis

assigned is represented by layeri.

From the communication speciﬁcation ﬁle, the communication

characteristics of the application are obtained and represented by a

graph [6], [9], deﬁned as follows:

DEFINITION 2. The communication graph is a directed graph,

G(V, E )with each vertex vi∈Vrepresenting a core and the di-

rected edge (vi,v

j)representing the communication between the

cores viand vj. The bandwidth of trafﬁc ﬂow from cores vito vj

243

3A-2

Figure 2: Communication graph example Figure 3: LPGs for the two layers Figure 4: Two min-cut partitions of LPGs

is represented by bwi,j and the latency constraint for the ﬂow is

represented by lati,j.

We deﬁne the Local Partitioning Graph for each layer:

DEFINITION 3. A local partitioning graph, LPG(Z,M,ly), is

a directed graph, with the set of vertices represented by Zand

edges by M. Each vertex represents a core in the layer ly. An edge

connecting two vertices is similar to the edge connecting the cor-

responding cores in the communication graph. The weight of the

edge (mi,m

j), deﬁned by hi,j , is set to a combination of the band-

width and the latency constraints of the trafﬁc ﬂow from core mito

mj:hi,j =α×bwi,j /max bw + (1 −α)×min lat/lati,j ,

where max bw is the maximum bandwidth value over all ﬂows,

min lat is the tightest latency constraint over all ﬂows and αis

a weight parameter. For cores that do not communicate with any

other core in the same layer, edges with low weight (close to 0) are

added between the corresponding vertices to all other vertices in

the layer. This will allow the partitioning process to still consider

such isolated vertices.

The LPGs for the two layers of the communication graph from

Figure 2 are shown in Figure 3. Since the LPGs are built layer by

layer, the graphs for the two layers are independent of one another.

Extra edges with low weights are added (dotted edges in the ﬁgure)

from the vertices that have no connections to the other vertices of

the LPG.

The algorithm for establishing core to switch connectivity is pre-

sented in Algorithm 1. As the number of input/output ports of a

switch increases, the maximum frequency of operation that can be

supported by it reduces, as the combinational path inside the cross-

bar and arbiter increases with size. In the ﬁrst step of the algorithm,

for the required operating frequency of the NoC, the maximum size

of the switch (denoted by max sw size) that can support that fre-

quency is obtained as an input. Based on this and the number of

cores in each layer, in the next steps (2-4), we determine the min-

imum number of switches needed in each layer. Then the local

partitioning graph for each layer is built.

Then, the number of switches in each layer is incremented (start-

ing from the initial count calculated in steps 2-4) every iteration,

until it equals the number of cores in the layer. The term |LP G(Z, M , j)|

represents the number of cores in layer j. For each switch count,

that many min-cut partitions of the LPG of the layer are obtained

(step 13). The cores in the same partition are connected to the same

switch. Two min-cut partitions of the LPGs of Figure 3 are shown

in Figure 4. Once the partitions for all the layers are obtained, the

cores in a partition are attached to the same switch and hence the

core to switch connectivity is obtained. The next step is to deter-

mine switch to switch connectivity, by ﬁnding paths for the inter

switch trafﬁc ﬂows. This is explained further in the next section.

Algorithm 1 Core-to-switch connectivity

1: Obtain maximum switch size max sw size for current fre-

quency

2: for each layer j∈1···lr do

3: nij=%number of cores in layerj/max sw size&

4: end for

5: Build LP G(Z, M , j)for each layer j.

6: for i=0to max∀j∈1···lr{|LP G(Z, M , j)|−nij}do

7: for each layer j∈1···lr do

8: if nij+i≤|LP G(Z, M, j )|then

9: np =nij+i

10: else

11: np =|LP G(Z, M, j )|

12: end if

13: Obtain np min-cut partitions of LPG(Z,M,j)

14: end for

15: Compute paths for inter-switch ﬂows (Section 5).

16: If valid paths found, save the current design point

17: end for

5. PATH COMPUTATION

The procedure to establish physical links and paths for trafﬁc

ﬂows is based on the power consumption increase and latency in

using the link. This cost computation in the 3D case is similar to

the 2D case, such as those presented in [14], [17], but it needs to

account for the max ill and max switch size constraints. Here,

we do not show the entire path computation algorithm, but only

present the steps needed to meet these constraints. In [14], [17],

the authors present methods to remove both routing and message-

dependent deadlocks when computing the paths. We also use the

methods to obtain paths that are free of deadlocks.

DEFINITION 4. Let nsw be the total number of switches used

across all the layers and let layeribe the layer in which switch iis

present. Let ill(i, j )be the number of vertical links established be-

tween layers iand j. Let the switch size inpiand switch size outi

be the number of input and output ports of switch i. Let costi,j be

the cost of establishing a physical link between switches iand j.

In Algorithm 2, we show the use of hard and soft thresholds

when evaluating the cost of establishing a physical link between

switches i and j. In steps 3, 4, we assign a cost of INF for es-

tablishing a link across switches in non adjacent layers and for

switches in layers that have reached the maximum vertical link

(max ill) threshold. To ensure meeting the maximum link con-

straint, we assign a very high cost (denoted by SOF T I N F ) for

establishing links between switches that are in layers having ver-

tical links close to the max ill value, denoted by sof t max ill

(steps 5, 6). From experiments, we found that a reasonable value

3A-2

244

for SO F T IN F to be 10 times the maximum cost of any ﬂow and

soft max switch ill to be few (2 to 3) links less than max ill

value. We use a similar technique to meet the maximum switch size

constraints (steps 10-12). By using these softer constraints ﬁrst, we

facilitate the path computation procedure to determine valid paths

when compared only using the hard constraints.

Algorithm 2 CHECK CONSTRAINTS(i,j)

1: for i=1to nsw do

2: for j=1to nsw do

3: if |layeri−layerj|≥2or ill(layeri, lay erj)≥

max ill then

4: costi,j =IN F

5: else if |layeri−layerj|=1and ill(layeri, layerj)≥

soft max ill then

6: costij =SO F T IN F

7: else if switch size inpi+1 ≥max switch size or

switch size outj+1 ≥max switch size then

8: costi,j =IN F

9: else if switch size inpi+1 ≥sof t max switch size

or switch size outj+1 ≥sof t max switch size

then

10: costi,j =SO F T IN F

11: end if

12: end for

13: end for

When paths are computed, if it is not feasible to meet the

max switch size constraints, we introduce new switches in the

topology that are used to connect the other switches together. These

indirect switches help in reducing the number of ports needed in the

direct switches. Due to space limitations, in this paper, we do not

explain the details of how the indirect switches are established.

6. SWITCH POSITION COMPUTATION

Once a topology for a particular switch count is obtained, the

next step is to ﬁnd the latency and power consumption on the wires.

In order to do this, based on the input positions of the cores, the

optimal position of the switches needs to be determined. For this,

we model the problem as a Linear Program (LP) [34].

Let us consider a topology with nsw switches. We denote the

co-ordinates of a switch i by (xsi, ysi),∀i∈1···nsw. The goal

of the LP is to determine the values of xsiand ysi, for all switches

in the particular topology. The sum of the Manhattan distances

between a switch iand a core kis given by:

coredisti,k =8

|xsi−xck|+|ysi−yck|, if switchiconnected

to corek

0, otherwise

(1)

The sum of the Manhattan distances between a switch iand switch

jto which it is connected to is given by:

swdisti,j =8

|xsi−xsj|+|ysi−ysj|, if switchiconnected

to switchj

0, otherwise

(2)

The above equations can be easily represented as a set of lin-

ear equations [34]. Let bw sw2corei,k and bw sw2swi,j be the

total bandwidth of trafﬁc ﬂows between switch iand core kand

switches iand j, respectively. To minimize the total power con-

sumption of the links, we need to minimize the length of the links

weighted by their bandwidth values, so that higher bandwidth links

are shorter than lower bandwidth ones. Formulating the objective

function mathematically, we get:

obj =P∀iP∀kcoredisti,k ∗bw sw2corei,k

+P∀iP∀jswdisti,j ∗bw sw2swi,j (3)

The LP for optimization is written as follows:

minimize obj

subject to Equations 1−3

xsi, ysi≥0,∀i∈1···nsw

(4)

We use the lp solve package [35] to obtain the optimum solution

for the switch co-ordinates. Even for big applications (65 cores,

tens of switches), the optimal solution is obtained in few seconds.

However, the optimal positions can result in overlap of switches

among themselves or with the cores. To remove the overlaps, we

use the ﬂoorplanner, Parquet [36], layer by layer. We feed the core

and switch positions as an input solution to the ﬂoorplanner. We

allow it to move the switches around the cores, maintaining the

relative positions of the cores and minimizing the movement of the

switches from the optimal positions computed by the LP. We also

pipeline long links to support full throughput on the NoC and add

Network Interfaces (NIs) to connect the cores to the network. The

resulting design is a valid ﬂoorplan of the NoC.

7. EXPERIMENTS AND CASE STUDIES

For our experiments, we use the switch and link libraries from

[5]. The power consumption and latency numbers of the compo-

nents of the library are obtained after post-layout analysis. We use

65nm low power technology libraries for the layout studies. For

the electrical characteristics of vertical interconnects, we use the

models from [33]. To obtain the electrical characteristics, a wafer-

to-wafer bonding technique is used as the underlying 3D integra-

tion technology. The vertical links are shown to have an order of

magnitude lower resistance and capacitance than a horizontal link

of the same dimension. This translates to a traversal delay of less

than 10% of clock cycle for 1 GHz operation and negligible power

consumption on the vertical links.

7.1 Multimedia SoC case study

For experimental case study, we consider a multi-media SoC,

Triple Video Object Plane Decoder, that has 38 cores (D 38 tvopd).

The communication graph of the benchmark is presented in Figure

5, where each vertex represents a core and the weight on the edge

represents the bandwidth between the cores expressed in MB/s.

The application is highly heterogeneous in nature, having three in-

dependent decoders working in parallel to improve performance.

Each decoder has 12 cores organized in a pipeline fashion. There

are two extra memories that are shared between the pipelines that

serve as input and output buffers. We consider the design imple-

mented on to 3 layers in 3D. The assignment of cores to the dif-

ferent layers and the ﬂoorplan of each layer were done manually,

such that the performance and manufacturing constraints (such as

thermal issues) are met. The processing cores are placed on the top

and bottom layers, so that they are close to the heat sink. The large

memory cores are all placed on the middle layer because they pro-

duce less heat and because this allows the manufacturer to use an

efﬁcient integration process for implementing the memories. The

ﬂoorplan of the design (along with the network components syn-

thesized by our procedure) is presented in Figure 7.

The data width of the NoC links is ﬁxed to 32 bits, to match the

data width of the cores in the design. We allowed the synthesis

method to sweep the NoC frequency and obtain NoC design points

245

3A-2

Figure 5: Communication graph for the D38 tvopd benchmark

Figure 6: Most power-efﬁcient topology

Figure 7: Resulting 3D ﬂoorplan with switches

5 10 15 20 25 30 35

100

120

Switch count

Power consumption (mW)

Switch power

Core−to−switch link power

Switch−to−switch link power

Total power

Figure 8: Power consumption

for different frequencies. From the resulting design points, we

found that the lowest operating frequency (of 500 MHz) resulted in

least power consumption for this design. The power consumption

of NoC designs synthesized by our procedure for different switch

counts, at 500 MHz operation, is presented in Figure 8. In the ﬁg-

ure, we show the core-to-switch link power, the switch-to-switch

link power, the switch power and the total power consumption. The

plot starts with 5 switches (on x-axis), as the maximum size of a

switch to support 500 MHz operation was 11x11 and the top and

bottom layers needed 2 switches each (topology shown in Figure

6), as they have more than 10 cores each. Because the number of

cores and the communication demand on each layer is different, we

obtain different number of switches on each layer.

Since the area of each 3D layer is small (approximately 20 mm2),

the links are short and switch power has higher impact on the to-

tal power consumption. With increasing switch count, the switch

power increases signiﬁcantly, leading to higher power consump-

tion. For this design, the NoC with 5 switches is most power opti-

mal and the resulting ﬂoorplan is shown in Figure 7.

7.2 Comparisons with mesh

Custom topologies that match the application characteristics can

result in large power-performance improvement when compared to

the standard topologies, such as mesh and torus [17]. For this com-

parison we used the D38 tvopd benchmark presented in Section

7.1 and ﬁve other benchmarks that model different trafﬁc scenarios.

We consider 3 benchmarks: D36 4,D36 6 and D36 8 with 36

cores, each core communicating to 4, 6 and 8 other cores, respec-

tively, modeling designs with multiple local memories. We also

consider a benchmark with shared memory bottleneck communi-

cation (D35 bot). For a larger design, we performed tests on the

D65 pipe which has 65 processing elements distributed on three

layers and organized in a pipeline fashion. All of the benchmarks

are mapped on to 3 layers in 3D.

We compared the custom topologies generated for the bench

marks against an optimized mesh topology. For the optimized mesh

each core is connected to a switch and only the necessary links

among switches are opened. The results of the comparison between

the best custom topology and the optimized mesh are presented in

Figure 9. As can be seen from results, the topology synthesized

by our method results in large power savings (38% on average)

when compared to the optimized mesh topologies. The synthesized

topologies also resulted in 24.5% reduction in average zero-load la-

tency, when compared the optimized mesh based NoC.

7.3 Impact on inter-layer link constraint

Limiting the number of inter-layer links has a great impact on

power consumption and average latency. Reducing the number of

TSVs is desirable for improving the yield of a 3D design. How-

ever, a very tight constraint on the number of inter-layer links can

3A-2

246

D_36_4 D_36_6 D_36_8 D_35_bot D_65_pipeD_38_tvopd

100

150

200

250

300

350

400

Power consumption (mW)

3D Application specific

3D Opt−mesh

Figure 9: Comparisons with mesh

10 11 12 13 14 15 16 17 18

170

180

190

200

210

220

230

240

Maximum number of inter−layer links (max_ill)

Minimum power consumption (mW)

Figure 10: Impact of max ill on power

10 11 12 13 14 15 16 17 18

3.45

3.5

3.55

3.6

3.65

3.7

Maximum number of inter−layer links (max_ill)

Minimum latency (cycles)

Figure 11: Impact of max ill on latency

lead to a signiﬁcant increase in power consumption. To see the im-

pact of the constraint, we varied the value of max ill constraint

and performed topology synthesis for each value, for one of the

benchmarks (D36 4). The power and latency values for the dif-

ferent max ill design points are shown in Figures 10 and 11. The

dotted line in the ﬁgures represent points where the max ill con-

straint was too tight to produce any feasible topologies. When there

is a tight constraint on the inter-layer links, more ﬂows are routed

through existing inter-layer links instead of opening new ones. This

leads to traversing more intermediate switches and higher switch

activities, leading to higher latency and power consumption. Please

note that our synthesis algorithm also allows the designers to per-

form such power, latency trade-offs for yield, early in the design

cycle.

The synthesis algorithm explores a large solution space. How-

ever, thanks to the efﬁcient heuristic methods presented, the en-

tire topology design process completed in few hours for all the ex-

periments, when run on a 2 GHz Linux workstation. Please note

that the synthesis process is performed once at design time and this

computational time incurred is negligible.

8. ACKNOWLEDGEMENTS

This work is supported by the Swiss National Science Founda-

tion (FNS, Grant 20021-109450/1).

9. CONCLUSIONS

The use of Networks on Chips (NoCs) for communication in 3D

chips has posed new opportunities and challenges for designers.

One of the most important problems is to design the most power-

performance efﬁcient NoC topology that satisﬁes the application

characteristics and 3D technology requirements. In this work, we

presented a synthesis approach to solve this problem. We also pre-

sented methods to place switches optimally on the 3D ﬂoorplan, so

that accurate power and delay numbers are obtained for the wires.

Our detailed comparisons with regular 3D optimized mesh show

that the custom 3D topologies lead to a large reduction in intercon-

nect power consumption. In future, we plan to explore tuning the

link data widths to meet the TSV constraints and to improve the

yield of the 3D NoCs.

10. REFERENCES

[1] L.Benini and G.De Micheli, “Networks on Chips: A New SoC Paradigm”, IEEE

Computers, pp. 70-78, Jan. 2002.

[2] P.Guerrier, A.Greiner,”A generic architecture for on-chip packet switched

interconnections”, Proc. DATE, pp. 250-256, March 2000.

[3] G. De Micheli, L. Benini, “Networks on Chips: Technology and Tools”, Morgan

Kaufmann, First Edition, July, 2006.

[4] K. Goossens et al., ”A Design Flow for Application-Speciﬁc Networks on Chip

with Guaranteed Performance to Accelerate SOC Design and Veriﬁcation”,

DATE 2005.

[5] S. Stergiou et al., “×pipesLite: a Synthesis Oriented Design Library for

Networks on Chips”, pp. 1188-1193, Proc. DATE 2005.

[6] J. Hu, R. Marculescu, ’Exploiting the Routing Flexibility for

Energy/Performance Aware Mapping of Regular NoC Architectures’, Proc.

DATE, March 2003.

[7] S. Murali, G. De Micheli, “SUNMAP: A Tool for Automatic Topology Selection

and Generation for NoCs”, Proc. DAC 2004.

[8] S. Murali, G. De Micheli, “Bandwidth Constrained Mapping of Cores onto NoC

Architectures”, Proc. DATE 2004.

[9] D. Bertozzi et al., “NoC Synthesis Flow for Customized Domain Speciﬁc

Multiprocessor Systems-on-Chip”, IEEE TPDS, Feb 2005.

[10] A.Pinto et al., “Efﬁcient Synthesis of Networks on Chip”, ICCD 2003, pp.

146-150, Oct 2003.

[11] W.H.Ho, T.M.Pinkston, “A Methodology for Designing Efﬁcient On-Chip

Interconnects on Well-Behaved Communication Patterns”, HPCA, 2003.

[12] T. Ahonen et al. ”Topology Optimization for Application Speciﬁc Networks on

Chip”, Proc. SLIP 04.

[13] K. Srinivasan et al., “An Automated Technique for Topology and Route

Generation of Application Speciﬁc On-Chip Interconnection Networks”, Proc.

ICCAD ’05.

[14] A. Hansson et al., “A Uniﬁed Approach to Constrained Mapping and Routing

on Network-on-Chip Architectures”, Proc. CODES-ISSS, 2005.

[15] X.Zhu, S.Malik, “A Hierarchical Modeling Framework for On-Chip

Communication Architectures”, ICCD 2002, pp. 663-671, Nov 2002.

[16] J. Xu et al., “A design methodology for application-speciﬁc networks-on-chip”,

ACM TECS, May 2006.

[17] S. Murali et al., “Designing Application-Speciﬁc Networks on Chips with

Floorplan Information”, pp. 355-362, ICCAD 2006.

[18] W. J. Dally, “Performance Analysis of k-ary n-cube Interconnection Networks”,

IEEE Transactions on Computers, Vol. 39, No. 6, pp. 775-785, 1990.

[19] K. Banerjee et al., “3-D ICs: ANovel Chip Design for Deep-Submicrometer

Interconnect Performance & SoC Integration”, Proc. of IEEE, 2001.

[20] B. Goplen and S. Sapatnekar, “Thermal Via Placement in 3D ICs”, Proc. Intl.

Symposium on Physical Design, pp. 167, 2005.

[21] J. Cong et al., “A thermal-driven ﬂoorplanning algorithm for 3D ICs”, ICCAD

2004.

[22] W.-L. Hung et al., “Interconnect and thermal-aware ﬂoorplanning for 3D

microprocessors”, Proc. ISQED, March 2006.

[23] S. K. Lim, “Physical Design for 3D System on Package”, IEEE Design & Test

of Computers, vol. 22(6), pp. 532539, 2005.

[24] P. Zhou et al., ”3D-STAF: Scalable temperature and leakage aware

ﬂoorplanning for three-dimensional integrated circuits”, ICCAD 2007.

[25] N. Miyakawa et al., “ New Multi-Layer Stacking Technology and Trial

Manufacture”, 3-D Architectures for Semiconductor Integration and Packaging,

Oct 2007.

[26] R. Weerasekara et al., “Extending Systems-on-Chip to the Third Dimension:

Performance, Cost and Technological Tradeoffs”, Proc. ICCAD, 2007.

[27] V. F. Pavlidis and E. G. Friedman, “Topologies for networks-onchip”, Proc.

SOCC, 2006.

[28] B. Feero and P. P. Pande, “Performance evaluation for three-dimensional

networks-on-chip”, Proc. ISVLSI, 2007.

[29] C. Addo-Quaye, “Thermal-Aware Mapping and Placement for 3-D NoC

Designs”, Proc. SOCC, 2005.

[30] J. Kim et al., “A novel dimensionally-decomposed router for on-chip

communication in 3d architectures”, ISCA, 2007.

[31] F. Li et al., “Design and Management of 3D Chip Multiprocessors Using

Network-in-Memory”, ISCA, pp. 130-141, 2006.

[32] D. Park et al., “MIRA: A Multi-Layered On-Chip Interconnect Router

Architecture”, Proc. ISCA, 2008.

[33] I. Loi, F. Angiolini, L. Benini, Supporting vertical links for 3D networks on

chip: toward an automated design and analysis ﬂow, Proc. Nano-Nets, 2007.

[34] S. Boyd and L. Vandenberghe, “Convex Optimization”, Cambridge University

Press, 2004.

[35] Package available at: http://sourceforge.net/projects/lpsolve

[36] S. N. Adya, I. L. Markov, ”Fixed-outline Floorplanning : Enabling Hierarchical

Design”, IEEE TVLSI, Dec 2003.

247

3A-2

Energy-Efficient Networks-on-Chip Architectures: Design and Run-Time Optimization

Chapter

Full-text available

Jan 2021

Networks-on-Chip (NoC) architectures have become the mainstream communication backbone of high-end processors and systems-on-chip (SoCs) after their introduction in the early 2000s. This success can be attributed mainly to their ability to satisfy the ever-increasing communication demands with superior energy-efficiency compared to traditional approaches. More specifically, NoCs provide low communication latencies (e.g., fast response in data centers) and high throughput (e.g., large data rates in graphics applications). Performance is a necessary condition, but it is not sufficient for their widespread adaption. They also need to minimize their contribution to the overall power dissipation, energy consumption, and cost while meeting the performance goals. This chapter reviews the design and run-time approaches developed to enable energy-efficient NoC architectures and future trends. It discusses both the traditional approaches and more recent proposals for 3D, wireless, and optical NoCs. Finally, it concludes by analyzing the trends in new workloads and future needs for NoCs.

Design of Low Power Encoding Schemes for LDPC Applications

Research

Full-text available

Jun 2020

As technology improves, the power dissipated by the links of a network-on-chip(NoC) starts to compete with the power dissipated by the other elements of the communicate ion subsystem, namely, the routers and the network interfaces (NIs). In this paper, we present a set of data encoding schemes to reduce the power dissipated by the links of an NoC. The proposed schemes are general and transparent with respect to the underlying NoC fabric (i.e., their application does not require any modification of the routers and link architecture). Experiments carried out on both synthetic and real traffic scenarios show the effectiveness of the proposed scheme s , which allow to save up to 51% of power dissipation and 14% of energy consumption without any significant performance degrade ation and with less than 15% area overhead in the Network Interface.

A Method to Remove Deadlocks in Networks-on-Chips with Wormhole Flow Control

Article

Jan 2010

Networks-on-Chip (NoCs) are a promising interconnect paradigm to address the communication bottleneck of Systems-on-Chip (SoCs). Wormhole flow control is widely used as the transmission protocol in NoCs, as it offers high throughput and low latency. To match the application characteristics, customized irregular topologies and routing functions are used. With wormhole flow control and custom irregular NoC topologies, deadlocks can occur during system operation. Ensuring a deadlock free operation of custom NoCs is a major challenge. In this paper, we address this important issue and present a method to remove deadlocks in application-specific NoCs. Our method can be applied to any NoC topology and routing function, and the potential deadlocks are removed by adding minimal number of virtual or physical channels. Experiments on a variety of realistic benchmarks show that our method results in a large reduction in the number of resources needed (88% on average) and NoC power consumption, area reduction (66% area savings on average) when compared to the state-of-the-art deadlock removal methods.

Decomposing and Cluster Refinement Design Method for Application-Specific Network-on-Chips

Article

Full-text available

Apr 2018

Along with higher and higher integration of intellectual properties (IPs) on a single chip, traditional bus-based system-on-chips (SoC) meets several design difficulties (such as low scalability, high power consumption, packet latency and clock tree problem). As a promising solution, network-on-chips (NoC) has been proposed and widely studied. In this work, a novel algorithm for NoC topology synthesis, which is decomposing and cluster refinement (DCR) algorithm, has been proposed to minimize the total power consumption of application-specific NoC. This algorithm is composed of two stages: decomposing with cluster generation, and cluster refinement. For partitioning and cluster generation, an initial low-power solution for NoC topology is generated. For cluster refinement, the clustering is optimized by performing floorplan to further reduce power consumption. Meanwhile, a good tradeoff between power consumption and CPU time can be achieved. Experimental results show that the proposed method outperforms the existing work.

FL-RuNS: A High Performance and Runtime Reconfigurable Fault-Tolerant Routing Scheme for Partially-Connected 3D Networks-on-Chip

Article

Jul 2019

Three-dimensional Networks-on-Chip (3D-NoCs) have been proposed as an enormously scalable solution to address communication problems in modern Systems-On-Chip (SoCs). Through-Silicon Via (TSV) is usually adopted as a viable technology enabling vertical connection among NoC layers. However, TSV-based architectures typically exhibit high vulnerability to transient and permanent faults caused by aging effects, thermal violations, manufacturing issues, or even transient fault sources. Therefore, TSV-based architectures call for robust routing schemes capable of sustaining operation under unpredictable failure patterns. In this paper, we introduce FL-RuNS a fault-tolerant routing scheme for achieving 100% packet delivery under an unconstrained set of runtime and permanent vertical link failures. The proposed scheme uses the concept of vertical link announcement to inform nodes in the network of the health condition of vertical links. This mechanism is able to dynamically and progressively reconfigure the entire network without any packet loss. FL-RuNS requires a very low number of asymmetric virtual channels to achieve both deadlock-freedom and reachability. Also, FL-RuNS introduces 1-flit-dedicated virtual channels which are used as an escape buffer in case of TSVs failures. The experimental results have confirmed that FL-RuNS shows better reliability when compared to the recently proposed fault-tolerant routing algorithm. Furthermore, the hardware synthesis performed using a commercial 28nm technology library shows a reasonable area and power overhead with respect to the non-fault-tolerant baseline.

A Runtime Fault-Tolerant Routing Scheme for Partially Connected 3D Networks-on-Chip

Conference Paper

Oct 2018

Performance and Thermal Tradeoffs for Energy-Efficient Monolithic 3D Network-on-Chip

Article

Aug 2018

Three-dimensional (3D) integration enables the design of high-performance and energy-efficient network on chip (NoC) architectures as communication backbones for manycore chips. To exploit the benefits of the vertical dimension of 3D integration, through-silicon-via (TSV) has been predominantly used in state-of-the-art manycore chip design. However, for TSV-based systems, high power density and the resultant thermal hotspot remain major concerns from the perspectives of chip functionality and overall reliability. The power consumption and thermal profiles of 3D NoCs can be improved by incorporating a Voltage-Frequency-Island (VFI)-based power management strategy. However, due to inherent thermal constraints of a TSV-based 3D system, we are unable to fully exploit the benefits offered by the power management methodology. In this context, emergence of monolithic 3D (M3D) integration has opened up new possibility of designing ultra-low-power and high-performance circuits and systems. The smaller dimensions of the inter-layer dielectric (ILD) and monolithic inter-tier vias (MIVs) offer high-density integration, flexibility of partitioning logic blocks across multiple tiers, and significant reduction of total wire-length. In this work, we present the first-ever study of the performance-thermal tradeoffs for energy efficient monolithic 3D manycore chips. In particular, we present a comparative performance evaluation of M3D NoCs with respect to their conventional TSV-based counterparts. We demonstrate that the proposed M3D-based NoC architecture incorporating VFI-based power management achieves a maximum of 29.4% lower energy-delay-product (EDP) compared to the TSV-based designs for a large set of benchmarks. We also demonstrate that the M3D-based NoC shows up to 29.1% lower maximum temperature than the TSV-based counterpart for these benchmarks.

A Lifetime Reliability-Constrained Runtime Mapping for Throughput Optimization in Many-Core Systems

Article

Jul 2018

Due to technology scaling, lifetime reliability is becoming one of the major design constraints in the performance optimization of future many-core systems. Given a lifetime reliability constraint, the existing lifetime-constrained runtime mapping schemes often lead to low throughput because of the requirement to map all applications to compact regions. In this paper, we propose a runtime application mapping scheme, LBRM, that exploits a borrowing strategy to improve the throughput of many-core systems given a lifetime constraint. First, we propose using different strategies for mapping communication-intensive applications and computation-intensive applications. The lifetime reliability constraint can be relaxed in the local time scale when the communication requirement is high. The throughput is improved because the communication distance of communication-intensive applications is optimized while the waiting time of computation-intensive application is reduced. Then, we propose a method to effectively classify applications depending on the communication-to-computation ratio. A dynamic threshold is determined according to the current locations of available cores. Finally, we propose an improved neighborhood allocation scheme to reduce the communication cost in the task mapping. The experimental results show that compared to the state-of-the-art lifetime-constrained mapping, the proposed mapping scheme improves the throughput of many-core systems by 26% on average for synthetic task graphs and by 20% on average for realistic task graphs while the lifetime reliability is maintained within a constraint.

Minterm based synthesis and optimization of asynchronous dual-rail encoded functional modules

Conference Paper

Sep 2017

Integrated Through Silicon Via Placement and Application Mapping for 3D Mesh Based NoC Design

Article

Jul 2016

This article proposes a solution to the integrated problem of Through-Silicon Via (TSV) placement and mapping of cores to the routers in a three-dimensional mesh-based Network-on-Chip (NoC) system. TSV geometry restricts their number in three-dimensional (3D) ICs. As a result, only about 25% of routers in a 3D NoC can possess vertical connections. Mapping plays an important role in evolving good system solutions in such a situation. TSVs have been placed with detailed consultation with the application mapping process. The integrated problem was first solved using the exact method of Integer Liner Programming (ILP). Next, a solution was obtained via a Particle Swarm Optimization (PSO) formulation. Several augmentations to the basic PSO strategy have been proposed to generate good-quality solutions. The results obtained are better than many of the contemporary approaches and close to the theoretical situation in which all routers are 3D in nature.

Physical Design for 3D System on Package

Article

Full-text available

Jun 2005

Sung Kyu Lim

The SoC paradigm is a system integration approach that integrates large numbers of transistors as well as various mixed-signal active and passive components onto a single chip. This realization-led to the 3D system-in-package (SiP) approach, alternatively called 3D ICs or 3D stacked die/package. Designers can take SiP a step further by embedding both active and passive components, but passive-component embedding is bulky and requires thick-film discrete components. Thick-film component embedding distinguishes SiP from system on package (SoP), an emerging 3D system integration concept that involves embedding both active and passive components. SoP, however, incorporates ultrathin films at microscale to embed the passive components, and the package rather than the board is the system. SoP overcomes both the computing and integration limitations of SoC, SiP, multichip modules (MCMs), and traditional system packaging by having global wiring as well as RF, digital, and optical component integration in the package instead of on the chip. Moreover, 3D SoP addresses the wire delay problem by enabling the replacement of long, slow global interconnects with short, fast vertical routes.

Topology optimization for application-specific networks-on-chip

Article

Full-text available

Feb 2004

Compared to the well understood macro networks, networks-on-chip introduce novel design challenges. The characteristics of the system data flows and the knowledge of the required wire lengths can be exploited to optimize for speed and power consumption. A component library for flexible construction of interconnection architectures is being developed at the Tampere University of Technology to enable the creation of application development platforms. The overall design flow of these development platforms is reviewed in this paper. Network-on-chip topology optimization is addressed by describing the methodologies used by an effective design automation tool. The detailed cost functions of the tool capture the factors contributing to the speed and power consumption of asynchronous interconnections, while different abstraction level input information is supported. A case study into the application domain of industrial process control and monitoring is presented in order to evaluate the result quality.

A generic architecture for on-chip packet-switched interconnections

Article

Jan 2000

Designing Application-Specific Networks on Chips with Floorplan Information

Article

Jan 2006

With increasing communication demands of processor and memory cores in Systems on Chips (SoCs), scalable Networks on Chips (NoCs) are needed to interconnect the cores. For the use of NoCs to be feasible in today’s industrial designs, a custom-tailored, application-specific NoC that satisfies the design objectives and constraints of the targeted application domain is required. In this work, we present a design methodology that automates the synthesis of such application-specific NoC architectures. We present a floorplan aware design method that considers the wiring complexityof the NoC during the topology synthesis process. This leads to detecting timing violations on the NoC links early in the design cycle and to have accurate power estimations of the interconnect. We incorporate mechanisms to prevent deadlocks during routing, which is critical for proper operation of NoCs. We integrate the NoC synthesis method with an existing design flow, automating NoC synthesis, generation, simulation and physical design processes. We also present ways to ensure design convergence across the levels. Experiments on several SoC benchmarks are presented, which show that the synthesized topologies provide a large reduction in network power consumption (2.78°— on average) and improvement in performance (1.59°— on average) over the best mesh and mesh-based custom topologies. An actual layout of a multimedia SoC with the NoC designed using our methodology is presented, which shows that the designed NoC supports the required frequency of operation (close to 900 MHz) without any timing violations. We could design the NoC from input specifications to layout in 4 hours, a process that usually takes several weeks.

Exploiting the routing flexibility for energy/performance aware mapping of regular NoC architectures

Article

Jan 2003

On the Performance of k-ary n-cube Interconnection Networks

Article

Jan 1986

William J. Dally

The performance of k-ary n-cube interconnection networks is analyzed under the assumption of constant wire bisection. It is shown that low-dimensional k-ary n-cube networks (e.g., tori) have lower latency and higher hot-spot throughput than high-dimensional networks (e.g., binary n-cubes) with the same bisection width.

Convex Optimization

Book

Mar 2004

Networks on Chips: Technology and Tools

Book

Aug 2006

3-d ics: a novel chip design for improving deep-submicrometer interconnect performance and system-on-chip integration" proceedings of the ieee

Article

Jan 2001

Design and Management of 3D Chip Multiprocessors Using Network-in-Memory

Article

May 2006

Long interconnects are becoming an increasingly important problem from both power and performance perspectives. This motivates designers to adopt on-chip network-based communication infrastructures and three-dimensional (3D) designs where multiple device layers are stacked together. Considering the current trends towards increasing use of chip multiprocessing, it is timely to consider 3D chip multiprocessor design and memory networking issues, especially in the context of data management in large L2 caches. The overall goal of this paper is to study the challenges for L2 design and management in 3D chip multiprocessors. Our first contribution is to propose a router architecture and a topology design that makes use of a network architecture embedded into the L2 cache memory. Our second contribution is to demonstrate, through extensive experiments, that a 3D L2 memory architecture generates much better results than the conventional two-dimensional (2D) designs under different number of layers and vertical (inter-wafer) connections. In particular, our experiments show that a 3D architecture with no dynamic data migration generates better performance than a 2D architecture that employs data migration. This also helps reduce power consumption in L2 due to a reduced number of data movements.

3A-2 Synthesis of Networks on Chips for 3D Systems on Chips

Abstract and Figures

Recommended publications

Designing Routing and Message-Dependent Deadlock Free Networks on Chips

Designing Routing and Message-Dependent Deadlock Free Networks on Chips

SunFloor 3D: A Tool for Networks on Chip Topology Synthesis for 3-D Systems on Chips

3D Network on Chip Topology Synthesis: Designing Custom Topologies for Chip Stacks

Comparative Analysis of NoCs for Two-Dimensional Versus Three-Dimensional SoCs Supporting Multiple V...

Design and analysis of NoCs for low-power 2D and 3D SoCs