Conference PaperPDF Available

ADAM: Run-time Agent-based Distributed Application Mapping for on-chip Communication

June 2008

June 2008

DOI:10.1145/1391469.1391664

Source
DBLP

Conference: 45th IEEE/ACM/EDA Design Automation Conference (DAC'08)

Authors:

Mohammad Abdullah Al Faruque

University of California, Irvine

Design-lime decisions can often only cover certain scenarios and fail in efficiency when hard-to-predict system scenarios occur. This drives the development of run-time adaptive systems. To the best of our knowledge, we are presenting the first scheme for a runtime application mapping in a distributed manner using agents targeting for adaptive NoC-based heterogeneous multi-processor systems. Our approach reduces the overall traffic produced to collect the current state of the system (monitoring-traffic), needed for runtime mapping, compared to a centralized mapping scheme. In our experiment, we obtain 10.7 times lower monitoring traffic compared to the centralized mapping scheme proposed in [8] for a 64 × 64 NoC. Our proposed scheme also requires less execution cycles compared to a non-clustered centralized approach. We achieve on an average 7.1 times lower computational effort for the mapping algorithm compared to the simple nearest-neighbor (NN) heuristics proposed in [6] in a 64 × 32 NoC. We demonstrate the advantage of our scheme by means of a robot application and a set of multimedia applications and compare it to the state-of-the-art run-time mapping schemes proposed in [6. 8. 19].

Content uploaded by Mohammad Abdullah Al Faruque

Content may be subject to copyright.

ADAM: Run-time Agent-based Distributed Application

Mapping for on-chip Communication

Mohammad Abdullah Al Faruque, Rudolf Krist, and Jörg Henkel

University of Karlsruhe, Chair for Embedded Systems, Karlsruhe, Germany

{alfaruque, krist, henkel} @ informatik.uni-karlsruhe.de

ABSTRACT

Design-time decisions can often only cover certain scenarios and

fail in efﬁciency when hard-to-predict system scenarios occur. This

drives the development of run-time adaptive systems. To the best

of our knowledge, we are presenting the ﬁrst scheme for a run-

time application mapping in a distributed manner using agents tar-

geting for adaptive NoC-based heterogeneous multi-processor sys-

tems. Our approach reduces the overall trafﬁc produced to collect

the current state of the system (monitoring-trafﬁc), needed for run-

time mapping, compared to a centralized mapping scheme. In our

experiment, we obtain 10.7 times lower monitoring trafﬁc com-

pared to the centralized mapping scheme proposed in [8] for a

64×64 NoC. Our proposed scheme also requires less execution cy-

cles compared to a non-clustered centralized approach. We achieve

on an average 7.1 times lower computational effort for the mapping

algorithm compared to the simple nearest-neighbor (NN) heuristics

proposed in [6] in a 64 ×32 NoC. We demonstrate the advantage

of our scheme by means of a robot application and a set of multi-

media applications and compare it to the state-of-the-art run-time

mapping schemes proposed in [6, 8, 19].

Categories and Subject Descriptors: C.3[Special-purpose and

application-based systems]: Real-time and embedded systems

General Terms: Algorithms, Design

Keywords: Agent-based application mapping, On-chip communi-

cation

1. INTRODUCTION AND RELATED WORK

Intel projects the availability of 100 billion transistors on a 300mm2

die by 2015 [4] which allows to integrate thousands of processors

or equivalent logic gates on a single die. Heterogeneous Processing

Elements (PEs), i.e. different types of instruction set processors or

reconﬁgurable hardware on such an architecture, are proposed for

building an energy-efﬁcient system [19]. Besides the low power

concern regarding computation, communication in such an archi-

tecture is another dominant factor since a scalable but light-weight

communication infrastructure is needed on-chip [4]. This motivates

toward the development of a tile-based heterogeneous Multiproces-

sor System on Chip (MPSoC) interconnected by a Network on Chip

(NoC) [1, 7, 9, 13]. In general, all related work proposes to design

an application-speciﬁc system where the parameters for the fabri-

cated chip are adjusted at design time.

The more complex a system grows the more it must be able to

handle those situations efﬁciently that are unpredictable at design

time. In this case the system needs to adapt itself to the new situa-

tion and therefore, the System on Chip (SoC) needs to be designed

with the capability of self-adaptiveness in mind. Self-adaptation

in SoC design is relatively new. The idea of adaptivity in future

SoC design is introduced in [11, 14]. Taking the same spirit into

NoC-based architecture design, we are the ﬁrst to propose an adap-

tive on-chip communication scheme into [11]. An adaptive system

needs to map the tasks of an application to various PEs at run-time

without interfering the currently executing applications. To do this

in a transparent way is a challenging research topic.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

DAC 2008, June 8–13, 2008, Anaheim, California, USA.

To solve the problem of mapping tasks to respective processing

elements, several design-time (off-line) mapping algorithms have

been proposed in related work. In [15], Branch and Bound-based,

in [16] Genetic Algorithm-based, and in [12] heuristic-based map-

ping algorithms are proposed. But an adaptive system that changes

its conﬁguration over time requires a re-mapping/run-time mapping

of applications. Possible reasons for the necessity of a run-time

mapping are listed in Section 2. In [19] the authors extend the Min-

Weight algorithm proposed in [5] for solving the problem of run-

time task assignment on heterogeneous processors. The task graphs

are restricted to a small number of vertices or a large number of ver-

tices with a degree of no more than two. Authors in [6] investigate

the performance of several mapping heuristics promising for run-

time use in NoC-based MPSoCs with dynamic workloads, targeting

NoC congestion minimization. The work presented in [8] proposes

an efﬁcient technique for run-time application mapping onto a ho-

mogeneous NoC platform with multiple voltage levels. Their work

is limited to a homogeneous architecture. A separate control net-

work besides the data network is used which represents an extra

overhead in terms of area and energy consumption. The state-of-

the-art run-time mapping work [6, 8, 19] has used a Centralized

Manager (CM) for conducting the job of mapping which is not

scalable in the context of hundreds or even thousands of cores that

may soon be integrated on a SoC. It suffers from a single point of

failure, larger volume of monitoring trafﬁc1, central point of com-

munication around the CM (hot-spot), and scalability issues.

The concept of task migration is an integral part of the run-time

application mapping. The study of task migration to move a cur-

rently executing task between different processors which are con-

nected by a network has already been a research focus in the dis-

tributed and parallel computing domain [20]. Now it is used to

facilitate run-time application mapping in adaptive heterogeneous

MPSoCs. Work presented in [2, 17] discuss the issues related to the

task migration for MPSoC design, i.e. the cost to interrupt a given

task, save its context, transmit all data to a new IP, and restart the

task in the new IP. In our work we use this approach though the

details of task migration are out of the scope of this paper.

The rest of the paper is organized as follows: In Section 2, we

present our motivation and novel contribution. In Section 3, we

introduce our ADAM architecture whereas in Section 4, our novel

clustering algorithm and agent-based run-time application mapping

are explained in detail. Experimental results are discussed in Sec-

tion 5 with Section 6 concluding the paper.

2. MOTIVATION AND NOVEL CONTRIBU-

TIONS

Let us motivate the need of an agent-based distributed applica-

tion mapping for NoCs by means of a simple scenario. We study a

32 ×32 NoC with a mesh topology. Some events that may require

a re-mapping at run-time for an adaptive system where design-time

mapping algorithms fail are given below:

•On-line detection of hardware faults.

•To minimize run-time system costs (i.e. to save energy be-

cause of the low battery status).

•When the user requirements change, e.g. the user wants to

switch video playback to a higher resolution.

1Monitoring-trafﬁc is deﬁned in this paper as the trafﬁc which is

caused by collecting information about the state of the tiles, ni∈

N(see Def. 2)

760

42.2

•When an adaptive system tries to conﬁgure the underlying

NoC infrastructure (i.e. changing the routing algorithm and

the buffer assignment) and if it fails, then the mapping in-

stance of the application needs to be changed [11].

State-of-the-art run-time mapping is handled using a Centralized

Manager (CM) which may bear the following problems:

•Single point of failure.

•Higher computational cost to calculate mapping inside CM.

•Large volume of monitoring-trafﬁc.

•Point of hot-spot as every tile sends the status of the PE to the

CM after every instance of mapping. It increases the chance

of bottleneck around the CM.

To solve the problem of a static design-time mapping algorithm

which may require a high computational effort, we need a scheme

that can perform a low-cost (execution time) mapping scheme in-

side a virtual cluster (see Def. 3) constructed at run-time. We

solve the problems of a centralized mapping scheme by using a dis-

tributed mapping inside each virtual cluster. This distributed map-

ping is accomplished by software modules that are autonomous,

modiﬁable, and exhibit adaptation capabilities. To the best of our

knowledge we are the ﬁrst to design an agent-based distributed ap-

plication mapping for a NoC platform. The system is analyzed

during run-time and self-adapts in terms of when and how a map-

ping algorithm should be invoked. Our novel contributions are

as follows:

(1) We provide a run-time agent-based distributed mapping algo-

rithm for next generation self-adaptive heterogeneous MPSoCs. Our

mapping algorithm is composed of two main parts: (a) virtual clus-

ter selection and cluster reorganization at run-time, and (b) a map-

ping algorithm inside a cluster at run-time.

(2) We propose a run-time cluster negotiation algorithm that gener-

ates virtual clusters to solve the problems of the centralized map-

ping algorithm.

(3) We present a low cost heuristic-based mapping algorithm in

terms of execution cycles on any instruction set processor that min-

imizes the communication related energy consumption.

3. OUR ADAM SCHEME

In the following we introduce our run-time Agent-based Dis-

tributed Application Mapping (ADAM) for a heterogeneous MP-

SoC with a NoC.

3.1 Some Deﬁnitions

Deﬁnitions necessary to explain our run-time ADAM concept

are described in the following:

Deﬁnition 1: An application communication task graph (CTG) is

a directed graph Gk=(T,F), where Tis a set of all tasks of an

application and fi,j ∈Fis a set of all ﬂows between connected

tasks tiand tjannotated by the inter-task bandwidth requirement.

Deﬁnition 2: A heterogeneous MPSoC architecture in a NoC plat-

form HMPSoCNoC is a directed graph P=(N, V ),wherever-

tices Nis a set of tiles niand vi,j ∈Vpresent an edge, the physical

channel between two tiles niand nj. A tile , ni∈Nis composed

of: a heterogeneous PE, a network interface, a router, local mem-

ory and a cache.

Deﬁnition 3: Acluster is a subset Ci⊆N,whereNis the set

of tiles njthat belong to the HMPSoCNoC and a virtual clus-

ter Cvi, is a cluster where there are no ﬁxed boundaries to decide

which tiles are included and which tiles are not. It can be created,

resized and destroyed at run time.

Deﬁnition 4: An agent Ag is a computational entity, which acts

on behalf of others. The construction of an agent is motivated from

[3] where agents are used for distributed network management. The

properties of an agent in our scheme are: an agent (1) is a smaller

task closer to the system, (2) it must do resource management, (3)

it may need memory to store state information for the resources,

(4) it must be executable on any processing element, (5) it must be

migratable, (6) it must be recoverable, and (7) it may be destroyed

if the cluster no longer exists. An agent-based mapping scheme

provides a ﬂexible framework for run-time mapping because it has

the negotiation capability among the clusters distributed over the

whole chip and it is not dependent on the design-time parameters

(see above properties).

state

searching for next

suitable cluster

searching for next

cluster, suitable

after migration

searching for next

cluster, suitable

after reclustering

mapping request

received

QoS reqm. not met

further migration not possible

QoS reqm.

met

global agent negotiation

suitable cluster found

app. mapped

successfully

no suitable cluster

exists, migration or

reclustering possible

no cluster suitable

after migration exists

After migration

suitable cluster

found

After reclustering suitable

cluster found

QoS reqm. not met

further reclustering not possible

QoS reqm.

met

application

mapping

task

migration

QoS reqr.

not met

reclustering

QoS reqr.

not met

Figure 1: Flow of our ADAM approach

Deﬁnition 5: Acluster agent CA ∈Ag is an agent that is respon-

sible for mapping operations within the cluster Ci. The cluster

agent is located in the processing element pCi

jwhere the index jof

pjdenotes that the cluster agent can be mapped to any PE of the

cluster. The CA stores the information about the cluster that the

agent is responsible for (see Table 1, 2).

Deﬁnition 6: Aglobal agent GA is an agent that stores the infor-

mation for performing the mapping operations to a selected cluster.

It stores information regarding the current usage of communication

and computation resources for each cluster and this information is

used for selection and re-organization of the clusters (see Table 1).

GA is movable and the stored information is light-weight and eas-

ily recoverable (there are multiple instances of the global agents).

Deﬁnition 7: The application mapping function is given by m:

Tti→ nj∈Nand the run-time mapping function mrun maps

the instance of task graph set Gtat time tto HMPSoCNoC.

Deﬁnition 8: Abinding is a function b:Tti→ tpPE ∈Tps,

where Tis the set of all tasks of an application and Tps is the set

of the PE types that are used on the HMPSoCNoC. The function

assigns each task tiof the CTG to a favorable type of PE. After the

binding operation is completed, the tasks are allowed to be mapped

only to PEs of the type given by the binding function b.

3.2 The ADAM Flow

An overview of our ADAM system is presented in Fig. 1. The

run-time mapping in our scheme is achieved by using a negotiation

policy among Cluster Agents (CAs) and Global Agents (GAs) of a

certain instance of time distributed over the whole chip. In Fig. 1

an application mapping request is sent to the CA of the requesting

cluster which receives all mapping requests and negotiates with the

GAs. There can be multiple instances of the GAs that are synchro-

nized over time. The GAs have global information about all the

clusters of the NoC in order to make decisions onto which clus-

ter the application should be mapped to. Possible replies to this

mapping request are:

1. When a suitable cluster of the application exists then the GAs

inform the requesting source CA and the requesting source

CA asks the suitable destination CA for the actual mapping

of the application.

2. When no suitable clusters are found by the GAs then the GAs

report the next most promising cluster where it is possible to

map the application to after task migration which is negoti-

ated between the GA and the CA to make this cluster suitable

for the mapping. The number of iterations is a conﬁguration

parameter.

761

3. When neither a suitable cluster nor a candidate cluster for

task migration are found, then the re-clustering concept is

used. It tries to acquire PEs from the neighboring clusters

(see Subsection 4.1). If the requirements are met after re-

clustering then the application may be mapped to that cluster.

This step is iterated for a number of times speciﬁed by the

conﬁguration.

If all the above-mentioned options do not lead to a successful map-

ping (the application and the system constraints are not met), then

the mapping request is refused and reported to the requester. The

requester waits until some resources are freed to proceed with the

mapping. In the next section the detailed description of the run-

time mapping algorithm using our ADAM concept is presented.

Algorithm 1 Suitable cluster negotiation

input: CTG,{nhistc[] |cis cluster}(a),(f)

output: c,b[] (suitable cluster and binding) (f),(g)

u(tp, t): the comp. res. req. when the task tis bound to tp (c)

u[tp]: the total comp. res. req. for the PE type in CT G

E(tpj,t

i): computation energy when task tiis bound to tpj(d)

nloop: constant, num. of matching loop iterations

1: for all ti∈CT G do // min energy binding (d) &

// thist calc. & summrz. u(tp)in CTG

2: b[ti]=min

tpj{E(tpj,t

i)=u(tpj,t

i)·(E[100](tpj)−

E[0](tpj)) + E[0] (tpj)}// initial binding,

// min. energy (d)

3: u[b[ti]] = u[b[ti]] + u(b[ti],t

i)// columns of res.

// req. profile (c)

4: k = u(tpj,t

i)·ncl

5: thist[b[ti],k]=thist[b[ti],k]+1 (e)

6: end for

7: sort thist by u[tp]desc

8: tpmax =max

tpj{u[tpj]}

9: sort {c⊆N|cis a cluster}by uc[tpmax]

10: for all c⊆N, c is a cluster do

11: sort nhistcby u[tp]asc

12: match thist and nhistc(eq. (1))

13: store mismatch[c, iloop]=(tpj,k

mis,qnt

tsk,mis)

14: if matched or iloop =nloop then

15: leave loop

16: end if

17: end for

18: if iloop =nloop then

19: for all c⊆N, c is a cluster, (init :iloop =0)do

20: (tpj,k

mis,qnt

tsk,mis)=mismatch[c, iloop ]

21: move qnttsk,mistasks with maxt{u[b[t]]}from tpjto

another PE type with mintp{E(tp, task s)}

22: match thist and nhistc

23: if not matched or iloop =nloop then

24: restore b[] to min energy binding; leave loop

25: end if

26: end for

27: end if

28: if not matched: ﬁnd cluster and tasks to migrate

29: if not matched: ﬁnd cluster and tasks to re-cluster

30: return b[],c

4. ALGORITHM FOR RUN-TIME MAPPING

In this section we present our detailed algorithm of run-time

Agent-based Distributed Application Mapping (ADAM) which has

the following two components: (1) a cluster negotiation algorithm

and (2) a mapping algorithm inside a virtual cluster.

4.1 Cluster Negotiation Algorithm

Here we present our run-time suitable cluster negotiation algo-

rithm (see Alg. 1). The algorithms (Alg. 1 and Alg. 2) have the

following important input and output data objects:

•The application CTG,Gwith required computational resource

proﬁles for each task. Gis given by a set of entries for each

ﬂow: entry =(idsrc,id

dst,bw

req,lat,RR

tp). Here, idsr c

and iddst are the id of the source and destination task of

the ﬂow, respectively bwreq is the required bandwidth of the

ﬂow, lat is the communication latency and RRtp is the re-

source requirement on each PE type that is needed for a task

to ensure a successful execution.

•The state information about all clusters are stored in a sum-

marized format by the GAs (Table 1 and data object nhistc).

More detailed information is stored in the CA (Table 2).

ﬁeld req. memory short description

tpPE log2#Tps PE type id, #Tps= #of PE type

q_tiles log2#Cmax #Cmax = #of tiles in a cluster

r_reqtot log2#Cmax total comp. reso. req. by the PE type

q_cl0log2#Cmax #tiles in res. req. class (0,1

... ...

q_clnlog2#Cmax #tiles in res. req. class (n−1

n,1]

Table 1: Global agent: entry of the cluster PE type LUT

•Energy Model: To make a binding decision (see Def. 8),

the amount of energy consumption for different PE types at

different resource requirement levels is needed. To explain

the energy model we take an example from Fig. 2 (b), where

for the PE type tp2the energy consumption is speciﬁed by

two values: tp2:(4X, 12X)that means that each PE of type

tp2consumes 4 units of energy (static energy consumption)

in a ﬁxed time when it uses no processing resources and 12

units of energy when it consumes the complete PE resources

and otherwise E=u·(E[100%] −E[0%])+E[0%] .

•thist[] and nhistc[] are two data objects that store the re-

source requirement histograms within the local memory of

the CAs and GAs,thist for the required resources for the

tasks and nhistcfor the actual PE resource usage status of

the cluster c (i.e. Fig. 2 (e), (f)). Each entry thist[tp, k]gives

the number of tasks of a given type that is part of resource

requirement class (k−1

ncl ,k

ncl ]and each entry nhistc[tp, k]

gives the actual number of tiles of a given type in resource

requirement class (k−1

ncl ,k

ncl ].

•The output data is the selected virtual cluster where the ap-

plication will be mapped to and the binding of the tasks to

the PEs, ∀ti∈T:b(ti)∈Tps(see Def. 8).

E[100%](tp3)

=17X

Legend

PE type1

PE type2 flow, with bw

t1 task t1

tpPE

E[0%]

(b) energy by

resource req.

E[100%]

11 X

12 X

17 X

21 X

t2 t3

10 7

t5 11

(a) task graph

(e) task comp. res. reqirements

1/5

resource req. by classes

2/5

3/5

4/5

tpPE 5/5

(f) PE availability in a cluster

1/5

2/5

3/5

4/5

5/5

tp1

res. req. by tpPE

tp2

tp3

tp4

task

t2 t3

10 7

t5 11

(g) binding

(d) energy consumption/

min-energy binding

energy by tpPE

tp2

4.96

5.04

5.76

6.4

8.4

tp3

3.72

7.24

8.84

8.2

8.36

tp4

6.28

7.4

8.36

8.2

8.36

task tp1

2.81

4.97

6.59

5.78

7.49

resource req. by classes

tpPE

Figure 2: Suitable cluster and binding example

The matching of the two data objects nhistcand thist is the

heart of Alg. 1 and is given below in Eq. (1).

∀i∈1, .., ncl −1:

ncl−1

j=ncl−i

thist[tp, j]≤

j=1

nhistc[tp, j](1)

In Fig. 2 we present an example of the cluster searching procedure.

The task graph of an application that is requested to be mapped

is shown in Fig. 2(a). The energy consumed by various PE types

in different resource requirement levels is given in 2(b) and it is

used to calculate the actual required energy consumption for every

task on different types of PEs (see 2(d)). The resource require-

ments of the tasks is given in 2(c). Using the tables 2(c) and 2(d)

the minimum energy binding for the tasks of the application is de-

rived. Using the task binding, Fig. 2(e) shows the resource require-

ment proﬁle to create a histogram corresponding to the data object

thist[]. Fig. 2(f) presents the histogram nhistc[] for a cluster. In

762

this example task 2 needs to be rebound to a new PE type to ﬁnd a

suitable cluster which has better energy consumption during the al-

gorithm execution. Finally, Fig. 2(g) presents the new binding and

the selection of the cluster. The complexity of our cluster negotia-

tion algorithm is O(m+r·log r)where mis the number of tasks

and ris the number of virtual clusters. Due to the low complexity,

this part of our approach is suitable to be applied at run-time.

Connected tilesCluster agent Source tile Destination tile

(3)mig(tsk,c_txt_swt)

(4)Succ_m ig

Parent task

(1)Sendamigreq

(2)Freeze the destination tile

(2)Freeze the source tile

(2)Freeze the connected tiles

(5)Release the freezes

(5)Release the freeze

(6)/*done*/

(4)Succ_mig

Figure 3: Task migration to support run-time application mapping

In case a suitable cluster cannot be found in Alg. 1, it starts look-

ing for the clusters which support task migration. Task migration2

as an integral part of our run-time mapping algorithm is demon-

strated in Fig. 3. The parent task sends a migration request to the

CA and upon receiving the request it freezes the source tile, tiles

connected to the source, and the destination tile for successful and

transparent migration. Then, the migration is performed with all

local data within the executing task, the state of the task and even

the modiﬁed binary of the task (the binary of the application may

need to be changed to make it executable for different instruction

set processors). The feedback is then provided to the CA.

Start state

End state

Legend:

taking PEs fro m

neighb. clusters

Re-clustering

successful

QoS

reqr. met

QoS reqr.

not met

map

app

Re-clustering

failed

no reclustering

possible

find another

cluster

requesting

neighb’s

for free PEs

no free PEs

reported

and neighb’s left

requesting

neighb’s

for migration

no free PEs

and no

neighb’s left

no free PEs reported

and neighb’s left

req. neighb’s

for least utilized

PEs

neighb’s left,

PE not shared

no free PEs

and no

neighb’s left

take unoccupied PE

take freed

share PE

Figure 4: The re-clustering algorithm ﬂow

When the migration of tasks does not deliver a suitable cluster,

then the re-clustering operation shown in Fig. 4 is invoked. First

negotiation is done between neighboring clusters to see if there are

some unoccupied PEs that can be given away to the requesting clus-

ter. If no unoccupied PEs are available, the neighbors are requested

to migrate tasks from some PEs to other PEs of that cluster without

losing its performance and run-time constraints. If that is not suc-

cessful either, then the neighboring cluster is requested for the least

utilized PEs that may be shared with the requesting cluster.

4.2 The Mapping Algorithm

Our run-time mapping algorithm inside a cluster managed by the

CA is motivated by the static mapping algorithm presented in [12]

as it is light-weight in terms of execution cycles and provides a

near-optimal mapping solution. The given algorithm is executed

once at design time. But for using the algorithm at run-time it

needs to be modiﬁed to keep the current instance of the mapping.

It is then executed in the background reacting to mapping requests

2Details of task migration are not discussed in the scope of this

paper. Our scheme uses the approach presented in [17]

Algorithm 2 Run-time mapping

CTG: input data, application CTG

mpng: output data, mapping of tasks to tiles

tileLU T,clu : state of the physical network

Tps ∈tileLUT ,clu: types contained in model

tpPE: type of a tile’s PE, tpPE ∈Tps

model

rs_avail(tpPE): gives the available computational resources of

all PEs of the given type tpPE

binding: ∀ti∈CTG :∃b(ti),b:see Deﬁnition 8

sorted: T ps, asc,by rs_avail(tpPE)// sorting by

// avail. of PE types

1: for all a∈Tpsdo

2: fa={fij ∈tg |bound(ti,a)∨bound(tj,a)}

3: sort(fa,desc,by bw_req(fij ∈fa))

4: for all fk

ij ∈fado

5: select ni,n

j∈tileLU T,clu ,for ti,t

jby min(cmp)

6: insert(ni,n

jto mpng)

7: end for

8: end for

9: allocate(mpng); update(tileLU T ,clu by mpng)

whenever the current instance of the mapping needs to be modi-

ﬁed. The pseudo code of the run-time mapping algorithm inside

each cluster is presented in Alg. 2. The input data is the CTG of the

application and the model tileLU T,clu of the HMPSoCNoC that

stores the current state of the used computation and communication

resources of that particular cluster. The CTG contains the required

energy consumption for each task to be executed on a particular

PE type. The task binding is done in the cluster negotiation step

with the GAs before the mapping step inside a virtual cluster.The

CTG contains the communication costs for each ﬂow fij between

the tasks tiand tj. The tile-LUT tileLU T,clu contains each tile’s

current computation resource usage, the type of the PE of this tile

tpPE, and the current bandwidth usage for each link. The output

(mpng)is the mapping of tasks to tiles of the network which is

used to allocate the tiles physically on the network and to update

the tileLU T,clu by the added application.

t1 t5

t2 t3

10 7

bac

(a) task graph

(b) tiles (part

of cluster)

on tiles

tp_i

tp_1

tp_2

tp_3

tp_4

tp_5

avail_rs(

tp_i)

1730 %

210 %

370 %

530 %

505 %

ord(tp_i)

(d) available computation

resources

(e) flows by PE

types

tasks

cmp_req

30 %

25 %

29 %

40 %

38 %

(f) required

comput. costs

(g) current comp.

resources in use

by tasks on tiles

Legend

PE type 1

PE type 2

PE type 3 flow, with bw

t1 task t1

10 %

40 %

20 %

10 %

33 %

25 %

res.

in use

flows

1-2

1-3

3-4

4-5

Figure 5: Run-time application mapping example

To decide to which tile of a particular PE type a task should be

mapped, a heuristics is used, described by the cost function c(t, n),

for the selection of a tile njfor a given task ti.

c(ti,n

j)=α(D(nj)+bwt(nj)+RR(nj))+βX

k∈Tcon,m

d(k)vol(k)

where, D(n)= 1

#tilesclu Pl∈Nd(n, l):D(n)is the average dis-

tance of a tile to all other tiles of the cluster, d(n, l)is the Manhat-

tan distance between tiles nand l,Tcon,m is the set of all connected

and mapped tasks ti,d(k)is the Manhattan distance between the

mapped tasks, vol(k)is the communication between the connected

tasks, RR(nj)is the resource requirement of the PE that will be as-

signed for the task, and bwt(nj)is the total bandwidth requirement

of the tasks on the tile.

In the following, Alg. 2 is explained using an example (see Fig. 5).

In Fig. 5 (a) we present a task graph, whose tasks are grouped by

the binding function (shown in different colors) in the earlier ne-

gotiation stage. In 5 (b) a part of the tiles of the current cluster is

presented, 5 (g) shows the current resources in use of some of these

763

(a) Mapping Computational Effort (Fixed cluster size)

1000

10000

100000

1000000

64 100 144 256 400 1024 2048 409664 100 144 256 400 1024 2048 4096

NoC Size [Tiles]

X * Cycles

ADAM (Our Scheme)ADAM (Our Scheme)

centralized NN (app. from [6])centralized NN (app. from [6])

centralized MAC (app. from [6])centralized MAC (app. from [6])

centralized PL (app. from [6])centralized PL (app. from [6])

Comparison to Centralized Scheme [6]

(b) Mapping Components (ADAM)

100

1000

10000

preparation match rebind

match

migration re-clustering mapping

Components

X * Cycles

8x8 NoC

16x16 NoC

8x8 NoC8x8 NoC

16x16 NoC16x16 NoC

32x32 NoC

64x64 NoC

32x32 NoC32x32 NoC

64x64 NoC64x64 NoC

1000

10000

100000

1000000

64 100 144 256 400 1024 2048 409664 100 144 256 400 1024 2048 4096

Cluster Size [tiles]

X * Cycles

ADAM (Our Scheme)

centralized NN (app. from [6])

ADAM (Our Scheme)ADAM (Our Scheme)

centralized NN (app. from [6])centralized NN (app. from [6])

centralized MAC (app. from [6])

centralized PL (app. from [6])

centralized MAC (app. from [6])centralized MAC (app. from [6])

centralized PL (app. from [6])centralized PL (app. from [6])

Figure 6: Computation complexity of mapping compared to [6]

tiles and 5 (f) presents the computational resource requirements for

each task of the task graph. In this example the availability of the

resources is presented by the ordered column in a table (Fig. 5 (d)).

In Fig. 5 (e) we see the ﬁrst set of ﬂows ftp2that connect PEs

of PE type 2:{f12,f

13,f

34}. The ﬂows are sorted in a decreas-

ing order according to their bandwidth requirements. The result

of a successful mapping is illustrated in Fig. 5 (c). To achieve a

mapping instance we iterate over the set of ﬂows and select tiles

where the previously un-mapped tasks connected by the ﬂows will

be mapped. Then the algorithm continues with the next set of ﬂows

ftp1that are connected to PEs of type 1. The complexity of our

mapping algorithm is O(m·log m+m·n)where mis the number

of tasks and nis the number of tiles in a particular cluster. The

complexity is low compared to the heuristics in [6] when it is used

in a distributed manner. This fact is veriﬁed in the result section

(see Fig. 6).

ﬁeld req. memory short description

id log2#Ntile id, Def. 2

tpPE log2#Tps type of tile’s PE (Def. 8)

r_reqcomp log2#Lv computation resource req.

bwused communication bw. usage

All directions log2#Lv output port, e.g. North

q_vc virtual channel quantity

All directions max. #VCs output port, e.g. North

Table 2: Fine-grained tile information inside each cluster agent

We study which data objects are needed by the mapping algo-

rithms and what kind of ﬁltering mechanism may be used to re-

duce the amount of data stored in the GAs. The state informa-

tion about the tiles and the links of the HMPSoCNoC have to be

stored by agents on different levels (GAs, CAs).CAs will need the

ﬁne grained information about the cluster to provide the distributed

mapping shown in Table 1 and 2. Table 1 contains the histogram of

computational resource requirements of the PEs. For each cluster

there is also an instance of this PE type LUT stored in the GA.The

ﬁltering process is as follows: (1) using the “raw” data from the

data object described by Table 2, (2) calculating the information

stored in data object described by Table 1, and (3) transmitting this

data from the CAs to the GAs. Another data object stored within

each CA is the variable mpng, a LUT shown in Alg. 2. The struc-

ture of each entry within this LUT consists of the id of the source

task, destination task, assigned tile, application, resource require-

ments for execution, communication volume, and the required la-

tency.

The run-time ﬂexibility of the mapping algorithm compared to a

design-time static mapping algorithm comes at some extra penal-

ties: near-optimal mapping solution (Fig. 8), extra computation at

run-time (Fig. 6(b)), additional trafﬁc to collect information about

the current state of the chip (Fig. 7), and ﬁnally monitoring infras-

tructure implemented in each router to collect information about

the current state of the MPSoC. Monitoring hardware is already

an integral part of our adaptive on-chip communication scheme

presented in [11]. The monitoring module implemented for our

adaptive router requires 46 slices on a XILINX Virtex2 FPGA [21],

an LUT (number of entries ×26 bits), an event input FIFO (5 ×

12 bits), and a connection input FIFO (5 ×18 bits). The addi-

tional monitoring events for our ADAM scheme are added on top

of this existing monitoring infrastructure and therefore, it increases

the size of the LUT and FIFO. Detailed description of the monitor-

ing module is out of the scope of this paper.

5. RESULTS AND CASE STUDY ANALYSIS

We have evaluated our ADAM approach using different applica-

tion scenarios: a robot application (Image Processing Line [18]),

some multi-media applications, and applications from TGFF [10].

We show the performance in terms of execution time and the vol-

ume of the generated monitoring trafﬁc and compare our results to

state-of-the-art centralized approaches [6, 8, 19]. In addition, we

compare our cluster-level mapping algorithm to an exhaustive off-

line mapping algorithm in order to see how far it is off from an

optimum solution.

In Fig. 6(a) we compare our approach to the centralized one [6].

We have partitioned our mapping computation into several steps

shown in Fig. 6(b). The conﬁguration parameters for this experi-

ment are as follows: the average cluster size is 64 and the number

of tasks is 48. In this experiment the number of cycles to check

whether a task can be mapped to a tile is represented by “X” (it

may differ depending on the instruction set). We consider that each

task has to be checked for a possible assignment to each tile in-

side a virtual cluster while in the non-clustered approach the tiles

of the whole NoC have to be considered. Therefore, our approach

can reduce the mapping computation complexity e.g. on a 32x64

system we have an approx. 7.1 times lower computational effort

compared to the simple nearest-neighbor (NN) heuristics proposed

in [6]. Fig. 6(c) shows that our approach scales in the same way as

the non-clustered architecture when we do not consider the cluster-

ing approach in our algorithm.

Amount of Data by Each Instance of Mapping

[Kbytes]

Traffic produced to collect the current MPSoC state

Application Size [Tasks]

100

1000

10000

100000

1000000

100

1000

10000

100000

1000000

10 20 40 100 200 50010 20 40 100 200 500

ADAM 8x8 (our scheme)ADAM 8x8 (our scheme)

centr. 8x8 (app. from [7,19])centr. 8x8 (app. from [7,19])

distr. 8x8distr. 8x8

ADAM 64x64 (our scheme)ADAM64x64 (our scheme)

centr. 64x64 (app. from [7,19])centr. 64x64 (app. from [7,19])

distr. 64x64distr. 64x64

ADAM 32x32 (our scheme)ADAM32x32 (our scheme)

centr. 32x32 (app. from [7,19])centr. 32x32 (app. from [7,19])

distr. 32x32

Figure 7: Our ADAM approach compared to approaches [7,19]

Fig. 7 demonstrates the advantage of our approach when we con-

sider the communication volume generated by the monitoring mod-

ule of the router needed by the mapping algorithm. We compare

our cluster-based distributed approach to a centralized approach

[8, 19] and a fully distributed approach (each tile acts as an in-

dividual cluster). The experimental setup is as follows: number

of classes and PE types are 16, resource requirement encoding re-

quires 1 Byte, task id encoding requires 4 Bytes, number of tasks

encoding requires 4 Bytes, and bandwidth encoding requires 1 Byte

of memory space. To calculate the mapping trafﬁc produced by our

764

approach we need to break down the communication into the fol-

lowing parts: (1) transmission of the task histogram thist[] to the

GA, (2) transmission of the task graph to the CA of the suitable clus-

ter, (3) reporting the cluster state to the CA, and (4) transmission of

the cluster state to the GA. The experiment shows that our approach

has noticeable advantages in reducing the amount of communica-

tion volume (10.7 times lower on a 64 ×64 NoC) caused by the

mapping when the HMPSoCNoC has many tiles.

Mapping Instance of the

Robotics Application

2000

4000

6000

8000

10000

12000

14000

Communication Cost [MB/s]

Resulting Communication Volume

after Mapping

0MPEG VOPD Image

Processing Line

( x 1/100 MB/s )

MWD

Applications

ADAM (Our Scheme)

Exhaustive optimization

for mapping

resulting mapping

solution using

ADAM

Gauss

2Input Grad

RGB 2

HSV

Gauss

1Post

Shirt

Filter

Skin

Filter Output

Figure 8: Comparing ADAM to exhaustive off-line mapping algo.

In Fig. 8 we present a comparison of the suitability of our cluster-

level mapping algorithm. It shows that our approach does not pro-

duce optimum results as they can be produced by the off-line ex-

haustive algorithm which requires a far higher computational effort.

But relative to the consumed computation effort our approach pro-

vides a reasonable near-optimal solution. The communication vol-

ume serves as the optimization criteria for the mapping algorithm

(it reduces the communication related energy consumption [15]),

and we found on an average a deviation of a mere 13.3% compared

to the exhaustive mapping algorithm. To make the comparison to

the off-line exhaustive mapping algorithm realistic, a homogeneous

tile has been considered. The near-optimal result can be used for

the run-time task mapping as this result may be traded-off with the

adaptivity and the lower computational effort.

We have also evaluated our mapping algorithm by means of a

robot application presented in [18]. We found in our algorithm the

near optimal communication volume to be 120.1 MB/s whereas,

in the exhaustive off-line mapping algorithm it can be reduced to

106.9 MB/s. The result is acceptable as we are doing it at run-time

using a heuristic algorithm and consuming 2 times lower execution

cycles compared to NN heuristics. The Image Processing Line ap-

plication takes only 11241 ×Xcycles using our ADAM algorithm

orthogonal to any instruction set processor compared to NN heuris-

tics (takes 20480 ×Xcycles) proposed in [6] on a 32 ×64 NoC.

Therefore, we observe that our run-time agent-based distributed ap-

plication mapping approach reduces the overall monitoring-trafﬁc

compared to a centralized mapping scheme and requires less exe-

cution cycles compared to a non-clustered centralized approach.

6. CONCLUSION

We have introduced the ﬁrst scheme for a run-time application

mapping in a distributed manner using an agent-based approach.

We target adaptive NoC-based heterogeneous multi-processor sys-

tems. The ADAM scheme generates 10.7 times lower monitoring

trafﬁc compared to a centralized scheme like the ones proposed in

[8, 19] in a 64 ×64 NoC. Our scheme also has a smaller number of

execution cycles compared to a non-clustered centralized approach.

In our experiments we achieve on an average 7.1 times lower com-

putational effort for the run-time mapping algorithm compared to

the simple nearest-neighbor (NN) heuristics proposed in [6] on a

64 ×32 NoC. The ﬂexibility of a run-time adaptive mapping, a 7.1

times lower computational effort and a 10.7 times lower monitor-

ing trafﬁc counterbalance the optimization result compared to an

optimized run-time centralized mapping algorithm.

7. REFERENCES

[1] L. Benini and G. De Micheli. “Networks on Chips: A new

SoC paradigm”. IEEE Computer, 35(1):70–78, 2002.

[2] S. Bertozzi, A. Acquaviva, D. Bertozzi, and A. Poggiali.

“Supporting task migration in multi-processor systems-on-

chip: a feasibility study”. DATE’06: Proc. of the Conf. on

Design, Automation and Test in Europe, pages 15–20, 2006.

[3] A. Bieszczad, B. Pagurek, and T. White. “Mobile agents for

network management”. IEEE Comm. surveys and tutorials,

1(1):2–9, 1998.

[4] S. Borkar. “Thousand core chips – A technology perspective”.

DAC’07: Proc. of the 44th annual Conf. on Design Automa-

tion, pages 746–749, 2007.

[5] H. Broersma, D. Paulusma, G. J. M. Smit, F. Vlaardinger-

broek, and G. J. Woeginger. “The computational complex-

ity of the minimum weight processor assignment problem”.

WG’04: Proc. of the 30th int. Workshop on Graph-theoretic

concepts in computer science, pages 189–200, 2004.

[6] E. Carvalho, N. Calazans, and F. Moraes. “Heuristics for dy-

namic task mapping in NoC-based heterogeneous MPSoCs”.

RSP’07: Proc. of the 18th IEEE int. workshop on Rapid Sys-

tem Prototyping, pages 34–40, May 2007.

[7] J. Chan and S. Parameswaran. “NoCGEN: A template based

reuse methodology for networks on chip architecture”. VL-

SID’04: Proc. of the 17th int. Conf. on VLSI Design, pages

717–720, 2004.

[8] C.-L. Chou and R. Marculescu. “Incremental run-time appli-

cation mapping for homogeneous NoCs with multiple voltage

levels”. CODES+ISSS’07: Proc. of the 5th IEEE/ACM int.

Conf. on Hardware/software Codesign and system synthesis,

pages 161–166, 2007.

[9] W. J. Dally and B. Towles. “Route packets, not wires: On-

chip interconnection networks”. DAC’01: Proc. of the 38th

Conf. on Design Automation, pages 684–689, 2001.

[10] R. P. Dick, D. L. Rhodes, and W. Wolf. “TGFF: Task graphs

for free”. CODES/CASHE’98: Proc. of the 6th int. workshop

on Hardware/software Codesign, pages 97–101, 1998.

[11] M. A. A. Faruque, T. Ebi, and J. Henkel. “Run-time adaptive

on-chip communication scheme”. ICCAD ’07: Proc. of the

2007 IEEE/ACM int. Conf. on Computer-aided design, pages

26–31, 2007.

[12] A. Hansson, K. Goossens, and A. Rˇ

adulescu. “A uniﬁed

approach to constrained mapping and routing on network-

on-chip architectures”. CODES+ISSS’05: Proc. of the 3rd

IEEE/ACM int. Conf. on Hardware/software Codesign and

system synthesis, pages 75–80, 2005.

[13] J. Henkel, W. Wolf, and S. Chakradhar. “On-chip networks:

A scalable, communication-centric embedded system design

paradigm”. VLSID’04: Proc. of the 17th int. Conf. on VLSI

Design, pages 845–851, 2004.

[14] P. Horn. “Autonomic computing: IBM’s perspective on the

state of information technology”. IBM Corporation, 2001.

[15] J. Hu and R. Marculescu. “Exploiting the routing ﬂexibility

for energy/performance aware mapping of regular NoC archi-

tectures”. DATE’03: Proc. of the Conf. on Design, Automa-

tion and Test in Europe, pages 10688–10693, 2003.

[16] T. Lei and S. Kumar. “A two-step genetic algorithm for map-

ping task graphs to a network on chip architecture”. DSD’03:

Proc. of the Euromicro symposium on Digital Systems Design,

pages 180–189, 2003.

[17] V. Nollet, T. Marescaux, P. Avasare, D. Verkest, and J.-Y.

Mignolet. “Centralized run-time resource management in a

network-on-chip containing reconﬁgurable hardware tiles”.

DATE’05: Proc. of the Conf. on Design, Automation and Test

in Europe, pages 234–239, March 2005.

[18] P. Azad, A. Ude, T. Asfour, G. Cheng, and R. Dillmann.

“Image-based markerless 3D human motion capture using

multiple cues”. Proc. of the int. workshop on Vision Based

Human-Robot Interaction, 2006.

[19] L. Smit, G. Smit, J. Hurink, H. Broersma, D. Paulusma, and

P. Wolkotte. “Run-time mapping of applications to a hetero-

geneous reconﬁgurable tiled system on chip architecture”.

FPL’04: Proc. of the IEEE int. Conf. on Field-Programmable

Technology, pages 421–424, 2004.

[20] P. Smith and N. C. Hutchinson. “Heterogeneous process mi-

gration: The Tui system”. Software – Practice and Experi-

ence, 28(6):611–639, 1998.

[21] Xilinx. ”Virtex2 datasheets”. http://www.xilinx.com/.

765

Dynamic Resource Allocation in Embedded, High-Performance and Cloud Computing

Book

Full-text available

Sep 2022

HyDra: Hybrid Task Mapping Application Framework for NOC-based MPSoCs

Article

Full-text available

Jan 2023

Multiprocessor System On Chip (MPSoC) with Networks-on-Chip (NoCs) have been proposed to address the communication challenges in modern dynamic applications. One of the key aspects of design exploration in NoC-based MPSoC is application mapping, which is critical for the parallel execution of multiple applications. However, mapping for dynamic workloads becomes challenging due to unpredictable arrival times of applications and the availability of resources. In this work, we propose a hybrid task mapping approach, HyDra, that combines design-time mapping and efficient run time remapping to reduce communication and energy costs. The proposed approach generates multiple application mappings during the design time phase by minimizing latency, energy, and communication costs. The diverse mapping possibilities produced at design time considers multiple performance metrics. However, we cannot predict the arrival time of applications and the availability of resources at design time. To further optimize the MPSoC performance, our dynamic mapping phase re configures the design time mappings based on run time availability of resources and applications. The simulation results show that HyDra reduces communication costs by 14% while using 15% less energy for small and large NoCs compared to state-of-the-art task mapping techniques. Furthermore, our approach provides an average of 19% reduction in end-to-end latency for applications. Our hybrid task allocation and scheduling approach effectively addresses communication issues in NoC-based MPSoCs for dynamic workloads. HyDra achieves improved performance by combining design-time and run time mapping, providing a promising solution for future MPSoC design.

Predictable timing behavior of gracefully degrading automotive systems

Article

Full-text available

Apr 2023
DES AUTOM EMBED SYST

Fail-operational behavior of safety-critical software for autonomous driving is essential as there is no driver available as a backup solution. In a failure scenario, safety-critical tasks can be restarted on other available hardware resources. Here, graceful degradation can be used as a cost-efficient solution where hardware resources are redistributed from non-critical to safety-critical tasks at run-time. We allow non-critical tasks to actively use resources that are reserved as a backup for critical tasks, which would be otherwise unused and which are only required in a failure scenario. However, in such a scenario, it is of paramount importance to achieve a predictable timing behavior of safety-critical applications to allow a safe operation. Here, it has to be ensured that even after the restart of safety-critical tasks a guarantee on execution times can be given. In this paper, we propose a graceful degradation approach using composable scheduling. We use our approach to present, for the first time, a performance analysis which is able to analyze timing constraints of fail-operational distributed applications using graceful degradation. Our method can verify that even during a critical Electronic Control Unit failure, there is always a backup solution available which adheres to end-to-end timing constraints. Furthermore, we present a dynamic decentralized mapping procedure which performs constraint solving at run-time using our analytical approach combined with a backtracking algorithm. We evaluate our approach by comparing mapping success rates to state-of-the-art approaches such as active redundancy and an approach based on resource availability. In our experimental setup our graceful degradation approach can fit about double the number of critical applications on the same architecture compared to an active redundancy approach. Combined, our approaches enable, for the first time, a dynamic and fail-operational behavior of gracefully degrading automotive systems with cost-efficient backup solutions for safety-critical applications.

Predictable Timing Behavior of Gracefully Degrading Automotive Systems

Preprint

Full-text available

Jul 2022

Fail-operational behavior of safety-critical software for autonomous driving is essential as there is no driver available as a backup solution.In a failure scenario, safety-critical tasks can be restarted on other available hardware resources.Here, graceful degradation can be used as a cost-efficient solution where hardware resources are redistributed from non-critical to safety-critical tasks at run-time.We allow non-critical tasks to actively use resources that are reserved as a backup for critical tasks, which would be otherwise unused and which are only required in a failure scenario.However, in such a scenario, it is of paramount importance to achieve a predictable timing behavior of safety-critical applications to allow a safe operation. Here, it has to be ensured that even after the restart of safety-critical tasks a guarantee on execution times can be given. In this paper, we propose a graceful degradation approach using composable scheduling.We use our approach to present, for the first time, a performance analysis which is able to analyze timing constraints of fail-operational distributed applications using graceful degradation.Our method can verify that even during a critical ECU failure, there is always a backup solution available which adheres to end-to-end timing constraints.Furthermore, we present a dynamic decentralized mapping procedure which performs constraint solving at run-time using our analytical approach combined with a backtracking algorithm. We evaluate our approach by comparing mapping success rates to state-of-the-art approaches such as active redundancy and an approach based on resource availability. In our experimental setup our graceful degradation approach can fit about double the number of critical applications on the same architecture compared to an active redundancy approach. Combined, our approaches enable, for the first time, a dynamic and fail-operational behavior of gracefully degrading automotive systems with cost-efficient backup solutions for safety-critical applications.

Mapping Hard Real-Time Tasks on Network-on-Chip Manycore Architectures

Thesis

Nov 2021

Chawki Benchehida

In this dissertation, we tackle the problem of execution complex multi-thread real-time applications on modern Network-on-Chip architectures. Network-on-Chip (NoC) is a promising technology that fits the increasing performance demands of Cyber-Physical Systems (CPS). The introduction of NoCs is justified by the fact that classical multi-core single-bus architectures fail to address the performance requirements and the predictability needs of modern CPS applications, especially as the number of cores increases. Even if the use of cache memories mitigates the bottleneck effect of single bus architectures, caches introduce unpredictable delays in accessing data, which in turn make it difficult to estimate the execution time of tasks. Most CPS applications are time-sensitive: tasks are assigned deadlines that must never exceed, otherwise a critical failure may occur. Such systems are denoted by hard real-time. Consequently, the communications that occur in the network, denoted by on-chip communications, must be predictable and as fast as possible to prevent deadline-missing. Since the task position on the NoC determines its communication cost, the allocation of the application tasks on the chip cores is a crucial problem. In this thesis, we address specifically the problem of allocating a set of real-time applications, each composed of several parallel tasks, whose structure is described by a Directed Acyclic Graph (DAG), onto a Network-on-Chip processor. First, we study the problem of bounding the communication cost depending on the different message scheduling policies at the router level. Then we address the problem of task scheduling and of verifying the schedulability of a certain allocation. Then, we propose an approach to reduce the complexity of the task allocation problem and its analysis cost. Moreover, we propose a task mapping strategy through a meta-heuristic which performs an effective design-space exploration for DAG (Directed Acyclic Graph) tasks. Lastly, in addition to on-chip communications, we studied the mapping problem when the off-chip communications are integrated into the model.

A Survey on Dynamic Application Mapping Approaches for Real-Time Network-on-Chip Based Platforms

Article

Full-text available

Nov 2023

Network-on-Chip (NoC) has been unfolded as a superior alternative for integrating a considerably greater extent of cores on a single chip. Recently, multi-core systems have become prevalent because of the increased processing demands for high-performance embedded applications. Application mapping techniques play a significant role in enhancing the extensive performance of such complex multi-core platforms. Developing and implementing efficient application mapping techniques are required for system design to meet the demand of such complicated multi-core systems. The paper primarily focuses on dynamic application mapping techniques, classifying them into a number of subcategories. It highlights such approaches and techniques that aim to enhance the performance of the NoC-based systems by optimizing them in terms of communication cost, latency, energy consumption, power, execution, and computational time. Future challenges, trends, and simulation tools have also been spotlighted.

Deadline-Aware and Energy-Efficient Dynamic Task Mapping and Scheduling for Multicore Systems Based on Wireless Network-on-Chip

Article

Oct 2023

Hybrid Wireless Network-on-Chip (HWNoC) architecture has been introduced as a promising communication infrastructure for multicore systems. HWNoC-based multicore systems encounter extremely dynamic application workloads that are submitted at run-time. Mapping and scheduling of these applications are critical for system performance, especially for real-time applications. The existing resource allocation approaches either ignore the use of wireless links in task allocation on cores or ignore the timing characteristic of tasks. In this paper, we propose a new deadline-aware and energy-efficient dynamic task mapping and scheduling approach for the HWNoC-based multicore system. By using of core utilization threshold and tasks laxity time, the proposed approach aims to minimize communication energy consumption and satisfy the deadline of the real-time applications tasks. Through cycle-accurate simulation, the performance of the proposed approach has been compared with state-of-the-art approaches in terms of communication energy consumption, deadline violation rate, communication latency, and runtime overhead. The experimental results confirmed that the proposed approach is a very competitive approach among the alternative approaches.

Machine Learning Enabled Solutions for Design and Optimization Challenges in Networks-on-Chip based Multi/Many-Core Architectures

Article

Mar 2023

Md Farhadur Reza

Due to the advancement of transistor technology, a single chip processor can now have hundreds of cores. Network-on-Chip (NoC) has been the superior interconnect fabric for multi/many-core on-chip systems because of its scalability and parallelism. Due to the rise of dark silicon with the end of Dennard Scaling, it becomes essential to design energy efficient and high performance heterogeneous NoC-based multi/many-core architectures. Because of the large and complex design space, the solution space becomes difficult to explore within a reasonable time for optimal trade-offs of energy-performance-reliability. Furthermore, reactive resource management is not effective in preventing problems from happening in adaptive systems. Therefore, in this work, we explore machine learning techniques to design and configure the NoC resources based on the learning of the system and applications workloads. Machine learning can automatically learn from past experiences and guide the NoC intelligently to achieve its objective on performance, power, and reliability. We presented the challenges of NoC design and resource management and proposed a generalized machine learning framework to uncover near-optimal solutions quickly. We propose and implement a NoC design and optimization solution enabled by neural networks, using the generalized machine learning framework. Simulation results demonstrated that the proposed neural networks-based design and optimization solution improves performance by 15% and reduces energy consumption by 6% compared to an existing non-machine learning-based solution while the proposed solution improves NoC latency and throughput compared to two existing machine learning-based NoC optimization solutions. The challenges of machine learning techniques adaptation in multi/many-core NoC have been presented to guide future research.

Dynamic Mapping for Many-cores using Management Application Organization

Conference Paper

Nov 2021

Management Application - a New Approach to Control Many-Core Systems

Conference Paper

Aug 2021

Heterogeneous process migration: the Tui system

Article

Full-text available

May 1998
SOFTWARE PRACT EXPER

Heterogeneous process migration is a technique whereby an active process is moved from one machine to another. It must then continue normal execution and communication. The source and destination processors can have a different architecture, that is, different instruction sets and data formats. Because of this heterogeneity, the entire process memory image must be translated during the migration. Tui is a migration system that is able to translate the memory image of a program (written in ANSI‐C) between four common architectures (m68000, SPARC, i486 and PowerPC). This requires detailed knowledge of all data types and variables used with the program. This is not always possible in non‐type‐safe (but popular) languages such as ANSI‐C, Pascal and Fortran. The important features of the Tui algorithm are discussed in great detail. This includes the method by which a program's entire set of data values can be located, and eventually reconstructed on the target processor. Performance figures demonstrating the viability of using Tui to migrate real applications are given. © 1998 John Wiley & Sons, Ltd.

A unified approach to constrained mapping and routing on network-on-chip architectures

Conference Paper

Full-text available

Oct 2005

One of the key steps in Network-on-Chip (NoC) based design is spatial mapping of cores and routing of the communication between those cores. Known solutions to the mapping and routing problem first map cores onto a topology and then route communication, using separated and possibly conflicting objective functions. In this paper we present a unified single-objective algorithm, called Unified MApping, Routing and Slot allocation (UMARS). As the main contribution we show how to couple path selection, mapping of cores and TDMA time-slot allocation such that the network required to meet the constraints of the application is minimized. The time-complexity of UMARS is low and experimental results indicate a run-time only 20% higher than that of path selection alone. We apply the algorithm to an MPEG decoder System-on-Chip (SoC), reducing area by 33%, power by 35% and worst-case latency by a factor four over a traditional multi-step approach.

Mobile agents for network management

Article

Full-text available

Oct 1998

In this article we discuss the potential uses of mobile agents in network management and define software agents and a navigation model that determines agent mobility. We list a number of potential advantages and disadvantages of mobile agents and include a short commentary on the ongoing standardization activity. The core of this article comprises descriptions of several actual and potential applications of mobile agents in the five OSI functional areas of network management. A brief review of other research activity in the area and prospects for the future conclude the presentation.

Supporting Task Migration in Multi-Processor Systems-on-Chip: A Feasibility Study

Conference Paper

Full-text available

Jan 2006

With the advent of multi-processor systems-on-chip, the interest in process migration is again on the rise both in research and in product development. New challenges associated with the new scenario include increased sensitivity to implementation complexity, tight power budgets, requirements on execution predictability, the lack of virtual memory support in many low-end MPSoCs. As a consequence, effectiveness and applicability of traditional transparent migration mechanisms are put in discussion in this context. Our paper proposes a task management software infrastructure that is well suited for the constraints of single chip multiprocessors with distributed operating systems. Load balancing in the system is maintained by means of intelligent initial placement and task migration. We propose a user-managed migration scheme based on code checkpointing and user-level middleware support as an effective solution for many MPSoC application domains. In order to prove the practical viability of this scheme, we also propose a characterization methodology for task migration overhead. We derive the minimum execution time following a task migration event during which the system configuration should be frozen to make up for the migration cost.

Exploiting the routing flexibility for energy/performance aware mapping of regular NoC architectures

Article

Jan 2003

Thousand core chips

Conference Paper

Jan 2007

S. Borkar

This paper presents the many-core architecture, with hundreds to thousands of small cores, to deliver unprecedented compute performance in an affordable power envelope. We discuss fine grain power management, memory bandwidth, on die networks, and system resiliency for the many-core system.

Image-Based Markerless 3d Human Motion Capture using Multiple Cues

Article

Incremental run-time application mapping for homogeneous NoCs with multiple voltage levels

Conference Paper

Sep 2007

In this paper, we propose an efficient technique for run-time application mapping onto Network-on-Chip (NoC) platforms with multiple voltage levels. Our technique consists of a region selection algorithm and a heuristic for run-time application mapping which minimizes the communication energy consumption, while still providing the required performance guarantees. The proposed technique allows for new applications to be easily added to the system platform with minimal inter-processor communication overhead. Moreover, our approach scales very well for large designs. Finally, the experimental results show as much as 50% communication energy savings compared to arbitrary mapping solutions.

Thousand Core ChipsA Technology Perspective

Conference Paper

Jun 2007

S. Borkar

The Computational Complexity of the Minimum Weight Processor Assignment Problem

Conference Paper

Jun 2004
Lect Notes Comput Sci

In portable multimedia systems a number of communicating tasks has to be performed on a set of heterogeneous processors. This should be donein an energy-effcient way. We give the background of the problem and model it as a graph optimization problem, which we call the minimum weight processor assignment problem. We show that our setting generalizes several problems known in literature, including minimum multiway cut, graph k-colorability, and minimum (generalized) vertex covering. We show that the minimum weight processor assignment problem is NP-hard, even when restricted to instances where the (process) graph is a bipartite graph with maximum degree at most 3, or with only two processors, or with arbitrarily small weight di erences, or with only two different edge weights. For graphs with maximum degree at most 2 (or in fact the larger class of degree-2-contractible graphs) we give a polynomial time algorithm. Finally we generalize this algorithm into an exact (but not effcient) algorithm for general graphs..

ADAM: Run-time Agent-based Distributed Application Mapping for on-chip Communication

Abstract

Recommended publications

System-Level Modeling of a NoC-Based H.264 Decoder

Efficient Heuristics for Minimizing Communication Overhead in NoC-based Heterogeneous MPSoC Platform...

Decentralized agent based re-clustering for task mapping of Tera-scale network-on-chip system

A divide and conquer based distributed run-time mapping methodology for many-core platforms

CARAT: Context-Aware Runtime Adaptive Task Migration for Multi Core Architectures