Conference PaperPDF Available

ADAM: Run-time Agent-based Distributed Application Mapping for on-chip Communication

Authors:

Abstract

Design-lime decisions can often only cover certain scenarios and fail in efficiency when hard-to-predict system scenarios occur. This drives the development of run-time adaptive systems. To the best of our knowledge, we are presenting the first scheme for a runtime application mapping in a distributed manner using agents targeting for adaptive NoC-based heterogeneous multi-processor systems. Our approach reduces the overall traffic produced to collect the current state of the system (monitoring-traffic), needed for runtime mapping, compared to a centralized mapping scheme. In our experiment, we obtain 10.7 times lower monitoring traffic compared to the centralized mapping scheme proposed in [8] for a 64 × 64 NoC. Our proposed scheme also requires less execution cycles compared to a non-clustered centralized approach. We achieve on an average 7.1 times lower computational effort for the mapping algorithm compared to the simple nearest-neighbor (NN) heuristics proposed in [6] in a 64 × 32 NoC. We demonstrate the advantage of our scheme by means of a robot application and a set of multimedia applications and compare it to the state-of-the-art run-time mapping schemes proposed in [6. 8. 19].
ADAM: Run-time Agent-based Distributed Application
Mapping for on-chip Communication
Mohammad Abdullah Al Faruque, Rudolf Krist, and Jörg Henkel
University of Karlsruhe, Chair for Embedded Systems, Karlsruhe, Germany
{alfaruque, krist, henkel} @ informatik.uni-karlsruhe.de
ABSTRACT
Design-time decisions can often only cover certain scenarios and
fail in efficiency when hard-to-predict system scenarios occur. This
drives the development of run-time adaptive systems. To the best
of our knowledge, we are presenting the first scheme for a run-
time application mapping in a distributed manner using agents tar-
geting for adaptive NoC-based heterogeneous multi-processor sys-
tems. Our approach reduces the overall traffic produced to collect
the current state of the system (monitoring-traffic), needed for run-
time mapping, compared to a centralized mapping scheme. In our
experiment, we obtain 10.7 times lower monitoring traffic com-
pared to the centralized mapping scheme proposed in [8] for a
64×64 NoC. Our proposed scheme also requires less execution cy-
cles compared to a non-clustered centralized approach. We achieve
on an average 7.1 times lower computational effort for the mapping
algorithm compared to the simple nearest-neighbor (NN) heuristics
proposed in [6] in a 64 ×32 NoC. We demonstrate the advantage
of our scheme by means of a robot application and a set of multi-
media applications and compare it to the state-of-the-art run-time
mapping schemes proposed in [6, 8, 19].
Categories and Subject Descriptors: C.3[Special-purpose and
application-based systems]: Real-time and embedded systems
General Terms: Algorithms, Design
Keywords: Agent-based application mapping, On-chip communi-
cation
1. INTRODUCTION AND RELATED WORK
Intel projects the availability of 100 billion transistors on a 300mm2
die by 2015 [4] which allows to integrate thousands of processors
or equivalent logic gates on a single die. Heterogeneous Processing
Elements (PEs), i.e. different types of instruction set processors or
reconfigurable hardware on such an architecture, are proposed for
building an energy-efficient system [19]. Besides the low power
concern regarding computation, communication in such an archi-
tecture is another dominant factor since a scalable but light-weight
communication infrastructure is needed on-chip [4]. This motivates
toward the development of a tile-based heterogeneous Multiproces-
sor System on Chip (MPSoC) interconnected by a Network on Chip
(NoC) [1, 7, 9, 13]. In general, all related work proposes to design
an application-specific system where the parameters for the fabri-
cated chip are adjusted at design time.
The more complex a system grows the more it must be able to
handle those situations efficiently that are unpredictable at design
time. In this case the system needs to adapt itself to the new situa-
tion and therefore, the System on Chip (SoC) needs to be designed
with the capability of self-adaptiveness in mind. Self-adaptation
in SoC design is relatively new. The idea of adaptivity in future
SoC design is introduced in [11, 14]. Taking the same spirit into
NoC-based architecture design, we are the first to propose an adap-
tive on-chip communication scheme into [11]. An adaptive system
needs to map the tasks of an application to various PEs at run-time
without interfering the currently executing applications. To do this
in a transparent way is a challenging research topic.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
DAC 2008, June 8–13, 2008, Anaheim, California, USA.
Copyright 2008 ACM ACM 978-1-60558-115-6/08/0006 ...$5.00.
To solve the problem of mapping tasks to respective processing
elements, several design-time (off-line) mapping algorithms have
been proposed in related work. In [15], Branch and Bound-based,
in [16] Genetic Algorithm-based, and in [12] heuristic-based map-
ping algorithms are proposed. But an adaptive system that changes
its configuration over time requires a re-mapping/run-time mapping
of applications. Possible reasons for the necessity of a run-time
mapping are listed in Section 2. In [19] the authors extend the Min-
Weight algorithm proposed in [5] for solving the problem of run-
time task assignment on heterogeneous processors. The task graphs
are restricted to a small number of vertices or a large number of ver-
tices with a degree of no more than two. Authors in [6] investigate
the performance of several mapping heuristics promising for run-
time use in NoC-based MPSoCs with dynamic workloads, targeting
NoC congestion minimization. The work presented in [8] proposes
an efficient technique for run-time application mapping onto a ho-
mogeneous NoC platform with multiple voltage levels. Their work
is limited to a homogeneous architecture. A separate control net-
work besides the data network is used which represents an extra
overhead in terms of area and energy consumption. The state-of-
the-art run-time mapping work [6, 8, 19] has used a Centralized
Manager (CM) for conducting the job of mapping which is not
scalable in the context of hundreds or even thousands of cores that
may soon be integrated on a SoC. It suffers from a single point of
failure, larger volume of monitoring traffic1, central point of com-
munication around the CM (hot-spot), and scalability issues.
The concept of task migration is an integral part of the run-time
application mapping. The study of task migration to move a cur-
rently executing task between different processors which are con-
nected by a network has already been a research focus in the dis-
tributed and parallel computing domain [20]. Now it is used to
facilitate run-time application mapping in adaptive heterogeneous
MPSoCs. Work presented in [2, 17] discuss the issues related to the
task migration for MPSoC design, i.e. the cost to interrupt a given
task, save its context, transmit all data to a new IP, and restart the
task in the new IP. In our work we use this approach though the
details of task migration are out of the scope of this paper.
The rest of the paper is organized as follows: In Section 2, we
present our motivation and novel contribution. In Section 3, we
introduce our ADAM architecture whereas in Section 4, our novel
clustering algorithm and agent-based run-time application mapping
are explained in detail. Experimental results are discussed in Sec-
tion 5 with Section 6 concluding the paper.
2. MOTIVATION AND NOVEL CONTRIBU-
TIONS
Let us motivate the need of an agent-based distributed applica-
tion mapping for NoCs by means of a simple scenario. We study a
32 ×32 NoC with a mesh topology. Some events that may require
a re-mapping at run-time for an adaptive system where design-time
mapping algorithms fail are given below:
On-line detection of hardware faults.
To minimize run-time system costs (i.e. to save energy be-
cause of the low battery status).
When the user requirements change, e.g. the user wants to
switch video playback to a higher resolution.
1Monitoring-traffic is defined in this paper as the traffic which is
caused by collecting information about the state of the tiles, ni
N(see Def. 2)
760
42.2
When an adaptive system tries to configure the underlying
NoC infrastructure (i.e. changing the routing algorithm and
the buffer assignment) and if it fails, then the mapping in-
stance of the application needs to be changed [11].
State-of-the-art run-time mapping is handled using a Centralized
Manager (CM) which may bear the following problems:
Single point of failure.
Higher computational cost to calculate mapping inside CM.
Large volume of monitoring-traffic.
Point of hot-spot as every tile sends the status of the PE to the
CM after every instance of mapping. It increases the chance
of bottleneck around the CM.
To solve the problem of a static design-time mapping algorithm
which may require a high computational effort, we need a scheme
that can perform a low-cost (execution time) mapping scheme in-
side a virtual cluster (see Def. 3) constructed at run-time. We
solve the problems of a centralized mapping scheme by using a dis-
tributed mapping inside each virtual cluster. This distributed map-
ping is accomplished by software modules that are autonomous,
modifiable, and exhibit adaptation capabilities. To the best of our
knowledge we are the first to design an agent-based distributed ap-
plication mapping for a NoC platform. The system is analyzed
during run-time and self-adapts in terms of when and how a map-
ping algorithm should be invoked. Our novel contributions are
as follows:
(1) We provide a run-time agent-based distributed mapping algo-
rithm for next generation self-adaptive heterogeneous MPSoCs. Our
mapping algorithm is composed of two main parts: (a) virtual clus-
ter selection and cluster reorganization at run-time, and (b) a map-
ping algorithm inside a cluster at run-time.
(2) We propose a run-time cluster negotiation algorithm that gener-
ates virtual clusters to solve the problems of the centralized map-
ping algorithm.
(3) We present a low cost heuristic-based mapping algorithm in
terms of execution cycles on any instruction set processor that min-
imizes the communication related energy consumption.
3. OUR ADAM SCHEME
In the following we introduce our run-time Agent-based Dis-
tributed Application Mapping (ADAM) for a heterogeneous MP-
SoC with a NoC.
3.1 Some Definitions
Definitions necessary to explain our run-time ADAM concept
are described in the following:
Definition 1: An application communication task graph (CTG) is
a directed graph Gk=(T,F), where Tis a set of all tasks of an
application and fi,j Fis a set of all flows between connected
tasks tiand tjannotated by the inter-task bandwidth requirement.
Definition 2: A heterogeneous MPSoC architecture in a NoC plat-
form HMPSoCNoC is a directed graph P=(N, V ),wherever-
tices Nis a set of tiles niand vi,j Vpresent an edge, the physical
channel between two tiles niand nj. A tile , niNis composed
of: a heterogeneous PE, a network interface, a router, local mem-
ory and a cache.
Definition 3: Acluster is a subset CiN,whereNis the set
of tiles njthat belong to the HMPSoCNoC and a virtual clus-
ter Cvi, is a cluster where there are no fixed boundaries to decide
which tiles are included and which tiles are not. It can be created,
resized and destroyed at run time.
Definition 4: An agent Ag is a computational entity, which acts
on behalf of others. The construction of an agent is motivated from
[3] where agents are used for distributed network management. The
properties of an agent in our scheme are: an agent (1) is a smaller
task closer to the system, (2) it must do resource management, (3)
it may need memory to store state information for the resources,
(4) it must be executable on any processing element, (5) it must be
migratable, (6) it must be recoverable, and (7) it may be destroyed
if the cluster no longer exists. An agent-based mapping scheme
provides a flexible framework for run-time mapping because it has
the negotiation capability among the clusters distributed over the
whole chip and it is not dependent on the design-time parameters
(see above properties).
state
searching for next
suitable cluster
searching for next
cluster, suitable
after migration
searching for next
cluster, suitable
after reclustering
mapping request
received
QoS reqm. not met
further migration not possible
QoS reqm.
met
global agent negotiation
suitable cluster found
app. mapped
successfully
no suitable cluster
exists, migration or
reclustering possible
no cluster suitable
after migration exists
After migration
suitable cluster
found
After reclustering suitable
cluster found
QoS reqm. not met
further reclustering not possible
QoS reqm.
met
application
mapping
task
migration
QoS reqr.
not met
reclustering
QoS reqr.
not met
Figure 1: Flow of our ADAM approach
Definition 5: Acluster agent CA Ag is an agent that is respon-
sible for mapping operations within the cluster Ci. The cluster
agent is located in the processing element pCi
jwhere the index jof
pjdenotes that the cluster agent can be mapped to any PE of the
cluster. The CA stores the information about the cluster that the
agent is responsible for (see Table 1, 2).
Definition 6: Aglobal agent GA is an agent that stores the infor-
mation for performing the mapping operations to a selected cluster.
It stores information regarding the current usage of communication
and computation resources for each cluster and this information is
used for selection and re-organization of the clusters (see Table 1).
GA is movable and the stored information is light-weight and eas-
ily recoverable (there are multiple instances of the global agents).
Definition 7: The application mapping function is given by m:
Tti→ njNand the run-time mapping function mrun maps
the instance of task graph set Gtat time tto HMPSoCNoC.
Definition 8: Abinding is a function b:Tti→ tpPE Tps,
where Tis the set of all tasks of an application and Tps is the set
of the PE types that are used on the HMPSoCNoC. The function
assigns each task tiof the CTG to a favorable type of PE. After the
binding operation is completed, the tasks are allowed to be mapped
only to PEs of the type given by the binding function b.
3.2 The ADAM Flow
An overview of our ADAM system is presented in Fig. 1. The
run-time mapping in our scheme is achieved by using a negotiation
policy among Cluster Agents (CAs) and Global Agents (GAs) of a
certain instance of time distributed over the whole chip. In Fig. 1
an application mapping request is sent to the CA of the requesting
cluster which receives all mapping requests and negotiates with the
GAs. There can be multiple instances of the GAs that are synchro-
nized over time. The GAs have global information about all the
clusters of the NoC in order to make decisions onto which clus-
ter the application should be mapped to. Possible replies to this
mapping request are:
1. When a suitable cluster of the application exists then the GAs
inform the requesting source CA and the requesting source
CA asks the suitable destination CA for the actual mapping
of the application.
2. When no suitable clusters are found by the GAs then the GAs
report the next most promising cluster where it is possible to
map the application to after task migration which is negoti-
ated between the GA and the CA to make this cluster suitable
for the mapping. The number of iterations is a configuration
parameter.
761
3. When neither a suitable cluster nor a candidate cluster for
task migration are found, then the re-clustering concept is
used. It tries to acquire PEs from the neighboring clusters
(see Subsection 4.1). If the requirements are met after re-
clustering then the application may be mapped to that cluster.
This step is iterated for a number of times specified by the
configuration.
If all the above-mentioned options do not lead to a successful map-
ping (the application and the system constraints are not met), then
the mapping request is refused and reported to the requester. The
requester waits until some resources are freed to proceed with the
mapping. In the next section the detailed description of the run-
time mapping algorithm using our ADAM concept is presented.
Algorithm 1 Suitable cluster negotiation
input: CTG,{nhistc[] |cis cluster}(a),(f)
output: c,b[] (suitable cluster and binding) (f),(g)
u(tp, t): the comp. res. req. when the task tis bound to tp (c)
u[tp]: the total comp. res. req. for the PE type in CT G
E(tpj,t
i): computation energy when task tiis bound to tpj(d)
nloop: constant, num. of matching loop iterations
1: for all tiCT G do // min energy binding (d) &
// thist calc. & summrz. u(tp)in CTG
2: b[ti]=min
tpj{E(tpj,t
i)=u(tpj,t
i)·(E[100](tpj)
E[0](tpj)) + E[0] (tpj)}// initial binding,
// min. energy (d)
3: u[b[ti]] = u[b[ti]] + u(b[ti],t
i)// columns of res.
// req. profile (c)
4: k = u(tpj,t
i)·ncl
5: thist[b[ti],k]=thist[b[ti],k]+1 (e)
6: end for
7: sort thist by u[tp]desc
8: tpmax =max
tpj{u[tpj]}
9: sort {cN|cis a cluster}by uc[tpmax]
10: for all cN, c is a cluster do
11: sort nhistcby u[tp]asc
12: match thist and nhistc(eq. (1))
13: store mismatch[c, iloop]=(tpj,k
mis,qnt
tsk,mis)
14: if matched or iloop =nloop then
15: leave loop
16: end if
17: end for
18: if iloop =nloop then
19: for all cN, c is a cluster, (init :iloop =0)do
20: (tpj,k
mis,qnt
tsk,mis)=mismatch[c, iloop ]
21: move qnttsk,mistasks with maxt{u[b[t]]}from tpjto
another PE type with mintp{E(tp, task s)}
22: match thist and nhistc
23: if not matched or iloop =nloop then
24: restore b[] to min energy binding; leave loop
25: end if
26: end for
27: end if
28: if not matched: find cluster and tasks to migrate
29: if not matched: find cluster and tasks to re-cluster
30: return b[],c
4. ALGORITHM FOR RUN-TIME MAPPING
In this section we present our detailed algorithm of run-time
Agent-based Distributed Application Mapping (ADAM) which has
the following two components: (1) a cluster negotiation algorithm
and (2) a mapping algorithm inside a virtual cluster.
4.1 Cluster Negotiation Algorithm
Here we present our run-time suitable cluster negotiation algo-
rithm (see Alg. 1). The algorithms (Alg. 1 and Alg. 2) have the
following important input and output data objects:
The application CTG,Gwith required computational resource
profiles for each task. Gis given by a set of entries for each
flow: entry =(idsrc,id
dst,bw
req,lat,RR
tp). Here, idsr c
and iddst are the id of the source and destination task of
the flow, respectively bwreq is the required bandwidth of the
flow, lat is the communication latency and RRtp is the re-
source requirement on each PE type that is needed for a task
to ensure a successful execution.
The state information about all clusters are stored in a sum-
marized format by the GAs (Table 1 and data object nhistc).
More detailed information is stored in the CA (Table 2).
field req. memory short description
tpPE log2#Tps PE type id, #Tps= #of PE type
q_tiles log2#Cmax #Cmax = #of tiles in a cluster
r_reqtot log2#Cmax total comp. reso. req. by the PE type
q_cl0log2#Cmax #tiles in res. req. class (0,1
n]
... ...
q_clnlog2#Cmax #tiles in res. req. class (n1
n,1]
Table 1: Global agent: entry of the cluster PE type LUT
Energy Model: To make a binding decision (see Def. 8),
the amount of energy consumption for different PE types at
different resource requirement levels is needed. To explain
the energy model we take an example from Fig. 2 (b), where
for the PE type tp2the energy consumption is specified by
two values: tp2:(4X, 12X)that means that each PE of type
tp2consumes 4 units of energy (static energy consumption)
in a fixed time when it uses no processing resources and 12
units of energy when it consumes the complete PE resources
and otherwise E=u·(E[100%] E[0%])+E[0%] .
thist[] and nhistc[] are two data objects that store the re-
source requirement histograms within the local memory of
the CAs and GAs,thist for the required resources for the
tasks and nhistcfor the actual PE resource usage status of
the cluster c (i.e. Fig. 2 (e), (f)). Each entry thist[tp, k]gives
the number of tasks of a given type that is part of resource
requirement class (k1
ncl ,k
ncl ]and each entry nhistc[tp, k]
gives the actual number of tiles of a given type in resource
requirement class (k1
ncl ,k
ncl ].
The output data is the selected virtual cluster where the ap-
plication will be mapped to and the binding of the tasks to
the PEs, tiT:b(ti)Tps(see Def. 8).
E[100%](tp3)
=17X
Legend
PE type1
PE type2 flow, with bw
7
t1 task t1
tpPE
1
2
3
4
E[0%]
2X
4X
1X
5X
(b) energy by
resource req.
E[100%]
11 X
12 X
17 X
21 X
t1
t2 t3
t4
10 7
5
t5 11
(a) task graph
3
(e) task comp. res. reqirements
1
2
3
4
1/5
1
0
0
0
resource req. by classes
2/5
1
1
0
0
3/5
1
0
0
0
4/5
1
0
0
0
tpPE 5/5
0
0
0
0
(f) PE availability in a cluster
1/5
1
1
2
3
2/5
1
1
5
2
3/5
0
1
4
2
4/5
1
2
5
4
5/5
15
2
5
4
(c) resource req. profile
1
2
3
4
5
tp1
9
33
51
42
61
res. req. by tpPE
tp2
12
16
22
30
55
tp3
17
39
49
45
46
tp4
8
15
21
20
21
task
t1
t2 t3
t4
10 7
5
t5 11
(g) binding
3
(d) energy consumption/
min-energy binding
1
2
3
4
5
energy by tpPE
tp2
4.96
5.04
5.76
6.4
8.4
tp3
3.72
7.24
8.84
8.2
8.36
tp4
6.28
7.4
8.36
8.2
8.36
task tp1
2.81
4.97
6.59
5.78
7.49
1
2
3
4
resource req. by classes
tpPE
Figure 2: Suitable cluster and binding example
The matching of the two data objects nhistcand thist is the
heart of Alg. 1 and is given below in Eq. (1).
i1, .., ncl 1:
ncl1
X
j=ncli
thist[tp, j]
i
X
j=1
nhistc[tp, j](1)
In Fig. 2 we present an example of the cluster searching procedure.
The task graph of an application that is requested to be mapped
is shown in Fig. 2(a). The energy consumed by various PE types
in different resource requirement levels is given in 2(b) and it is
used to calculate the actual required energy consumption for every
task on different types of PEs (see 2(d)). The resource require-
ments of the tasks is given in 2(c). Using the tables 2(c) and 2(d)
the minimum energy binding for the tasks of the application is de-
rived. Using the task binding, Fig. 2(e) shows the resource require-
ment profile to create a histogram corresponding to the data object
thist[]. Fig. 2(f) presents the histogram nhistc[] for a cluster. In
762
this example task 2 needs to be rebound to a new PE type to find a
suitable cluster which has better energy consumption during the al-
gorithm execution. Finally, Fig. 2(g) presents the new binding and
the selection of the cluster. The complexity of our cluster negotia-
tion algorithm is O(m+r·log r)where mis the number of tasks
and ris the number of virtual clusters. Due to the low complexity,
this part of our approach is suitable to be applied at run-time.
Connected tilesCluster agent Source tile Destination tile
(3)mig(tsk,c_txt_swt)
(4)Succ_m ig
Parent task
(1)Sendamigreq
(2)Freeze the destination tile
(2)Freeze the source tile
(2)Freeze the connected tiles
(5)Release the freezes
(5)Release the freeze
(6)/*done*/
(4)Succ_mig
Figure 3: Task migration to support run-time application mapping
In case a suitable cluster cannot be found in Alg. 1, it starts look-
ing for the clusters which support task migration. Task migration2
as an integral part of our run-time mapping algorithm is demon-
strated in Fig. 3. The parent task sends a migration request to the
CA and upon receiving the request it freezes the source tile, tiles
connected to the source, and the destination tile for successful and
transparent migration. Then, the migration is performed with all
local data within the executing task, the state of the task and even
the modified binary of the task (the binary of the application may
need to be changed to make it executable for different instruction
set processors). The feedback is then provided to the CA.
Start state
End state
Legend:
taking PEs fro m
neighb. clusters
Re-clustering
successful
QoS
reqr. met
QoS reqr.
not met
map
app
Re-clustering
failed
no reclustering
possible
find another
cluster
requesting
neighb’s
for free PEs
no free PEs
reported
and neighb’s left
requesting
neighb’s
for migration
no free PEs
and no
neighb’s left
no free PEs reported
and neighb’s left
req. neighb’s
for least utilized
PEs
neighb’s left,
PE not shared
no free PEs
and no
neighb’s left
take unoccupied PE
take freed
PE
share PE
Figure 4: The re-clustering algorithm flow
When the migration of tasks does not deliver a suitable cluster,
then the re-clustering operation shown in Fig. 4 is invoked. First
negotiation is done between neighboring clusters to see if there are
some unoccupied PEs that can be given away to the requesting clus-
ter. If no unoccupied PEs are available, the neighbors are requested
to migrate tasks from some PEs to other PEs of that cluster without
losing its performance and run-time constraints. If that is not suc-
cessful either, then the neighboring cluster is requested for the least
utilized PEs that may be shared with the requesting cluster.
4.2 The Mapping Algorithm
Our run-time mapping algorithm inside a cluster managed by the
CA is motivated by the static mapping algorithm presented in [12]
as it is light-weight in terms of execution cycles and provides a
near-optimal mapping solution. The given algorithm is executed
once at design time. But for using the algorithm at run-time it
needs to be modified to keep the current instance of the mapping.
It is then executed in the background reacting to mapping requests
2Details of task migration are not discussed in the scope of this
paper. Our scheme uses the approach presented in [17]
Algorithm 2 Run-time mapping
CTG: input data, application CTG
mpng: output data, mapping of tasks to tiles
tileLU T,clu : state of the physical network
Tps tileLUT ,clu: types contained in model
tpPE: type of a tile’s PE, tpPE Tps
model
rs_avail(tpPE): gives the available computational resources of
all PEs of the given type tpPE
binding: tiCTG :b(ti),b:see Definition 8
sorted: T ps, asc,by rs_avail(tpPE)// sorting by
// avail. of PE types
1: for all aTpsdo
2: fa={fij tg |bound(ti,a)bound(tj,a)}
3: sort(fa,desc,by bw_req(fij fa))
4: for all fk
ij fado
5: select ni,n
jtileLU T,clu ,for ti,t
jby min(cmp)
6: insert(ni,n
jto mpng)
7: end for
8: end for
9: allocate(mpng); update(tileLU T ,clu by mpng)
whenever the current instance of the mapping needs to be modi-
fied. The pseudo code of the run-time mapping algorithm inside
each cluster is presented in Alg. 2. The input data is the CTG of the
application and the model tileLU T,clu of the HMPSoCNoC that
stores the current state of the used computation and communication
resources of that particular cluster. The CTG contains the required
energy consumption for each task to be executed on a particular
PE type. The task binding is done in the cluster negotiation step
with the GAs before the mapping step inside a virtual cluster.The
CTG contains the communication costs for each flow fij between
the tasks tiand tj. The tile-LUT tileLU T,clu contains each tile’s
current computation resource usage, the type of the PE of this tile
tpPE, and the current bandwidth usage for each link. The output
(mpng)is the mapping of tasks to tiles of the network which is
used to allocate the tiles physically on the network and to update
the tileLU T,clu by the added application.
t1 t5
t4
t2
t3
t1
t2 t3
t4
10 7
5
t5
11
bac
e
df
bac
e
df
(a) task graph
(b) tiles (part
of cluster)
(c) tasks placed
on tiles
tp_i
tp_1
tp_2
tp_3
tp_4
tp_5
avail_rs(
tp_i)
1730 %
210 %
370 %
530 %
505 %
ord(tp_i)
5
1
2
4
3
(d) available computation
resources
(e) flows by PE
types
tasks
t1
t2
t3
t4
t5
cmp_req
30 %
25 %
29 %
40 %
38 %
(f) required
comput. costs
(g) current comp.
resources in use
by tasks on tiles
Legend
PE type 1
PE type 2
PE type 3 flow, with bw
7
t1 task t1
PE
a
b
c
d
e
f
10 %
40 %
20 %
10 %
33 %
25 %
res.
in use
flows
1-2
1-3
3-4
4-5
bw
10
7
5
11
Figure 5: Run-time application mapping example
To decide to which tile of a particular PE type a task should be
mapped, a heuristics is used, described by the cost function c(t, n),
for the selection of a tile njfor a given task ti.
c(ti,n
j)=α(D(nj)+bwt(nj)+RR(nj))+βX
kTcon,m
d(k)vol(k)
where, D(n)= 1
#tilesclu PlNd(n, l):D(n)is the average dis-
tance of a tile to all other tiles of the cluster, d(n, l)is the Manhat-
tan distance between tiles nand l,Tcon,m is the set of all connected
and mapped tasks ti,d(k)is the Manhattan distance between the
mapped tasks, vol(k)is the communication between the connected
tasks, RR(nj)is the resource requirement of the PE that will be as-
signed for the task, and bwt(nj)is the total bandwidth requirement
of the tasks on the tile.
In the following, Alg. 2 is explained using an example (see Fig. 5).
In Fig. 5 (a) we present a task graph, whose tasks are grouped by
the binding function (shown in different colors) in the earlier ne-
gotiation stage. In 5 (b) a part of the tiles of the current cluster is
presented, 5 (g) shows the current resources in use of some of these
763
(a) Mapping Computational Effort (Fixed cluster size)
1000
10000
100000
1000000
64 100 144 256 400 1024 2048 409664 100 144 256 400 1024 2048 4096
NoC Size [Tiles]
X * Cycles
ADAM (Our Scheme)ADAM (Our Scheme)
centralized NN (app. from [6])centralized NN (app. from [6])
centralized MAC (app. from [6])centralized MAC (app. from [6])
centralized PL (app. from [6])centralized PL (app. from [6])
Comparison to Centralized Scheme [6]
(b) Mapping Components (ADAM)
10
100
1000
10000
preparation match rebind
match
migration re-clustering mapping
Components
X * Cycles
8x8 NoC
16x16 NoC
8x8 NoC8x8 NoC
16x16 NoC16x16 NoC
32x32 NoC
64x64 NoC
32x32 NoC32x32 NoC
64x64 NoC64x64 NoC
(c) Mapping Computational Effort (1 Cluster)
1000
10000
100000
1000000
64 100 144 256 400 1024 2048 409664 100 144 256 400 1024 2048 4096
Cluster Size [tiles]
X * Cycles
ADAM (Our Scheme)
centralized NN (app. from [6])
ADAM (Our Scheme)ADAM (Our Scheme)
centralized NN (app. from [6])centralized NN (app. from [6])
centralized MAC (app. from [6])
centralized PL (app. from [6])
centralized MAC (app. from [6])centralized MAC (app. from [6])
centralized PL (app. from [6])centralized PL (app. from [6])
Figure 6: Computation complexity of mapping compared to [6]
tiles and 5 (f) presents the computational resource requirements for
each task of the task graph. In this example the availability of the
resources is presented by the ordered column in a table (Fig. 5 (d)).
In Fig. 5 (e) we see the first set of flows ftp2that connect PEs
of PE type 2:{f12,f
13,f
34}. The flows are sorted in a decreas-
ing order according to their bandwidth requirements. The result
of a successful mapping is illustrated in Fig. 5 (c). To achieve a
mapping instance we iterate over the set of flows and select tiles
where the previously un-mapped tasks connected by the flows will
be mapped. Then the algorithm continues with the next set of flows
ftp1that are connected to PEs of type 1. The complexity of our
mapping algorithm is O(m·log m+m·n)where mis the number
of tasks and nis the number of tiles in a particular cluster. The
complexity is low compared to the heuristics in [6] when it is used
in a distributed manner. This fact is verified in the result section
(see Fig. 6).
field req. memory short description
id log2#Ntile id, Def. 2
tpPE log2#Tps type of tile’s PE (Def. 8)
r_reqcomp log2#Lv computation resource req.
bwused communication bw. usage
All directions log2#Lv output port, e.g. North
q_vc virtual channel quantity
All directions max. #VCs output port, e.g. North
Table 2: Fine-grained tile information inside each cluster agent
We study which data objects are needed by the mapping algo-
rithms and what kind of filtering mechanism may be used to re-
duce the amount of data stored in the GAs. The state informa-
tion about the tiles and the links of the HMPSoCNoC have to be
stored by agents on different levels (GAs, CAs).CAs will need the
fine grained information about the cluster to provide the distributed
mapping shown in Table 1 and 2. Table 1 contains the histogram of
computational resource requirements of the PEs. For each cluster
there is also an instance of this PE type LUT stored in the GA.The
filtering process is as follows: (1) using the “raw” data from the
data object described by Table 2, (2) calculating the information
stored in data object described by Table 1, and (3) transmitting this
data from the CAs to the GAs. Another data object stored within
each CA is the variable mpng, a LUT shown in Alg. 2. The struc-
ture of each entry within this LUT consists of the id of the source
task, destination task, assigned tile, application, resource require-
ments for execution, communication volume, and the required la-
tency.
The run-time flexibility of the mapping algorithm compared to a
design-time static mapping algorithm comes at some extra penal-
ties: near-optimal mapping solution (Fig. 8), extra computation at
run-time (Fig. 6(b)), additional traffic to collect information about
the current state of the chip (Fig. 7), and finally monitoring infras-
tructure implemented in each router to collect information about
the current state of the MPSoC. Monitoring hardware is already
an integral part of our adaptive on-chip communication scheme
presented in [11]. The monitoring module implemented for our
adaptive router requires 46 slices on a XILINX Virtex2 FPGA [21],
an LUT (number of entries ×26 bits), an event input FIFO (5 ×
12 bits), and a connection input FIFO (5 ×18 bits). The addi-
tional monitoring events for our ADAM scheme are added on top
of this existing monitoring infrastructure and therefore, it increases
the size of the LUT and FIFO. Detailed description of the monitor-
ing module is out of the scope of this paper.
5. RESULTS AND CASE STUDY ANALYSIS
We have evaluated our ADAM approach using different applica-
tion scenarios: a robot application (Image Processing Line [18]),
some multi-media applications, and applications from TGFF [10].
We show the performance in terms of execution time and the vol-
ume of the generated monitoring traffic and compare our results to
state-of-the-art centralized approaches [6, 8, 19]. In addition, we
compare our cluster-level mapping algorithm to an exhaustive off-
line mapping algorithm in order to see how far it is off from an
optimum solution.
In Fig. 6(a) we compare our approach to the centralized one [6].
We have partitioned our mapping computation into several steps
shown in Fig. 6(b). The configuration parameters for this experi-
ment are as follows: the average cluster size is 64 and the number
of tasks is 48. In this experiment the number of cycles to check
whether a task can be mapped to a tile is represented by “X” (it
may differ depending on the instruction set). We consider that each
task has to be checked for a possible assignment to each tile in-
side a virtual cluster while in the non-clustered approach the tiles
of the whole NoC have to be considered. Therefore, our approach
can reduce the mapping computation complexity e.g. on a 32x64
system we have an approx. 7.1 times lower computational effort
compared to the simple nearest-neighbor (NN) heuristics proposed
in [6]. Fig. 6(c) shows that our approach scales in the same way as
the non-clustered architecture when we do not consider the cluster-
ing approach in our algorithm.
Amount of Data by Each Instance of Mapping
[Kbytes]
Traffic produced to collect the current MPSoC state
Application Size [Tasks]
1
10
100
1000
10000
100000
1000000
1
10
100
1000
10000
100000
1000000
10 20 40 100 200 50010 20 40 100 200 500
ADAM 8x8 (our scheme)ADAM 8x8 (our scheme)
centr. 8x8 (app. from [7,19])centr. 8x8 (app. from [7,19])
distr. 8x8distr. 8x8
ADAM 64x64 (our scheme)ADAM64x64 (our scheme)
centr. 64x64 (app. from [7,19])centr. 64x64 (app. from [7,19])
distr. 64x64distr. 64x64
ADAM 32x32 (our scheme)ADAM32x32 (our scheme)
centr. 32x32 (app. from [7,19])centr. 32x32 (app. from [7,19])
distr. 32x32
distr. 32x32
Figure 7: Our ADAM approach compared to approaches [7,19]
Fig. 7 demonstrates the advantage of our approach when we con-
sider the communication volume generated by the monitoring mod-
ule of the router needed by the mapping algorithm. We compare
our cluster-based distributed approach to a centralized approach
[8, 19] and a fully distributed approach (each tile acts as an in-
dividual cluster). The experimental setup is as follows: number
of classes and PE types are 16, resource requirement encoding re-
quires 1 Byte, task id encoding requires 4 Bytes, number of tasks
encoding requires 4 Bytes, and bandwidth encoding requires 1 Byte
of memory space. To calculate the mapping traffic produced by our
764
approach we need to break down the communication into the fol-
lowing parts: (1) transmission of the task histogram thist[] to the
GA, (2) transmission of the task graph to the CA of the suitable clus-
ter, (3) reporting the cluster state to the CA, and (4) transmission of
the cluster state to the GA. The experiment shows that our approach
has noticeable advantages in reducing the amount of communica-
tion volume (10.7 times lower on a 64 ×64 NoC) caused by the
mapping when the HMPSoCNoC has many tiles.
Mapping Instance of the
Robotics Application
2000
4000
6000
8000
10000
12000
14000
Communication Cost [MB/s]
Resulting Communication Volume
after Mapping
0MPEG VOPD Image
Processing Line
( x 1/100 MB/s )
MWD
Applications
ADAM (Our Scheme)
Exhaustive optimization
for mapping
resulting mapping
solution using
ADAM
Gauss
2Input Grad
RGB 2
HSV
Gauss
1Post
Shirt
Filter
Skin
Filter Output
Figure 8: Comparing ADAM to exhaustive off-line mapping algo.
In Fig. 8 we present a comparison of the suitability of our cluster-
level mapping algorithm. It shows that our approach does not pro-
duce optimum results as they can be produced by the off-line ex-
haustive algorithm which requires a far higher computational effort.
But relative to the consumed computation effort our approach pro-
vides a reasonable near-optimal solution. The communication vol-
ume serves as the optimization criteria for the mapping algorithm
(it reduces the communication related energy consumption [15]),
and we found on an average a deviation of a mere 13.3% compared
to the exhaustive mapping algorithm. To make the comparison to
the off-line exhaustive mapping algorithm realistic, a homogeneous
tile has been considered. The near-optimal result can be used for
the run-time task mapping as this result may be traded-off with the
adaptivity and the lower computational effort.
We have also evaluated our mapping algorithm by means of a
robot application presented in [18]. We found in our algorithm the
near optimal communication volume to be 120.1 MB/s whereas,
in the exhaustive off-line mapping algorithm it can be reduced to
106.9 MB/s. The result is acceptable as we are doing it at run-time
using a heuristic algorithm and consuming 2 times lower execution
cycles compared to NN heuristics. The Image Processing Line ap-
plication takes only 11241 ×Xcycles using our ADAM algorithm
orthogonal to any instruction set processor compared to NN heuris-
tics (takes 20480 ×Xcycles) proposed in [6] on a 32 ×64 NoC.
Therefore, we observe that our run-time agent-based distributed ap-
plication mapping approach reduces the overall monitoring-traffic
compared to a centralized mapping scheme and requires less exe-
cution cycles compared to a non-clustered centralized approach.
6. CONCLUSION
We have introduced the first scheme for a run-time application
mapping in a distributed manner using an agent-based approach.
We target adaptive NoC-based heterogeneous multi-processor sys-
tems. The ADAM scheme generates 10.7 times lower monitoring
traffic compared to a centralized scheme like the ones proposed in
[8, 19] in a 64 ×64 NoC. Our scheme also has a smaller number of
execution cycles compared to a non-clustered centralized approach.
In our experiments we achieve on an average 7.1 times lower com-
putational effort for the run-time mapping algorithm compared to
the simple nearest-neighbor (NN) heuristics proposed in [6] on a
64 ×32 NoC. The flexibility of a run-time adaptive mapping, a 7.1
times lower computational effort and a 10.7 times lower monitor-
ing traffic counterbalance the optimization result compared to an
optimized run-time centralized mapping algorithm.
7. REFERENCES
[1] L. Benini and G. De Micheli. “Networks on Chips: A new
SoC paradigm”. IEEE Computer, 35(1):70–78, 2002.
[2] S. Bertozzi, A. Acquaviva, D. Bertozzi, and A. Poggiali.
“Supporting task migration in multi-processor systems-on-
chip: a feasibility study”. DATE’06: Proc. of the Conf. on
Design, Automation and Test in Europe, pages 15–20, 2006.
[3] A. Bieszczad, B. Pagurek, and T. White. “Mobile agents for
network management”. IEEE Comm. surveys and tutorials,
1(1):2–9, 1998.
[4] S. Borkar. “Thousand core chips – A technology perspective”.
DAC’07: Proc. of the 44th annual Conf. on Design Automa-
tion, pages 746–749, 2007.
[5] H. Broersma, D. Paulusma, G. J. M. Smit, F. Vlaardinger-
broek, and G. J. Woeginger. “The computational complex-
ity of the minimum weight processor assignment problem”.
WG’04: Proc. of the 30th int. Workshop on Graph-theoretic
concepts in computer science, pages 189–200, 2004.
[6] E. Carvalho, N. Calazans, and F. Moraes. “Heuristics for dy-
namic task mapping in NoC-based heterogeneous MPSoCs”.
RSP’07: Proc. of the 18th IEEE int. workshop on Rapid Sys-
tem Prototyping, pages 34–40, May 2007.
[7] J. Chan and S. Parameswaran. “NoCGEN: A template based
reuse methodology for networks on chip architecture”. VL-
SID’04: Proc. of the 17th int. Conf. on VLSI Design, pages
717–720, 2004.
[8] C.-L. Chou and R. Marculescu. “Incremental run-time appli-
cation mapping for homogeneous NoCs with multiple voltage
levels”. CODES+ISSS’07: Proc. of the 5th IEEE/ACM int.
Conf. on Hardware/software Codesign and system synthesis,
pages 161–166, 2007.
[9] W. J. Dally and B. Towles. “Route packets, not wires: On-
chip interconnection networks”. DAC’01: Proc. of the 38th
Conf. on Design Automation, pages 684–689, 2001.
[10] R. P. Dick, D. L. Rhodes, and W. Wolf. “TGFF: Task graphs
for free”. CODES/CASHE’98: Proc. of the 6th int. workshop
on Hardware/software Codesign, pages 97–101, 1998.
[11] M. A. A. Faruque, T. Ebi, and J. Henkel. “Run-time adaptive
on-chip communication scheme”. ICCAD ’07: Proc. of the
2007 IEEE/ACM int. Conf. on Computer-aided design, pages
26–31, 2007.
[12] A. Hansson, K. Goossens, and A. Rˇ
adulescu. “A unified
approach to constrained mapping and routing on network-
on-chip architectures”. CODES+ISSS’05: Proc. of the 3rd
IEEE/ACM int. Conf. on Hardware/software Codesign and
system synthesis, pages 75–80, 2005.
[13] J. Henkel, W. Wolf, and S. Chakradhar. “On-chip networks:
A scalable, communication-centric embedded system design
paradigm”. VLSID’04: Proc. of the 17th int. Conf. on VLSI
Design, pages 845–851, 2004.
[14] P. Horn. “Autonomic computing: IBM’s perspective on the
state of information technology”. IBM Corporation, 2001.
[15] J. Hu and R. Marculescu. “Exploiting the routing flexibility
for energy/performance aware mapping of regular NoC archi-
tectures”. DATE’03: Proc. of the Conf. on Design, Automa-
tion and Test in Europe, pages 10688–10693, 2003.
[16] T. Lei and S. Kumar. “A two-step genetic algorithm for map-
ping task graphs to a network on chip architecture”. DSD’03:
Proc. of the Euromicro symposium on Digital Systems Design,
pages 180–189, 2003.
[17] V. Nollet, T. Marescaux, P. Avasare, D. Verkest, and J.-Y.
Mignolet. “Centralized run-time resource management in a
network-on-chip containing reconfigurable hardware tiles”.
DATE’05: Proc. of the Conf. on Design, Automation and Test
in Europe, pages 234–239, March 2005.
[18] P. Azad, A. Ude, T. Asfour, G. Cheng, and R. Dillmann.
“Image-based markerless 3D human motion capture using
multiple cues”. Proc. of the int. workshop on Vision Based
Human-Robot Interaction, 2006.
[19] L. Smit, G. Smit, J. Hurink, H. Broersma, D. Paulusma, and
P. Wolkotte. “Run-time mapping of applications to a hetero-
geneous reconfigurable tiled system on chip architecture”.
FPL’04: Proc. of the IEEE int. Conf. on Field-Programmable
Technology, pages 421–424, 2004.
[20] P. Smith and N. C. Hutchinson. “Heterogeneous process mi-
gration: The Tui system”. Software – Practice and Experi-
ence, 28(6):611–639, 1998.
[21] Xilinx. ”Virtex2 datasheets”. http://www.xilinx.com/.
765
... An example many-core system platform. . . . . . . 77 Figure 5. 4 Steps of dynamic resource allocation method benefiting from modal nature of applications. . . . 78 Figure 5. 5 Spanning tree construction for DemoCar. . . . . . . ...
... Step 2. Task fetching and schedulability analysis (lines [4][5]: The tasks input FIFO queue is checked if empty (line 4) and a task T i is fetched (line 5). ...
... Resource set Π; outputs : Task mapping; 1 Choose an initial random population of task mappings 2 while not termination condition do 3 Evaluate the number of deadline violations using IA; //criterion (i) 4 Evaluate the makespan using IA; //criterion (ii) 5 Create clusters of individuals with the same number of deadline violations; Perform tournament selection; //criterion (i) has higher priority than criterion (ii) 9 Generate individuals using crossover and mutation; ...
... In [16] introduced a run time agent-based mapping approach. A small task called an agent is one that may run on any NoC node. ...
Article
Full-text available
Multiprocessor System On Chip (MPSoC) with Networks-on-Chip (NoCs) have been proposed to address the communication challenges in modern dynamic applications. One of the key aspects of design exploration in NoC-based MPSoC is application mapping, which is critical for the parallel execution of multiple applications. However, mapping for dynamic workloads becomes challenging due to unpredictable arrival times of applications and the availability of resources. In this work, we propose a hybrid task mapping approach, HyDra, that combines design-time mapping and efficient run time remapping to reduce communication and energy costs. The proposed approach generates multiple application mappings during the design time phase by minimizing latency, energy, and communication costs. The diverse mapping possibilities produced at design time considers multiple performance metrics. However, we cannot predict the arrival time of applications and the availability of resources at design time. To further optimize the MPSoC performance, our dynamic mapping phase re configures the design time mappings based on run time availability of resources and applications. The simulation results show that HyDra reduces communication costs by 14% while using 15% less energy for small and large NoCs compared to state-of-the-art task mapping techniques. Furthermore, our approach provides an average of 19% reduction in end-to-end latency for applications. Our hybrid task allocation and scheduling approach effectively addresses communication issues in NoC-based MPSoCs for dynamic workloads. HyDra achieves improved performance by combining design-time and run time mapping, providing a promising solution for future MPSoC design.
... Their mapping approach takes bandwidth and load constraints into consideration and uses the best-neighbour strategy, which takes only the closest search space around a task into account. In [11] an agent-based run-time mapping approach for heterogeneous NoC architectures is presented. The system is based on global agents containing system state information and cluster agents which are responsible for assigning resources. ...
Article
Full-text available
Fail-operational behavior of safety-critical software for autonomous driving is essential as there is no driver available as a backup solution. In a failure scenario, safety-critical tasks can be restarted on other available hardware resources. Here, graceful degradation can be used as a cost-efficient solution where hardware resources are redistributed from non-critical to safety-critical tasks at run-time. We allow non-critical tasks to actively use resources that are reserved as a backup for critical tasks, which would be otherwise unused and which are only required in a failure scenario. However, in such a scenario, it is of paramount importance to achieve a predictable timing behavior of safety-critical applications to allow a safe operation. Here, it has to be ensured that even after the restart of safety-critical tasks a guarantee on execution times can be given. In this paper, we propose a graceful degradation approach using composable scheduling. We use our approach to present, for the first time, a performance analysis which is able to analyze timing constraints of fail-operational distributed applications using graceful degradation. Our method can verify that even during a critical Electronic Control Unit failure, there is always a backup solution available which adheres to end-to-end timing constraints. Furthermore, we present a dynamic decentralized mapping procedure which performs constraint solving at run-time using our analytical approach combined with a backtracking algorithm. We evaluate our approach by comparing mapping success rates to state-of-the-art approaches such as active redundancy and an approach based on resource availability. In our experimental setup our graceful degradation approach can fit about double the number of critical applications on the same architecture compared to an active redundancy approach. Combined, our approaches enable, for the first time, a dynamic and fail-operational behavior of gracefully degrading automotive systems with cost-efficient backup solutions for safety-critical applications.
... Their mapping approach takes bandwidth and load constraints into consideration and uses the best-neighbour strategy, which takes only the closest search space around a task into account. In [10] an agent-based run-time mapping approach for heterogeneous NoC architectures is presented. The system is based on global agents containing system state information and cluster agents which are responsible for assigning resources. ...
Preprint
Full-text available
Fail-operational behavior of safety-critical software for autonomous driving is essential as there is no driver available as a backup solution.In a failure scenario, safety-critical tasks can be restarted on other available hardware resources.Here, graceful degradation can be used as a cost-efficient solution where hardware resources are redistributed from non-critical to safety-critical tasks at run-time.We allow non-critical tasks to actively use resources that are reserved as a backup for critical tasks, which would be otherwise unused and which are only required in a failure scenario.However, in such a scenario, it is of paramount importance to achieve a predictable timing behavior of safety-critical applications to allow a safe operation. Here, it has to be ensured that even after the restart of safety-critical tasks a guarantee on execution times can be given. In this paper, we propose a graceful degradation approach using composable scheduling.We use our approach to present, for the first time, a performance analysis which is able to analyze timing constraints of fail-operational distributed applications using graceful degradation.Our method can verify that even during a critical ECU failure, there is always a backup solution available which adheres to end-to-end timing constraints.Furthermore, we present a dynamic decentralized mapping procedure which performs constraint solving at run-time using our analytical approach combined with a backtracking algorithm. We evaluate our approach by comparing mapping success rates to state-of-the-art approaches such as active redundancy and an approach based on resource availability. In our experimental setup our graceful degradation approach can fit about double the number of critical applications on the same architecture compared to an active redundancy approach. Combined, our approaches enable, for the first time, a dynamic and fail-operational behavior of gracefully degrading automotive systems with cost-efficient backup solutions for safety-critical applications.
... It tries to allocate tasks close to each other in order to reduce the communication latency. The authors of Al Faruque, Krist, and Henkel, 2008 proposed an agent-based dynamic mapping in which the task allocation algorithm follows a distributed approach. Agents are defined as a small tasks that collect information about a NoC region while they collaborate between them to find a suitable mapping schema. ...
Thesis
In this dissertation, we tackle the problem of execution complex multi-thread real-time applications on modern Network-on-Chip architectures. Network-on-Chip (NoC) is a promising technology that fits the increasing performance demands of Cyber-Physical Systems (CPS). The introduction of NoCs is justified by the fact that classical multi-core single-bus architectures fail to address the performance requirements and the predictability needs of modern CPS applications, especially as the number of cores increases. Even if the use of cache memories mitigates the bottleneck effect of single bus architectures, caches introduce unpredictable delays in accessing data, which in turn make it difficult to estimate the execution time of tasks. Most CPS applications are time-sensitive: tasks are assigned deadlines that must never exceed, otherwise a critical failure may occur. Such systems are denoted by hard real-time. Consequently, the communications that occur in the network, denoted by on-chip communications, must be predictable and as fast as possible to prevent deadline-missing. Since the task position on the NoC determines its communication cost, the allocation of the application tasks on the chip cores is a crucial problem. In this thesis, we address specifically the problem of allocating a set of real-time applications, each composed of several parallel tasks, whose structure is described by a Directed Acyclic Graph (DAG), onto a Network-on-Chip processor. First, we study the problem of bounding the communication cost depending on the different message scheduling policies at the router level. Then we address the problem of task scheduling and of verifying the schedulability of a certain allocation. Then, we propose an approach to reduce the complexity of the task allocation problem and its analysis cost. Moreover, we propose a task mapping strategy through a meta-heuristic which performs an effective design-space exploration for DAG (Directed Acyclic Graph) tasks. Lastly, in addition to on-chip communications, we studied the mapping problem when the off-chip communications are integrated into the model.
Article
Full-text available
Network-on-Chip (NoC) has been unfolded as a superior alternative for integrating a considerably greater extent of cores on a single chip. Recently, multi-core systems have become prevalent because of the increased processing demands for high-performance embedded applications. Application mapping techniques play a significant role in enhancing the extensive performance of such complex multi-core platforms. Developing and implementing efficient application mapping techniques are required for system design to meet the demand of such complicated multi-core systems. The paper primarily focuses on dynamic application mapping techniques, classifying them into a number of subcategories. It highlights such approaches and techniques that aim to enhance the performance of the NoC-based systems by optimizing them in terms of communication cost, latency, energy consumption, power, execution, and computational time. Future challenges, trends, and simulation tools have also been spotlighted.
Article
Hybrid Wireless Network-on-Chip (HWNoC) architecture has been introduced as a promising communication infrastructure for multicore systems. HWNoC-based multicore systems encounter extremely dynamic application workloads that are submitted at run-time. Mapping and scheduling of these applications are critical for system performance, especially for real-time applications. The existing resource allocation approaches either ignore the use of wireless links in task allocation on cores or ignore the timing characteristic of tasks. In this paper, we propose a new deadline-aware and energy-efficient dynamic task mapping and scheduling approach for the HWNoC-based multicore system. By using of core utilization threshold and tasks laxity time, the proposed approach aims to minimize communication energy consumption and satisfy the deadline of the real-time applications tasks. Through cycle-accurate simulation, the performance of the proposed approach has been compared with state-of-the-art approaches in terms of communication energy consumption, deadline violation rate, communication latency, and runtime overhead. The experimental results confirmed that the proposed approach is a very competitive approach among the alternative approaches.
Article
Due to the advancement of transistor technology, a single chip processor can now have hundreds of cores. Network-on-Chip (NoC) has been the superior interconnect fabric for multi/many-core on-chip systems because of its scalability and parallelism. Due to the rise of dark silicon with the end of Dennard Scaling, it becomes essential to design energy efficient and high performance heterogeneous NoC-based multi/many-core architectures. Because of the large and complex design space, the solution space becomes difficult to explore within a reasonable time for optimal trade-offs of energy-performance-reliability. Furthermore, reactive resource management is not effective in preventing problems from happening in adaptive systems. Therefore, in this work, we explore machine learning techniques to design and configure the NoC resources based on the learning of the system and applications workloads. Machine learning can automatically learn from past experiences and guide the NoC intelligently to achieve its objective on performance, power, and reliability. We presented the challenges of NoC design and resource management and proposed a generalized machine learning framework to uncover near-optimal solutions quickly. We propose and implement a NoC design and optimization solution enabled by neural networks, using the generalized machine learning framework. Simulation results demonstrated that the proposed neural networks-based design and optimization solution improves performance by 15% and reduces energy consumption by 6% compared to an existing non-machine learning-based solution while the proposed solution improves NoC latency and throughput compared to two existing machine learning-based NoC optimization solutions. The challenges of machine learning techniques adaptation in multi/many-core NoC have been presented to guide future research.
Article
Full-text available
Heterogeneous process migration is a technique whereby an active process is moved from one machine to another. It must then continue normal execution and communication. The source and destination processors can have a different architecture, that is, different instruction sets and data formats. Because of this heterogeneity, the entire process memory image must be translated during the migration. Tui is a migration system that is able to translate the memory image of a program (written in ANSI‐C) between four common architectures (m68000, SPARC, i486 and PowerPC). This requires detailed knowledge of all data types and variables used with the program. This is not always possible in non‐type‐safe (but popular) languages such as ANSI‐C, Pascal and Fortran. The important features of the Tui algorithm are discussed in great detail. This includes the method by which a program's entire set of data values can be located, and eventually reconstructed on the target processor. Performance figures demonstrating the viability of using Tui to migrate real applications are given. © 1998 John Wiley & Sons, Ltd.
Conference Paper
Full-text available
One of the key steps in Network-on-Chip (NoC) based design is spatial mapping of cores and routing of the communication between those cores. Known solutions to the mapping and routing problem first map cores onto a topology and then route communication, using separated and possibly conflicting objective functions. In this paper we present a unified single-objective algorithm, called Unified MApping, Routing and Slot allocation (UMARS). As the main contribution we show how to couple path selection, mapping of cores and TDMA time-slot allocation such that the network required to meet the constraints of the application is minimized. The time-complexity of UMARS is low and experimental results indicate a run-time only 20% higher than that of path selection alone. We apply the algorithm to an MPEG decoder System-on-Chip (SoC), reducing area by 33%, power by 35% and worst-case latency by a factor four over a traditional multi-step approach.
Article
Full-text available
In this article we discuss the potential uses of mobile agents in network management and define software agents and a navigation model that determines agent mobility. We list a number of potential advantages and disadvantages of mobile agents and include a short commentary on the ongoing standardization activity. The core of this article comprises descriptions of several actual and potential applications of mobile agents in the five OSI functional areas of network management. A brief review of other research activity in the area and prospects for the future conclude the presentation.
Conference Paper
Full-text available
With the advent of multi-processor systems-on-chip, the interest in process migration is again on the rise both in research and in product development. New challenges associated with the new scenario include increased sensitivity to implementation complexity, tight power budgets, requirements on execution predictability, the lack of virtual memory support in many low-end MPSoCs. As a consequence, effectiveness and applicability of traditional transparent migration mechanisms are put in discussion in this context. Our paper proposes a task management software infrastructure that is well suited for the constraints of single chip multiprocessors with distributed operating systems. Load balancing in the system is maintained by means of intelligent initial placement and task migration. We propose a user-managed migration scheme based on code checkpointing and user-level middleware support as an effective solution for many MPSoC application domains. In order to prove the practical viability of this scheme, we also propose a characterization methodology for task migration overhead. We derive the minimum execution time following a task migration event during which the system configuration should be frozen to make up for the migration cost.
Conference Paper
This paper presents the many-core architecture, with hundreds to thousands of small cores, to deliver unprecedented compute performance in an affordable power envelope. We discuss fine grain power management, memory bandwidth, on die networks, and system resiliency for the many-core system.
Conference Paper
In this paper, we propose an efficient technique for run-time application mapping onto Network-on-Chip (NoC) platforms with multiple voltage levels. Our technique consists of a region selection algorithm and a heuristic for run-time application mapping which minimizes the communication energy consumption, while still providing the required performance guarantees. The proposed technique allows for new applications to be easily added to the system platform with minimal inter-processor communication overhead. Moreover, our approach scales very well for large designs. Finally, the experimental results show as much as 50% communication energy savings compared to arbitrary mapping solutions.
Conference Paper
This paper presents the many-core architecture, with hundreds to thousands of small cores, to deliver unprecedented compute performance in an affordable power envelope. We discuss fine grain power management, memory bandwidth, on die networks, and system resiliency for the many-core system.
Conference Paper
In portable multimedia systems a number of communicating tasks has to be performed on a set of heterogeneous processors. This should be donein an energy-effcient way. We give the background of the problem and model it as a graph optimization problem, which we call the minimum weight processor assignment problem. We show that our setting generalizes several problems known in literature, including minimum multiway cut, graph k-colorability, and minimum (generalized) vertex covering. We show that the minimum weight processor assignment problem is NP-hard, even when restricted to instances where the (process) graph is a bipartite graph with maximum degree at most 3, or with only two processors, or with arbitrarily small weight di erences, or with only two different edge weights. For graphs with maximum degree at most 2 (or in fact the larger class of degree-2-contractible graphs) we give a polynomial time algorithm. Finally we generalize this algorithm into an exact (but not effcient) algorithm for general graphs..