Conference PaperPDF Available

Overcoming the "Memory Wall" by improved system design exploration and a link to process technology options

Authors:
  • Hypertech S.A.

Abstract and Figures

Data transfer and storage issues "take the centre stage" in information and communication systems because of the increasing complexity and data dominance typically associated to them. In this paper, we summarise a systematic methodology for the power critical storage modules such as local SRAMs. We focus on their memory organisation (access schedule and data assignment) together with an exploration of the effect that the interconnect technology may have on the energy consumed by these local memory organisations in deep sub-micron technologies.
Content may be subject to copyright.
Overcoming The "Memory Wall" By Improved System
Design Exploration And A Link To Process Technology
Options
Antonis Papanikolaou, Miguel Miranda and Francky Catthoor
IMEC, Kapeldreef 75, Leuven, Belgium
Also Professor at the Katholieke Universiteit Leuven, Belgium
{papaniko,miranda,catthoor}@imec.be
ABSTRACT
Data transfer and storage issues “take the centre stage” in
information and communication systems because of the in-
creasing complexity and data dominance typically associ-
ated to them. In this paper, we summarise a systematic
methodology for the power critical storage modules such
as local SRAMs. We focus on their memory organisation
(access schedule and data assignment) together with an ex-
ploration of the effect that the interconnect technology may
have on the energy consumed by these local memory organ-
isations in deep sub-micron technologies.
Categories and Subject Descriptors
B.3.1 [Hardware]: Semiconductor memoriesStatic Mem-
ory (SRAM);C.3[Computer systems organization]: [Real-
time and embedded systems]
General Terms
Design
Keywords
combined system design and process technology exploration,
optimal energy/delay trade-off exploration in memories
1. INTRODUCTION
Data transfer and storage issues “take the centre stage” [1]
in information and communication systems because of the
increasing complexity and data dominance typically associ-
ated to them. Additionally, while the evolution of process
technology scaling favours speed, this is not necessarily the
case for energy once the 100 nm barrier is crossed. This
is especially true for the back-end process technology such
as the interconnect. If these interconnect process options
are not well exploited the total energy budget for a given
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CF’04, April 14–16, 2004, Ischia, Italy.
Copyright 2004 ACM 1-58113-741-9/04/0004 ...$5.00.
application task may go up, instead of the expected reduc-
tion. As a result, the overall energy consumption but also
the system performance associated to these systems suffer
a data transfer and storage bottleneck which is especially
impacting the embedded world.
In IMEC this bottleneck has been identified since the late
80’s and a system design methodology has been developed
for Data Transfer and Storage Exploration (DTSE) [2, 3].
This methodology aims at reducing the data and memory re-
lated energy consumption for a given application within real-
time constraints. It consists of two major parts, platform-
independent and platform-dependent optimisations. In the
first part the application source code is optimised a.o. by
making memory accesses more local and regular, indepen-
dently of the target platform. In the second part, a.o. the
ordering of the memory accesses is decided along with the
mapping of the data on the memory organisation of the tar-
get architecture.
A
A’
A"
B
B’
L2-memory
L1-
mem
A
A" B
Size=1Mb
E/acc=10uJ/acc
Size=1Kb
E/acc=1uJ/acc
Scalar level functionality Register Files and
Functional Units
Register Files and
Functional Units
L2-memory
L1-
mem
Figure 1: Optimal assignment of application data to
memory layers
This summary paper is situated in the context of map-
ping application domains on predefined architecture plat-
forms. We do not consider general purpose processors, but
application-domain specific platforms, like modern DSP pro-
cessors and emerging multi-processor SoC platforms. The
target application domains include the ones that are repre-
202
sentative of applications for future embedded systems, namely
wireless and multi-media applications. Our main optimisa-
tion objective is the reduction of energy consumption of the
data memory organisation for a given task, while always
meeting the timing constraints imposed by the applications.
The platforms under consideration are (almost) fully prede-
fined and the designer has to take care of optimally map-
ping his applications on this platform in order to optimise
power consumption. Some reconfigurability can be present
in the memory organisation and its interconnect network.
Note that even a fully predefined architecture will still pro-
vide some flexibility, for example the communication net-
work between the memories and the processing elements is
usually containing some shared, programmable busses. Also
the way the individual data of the application are mapped
to the memory units and banks, or the so-called data layout
can be decided by the compiler or linker (possibly incorpo-
rating designer input).
We address the power critical storage modules such as
local SRAMs and focus on their memory organisation, Fig-
ure 1 (access schedule and assignment) together with the
effect that the interconnect technology has on the energy
consumed by these. This is done both inside the SRAMs
(intra-memory) and between them (inter-memory). To ex-
pose the effects of the memory organisation and the inter-
connect in these modules, we have performed several ex-
periments using as drivers real-life applications of industrial
relevance sufficiently representative from our target appli-
cation domain, namely the Digital Audio Broadcast (DAB)
receiver[4] and the Quad-tree Structured Difference Pulse
Code Modulation (QSDPCM) video encoder[5].
Our approach for intra-memory interconnect energy op-
timisation is based on intentionally varying the width of
the interconnect wires while maintaining their relative pitch
to create interconnect structures inside the memories with
lower associated capacitance, hence with reduced energy.
The impact of this, a higher associated access delay, is com-
pensated at the system level by exploiting the available
parallelism in the data transfer at the architecture level.
This leads to memory instances that are significantly more
energy-efficient than the memories currently provided com-
mercially while they mainly differentiate in the dimensions
of the interconnect wires they use internally. Moreover, dif-
ferent implementations may exist for the same memory in-
stance, each having different access delay and energy per
access characteristics than the others, hence providing a va-
riety of Energy/Delay (E/D) trade-offs.
The next step in our flow aims at improving the hetero-
geneity of the memory organisation of the platform. The
way to achieve that is by propagating these E/D trade-offs
from the memory module level to the system-level mem-
ory organisation in order to save energy at that higher level
where the impact is significantly higher than the module
level. The goal is to construct the (local) memory organ-
isation by selecting those SRAM modules that allow just
meeting the system performance constraints. Hence, the
slack gained in system performance will translate in an over-
all energy gain. This last step is supported by the platform-
dependent mapping sub-stage of the Data Transfer and Stor-
age Exploration approach [3, 6, 7] and associated tool set
which is targeted to partially predefined memory organisa-
tions (still with a number of configuration parameters).
2. RELATED WORK
Apart from IMEC, three currently active research groups
exist that are internationally recognised for their contribu-
tions to data transfer and storage management related re-
search issues. One (co-operating) group is located at the
Universities of Bologna and Torino in Italy, the second one
at the University of Irvine, California and the last one in
Penn State Univ.
The groups cooperating in Torino/Bologna [8] are mainly
targeting memory design issues, hence their abstraction level
for exploration is at the hardware level and therefore diffi-
cult to port and re-target and also more limited in explo-
ration space. Indeed, by working at a lower abstraction level
(memory organisation/architecture) the exploration range is
more limited than at the system and/or application level as
in case of the DTSE approach. Therefore also the achiev-
able gains in implementation cost (size, speed, power) are
smaller.
The group at Irvine has been working on memory issues
in general [9] and on the EXPRESSION compiler [10] in
concrete. Their research is situated at the compiler level of
abstraction and has a quite complete memory view. Still,
it is not as retargetable as our focus, especially due to the
assumptions made on the distributed on-chip memory or-
ganisation or the SDRAM modules. They are highly fo-
cused on the typical programmable target processor’s local
memory organisations where a combination of caches and
scratch-pad memory are becoming available now.
The group at Penn State focuses on developing compiler
optimisations for data locality in order to exploit the avail-
able memory hierarchy [11, 12, 13]. They consider the use
of high-level transformations which allows to mostly per-
form their optimisations at the source code level, hence their
approach is potentially retargetable. However, their tech-
niques, although systematic, do not explore Pareto-optimal
trade-offs. This limits the possibilities of using such an ap-
proach for memory organisation exploration in concrete and
platform architecture exploration in general.
The high orthogonality, efficiency, portability and retar-
getability characteristics of IMEC’s DTSE approach [2, 3]
are unique when compared to the other past and currently
active data transfer and storage management research projects
outside IMEC.
3. MEMORYORGANISATIONTEMPLATE
It has been already proven that exploiting a data memory
hierarchy is very beneficial for power consumption, see Sec-
tion 4. In fact, power optimal memory organisations usually
have several layers of memories and each layer is distributed,
like the TI C55 family of DSPs. A distributed memory or-
ganisation can provide a larger bandwidth than a fully cen-
tralised one, which means that the application data can be
transferred much faster into the processing elements. Fur-
thermore, the fact that the total memory footprint is divided
between several memories results in a smaller energy per ac-
cess of each memory, compared to the centralised memory.
By exploiting these two axes of freedom we will see how
application energy consumption vs. application execution
time trade-offs can be created and using these trade-offs a
designer can find the power-optimal mapping for the given
application timing constraints, see Section 5.1.
To further optimise power consumption we can exploit
203
any potential heterogeneity in the memory layers. A dis-
tributed memory layer can contain memories that are iden-
tical or memories that differ from each other. If the platform
provides this kind of heterogeneity then we can use it to
further optimise the application power consumption by fine
tuning the energy consumption and delay characteristics of
the memories, so as to just meet the application timing con-
straints. Running as slow as possible (taking into account
the energy weight distribution), while still meeting the “real-
time” application constraints, translates into energy gains as
we will see in Section 8.2.
4. ENERGY EFFICIENT DATA MEMORY
HIERARCHY MANAGEMENT
Despite the recent architectural advances for multime-
dia aiming at improving computational efficiency (e.g., sub-
word parallel data level processing, reconfigurable comput-
ing, etc), the dominance in data storage and transfer of these
systems still remains one of the main bottlenecks for energy
and speed efficient implementations. The reason is the ever
increasing gap in speed and energy between the memory and
the data processing subsystems.
To cope with such a gap, the addition of more and more
layers to the memory hierarchy, from where data can be
efficiently accessed from smaller, faster and more energy ef-
ficient memories becomes mandatory. This is true both for
Systems on a Chip (SoC) platforms based on random ac-
cess memories as well as for cache memories. However, the
potential efficiency offered by these multi-layer memory or-
ganisations becomes attainable only on condition that the
storage and transfer of data between the different layers is
done in a quite optimised manner.
Memory hierarchy layers can contain software controlled
scratch-pad memories or caches. In order to guarantee an
efficient transfer of data along the data memory hierarchy
often this requires that smaller copies of the data are made
from the larger data arrays which can be stored in the smaller
layers [14]. Those copies must be selected such that they
minimise the overall transfer cost. In this context, any
transfer of data from a higher layer to the current one is
considered to be an overhead for the current layer.
This happens most efficiently under full software control
(e.g., a number of SRAM scratch-pad memories) because a
global view on the transfer can be obtained at design time.
In this case, copy operations should be explicitly present in
the application code. This is mostly possible for design time
analysable applications that are characterised via Pareto
curves collecting all optimal energy trade-offs for the dif-
ferent execution times. The decision of the selected Pareto
point can then already be made at design time for purely
static applications.
However, many real-life applications are dynamic in na-
ture and they cannot be completely characterised at de-
sign time. Traditionally, to cope with this dynamism, HW-
controlled caches are used instead of SW-controlled scratch-
pad memories. In this case, the hardware cache controller
will make the copies of signals at the moment they are ac-
cessed (and the copy is not present yet in the cache). How-
ever, this is inefficient because data present in the cache and
required in the near) future, can be (wrongly) evicted in or-
der to accommodate new fetched data. This evicted data,
when needed again by the processor, will have to be brought
to the cache for a second time, hence leading to transfer and
power overhead. To minimise such overhead, architects tend
to use bigger caches with hardware controllers implementing
complex mapping policies. However, this is not efficient for
power given the extra overhead in every single access even
when these are cache hits. According to [15], a 1KByte 4-
way associative cache is between 4-5 times more inefficient
in terms of energy per access than a scratch-pad of the same
size and 2-3 times than a (1-way associative) Direct Mapped
cache (DM-cache), while the last one is only 60% less effi-
cient than a scratch-pad of the same size. This overhead
is due to accessing the tag array of the cache, whose size
(hence energy overhead) considerably increases with the as-
sociativity factor.
An illustration of the Memory Hierarchy Layer Assign-
ment (MHLA) methodology [16] is illustrated in Figure 1.
The assumed application contains two arrays, namely A and
B. For array A two possibilities exist for copying part of it to
smaller arrays and reusing these elements. For array B only
one such case was identified. The target platform is also
shown on the right. It is clear that the energy per access of
the L2 memory is much larger than the energy per access
of the L1 memory, an order of magnitude difference. In this
example, after tedious exploration the power-optimal choice
for assigning arrays to layers was to assign only array A to
the second layer. In the first layer, a small copy of part of A
is stored, along with B, which obviously was small enough
to fit in layer 1.
By applying the fore-mentioned techniques on a real-life
driver application we can show significant energy gains for
the complete memory organisation. The driver we have used
is the QSDPCM [5]. This application is a video encoder
and the memory organisation assumed contains two hierar-
chy layers. The size of the L1 layer differs slightly as we
can see in the results, see Figure 2, but the difference is so
small that these results can be directly translated in a pre-
defined architecture context. It is clear from the results that
by making some changes in the source code and introducing
copies of large arrays in the local layer most of the accesses
are now localised to this layer. This has a significant effect
on the energy that is consumed on the memory of the sec-
ond layer. By reducing the number of second layer accesses
by about 45% we notice a total energy reduction of about
35%. Note that the accesses to the local memory layer have
increased significantly, but the impact on the total L1 en-
ergy consumption is not very significant. The reason is that
an access to the second layer is much more costly than an
access to the local layer.
300306542
Number of L2
accesses (x103)
1.191.191.14
Number of L1
accesses (x106)
6336066352114048
Layer size for L2
(Bytes)
802748742
Layer size for L1
(Bytes)
32.1732.8558.18L2 Energy
14.2914.3213.79L1 Energy
46.4747.1771.97Total Energy
Partially-mergedFully-mergedNon-merged
300306542
Number of L2
accesses (x103)
1.191.191.14
Number of L1
accesses (x106)
6336066352114048
Layer size for L2
(Bytes)
802748742
Layer size for L1
(Bytes)
32.1732.8558.18L2 Energy
14.2914.3213.79L1 Energy
46.4747.1771.97Total Energy
Partially-mergedFully-mergedNon-merged
Figure 2: Localising the memory accesses by intro-
ducing copies in the code reduces total energy con-
sumption
204
5. LOW-POWERDISTRIBUTEDMEMORY
ORGANISATIONS
5.1 How to meet real-time bandwidth
constraints
In data transfer intensive applications, a costly and dif-
ficult to solve issue is to get the data to the processor on
time. A certain amount of parallelism in the data trans-
fers is usually required to meet the application’s real-time
constraints. Parallel data transfers, however, can be very
costly. Therefore, the trade-off between data transfer band-
width and data transfer cost should be carefully explored.
This section describes the potential trade-offs involved, and
also introduces a new way to systematically trade-off the
data transfer cost with application run-time.
In our application domain, an overall target storage cy-
cle budget is typically imposed, corresponding to the overall
throughput. In addition, other real-time constraints can be
present which restrict the ordering freedom. In data trans-
fer and storage intensive applications, the memory accesses
are often the limiting factor to the execution speed, both
in custom “hardware” and instruction-set processors (“soft-
ware”).
Data processing can be easily sped up through pipelin-
ing and other forms of parallelism. Increasing the memory
bandwidth, on the other hand, is much more expensive and
requires the introduction of different hierarchical layers, typ-
ically involving also multi-port memories. These memories
cause a large penalty in area and energy though.
Because memory accesses are so important, it is even pos-
sible to make an initial system level performance evalua-
tion based solely on the memory accesses to complex data
types [2]. Data processing is then temporarily ignored ex-
cept for the fact that it introduces dependencies between
memory accesses. This section focuses on the trade-off be-
tween cycle budget distribution over different system com-
ponents and gains in the total system energy consumption.
Before going into details, a number of other issues need to
be introduced though.
Defining such a memory system based on a high-level spec-
ification is far from trivial when taking the real time con-
straints into account. High-level tools will have to support
the definition of the memory system [17]. A global data
transfer scheduling approach balancing the required mem-
ory bandwidth is needed to come up with a suitable memory
architecture within the given (timing) constraints (see sub-
section 5.1.1). As such, this is an impossible task to perform
manually, especially when taking sophisticated cost models
into account.
The cost models can be memory size (area) or power based
and allow us to have an impact on heterogeneous RAM-
based organisations [18]. Using such models will provide a
clear trade-off in the power, area and performance design
space.
To solve this practical design-time issue, we have devel-
oped design automation tools to support these decisions.
We use our tools to come up with Pareto curves to visualise
the useful trade-off space between the cycle budget assigned
to a given system submodule and its corresponding energy
and/or area consumption, i.e. involving three search space
axes. As far as we know, no other systematic (automatable)
approach is available in literature to solve this important
design problem.
5.1.1 Balancing memory bandwidth
The required bandwidth is as large as the maximum band-
width needed by the application (see Figure 3). When a very
high demand is present in a certain part of the code (e.g.
a certain loop nest), the entire application suffers. By flat-
tening this peak, a reduced overall bandwidth requirement
is obtained (see lower part of Figure 3).
Figure 3: Balancing the bandwidth lowers the cost.
The re-balancing of the bandwidth load is again performed
by reordering the memory accesses (also across loop scopes).
Moreover, the overall cycle distribution over all the loops
has a large impact on the peak bandwidth. Tool support is
also here indispensable for this difficult task. An accurate
memory access ordering is needed for defining the needed
memory organisation meeting all system timing constraints.
5.1.2 Energy cost versus cycle budget trade-off
The data transfer and storage (“memory”) related cost
clearly increases when a higher memory bandwidth is needed.
In most cases, designers of real-time systems will assume
that the time for executing a given task is predefined in an
initial system decision step, where the system timing budget
is broken up based on ad hoc guidelines.
This subsection explains how to use the available trade-
offs between cycle budget for a given task and the memory
cost related to it. This almost necessarily leads to the use of
Pareto curves to exploit the opportunities really well. The
data transfer and storage related cycle budget (”memory cy-
cle budget”) is strongly coupled to the memory system cost
(both energy and memory size/area are important here, as
mentioned earlier). A Pareto curve is a powerful instrument
to be able to make the right trade-offs. The Pareto curve
only represents the potentially interesting points in a search
space with multiple axes and excludes all the solutions which
have an equally good or worse solution for all the axes.
The memory cost increases when lowering the cycle bud-
get (see Figure 4). When lowering the cycle budget, a more
energy consuming and more complex (larger) memory archi-
tecture is needed to deliver the required higher bandwidth.
Note that the lower cycle budget is not obtained by reducing
the number of memory accesses, as in the case of the algo-
rithmic changes and data-flow transformations [2]. During
the currently discussed step in the system design trajectory,
the amount of data transferred to the data-path remains
equal over the complete cycle budget range. Nevertheless,
some control flow transformations and data reuse optimisa-
tions [3] can be beneficial for energy consumption and not
205
Cost
Cycle Budget
Minimum
budget Fully
sequential
Large
Limited
Low
Full
Needed bandwidth
Freedom
Unintresting
alternatives
Figure 4: Pareto curve for trading of memory cycle
budget vs. cost
for the cycle budget (or vice versa). This type of (largely)
platform-independent optimisations still has to be consid-
ered in the trade-off.
Figure 5 illustrates the concepts described above. If many
storage cycles are available to perform the data transfer then
the required bandwidth (shown in number of ports in this
figure) is small and no memory needs to be dual-port. If,
however, all the data transfers have to be executed in very
few cycles a large number of ports is required and, obviously,
the communication network should also be able to provide
the required bandwidth.
Storage Cycles
Memory Energy
Functional Units
Functional Units
Figure 5: Number of cycles allocated for data trans-
fer has a large impact on the required bandwidth
The interesting range of such a Pareto graph on the cycle
budget axis, should be defined from the critical path up
to the fully sequential memory access path. In the fully
sequential case all the memory accesses can be transferred
over a single port (lowest bandwidth). However, the number
of memories is not necessarily constrained to one memory.
We can still in a later phase exploit the distributed memory
organisation to minimise energy consumption. In the other
extreme, for the critical path, only a limited memory access
ordering is valid. Many memory accesses are then performed
in parallel.
6. EXPLOITINGHETEROGENEITYIN PRE-
DEFINED MEMORY ORGANISATIONS
A heterogeneous memory organisation has different kinds
(or flavours) of memory instances in a single layer. These
different flavours can come from having different sizes, dif-
ferent bit-widths, different number of ports or even due to
different implementation options at the (internal) organisa-
tion, circuit and/or, even, technology processing level. All
these enable the creation of different energy and delay char-
acteristics. For example, a large single-port memory will
consume more power per access and will be slower than a
smaller single-port memory, but a smaller dual-port mem-
ory may be even more power hungry than the original mem-
ory. These kinds of trade-offs are very difficult to explore
manually and we will show that they can be systematically
explored.
In a custom design context, tools already exist that can
decide on the number and the sizes of all the memories in the
memory organisation for minimum power consumption [17].
This is a difficult problem to solve, since all the arrays of
the application have to be assigned to one of the memories
in an optimal manner. The optimal number of memories
and the optimal assignment of arrays to these memories, is
not trivial, since information about the array bit-width and
access frequency has to be taken into account. Figure 6 illus-
trates how memory allocation and assignment can provide
power vs. area trade-offs for the memory organization. The
left-hand assignment is bad for power because the very fre-
quently accessed array A is stored in a large memory. Thus
the product of energy per access times access frequency is
very large. On the right-hand side, we can see a better choice
for power, but an overhead in area is present. The reason is
that the bitwidths of arrays B and C are different, forcing
the right memory to be much larger that the sum of the size
of the arrays. Similar trade-offs can be generated involving
delay, since memory access delay increases as memories grow
bigger.
Thus a methodology and tools [17] are needed to find the
power optimal solution given timing and/or area constraints.
In that case the number of available options is very large and
everything can be decided at design time where only the cho-
sen memories will be processed. The results of these tools
indicate once more that the most power-optimal organisa-
tion is distributed and heterogeneous.
/
Power Area
C
A
B
/
Power Area
A
C
B
Figure 6: Energy consumption vs. area and delay
trade-offs involved in the Memory Allocation and
Assignment step
In the context of predefined architectures the focus is a
little different. The available memory options are (partly
as we will see later) already decided. This means that the
system design tools do not have the freedom any more to
206
allocate the best possible memory. However if all the avail-
able memories are all of the same type then the optimisation
freedom of the tool is significantly reduced. On the other
hand, if some heterogeneity already exists in the architec-
ture then the tools can use this freedom for optimisation.
Finding which flavours of memories are the best candidates
for use in this context is not easy, but it is feasible within
limitations. Given the target application domain, it is possi-
ble to derive a class of memories that are likely candidates.
For example, small local memories can be used by appli-
cations that perform filtering of single- or two-dimensional
signals. Motion estimation kernels are such kinds of applica-
tions, where a small 2-D block traverses an image in order to
find its match. Exploiting the available reuse we can store
small parts of the image and the small 2-D block in local
memories. These memories will be very active, thus they
need to be as small as possible.
For very delay critical applications, perhaps some of these
small memories will have to be dual-port to provide the re-
quired bandwidth. By experimenting with the applications
included in this domain, one can find common memory re-
quirements such that a selection of a representative set of
heterogeneous memories is feasible. Note, though, that this
will incur an overhead in the area of the local memory layer.
In order to increase the freedom of the design tools several
additional memories should be available on top of the mini-
mal set. This implies that some redundancy will exist in this
local layer. As a result, the number of available memories
will be quite large. But a reasonable redundancy in these
small memories is acceptable. Overall, the area they occupy
is small (usually even negligible) compared to the second and
third layer memories which can contain tens or hundreds of
Mbits. So for example a 20% redundancy for the global first-
level memory layer (which will contains less than 1 Mbit in
total) will incur a marginal increase in the total chip area.
Since this tiny area overhead can be exploited to generate a
significantly lower overall energy consumption for the same
system performance constraints, this is a very useful trade-
off. That is especially true for portable embedded systems,
where energy (or average power) consumption is the first
consideration. Further opportunities for optimisation can
be provided by also taking into account information about
the life-time of the different application arrays and trying
to map several arrays with non-overlapping life-times in the
same address space. These will not be further discussed in
this paper.
7. APPLICATION CASE STUDY ON A
SIMPLE ARCHITECTURE TEMPLATE
7.1 Target platform
In this paragraph we will describe an example of the fore-
mentioned predefined partly-programmable platform, see Fig-
ure 7. This example consists of one off-chip main memory,
a first layer of memories and the processing elements. Typ-
ically the off-chip memory is an SDRAM. The first layer
memories can be homogeneous (several identical memories)
but preferably should be heterogeneous (several different
memories). A heterogeneous memory layer implies memo-
ries that are different in size, but also in number of ports and
other characteristics. We will assume for this example that
our architecture is actually heterogeneous and that memo-
ries (or memory planes in the SDRAM) that are not used
can be powered-down to minimise static energy consump-
tion. This simple architecture already provides a memory
hierarchy and a distributed local layers of memories.
Communication network
SDRAM
on-chip L1 SRAM
memories
Processing elements
off-chip main
memory
configurable
communication
network
Figure 7: Example architecture
7.2 Mapping process
The first problem to be solved is the ordering of the mem-
ory accesses. In case the local memory layer is heteroge-
neous, energy vs. number of cycles trade-offs can be gen-
erated by ordering the memory accesses in different ways.
Let’s assume, for example, that single- and dual-port mem-
ories exist in the local layer. This freedom allows the sched-
uler to find an ordering which requires a bandwidth of two
elements per cycle. Assigning an application array to a dual-
port memory, increases the available bandwidth of this ar-
ray, two elements can be read simultaneously. This allows
the scheduler to find an ordering which requires a larger
bandwidth and the number of required cycles to perform
all the memory transfers is reduced. But using a dual-port
memory induces a penalty in energy consumption, hence the
energy vs. number of cycles trade-off.
The memory access schedule can also have an impact on
the energy consumption of the SDRAM. Typically SDRAMs
have a few different read modes, for example page, burst or
interleaved mode. The interleaved mode provides the largest
possible bandwidth, but in order to use it the data has to dis-
tributed over several banks. This means that all the banks
should be active and will be accessed quite often, the re-
sult is that static, but also dynamic, energy consumption is
quite high. If, on the other hand, all the application data
are stored into one bank, then only this bank will be ac-
tive. This is beneficial for static energy consumption, but
the available bandwidth is reduced. [19] provides a system-
atic methodology and prototype tool to exploit this trade-off
with very promising experimental results.
The way to exploit these possibilities is by balancing the
bandwidth requirements over time. In the usual case they
vary significantly over time and at some point a lot of band-
width is required, while most of the time the requirements
are very relaxed. By balancing the bandwidth requirements
we avoid having to use all the SDRAM banks simultane-
ously, thus we can power-down some of them saving static
power.
The same concepts apply also to the local memories of
layer 1. If the maximum possible bandwidth is required then
the use of multi-port memories is indispensable, but the en-
207
ergy overhead is large. Furthermore, the same bandwidth is
required from the communication network. An application
with relaxed timing constraints, on the other hand, can use
a single bus and only multiple single-port memories which is
the optimal combination for the energy consumption of the
memory organisation.
Additionally, in order to exploit the available hierarchy in
the memory organisation some data reuse should be present
in the application source code. Reuse exists when the same
data element of an array is used multiple times during the
processing [14, 20]. In this case it is more efficient to copy
part of this array to a small memory and perform multiple
accesses to this memory, improving the total energy con-
sumption [16, 7, 21]. Let’s assume that the application that
is running on this architecture is a two-dimensional filtering
of an image. This operation involves two data structures,
one small two-dimensional array (the filter, for example 3x3
coefficients) and the image, which is typically a large two-
dimensional array. Due to their sizes the filter array is usu-
ally stored in one of the small layer 1 SRAMs and the image
is stored in the off-chip memory. To perform the filtering
operation the centre coefficient of the filter will traverse all
the pixels of the image in a row-wise manner. At each image
pixel the nine relevant image pixels will have to be fetched
from the off-chip memory and be temporarily stored in one
of the layer 1 memories or in registers. After the filtering
of one pixel is performed, the filter will move to the next
pixel to the right. But out of the nine image pixels that
were used in the filtering of the previous pixel, six can be
reused. This type of reuse can be exploited to optimise the
power consumption of the memory organisation. Instead
of always loading from the off-chip memory the nine cur-
rently involved image pixels we can keep the six that were
already loaded for the previous operation and retrieve from
the main memory only the three “new” ones. This way for
the filtering of each pixel we save six accesses to the off-
chip SDRAM and substitute them with accesses to the local
memories. Given the large difference in energy per access
between the SDRAM and on-chip SRAMs, this saving in
off-chip accesses can provide very significant power gains, as
demonstrated e.g. in [16].
In order to exploit the reuse available in the application
we have introduced some small arrays that are copies of
parts of larger arrays that are stored off-chip. Combined
with the original application signals they give us an idea
of the total size of the application data. Furthermore the
bandwidth requirements have been fully defined by the step
that performs the access ordering. The next step now is to
assign all these application data into the available memories.
In order to obtain the power-optimal assignment several is-
sues have to combined. The most important is that heavily
accessed arrays should be placed in memories with small
energy per access. If the architecture provides an heteroge-
neous set of layer 1 memories, then they will have different
energy consumption characteristics. It is obvious that the
arrays which gather most of the activity should be stored
in memories which consume less energy per access. How-
ever, several constraints should be taken into account. The
application arrays have different sizes, different bit-widths
and different activation frequencies. The memories of the
architecture, on the other hand, are partly (see Section 8)
fixed. This power optimisation process is still quite complex
and given that some freedom in customising the memories
is still available, the tools have enough options in order to
find a good solution.
8. AN INTERCONNECT TECHNOLOGY
BASEDTECHNIQUE TOINCREASETHE
HETEROGENEITY OF THE MEMORY
LAYERS
It is clear that having heterogeneity in the memory or-
ganisation is crucial to minimise the energy consumption.
In this section we will discuss a technique that allows us to
have memories with the same size, bit-width and number of
ports, but with different energy-delay characteristics. This
variation can come from two sources, the first is the internal
partitioning of the memory. Partitioning means splitting
the array of cells that store the data into smaller sub-arrays
and activating only one at each access. The second source
of variation we can exploit is the physical dimensions of the
interconnect wires that are used for the implementation of
the memories [22], see Section 8.1. The main energy con-
sumption gains come from the variation on the wire dimen-
sions given that memories are usually interconnect domi-
nated. Combined with the sub-array partitioning this leads
to very good Pareto ranges.
However these options have to be available also for prede-
fined memory organisations. Two ways exist to create these
effects in predefined platforms.
The first is structured ASICs. These are platforms where
the active area (silicon) and the lower metal layers have al-
ready been processed, thus fully defined. But the upper
metal layers have not yet been added. Since the freedom
to choose and route the wires inside the memories might
still be available, we can choose an implementation of the
upper metal layers were the dimensions of the wires are cus-
tom. Furthermore, potentially the partitioning can also be
decided at that stage if enough freedom is foreseen in the
phase of implementing the active area. Structured ASICs
are already provided from several companies like Synplic-
ity [23], Faraday Technology Corp. [24], AMI semiconduc-
tors [25] and others.
The second way to create these options is by having some
redundancy in the memory organisation. For example, the
predefined architecture may include several memory units or
sub-arrays of exactly the same size and number of ports, but
implemented in such a way that their energy consumption
and delay are different. Then during the application map-
ping phase the tools have the freedom to decide which of
the memory units is activated and which of the sub-arrays
are combined for a particular unit. That additional free-
dom is beneficial for optimising overall energy consumption.
It is clear that this redundancy has an area overhead but
in the lowest layer that is negligible, see Section 6. This
area overhead can be used to improve the heterogeneity of
the memory organisation, thus providing more freedom for
power optimisation to the designers.
8.1 Creating heterogeneity at the SRAM level
As indicated, one important way to create multiple in-
stances of memory sub-arrays or units which have different
energy consumption and access delay characteristics is by
customising the dimensions of the interconnect wires during
the processing of the chip.
It has been shown [22] that by varying these dimensions
208
energy/delay trade-offs can be achieved at the level of in-
dividual wires. We will first evaluate the impact of these
trade-offs on a complete memory and in a second phase use
them to create heterogeneous memory organisation layers.
The memory model we have used is based on a varia-
tion of the CACTI model [15] developed at the University
of Texas [26]. This model has improved scaling behaviour
compared to the original CACTI model, thus reflects better
the energy consumption and delay behaviour of current and
future embedded memories. We have further extracted an
embedded SRAM model from it, since it was originally a
model for on-chip caches and made some additional changes
that will be discussed in the next section.
0510
Delay (ns)
0.00
0.02
0.04
0.06
0.08
Energy (nJ)
Figure 8: Pareto exploration of a 32kbit memory
instance, partitioning and interconnect parameters
range
8.1.1 Memory module level organisation exploration
State-of-the-art SRAMs do not have one monolithic array
of cells, but the “storage area” is split into several partitions
or banks. Each time the memory is accessed only one parti-
tion has to be activated, thus memory energy and delay are
improved. By exploring, however, the partitioning we can
create energy-delay trade-offs at the level of memory mod-
ules. Partitioning is applied by dividing the bit-lines or the
word-lines of the array of cells into two pieces recursively.
But dividing the bit-lines has a different effect on total en-
ergy consumption and delay than dividing the word-line. By
exploring these combinations we achieve trade-offs between
memory energy and delay.
On top of the exploration of possible partitioning schemes
that is included in CACTI we have added the exploration of
the dimensions of the interconnect wires.
Coupling the trade-off at the technology level to the mem-
ory model we can see its large influence on the entire mem-
ory at the level of IP block. In combination with the energy-
delay trade-off due to memory partitioning, we can now have
very good ranges in the energy-delay Pareto optimal trade-
off curve of the entire memory, see Figure 8.
The final output of this model is now an energy-delay-area
Pareto optimal trade-off curve which shows all the optimal
feasible operating points of the particular memory (Figure 8,
only energy-delay Pareto points are shown). Thus if some
redundancy is allowed in the memory organisation, then sev-
eral flavours of the same memory can be present. A fast,
power-hungry memory might be needed if the timing re-
quirements are very tight, but a slow, power-efficient ver-
sion of the same memory will significantly improve the total
power consumption if the timing constraints are not very
critical.
It is important to note here that not all these Pareto
points are necessary in order to get system level gains. What
is more important is the available range rather than the to-
tal number of points. It is not necessary to have all these
different memory flavours made available on the platform.
Having several memories which are heterogeneous in energy
and delay characteristics enables our design tools to optimise
energy consumption significantly, especially if the ranges in
energy and delay are large enough. This heterogeneity will
be the result of a combination of different size, bit-width,
number of ports and technology or circuit implementation
choices for each memory.
8.2 Propagation of memory level trade-offs to
application level
So far we have been dealing with the wires that exist
inside the memories of the memory organisation. Apart
form these, wires are also used for the implementation of
the busses. However, the contribution of these wires can be
kept low by additional optimisation techniques. The first is
activity-aware power optimal floor-planning. The idea is to
place the heavily active memories close to the data-paths,
so that they have short connections. Less active memories
can be placed farther away. The second technique is to use
a bus segmentation approach for the implementation of the
busses, which partitions the bus into several segments and
only the necessary segments are activated for each memory
transfer. Combining these two techniques makes sure that
the inter-memory interconnect power consumption stays rel-
atively small and we can focus only in the intra-memory
power consumption. Experiments show that the power con-
sumed on the busses can be kept under 20% of the total
power of the memory organisation (memories and busses),
if the application has been already optimised for data trans-
fer and storage, using e.g. DTSE. [22, 27]
In order to assess the impact of the energy vs. delay
trade-offs on the application level we have to apply the mem-
ory module level trade-offs on an actual application. The
driver application we have used is a Digital Audio Broad-
cast (DAB) channel decoder [4] which has been optimised
for data transfer and storage management using the DTSE
methodology. After optimisations, the clock frequency re-
quired to implement the application while meeting the real-
time constraints is only 43 MHz.
These optimisations help to relax the timing constraints
on the individual memories significantly, thus creating an
opportunity to trade off delay slack for minimum energy
consumption. Thus, global system-wide trade-offs can be
made which allow that the optimal memory organisation is
affected by the use of the trade-off space in the memory
selection process.
In order to see how the energy-delay trade-offs propagate
to the application level we have performed an experiment
using the fastest memory module implementation for the
SRAMs of the library.
The goal of the experiment is to find the optimal number
of memories that should be used in the memory organisation
to minimise its power consumption. We take into account
the number of memories, the power consumption for each
memory and the access frequency of each memory. The re-
209
sult is the optimal memory organisation and the total power
dissipated on it.
A second experiment has involved using the slowest possi-
ble, energy optimal, interconnect wires which meet the real-
time constraint of the application. This means that the wires
inside the memories will be customised to the delay require-
ments of each memory. Memories on the critical path will
have faster wires than the ones that have relaxed timing
constraints. The results are shown in Figure 9. Both exper-
iments have been performed assuming an identical ordering
for the memory accesses. The horizontal axis of the figure
refers to the number of local memories that are allocated
by the Memory Allocation and Assignment tool [17]. It is
clear that the more memories the tool can allocate (more
freedom) the better the final solution is in terms of power
consumption. A limit for the number of memories exists,
after which further allocation of memories does not impact
the power consumption, but it depends on the application.
By comparing the results of these two experiments we
can see that for the best memory allocation and assignment
case (10 1st layer memories), the selective use of slow, but
power-efficient, wires can save up to almost 40% in power
consumption of the memory organization compared to a de-
sign based on the ITRS roadmap proposed points.
number of memories
Power (mW)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
345678910
fast memories
power-optimal memories
Figure 9: System power consumption using the
ITRS and the power-optimal interconnect options
These experiments have been performed in a custom ASIC
context, where the memories can be fully designed and op-
timised at design time. The results, though, prove that
a homogeneous memory organisation is not suited for low
power operation. Heterogeneity gives additional freedom to
the design tools which is required to further reduce energy
consumption.
The technology node used for this practical experiment is
130nm because we did not have access to well-characterised
design data for lower technology nodes. At this technol-
ogy node, the impact of the wire trade-offs is still not very
significant. The majority of the power consumption and de-
lay of the chip can be attributed to the active part of the
chip (transistors). On the other hand, for smaller technol-
ogy nodes the gains are expected to be much higher. The
reason for this is that process scaling does not affect the
wires in a similar way as the transistors. Making the tran-
sistors smaller reduces their capacitance while their resis-
tance is not seriously degraded, thus their speed and en-
ergy consumption is improved. Making the wires smaller,
though, increases their resistance significantly while their
capacitance is also increased because the distance between
the wires becomes smaller [28, 29, 30]. Thus, their electrical
characteristics, energy consumption and delay are seriously
degraded. The combination of the two fore-mentioned ef-
fects explains why the energy gains from changing the wire
dimensions grow with technology scaling and also motivates
such an exploration for future deep sub-micron technologies.
9. CONCLUSIONS
The power consumption of the global data memory organ-
isation is fast becoming a major bottleneck in the design of
energy-efficient embedded systems. In this paper we have
presented a system design flow and a method to generate
a memory library that is targeted towards total system en-
ergy optimisation within real-time constraints, that can to
a large extent alleviate this problem. Pure source code and
mapping transformations can already provide a significant
reduction in the memory organisation power consumption.
Any additional freedom to customise the platform architec-
ture and especially to increase its heterogeneity can result in
further gains, by providing additional freedom to the mem-
ory management design flow to find better overall solutions.
10. REFERENCES
[1] G.Lawton, “Storage technology takes the center
stage”, IEEE Computer Magazine, Vol.32, No.11,
pp.10-13, Nov. 1999.
[2] F.Catthoor, S.Wuytack, E.De Greef, F.Balasa,
L.Nachtergaele, A.Vandecappelle, “Custom Memory
Management Methodology – Exploration of Memory
Organisation for Embedded Multimedia System
Design”, ISBN 0-7923-8288-9, Kluwer Acad. Publ.,
Boston, 1998.
[3] F.Catthoor, K.Danckaert, C.Kulkarni, E.Brockmeyer,
P.G.Kjeldsberg, T.Van Achteren, T.Omnes, Data
access and storage management for embedded
program mable process ors, ISBN 0-7923-7689-7, Kluwer
Acad. Publ., Boston, 2002.
[4] Radio broadcasting systems; digital audio
broadcasting to mobile, portable, and fixed receivers.
Standard RE/JTC-00DAB-4, ETSI, ETS 300 401,
May 1997.
[5] P.Strobach, “QSDPCM – A New Technique in Scene
Adaptive Coding,” Proc. 4th Eur. Signal Processing
Conf., EUSIPCO-88, Grenoble, France, Elsevier Publ.,
Amsterdam, pp.1141–1144, Sep. 1988.
[6] Multi-media compilation project (Acropolis) at IMEC
http://www.imec.be/acropolis/
[7] P.Panda, F.Catthoor, N.Dutt, K.Danckaert,
E.Brockmeyer, C.Kulkarni, A.Vandecappelle,
P.G.Kjeldsberg, “Data and Memory Optimizations for
Embedded Systems”, ACM Trans. on Design
Automation for Embedded Systems (TODAES), Vol.6,
No.2, pp.142-206, April 2001.
[8] A.Macii, L.Benini, M.Poncino, “Memory Design
Techniques for Low Energy Embedded Systems”,
ISBN 0-7923-7690-0, Kluwer Acad. Publ., Boston,
2002.
[9] P.R.Panda, N.D.Dutt, A.Nicolau, “Memory issues in
embedded in systems-on-chip: optimization and
exploration”, Kluwer Acad. Publ., Boston, 1999.
[10] A.Halambi, P.Grun, V.Ganesh, A.Khare, N.Dutt and
A.Nicolau. “EXPRESSION: A Language for
Architecture Exploration through Compiler/Simulator
210
Retargetability”, Proc. 2nd ACM/IEEE Design and
Tes t i n E u ro pe Conf. (DATE),Munich,Germany,
March 1999.
[11] http://www.cse.psu.edu/ kandemir/research.html
[12] M.Kandemir, J.Ramanujam, A.Choudhary,
“Improving cache locality by a combination of loop
and data transformations”, IEEE Trans. on
Computers, Vol.48, No.2, pp.159-167, Feb. 1999.
[13] J.Ramanujam, J.Hong, M.Kandemir, A.Narayan,
“Reducing memory requirements of nested loops for
embedded systems”, 38th ACM/IEEE Design
Automation Conf., Las Vegas NV, pp.359-364, June
2001.
[14] T. Achteren, R.Lauwereins, and F.Catthoor.
Systematic data reuse exploration techniques for
non-homogeneous access patterns. In Proc. 5th
ACM/IEEE Design and Test in Europe Conf.
(DATE), pages 428–435, Paris, France, Apr. 2002.
[15] S.J.E.Wilton, N.P.Jouppi, “CACTI : An enhanced
cache access and cycle time model”, IEEE J. of Solid
State Circuits, Vol.31, No.5, pp.677-688, May 1996.
[16] E.Brockmeyer, M.Miranda, F.Catthoor, H.Corporaal,
“Layer Assignment Techniques for Low Power in
Multi-layered Memory Organisations”, Proc. 6th
ACM/IEEE Design and Test in Europe Conf.
(DATE), Munich, Germany, pp.1070-1075, March
2003.
[17] A.Vandecappelle, M.Miranda, E.Brockmeyer
F.Catthoor, D.Verkest, Global Multimedia System
Design Exploration using Accurate Memory
Organ iza tion Feedba ck Proc. 36th ACM/IEEE Design
Automation Conf., New Orleans LA, pp.327-332, June
1999.
[18] E.Brockmeyer, A.Vandecappelle, F.Catthoor,
Systematic cycle budget versus system power trade-off:
a new perspective on system exploration of real-time
data-dominated applications Proc. IEEE Int. Symp. on
Low Power Electronics and Design, pages 137-142,
Rapallo, Italy, Aug. 2000.
[19] P.Marchal, J.I.Gomez, D.Bruni, F.Catthoor, M.Prieto,
L.Benini, H.Corporaal, “SDRAM-Energy-Aware
Memory Allocation for Dynamic Multi-Media
Applications on Multi-Processor Platforms”, Proc. 6th
ACM/IEEE Design and Test in Europe Conf.
(DATE), Munich, Germany, pp.516-521, March 2003.
[20] I.Issenin, E.Brockmeyer, M.Miranda, N.Dutt, “Data
reuse analysis technique for software-controlled
memory hierarchies”, Proc. 7th ACM/IEEE Design
and Test in Europe Conf. (DATE), Paris, France, pp.,
Feb. 2004.
[21] K.Masselos, F.Catthoor, A.Kakarudas, C.E.Goutis,
H.De Man, “Memory Hierarchy Layer Assignment for
Data Re-Use Exploitation in Multimedia Algorithms
Realized on Predefined Processor Architectures”,
Proc. Intnl. Conf. on Electronic Circuits and Systems
(ICECS), Malta, pp.I.285-I.288, Sep. 2001.
[22] A.Papanikolaou, M.Miranda, F.Catthoor,
H.Corporaal, H.De Man, D.De Roest, M.Stucchi,
K.Maex, “Global interconnect trade-off for technology
over memory modules to application level: case
study”, 5th ACM/IEEE Intnl. Wsh. on System Level
Interconnect Prediction, Monterey CA, April 2003.
[23] Synplicity - Structured ASICs
http://www.synplicity.com/products/structuredasic/
[24] Faraday Tech. Corp. - Structured ASICs,
http://www.faraday-
tech.com/html/ASIC/IPsolution/structured.html
[25] AMI semiconductors - Structured ASICs
http://www.amis.com/asics/structured asics/
[26] V. Agarwal, S. Keckler, D. Burger ”The effect of
technology scaling on microarchitectural structures”
Technical Report TR2000-02, University of Texas at
Austin
[27] H.Wang, A.Papanikolaou, M.Miranda, F.Catthoor, “A
global bus power optimization methodology for
physical design of memory dominated systems by
coupling bus segmentation and activity driven block
placement”, Proc. IEEE Asia and South Pacific
Design Autom. Conf. (ASPDAC), Yokohama, Japan,
Jan. 2004.
[28] D. Sylvester, K. Keutzer, Impact of small process
geometries on microarchitectures in systems on a chip,
Proceedings of the IEEE, vol.89, no.4, p. 467, April
2001.
[29] J.A. Davis, R. Venkatesan, A. Kaloyeros, M.
Beylansky, S.J. Shouri, K. Banerjee, K.C. Saraswat,
A. Rahman, R. Reif, J.D. Meindl, Interconnect limits
on gigascale intergration (GSI) in the 21st century,
Proceedings of the IEEE, no.3, vol.89, pp. 305, March
2001.
[30] D. Sylvester, C. Hu, O.S. Nakagawa, and S-Y. Oh,
Interconnect scaling: signal integrity and performance
in future high-speed CMOS designs, Proceedings Of
Symposium on VLSI Technology, pp. 42-43, 1998.
211
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We present a survey of the state-of-the-art techniques used in performing data and memory-related optimizations in embedded systems. The optimizations are targeted directly or indirectly at the memory subsystem, and impact one or more out of three important cost metrics: area, performance, and power dissipation of the resulting implementation. We first examine architecture-independent optimizations in the form of code transoformations. We next cover a broad spectrum of optimization techniques that address memory architectures at varying levels of granularity, ranging from register files to on-chip memory, data caches, and dynamic memory (DRAM). We end with memory addressing related issues.
Conference Paper
Full-text available
Heterogeneous multi-processors platforms are an interesting option to satisfy the computational performance of dynamic multi-media applications at a reasonable energy cost. Today, almost no support exists to energy-efficiently manage the data of a multi-threaded application on these platforms. In this paper we show that the assignment of data of dynamically created/deleted tasks to the shared memory has a large impact on the energy consumption. We present two dynamic memory allocators which solve the bank assignment problem for shared multi-banked SDRAM memories. Both allocators assign the tasks' data to the available SDRAM banks such that the number of page-misses is reduced We have measured large energy savings with these allocators compared to existing dynamic memory allocators for several task-sets based on MediaBench.
Conference Paper
Full-text available
This paper presents a methodology which can substantially reduce the bus power consumption in memory dominated systems. It systematically combines an activity driven placement of the memories and a bus segmentation approach for the interconnect to localize the wire switching activity and minimize the associated wire capacitive load of the memory bus. A factor of 2.8 in bus power reduction is achieved for a real life design while maintaining the same performance.
Conference Paper
Full-text available
Successful exploration of system-level design deci- sions is impossible without fast and accurate esti- mation of the impact on the system cost. In most multimedia applications, the dominant cost factor is related to the organization of the memory archi- tecture. This paper presents a systematic approach which allows effective system-level exploration of memory organization design alternatives, based on accurate feedback by using our earlier developed tools. The effectiveness of this approach is illus- trated on an industrial application. Applying our approach, a substantial part of the design search space has been explored in a very short time, result- ing in a cost-efficient solution which meets all design constraints.
Conference Paper
Full-text available
In this paper we show how to exploit energy-delay trade-offs that exist due to the variation of the technology parameters for the implementation of interconnect wires. We also evaluate how these trade-offs can be propagated to the memory module level, so we can minimise the power consumption of the entire memory organisation (i.e., memories and connections between them). Our approach is that at future technology nodes the delay problem can be handled at the application level, so given any delay slack obtained at that level, we can exploit it to make the switching on the interconnect wires slower and thus less energy consuming. In this way, we have shown that for real-life applications the power consumption at future technology nodes can be reduced by about 34%, when compared to the option provided by the ITRS roadmap. This is achieved by, instead of using the very fast and power hungry wires, selectively using slower and thinner interconnect wires while still meeting the application real-time constraints.
Book
1. DTSE in Programmable Architectures. 2. Related Compiler Work on Data Tranfer and Storage Management. 3. Global Loop Transformations. 4. System-Level Storage Requirements Estimation. 5. Automated Data Reuse Exploration Techniques. 6. Storage Cycle Budget Distribution. 7. Cache Optimization. 8. Demonstrator Designs. 9. Conclusions and Future Work. References. Bibliography. Index.
Article
We present a survey of the state-of-the-art techniques used in performing data and memoryrelated optimizations in embedded systems. The optimizations are targeted directly or indirectly at the memory subsystem, and impact one or more out of three important cost metrics: area, performance, and power dissipation of the resulting implementation. We first examine architecture-independent optimizations in the form of code transformations. We next cover a broad spectrum of optimization techniques that address memory architectures at varying levels of granularity, ranging from register files to on-chip memory, data caches, and dynamic memory (DRAM). We end with memory addressing related issues.
Book
Preface Giovanni de Michelli. Acknowledgments. 1. Introduction. 2. Application-Specific Core-Based Systems. 3. Energy Optimization of the Memory Sub-System. 4. Application-Specific Memories. 5. Application-Driven Memory Partitioning. 6. Application-Specific Code Compression. 7. Perspectives. Index.