Conference PaperPDF Available

Overcoming the "Memory Wall" by improved system design exploration and a link to process technology options

April 2004

April 2004

DOI:10.1145/977091.977119

Source
DBLP

Conference: Proceedings of the First Conference on Computing Frontiers, 2004, Ischia, Italy, April 14-16, 2004

Authors:

Antonis Papanikolaou

Hypertech S.A.

Miguel Miranda

Qualcomm

Data transfer and storage issues "take the centre stage" in information and communication systems because of the increasing complexity and data dominance typically associated to them. In this paper, we summarise a systematic methodology for the power critical storage modules such as local SRAMs. We focus on their memory organisation (access schedule and data assignment) together with an exploration of the effect that the interconnect technology may have on the energy consumed by these local memory organisations in deep sub-micron technologies.

Pareto curve for trading of memory cycle budget vs. cost

…

Energy consumption vs. area and delay trade-offs involved in the Memory Allocation and Assignment step

…

Example architecture

…

Pareto exploration of a 32kbit memory instance, partitioning and interconnect parameters range

…

System power consumption using the ITRS and the power-optimal interconnect options

…

Figures - uploaded by Antonis Papanikolaou

Content may be subject to copyright.

Content uploaded by Antonis Papanikolaou

Content may be subject to copyright.

Overcoming The "Memory Wall" By Improved System

Design Exploration And A Link To Process Technology

Options

Antonis Papanikolaou, Miguel Miranda and Francky Catthoor†

IMEC, Kapeldreef 75, Leuven, Belgium

†Also Professor at the Katholieke Universiteit Leuven, Belgium

{papaniko,miranda,catthoor}@imec.be

ABSTRACT

Data transfer and storage issues “take the centre stage” in

information and communication systems because of the in-

creasing complexity and data dominance typically associ-

ated to them. In this paper, we summarise a systematic

methodology for the power critical storage modules such

as local SRAMs. We focus on their memory organisation

(access schedule and data assignment) together with an ex-

ploration of the eﬀect that the interconnect technology may

have on the energy consumed by these local memory organ-

isations in deep sub-micron technologies.

Categories and Subject Descriptors

B.3.1 [Hardware]: Semiconductor memories—Static Mem-

ory (SRAM);C.3[Computer systems organization]: [Real-

time and embedded systems]

General Terms

Design

Keywords

combined system design and process technology exploration,

optimal energy/delay trade-oﬀ exploration in memories

1. INTRODUCTION

Data transfer and storage issues “take the centre stage” [1]

in information and communication systems because of the

increasing complexity and data dominance typically associ-

ated to them. Additionally, while the evolution of process

technology scaling favours speed, this is not necessarily the

case for energy once the 100 nm barrier is crossed. This

is especially true for the back-end process technology such

as the interconnect. If these interconnect process options

are not well exploited the total energy budget for a given

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

CF’04, April 14–16, 2004, Ischia, Italy.

application task may go up, instead of the expected reduc-

tion. As a result, the overall energy consumption but also

the system performance associated to these systems suﬀer

a data transfer and storage bottleneck which is especially

impacting the embedded world.

In IMEC this bottleneck has been identiﬁed since the late

80’s and a system design methodology has been developed

for Data Transfer and Storage Exploration (DTSE) [2, 3].

This methodology aims at reducing the data and memory re-

lated energy consumption for a given application within real-

time constraints. It consists of two major parts, platform-

independent and platform-dependent optimisations. In the

ﬁrst part the application source code is optimised a.o. by

making memory accesses more local and regular, indepen-

dently of the target platform. In the second part, a.o. the

ordering of the memory accesses is decided along with the

mapping of the data on the memory organisation of the tar-

get architecture.

A’

B’

L2-memory

L1-

mem

A" B

Size=1Mb

E/acc=10uJ/acc

Size=1Kb

E/acc=1uJ/acc

Scalar level functionality Register Files and

Functional Units

L2-memory

L1-

mem

Figure 1: Optimal assignment of application data to

memory layers

This summary paper is situated in the context of map-

ping application domains on predeﬁned architecture plat-

forms. We do not consider general purpose processors, but

application-domain speciﬁc platforms, like modern DSP pro-

cessors and emerging multi-processor SoC platforms. The

target application domains include the ones that are repre-

202

sentative of applications for future embedded systems, namely

wireless and multi-media applications. Our main optimisa-

tion objective is the reduction of energy consumption of the

data memory organisation for a given task, while always

meeting the timing constraints imposed by the applications.

The platforms under consideration are (almost) fully prede-

ﬁned and the designer has to take care of optimally map-

ping his applications on this platform in order to optimise

power consumption. Some reconﬁgurability can be present

in the memory organisation and its interconnect network.

Note that even a fully predeﬁned architecture will still pro-

vide some ﬂexibility, for example the communication net-

work between the memories and the processing elements is

usually containing some shared, programmable busses. Also

the way the individual data of the application are mapped

to the memory units and banks, or the so-called data layout

can be decided by the compiler or linker (possibly incorpo-

rating designer input).

We address the power critical storage modules such as

local SRAMs and focus on their memory organisation, Fig-

ure 1 (access schedule and assignment) together with the

eﬀect that the interconnect technology has on the energy

consumed by these. This is done both inside the SRAMs

(intra-memory) and between them (inter-memory). To ex-

pose the eﬀects of the memory organisation and the inter-

connect in these modules, we have performed several ex-

periments using as drivers real-life applications of industrial

relevance suﬃciently representative from our target appli-

cation domain, namely the Digital Audio Broadcast (DAB)

receiver[4] and the Quad-tree Structured Diﬀerence Pulse

Code Modulation (QSDPCM) video encoder[5].

Our approach for intra-memory interconnect energy op-

timisation is based on intentionally varying the width of

the interconnect wires while maintaining their relative pitch

to create interconnect structures inside the memories with

lower associated capacitance, hence with reduced energy.

The impact of this, a higher associated access delay, is com-

pensated at the system level by exploiting the available

parallelism in the data transfer at the architecture level.

This leads to memory instances that are signiﬁcantly more

energy-eﬃcient than the memories currently provided com-

mercially while they mainly diﬀerentiate in the dimensions

of the interconnect wires they use internally. Moreover, dif-

ferent implementations may exist for the same memory in-

stance, each having diﬀerent access delay and energy per

access characteristics than the others, hence providing a va-

riety of Energy/Delay (E/D) trade-oﬀs.

The next step in our ﬂow aims at improving the hetero-

geneity of the memory organisation of the platform. The

way to achieve that is by propagating these E/D trade-oﬀs

from the memory module level to the system-level mem-

ory organisation in order to save energy at that higher level

where the impact is signiﬁcantly higher than the module

level. The goal is to construct the (local) memory organ-

isation by selecting those SRAM modules that allow just

meeting the system performance constraints. Hence, the

slack gained in system performance will translate in an over-

all energy gain. This last step is supported by the platform-

dependent mapping sub-stage of the Data Transfer and Stor-

age Exploration approach [3, 6, 7] and associated tool set

which is targeted to partially predeﬁned memory organisa-

tions (still with a number of conﬁguration parameters).

2. RELATED WORK

Apart from IMEC, three currently active research groups

exist that are internationally recognised for their contribu-

tions to data transfer and storage management related re-

search issues. One (co-operating) group is located at the

Universities of Bologna and Torino in Italy, the second one

at the University of Irvine, California and the last one in

Penn State Univ.

The groups cooperating in Torino/Bologna [8] are mainly

targeting memory design issues, hence their abstraction level

for exploration is at the hardware level and therefore diﬃ-

cult to port and re-target and also more limited in explo-

ration space. Indeed, by working at a lower abstraction level

(memory organisation/architecture) the exploration range is

more limited than at the system and/or application level as

in case of the DTSE approach. Therefore also the achiev-

able gains in implementation cost (size, speed, power) are

smaller.

The group at Irvine has been working on memory issues

in general [9] and on the EXPRESSION compiler [10] in

concrete. Their research is situated at the compiler level of

abstraction and has a quite complete memory view. Still,

it is not as retargetable as our focus, especially due to the

assumptions made on the distributed on-chip memory or-

ganisation or the SDRAM modules. They are highly fo-

cused on the typical programmable target processor’s local

memory organisations where a combination of caches and

scratch-pad memory are becoming available now.

The group at Penn State focuses on developing compiler

optimisations for data locality in order to exploit the avail-

able memory hierarchy [11, 12, 13]. They consider the use

of high-level transformations which allows to mostly per-

form their optimisations at the source code level, hence their

approach is potentially retargetable. However, their tech-

niques, although systematic, do not explore Pareto-optimal

trade-oﬀs. This limits the possibilities of using such an ap-

proach for memory organisation exploration in concrete and

platform architecture exploration in general.

The high orthogonality, eﬃciency, portability and retar-

getability characteristics of IMEC’s DTSE approach [2, 3]

are unique when compared to the other past and currently

active data transfer and storage management research projects

outside IMEC.

3. MEMORYORGANISATIONTEMPLATE

It has been already proven that exploiting a data memory

hierarchy is very beneﬁcial for power consumption, see Sec-

tion 4. In fact, power optimal memory organisations usually

have several layers of memories and each layer is distributed,

like the TI C55 family of DSPs. A distributed memory or-

ganisation can provide a larger bandwidth than a fully cen-

tralised one, which means that the application data can be

transferred much faster into the processing elements. Fur-

thermore, the fact that the total memory footprint is divided

between several memories results in a smaller energy per ac-

cess of each memory, compared to the centralised memory.

By exploiting these two axes of freedom we will see how

application energy consumption vs. application execution

time trade-oﬀs can be created and using these trade-oﬀs a

designer can ﬁnd the power-optimal mapping for the given

application timing constraints, see Section 5.1.

To further optimise power consumption we can exploit

203

any potential heterogeneity in the memory layers. A dis-

tributed memory layer can contain memories that are iden-

tical or memories that diﬀer from each other. If the platform

provides this kind of heterogeneity then we can use it to

further optimise the application power consumption by ﬁne

tuning the energy consumption and delay characteristics of

the memories, so as to just meet the application timing con-

straints. Running as slow as possible (taking into account

the energy weight distribution), while still meeting the “real-

time” application constraints, translates into energy gains as

we will see in Section 8.2.

4. ENERGY EFFICIENT DATA MEMORY

HIERARCHY MANAGEMENT

Despite the recent architectural advances for multime-

dia aiming at improving computational eﬃciency (e.g., sub-

word parallel data level processing, reconﬁgurable comput-

ing, etc), the dominance in data storage and transfer of these

systems still remains one of the main bottlenecks for energy

and speed eﬃcient implementations. The reason is the ever

increasing gap in speed and energy between the memory and

the data processing subsystems.

To cope with such a gap, the addition of more and more

layers to the memory hierarchy, from where data can be

eﬃciently accessed from smaller, faster and more energy ef-

ﬁcient memories becomes mandatory. This is true both for

Systems on a Chip (SoC) platforms based on random ac-

cess memories as well as for cache memories. However, the

potential eﬃciency oﬀered by these multi-layer memory or-

ganisations becomes attainable only on condition that the

storage and transfer of data between the diﬀerent layers is

done in a quite optimised manner.

Memory hierarchy layers can contain software controlled

scratch-pad memories or caches. In order to guarantee an

eﬃcient transfer of data along the data memory hierarchy

often this requires that smaller copies of the data are made

from the larger data arrays which can be stored in the smaller

layers [14]. Those copies must be selected such that they

minimise the overall transfer cost. In this context, any

transfer of data from a higher layer to the current one is

considered to be an overhead for the current layer.

This happens most eﬃciently under full software control

(e.g., a number of SRAM scratch-pad memories) because a

global view on the transfer can be obtained at design time.

In this case, copy operations should be explicitly present in

the application code. This is mostly possible for design time

analysable applications that are characterised via Pareto

curves collecting all optimal energy trade-oﬀs for the dif-

ferent execution times. The decision of the selected Pareto

point can then already be made at design time for purely

static applications.

However, many real-life applications are dynamic in na-

ture and they cannot be completely characterised at de-

sign time. Traditionally, to cope with this dynamism, HW-

controlled caches are used instead of SW-controlled scratch-

pad memories. In this case, the hardware cache controller

will make the copies of signals at the moment they are ac-

cessed (and the copy is not present yet in the cache). How-

ever, this is ineﬃcient because data present in the cache and

required in the near) future, can be (wrongly) evicted in or-

der to accommodate new fetched data. This evicted data,

when needed again by the processor, will have to be brought

to the cache for a second time, hence leading to transfer and

power overhead. To minimise such overhead, architects tend

to use bigger caches with hardware controllers implementing

complex mapping policies. However, this is not eﬃcient for

power given the extra overhead in every single access even

when these are cache hits. According to [15], a 1KByte 4-

way associative cache is between 4-5 times more ineﬃcient

in terms of energy per access than a scratch-pad of the same

size and 2-3 times than a (1-way associative) Direct Mapped

cache (DM-cache), while the last one is only 60% less eﬃ-

cient than a scratch-pad of the same size. This overhead

is due to accessing the tag array of the cache, whose size

(hence energy overhead) considerably increases with the as-

sociativity factor.

An illustration of the Memory Hierarchy Layer Assign-

ment (MHLA) methodology [16] is illustrated in Figure 1.

The assumed application contains two arrays, namely A and

B. For array A two possibilities exist for copying part of it to

smaller arrays and reusing these elements. For array B only

one such case was identiﬁed. The target platform is also

shown on the right. It is clear that the energy per access of

the L2 memory is much larger than the energy per access

of the L1 memory, an order of magnitude diﬀerence. In this

example, after tedious exploration the power-optimal choice

for assigning arrays to layers was to assign only array A to

the second layer. In the ﬁrst layer, a small copy of part of A

is stored, along with B, which obviously was small enough

to ﬁt in layer 1.

By applying the fore-mentioned techniques on a real-life

driver application we can show signiﬁcant energy gains for

the complete memory organisation. The driver we have used

is the QSDPCM [5]. This application is a video encoder

and the memory organisation assumed contains two hierar-

chy layers. The size of the L1 layer diﬀers slightly as we

can see in the results, see Figure 2, but the diﬀerence is so

small that these results can be directly translated in a pre-

deﬁned architecture context. It is clear from the results that

by making some changes in the source code and introducing

copies of large arrays in the local layer most of the accesses

are now localised to this layer. This has a signiﬁcant eﬀect

on the energy that is consumed on the memory of the sec-

ond layer. By reducing the number of second layer accesses

by about 45% we notice a total energy reduction of about

35%. Note that the accesses to the local memory layer have

increased signiﬁcantly, but the impact on the total L1 en-

ergy consumption is not very signiﬁcant. The reason is that

an access to the second layer is much more costly than an

access to the local layer.

300306542

Number of L2

accesses (x103)

1.191.191.14

Number of L1

accesses (x106)

6336066352114048

Layer size for L2

(Bytes)

802748742

Layer size for L1

(Bytes)

32.1732.8558.18L2 Energy

14.2914.3213.79L1 Energy

46.4747.1771.97Total Energy

Partially-mergedFully-mergedNon-merged

300306542

Number of L2

accesses (x103)

1.191.191.14

Number of L1

accesses (x106)

6336066352114048

Layer size for L2

(Bytes)

802748742

Layer size for L1

(Bytes)

32.1732.8558.18L2 Energy

14.2914.3213.79L1 Energy

46.4747.1771.97Total Energy

Partially-mergedFully-mergedNon-merged

Figure 2: Localising the memory accesses by intro-

ducing copies in the code reduces total energy con-

sumption

204

5. LOW-POWERDISTRIBUTEDMEMORY

ORGANISATIONS

5.1 How to meet real-time bandwidth

constraints

In data transfer intensive applications, a costly and dif-

ﬁcult to solve issue is to get the data to the processor on

time. A certain amount of parallelism in the data trans-

fers is usually required to meet the application’s real-time

constraints. Parallel data transfers, however, can be very

costly. Therefore, the trade-oﬀ between data transfer band-

width and data transfer cost should be carefully explored.

This section describes the potential trade-oﬀs involved, and

also introduces a new way to systematically trade-oﬀ the

data transfer cost with application run-time.

In our application domain, an overall target storage cy-

cle budget is typically imposed, corresponding to the overall

throughput. In addition, other real-time constraints can be

present which restrict the ordering freedom. In data trans-

fer and storage intensive applications, the memory accesses

are often the limiting factor to the execution speed, both

in custom “hardware” and instruction-set processors (“soft-

ware”).

Data processing can be easily sped up through pipelin-

ing and other forms of parallelism. Increasing the memory

bandwidth, on the other hand, is much more expensive and

requires the introduction of diﬀerent hierarchical layers, typ-

ically involving also multi-port memories. These memories

cause a large penalty in area and energy though.

Because memory accesses are so important, it is even pos-

sible to make an initial system level performance evalua-

tion based solely on the memory accesses to complex data

types [2]. Data processing is then temporarily ignored ex-

cept for the fact that it introduces dependencies between

memory accesses. This section focuses on the trade-oﬀ be-

tween cycle budget distribution over diﬀerent system com-

ponents and gains in the total system energy consumption.

Before going into details, a number of other issues need to

be introduced though.

Deﬁning such a memory system based on a high-level spec-

iﬁcation is far from trivial when taking the real time con-

straints into account. High-level tools will have to support

the deﬁnition of the memory system [17]. A global data

transfer scheduling approach balancing the required mem-

ory bandwidth is needed to come up with a suitable memory

architecture within the given (timing) constraints (see sub-

section 5.1.1). As such, this is an impossible task to perform

manually, especially when taking sophisticated cost models

into account.

The cost models can be memory size (area) or power based

and allow us to have an impact on heterogeneous RAM-

based organisations [18]. Using such models will provide a

clear trade-oﬀ in the power, area and performance design

space.

To solve this practical design-time issue, we have devel-

oped design automation tools to support these decisions.

We use our tools to come up with Pareto curves to visualise

the useful trade-oﬀ space between the cycle budget assigned

to a given system submodule and its corresponding energy

and/or area consumption, i.e. involving three search space

axes. As far as we know, no other systematic (automatable)

approach is available in literature to solve this important

design problem.

5.1.1 Balancing memory bandwidth

The required bandwidth is as large as the maximum band-

width needed by the application (see Figure 3). When a very

high demand is present in a certain part of the code (e.g.

a certain loop nest), the entire application suﬀers. By ﬂat-

tening this peak, a reduced overall bandwidth requirement

is obtained (see lower part of Figure 3).

Figure 3: Balancing the bandwidth lowers the cost.

The re-balancing of the bandwidth load is again performed

by reordering the memory accesses (also across loop scopes).

Moreover, the overall cycle distribution over all the loops

has a large impact on the peak bandwidth. Tool support is

also here indispensable for this diﬃcult task. An accurate

memory access ordering is needed for deﬁning the needed

memory organisation meeting all system timing constraints.

5.1.2 Energy cost versus cycle budget trade-off

The data transfer and storage (“memory”) related cost

clearly increases when a higher memory bandwidth is needed.

In most cases, designers of real-time systems will assume

that the time for executing a given task is predeﬁned in an

initial system decision step, where the system timing budget

is broken up based on ad hoc guidelines.

This subsection explains how to use the available trade-

oﬀs between cycle budget for a given task and the memory

cost related to it. This almost necessarily leads to the use of

Pareto curves to exploit the opportunities really well. The

data transfer and storage related cycle budget (”memory cy-

cle budget”) is strongly coupled to the memory system cost

(both energy and memory size/area are important here, as

mentioned earlier). A Pareto curve is a powerful instrument

to be able to make the right trade-oﬀs. The Pareto curve

only represents the potentially interesting points in a search

space with multiple axes and excludes all the solutions which

have an equally good or worse solution for all the axes.

The memory cost increases when lowering the cycle bud-

get (see Figure 4). When lowering the cycle budget, a more

energy consuming and more complex (larger) memory archi-

tecture is needed to deliver the required higher bandwidth.

Note that the lower cycle budget is not obtained by reducing

the number of memory accesses, as in the case of the algo-

rithmic changes and data-ﬂow transformations [2]. During

the currently discussed step in the system design trajectory,

the amount of data transferred to the data-path remains

equal over the complete cycle budget range. Nevertheless,

some control ﬂow transformations and data reuse optimisa-

tions [3] can be beneﬁcial for energy consumption and not

205

Cost

Cycle Budget

Minimum

budget Fully

sequential

Large

Limited

Low

Full

Needed bandwidth

Freedom

Unintresting

alternatives

Figure 4: Pareto curve for trading of memory cycle

budget vs. cost

for the cycle budget (or vice versa). This type of (largely)

platform-independent optimisations still has to be consid-

ered in the trade-oﬀ.

Figure 5 illustrates the concepts described above. If many

storage cycles are available to perform the data transfer then

the required bandwidth (shown in number of ports in this

ﬁgure) is small and no memory needs to be dual-port. If,

however, all the data transfers have to be executed in very

few cycles a large number of ports is required and, obviously,

the communication network should also be able to provide

the required bandwidth.

Storage Cycles

Memory Energy

Functional Units

Figure 5: Number of cycles allocated for data trans-

fer has a large impact on the required bandwidth

The interesting range of such a Pareto graph on the cycle

budget axis, should be deﬁned from the critical path up

to the fully sequential memory access path. In the fully

sequential case all the memory accesses can be transferred

over a single port (lowest bandwidth). However, the number

of memories is not necessarily constrained to one memory.

We can still in a later phase exploit the distributed memory

organisation to minimise energy consumption. In the other

extreme, for the critical path, only a limited memory access

ordering is valid. Many memory accesses are then performed

in parallel.

6. EXPLOITINGHETEROGENEITYIN PRE-

DEFINED MEMORY ORGANISATIONS

A heterogeneous memory organisation has diﬀerent kinds

(or ﬂavours) of memory instances in a single layer. These

diﬀerent ﬂavours can come from having diﬀerent sizes, dif-

ferent bit-widths, diﬀerent number of ports or even due to

diﬀerent implementation options at the (internal) organisa-

tion, circuit and/or, even, technology processing level. All

these enable the creation of diﬀerent energy and delay char-

acteristics. For example, a large single-port memory will

consume more power per access and will be slower than a

smaller single-port memory, but a smaller dual-port mem-

ory may be even more power hungry than the original mem-

ory. These kinds of trade-oﬀs are very diﬃcult to explore

manually and we will show that they can be systematically

explored.

In a custom design context, tools already exist that can

decide on the number and the sizes of all the memories in the

memory organisation for minimum power consumption [17].

This is a diﬃcult problem to solve, since all the arrays of

the application have to be assigned to one of the memories

in an optimal manner. The optimal number of memories

and the optimal assignment of arrays to these memories, is

not trivial, since information about the array bit-width and

access frequency has to be taken into account. Figure 6 illus-

trates how memory allocation and assignment can provide

power vs. area trade-oﬀs for the memory organization. The

left-hand assignment is bad for power because the very fre-

quently accessed array A is stored in a large memory. Thus

the product of energy per access times access frequency is

very large. On the right-hand side, we can see a better choice

for power, but an overhead in area is present. The reason is

that the bitwidths of arrays B and C are diﬀerent, forcing

the right memory to be much larger that the sum of the size

of the arrays. Similar trade-oﬀs can be generated involving

delay, since memory access delay increases as memories grow

bigger.

Thus a methodology and tools [17] are needed to ﬁnd the

power optimal solution given timing and/or area constraints.

In that case the number of available options is very large and

everything can be decided at design time where only the cho-

sen memories will be processed. The results of these tools

indicate once more that the most power-optimal organisa-

tion is distributed and heterogeneous.

Power Area

Figure 6: Energy consumption vs. area and delay

trade-oﬀs involved in the Memory Allocation and

Assignment step

In the context of predeﬁned architectures the focus is a

little diﬀerent. The available memory options are (partly

as we will see later) already decided. This means that the

system design tools do not have the freedom any more to

206

allocate the best possible memory. However if all the avail-

able memories are all of the same type then the optimisation

freedom of the tool is signiﬁcantly reduced. On the other

hand, if some heterogeneity already exists in the architec-

ture then the tools can use this freedom for optimisation.

Finding which ﬂavours of memories are the best candidates

for use in this context is not easy, but it is feasible within

limitations. Given the target application domain, it is possi-

ble to derive a class of memories that are likely candidates.

For example, small local memories can be used by appli-

cations that perform ﬁltering of single- or two-dimensional

signals. Motion estimation kernels are such kinds of applica-

tions, where a small 2-D block traverses an image in order to

ﬁnd its match. Exploiting the available reuse we can store

small parts of the image and the small 2-D block in local

memories. These memories will be very active, thus they

need to be as small as possible.

For very delay critical applications, perhaps some of these

small memories will have to be dual-port to provide the re-

quired bandwidth. By experimenting with the applications

included in this domain, one can ﬁnd common memory re-

quirements such that a selection of a representative set of

heterogeneous memories is feasible. Note, though, that this

will incur an overhead in the area of the local memory layer.

In order to increase the freedom of the design tools several

additional memories should be available on top of the mini-

mal set. This implies that some redundancy will exist in this

local layer. As a result, the number of available memories

will be quite large. But a reasonable redundancy in these

small memories is acceptable. Overall, the area they occupy

is small (usually even negligible) compared to the second and

third layer memories which can contain tens or hundreds of

Mbits. So for example a 20% redundancy for the global ﬁrst-

level memory layer (which will contains less than 1 Mbit in

total) will incur a marginal increase in the total chip area.

Since this tiny area overhead can be exploited to generate a

signiﬁcantly lower overall energy consumption for the same

system performance constraints, this is a very useful trade-

oﬀ. That is especially true for portable embedded systems,

where energy (or average power) consumption is the ﬁrst

consideration. Further opportunities for optimisation can

be provided by also taking into account information about

the life-time of the diﬀerent application arrays and trying

to map several arrays with non-overlapping life-times in the

same address space. These will not be further discussed in

this paper.

7. APPLICATION CASE STUDY ON A

SIMPLE ARCHITECTURE TEMPLATE

7.1 Target platform

In this paragraph we will describe an example of the fore-

mentioned predeﬁned partly-programmable platform, see Fig-

ure 7. This example consists of one oﬀ-chip main memory,

a ﬁrst layer of memories and the processing elements. Typ-

ically the oﬀ-chip memory is an SDRAM. The ﬁrst layer

memories can be homogeneous (several identical memories)

but preferably should be heterogeneous (several diﬀerent

memories). A heterogeneous memory layer implies memo-

ries that are diﬀerent in size, but also in number of ports and

other characteristics. We will assume for this example that

our architecture is actually heterogeneous and that memo-

ries (or memory planes in the SDRAM) that are not used

can be powered-down to minimise static energy consump-

tion. This simple architecture already provides a memory

hierarchy and a distributed local layers of memories.

Communication network

SDRAM

on-chip L1 SRAM

memories

Processing elements

off-chip main

memory

configurable

communication

network

Figure 7: Example architecture

7.2 Mapping process

The ﬁrst problem to be solved is the ordering of the mem-

ory accesses. In case the local memory layer is heteroge-

neous, energy vs. number of cycles trade-oﬀs can be gen-

erated by ordering the memory accesses in diﬀerent ways.

Let’s assume, for example, that single- and dual-port mem-

ories exist in the local layer. This freedom allows the sched-

uler to ﬁnd an ordering which requires a bandwidth of two

elements per cycle. Assigning an application array to a dual-

port memory, increases the available bandwidth of this ar-

ray, two elements can be read simultaneously. This allows

the scheduler to ﬁnd an ordering which requires a larger

bandwidth and the number of required cycles to perform

all the memory transfers is reduced. But using a dual-port

memory induces a penalty in energy consumption, hence the

energy vs. number of cycles trade-oﬀ.

The memory access schedule can also have an impact on

the energy consumption of the SDRAM. Typically SDRAMs

have a few diﬀerent read modes, for example page, burst or

interleaved mode. The interleaved mode provides the largest

possible bandwidth, but in order to use it the data has to dis-

tributed over several banks. This means that all the banks

should be active and will be accessed quite often, the re-

sult is that static, but also dynamic, energy consumption is

quite high. If, on the other hand, all the application data

are stored into one bank, then only this bank will be ac-

tive. This is beneﬁcial for static energy consumption, but

the available bandwidth is reduced. [19] provides a system-

atic methodology and prototype tool to exploit this trade-oﬀ

with very promising experimental results.

The way to exploit these possibilities is by balancing the

bandwidth requirements over time. In the usual case they

vary signiﬁcantly over time and at some point a lot of band-

width is required, while most of the time the requirements

are very relaxed. By balancing the bandwidth requirements

we avoid having to use all the SDRAM banks simultane-

ously, thus we can power-down some of them saving static

power.

The same concepts apply also to the local memories of

layer 1. If the maximum possible bandwidth is required then

the use of multi-port memories is indispensable, but the en-

207

ergy overhead is large. Furthermore, the same bandwidth is

required from the communication network. An application

with relaxed timing constraints, on the other hand, can use

a single bus and only multiple single-port memories which is

the optimal combination for the energy consumption of the

memory organisation.

Additionally, in order to exploit the available hierarchy in

the memory organisation some data reuse should be present

in the application source code. Reuse exists when the same

data element of an array is used multiple times during the

processing [14, 20]. In this case it is more eﬃcient to copy

part of this array to a small memory and perform multiple

accesses to this memory, improving the total energy con-

sumption [16, 7, 21]. Let’s assume that the application that

is running on this architecture is a two-dimensional ﬁltering

of an image. This operation involves two data structures,

one small two-dimensional array (the ﬁlter, for example 3x3

coeﬃcients) and the image, which is typically a large two-

dimensional array. Due to their sizes the ﬁlter array is usu-

ally stored in one of the small layer 1 SRAMs and the image

is stored in the oﬀ-chip memory. To perform the ﬁltering

operation the centre coeﬃcient of the ﬁlter will traverse all

the pixels of the image in a row-wise manner. At each image

pixel the nine relevant image pixels will have to be fetched

from the oﬀ-chip memory and be temporarily stored in one

of the layer 1 memories or in registers. After the ﬁltering

of one pixel is performed, the ﬁlter will move to the next

pixel to the right. But out of the nine image pixels that

were used in the ﬁltering of the previous pixel, six can be

reused. This type of reuse can be exploited to optimise the

power consumption of the memory organisation. Instead

of always loading from the oﬀ-chip memory the nine cur-

rently involved image pixels we can keep the six that were

already loaded for the previous operation and retrieve from

the main memory only the three “new” ones. This way for

the ﬁltering of each pixel we save six accesses to the oﬀ-

chip SDRAM and substitute them with accesses to the local

memories. Given the large diﬀerence in energy per access

between the SDRAM and on-chip SRAMs, this saving in

oﬀ-chip accesses can provide very signiﬁcant power gains, as

demonstrated e.g. in [16].

In order to exploit the reuse available in the application

we have introduced some small arrays that are copies of

parts of larger arrays that are stored oﬀ-chip. Combined

with the original application signals they give us an idea

of the total size of the application data. Furthermore the

bandwidth requirements have been fully deﬁned by the step

that performs the access ordering. The next step now is to

assign all these application data into the available memories.

In order to obtain the power-optimal assignment several is-

sues have to combined. The most important is that heavily

accessed arrays should be placed in memories with small

energy per access. If the architecture provides an heteroge-

neous set of layer 1 memories, then they will have diﬀerent

energy consumption characteristics. It is obvious that the

arrays which gather most of the activity should be stored

in memories which consume less energy per access. How-

ever, several constraints should be taken into account. The

application arrays have diﬀerent sizes, diﬀerent bit-widths

and diﬀerent activation frequencies. The memories of the

architecture, on the other hand, are partly (see Section 8)

ﬁxed. This power optimisation process is still quite complex

and given that some freedom in customising the memories

is still available, the tools have enough options in order to

ﬁnd a good solution.

8. AN INTERCONNECT TECHNOLOGY

BASEDTECHNIQUE TOINCREASETHE

HETEROGENEITY OF THE MEMORY

LAYERS

It is clear that having heterogeneity in the memory or-

ganisation is crucial to minimise the energy consumption.

In this section we will discuss a technique that allows us to

have memories with the same size, bit-width and number of

ports, but with diﬀerent energy-delay characteristics. This

variation can come from two sources, the ﬁrst is the internal

partitioning of the memory. Partitioning means splitting

the array of cells that store the data into smaller sub-arrays

and activating only one at each access. The second source

of variation we can exploit is the physical dimensions of the

interconnect wires that are used for the implementation of

the memories [22], see Section 8.1. The main energy con-

sumption gains come from the variation on the wire dimen-

sions given that memories are usually interconnect domi-

nated. Combined with the sub-array partitioning this leads

to very good Pareto ranges.

However these options have to be available also for prede-

ﬁned memory organisations. Two ways exist to create these

eﬀects in predeﬁned platforms.

The ﬁrst is structured ASICs. These are platforms where

the active area (silicon) and the lower metal layers have al-

ready been processed, thus fully deﬁned. But the upper

metal layers have not yet been added. Since the freedom

to choose and route the wires inside the memories might

still be available, we can choose an implementation of the

upper metal layers were the dimensions of the wires are cus-

tom. Furthermore, potentially the partitioning can also be

decided at that stage if enough freedom is foreseen in the

phase of implementing the active area. Structured ASICs

are already provided from several companies like Synplic-

ity [23], Faraday Technology Corp. [24], AMI semiconduc-

tors [25] and others.

The second way to create these options is by having some

redundancy in the memory organisation. For example, the

predeﬁned architecture may include several memory units or

sub-arrays of exactly the same size and number of ports, but

implemented in such a way that their energy consumption

and delay are diﬀerent. Then during the application map-

ping phase the tools have the freedom to decide which of

the memory units is activated and which of the sub-arrays

are combined for a particular unit. That additional free-

dom is beneﬁcial for optimising overall energy consumption.

It is clear that this redundancy has an area overhead but

in the lowest layer that is negligible, see Section 6. This

area overhead can be used to improve the heterogeneity of

the memory organisation, thus providing more freedom for

power optimisation to the designers.

8.1 Creating heterogeneity at the SRAM level

As indicated, one important way to create multiple in-

stances of memory sub-arrays or units which have diﬀerent

energy consumption and access delay characteristics is by

customising the dimensions of the interconnect wires during

the processing of the chip.

It has been shown [22] that by varying these dimensions

208

energy/delay trade-oﬀs can be achieved at the level of in-

dividual wires. We will ﬁrst evaluate the impact of these

trade-oﬀs on a complete memory and in a second phase use

them to create heterogeneous memory organisation layers.

The memory model we have used is based on a varia-

tion of the CACTI model [15] developed at the University

of Texas [26]. This model has improved scaling behaviour

compared to the original CACTI model, thus reﬂects better

the energy consumption and delay behaviour of current and

future embedded memories. We have further extracted an

embedded SRAM model from it, since it was originally a

model for on-chip caches and made some additional changes

that will be discussed in the next section.

0510

Delay (ns)

0.00

0.02

0.04

0.06

0.08

Energy (nJ)

Figure 8: Pareto exploration of a 32kbit memory

instance, partitioning and interconnect parameters

range

8.1.1 Memory module level organisation exploration

State-of-the-art SRAMs do not have one monolithic array

of cells, but the “storage area” is split into several partitions

or banks. Each time the memory is accessed only one parti-

tion has to be activated, thus memory energy and delay are

improved. By exploring, however, the partitioning we can

create energy-delay trade-oﬀs at the level of memory mod-

ules. Partitioning is applied by dividing the bit-lines or the

word-lines of the array of cells into two pieces recursively.

But dividing the bit-lines has a diﬀerent eﬀect on total en-

ergy consumption and delay than dividing the word-line. By

exploring these combinations we achieve trade-oﬀs between

memory energy and delay.

On top of the exploration of possible partitioning schemes

that is included in CACTI we have added the exploration of

the dimensions of the interconnect wires.

Coupling the trade-oﬀ at the technology level to the mem-

ory model we can see its large inﬂuence on the entire mem-

ory at the level of IP block. In combination with the energy-

delay trade-oﬀ due to memory partitioning, we can now have

very good ranges in the energy-delay Pareto optimal trade-

oﬀ curve of the entire memory, see Figure 8.

The ﬁnal output of this model is now an energy-delay-area

Pareto optimal trade-oﬀ curve which shows all the optimal

feasible operating points of the particular memory (Figure 8,

only energy-delay Pareto points are shown). Thus if some

redundancy is allowed in the memory organisation, then sev-

eral ﬂavours of the same memory can be present. A fast,

power-hungry memory might be needed if the timing re-

quirements are very tight, but a slow, power-eﬃcient ver-

sion of the same memory will signiﬁcantly improve the total

power consumption if the timing constraints are not very

critical.

It is important to note here that not all these Pareto

points are necessary in order to get system level gains. What

is more important is the available range rather than the to-

tal number of points. It is not necessary to have all these

diﬀerent memory ﬂavours made available on the platform.

Having several memories which are heterogeneous in energy

and delay characteristics enables our design tools to optimise

energy consumption signiﬁcantly, especially if the ranges in

energy and delay are large enough. This heterogeneity will

be the result of a combination of diﬀerent size, bit-width,

number of ports and technology or circuit implementation

choices for each memory.

8.2 Propagation of memory level trade-offs to

application level

So far we have been dealing with the wires that exist

inside the memories of the memory organisation. Apart

form these, wires are also used for the implementation of

the busses. However, the contribution of these wires can be

kept low by additional optimisation techniques. The ﬁrst is

activity-aware power optimal ﬂoor-planning. The idea is to

place the heavily active memories close to the data-paths,

so that they have short connections. Less active memories

can be placed farther away. The second technique is to use

a bus segmentation approach for the implementation of the

busses, which partitions the bus into several segments and

only the necessary segments are activated for each memory

transfer. Combining these two techniques makes sure that

the inter-memory interconnect power consumption stays rel-

atively small and we can focus only in the intra-memory

power consumption. Experiments show that the power con-

sumed on the busses can be kept under 20% of the total

power of the memory organisation (memories and busses),

if the application has been already optimised for data trans-

fer and storage, using e.g. DTSE. [22, 27]

In order to assess the impact of the energy vs. delay

trade-oﬀs on the application level we have to apply the mem-

ory module level trade-oﬀs on an actual application. The

driver application we have used is a Digital Audio Broad-

cast (DAB) channel decoder [4] which has been optimised

for data transfer and storage management using the DTSE

methodology. After optimisations, the clock frequency re-

quired to implement the application while meeting the real-

time constraints is only 43 MHz.

These optimisations help to relax the timing constraints

on the individual memories signiﬁcantly, thus creating an

opportunity to trade oﬀ delay slack for minimum energy

consumption. Thus, global system-wide trade-oﬀs can be

made which allow that the optimal memory organisation is

aﬀected by the use of the trade-oﬀ space in the memory

selection process.

In order to see how the energy-delay trade-oﬀs propagate

to the application level we have performed an experiment

using the fastest memory module implementation for the

SRAMs of the library.

The goal of the experiment is to ﬁnd the optimal number

of memories that should be used in the memory organisation

to minimise its power consumption. We take into account

the number of memories, the power consumption for each

memory and the access frequency of each memory. The re-

209

sult is the optimal memory organisation and the total power

dissipated on it.

A second experiment has involved using the slowest possi-

ble, energy optimal, interconnect wires which meet the real-

time constraint of the application. This means that the wires

inside the memories will be customised to the delay require-

ments of each memory. Memories on the critical path will

have faster wires than the ones that have relaxed timing

constraints. The results are shown in Figure 9. Both exper-

iments have been performed assuming an identical ordering

for the memory accesses. The horizontal axis of the ﬁgure

refers to the number of local memories that are allocated

by the Memory Allocation and Assignment tool [17]. It is

clear that the more memories the tool can allocate (more

freedom) the better the ﬁnal solution is in terms of power

consumption. A limit for the number of memories exists,

after which further allocation of memories does not impact

the power consumption, but it depends on the application.

By comparing the results of these two experiments we

can see that for the best memory allocation and assignment

case (10 1st layer memories), the selective use of slow, but

power-eﬃcient, wires can save up to almost 40% in power

consumption of the memory organization compared to a de-

sign based on the ITRS roadmap proposed points.

number of memories

Power (mW)

0.2

0.4

0.6

0.8

1.2

1.4

345678910

fast memories

power-optimal memories

Figure 9: System power consumption using the

ITRS and the power-optimal interconnect options

These experiments have been performed in a custom ASIC

context, where the memories can be fully designed and op-

timised at design time. The results, though, prove that

a homogeneous memory organisation is not suited for low

power operation. Heterogeneity gives additional freedom to

the design tools which is required to further reduce energy

consumption.

The technology node used for this practical experiment is

130nm because we did not have access to well-characterised

design data for lower technology nodes. At this technol-

ogy node, the impact of the wire trade-oﬀs is still not very

signiﬁcant. The majority of the power consumption and de-

lay of the chip can be attributed to the active part of the

chip (transistors). On the other hand, for smaller technol-

ogy nodes the gains are expected to be much higher. The

reason for this is that process scaling does not aﬀect the

wires in a similar way as the transistors. Making the tran-

sistors smaller reduces their capacitance while their resis-

tance is not seriously degraded, thus their speed and en-

ergy consumption is improved. Making the wires smaller,

though, increases their resistance signiﬁcantly while their

capacitance is also increased because the distance between

the wires becomes smaller [28, 29, 30]. Thus, their electrical

characteristics, energy consumption and delay are seriously

degraded. The combination of the two fore-mentioned ef-

fects explains why the energy gains from changing the wire

dimensions grow with technology scaling and also motivates

such an exploration for future deep sub-micron technologies.

9. CONCLUSIONS

The power consumption of the global data memory organ-

isation is fast becoming a major bottleneck in the design of

energy-eﬃcient embedded systems. In this paper we have

presented a system design ﬂow and a method to generate

a memory library that is targeted towards total system en-

ergy optimisation within real-time constraints, that can to

a large extent alleviate this problem. Pure source code and

mapping transformations can already provide a signiﬁcant

reduction in the memory organisation power consumption.

Any additional freedom to customise the platform architec-

ture and especially to increase its heterogeneity can result in

further gains, by providing additional freedom to the mem-

ory management design ﬂow to ﬁnd better overall solutions.

10. REFERENCES

[1] G.Lawton, “Storage technology takes the center

stage”, IEEE Computer Magazine, Vol.32, No.11,

pp.10-13, Nov. 1999.

[2] F.Catthoor, S.Wuytack, E.De Greef, F.Balasa,

L.Nachtergaele, A.Vandecappelle, “Custom Memory

Management Methodology – Exploration of Memory

Organisation for Embedded Multimedia System

Design”, ISBN 0-7923-8288-9, Kluwer Acad. Publ.,

Boston, 1998.

[3] F.Catthoor, K.Danckaert, C.Kulkarni, E.Brockmeyer,

P.G.Kjeldsberg, T.Van Achteren, T.Omnes, Data

access and storage management for embedded

program mable process ors, ISBN 0-7923-7689-7, Kluwer

Acad. Publ., Boston, 2002.

[4] Radio broadcasting systems; digital audio

broadcasting to mobile, portable, and ﬁxed receivers.

Standard RE/JTC-00DAB-4, ETSI, ETS 300 401,

May 1997.

[5] P.Strobach, “QSDPCM – A New Technique in Scene

Adaptive Coding,” Proc. 4th Eur. Signal Processing

Conf., EUSIPCO-88, Grenoble, France, Elsevier Publ.,

Amsterdam, pp.1141–1144, Sep. 1988.

[6] Multi-media compilation project (Acropolis) at IMEC

http://www.imec.be/acropolis/

[7] P.Panda, F.Catthoor, N.Dutt, K.Danckaert,

E.Brockmeyer, C.Kulkarni, A.Vandecappelle,

P.G.Kjeldsberg, “Data and Memory Optimizations for

Embedded Systems”, ACM Trans. on Design

Automation for Embedded Systems (TODAES), Vol.6,

No.2, pp.142-206, April 2001.

[8] A.Macii, L.Benini, M.Poncino, “Memory Design

Techniques for Low Energy Embedded Systems”,

ISBN 0-7923-7690-0, Kluwer Acad. Publ., Boston,

2002.

[9] P.R.Panda, N.D.Dutt, A.Nicolau, “Memory issues in

embedded in systems-on-chip: optimization and

exploration”, Kluwer Acad. Publ., Boston, 1999.

[10] A.Halambi, P.Grun, V.Ganesh, A.Khare, N.Dutt and

A.Nicolau. “EXPRESSION: A Language for

Architecture Exploration through Compiler/Simulator

210

Retargetability”, Proc. 2nd ACM/IEEE Design and

Tes t i n E u ro pe Conf. (DATE),Munich,Germany,

March 1999.

[11] http://www.cse.psu.edu/ kandemir/research.html

[12] M.Kandemir, J.Ramanujam, A.Choudhary,

“Improving cache locality by a combination of loop

and data transformations”, IEEE Trans. on

Computers, Vol.48, No.2, pp.159-167, Feb. 1999.

[13] J.Ramanujam, J.Hong, M.Kandemir, A.Narayan,

“Reducing memory requirements of nested loops for

embedded systems”, 38th ACM/IEEE Design

Automation Conf., Las Vegas NV, pp.359-364, June

2001.

[14] T. Achteren, R.Lauwereins, and F.Catthoor.

Systematic data reuse exploration techniques for

non-homogeneous access patterns. In Proc. 5th

ACM/IEEE Design and Test in Europe Conf.

(DATE), pages 428–435, Paris, France, Apr. 2002.

[15] S.J.E.Wilton, N.P.Jouppi, “CACTI : An enhanced

cache access and cycle time model”, IEEE J. of Solid

State Circuits, Vol.31, No.5, pp.677-688, May 1996.

[16] E.Brockmeyer, M.Miranda, F.Catthoor, H.Corporaal,

“Layer Assignment Techniques for Low Power in

Multi-layered Memory Organisations”, Proc. 6th

ACM/IEEE Design and Test in Europe Conf.

(DATE), Munich, Germany, pp.1070-1075, March

2003.

[17] A.Vandecappelle, M.Miranda, E.Brockmeyer

F.Catthoor, D.Verkest, Global Multimedia System

Design Exploration using Accurate Memory

Organ iza tion Feedba ck Proc. 36th ACM/IEEE Design

Automation Conf., New Orleans LA, pp.327-332, June

1999.

[18] E.Brockmeyer, A.Vandecappelle, F.Catthoor,

Systematic cycle budget versus system power trade-oﬀ:

a new perspective on system exploration of real-time

data-dominated applications Proc. IEEE Int. Symp. on

Low Power Electronics and Design, pages 137-142,

Rapallo, Italy, Aug. 2000.

[19] P.Marchal, J.I.Gomez, D.Bruni, F.Catthoor, M.Prieto,

L.Benini, H.Corporaal, “SDRAM-Energy-Aware

Memory Allocation for Dynamic Multi-Media

Applications on Multi-Processor Platforms”, Proc. 6th

ACM/IEEE Design and Test in Europe Conf.

(DATE), Munich, Germany, pp.516-521, March 2003.

[20] I.Issenin, E.Brockmeyer, M.Miranda, N.Dutt, “Data

reuse analysis technique for software-controlled

memory hierarchies”, Proc. 7th ACM/IEEE Design

and Test in Europe Conf. (DATE), Paris, France, pp.,

Feb. 2004.

[21] K.Masselos, F.Catthoor, A.Kakarudas, C.E.Goutis,

H.De Man, “Memory Hierarchy Layer Assignment for

Data Re-Use Exploitation in Multimedia Algorithms

Realized on Predeﬁned Processor Architectures”,

Proc. Intnl. Conf. on Electronic Circuits and Systems

(ICECS), Malta, pp.I.285-I.288, Sep. 2001.

[22] A.Papanikolaou, M.Miranda, F.Catthoor,

H.Corporaal, H.De Man, D.De Roest, M.Stucchi,

K.Maex, “Global interconnect trade-oﬀ for technology

over memory modules to application level: case

study”, 5th ACM/IEEE Intnl. Wsh. on System Level

Interconnect Prediction, Monterey CA, April 2003.

[23] Synplicity - Structured ASICs

http://www.synplicity.com/products/structuredasic/

[24] Faraday Tech. Corp. - Structured ASICs,

http://www.faraday-

tech.com/html/ASIC/IPsolution/structured.html

[25] AMI semiconductors - Structured ASICs

http://www.amis.com/asics/structured asics/

[26] V. Agarwal, S. Keckler, D. Burger ”The eﬀect of

technology scaling on microarchitectural structures”

Technical Report TR2000-02, University of Texas at

Austin

[27] H.Wang, A.Papanikolaou, M.Miranda, F.Catthoor, “A

global bus power optimization methodology for

physical design of memory dominated systems by

coupling bus segmentation and activity driven block

placement”, Proc. IEEE Asia and South Paciﬁc

Design Autom. Conf. (ASPDAC), Yokohama, Japan,

Jan. 2004.

[28] D. Sylvester, K. Keutzer, Impact of small process

geometries on microarchitectures in systems on a chip,

Proceedings of the IEEE, vol.89, no.4, p. 467, April

2001.

[29] J.A. Davis, R. Venkatesan, A. Kaloyeros, M.

Beylansky, S.J. Shouri, K. Banerjee, K.C. Saraswat,

A. Rahman, R. Reif, J.D. Meindl, Interconnect limits

on gigascale intergration (GSI) in the 21st century,

Proceedings of the IEEE, no.3, vol.89, pp. 305, March

2001.

[30] D. Sylvester, C. Hu, O.S. Nakagawa, and S-Y. Oh,

Interconnect scaling: signal integrity and performance

in future high-speed CMOS designs, Proceedings Of

Symposium on VLSI Technology, pp. 42-43, 1998.

211

ResearchGate has not been able to resolve any citations for this publication.

QSDPCM-A New Technique in Scene Adaptive Coding

Article

Full-text available

Peter Strobach

Data and memory optimization techniques for embedded systems

Article

Full-text available

Apr 2001

We present a survey of the state-of-the-art techniques used in performing data and memory-related optimizations in embedded systems. The optimizations are targeted directly or indirectly at the memory subsystem, and impact one or more out of three important cost metrics: area, performance, and power dissipation of the resulting implementation. We first examine architecture-independent optimizations in the form of code transoformations. We next cover a broad spectrum of optimization techniques that address memory architectures at varying levels of granularity, ranging from register files to on-chip memory, data caches, and dynamic memory (DRAM). We end with memory addressing related issues.

SDRAM-Energy-Aware Memory Allocation for Dynamic Multi-Media Applications on Multi-Processor Platforms.

Conference Paper

Full-text available

Jan 2003

Heterogeneous multi-processors platforms are an interesting option to satisfy the computational performance of dynamic multi-media applications at a reasonable energy cost. Today, almost no support exists to energy-efficiently manage the data of a multi-threaded application on these platforms. In this paper we show that the assignment of data of dynamically created/deleted tasks to the shared memory has a large impact on the energy consumption. We present two dynamic memory allocators which solve the bank assignment problem for shared multi-banked SDRAM memories. Both allocators assign the tasks' data to the available SDRAM banks such that the number of page-misses is reduced We have measured large energy savings with these allocators compared to existing dynamic memory allocators for several task-sets based on MediaBench.

A global bus power optimization methodology for physical design of memory dominated systems by coupling bus segmentation and activity driven block placement.

Conference Paper

Full-text available

Jan 2004

This paper presents a methodology which can substantially reduce the bus power consumption in memory dominated systems. It systematically combines an activity driven placement of the memories and a bus segmentation approach for the interconnect to localize the wire switching activity and minimize the associated wire capacitive load of the memory bus. A factor of 2.8 in bus power reduction is achieved for a real life design while maintaining the same performance.

Global Multimedia System Design Exploration Using Accurate Memory Organization Feedback.

Conference Paper

Full-text available

Jun 1999

Successful exploration of system-level design deci- sions is impossible without fast and accurate esti- mation of the impact on the system cost. In most multimedia applications, the dominant cost factor is related to the organization of the memory archi- tecture. This paper presents a systematic approach which allows effective system-level exploration of memory organization design alternatives, based on accurate feedback by using our earlier developed tools. The effectiveness of this approach is illus- trated on an industrial application. Applying our approach, a substantial part of the design search space has been explored in a very short time, result- ing in a cost-efficient solution which meets all design constraints.

Global Interconnect Trade-off For Technology Over Memory Modules To Application Level: Case Study

Conference Paper

Full-text available

Apr 2003

In this paper we show how to exploit energy-delay trade-offs that exist due to the variation of the technology parameters for the implementation of interconnect wires. We also evaluate how these trade-offs can be propagated to the memory module level, so we can minimise the power consumption of the entire memory organisation (i.e., memories and connections between them). Our approach is that at future technology nodes the delay problem can be handled at the application level, so given any delay slack obtained at that level, we can exploit it to make the switching on the interconnect wires slower and thus less energy consuming. In this way, we have shown that for real-life applications the power consumption at future technology nodes can be reduced by about 34%, when compared to the option provided by the ITRS roadmap. This is achieved by, instead of using the very fast and power hungry wires, selectively using slower and thinner interconnect wires while still meeting the application real-time constraints.

Custom memory management methodology. Exploration of memory organisation for embedded multimedia system design

Article

Data Access and Storage Management for Embedded Programmable Processors

Book

Jan 2002

1. DTSE in Programmable Architectures. 2. Related Compiler Work on Data Tranfer and Storage Management. 3. Global Loop Transformations. 4. System-Level Storage Requirements Estimation. 5. Automated Data Reuse Exploration Techniques. 6. Storage Cycle Budget Distribution. 7. Cache Optimization. 8. Demonstrator Designs. 9. Conclusions and Future Work. References. Bibliography. Index.

Data and memory optimizations for embedded systems

Article

Jan 2001
ACM T DES AUTOMAT EL

We present a survey of the state-of-the-art techniques used in performing data and memoryrelated optimizations in embedded systems. The optimizations are targeted directly or indirectly at the memory subsystem, and impact one or more out of three important cost metrics: area, performance, and power dissipation of the resulting implementation. We first examine architecture-independent optimizations in the form of code transformations. We next cover a broad spectrum of optimization techniques that address memory architectures at varying levels of granularity, ranging from register files to on-chip memory, data caches, and dynamic memory (DRAM). We end with memory addressing related issues.

Memory Design Techniques for Low Energy Embedded Systems

Book

Jan 2002

Preface Giovanni de Michelli. Acknowledgments. 1. Introduction. 2. Application-Specific Core-Based Systems. 3. Energy Optimization of the Memory Sub-System. 4. Application-Specific Memories. 5. Application-Driven Memory Partitioning. 6. Application-Specific Code Compression. 7. Perspectives. Index.

Overcoming the "Memory Wall" by improved system design exploration and a link to process technology options

Abstract and Figures

Recommended publications

MAPC: Memory access pattern based controller

Exploiting data transfer locality in memory mapping

A global bus power optimization methodology for physical design of memory dominated systems by coupl...

System-level process variability compensation on memory organizations of dynamic applications: A cas...

Detection of Partially Simultaneously Alive Signals in Storage Requirement Estimation for Data Inten...