Conference PaperPDF Available

Run-Time Management of Logic Resources on Reconfigurable Systems.

March 2003

March 2003

DOI:10.1109/DATE.2003.1253731

Source
DBLP

Conference: 2003 Design, Automation and Test in Europe Conference and Exposition (DATE 2003)
At: Munich, Germany

Authors:

Manuel Gericota

Polytechnic Institute of Porto

Gustavo Ribeiro Alves

Polytechnic Institute of Porto

Miguel L. Silva

University of Porto

José Manuel Martins Ferreira

University of South-Eastern Norway

Dynamically reconfigurable systems based on partial and dynamically reconfigurable FPGAs may have their functionality partially modified at run-time without stopping the operation of the whole system. The efficient management of the logic space available is one of the biggest problems faced by these systems. When the sequence of reconfigurations to be performed is not predictable, resource allocation decisions have to be made on-line. A rearrangement may be necessary to gel enough contiguous space to implement incoming functions, avoiding the spreading of their components and the resulting degradation of system performance. A new software tool that helps to handle the problems posed by the consecutive reconfiguration of the same logic space is presented in this paper. This tool uses a novel on-line rearrangement procedure to solve fragmentation problems and to rearrange the logic space in a way completely transparent to the applications currently running.

Content uploaded by Gustavo Ribeiro Alves

Content may be subject to copyright.

Run-Time Management of Logic Resources on Reconfigurable Systems

Manuel G. Gericota, Gustavo R. Alves

Department of Electrical Engineering – ISEP

Rua Dr. António Bernardino de Almeida – 4200-072 Porto – PORTUGAL

Miguel L. Silva, José M. Ferreira

Department of Electrical and Computer Engineering – FEUP

Rua Dr. Roberto Frias – 4200-465 Porto – PORTUGAL

Abstract

Dynamically reconfigurable systems based on partial

and dynamically reconfigurable FPGAs may have their

functionality partially modified at run-time without

stopping the operation of the whole system.

The efficient management of the logic space available

is one of the biggest problems faced by these systems.

When the sequence of reconfigurations to be performed is

not predictable, resource allocation decisions have to be

made on-line. A rearrangement may be necessary to get

enough contiguous space to implement incoming

functions, avoiding the spreading of their components and

the resulting degradation of system performance.

A new software tool that helps to handle the problems

posed by the consecutive reconfiguration of the same logic

space is presented in this paper. This tool uses a novel on-

-line rearrangement procedure to solve fragmentation

problems and to rearrange the logic space in a way

completely transparent to the applications currently

running.

1. Introduction♦

Reconfigurable computing experienced a considerable

expansion in the last few years, due in part to the fast run-

-time partial reconfiguration features offered by recent

Field Programmable Gate Arrays (FPGAs). The Virtex

and Spartan families from Xilinx, used to validate this

work, are the most recent examples. This kind of devices

enabled the implementation of the concept of virtual

hardware defined in [1] ten years ago: to use temporal

♦This work is supported by an FCT program under contract

POCTI/33842/ESE/2000

partitioning to implement those applications whose area

requirements exceed the reconfigurable logic space

available (i.e. to assume the availability of unlimited

hardware resources). The static implementation of a

circuit is separated in two or more independent hardware

contexts, which may be swapped during runtime [2].

Extensive work was done to improve the multi-context

handling capability of these devices, by storing several

configurations and enabling quick context switching [3,

4]. The main goal was to improve the execution time by

minimising external memory transfers, assuming that

some amount of on-chip data storage was available in the

reconfigurable architecture. However, this solution was

only feasible if the functions implemented on hardware

were mutually exclusive on the temporal domain, e. g.

context-switching between coding/decoding schemes in

communication, video or audio systems; otherwise, the

length of the reconfiguration intervals would lead to

unacceptable delays in most applications.

These restrictions have been overtaken by higher levels

of integration, due to the employment of sub-micron

scales, and by the use of higher frequencies of operation.

The increasing amount of logic available in FPGAs and

the reduction on the reconfiguration time, partly owing to

the possibility of partial reconfiguration, extended the

concept of virtual hardware to the implementation of

multiple applications sharing the same logic resources in

the spatial and temporal domains.

An application comprises a set of functions that are

predominantly executed sequentially, or with a low degree

of parallelism, in which case their simultaneous

availability is not required. On the other hand, the

reconfiguration intervals offered by new FPGAs are

sufficiently small to enable functions to be swapped in real

time. If a proper floorplanning schedule is devised, it

becomes feasible to use a single device to run a set of

applications, which in total require far more than 100% of

1530-1591/03 $17.00  2003 IEEE

974

the FPGA available resources, by swapping functions in

and out of the FPGA as needed.

Partial reconfiguration times are in the order of a few

milliseconds, depending on the configuration interface and

on the complexity (and thus on the size) of the function

being implemented. However, the reconfiguration time

overhead may be virtually zero, if new functions are

swapped in advance with those already out of use, as

illustrated in figure 1. A number of applications share the

same available reconfigurable logic space in both the

temporal and spatial domains [5]. After the execution of a

given function, a new function may be set up in its place

during the interval rt, in order to be available when

required by the application flow (rt should therefore not be

mistaken by the reconfiguration time). Notice that an

increase in the degree of parallelism may retard the

reconfiguration of incoming functions, due to lack of

space in the FPGA. Consequently, delays will be

introduced in the application execution, systematically or

not, depending on the application flow.

Time

Initial

configuration rt - reconfiguration interval

- data transfer between different functions

Appl.C

Appl.B

Available

resource space

Function C1

Function B1

Function A1Appl.A

Function A2

Function B2

Function C3

Fun

cti

on C2 Function C4

Applications

running in

the FPGA

Fig. 1. Temporal scheduling of applications in the

temporal and spatial domains

The main goal behind the temporal and spatial

partitioning of the reconfigurable logic resources is to

achieve the maximum efficiency of the reconfigurable

systems, pushing up resource usage and taking the

maximum advantage of its flexibility. However, this

approach comprises several problems. An incoming

function may require the relocation of other functions

already implemented and running, in order to release

enough contiguous space for its configuration (see

function C2 in figure 1). Since each of the multiple

independent functions sharing the logic space occupies a

different amount of resources, many small pools of

resources are created as they are released. These

unallocated areas tend to become so small that they fail to

satisfy any request and for that reason remain unused,

leading to a fragmentation of the FPGA logic space [6].

Suitable arrangements can be designed if the

requirements of each function and their sequence are

known in advance, but an increase in the available

resources will in most cases be necessary to cope with the

allocation problem [7]. However, when placement

decisions have to be made on-line, or the need for extra

space is only temporary, an increase on the available

resources is a poor solution, since it decreases the

efficiency of the system.

The problem described may be solved through on-line

management of the available resources, whereby the

system tries to avoid that a lack of contiguous free

resources prevents the configuration of new functions

(provided that the total number of resources available is

sufficient). Note that spreading the components of an

incoming function, due to fragmentation of available

resources, would degrade its performance, delaying tasks

and reducing the utilisation of the FPGA. If a new

function cannot be allocated immediately due to lack of

contiguous free resources, a suitable rearrangement of a

subset of the functions currently running may solve the

problem. Three methods are proposed in [5] to find such

(partial) rearrangements, in order to increase the rate at

which waiting functions are allocated, while minimising

disruptions to running functions that are to be relocated.

However, no physical execution of these rearrangements

is proposed other than halting those functions, stopping

the normal system operation.

A mechanism to implement such rearrangements

without disturbing the system operation is presented in

this paper. To address this problem, a new concept is

introduced – dynamic relocation –, which enables the

relocation of each FPGA CLB (Configurable Logic

Block) and of its associated interconnections, even if the

CLB is part of a function that is actually being used by an

application [8, 9]. This concept enables the rearrangement

and defragmentation of the FPGA logic space on-line (i. e.

concurrently with all applications currently running),

without any time overheads.

We will start by describing the dynamic relocation of

each CLB, highlighting the constraints imposed by the

FPGA architecture, and then we will introduce the

relocation mechanism proposed. The relocation of routing

resources will then be considered, including the software

tool that was developed to automate the generation of the

required partial configuration files.

2. Dynamic CLB relocation

Conceptually, an FPGA could be described as an array

of uncommitted CLBs, surrounded by a periphery of

IOBs, which are interconnectable by configurable routing

resources, controlled by an underlying set of memory

cells.

Any on-line management strategy implies a dynamic

relocation mechanism, whereby a CLB currently being

used by a given function has its functionality transferred

into another CLB, without disturbing system operation.

This relocation mechanism does more than just copying

the functional specification of the CLB to be replicated:

975

the corresponding interconnections with the rest of the

circuit have to be re-established; additionally, according to

its current functionality, internal state information may

also have to be copied.

The transparent relocation of a CLB is not a trivial task

due to two major issues: i) configuration memory

organisation and ii) internal state information.

The configuration memory can be visualised as a

rectangular array of bits, which are grouped into one-bit

wide vertical frames extending from the top to the bottom

of the array. A frame is the smallest unit of configuration

that can be written to or read from the configuration

memory. Frames are grouped together into larger units

called columns. Each CLB column corresponds to a

configuration column with multiple frames, mixing

internal CLB configuration and state information, and

column routing and interconnect information. The

partitioning of the entire FPGA configuration memory into

frames enables on-line concurrent partial reconfiguration,

facilitating the implementation of on-line rearrangement

procedures. The configuration procedure is a sequential

mechanism that spans through some (or eventually all)

CLB configuration columns. When the functionality of a

given CLB is dynamically relocated, even into a CLB in

the same column, more than one column may be affected,

since its input and output signals (as well as those in its

replica) may cross several columns before reaching its

source or destination.

Any reconfiguration action must therefore ensure that

the signals from the original CLB are not broken before

being totally re-established from its replica, otherwise its

operation will be disturbed or even halted. It is also

important to ensure that the functionality of the CLB

replica is perfectly stable before its outputs are connected

to the system, to prevent output glitches. A set of

experiments performed with an XCV200 from Xilinx

demonstrated that the only possible solution is to divide

the relocation procedure in two phases, as illustrated in

figure 2 (for reasons of intelligibility, only the relevant

interconnections are represented).

In the first phase, the internal configuration of the CLB

is copied into the new location and the inputs of both

CLBs are placed in parallel. Due to the slowness of the

reconfiguration procedure when compared with the speed

of operation of current applications, the outputs of the

CLB replica are already perfectly stable when they are

connected to the circuit, in the second phase. To avoid

output glitches, both CLBs (the original and its replica)

must remain in parallel for at least one clock cycle. Notice

that rewriting the same configuration data does not

generate any transient signals, so this procedure does not

affect the remaining resources covered by the rewriting of

the configuration frames that are needed to carry out the

relocation procedure.

Successful completion of the procedure described

above cannot be achieved without correct transfer of state

information. If the current CLB function is purely

combinational, this two-phase relocation procedure will

suffice to accomplish successful relocation. However, in

the case of a sequential function, the internal state

information must be preserved and no update operations

could be lost during the copying phase. The solution to

this problem depends on the type of implementation. In

this paper we shall consider three implementation cases:

synchronous free-running clock circuits, synchronous

gated-clock circuits, and asynchronous circuits.

1st phase

2nd phase

CLB CLB CLB CLB

CLB

CLB CLB CLB

CLB CLB CLB CLB

CLB CLB CLB

Fig. 2. Two-phase CLB relocation procedure

When dealing with synchronous free-running clock

circuits, the two-phase relocation procedure described

previously is a good solution. Between the first and the

second phase the CLB replica has the same inputs as the

original CLB, and all its flip-flops (FFs) acquire the same

state information. The experimental replication of CLBs

with FF driven by a free-running clock has confirmed the

effectiveness of this method. No loss of state information

or the presence of output glitches was observed.

When using synchronous gated-clock circuits, where

input acquisition by the FFs is controlled by the state of

the clock enable signal (CE), the previous method does

not ensure that the CLB replica captures the correct state

information, because CE may not be active during the

relocation procedure. Besides, it is not feasible to simply

set this signal as part of the relocation procedure, because

the value present at the input of the replica FFs may

change in the meantime, and a coherency problem would

then occur.

An auxiliary relocation circuit needs to be implemented

to solve this problem. This circuit manages the transfer of

the state information from the original FFs to the replica

976

FFs, while enabling their update by the circuit at any

instant, without delaying the relocation procedure. The

whole relocation scheme is represented in figure 3, where

only one logic cell is shown, for reasons of simplicity. In

the Virtex and Spartan families, each CLB comprises four

of these cells; however, and for the purpose of

implementing this procedure, each CLB cell can be

considered individually. The temporary transfer paths

established between the original cells and their replicas do

not affect their functionality, since only free routing

resources are used. No changes in the cell structure are

required to implement this procedure. The relocation

control and the clock enable control signals are driven

through the reconfiguration memory, so no extra external

pins are required. The OR gate and the 2:1 multiplexer

that are part of the auxiliary relocation circuit must be

implemented during the relocation process in a nearby

(free) CLB.

Combinational

logic circuitry

Relocation

control

Combinational

logic circuitry

Clock enable

control

Auxiliary relocation circuit

Replica

combinational

output

Original

registered

output

Original CLB

Replica CLB

CLB inputs

Clock enable signal

Clock signal

CLB

output

Fig. 3. CLB relocation for synchronous gated-

-clock circuits

The inputs of the 2:1 multiplexer present in the

auxiliary relocation circuit receive one temporary transfer

path from the output of the original CLB FF and another

one from the output of the combinational logic circuitry of

the replica CLB, which, in normal operation, is applied to

the FF input. This multiplexer is controlled by the clock

enable signal of the original CLB FF. If this signal is not

active, the output of the original CLB FF is applied to the

input of the replica CLB FF. The clock enable control

signal is then activated, which forces the replica CLB FF

to capture the value coming from the original CLB FF.

If the clock enable signal is active, or is activated

during this process, the multiplexer selects the output of

the combinational block in the replica CLB and applies it

to its FF input. In this case, both the original and the

replica FFs are updated at the same time and with the

same values, guaranteeing state coherency.

After the state has been transferred, the input signals

involved in the execution of the relocation procedure are

placed in parallel, all the signals to and from the auxiliary

relocation circuit are disconnected, and the outputs of both

CLBs are also placed in parallel. After at least one

function clock cycle, the original CLB is disconnected

from the rest of the circuit (first the outputs and then the

inputs, in order to prevent any transient instability in the

output signals), and becomes part of the pool of free

resources.

Figure 4 represents the flow diagram of the proposed

relocation procedure. Several relocation experiments were

carried out in a group of circuits from the ITC’99

Benchmark Circuits from the Politécnico di Torino [10]

implemented in a Virtex XCV200, proving the

effectiveness of our approach. These circuits are purely

synchronous with only one single-phase clock signal

present. However, this approach is also applicable to

multiple clock/multiple phase applications, since only one

clock signal is involved in the relocation of each CLB

(CLBs relocation is performed individually, even if many

of these blocks were replicated simultaneously). No loss

of information or functional disturbance was observed

during the execution of these experiments.

Begin

Connect signals to the auxiliary relocation circuit;

place CLB input signals in parallel

Activate relocation and clock enable control

Deactivate clock enable control

Connect the clock enable inputs of both CLBs

Disconnect all the auxiliary relocation circuit signals

Place CLB outputs in parallel

Disconnect the original CLB outputs

End

> 2 CLK pulse No

Yes

>1 CLK pulse No

Yes

Deactivate relocation control

Disconnect the original CLB inputs

Fig. 4. Relocation procedure flow diagram

The average relocation time of each CLB implementing

synchronous gated-clock circuits is about 22,6 ms, when

the Boundary Scan [11] infrastructure is used to perform

the reconfiguration, at a test clock frequency of 20 MHz.

This method is also effective when dealing with

asynchronous circuits, where transparent data latches are

used instead of FFs. In this case, the control enable signal

is replaced in the latch by the input control signal, with the

977

value present in the input of the FF being stored when the

control signal changes from ‘1’ to ‘0’. The same auxiliary

relocation circuit is used and the same relocation sequence

is followed.

In the Virtex family of FPGAs, LUTs (Look-Up

Tables) can be configured as Distributed RAMs.

However, it is not feasible to extend this on-line relocation

concept to the relocation of those LUT/RAMs. The

content of the LUT/RAMs could be read and written

through the configuration memory, but the system would

have to be stopped to ensure data coherency, in the case of

a writing attempt during the relocation interval, as stated

in [12]. Furthermore, since frames span an entire column

of CLB slices, the same LUT bit in all of them is updated

with a single write command. It must be ensured that all

the remaining data in the slice is either constant, or it is

also modified externally through partial reconfiguration.

Even not being relocated, LUT/RAMs should not lie in

any column that could be affected by the relocation

procedure.

3. Rearranging routing resources

Due to the scarcity of routing resources, it might also

be necessary to perform a rearrangement of the existent

interconnections, to optimise the occupancy of such

resources, after the relocation of one or more CLBs, and to

increase the availability of routing paths to incoming

functions. The relocation of routing resources does not

pose any special problems, since the same two-phase

relocation procedure is effective on the relocation of local

and global interconnections. The interconnections

involved are first duplicated in order to establish an

alternative path, and then disconnected, becoming

available to be reused, as illustrated in figure 5.

CLB

- Original path

- Replica path

CLB2

CLB1

Fig. 5. Relocation of routing resources

A last remark must be made about the relocation of

routing resources. Since different paths are used while

paralleling the original and replica interconnections, each

of them will have a different propagation delay. This

means that if the signal level at the output of the CLB

source changes, the signal at the input of the CLB

destination will show an interval of fuzziness, as shown in

figure 6. However, the impedance of the routing switches

will limit the current flow in the interconnection, and

hence this behaviour does not damage the FPGA.

Nevertheless, and for transient analysis, the propagation

delay associated to the parallel interconnections, shall be

the longer of the two paths.

CLB1 output

time

CLB2 input

- Signal propagation

through the original path

- Signal propagation

through the replica path

Fig. 6. Propagation delay during the relocation of

routing resources

The dynamic relocation of CLBs and interconnections

should have a minimum influence (preferably none) in the

system operation, as well as reduced overhead in terms of

reconfiguration cost. This cost depends on the number of

reconfiguration frames needed to relocate each CLB, since

a great number of frames would imply a longer

rearrangement time. The impact of the relocation

procedure in those functions currently running is mainly

related to the delays imposed by rerouted paths, since the

relocation procedure might imply a longer path, therefore

decreasing the maximum frequency of operation.

The placement algorithms (in an attempt to reduce path

delays) gather in the same area the logic that is needed to

implement the components of a given function. It is

unwise to disperse it, since it would generate longer paths

(and hence, an increase in path delays). On the other hand,

it would also put too much stress upon the limited routing

resources. Therefore, the relocation of the CLBs should be

performed to nearby CLBs. If necessary, the relocation of

a complete function may take place in several stages, to

avoid an excessive increase in path delays during the

relocation interval.

4. The FPGA rearrangement and

programming tool

To support the implementation of this management

process, a software tool was developed, based on the JBits

software − a set of Java classes that provide an

Application Programming Interface (API) to access the

Xilinx FPGA bitstream [13]. This tool is responsible by

the creation of the partial configuration files and carries

out the partial and dynamic reconfiguration of the FPGA

through the Boundary Scan interface. Since the partial

configuration files that implement the rearrangements

978

defined by the relocation procedure are generated

automatically (without designer intervention), the usage of

this tool becomes very straightforward. The input

information may be provided in the form of a complete

configuration file (generated by the traditional

development tool with a new placement for the functions

currently running or those that are about to be

implemented) or by providing the co-ordinates - source

and destination - of the CLB to be relocated. The tool

implements a series of algorithms based on artificial

intelligence techniques that manage the routing of the

signals coming in and out of the CLBs that are relocated,

optimising the usage of the routing resources. The

program always keeps a complete copy of the current

configuration, enabling system recovery in case of failure.

The user interface is shown in figure 7.

Fig. 7. Interface of the FPGA Rearrangement and

Programming tool

5. Conclusion

A novel relocation procedure to perform the dynamic

relocation of CLBs, without halting their operation, was

presented in this paper. The proposed procedure enables

the implementation of a truly on-line management of the

FPGA logic space, supporting the rearrangement of

running functions, releasing enough contiguous space for

the configuration of new incoming functions, and

performing defragmentation. Therefore, on-line dynamic

scheduling of tasks in the spatial and temporal domains

becomes possible, enabling the implementation of the

virtual hardware concept [1]. Several applications may

share the same hardware platform, with their respective

functions running and being swapped in and out of the

FPGA, without generating any time overhead to the

running applications, or disturbing their operation. The

software application that was developed to support the

implementation of the relocation procedure enables the

complete automation of the whole process and an

optimised management of the available resources. Further

work is under way to increase the functionality and

flexibility of this tool.

References

[1] X. P. Long, H. Amano, “WASMII: a Data Driven Computer

on a Virtual Hardware”, Proc. 1st IEEE Workshop on

FPGAs for Custom Computing Machines, 1993, pp. 33-42.

[2] J. M. P. Cardoso, H. C. Neto, “An Enhanced Static-List

Scheduling Algorithm for Temporal Partitioning onto

RPUs”, Proc. 10th Intl. Conf. on VLSI, 1999, pp. 485-496.

[3] R. Maestre, F. J. Kurdahi, R. Hermida, N. Bagherzadeh, H.

Singh, “A Formal Approach to Context Scheduling for

Multicontext Reconfigurable Architectures”, IEEE Trans.

on VLSI Systems, Vol. 9, No. 1, Feb. 2001, pp. 173-185.

[4] M. Sanchez-Elez, M. Fernandez, R. Maestre, R. Hermida,

N. Bagherzadeh, F. J. Kurdahi, “A Complete Data

Scheduler for Multi-Context Reconfigurable Architectures”,

Proc. Design, Automation and Test in Europe, 2002,

pp. 547-552.

[5] O. Diessel, H. El Gindy, M. Middendorf, H. Schmeck, B.

Schmidt, “Dynamic scheduling of tasks on partially

reconfigurable FPGAs”, IEE Proc.-Computer Digital

Technology, Vol. 147, No. 3, May 2000, pp. 181-188.

[6] M. Vasilko, DYNASTY: A Temporal Floorplanning Based

CAD Framework for Dynamically Reconfigurable Logic

Systems, Proc. 9th Intl. Workshop on Field-Programmable

Logic and Applications, 1999, pp.124-133.

[7] M. Teich, S. Fekete, J. Schepers, “Compile-time

optimization of dynamic hardware reconfigurations”, Proc.

Intl. Conf. on Parallel and Distributed Processing

Techniques and Applications, 1999, pp. 1097-1103.

[8] M. G. Gericota, G. R. Alves, M. L. Silva, J. M. Ferreira,

“Active Replication: Towards a Truly SRAM-based FPGA

On-Line Concurrent Testing”, Proc. 8th IEEE Intl. On-Line

Testing Workshop, 2002, pp. 165-169.

[9] M. G. Gericota, G. R. Alves, M. L. Silva, J. M. Ferreira,

“On-line Defragmentation for Run-Time Partially

Reconfigurable FPGAs”, Proc. 12th Intl. Conf. on Field

Programmable Logic and Applications, 2002, pp. 302-311.

[10] Politécnico di Torino ITC’99 benchmarks.

http://www.cad.polito.it/tools/itc99.html

[11] IEEE Std. Test Access Port and Boundary Scan Architecture

(IEEE Std 1149.1), IEEE Std. Board, May 1990.

[12] W. Huang, E. J. McCluskey, “A Memory Coherence

Technique for Online Transient Error Recovery of FPGA

Configurations”, Proc. 9th ACM Intl. Symposium on Field-

-Programmable Gate Arrays, 2001, pp. 183-192.

[13] S. A. Guccione, D. Levi, P. Sundararajan, “JBits Java based

interface for reconfigurable computing”, Proc. 2nd Military

and Aerospace Appl. of Prog. Devices and Technologies

Conf., 1999.

979

Applications of TORC: An Open Toolkit for Reconfigurable Computing

Article

On-line Scheduling of Real Time Tasks and Compact Placement Method for Dynamically Partially Reconfigurable Systems

Article

Jying-Lwen Chiu

Run-time Reconfigurable Hardware

Article

Jan 2007

A Framework for Mapping Dynamic Virtual Kernels onto Heterogeneous Reconfigurable Platforms

Conference Paper

May 2014

Field programmable Gate Arrays (FPGAs) promise a low power flexible alternative for today's market heterogeneous systems. In order to be widely accepted, novel solutions and approaches are required for fast and flexible application implementation. In this paper we propose a methodology, as well as the supporting toolflow targeting to provide fast implementation of multiple applications onto heterogeneous FPGAs through the use of virtual kernels. Experimental results prove the efficiency of the introduced solution, as we achieve application mapping 30x faster on average compared to a state-of-art approach, with negligible degradation in the quality of the mapping. Additionally, we enable the mapping of multiple applications onto a single FPGA with only a small penalty of 4.7% in the maximum operation frequency of those applications compared with our reference solution.

On Supporting Efficient Partial Reconfiguration with Just-In-Time Compilation

Conference Paper

May 2012

Partial reconfiguration is possible to deliver vir-tually unlimited hardware resources since it enables dynamic allocation and de-allocation of tasks onto a reconfigurable architecture, while the rest tasks continue to operate. However, in order to benefit from this flexibility, partial reconfigu-ration has to be appropriately applied. Among others, the placement of partial configuration data is a critical issue since it affects the fragmentation of hardware resources. In this paper we introduce a novel methodology for supporting partial reconfiguration with the usage of a Just-in-Time (JIT) Compilation framework. Experimental results with a number of benchmarks showed that the introduced solution performs application P&R 7.34× faster, as compared to the state-of-the-art tools, while it also leads to significant lower fragmentation of hardware resources.

IP Protection of FPGA Cores Through a Novel Public/Secret-Key Encryption Mechanism

Chapter

Jan 2016

Protecting the rights of Intellectual Property (IP) owners is extremely important to the expansion of the core-based design market. Currently, IP providers have no mechanism to guarantee the protection of their IP against over-deployment. We propose a system to guarantee that IP cores can only be deployed into devices agreed upon between the IP provider and the customer. The system is based on secured handshaking with encrypted device and design authentication information. It consists of hardware-supported design encryption and secured authentication protocols. It uses a combination of secret and public-key cryptographic functions devised for an uncomplicated trustable design exchange scenario. The public-key functions use modular squaring (Rabin Lock) on the FPGA chip instead of exponentiation to reduce the hardware complexity. The system limits the parties involved in the transaction to the IP provider and the customer only.

A Runtime FPGA Placement and Routing Using Low-Complexity Graph Traversal

Article

Apr 2015

Dynamic Partial Reconfiguration (DPaR) enables efficient allocation of logic resources by adding new functionalities or by sharing and/or multiplexing resources over time. Placement and routing (P&R) is one of the most time-consuming steps in the DPaR flow. P&R are two independent NP-complete problems, and, even for medium size circuits, traditional P&R algorithms are not capable of placing and routing hardware modules at runtime. We propose a novel runtime P&R algorithm for Field-Programmable Gate Array (FPGA)-based designs. Our algorithm models the FPGA as an implicit graph with a direct correspondence to the target FPGA. The P&R is performed as a graph mapping problem by exploring the node locality during a depth-first traversal. We perform the P&R using a greedy heuristic that executes in polynomial time. Unlike state-of-the-art algorithms, our approach does not try similar solutions, thus allowing the P&R to execute in milliseconds. Our algorithm is also suitable for P&R in fragmented regions. We generate results for a manufacturer-independent virtual FPGA. Compared with the most popular P&R tool running the same benchmark suite, our algorithm is up to three orders of magnitude faster.

A Survey of FPGA Dynamic Reconfiguration Design Methodology and Applications

Article

Aug 2014

FPGA Dynamic Partial Reconfiguration (DPR or PR) technology has emerged and become gradually mature in the recent years. It provides the Time-Division Multiplexing (TDM) capability in utilizing on-chip resources and leads to significant benefits in comparison with conventional static designs. However, the partially reconfigurable design process features additional complexity and technical requirements to the FPGA developers. Hence, PR design approaches are being widely explored and investigated to systematize the development methodology and ease the designers. In this paper, the authors collect several research and engineering projects in this area and present a survey of the design methodology and applications of PR. Research aspects are discussed in various hardware/software layers.

A run-time graph-based Polynomial Placement and routing algorithm for virtual FPGAS

Conference Paper

Sep 2013

Dynamic partial reconfiguration enables efficient use of hardware resources by multiplexing system functionality in time. However, many challenges arise from partial reconfiguration implementation. The placement and routing (P&R) of the hardware modules is a computationally intensive task, and the state-of-art algorithms are not suitable to place and route modules at run-time. This paper makes several contributions: (1) Single Placement at run-time: we propose a novel P&R algorithm based on greedy heuristic where a single placement is performed at run-time in few milliseconds. (2) Implicit Graph Model: the FPGA is modelled as an implicit graph with a direct correspondence to the physical FPGA, and the P&R is performed as a graph mapping problem by exploring the node locality during the depth-first traversal. (3) Polynomial Placement: we show that even a single placement can be routed without critical path degradation. (4) Fragmented Regions: the graph approach is flexible, and it allows efficient placement even onto fragmented FPGA areas. Compared with the most popular P&R tool running the same benchmark suite our algorithm is on average 864x faster. Moreover, the bitstream for partial reconfiguration is also reduced by a factor of 4.

Run-time management for future MPSoC platforms

Article

Vincent Nollet

The Test Access Port and Boundary-Scan Architecture

Book

Full-text available

Jan 1990

from the Preface: "IEEE Std 1149.1 was developed for your use. As more engineers and more firms use it, it will become more valuable. The more expertise in ATEs, circuit design, catalog ICs, application-specific ICs (ASICs), etc. that is developed collectively, the more we all can benefit from reuse of generic solutions to common technological problems. As Harry Bleeker did in his foreward to this book, we urge you to use the standard , we urge you to participate in its further evolution, and we urge you to do so in the superbly constructive and cooperative spirit that has infused JTAG. "Guard the mysteries. "Constantly reveal them." --Lew Welch, "Course College Courses: Religion"

Active replication: towards a truly SRAM-based FPGA on-line concurrent testing

Conference Paper

Full-text available

Jul 2002

The reusing of the same hardware resources to implement speed-critical algorithms, without interrupting system operation, is one of the main reasons for the increasing use of reconfigurable computing platforms, employing complex SRAM-based FPGAs. However, new semiconductor manufacturing technologies increase the probability of lifetime operation failures, requiring new on-line testing / fault-tolerance methods able to improve the dependability of the systems where they are included. The Active Replication technique presented in this paper consists of a set of procedures that enables the implementation of a truly non-intrusive structural on-line concurrent testing approach, detecting and avoiding permanent faults and correcting errors due to transient faults. In relation to a previous technique proposed by the authors as part of the DRAFT FPGA concurrent test methodology, the Active Replication technique extends the range of circuits that can be replicated, by introducing a novel method with very low silicon overhead.

A Complete Data Scheduler for Multi-Context Reconfigurable Architectures

Conference Paper

Full-text available

Jan 2002

A new technique is presented in this paper to improve the efficiency of data scheduling for multi-context reconfigurable architectures targeting multimedia and DSP applications. The main goal is to improve the applications execution time minimizing external memory transfers. Some amount of on-chip data storage is assumed to be available in the reconfigurable architecture. Therefore the Complete Data Scheduler tries to optimally exploit this storage, saving data and result transfers between on-chip and external memories. In order to do this, specific algorithms for data placement and replacement have been designed. We also show that a suitable data scheduling could decrease the number of transfers required to implement the dynamic reconfiguration of the system.

An Enhanced Static-List Scheduling Algorithm for Temporal Partitioning onto RPUs

Conference Paper

Full-text available

Jan 1999

This paper presents a novel algorithm for temporal partitioning of graphs rep- resenting a behavioral description. The algorithm is based on an extension of the traditional static-list scheduling that tailors it to resolve both scheduling and temporal partitioning. The nodes to be mapped into a partition are selected based on a statically computed cost model. The cost for each node integrates communication effects, the critical path length, and the possibility of the criti- cal path to hide the delay of parallel nodes. In order to alleviate the runtime there is no dynamic update of the costs. A comparison of the algorithm to other schedulers and with close-to-optimum results obtained with a simulated annealing approach is shown. The presented algorithm has been implemented and the results show that it is robust, effective, and efficient, and when com- pared to other methods finds very good results in small amounts of CPU time.

Active Replication: Towards a Truly SRAM-based FPGA On-Line Concurrent Testing

Conference Paper

Jul 2002

The reusing of the same hardware resources to implement speed-critical algorithms, without interrupting system operation, is one of the main reasons for the increasing use of reconfigurable computing platforms, employing complex SRAM-based FPGAs. However, newsemiconductor manufacturing technologies increase the probability of lifetime operation failures, requiring new on-line testing/fault-tolerance methods able to improve the dependability of the systems where they are included.The Active Replication technique presented in this paper consists of a set of procedures that enables the implementation of a truly non-intrusive structural on-line concurrent testing approach, detecting and avoiding permanent faults and correcting errors due to transient faults.In relation to a previous technique proposed by the authors as part of the DRAFT FPGA concurrent test methodology, the Active Replication technique extends the range of circuits that can be replicated, by introducing a novel method with very low silicon overhead.

A memory coherence technique for online transient error recovery of FPGA configurations

Conference Paper

Feb 2001

The partial reconfiguration feature of some of the current-generation Field Programmable Gate Arrays (FPGAs) can improve dependability by detecting and correcting errors in on-chip configuration data. Such an error recovery process can be executed online with minimal interference of user applications. However, because Look-up Tables (LUTs) in Configurable Logic Blocks (CLBs) of FPGAs can also implement memory modules for user applications, a memory coherence issue arises such that memory contents in user applications may be altered by the online configuration data recovery process. In this paper, we investigate this memory coherence problem and propose a memory coherence technique that does not impose extra constraints on the placement of memory-configured LUTs. Theoretical analyses and simulation results show that the proposed technique guarantees the memory coherence with a very small (on the order of 0.1%) execution time overhead in user applications.

Compile-time Optimization of Dynamic Hardware Reconfigurations.

Conference Paper

Jan 1999

Recent generations of Field ProgrammableGate Arrays (FPGA) allow the dynamic reconfigurationof cells on the chip during run-time.For a given problem consisting of a set of tasks withcomputation requirements modeled by rectangles ofcells, several optimization problems such as findingthe array of minimal size to accomplish the taskswithin a given time limit are considered. Existingapproaches based on ILP formulations to solve theseproblems being related to multi-dimensional packing...

DYNASTY: A Temporal Floorplanning Based CAD Framework for Dynamically Reconfigurable Logic Systems

Conference Paper

Aug 1999

Milan Vasilko

This paper presents DYNASTY–a new CAD framework aimed at supporting research of design techniques, algorithms and methodologies for dynamically reconfigurable logic (DRL) systems. Design flow implemented in the DYNASTY Framework is based around a temporal floorplanning (TF) DRL design abstraction, which allows simultaneous DRL design space exploration in spatial and temporal dimensions. The paper introduces temporal floorplanning and its implementation in the DYNASTY Framework. Methodologies based on temporal floorplanning promise reduction of design time and elimination of costly design iterations present in traditional DRL design methodologies.

A complete data scheduler for multi-context reconfigurablearchitectures

Conference Paper

Feb 2002

WASMII: a data driven computer on a virtual hardware

Conference Paper

May 1993

Virtual hardware is a technique to realize a large digital circuit with a small real hardware by using an extended Field Programmable Gate Array (FPGA) technology. Several configuration RAM modules are provided inside the FPGA chip, and the configuration of the gate array can be rapidly changed by replacing the active module. Data for configuration are transferred from an off-chip backup RAM to an unused configuration RAM module. A novel computation mechanism called the WASMII, which executes a target data flow graph directly, is proposed on the basis of the virtual hardware. A WASMII chip consists of the FPGA for virtual hardware and the additional mechanism to replace configuration RAM modules in the data driven manner. Configuration data are preloaded by the order which is assigned in advance with a static scheduling preprocessor. By connecting a number of WASMII chips, a highly parallel system can be easily constructed

Run-Time Management of Logic Resources on Reconfigurable Systems.

Abstract

Recommended publications

On-line Defragmentation for Run-Time Partially Reconfigurable FPGAs

Run-time Defragmentation for Dynamically Reconfigurable Hardware

A New Approach to Assess Defragmentation Strategies in Dynamically Reconfigurable FPGAs

DRAFT: An On-line Concurrent Test for Partial and Dynamically Reconfigurable FPGAs