Conference PaperPDF Available

Run-Time Management of Logic Resources on Reconfigurable Systems.

Authors:

Abstract

Dynamically reconfigurable systems based on partial and dynamically reconfigurable FPGAs may have their functionality partially modified at run-time without stopping the operation of the whole system. The efficient management of the logic space available is one of the biggest problems faced by these systems. When the sequence of reconfigurations to be performed is not predictable, resource allocation decisions have to be made on-line. A rearrangement may be necessary to gel enough contiguous space to implement incoming functions, avoiding the spreading of their components and the resulting degradation of system performance. A new software tool that helps to handle the problems posed by the consecutive reconfiguration of the same logic space is presented in this paper. This tool uses a novel on-line rearrangement procedure to solve fragmentation problems and to rearrange the logic space in a way completely transparent to the applications currently running.
Run-Time Management of Logic Resources on Reconfigurable Systems
Manuel G. Gericota, Gustavo R. Alves
Department of Electrical Engineering – ISEP
Rua Dr. António Bernardino de Almeida – 4200-072 Porto – PORTUGAL
Miguel L. Silva, José M. Ferreira
Department of Electrical and Computer Engineering – FEUP
Rua Dr. Roberto Frias – 4200-465 Porto – PORTUGAL
Abstract
Dynamically reconfigurable systems based on partial
and dynamically reconfigurable FPGAs may have their
functionality partially modified at run-time without
stopping the operation of the whole system.
The efficient management of the logic space available
is one of the biggest problems faced by these systems.
When the sequence of reconfigurations to be performed is
not predictable, resource allocation decisions have to be
made on-line. A rearrangement may be necessary to get
enough contiguous space to implement incoming
functions, avoiding the spreading of their components and
the resulting degradation of system performance.
A new software tool that helps to handle the problems
posed by the consecutive reconfiguration of the same logic
space is presented in this paper. This tool uses a novel on-
-line rearrangement procedure to solve fragmentation
problems and to rearrange the logic space in a way
completely transparent to the applications currently
running.
1. Introduction
Reconfigurable computing experienced a considerable
expansion in the last few years, due in part to the fast run-
-time partial reconfiguration features offered by recent
Field Programmable Gate Arrays (FPGAs). The Virtex
and Spartan families from Xilinx, used to validate this
work, are the most recent examples. This kind of devices
enabled the implementation of the concept of virtual
hardware defined in [1] ten years ago: to use temporal
This work is supported by an FCT program under contract
POCTI/33842/ESE/2000
partitioning to implement those applications whose area
requirements exceed the reconfigurable logic space
available (i.e. to assume the availability of unlimited
hardware resources). The static implementation of a
circuit is separated in two or more independent hardware
contexts, which may be swapped during runtime [2].
Extensive work was done to improve the multi-context
handling capability of these devices, by storing several
configurations and enabling quick context switching [3,
4]. The main goal was to improve the execution time by
minimising external memory transfers, assuming that
some amount of on-chip data storage was available in the
reconfigurable architecture. However, this solution was
only feasible if the functions implemented on hardware
were mutually exclusive on the temporal domain, e. g.
context-switching between coding/decoding schemes in
communication, video or audio systems; otherwise, the
length of the reconfiguration intervals would lead to
unacceptable delays in most applications.
These restrictions have been overtaken by higher levels
of integration, due to the employment of sub-micron
scales, and by the use of higher frequencies of operation.
The increasing amount of logic available in FPGAs and
the reduction on the reconfiguration time, partly owing to
the possibility of partial reconfiguration, extended the
concept of virtual hardware to the implementation of
multiple applications sharing the same logic resources in
the spatial and temporal domains.
An application comprises a set of functions that are
predominantly executed sequentially, or with a low degree
of parallelism, in which case their simultaneous
availability is not required. On the other hand, the
reconfiguration intervals offered by new FPGAs are
sufficiently small to enable functions to be swapped in real
time. If a proper floorplanning schedule is devised, it
becomes feasible to use a single device to run a set of
applications, which in total require far more than 100% of
1530-1591/03 $17.00 2003 IEEE
974
the FPGA available resources, by swapping functions in
and out of the FPGA as needed.
Partial reconfiguration times are in the order of a few
milliseconds, depending on the configuration interface and
on the complexity (and thus on the size) of the function
being implemented. However, the reconfiguration time
overhead may be virtually zero, if new functions are
swapped in advance with those already out of use, as
illustrated in figure 1. A number of applications share the
same available reconfigurable logic space in both the
temporal and spatial domains [5]. After the execution of a
given function, a new function may be set up in its place
during the interval rt, in order to be available when
required by the application flow (rt should therefore not be
mistaken by the reconfiguration time). Notice that an
increase in the degree of parallelism may retard the
reconfiguration of incoming functions, due to lack of
space in the FPGA. Consequently, delays will be
introduced in the application execution, systematically or
not, depending on the application flow.
Time
Initial
configuration rt - reconfiguration interval
- data transfer between different functions
Appl.C
Appl.B
Available
resource space
Function C1
Function B1
Function A1Appl.A
Function A2
Function B2
Function C3
Fun
cti
on C2 Function C4
Applications
running in
the FPGA
rt
rt
rt
Fig. 1. Temporal scheduling of applications in the
temporal and spatial domains
The main goal behind the temporal and spatial
partitioning of the reconfigurable logic resources is to
achieve the maximum efficiency of the reconfigurable
systems, pushing up resource usage and taking the
maximum advantage of its flexibility. However, this
approach comprises several problems. An incoming
function may require the relocation of other functions
already implemented and running, in order to release
enough contiguous space for its configuration (see
function C2 in figure 1). Since each of the multiple
independent functions sharing the logic space occupies a
different amount of resources, many small pools of
resources are created as they are released. These
unallocated areas tend to become so small that they fail to
satisfy any request and for that reason remain unused,
leading to a fragmentation of the FPGA logic space [6].
Suitable arrangements can be designed if the
requirements of each function and their sequence are
known in advance, but an increase in the available
resources will in most cases be necessary to cope with the
allocation problem [7]. However, when placement
decisions have to be made on-line, or the need for extra
space is only temporary, an increase on the available
resources is a poor solution, since it decreases the
efficiency of the system.
The problem described may be solved through on-line
management of the available resources, whereby the
system tries to avoid that a lack of contiguous free
resources prevents the configuration of new functions
(provided that the total number of resources available is
sufficient). Note that spreading the components of an
incoming function, due to fragmentation of available
resources, would degrade its performance, delaying tasks
and reducing the utilisation of the FPGA. If a new
function cannot be allocated immediately due to lack of
contiguous free resources, a suitable rearrangement of a
subset of the functions currently running may solve the
problem. Three methods are proposed in [5] to find such
(partial) rearrangements, in order to increase the rate at
which waiting functions are allocated, while minimising
disruptions to running functions that are to be relocated.
However, no physical execution of these rearrangements
is proposed other than halting those functions, stopping
the normal system operation.
A mechanism to implement such rearrangements
without disturbing the system operation is presented in
this paper. To address this problem, a new concept is
introduced – dynamic relocation –, which enables the
relocation of each FPGA CLB (Configurable Logic
Block) and of its associated interconnections, even if the
CLB is part of a function that is actually being used by an
application [8, 9]. This concept enables the rearrangement
and defragmentation of the FPGA logic space on-line (i. e.
concurrently with all applications currently running),
without any time overheads.
We will start by describing the dynamic relocation of
each CLB, highlighting the constraints imposed by the
FPGA architecture, and then we will introduce the
relocation mechanism proposed. The relocation of routing
resources will then be considered, including the software
tool that was developed to automate the generation of the
required partial configuration files.
2. Dynamic CLB relocation
Conceptually, an FPGA could be described as an array
of uncommitted CLBs, surrounded by a periphery of
IOBs, which are interconnectable by configurable routing
resources, controlled by an underlying set of memory
cells.
Any on-line management strategy implies a dynamic
relocation mechanism, whereby a CLB currently being
used by a given function has its functionality transferred
into another CLB, without disturbing system operation.
This relocation mechanism does more than just copying
the functional specification of the CLB to be replicated:
975
the corresponding interconnections with the rest of the
circuit have to be re-established; additionally, according to
its current functionality, internal state information may
also have to be copied.
The transparent relocation of a CLB is not a trivial task
due to two major issues: i) configuration memory
organisation and ii) internal state information.
The configuration memory can be visualised as a
rectangular array of bits, which are grouped into one-bit
wide vertical frames extending from the top to the bottom
of the array. A frame is the smallest unit of configuration
that can be written to or read from the configuration
memory. Frames are grouped together into larger units
called columns. Each CLB column corresponds to a
configuration column with multiple frames, mixing
internal CLB configuration and state information, and
column routing and interconnect information. The
partitioning of the entire FPGA configuration memory into
frames enables on-line concurrent partial reconfiguration,
facilitating the implementation of on-line rearrangement
procedures. The configuration procedure is a sequential
mechanism that spans through some (or eventually all)
CLB configuration columns. When the functionality of a
given CLB is dynamically relocated, even into a CLB in
the same column, more than one column may be affected,
since its input and output signals (as well as those in its
replica) may cross several columns before reaching its
source or destination.
Any reconfiguration action must therefore ensure that
the signals from the original CLB are not broken before
being totally re-established from its replica, otherwise its
operation will be disturbed or even halted. It is also
important to ensure that the functionality of the CLB
replica is perfectly stable before its outputs are connected
to the system, to prevent output glitches. A set of
experiments performed with an XCV200 from Xilinx
demonstrated that the only possible solution is to divide
the relocation procedure in two phases, as illustrated in
figure 2 (for reasons of intelligibility, only the relevant
interconnections are represented).
In the first phase, the internal configuration of the CLB
is copied into the new location and the inputs of both
CLBs are placed in parallel. Due to the slowness of the
reconfiguration procedure when compared with the speed
of operation of current applications, the outputs of the
CLB replica are already perfectly stable when they are
connected to the circuit, in the second phase. To avoid
output glitches, both CLBs (the original and its replica)
must remain in parallel for at least one clock cycle. Notice
that rewriting the same configuration data does not
generate any transient signals, so this procedure does not
affect the remaining resources covered by the rewriting of
the configuration frames that are needed to carry out the
relocation procedure.
Successful completion of the procedure described
above cannot be achieved without correct transfer of state
information. If the current CLB function is purely
combinational, this two-phase relocation procedure will
suffice to accomplish successful relocation. However, in
the case of a sequential function, the internal state
information must be preserved and no update operations
could be lost during the copying phase. The solution to
this problem depends on the type of implementation. In
this paper we shall consider three implementation cases:
synchronous free-running clock circuits, synchronous
gated-clock circuits, and asynchronous circuits.
1st phase
2nd phase
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB
CLB
CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB
Fig. 2. Two-phase CLB relocation procedure
When dealing with synchronous free-running clock
circuits, the two-phase relocation procedure described
previously is a good solution. Between the first and the
second phase the CLB replica has the same inputs as the
original CLB, and all its flip-flops (FFs) acquire the same
state information. The experimental replication of CLBs
with FF driven by a free-running clock has confirmed the
effectiveness of this method. No loss of state information
or the presence of output glitches was observed.
When using synchronous gated-clock circuits, where
input acquisition by the FFs is controlled by the state of
the clock enable signal (CE), the previous method does
not ensure that the CLB replica captures the correct state
information, because CE may not be active during the
relocation procedure. Besides, it is not feasible to simply
set this signal as part of the relocation procedure, because
the value present at the input of the replica FFs may
change in the meantime, and a coherency problem would
then occur.
An auxiliary relocation circuit needs to be implemented
to solve this problem. This circuit manages the transfer of
the state information from the original FFs to the replica
976
FFs, while enabling their update by the circuit at any
instant, without delaying the relocation procedure. The
whole relocation scheme is represented in figure 3, where
only one logic cell is shown, for reasons of simplicity. In
the Virtex and Spartan families, each CLB comprises four
of these cells; however, and for the purpose of
implementing this procedure, each CLB cell can be
considered individually. The temporary transfer paths
established between the original cells and their replicas do
not affect their functionality, since only free routing
resources are used. No changes in the cell structure are
required to implement this procedure. The relocation
control and the clock enable control signals are driven
through the reconfiguration memory, so no extra external
pins are required. The OR gate and the 2:1 multiplexer
that are part of the auxiliary relocation circuit must be
implemented during the relocation process in a nearby
(free) CLB.
0
1
DQ
CE
Combinational
logic circuitry
Relocation
control
DQ
CE
Combinational
logic circuitry
1
0
Clock enable
control
Auxiliary relocation circuit
Replica
combinational
output
Original
registered
output
Original CLB
Replica CLB
CLB inputs
Clock enable signal
Clock signal
CLB
output
Fig. 3. CLB relocation for synchronous gated-
-clock circuits
The inputs of the 2:1 multiplexer present in the
auxiliary relocation circuit receive one temporary transfer
path from the output of the original CLB FF and another
one from the output of the combinational logic circuitry of
the replica CLB, which, in normal operation, is applied to
the FF input. This multiplexer is controlled by the clock
enable signal of the original CLB FF. If this signal is not
active, the output of the original CLB FF is applied to the
input of the replica CLB FF. The clock enable control
signal is then activated, which forces the replica CLB FF
to capture the value coming from the original CLB FF.
If the clock enable signal is active, or is activated
during this process, the multiplexer selects the output of
the combinational block in the replica CLB and applies it
to its FF input. In this case, both the original and the
replica FFs are updated at the same time and with the
same values, guaranteeing state coherency.
After the state has been transferred, the input signals
involved in the execution of the relocation procedure are
placed in parallel, all the signals to and from the auxiliary
relocation circuit are disconnected, and the outputs of both
CLBs are also placed in parallel. After at least one
function clock cycle, the original CLB is disconnected
from the rest of the circuit (first the outputs and then the
inputs, in order to prevent any transient instability in the
output signals), and becomes part of the pool of free
resources.
Figure 4 represents the flow diagram of the proposed
relocation procedure. Several relocation experiments were
carried out in a group of circuits from the ITC’99
Benchmark Circuits from the Politécnico di Torino [10]
implemented in a Virtex XCV200, proving the
effectiveness of our approach. These circuits are purely
synchronous with only one single-phase clock signal
present. However, this approach is also applicable to
multiple clock/multiple phase applications, since only one
clock signal is involved in the relocation of each CLB
(CLBs relocation is performed individually, even if many
of these blocks were replicated simultaneously). No loss
of information or functional disturbance was observed
during the execution of these experiments.
Begin
Connect signals to the auxiliary relocation circuit;
place CLB input signals in parallel
Activate relocation and clock enable control
Deactivate clock enable control
Connect the clock enable inputs of both CLBs
Disconnect all the auxiliary relocation circuit signals
Place CLB outputs in parallel
Disconnect the original CLB outputs
End
> 2 CLK pulse No
Yes
>1 CLK pulse No
Yes
Deactivate relocation control
Disconnect the original CLB inputs
Fig. 4. Relocation procedure flow diagram
The average relocation time of each CLB implementing
synchronous gated-clock circuits is about 22,6 ms, when
the Boundary Scan [11] infrastructure is used to perform
the reconfiguration, at a test clock frequency of 20 MHz.
This method is also effective when dealing with
asynchronous circuits, where transparent data latches are
used instead of FFs. In this case, the control enable signal
is replaced in the latch by the input control signal, with the
977
value present in the input of the FF being stored when the
control signal changes from ‘1’ to ‘0’. The same auxiliary
relocation circuit is used and the same relocation sequence
is followed.
In the Virtex family of FPGAs, LUTs (Look-Up
Tables) can be configured as Distributed RAMs.
However, it is not feasible to extend this on-line relocation
concept to the relocation of those LUT/RAMs. The
content of the LUT/RAMs could be read and written
through the configuration memory, but the system would
have to be stopped to ensure data coherency, in the case of
a writing attempt during the relocation interval, as stated
in [12]. Furthermore, since frames span an entire column
of CLB slices, the same LUT bit in all of them is updated
with a single write command. It must be ensured that all
the remaining data in the slice is either constant, or it is
also modified externally through partial reconfiguration.
Even not being relocated, LUT/RAMs should not lie in
any column that could be affected by the relocation
procedure.
3. Rearranging routing resources
Due to the scarcity of routing resources, it might also
be necessary to perform a rearrangement of the existent
interconnections, to optimise the occupancy of such
resources, after the relocation of one or more CLBs, and to
increase the availability of routing paths to incoming
functions. The relocation of routing resources does not
pose any special problems, since the same two-phase
relocation procedure is effective on the relocation of local
and global interconnections. The interconnections
involved are first duplicated in order to establish an
alternative path, and then disconnected, becoming
available to be reused, as illustrated in figure 5.
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
- Original path
- Replica path
CLB2
CLB1
Fig. 5. Relocation of routing resources
A last remark must be made about the relocation of
routing resources. Since different paths are used while
paralleling the original and replica interconnections, each
of them will have a different propagation delay. This
means that if the signal level at the output of the CLB
source changes, the signal at the input of the CLB
destination will show an interval of fuzziness, as shown in
figure 6. However, the impedance of the routing switches
will limit the current flow in the interconnection, and
hence this behaviour does not damage the FPGA.
Nevertheless, and for transient analysis, the propagation
delay associated to the parallel interconnections, shall be
the longer of the two paths.
CLB1 output
time
CLB2 input
V
- Signal propagation
through the original path
- Signal propagation
through the replica path
Fig. 6. Propagation delay during the relocation of
routing resources
The dynamic relocation of CLBs and interconnections
should have a minimum influence (preferably none) in the
system operation, as well as reduced overhead in terms of
reconfiguration cost. This cost depends on the number of
reconfiguration frames needed to relocate each CLB, since
a great number of frames would imply a longer
rearrangement time. The impact of the relocation
procedure in those functions currently running is mainly
related to the delays imposed by rerouted paths, since the
relocation procedure might imply a longer path, therefore
decreasing the maximum frequency of operation.
The placement algorithms (in an attempt to reduce path
delays) gather in the same area the logic that is needed to
implement the components of a given function. It is
unwise to disperse it, since it would generate longer paths
(and hence, an increase in path delays). On the other hand,
it would also put too much stress upon the limited routing
resources. Therefore, the relocation of the CLBs should be
performed to nearby CLBs. If necessary, the relocation of
a complete function may take place in several stages, to
avoid an excessive increase in path delays during the
relocation interval.
4. The FPGA rearrangement and
programming tool
To support the implementation of this management
process, a software tool was developed, based on the JBits
software a set of Java classes that provide an
Application Programming Interface (API) to access the
Xilinx FPGA bitstream [13]. This tool is responsible by
the creation of the partial configuration files and carries
out the partial and dynamic reconfiguration of the FPGA
through the Boundary Scan interface. Since the partial
configuration files that implement the rearrangements
978
defined by the relocation procedure are generated
automatically (without designer intervention), the usage of
this tool becomes very straightforward. The input
information may be provided in the form of a complete
configuration file (generated by the traditional
development tool with a new placement for the functions
currently running or those that are about to be
implemented) or by providing the co-ordinates - source
and destination - of the CLB to be relocated. The tool
implements a series of algorithms based on artificial
intelligence techniques that manage the routing of the
signals coming in and out of the CLBs that are relocated,
optimising the usage of the routing resources. The
program always keeps a complete copy of the current
configuration, enabling system recovery in case of failure.
The user interface is shown in figure 7.
Fig. 7. Interface of the FPGA Rearrangement and
Programming tool
5. Conclusion
A novel relocation procedure to perform the dynamic
relocation of CLBs, without halting their operation, was
presented in this paper. The proposed procedure enables
the implementation of a truly on-line management of the
FPGA logic space, supporting the rearrangement of
running functions, releasing enough contiguous space for
the configuration of new incoming functions, and
performing defragmentation. Therefore, on-line dynamic
scheduling of tasks in the spatial and temporal domains
becomes possible, enabling the implementation of the
virtual hardware concept [1]. Several applications may
share the same hardware platform, with their respective
functions running and being swapped in and out of the
FPGA, without generating any time overhead to the
running applications, or disturbing their operation. The
software application that was developed to support the
implementation of the relocation procedure enables the
complete automation of the whole process and an
optimised management of the available resources. Further
work is under way to increase the functionality and
flexibility of this tool.
References
[1] X. P. Long, H. Amano, “WASMII: a Data Driven Computer
on a Virtual Hardware”, Proc. 1st IEEE Workshop on
FPGAs for Custom Computing Machines, 1993, pp. 33-42.
[2] J. M. P. Cardoso, H. C. Neto, “An Enhanced Static-List
Scheduling Algorithm for Temporal Partitioning onto
RPUs”, Proc. 10th Intl. Conf. on VLSI, 1999, pp. 485-496.
[3] R. Maestre, F. J. Kurdahi, R. Hermida, N. Bagherzadeh, H.
Singh, “A Formal Approach to Context Scheduling for
Multicontext Reconfigurable Architectures”, IEEE Trans.
on VLSI Systems, Vol. 9, No. 1, Feb. 2001, pp. 173-185.
[4] M. Sanchez-Elez, M. Fernandez, R. Maestre, R. Hermida,
N. Bagherzadeh, F. J. Kurdahi, “A Complete Data
Scheduler for Multi-Context Reconfigurable Architectures”,
Proc. Design, Automation and Test in Europe, 2002,
pp. 547-552.
[5] O. Diessel, H. El Gindy, M. Middendorf, H. Schmeck, B.
Schmidt, “Dynamic scheduling of tasks on partially
reconfigurable FPGAs”, IEE Proc.-Computer Digital
Technology, Vol. 147, No. 3, May 2000, pp. 181-188.
[6] M. Vasilko, DYNASTY: A Temporal Floorplanning Based
CAD Framework for Dynamically Reconfigurable Logic
Systems, Proc. 9th Intl. Workshop on Field-Programmable
Logic and Applications, 1999, pp.124-133.
[7] M. Teich, S. Fekete, J. Schepers, “Compile-time
optimization of dynamic hardware reconfigurations”, Proc.
Intl. Conf. on Parallel and Distributed Processing
Techniques and Applications, 1999, pp. 1097-1103.
[8] M. G. Gericota, G. R. Alves, M. L. Silva, J. M. Ferreira,
“Active Replication: Towards a Truly SRAM-based FPGA
On-Line Concurrent Testing”, Proc. 8th IEEE Intl. On-Line
Testing Workshop, 2002, pp. 165-169.
[9] M. G. Gericota, G. R. Alves, M. L. Silva, J. M. Ferreira,
“On-line Defragmentation for Run-Time Partially
Reconfigurable FPGAs”, Proc. 12th Intl. Conf. on Field
Programmable Logic and Applications, 2002, pp. 302-311.
[10] Politécnico di Torino ITC’99 benchmarks.
http://www.cad.polito.it/tools/itc99.html
[11] IEEE Std. Test Access Port and Boundary Scan Architecture
(IEEE Std 1149.1), IEEE Std. Board, May 1990.
[12] W. Huang, E. J. McCluskey, “A Memory Coherence
Technique for Online Transient Error Recovery of FPGA
Configurations”, Proc. 9th ACM Intl. Symposium on Field-
-Programmable Gate Arrays, 2001, pp. 183-192.
[13] S. A. Guccione, D. Levi, P. Sundararajan, “JBits Java based
interface for reconfigurable computing”, Proc. 2nd Military
and Aerospace Appl. of Prog. Devices and Technologies
Conf., 1999.
979
... 15: Far-field measurements with a wide-band antenna. ...
... This is utilized in software defined radios, and is becoming utilized in image processing algorithms[12][13].Many labs have attempted to perform non-traditional projects with Xilinx FPGAs. Some of these projects include ReCoBus at University of Erlangen-Nuremberg, and the RTR work at the University of Oporto[14][15]. To accomplish these projects these labs have had to write parsers and data structures for theXilinx intermediate files. These efforts are redundant and are limiting to the research potential of these labs. ...
... Two methods proposed by the Xilinx FPGA vendors for partially reconfiguration, namely Module-Based [15] and Difference-Based. The difference-based design could reference in famous Jbits [14], and the well-known PARBIT tool [12] and their placer [10]. ...
... the vertical axis means the column number of our reconfigurable device, while horizontal axis shows the system time (St) . In the illustration example, hardware task T1 occupies device's column [13,16], task T2 occupies device's column [1,8] when St = 0. Task T3 is scheduled to column [10,12] at simulated system time 2. When current time C(t) is at time 0, system starts to execute the tasks T1 and T2, so T1 and T2 are named executing tasks. The tasks have been scheduled but not been executed are named scheduled tasks, such as the tasks T3, T4, and T5 are scheduled tasks. ...
... Optimal solutions using a branch-and-bound technique [105] and heuristic approaches [106,107,108] for defragmentation are studied. In [109,110,111], implementation techniques for task relocation are presented. The idea is to replicate the corresponding configurable elements and routing resources. ...
... A relevant problem affects the case, where it is not possible to identify such a region onto the target architecture. Algorithms that deal with this problem perform re-arrangement of configured hardware resources [12] [13]. Unfortunately, these re-allocation algorithms are applicable almost exclusively at design-time (off-line), due to limitations such as the increased computational effort and the data hazard problems during the transfer of applications functionalities. ...
Conference Paper
Field programmable Gate Arrays (FPGAs) promise a low power flexible alternative for today's market heterogeneous systems. In order to be widely accepted, novel solutions and approaches are required for fast and flexible application implementation. In this paper we propose a methodology, as well as the supporting toolflow targeting to provide fast implementation of multiple applications onto heterogeneous FPGAs through the use of virtual kernels. Experimental results prove the efficiency of the introduced solution, as we achieve application mapping 30x faster on average compared to a state-of-art approach, with negligible degradation in the quality of the mapping. Additionally, we enable the mapping of multiple applications onto a single FPGA with only a small penalty of 4.7% in the maximum operation frequency of those applications compared with our reference solution.
... Apart from the increased flexibility imposed by these solutions, they introduce mentionable overheads in terms of execution run-time (for performing placement of configuration data over the reconfigurable architecture), as well as fragmentation of target FPGA. A methodology targeting to fuse portions of the unused area, resulting into larger contiguous area is discussed in [10], whereas [1] presents an algorithm for alleviating the problems posed by the consecutive reconfiguration of the same logic. An algorithm for predicting possible locations of the maximal empty device areas on a partially reconfigurable FPGA can be found in [6]. ...
Conference Paper
Partial reconfiguration is possible to deliver vir-tually unlimited hardware resources since it enables dynamic allocation and de-allocation of tasks onto a reconfigurable architecture, while the rest tasks continue to operate. However, in order to benefit from this flexibility, partial reconfigu-ration has to be appropriately applied. Among others, the placement of partial configuration data is a critical issue since it affects the fragmentation of hardware resources. In this paper we introduce a novel methodology for supporting partial reconfiguration with the usage of a Just-in-Time (JIT) Compilation framework. Experimental results with a number of benchmarks showed that the introduced solution performs application P&R 7.34× faster, as compared to the state-of-the-art tools, while it also leads to significant lower fragmentation of hardware resources.
Chapter
Protecting the rights of Intellectual Property (IP) owners is extremely important to the expansion of the core-based design market. Currently, IP providers have no mechanism to guarantee the protection of their IP against over-deployment. We propose a system to guarantee that IP cores can only be deployed into devices agreed upon between the IP provider and the customer. The system is based on secured handshaking with encrypted device and design authentication information. It consists of hardware-supported design encryption and secured authentication protocols. It uses a combination of secret and public-key cryptographic functions devised for an uncomplicated trustable design exchange scenario. The public-key functions use modular squaring (Rabin Lock) on the FPGA chip instead of exponentiation to reduce the hardware complexity. The system limits the parties involved in the transaction to the IP provider and the customer only.
Article
Dynamic Partial Reconfiguration (DPaR) enables efficient allocation of logic resources by adding new functionalities or by sharing and/or multiplexing resources over time. Placement and routing (P&R) is one of the most time-consuming steps in the DPaR flow. P&R are two independent NP-complete problems, and, even for medium size circuits, traditional P&R algorithms are not capable of placing and routing hardware modules at runtime. We propose a novel runtime P&R algorithm for Field-Programmable Gate Array (FPGA)-based designs. Our algorithm models the FPGA as an implicit graph with a direct correspondence to the target FPGA. The P&R is performed as a graph mapping problem by exploring the node locality during a depth-first traversal. We perform the P&R using a greedy heuristic that executes in polynomial time. Unlike state-of-the-art algorithms, our approach does not try similar solutions, thus allowing the P&R to execute in milliseconds. Our algorithm is also suitable for P&R in fragmented regions. We generate results for a manufacturer-independent virtual FPGA. Compared with the most popular P&R tool running the same benchmark suite, our algorithm is up to three orders of magnitude faster.
Article
FPGA Dynamic Partial Reconfiguration (DPR or PR) technology has emerged and become gradually mature in the recent years. It provides the Time-Division Multiplexing (TDM) capability in utilizing on-chip resources and leads to significant benefits in comparison with conventional static designs. However, the partially reconfigurable design process features additional complexity and technical requirements to the FPGA developers. Hence, PR design approaches are being widely explored and investigated to systematize the development methodology and ease the designers. In this paper, the authors collect several research and engineering projects in this area and present a survey of the design methodology and applications of PR. Research aspects are discussed in various hardware/software layers.
Conference Paper
Dynamic partial reconfiguration enables efficient use of hardware resources by multiplexing system functionality in time. However, many challenges arise from partial reconfiguration implementation. The placement and routing (P&R) of the hardware modules is a computationally intensive task, and the state-of-art algorithms are not suitable to place and route modules at run-time. This paper makes several contributions: (1) Single Placement at run-time: we propose a novel P&R algorithm based on greedy heuristic where a single placement is performed at run-time in few milliseconds. (2) Implicit Graph Model: the FPGA is modelled as an implicit graph with a direct correspondence to the physical FPGA, and the P&R is performed as a graph mapping problem by exploring the node locality during the depth-first traversal. (3) Polynomial Placement: we show that even a single placement can be routed without critical path degradation. (4) Fragmented Regions: the graph approach is flexible, and it allows efficient placement even onto fragmented FPGA areas. Compared with the most popular P&R tool running the same benchmark suite our algorithm is on average 864x faster. Moreover, the bitstream for partial reconfiguration is also reduced by a factor of 4.
Book
Full-text available
from the Preface: "IEEE Std 1149.1 was developed for your use. As more engineers and more firms use it, it will become more valuable. The more expertise in ATEs, circuit design, catalog ICs, application-specific ICs (ASICs), etc. that is developed collectively, the more we all can benefit from reuse of generic solutions to common technological problems. As Harry Bleeker did in his foreward to this book, we urge you to use the standard , we urge you to participate in its further evolution, and we urge you to do so in the superbly constructive and cooperative spirit that has infused JTAG. "Guard the mysteries. "Constantly reveal them." --Lew Welch, "Course College Courses: Religion"
Conference Paper
Full-text available
The reusing of the same hardware resources to implement speed-critical algorithms, without interrupting system operation, is one of the main reasons for the increasing use of reconfigurable computing platforms, employing complex SRAM-based FPGAs. However, new semiconductor manufacturing technologies increase the probability of lifetime operation failures, requiring new on-line testing / fault-tolerance methods able to improve the dependability of the systems where they are included. The Active Replication technique presented in this paper consists of a set of procedures that enables the implementation of a truly non-intrusive structural on-line concurrent testing approach, detecting and avoiding permanent faults and correcting errors due to transient faults. In relation to a previous technique proposed by the authors as part of the DRAFT FPGA concurrent test methodology, the Active Replication technique extends the range of circuits that can be replicated, by introducing a novel method with very low silicon overhead.
Conference Paper
Full-text available
A new technique is presented in this paper to improve the efficiency of data scheduling for multi-context reconfigurable architectures targeting multimedia and DSP applications. The main goal is to improve the applications execution time minimizing external memory transfers. Some amount of on-chip data storage is assumed to be available in the reconfigurable architecture. Therefore the Complete Data Scheduler tries to optimally exploit this storage, saving data and result transfers between on-chip and external memories. In order to do this, specific algorithms for data placement and replacement have been designed. We also show that a suitable data scheduling could decrease the number of transfers required to implement the dynamic reconfiguration of the system.
Conference Paper
Full-text available
This paper presents a novel algorithm for temporal partitioning of graphs rep- resenting a behavioral description. The algorithm is based on an extension of the traditional static-list scheduling that tailors it to resolve both scheduling and temporal partitioning. The nodes to be mapped into a partition are selected based on a statically computed cost model. The cost for each node integrates communication effects, the critical path length, and the possibility of the criti- cal path to hide the delay of parallel nodes. In order to alleviate the runtime there is no dynamic update of the costs. A comparison of the algorithm to other schedulers and with close-to-optimum results obtained with a simulated annealing approach is shown. The presented algorithm has been implemented and the results show that it is robust, effective, and efficient, and when com- pared to other methods finds very good results in small amounts of CPU time.
Conference Paper
The reusing of the same hardware resources to implement speed-critical algorithms, without interrupting system operation, is one of the main reasons for the increasing use of reconfigurable computing platforms, employing complex SRAM-based FPGAs. However, newsemiconductor manufacturing technologies increase the probability of lifetime operation failures, requiring new on-line testing/fault-tolerance methods able to improve the dependability of the systems where they are included.The Active Replication technique presented in this paper consists of a set of procedures that enables the implementation of a truly non-intrusive structural on-line concurrent testing approach, detecting and avoiding permanent faults and correcting errors due to transient faults.In relation to a previous technique proposed by the authors as part of the DRAFT FPGA concurrent test methodology, the Active Replication technique extends the range of circuits that can be replicated, by introducing a novel method with very low silicon overhead.
Conference Paper
The partial reconfiguration feature of some of the current-generation Field Programmable Gate Arrays (FPGAs) can improve dependability by detecting and correcting errors in on-chip configuration data. Such an error recovery process can be executed online with minimal interference of user applications. However, because Look-up Tables (LUTs) in Configurable Logic Blocks (CLBs) of FPGAs can also implement memory modules for user applications, a memory coherence issue arises such that memory contents in user applications may be altered by the online configuration data recovery process. In this paper, we investigate this memory coherence problem and propose a memory coherence technique that does not impose extra constraints on the placement of memory-configured LUTs. Theoretical analyses and simulation results show that the proposed technique guarantees the memory coherence with a very small (on the order of 0.1%) execution time overhead in user applications.
Conference Paper
Recent generations of Field ProgrammableGate Arrays (FPGA) allow the dynamic reconfigurationof cells on the chip during run-time.For a given problem consisting of a set of tasks withcomputation requirements modeled by rectangles ofcells, several optimization problems such as findingthe array of minimal size to accomplish the taskswithin a given time limit are considered. Existingapproaches based on ILP formulations to solve theseproblems being related to multi-dimensional packing...
Conference Paper
This paper presents DYNASTY–a new CAD framework aimed at supporting research of design techniques, algorithms and methodologies for dynamically reconfigurable logic (DRL) systems. Design flow implemented in the DYNASTY Framework is based around a temporal floorplanning (TF) DRL design abstraction, which allows simultaneous DRL design space exploration in spatial and temporal dimensions. The paper introduces temporal floorplanning and its implementation in the DYNASTY Framework. Methodologies based on temporal floorplanning promise reduction of design time and elimination of costly design iterations present in traditional DRL design methodologies.
Conference Paper
A new technique is presented in this paper to improve the efficiency of data scheduling for multi-context reconfigurable architectures targeting multimedia and DSP applications. The main goal is to improve the applications execution time minimizing external memory transfers. Some amount of on-chip data storage is assumed to be available in the reconfigurable architecture. Therefore the Complete Data Scheduler tries to optimally exploit this storage, saving data and result transfers between on-chip and external memories. In order to do this, specific algorithms for data placement and replacement have been designed. We also show that a suitable data scheduling could decrease the number of transfers required to implement the dynamic reconfiguration of the system
Conference Paper
Virtual hardware is a technique to realize a large digital circuit with a small real hardware by using an extended Field Programmable Gate Array (FPGA) technology. Several configuration RAM modules are provided inside the FPGA chip, and the configuration of the gate array can be rapidly changed by replacing the active module. Data for configuration are transferred from an off-chip backup RAM to an unused configuration RAM module. A novel computation mechanism called the WASMII, which executes a target data flow graph directly, is proposed on the basis of the virtual hardware. A WASMII chip consists of the FPGA for virtual hardware and the additional mechanism to replace configuration RAM modules in the data driven manner. Configuration data are preloaded by the order which is assigned in advance with a static scheduling preprocessor. By connecting a number of WASMII chips, a highly parallel system can be easily constructed