2: A block diagram showing the memory hierarchy of a computer. The RF can be loosely coupled, tightly coupled or attached as co-processor. Source: Reconfigurable Computing[19].

2: A block diagram showing the memory hierarchy of a computer. The RF can be loosely coupled, tightly coupled or attached as co-processor. Source: Reconfigurable Computing[19].

Source publication
Conference Paper
Full-text available
In this paper we describe a new generic approach for accelerating software functions using a reconfigurable device connected through a high-speed link to a general purpose system. As opposed to related ISA extension approaches, we insert system calls to the original program at hand to control the reconfigurable accelerator. The reconfigurable devic...

Contexts in source publication

Context 1
... Section 2.2 we discuss how a reconfigurable fabric can be integrated into a traditional computing system. Subsequently, in Section 2.3 we discuss high performance computing with reconfigurable acceleration, and take a more 8 CHAPTER 2. BACKGROUND Figure 2.1: An example of how Computational blocks can be connected in a reconfig- urable fabric. ...
Context 2
... reconfigurable fabric is one of the basic requirements for reconfigurable computing. The reconfigurable fabric is a specially designed chip consisting of computing elements and an interconnect such as in Figure 2.1. The grid formation shown in the figure is not necessarily the way these devices are implemented but the underlying principles are the same. ...
Context 3
... general purpose computing with reconfigurable acceleration, the reconfigurable fabric must somehow be connected to an existing host processor. Figure 2.2 shows the different levels at which the fabric can be connected in the memory hierarchy. ...
Context 4
... Altix provides support for reconfigurable computing in the form of their recon- figurable Application Specific Computing (RASC) program [20]. In RASC, an FPGA ( Figure 2.3) is connected to the SGI NUMAlink [20] interconnect as a co-processor. NU- MAlink is a high bandwidth and low latency interconnect which is used to connect processors, memory, and other components in Altix machines. ...
Context 5
... Core Services block provides the interface between the Algorithmic block and the host system, and to do so implements the following features: It Implements the Scalable System Port (SSP), which allows communication over the NUMAlink; Provides read and write access to the SRAMs from both the host system and the algorithmic block; Allows single and multi step debugging of the algorithmic block; And provides access to the algorithms debug port and registers. Figure 2.4 shows that a device driver provides access to the FPGAs core services from software through system calls. ...
Context 6
... Convey architecture (Figure 2.5) consists of off-the-shelf Intel processors in combi- nation with a reconfigurable co-processor. ...
Context 7
... the co-processor shares a cache-coherent view of the global virtual memory with the host processor. The co-processor consists of three components (Figure 2.6): the Application Engine Hub (AEH), Memory Controllers (MCs), and the Application Engines (AEs). The AEH is responsible for the interface to the host processor and I/O chipset, and fetching instruc- tions. ...
Context 8
... how to make an architecture where applications can be easily ported to different hardware using the same platform without major redesign. The MOLEN architecture (Figure 2.7) consists of a GPP and a reconfigurable accelerator which can communicate through a set of registers called Exchange Registers (XREG). A program is executed on the GPP, with certain computationally intensive functions implemented as accelerators. ...

Similar publications

Article
Full-text available
Now a day's extended number of wireless communication users have increasing demand of security and protecting data transmitted by the user over unsecured network so that unauthorized persons cannot access it. the data share through wireless network so it must provide data with authentication. The security between the wireless devices is most import...

Citations

... In [12], a reconfigurable architecture which employs a single 512-entry TLB for address translation mechanism alongside a Direct Memory Access (DMA) unit is proposed. In the case of a miss occurrence, the TLB is locked and the FPGA will be interrupted. ...
Article
This paper presents the integration issues of a proposed run-time configurable Memory Management Unit (MMU) to the COFFEE processor developed by our group at Tampere University of Technology. The MMU consists of three Translation Lookaside Buffers (TLBs) in two levels of hierarchy. The MMU and its respective integration to the processor is prototyped on a Field Programmable Gate Array (FPGA) device. Furthermore, analytical results of scaling the second-level Unified TLB (UTLB) to three configurations (with 16, 32, and 64 entries) with respect to the effect on overall hit rate as well as the energy consumption are shown. The critical path analysis of the logical design running on the target FPGA is presented together with a description of optimization techniques to improve static timing performance which leads to gain 22.75% speed-up. We could reach to our target operating frequency of 200 MHz for the 64-entry UTLB and, thus, it is our preferred option. The 32-entry UTLB configuration provides a decent trade-off for resource-constrained or speed-critical hardware designs while the 16-entry configuration poses unsatisfactory performance. Next, integration challenges and how to resolve each of them (such as employing a wrapper around the MMU, modifying the hardware description of the COFFEE core, etc.) are investigated in detail. This paper not only provides invaluable information with regard to the implementation and integration phases of the MMU to a RISC processor, it opens a new horizon to our processor to provide virtual memory for its running operating system without degrading the operating frequency. This work also tends toward being a general reference for future integration to the COFFEE core as well as other similar processor architectures.
... In [9], a reconfigurable architecture which employs a single 512-entry TLB for address translation mechanism alongside a Direct Memory Access (DMA) unit is proposed. In case of a miss occurrence, the TLB is locked and the FPGA will be interrupted. ...
... They make profit from the huge logic fabrics provided by nowadays FPGA boards that can be connected easily to the host machines using fast links. Among these projects, Brandon et al. [5] proposed a generic platform solution for reconfigurable acceleration in a generalpurpose system. They focussed also on techniques to more efficiently integrate a reconfigurable device in a traditional computer system, addressing issues related to the memory accesses, programming interface, and system-level support for high performance computing, not for real-time requirement. ...
Conference Paper
Full-text available
Real-time computing systems are increasingly used in aerospace and avionic industries. In the face of power wall and real-time requirements, hardware designers are directed towards reconfigurable computing with the usage of heterogeneous CPU/FPGA systems. However, there is a lack of real-time environments able to deal with the execution of applications on such heterogeneous systems dedicated to avionic Test and Simulation (T&S). This research investigates the problem of soft real-time environments for CPU/FPGA systems and proposes first a high-performance hardware architecture used to implement intimately coupled hardware and software avionic models. Second, this paper presents the description of an efficient real-time software environment for the model's execution, the multi-core CPU monitoring and the runtime task re-allocation to avoid the timing constraint violation. Experimental results underpin the industrial relevance of the presented approach for avionic T&S systems with real-time support.
... As opposed to previous approaches which require ISA extensions to the host processor, such as Molen [14], Garp [4] and others, we describe an approach that requires only an FPGA device driver, a few compiler extensions and a hardware wrapper in the FPGA. We extend our previous work on reconfigurable acceleration for general purpose computing, presented in [15], by adding dynamic partial self-reconfiguration support. The current version of our system includes hardware instances of functions in the executable binary, which are installed and executed on demand. ...
... They make profit from the huge logic fabrics provided by nowadays FPGA boards that can be connected easily to the host machines using fast links. Among these projects, Brandon et al. [7] proposed a generic platform solution for reconfigurable acceleration in a general-purpose system. They focussed also on techniques to more efficiently integrate a reconfigurable device in a traditional computer system, addressing issues related to the memory accesses, programming interface, and system-level support. ...
Conference Paper
Full-text available
In the face of power wall and high performance requirements, designers of hardware architectures are directed more and more towards reconfigurable computing with the usage of heterogeneous CPU/FPGA systems. In such architectures, multi-core processors come with high computation rates while the reconfigurable logic offers high performance per watt and adaptability to the application constraints. However, the design of heterogeneous architectures is facing extremely challenging requirements such as the appropriate programming model, design tools, and the rapid system prototyping. Focusing this issue, we present a prototyping environment for heterogeneous CPU/FPGA systems. Within this environment, we conceived a generic and scalable architecture based on a multi-core processor tightly-connected to FPGA in order to meet performance, power and flexibility goals. Furthermore, front-end interfaces are presented in order to establish communication, data sharing, and synchronisation between the different software and hardware processing units. Finally, we defined a design methodology that eases the development of applications onto heterogeneous systems. Our environment is conceived using standard host machine coupled with a Xilinx Virtex 6 FPGA through the PCI Express standard bus. In the experimental part, we evaluate first the reliability of different CPU/FPGA communication solutions in order to bring real-time capabilities to our system. Secondly, we demonstrate the efficiency of the presented design methodology for heterogeneous systems through the FIR signal processing application.
... Kelm et al. [22] used a model based on local input/output buffers on the accelerator with DMA support to access external memory. Brandon et al [23] proposes a platform-independent approach by managing virtual address space inside their accelerator. Several commercially available machines like the SGI Altix-4700 [24] or the Convey HC 1 [25] propose system level models to accelerate application kernels using FPGAs. ...
Conference Paper
Full-text available
In the race towards computational efficiency, accelerators are achieving prominence. Among the different types, accelerators built using reconfigurable fabric, such as FPGAs, have a tremendous potential due to the ability to customize the hardware to the application. However, the lack of a standard design methodology hinders the adoption of such devices and makes the portability and reusability across designs difficult. In addition, generation of highly customized circuits does not integrate nicely with high level synthesis tools. In this work, we introduce TARCAD, a template architecture to design reconfigurable accelerators. TARCAD enables high customization in the data management and compute engines while retaining a programming model based on generic programming principles. The template provides generality and scalable performance over a range of FPGAs. We describe the template architecture in detail and show how to implement five important scientific kernels: MxM, Acoustic Wave Equation, FFT, SpMV and Smith Waterman. TARCAD is compared with other High Level Synthesis models and is evaluated against GPUs, a well-known architecture that is far less customizable and, therefore, also easier to target from a simple and portable programming model. We analyze the TARCAD template and compare its efficiency on a large Xilinx Virtex-6 device to that of several recent GPU studies.
... Another approach [BrSG10] that achieves full virtual memory integration attaches the HA to a host PC via AMD HyperTansport [Hype05]. To this end, the HA is co-located on a Virtex-4 100 FPGA with an open-source HTX interface core [SGNB08] and a separate TLB for virtual to physical address translation. ...
Article
We developed a point-to-point, low latency, 3D torus Network Controller integrated in an FPGA-based PCIe board which implements a Remote Direct Memory Access (RDMA) communication protocol. RDMA requires ability to directly access the remote node application memory with minimal OS or CPU intervention. To this purpose, a key element is the design of a direct memory writing mechanism to address the destination buffers; on Virtual Memory supporting OSes this corresponds to a number of page-segmented DMAs. To minimally affect overall performance, mechanisms with lowest possible latency are needed for either Virtual-to-Physical address translation and registered buffers list scanning. In a first implementation these tasks were set on a soft-core on the FPGA, leading to a 1.6 s latency to process a single packet and limiting the peak bandwidth. As a second trial, we present an accelerated version for these time-critical network functions exploiting an application-specific processor (ASIP) designed using a retargetable ASIP development toolsuite that allows architectural exploration. Benchmark results for Buffer Search and Virtual-to-Physical tasks on the ASIP show improvements for latency with up to ten times lower cycles cost compared with the soft-core .
Conference Paper
In the embedded era, reconfigurable components comes in three forms of IP Intellectual Property cores i) Soft core ii) Firm core and iii) Hard Core. This paper presents a new technique of embedding multigrain parallel processing HPRC using FPGA in the CPU/DSP unit of OR1200 a soft-core RISC processor. The core performance is increased by placing a multigrain parallel processing HPRC internally in the Integer Execution Pipeline unit of the CPU/DSP core. Depending on the complexity/depth of the code, the dependency level of vertices DL were created and numbers of threads N were created to run the code parallel in HPRC. Multigrain parallel processing HPRC is achieved by two function i) HPRC_Parallel_Start to trigger the parallel thread ii) HPRC_Parallel_End to stop the thread. In the first phase of this paper a Verilog HDL functional code is developed and synthesised using XIINX ISE and in the second phase a CoreMark processor core benchmark is used to test the performance of the reconfigured IP soft core.