Dirk Koch's research while affiliated with Universität Heidelberg and other places

What is this page?


This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Publications (98)


Analysis of Process Variation Within Clock Regions of AMD-Xilinx UltraScale+ Devices
  • Chapter

March 2024

·

7 Reads

·

Dirk Koch

As semiconductor technology advances and transistor feature sizes shrink, the increasing significance of process variation poses critical challenges to the reliability of semiconductor devices. This paper thoroughly explores the impact of process variation within the Clock Regions (CRs) of AMD-Xilinx UltraScale+ devices. We employ a novel method to characterize process variation with significantly higher precision than conventional ring oscillator (RO)-based sensors. Our experimental findings on ZYNQ XCZU9EG reveal that the latency of resources during rising and falling transitions may differ. Additionally, the proximity of Interconnect (INT) tiles to various tile types can influence the latency of resources within a column in a given CR. Moreover, we demonstrate that specific segments within CRs consistently exhibit faster performance compared to other areas within the same CR.

Share

Memory-Aware Scheduling for a Resource-Elastic FPGA Operating System

September 2023

·

4 Reads

The memory subsystem is often the main performance bottleneck in an FPGA acceleration system. This paper presents two memory-aware runtime schedulers that decide the order of running tasks to improve the system’s performance: memory model-aware (MMA) and memory access pattern-aware (MAPA) schedulers. The proposed approaches consider memory characteristics in scheduling decisions to alleviate the memory overhead and enhance the system’s performance. MMA considers the accessed memory regions when scheduling the tasks in a way that reduces the memory page miss rates. On the other hand, MAPA alleviates the pressure on the memory subsystem by scheduling the tasks mainly based on their memory intensity and access patterns. The proposed runtime schedulers are evaluated and implemented on an Ultra96 FPGA board. The presented approaches show (on average) approximately \(10\%\), \(22\%\), \(12\%\), and \(9\%\) improvements in memory throughput, task execution time, makespan time, and job throughput, respectively, over an existing state-of-the-art memory-agnostic scheduler.




Figure 1: Bitstream frames order in Xilinx XC7V2000 FPGA. The bitstream starts at the white arrow on the left of the figure and ends at the white arrow on the right of the figure. Blue lines represent CLB frames (block type 0), while red lines represent BRAM contents frames (block type 1). Dashed lines represent "jumps" from the end of a clock region to the beginning of the next clock region. Clearly, the frame order is not linear for all devices.
Figure 2: A clock region (bitstream frame) height in Xilinx UltraScale+ consists of 60 CLBs (blue), 60 switchmatrixes (yellow) 12 BlockRAMs (red), or 12 DSPs (green). (a) presents the bitstream sizes in block type 0 (logic, clock, routing), while (b) presents the same piece of fabric's mapping in block type 1 (BlockRAM data). The resources' bitstream sizes are misproportioned and addressing individual frames requires careful offset calculations.
Figure 3: Bitstream coordinates in byteman are in a grid format as viewed in Xilinx Vivado device. Shown are coordinates for Xilinx XC7V2000 FPGA. Y addresses CLB elements (50 per clock region in Xilinx Virtex-7), while X addresses horizontally the bitstream-mapped resources of the device. The conversion to the actual bitstream coordinates (see Figure 1) is done automatically for the user.
byteman: A Bitstream Manipulation Framework
  • Conference Paper
  • Full-text available

November 2022

·

775 Reads

·

3 Citations

From better resource pooling for FPGA cloud providers to building dynamic execution pipelines at runtime, the capabilities of partial reconfiguration (PR) are waiting to be fully explored. However, the community still fails to materialize PR at scale, and FPGAs are only used as updatable ASICs, hence, omitting the opportunities offered by dynamically reconfiguring FPGAs at runtime. This work proposes a resourceful FPGA bitstream manipulation framework. The proposed tool provides means for parsing, modification, and generation of bitstream files, and it has been open-sourced and demonstrated in a working system. As a distinguished feature, it supports multi-die FPGAs (among the 106 Xilinx 7 Series, UltraScale, and UltraScale+ devices), and enables datacenter FPGAs to be used for relocatable PR. Using the versatile tool's built-in (dis)assembler allows for manual bitstream manipulations. Bundled with an efficient bitstream manipulation core, the efficacy is demonstrated by two case studies where we observe 58-377x higher bitstream merging throughput than a current state-of-art tool.

Download

Automated Generation and Orchestration of Stream Processing Pipelines on FPGAs

November 2022

·

202 Reads

FPGAs have demonstrated substantial performance and energy efficiency advantages for workloads that fit a stream processing model with direct module-to-module communications. However, when the dataflow processing system is required to adapt to runtime conditions, current static acceleration solutions are limited in how efficiently the FPGA can be utilized due to the inability to switch out idling modules. To better use FPGAs in dynamic scenarios, this paper proposes using partial reconfigura-tion to stitch together different physically implemented operator modules on-the-fly. Rather than using designated module slots, our system places all modules and routing wires into a shared region with more placement options to minimize fragmentation. Furthermore, we use a module library that provides different resource and performance trade-offs for faster execution while considering the configuration cost. Then our system finds the optimal set of modules while scheduling multiple acceleration requests and managing all constraints transparently to the end-user. We demonstrate that the overheads of the middleware are insignificant enough to form accelerators with end-to-end execution times equal to hand-crafted static systems with small datasets while being 7.2× faster when streaming large datasets. We exemplified our approach for database acceleration, where the whole operation is abstracted to execute SQL queries directly.



Fig. 1. Acceleration ecosystem reusing physically implemented modules. The number of FPGA regions, data routing, and module sizes are all flexible.
FPL Demo: Runtime Stream Processing with Resource-Elastic Pipelines on FPGAs

August 2022

·

181 Reads

FPGAs are efficient at dataflow applications, as demonstrated in various application domains, including machine learning, communication, and image processing. In this demo, we accelerate database management operations transparently to the user by stitching together partially reconfigurable stream processing modules that implement database operators. Our runtime system orchestrates this, which builds custom pipelines according to runtime conditions. This demo will showcase an acceleration of SQL queries using our dynamic stream processing system running on a ZCU102 FPGA board.




Citations (58)


... It also enables runtime bitstream manipulation by being 220 − 377× faster than related work. byteman has also been demonstrated in a database acceleration system that utilizes dynamic execution pipelines at runtime [40]. ...

Reference:

byteman: A Bitstream Manipulation Framework
FPL Demo: Runtime Stream Processing with Resource-Elastic Pipelines on FPGAs
  • Citing Conference Paper
  • August 2022

... There is a moderate amount of research on database acceleration focusing on analytics [20]. Similar to earlier works that accelerated joins [2], [21], there are tradeoffs when selecting between a sort and a hash-based solution [22], [23], each with separate adaptation challenges [24], [25]. For instance, even with a sorter-based pipeline, supporting arbitrary data in a join accelerator caused stalls and increased its memory requirements [21]. ...

Automated Generation and Orchestration of Stream Processing Pipelines on FPGAs
  • Citing Conference Paper
  • December 2022

... AMD/Xilinx's dynamic function exchange (DFX) introduces a technology for setting up PR areas within a static system, allowing users to allocate modules to these specific areas on FPGA fabrics [38]. However, the Vivado toolchain has several weaknesses, such as being too slow for real-time applications and the lack of support for bitstream relocation [18]. On the other hand, the opensource tools, such as Byteman [18], improved the efficiency and speed of performing PR from an embedded processor, making it suitable for real-time applications, such as countermeasures against side-channel attacks during runtime. ...

byteman: A Bitstream Manipulation Framework

... Academic studies on the virtualization of FPGA resources are abundant in the research community [7], [8], covering a wide range of FPGA types (including PCIe, network, and SoC) and methodologies. ...

The Future of FPGA Acceleration in Datacenters and the Cloud
  • Citing Article
  • February 2022

ACM Transactions on Reconfigurable Technology and Systems

·

·

paul chow

·

[...]

·

Russell Tessier

... In addition to the safety and security concerns raised by multi-tenant FPGAs, timely availability of resources to legitimate tenants is also of utmost importance. A malicious tenant can hide behind the facade of legitimacy, waiting to initiate DoS for requests generated by legitimate tenants or may try to damage the PDN of multi-tenant FPGAs in order to cause long-term damage [86,87]. In this section, we present a scenario in which a malicious tenant threatens the availability of resources to legitimate tenants, followed by a defense mechanism that can fend against such attempts. ...

Denial-of-Service on FPGA-based Cloud Infrastructures — Attack and Defense

IACR Transactions on Cryptographic Hardware and Embedded Systems

... Emerging memory technologies including resistive random access memory (ReRAM/RRAM) [1]- [4] provide potential solutions to the challenges faced by current memories due to their lower power consumption, scalability, non-volatility and high-speed operation. Apart from the typical use case as computer memory, such devices have other uses in applications such as in-memory computing [5], [6] and FPGAs [7], [8]. A common criteria of these applications is the use of large arrays with millions, billions or more of devices in a single chip [9]. ...

Memristor-based Pass Gate for FPGA Programmable Routing Switch
  • Citing Conference Paper
  • May 2021

... The virtual streams/channels are differentiated by StreamID and ChannelID signals of DSPI. 9,48,50,51,59,87,95,102 Abstract While FPGAs are becoming mainstream in the deployment of datacenters and cloud systems, they are mostly used as updatable ASICs. This thesis shows that it is feasible to achieve acceleration for runtime-only known problems using dynamically built stream processing pipelines if we efficiently exploit the given FPGA resources and utilize additional techniques such as resource elasticity. ...

Moving Compute towards Data in Heterogeneous multi-FPGA Clusters using Partial Reconfiguration and I/O Virtualisation
  • Citing Conference Paper
  • December 2020

... Field-Programmable System-on-Chips (FPSoC) achieves even tighter integration by integrating processors and FPGAs on the same chip, similar to Duet at a high level. Many commodity FPSoCs [26], [36], [37], [42], [55] and academic FPSoCs [28], [45], [53] support full or partial cache coherence. For example, the Xilinx Zynq-7000 employs the AXI4 ACP interface [3] which supports uni-directional cache coherence (I/O coherency). ...

FABulous: An Embedded FPGA Framework
  • Citing Conference Paper
  • February 2021

... 9, 62, 100, 116 module resource footprint variant Implemented module for different FPGA resource footprint to maximize placement options. 37,66,119,120,125,143 module stitching Building an execution pipeline at runtime by placing module bitstreams. 9,21,40,59,120,122 partial (re)configuration Changing the loaded FPGA bitstream for a partial region of the FPGA. ...

Transparent Integration of a Dynamic FPGA Database Acceleration System
  • Citing Conference Paper
  • August 2020