2: The performance of multiply accumulate operation changes depending upon the target clock period. Assume the multiply operation takes 3 ns and add operation takes 2 ns. Part a) has a clock period of 1 ns, and one MAC operation takes 5 cycles. Thus the performance is 200 million MACs/sec. Part b) has a clock period of 2 ns, and the MAC takes 3 cycles resulting in approximately 167 million MACs/sec. Part c) has a clock period of 5 ns. By using operation chaining, a MAC operation takes 1 cycle for a clock period of 200 million MACs/sec.

Source publication

Figure 1.3: The two-dimensional structure of an FPGA showing an island...

Figure 1.6: A block diagram showing a hypothetical embedded FPGA...

Figure 1.9: RISC-style assembly generated from the C code in Figure...

Figure 2.2: The performance of multiply accumulate operation changes...

Figure 2.7: Part a) shows a schedule for the body of the MAC for loop....

Parallel Programming for FPGAs

Preprint

Full-text available

May 2018

This book focuses on the use of algorithmic high-level synthesis (HLS) to build application-specific FPGA systems. Our goal is to give the reader an appreciation of the process of creating an optimized hardware design using HLS. Although the details are, of necessity, different from parallel programming for multicore processors or GPUs, many of the...

Module-per-Object: a Human-Driven Methodology for C++-based High-Level Synthesis Design

Preprint

Full-text available

Mar 2019

High-Level Synthesis (HLS) brings FPGAs to audiences previously unfamiliar to hardware design. However, achieving the highest Quality-of-Results (QoR) with HLS is still unattainable for most programmers. This requires detailed knowledge of FPGA architecture and hardware design in order to produce FPGA-friendly codes. Moreover, these codes are norma...

Similar publications