Application mapping flow. Note: DFG, DCR, and DRG.

Source publication

High Throughput Data Mapping for Coarse-Grained Reconfigurable Architectures

Article

Full-text available

Nov 2011

Coarse-grained reconfigurable arrays (CGRAs) are a very promising platform, providing both up to 10-100 MOps/mW of power efficiency and software programmability. However, this promise of CGRAs critically hinges on the effectiveness of application mapping onto CGRA platforms. While previous solutions have greatly improved the computation speed, they...

Context 1

... one with loop-carried data dependence. be rather easily solved by adding an additional cost to the PEs to which load/store operations have been mapped, called BBC. We define the BBC for a PE p, as BBC(p) = b·m(p), where b is a design parameter called the base balancing cost, and m(p) is the number of memory operations already mapped onto PE p. Fig. 3 illustrates our compilation flow. The two analy- ses, performance bottleneck analysis and data reuse analysis, are performed before time-consuming modulo scheduling. Memory-aware modulo scheduling refers to the EMS algo- rithm extended by adding DROC and BBC to the existing cost function, which does not significantly increase the ...

View in full-text

Impact of Memory Architecture on FPGA Energy Consumption

Conference Paper

Full-text available

Feb 2015

FPGAs have the advantage that a single component can be configured post-fabrication to implement almost any computation. However, designing a one-size-fits-all memory architecture causes an inherent mismatch between the needs of the application and the memory sizes and placement on the architecture. Nonetheless, we show that an energy-balanced desi...

Programmable Power-of-two RNS Scaler and its Application to a QRNS Polyphase Filter

Conference Paper

Full-text available

Jun 2005

The scaling operation, i.e. the division by a constant factor followed by rounding, is a commonly used technique for reducing the dynamic range in digital signal processing (DSP) systems. Usually, the constant is a power of two, and the implementation of the scaling is reduced to a right shift. This basic operation is not easily implementable in th...

Compiler Design for Accelerating Applications on Coarse-Grained Reconfigurable Architectures

Thesis

Full-text available

Dec 2021

Mahesh Balasubramanian

Coarse-Grained Reconfigurable Arrays (CGRAs) are emerging accelerators that promise low-power acceleration of compute-intensive loops in applications. The acceleration achieved by CGRA relies on the efficient mapping of the compute-intensive loops by the CGRA compiler onto the CGRA. The CGRA mapping problem, being NP-complete, is performed in a two-step process, scheduling, and mapping. The scheduling algorithm allocates timeslots to the nodes of the DFG, and the mapping algorithm maps the scheduled nodes onto the PEs of the CGRA. On a mapping failure, the initiation interval (II) is increased, and a new schedule is obtained for the increased II. Most previous mapping techniques use the Iterative Modulo Scheduling algorithm (IMS) to find a schedule for a given II. Since IMS generates a resource-constrained ASAP (as-soon-as-possible) scheduling, even with increased II, it tends to generate a similar schedule that is not mappable and does not explore the schedule space effectively. The problems encountered by IMS-based scheduling algorithms are explored and an improved randomized scheduling algorithm for scheduling of the application loop to be accelerated is proposed. When encountering a mapping failure for a given schedule, existing mapping algorithms either exit and retry the mapping anew, or recursively remove the previously mapped node to find a valid mapping (backtrack). Abandoning the mapping is extreme, but even backtracking may not be the best choice, since the root of the problem may not be the previous node. The challenges in existing algorithms are systematically analyzed and a failure-aware mapping algorithm is presented. The loops in general-purpose applications are often complicated loops, i.e., loops with perfect and imperfect nests and loops with nested if-then-else’s (conditionals). The existing hardware-software solutions to execute branches and conditions are inefficient. A co-design approach that efficiently executes complicated loops on CGRA is proposed. The compiler transforms complex loops, maps them to the CGRA, and lays them out in the memory in a specific manner, such that the hardware can fetch and execute the instructions from the right path at runtime. Finally, a CGRA compilation simulator open-source framework is presented. This open-source CGRA simulation framework is based on LLVM and gem5 to extract the loop, map them onto the CGRA architecture, and execute them as a co-processor to an ARM CPU.

Design space exploration in near-data co-processors for general-purpose acceleration, in high-performance and low-power processing environments

Thesis

Sep 2021

Athanasios Tziouvaras

Modern computer architectures face a performance scaling wall as the throughput and power consumption bottleneck has shifted from the core pipeline towards the DRAM latency and data transfer operations. This phenomenon can be partially attributed to the stop of Dennard’s scaling and to the continuous shrinking size of transistors. As a result, the power density of the integrated circuits has increased to a point where most of the cores in a multi-core architecture are forced to operate in near-threshold voltage levels. In order to address such an issue, researchers tend to deviate from the standard Von Neuman architectures towards new computing models. In the last decade there is a resurgence of the NDP paradigm, under which the instructions are executed on the DRAM die instead of the core pipeline. Therefore, the amount of CPU-DRAM transactions is significantly decreased and thus, it positively affects the power dissipation and the achievable throughput of the system. Under this premise, in this dissertation we explore the NDP paradigm for high performance and for low-power computing. Regarding the high performance computing, we propose a novel approach that considers general purpose loop execution. Our design employs an instruction scheduling methodology which issues each individual instruction on a custom integrated circuit acting as loop accelerator that is located on the logic layer of an HMC DRAM. There, instructions are iteratively executed in parallel in a software pipelining fashion, while intermediate results are forwarded through an on-chip interconnection network. Regarding the low-power computing, we develop a novel timing analysis methodology that is based on the premises of STA, specifically for low-power, low-end pipelines. The proposed timing methodology considers the excitation of the timing paths for each instruction supported by the ISA, and calculates the worst-case slack for each individual instruction. As a result, we obtain timing information on an instruction level and we proceed in exploiting such knowledge to adaptively scale the clock frequency according to instruction types that execute in the pipeline at any given time. In the sequel, we employ the aforementioned BTWC methodology to co-design a pipeline from the ground up to support a clock scaling mechanism with cycle-to-cycle granularity. We focus on the general purpose code execution and we implement our design on the logic layer of an HMC DRAM in order to enable near-data execution. We opt to evaluate both the high performance and the low power architectures on post-layout simulations in order to strengthen the validity of our designs. Results indicate a significant performance increase in terms of throughput over the baseline processors while the power consumption levels are critically reduced.

Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights

Preprint

Full-text available

Jul 2020
P IEEE

Machine learning (ML) models are widely used in many important domains. For efficiently processing these computational- and memory-intensive applications, tensors of these over-parameterized models are compressed by leveraging sparsity, size reduction, and quantization of tensors. Unstructured sparsity and tensors with varying dimensions yield irregular computation, communication, and memory access patterns; processing them on hardware accelerators in a conventional manner does not inherently leverage acceleration opportunities. This paper provides a comprehensive survey on the efficient execution of sparse and irregular tensor computations of ML models on hardware accelerators. In particular, it discusses enhancement modules in the architecture design and the software support; categorizes different hardware designs and acceleration techniques and analyzes them in terms of hardware and execution costs; analyzes achievable accelerations for recent DNNs; highlights further opportunities in terms of hardware/software/model co-design optimizations (inter/intra-module). The takeaways from this paper include: understanding the key challenges in accelerating sparse, irregular-shaped, and quantized tensors; understanding enhancements in accelerator systems for supporting their efficient computations; analyzing trade-offs in opting for a specific design choice for encoding, storing, extracting, communicating, computing, and load-balancing the non-zeros; understanding how structured sparsity can improve storage efficiency and balance computations; understanding how to compile and map models with sparse tensors on the accelerators; understanding recent design trends for efficient accelerations and further opportunities. arXiv: https://arxiv.org/abs/2007.00864

dMazeRunner: Optimizing Convolutions on Dataflow Accelerators

Conference Paper

Full-text available

May 2020

dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators

Article

Full-text available

Oct 2019
ACM T EMBED COMPUT S

Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner -- to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83× better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.

Optimizing inter-nest data locality in imperfect stencils based on loop blocking

Article

Full-text available

Oct 2018
J SUPERCOMPUT

With the interesting growth in high-performance computing, the performance of data-driven programs is becoming more and more dependent on fast memory access, which can be improved by data locality. Data locality between a pair of loop nest is called inter-nest data locality. A very important class of loop nests that shows a significant inter-nest data locality is stencils. In this paper, we have proposed a method to optimize inter-nest data locality in stencils and named it EALB. In the proposed method, two “compute” and “copy” loop nests within the time loop nest of stencils are partitioned into blocks and these blocks are executed interleaved. Determining the optimum block size in the proposed method is based on an evolutionary algorithm which uses cache miss rate and cache eviction rate. The experimental results show that the EALB is significantly effective compared to the original programs and has better results compared to the results of the state-of-the-art approach, Pluto.

Scalable Application Mapping for SIMD Reconfigurable Architecture

Article

Dec 2015

Coarse-Grained Reconfigurable Architecture (CGRA) is a very promising platform that provides fast turn-around-time as well as very high energy efficiency for multimedia applications. One of the problems with CGRAs, however, is application mapping, which currently does not scale well with geometrically increasing numbers of cores. To mitigate the scalability problem, this paper discusses how to use the SIMD (Single Instruction Multiple Data) paradigm for CGRAs. While the idea of SIMD is not new, SIMD can complicate the mapping problem by adding an additional dimension of iteration mapping to the already complex problem of operation and data mapping, which are all interdependent, and can thus significantly affect performance through memory bank conflicts. In this paper, based on a new architecture called SIMD reconfigurable architecture, which allows SIMD execution at multiple levels of granularity, we present how to minimize bank conflicts considering all three related sub-problems, for various RA organizations. We also present data tiling and evaluate a conflictfree scheduling algorithm as a way to eliminate bank conflicts for a certain class of mapping problem. © 2015, Institute of Electronics Engineers of Korea. All rights reserved.

Memory Partitioning Algorithm for Modulo Scheduling on Coarse-Grained Reconfigurable Architectures

Conference Paper

Full-text available

Oct 2015

Yuli Dai

Using System Hyper Pipelining (SHP) to Improve the Performance of a Coarse-Grained Reconfigurable Architecture (CGRA) Mapped on an FPGA

Article

Aug 2015

Tobias Strauch

The well known method C-Slow Retiming (CSR) can be used to automatically convert a given CPU into a multithreaded CPU with independent threads. These CPUs are then called streaming or barrel processors. System Hyper Pipelining (SHP) adds a new flexibility on top of CSR by allowing a dynamic number of threads to be executed and by enabling the threads to be stalled, bypassed and reordered. SHP is now applied on the programming elements (PE) of a coarse-grained reconfigurable architecture (CGRA). By using SHP, more performance can be achieved per PE. Fork-Join operations can be implemented on a PE using the flexibility provided by SHP to dynamically adjust the number of threads per PE. Multiple threads can share the same data locally, which greatly reduces the data traffic load on the CGRA's routing structure. The paper shows the results of a CGRA using SHP-ed RISC-V cores as PEs implemented on a FPGA.

A Data Prefetch and Reuse Strategy for Coarse-Grained Reconfigurable Architectures

Article

Mar 2013
IEICE T INF SYST

The Coarse Grained Reconfigurable Architectures (CGRAs) are proposed as new choices for enhancing the ability of parallel processing. Data transfer throughput between Reconfigurable Cell Array (RCA) and on-chip local memory is usually the main performance bottleneck of CGRAs. In order to release this stress, we propose a novel data transfer strategy that is called Heuristic Data Prefetch and Reuse (HDPR), for the first time in the case of explicit CGRAs. The HDPR strategy provides not only the flexible data access schedule but also the high data throughput needed to realize fast pipelined implementations of various loop kernels. To improve the data utilization efficiency, a dual-bank cache-like data reuse structure is proposed. Furthermore, a heuristic data prefetch is also introduced to decrease the data access latency. Experimental results demonstrate that when compared with conventional explicit data transfer strategies, our work achieves a significant speedup improvement of, on average, 1.73 times at the expense of only 5.86% increase in area.

Application mapping flow. Note: DFG, DCR, and DRG.

Context in source publication

Similar publications

Citations