Fig 3 - uploaded by Aviral Shrivastava
Content may be subject to copyright.
Application mapping flow. Note: DFG, DCR, and DRG.  

Application mapping flow. Note: DFG, DCR, and DRG.  

Source publication
Article
Full-text available
Coarse-grained reconfigurable arrays (CGRAs) are a very promising platform, providing both up to 10-100 MOps/mW of power efficiency and software programmability. However, this promise of CGRAs critically hinges on the effectiveness of application mapping onto CGRA platforms. While previous solutions have greatly improved the computation speed, they...

Context in source publication

Context 1
... one with loop-carried data dependence. be rather easily solved by adding an additional cost to the PEs to which load/store operations have been mapped, called BBC. We define the BBC for a PE p, as BBC(p) = b·m(p), where b is a design parameter called the base balancing cost, and m(p) is the number of memory operations already mapped onto PE p. Fig. 3 illustrates our compilation flow. The two analy- ses, performance bottleneck analysis and data reuse analysis, are performed before time-consuming modulo scheduling. Memory-aware modulo scheduling refers to the EMS algo- rithm extended by adding DROC and BBC to the existing cost function, which does not significantly increase the ...

Similar publications

Conference Paper
Full-text available
FPGAs have the advantage that a single component can be configured post-fabrication to implement almost any computation. However, designing a one-size-fits-all memory architecture causes an inherent mismatch between the needs of the application and the memory sizes and placement on the architecture. Nonetheless, we show that an energy-balanced desi...
Conference Paper
Full-text available
The scaling operation, i.e. the division by a constant factor followed by rounding, is a commonly used technique for reducing the dynamic range in digital signal processing (DSP) systems. Usually, the constant is a power of two, and the implementation of the scaling is reduced to a right shift. This basic operation is not easily implementable in th...

Citations

... By efficiently performing memory operations in parallel and routing the reused data and by minimizing interconnect delays, [60] achieves better performance. Expanding on the concept of data reuse and memory bandwidth use [61] proposed a technique to efficiently bank the data memory and update the mapping to better utilize the bank distribution. A DFG of a loop with memory operations is shown in Figure 3.4(a), and the CGRA architecture with double-buffered data memory is shown in Figure 3.4(b). ...
... A DFG of a loop with memory operations is shown in Figure 3.4(a), and the CGRA architecture with double-buffered data memory is shown in Figure 3.4(b). As shown in Figure 3.4(c), naive mapping underutilized the bank distribution, whereas [61] utilizes the bank distribution and maps nodes that access the same memory parts to the PEs accessing the same bank. ...
Thesis
Full-text available
Coarse-Grained Reconfigurable Arrays (CGRAs) are emerging accelerators that promise low-power acceleration of compute-intensive loops in applications. The acceleration achieved by CGRA relies on the efficient mapping of the compute-intensive loops by the CGRA compiler onto the CGRA. The CGRA mapping problem, being NP-complete, is performed in a two-step process, scheduling, and mapping. The scheduling algorithm allocates timeslots to the nodes of the DFG, and the mapping algorithm maps the scheduled nodes onto the PEs of the CGRA. On a mapping failure, the initiation interval (II) is increased, and a new schedule is obtained for the increased II. Most previous mapping techniques use the Iterative Modulo Scheduling algorithm (IMS) to find a schedule for a given II. Since IMS generates a resource-constrained ASAP (as-soon-as-possible) scheduling, even with increased II, it tends to generate a similar schedule that is not mappable and does not explore the schedule space effectively. The problems encountered by IMS-based scheduling algorithms are explored and an improved randomized scheduling algorithm for scheduling of the application loop to be accelerated is proposed. When encountering a mapping failure for a given schedule, existing mapping algorithms either exit and retry the mapping anew, or recursively remove the previously mapped node to find a valid mapping (backtrack). Abandoning the mapping is extreme, but even backtracking may not be the best choice, since the root of the problem may not be the previous node. The challenges in existing algorithms are systematically analyzed and a failure-aware mapping algorithm is presented. The loops in general-purpose applications are often complicated loops, i.e., loops with perfect and imperfect nests and loops with nested if-then-else’s (conditionals). The existing hardware-software solutions to execute branches and conditions are inefficient. A co-design approach that efficiently executes complicated loops on CGRA is proposed. The compiler transforms complex loops, maps them to the CGRA, and lays them out in the memory in a specific manner, such that the hardware can fetch and execute the instructions from the right path at runtime. Finally, a CGRA compilation simulator open-source framework is presented. This open-source CGRA simulation framework is based on LLVM and gem5 to extract the loop, map them onto the CGRA architecture, and execute them as a co-processor to an ARM CPU.
... In contrast, CGRA related work that treats them as general purpose architectures often focuses on the instruction mapping procedure of the accelerator [38]. As a result, CGRA architectures tend to exploit local data reuse [39], utilize a local register file of the PEs [40] and organize the instruction mapping process in an efficient way [41]. ...
Thesis
Modern computer architectures face a performance scaling wall as the throughput and power consumption bottleneck has shifted from the core pipeline towards the DRAM latency and data transfer operations. This phenomenon can be partially attributed to the stop of Dennard’s scaling and to the continuous shrinking size of transistors. As a result, the power density of the integrated circuits has increased to a point where most of the cores in a multi-core architecture are forced to operate in near-threshold voltage levels. In order to address such an issue, researchers tend to deviate from the standard Von Neuman architectures towards new computing models. In the last decade there is a resurgence of the NDP paradigm, under which the instructions are executed on the DRAM die instead of the core pipeline. Therefore, the amount of CPU-DRAM transactions is significantly decreased and thus, it positively affects the power dissipation and the achievable throughput of the system. Under this premise, in this dissertation we explore the NDP paradigm for high performance and for low-power computing. Regarding the high performance computing, we propose a novel approach that considers general purpose loop execution. Our design employs an instruction scheduling methodology which issues each individual instruction on a custom integrated circuit acting as loop accelerator that is located on the logic layer of an HMC DRAM. There, instructions are iteratively executed in parallel in a software pipelining fashion, while intermediate results are forwarded through an on-chip interconnection network. Regarding the low-power computing, we develop a novel timing analysis methodology that is based on the premises of STA, specifically for low-power, low-end pipelines. The proposed timing methodology considers the excitation of the timing paths for each instruction supported by the ISA, and calculates the worst-case slack for each individual instruction. As a result, we obtain timing information on an instruction level and we proceed in exploiting such knowledge to adaptively scale the clock frequency according to instruction types that execute in the pipeline at any given time. In the sequel, we employ the aforementioned BTWC methodology to co-design a pipeline from the ground up to support a clock scaling mechanism with cycle-to-cycle granularity. We focus on the general purpose code execution and we implement our design on the logic layer of an HMC DRAM in order to enable near-data execution. We opt to evaluate both the high performance and the low power architectures on post-layout simulations in order to strengthen the validity of our designs. Results indicate a significant performance increase in terms of throughput over the baseline processors while the power consumption levels are critically reduced.
... C. Management of Multi-Bank Memory 1) Concurrent accesses to memory banks: While singlebank memory can be easier to manage, it is infeasible to provide multiple ports for the PE-array with just one bank [169]. Moreover, multi-port unified memory consumes very high power and longer latency [170]. ...
Preprint
Full-text available
Machine learning (ML) models are widely used in many important domains. For efficiently processing these computational- and memory-intensive applications, tensors of these over-parameterized models are compressed by leveraging sparsity, size reduction, and quantization of tensors. Unstructured sparsity and tensors with varying dimensions yield irregular computation, communication, and memory access patterns; processing them on hardware accelerators in a conventional manner does not inherently leverage acceleration opportunities. This paper provides a comprehensive survey on the efficient execution of sparse and irregular tensor computations of ML models on hardware accelerators. In particular, it discusses enhancement modules in the architecture design and the software support; categorizes different hardware designs and acceleration techniques and analyzes them in terms of hardware and execution costs; analyzes achievable accelerations for recent DNNs; highlights further opportunities in terms of hardware/software/model co-design optimizations (inter/intra-module). The takeaways from this paper include: understanding the key challenges in accelerating sparse, irregular-shaped, and quantized tensors; understanding enhancements in accelerator systems for supporting their efficient computations; analyzing trade-offs in opting for a specific design choice for encoding, storing, extracting, communicating, computing, and load-balancing the non-zeros; understanding how structured sparsity can improve storage efficiency and balance computations; understanding how to compile and map models with sparse tensors on the accelerators; understanding recent design trends for efficient accelerations and further opportunities. arXiv: https://arxiv.org/abs/2007.00864
... Specification of target accelerator: Our dataflow accelerator architecture features 16×16 PEs with 16-bit precision. PEs access private 1024B double-buffered RFs and a 128 kB shared double-buffered scratch-pad [20]. Each pipelined PE features a 2-stage multiplier and an adder. ...
... The SPM consists of 64 banks (2 kB each) that can be allocated to any data. Data is accessed from DRAM via DMA and managed in SPM with double-buffering [41,45]. Our latency model for data transfers via DMA is same as Cell processors that featured SPMs [44]. ...
Article
Full-text available
Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner -- to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83× better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.
... Optimizing loop nests through their transformation has always been a key method for achieving a better performance [12]. Loop blocking is one of the most important loop optimization techniques that is used with different aims such as improving data locality, exploiting coarse-grained and data parallelism [13,14]. ...
Article
Full-text available
With the interesting growth in high-performance computing, the performance of data-driven programs is becoming more and more dependent on fast memory access, which can be improved by data locality. Data locality between a pair of loop nest is called inter-nest data locality. A very important class of loop nests that shows a significant inter-nest data locality is stencils. In this paper, we have proposed a method to optimize inter-nest data locality in stencils and named it EALB. In the proposed method, two “compute” and “copy” loop nests within the time loop nest of stencils are partitioned into blocks and these blocks are executed interleaved. Determining the optimum block size in the proposed method is based on an evolutionary algorithm which uses cache miss rate and cache eviction rate. The experimental results show that the EALB is significantly effective compared to the original programs and has better results compared to the results of the state-of-the-art approach, Pluto.
... For reconfigurable architectures that allow one cycle context switch such as ADRES [6], a variant of modulo scheduling can be used to find quality mappings within a reasonable time. Subsequently, memory issues are recognized as a dominant problem, making it necessary to consider data mapping in addition to computation mapping during scheduling [7,8]. Data distribution and mapping in the context of the RAW machine is considered by Barua et al. [9]. ...
... Generating code for this type of architecture requires careful evaluation of the implications of operation scheduling on the data placement. One way to achieve that is, as suggested by [7], to add a heuristic function in the operation scheduling that penalizes the allocation of two or more memory operations onto different load-store PEs if the sets of data accessed by the operations share some array elements-because otherwise the shared array elements must be duplicated into different memory banks, potentially increasing data transfer time. ...
Article
Coarse-Grained Reconfigurable Architecture (CGRA) is a very promising platform that provides fast turn-around-time as well as very high energy efficiency for multimedia applications. One of the problems with CGRAs, however, is application mapping, which currently does not scale well with geometrically increasing numbers of cores. To mitigate the scalability problem, this paper discusses how to use the SIMD (Single Instruction Multiple Data) paradigm for CGRAs. While the idea of SIMD is not new, SIMD can complicate the mapping problem by adding an additional dimension of iteration mapping to the already complex problem of operation and data mapping, which are all interdependent, and can thus significantly affect performance through memory bank conflicts. In this paper, based on a new architecture called SIMD reconfigurable architecture, which allows SIMD execution at multiple levels of granularity, we present how to minimize bank conflicts considering all three related sub-problems, for various RA organizations. We also present data tiling and evaluate a conflictfree scheduling algorithm as a way to eliminate bank conflicts for a certain class of mapping problem. © 2015, Institute of Electronics Engineers of Korea. All rights reserved.
... An early work [5] presented a multi-bank double buffering memory architecture, where PEs in a row can access the corresponding bank. Fig. 1 shows details of CGRA, which contains 4×4 PE array. ...
... Various speedups can be achieved when the right method is applied. Yongjoo et al. propose in [4] memory-aware application mapping to improve the data throughput on CGRAs. The memory bandwidth optimization is discussed in [5] In this paper a Hyper Pipelined Reconfigurable Architecture (HPRA) is proposed, which demos improvements of the two aforementioned problems based on a CGRA. ...
Article
The well known method C-Slow Retiming (CSR) can be used to automatically convert a given CPU into a multithreaded CPU with independent threads. These CPUs are then called streaming or barrel processors. System Hyper Pipelining (SHP) adds a new flexibility on top of CSR by allowing a dynamic number of threads to be executed and by enabling the threads to be stalled, bypassed and reordered. SHP is now applied on the programming elements (PE) of a coarse-grained reconfigurable architecture (CGRA). By using SHP, more performance can be achieved per PE. Fork-Join operations can be implemented on a PE using the flexibility provided by SHP to dynamically adjust the number of threads per PE. Multiple threads can share the same data locally, which greatly reduces the data traffic load on the CGRA's routing structure. The paper shows the results of a CGRA using SHP-ed RISC-V cores as PEs implemented on a FPGA.
... With the numerous hardware computing resources, the Reconfigurable Cell Array (RCA) has to handle a large amount of high-speed parallel tasks, which results in great demand for a massive data throughput of CGRAs. Therefore, a data transfer strategy efficient both in throughput and speed is particularly needed to satisfy the requirement [7]. ...
Article
The Coarse Grained Reconfigurable Architectures (CGRAs) are proposed as new choices for enhancing the ability of parallel processing. Data transfer throughput between Reconfigurable Cell Array (RCA) and on-chip local memory is usually the main performance bottleneck of CGRAs. In order to release this stress, we propose a novel data transfer strategy that is called Heuristic Data Prefetch and Reuse (HDPR), for the first time in the case of explicit CGRAs. The HDPR strategy provides not only the flexible data access schedule but also the high data throughput needed to realize fast pipelined implementations of various loop kernels. To improve the data utilization efficiency, a dual-bank cache-like data reuse structure is proposed. Furthermore, a heuristic data prefetch is also introduced to decrease the data access latency. Experimental results demonstrate that when compared with conventional explicit data transfer strategies, our work achieves a significant speedup improvement of, on average, 1.73 times at the expense of only 5.86% increase in area.