(a) The CW and CCW regions relative to vector p; (b) The extreme CW and CCW vectors of vectors

Source publication

Optimizing overall loop schedules using prefetching and partitioning

Article

Full-text available

Jul 2000

In this paper, a method combining the loop pipelining technique with data prefetching, called Partition Scheduling with Prefetching (PSP), is proposed. In PSP, the iteration space is first divided into regular partitions. Then a two-part schedule, consisting of the ALU and memory parts, is produced and balanced to produce high throughput. These two...

A PIPELINE SCHEDULING ALGORITHM FOR HIGH-LEVEL SYNTHESIS

Conference Paper

Full-text available

Feb 2003

Azeddien Sllame

Scheduling is the most important task in high-level synthesis process, while pipelining is highly important for realising high-performance digital components. This paper presents a pipeline list-based scheduling algorithm, which performs forward and backward pipelining. The forward priority function is based on incorporating some information extrac...

Software pipelining of loops by the method of modulo scheduling

Article

Full-text available

Nov 2007

Software pipelining is an efficient method of loop optimization that allows for parallelism of operations related to different loop iterations. Currently, most commercial compilers use loop pipelining methods based on modulo scheduling algorithms. This paper reviews these algorithms and considers algorithmic solutions designed for overcoming the ne...

Real-World Clustering for Task Graphs on Shared Memory Systems

Conference Paper

Full-text available

Feb 2015

Due to the increasing desire for safe and (semi-)automated parallelization of software, the scheduling of automatically generated task graphs becomes increasingly important. Previous static scheduling algorithms assume negligible run-time overhead of spawning and joining tasks. We show that this overhead is significant for small- to medium-sized ta...

Toward Optimizing Latency Under Throughput Constraints for Application Workflows on Clusters

Conference Paper

Full-text available

Aug 2007

In many application domains, it is desirable to meet some user-defined performance requirement while minimizing resource usage and optimizing additional performance parameters. For example, application workflows with real-time constraints may have strict throughput requirements and desire a low latency or response-time. The structure of these workf...

Time-constrained loop scheduling with minimal resources.

Article

Full-text available

Jan 2006

Many applications commonly found in digital signal process- ing and image processing applications can be represented by data-flow graphs (DFGs). In our previous work, we proposed a new technique, extended retiming, which can be combined with minimal unfolding to transform a DFG into one which is rate-optimal. The result, however, is a DFG with spli...

Optimal Loop Tiling for Minimizing Write Operations on NVMs with Complete Memory Latency Hiding

Preprint

Full-text available

Nov 2021

Non-volatile memory (NVM) is expected to be the second level memory (named remote memory) in two-level memory hierarchy in the future. However, NVM has the limited write endurance, thus it is vital to reduce the number of write operations on NVM. Meanwhile, in two-level memory hierarchy, prefetch is widely used for fetching certain data before it is actually required, to hide the remote memory access latency. In general, large-scale nested loop is the performance bottleneck in one program due to the write operations on NVM caused by the first level memory (named local memory) miss and data reuse. Loop tiling is the key technique for grouping iterations so as to reduce the communication with remote memory used in compiler. In this paper, we propose a new loop tiling approach for minimizing the write operations on NVMs and completely hiding the NVM access latency. Specifically, we introduce a series of theorems to help loop tiling. Then, a legal tile shape and an optimal tile size selection strategy is proposed according to data dependency and local memory capacity. Furthermore, we propose a pipeline scheduling policy to completely hide the remote memory latency. Extensive experiments show that the proposed techniques can reduce write operations on NVMs by 95.1% on average, and NVM latency can be completely hidden.

Minimizing Write Operation for Multi-Dimensional DSP Applications via a Two-Level Partition Technique with Complete Memory Latency Hiding

Article

Feb 2015
J SYST ARCHITECT

Most scientific and digital signal processing (DSP) applications are recursive or iterative. The execution of these applications on a chip multiprocessor (CMP) encounters two challenges. First, as most of the digital signal processing applications are both computation intensive and data intensive, an inefficient scheduling scheme may generate huge amount of Write operation, cost a lot of time, and consume significant amount of energy. Second, because CPU speed has been increased dramatically compared with memory speed, the slowness of memory hinders the overall system performance. In this paper, we develop a Two-Level Partition (TLP) algorithm that can minimize Write operation while achieving full parallelism for multi-dimensional DSP applications running on CMPs which employ scratchpad memory (SPM) as on-chip memory (e.g., the IBM Cell processor). Experiments on DSP benchmarks demonstrate the effectiveness and efficiency of the TLP algorithm, namely, the TLP algorithm can completely hide memory latencies to achieve full parallelism and generate the least amount of Write operation to main memory compared with previous approaches. Experimental results show that our proposed algorithm is superior to all known methods, including the list scheduling, rotation scheduling, Partition Scheduling with Prefetching (PSP), and Iterational Retiming with Partitioning (IRP) algorithms. Furthermore, the TLP scheduling algorithm can reduce Write operation to main memory by 45.35% and reduce the schedule length by 23.7% on average compared with the IRP scheduling algorithm, the best known algorithm.

Variable Partitioning and Scheduling for MPSoC with Virtually Shared Scratch Pad Memory

Article

Full-text available

Feb 2010
J SIGNAL PROCESS SYS

One of the most critical components that determine the success of an MPSoC based architecture is its on-chip memory. Scratch Pad Memory (SPM) is increasingly being applied to substitute cache as the on-chip memory of embedded MPSoCs due to its superior chip area, power consumption and timing predictability. SPM can be organized as a Virtually Shared SPM (VS-SPM) architecture that takes advantage of both shared and private SPM. However, making effective use of the VS-SPM architecture strongly depends on two inter-dependent problems: variable partitioning and task scheduling. In this paper, we decouple these two problems and solve them in phase-ordered manner. We propose two variable partitioning heuristics based on an initial schedule: High Access Frequency First (HAFF) variable partitioning and Global View Prediction (GVP) variable partitioning. Then, we present a loop pipeline scheduling algorithm known as Rotation Scheduling with Variable Partitioning (RSVP) to improve overall throughput. Our experimental results obtained on MiBench show that the average performance improvements over IDAS (Integrated Data Assignment with Scheduling) are 23.74% for HAFF and 31.91% for GVP on four-core MPSoC. The average schedule length generated by RSVP is 25.96% shorter than that of list scheduling with optimal variable partition.

Loop Scheduling with Complete Memory Latency Hiding on Multi-core Architecture

Conference Paper

Full-text available

Jan 2006

The widening gap between processor and memory performance is the main bottleneck for modern computer systems to achieve high processor utilization. In this paper, we propose a new loop scheduling with memory management technique, iterational retiming with partitioning (IRP), that can completely hide memory latencies for applications with multi-dimensional loops on architectures like CELL processor (J.A. Kahle et al., 2005). In IRP, the iteration space is first partitioned carefully. Then a two-part schedule, consisting of processor and memory parts, is produced such that the execution time of the memory part never exceeds the execution time of the processor part. These two parts are executed simultaneously and complete memory latency hiding is reached. Experiments on DSP benchmarks show that IRP consistently produces optimal solutions as well as significant improvement over previous techniques

Data dependent loop scheduling based on genetic algorithms for distributed and shared memory systems

Article

May 2004
J PARALLEL DISTR COM

Many approaches have been described for the parallel loop scheduling problem for shared-memory systems, but little work has been done on the data-dependent loop scheduling problem (nested loops with loop carried dependencies). In this paper, we propose a general model for the data-dependent loop scheduling problem on distributed as well as shared memory systems. In order to achieve load balancing and low runtime scheduling and communication overhead, our model is based on a loop task graph and the notion of critical path. In addition, we develop a heuristic algorithm based on our model and on genetic algorithms to test the reliability of the model. We test our approach on different scenarios and benchmarks. The results are very encouraging and suggest a future parallel compiler implementation based on our model.

Code Size Reduction Technique and Implementation for Software-Pipelined DSP Applications

Article

Full-text available

Nov 2003
ACM T EMBED COMPUT S

Software pipelining technique is extensively used to explore the instruction level parallelism in loops. However, this performance optimization technique results in code size expansion. For em-bedded systems with very limited on-chip memory resources, the code size becomes one of the most important optimization concerns. This paper presents the fundamental understanding of the relation-ship between code size expansion and software pipelining based on retiming. We propose a general Code-size REDuction technique (CRED) for software-pipelined loops on various kinds of processors. Our CRED algorithms integrate the code size reduction procedure with software pipelining to produce minimal code size for a target schedule length. The experiments on a set of well-known benchmarks show the effectiveness of CRED technique on both reducing the code size of software-pipelined loops and exploring the code size/performance trade-off space.

Optimal code size reduction for software-pipelined and unfolded loops

Conference Paper

Full-text available

Feb 2002

Software pipelining and unfolding are commonly used techniques to increase parallelism for DSP applications. However, these techniques expand the code size of the application significantly. For most DSP systems with limited memory resources, code size becomes one of the most critical concerns for the high-performance applications. In this paper, we present the code size reduction theory based on retiming and unfolding concepts. We propose a code size reduction framework to achieve the optimal code size of software-pipelined and unfolded loops by using conditional registers. The experimental results on several well-known benchmarks show the effectiveness of our code size reduction technique in controlling the code size of optimized loops.

Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping

Article

Full-text available

Jan 2002
J SUPERCOMPUT

This article presents an algorithm to reduce cache conflicts and improve cache localities. The proposed algorithm analyzes locality reference space for each reference pattern, partitions the multi-level cache into several parts with different sizes, and then maps array data onto the scheduled cache positions to eliminate cache conflicts. A greedy method for rearranging array variables in declared statement is also developed, to reduce the memory overhead for mapping arrays onto a partitioned cache. Besides, loop tiling and the proposed schemes are combined to exploit opportunities for both temporal and spatial reuse. Atom is used as a tool to develop a simulation of the behavior of the direct-mapping cache to demonstrate that our approach is effective at reducing number of cache conflicts and exploiting cache localities. Experimental results reveal that applying the cache partitioning scheme can greatly reduce the cache conflicts and thus save program execution time in both single-level cache and multi-level cache hierarchies.

Reducing cache conflicts by multi-level cache partitioning andarray elements mapping

Conference Paper

Full-text available

Feb 2000

The paper presents an algorithm to reduce cache conflicts and improve cache localities. The proposed algorithm analyzes unique locality reference space for each reference pattern, partitions the multi-level cache into several parts with different size, and then maps array data onto the scheduled cache positions such that cache conflicts can be eliminated. To reduce the memory overhead for mapping array variables onto partitioned cache, a greedy method for rearranging array variables in declared statement is also developed. In addition, we combine loop tiling and the proposed schemes for exploiting both temporal and spatial reuse opportunities. To demonstrate that our approach is effective at reducing the number of cache conflicts and exploiting cache localities, we use Atom as a tool to develop a simulator for simulation of the behavior of direct-mapping cache. Experimental results show that applying our cache partitioning scheme can largely reduce the cache conflicts and thus save program execution time in both one-level cache and multi-level cache hierarchies

Loop interchange and tiling for multi-dimensional loops to minimize write operations on NVMs

Article

Dec 2022
J SYST ARCHITECT

(a) The CW and CCW regions relative to vector p; (b) The extreme CW and CCW vectors of vectors

Similar publications

Citations