Figure 5 - uploaded by Edwin Hsing-Mean Sha
Content may be subject to copyright.
(a) The CW and CCW regions relative to vector p; (b) The extreme CW and CCW vectors of vectors  

(a) The CW and CCW regions relative to vector p; (b) The extreme CW and CCW vectors of vectors  

Source publication
Article
Full-text available
In this paper, a method combining the loop pipelining technique with data prefetching, called Partition Scheduling with Prefetching (PSP), is proposed. In PSP, the iteration space is first divided into regular partitions. Then a two-part schedule, consisting of the ALU and memory parts, is produced and balanced to produce high throughput. These two...

Similar publications

Conference Paper
Full-text available
Scheduling is the most important task in high-level synthesis process, while pipelining is highly important for realising high-performance digital components. This paper presents a pipeline list-based scheduling algorithm, which performs forward and backward pipelining. The forward priority function is based on incorporating some information extrac...
Article
Full-text available
Software pipelining is an efficient method of loop optimization that allows for parallelism of operations related to different loop iterations. Currently, most commercial compilers use loop pipelining methods based on modulo scheduling algorithms. This paper reviews these algorithms and considers algorithmic solutions designed for overcoming the ne...
Conference Paper
Full-text available
Due to the increasing desire for safe and (semi-)automated parallelization of software, the scheduling of automatically generated task graphs becomes increasingly important. Previous static scheduling algorithms assume negligible run-time overhead of spawning and joining tasks. We show that this overhead is significant for small- to medium-sized ta...
Conference Paper
Full-text available
In many application domains, it is desirable to meet some user-defined performance requirement while minimizing resource usage and optimizing additional performance parameters. For example, application workflows with real-time constraints may have strict throughput requirements and desire a low latency or response-time. The structure of these workf...
Article
Full-text available
Many applications commonly found in digital signal process- ing and image processing applications can be represented by data-flow graphs (DFGs). In our previous work, we proposed a new technique, extended retiming, which can be combined with minimal unfolding to transform a DFG into one which is rate-optimal. The result, however, is a DFG with spli...

Citations

... Nested loop is the performance bottleneck in one program [9]. When a nested loop is executed, the local memory cannot hold all data required by nested loop due to the large loop size. ...
... Note that, we consider all computations within one iteration as an entity, as shown the nodes in the red circle in Fig. 2 (b). Furthermore, we adopt multi-dimensional data flow graph (MDFG) [9] to represent a nested loop. As shown in Fig. 2 (c), it is the MDFG of the nested loop in this example. ...
Preprint
Full-text available
Non-volatile memory (NVM) is expected to be the second level memory (named remote memory) in two-level memory hierarchy in the future. However, NVM has the limited write endurance, thus it is vital to reduce the number of write operations on NVM. Meanwhile, in two-level memory hierarchy, prefetch is widely used for fetching certain data before it is actually required, to hide the remote memory access latency. In general, large-scale nested loop is the performance bottleneck in one program due to the write operations on NVM caused by the first level memory (named local memory) miss and data reuse. Loop tiling is the key technique for grouping iterations so as to reduce the communication with remote memory used in compiler. In this paper, we propose a new loop tiling approach for minimizing the write operations on NVMs and completely hiding the NVM access latency. Specifically, we introduce a series of theorems to help loop tiling. Then, a legal tile shape and an optimal tile size selection strategy is proposed according to data dependency and local memory capacity. Furthermore, we propose a pipeline scheduling policy to completely hide the remote memory latency. Extensive experiments show that the proposed techniques can reduce write operations on NVMs by 95.1% on average, and NVM latency can be completely hidden.
... These problems (e.g., a large number of DSP applications) are characterized by nested loops with uniform data dependencies. Loop pipelining techniques [4][5][6] and partition techniques [12,13] are widely used to explore the instruction level parallelism, so that a good schedule with high performance can be obtained, ...
... Passos and Sha proposed a multilevel partition technique to reduce SPM data lost, so that it can minimize memory access overhead for multi-dimensional DSP applications [20]. Chen et al. presented a novel partition technique to reduce memory access overhead [4]. However, these two partition techniques cannot achieve full parallelism. ...
... In this section, we provide a motivational example that shows the effectiveness of the TLP technique. In the example, we compare the schedules generated by a classical scheduling technique, i.e., the Partition Scheduling with Prefetching (PSP) scheduling algorithm [4], and the TLP scheduling algorithm, respectively. For simplicity, we only illustrate the results of these two techniques without going into the details of how each result is generated. ...
Article
Most scientific and digital signal processing (DSP) applications are recursive or iterative. The execution of these applications on a chip multiprocessor (CMP) encounters two challenges. First, as most of the digital signal processing applications are both computation intensive and data intensive, an inefficient scheduling scheme may generate huge amount of Write operation, cost a lot of time, and consume significant amount of energy. Second, because CPU speed has been increased dramatically compared with memory speed, the slowness of memory hinders the overall system performance. In this paper, we develop a Two-Level Partition (TLP) algorithm that can minimize Write operation while achieving full parallelism for multi-dimensional DSP applications running on CMPs which employ scratchpad memory (SPM) as on-chip memory (e.g., the IBM Cell processor). Experiments on DSP benchmarks demonstrate the effectiveness and efficiency of the TLP algorithm, namely, the TLP algorithm can completely hide memory latencies to achieve full parallelism and generate the least amount of Write operation to main memory compared with previous approaches. Experimental results show that our proposed algorithm is superior to all known methods, including the list scheduling, rotation scheduling, Partition Scheduling with Prefetching (PSP), and Iterational Retiming with Partitioning (IRP) algorithms. Furthermore, the TLP scheduling algorithm can reduce Write operation to main memory by 45.35% and reduce the schedule length by 23.7% on average compared with the IRP scheduling algorithm, the best known algorithm.
... Xue et al. [6] introduced a data prefetching technique to hide the memory latency for accessing off-chip memory . Other previous works, such as789 are also on the data transfer problem. In this paper, our target on-chip memory is organized as a Virtually Shared SPM (VS- SPM) architecture [10], where access to remote SPMs cost many more clock cycles than to local SPMs. ...
Article
Full-text available
One of the most critical components that determine the success of an MPSoC based architecture is its on-chip memory. Scratch Pad Memory (SPM) is increasingly being applied to substitute cache as the on-chip memory of embedded MPSoCs due to its superior chip area, power consumption and timing predictability. SPM can be organized as a Virtually Shared SPM (VS-SPM) architecture that takes advantage of both shared and private SPM. However, making effective use of the VS-SPM architecture strongly depends on two inter-dependent problems: variable partitioning and task scheduling. In this paper, we decouple these two problems and solve them in phase-ordered manner. We propose two variable partitioning heuristics based on an initial schedule: High Access Frequency First (HAFF) variable partitioning and Global View Prediction (GVP) variable partitioning. Then, we present a loop pipeline scheduling algorithm known as Rotation Scheduling with Variable Partitioning (RSVP) to improve overall throughput. Our experimental results obtained on MiBench show that the average performance improvements over IDAS (Integrated Data Assignment with Scheduling) are 23.74% for HAFF and 31.91% for GVP on four-core MPSoC. The average schedule length generated by RSVP is 25.96% shorter than that of list scheduling with optimal variable partition.
... Various techniques have been proposed to consider both scheduling and prefetching at the same time. Fei Chen et al. [8,9] show the successful usage of the partitioning idea in improving loop schedules. They presented the first available technique that combines loop pipelining, prefetching, and partitioning to optimize the overall loop schedule. ...
... With this type of architecture, CELL processor gains more than 10 times the performance of a modern top of line CPU. 9 ( B 4 7 6 which is also known as dependence vector, and t is the computation time of each node. We use d(e) = (i , i ) as a general formulation of any delay shown in a two-dimensional DFG (2DFG ). ...
... Partition IRP List PSP [9] $ A £ In Table 1, we assume that each processor operation takes 1 time unit, each prefetch takes 2 time units, and each write back takes 2 time units. And assume that inside each processing core, there are 2 ALU units available to execute computations. ...
Conference Paper
Full-text available
The widening gap between processor and memory performance is the main bottleneck for modern computer systems to achieve high processor utilization. In this paper, we propose a new loop scheduling with memory management technique, iterational retiming with partitioning (IRP), that can completely hide memory latencies for applications with multi-dimensional loops on architectures like CELL processor (J.A. Kahle et al., 2005). In IRP, the iteration space is first partitioned carefully. Then a two-part schedule, consisting of processor and memory parts, is produced such that the execution time of the memory part never exceeds the execution time of the processor part. These two parts are executed simultaneously and complete memory latency hiding is reached. Experiments on DSP benchmarks show that IRP consistently produces optimal solutions as well as significant improvement over previous techniques
... It implicitly uses the loop pipelining technique and a task remapping procedure to allocate nodes. In [4] a method is proposed combining the loop pipelining technique with data prefetching, called partition scheduling with prefetching. In this algorithm, the iteration space is first divided into regular partitions. ...
... Given a loop task graph G ¼ ðV ; AÞ; we define a replicated task graph G u ¼ ðV u ; A u Þ as follows [4]: the graph G u is an acyclic task graph that represents the body of the loop after being unrolled using unrolling vector u ¼ fu 1 ; y; u n g: The set of nodes V u is the set of replicated tasks according to the unrolling process ðjVuj ¼ jV j à Q n i¼1 ðu i þ 1ÞÞ: The set of arcs is the set that represents the original loop-independent dependencies (these are the arcs whose weights (dependence sets) have distance vectors with zero elements in G; for example, the arc between node T1 and node T2 in Fig. 1) and loop-carried dependencies that have become loop independent as a result of the unrolling. The weight of a replicated node remains the same as its original node. ...
Article
Many approaches have been described for the parallel loop scheduling problem for shared-memory systems, but little work has been done on the data-dependent loop scheduling problem (nested loops with loop carried dependencies). In this paper, we propose a general model for the data-dependent loop scheduling problem on distributed as well as shared memory systems. In order to achieve load balancing and low runtime scheduling and communication overhead, our model is based on a loop task graph and the notion of critical path. In addition, we develop a heuristic algorithm based on our model and on genetic algorithms to test the reliability of the model. We test our approach on different scenarios and benchmarks. The results are very encouraging and suggest a future parallel compiler implementation based on our model.
... We also conduct the experiments to explore the opportunities in making code size/performance trade-off by using our algorithm. Our code size reduction technique can be easily combined with some optimization techniques considering memory constraints and data prefetching, such as those in [5, 6, 24]. The rest of the paper is organized as follows: In Section 2, we introduce necessary backgrounds related to CRED technique. ...
Article
Full-text available
Software pipelining technique is extensively used to explore the instruction level parallelism in loops. However, this performance optimization technique results in code size expansion. For em-bedded systems with very limited on-chip memory resources, the code size becomes one of the most important optimization concerns. This paper presents the fundamental understanding of the relation-ship between code size expansion and software pipelining based on retiming. We propose a general Code-size REDuction technique (CRED) for software-pipelined loops on various kinds of processors. Our CRED algorithms integrate the code size reduction procedure with software pipelining to produce minimal code size for a target schedule length. The experiments on a set of well-known benchmarks show the effectiveness of CRED technique on both reducing the code size of software-pipelined loops and exploring the code size/performance trade-off space.
... In this paper, we present the theoretical foundation as well as code size reduction framework when software pipelining and unfolding techniques are used. The performance of softwarepipelined applications after applying code size reduction can be further improved by some optimization techniques considering memory constraints and data prefetching [3,10]. ...
Conference Paper
Full-text available
Software pipelining and unfolding are commonly used techniques to increase parallelism for DSP applications. However, these techniques expand the code size of the application significantly. For most DSP systems with limited memory resources, code size becomes one of the most critical concerns for the high-performance applications. In this paper, we present the code size reduction theory based on retiming and unfolding concepts. We propose a code size reduction framework to achieve the optimal code size of software-pipelined and unfolded loops by using conditional registers. The experimental results on several well-known benchmarks show the effectiveness of our code size reduction technique in controlling the code size of optimized loops.
... To minimize the memory overhead caused by applying padding array scheme, a greedy method is developed to redeclare the arrays in a different order to minimize the memory overhead caused by applying the padding array scheme. The well-known loop tiling technique [3,5,12,16,17] and the cache partitioning and mapping scheme presented here are combined for the case in which the arrays in real applications are much larger than the partitioned cache region and in which data is reused after many iterations. Atom is employed as a tool to develop a simulation of the behavior of the direct-mapped cache to demonstrate that the approach presented here effectively reduces the number of cache conflicts and exploits cache localities. ...
Article
Full-text available
This article presents an algorithm to reduce cache conflicts and improve cache localities. The proposed algorithm analyzes locality reference space for each reference pattern, partitions the multi-level cache into several parts with different sizes, and then maps array data onto the scheduled cache positions to eliminate cache conflicts. A greedy method for rearranging array variables in declared statement is also developed, to reduce the memory overhead for mapping arrays onto a partitioned cache. Besides, loop tiling and the proposed schemes are combined to exploit opportunities for both temporal and spatial reuse. Atom is used as a tool to develop a simulation of the behavior of the direct-mapping cache to demonstrate that our approach is effective at reducing number of cache conflicts and exploiting cache localities. Experimental results reveal that applying the cache partitioning scheme can greatly reduce the cache conflicts and thus save program execution time in both single-level cache and multi-level cache hierarchies.
... To minimize the memory overhead caused by applying padding array scheme, a greedy method is developed to redeclare the arrays in a different order to minimize the memory overhead caused by applying the padding array scheme. The well-known loop tiling technique [3,5,12,16,17] and the cache partitioning and mapping scheme presented here are combined for the case in which the arrays in real applications are much larger than the partitioned cache region and in which data is reused after many iterations. Atom is employed as a tool to develop a simulation of the behavior of the direct-mapped cache to demonstrate that the approach presented here effectively reduces the number of cache conflicts and exploits cache localities. ...
Conference Paper
Full-text available
The paper presents an algorithm to reduce cache conflicts and improve cache localities. The proposed algorithm analyzes unique locality reference space for each reference pattern, partitions the multi-level cache into several parts with different size, and then maps array data onto the scheduled cache positions such that cache conflicts can be eliminated. To reduce the memory overhead for mapping array variables onto partitioned cache, a greedy method for rearranging array variables in declared statement is also developed. In addition, we combine loop tiling and the proposed schemes for exploiting both temporal and spatial reuse opportunities. To demonstrate that our approach is effective at reducing the number of cache conflicts and exploiting cache localities, we use Atom as a tool to develop a simulator for simulation of the behavior of direct-mapping cache. Experimental results show that applying our cache partitioning scheme can largely reduce the cache conflicts and thus save program execution time in both one-level cache and multi-level cache hierarchies