Basic Code Generation Schema

Source publication

Reduced code size modulo scheduling in the absence of hardware support

Conference Paper

Full-text available

Feb 2002

Modulo scheduling is a very effective instruction scheduling technique that exploits Instruction Level Parallelism (ILP) in loop bodies by overlapping the execution of successive iterations. Unfortunately, modulo scheduling has been shown to cause heavy code expansion. To avoid the penalties of code expansion, some processors have dedicated hardwar...

Context 1

... this section we analyze the impact of the code size reduction on performance on a detailed benchmark-by-benchmark basis. We have excluded the non-speculative schemes because the number of loops for which a valid modulo schedule can be generated with those schemas is very limited, especially in the case of SpecInt2000. We have used IMS with a speculative code generation schema as an example of a code size insensitive modulo scheduler, our default modulo scheduler (LxMS) also speculative to show the benefits of a code size sensitive modulo scheduler, and finally LxMS with all optimizations from section 6 enabled (this is the default for our compiler). All benchmarks have been run to completion and the output has been validated for all runs. Figure 9 shows the loop expansion ratio. Notice that almost all the individual results follow the trend of the global numbers previously shown. In all cases, LxMS causes significantly less code expansion than IMS, which is more pronounced for the BenchSuite benchmarks due to the larger loops. Also, for all benchmarks except adpcm (in which the extra copies are worse than the epilogue reduction), the proposed schemas reduce the code expansion even more. Notice that although the code size reduction mechanisms make a small contribution on average, there are a few benchmarks (like mp4dec and opendivx) where these mechanisms make an important contribution. Finally, there is the exceptional case of csc where LxMS manages to schedule the three inner loops of the program with SC=1 and MVE=1, using the same II as in IMS, which results in no code expansion at all. -19 - reduction is minimal because they are mostly huge programs with very small loops. However, for the BenchSuite, the binary size reduction is very significant; 30% on average and 67.5% for the best case. Recall that BenchSuite is representative of the embedded Lx target domain, where binary size is sometimes critical. Figure 10 shows the performance relative to IMS with no stalls, i.e., the ideal execution time assuming no cache miss penalty and no branch penalty. In general, the performance improvement/degradation is almost negligible, being in the -2% +3% range in the most extreme cases. We have observed that although LxMS produces slightly smaller IIs, this is not a -20 - determinant factor on performance. The two cases (bmark and mp2audio) in which LxMS improves performance are due to the fact that loops having a single epilogue allow the epilogue to be scheduled in combination with the succeeding basic block. When MVE reduction and epilogue collapsing is applied, this is emphasized. The three cases (crypto, mp4dec and opendivx) where LxMS degrades performance are due to loops with short iteration counts that are executed many times. Our compiler schedules prologues and epilogues using a different pattern than in the kernel, benefiting from the lower resource requirements of this two pieces, to generate more efficient schedules than what can be achieved with kernel-only code. If the loop performs very few iterations, the code in the prologue-epilogue can be more efficient for schedulers that speculate less (IMS). This situation is aggravated by the copies in copy blocks and the serialization of the epilogue to reduce its size. However, on average, the performance is not ...

View in full-text

Context 2

... additional problem that must be solved is the presence of loop variants with a lifetime longer than II. In this case, a new value of the loop variant is generated before the value generated in the previous iteration is consumed. One approach to fix this problem is by renaming the register using rotating register files [4]. In the absence of such hardware support, Modulo Variable Expansion (MVE) [12] can solve this problem. When MVE is applied, the kernel is unrolled and registers are renamed to prevent that successive lifetimes corresponding to the same original loop-variant overlap in time. The minimum degree of unroll, K min , is determined by the longest lifetime, as: Figure 1 shows the code schema to be generated for a loop with SC = 4, BS = 1, and K min = 2. This schema can be simplified [19]; however, we choose not to do so to improve readability. -6 ...

View in full-text

Context 3

... Figure 11 shows the real execution time. We have observed that the D-Cache stall cycles are practically the same for the three cases. Branch penalty cycles are slightly higher for LxMS than for IMS (due to smaller MVE) and even higher for LxMS-mve-ep; however, they have a minimal impact on performance. The most important factor that varies significantly with code size reduction is I-Cache Stalls. There are three benchmarks where performance is significantly improved by reducing loop code expansion (mp4dec +20%, mpeg2 +14% and opendivx +20%). Of these, in mp4dec and opendivx the code reduction schemas play a critical role. There are several other benchmarks where reducing loop expansion produces a small improvement in I-Cache stalls. Finally, there are a few benchmarks where reducing code size actually increases I-Cache stalls. The most notable case is perlbmk where the code size reduction produced by LxMS over IMS reduces I-Cache capacity misses, but causes two critical functions to conflict in the I-Cache, thus increasing the conflict misses; when code size is further reduced they do not conflict any longer, and we benefit from reduced I-Cache capacity misses. A number of ...

View in full-text

Context 4

... that in Figure 1 and Figure 3 the epilogues corresponding to the kernel and the last prologue exits are identical. They only differ in the registers they read. It is relatively easy to make the compiler assign the same registers in the prologue and in the different kernel -15 - replications. In that way, all epilogues will be identical and could be collapsed together into a single epilogue. However, the long lifetimes that span for more than II cycles cannot be assigned to the same register. Note that this is the reason for MVE in the first ...

View in full-text

Context 5

... we have evaluated the proposed techniques to reduce code expansion with different schedulers, with and without speculation and with different machine configurations. Lx corresponds to the ST210 processor used throughout this paper. LxP corresponds to a processor where load latency (32, 16 and 8 bit) has been increased to 4 cycles, multiplies take 3 cycles, and the compare to branch latency is 3 cycles. LxSP is like LxP with 8 cycle loads. Figure 12 shows the code size expansion for these more aggressive configurations for both IMS and LxMS. For each processor configuration and each scheduler there are four columns: Basic corresponds to the basic code schema, +mve+ep corresponds to the schema with all epilogs collapsed and copies inserted to reduce MVE. Finally the Spec columns correspond to the equivalent speculative ...

View in full-text

Context 6

... Figure 6, we can see that the partial prologue-epilogue actually corresponds to a subset of the full kernel epilogue. Notice that this is also true for the non-speculative case of Figure 1. In general, any partial epilogue corresponding to an early exit is a subset of the next partial epilogue. The code generated for one partial epilogue can be reused by the next epilogue if the completion of iterations is done in a sequential way, instead of a parallel way. Of course, the lifetimes are still a problem and copy blocks may be required for the early exits. Figure 7 shows the corresponding schema for our ...

View in full-text

Context 7

... this section, we analyze which factors contribute to code size, and which heuristics can help to minimize code size in a modulo scheduler. Figure 1, it can be seen that there are three factors contributing to code expansion. The number of stages determines the size of the prologue and the size of epilogues. A reduced number of stages will have a smaller prologue and smaller epilogues. The stage in which the loop branch instruction is scheduled partially determines the number of epilogues, since each early exit in the prologue requires a partial epilogue. The later the branch is scheduled the fewer epilogues are required. Finally, the K min degree of MVE determines how many copies of the kernel are required. A smaller K min will require less code expansion due to the kernel. In addition, by reducing the number of kernel replications we are also reducing the number of epilogues. Table 3 shows the average K min , the average Stage Count, and the average number of early exits for the three schedulers considered. LxMS, despite producing schedules with smaller IIs, results in the smallest loop size expansion because it succeeds in minimizing all three factors, while SMS results in an intermediate loop size expansion because it fails to reduce the stage count. 1 1 SMS performs worse on SpecInt2000 due to one pathological case (see comment on early/late start limits in Stage Count section) ; if this one case were removed, the code size would also be between the code size for IMS and LxMS. -10 - Next, we outline the characteristics that modulo scheduling heuristics must have in order to improve each one of these factors. The above modulo schedulers are used as practical ...

View in full-text

Context 8

... the loop can eliminate the multiple epilogues in Figure 1. A non-software pipelined version of the loop is first executed until the number of remaining iterations for the loop kernel is a multiple of K min . In that way only one epilogue is required at the exit of the kernel. Likewise, the preconditioning loop can also be used when the trip count of the loop is smaller than SC, so that the early exit epilogues can also be removed. Figure 3 shows the code schema to be generated for our hypothetic loop. However, notice that in terms of code size the preconditioning loop contributes a full copy of the original loop body that must also be considered. The main drawback of this schema is that the first few iterations executed by the preconditioning loop are executed at a lower efficiency rate, reducing the overall performance. The code schemas in Figure 1 and Figure 3 require the loop trip count to be known before entering the loop. This is acceptable for numeric applications; however, for general applications it is interesting to be able to modulo schedule loops with conditional exits. This can be achieved if speculative modulo scheduling is implemented. In that case iterations are started speculatively before knowing that they must be executed. Once the exit branch is executed, the iterations started speculatively are not completed in the epilogue, since they should never be executed. As can be seen in Figure 4, this leads to a code schema where the epilogues have reduced size, since only the non-speculative iterations must be completed. Figure 5 compares the loop expansion ratio of the basic code schema of Figure 1 (Basic) with loop preconditioning (Prec) and speculative modulo scheduling (Spec). We observed that for some loops, the code size contribution of the preconditioning loop was larger than the size of the epilogues it removed. We also compare the loop expansion ratio of applying loop preconditioning only when it will help to reduce code size (Prec++). -14 - For the default Lx scheduler (LxMS), speculation is consistently superior to all other code schemas in terms of code size. This is because, as mentioned in the previous section, LxMS tends to schedule the branch as late as possible in the schedule. In other words, LxMS performs aggressive speculation minimizing the size of the ...

View in full-text

Context 9

View in full-text

Context 10

View in full-text

Support for High-Frequency Streaming in CMPs

Conference Paper

Full-text available

Dec 2006

As the industry moves toward larger-scale chip multiprocessors, the need to parallelize applications grows. High inter-thread commu- nication delays, exacerbated by over-stressed high-latency memory subsystems and ever-increasing wire delays, require parallelization techniques to create partially or fully independent threads to im- prove performanc...

On the Effectiveness of Register Moves to Minimise Post-Pass Unrolling in Software Pipelined Loops

Conference Paper

Jul 2012

Software pipelining is a powerful technique to expose fine-grain parallelism, but it results in variables staying alive across more than one kernel iteration. It requires periodic register allocation and is challenging for code generation: the lack of a reliable solution currently restricts the applicability of software pipelining. The classical software solution that does not alter the computation throughput consists in unrolling the loop a posteriori [11], [10]. However, the resulting unrolling degree is often unacceptable and may reach absurd levels. Alternatively, loop unrolling can be avoided thanks to software register renaming. This is achieved through the insertion of move operations, but this may increase the initiation interval (II) which nullifies the benefits of software pipelining. This article aims at tightly controling the post-pass loop unrolling necessary to generate code. We study the potential of live range splitting to reduce kernel loop unrolling, introducing additional move instructions without inscreasing the II. We provide a complete formalisation of the problem, an algorithm, and extensive experiments. Our algorithm yields low unrolling degrees in most cases - with no increase of the II.

Automatic Design of Efficient Application-centric Architectures.

Article

Jan 2008

Kevin C. Fan

As the market for embedded devices continues to grow, the demand for high performance, low cost, and low power computation grows as well. Many embedded applications perform computationally intensive tasks such as processing streaming video or audio, wireless communication, or speech recognition and must be implemented within tight power budgets. Typically, general purpose processors are not able to meet these performance and power requirements. Custom hardware in the form of loop accelerators are often used to execute the compute-intensive portions of these applications because they can achieve significantly higher levels of performance and power efficiency. Automated hardware synthesis from high level specifications is a key technology used in designing these accelerators, because the resulting hardware is correct by construction, easing verification and greatly decreasing time-to-market in the quickly evolving embedded domain. In this dissertation, a compiler-directed approach is used to design a loop accelerator from a C specification and a throughput requirement. The compiler analyzes the loop and generates a virtual architecture containing sufficient resources to sustain the required throughput. Next, a software pipelining scheduler maps the operations in the loop to the virtual architecture. Finally, the accelerator datapath is derived from the resulting schedule. In this dissertation, synthesis of different types of loop accelerators is investigated. First, the system for synthesizing single loop accelerators is detailed. In particular, a scheduler is presented that is aware of the effects of its decisions on the resulting hardware, and attempts to minimize hardware cost. Second, synthesis of multifunction loop accelerators, or accelerators capable of executing multiple loops, is presented. Such accelerators exploit coarse-grained hardware sharing across loops in order to reduce overall cost. Finally, synthesis of post-programmable accelerators is presented, allowing changes to be made to the software after an accelerator has been created. The tradeoffs between the flexibility, cost, and energy efficiency of these different types of accelerators are investigated. Automatically synthesized loop accelerators are capable of achieving order-of-magnitude gains in performance, area efficiency, and power efficiency over processors, and programmable accelerators allow software changes while maintaining highly efficient levels of computation. Ph.D. Computer Science & Engineering University of Michigan, Horace H. Rackham School of Graduate Studies http://deepblue.lib.umich.edu/bitstream/2027.42/61644/1/fank_1.pdf

Software pipelining of loops by the method of modulo scheduling

Article

Full-text available

Nov 2007
PROGRAM COMPUT SOFT+

Software pipelining is an efficient method of loop optimization that allows for parallelism of operations related to different loop iterations. Currently, most commercial compilers use loop pipelining methods based on modulo scheduling algorithms. This paper reviews these algorithms and considers algorithmic solutions designed for overcoming the negative effects of software pipelining of loops (a significant growth in code size and increase in the register pressure) as well as methods making it possible to use some hardware features of a target architecture. The paper considers global-scheduling mechanisms allowing one to pipeline loops containing a few basic blocks and loops in which the number of iterations is not known before the execution.

Cost sensitive modulo scheduling in a loop accelerator synthesis system

Conference Paper

Dec 2005

Scheduling algorithms used in compilers traditionally focus on goals such as reducing schedule length and register pressure or producing compact code. In the context of a hardware synthesis system where the schedule is used to determine various components of the hardware, including datapath, storage, and interconnect, the goals of a scheduler change drastically. In addition to achieving the traditional goals, the scheduler must proactively make decisions to ensure efficient hardware is produced. This paper proposes two exact solutions for cost sensitive modulo scheduling, one based on an integer linear programming formulation and another based on branch-and-bound search. To achieve reasonable compilation times, decomposition techniques to break down the complex scheduling problem into phase ordered sub-problems are proposed. The decomposition techniques work either by partitioning the dataflow graph into smaller subgraphs and optimally scheduling the subgraphs, or by splitting the scheduling problem into two phases, time slot and resource assignment. The effectiveness of cost sensitive modulo scheduling in minimizing the costs of function units, register structures, and interconnection wires are evaluated within a fully automatic synthesis system for loop accelerators. The cost sensitive modulo scheduler increases the efficiency of the resulting hardware significantly compared to both traditional cost unaware and greedy cost aware modulo schedulers.

Code generation for single-dimension software pipelining of multi-dimensional loops

Conference Paper

Full-text available

Apr 2004

Traditionally, software pipelining is applied either to the innermost loop of a given loop nest or from the innermost loop to the outer loops. We proposed a scheduling method, called single-dimension software pipelining (SSP), to software pipeline a multidimensional loop nest at an arbitrary loop level. We describe our solution to SSP code generation. In contrast to traditional software pipelining, SSP handles two distinct repetitive patterns, and thus requires new code generation algorithms. Further, these two distinct repetitive patterns complicate register assignment and require two levels of register renaming. As rotating registers support renaming at only one level, our solution is based on a combination of dynamic register renaming (using rotating registers) and static register renaming (using code replication). Finally, code size increase, an even more important issue for SSP than for traditional software-pipelining, is also addressed. Optimizations are proposed to reduce code size without significant performance degradation. We first present a code generation scheme and subsequently implement it for the IA-64 architecture, making effective use of rotating registers and predicated execution. We present some initial experimental results, which demonstrate not only the feasibility and correctness of our code generation scheme, but also its code quality.

Swing modulo scheduling for gcc

Article

Full-text available

Jan 2004

Software pipelining is a technique that im-proves the scheduling of instructions in loops by overlapping instructions from different it-erations. Modulo scheduling is an approach for constructing software pipelines that focuses on minimizing the cycle count of the loops and thereby optimize performance. In this pa-per we describe our implementation of Swing Modulo Scheduling in GCC, which is a Mod-ulo Scheduling technique that also focuses on reducing register pressure. Several key issues are discussed, including the use and adaptation of GCC's machine-model infrastructure for scheduling (DFA) and data-dependence graph construction. We also present directions for fu-ture enhancements.

From Machine Scheduling to VLIW Instruction Scheduling

Article

Full-text available

Dec 2003

Benot Dupont De Dinechin

and instruction scheduling problems on modern VLIW processors such as the STMicroelectronics ST200. Our motivations are to apply the machine scheduling techniques that are relevant to instruction scheduling in VLIW compilers, and to understand how processor micro-architecture features impact advanced instruction scheduling techniques. Based on this discussion, we present our theoretical contributions to the field of instruction scheduling that are applied in the STMicroelectronics ST200 production compiler, and we introduce a new time-indexed formulation for the register constrained instruction scheduling problems.

Loop optimization with tradeoff between cycle count and code size for DSP applications

Article

Full-text available

Software pipelining is an effective technique to reduce cycle count by exploiting instruction level parallelism in loops. It has been implemented in most VLIW DSP compilers. However, software pipelining expands the code size due to the introduction of prelude and postlude. To address this problem, many VLIW DSP compilers include certain code size reduction features. During compilation, a user is given limited options of exercising these code reduction features. As a result, the tradeoff options between cycle count and code size are also limited. Yet today's software development often requires an optimum balance between code size and cycle count, which in turn requires a much wider tradeoff space. This paper presents a new heuristic code-size-constraint loop optimization approach to extend the tradeoff space. Preliminary experimental results indicate that the new approach can significantly widen the tradeoff space, thus providing DSP users with more flexibility to meet their various design criteria.

Bibliography

Chapter

Jun 2014

Modulo Scheduling

Chapter

Nov 2016

Loop parallelization, and particularly the parallelization of innermost loops, is the most critical aspect of any parallelizing compiler. Trace scheduling can be applied to loops, but has the disadvantage that loops must be unrolled to exploit parallelism between loop iterations, which can lead to code explosion and in general still does not exploit all of the parallelism available. In this chapter we discuss modulo scheduling, which was the first technique to address the scheduling of loops (both within and across iterations) directly and is still very widely used in practice. The original modulo scheduling technique applies only to loops where the loop body is a single basic block; we also present extensions to modulo scheduling that allow the technique to be applied to more general loops.

Basic Code Generation Schema

Contexts in source publication

Similar publications

Citations