Conference Paper

An Investigation Of The Performance Of Various Dynamic Scheduling Techniques

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Not Available

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Acyclic task graphs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 12 2.1.3 Parallel loops : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15 2.2 Dynamic Scheduling of HTGs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15 2.2.1 Scheduling parallel loops : : : : : : : : : : : : : : : : : : : : : : : : : : : : 17 2.2.2 Acyclic task graphs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 18 2.2.3 High-level scheduling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19 3 DYNAMIC SCHEDULING ALGORITHMS FOR ATGS : : : : : : : : : : : 20 3.1 Evaluating Execution Conditions : : : : : : : : : : : : : : : : : : : : : : : : : : : 20 3.1.1 Form of execution conditions : : : : : : : : : : : : : : : : : : : : : : : : : 21 3.1.2 ...
... += iteration i .path endif end for endif ATG: 18 for each task v i 2 node do 19 Evaluate Node(v i ) end for 20 node.work := 0 21 node.path ...
... Thus, SWITCH is simply a single instruction: jsr.n r1 ;pc <-r1, r1 <-old pc+2, ;execute delay slot at old pc+1 (The su x .n on mc88000 instructions indicates use of the optional branch delay slot 75].) This is not possible on the i860 because of a restriction that r1 not be the target register of a calli instruction ( 53], p. [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19]. The delay slot of the jsr instruction can still be used if it is moved ahead one instruction from where it normally would have been issued. ...
Article
This thesis examines nonloop parallelism at both fine and coarse levels of granularity in numerical FORTRAN programs. Measurements of the extent of this functional parallelism in a number of FORTRAN codes are presented, as well as compiler and run-time algorithms designed to exploit it. Hardware and software embodiments of the dynamic scheduling algorithms are developed, along with the compiler optimizations necessary to make these practical. The impact of fine grain functional parallelism on instruction-level archictecture is explored, and it is shown that dynamic instruction scheduling hardware based on the functional parallelism scheduling algorithms can yield a significant improvement over static scheduling on conventional RISC processors when the latency of memory accesses is highly variable. Measurements of the characteristics of a set of FORTRAN benchmark programs indicates that such a hardware realization is feasible in practice. iii TABLE OF CONTENTS CHAPTER PAGE 1 INTRODUC...
... In particular, once a functional unit becomes available, the select logic then directs a suitable instruction to that unit for execution by asserting the corresponding grant signal. Many selection policies have been presented for the case where the number of ready instructions exceeds the capacity of the available functional units [12], [13], for instance, the oldest first selection algorithm [12]. ...
... In particular, once a functional unit becomes available, the select logic then directs a suitable instruction to that unit for execution by asserting the corresponding grant signal. Many selection policies have been presented for the case where the number of ready instructions exceeds the capacity of the available functional units [12], [13], for instance, the oldest first selection algorithm [12]. ...
Article
This paper presents two effective wakeup designs that improve the speed, power, area, and scalability without instructions per cycle (IPC) loss for dynamic instruction schedulers. First, a wakeup design is proposed to aim at reducing the power consumption and wakeup latency. This design removes the read of the destination tags from the wakeup path by matching the source tags directly with the grant lines. Moreover, this design eliminates the redundant matches during the wakeup operations by matching the source tags with only the selected grant lines. Next, the second design explores a metric called wakeup locality to further reduce the area cost of the wakeup logic. By limiting the wakeup ranges for the instructions in the issue window, this design not only reduces the area requirement but also improves the scalability. The experimental results show that this range-limited-wakeup design saves 76%-94% of the power consumption and reduces 29%-77% in the wakeup latency compared to the conventional CAM-based scheme with only 5%-44% of the area cost in a traditional RAM-based scheme. The results also show that this design scales well with the increase of both the issue width and the window size
... Once a functional unit becomes available, the select logic directs a suitable instruction to that unit for execution by asserting the corresponding grant signal. Many selection policies, for instance, the oldest first selection algorithm [7], have been presented for the case where the number of ready instructions exceeds the capacity of the available functional units [7], [8]. ...
... Once a functional unit becomes available, the select logic directs a suitable instruction to that unit for execution by asserting the corresponding grant signal. Many selection policies, for instance, the oldest first selection algorithm [7], have been presented for the case where the number of ready instructions exceeds the capacity of the available functional units [7], [8]. ...
Article
In a high-performance superscalar processor, the instruction scheduler often comes with poor scalability and high complexity due to the expensive wake-up operation. From detailed simulation-based analyses, we find that 95 percent of the wake-up distances between two dependent instructions are short, in the range of 16 instructions, and 99 percent are in the range of 31 instructions. We apply this wake-up spatial locality to the design of conventional CAM-based and matrix-based wakeup logic, respectively. By limiting the wake-up coverage to i + 16 instructions, where 0 les i les 15 for 16-entry segments, the proposed wake-up designs confine the wake-up operation to two matrix-based or three CAM-based 16-entry segments no matter how large the issue window size is. The experimental results show that, for an issue window of 128 entries (IW<sub>128</sub>) or 256 entries (IW<sub>256</sub>), the proposed CAM-based wake-up locality design saves 65 percent (IW<sub>128</sub>) and 76 percent (IW<sub>256</sub>) of the power consumption and reduces 44 percent (IW<sub>128</sub>) and 78 percent (IW<sub>256</sub>) in the wake-up latency compared to the conventional CAM-based design with almost no performance loss. For the matrix-based wake-up logic, applying wake-up locality to the design drastically reduces the area cost. Extensive simulation results, including comparisons with previous works, show that the wake-up spatial locality is the key element to achieving scalability for future sophisticated instruction schedulers.
... After this condition is satisfied, it commits: it updates the architectural state and is deallocated from the reservation stations. 1 Note that after an instruction is selected for execution, several cycles pass before it completes execution. During this time, instructions dependent on it may be scheduled (woken up and selected) for execution. ...
... The select logic for each functional unit grants execution to one ready instruction. If more than one instruction requests execution, heuristics may be used for choosing which instruction receives the grant [1]. The inputs to the select logic are the request signals from each of the functional unit's RSEs, plus any additional information needed for scheduling heuristics such as priority information. ...
Article
A machine's performance is the product of its IPC (Instructions Per Cycle) and clock frequency. Recently, Palacharla, Jouppi, and Smith [3] warned that the dynamic instruction scheduling logic for current machines performs an atomic operation. Either you sacrifice IPC by pipelining this logic, thereby eliminating its ability to execute dependent instructions in consecutive cycles. Or you sacrifice clock frequency by not pipelining it, performing this atomic operation in a single long cycle. Both alternatives are unacceptable for high performance.
... As a result, the instruction age order does not match the priority order that the select logic considers. It is widely known that high IPC is obtained using the age-ordered issue priority [23]. Therefore, the random IQ alone suffers from low IPC. ...
Article
The high energy consumption of current processors causes several problems, including a limited clock frequency, short battery lifetime, and reduced device reliability. It is therefore important to reduce the energy consumption of the processor. Among resources in a processor, the issue queue (IQ) is a large consumer of energy, much of which is consumed by the wakeup logic. Within the wakeup logic, the tag comparison that checks source operand readiness consumes a significant amount of energy. This paper proposes an energy reduction scheme for tag comparison, called double-stage tag comparison. This scheme first compares the lower bits of the tag and then, only if these match, compares the higher bits. Because the energy consumption of tag comparison is roughly proportional to the total number of bits compared, energy is saved by reducing this number. However, this sequential comparison increases the delay of the IQ, thereby increasing the clock cycle time. Although this can be avoided by allocating an extra cycle to the issue operation, this in turn degrades the IPC. To avoid IPC degradation, we reconfigure a small number of entries in the IQ, where several oldest instructions that are likely to have an adverse effect on performance reside, to a single stage for tag comparison. Our evaluation results for SPEC2017 benchmark programs show that the double-stage tag comparison achieves on average a 21% reduction in the energy consumed by the wakeup logic (15% when including the overhead) with only 3.0% performance degradation.
... Butler et al. investigated the effect of several select policies of the IQ, including random selection and selection based on the number of dependent instructions [8]. Their evaluation results showed that the select policies they investigated delivered almost the same performance for the INT programs (SPEC89 benchmark) but the performance differed by up to 20% for the FP programs. ...
Conference Paper
The improvement of single-thread performance is much needed. Among the many structures that comprise a processor, the issue queue (IQ) is one of the most important structures that influences high single-thread performance. Correctly assigning the issue priority and providing high capacity efficiency are key features, but no conventional IQ organizations do not sufficiently have these. In this paper, we propose an IQ called the switching issue queue (SWQUE), which dynamically configures the IQ as a modified circular queue (CIRC-PC) or random queue with an age matrix (AGE) by responding to the degree of capacity demand. CIRC-PC corrects the issue priority when wrap-around occurs by exploiting the finding that instructions that are wrapped around are latency-tolerant. CIRC-PC is used for phases in which capacity efficiency is less important and the correct priority is more important; and AGE is used for phases in which capacity efficiency is more important. Our evaluation results using SPEC2017 benchmark programs show that SWQUE achieved higher performance by averages of 9.7% and 2.9% (up to 24.4% or 10.6%) for integer and floating-point programs, respectively, compared with AGE, which is widely used in current processors.
... This buffer can be unified (all entries are linked to all functional units), split (all entries are linked to only one functional unit), or mixed. A similar mecha nism is called the node table by Butler and Patt in [BuPa92]. In their paper, they show that most issuing techniques give almost the same performance for a processor featuring a wide degree. ...
Article
Full-text available
This study has been carried out in order to determine cost-effective configurations of functional units for multiple-issue out-of-order superscalar processors. The trace-driven simulations were performed on the six integer and the fourteen floating-point programs from the SPEC 92 suite. We first evaluate the number of instructions allowed to be concurrently processed by the execution stages of the pipeline. We then apply some restrictions on the execution issue of different instruction classes in order to define these configurations. We conclude that five to nine functional units are necessary to exploit Instruction-Level Parallelism. An important point is that several data cache ports are required in a processor of degree 4 or more. Finally, we report on complementary results on the utilization rate of the functional units. Keywords: Instruction-level parallelism, Superscalar microprocessor, Out-of-Order Execution, and Functional Units. Processor PPC 604 MC 88110 Ultra Sparc DEC 211...
... We assume an oldest-first issue policy for our model. Prior work by Butler and Patt [1992] shows that scheduling policy has very low impact on IPC. We therefore expect that our methodology can be applied to IQs with other scheduling policies. ...
Article
Full-text available
Reliability to soft errors is a significant design challenge in modern microprocessors owing to an exponential increase in the number of transistors on chip and the reduction in operating voltages with each process generation. Architectural Vulnerability Factor (AVF) modeling using microarchitectural simulators enables architects to make informed performance, power, and reliability tradeoffs. However, such simulators are time-consuming and do not reveal the microarchitectural mechanisms that influence AVF. In this article, we present an accurate first-order mechanistic analytical model to compute AVF, developed using the first principles of an out-of-order superscalar execution. This model provides insight into the fundamental interactions between the workload and microarchitecture that together influence AVF. We use the model to perform design space exploration, parametric sweeps, and workload characterization for AVF.
... The selection process identifies those instructions whose source operands are ready and the required resources are available and issues them for execution. When more than one instruction compete for the same resource, the selection logic chooses one among them according to some heuristics [24]. ...
... Since the data reference stream is not strongly dependent on the instruction-set architecture it is reasonable to use a trace from an existing processor. A second common application is the analysis of a superscalar processor with the same instruction-set architecture as the machine on which the trace was generated [Smith89,Butler92]. This approach is relatively simple to implement but has the drawback that the traces do not depend on the target machine. ...
Article
We present a framework for an efficient instruction-level machine simulator which can be used with existing software tools to develop and analyze programs for a proposed processor architecture. The simulator exploits similarities between the instruction sets of the emulated machine and the host machine to provide fast simulation. Furthermore, existing program development tools on the host machine such as debuggers and profilers can be used without modification on the emulated program running under the simulator. The simulator can therefore be used to debug and tune application code for the new processor without building a whole new set of program development tools. The technique has applicability to a diverse set of simulation problems. We show how the framework has been used to build simulators for a shared-memory multiprocessor, a superscalar processor with support for speculative execution, and a dual-issue embedded processor.
... Thus, our selection policy is oldest-first. Others have observed that variations in selection policy do not cause any significant changes in performance [6]. In order to handle multiple functional units of the same type, the arbiters are stacked as shown in Fig. 2(b). ...
Article
Full-text available
Pipelined processor cores are conventionally designed to accommodate the critical paths in the critical pipeline stage(s) in a single clock cycle, to ensure correctness. Such conservative design is wasteful in many cases since critical paths are rarely exercised. Thus, configuring the pipeline to operate correctly for rarely used critical paths targets the uncommon case instead of optimizing for the common case. In this study, we describe Trifecta-an architectural technique that completes common-case, subcritical path operations in a single cycle but uses two cycles when the critical path is exercised. This increases slack for both single-and two-cycle operations and offers a unique advantage under process variation. In contrast with existing mechanisms that trade power or performance for yield, Trifecta improves the yield while preserving performance and power. We applied this technique to the critical pipeline stages of a superscalar out-of-order (OoO) and a single issue in-order processor, namely instruction issue and execute, respectively. Our experiments show that the rare two-cycle operations result in a small decrease (5% for integer and 2% for floating-point benchmarks of SPEC2000) in instructions per cycle. However, the increased delay slack causes an improvement in yield-adjusted-throughput by 20% (12.7%) for an in-order (InO) processor configuration.
... However, compaction entails shifting instructions around in the queue every cycle and depending on the instruction word width may therefore be a source of considerable power consumption. Studies have shown that overall performance is largely independent of what selection policy is used (oldest first, position based, etc.) [10]. As such, the compaction strategy may not be best suited for low power operation; nor is it critical to achieving good performance. ...
Conference Paper
Increasing power dissipation has become a major constraint for future performance gains in the design of microprocessors. In this paper, we present the circuit design of an issue queue for a superscalar processor that leverages transmission gate insertion to provide dynamic low-cost configurability of size and speed. A novel circuit structure dynamically gathers statistics of issue queue activity over intervals of instruction execution. These statistics are then used to change the size of an issue queue organization on- the-fly to improve issue queue energy and performance. When applied to a fixed, full-size issue queue structure, the result is up to a 70% reduction in energy dissipation. The complexity of the additional circuitry to achieve this result is almost negligible. Furthermore, self-timed techniques embedded in the adaptive scheme can provide a 56% decrease in cycle time of the issue queue when we change the adaptive issue queue size from 32 entries (largest possible) to 8 entries (smallest possible in our design).
... Early in the era of out-of-order execution, Butler and Patt [6] studied the efficacy of different picker criteria such as number of dependents, whether the instruction feeds a branch, the dataflowchain length, and others. They concluded that performance was largely independent of the heuristic used, and simpler was thus better for real designs. ...
Conference Paper
From multiprocessor scale-up to cache sizes to the number of reorder-buffer entries, microarchitects wish to reap the b enefits of more computing resources while staying within power and latency bounds. This tension is quite evident in schedulers, which need to be large and single-cycle for maximum performance on out-of-order cores. In this work we present two straightforward modificat ions to a matrix scheduler implementation which greatly strengthen its scalability. Both are based on the simple observation that t he wakeup and picker matrices are sparse, even at small sizes; thus small indirection tables can be used to greatly reduce their width and latency. This technique can be used to create quicker iso- performance schedulers (17-58% reduced critical path) or larger iso- timing schedulers (7-26% IPC increase). Importantly, the power and area requirements of the additional hardware are likely offset by the greatly reduced matrix sizes and subsuming the functionality of the power-hungry allocation CAMs. Categories and Subject Descriptors. C.1.0 (Processor Architec-
... The obvious issue is, how much ILP is available? Current processor implementations attempt to use 4- or perhaps 8-way instruction issues each cycle [5, 3, 8]. At least at this time, it seems problematic that ILP compilers will efficiently support more than these levels. ...
Conference Paper
CMOS technology should, over the next few years, reach lithography of under 0.1 μ. This provides a die area improvement of a factor of 10 over today's technology. What is the best use of this area? Multiprocessors, very high level superscalar processors, VLIW, processor-memory combinations, or simpler processors with very large caches? Some have made the case that pin bandwidth is an important limit that must be confronted by any organization. This presentation reviews the scalability of our current technology and critically analyzes the various alternatives to the best use of silicon area in the era of 0.1 μ technologies. The rapid advance of technology has made possible a number of processor improvements which would have been thought impossible just a few years ago. These improvements include cycle time, cache size, and an extraordinary increase in the complexity of the processor; this complexity is required for the management of relatively large scales of instruction level parallelism
... An example selection policy is oldest firstthe ready instruction that occurs earliest in program order is granted the functional unit. Butler and Patt [5] studied various policies for scheduling ready instructions and found that overall performance is largely independent of the selection policy. The HP PA-8000 uses a selection policy that is based on the location of the instruction in the window. ...
... When more than one instruction competes for the same resource, the selection logic chooses one of them according to some heuristic. 2 Overall, the issue logic's main source of complexity and power dissipation is the many tag comparisons it must perform every cycle. Researchers have proposed several approaches to improve the issue logic's power efficiency. ...
Article
Full-text available
The improved performance of current microprocessors brings with it increasingly complex and power-dissipating issue logic. Recent proposals introduce a range of mechanisms for tackling this problem.
... Generic instruction window cell. [14], the easiest to implement location based policy [63] results in the highest energy efficiency. ...
Article
In recent years, reducing power has become an important design goal for high-performance microprocessors. This work attempts to bring the power issue to the earliest phases of microprocessor development, in particular, the stage of defining a chip microarchitecture. We investigate power-optimization techniques of superscalar microprocessors at the microarchitecture level that do not compromise performance. First, major targets for power reduction are identified within microarchitecture, where power is heavily consumed or will be heavily consumed in next-generation superscalar processors. Then, a new, energy-efficient version of a multicluster microarchitecture is developed that reduces energy the identified critical design points with minimal performance impact. A methodology is developed for energy-performance optimization at the microarchitecture level that generates, for a microarchitecture, a set of energy-efficient configurations, forming a convex hull in the power-performance space. Detailed simulation of the baseline and proposed multicluster architectures has been performed using the developed optimization methodology. A comparison of the two microarchitectures, both optimized for energy efficiency, shows that the multicluster architecture is potentially up to twice as energy efficient for wide issue processors, with an advantage that Grows with the issue width. Conversely, at the same power dissipation level, the multicluster architecture supports configurations with measurably higher performance than equivalent conventional designs
... To save power, the scheduling window is non-compacting -that is, instructions stay in the same location for the duration of the time they are in the window. While oldest-first selection logic is not used, previous work has shown that the selection priority has little effect on IPC [4]. Since an instruction's consumers may reside in any cluster, it broadcasts its destination tag and result to its local cluster in addition to all other clusters. ...
Article
The register file and result bypass network are large sources of power consumption in high-performance processors. This paper introduces a technique called Demand-Only Broadcast that reduces the power consumption of these structures in a clustered execution core. With this technique, an instruction's result is only broadcast to remote clusters if it is needed by dependants in those clusters. Demand-Only Broadcast was evaluated using a performance--power simulator of a high-performance clustered processor which already employs techniques for reducing register file and instruction window power. By eliminating 59% of the register file writes and data bypasses, the total processor power consumption (including the hardware needed by this mechanism) is reduced by 10%, while having less than a 1% impact on performance.
... An example selection policy is oldest ready first -the ready instruction that occurs earliest in program order is granted the functional unit. Butler and Patt [5] studied various policies for scheduling ready instructions and found that overall performance is largely independent of the selection policy. For example, the HP PA-8000 [19] uses a selection policy that is based on the location of the instruction in the window. ...
Article
To characterize future performance limitations of superscalar processors, the delays of key pipeline structures in superscalar processors are studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of and .
... By the end of the cycle, the starting address of the next instruction block must be generated. In some of the processors, the I-cache access time is longer than the cycle time, leading to a pipeline structure depicted in gure 6 (a) 1 . For instance, the Intel PentiumPro 8] features a pipelined I-cache access completed within two and a half cycles. ...
Article
Full-text available
: A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. To overcome the instruction fetch bottleneck shown in wide-dispatch "brainiac" processors, this paper presents a novel cost-effective mechanism called the multiple-block ahead branch predictor that predicts in an efficient way addresses of multiple basic blocks in a single cycle. Moreover and unlike the previous multiple predictor schemes, the multiple-block ahead branch predictor can use any of the branch prediction schemes to perform very accurate predictions required to achieve high-performance on superscalar processors. Finally, we show that pipelining the branch prediction process can be done by means of our predictor for "speed demon" processors to achieve higher clock rate. (R'esum'e : tsvp) This work was partially supported by GDR AMN (MIF project). St'ephan Jourdan and Pascal Sainrat are with the APARA team at IRIT, Toulouse, France. This report...
... However, compaction entails shifting instructions around in the queue every cycle and depending on the instruction word width may therefore be a source of considerable power consumption. Studies have shown that overall performance is largely independent of what selection policy is used (oldest first, position based, etc.) [4]. As such, the compaction strategy may not be best suited for low power operation; nor is it critical to achieving good performance. ...
Article
Increasing power dissipation has become a major constraint for future performance gains in the design of microprocessors. In this paper, we present the circuit design of an issue queue for a superscalar processor that leverages transmission gate insertion to provide dynamic low-cost configurability of size and speed. A novel circuit structure dynamically gathers statistics of issue queue activity over intervals of instruction execution. These statistics are then used to change the size of an issue queue organization on-the-fly to improve issue queue energy and performance. When applied to a fixed, full-size issue queue structure, the result is up to a 70% reduction in energy dissipation. The complexity of the additional circuitry to achieve this result is almost negligible. Furthermore, self-timed techniques embedded in the adaptive scheme can provide a 56% decrease in cycle time of the CAM array read of the issue queue when we change the adaptive issue queue size from 32 entries (largest possible) to 8 entries (smallest possible in our design).
... This observation led us to a sampling method, using correct-path instructions to simulate the wrong path. A similar technique was used in [2]. ...
Article
The performance of out-of-order processors increases with the instruction window size. In conventional processors, the effective instruction window cannot be larger than the issue buffer. Determining which instructions from the issue buffer can be launched to the execution units is a timecritical operation which complexity increases with the issue buffer size. We propose to relieve the issue stage by reordering instructions before they enter the issue buffer. This study introduces the general principle of data-flow prescheduling. Then we describe a possible implementation. Our preliminary results show that data-flow prescheduling makes it possible to enlarge the effective instruction window while keeping the issue buffer small. 1. Introduction Processor performance is strongly correlated with the clock cycle. Shorter clock cycle has been allowed by both improvements in silicon technology and careful processor design. As a consequence of this evolution, the IPC (average number of ins...
... This is in agreement with earlier discussion about how the fifo steering policy dynamically adapts to the number of clusters being used based on the parallelism in the instruction stream, thus resulting in fewer inter-cluster bypasses. Butler and Patt [BP92] also report significant performance degradation when the "headonly" (fifo) scheduling policy is used with distributed reservation stations. ...
Article
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of , 894 , and . Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future.
Article
Spatial architectures are more efficient than traditional Out-of-Order (OOO) processors for computationally intensive programs. However, spatial architectures require mapping a program, either statically or dynamically, onto the spatial fabric. Static methods can generate efficient mappings, but they cannot adapt to changing workloads and are not compatible across hardware generations. Current dynamic methods are adaptive and compatible, but do not optimize as well due to their limited use of speculation and small mapping scopes. To overcome the limitations of existing dynamic mapping methods for spatial architectures, while minimizing the inefficiencies inherent in OOO superscalar processors, this paper presents DynaSpAM (Dynamic Spatial Architecture Mapping), a framework that tightly couples a spatial fabric with an OOO pipeline. DynaSpAM coaxes the OOO processor into producing an optimized mapping with a simple modification to the processor's scheduler. The insight behind DynaSpAM is that today's powerful OOO processors do for themselves most of the work necessary to produce a highly optimized mapping for a spatial architecture, including aggressively speculating control and memory dependences, and scheduling instructions using a large window. Evaluation of DynaSpAM shows a geomean speedup of 1.42x for 11 benchmarks from the Rodinia benchmark suite with a geomean 23.9% reduction in energy consumption compared to an 8-issue OOO pipeline.
Article
Full-text available
The evolution of microprocessor architecture depends upon the changing aspects of technology. As die density and speed increase, memory and program behavior become increasingly important in defining architecture tradeoffs. While technology enables increasingly complex processor implementations, there are physical and program behavior limits to the usefulness of this complexity. Physical limits include device limits as well as practical limits on power and cost. Program behavior limits result from unpredictable events occurring during execution. Architectures and implementations that span these limits are vital to the continued evolution of the microprocessor. Keywords: microprocessor architecture, scaling, memory wall, cache limits, program behavior limits. Successful microprocessor implementations depend upon the processosr architect's ability to predict trends and advances in both technology and user behavior. Selecting an approach for a microprocessor implementation depend...
Article
In this paper, we set out to study the performance advantages of an Out-of-Order (OOO) processor relative to in-order processors with similar execution resources. In particular, we try to tease apart the performance contributions from two sources: the improved schedules enabled by OOO hardware speculation support and its ability to generate different schedules on different occurrences of the same instructions based on operand and functional unit availability. We find that the ability to express good static schedules achieves the bulk of the speedup resulting from OOO. Specifically, of the 53% speedup achieved by OOO relative to a similarly provisioned inorder machine, we find that 88% of that speedup can be achieved by using a single "best" static schedule as suggested by observing an OOO schedule of the code. We discuss the ISA mechanisms that would be required to express these static schedules. Furthermore, we find that the benefits of dynamism largely come from two kinds of events that influence the application's critical path: load instructions that miss in the cache only part of the time and branch mispredictions. We find that much of the benefit of OOO dynamism can be achieved by the potentially simpler task of addressing these two behaviors directly.
Conference Paper
In this paper, we set out to study the performance advantages of an Out-of-Order (OOO) processor relative to in-order processors with similar execution resources. In particular, we try to tease apart the performance contributions from two sources: the improved sched- ules enabled by OOO hardware speculation support and its ability to generate different schedules on different occurrences of the same instructions based on operand and functional unit availability. We find that the ability to express good static schedules achieves the bulk of the speedup resulting from OOO. Specifically, of the 53% speedup achieved by OOO relative to a similarly provisioned in- order machine, we find that 88% of that speedup can be achieved by using a single "best" static schedule as suggested by observing an OOO schedule of the code. We discuss the ISA mechanisms that would be required to express these static schedules. Furthermore, we find that the benefits of dynamism largely come from two kinds of events that influence the application's critical path: load instructions that miss in the cache only part of the time and branch mispredictions. We find that much of the benefit of OOO dynamism can be achieved by the potentially simpler task of addressing these two behaviors directly.
Article
Increasing levels of power dissipation threaten to limit the performance gains of future high-end, out-of-order issue microprocessors. Therefore, it is imperative that designers devise techniques that significantly reduce the power dissipation of the key hardware structures on the chip without unduly compromising performance. Such a key structure in out-of-order designs is the issue queue. Although crucial in achieving high performance, the issue queues are often a major contributor to the overall power consumption of the chip, potentially affecting both thermal issues related to hot spots and energy issues related to battery life. In this chapter, we present two techniques that significantly reduce issue queue power while maintaining high performance operation. First, we evaluate the power savings achieved by implementing a CAM/RAM structure for the issue queue as an alternative to the more power-hungry latch-based issue queue used in many designs. We then present the microarchitecture and circuit design of an adaptive issue queue that leverages transmission gate insertion to provide dynamic low-cost configurability of size and speed. We compare two different dynamic adaptation algorithms that use issue queue utilization and parallelism metrics in order to size the issue queue on-the-fly during execution. Together, these two techniques provide over a 70% average reduction in issue queue power dissipation for a collection of the SPEC CPU2000 integer benchmarks, with only a 3% overall performance degradation.
Conference Paper
This paper investigates and quantifies the impacts of several aggressive performance-boosting techniques designed for superscalar processors on the performance of SMT architectures. First, we examine the synergy of multithreading and speculative execution. Second, we quantify the performance impact of not supporting the load-hit speculation. Finally, we consider the impact of pipelining instruction scheduling logic over two cycles. The general conclusion of our studies is that while speculative execution is still important to achieve high SMT performance, scheduler-related mechanisms can be relaxed because the pipeline bubbles created in the execution schedule of one thread are often filled by the instructions from other threads.
Article
Full-text available
The evolution of microprocessor architecture depends upon the changing aspects of technology. As die density and speed increase, memory and program behavior become increasingly important in defining architecture trade-offs. While technology enables increasingly complex processor implementations, there are physical and program behavior limits to the usefulness of this complexity. Physical limits include device limits as well as practical limits on power and cost. Program behavior limits result from unpredictable events occurring during execution. Architectures and implementations that span these limits are vital to the continued evolution of the microprocessor.
Article
Execution along mispredicted paths may or may not affect the accuracy of subsequent branch predictions if recovery mechanisms are not provided to undo the erroneous information that is acquired by the branch prediction storage structures. In this paper, we study four elements of the Two-Level Branch Predictor: the Branch Target Buffer (BTB), the Branch History Register (BHR), the Pattern History Tables (PHTs), and the Return Address Stack (RAS). For each we determine whether a recovery mechanism is needed, and, if so, show how to design a cost-effective one. Using five benchmarks from the SPECint92 suite, we show that there is no need to provide recovery mechanisms for the BTB and the PHTs, but that performance is degraded by an average of 30% if recovery mechanisms are not provided for the BHR and RAS.
Article
Thesis (Ph. D.)--University of Toronto, 1997. Includes bibliographical references.
Conference Paper
As the technology scales, reduction in transistor size creates many opportunities for increased circuit capabilities in reduced chip area. In modern wide-issue processors, performance of the processor is directly impacted by the time delay complexity of the dynamic scheduling logic. In this paper, we analyze the scaling of time delay of instruction select logic at the submicron technologies, and also present novel designs that provide a single selection tree for two similar functional units. The designs are based on a tree structure using arbiter cells of two and four inputs which can handle one or two functional units. The effects of technology and design decisions are shown based on simulations using four submicron technologies. The delays in the select logic trees are shown to decrease by an average of 60% from 130 nm technology to 45 nm technology when servicing a single functional unit. The double grant arbiter cells are shown to build a tree that will serve multiple functional units simultaneously with 65% lesser delay as compared to multiple single-grant trees<sup>1</sup>.
Conference Paper
The performance of out-of-order processors increases with the instruction window size, In conventional processors, the effective instruction window cannot be larger than the issue buffer. Determining which instructions from the issue buffer can be launched to the execution units is a time-critical operation which complexity increases with the issue buffer size. We propose to relieve the issue stage by reordering instructions before they enter the issue buffer. This study introduces the general principle of data flow prescheduling. Then we describe a possible implementation. Our preliminary results show that data-flow prescheduling makes it possible to enlarge the effective instruction window while keeping the issue buffer small
Conference Paper
A machine's performance is the product of its IPC (Instructions Per Cycle) and clock frequency. Recently, S. Palacharla et al. (1997) warned that the dynamic instruction scheduling logic for current machines performs an atomic operation. Either you sacrifice IPC by pipelining this logic, thereby eliminating its ability to execute dependent instructions in consecutive cycles. Or you sacrifice clock frequency by not pipelining it, performing this atomic operation in a single long cycle. Both alternatives are unacceptable for high performance. The paper offers a third, acceptable, alternative: pipelined scheduling with speculative wakeup. This technique pipelines the scheduling logic without eliminating its ability to execute dependent instructions in consecutive cycles. With this technique, you sacrifice little IPC, and no clock frequency. Our results show that on the SPECint95 benchmarks, a machine using this technique has an average IPC that is 13% greater than the IPC of a baseline machine that pipelines the scheduling logic but sacrifices the ability to execute dependent instructions in consecutive cycles, and within 2% of the IPC of a conventional machine that uses single cycle scheduling logic
Conference Paper
Full-text available
Out-of-order issue mechanisms increase performance by dynamically rescheduling instructions that cannot be statically reordered by the compiler. Such mechanisms are effective but expensive in terms of both complexity and silicon area. It is therefore desirable to find cost-effective alternatives which can provide similar performance gains. In this paper we present delayed issue, a novel technique which allows instructions to be executed out-of-order without the hardware complexity of dynamic out-of-order issue. Instructions are inserted into per-functional unit delay queues using delays specified by the compiler. Instructions within a queue are issued in order; out of order execution results from different instructions being inserted into the queues at various delays. In addition to improving performance, delayed issue reduces code bloat when loops are pipelined
Conference Paper
Full-text available
This study has been carried our in order to determine cost-effective configurations of functional units for multiple-issue out-of-order superscalar processors. The trace-driven simulations were performed on the six integer and the fourteen floating-point programs from the SPEC 92 suite. We first evaluate the number of instructions allowed to be concurrently processed by the execution stages of the pipeline. We then apply some restrictions on the execution issue of different instruction classes in order to define these configurations. We conclude that five to nine functional units are necessary to exploit instruction-level parallelism. An important point is that several data cache ports are required in a processor of degree 4 or more. Finally, we report on complementary results on the utilization rate of the functional units.
Conference Paper
New techniques are increasing the degree of instruction-level parallelism exploited by processors. Recent superscalar implementations include multiple functional units, allowing the parallel execution of several instructions from the same application program. The trend towards an expansion of the number of hardware resources is likely to continue in future superscalar designs, and in order to maximize the processor throughput, the computational load must be balanced among these resources by the dynamic instruction-issuing algorithm. We investigate the effect on performance caused by the way instructions are distributed among the functional units of superscalar processors. Our results show that a performance gain of up to 38% can be obtained when the instructions are evenly distributed among the functional units
Conference Paper
Full-text available
In out-of-order issue superscalar microprocessors, instructions must be buffered before they are issued. This buffer can be either unified (one buffer linked to all functional units) as in the P6, distributed among the units as in the PowerPC 620, or semi-unified (a few buffers each shared by several units) as in the MIPS R10000. Of course, the size of this buffer also plays a leading role in the performance of the processor. Intensive trace-driven simulations on the SPEC92 suite have been made in order to determine the best designs and relevant choices are pointed out according to the dispatch width of the processor
Conference Paper
A study was initiated that investigated detractors to parallelism and implementation constraints associated with the critical paths in the design of fine grain parallel machines. The outcome of the research has been a new machine organization that facilitates and improves parallel instruction issue and possible increases in cycle time and by improving the instruction-level parallelism, using specialized hardware. The authors describe the attributes of the proposed machine organization related to the analysis of instruction sequences for the parallel issue and execution. They also describe the permanent preprocessing in the cache that allows for the determination of instructions for parallel execution prior to the instruction fetch and issues
Article
While in the last decade image and video processing (IVP) have gradually moved from special purpose computer architectures based on massive parallelism (MP) to general purpose computer architectures based on instruction-level parallelism (ILP), a new challenge is now to be faced by the IVP community, namely the application of IVP also in small-size embedded systems (e.g., video players, smart cameras, digital diaries, etc.) based on ILP processors. Because of the requirements of low size, weight, and power consumption, these embedded systems do not take advantage of processors that feature advanced dynamic code optimization mechanisms such as those based on instruction reordering and register renaming. On the other hand, the compile time techniques of present generation compilers do not appear to be aggressive enough to exploit the massive parallelism of IVP tasks in ILP architectures, thus leading to inefficient programs. This paper analyzes the efficiency of IVP programs on ILP CPUs. In particular it presents: (1) a reference model for the efficient design and implementation of highly parallel programs, such as the ones of the IVP domain; (2) an analysis of the inefficiencies of IVP programs implemented on ILP processors; and (3) a set of techniques, deriving from the reference model, that overcome these inefficiencies. These techniques are based on a novel computing paradigm called bucket processing.
Article
Full-text available
Out-or-order execution and branch prediction are two mechanisms that can be used profitably in the design of supercomputers to increase performance. Proper exception handling and branch prediction miss handling in an out-of-order execution machine do require some kind of repair mechanism which can restore the machine to a known previous state. In this paper we present a class of repair mechanisms using the concept of checkpointing. We derive several properties of checkpoint repair mechanisms. In addition, we provide algorithms for performing checkpoint repair that incur little overhead in time and modest cost in hardware. We also note that our algorithms require no additional complexity or time for use with write-back cache memory systems than they do with write-through cache memory systems, contrary to statements made by previous researchers.
Conference Paper
Full-text available
HPS is a new model for a high performance microarchitecture which is targeted for implementing very dissimilar ISP architectures. It derives its performance from executing the operations within a restricted window of a program out-of-order, asynchronously, and concurrently whenever possible. Before the model can be reduced to an effective working implementation of a particular target architecture, several issues need to be resolved. This paper discusses these issues, both in general and in the context of architectures with specific characteristics.
Conference Paper
Full-text available
Our recent work in microarchitecture has identified a new model of execution, restricted data flow, in which data flow techniques are used to coordinate out-of-order execution of sequential instruction streams. We believe that the restricted data flow model has great potential for implementing very high performance computing engines. This paper defines a minimal functionality variant of our model, which we are calling HPSm. The instruction set, data path, timing and control of HPSm are all described. A simulator for HPSm has been written, and some of the Berkeley RISC benchmarks have been executed on the simulator. We report the measurements obtained from these benchmarks, along with the measurements obtained for the Berkeley RISC II. The results are encouraging.
Article
The large latency of memory accesses is a major impediment to achieving high performance in large scale shared-memory multi-processsors. Relaxing the memory consistency model is an attractive technique for hiding this latency by allowing the overlap of memory accesses with other computation and memory accesses. Previous studies on relaxed models have shown that the latency of write accesses can be hidden by buffering writes and allowing reads to bypass pending writes. Hiding the latency of reads by exploiting the overlap allowed by relaxed models is inherently more difficult, however, simply because the processor depends on the return value for its future computation. This paper explores the use of dynamically scheduled processors to exploit the overlap allowed by relaxed models for hiding the latency of reads. Our results are based on detailed simulation studies of several parallel applications. The results show that a substantial fraction of the read latency can be hidden using this technique. However, the major improvements in performance are achieved only at large instruction window sizes.
Conference Paper
The authors model the execution of the SPEC benchmarks under differing resource constraints. They repeat the work of the previous researchers, and under the hardware resource constraints imposed, similar results are obtained. On the other hand, when all constraints are removed except those required by the semantics of the program, degrees of parallelism in excess of 17 instructions per cycle are found. Finally, it is shown that if the hardware is properly balanced, one can sustain from 2.0 to 5.8 instructions per cycle on a processor that is reasonable to design today.
The Design and Implementation of the ~.3BSD Unix Operating Systems
  • S M Leffier
  • M Mckusick
  • J Karels
  • Quarterman
HPSm a High Performance Restricted Data Flow Architecture Having Minimal Functionality
  • W Hwu
  • Y N Putt
  • Y N Hwu
  • Putt