Conference Paper

An Investigation Of The Performance Of Various Dynamic Scheduling Techniques

January 1993
ACM SIGMICRO Newsletter 23(1-2):1-9

January 1993
23(1-2):1-9

DOI:10.1145/144965.144974

Source
IEEE Xplore

Conference: Microarchitecture, 1992. MICRO 25., Proceedings of the 25th Annual International Symposium on

Authors:

Not Available

Hardware And Software For Functional And Fine Grain Parallelism

Article

Jul 1994

Carl Josef Beckmann

This thesis examines nonloop parallelism at both fine and coarse levels of granularity in numerical FORTRAN programs. Measurements of the extent of this functional parallelism in a number of FORTRAN codes are presented, as well as compiler and run-time algorithms designed to exploit it. Hardware and software embodiments of the dynamic scheduling algorithms are developed, along with the compiler optimizations necessary to make these practical. The impact of fine grain functional parallelism on instruction-level archictecture is explored, and it is shown that dynamic instruction scheduling hardware based on the functional parallelism scheduling algorithms can yield a significant improvement over static scheduling on conventional RISC processors when the latency of memory accesses is highly variable. Measurements of the characteristics of a set of FORTRAN benchmark programs indicates that such a hardware realization is feasible in practice. iii TABLE OF CONTENTS CHAPTER PAGE 1 INTRODUC...

Wake-Up Logic Optimizations Through Selective Match and Wakeup Range Limitation

Article

Nov 2006

This paper presents two effective wakeup designs that improve the speed, power, area, and scalability without instructions per cycle (IPC) loss for dynamic instruction schedulers. First, a wakeup design is proposed to aim at reducing the power consumption and wakeup latency. This design removes the read of the destination tags from the wakeup path by matching the source tags directly with the grant lines. Moreover, this design eliminates the redundant matches during the wakeup operations by matching the source tags with only the selected grant lines. Next, the second design explores a metric called wakeup locality to further reduce the area cost of the wakeup logic. By limiting the wakeup ranges for the instructions in the issue window, this design not only reduces the area requirement but also improves the scalability. The experimental results show that this range-limited-wakeup design saves 76%-94% of the power consumption and reduces 29%-77% in the wakeup latency compared to the conventional CAM-based scheme with only 5%-44% of the area cost in a traditional RAM-based scheme. The results also show that this design scales well with the increase of both the issue width and the window size

Scalable Dynamic Instruction Scheduler through Wake-Up Spatial Locality

Article

Dec 2007

In a high-performance superscalar processor, the instruction scheduler often comes with poor scalability and high complexity due to the expensive wake-up operation. From detailed simulation-based analyses, we find that 95 percent of the wake-up distances between two dependent instructions are short, in the range of 16 instructions, and 99 percent are in the range of 31 instructions. We apply this wake-up spatial locality to the design of conventional CAM-based and matrix-based wakeup logic, respectively. By limiting the wake-up coverage to i + 16 instructions, where 0 les i les 15 for 16-entry segments, the proposed wake-up designs confine the wake-up operation to two matrix-based or three CAM-based 16-entry segments no matter how large the issue window size is. The experimental results show that, for an issue window of 128 entries (IW128) or 256 entries (IW256), the proposed CAM-based wake-up locality design saves 65 percent (IW128) and 76 percent (IW256) of the power consumption and reduces 44 percent (IW128) and 78 percent (IW256) in the wake-up latency compared to the conventional CAM-based design with almost no performance loss. For the matrix-based wake-up logic, applying wake-up locality to the design drastically reduces the area cost. Extensive simulation results, including comparisons with previous works, show that the wake-up spatial locality is the key element to achieving scalability for future sophisticated instruction schedulers.

On Pipelining Dynamic Instruction Scheduling Logic

Article

Nov 2000

A machine's performance is the product of its IPC (Instructions Per Cycle) and clock frequency. Recently, Palacharla, Jouppi, and Smith [3] warned that the dynamic instruction scheduling logic for current machines performs an atomic operation. Either you sacrifice IPC by pipelining this logic, thereby eliminating its ability to execute dependent instructions in consecutive cycles. Or you sacrifice clock frequency by not pipelining it, performing this atomic operation in a single long cycle. Both alternatives are unacceptable for high performance.

Reducing Energy Consumption of Wakeup Logic through Double-Stage Tag Comparison

Article

Feb 2022

The high energy consumption of current processors causes several problems, including a limited clock frequency, short battery lifetime, and reduced device reliability. It is therefore important to reduce the energy consumption of the processor. Among resources in a processor, the issue queue (IQ) is a large consumer of energy, much of which is consumed by the wakeup logic. Within the wakeup logic, the tag comparison that checks source operand readiness consumes a significant amount of energy. This paper proposes an energy reduction scheme for tag comparison, called double-stage tag comparison. This scheme first compares the lower bits of the tag and then, only if these match, compares the higher bits. Because the energy consumption of tag comparison is roughly proportional to the total number of bits compared, energy is saved by reducing this number. However, this sequential comparison increases the delay of the IQ, thereby increasing the clock cycle time. Although this can be avoided by allocating an extra cycle to the issue operation, this in turn degrades the IPC. To avoid IPC degradation, we reconfigure a small number of entries in the IQ, where several oldest instructions that are likely to have an adverse effect on performance reside, to a single stage for tag comparison. Our evaluation results for SPEC2017 benchmark programs show that the double-stage tag comparison achieves on average a 21% reduction in the energy consumed by the wakeup logic (15% when including the overhead) with only 3.0% performance degradation.

SWQUE: A Mode Switching Issue Queue with Priority-Correcting Circular Queue

Conference Paper

Oct 2019

Hideki Ando

The improvement of single-thread performance is much needed. Among the many structures that comprise a processor, the issue queue (IQ) is one of the most important structures that influences high single-thread performance. Correctly assigning the issue priority and providing high capacity efficiency are key features, but no conventional IQ organizations do not sufficiently have these. In this paper, we propose an IQ called the switching issue queue (SWQUE), which dynamically configures the IQ as a modified circular queue (CIRC-PC) or random queue with an age matrix (AGE) by responding to the degree of capacity demand. CIRC-PC corrects the issue priority when wrap-around occurs by exploiting the finding that instructions that are wrapped around are latency-tolerant. CIRC-PC is used for phases in which capacity efficiency is less important and the correct priority is more important; and AGE is used for phases in which capacity efficiency is more important. Our evaluation results using SPEC2017 benchmark programs show that SWQUE achieved higher performance by averages of 9.7% and 2.9% (up to 24.4% or 10.6%) for integer and floating-point programs, respectively, compared with AGE, which is widely used in current processors.

Exploring Configurations of Functional Units in an Out-of-Order Superscalar Processor

Article

Full-text available

May 1995
Comput Architect News

This study has been carried out in order to determine cost-effective configurations of functional units for multiple-issue out-of-order superscalar processors. The trace-driven simulations were performed on the six integer and the fourteen floating-point programs from the SPEC 92 suite. We first evaluate the number of instructions allowed to be concurrently processed by the execution stages of the pipeline. We then apply some restrictions on the execution issue of different instruction classes in order to define these configurations. We conclude that five to nine functional units are necessary to exploit Instruction-Level Parallelism. An important point is that several data cache ports are required in a processor of degree 4 or more. Finally, we report on complementary results on the utilization rate of the functional units. Keywords: Instruction-level parallelism, Superscalar microprocessor, Out-of-Order Execution, and Functional Units. Processor PPC 604 MC 88110 Ultra Sparc DEC 211...

Mechanistic Modeling of Architectural Vulnerability Factor

Article

Full-text available

Jan 2015

Reliability to soft errors is a significant design challenge in modern microprocessors owing to an exponential increase in the number of transistors on chip and the reduction in operating voltages with each process generation. Architectural Vulnerability Factor (AVF) modeling using microarchitectural simulators enables architects to make informed performance, power, and reliability tradeoffs. However, such simulators are time-consuming and do not reveal the microarchitectural mechanisms that influence AVF. In this article, we present an accurate first-order mechanistic analytical model to compute AVF, developed using the first principles of an out-of-order superscalar execution. This model provides insight into the fundamental interactions between the workload and microarchitecture that together influence AVF. We use the model to perform design space exploration, parametric sweeps, and workload characterization for AVF.

Adaptive and Low-Complexity Microarchitectures for Power Reduction

Article

Jaume Abella Ferrer

MABLE: A TECHNIQUE FOR EFFICIENT MACHINE SIMULATION

Article

We present a framework for an efficient instruction-level machine simulator which can be used with existing software tools to develop and analyze programs for a proposed processor architecture. The simulator exploits similarities between the instruction sets of the emulated machine and the host machine to provide fast simulation. Furthermore, existing program development tools on the host machine such as debuggers and profilers can be used without modification on the emulated program running under the simulator. The simulator can therefore be used to debug and tune application code for the new processor without building a whole new set of program development tools. The technique has applicability to a diverse set of simulation problems. We show how the framework has been used to build simulators for a shared-memory multiprocessor, a superscalar processor with support for speculative execution, and a dual-issue embedded processor.

Trifecta: A Nonspeculative Scheme to Exploit Common, Data-Dependent Subcritical Paths

Article

Full-text available

Feb 2010

Pipelined processor cores are conventionally designed to accommodate the critical paths in the critical pipeline stage(s) in a single clock cycle, to ensure correctness. Such conservative design is wasteful in many cases since critical paths are rarely exercised. Thus, configuring the pipeline to operate correctly for rarely used critical paths targets the uncommon case instead of optimizing for the common case. In this study, we describe Trifecta-an architectural technique that completes common-case, subcritical path operations in a single cycle but uses two cycles when the critical path is exercised. This increases slack for both single-and two-cycle operations and offers a unique advantage under process variation. In contrast with existing mechanisms that trade power or performance for yield, Trifecta improves the yield while preserving performance and power. We applied this technique to the critical pipeline stages of a superscalar out-of-order (OoO) and a single issue in-order processor, namely instruction issue and execute, respectively. Our experiments show that the rare two-cycle operations result in a small decrease (5% for integer and 2% for floating-point benchmarks of SPEC2000) in instructions per cycle. However, the increased delay slack causes an improvement in yield-adjusted-throughput by 20% (12.7%) for an in-order (InO) processor configuration.

An Adaptive Issue Queue for Reduced Power at High Performance

Conference Paper

Nov 2000

Increasing power dissipation has become a major constraint for future performance gains in the design of microprocessors. In this paper, we present the circuit design of an issue queue for a superscalar processor that leverages transmission gate insertion to provide dynamic low-cost configurability of size and speed. A novel circuit structure dynamically gathers statistics of issue queue activity over intervals of instruction execution. These statistics are then used to change the size of an issue queue organization on- the-fly to improve issue queue energy and performance. When applied to a fixed, full-size issue queue structure, the result is up to a 70% reduction in energy dissipation. The complexity of the additional circuitry to achieve this result is almost negligible. Furthermore, self-timed techniques embedded in the adaptive scheme can provide a 56% decrease in cycle time of the issue queue when we change the adaptive issue queue size from 32 entries (largest possible) to 8 entries (smallest possible in our design).

Matrix scheduler reloaded

Conference Paper

Jun 2007
Comput Architect News

From multiprocessor scale-up to cache sizes to the number of reorder-buffer entries, microarchitects wish to reap the b enefits of more computing resources while staying within power and latency bounds. This tension is quite evident in schedulers, which need to be large and single-cycle for maximum performance on out-of-order cores. In this work we present two straightforward modificat ions to a matrix scheduler implementation which greatly strengthen its scalability. Both are based on the simple observation that t he wakeup and picker matrices are sparse, even at small sizes; thus small indirection tables can be used to greatly reduce their width and latency. This technique can be used to create quicker iso- performance schedulers (17-58% reduced critical path) or larger iso- timing schedulers (7-26% IPC increase). Importantly, the power and area requirements of the additional hardware are likely offset by the greatly reduced matrix sizes and subsuming the functionality of the power-hungry allocation CAMs. Categories and Subject Descriptors. C.1.0 (Processor Architec-

What's ahead in computer design?

Conference Paper

Oct 1997

Michael J. Flynn

CMOS technology should, over the next few years, reach lithography of under 0.1 μ. This provides a die area improvement of a factor of 10 over today's technology. What is the best use of this area? Multiprocessors, very high level superscalar processors, VLIW, processor-memory combinations, or simpler processors with very large caches? Some have made the case that pin bandwidth is an important limit that must be confronted by any organization. This presentation reviews the scalability of our current technology and critically analyzes the various alternatives to the best use of silicon area in the era of 0.1 μ technologies. The rapid advance of technology has made possible a number of processor improvements which would have been thought impossible just a few years ago. These improvements include cycle time, cache size, and an extraordinary increase in the complexity of the processor; this complexity is required for the management of relatively large scales of instruction level parallelism

Complexity-Effective Superscalar Processors

Conference Paper

Full-text available

Jul 1997

Not Available

Power- and Complexity-Aware Issue Queue Designs

Article

Full-text available

Oct 2003

The improved performance of current microprocessors brings with it increasingly complex and power-dissipating issue logic. Recent proposals introduce a range of mechanisms for tackling this problem.

Inherently lower-power high-performance superscalar architectures

Article

Apr 2001

In recent years, reducing power has become an important design goal for high-performance microprocessors. This work attempts to bring the power issue to the earliest phases of microprocessor development, in particular, the stage of defining a chip microarchitecture. We investigate power-optimization techniques of superscalar microprocessors at the microarchitecture level that do not compromise performance. First, major targets for power reduction are identified within microarchitecture, where power is heavily consumed or will be heavily consumed in next-generation superscalar processors. Then, a new, energy-efficient version of a multicluster microarchitecture is developed that reduces energy the identified critical design points with minimal performance impact. A methodology is developed for energy-performance optimization at the microarchitecture level that generates, for a microarchitecture, a set of energy-efficient configurations, forming a convex hull in the power-performance space. Detailed simulation of the baseline and proposed multicluster architectures has been performed using the developed optimization methodology. A comparison of the two microarchitectures, both optimized for energy efficiency, shows that the multicluster architecture is potentially up to twice as energy efficient for wide issue processors, with an advantage that Grows with the issue width. Conversely, at the same power dissipation level, the multicluster architecture supports configurations with measurably higher performance than equivalent conventional designs

Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores

Article

Jul 2004

The register file and result bypass network are large sources of power consumption in high-performance processors. This paper introduces a technique called Demand-Only Broadcast that reduces the power consumption of these structures in a clustered execution core. With this technique, an instruction's result is only broadcast to remote clusters if it is needed by dependants in those clusters. Demand-Only Broadcast was evaluated using a performance--power simulator of a high-performance clustered processor which already employs techniques for reducing register file and instruction window power. By eliminating 59% of the register file writes and data bypasses, the total processor power consumption (including the hardware needed by this mechanism) is reduced by 10%, while having less than a 1% impact on performance.

Quantifying the Complexity of Superscalar Processors

Article

Jan 2002

To characterize future performance limitations of superscalar processors, the delays of key pipeline structures in superscalar processors are studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of and .

Multiple-Block Ahead Branch Predictors

Article

Full-text available

Apr 1996

: A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. To overcome the instruction fetch bottleneck shown in wide-dispatch "brainiac" processors, this paper presents a novel cost-effective mechanism called the multiple-block ahead branch predictor that predicts in an efficient way addresses of multiple basic blocks in a single cycle. Moreover and unlike the previous multiple predictor schemes, the multiple-block ahead branch predictor can use any of the branch prediction schemes to perform very accurate predictions required to achieve high-performance on superscalar processors. Finally, we show that pipelining the branch prediction process can be done by means of our predictor for "speed demon" processors to achieve higher clock rate. (R'esum'e : tsvp) This work was partially supported by GDR AMN (MIF project). St'ephan Jourdan and Pascal Sainrat are with the APARA team at IRIT, Toulouse, France. This report...

A Circuit Level Implementation of an Adaptive Issue Queue for Power-Aware Microprocessors

Article

Feb 2001

Increasing power dissipation has become a major constraint for future performance gains in the design of microprocessors. In this paper, we present the circuit design of an issue queue for a superscalar processor that leverages transmission gate insertion to provide dynamic low-cost configurability of size and speed. A novel circuit structure dynamically gathers statistics of issue queue activity over intervals of instruction execution. These statistics are then used to change the size of an issue queue organization on-the-fly to improve issue queue energy and performance. When applied to a fixed, full-size issue queue structure, the result is up to a 70% reduction in energy dissipation. The complexity of the additional circuitry to achieve this result is almost negligible. Furthermore, self-timed techniques embedded in the adaptive scheme can provide a 56% decrease in cycle time of the CAM array read of the issue queue when we change the adaptive issue queue size from 32 entries (largest possible) to 8 entries (smallest possible in our design).

Data-Flow Prescheduling for Large Instruction Windows in Out-of-Order Processors

Article

Dec 2000

The performance of out-of-order processors increases with the instruction window size. In conventional processors, the effective instruction window cannot be larger than the issue buffer. Determining which instructions from the issue buffer can be launched to the execution units is a timecritical operation which complexity increases with the issue buffer size. We propose to relieve the issue stage by reordering instructions before they enter the issue buffer. This study introduces the general principle of data-flow prescheduling. Then we describe a possible implementation. Our preliminary results show that data-flow prescheduling makes it possible to enlarge the effective instruction window while keeping the issue buffer small. 1. Introduction Processor performance is strongly correlated with the clock cycle. Shorter clock cycle has been allowed by both improvements in silicon technology and careful processor design. As a consequence of this evolution, the IPC (average number of ins...

Complexity-Effective Superscalar Processors

Article

Sep 2000
Comput Architect News

The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of , 894 , and . Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future.

Segmenting Age Matrices to Improve Instruction Scheduling without Increasing Delay and Area

Conference Paper

Oct 2022

Hideki Ando

CRISP: critical slice prefetching

Conference Paper

Feb 2022

Rearranging Random Issue Queue with High IPC and Short Delay

Conference Paper

Oct 2018

Performance Improvement by Prioritizing the Issue of the Instructions in Unconfident Branch Slices

Conference Paper

Oct 2018

Hideki Ando

DynaSpAM

Article

Jun 2015
Comput Architect News

Spatial architectures are more efficient than traditional Out-of-Order (OOO) processors for computationally intensive programs. However, spatial architectures require mapping a program, either statically or dynamically, onto the spatial fabric. Static methods can generate efficient mappings, but they cannot adapt to changing workloads and are not compatible across hardware generations. Current dynamic methods are adaptive and compatible, but do not optimize as well due to their limited use of speculation and small mapping scopes. To overcome the limitations of existing dynamic mapping methods for spatial architectures, while minimizing the inefficiencies inherent in OOO superscalar processors, this paper presents DynaSpAM (Dynamic Spatial Architecture Mapping), a framework that tightly couples a spatial fabric with an OOO pipeline. DynaSpAM coaxes the OOO processor into producing an optimized mapping with a simple modification to the processor's scheduler. The insight behind DynaSpAM is that today's powerful OOO processors do for themselves most of the work necessary to produce a highly optimized mapping for a spatial architecture, including aggressively speculating control and memory dependences, and scheduling instructions using a large window. Evaluation of DynaSpAM shows a geomean speedup of 1.42x for 11 benchmarks from the Rodinia benchmark suite with a geomean 23.9% reduction in energy consumption compared to an 8-issue OOO pipeline.

Basic Issues in Microprocessor Architecture

Article

Full-text available

Jan 1999

Michael J. Flynn

The evolution of microprocessor architecture depends upon the changing aspects of technology. As die density and speed increase, memory and program behavior become increasingly important in defining architecture tradeoffs. While technology enables increasingly complex processor implementations, there are physical and program behavior limits to the usefulness of this complexity. Physical limits include device limits as well as practical limits on power and cost. Program behavior limits result from unpredictable events occurring during execution. Architectures and implementations that span these limits are vital to the continued evolution of the microprocessor. Keywords: microprocessor architecture, scaling, memory wall, cache limits, program behavior limits. Successful microprocessor implementations depend upon the processosr architect's ability to predict trends and advances in both technology and user behavior. Selecting an approach for a microprocessor implementation depend...

Discerning the dominant out-of-order performance advantage

Article

Apr 2013

In this paper, we set out to study the performance advantages of an Out-of-Order (OOO) processor relative to in-order processors with similar execution resources. In particular, we try to tease apart the performance contributions from two sources: the improved schedules enabled by OOO hardware speculation support and its ability to generate different schedules on different occurrences of the same instructions based on operand and functional unit availability. We find that the ability to express good static schedules achieves the bulk of the speedup resulting from OOO. Specifically, of the 53% speedup achieved by OOO relative to a similarly provisioned inorder machine, we find that 88% of that speedup can be achieved by using a single "best" static schedule as suggested by observing an OOO schedule of the code. We discuss the ISA mechanisms that would be required to express these static schedules. Furthermore, we find that the benefits of dynamism largely come from two kinds of events that influence the application's critical path: load instructions that miss in the cache only part of the time and branch mispredictions. We find that much of the benefit of OOO dynamism can be achieved by the potentially simpler task of addressing these two behaviors directly.

Discerning the Dominant Out-of-Order Performance Advantage: Is it Speculation or Dynamism?

Conference Paper

Mar 2013

In this paper, we set out to study the performance advantages of an Out-of-Order (OOO) processor relative to in-order processors with similar execution resources. In particular, we try to tease apart the performance contributions from two sources: the improved sched- ules enabled by OOO hardware speculation support and its ability to generate different schedules on different occurrences of the same instructions based on operand and functional unit availability. We find that the ability to express good static schedules achieves the bulk of the speedup resulting from OOO. Specifically, of the 53% speedup achieved by OOO relative to a similarly provisioned in- order machine, we find that 88% of that speedup can be achieved by using a single "best" static schedule as suggested by observing an OOO schedule of the code. We discuss the ISA mechanisms that would be required to express these static schedules. Furthermore, we find that the benefits of dynamism largely come from two kinds of events that influence the application's critical path: load instructions that miss in the cache only part of the time and branch mispredictions. We find that much of the benefit of OOO dynamism can be achieved by the potentially simpler task of addressing these two behaviors directly.

An Instruction Scheduler for Dynamic ALU Cascading Adoption

Article

Jan 2009

QUANTIFYING THE IMPACTS OF DISABLING SPECULATION AND RELAXING THE SCHEDULING LOOP IN MULTITHREADED PROCESSORS

Article

Jason Loew

Power-Efficient Issue Queue Design

Article

Jan 2002

Increasing levels of power dissipation threaten to limit the performance gains of future high-end, out-of-order issue microprocessors. Therefore, it is imperative that designers devise techniques that significantly reduce the power dissipation of the key hardware structures on the chip without unduly compromising performance. Such a key structure in out-of-order designs is the issue queue. Although crucial in achieving high performance, the issue queues are often a major contributor to the overall power consumption of the chip, potentially affecting both thermal issues related to hot spots and energy issues related to battery life. In this chapter, we present two techniques that significantly reduce issue queue power while maintaining high performance operation. First, we evaluate the power savings achieved by implementing a CAM/RAM structure for the issue queue as an alternative to the more power-hungry latch-based issue queue used in many designs. We then present the microarchitecture and circuit design of an adaptive issue queue that leverages transmission gate insertion to provide dynamic low-cost configurability of size and speed. We compare two different dynamic adaptation algorithms that use issue queue utilization and parallelism metrics in order to size the issue queue on-the-fly during execution. Together, these two techniques provide over a 70% average reduction in issue queue power dissipation for a collection of the SPEC CPU2000 integer benchmarks, with only a 3% overall performance degradation.

Aggressive Scheduling and Speculation in Multithreaded Architectures: Is it Worth its Salt?

Conference Paper

Dec 2008

This paper investigates and quantifies the impacts of several aggressive performance-boosting techniques designed for superscalar processors on the performance of SMT architectures. First, we examine the synergy of multithreading and speculative execution. Second, we quantify the performance impact of not supporting the load-hit speculation. Finally, we consider the impact of pipelining instruction scheduling logic over two cycles. The general conclusion of our studies is that while speculative execution is still important to achieve high SMT performance, scheduler-related mechanisms can be relaxed because the pipeline bubbles created in the execution schedule of one thread are often filled by the instructions from other threads.

Basic issues in microprocessor architecture

Article

Full-text available

Feb 1970
J SYST ARCHITECT

Michael J. Flynn

Recovery Requirements of Branch Prediction Storage Structures in the Presence of Mispredicted-Path Execution

Article

Oct 1997

Execution along mispredicted paths may or may not affect the accuracy of subsequent branch predictions if recovery mechanisms are not provided to undo the erroneous information that is acquired by the branch prediction storage structures. In this paper, we study four elements of the Two-Level Branch Predictor: the Branch Target Buffer (BTB), the Branch History Register (BHR), the Pattern History Tables (PHTs), and the Return Address Stack (RAS). For each we determine whether a recovery mechanism is needed, and, if so, show how to design a cost-effective one. Using five benchmarks from the SPECint92 suite, we show that there is no need to provide recovery mechanisms for the BTB and the PHTs, but that performance is degraded by an average of 30% if recovery mechanisms are not provided for the BHR and RAS.

Memory-system design considerations for dynamically-scheduled microprocessors [microform] /

Article

Thesis (Ph. D.)--University of Toronto, 1997. Includes bibliographical references.

An Optimal Multi-Functional Unit Dynamic Instruction Selection Logic at Submicron Technologies

Conference Paper

Feb 2008

As the technology scales, reduction in transistor size creates many opportunities for increased circuit capabilities in reduced chip area. In modern wide-issue processors, performance of the processor is directly impacted by the time delay complexity of the dynamic scheduling logic. In this paper, we analyze the scaling of time delay of instruction select logic at the submicron technologies, and also present novel designs that provide a single selection tree for two similar functional units. The designs are based on a tree structure using arbiter cells of two and four inputs which can handle one or two functional units. The effects of technology and design decisions are shown based on simulations using four submicron technologies. The delays in the select logic trees are shown to decrease by an average of 60% from 130 nm technology to 45 nm technology when servicing a single functional unit. The double grant arbiter cells are shown to build a tree that will serve multiple functional units simultaneously with 65% lesser delay as compared to multiple single-grant trees1.

Data-flow prescheduling for large instruction windows inout-of-order processors

Conference Paper

Feb 2001

The performance of out-of-order processors increases with the instruction window size, In conventional processors, the effective instruction window cannot be larger than the issue buffer. Determining which instructions from the issue buffer can be launched to the execution units is a time-critical operation which complexity increases with the issue buffer size. We propose to relieve the issue stage by reordering instructions before they enter the issue buffer. This study introduces the general principle of data flow prescheduling. Then we describe a possible implementation. Our preliminary results show that data-flow prescheduling makes it possible to enlarge the effective instruction window while keeping the issue buffer small

On pipelining dynamic instruction scheduling logic

Conference Paper

Feb 2000

A machine's performance is the product of its IPC (Instructions Per Cycle) and clock frequency. Recently, S. Palacharla et al. (1997) warned that the dynamic instruction scheduling logic for current machines performs an atomic operation. Either you sacrifice IPC by pipelining this logic, thereby eliminating its ability to execute dependent instructions in consecutive cycles. Or you sacrifice clock frequency by not pipelining it, performing this atomic operation in a single long cycle. Both alternatives are unacceptable for high performance. The paper offers a third, acceptable, alternative: pipelined scheduling with speculative wakeup. This technique pipelines the scheduling logic without eliminating its ability to execute dependent instructions in consecutive cycles. With this technique, you sacrifice little IPC, and no clock frequency. Our results show that on the SPECint95 benchmarks, a machine using this technique has an average IPC that is 13% greater than the IPC of a baseline machine that pipelines the scheduling logic but sacrifices the ability to execute dependent instructions in consecutive cycles, and within 2% of the IPC of a conventional machine that uses single cycle scheduling logic

Cheap out-of-order execution using delayed issue

Conference Paper

Full-text available

Feb 2000

J.P. Grossman

Out-of-order issue mechanisms increase performance by dynamically rescheduling instructions that cannot be statically reordered by the compiler. Such mechanisms are effective but expensive in terms of both complexity and silicon area. It is therefore desirable to find cost-effective alternatives which can provide similar performance gains. In this paper we present delayed issue, a novel technique which allows instructions to be executed out-of-order without the hardware complexity of dynamic out-of-order issue. Instructions are inserted into per-functional unit delay queues using delays specified by the compiler. Instructions within a queue are issued in order; out of order execution results from different instructions being inserted into the queues at various delays. In addition to improving performance, delayed issue reduces code bloat when loops are pipelined

Exploring configurations of functional units in an out-of-order superscalar processor

Conference Paper

Full-text available

Jul 1995

This study has been carried our in order to determine cost-effective configurations of functional units for multiple-issue out-of-order superscalar processors. The trace-driven simulations were performed on the six integer and the fourteen floating-point programs from the SPEC 92 suite. We first evaluate the number of instructions allowed to be concurrently processed by the execution stages of the pipeline. We then apply some restrictions on the execution issue of different instruction classes in order to define these configurations. We conclude that five to nine functional units are necessary to exploit instruction-level parallelism. An important point is that several data cache ports are required in a processor of degree 4 or more. Finally, we report on complementary results on the utilization rate of the functional units.

Load balancing in superscalar architectures

Conference Paper

Oct 1996

New techniques are increasing the degree of instruction-level parallelism exploited by processors. Recent superscalar implementations include multiple functional units, allowing the parallel execution of several instructions from the same application program. The trend towards an expansion of the number of hardware resources is likely to continue in future superscalar designs, and in order to maximize the processor throughput, the computational load must be balanced among these resources by the dynamic instruction-issuing algorithm. We investigate the effect on performance caused by the way instructions are distributed among the functional units of superscalar processors. Our results show that a performance gain of up to 38% can be obtained when the instructions are evenly distributed among the functional units

An investigation of the performance of various instruction-issuebuffer topologies

Conference Paper

Full-text available

Nov 1995

In out-of-order issue superscalar microprocessors, instructions must be buffered before they are issued. This buffer can be either unified (one buffer linked to all functional units) as in the P6, distributed among the units as in the PowerPC 620, or semi-unified (a few buffers each shared by several units) as in the MIPS R10000. Of course, the size of this buffer also plays a leading role in the performance of the processor. Intensive trace-driven simulations on the SPEC92 suite have been made in order to determine the best designs and relevant choices are pointed out according to the dispatch width of the processor

In-cache pre-processing and decode mechanisms for fine grain parallelism in SCISM

Conference Paper

Apr 1993

A study was initiated that investigated detractors to parallelism and implementation constraints associated with the critical paths in the design of fine grain parallel machines. The outcome of the research has been a new machine organization that facilitates and improves parallel instruction issue and possible increases in cycle time and by improving the instruction-level parallelism, using specialized hardware. The authors describe the attributes of the proposed machine organization related to the analysis of instruction sequences for the parallel issue and execution. They also describe the permanent preprocessing in the cache that allows for the determination of instructions for parallel execution prior to the instruction fetch and issues

On the efficiency of image and video processing programs on instruction level parallel processors

Article

Aug 2002

While in the last decade image and video processing (IVP) have gradually moved from special purpose computer architectures based on massive parallelism (MP) to general purpose computer architectures based on instruction-level parallelism (ILP), a new challenge is now to be faced by the IVP community, namely the application of IVP also in small-size embedded systems (e.g., video players, smart cameras, digital diaries, etc.) based on ILP processors. Because of the requirements of low size, weight, and power consumption, these embedded systems do not take advantage of processors that feature advanced dynamic code optimization mechanisms such as those based on instruction reordering and register renaming. On the other hand, the compile time techniques of present generation compilers do not appear to be aggressive enough to exploit the massive parallelism of IVP tasks in ILP architectures, thus leading to inefficient programs. This paper analyzes the efficiency of IVP programs on ILP CPUs. In particular it presents: (1) a reference model for the efficient design and implementation of highly parallel programs, such as the ones of the IVP domain; (2) an analysis of the inefficiencies of IVP programs implemented on ILP processors; and (3) a set of techniques, deriving from the reference model, that overcome these inefficiencies. These techniques are based on a novel computing paradigm called bucket processing.

Checkpoint Repair for High-Performance Out-of-Order Execution Machines

Article

Full-text available

Dec 1987

Out-or-order execution and branch prediction are two mechanisms that can be used profitably in the design of supercomputers to increase performance. Proper exception handling and branch prediction miss handling in an out-of-order execution machine do require some kind of repair mechanism which can restore the machine to a known previous state. In this paper we present a class of repair mechanisms using the concept of checkpointing. We derive several properties of checkpoint repair mechanisms. In addition, we provide algorithms for performing checkpoint repair that incur little overhead in time and modest cost in hardware. We also note that our algorithms require no additional complexity or time for use with write-back cache memory systems than they do with write-through cache memory systems, contrary to statements made by previous researchers.

Critical issues regarding HPS, a high performance microarchitecture

Conference Paper

Full-text available

Dec 1985

HPS is a new model for a high performance microarchitecture which is targeted for implementing very dissimilar ISP architectures. It derives its performance from executing the operations within a restricted window of a program out-of-order, asynchronously, and concurrently whenever possible. Before the model can be reduced to an effective working implementation of a particular target architecture, several issues need to be resolved. This paper discusses these issues, both in general and in the context of architectures with specific characteristics.

HPSm, a High Performance Restricted Data Flow Architecture Having Minimal Functionality

Conference Paper

Full-text available

Jun 1986
Comput Architect News

Our recent work in microarchitecture has identified a new model of execution, restricted data flow, in which data flow techniques are used to coordinate out-of-order execution of sequential instruction streams. We believe that the restricted data flow model has great potential for implementing very high performance computing engines. This paper defines a minimal functionality variant of our model, which we are calling HPSm. The instruction set, data path, timing and control of HPSm are all described. A simulator for HPSm has been written, and some of the Berkeley RISC benchmarks have been executed on the simulator. We report the measurements obtained from these benchmarks, along with the measurements obtained for the Berkeley RISC II. The results are encouraging.

Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

Article

Jun 1992
Comput Architect News

The large latency of memory accesses is a major impediment to achieving high performance in large scale shared-memory multi-processsors. Relaxing the memory consistency model is an attractive technique for hiding this latency by allowing the overlap of memory accesses with other computation and memory accesses. Previous studies on relaxed models have shown that the latency of write accesses can be hidden by buffering writes and allowing reads to bypass pending writes. Hiding the latency of reads by exploiting the overlap allowed by relaxed models is inherently more difficult, however, simply because the processor depends on the return value for its future computation. This paper explores the use of dynamically scheduled processors to exploit the overlap allowed by relaxed models for hiding the latency of reads. Our results are based on detailed simulation studies of several parallel applications. The results show that a substantial fraction of the read latency can be hidden using this technique. However, the major improvements in performance are achieved only at large instruction window sizes.

Single instruction stream parallelism is greater than two

Conference Paper

Feb 1991
Comput Architect News

The authors model the execution of the SPEC benchmarks under differing resource constraints. They repeat the work of the previous researchers, and under the hardware resource constraints imposed, similar results are obtained. On the other hand, when all constraints are removed except those required by the semantics of the program, degrees of parallelism in excess of 17 instructions per cycle are found. Finally, it is shown that if the hardware is properly balanced, one can sustain from 2.0 to 5.8 instructions per cycle on a processor that is reasonable to design today.

The Design and Implementation of the ~.3BSD Unix Operating Systems

S M Leffier
M Mckusick
J Karels
Quarterman

HPSm a High Performance Restricted Data Flow Architecture Having Minimal Functionality

W Hwu
Y N Putt
Y N Hwu
Putt

An Investigation Of The Performance Of Various Dynamic Scheduling Techniques

Abstract

No full-text available

Recommended publications

A Case Study of Capacitated Scheduling A Case Study of Capacitated Scheduling A Case Study of Capaci...

Scheduling the scheduling task: a time-management perspective on scheduling

Scheduling device, scheduling method, scheduling program, storage medium, and mass spectrometry syst...

Scheduling Parallel Tasks under Multiple Resources: List Scheduling vs. Pack Scheduling