Article

The PowerPC 604 RISC microprocessor.

November 1994
IEEE Micro 14(5)(5):8

November 1994
14(5)(5):8

DOI:10.1109/MM.1994.363071

Source
IEEE Xplore

Authors:

First Page of the Article

Express Misprediction Recovery

Thesis

Full-text available

Aug 2014

Vignyan Reddy Kothinti Naresh

Continuing advances in branch prediction provide a promising avenue for mitigating the impact of control dependences on extracting instruction-level parallelism in high-performance processors. However, the rate of improvement in prediction rates has slowed significantly and may be approaching an asymptotic upper bound, particularly once practical constraints on the predictor’s cycle time, energy, and area are taken into consideration. To reach higher levels of performance, future processors must not just reduce the number of mispredictions, but should employ mechanisms that reduce the performance penalty of each misprediction. This thesis presents EMR — an approach that boosts performance by reducing the misprediction penalty and by amplifying instruction delivery bandwidth to the execution core. EMR implements a novel in-place misprediction recovery that minimizes latency to activate instructions from the correct control path, while also utilizing control independence to avoid unnecessary re-execution of instructions from beyond control flow joins. Performance analysis shows that EMR outperforms the baseline by up to 81% with a mean performance gain of 23% in CINT2006. Additionally, using EMR speeds up MiBench, Graph500, and CFP2006 by 20%, 10.5%, and 4% respectively. Energy analysis shows that EMR consumes 16% lower energy than the baseline in CINT2006. As we scale up the window sizes, EMR can, with its amplified instruction delivery, commit up to seven µops per cycle without increasing frontend bandwidth.

OF THE REQUIREMENTS FOR THE DEGREE OF

Article

Jan 2004

Over the past decade superscalar microprocessors have achieved enormous improvements in computing power by exploiting higher levels of parallelism in many different ways. Highperformance superscalar processors have experienced remarkable increases in processor width, pipeline depth and speculative execution. All of these trends have come at an extremely high increase in hardware complexity and chip resource consumption. Until recently, their main limitation has been the availability of such resources in the chip, but with current technology shrinks and increases in transistor budgets, other limiting factors have become preeminent, such as power consumption, temperature and wire delays. These new problems greatly compromise the scalability of conventional superscalar designs. Many previous works have demonstrated the effectiveness of partitioning the layout of several critical hardware components as a means to keep most of the parallelism while improving the scalability. Some of these components are the register file, the issue queue and the bypass network. Their partitioning is the basis for the so called clustered architectures. A clustered processor core, made up of several low complex blocks or clusters, can efficiently

DESIGN OF CLUSTERED SUPERSCALAR MICROARCHITECTURES

Article

Joan-Manuel Parcerisa

Hoisting Branch Conditions - Improving Super-Scalar Processor Performance.

Conference Paper

Full-text available

Aug 1995

The performance and hardware complexity of super-scalar architectures is hindered by conditional branch instructions. When conditional branches are encountered in a program, the instruction fetch unit must rapidly predict the branch predicate and begin speculatively fetching instructions with no loss of instruction throughput. Speculative execution has a high hardware cost, is limited by dynamic branch prediction accuracies, and does not scale well for increasingly super-scalar architectures. The conditional branch bottleneck would be solved if we could somehow move branch condition evaluation far forward in the instruction stream and provide a new branch instruction that encoded both the source and target address of a branch. This paper summarizes the hardware extensions to support just such a Future Branch, then gives a compiler algorithm for hoisting branch evaluation across many blocks. The algorithm is applicable to other optimizations for parallelism, such as prefetching data.

Instruction issue system for superscalar processors.

Conference Paper

Full-text available

Jan 1997

A Floating Point Divider Performing IEEE Rounding and Quotient Conversion in Parallel

Conference Paper

Full-text available

Sep 2004

Processing floating point division generally consists of SRT recurrence, quotient conversion, rounding, and normalization steps. In the rounding step, a high speed adder is required for increment oper- ation, increasing the overall execution time. In this paper, a floating point divider performing quotient conversion and rounding in parallel is presented by analyzing the operational characteristics of floating point division. The proposed floating point divider does not require any ad- ditional execution time, nor does it need any high speed adder for the rounding step. The proposed divider can execute quotient conversion, rounding, and normalization within one cycle. To support design effi- ciency, the quotient conversion/rounding unit of the proposed divider can be shared efficiently with the addition/rounding hardware for float- ing point multiplier.

Uma Arquitetura Super Escalar com Múltiplos Fluxos de Instruções

Conference Paper

Aug 1996

Mecanismos sofisticados para escalonamento dinâmico de instruções, previsão dinâmica de desvios e execução especulativo são encontrados em arquiteturas super escalares destinadas à aplicações que exigem alto desempenho. No entanto, estas arquiteturas ainda apresentam um desempenho bem inferior ao que seria alcançado por uma arquitetura super escalar ideal. Este artigo apresenta resultados experimentais que identificam algumas das principais causas da discrepância entre os desempenhos de uma arquitetura super escalar real e de uma ideal. Também é apresentado um novo mecanismo, baseado em múltiplos fluxos de instruções, que tem como objetivo reduzir as limitações encontradas.

Multi-layered data mining architecture in the context of Internet of Things

Conference Paper

Full-text available

Jul 2017

Multi-Layered Data Mining Architecture in the Context of Internet of Things

Conference Paper

Jul 2017

Sebastian Scholze

Hardware-Based Profiling: An Effective Technique for Profile-Driven Optimization

Article

Full-text available

Apr 1996
INT J PARALLEL PROG

Profile based optimization can be used for instruction scheduling, loop scheduling, data preloading, function in-lining, and instruction cache performance enhancement. However, these techniques have not been embraced by software vendors because programs instrumented for profiling run significantly slower, an awkward compile-run-recompile sequence is required, and a test input suite must be collected and validated for each program. This paper introduces hardware-based profiling that uses traditional branch handling hardware to generate profile information in real time. Techniques are presented for both one-level and two-level branch hardware organizations. The approach produces high accuracy with small slowdown in execution (0.4%-4.6%). This allows a program to be profiled while it is used, eliminating the need for a test input suite. With contemporary processors driven increasingly by compiler support, hardware-based profiling is important for high-performance systems.

Functional verification of instruction processing units through control flow modeling

Article

Mar 2002

The design verification of state-of-the-art high-performance microprocessors has become a significant challenge for test engineers. Deep pipelines, multiple execution units, out-of-order and speculative execution techniques, typically found in such microprocessors, contribute much to this complexity. Conventional methods, which treat the processor as a logic state machine or apply architectural level tests, fail to provide coverage of all possible corner cases in the design. This paper presents a functional verification method for modern microprocessors, which is based on innovative models of the microprocessor architecture, intended to cover the testing of all corner cases. In order to test the models presented in this work, an architecture independent coverage measurement system has been developed. The models were tested with both random code and real world applications in order to determine which of the two achieves higher coverage.

Evaluating the Performance of Space Plasma Simulations Using FPGA’s

Conference Paper

Jun 2002
Lect Notes Comput Sci

This paper analyses the performance of a custom compute machine, that performs electrostatic plasma simulations, using Field Programmable Gate Array’s (FPGAs). Although FPGA’s run at slower clock speeds than their off-the-shelf counterparts, the processing power lost in the reduced number of clock cycles per second is quickly recovered in the high degree of spatial parallelism that is achievable within the devices. We describe the development of the architecture of the machine and its support for the C-programming language via the use of a cross-compiler. Results are presented and a discussion is given on the constraints of FPGAs in particular and the hardware design process in general.

SHA: A Design for Parallel Architectures?

Conference Paper

May 1997

To enhance system performance computer architectures tend to incorporate an increasing number of parallel execution units. This pa- per shows that the new generation of MD4-based customized hash func- tions (RIPEMD-128, RIPEMD-160, SHA-1) contains much more soft- ware parallelism than any of these computer architectures is currently able to provide. It is conjectured that the parallelism found in SHA-1 is a design principle. The critical path of SHA-1 is twice as short as that of its closest contender RIPEMD-160, but realizing it would require a 7-way multiple-issue architecture. It will also be shown that, due to the organization of RIPEMD-160 in two independent lines, it will probably be easier for future architectures to exploit its software parallelism.

Load Execution Latency Reduction.

Conference Paper

Jul 1998

In order to achieve high performance, contemporary microprocessors must effectively pro- cess the four major instruction types: ALU, branch, load, and store instructions. This paper focuses on the reduction of load instruction execution latency. Load execution latency is dependent on memory access latency, pipeline depth, and data dependencies. Through load effective address prediction both data dependencies and deep pipeline effects can poten- tially be removed from the overall execution time. If a load effective address is correctly predicted, the data cache can be speculatively accessed prior to execution, thus effectively reducing the latency of load execution. A hybrid load effective address prediction technique is proposed, using three basic predic- tors: Last Address Predictor (LAP), Stride Predictor (SP), and Global Dynamic Predictor (GDP). In addition to improving load address prediction, this work explores the balance of data ports in the cache memory hierarchy, and the effects of load and store aliasing in wide superscalar machines. Results: Using a realistic hybrid load address predictor, load address prediction rates range from 32% to 77% at an average of 51% for SPECint95 and 60% to 96% at an aver- age of 87% for SPECfp95. For a wide superscalar machine with a significant number of execution resources, this prediction rate increases IPC by 12% and 19% for SPECint 95 and SPECfp95 respectively. It is also shown that load/store aliasing decreases the average IPC by 33% for SPECint95 and 24% for SPECfp95.

Dynamic removal of redundant computations

Conference Paper

Full-text available

May 1999

A mechanism for dynamic instruction-level reuse in superscalar microprocessors is presented. The underlying concept that the mechanism exploits is the run-time removal of redundant computations, and in particular the elimination of common subexpressions and invariants. Removing redundant computation is a target of optimizing compilers but sometimes they do not succeed due to their limited knowledge of the data. Moreover, the proposed mechanism can also remove quasi-redundant computations, such as subexpressions that often produce the same result but sometimes they differ, depending on the data values, and thus they cannot be eliminated by the compiler. Experimental results for the Spec95 show that on average the mechanism can avoid the execution of about 32% of the dynamic instructions and provides a 1.10 speedup in a superscalar microprocessor. An extensive evaluation of different configurations and a comparison with previous schemes is presented, as well as the performance potential ...

Power-aware operand delivery

Conference Paper

Full-text available

Aug 2007

Based on operand delivery, existing microprocessors can be catego- rized into architected register file (ARF) or physical register file (PRF) machines, both with or without payload RAM (PL). Though many previous generation microprocessors use a PRF without PL, the trend of newer microprocessors targeting lower power environ- ments seem to be moving towards ARF with PL. We quantitatively analyze power consumption of different machine styles: ARF with PL, ARF without PL, PRF with PL, and PRF only machine. Our result shows that PRF without PL consumes the least amount of power and is fundamentally the best approach for building power- aware out-of-order microprocessors.

Mispredicted Path Cache Effects.

Conference Paper

Jan 1999

As superscalar pipelines become wider and deeper, the percentage of dynamic instructions fetched into the machine from the mispredicted path significantly increases. This paper discusses how a new cycle-accurate performance simulator is used to accurately measure mispredicted path effects on the cache hierarchy. Previously published results based on less accurate tools indicated that mispredicted path instructions have the serendipitous positive effect of doing memory prefetching. Our results show that while such prefetching does occur for some benchmarks, it does not occur consistently for all benchmarks. Furthermore the IPC impact varies widely among the benchmarks. SPECint95 benchmarks show IPC changes ranging from -8% to +12%.

Mikroprozessoren — heute, morgen und übermorgen

Chapter

Jan 1997

Theo Ungerer

Next cache line and set prediction

Article

May 1995
Comput Architect News

Accurate instruction fetch and branch prediction is increasingly important on today's wide-issue architectures. Fetch prediction is the process of determining the next instruction to request from the memory subsystem. Branch prediction is the process of predicting the likely out-come of branch instructions. Several researchers have proposed very effective fetch and branch prediction mechanisms including branch target buffers (BTB) that store the target addresses of taken branches. An alternative approach fetches the instruction following a branch by using an index into the cache instead of a branch target address. We call such an index a next cache line and set (NLS) predictor. A NLS predictor is a pointer into the instruction cache, indicating the target instruction of a branch.In this paper we examine the use of NLS predictors for efficient and accurate fetch and branch prediction. Previous studies associated each NLS predictor with a cache line and provided only one-bit conditional branch predictors. Our study examines the use of NLS predictors with highly accurate two-level correlated conditional branch architectures. We examine the performance of decoupling the NLS predictors from the cache line and storing them in a separate tag-less memory buffer. Our results show that the decoupled architecture performs better than associating the NLS predictors with the cache line, that the NLS architecture benefits from reduced cache miss rates, and it is particularly effective for programs containing many branches. We also provide an in-depth comparison between the NLS and BTB architectures, showing that the NLS architecture is a competitive alternative to the BTB design.

Memory-system design considerations for dynamically-scheduled processors

Article

May 1997
Comput Architect News

In this paper, we identify performance trends and design relationships between the following components of the data memory hierarchy in a dynamically-scheduled processor: the register file, the lockup-free data cache, the stream buffers, and the interface between these components and the lower levels of the memory hierarchy. Similar performance was obtained from all systems having support for fewer than four in-flight misses, irrespective of the register-file size, the issue width of the processor, and the memory bandwidth. While providing support for more than four in-flight misses did increase system performance, the improvement was less than that obtained by increasing the number of registers. The addition of stream buffers to the investigated systems led to a significant performance increase, with the larger increases for systems having less in-flight-miss support, greater memory bandwidth, or more instruction issue capability. The performance of these systems was not significantly affected by the inclusion of traffic filters, dynamic-stride calculators, or the inclusion of the per-load non-unity stride-predictor and the incremental-prefetching techniques, which we introduce. However, the incremental prefetching technique reduces the bandwidth consumed by stream buffers by 50% without a significant impact on performance.

Sampling for Cache and Processor Simulation

Chapter

Jan 1995

Kishore N. Menezes

There are a wealth of technological alternatives that can be incorporated into a cache or processor design. The memory configuration and the cache size, associativity and block size for each of the components in the heirarchy are some of these applicable to memory subsystems. For processors, these include branch handling strategies, functional unit duplication, and instruction fetch, issue, completion and retirement policies. The deciding factor between the various choices available is a function of the performance each adds, versus the cost each incurs. The large amount of available design choices inflates the design space. The profitability of a given design is measured through the excution of application programs and other workloads. Trace-driven simulation is used to simplify this process.

Control Flow: Branching and Control Hazards

Chapter

Jan 1999

Amos R. Omondi

For an instruction pipeline to attain its maximum performance, it is, at the very least, necessary that it be supplied with instructions at a rate that matches its maximum processing rate. The main impediment to ensuring adequate instruction-supply is usually the high access time (relative to the pipeline cycle time) of the memory from which instructions are fetched. At any given moment, the addresses of the next instructions required are easy to determine if there are no branch (control-transfer)1 instructions involved: simply incrementing the program counter, or similar addressing register, suffices. A branch instruction, on the other hand, presents a problem, since the addresses of the following instructions cannot be known with absolute certainty until after the branch has been executed; furthermore, the execution may depend on a condition yet to be determined by preceding instructions. Consequently, unless special measures are taken, a branch instruction will introduce a gap — the delay of which we shall term the branch latency — in the flow of instructions. In this chapter we shall discuss a number of measures for dealing with this, which is arguably the hardest problem in the design of high-performance instruction pipelines [116].

Data Flow: Detecting and Resolving Data Hazards

Chapter

Jan 1999

Amos R. Omondi

A pipeline can fail to achieve its maximum speedup if there are discontinuities in the supply of instructions or data. Discontinuities in the flow of instructions have been covered in the preceding chapter; in this chapter, we shall discuss the problem of discontinuities in the flow of data as well as corresponding solutions. The data-flow discontinuities arise mainly from two sources: one is a mismatch between the rate at which the pipeline requests data and the rate at which the data is delivered to the pipeline; the other is data hazards (or data dependences) that occur between instructions in the pipeline when one instruction cannot proceed because its progress depends on that of another instruction. The basic problem of mismatching rates is largely solved by the use of appropriate high-speed intermediate storage (cache, registers, etc.) and other techniques discussed in Chapter 3 and will not be considered further. This chapter is therefore devoted to just the hazards and related issues.

Fundamentals of Pipelining

Chapter

Jan 1999

Amos R. Omondi

We shall begin by introducing the main issues in the design and implementation of pipelined and superscalar computers, in which the exploitation of low-level parallelism constitute the main means for high performance. The first section of the chapter consists of a discussion of the basic principles underlying the design of such computers. The second section gives a taxonomy for the classification of pipelined machines and introduces a number of commonly used terms. The third and fourth section deal with the performance of pipelines: ideal performance and impediments to achieving this are examined. The fifth section consists of some examples of practical pipelines; these pipelines form the basis for detailed case studies in subsequent chapters. The last section is a summary.

Interrupts and Branch Mispredictions

Chapter

Jan 1999

Amos R. Omondi

This chapter deals with the handling of interrupts in pipelined machines and with recovery in the event of branch mispredictions. The first section is a discussion of basic implementation techniques, the second consists of a number of case studies, and the third is a summary.

Statistical Sampling for Processor and Cache Simulation

Chapter

Sep 2005

Thomas M. Bryan

Statistical Techniques for Computer Performance Analysis

Chapter

Sep 2005

The multimedia capability of simultaneous multithreading processors using MPEG-2 decoding as an example

Article

Feb 2001

This paper explores the microarchitecture of a simultaneous multithreaded processor with multimedia enhancements and evaluates its performance with a multimedia program as workload. Novel features are the combination of a simultaneous multithreaded processor with multimedia instructions and its evaluation with a realistic multimedia workload. The workload is a hand-optimized MPEG-2 video decompression algorithm that extensively uses multimedia units and moreover applies cooperative multithreading to exploit coarse-grain parallelism. Various processor configurations were simulated to identify and eliminate potential bottlenecks. Our simulation results show that a simultaneous multithreaded processor with multimedia enhancements is able to yield a threefold IPC (instructions per cycle) increase over a comparable (single-threaded) superscalar processor.

Heterogeneous Multicore Processor Technologies for Embedded Systems

Book

Oct 2012

To satisfy the higher requirements of digitally converged embedded systems, this book describes heterogeneous multicore technology that uses various kinds of low-power embedded processor cores on a single chip. With this technology, heterogeneous parallelism can be implemented on an SoC, and greater flexibility and superior performance per watt can then be achieved. This book defines the heterogeneous multicore architecture and explains in detail several embedded processor cores including CPU cores and special-purpose processor cores that achieve highly arithmetic-level parallelism. The authors developed three multicore chips (called RP-1, RP-2, and RP-X) according to the defined architecture with the introduced processor cores. The chip implementations, software environments, and applications running on the chips are also explained in the book. Provides readers an overview and practical discussion of heterogeneous multicore technologies from both a hardware and software point of view; Discusses a new, high-performance and energy efficient approach to designing SoCs for digitally converged, embedded systems; Covers hardware issues such as architecture and chip implementation, as well as software issues such as compilers, operating systems, and application programs; Describes three chips developed according to the defined heterogeneous multicore architecture, including chip implementations, software environments, and working applications. © 2012 Springer Science+Business Media New York. All rights are reserved.

Processor Cores

Chapter

Mar 2012

The processor cores described in this chapter are well tuned for embedded systems. They are SuperHTMRISC engine family processor cores (SH cores) as typical embedded CPU cores, flexible engine/generic ALU array (FE–GA or shortly called FE as flexible engine) as a reconfigurable processor core, MX core as a massively parallel SIMD-type processor, and video processing unit (VPU) as a video processing accelerator. We can implement heterogeneous multicore processor chips with them, and three implemented prototype chips, RP-1, RP-2, and RP-X, are introduced in the Chap. 4.

Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations

Conference Paper

Nov 1994

There is an increasing trend to use commodity microprocessors as the compute engines in large-scale multiprocessors. However, given that the majority of the microprocessors are sold in the workstation market, not in the multiprocessor market, it is only natural that architectural features that benefit only multiprocessors are less likely to be adopted in commodity microprocessors. In this paper, we explore multiple-context processors, an architectural technique proposed to hide the large memory latency in multiprocessors. We show that while current multiple-context designs work reasonably well for multiprocessors, they are ineffective in hiding the much shorter uniprocessor latencies using the limited parallelism found in workstation environments. We propose an alternative design that combines the best features of two existing approaches, and present simulation results that show it yields better performance for both multiprogrammed workloads on a workstation and parallel applications on a multiprocessor. By addressing the needs of the workstation environment, our proposal makes multiple contexts more attractive for commodity microprocessors.

Performance and Cost Analysis . . .

Article

Jan 1995

Dimitri Carl Argyres

DYNAMIC OPTIMIZATIONS OF SUPERSCALAR PROCESSORS FOR ENERGY EFFICIENCY

Article

Dmitry Ponomarev

A modified framework for determination of next cache address

Article

Full-text available

Jan 2008

Recent processors issue multiple instructions per cycle and employ multiple functional units and hardware scheduling techniques to achieve maximum parallelism at the instruction level. To exploit maximum efficiency such multiple issue processors must be fed by high instruction fetch bandwidth. The instruction fetch unit must fetch enough instructions every cycle to keep the functional units busy. No clock cycle should go idle and thus several instructions need to be fetched at every clock cycle. In conventional multi-way set-associative I-cache, an instruction is read in the following way. The address generated by the processor is divided into two parts - tag and index. The index selects the set of the I-cache to be accessed. The tag is compared simultaneously with the cache blocks tags of all the blocks in the set. The data is read from the block whose tag matches with the instruction tag. If none of the stored tags matches with the instruction tag, then a cache miss occurs and the instruction is read from the other levels of the memory hierarchy. With very fast clock cycles, this whole procedure requires more than one clock cycle to complete. It is expected that the future processors, which are likely to have a very deep pipeline, will require several pipeline stages to fetch instructions. The two main factors that affect the performance of a multi-issue processor are: cache access time and fetching correct instructions every cycle. In this paper we will address both these problems by predicting the address of the Instruction cache from where the next instruction has to be fetched.

A scalable register file architecture for superscalar processors

Article

Jun 1998
MICROPROCESS MICROSY

A major obstacle in designing superscalar processors is the size and port requirement of the register file. Multiple register files of a scalar processor can be used in a superscalar processor if results are renamed when they are written to the register file. Consequently, a scalable register file architecture can be implemented without performance degradation. Another benefit is that the cycle time of the register file is significantly shortened, potentially producing an increase in the speed of the processor.

Beyond RISC - The Post-RISC Architecture Submitted to: IEEE Micro 3/96

Article

Full-text available

The principles of the RISC architecture guided the design of the previous generation of processors. These principles have accelerated the performance gains of these processors over their predecessors. The current generation of CPUs is poised to continue this rapid acceleration of performance. In this paper we assert that the performance gains of the current generation are due to changes that are decidedly not RISC. We call this current generation of processors Post-RISC and, we describe the characteristics of Post-RISC processors. In addition, we survey six processors of the current generation which highlight the principles of Post-RISC.

The PowerPC 620 microprocessor: a high performance superscalar RISC microprocessor

Article

Jan 1995

The PowerPC 620 RISC microprocessor is the first chip for the application server and technical workstation product line within the PowerPC family. It utilizes a high performance microarchitecture with many advanced superscalar features to exploit instruction level parallelism. It is the first 64-bit implementation of the PowerPC architecture supporting both 32- and 64-bit application software, and is compatible with the PowerPC 601, PowerPC 603, and PowerPC 604 microprocessors.

Generic Datapath Alternatives of Advanced Superscalar Processors

Article

Full-text available

Dezso Sima

In the quest for higher performance, virtually all recent models of significant superscalar lines have introduced both shelving and register renaming. The implementation of these advanced features have had major implications on the datapath of the processors. As a consequence, the datapaths of recent superscalars differ substantially from those used in previous models. Although the internal details of most new microarchitectures have been disclosed, the design space of their datapath has not yet been described. In this paper we contribute to this area by identifying generic alternatives of the kernel of the datapath in advanced superscalar processors. Our work is based on the exploration of the design spaces of both shelving and register renaming, since the manner of their implementation is decisive for the layout of the kernel of the datapath. From the design spaces mentioned, we first point out those aspects which are relevant for the generic alternatives of the kernel of the datapath and review the related design choices. Then from the feasible combinations of the indicated design choices we derive and present 24 generic datapath alternatives. In this framework we also show which generic alternatives have been chosen in recent superscalars.

SATSim: A superscalar architecture trace simulator using interactive animation

Article

Jun 2000

This paper describes an interactive animation tool, SATSim, which conveys superscalar architecture concepts. It has been used in an advanced undergraduate computer architecture course to visualize the complicated behavioral patterns of superscalar architectures, such as out-of-order execution, in-order commitment, and the impact of branch mispredictions and cache misses. SATSim allows students to interactively change hardware configuration parameters and to observe their effects visually in a more accessible manner than is currently possible with existing simulators or with traditional static media.

Optimal software pipelining through enumeration of schedules

Chapter

Oct 2006

Resource-constrained software-pipelining has played an increasingly significant role in exploiting instruction-level parallelism and has been drawing intensive academic and industrial interest. The challenge is to find schedule which is optimal: i.e. the fastest possible schedule under given resource constraints while keeping register usage minimal. One interesting problem is open: the design of an algorithm which ensures that such an optimal schedule will always be found and with a cost which is manageable in practice. In this paper, we present a novel modulo scheduling algorithm which provides a solution to this open problem. The proposed algorithm has been successfully implemented and tested under MOST — the Modulo Scheduling Testbed developed at McGill University. Unlike the existing optimal modulo scheduling approach based on integer linear programming (ILP) [6], our approach employs a search procedure which directly exploits the program structure in terms of its dependence graphs. Experimental results on more than 1000 loops from popular benchmark programs show that our method often finds a schedule faster. In addition, with our approach many loops require a surprisingly small number of schedules be searched to obtain an optimal solution, thus making the approach quite feasible.

A superscalar architecture to exploit instruction level parallelism

Article

Mar 1997
MICROPROCESS MICROSY

If a high-performance superscalar processor is to realise its full potential, the compiler must re-order or schedule the object code at compile time. This scheduling creates groups of adjacent instructions that are independent and which therefore can be issued and executed in parallel at run time. This paper provides an overview of the Hatfield Superscalar Architecture (HSA), a multipleinstruction-issue architecture developed at the University of Hertfordshire to support the development of high-performance instruction schedulers. The long-term objective of the HSA project is to develop the scheduling technology to realise an order of magnitude performance improvement over traditional RISC designs. The paper also presents results from the first HSA instruction scheduler that currently achieves a speedup of over three compared to a classic RISC processor.

Bounds modelling and compiler optimizations for superscalar performance tuning

Article

Jun 1999
J SYST ARCHITECT

We consider the floating point microarchitecture support in RISC superscalar processors. We briefly review the fundamental performance trade-offs in the design of such microarchitecutres. We propose a simple, yet effective bounds model to deduce the “best-case” loop performance limits for these processors. We compare these bounds to simulated and real performance measurements. From this study, we identify several loop tuning opportunities. In particular, we illustrate the use of this analysis in suggesting loop unrolling and scheduling heuristics. We report our experimental results in the context of a set of application-based loop test cases. These are designed to stress various resource limits in the core (infinite cache) microarchitecture.

High performance set associative translation lookaside buffers for low power microprocessors

Article

Jul 2008
INTEGRATION

A fast low power four-way set-associative translation lookaside buffer (TLB) is proposed. The proposed TLB allows single clock phase accesses at clock frequencies above 1 GHz. Comparisons to a conventional fully associative CAM tagged TLB, which is the type most commonly used in embedded processors, with the same number of entries on a 65 nm low standby power process show that the proposed design has 28% lower delay and up to 50% lower energy delay product. Unlike previous set-associative TLBs, which replicate the TLB to support multiple page sizes, multiple page sizes are supported on a way-by-way basis. Alternative conventional CAM tagged and set-associative TLBs are investigated with regard to access latency, power dissipation and hit rates.

Performance Assessment of Contents Management in Multilevel On-Chip Caches.

Conference Paper

Jan 1996

This paper deals with two level on-chip cache memories. We show the impact of three different relationships between the contents of these levels on the system performance. In addition to the classical Inclusion contents management, we propose two alternatives, namely Exclusion and Demand, developing for them the necessary coherence support and quantifying their relative performance in a design space (sizes, latencies, ...) in agreement with the constraints imposed by integration. Two performance metrics are considered: the second-level cache miss ratio and the system CPI. The experiments have been carried out running a set of integer and floating point SPEC'92 benchmarks. We conclude showing the superiority of our improved version of Exclusion throughout all the sizing and workload spectrum studied

Cache Performance of Fast-Allocating Programs.

Conference Paper

Jan 1995

We study the cache performance of a set of ML programs, compiled by the Standard ML of New Jersey compiler. We find that more than half of the reads are for objects that have just been allocated. We also consider the effects of varying software (garbage collection frequency) and hardware (cache) parameters. Confirming results of related experiments, we found that ML programs can have good cache performance when there is no penalty for allocation. Even on caches that have an allocation penalty, we found that ML programs can have lower miss ratios than the C and Fortran SPEC92 benchmarks. Topics: 4 benchmarks, performance analysis; 21 hardware design, measurements; 17 garbage collection, storage allocation; 46 runtime systems. 1 Introduction With the gap between CPU and memory speed widening, good cache performance is increasingly important for programs to take full advantage of the speed of current microprocessors. Most recent microprocessors come with a small on-chip cache, and many m...

Functional Verification Methodology for the PowerPC 604 Microprocessor.

Conference Paper

Jan 1996

Functional (i.e., logic) verification of the current generation of complex, super-scalar microprocessors such as the PowerPC 604 microprocessor presents significant challenges to a project's verification participants. Simple architectural level tests are insufficient to gain confidence in the quality of the design. Detailed planning must be combined with a broad collection of methods and tools to ensure that design defects are detected as early as possible in a project's life-cycle. This paper discusses the methodology applied to the functional verification of the PowerPC 604 microprocessor

Integrating a Misprediction Recovery Cache (MRC) into a Superscalar Pipeline.

Conference Paper

Jan 1996

In modern processors, deep pipelines couple with superscalar techniques to allow each pipe stage to process multiple instructions. When such a pipe must be pushed and refilled, as when predicted program flow beyond a branch is subsequently recognized as wrong, the temporary performance loss is significant. While modern branch target buffer (BTB) technology makes this flush/refill penalty fairly rare, the penalty that accrues from the remaining branch mispredictions is a serious impediment to even higher processor performance. Advanced mechanisms that can reduce this residual misprediction penalty can be of enormous value in future microprocessor designs. One promising new mechanism, the Misprediction Recovery Cache (MRC) is proposed previously. In this paper, we focus especially on MRC integration into existing pipelines.

Identifying Bottlenecks in a Multithreaded Superscalar Microprocessor

Conference Paper

Aug 1996

. This paper presents a multithreaded superscalar processor that permits several threads to issue instructions to the execution units of a wide superscalar processor in a single cycle. Instructions can simultaneously be issued from up to 8 threads with a total issue bandwidth of 8 instructions per cycle. Our results show that the 8-threaded 8-issue processor reaches a throughput of 4.2 instructions per cycle. 1 Introduction Current microprocessors utilize instruction-level parallelism by a deep processor pipeline and by the superscalar technique that issues up to four instructions per cycle from a single thread. VLSI-technology will allow future generations of microprocessors to exploit instruction-level parallelism up to 8 instructions per cycle, or more. However, the instruction-level parallelism found in a conventional instruction stream is limited. The solution is the additional utilization of more coarse-grained parallelism. The main approaches are the multiprocessor chip and the...

The Effect of the Speculation Depth on the Performance of Superscalar Architectures.

Conference Paper

Aug 1997

Speculative execution is a key concept to increase the performance of superscalar processors. Given accurate branch prediction mechanisms, the efficiency of speculative execution is mainly determined by the speculation depth. In this work, we evaluate the pressure of speculative execution on the resource requirements of a typical superscalar architecture.

A Compiler-Based Speculative Execution Scheme for ILP Enhancement.

Article

Jan 2000
J INF SCI ENG

Instruction-level parallelism (ILP) consists of a family of processor and compiler design techniques that speedup execution by causing individual operations to execute in parallel. For control-intensive programs, however, there is usually insufficient ILP in a basic block for effective exploitation by ILP processors. Speculative execution is the ex-ecution of instructions before it is known whether these instructions should be executed. Thus, it can be used to alleviate the effects of conditional branches. To effectively exploit ILP, the compiler must boost instructions across branches. However, a hazard may be introduced by speculatively executed instructions that incorrectly overwrite a value when a branch is mispredicted. To eliminate such side effects, we propose a solution in this paper which uses scheduling techniques. If the result of a boosted instruction can be stored in another location until the branch is committed, the undesired side effect can be avoided. A scheduling algorithm called LESS with a renaming function is proposed for passing differ-ent identifiers along different execution paths to maintain the correctness of the program until the data is no longer used or modified. The hardware implementation for this method is relatively simple and rather efficient. Simulation results show that the speedup achieved using LESS is better than that obtained using other existing methods. For example, LESS achieves a performance improvement of 12%, on average, compared to the CRF scheme, a solution proposed recently which uses the concept of shadow register pairs.

Implementation of Precise Interrupts in Pipelined Processors.

Conference Paper

Full-text available

Jun 1985
Comput Architect News

The precise interrupt problem in pipelined processors is described, and five solutions are discussed in detail. The first forces instructions to complete and modify the process state in architectural order. The other four allow instructions to complete in any order, but additional hardware is used so that a precise state can be restored when an interrupt occurs. All the methods are discussed in the context of a parallel pipeline structure. Simulation results based on the CRAY-1S scalar architecture show that, at best, the first solution results in a performance degradation of about 16%. The remaining four solutions offer similar performance, and three of them result in as little as a 3% performance loss. Several extensions, including virtual memory and linear pipeline structures, are briefly discussed.

The PowerPC Architecture, IBM RISC/System 6000 Technology, Volume II

Article

Ed Silha

Implementation of the PowerPC 601 microprocessor

Article

Oct 1994
IBM J RES DEV

To produce a marketable PowerPC™ microprocessor on a short development schedule, the logic had to be designed in a manner flexible enough to allow quick modifications without sacrificing high performance and density when customized cells were required. This was accomplished for the PowerPC 601™ microprocessor (601) with a high-level design-language description, which was synthesized for a gate-level implementation and simulated for functional verification. In a similar way, the physical design strategy for the 601 struck an attractive balance between a highly automated, flexible floorplan and the additional density that had to be available for limited, well-conceived manual placements. Finally, a rigorous test strategy was implemented, which has proved very useful in analyzing the processor and in assembling 601-based systems. Careful adherence to this methodology led to a successful first-pass physical implementation, leaving the second iteration for additional customer requests.

An Efficient Algorithm for Exploiting Multiple Arithmetic Units

Article

Jan 1967
IBM J RES DEV

R. M. Tomasulo

This paper describes ,the methods employed,in the floating-point area of ,the System/360 Model ,91 to exploit the existence of multiple execution,, efficiently utilizing the execution units without requiring specially optimized code. Instead, the hardware, by ‘looking ahead’ about eight instructions. automatically optimizes the program execution on a local basis. The application of these techniques is not limited to floating-point arithmetic or System/360 architecture. It may be used in almost any computer having multiple execution units and one or more ‘accumulators.’ Both of the execution units, as well as the associated storage buffers, multiple accumulators and input /output buses, are extensively checked.

Superscalar Microprocessor Design

Book

Jan 1991

W. M. Johnson

Instruction issue logic for high-performance, interruptible, multiple functional unit, pipelined computers

Article

Mar 1995

Gurindar S. Sohi

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

The PowerPC 603 microprocessor: a high performance, low power, superscalar RISC microprocessor

Conference Paper

Jan 1994

The PowerPC 603 microprocessor is the second member of the PowerPC microprocessor family. The 603 is a superscalar implementation featuring low power operation of less than 3 watts while maintaining high performance of 75 SPECint92 (estimated) at 80 MHz. The 7.4 mm by 11.5 mm design is implemented in 0.5 μm, four-level metal CMOS technology. The 603 features dual 8-kByte instruction and data caches and a 32/64-bit system bus. Peak instruction rates of 3 instructions per clock cycle give outstanding performance to notebook and portable applications

The PowerPC 601 microprocessor

Conference Paper

Mar 1993

C.R. Moore

A highly integrated, single-chip microprocessor is described that combines a powerful reduced instruction set computer (RISC) architecture with a superscalar machine organization and a versatile, high-performance bus interface. The PowerPC 601 microprocessor contains a 32-kbyte cache and is capable of dispatching, executing, and completing up to three instructions per cycle. The bus interface can be configured for a wide range of system bus interfaces, including pipelined, nonpipelined, and split transactions. In addition, the processor is equipped with features suitable for symmetric multiprocessing applications. The result is a cost-effective, general-purpose microprocessor solution that offers very competitive performance

How does processor MHz relate to end-user performance? II. Memory subsystem and instruction set

Article

Nov 1993

For part I, see ibid., vol.13, no.4, p.8-16 (1993). Two processors that compete in the workstation/server markets are compared. The 62.5-MHz IBM RISC System/6000 Model 580 exemplifies a moderate clock rate design. The 133-1200-MHz DEC Alpha processor represents an aggressive clock rate design. The performance implications of the memory subsystems and the effect of instruction sets on path length are described. It is shown that performance measurements on many systems support the initial claim that cycle time is not sufficient to determine performance.< >

How does processor MHZ relate to end-user performance? I. Pipelinesand functional units

Article

Sep 1993

Two processors that compete in the workstation/server markets are compared. The 62.5-MHz IBM RISC System/6000 Model 580 (RS1) exemplifies a moderate clock rate design. As the highest SPECmark89/MHz system it can be viewed as maximizing the work performed per cycle. the 133-/200-MHz DEC Alpha processor represents an aggressive clock rate design. At 200 MHz, the Alpha has the highest MHz rate in the market. The authors discuss clock rate goals, how they influence design choices, and performance implications. The primary advantage for the Alpha design appears to be the high clock rate. The RS1 design includes a significant amount of hardware to increase in superscalar capability, especially on floating-point codes. RS1 has a significant infinite cache CPI advantage on floating-point applications. Infinite cache CPI for the two designs seem comparable on fixed-point codes

The Metaflow Architecture

Article

Jul 1991

The Metaflow architecture, a unified approach to maximizing the performance of superscalar microprocessors, is introduced. The Metaflow architecture exploits inherent instruction-level parallelism in conventional sequential programs by hardware means, without relying on optimizing compilers. It is based on a unified structure, the DRIS (deferred-scheduling, register-renaming instruction shelf), that manages out-of-order execution and most of the attendant problems. Coupling the DRIS with a speculative-execution mechanism that avoids conditional branch stalls results in performance limited only be inherent instruction-level parallelism and available execution resources. Although presented in the context of superscalar machines, the technique is equally applicable to a superpipelined implementation. Lightning, the first implementation of the Metaflow architecture, which executes the Sparc RISC instruction set is described

Branch Prediction Strategies and Branch Target Buffer Design

Article

Feb 1984

One of the major problems in designing a CPU pipeline is to ensure a steady flow of instructions to the initial stage of the pipeline. The paper studies the 'branch problem'- the execution of a branch instruction, which consists of causing the instruction fetch unit to select a different instruction as the next instruction to execute. The branch target buffer is a small associative memory that retains the addresses of recently executed branches and their targets (destinations). The buffer is used to predict whether the branch will be taken this time, and if so, what its target will be. This article presents a systematic approach to selecting good prediction strategies, which is based on 26 program address traces grouped into four IBM 370 workloads (scientific, commerical, compiler, supervisor) and CDC 6400 and DEC PDP-11 workloads.

Overviewofthe PowerPCBus Interface, /E€€ Micro, this issue

42-51

M S Allen

M.S. Allen, et al., Overviewofthe PowerPCBus Interface, /E€€ Micro, this issue, pp. 42-51.

The PowerPC 601 Microprocessor," <i>IBM RISC System/6000 Technology: Volume II&lt

. C Moore

Overview of the PowerPC Bus Interface

allen

The PowerPC 604 RISC microprocessor.

Abstract

No full-text available

Recommended publications

Architectural features of the i860-microprocessor RISC core and on-chip caches

An adaptive self-test routine for in-field diagnosis of permanent faults in simple RISC cores

High performance ASIP implementation of PBDI — A new intra-field deinterlacing method

A 100 MHz, 0.4 W RISC processor with 200 MHz multiply adder, using pulse-register technique