Article

The PowerPC 604 RISC microprocessor.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

First Page of the Article

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Till the introduction of out-of-order processing, instructions were scheduled for execution in the program order. Out-of-order processors identify independent instructions in a finite instruction window and schedule them in out-of-order fashion [82,126,51,110,55,6,32,81,99,37,112]. Exploiting larger degree of instruction level parallelism and memory level parallelism, these processors tend to have high performance and thus became popular for commercial uses [28,9,102,20,33,58,50,101,34]. ...
... Later Smith presented dynamic branch prediction strategies which could result in high prediction accuracies [106]. The bimodal predictor proposed in this paper was used in many commercial processors [6,110,61]. While the bimodal predictor uses branch instruction address alone, Yeh and Patt [127] propose inclusion of global branch history and local histories in predicting branches. ...
Thesis
Full-text available
Continuing advances in branch prediction provide a promising avenue for mitigating the impact of control dependences on extracting instruction-level parallelism in high-performance processors. However, the rate of improvement in prediction rates has slowed significantly and may be approaching an asymptotic upper bound, particularly once practical constraints on the predictor’s cycle time, energy, and area are taken into consideration. To reach higher levels of performance, future processors must not just reduce the number of mispredictions, but should employ mechanisms that reduce the performance penalty of each misprediction. This thesis presents EMR — an approach that boosts performance by reducing the misprediction penalty and by amplifying instruction delivery bandwidth to the execution core. EMR implements a novel in-place misprediction recovery that minimizes latency to activate instructions from the correct control path, while also utilizing control independence to avoid unnecessary re-execution of instructions from beyond control flow joins. Performance analysis shows that EMR outperforms the baseline by up to 81% with a mean performance gain of 23% in CINT2006. Additionally, using EMR speeds up MiBench, Graph500, and CFP2006 by 20%, 10.5%, and 4% respectively. Energy analysis shows that EMR consumes 16% lower energy than the baseline in CINT2006. As we scale up the window sizes, EMR can, with its amplified instruction delivery, commit up to seven µops per cycle without increasing frontend bandwidth.
... The first class has been implemented since early dynamically scheduled processors such as the Tomasulo-based IBM 360/91 [103], and it groups instructions that use the same execution units, i.e. by instruction types, such as integer, floating point, memory, branches, etc. Other examples of these are the MIPS R10000 [110], the Alpha 21264 [42,53], the PowerPC 604 [94], etc. The second class groups program segments made of consecutive control-dependent instructions (called tasks, threads or traces), and assigns them to a different cluster or processing element (PE) for parallel execution. ...
... The first model does the issue and register read in a single stage (like it occurs in short pipeline layouts), while the other two have 1 and 2 read stages respectively. The first model also corresponds to an architecture that reads the register file before inserting the instruction into the issue queue, like a PowerPC [94]. Figure 5-5b shows the value prediction speedups for these three pipeline depths. ...
Article
Over the past decade superscalar microprocessors have achieved enormous improvements in computing power by exploiting higher levels of parallelism in many different ways. Highperformance superscalar processors have experienced remarkable increases in processor width, pipeline depth and speculative execution. All of these trends have come at an extremely high increase in hardware complexity and chip resource consumption. Until recently, their main limitation has been the availability of such resources in the chip, but with current technology shrinks and increases in transistor budgets, other limiting factors have become preeminent, such as power consumption, temperature and wire delays. These new problems greatly compromise the scalability of conventional superscalar designs. Many previous works have demonstrated the effectiveness of partitioning the layout of several critical hardware components as a means to keep most of the parallelism while improving the scalability. Some of these components are the register file, the issue queue and the bypass network. Their partitioning is the basis for the so called clustered architectures. A clustered processor core, made up of several low complex blocks or clusters, can efficiently
... The first class has been implemented since early dynamically scheduled processors such as the Tomasulo-based IBM 360/91 [103], and it groups instructions that use the same execution units, i.e. by instruction types, such as integer, floating point, memory, branches, etc. Other examples of these are the MIPS R10000 [110], the Alpha 21264 [42,53], the PowerPC 604 [94], etc. The second class groups program segments made of consecutive control-dependent instructions (called tasks, threads or traces), and assigns them to a different cluster or processing element (PE) for parallel execution. ...
... The first model does the issue and register read in a single stage (like it occurs in short pipeline layouts), while the other two have 1 and 2 read stages respectively. The first model also corresponds to an architecture that reads the register file before inserting the instruction into the issue queue, like a PowerPC [94]. Figure 5-5b shows the value prediction speedups for these three pipeline depths. ...
... It is possible to determine the legality of branch hoisting at the source level, but it is di cult to express the transformation in conventional source level representations such as abstract syntax trees. Hence we perform branch hoisting on a low-level form of intermediate code (such as gcc's RTX), after most standard local and loop optimizations,but before physical register allocation or instruction scheduling 8 . ...
... Statement Hoist Criterion: A statement s can be hoisted form a point p to a set of points fp 0 ; p 1 ; :::p N g in the program provided: strong domination on any path from the start of the program to p or from p to itself, that does not including p, passes exactly once through one and only one of the points in fp 0 ; p 1 ; :::p N g . 9 8 Instruction scheduling must be aware of future branches, to try to ensure that no stalls occur due to a future branch that is still in the instruction pipeline 9 Gupta 5], introduced generalized dominators: a set of nodes that dominate a point p. However, our strong dominators are \stronger", as we require uniqueness, and consider paths from p to itself. ...
Conference Paper
Full-text available
The performance and hardware complexity of super-scalar architectures is hindered by conditional branch instructions. When conditional branches are encountered in a program, the instruction fetch unit must rapidly predict the branch predicate and begin speculatively fetching instructions with no loss of instruction throughput. Speculative execution has a high hardware cost, is limited by dynamic branch prediction accuracies, and does not scale well for increasingly super-scalar architectures. The conditional branch bottleneck would be solved if we could somehow move branch condition evaluation far forward in the instruction stream and provide a new branch instruction that encoded both the source and target address of a branch. This paper summarizes the hardware extensions to support just such a Future Branch, then gives a compiler algorithm for hoisting branch evaluation across many blocks. The algorithm is applicable to other optimizations for parallelism, such as prefetching data.
... Nowadays the CPUs of the last generation computers have multiple functional units, and are able to issue multiple independent instructions per clock cycle, they are called superscalar processors [2,4,7,10,12,13,16,18,20]. In these CPUs instructions are not executed necessarily in order, an instruction is executed when the processor resources are available. ...
... Some modern designs use deeper pipelines to obtain higher clock frequencies. There are works in the literature which supports the use of shorter pipelines that fits better today personal computers need [11,12]. ...
... An FPU (Floating Point Unit) is a principal component in graphics accelerators [1,2], digital signal processors, and high performance computer systems. As the chip integration density increases due to the advances in semiconductor technology, it has become possible for an FPU to be placed on a single chip together with the integer unit, allowing the FPU to exceed its original supplementary function and becoming a principal element in a CPU [2,3,4,5]. In recent microprocessors, a floating point division unit is built on a chip to speed up the floating point division operation. ...
... Because the proposed floating point divider can perform quotient conversion and rounding within only one cycle, to complete the floating point division operation, 30 cycles should be taken in the radix-4 case, 21 cycles for the radix-8 case, and 15 cycles for the radix-16 case, respectively. As shown in Table 2 in the radix-4 case, PowerPC604e [3] takes 31 cycles, PA-RISC 8000 [15] takes 31 cycles, and Pentium [7] takes 33 cycles to complete the floating point division operation. In the radix-8 case, UltraSPARC [4] takes 22 cycles. ...
Conference Paper
Full-text available
Processing floating point division generally consists of SRT recurrence, quotient conversion, rounding, and normalization steps. In the rounding step, a high speed adder is required for increment oper- ation, increasing the overall execution time. In this paper, a floating point divider performing quotient conversion and rounding in parallel is presented by analyzing the operational characteristics of floating point division. The proposed floating point divider does not require any ad- ditional execution time, nor does it need any high speed adder for the rounding step. The proposed divider can execute quotient conversion, rounding, and normalization within one cycle. To support design effi- ciency, the quotient conversion/rounding unit of the proposed divider can be shared efficiently with the addition/rounding hardware for float- ing point multiplier.
... Processadores super escalares diferem nos mecanismos adotados para tratar as dependências de dados e as dependências de controle [3]. Alguns processadores, como por exemplo o PowerPC 604 [4] e 620 l5] e o MIPS RlOOOO (6], possuem mecanismos que realizam um escalonamento dinamico de instruções, permitindo a execução de instruções subseqüentes enquanto a dependência é resolvida. ...
Conference Paper
Mecanismos sofisticados para escalonamento dinâmico de instruções, previsão dinâmica de desvios e execução especulativo são encontrados em arquiteturas super escalares destinadas à aplicações que exigem alto desempenho. No entanto, estas arquiteturas ainda apresentam um desempenho bem inferior ao que seria alcançado por uma arquitetura super escalar ideal. Este artigo apresenta resultados experimentais que identificam algumas das principais causas da discrepância entre os desempenhos de uma arquitetura super escalar real e de uma ideal. Também é apresentado um novo mecanismo, baseado em múltiplos fluxos de instruções, que tem como objetivo reduzir as limitações encontradas.
... within the home appliance). As the computational power is also reduced ( [14], [15], [16]), this first data mining may be mere statistics, such as correlations, histograms and boolean algebra, or simple algorithms, such as small neural networks ( [17], [18]), decision trees [19] or regression analysis [20]. A pertinent comparison of the three approaches is made in by Tso and Yau in [21]. ...
... within the home appliance). As the computational power is also reduced ( [14], [15], [16]), this first data mining may be mere statistics, such as correlations, histograms and boolean algebra, or simple algorithms, such as small neural networks ( [17], [18]), decision trees [19] or regression analysis [20]. A pertinent comparison of the three approaches is made in by Tso and Yau in [21]. ...
... The most common state machine for one-level schemes is the two-bit counter predictor, (12) which is implemented in several contemporary processors, including the Intel Pentium(lO) and the Power PC 604. (11) The size for the one-level BTB buffer affects the branch prediction accuracy since several branches may contend for the same entry in the buffer. The effect of the size of a buffer (S) on the number of contentions is examined and plotted in Fig. 2. The knee for the majority of the curves occurs near the 512-to 1024-entry region. ...
Article
Full-text available
Profile based optimization can be used for instruction scheduling, loop scheduling, data preloading, function in-lining, and instruction cache performance enhancement. However, these techniques have not been embraced by software vendors because programs instrumented for profiling run significantly slower, an awkward compile-run-recompile sequence is required, and a test input suite must be collected and validated for each program. This paper introduces hardware-based profiling that uses traditional branch handling hardware to generate profile information in real time. Techniques are presented for both one-level and two-level branch hardware organizations. The approach produces high accuracy with small slowdown in execution (0.4%-4.6%). This allows a program to be profiled while it is used, eliminating the need for a test input suite. With contemporary processors driven increasingly by compiler support, hardware-based profiling is important for high-performance systems.
... In this paper, we present the de®nition of two models for testing units of modern microprocessors, construction of test code, and checking coverage measurement. Both proposed models are based on the architecture of the PowerPC microprocessor [4,10]. The models have been veri®ed using a random code generator and`real world' applications, while measuring and analyzing the coverage results ...
Article
The design verification of state-of-the-art high-performance microprocessors has become a significant challenge for test engineers. Deep pipelines, multiple execution units, out-of-order and speculative execution techniques, typically found in such microprocessors, contribute much to this complexity. Conventional methods, which treat the processor as a logic state machine or apply architectural level tests, fail to provide coverage of all possible corner cases in the design. This paper presents a functional verification method for modern microprocessors, which is based on innovative models of the microprocessor architecture, intended to cover the testing of all corner cases. In order to test the models presented in this work, an architecture independent coverage measurement system has been developed. The models were tested with both random code and real world applications in order to determine which of the two achieves higher coverage.
... As the intention of this project is to physically implement a prototype system, rather than produce a VHDL simulation model, a Reduced Instruction Set Computer (RISC) [20,18,23], with a Harvard bus architecture that is easy to implement has been chosen as our general purpose CPU. This led to the choice of a design that is compatible with the MIPS R3000 [17,14] processor, mainly due to its widely available documentation. ...
Conference Paper
This paper analyses the performance of a custom compute machine, that performs electrostatic plasma simulations, using Field Programmable Gate Array’s (FPGAs). Although FPGA’s run at slower clock speeds than their off-the-shelf counterparts, the processing power lost in the reduced number of clock cycles per second is quickly recovered in the high degree of spatial parallelism that is achievable within the devices. We describe the development of the architecture of the machine and its support for the C-programming language via the use of a cross-compiler. Results are presented and a discussion is given on the constraints of FPGAs in particular and the hardware design process in general.
... This is illustrated in the right diagram of the same figure, showing an increase in CPL of 1 stage. The left diagram is expected to be found on e.g., a PowerPC 604 [SDC94] or a PA 7100LC [BKQW95], while the right diagram resembles the situation on a Pentium processor, except that a Pentium cannot execute a rotate over more than 1 bit in parallel with any other instruction, resulting in a further increase of the CPL. The first step of MD4's round 2 implemented on a two-way superscalar architecture. ...
Conference Paper
To enhance system performance computer architectures tend to incorporate an increasing number of parallel execution units. This pa- per shows that the new generation of MD4-based customized hash func- tions (RIPEMD-128, RIPEMD-160, SHA-1) contains much more soft- ware parallelism than any of these computer architectures is currently able to provide. It is conjectured that the parallelism found in SHA-1 is a design principle. The critical path of SHA-1 is twice as short as that of its closest contender RIPEMD-160, but realizing it would require a 7-way multiple-issue architecture. It will also be shown that, due to the organization of RIPEMD-160 in two independent lines, it will probably be easier for future architectures to exploit its software parallelism.
... All data reported in this paper are generated by a fully functional cycle equivalent model, based on the PowerPC 604. The model is based on published reports [6,8,19] and accurately models all aspects of the microarchitecture. An enhanced PowerPC 604 model and the benchmark set are discussed in the next two subsections. ...
Conference Paper
In order to achieve high performance, contemporary microprocessors must effectively pro- cess the four major instruction types: ALU, branch, load, and store instructions. This paper focuses on the reduction of load instruction execution latency. Load execution latency is dependent on memory access latency, pipeline depth, and data dependencies. Through load effective address prediction both data dependencies and deep pipeline effects can poten- tially be removed from the overall execution time. If a load effective address is correctly predicted, the data cache can be speculatively accessed prior to execution, thus effectively reducing the latency of load execution. A hybrid load effective address prediction technique is proposed, using three basic predic- tors: Last Address Predictor (LAP), Stride Predictor (SP), and Global Dynamic Predictor (GDP). In addition to improving load address prediction, this work explores the balance of data ports in the cache memory hierarchy, and the effects of load and store aliasing in wide superscalar machines. Results: Using a realistic hybrid load address predictor, load address prediction rates range from 32% to 77% at an average of 51% for SPECint95 and 60% to 96% at an aver- age of 87% for SPECfp95. For a wide superscalar machine with a significant number of execution resources, this prediction rate increases IPC by 12% and 19% for SPECint 95 and SPECfp95 respectively. It is also shown that load/store aliasing decreases the average IPC by 33% for SPECint95 and 24% for SPECfp95.
... As pointed out above, the performance of a reuse scheme is determined not only by the percentage of reused instructions but also by the reuse latency. Let us assume a dynamically scheduled processor with a microarchitecture that keeps speculative results in the reorder buffer or rename buffers and is pipelined in the stages shown inFigure 4.a (this example is based on the PowerPC 604[15]although the conclusions are the same for other pipelines). Every instruction is fetched and then it is decoded and the physical location that holds the last definition of each source operand (if available) is identified. ...
Conference Paper
Full-text available
A mechanism for dynamic instruction-level reuse in superscalar microprocessors is presented. The underlying concept that the mechanism exploits is the run-time removal of redundant computations, and in particular the elimination of common subexpressions and invariants. Removing redundant computation is a target of optimizing compilers but sometimes they do not succeed due to their limited knowledge of the data. Moreover, the proposed mechanism can also remove quasi-redundant computations, such as subexpressions that often produce the same result but sometimes they differ, depending on the data values, and thus they cannot be eliminated by the compiler. Experimental results for the Spec95 show that on average the mechanism can avoid the execution of about 32% of the dynamic instructions and provides a 1.10 speedup in a superscalar microprocessor. An extensive evaluation of different configurations and a comparison with previous schemes is presented, as well as the performance potential ...
... PRF-style machines have been around for more than a decade, starting with the MIPS R10000 [19], Alpha 21264 [12][13][5], IBM Power4 [17] and Power5 [16], and Intel Pentium4 [9]. ARF-style machines have a similar history, used in machines like the PowerPC 604 [15] and the Intel P6 [4][8] architecture, which formed the basis for the Intel Pentium Pro [14] and Pentium III [11]. ARF are also used in current generation microprocessors such as the AMD K8 [6][10] family, which includes AMD Athlon and Opteron, Intel Pentium M [7], as well as the recently launched Intel Core family. ...
Conference Paper
Full-text available
Based on operand delivery, existing microprocessors can be catego- rized into architected register file (ARF) or physical register file (PRF) machines, both with or without payload RAM (PL). Though many previous generation microprocessors use a PRF without PL, the trend of newer microprocessors targeting lower power environ- ments seem to be moving towards ARF with PL. We quantitatively analyze power consumption of different machine styles: ARF with PL, ARF without PL, PRF with PL, and PRF only machine. Our result shows that PRF without PL consumes the least amount of power and is fundamentally the best approach for building power- aware out-of-order microprocessors.
... All data reported in this paper are generated using the new fMW [1] tool which integrates a functional simulator and a cycle-accurate timing simulator that is built based on a validated PowerPC 604 model [2]. The machine model is based on published reports [4,7,10] and accurately models all aspects of the microarchitecture. The machine model and the simulation framework used in this paper are discussed in the next two subsections. ...
Conference Paper
As superscalar pipelines become wider and deeper, the percentage of dynamic instructions fetched into the machine from the mispredicted path significantly increases. This paper discusses how a new cycle-accurate performance simulator is used to accurately measure mispredicted path effects on the cache hierarchy. Previously published results based on less accurate tools indicated that mispredicted path instructions have the serendipitous positive effect of doing memory prefetching. Our results show that while such prefetching does occur for some benchmarks, it does not occur consistently for all benchmarks. Furthermore the IPC impact varies widely among the benchmarks. SPECint95 benchmarks show IPC changes ranging from -8% to +12%.
Article
Accurate instruction fetch and branch prediction is increasingly important on today's wide-issue architectures. Fetch prediction is the process of determining the next instruction to request from the memory subsystem. Branch prediction is the process of predicting the likely out-come of branch instructions. Several researchers have proposed very effective fetch and branch prediction mechanisms including branch target buffers (BTB) that store the target addresses of taken branches. An alternative approach fetches the instruction following a branch by using an index into the cache instead of a branch target address. We call such an index a next cache line and set (NLS) predictor. A NLS predictor is a pointer into the instruction cache, indicating the target instruction of a branch.In this paper we examine the use of NLS predictors for efficient and accurate fetch and branch prediction. Previous studies associated each NLS predictor with a cache line and provided only one-bit conditional branch predictors. Our study examines the use of NLS predictors with highly accurate two-level correlated conditional branch architectures. We examine the performance of decoupling the NLS predictors from the cache line and storing them in a separate tag-less memory buffer. Our results show that the decoupled architecture performs better than associating the NLS predictors with the cache line, that the NLS architecture benefits from reduced cache miss rates, and it is particularly effective for programs containing many branches. We also provide an in-depth comparison between the NLS and BTB architectures, showing that the NLS architecture is a competitive alternative to the BTB design.
Article
In this paper, we identify performance trends and design relationships between the following components of the data memory hierarchy in a dynamically-scheduled processor: the register file, the lockup-free data cache, the stream buffers, and the interface between these components and the lower levels of the memory hierarchy. Similar performance was obtained from all systems having support for fewer than four in-flight misses, irrespective of the register-file size, the issue width of the processor, and the memory bandwidth. While providing support for more than four in-flight misses did increase system performance, the improvement was less than that obtained by increasing the number of registers. The addition of stream buffers to the investigated systems led to a significant performance increase, with the larger increases for systems having less in-flight-miss support, greater memory bandwidth, or more instruction issue capability. The performance of these systems was not significantly affected by the inclusion of traffic filters, dynamic-stride calculators, or the inclusion of the per-load non-unity stride-predictor and the incremental-prefetching techniques, which we introduce. However, the incremental prefetching technique reduces the bandwidth consumed by stream buffers by 50% without a significant impact on performance.
Chapter
There are a wealth of technological alternatives that can be incorporated into a cache or processor design. The memory configuration and the cache size, associativity and block size for each of the components in the heirarchy are some of these applicable to memory subsystems. For processors, these include branch handling strategies, functional unit duplication, and instruction fetch, issue, completion and retirement policies. The deciding factor between the various choices available is a function of the performance each adds, versus the cost each incurs. The large amount of available design choices inflates the design space. The profitability of a given design is measured through the excution of application programs and other workloads. Trace-driven simulation is used to simplify this process.
Chapter
For an instruction pipeline to attain its maximum performance, it is, at the very least, necessary that it be supplied with instructions at a rate that matches its maximum processing rate. The main impediment to ensuring adequate instruction-supply is usually the high access time (relative to the pipeline cycle time) of the memory from which instructions are fetched. At any given moment, the addresses of the next instructions required are easy to determine if there are no branch (control-transfer)1 instructions involved: simply incrementing the program counter, or similar addressing register, suffices. A branch instruction, on the other hand, presents a problem, since the addresses of the following instructions cannot be known with absolute certainty until after the branch has been executed; furthermore, the execution may depend on a condition yet to be determined by preceding instructions. Consequently, unless special measures are taken, a branch instruction will introduce a gap — the delay of which we shall term the branch latency — in the flow of instructions. In this chapter we shall discuss a number of measures for dealing with this, which is arguably the hardest problem in the design of high-performance instruction pipelines [116].
Chapter
A pipeline can fail to achieve its maximum speedup if there are discontinuities in the supply of instructions or data. Discontinuities in the flow of instructions have been covered in the preceding chapter; in this chapter, we shall discuss the problem of discontinuities in the flow of data as well as corresponding solutions. The data-flow discontinuities arise mainly from two sources: one is a mismatch between the rate at which the pipeline requests data and the rate at which the data is delivered to the pipeline; the other is data hazards (or data dependences) that occur between instructions in the pipeline when one instruction cannot proceed because its progress depends on that of another instruction. The basic problem of mismatching rates is largely solved by the use of appropriate high-speed intermediate storage (cache, registers, etc.) and other techniques discussed in Chapter 3 and will not be considered further. This chapter is therefore devoted to just the hazards and related issues.
Chapter
We shall begin by introducing the main issues in the design and implementation of pipelined and superscalar computers, in which the exploitation of low-level parallelism constitute the main means for high performance. The first section of the chapter consists of a discussion of the basic principles underlying the design of such computers. The second section gives a taxonomy for the classification of pipelined machines and introduces a number of commonly used terms. The third and fourth section deal with the performance of pipelines: ideal performance and impediments to achieving this are examined. The fifth section consists of some examples of practical pipelines; these pipelines form the basis for detailed case studies in subsequent chapters. The last section is a summary.
Chapter
This chapter deals with the handling of interrupts in pipelined machines and with recovery in the event of branch mispredictions. The first section is a discussion of basic implementation techniques, the second consists of a number of case studies, and the third is a summary.
Article
This paper explores the microarchitecture of a simultaneous multithreaded processor with multimedia enhancements and evaluates its performance with a multimedia program as workload. Novel features are the combination of a simultaneous multithreaded processor with multimedia instructions and its evaluation with a realistic multimedia workload. The workload is a hand-optimized MPEG-2 video decompression algorithm that extensively uses multimedia units and moreover applies cooperative multithreading to exploit coarse-grain parallelism. Various processor configurations were simulated to identify and eliminate potential bottlenecks. Our simulation results show that a simultaneous multithreaded processor with multimedia enhancements is able to yield a threefold IPC (instructions per cycle) increase over a comparable (single-threaded) superscalar processor.
Book
To satisfy the higher requirements of digitally converged embedded systems, this book describes heterogeneous multicore technology that uses various kinds of low-power embedded processor cores on a single chip. With this technology, heterogeneous parallelism can be implemented on an SoC, and greater flexibility and superior performance per watt can then be achieved. This book defines the heterogeneous multicore architecture and explains in detail several embedded processor cores including CPU cores and special-purpose processor cores that achieve highly arithmetic-level parallelism. The authors developed three multicore chips (called RP-1, RP-2, and RP-X) according to the defined architecture with the introduced processor cores. The chip implementations, software environments, and applications running on the chips are also explained in the book. Provides readers an overview and practical discussion of heterogeneous multicore technologies from both a hardware and software point of view; Discusses a new, high-performance and energy efficient approach to designing SoCs for digitally converged, embedded systems; Covers hardware issues such as architecture and chip implementation, as well as software issues such as compilers, operating systems, and application programs; Describes three chips developed according to the defined heterogeneous multicore architecture, including chip implementations, software environments, and working applications. © 2012 Springer Science+Business Media New York. All rights are reserved.
Chapter
The processor cores described in this chapter are well tuned for embedded systems. They are SuperHTMRISC engine family processor cores (SH cores) as typical embedded CPU cores, flexible engine/generic ALU array (FE–GA or shortly called FE as flexible engine) as a reconfigurable processor core, MX core as a massively parallel SIMD-type processor, and video processing unit (VPU) as a video processing accelerator. We can implement heterogeneous multicore processor chips with them, and three implemented prototype chips, RP-1, RP-2, and RP-X, are introduced in the Chap. 4.
Conference Paper
There is an increasing trend to use commodity microprocessors as the compute engines in large-scale multiprocessors. However, given that the majority of the microprocessors are sold in the workstation market, not in the multiprocessor market, it is only natural that architectural features that benefit only multiprocessors are less likely to be adopted in commodity microprocessors. In this paper, we explore multiple-context processors, an architectural technique proposed to hide the large memory latency in multiprocessors. We show that while current multiple-context designs work reasonably well for multiprocessors, they are ineffective in hiding the much shorter uniprocessor latencies using the limited parallelism found in workstation environments. We propose an alternative design that combines the best features of two existing approaches, and present simulation results that show it yields better performance for both multiprogrammed workloads on a workstation and parallel applications on a multiprocessor. By addressing the needs of the workstation environment, our proposal makes multiple contexts more attractive for commodity microprocessors.
Article
Full-text available
Recent processors issue multiple instructions per cycle and employ multiple functional units and hardware scheduling techniques to achieve maximum parallelism at the instruction level. To exploit maximum efficiency such multiple issue processors must be fed by high instruction fetch bandwidth. The instruction fetch unit must fetch enough instructions every cycle to keep the functional units busy. No clock cycle should go idle and thus several instructions need to be fetched at every clock cycle. In conventional multi-way set-associative I-cache, an instruction is read in the following way. The address generated by the processor is divided into two parts - tag and index. The index selects the set of the I-cache to be accessed. The tag is compared simultaneously with the cache blocks tags of all the blocks in the set. The data is read from the block whose tag matches with the instruction tag. If none of the stored tags matches with the instruction tag, then a cache miss occurs and the instruction is read from the other levels of the memory hierarchy. With very fast clock cycles, this whole procedure requires more than one clock cycle to complete. It is expected that the future processors, which are likely to have a very deep pipeline, will require several pipeline stages to fetch instructions. The two main factors that affect the performance of a multi-issue processor are: cache access time and fetching correct instructions every cycle. In this paper we will address both these problems by predicting the address of the Instruction cache from where the next instruction has to be fetched.
Article
A major obstacle in designing superscalar processors is the size and port requirement of the register file. Multiple register files of a scalar processor can be used in a superscalar processor if results are renamed when they are written to the register file. Consequently, a scalable register file architecture can be implemented without performance degradation. Another benefit is that the cycle time of the register file is significantly shortened, potentially producing an increase in the speed of the processor.
Article
Full-text available
The principles of the RISC architecture guided the design of the previous generation of processors. These principles have accelerated the performance gains of these processors over their predecessors. The current generation of CPUs is poised to continue this rapid acceleration of performance. In this paper we assert that the performance gains of the current generation are due to changes that are decidedly not RISC. We call this current generation of processors Post-RISC and, we describe the characteristics of Post-RISC processors. In addition, we survey six processors of the current generation which highlight the principles of Post-RISC.
Article
The PowerPC 620 RISC microprocessor is the first chip for the application server and technical workstation product line within the PowerPC family. It utilizes a high performance microarchitecture with many advanced superscalar features to exploit instruction level parallelism. It is the first 64-bit implementation of the PowerPC architecture supporting both 32- and 64-bit application software, and is compatible with the PowerPC 601, PowerPC 603, and PowerPC 604 microprocessors.
Article
Full-text available
In the quest for higher performance, virtually all recent models of significant superscalar lines have introduced both shelving and register renaming. The implementation of these advanced features have had major implications on the datapath of the processors. As a consequence, the datapaths of recent superscalars differ substantially from those used in previous models. Although the internal details of most new microarchitectures have been disclosed, the design space of their datapath has not yet been described. In this paper we contribute to this area by identifying generic alternatives of the kernel of the datapath in advanced superscalar processors. Our work is based on the exploration of the design spaces of both shelving and register renaming, since the manner of their implementation is decisive for the layout of the kernel of the datapath. From the design spaces mentioned, we first point out those aspects which are relevant for the generic alternatives of the kernel of the datapath and review the related design choices. Then from the feasible combinations of the indicated design choices we derive and present 24 generic datapath alternatives. In this framework we also show which generic alternatives have been chosen in recent superscalars.
Article
This paper describes an interactive animation tool, SATSim, which conveys superscalar architecture concepts. It has been used in an advanced undergraduate computer architecture course to visualize the complicated behavioral patterns of superscalar architectures, such as out-of-order execution, in-order commitment, and the impact of branch mispredictions and cache misses. SATSim allows students to interactively change hardware configuration parameters and to observe their effects visually in a more accessible manner than is currently possible with existing simulators or with traditional static media.
Chapter
Resource-constrained software-pipelining has played an increasingly significant role in exploiting instruction-level parallelism and has been drawing intensive academic and industrial interest. The challenge is to find schedule which is optimal: i.e. the fastest possible schedule under given resource constraints while keeping register usage minimal. One interesting problem is open: the design of an algorithm which ensures that such an optimal schedule will always be found and with a cost which is manageable in practice. In this paper, we present a novel modulo scheduling algorithm which provides a solution to this open problem. The proposed algorithm has been successfully implemented and tested under MOST — the Modulo Scheduling Testbed developed at McGill University. Unlike the existing optimal modulo scheduling approach based on integer linear programming (ILP) [6], our approach employs a search procedure which directly exploits the program structure in terms of its dependence graphs. Experimental results on more than 1000 loops from popular benchmark programs show that our method often finds a schedule faster. In addition, with our approach many loops require a surprisingly small number of schedules be searched to obtain an optimal solution, thus making the approach quite feasible.
Article
If a high-performance superscalar processor is to realise its full potential, the compiler must re-order or schedule the object code at compile time. This scheduling creates groups of adjacent instructions that are independent and which therefore can be issued and executed in parallel at run time. This paper provides an overview of the Hatfield Superscalar Architecture (HSA), a multipleinstruction-issue architecture developed at the University of Hertfordshire to support the development of high-performance instruction schedulers. The long-term objective of the HSA project is to develop the scheduling technology to realise an order of magnitude performance improvement over traditional RISC designs. The paper also presents results from the first HSA instruction scheduler that currently achieves a speedup of over three compared to a classic RISC processor.
Article
We consider the floating point microarchitecture support in RISC superscalar processors. We briefly review the fundamental performance trade-offs in the design of such microarchitecutres. We propose a simple, yet effective bounds model to deduce the “best-case” loop performance limits for these processors. We compare these bounds to simulated and real performance measurements. From this study, we identify several loop tuning opportunities. In particular, we illustrate the use of this analysis in suggesting loop unrolling and scheduling heuristics. We report our experimental results in the context of a set of application-based loop test cases. These are designed to stress various resource limits in the core (infinite cache) microarchitecture.
Article
A fast low power four-way set-associative translation lookaside buffer (TLB) is proposed. The proposed TLB allows single clock phase accesses at clock frequencies above 1 GHz. Comparisons to a conventional fully associative CAM tagged TLB, which is the type most commonly used in embedded processors, with the same number of entries on a 65 nm low standby power process show that the proposed design has 28% lower delay and up to 50% lower energy delay product. Unlike previous set-associative TLBs, which replicate the TLB to support multiple page sizes, multiple page sizes are supported on a way-by-way basis. Alternative conventional CAM tagged and set-associative TLBs are investigated with regard to access latency, power dissipation and hit rates.
Conference Paper
This paper deals with two level on-chip cache memories. We show the impact of three different relationships between the contents of these levels on the system performance. In addition to the classical Inclusion contents management, we propose two alternatives, namely Exclusion and Demand, developing for them the necessary coherence support and quantifying their relative performance in a design space (sizes, latencies, ...) in agreement with the constraints imposed by integration. Two performance metrics are considered: the second-level cache miss ratio and the system CPI. The experiments have been carried out running a set of integer and floating point SPEC'92 benchmarks. We conclude showing the superiority of our improved version of Exclusion throughout all the sizing and workload spectrum studied
Conference Paper
We study the cache performance of a set of ML programs, compiled by the Standard ML of New Jersey compiler. We find that more than half of the reads are for objects that have just been allocated. We also consider the effects of varying software (garbage collection frequency) and hardware (cache) parameters. Confirming results of related experiments, we found that ML programs can have good cache performance when there is no penalty for allocation. Even on caches that have an allocation penalty, we found that ML programs can have lower miss ratios than the C and Fortran SPEC92 benchmarks. Topics: 4 benchmarks, performance analysis; 21 hardware design, measurements; 17 garbage collection, storage allocation; 46 runtime systems. 1 Introduction With the gap between CPU and memory speed widening, good cache performance is increasingly important for programs to take full advantage of the speed of current microprocessors. Most recent microprocessors come with a small on-chip cache, and many m...
Conference Paper
Functional (i.e., logic) verification of the current generation of complex, super-scalar microprocessors such as the PowerPC 604 microprocessor presents significant challenges to a project's verification participants. Simple architectural level tests are insufficient to gain confidence in the quality of the design. Detailed planning must be combined with a broad collection of methods and tools to ensure that design defects are detected as early as possible in a project's life-cycle. This paper discusses the methodology applied to the functional verification of the PowerPC 604 microprocessor
Conference Paper
In modern processors, deep pipelines couple with superscalar techniques to allow each pipe stage to process multiple instructions. When such a pipe must be pushed and refilled, as when predicted program flow beyond a branch is subsequently recognized as wrong, the temporary performance loss is significant. While modern branch target buffer (BTB) technology makes this flush/refill penalty fairly rare, the penalty that accrues from the remaining branch mispredictions is a serious impediment to even higher processor performance. Advanced mechanisms that can reduce this residual misprediction penalty can be of enormous value in future microprocessor designs. One promising new mechanism, the Misprediction Recovery Cache (MRC) is proposed previously. In this paper, we focus especially on MRC integration into existing pipelines.
Conference Paper
. This paper presents a multithreaded superscalar processor that permits several threads to issue instructions to the execution units of a wide superscalar processor in a single cycle. Instructions can simultaneously be issued from up to 8 threads with a total issue bandwidth of 8 instructions per cycle. Our results show that the 8-threaded 8-issue processor reaches a throughput of 4.2 instructions per cycle. 1 Introduction Current microprocessors utilize instruction-level parallelism by a deep processor pipeline and by the superscalar technique that issues up to four instructions per cycle from a single thread. VLSI-technology will allow future generations of microprocessors to exploit instruction-level parallelism up to 8 instructions per cycle, or more. However, the instruction-level parallelism found in a conventional instruction stream is limited. The solution is the additional utilization of more coarse-grained parallelism. The main approaches are the multiprocessor chip and the...
Conference Paper
Speculative execution is a key concept to increase the performance of superscalar processors. Given accurate branch prediction mechanisms, the efficiency of speculative execution is mainly determined by the speculation depth. In this work, we evaluate the pressure of speculative execution on the resource requirements of a typical superscalar architecture.
Article
Instruction-level parallelism (ILP) consists of a family of processor and compiler design techniques that speedup execution by causing individual operations to execute in parallel. For control-intensive programs, however, there is usually insufficient ILP in a basic block for effective exploitation by ILP processors. Speculative execution is the ex-ecution of instructions before it is known whether these instructions should be executed. Thus, it can be used to alleviate the effects of conditional branches. To effectively exploit ILP, the compiler must boost instructions across branches. However, a hazard may be introduced by speculatively executed instructions that incorrectly overwrite a value when a branch is mispredicted. To eliminate such side effects, we propose a solution in this paper which uses scheduling techniques. If the result of a boosted instruction can be stored in another location until the branch is committed, the undesired side effect can be avoided. A scheduling algorithm called LESS with a renaming function is proposed for passing differ-ent identifiers along different execution paths to maintain the correctness of the program until the data is no longer used or modified. The hardware implementation for this method is relatively simple and rather efficient. Simulation results show that the speedup achieved using LESS is better than that obtained using other existing methods. For example, LESS achieves a performance improvement of 12%, on average, compared to the CRF scheme, a solution proposed recently which uses the concept of shadow register pairs.
Conference Paper
Full-text available
The precise interrupt problem in pipelined processors is described, and five solutions are discussed in detail. The first forces instructions to complete and modify the process state in architectural order. The other four allow instructions to complete in any order, but additional hardware is used so that a precise state can be restored when an interrupt occurs. All the methods are discussed in the context of a parallel pipeline structure. Simulation results based on the CRAY-1S scalar architecture show that, at best, the first solution results in a performance degradation of about 16%. The remaining four solutions offer similar performance, and three of them result in as little as a 3% performance loss. Several extensions, including virtual memory and linear pipeline structures, are briefly discussed.
Article
To produce a marketable PowerPC™ microprocessor on a short development schedule, the logic had to be designed in a manner flexible enough to allow quick modifications without sacrificing high performance and density when customized cells were required. This was accomplished for the PowerPC 601™ microprocessor (601) with a high-level design-language description, which was synthesized for a gate-level implementation and simulated for functional verification. In a similar way, the physical design strategy for the 601 struck an attractive balance between a highly automated, flexible floorplan and the additional density that had to be available for limited, well-conceived manual placements. Finally, a rigorous test strategy was implemented, which has proved very useful in analyzing the processor and in assembling 601-based systems. Careful adherence to this methodology led to a successful first-pass physical implementation, leaving the second iteration for additional customer requests.
Article
This paper describes ,the methods employed,in the floating-point area of ,the System/360 Model ,91 to exploit the existence of multiple execution,, efficiently utilizing the execution units without requiring specially optimized code. Instead, the hardware, by ‘looking ahead’ about eight instructions. automatically optimizes the program execution on a local basis. The application of these techniques is not limited to floating-point arithmetic or System/360 architecture. It may be used in almost any computer having multiple execution units and one or more ‘accumulators.’ Both of the execution units, as well as the associated storage buffers, multiple accumulators and input /output buses, are extensively checked.
Article
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
Conference Paper
The PowerPC 603 microprocessor is the second member of the PowerPC microprocessor family. The 603 is a superscalar implementation featuring low power operation of less than 3 watts while maintaining high performance of 75 SPECint92 (estimated) at 80 MHz. The 7.4 mm by 11.5 mm design is implemented in 0.5 μm, four-level metal CMOS technology. The 603 features dual 8-kByte instruction and data caches and a 32/64-bit system bus. Peak instruction rates of 3 instructions per clock cycle give outstanding performance to notebook and portable applications
Conference Paper
A highly integrated, single-chip microprocessor is described that combines a powerful reduced instruction set computer (RISC) architecture with a superscalar machine organization and a versatile, high-performance bus interface. The PowerPC 601 microprocessor contains a 32-kbyte cache and is capable of dispatching, executing, and completing up to three instructions per cycle. The bus interface can be configured for a wide range of system bus interfaces, including pipelined, nonpipelined, and split transactions. In addition, the processor is equipped with features suitable for symmetric multiprocessing applications. The result is a cost-effective, general-purpose microprocessor solution that offers very competitive performance
Article
For part I, see ibid., vol.13, no.4, p.8-16 (1993). Two processors that compete in the workstation/server markets are compared. The 62.5-MHz IBM RISC System/6000 Model 580 exemplifies a moderate clock rate design. The 133-1200-MHz DEC Alpha processor represents an aggressive clock rate design. The performance implications of the memory subsystems and the effect of instruction sets on path length are described. It is shown that performance measurements on many systems support the initial claim that cycle time is not sufficient to determine performance.< >
Article
Two processors that compete in the workstation/server markets are compared. The 62.5-MHz IBM RISC System/6000 Model 580 (RS1) exemplifies a moderate clock rate design. As the highest SPECmark89/MHz system it can be viewed as maximizing the work performed per cycle. the 133-/200-MHz DEC Alpha processor represents an aggressive clock rate design. At 200 MHz, the Alpha has the highest MHz rate in the market. The authors discuss clock rate goals, how they influence design choices, and performance implications. The primary advantage for the Alpha design appears to be the high clock rate. The RS1 design includes a significant amount of hardware to increase in superscalar capability, especially on floating-point codes. RS1 has a significant infinite cache CPI advantage on floating-point applications. Infinite cache CPI for the two designs seem comparable on fixed-point codes
Article
The Metaflow architecture, a unified approach to maximizing the performance of superscalar microprocessors, is introduced. The Metaflow architecture exploits inherent instruction-level parallelism in conventional sequential programs by hardware means, without relying on optimizing compilers. It is based on a unified structure, the DRIS (deferred-scheduling, register-renaming instruction shelf), that manages out-of-order execution and most of the attendant problems. Coupling the DRIS with a speculative-execution mechanism that avoids conditional branch stalls results in performance limited only be inherent instruction-level parallelism and available execution resources. Although presented in the context of superscalar machines, the technique is equally applicable to a superpipelined implementation. Lightning, the first implementation of the Metaflow architecture, which executes the Sparc RISC instruction set is described
Article
One of the major problems in designing a CPU pipeline is to ensure a steady flow of instructions to the initial stage of the pipeline. The paper studies the 'branch problem'- the execution of a branch instruction, which consists of causing the instruction fetch unit to select a different instruction as the next instruction to execute. The branch target buffer is a small associative memory that retains the addresses of recently executed branches and their targets (destinations). The buffer is used to predict whether the branch will be taken this time, and if so, what its target will be. This article presents a systematic approach to selecting good prediction strategies, which is based on 26 program address traces grouped into four IBM 370 workloads (scientific, commerical, compiler, supervisor) and CDC 6400 and DEC PDP-11 workloads.
Overviewofthe PowerPCBus Interface, /E€€ Micro, this issue
  • M S Allen
M.S. Allen, et al., Overviewofthe PowerPCBus Interface, /E€€ Micro, this issue, pp. 42-51.
The PowerPC 601 Microprocessor," &lt;i&gt;IBM RISC System/6000 Technology: Volume II&lt
  • . C Moore
Overview of the PowerPC Bus Interface
  • allen