Article

Intel''s P6 Uses Decoupled Superscalar Design

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Decoder-based instruction macro-expansion is used in many CISC (e.g., IA32) processors to present the execution engine with a finer-granularity, easier-to-execute RISC-like instruction stream [10,12,14,15]. DISE perform a logically different function-it adds functionality to an executing program by inserting additional ISA instructions-but uses similar hardware mechanisms. ...
... Basic structures. The instruction to instruction-sequence expansion performed by DISE is similar in spirit-and thus in implementation-to the instruction to micro-instructionsequence expansion performed by many IA32 processors to convert complex instructions to a more regular (three register) internal form [10,12,14,15]. ...
... While this logical organization can be mirrored physically-with DISE feeds an ACF-augmented CISC stream to an unmodified and completely microarchitectural CISC-to-RISC decoder-the two facilities are so similar structurally that they may be combined into a single complex. Figure 2 (bottom) illustrates this organization using the P6 decoder [14] as a rough guide. CISC to internal RISC translation is performed in two ways. ...
Conference Paper
Dynamic instruction stream editing (DISE) is a cooperative software-hardware scheme for efficiently adding customization functionality $e.g, safety/security checking, profiling, dynamic code decompression, and dynamic optimization - to an application. In DISE, application customization functions (ACFs) are formulated as rules for macro-expanding certain instructions into parameterized instruction sequences. The processor executes the rules on the fetched instructions, feeding the execution engine an instruction stream that contains ACF code. Dynamic instruction macro-expansion is widely used in many of today's processors to convert a complex ISA to an easier-to-execute, finer-grained internal form. DISE coopts this technology and adds a programming interface to it. DISE unifies the implementation of a large class of ACFs that would otherwise require either special-purpose hardware widgets or static binary rewriting. We show DISE implementations of two ACFs - memory fault isolation and dynamic code decompression - and their composition. Simulation shows that DISE ACFs have better performance than their software counterparts, and more flexibility (which sometimes translates into performance) than hardware implementations.
... Mechanisms that leverage this low utilization and enable sharing of EUs like FPUs (that can account for 20% of the total core logic in simple cores (Jain et al., 2012;Gwenapp, 1995)) have been successfully applied to processors with up to a dozen cores to save both on-chip area and static power (Meltzer, 1999;Ahmad and Arslan, 2005;Kakoee et al., 2013;Castells-Rufas et al., 2011). A similar sharing of EUs (that have low utilization) could be implemented in current and future manycore systems to reduce on-chip power consumption and area. ...
... The main micro-architectural parameters of the components on the logic layer are shown in Table 6.1. Our cores implement a 2-way in-order issue, out-of-order execution superscalar pipeline that is based on the scaled up version of a Pentium Pro processor (Gwenapp, 1995) with a contemporary x87 FP execution environment based on Intel Nehalem (Kurd et al., 2008). The implementation of the FP units in our target system meets the IEEE Standard for Binary Floating-Point Arithmetic (IEEE-754) (IEE, 1985). ...
Thesis
Full-text available
During the past decade, the very large scale integration (VLSI) community has migrated towards incorporating multiple cores on a single chip to sustain the historic performance improvement in computing systems. As the core count continuously increases, the performance of network-on-chip (NoC), which is responsible for the communication between cores, caches and memory controllers, is increasingly becoming critical for sustaining the performance improvement. In this dissertation, we propose several methods to improve the energy efficiency of both electrical and silicon-photonic NoCs. Firstly, for electrical NoC, we propose a flow control technique, Express Virtual Channel with Taps (EVC-T), to transmit both broadcast and data packets efficiently in a mesh network. A low-latency notification tree network is included to maintain the order of broadcast packets. The EVC-T technique improves the NoC latency by 24% and the system energy efficiency in terms of energy-delay product (EDP) by 13%. In the near future, the silicon-photonic links are projected to replace the electrical links for global on-chip communication due to their lower datadependent power and higher bandwidth density, but the high laser power can more than offset these advantages. Therefore, we propose a silicon-photonic multi-bus NoC architecture and a methodology that can reduce the laser power by 49% on average through bandwidth reconfiguration at runtime based on the variations in bandwidth requirements of applications. We also propose a technique to reduce the laser power by dynamically activating/deactivating the L2 cache banks and switching ON/OFF the corresponding silicon-photonic links in a crossbar NoC. This cache-reconfiguration based technique can save laser power by 23.8% and improves system EDP by 5.52% on average. In addition, we propose a methodology for placing and sharing on-chip laser sources by jointly considering the bandwidth requirements, thermal constraints and physical layout constraints. Our proposed methodology for placing and sharing of on-chip laser sources reduces laser power. In addition to reducing the laser power to improve the energy efficiency of silicon-photonic NoCs, we propose to leverage the large bandwidth provided by silicon-photonic NoC to share computing resources. The global sharing of floating-point units can save system area by 13.75% and system power by 10%.
... Arquiteturas de computadores são organizações de componentes em mesmo sistema computacional [3], envolvendo processadores, memória e dispositivos de entrada/saída, entre outros, levando em consideração os tipos, as características e as conexões. Pode-se dizer que a arquitetura fundamental é a de Von Neumann [1], a qual foi a precursora dos computadores superescalares atuais tais como o Pentium da Intel [4] e o PowerPc da Motorola [5]. O modelo de Von Neumann foi proposto por John Von Neumann em 1946 e traz os elementos principais para a execução de programas seqüenciais. ...
Article
Commercial processors are designed as superscalar architectures, being able to predict branches, execute instructions speculatively and extract the implicit instruction level parallelism. These architectures a rise as the evolution of the pipeline architectures, which are the evolution of the sequential model of Von Neumann. Thus, the investigation of superscalar architectures involves the investigation of their precursory architectures consequently. In this context, the hardware description languages (HDL) have been used successfully. HDL allows describing digital circuits, hardware components and complete architectures, in a software level. The present work has the main objective of proposing an evolutionary modeling of superscalar architectures using HDL, going through their precursory models. The HDL prototypes are presented and compared. Resumo. Os computadores comerciais atuais são projetados como arquiteturas superescalares, sendo capazes de prever desvios, executar instruções especulativamente e extrair o paralelismo implícito em nível de instruções. Estas arquiteturas surgiram com a evolução das arquiteturas em pipeline, que por sua vez surgiram com a evolução do modelo seqüencial de Von Neumann. Assim, a investigação das arquiteturas superescalares envolve por conseqüência a investigação das arquiteturas precursoras. Neste contexto, as linguagens de descrição de hardware (HDL) têm sido utilizadas com êxito. HDL permite descrever circuitos digitais, componentes de hardware e até mesmo arquiteturas completas, em nível de software. O presente trabalho tem como objetivo principal propor uma modelagem evolutiva das arquiteturas superescalares usando HDL, passando por seus modelos precursores. Os protótipos HDL são apresentados e comparados.
... For each machine width, we model two different pipeline depths. For the shallow pipeline, which corresponds to a processor like the PowerPC 604 [13], the front-end pipeline depth (which includes the fetch, decode and dispatch pipeline stages) is 4 stages deep, and for the deep pipeline, which corresponds to a processor like the Intel P6 [14], it is 12 stages deep. The instruction execution latencies vary with the pipeline depth and are shown in Table 2. ...
... In-order speculative microarchitectures like Sun's UltraSparc-III that use working (future) register files indexed by architectural register number both disallow arbitrary assignments of physical results to architectural names and overwrite the mis-speculated instructions results during recovery . Intel's P6 [10] core processors and HAL's SPARC64 V [7] keep speculative results in the re-order buffer, preventing their preservation past a mis-speculation recovery. IBM's Power [19] processors and (we believe) AMD's K7 [5] have physical register files separate from the re-order buffer, but also have an architectural register file and require that physical registers be allocated and freed in-order. ...
Conference Paper
Register integration (or simply integration) is a mechanism for incorporating speculative results directly into a sequential execution using data-dependence relationships. In this paper, we use integration to implement squash reuse, the salvaging of instruction results that were needlessly discarded during the course of sequential recovery from a control- or data- mis- speculation. To implement integration, we first allow the results of squashed instructions to remain in the physical register file past mis-speculation recovery. As the processor re-traces por- tions of the squashed path, integration logic examines each instruction as it is being renamed. Using an auxiliary table, this circuit searches the physical register file for the physical register belonging to the corresponding squashed instance of the instruction. If this register is found, integration succeeds and the squashed result is re-validated by a simple update of the rename table. Once integrated, an instruction is complete and may bypass the out-of-order core of the machine entirely. Integration reduces contention for queuing and execution resources, collapses dependent chains of instructions and accelerates the resolution of branches. It achieves this using only rename-table manipulations; no additional values are read from or written to the physical registers. Our preliminary evaluation shows that a minimal integration configuration can provide performance improvements of up to 8% when applied to current-generation micro-architectures and up to 11.5% when applied to more aggressive micro- architectures. Integration also reduces the amount of wasteful speculation in the machine, cutting the number of instructions executed by up to 15% and the number of instructions fetched along mis-speculated paths by as much as 6%.
... Roughly 20% -30% of chip power is consumed by data cache [34] [14]. Many techniques have been proposed in the past to help reduce data cache power consumption. ...
Conference Paper
Full-text available
Conventional high-performance processors utilize register renaming and complex broadcast-based scheduling logic to steer instructions into a small number of heavily-pipelined execution lanes. This requires multiple complex structures and repeated dependency resolution, imposing a significant dynamic power overhead. This paper advocates in-place execution of instructions, a power-saving, pipeline-free approach that consolidates rename, issue, and bypass logic into one structure---the CRIB---while simultaneously eliminating the need for a multiported register file, instead storing architected state in a simple rank of latches. CRIB achieves the high IPC of an out-of-order machine while keeping the execution core clean, simple, and low power. The datapath within a CRIB structure is purely combinational, eliminating most of the clocked elements in the core while keeping a fully synchronous yet high-frequency design. Experimental results match the IPC and cycle time of a baseline out-of-order design while reducing dynamic energy consumption by more than 60% in affected structures.
... Due to the dominance of x86 processors in the computer world, many manufacturers have entered the market with their own x86-compatible processors. Examples include Intel's Pentium [10], Cyrix's M1 [24], AMD's K6 [5], and Rise's mP6 [13]. The majority of these processors adopt the decoupled decode/encode architecture by translating a complex instruction into a number of simple RISC-like instructions. ...
Article
Full-text available
Functional-level simulators have become an indispensable tool in designing today's processors. They help to exploit the design space and evaluate various design options so as to derive a suitable processor microarchitecture. Although Intel's x86 series processors are the most popular CPU in the computers, there are only a few simulation tools available for studying these processors. This paper introduces such a simulation tool. Internally it simulates a decoupled decode/encode architecture and has a RISC core. It is trace-driven and thus has a tracing system, a trace sampling system, and a processor simulator. We will describe the internal workings of the simulation tool and demonstrate how it can be used in evaluating a specific x86-compatible processor.
... The response time of a complex IAG is longer than a single cycle since the mid 90's (e.g. PentiumPro [5], Alpha 21264 [7] and EV8 [12]). The solution that has been used so far is to rely on a hierarchy of IAGs. ...
Conference Paper
On a N-way issue superscalar processor, the front end instruction fetch engine must deliver instructions to the execution core at a sustained rate higher than N instructions per cycle. This means that the instruction address generator/predictor (IAG) has to predict the instruction flow at an even higher rate while the prediction accuracy cannot be sacrificed. Achieving high accuracy on this prediction becomes more and more critical since the overall pipeline is becoming deeper and deeper with each new generation of processors. Then very complex IAGs featuring different predictors for jumps, returns, conditional and unconditional branches and complex logic are used. Usually, the IAG uses information (branch histories, fetch addresses, ...) available at a cycle to predict the next fetch address(es). Unfortunately, a complex IAG cannot deliver a prediction within a short cycle. Therefore, processors rely on a hierarchy of IAGs with increasing accuracies but also increasing latencies: the accurate but slow IAG is used to correct the fast, but less accurate IAG. A significant part of the potential instruction bandwidth is often wasted in pipeline bubbles due to these corrections. As an alternative to the use of a hierarchy of IAGs, it is possible to initiate the instruction address generation several cycles ahead of its use. We explore in details such an ahead pipelined IAG. The example illustrated shows that, even when the instruction address generation is (partially) initiated five cycles ahead of its use, it is possible to reach approximately the same prediction accuracy as the one of a conventional one block ahead complex IAG. The solution presented allows to deliver a sustained address generation rate close to one instruction block per cycle with state of the art accuracy.
... High-end x86 processors [4] ...
Conference Paper
Modern microprocessors offer more instruction-level parallelism than most programs and compilers can currently exploit. The resulting disparity between a machine's peak and actual performance, while frustrating for computer architects and chip manufacturers, opens the exciting possibility of low-cost instrumentation for measurement, simulation, or emulation. Instrumentation code that executes in previously unused processor cycles is effectively hidden. On two superscalar SPARC processors, a simple, local scheduler hid an average of 13% of the overhead cost of profiling instrumentation in the SPECINT benchmarks and an average of 33% of the profiling cost in the SPECFP benchmarks
... If there are not sufficient empty ROB slots, we handle the interrupt as normal. If the architecture uses reservation stations in addition to an ROB [4], [29] (an implementation choice that reduces the number of result-bus drops), we also have to ensure enough reservation stations for the handler, otherwise handle interrupts as normal. We call this scheme a nonspeculative in-line interrupt-handling facility because the hardware knows the length of the handler a priori. ...
Article
Full-text available
The effects of the general-purpose precise interrupt mechanisms in use for the past few decades have received very little attention. When modern out-of-order processors handle interrupts precisely, they typically begin by flushing the pipeline to make the CPU available to execute handler instructions. In doing so, the CPU ends up flushing many instructions that have been brought in to the reorder buffer. In particular, these instructions may have reached a very deep stage in the pipeline - representing significant work that is wasted. In addition, an overhead of several cycles and wastage of energy (per exception detected) can be expected in refetching and reexecuting the instructions flushed. This paper concentrates on improving the performance of precisely handling software managed translation look-aside buffer (TLB) interrupts, one of the most frequently occurring interrupts. The paper presents a novel method of in-lining the interrupt handler within the reorder buffer. Since the first level interrupt-handlers of TLBs are usually small, they could potentially fit in the reorder buffer along with the user-level code already there. In doing so, the instructions that would otherwise be flushed from the pipe need not be refetched and reexecuted. Additionally, it allows for instructions independent of the exceptional instruction to continue to execute in parallel with the handler code. By in-lining the TLB interrupt handler, this provides lock-up free TLBs. This paper proposes the prepend and append schemes of in-lining the interrupt handler into the available reorder buffer space. The two schemes are implemented on a performance model of the Alpha 21264 processor built by Alpha designers at the Palo Alto Design Center (PADC), California. We compare the overhead and performance impact of handling TLB interrupts by the traditional scheme, the append in-lined scheme, and the prepend in-lined scheme. For small, medium, and large memory footprints, the overhead is quantified by comparing the number and pipeline state of instructions flushed, the energy savings, and the performance improvements. We find that lock-up free TLBs reduce the overhead of refetching and reexecuting the instructions flushed by 30-95 percent, reduce the execution time by 5-25 percent, and al- so reduce the energy wasted by 30-90 percent.
... On the other hand, the Intel Pentium Pro [13], the PowerPC 604 [7], and the HAL SPARC64 [12] are based on the reservation station model shown in Figure 2. There are two main differences between the two models. ...
Article
To characterize future performance limitations of superscalar processors, the delays of key pipeline structures in superscalar processors are studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of and .
... These tradeoffs vary greatly across different shared-memory implementations, and are particularly difficult to predict for the future. For example, the recent trend of increasingly aggressive processors (e.g., MIPS R10000 [44], HP PA-8000 [32], Intel Pentium Pro [33]) is increasing the relative gains possible through more complex optimizations. Similarly, several recently proposed systems use software to manage part or all of the shared-memory communication [9] [19] [22] [34] [39]. ...
Article
The memory consistency model of a shared-memory system is a formal specification of the semantics of sharedmemory. The most commonly assumed model, sequential consistency, provides simple semantics but is not easily amenable to high performance. Researchers have proposed several methods to alleviate this performance disadvantage. This paper focuses on the method of obtaining information from the programmer to determine the memory operations on which an optimization can be applied without violating sequential consistency. Several researchers have applied the programmer-based method to current relaxed consistency models,determining program information to allow the optimizations of the relaxed models without violating sequential consistency. Previous efforts (by us and others), however, involved an ad hoc process to postulate the information, and complex proofs to prove the correctness of the information. Further, they do not provide insight on possible future optimizations or useful prog...
... For each machine width, we model two different pipeline depths. For the shallow pipeline, which corresponds to a processor like the PowerPC 604 [13], the front-end pipeline depth (which includes the fetch, decode and dispatch pipeline stages) is 4 stages deep, and for the deep pipeline, which corresponds to a processor like the Intel P6 [14], it is 12 stages deep. The instruction execution latencies vary with the pipeline depth and are shown in Table 2. ...
Article
This paper presents the concept of dynamic control independence (DCI) and shows how it can be detected and exploited in an out-of-order superscalar processor to reduce the performance penalties of branch mispredictions. We show how DCI can be leveraged during branch misprediction recovery to reduce the number of instructions squashed on a misprediction as well as how it can be used to avoid predicting unpredictable branches by fetching instructions out-of-order. A practical implementation is described and evaluated using six SPECint95 benchmarks. We show that exploiting DCI during branch misprediction recovery improves performance by 0.9-9.9% on a 4-wide processor, by 1.8-11.2% on an 8-wide processor and by 1.915. 3% on a 12-wide processor. We also show that using DCI information to fetch instructions out-of-order when an unpredictable branch is encountered potentially improves performance by 0.9-15.2% on a 4-wide processor, by 2.0-14.8% on an 8-wide processor and by 2.6-16...
... The response time of a complex IAG is longer than a single cycle since the mid 90's (e.g. PentiumPro [5], Alpha 21264 [7] and EV8 [12]). The solution that has been used so far is to rely on a hierarchy of IAGs. ...
Article
On a N-way issue superscalar processor, the front end instruction fetch engine must deliver instructions to the execution core at a sustained rate higher than N instructions per cycle. This means that the instruction address generator/predictor (IAG) has to predict the instruction flow at an even higher rate while the prediction accuracy can not be sacrificed.Achieving high accuracy on this prediction becomes more and more critical since the overall pipeline is becoming deeper and deeper with each new generation of processors. Then very complex IAGs featuring different predictors for jumps, returns, conditional and unconditional branches and complex logic are used. Usually, the IAG uses information (branch histories, fetch addresses, . . . ) available at a cycle to predict the next fetch address(es). Unfortunately, a complex IAG cannot deliver a prediction within a short cycle. Therefore, processors rely on a hierarchy of IAGs with increasing accuracies but also increasing latencies: the accurate but slow IAG is used to correct the fast, but less accurate IAG. A significant part of the potential instruction bandwidth is often wasted in pipeline bubbles due to these corrections.As an alternative to the use of a hierarchy of IAGs, it is possible to initiate the instruction address generation several cycles ahead of its use. In this paper, we explore in details such an ahead pipelined IAG. The example illustrated in this paper shows that, even when the instruction address generation is (partially) initiated five cycles ahead of its use, it is possible to reach approximately the same prediction accuracy as the one of a conventional one-block ahead complex IAG. The solution presented in this paper allows to deliver a sustained address generation rate close to one instruction block per cycle with state-of-the art accuracy.
Article
Conventional high-performance processors utilize register renaming and complex broadcast-based scheduling logic to steer instructions into a small number of heavily-pipelined execution lanes. This requires multiple complex structures and repeated dependency resolution, imposing a significant dynamic power overhead. This paper advocates in-place execution of instructions, a power-saving, pipeline-free approach that consolidates rename, issue, and bypass logic into one structure---the CRIB---while simultaneously eliminating the need for a multiported register file, instead storing architected state in a simple rank of latches. CRIB achieves the high IPC of an out-of-order machine while keeping the execution core clean, simple, and low power. The datapath within a CRIB structure is purely combinational, eliminating most of the clocked elements in the core while keeping a fully synchronous yet high-frequency design. Experimental results match the IPC and cycle time of a baseline out-of-order design while reducing dynamic energy consumption by more than 60% in affected structures.
Article
This paper compares the delay and area of a comprehensive set of processor building block circuits when implemented on custom CMOS and FPGA substrates, then uses these results to show how soft processor microarchitectures should be different from those of hard processors. We find that the ratios of the custom CMOS versus FPGA area for different building blocks varies considerably more than the speed ratios, thus, area ratios have more impact on microarchitecture choices. Complete processor cores on an FPGA use $17hbox{--}27times$ more area (“area ratio”) than the same design implemented in custom CMOS. Building blocks with dedicated hardware support on FPGAs such as SRAMs, adders, and multipliers are particularly area-efficient $(2hbox{--}7times)$ , while multiplexers and content-addressable memories (CAM) are particularly area-inefficient $({>}100times)$ . Applying these results, we find out-of-order soft processors should use physical register file organizations to minimize CAM size.
Article
Global Signal Vulnerability (GSV) analysis is a novel method for assessing the susceptibility of modern microprocessor state elements to failures in the field of operation. In order to effectively allocate design for reliability resources, GSV analysis takes into account the high degree of architectural masking exhibited in modern microprocessors and ranks state elements accordingly. The novelty of this method lies in the way this ranking is computed. GSV analysis operates either at the Register Transfer (RT-) or at the Gate-Level, offering increased accuracy in contrast to methods which compute the architectural vulnerability of registers through high-level simulations on performance models. Moreover, it does not rely on extensive Statistical Fault Injection (SFI) campaigns and lengthy executions of workloads to completion in RT- or Gate-Level designs, which would make such analysis prohibitive. Instead, it monitors the behavior of key global microprocessor signals in response to a progressive stuck-at fault injection method during partial workload execution. Experimentation with the Scheduler and Reorder Buffer modules of an Alpha-like microprocessor and a modern Intel microprocessor corroborates that GSV analysis generates a near-optimal ranking, yet is several orders of magnitude faster than existing RT- or Gate-Level approaches.
Article
Full-text available
We present a new technique for verification of complex hardware devices that allows both generality and a high degree of automation. The technique is based on our new way of constructing a “light-weight” completion function together with new encoding of uninterpreted functions called reference file representation. Our technique combines our completion function method and reference file representation with compositional model checking and theorem proving. This extends the state of the art in two directions. First, we obtain a more general verification methodology. Second, it is easier to use, since it has a higher degree of automation. As a benchmark, we take Tomasulo's algorithm for scheduling out-of-order instruction execution used in many modern superscalar processors like the Pentium-II and the PowerPC 604. The algorithm is parameterized by the processor configuration, and our approach allows us to prove its correctness in general, independent of any actual design.
Article
We propose a new universal High-Level Information (HLI) format to effectively integrate front-end and back-end compilers by passing front-end information to the back-end compiler. Importing this information into an existing back-end leverages the state-of-the-art analysis and transformation capabilities of existing front-end compilers to allow the back-end greater optimization potential than it has when relying on only locally-extracted information. A version of the HLI has been implemented in the SUIF parallelizing compiler and the GCC back-end compiler. Experimental results with the SPEC benchmarks show that HLI can provide GCC with substantiallymore accurate data dependence information than it can obtain on its own. Our results show that the number of dependence edges in GCC can be reduced substantially for the benchmark programs studied, which provides greater flexibility to GCC's code scheduling pass, common subexpression elimination pass, loop invariant code removal pass and register local allocation pass. Even with GCC's optimizations limited to basic blocks, the use of HLI produces moderate speedups compared to using only GCC's dependence tests when the optimized programs are executed on a MIPS R10000 processor.
Article
Various levels of parallelism have recently been introduced in advanced microprocessors to meet the demanding computing need in digital video processing and other multimedia applications. Because many imaging algorithms are easily parallelizable, these architectural features and their wide availability at low cost have become a powerful tool in tackling both existing and new imaging applications. At the lowest level, the subword parallelism is used in the new instructions aimed at processing multiple multimedia data simultaneously. Instruction-level parallelism including subword parallelism is realized in either very long instruction word or superscalar architectures, while on-chip and/or off-chip multiprocessing capability is available for easier multiprocessor system designs. One of the difficulties in maximizing the computing throughput via parallelism has been the level of programming in that to obtain the optimal performance, assembly-level programming has typically been required. We review the architectural features in several modern microprocessors such as TMS320C60, TM-1000, PowerPC 604, Pentium II, R10000, Alpha 21264, PA-RISC 8200, UltraSPARC-II, and TMS320C80. Various obstacles to obtaining the best performance from these microprocessors with high-level and assembly languages are discussed, and several approaches to overcome these difficulties in diverse imaging applications are presented. © 1998 John Wiley & Sons, Inc. Int J Imaging Syst Technol 9: 407–415, 1998
Article
An important challenge concerning the design of future microprocessors is that current design methodologies are becoming impractical due to long simulation runs and due to the fact that chip layout considerations are not incorporated in early design stages. In this paper, we show that statistical modeling can be used to speed up the architectural simulations and is thus viable for early design stage explorations of new microarchitectures. In addition, we argue that processor layouts should be considered in early design stages in order to tackle the growing importance of interconnects in future technologies. In order to show the applicability of our methodology which combines statistical modeling and processor layout considerations in an early design stage, we have applied our method on a novel architectural paradigm, namely a fixed-length block structured architecture. A fixed-length block structured architecture is an answer to the scalability problem of current architectures. Two important factors prevent contemporary out-of-order architectures from being scalable to higher levels of parallelism in future deep-submicron technologies: the increased complexity and the growing domination of interconnect delays. In this paper, we show by using statistical modeling and processor layout considerations, that a fixed-length block structured architecture is a viable architectural paradigm for future microprocessors in future technologies thanks to the introduction of decentralization and a reduced register file pressure.
Conference Paper
We study the direct cost of virtual function calls in C++ programs, assuming the standard implementation using virtual function tables. We measure this overhead experimentally for a number of large benchmark programs, using a combination of executable inspection and processor simulation. Our results show that the C++ programs measured spend a median of 5.2% of their time and 3.7% of their instructions in dispatch code. For "all virtuals" versions of the programs, the median overhead rises to 13.7% (13% of the instructions). The "thunk" variant of the virtual function table implementation reduces the overhead by a median of 21% relative to the standard implementation. On future processors, these overheads are likely to increase moderately.
Conference Paper
Full-text available
The trend in microprocessor design is to extend instruction-set architectures with features—such as parallelism annotations, predication, speculative memory access,or multimedia instructions—that allow the compiler or programmer to express more instruction-level parallelism than the microarchitecture is willing to derive. In this paper we show how these instruction-set extensions can be put to use when formally verifying the correctness of a microarchitectural model. Inspired by Intel’s IA-64, we develop an explicitly parallel instruction-set architecture and a clustered microarchitectural model. We then describe how to formally verify that the model implements the instruction set. The contribution of this paper is a specification and verification method that facilitates the decomposition of microarchitectural correctness proofs using instruction-set extensions.
Conference Paper
Full-text available
There is a large class of circuits (including pipeline and out-of-order execution components) which can be formally verified while completely ignoring the precise characteristics (e.g. word-size) of the data manipulated by the circuits. In the literature, this is often described as the use of uninterpreted functions, implying that the concrete operations applied to the data are abstracted into unknown and featureless functions. In this paper, we briefly introduce an abstract unifying model for such datainsensitive circuits, and claim that the development of such models, perhaps even a theory of circuit schemas, can significantly contribute to the development of efficient and comprehensive verification algorithms combining deductive as well as enumerative methods. As a case study, we present in this paper an algorithm for out-of-order execution with in-order retirement and show it to be a refinement of the sequential instruction execution algorithm. Refinement is established by deductively proving (using pvs) that the register files of the out-of-order algorithm and the sequential algorithm agree at all times if the two systems are synchronized at instruction retirement time.
Conference Paper
Full-text available
In this paper we describe and compare two methodologies for verifying the correctness of a speculative out-of-order execution system with interrupts. Both methods are deductive (we use PVS) and are based on refinement. The first proof is by direct refinement to a sequential system; the second proof combines refinement with induction over the number of retirement buffer slots.
Conference Paper
Full-text available
Commodity microprocessors uniformly apply branchprediction and single path speculative execution to allkinds of program branches and suffer from the highmisprediction penalty which is caused by branches withlow prediction accuracy and, in particular, by branchesthat are unpredictable. The Simultaneous SpeculationScheduling (S3) technique removes such penalties by acombination of compiler and architectural techniquesthat enable speculative dual path execution after programbranches....
Conference Paper
Within two or three technology generations, processor architects will face a number of major challenges. Wire delays will become critical, and power considerations will temper the availability of billions of transistors. Many important applications will be object-oriented, multithreaded, and will consist of many separately compiled and dynamically linked parts. To accommodate these shifts in both technology and applications, microarchitectures will process instruction streams in a distributed fashion - instruction level distributed processing (ILDP). ILDP will be implemented in a variety of ways, including both homogeneous and heterogeneous elements. To help find run-time parallelism, orchestrate distributed hardware resources, implement power conservation strategies, and to provide fault-tolerant features, an additional layer of abstraction - the virtual machine layer - will likely become an essential ingredient. Finally, newinstruction sets may be necessary to better focus on instruction level communication and dependence, rather than computation and independence as is commonly done today.
Book
This lecture presents a study of the microarchitecture of contemporary microprocessors. The focus is on implementation aspects, with discussions on their implications in terms of performance, power, and cost of state-of-the-art designs. The lecture starts with an overview of the different types of microprocessors and a review of the microarchitecture of cache memories. Then, it describes the implementation of the fetch unit, where special emphasis is made on the required support for branch prediction. The next section is devoted to instruction decode with special focus on the particular support to decoding x86 instructions. The next chapter presents the allocation stage and pays special attention to the implementation of register renaming. Afterward, the issue stage is studied. Here, the logic to implement out-of-order issue for both memory and non-memory instructions is thoroughly described. The following chapter focuses on the instruction execution and describes the different functional units that can be found in contemporary microprocessors, as well as the implementation of the bypass network, which has an important impact on the performance. Finally, the lecture concludes with the commit stage, where it describes how the architectural state is updated and recovered in case of exceptions or misspeculations. This lecture is intended for an advanced course on computer architecture, suitable for graduate students or senior undergrads who want to specialize in the area of computer architecture. It is also intended for practitioners in the industry in the area of microprocessor design. The book assumes that the reader is familiar with the main concepts regarding pipelining, out-of-order execution, cache memories, and virtual memory.
Article
Full-text available
The potential performance of superscalar processors can be exploited only when processor is fed with sufficient instruction bandwidth. The front-end units, the Instruction Stream Buffer (ISB) and the fetcher, are the key elements for achieving this goal. Current ISBs could not support instruction streaming beyond a basic block. In x86 processors, the split-line instruction problem worsens this situation. In this paper, we proposed a basic blocks reassembling ISB. By cooperating with the proposed LineWeighted Branch Target Buffer (LWBTB), the proposed ISB can predict upcoming branch information and reassemble current cache line together with the other cache line containing instructions for the next basic block. Therefore, the fetcher could fetch more instructions in a cycle with the assistance of the proposed ISB. Simulation results show that the cache line size over 54 bytes has a good chance to let two basic blocks present in a reassembled cache line and the fetch efficiency is about 90% as the fetch capacity is under 6.
Article
Full-text available
A major hurdle of recent x86 superscalar processor designs is limited instruction is- sue rate due to the overly complex x86 instruction formats. To alleviate this problem, the machine states must be preserved and the instruction address routing paths must be simplified. We propose an instruction address queue, whose queue size has been esti- mated to handle saving of instruction addresses with three operations: allocation, access, and retirement. The instruction address queue will supply the stored instruction ad- dresses as data for three mechanisms: changing instruction flow, updating BTB, and handling exceptions. It can also be used for internal snooping to solve self-modified code problems. Two CISC hazards in the x86 architectures, the variable instruction length and the complex addressing mode, have been considered in this design. Instead of the simple full associative storing method in lower degree (< 4) superscalar systems, the line-offset method is used in this address queue. This will reduce by 1/3 the storage space for a degree-5 superscalar x86 processor with even smaller access latency. We use synthesis tools to analyze the design, and show that it produces optimized results. Because the address queue design can keep two different line addresses in an instruction access per cycle, this method can be extended for designing a multiple instruction block issue system, such as the trace processor.
Article
Execution along mispredicted paths may or may not affect the accuracy of subsequent branch predictions if recovery mechanisms are not provided to undo the erroneous information that is acquired by the branch prediction storage structures. In this paper, we study four elements of the Two-Level Branch Predictor: the Branch Target Buffer (BTB), the Branch History Register (BHR), the Pattern History Tables (PHTs), and the Return Address Stack (RAS). For each we determine whether a recovery mechanism is needed, and, if so, show how to design a cost-effective one. Using five benchmarks from the SPECint92 suite, we show that there is no need to provide recovery mechanisms for the BTB and the PHTs, but that performance is degraded by an average of 30% if recovery mechanisms are not provided for the BHR and RAS.
Article
Implementors of graphics application programming interfaces (APIs) and algorithms are often required to support a plethora of options for several stages at the back end of the geometry and rasterization pipeline. Implementing these options in high-level programming languages such as C leads to code with many branches and large object modules due to indiscriminate duplication of very similar code. This reduces the speed of execution of the program. This paper examines the problems of branches and code size in detail and presents techniques for transforming typical code sequences in graphics programs to reduce the number of branches and to reduce to the size of the program. A set of branch-free basis functions is defined first. The application of these functions to common geometric queries, geometry pipeline computations, rasterization, and pixel processing algorithms is then discussed.
Article
Thesis (Ph.D.)--University of Wisconsin--Madison, 2006 Includes bibliographical references (p. 159-169). Photocopy.
Article
Thesis (Ph. D.)--University of Toronto, 1997. Includes bibliographical references.
Conference Paper
Full-text available
A dynamic binary translation system for a co-designed virtual machine is described and evaluated. The underlying hardware directly executes an accumulator-oriented instruction set that exposes instruction dependence chains (strands) to a distributed microarchitecture containing a simple instruction pipeline. To support conventional program binaries, a source instruction set (Alpha in our study) is dynamically translated to the target accumulator instruction set. The binary translator identifies chains of inter-instruction dependences and assigns them to dependence-carrying accumulators. Because the underlying superscalar microarchitecture is capable of dynamic instruction scheduling, the binary translation system does not perform aggressive optimizations or re-schedule code; this significantly reduces binary translation overhead. Detailed timing simulation of the dynamically translated code running on an accumulator-based distributed microarchitecture shows the overall system is capable of achieving similar performance to an ideal out-of-order superscalar processor, ignoring the significant clock frequency advantages that the accumulator-based hardware is likely to have. As part of the study, we evaluate an instruction set modification that simplifies precise trap implementation. This approach significantly reduces the number of instructions required for register state copying, thereby improving performance. We also observe that translation chaining methods can have substantial impact on the performance, and we evaluate a number of chaining methods.
Conference Paper
This paper presents the concept of an Instruction Path Coprocessor (I-COP), which is a programmable on-chip coprocessor, with its own mini-instruction set, that operates on the core processor's instructions to transform them into an internal format that can be more efficiently executed. It is located off the critical path of the core processor to ensure that it does not negatively impact the core processor's cycle time or pipeline depth. An I-COP is highly versatile and can be used to implement different types of instruction transformations to enhance the IPC of the core processor. We study four potential applications of the I-COP to demonstrate the feasibility of this concept and investigate the design issues of such a coprocessor. A prototype instruction set for the I-COP is presented along with an implementation framework that facilitates achieving high I-COP performance. Initial results indicate that the I-COP is able to efficiently implement the trace cache fill unit as well as the register move, stride data prefetching and linked data structure prefetching trace optimizations.
Conference Paper
The understanding of instruction set usage in typical DOS/Windows applications plays a very important role in designing high performance ×86 compatible microprocessors. This paper presents the tools to such analysis, the analysis results, and their implications on the design of a superscalar processor, based on a RISC core, for efficient ×86 instruction execution. The analyzed results are used to optimize the execution of frequently executed instructions and micro operations
Conference Paper
AMULET2e is an asynchronous microprocessor system based on the AMULET2 processor core. In addition to the processor it incorporates a number of distinct subsystems, the most notable of which is an asynchronous on-chip cache. This includes several novel features which exploit the asynchronous design style to increase throughput and reduce power consumption. These features are evident at a number of levels in the design. For example, the cache is micropipelined to increase its throughput, at the architectural level there is an arbitration free non-blocking line fetch mechanism while at the circuit design level there is a power-saving RAM sense amplifier control circuit. A significant property of the cache system is its ability to cycle in a data dependent way which allows the system to approach average case performance
Article
Designing a wholly new microprocessor is difficult and expensive. To justify this effort, a major new microarchitecture must improve performance one and a half or two times over the previous-generation microarchitecture, when evaluated on equivalent process technology. In addition, semiconductor process technology continues to evolve while the processor design is in progress. The previous-generation microarchitecture increases in clock speed and performance due to compactions and conversion to newer technology. A new microarchitecture must “intercept” the process technology to achieve a compounding of process and microarchitectural speedups. This paper looks at a large microprocessor development project which reveals some of the reasoning (for goals, changes, trade-offs, and performance simulation) that lay behind its final form
Article
The register file is one of the critical components of current processors in terms of access time and power consumption. Among other things, the potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register renaming schemes, both register allocation and releasing are conservatively done, the former at the rename stage, before registers are loaded with values, and the latter at the commit stage of the instruction redefining the same register, once registers are not used any more. We introduce VP-LAER, a renaming scheme that allocates registers later and releases them earlier than conventional schemes. Specifically, physical registers are allocated at the end of the execution stage and released as soon as the processor realizes that there will be no further use of them. VP-LAER enhances register utilization, that is, the fraction of allocated registers having a value to be read in the future. Detailed cycle-level simulations show either a significant speedup for a given register file size or a reduction in the register file size for a given performance level, especially for floating-point codes, where the register file pressure is usually high.
Article
Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Trade-offs among instruction-window size, branch-prediction accuracy, and instruction- and data-cache size can change as these parameters move through different domains. For example, modeling unrealistic caches can under- or overstate the benefits of better prediction or a larger instruction window. Avoiding such pitfalls requires understanding how all these parameters interact. Because such methodological mistakes are common, this paper provides a comprehensive set of SimpleScalar simulation results from SPECint95 programs, showing the interactions among these major structures. In addition to presenting this database of simulation results, major mechanisms driving the observed trade-offs are described. The paper also considers appropriate simulation techniques when sampling full-length runs with the SPEC reference inputs. In particular, the results show that branch mispredictions limit the benefits of larger instruction windows, that better branch prediction and better instruction cache behavior have synergistic effects, and that the benefits of larger instruction windows and larger data caches trade off and have overlapping effects. In addition, simulations of only 50 million instructions can yield representative results if these short windows are carefully selected
Article
Many contemporary multiple issue processors employ out-of-order scheduling hardware in the processor pipeline. Such scheduling hardware can yield good performance without relying on compile-time scheduling. The hardware can also schedule around unexpected run-time occurrences such as cache misses. As issue widths increase, the complexity of such scheduling hardware increases considerably and can have an impact on the cycle time of the processor. This paper presents the design of a multiple issue processor that uses an alternative approach called miss path scheduling . Scheduling hardware is removed from the processor pipeline altogether and placed on the path between the instruction cache and the next level of memory. Scheduling is performed at cache miss time, as instructions are received from memory. Scheduled blocks of instructions are issued to an aggressively clocked in-order execution core. Details of a hardware scheduler that can perform speculation are outlined and ...
Article
The Multiscalar architecture executes a single sequential program following multiple flows of control. In the Multiscalar hardware, a global sequencer, with help from the compiler, takes large steps through the program's control flow graph (CFG) speculatively, starting a new thread of control (task) at each step. This is inter-task control flow speculation. Within a task, traditional control flow speculation is used to extract instruction level parallelism. This is intra-task control flow speculation. This paper focuses on mechanisms to implement intertask control flow speculation (task prediction) in a Multiscalar implementation. This form of speculation has fundamental differences from traditional branch prediction. We look in detail at the issues of prediction automata, history generation and target buffers. We present implementations in each of these areas that offer good accuracy, size and performance characteristics. Keywords: Multiscalar Architecture, Control-flow Speculation,...
Article
Full-text available
Processor cycle times are currently much faster than memory cycle times, and the trend has been for this gap to increase over time. The problem of increasing memory latency, relative to processor speed, has been dealt with by adding high speed cache memory. However, it is difficult to make a cache both large and fast, so that cache misses are expected to continue to have a significant performance impact. Prediction caches use a history of recent cache misses to predict future misses, and to reduce the overall cache miss rate. This paper describes several prediction caches, and introduces a new kind of prediction cache, which combines the features of prefetching and victim caching. This new cache is shown to be more effective at reducing miss rate and improving performance than existing prediction caches. Key Words and Phrases: Dynamic scheduling, Memory latency, Stream buffer, Victim cache, Prediction cache Copyright c fl 1996 by James E. Bennett Michael J. Flynn Contents 1 Int...
Article
Full-text available
While a number of dynamically scheduled processors have recently been brought to market, work on hardware techniques for memory latency tolerance has mostly targeted statically scheduled processors. This paper attempts to remedy this situation by examining the applicability of hardware latency tolerance techniques to dynamically scheduled processors. The results so far indicate that the inherent ability of the dynamically scheduled processor to tolerate memory latency reduces the need for additional hardware such as stream buffers or stride prediction tables. However, the technique of victim caching, while not usually considered as a latency tolerating technique, proves to be quite effective in aiding the dynamically scheduled processor in tolerating memory latency. For a fixed size investment in microprocessor chip area, the victim cache outperforms both stream buffers and stride prediction. Key Words and Phrases: Dynamic scheduling, Memory latency, Stream buffer, Stride prediction, ...
Article
Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Tradeoffs among instruction-window size, branch-prediction accuracy, and instruction- and datacache size can change as these parameters move through different domains. For example, modeling unrealistic caches can under- or over-state the benefits of better prediction or a larger instruction window. Avoiding such pitfalls requires understanding how all these parameters interact. Because such methodological mistakes are common, this paper provides a comprehensive set of SimpleScalar simulation results from SPECint95 programs, showing the interactions among these major structures. In addition to presenting this database of simulation results, major mechanisms driving the observed tradeoffs are described. The paper also considers appropriate simulation techniques when sampling full-length runs with the SPEC reference inputs. In part...
Article
Computer Architecture Support for Database Applications by Kimberly Kristine Keeton Doctor of Philosophy in Computer Science University of California at Berkeley Professor David A. Patterson, Chair Database workloads are an important class of applications, responsible for one-third of the symmetric multiprocessor (SMP) server market. Despite their importance, they are seldom used in computer architecture performance evaluations, which favor technical applications, such as SPEC. Database applications are often avoided because they are difficult to study in fully-scaled configurations, for reasons including large hardware requirements and complicated software configuration and tuning issues. This dissertation addresses several of the challenges posed by database workloads. First, we characterize the architectural behavior of two standard database workloads, namely online transaction processing (OLTP) and decision support (DSS), running on a commercial database on a commodity Intel...
ResearchGate has not been able to resolve any references for this publication.