Article

The 21264: a supersealar alpha processor with outof-order execution

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The structures identified above were presented in the context of the baseline superscalar model shown in Figure 1. The MIPS R10000 [22] and the DEC 21264 [10] are real implementations that directly fit this model. Hence, the structures identified above apply to these two processors. ...
... An alternate scheme for register renaming uses a CAM (content-addressable memory) [19] to store the current mappings. Such a scheme is implemented in the HAL SPARC [2] and the DEC 21264 [10]. The number of entries in the CAM is equal to the number of physical registers. ...
... For an 8-way machine with deep pipelines, this would exclude a large number of bypass paths. Another solution is to generalize the method used in the DEC 21264 [10] and use multiple copies of the register file. This is the "cluster" method referred to in Section 4.4. ...
... As further interpretations, the mapping S n −→ S n (for S = {0, 1}) interprets boolean networks that have been vastly studied for their theoretical interest in computer science as for their potential applications in nature (genetic networks [Kum96], neuron networks, etc) or in the social sciences [Kel96]. In the context of chip design, such approaches (like in situ) are considered to be important and are discussed in electronic oriented papers (see for example [RAS97]). ...
... Several run-time techniques have been introduced by computer architects based on the hardware. This enabled the processor that it can run any ready instruction from an instruction window [Cen,Kel96,Kum96,MO56,Sch73,Tom67]. ...
Thesis
Full-text available
This thesis aims at developing strategies to enhance the power of sequential computation and distributed systems, particularly, it deals with sequential break down of operations and decentralized collaborative editing systems. In this thesis, we introduced precision control indexing method that generates unique identifiers which are used for indexed communication in distributed systems, particularly, in decentralized collaborative editing systems. These identifiers are still real numbers with a specific controlled pattern of precision. Set of identifiers is kept finite that makes it possible to compute local as well as global cardinality. This property plays important role in dealing with indexed communication. Besides this, some other properties including order preservation are observed. The indexing method is tested and verified by experimentation successfully and it leads to design decentralized collaborative editing system. Dealing with sequential break down of operations, we explore limitations of the existing strategies, extended the idea by introducing new strategies. These strategies lead towards optimization (processor, compiler, memory, code). This style of decomposition attracts research communities for further investigation and practical implementation that could lead towards designing an arithmetic unit.
... As further interpretations, the mapping S n −→ S n (for S = {0, 1}) interprets boolean networks that have been vastly studied for their theoretical interest in computer science as for their potential applications in nature (genetic networks [Kum96], neuron networks, etc) or in the social sciences [Kel96]. In the context of chip design, such approaches (like in situ) are considered to be important and are discussed in electronic oriented papers (see for example [RAS97]). ...
... Several run-time techniques have been introduced by computer architects based on the hardware. This enabled the processor that it can run any ready instruction from an instruction window [Cen,Kel96,Kum96,MO56,Sch73,Tom67]. ...
Article
This thesis aims at developing strategies to enhance the power of sequential computation and distributed systems, particularly, it deals with sequential break down of operations and decentralized collaborative editing systems. The rapid growth in the use of modern computer technology increases the demand for higher performance in all areas of computing. This demand for ever greater performance led to growth in hardware performance and architecture evolution that results a stress on compiler technology and research community. Since a high-performance microprocessor is at the heart of every general-purpose computer, from servers, to desktop and laptop PCs, to cell-phone platforms such as the iPhone. Therefore increasing the performance is considered to be a permanent challenge in computer science. Similarly, rapid development of computer network inspired the advancement of real-time collaborative editing. Real-time Collaborative Editors (RCE) enable group of users to edit simultaneously shared document from physically dispersed sites that are interconnected by computer network. In such distributed systems, one of the challenges is to avoid from communication conflicts. For this purpose indexed communication acts as an essential requirement. In this thesis, a precision control indexing method is introduced that enables to generate unique identifiers which are used for indexed communication in distributed systems, particularly, in decentralized collaborative editing systems. These identifiers are still real numbers with a specific controlled pattern of precision. Set of identifiers is kept finite that makes it possible to compute local as well as global cardinality. This property plays important role in dealing with indexed communication. Besides this, some other properties including order preservation are observed. The indexing method is tested and verified by experimentation successfully and it leads to design decentralized collaborative editing system. Dealing with sequential break down of operations, we explore limitations of the existing strategies, extended the idea by introducing new strategies. These strategies lead towards optimization (processor, compiler, memory, code). This style of decomposition attracts research communities for further investigation and practical implementation that could lead towards designing an arithmetic unit.
... To increase the instruction level parallelism, many architectures are proposed and can be classified into two main approaches, the superscaler [12][13] and the VLIW architectures[1][2][3][4][5][6][7]. The superscaler architectures perform instruction grouping and function unit assignment by hardware which has the advantages of instruction compatibility. ...
Article
Digital Signal Processor (DSP) has been widely used in processing video and audio streaming data. Due to the huge amount of streaming data, increasing throughput is the key issue in designing DSP architecture. One way to increase the throughput of a DSP is to increase the instruction level parallelism. To increase the instruction level parallelism, many architectures have been proposed and can be classified into two main approaches, the superscaler and the VLIW architectures. Among the hardware architectures, the VLIW attracts a lot of attention due to its simple hardware complexity. However, the VLIW architecture suffers from the explosion of instruction memories due to the overhead of instruction grouping. In this paper, we propose a novel DSP architecture which contains three pipelines and performs dynamic instruction grouping by hardware. The experimental results show that our architecture can reduce 13% of memory requiremnt on average while maintaining the same performance.
... C A M R e g i s t e r M a p p i n g P h y s i c a l 1 2 3 4 5 6 7 … L o g i c a l 3 4 2 1 8 9 7 … V a l i d 1 1 1 1 0 0 0 … F u t u r e F r e e 0 0 0 0 0 0 0 … F r e e L i s t 0 0 0 0 1 1 1 … Figure 3. Extension to the CAM Register Map- ping In figure 3 we see an example of how our Register Mapping works. Our register mapping structure follows the CAM scheme such as in the Alpha 21264 [17] ...
Conference Paper
Full-text available
Modern out-of-order processors tolerate long latency memory operations by supporting a large number of in-flight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the reorder buffer (ROB), the general purpose instructions queues, the load/store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. We propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10% for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200% improvement over a current processor with a similar number of entries.
... So far we have discussed microarchitectures that distribute the instruction window based on task or trace boundaries. Dependence-based clustering is an interesting alternative [14] [15]. Similar to trace processors, the window and execution resources are distributed among multi-ple smaller clusters. ...
Conference Paper
Traces are dynamic instruction sequences constructed and cached by hardware. A microarchitecture organized around traces is presented as a means for efficiently executing many instructions per cycle. Trace processors exploit both control flow and data flow hierarchy to overcome complexity and architectural limitations of conventional superscalar processors by (1) distributing execution resources based on trace boundaries and (2) applying control and data prediction at the trace level rather than individual branches or instructions. Three sets of experiments using the SPECInt95 benchmarks are presented. (i) A detailed evaluation of trace processor configurations: the results affirm that significant instruction-level parallelism can be exploited in integer programs (2 to 6 instructions per cycle). We also isolate the impact of distributed resources, and quantify the value of successively doubling the number of distributed elements. (ii) A trace processor with data prediction applied to inter-trace dependences: potential performance improvement with perfect prediction is around 45% for all benchmarks. With realistic prediction, gcc achieves an actual improvement of 10%. (iii) Evaluation of aggressive control flow: some benchmarks benefit from control independence by as much as 10%
... In contrast, the sliced register file enables that each lane only needs 4R+2W for the functional units 1 . Note that, as opposed to clustering techniques [11], writes into one register file lane do not need to be made visible to other lanes. ...
Conference Paper
Full-text available
Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor. Tarantula adds to the EV8 core a vector unit capable of 32 double-precision flops per cycle. The vector unit fetches data directly from a 16 MByte second level cache with a peak bandwidth of sixty four 64-bit values per cycle. The whole chip is backed by a memory controller capable of delivering over 64 GBytes/s of raw bandwidth. Tarantula extends the Alpha ISA with new vector instructions that operate on new architectural state. Salient features of the architecture and implementation are: (1) it fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, (2) provides high bandwidth for non-unit stride memory accesses, (3) supports gather/scatter instructions efficiently, (4) fully integrates with the EV8 core with a narrow, streamlined interface, rather than acting as a co-processor (5) can achieve a peak of 104 operations per cycle, and (6) achieves excellent "real-computation" per transistor and per watt ratios. Our detailed simulations show that Tarantula achieves an average speedup of 5X over EV8, out of a peak speedup in terms of flops of 8X. Furthermore, performance on gather/scatter intensive benchmarks such as Radix Sort is also remarkable: a speedup of almost 3X over EV8 and 15 sustained operations per cycle. Several benchmarks exceed 20 operations per cycle
... [AMG + 95] and the DEC 21264 [Kel96]. The number of entries in the CAM is equal to the number of physical registers. ...
Article
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of , 894 , and . Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future.
Conference Paper
The continued increase in microprocessor clock frequency that has come from advancements in fabrication technology and reductions in feature size, creates challenges in maintaining both manufacturing yield rates and long-term reliability of devices. Methods based on defect detection and reduction may not offer a scalable solution due to cost of eliminating contaminants in the manufacturing process and increasing chip complexity. This paper proposes to use the inherent redundancy available in existing and future chip microarchitectures to improve yield and enable graceful performance degradation in fail-in-place systems. We introduce a new yield metric called performance averaged yield (Ypav) which accounts both for fully functional chips and those that exhibit some performance degradation. Our results indicate that at 250nm we are able to increase the Ypav of a uniprocessor with only redundant rows in its caches from a base value of 85% to 98% using microarchitectural redundancy. Given constant chip area, shrinking feature sizes increases fault susceptibility and reduces the base Ypav to 60% at 50nm, which exploiting microarchitectural redundancy then increases to 99.6%.
Article
Conventional out-of-order processors that use a unified physical register file allocate and reclaim registers explicitly using a free list that operates as a circular queue. We describe and evaluate a more flexible register management scheme — reference counting. We implement reference counting using a bit-matrix with a column for every physical register and a row for every entity that can hold a physical register, e.g., an in-flight instruction. Columns are NOR'ed together to create a bitvector free list from which registers are allocated using priority encoders. We describe reference counting designs that support micro-architectural techniques including register file power gating, dynamic register move elimination, register file checkpointing, and latency tolerant execution. Performance and circuit simulation show that the energy cost of reference counting is low and is easily recouped by the savings of the techniques it enables.
Chapter
The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms which perform well in practice. We advocates the approach to adapt external memory algorithms to this purpose. We exemplify this approach and the practical issues involved by engineering a fast priority queue suited to external memory and cached memory which is based on k-way merging. It improves previous external memory algorithms by constant factors crucial for transferring it to cached memory. Running in the cache hierarchy of a workstation the algorithm is at least two times faster than an optimized implementation of binary heaps and 4-ary heaps for large inputs.
Conference Paper
In this paper, we compare automatic-optimized codes against hand-optimized codes. The automatic-optimized codes have been generated using our own developed tool that implements compiler techniques proposed in our previous work. Our compiler techniques focus on applying multilevel tiling to non-rectangular loop nests. This type of loop nests are commonly found in linear algebra algorithms, typically used in numerical codes. As hand-optimized codes, we use two different numerical libraries: the BLAS3 library provided by the manufacturers and the RISC-BLAS library proposed in Dayde and Duff (1998). Results will show how compiler technology can make it possible for non-rectangular loop nests to achieve as high performance as hand-optimized codes on modern microprocessors
Conference Paper
Full-text available
Dynamically trace scheduled VLIW (DTSVLIW) architectures can be used to implement machines that execute code of current RISC or CISC instruction set architectures in a VLIW fashion, delivering instruction level parallelism (ILP) with backward code compatibility. This paper presents the effect of multicycle instructions on the performance of a DTSVLIW architecture running the SPECint95 benchmarks. The classic approaches to providing ILP are VLIW and superscalar architectures. With VLIW, the compiler is required to extract the parallelism from the program and to build Very Long Instruction Words for execution. This leads to fast and (relatively) simple hardware, but puts a heavy responsibility on the compiler, and object code compatibility(4) is a problem. In Superscalar, the extraction of parallelism is done by the hardware which dynamically schedules the sequential instruction stream on to the functional units. The hardware is much more complex, and therefore slower than a corresponding VLIW design. The peak instruction feed into the functional units is lower for Superscalar. A number of approaches(1)(2)(3) have been examined that marry the advantages of the contrasting designs: the Superscalar dynamic extraction of ILP, the simplicity of the VLIW architectures. The approach presented here follows that first presented by Nair and Hopkins(3). Our architecture, the dynamically trace scheduled VLIW architecture (DTSVLIW) (4), demonstrates similar results to theirs in providing significant parallelism, but with a simpler architecture that should be much easier to implement. In our earlier work(5), we had zero latency load/store instructions. Here we present results with more realistic latencies. 2 The DTSVLIW Architecture The DTSVLIW has two execution engines: the Primary Processor and the VLIW Engine, and two caches for instructions: the Instruction Cache and the VLIW Cache. The Primary Processor, a simple pipelined processor, fetches instructions and does the first execution of this code. The instruction trace it produces is dynamically scheduled by the Scheduler Unit into VLIW instructions, saved in blocks to the VLIW Cache for re -execution by the VLIW Engine. The Primary Engine, executing the Sparc-7 ISA, provides object-code compatibility; the VLIW Engine VLIW performance and simplicity. The Scheduler unit works in parallel with the Primary Engine. Scheduling does not impact on VLIW performance as it does in Superscalar.
Conference Paper
Increasing diversity in packet-processing applications and rapid increases in channel bandwidth lead to greater complexity in communication protocols. These factors result in larger computational loads for packet-processing engines that introduce high performance microprocessor designs as an important solution. This paper presents an exhaustive simulation for exploring the performance of instruction-level parallel super scalar processors executing packet-processing applications. Based on the simulation results, a design space exploration has been used to derive performance-efficient application-specific super scalar processor architecture based on MIPS instruction set architecture. Simple Scalar architecture toolset has been used for design space exploration and network applications have been investigated to guide the architecture exploration. The optimizations achieve up to 80% improvement in performance for representative packet-processing applications.
Conference Paper
Full-text available
Communication misses - those serviced by dirty data in remote caches - are a pressing performance limiter in shared-memory multiprocessors. Recent research has indicated that temporally silent stores can be exploited to substantially reduce such misses, either with coherence protocol enhancements (MESTI); by employing speculation to create atomic silent store-pairs that achieve speculative lock elision (SLE); or by employing load value prediction (LVP). We evaluate all three approaches utilizing full-system, execution-driven simulation, with scientific and commercial workloads, to measure performance. Our studies indicate that accurate detection of elision idioms for SLE is vitally important for delivering robust performance and appears difficult for existing commercial codes. Furthermore, common datapath issues in out-of-order cores cause barriers to speculation and therefore may cause SLE failures unless SLE-specific speculation mechanisms are added to the microarchitecture. We also propose novel prediction and silence detection mechanisms that enable the MESTI protocol to deliver robust performance for all workloads. Finally, we conduct a detailed execution-driven performance evaluation of load value prediction (LVP), another simple method for capturing the benefit of temporally silent stores. We show that while theoretically LVP can capture the greatest fraction of communication misses among all approaches, it is usually not the most effective at delivering performance. This occurs because attempting to hide latency by speculating at the consumer, i.e. predicting load values, is fundamentally less effective than eliminating the latency at the source, by removing the invalidation effect of stores. Applying each method, we observe performance changes in application benchmarks ranging from 1% to 14% for an enhanced version of MESTI, -1.0% to 9% for LVP, -3% to 9% for enhanced SLE, and 2% to 21% for combined techniques
Article
Full-text available
Dissertation (Ph.D.)--University of Michigan.
Article
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
Conference Paper
Existing schemes for cache energy optimization incorporate a limited degree of dynamic associativity: either direct mapped or full available associativity (say 4-way). In this paper, we explore a more general design space for dynamic associativity (for a 4-way associative cache, consider 1-way, 2-way, and 4-way associative accesses). The other major departure is in the associativity control mechanism. We use the actual instruction level parallelism exhibited by the instructions surrounding a given load to classify it as an IPC k load (for 1≤k≤IW with an issue width of IW) in a superscalar architecture. The lookup schedule is fixed in advance for each IPC classifier 1≤k≤IW. The schedules are as way-disjoint as possible for load/stores with different IPC classifications. The energy savings over SPEC2000 CPU benchmarks average 28.6% for a 32 kB, 4-way, L-1 data cache. The resulting performance (IPC) degradation from the dynamic way schedule is restricted to less than 2.25%, mainly because IPC based placement ends up being an excellent classifier.
Conference Paper
The continued increase in microprocessor clock frequency that has come from advancements in fabrication technology and reductions in feature size, creates challenges in maintaining both manufacturing yield rates and long-term reliability of devices. Methods based on defect detection and reduction may not offer a scalable solution due to cost of eliminating contaminants in the manufacturing process and increasing chip complexity. We propose to use the inherent redundancy available in existing and future chip microarchitectures to improve yield and enable graceful performance degradation in fail-in-place systems. We introduce a new yield metric called performance averaged yield (Y<sub>PAV</sub>), which accounts both for fully functional chips and those that exhibit some performance degradation. Our results indicate that at 250nm we are able to increase the Y<sub>PAV</sub> of a uniprocessor with only redundant rows in its caches from a base value of 85% to 98% using microarchitectural redundancy. Given constant chip area, shrinking feature sizes increases fault susceptibility and reduces the base Y<sub>PAV</sub> to 60% at 50nm, which exploiting microarchitectural redundancy then increases to 99.6%.
Conference Paper
Full-text available
An increasingly large portion of scheduler latency is derived from the monolithic content addressable memory arrays accessed during instruction wake-up. The performance of the scheduler can be improved by decreasing the number of tag comparisons necessary to schedule instructions. Using detailed simulation-based analyses, we find that most instructions enter the window with at least one of their input operands already available. By putting these instructions into specialized windows with fewer tag comparators, load capacitance on the scheduler critical path can be reduced, with only very small effects on program throughput. For instructions with multiple unavailable operands, we introduce a last-tag speculation mechanism that eliminates all remaining tag comparators except those for the last arriving input operand. By combining these two tag-reduction schemes, we are able to construct dynamic schedulers with approximately one quarter of the tag comparators found in conventional designs. Conservative circuit-level timing analyses indicate that the optimized designs are 20-45% faster and require 10-20% less power depending on instruction window size
Conference Paper
Full-text available
Next-generation wide-issue processors will require greater memory bandwidth than provided by present memory hierarchy designs. We propose techniques for increasing the memory bandwidth of multi-ported L1 D-caches, large on-chip L2 caches and dedicated memory ports while minimizing the cycle time impact. These approaches are evaluated within the context of an 8-way superscalar processor design and next-generation VLSI, packaging and RAM technologies. We show that the combined L1 and L2 cache enhancements can outperform conventional techniques by over 80%, and that even with an on-chip 512-kByte L2 cache, board-level caches provide significant enough performance gains to justify their higher cost
Article
Several proposed techniques including CPR (checkpoint processing and recovery) and NoSQ (no store queue) rely on reference counting to manage physical registers. However, the register reference counting mechanism itself has received surprisingly little attention. This paper fills this gap by describing potential register reference counting schemes for NoSQ, CPR, and a hypothetical NoSQ/CPR hybrid. Although previously described in terms of binary counters, we find that reference counts are actually more naturally represented as matrices. Binary representations can be used as an optimization in specific situations.
Article
In this paper, we evaluate the performance of high bandwidth cache organizations employing multiple cache ports, multiple cycle hit times, and cache port efficiency enhancements, such as load all and line buffer, to find the organization that provides the best processor performance. Using a dynamic superscalar processor running realistic benchmarks that include operating system references, we use execution time to measure processor performance. When the cache is limited to a single cache port without enhancements, we find that two cache ports increase processor performance by 25 percent. With the addition of line buffer and load all to a single pelted cache, the processor achieves 91 percent of the performance of the same processor containing a cache with two ports. When the processor is not limited to a single cache port, the results show that a large dual-ported multicycle pipelined SRAM cache with a line buffer maximizes processor performance. A large pipelined cache provides both a low miss rate and a high CPU clock frequency. Dual-porting the cache and using a line buffer provide the bandwidth needed by a dynamic superscalar processor. The line buffer makes the pipelined dual-ported cache the best option by increasing cache port bandwidth and hiding cache latency
Article
In this paper, we examine some critical design features of a trace cache fetch engine for a 16-wide issue processor and evaluate their effects on performance. We evaluate path associativity, partial matching, and inactive issue, all of which are straightforward extensions to the trace cache. We examine features such as the fill unit and branch predictor design. In our final analysis, we show that the trace cache mechanism attains a 28 percent performance improvement over an aggressive single block fetch mechanism and a 15 percent improvement over a sequential multiblock mechanism
Article
In high-performance processors, increasing the number of instructions fetched and executed in parallel is becoming increasingly complex, and the peak bandwidth is often underutilized due to control and data dependences. A trace processor 1) efficiently sequences through programs in large units, called traces, and allocates trace-sized units of work to distributed processing elements (PEs), and 2) uses aggressive speculation to partially alleviate the effects of control and data dependences. A trace is a dynamic sequence of instructions, typically 16 to 32 instructions in length, which embeds any number of taken or not-taken branch instructions. The hierarchical, trace-based approach to increasing parallelism overcomes basic inefficiencies of managing fetch and execution resources on an individual instruction basis. This thesis shows the trace processor is a good microarchitecture for implementing wide-issue machines. Three key points support this conclusion. 1. Trace processors perfo...
Article
Full-text available
Vector architectures have long been the of choice for building supercomputers. They first appeared in the early seventies and had a long period of unquestioned dominance from the time the CRAY-1 was introduced in 1976 until the the appearance of "killer micros", in 1991. They still have a foothold in the supercomputer marketplace, although their continued viability, in the face of micro-based parallel systems, is being seriously questioned. We present a brief history of supercomputing and discuss the merits of vector architectures. Then we relate the advantages of vector architectures with current trends in computer system and device technology. Although the viability of vector supercomputers is indeed questionable, largely because of cost issues, we argue that vector architectures have a long future ahead of them -- with new applications and commodity implementations. Vector instruction sets have many fundamental advantages and deserve serious consideration for implementation in next generation computer systems, where graphics and other multimedia applications will abound.
ResearchGate has not been able to resolve any references for this publication.