Conference Paper

A 14-port 3.8 ns 116-word 64b read-renaming register file

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A 116-word×64b register file with ten read ports and four write ports is part of a four-issue superscalar, register-renamed, four-window, V9 SPARC-architecture CPU operating at 154MHz. Since the register file combines a register-rename function with the register-read operation, the CPU pipeline is one stage shorter than other register-renaming architectures

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... and the HAL Sparc [4]. This figure shows the typical CAM structure of a renaming mechanism plus another bit per entry which we call Future Free bit. ...
Conference Paper
Full-text available
Modern out-of-order processors tolerate long latency memory operations by supporting a large number of in-flight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the reorder buffer (ROB), the general purpose instructions queues, the load/store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. We propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10% for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200% improvement over a current processor with a similar number of entries.
... An alternate scheme for register renaming uses a CAM (content-addressable memory) [19] to store the current mappings. Such a scheme is implemented in the HAL SPARC [2] and the DEC 21264 [10]. The number of entries in the CAM is equal to the number of physical registers. ...
... In this case, both the FX-mapping table and the merged register file have 10 read and 4 write ports while its FP-counterpart has 6 read and 3 write ports. According to Asato & al. 63 , this 14-port 116 word 64-bit merged register file needs 371K transistors, far more than the entire Intel 8086 processor (about 30K transistors) or slightly more than the i386 (about 275 K transistors) 64 . ...
Article
Full-text available
Register renaming is a technique to remove false data dependencie-write after read (WAR) and write after write (WAW)-that occur in straight line code between register operands of subsequent instructions. By eliminating related precedence requirements in the execution sequence of the instructions, renaming increases the average number of instructions that are available for parallel execution per cycle. This results in increased IPC (number of instructions executed per cycle). The identification and exploration of the design space of register-renaming lead to a comprehensive understanding of this intricate technique. As this article shows, the design space of register renaming is spanned by four main dimensions: the scope of register renaming, the layout of the rename buffers, the method of register mapping, and the rename rate. Relevant aspects of the design space give rise to eight basic alternatives for register-renaming. In addition, the kind of operand fetch policy significantly affects how the processor carries out the rename process, which duplicates the eight basic alternatives to 16 possible implementation schemes. The article indicates which basic implementation scheme is used in relevant superscalar processors. As register renaming is usually implemented in conjunction with shelving, the underlying microarchitecture is assumed to employ shelving
... Bit lines are connected to the cell in a vertical direction, and a differential bit line scheme is used. A single bit line scheme for read is widely used in a multiport memory cell [7]- [9], but a differential scheme is selected because ECRL uses differential signals [6] and recovers the charge of bit line. ...
Article
A 32×32-b adiabatic register file with one read port and one write port is designed. A four-phase clock generator is also designed to provide supply clocks for adiabatic circuits. All the word line and bit line charge on the capacitive interconnections is recovered to save energy. Adiabatic circuits are based on efficient charge recovery logic (ECRL) and are integrated using 0.8 μm complimentary metal-oxide-semiconductor (CMOS) technology. Measurement results show that power consumption of the core is significantly reduced by a factor of up to 3.5 compared with a conventional circuit
... An alternative scheme for register renaming uses a CAM (content-addressable memory [32]) to store the current mappings. Such a scheme is implemented in the HAL SPARC [2] and the DEC 21264 [18]. The number of entries in the CAM is equal to the number of physical registers. ...
Article
To characterize future performance limitations of superscalar processors, the delays of key pipeline structures in superscalar processors are studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of and .
Conference Paper
Full-text available
The implementation of a superscalar, speculative execution SPARC-V9 microprocessor incorporating restricted data flow principles required many design trade-offs. Consideration was given to both performance and cost. Performance is largely a function of cycle time and instructions executed per cycle while cost is primarily a function of die area. Here we describe our restricted data flow implementation and the means with which we arrived at its configuration. Future semi-conductor technology advances will allow these trade-offs to be relaxed and higher performance restricted data flow machines to be built.
Conference Paper
Full-text available
The HaL PM1 CPU is the first implementation of the 64-bit SPARC Version 9 instruction set architecture. The processor utilizes superscalar instruction issue, register renaming, and a dataflow model of execution. Instructions can complete out-of-order and are later committed in order. The PM1 CPU maintains precise state. The processor has a higher level of reliability than is currently available in desktop computers for the commercial marketplace
Article
We report the first implementation of the new SPARC V9 64-b instruction set architecture. The HaL processor called SPARC64 is a ceramic Multi-Chip Module (MCM) that contains one CPU chip, one Memory Management Unit (MMU) chip, and four 64 KB Cache chips. Together, they implement a unique three-level address translation scheme that efficiently supports using virtual addresses spread anywhere in the full 64-b address range. The processor assigns a serial number to each issued instruction to track up to 64 in-progress instructions and can speculatively issue through up to 16 branches. It issues up to 4 instructions per cycle and utilizes superscalar instruction issue, register renaming, and dataflow (potentially out-of-order) execution to fully exploit instruction-level parallelism. The processor maintains a precise-state execution model, and commits in-order, up to 9 instructions in a cycle. In a HaL R1 system, a production SPARC64 running at 143 MHz has a performance of 230 SPECint92 and 300 SPECfp92 and dissipates 50 W from a 3.3 V supply
Article
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of , 894 , and . Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future.
Conference Paper
This processor is the first implementation of the SPARC V9 64b instruction set architecture and has an estimated performance exceeding 256 SPECint92 and 330 SPECfp92 at 154 MHz. The R1 processor consists of one CPU chip, one memory management unit (MMU), four cache chips, and one clock chip mounted on a ceramic multi-chip module (MCM). The processor utilizes superscalar instruction issue, register renaming, and data flow execution to exploit instruction-level parallelism
PowerPC 604 RISC Microprocessor
  • M Deiiman