Conference Paper

Architectural effects on dual instruction issue with interlock collapsing ALUs

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The authors present an evaluation of an innovative interlock collapsing arithmetic logic unit (ALU) in combination with several dual instruction issue processor organizations for two very different example architectures, IBM S/370 and MIPS R2000. The interlock collapsing ALU collapses execution interlocks between some integer operations as well as between address generation operations, without increasing the cycle time of the base machine. Thus, this allows two ALU, execution dependent instructions to be run in parallel, in a single cycle, instead of being executed sequentially. Results demonstrate that the overall contribution to the increase in instruction-level parallelism from the various processor organization design alternatives is remarkably similar to both the two example processors considering that the architectures are very different, and the contribution of the individual design alternatives varies

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For instructions compression, trace filtering was used in [8]. It is done by discarding parts of the ITF that have low impact on the simulation. ...
Conference Paper
Full-text available
Testing the performance of a new computational component is costly due to the need of prototyping different setups. Therefore, trace driven hardware simulations are used. Instruction Trace Files (ITFs) are files containing traces of executed instructions in a program's run and are used as an input for hardware simulations. ITFs tend to be large in size, causing a storage challenge. Many trace reduction techniques exist to deal with the ITFs' storage challenge. In this paper we introduce ITFComp, a compression algorithm that combines general purpose compression methods with knowledge about ARM architecture ITFs' structure to reduce their size. ITFComp also works on compressing data memory addresses accessed by instructions within ITFs to further reduce an ITF size. Results show a reduction of 600 times on average when combined with LZMA compression algorithm. This reduction is 4 times better than when using LZMA alone, and 10 times better than when using DEFLATE. ITFComp introduces a negligible overhead in the decompression time (less than 1%).
... We simulated a two-way compounding scheme, maximum compounding of two instructions, to evaluate whether such a scheme holds promise for potential computer implementations. Other evaluations, including the parallelism increase and the architectural effects of RISC and CISC on parallelism using interlock collapsing ALUs, are reported elsewhere [36,371. Still other evaluations are entirely possible. ...
... We simulated a two-way compounding scheme, maximum compounding of two instructions, to evaluate whether such a scheme holds promise for potential computer implementations. Other evaluations, including the parallelism increase and the architectural effects of RISC and CISC on parallelism using interlock collapsing ALUs, are reported elsewhere [36,371. Still other evaluations are entirely possible. ...
Article
In this paper we describe a machine organization suitable for RISC and CISC architectures. The proposed organization reduces hardware complexity in parallel instruction fetch and issue logic by minimizing possible increases in cycle time caused by parallel instruction issue decisions in the instruction buffer. Furthermore, it improves instruction-level parallelism by means of special features. The improvements are achieved by analyzing instruction sequences and deciding which instructions will issue and execute in parallel prior to actual instruction fetch and issue, by incorporating preprocessed information for parallel issue and execution of instructions in the cache, by categorizing instructions for parallel issue and execution on the basis of hardware utilization rather than opcode description, by attempting to avoid memory interlocks through the preprocessing mechanism, and by eliminating execution interlocks with specialized hardware.
... The longer the two instructions can travel together in the pipeline, fewer resources are needed and less power is consumed. In particular, in cases where we can build an execution unit that can execute dependent fused instructions together, we can reduce the program critical path and gain performance [38]. ...
Article
Full-text available
In the past several decades, the world of computers and especially that of microprocessors has witnessed phenomenal advances. Computers have exhibited ever-increasing performance and decreasing costs, making them more affordable and in turn, accelerating additional software and hardware development that fueled this process even more. The technology that enabled this exponential growth is a combination of advancements in process technology, microarchitecture, architecture, and design and development tools. While the pace of this progress has been quite impressive over the last two decades, it has become harder and harder to keep up this pace. New process technology requires more expensive megafabs and new performance levels require larger die, higher power consumption, and enormous design and validation effort. Furthermore, as CMOS technology continues to advance, microprocessor design is exposed to a new set of challenges. In the near future, microarchitecture has to consider and explicitly manage the limits of semiconductor technology, such as wire delays, power dissipation, and soft errors. In this paper we describe the role of microarchitecture in the computer world present the challenges ahead of us, and highlight areas where microarchitecture can help address these challenges
Article
A 32-bit 3-1 interlock collapsing ALU, proposed to allow the execution of two interlocked ALU-type instructions in one machine cycle using an instruction-level parallel machine implementation, is shown to produce results equivalent to a serial execution of the instructions using a 2-1 ALU. The equivalence is shown by deriving tables which represent all possible requirements for the serial execution of the instructions followed by the generalization of the table to represent sets of instructions rather than the individual instructions themselves. Consequently, the equivalence of the 3-1 interlock collapsing ALU operations with these generalized requirements of the serial execution of the instructions is shown. The correctness of a proposed high-speed interlock collapsing ALU is thereby demonstrated.
Conference Paper
A study was initiated that investigated detractors to parallelism and implementation constraints associated with the critical paths in the design of fine grain parallel machines. The outcome of the research has been a new machine organization that facilitates and improves parallel instruction issue and possible increases in cycle time and by improving the instruction-level parallelism, using specialized hardware. The authors describe the attributes of the proposed machine organization related to the analysis of instruction sequences for the parallel issue and execution. They also describe the permanent preprocessing in the cache that allows for the determination of instructions for parallel execution prior to the instruction fetch and issues
Article
Full-text available
Methods of achieving look-ahead in processing units are discussed. An optimality criterion is proposed, and several schemes are compared against the optimum under varying assumptions. These schemes include existing and proposed machine organi- zations, and theoretical treatments not mentioned before in this context. The prob- lems of eliminating associative searches in the processor control and the handling of loop-forming decisions are also considered. The inherent limitations of such processors are discussed. Finally, a number of enhancements to look-ahead proces- sors is qualitatively surveyed.
Article
Growing interest in ambitious multiple-issue machines and heavily -pipelined machines requires a careful examination of how much instruction-level parallelism exists in typical programs. Such an examination is complicated by the wide variety of hardware and software techniques for increasing the parallelism that can be exploited, including branch prediction, register renaming, and alias analysis. By performing simulations based on instruction maces, we can model techniques at the limits of feasibility and even beyond. Our study shows a striking difference between assuming that the techniques we use are perfect and merely assuming that they are impossibly good. Even with impossibly good techniques, average parallelism rarely exceeds 7, with 5 more common.
Article
Neural networks "compute" though not in the way that traditional computers do. It is necessary to accept their weaknesses to use their strengths. We discuss some of the assumptions and constraints that govern operation of neural nets, describe one particular ...
Article
In this paper, we describe some of the attributes of the SCISM organization, a multiple instruction-issuing machine, the outcome of five years of research at the IBM Glendale Laboratory, in Endicott, New York. The proposed organization embodies a number of mechanisms, including the analysis of instruction sequences and deciding which instructions will execute in parallel prior to instruction fetch and issue, the incorporation of permanent preprocessing of instructions to be executed in parallel, the categorization of instructions for parallel execution on the basis of hardware utilization rather than opcode description, the avoidance of memory interlocks through the preprocessing mechanism, and the elimination of execution interlocks with specialized hardware. It is shown that by incorporating these mechanisms, a SCISM capable of issuing and executing two instructions per cycle can achieve more than 90% of the theoretical maximum performance of an idealized, dual instruction issue superscalar machine.
Article
A 32-bit 3-1 interlock collapsing ALU, proposed to allow the execution of two interlocked ALU-type instructions in one machine cycle using an instruction-level parallel machine implementation, is shown to produce results equivalent to a serial execution of the instructions using a 2-1 ALU. The equivalence is shown by deriving tables which represent all possible requirements for the serial execution of the instructions followed by the generalization of the table to represent sets of instructions rather than the individual instructions themselves. Consequently, the equivalence of the 3-1 interlock collapsing ALU operations with these generalized requirements of the serial execution of the instructions is shown. The correctness of a proposed high-speed interlock collapsing ALU is thereby demonstrated.
Article
An innovative technique has been developed that permits the collapsing of execution interlocks between integer ALU operations as well as between address generation operations, allowing parallel execution of two instructions, having true dependencies, in a single cycle. Given that the proposed scheme has been shown not to increase the machine cycle time, it potentially provides an attractive means for increasing the instruction--level parallelism. Preliminary results show that within the basic blocks, the geometric mean of the speedup from this new design technique is up to 10% in the integer SPEC Benchmarks. The geometric mean of the speedup including floating point benchmarks is up to 6%. The results also suggest that depending on the application environment this new design may be used as an alternative to the relatively more expensive out--of--order instruction issue approach.
Article
The IBM RISC System/6000* processor is a second-generation RISC processor which reduces the execution pipeline penalties caused by branch instructions and also provides high floating-point performance. It employs multiple functional units which operate concurrently to maximize the instruction execution rate. By employing these advanced machine-organization techniques, it can execute up to four instructions simultaneously. Approximately 11 MFLOPS are achieved on the LINPACK benchmarks.
Conference Paper
The authors model the execution of the SPEC benchmarks under differing resource constraints. They repeat the work of the previous researchers, and under the hardware resource constraints imposed, similar results are obtained. On the other hand, when all constraints are removed except those required by the semantics of the program, degrees of parallelism in excess of 17 instructions per cycle are found. Finally, it is shown that if the hardware is properly balanced, one can sustain from 2.0 to 5.8 instructions per cycle on a processor that is reasonable to design today.
Conference Paper
Execution dependence between sequential instructions is one of the factors that limits the level of parallelism which can be exploited by fine grain parallel machines. Several architectural, compiler and machine organization techniques that have been used to alleviate this restriction are examined. They are compared against a relatively new mechanism that simply eliminates the execution dependency. The dependency elimination is achieved by using a novel integer arithmetic logic unit (ALU) design, which performs arithmetic and logical operations on three operands in a single cycle, but without extending the cycle time of the base machine
Conference Paper
The architecture for issuing multiple instructions per clock in the NonStop Cyclone processor is described. Pairs of instructions are fetched and decoded by a dual two-stage prefetch pipeline and passed to a dual six-stage pipeline for execution. Dynamic branch prediction is used to reduce branch penalties. A unique microcode routine for each pair is stored in the large duplexed control store. The microcode controls parallel data paths optimized for executing the most frequent instruction pairs. Other features of the architecture include cache support for unaligned double-precision accesses, a virtually addressed main memory, and a novel precise exception mechanism
Article
A device capable of executing interlocked fixed point arithmetic logic unit (ALU) instructions in parallel with other instructions causing the execution interlock is presented. The device incorporates the design of a 3-1 ALU and can execute two's complement, unsigned binary, and binary logical operations. It is shown that status for ALU operations using a 3-1 ALU can be determined in a parallel fashion, resulting in the compliance of the proposed device with predetermined architectural behavior of single instruction execution. The device requires no more logic stages than does a 3-1 binary adder using a carry-save adder (CSA) followed by a carry-lookahead adder (CLA) design. Design considerations using a commonly available CMOS technology are also reported, indicating that the device will not increase the machine cycle of an implementation. It is suggested that the device can maintain full architectural compatibility
The SuperSPARCTM Microprocessor
  • G Blanck
  • S Krueger