Conference Paper

Architectural effects on dual instruction issue with interlock collapsing ALUs

April 1993

April 1993

DOI:10.1109/PCCC.1993.344486

Source
IEEE Xplore

Conference: Computers and Communications, 1993., Twelfth Annual International Phoenix Conference on

Authors:

The authors present an evaluation of an innovative interlock collapsing arithmetic logic unit (ALU) in combination with several dual instruction issue processor organizations for two very different example architectures, IBM S/370 and MIPS R2000. The interlock collapsing ALU collapses execution interlocks between some integer operations as well as between address generation operations, without increasing the cycle time of the base machine. Thus, this allows two ALU, execution dependent instructions to be run in parallel, in a single cycle, instead of being executed sequentially. Results demonstrate that the overall contribution to the increase in instruction-level parallelism from the various processor organization design alternatives is remarkably similar to both the two example processors considering that the architectures are very different, and the contribution of the individual design alternatives varies

ITFComp: A Compression Algorithm for ARM Architecture Instruction Trace Files

Conference Paper

Full-text available

Jan 2015

Testing the performance of a new computational component is costly due to the need of prototyping different setups. Therefore, trace driven hardware simulations are used. Instruction Trace Files (ITFs) are files containing traces of executed instructions in a program's run and are used as an input for hardware simulations. ITFs tend to be large in size, causing a storage challenge. Many trace reduction techniques exist to deal with the ITFs' storage challenge. In this paper we introduce ITFComp, a compression algorithm that combines general purpose compression methods with knowledge about ARM architecture ITFs' structure to reduce their size. ITFComp also works on compressing data memory addresses accessed by instructions within ITFs to further reduce an ITF size. Results show a reduction of 600 times on average when combined with LZMA compression algorithm. This reduction is 4 times better than when using LZMA alone, and 10 times better than when using DEFLATE. ITFComp introduces a negligible overhead in the decompression time (less than 1%).

SCISM: A Scalable Compound Instruction Set Machine Architecture

Article

Jan 1994

SCISM: A scalable compound instruction set machine

Article

Feb 1994
IBM J RES DEV

In this paper we describe a machine organization suitable for RISC and CISC architectures. The proposed organization reduces hardware complexity in parallel instruction fetch and issue logic by minimizing possible increases in cycle time caused by parallel instruction issue decisions in the instruction buffer. Furthermore, it improves instruction-level parallelism by means of special features. The improvements are achieved by analyzing instruction sequences and deciding which instructions will issue and execute in parallel prior to actual instruction fetch and issue, by incorporating preprocessed information for parallel issue and execution of instructions in the cache, by categorizing instructions for parallel issue and execution on the basis of hardware utilization rather than opcode description, by attempting to avoid memory interlocks through the preprocessing mechanism, and by eliminating execution interlocks with specialized hardware.

Coming Challenges in Microarchitecture and Architecture

Article

Full-text available

Apr 2001

In the past several decades, the world of computers and especially that of microprocessors has witnessed phenomenal advances. Computers have exhibited ever-increasing performance and decreasing costs, making them more affordable and in turn, accelerating additional software and hardware development that fueled this process even more. The technology that enabled this exponential growth is a combination of advancements in process technology, microarchitecture, architecture, and design and development tools. While the pace of this progress has been quite impressive over the last two decades, it has become harder and harder to keep up this pace. New process technology requires more expensive megafabs and new performance levels require larger die, higher power consumption, and enormous design and validation effort. Furthermore, as CMOS technology continues to advance, microprocessor design is exposed to a new set of challenges. In the near future, microarchitecture has to consider and explicitly manage the limits of semiconductor technology, such as wire delays, power dissipation, and soft errors. In this paper we describe the role of microarchitecture in the computer world present the challenges ahead of us, and highlight areas where microarchitecture can help address these challenges

Proof of correctness of high-performance 3-1 interlock collapsing ALUs

Article

Jan 1993
IBM J RES DEV

A 32-bit 3-1 interlock collapsing ALU, proposed to allow the execution of two interlocked ALU-type instructions in one machine cycle using an instruction-level parallel machine implementation, is shown to produce results equivalent to a serial execution of the instructions using a 2-1 ALU. The equivalence is shown by deriving tables which represent all possible requirements for the serial execution of the instructions followed by the generalization of the table to represent sets of instructions rather than the individual instructions themselves. Consequently, the equivalence of the 3-1 interlock collapsing ALU operations with these generalized requirements of the serial execution of the instructions is shown. The correctness of a proposed high-speed interlock collapsing ALU is thereby demonstrated.

Interlock collapsing ALU for increased instruction-level parallelism

Conference Paper

Dec 1992

Not Available

PDATS Lossless Address Trace Compression For Reducing File Size And Access Time

Conference Paper

May 1994

Not Available

In-cache pre-processing and decode mechanisms for fine grain parallelism in SCISM

Conference Paper

Apr 1993

A study was initiated that investigated detractors to parallelism and implementation constraints associated with the critical paths in the design of fine grain parallel machines. The outcome of the research has been a new machine organization that facilitates and improves parallel instruction issue and possible increases in cycle time and by improving the instruction-level parallelism, using specialized hardware. The authors describe the attributes of the proposed machine organization related to the analysis of instruction sequences for the parallel issue and execution. They also describe the permanent preprocessing in the cache that allows for the determination of instructions for parallel execution prior to the instruction fetch and issues

Look-Ahead Processors

Article

Full-text available

Dec 1975

Robert M. Keller

Methods of achieving look-ahead in processing units are discussed. An optimality criterion is proposed, and several schemes are compared against the optimum under varying assumptions. These schemes include existing and proposed machine organi- zations, and theoretical treatments not mentioned before in this context. The prob- lems of eliminating associative searches in the processor control and the handling of loop-forming decisions are also considered. The inherent limitations of such processors are discussed. Finally, a number of enhancements to look-ahead proces- sors is qualitatively surveyed.

Limits of Instruction-Level Parallelism

Article

Jan 1991

D.W. Wall

Growing interest in ambitious multiple-issue machines and heavily -pipelined machines requires a careful examination of how much instruction-level parallelism exists in typical programs. Such an examination is complicated by the wide variety of hardware and software techniques for increasing the parallelism that can be exploited, including branch prediction, register renaming, and alias analysis. By performing simulations based on instruction maces, we can model techniques at the limits of feasibility and even beyond. Our study shows a striking difference between assuming that the techniques we use are perfect and merely assuming that they are impossibly good. Even with impossibly good techniques, average parallelism rarely exceeds 7, with 5 more common.

Interlock Collapsing ALU For Increased Instruction-level Parallelism

Conference Paper

Jan 1992

N. Malik

Execution interlock collapsing under restricted memory models

Article

The WM computer architecture

Article

Mar 1988
Comput Architect News

Wm. A. Wulf

Neural networks "compute" though not in the way that traditional computers do. It is necessary to accept their weaknesses to use their strengths. We discuss some of the assumptions and constraints that govern operation of neural nets, describe one particular ...

On the attributes of the SCISM organization

Article

Sep 1992
Comput Architect News

In this paper, we describe some of the attributes of the SCISM organization, a multiple instruction-issuing machine, the outcome of five years of research at the IBM Glendale Laboratory, in Endicott, New York. The proposed organization embodies a number of mechanisms, including the analysis of instruction sequences and deciding which instructions will execute in parallel prior to instruction fetch and issue, the incorporation of permanent preprocessing of instructions to be executed in parallel, the categorization of instructions for parallel execution on the basis of hardware utilization rather than opcode description, the avoidance of memory interlocks through the preprocessing mechanism, and the elimination of execution interlocks with specialized hardware. It is shown that by incorporating these mechanisms, a SCISM capable of issuing and executing two instructions per cycle can achieve more than 90% of the theoretical maximum performance of an idealized, dual instruction issue superscalar machine.

Compound instruction-set machines

Article

S. Vassiliadis

Proof of correctness of high-performance 3-1 interlock collapsing ALUs

Article

Jan 1993
IBM J RES DEV

Instruction-level parallelism from execution interlock collapsing

Article

Sep 1992
Comput Architect News

An innovative technique has been developed that permits the collapsing of execution interlocks between integer ALU operations as well as between address generation operations, allowing parallel execution of two instructions, having true dependencies, in a single cycle. Given that the proposed scheme has been shown not to increase the machine cycle time, it potentially provides an attractive means for increasing the instruction--level parallelism. Preliminary results show that within the basic blocks, the geometric mean of the speedup from this new design technique is up to 10% in the integer SPEC Benchmarks. The geometric mean of the speedup including floating point benchmarks is up to 6%. The results also suggest that depending on the application environment this new design may be used as an alternative to the relatively more expensive out--of--order instruction issue approach.

Machine organization of the IBM RISC System/6000 processor

Article

Feb 1990
IBM J RES DEV

G. F. Grohoski

The IBM RISC System/6000* processor is a second-generation RISC processor which reduces the execution pipeline penalties caused by branch instructions and also provides high floating-point performance. It employs multiple functional units which operate concurrently to maximize the instruction execution rate. By employing these advanced machine-organization techniques, it can execute up to four instructions simultaneously. Approximately 11 MFLOPS are achieved on the LINPACK benchmarks.

Single instruction stream parallelism is greater than two

Conference Paper

Feb 1991
Comput Architect News

The authors model the execution of the SPEC benchmarks under differing resource constraints. They repeat the work of the previous researchers, and under the hardware resource constraints imposed, similar results are obtained. On the other hand, when all constraints are removed except those required by the semantics of the program, degrees of parallelism in excess of 17 instructions per cycle are found. Finally, it is shown that if the hardware is properly balanced, one can sustain from 2.0 to 5.8 instructions per cycle on a processor that is reasonable to design today.

Execution dependencies and their resolution in fine grain parallel machines

Conference Paper

Apr 1993

Execution dependence between sequential instructions is one of the factors that limits the level of parallelism which can be exploited by fine grain parallel machines. Several architectural, compiler and machine organization techniques that have been used to alleviate this restriction are examined. They are compared against a relatively new mechanism that simply eliminates the execution dependency. The dependency elimination is achieved by using a novel integer arithmetic logic unit (ALU) design, which performs arithmetic and logical operations on three operands in a single cycle, but without extending the cycle time of the base machine

Multiple instruction issue in the NonStop Cyclone processor

Conference Paper

Jun 1990

The architecture for issuing multiple instructions per clock in the NonStop Cyclone processor is described. Pairs of instructions are fetched and decoded by a dual two-stage prefetch pipeline and passed to a dual six-stage pipeline for execution. Dynamic branch prediction is used to reduce branch penalties. A unique microcode routine for each pair is stored in the large duplexed control store. The microcode controls parallel data paths optimized for executing the most frequent instruction pairs. Other features of the architecture include cache support for unaligned double-precision accesses, a virtually addressed main memory, and a novel precise exception mechanism

Interlock collapsing ALU's

Article

Aug 1993

A device capable of executing interlocked fixed point arithmetic logic unit (ALU) instructions in parallel with other instructions causing the execution interlock is presented. The device incorporates the design of a 3-1 ALU and can execute two's complement, unsigned binary, and binary logical operations. It is shown that status for ALU operations using a 3-1 ALU can be determined in a parallel fashion, resulting in the compliance of the proposed device with predetermined architectural behavior of single instruction execution. The device requires no more logic stages than does a 3-1 binary adder using a carry-save adder (CSA) followed by a carry-lookahead adder (CLA) design. Design considerations using a commonly available CMOS technology are also reported, indicating that the device will not increase the machine cycle of an implementation. It is suggested that the device can maintain full architectural compatibility

The SuperSPARCTM Microprocessor

G Blanck
S Krueger

Architectural effects on dual instruction issue with interlock collapsing ALUs

Abstract

No full-text available

Recommended publications

Towards Development on a Silicon-based Cellular Computing Machine

MPSS: a Simulator of Message-Passing Applications for Heterogeneous Computing Environments

Todd, S.: Algorithm and hardware for a merge sort using multiple processors. IBM J. Res. Dev. 22(5),...

A minimal TTL processor for architecture exploration

A dynamically reconfigurable M-SIMD implementation architecture for large scale neural-digital hybri...

Concurrent Supercomputing in Europe

Digital signal processor supporting two types of instruction sets

2 n -way jump microinstruction hardware and an effective instruction binding method

Introducing Parallel Computations to a PTTP-Based First-Order Reasoning Process in the Oz Language

HtComp: Bringing reconfigurable hardware to future high-performance applications

Parasitic experiments on fermilab booster LLRF systems

Fussli: A portable framework for exploiting hybrid task, data and pipeline parallelism on multi-core...