A superscalar alpha processor with out-of-order exeecution

Complexity-Effective Superscalar Processors

Conference Paper

Full-text available

Jul 1997

Not Available

Dynamic speculation and synchronization of data dependences

Article

May 1997
Comput Architect News

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Memory dependence speculation tradeoffs in centralized, continuous-window superscalar processors

Conference Paper

Feb 2000

We consider a variety of dynamic, hardware-based methods for exploiting load/store parallelism, including mechanisms that use memory dependence speculation. While previous work has also investigated such methods, this has been done primarily for split, distributed window processor models. We focus on centralized continuous-window processor models (the common configuration today). We confirm that exploiting load/store parallelism can greatly improve performance. Moreover, we show that much of this performance potential can be captured if addresses of the memory locations accessed by both loads and stores can be used to schedule loads. However, using addresses to schedule load execution may not always be an option due to complexity, latency, and cost considerations. For this reason, we also consider configurations that use just memory dependence speculation to guide load execution. We consider a variety of methods and show that speculation/synchronization can be used to effectively exploit virtually all load/store parallelism. We demonstrate that this technique is competitive to or better than the one that uses addresses for scheduling loads. We conclude by discussing why our findings differ, in part, from those reported for split, distributed window processor models

Function Unit Clustering in Wide-Issue Superscalar Processors

Article

Oct 2000

Matthew A. Postiff

As more function units are integrated into wideissue superscalar processors and as cycle times decrease, result-forwarding delays will become worse relative to processor cycle time. Physical distance and capacitive effects of smaller geometry wires are the main reason for this increase in delay. Thus, a full bypass network, able to forward results from any function unit to any other function unit, cannot be realized without increasing the cycle time. Since increasing the cycle time is undesirable, this paper examines clustering of function units as an approximation to an ideal design where there is no inter-function unit communication delay. A cluster of function units is simply a grouping of neighboring function units with fast intra-cluster communication delay. Communication between clusters is assessed a higher penalty because of the distance between the clusters. This complicates function unit selection over the ideal case where there is no delay between function units. The goal of...

Selective Dual Path Execution

Article

Apr 1998

Selective Dual Path Execution (SDPE) reduces branch misprediction penalties by selectively forking a second path and executing instructions from both paths following a conditional branch instruction. SDPE restricts the number of simultaneously executed paths to two, and uses a branch prediction confidence mechanism to fork selectively only for branches that are more likely to be mispredicted. A branch forking policy defines the behavior of SDPE when a low confidence branch prediction is encountered while two paths are already being executed. Trace driven simulation is used to evaluate the effectiveness of SDPE with three different forking policies. SDPE can reduce the cycles lost to branch mispredictions by 34 to 50%, resulting in an approximate 10% reduction in overall execution time. However, it is shown that both branch mispredictions and low confidence predictions tend to occur in clusters, limiting the effectiveness of SDPE. A number of design parameters are studied via simulation. These include prediction and confidence table sizes. Finally, a number of implementation issues are discussed, with emphasis on instruction fetching mechanisms and register renaming. 1.0

Evaluation of Cache Coherence Protocols in terms of Power and Latency in Multiprocessors

Conference Paper

Full-text available

May 2016

The shared memory multiprocessors suffer with significant problem of accessing shared resources in a shared memory it will result in longer latencies. Consequently, the performance of the system will get affected. With the object of solving the problem of increased access latency due to large number of processors with shared memory, Cache is being used. Every processor has its own private cache, now they can update or access the data comfortably but again it leads to another serious issue i.e. cache coherency. The magnitude of the potential performance difference between the various cache coherency approaches indicates that the choice of coherence solution is very important in the design of an efficient shared-bus multiprocessor, since it may limit the number of processors in the system. In this paper we evaluate a typical multiprocessor system in terms of power and latency with different cache coherence protocols which GEM5 simulator is used. The traffic is generated with five injection rates (0.1, 0.2, 0.3, 0.4 and 0.5). Power and latency analyzing figures make up and appear in experimental result. The result shows MOESI_CMP_token has maximum latency and power consumption.

Symbiotic Subordinate Threading (SST)

Article

Apr 2007

Rania Mameesh

Fast Flexible Architectures for Secure Communication

Article

Lisa Wu

Fast Priority Queues for Cached Memory

Chapter

Jan 2008

Peter Sanders

The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms which perform well in practice. We advocates the approach to adapt external memory algorithms to this purpose. We exemplify this approach and the practical issues involved by engineering a fast priority queue suited to external memory and cached memory which is based on k-way merging. It improves previous external memory algorithms by constant factors crucial for transferring it to cached memory. Running in the cache hierarchy of a workstation the algorithm is at least two times faster than an optimized implementation of binary heaps and 4-ary heaps for large inputs.

On the Performance of Hand vs. Automatically Optimized Numerical Codes.

Conference Paper

Jan 2000

In this paper, we compare automatic-optimized codes against hand-optimized codes. The automatic-optimized codes have been generated using our own developed tool that implements compiler techniques proposed in our previous work. Our compiler techniques focus on applying multilevel tiling to non-rectangular loop nests. This type of loop nests are commonly found in linear algebra algorithms, typically used in numerical codes. As hand-optimized codes, we use two different numerical libraries: the BLAS3 library provided by the manufacturers and the RISC-BLAS library proposed in Dayde and Duff (1998). Results will show how compiler technology can make it possible for non-rectangular loop nests to achieve as high performance as hand-optimized codes on modern microprocessors

Out-of-Order Commit Processors.

Conference Paper

Full-text available

Jan 2004

Modern out-of-order processors tolerate long latency memory operations by supporting a large number of in-flight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the reorder buffer (ROB), the general purpose instructions queues, the load/store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. We propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10% for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200% improvement over a current processor with a similar number of entries.

Effect of Multicycle Intructions on the Integer Performance of the Dynamixcally Trace Scheduled VLIW Architecture.

Conference Paper

Full-text available

Apr 1999
Lect Notes Comput Sci

Dynamically trace scheduled VLIW (DTSVLIW) architectures can be used to implement machines that execute code of current RISC or CISC instruction set architectures in a VLIW fashion, delivering instruction level parallelism (ILP) with backward code compatibility. This paper presents the effect of multicycle instructions on the performance of a DTSVLIW architecture running the SPECint95 benchmarks. The classic approaches to providing ILP are VLIW and superscalar architectures. With VLIW, the compiler is required to extract the parallelism from the program and to build Very Long Instruction Words for execution. This leads to fast and (relatively) simple hardware, but puts a heavy responsibility on the compiler, and object code compatibility(4) is a problem. In Superscalar, the extraction of parallelism is done by the hardware which dynamically schedules the sequential instruction stream on to the functional units. The hardware is much more complex, and therefore slower than a corresponding VLIW design. The peak instruction feed into the functional units is lower for Superscalar. A number of approaches(1)(2)(3) have been examined that marry the advantages of the contrasting designs: the Superscalar dynamic extraction of ILP, the simplicity of the VLIW architectures. The approach presented here follows that first presented by Nair and Hopkins(3). Our architecture, the dynamically trace scheduled VLIW architecture (DTSVLIW) (4), demonstrates similar results to theirs in providing significant parallelism, but with a simpler architecture that should be much easier to implement. In our earlier work(5), we had zero latency load/store instructions. Here we present results with more realistic latencies. 2 The DTSVLIW Architecture The DTSVLIW has two execution engines: the Primary Processor and the VLIW Engine, and two caches for instructions: the Instruction Cache and the VLIW Cache. The Primary Processor, a simple pipelined processor, fetches instructions and does the first execution of this code. The instruction trace it produces is dynamically scheduled by the Scheduler Unit into VLIW instructions, saved in blocks to the VLIW Cache for re -execution by the VLIW Engine. The Primary Engine, executing the Sparc-7 ISA, provides object-code compatibility; the VLIW Engine VLIW performance and simplicity. The Scheduler unit works in parallel with the Primary Engine. Scheduling does not impact on VLIW performance as it does in Superscalar.

A high-speed dynamic instruction scheduling scheme for superscalar processors

Conference Paper

Jan 2001

The wakeup logic is a part of the issuing window and is responsible to manage the ready jags of the operands for dynamic instruction scheduling. The conventional wakeup logic is based on association, and composed of a RAM and a CAM. Since the logic is not pipelinable and the delays of these memories are dominated by the wire delays, the logic will be more critical with deeper pipelines and smaller feature sizes. This paper describes a new scheduling scheme not based on the association but on matrices which represent the dependences between instructions. Since the update logic of the matrices detects the dependencies between instructions as the register renaming logic does, the wakeup operation is realized by just reading the matrices. This paper also describes a technique to reduce the effective size of the matrices for small IPC penalties. We designed the layouts of the logics guided by a 0.18pm CMOS design rule provided by Fujitsu Limited, and calculated the delays. We also evaluated the penalties by cycle-level simulation. The results show that our scheme achieves 2.7GHz clock speed for the IPC degradation of about 1 %.

SST: symbiotic subordinate threading

Conference Paper

Nov 2005

We propose a subordinate threading scheme in which the main thread skips instructions that are guaranteed to be correctly executed by the subordinate thread. Speeding up the main thread increases the overall speed of the processor. Also, a faster main thread can detect the subordinate thread's mispredictions earlier, thereby cutting down the amount of time the subordinate thread spends on wrong-path instructions. Hence, the subordinate thread is now free to do more aggressive speculations. We developed a cycle-accurate simulator and evaluated our symbiotic subordinate threading scheme for the SPEC2000 integer benchmarks. Our results show an average performance improvement of 21% over a base subordinate threading scheme that does not let the main thread skip any instructions.

Tarantula: A vector extension to the Alpha architecture

Conference Paper

Full-text available

Feb 2002

Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor. Tarantula adds to the EV8 core a vector unit capable of 32 double-precision flops per cycle. The vector unit fetches data directly from a 16 MByte second level cache with a peak bandwidth of sixty four 64-bit values per cycle. The whole chip is backed by a memory controller capable of delivering over 64 GBytes/s of raw bandwidth. Tarantula extends the Alpha ISA with new vector instructions that operate on new architectural state. Salient features of the architecture and implementation are: (1) it fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, (2) provides high bandwidth for non-unit stride memory accesses, (3) supports gather/scatter instructions efficiently, (4) fully integrates with the EV8 core with a narrow, streamlined interface, rather than acting as a co-processor (5) can achieve a peak of 104 operations per cycle, and (6) achieves excellent "real-computation" per transistor and per watt ratios. Our detailed simulations show that Tarantula achieves an average speedup of 5X over EV8, out of a peak speedup in terms of flops of 8X. Furthermore, performance on gather/scatter intensive benchmarks such as Radix Sort is also remarkable: a speedup of almost 3X over EV8 and 15 sustained operations per cycle. Several benchmarks exceed 20 operations per cycle

Physical Register Reference Counting

Article

Feb 2008

Amir Roth

Several proposed techniques including CPR (checkpoint processing and recovery) and NoSQ (no store queue) rely on reference counting to manage physical registers. However, the register reference counting mechanism itself has received surprisingly little attention. This paper fills this gap by describing potential register reference counting schemes for NoSQ, CPR, and a hypothetical NoSQ/CPR hybrid. Although previously described in terms of binary counters, we find that reference counts are actually more naturally represented as matrices. Binary representations can be used as an optimization in specific situations.

Architectural Support for Fast Symmetric-Key Cryptography

Article

Full-text available

Mar 2002

The emergence of the Internet as a trusted medium for commerce and communication has made cryptography an essential component of modern information systems. Cryptography provides the mechanisms necessary to implement accountability, accuracy, and confidentiality in communication. As demands for secure communication bandwidth grow, efficient cryptographic processing will become increasingly vital to good system performance. In this paper, we explore techniques to improve the performance of symmetric key cipher algorithms. Eight popular strong encryption algorithms are examined in detail. Analysis reveals the algorithms are computationally complex and contain little parallelism. Overall throughput on a high-end microprocessor is quite poor, a 600 Mhz processor is incapable of saturating a T3 communication line with 3DES (triple DES) encrypted data. We introduce new instructions that improve the efficiency of the analyzed algorithms. Our approach adds instruction set support for fast substitutions, general permutations, rotates, and modular arithmetic. Performance analysis of the optimized ciphers shows an overall speedup of 59% over a baseline machine with rotate instructions and 74% speedup over a baseline without rotates. Even higher speedups are demonstrated with optimized substitutions (SBOXes) and additional functional unit resources. Our analyses of the original and optimized algorithms suggest future directions for the design of high-performance programmable cryptographic processors.

Memory Dependence Prediction

Article

Jan 1999

Andreas Moshovos

As the existing techniques that empower the modern high-performance processors are being refined and as the underlying technology trade-offs change, new bottlenecks are exposed and new challenges are raised. This thesis introduces a new tool, Memory Dependence Prediction that can be useful in combating these bottlenecks and meeting the new challenges. Memory dependence prediction is a technique to guess whether a load or a store will experience a dependence. Memory dependence prediction exploits regularity in the memory dependence stream of ordinary programs, a phenomenon which is also identified in this thesis. To demonstrate the utility of memory dependence prediction this thesis also presents the following three novel microarchitectural techniques: 1. Dynamic Speculation/Synchronization of Memory Dependences: this thesis demonstrates that to exploit parallelism over larger regions of code waiting to determine the dependences a load has is not the best performing option. Higher performance is possible if memory dependence speculation is used especially if memory dependence prediction is used to guide this speculation.

Transparent Dynamic Optimization

Article

Full-text available

Jul 1999

Dynamic optimization refers to the runtime optimization of a native program binary. This paper describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capable of optimizing a native program binary at runtime. Dynamo is a realistic implementation, not a simulation, that is written entirely in user-level software, and runs on a PA-RISC machine under the HPUX operating system. Dynamo does not depend on any special programming language, compiler, operating system or hardware support. The program binary is not instrumented and is left untouched during Dynamo's operation. Dynamo observes the program's behavior through interpretation to dynamically select hot instruction traces from the running program. The hot traces are optimized using low-overhead optimization techniques and emitted into a software code cache. Subsequent instances of these traces cause the cached version to be executed, resulting in a performance boost. Contrary to intuition, we ...

Exploiting Idle Floating-Point Resources For Integer Execution

Article

Jun 2001

In conventional superscalar microarchitectures with partitioned integer and floating-point resources, all floating-point resources are idle during execution of integer programs. Palacharla and Smith [26] addressed this drawback and proposed that the floating-point subsystem be augmented to support integer operations. The hardware changes required are expected to be fairly minimal. To exploit these idle floating resources, the compiler must identify integer code that can be profitably offloaded to the augmented floating-point subsystem. In this paper, we present two compiler algorithms to do this. The basic scheme offloads integer computation to the floating-point subsystem using existing program loads/stores for inter-partition communication. For the SPECINT95 benchmarks, we show that this scheme offloads from 5% to 29% of the total dynamic instructions to the floating-point subsystem. The advanced scheme inserts copy instructions and duplicates some instructions to further offload computation. We evaluate the effectiveness of the two schemes using timing simulation. We show that the advanced scheme can offload from 9% to 41% of the total dynamic instructions to the floating-point subsystem. In doing so, speedups from 3% to 23% are achieved over a conventional microarchitecture.

Fast Priority Queues for Cached Memory

Article

Feb 1999

Peter Sanders

The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms which perform well in practice. We advocates the approach to adapt external memory algorithms to this purpose. We exemplify this approach and the practical issues involved by engineering a fast priority queue suited to external memory and cached memory which is based on k-way merging. It improves previous external memory algorithms by constant factors crucial for transferring it to cached memory. Running in the cache hierarchy of a workstation the algorithm is at least two times faster than an optimized implementation of binary heaps and 4-ary heaps for large inputs.

Exploiting idle floating-point resources for integer execution

Article

May 1998

In conventional superscalar microarchitectures with partitioned integer and floating-point resources, all floating-point resources are idle during execution of integer programs. Palacharla and Smith [26] addressed this drawback and proposed that the floating-point subsystem be augmented to support integer operations. The hardware changes required are expected to be fairly minimal.To exploit these idle floating resources, the compiler must identify integer code that can be profitably offloaded to the augmented floating-point subsystem. In this paper, we present two compiler algorithms to do this. The basic scheme offloads integer computation to the floating-point subsystem using existing program loads/stores for inter-partition communication. For the SPECINT95 benchmarks, we show that this scheme offloads from 5% to 29% of the total dynamic instructions to the floating-point subsystem. The advanced scheme inserts copy instructions and duplicates some instructions to further offload computation. We evaluate the effectiveness of the two schemes using timing simulation. We show that the advanced scheme can offload from 9% to 41% of the total dynamic instructions to the floating-point subsystem. In doing so, speedups from 3% to 23% are achieved over a conventional microarchitecture.

Dynamic speculation and synchronization of data dependences

Article

May 1997
Comput Architect News

Data dependence speculation is used in instruction-level parallel (ILP) processors to allow early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the two instructions must be synchronized. The modern dynamically scheduled processors that use data dependence speculation do so blindly (i.e., every load instruction with unresolved dependences is speculated). In this paper, we demonstrate that as dynamic instruction windows get larger, significant performance benefits can result when intelligent decisions about data dependence speculation are made. We propose dynamic data dependence speculation techniques: (i) to predict if the execution of an instruction is likely to result in a data dependence mis-specalation, and (ii) to provide the synchronization needed to avoid a mis-speculation. Experimental results evaluating the effectiveness of the proposed techniques are presented within the context of a Multiscalar processor.

Exploiting microarchitectural redundancy for defect tolerance

Conference Paper

Sep 2012

Premkishore Shivakumar

The continued increase in microprocessor clock frequency that has come from advancements in fabrication technology and reductions in feature size, creates challenges in maintaining both manufacturing yield rates and long-term reliability of devices. Methods based on defect detection and reduction may not offer a scalable solution due to cost of eliminating contaminants in the manufacturing process and increasing chip complexity. This paper proposes to use the inherent redundancy available in existing and future chip microarchitectures to improve yield and enable graceful performance degradation in fail-in-place systems. We introduce a new yield metric called performance averaged yield (Ypav) which accounts both for fully functional chips and those that exhibit some performance degradation. Our results indicate that at 250nm we are able to increase the Ypav of a uniprocessor with only redundant rows in its caches from a base value of 85% to 98% using microarchitectural redundancy. Given constant chip area, shrinking feature sizes increases fault susceptibility and reduces the base Ypav to 60% at 50nm, which exploiting microarchitectural redundancy then increases to 99.6%.

Flexible Register Management using Reference Counting

Article

May 2012

Conventional out-of-order processors that use a unified physical register file allocate and reclaim registers explicitly using a free list that operates as a circular queue. We describe and evaluate a more flexible register management scheme — reference counting. We implement reference counting using a bit-matrix with a column for every physical register and a row for every entity that can hold a physical register, e.g., an in-flight instruction. Columns are NOR'ed together to create a bitvector free list from which registers are allocated using priority encoders. We describe reference counting designs that support micro-architectural techniques including register file power gating, dynamic register move elimination, register file checkpointing, and latency tolerant execution. Performance and circuit simulation show that the energy cost of reference counting is low and is easily recouped by the savings of the techniques it enables.

Design and Verification for Dual Issue Digital Signal Processor

Article

Jan 2009

Digital Signal Processor (DSP) has been widely used in processing video and audio streaming data. Due to the huge amount of streaming data, increasing throughput is the key issue in designing DSP architecture. One way to increase the throughput of a DSP is to increase the instruction level parallelism. To increase the instruction level parallelism, many architectures have been proposed and can be classified into two main approaches, the superscaler and the VLIW architectures. Among the hardware architectures, the VLIW attracts a lot of attention due to its simple hardware complexity. However, the VLIW architecture suffers from the explosion of instruction memories due to the overhead of instruction grouping. In this paper, we propose a novel DSP architecture which contains three pipelines and performs dynamic instruction grouping by hardware. The experimental results show that our architecture can reduce 13% of memory requiremnt on average while maintaining the same performance.

Development of National Characteristic Tourism Products with AMT

Conference Paper

Jul 2010

This article described the application process of advanced manufacturing technology using in the national tourism industry from the digital development, exhibition and cultural heritage protection aspects of ethnic characteristic tourism products, and introduced specific ways of computer-aided design, reverse engineering, rapid prototyping manufacturing technology and information network technology in the national tourism product development. It has a far-reaching significance to improve the efficiency of product development and the development of national characteristic tourism industry.

Trace Processors.

Conference Paper

Jan 1997

Traces are dynamic instruction sequences constructed and cached by hardware. A microarchitecture organized around traces is presented as a means for efficiently executing many instructions per cycle. Trace processors exploit both control flow and data flow hierarchy to overcome complexity and architectural limitations of conventional superscalar processors by (1) distributing execution resources based on trace boundaries and (2) applying control and data prediction at the trace level rather than individual branches or instructions. Three sets of experiments using the SPECInt95 benchmarks are presented. (i) A detailed evaluation of trace processor configurations: the results affirm that significant instruction-level parallelism can be exploited in integer programs (2 to 6 instructions per cycle). We also isolate the impact of distributed resources, and quantify the value of successively doubling the number of distributed elements. (ii) A trace processor with data prediction applied to inter-trace dependences: potential performance improvement with perfect prediction is around 45% for all benchmarks. With realistic prediction, gcc achieves an actual improvement of 10%. (iii) Evaluation of aggressive control flow: some benchmarks benefit from control independence by as much as 10%

Architecture-Level Design Space Exploration of Super Scalar Microarchitecture for Network Applications

Conference Paper

Sep 2010

Increasing diversity in packet-processing applications and rapid increases in channel bandwidth lead to greater complexity in communication protocols. These factors result in larger computational loads for packet-processing engines that introduce high performance microprocessor designs as an important solution. This paper presents an exhaustive simulation for exploring the performance of instruction-level parallel super scalar processors executing packet-processing applications. Based on the simulation results, a design space exploration has been used to derive performance-efficient application-specific super scalar processor architecture based on MIPS instruction set architecture. Simple Scalar architecture toolset has been used for design space exploration and network applications have been investigated to guide the architecture exploration. The optimizations achieve up to 80% improvement in performance for representative packet-processing applications.

Reaping the Benefit of Temporal Silence to Improve Communication Performance

Conference Paper

Full-text available

Jan 2005

Communication misses - those serviced by dirty data in remote caches - are a pressing performance limiter in shared-memory multiprocessors. Recent research has indicated that temporally silent stores can be exploited to substantially reduce such misses, either with coherence protocol enhancements (MESTI); by employing speculation to create atomic silent store-pairs that achieve speculative lock elision (SLE); or by employing load value prediction (LVP). We evaluate all three approaches utilizing full-system, execution-driven simulation, with scientific and commercial workloads, to measure performance. Our studies indicate that accurate detection of elision idioms for SLE is vitally important for delivering robust performance and appears difficult for existing commercial codes. Furthermore, common datapath issues in out-of-order cores cause barriers to speculation and therefore may cause SLE failures unless SLE-specific speculation mechanisms are added to the microarchitecture. We also propose novel prediction and silence detection mechanisms that enable the MESTI protocol to deliver robust performance for all workloads. Finally, we conduct a detailed execution-driven performance evaluation of load value prediction (LVP), another simple method for capturing the benefit of temporally silent stores. We show that while theoretically LVP can capture the greatest fraction of communication misses among all approaches, it is usually not the most effective at delivering performance. This occurs because attempting to hide latency by speculating at the consumer, i.e. predicting load values, is fundamentally less effective than eliminating the latency at the source, by removing the invalidation effect of stores. Applying each method, we observe performance changes in application benchmarks ranging from 1% to 14% for an enhanced version of MESTI, -1.0% to 9% for LVP, -3% to 9% for enhanced SLE, and 2% to 21% for combined techniques

Designing High Bandwidth On-Chip Caches.

Conference Paper

May 1997
Comput Architect News

Not Available

Exploring, defining, and exploiting recent store value locality /

Article

Kevin M. Lepak

Thesis (Ph. D.)--University of Wisconsin--Madison, 2003. Includes bibliographical references (p. 239-244). Photocopy.

Dynamic page placement to improve locality in CC-NUMA multiprocessors for TPC-C

Conference Paper

Dec 2001

The use of CC-NUMA multiprocessors complicates the placement of physical memory pages. Memory closest to a processor provides the best access time, but optimal memory page placement is a difficult problem with process movement, multiple processes requiring access to the same physical memory page, and application behavior changing over execution time. We use dynamic page placement to move memory pages where needed for the database benchmark TPC-C executing on a four node CC-NUMA multiprocessor. Dynamic page placement achieves local memory accesses up to 73% of the time instead of the static page placement results of 34% locality achieved with first touch and 25% with round robin. This can result in a 17% improvement in performance.

IPC driven dynamic associative cache architecture for low energy

Conference Paper

Nov 2004

Existing schemes for cache energy optimization incorporate a limited degree of dynamic associativity: either direct mapped or full available associativity (say 4-way). In this paper, we explore a more general design space for dynamic associativity (for a 4-way associative cache, consider 1-way, 2-way, and 4-way associative accesses). The other major departure is in the associativity control mechanism. We use the actual instruction level parallelism exhibited by the instructions surrounding a given load to classify it as an IPC k load (for 1≤k≤IW with an issue width of IW) in a superscalar architecture. The lookup schedule is fixed in advance for each IPC classifier 1≤k≤IW. The schedules are as way-disjoint as possible for load/stores with different IPC classifications. The energy savings over SPEC2000 CPU benchmarks average 28.6% for a 32 kB, 4-way, L-1 data cache. The resulting performance (IPC) degradation from the dynamic way schedule is restricted to less than 2.25%, mainly because IPC based placement ends up being an excellent classifier.

Exploiting microarchitectural redundancy for defect tolerance

Conference Paper

Nov 2003

The continued increase in microprocessor clock frequency that has come from advancements in fabrication technology and reductions in feature size, creates challenges in maintaining both manufacturing yield rates and long-term reliability of devices. Methods based on defect detection and reduction may not offer a scalable solution due to cost of eliminating contaminants in the manufacturing process and increasing chip complexity. We propose to use the inherent redundancy available in existing and future chip microarchitectures to improve yield and enable graceful performance degradation in fail-in-place systems. We introduce a new yield metric called performance averaged yield (YPAV), which accounts both for fully functional chips and those that exhibit some performance degradation. Our results indicate that at 250nm we are able to increase the YPAV of a uniprocessor with only redundant rows in its caches from a base value of 85% to 98% using microarchitectural redundancy. Given constant chip area, shrinking feature sizes increases fault susceptibility and reduces the base YPAV to 60% at 50nm, which exploiting microarchitectural redundancy then increases to 99.6%.

A dependence driven efficient dispatch scheme

Conference Paper

Nov 2003

The rename map table (RMT) access and the dependence check logic (DCL) delays scale unfavorably "with the dispatch width (DW) of a superscalar processor. It is a well-known program property that the results of most instructions are consumed within the following 4-6 instruction window. This program behavior can be exploited to reduce the rename delay by reducing the number of read/write ports in the RMT to significantly below the current 3 * DW. We propose an algorithm to dynamically allocate reduced number of RMT ports to instructions in the current dispatch window, matching dispatch resources to average needs rather than peak needs. This results in shorter RMT access delays as well as in lower energy in the dispatch stage. The IPC reduction due to rename map table read/write port contention in the proposed scheme stays within 2-4%. The cycle time saved can also be leveraged to support wider dispatch in the same cycle time in order to offset this degradation.

Modeling the Effect of Technology Trends on the Soft Error Rate of Combinatorial Logic

Conference Paper

Feb 2002

This paper examines the effect of technology scaling and microarchitectural trends on the rate of soft errors in CMOS memory and logic circuits. We describe and validate an end-to-end model that enables us to compute the soft error rates (SER) for existing and future microprocessor-style designs. The model captures the effects of two important masking phenomena, electrical masking and latching-window masking, which inhibit soft errors in combinational logic. We quantify the SER due to high-energy neutrons in SRAM cells, latches, and logic circuits for feature sizes from 600 nm to 50 nm and clock periods from 16 to 6 fan-out-of-4 inverter delays. Our model predicts that the SER per chip of logic circuits will increase nine orders of magnitude from 1992 to 2011 and at that point will be comparable to the SER per chip of unprotected memory elements. Our result emphasizes that computer system designers must address the risks of soft errors in logic circuits for future designs.

Efficient dynamic scheduling through tag elimination

Conference Paper

Full-text available

Feb 2002

An increasingly large portion of scheduler latency is derived from the monolithic content addressable memory arrays accessed during instruction wake-up. The performance of the scheduler can be improved by decreasing the number of tag comparisons necessary to schedule instructions. Using detailed simulation-based analyses, we find that most instructions enter the window with at least one of their input operands already available. By putting these instructions into specialized windows with fewer tag comparators, load capacitance on the scheduler critical path can be reduced, with only very small effects on program throughput. For instructions with multiple unavailable operands, we introduce a last-tag speculation mechanism that eliminates all remaining tag comparators except those for the last arriving input operand. By combining these two tag-reduction schemes, we are able to construct dynamic schedulers with approximately one quarter of the tag comparators found in conventional designs. Conservative circuit-level timing analyses indicate that the optimized designs are 20-45% faster and require 10-20% less power depending on instruction window size

CryptoManiac: A fast flexible architecture for secure communication

Conference Paper

Full-text available

Feb 2001

The growth of the Internet as a vehicle for secure communication and electronic commerce has brought cryptographic processing performance to the forefront of high throughput system design. This trend will be further underscored with the widespread adoption of secure protocols such as secure IP (IPSEC) and virtual private networks (VPNs). In this paper, we introduce the CryptoManiac processor, a fast and flexible co-processor for cryptographic workloads. Our design is extremely efficient; we present analysis of a 0.25 um physical design that runs the standard Rijndael cipher algorithm 2.25 times faster than a 600 MHz Alpha 21264 processor. Moreover, our implementation requires 1/100th the area and power in the same technology. We demonstrate that the performance of our design rivals a state-of-the-art dedicated hardware implementation of the 3DES (triple DES) algorithm, while retaining the flexibility to simultaneously support multiple cipher algorithms. Finally, we define a scalable system architecture that combines CryptoManiac processing elements to exploit inter-session and inter-packet parallelism available in many communication protocols. Using I/O traces and detailed timing simulation, we show that chip multiprocessor configurations can effectively service high throughput applications including secure web and disk I/O processing

Performance modeling and code partitioning for the DS architecture

Conference Paper

Jun 1998

DS (Decoupled-Superscalar) is a new microarchitecture that combines decoupled and superscalar techniques to exploit instruction level parallelism. Issue bandwidth is increased while circuit complexity growth is controlled with little negative impact on performance. Programs for DS are compiled into two instruction substreams: the dominant substream navigates the control flow and the rest of computational task is shared between the dominant and subsidiary substreams. Each substream is processed by a separate superscalar core realizable with current VLSI technology. DS machines are binary compatible with superscalar machines having the same instruction set, and a family of DS machines is binary compatible without recompilation. DS run time behavior is examined with an analytical model. A novel technique for controlling slip between substreams is introduced. Code partitioning issues of instruction count balancing and residence time balancing, important to any split-stream scheme, are discussed. Simulation shows DS achieves performance comparable to an aggressive superscalar, but with potentially less complex hardware and faster clock rate

Branch prediction methods used in modern superscalar processors

Conference Paper

Oct 1997

S. Atukorala

Branch prediction is crucial to maintaining high performance in modern superscalar processors. Conditional branches can occur as frequently as one in every 5 or 6 instructions in nonnumeric programs, leading to heavy mis-prediction penalties in current superpipelined, superscalar architectures. Realization of this fact has lead to an upsurge in the number of recent research publications in this area. This has been accompanied by the processor manufacturers implementing more complex branch prediction hardware in recent processors. This paper surveys the status of research in static and dynamic branch prediction and also how these research findings are being employed by current processor designers. The paper ends with an evaluation of future directions of branch prediction

Initial results on the performance and cost of vectormicroprocessors

Conference Paper

Jan 1998

Increasingly wider superscalar processors are experiencing diminishing performance returns while requiring larger portions of die area dedicated to control rather than datapath. As an alternative to using these processors to exploit parallelism effectively, we are investigating the viability of using single-chip vector microprocessors. This paper presents some initial results of our investigation where we compare the performance and cost of vector microprocessors to that of aggressive, out-of-order superscalar microprocessors. On the performance side, we show that vector processors are able to execute a highly parallel, integer-based application 1.5-7.3 times faster than superscalar processors can by exploiting parallelism more effectively. This ability stems from the use of vector instructions. Vector instructions exploit parallelism across loop iterations by implicitly re-scheduling operations and temporally localizing the parallelism. Vector instructions also reduce instruction bandwidth by more than an order of magnitude because they express an abundance of parallelism in a compact encoding. On the cost side we show that, to achieve these performance gains, highly parallel, integer-based vector microprocessors are no more costly to implement than existing in-order and out-of-order superscalar microprocessors. One reason for this is that the organization of a vector register file provides tremendous bandwidth without incurring a large area penalty. A second reason is that the control logic for issuing vector instructions is relatively simple. Both the performance gains and cost savings are possible because vector processors rely on a vectorizing compiler, rather than hardware, to detect parallelism and to express it in a compact form to the hardware. These initial results suggest that transferring this functionality to the compiler offers a tremendous performance/cost benefit

Alternative fetch and issue policies for the trace cache fetchmechanism

Conference Paper

Jan 1998

The increasing widths of superscalar processors are placing greater demands upon the fetch mechanism. The trace cache meets these demands by placing logically contiguous instructions in physically contiguous storage. It is capable of supplying multiple fetch blocks each cycle. We examine two fetch and issue techniques, partial matching and inactive issue, that improve the overall performance of the trace cache by improving the effective fetch rate. We show that for the SPECint95 benchmarks partial matching increases the overall performance by 12% and adding inactive issue increases performance by 15%. Furthermore we apply these two techniques to issue blocks from trace segments which contain multiple execution paths. We conclude with a performance comparison between a trace cache implementing partial matching and inactive issue and an aggressive single block fetch mechanism. The trace cache increases performance by an average of 25% over the instruction cache

Improving the memory bandwidth of highly-integrated, wide-issue, microprocessor-based systems

Conference Paper

Full-text available

Dec 1997

Next-generation wide-issue processors will require greater memory bandwidth than provided by present memory hierarchy designs. We propose techniques for increasing the memory bandwidth of multi-ported L1 D-caches, large on-chip L2 caches and dedicated memory ports while minimizing the cycle time impact. These approaches are evaluated within the context of an 8-way superscalar processor design and next-generation VLSI, packaging and RAM technologies. We show that the combined L1 and L2 cache enhancements can outperform conventional techniques by over 80%, and that even with an on-chip 512-kByte L2 cache, board-level caches provide significant enough performance gains to justify their higher cost

Dynamic Speculation And Synchronization Of Data Dependence

Conference Paper

Jul 1997

Not Available

High bandwidth on-chip cache design

Article

May 2001

In this paper, we evaluate the performance of high bandwidth cache organizations employing multiple cache ports, multiple cycle hit times, and cache port efficiency enhancements, such as load all and line buffer, to find the organization that provides the best processor performance. Using a dynamic superscalar processor running realistic benchmarks that include operating system references, we use execution time to measure processor performance. When the cache is limited to a single cache port without enhancements, we find that two cache ports increase processor performance by 25 percent. With the addition of line buffer and load all to a single pelted cache, the processor achieves 91 percent of the performance of the same processor containing a cache with two ports. When the processor is not limited to a single cache port, the results show that a large dual-ported multicycle pipelined SRAM cache with a line buffer maximizes processor performance. A large pipelined cache provides both a low miss rate and a high CPU clock frequency. Dual-porting the cache and using a line buffer provide the bandwidth needed by a dynamic superscalar processor. The line buffer makes the pipelined dual-ported cache the best option by increasing cache port bandwidth and hiding cache latency

A chip-multiprocessor architecture with speculative multithreading

Article

Full-text available

Oct 1999

Much emphasis is now being placed on chip-multiprocessor (CMP) architectures for exploiting thread-level parallelism in applications. In such architectures, speculation may be employed to execute applications that cannot be parallelized statically. In this paper, we present an efficient CMP architecture for the speculative execution of sequential binaries without source recompilation. We present software support that enables the identification of threads from a sequential binary. The hardware includes a memory disambiguation mechanism that enables the detection of interthread memory dependence violations during speculative execution. This hardware is different from past proposals in that it does not rely on a snoopy-based cache-coherence protocol. Instead, it uses an approach similar to a directory-based scheme. Furthermore, the architecture includes a simple and efficient hardware mechanism to enable register-level communication between on-chip processors. Evaluation of this software-hardware approach shows that it is quite effective in achieving high performance when running sequential binaries

Evaluation of design options for the trace cache fetch mechanism

Article

Mar 1999

In this paper, we examine some critical design features of a trace cache fetch engine for a 16-wide issue processor and evaluate their effects on performance. We evaluate path associativity, partial matching, and inactive issue, all of which are straightforward extensions to the trace cache. We examine features such as the fill unit and branch predictor design. In our final analysis, we show that the trace cache mechanism attains a 28 percent performance improvement over an aggressive single block fetch mechanism and a 15 percent improvement over a sequential multiblock mechanism

Improving the Memory Bandwidth of Highly-Integrated, Wide-Issue, Microprocessor-Based Systems

Article

Full-text available

May 1998

Next generation, wide-issue processors will require greater memory bandwidth than provided by present memory hierarchy designs. We propose techniques for increasing the memory bandwidth of multi-ported L1 Dcaches, large on-chip L2 caches, and dedicated memory ports while minimizing cycle time impact. These approaches are evaluated within the context of an 8way superscalar processor design and next-generation VLSI, packaging, and RAM technologies. We show that the combined L1 and L2 cache enhancements can outperform conventional techniques by over 80%, and that even with an on-chip 512KB L2 cache, board-level caches provide significant enough performance gains to justify their higher cost. 1 Introduction Increasing attention has been focused on the memory hierarchy performance of microprocessorbased systems due to the growing processor/memory speed gap and memory bandwidth demands of modern superscalar designs. In order to prevent the next-generation of wide-issue, highly-integrated m...

Accessing Multiple Sequences Through Set Associative Caches

Conference Paper

Jun 1999

Peter Sanders

The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms which perform well in practice. We start from the empirical observation that external memory algorithms often turn out to be good algorithms for cached memory. This is not self evident since caches have a fixed and quite restrictive algorithm choosing the content of the cache. We investigate the impact of this restriction for the frequently occurring case of access to multiple sequences. We show that any access pattern to k = Θ(M=B 1+1/a ) sequential data streams can be efficiently supported on an a-way set associative cache with capacity M and line size B. The bounds are tight up to lower order terms.

A superscalar alpha processor with out-of-order exeecution

No full-text available

Recommended publications

In favor of a 'fractionation' view of ventral parietal cortex: Comment on Cabeza et al

Constrained energy minimization and orbital stability for the NLS equation on a star graph

[Studies on bacterial aerosol. Part 1. Biological properties of staphylococci (author's transl)]

Symmetry Breaking in Nuclear Reactions: Asymmetry in alpha + d --> 3He + 3H