Article

A superscalar alpha processor with out-of-order exeecution

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The structures identified above were presented in the context of the baseline superscalar model shown inFigure 1. The MIPS R10000 [22] and the DEC 21264 [10] are real implementations that directly fit this model. Hence, the structures identified above apply to these two processors. ...
... An alternate scheme for register renaming uses a CAM (content-addressable memory) [19] to store the current mappings . Such a scheme is implemented in the HAL SPARC [2] and the DEC 21264 [10]. The number of entries in the CAM is equal to the number of physical registers. ...
... For an 8-way machine with deep pipelines, this would exclude a large number of bypass paths. Another solution is to generalize the method used in the DEC 21264 [10] and use multiple copies of the register file. This is the " cluster " method referred to in Section 4.4. ...
... So far, no attention has been given to dynamic techniques to improve the accuracy of data dependence speculation. This is because in the small instruction window sizes of modern dynamically scheduled processors [12,11,14] , the probability of a mis-speculation is small, and furthermore, the net performance loss that is due to erroneous data dependence speculation is small. ...
... Consequently, to gain the most out of data dependence speculation we would like to use it as aggressively as possible while keeping the net cost of mis-speculation as low as possible. The modern dynamically-scheduled processors that use data dependence speculation [11,12,14] do so blindly (i.e., a load is speculated whenever possible). No explicit attempt is made to reduce the net cost of mis-speculation. ...
Article
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
... As we demonstrate, the choice of a split versus a continuous window greatly impacts the effectiveness of each method creating different tradeoffs and options. There are several reasons why a study assuming a continuous window is warranted: While future, high-performance processors may be forced to use split, distributed windows [7,27,21,13,30,31,24,6,29,8] , virtually all modern processors use centralized, continuous windows. Moreover, traditionally, techniques developed initially for high-performance processors sooner or later find their way into other classes of processors (e.g., many of the techniques initially used for mainframes in the 60's and 70's found their way into micro-processors during the 80's and 90's). ...
... The previous studies on memory dependence speculation have primarily assumed split window models. In [19], a model of the Multiscalar architecture [7,27] was used, while, in [4] , a model of the Alpha 21264 proces- sor [13,14] was used. While quite dissimilar, both models use distributed instruction schedulers that may not use program order priority in scheduling store and load address calculations. ...
Conference Paper
We consider a variety of dynamic, hardware-based methods for exploiting load/store parallelism, including mechanisms that use memory dependence speculation. While previous work has also investigated such methods, this has been done primarily for split, distributed window processor models. We focus on centralized continuous-window processor models (the common configuration today). We confirm that exploiting load/store parallelism can greatly improve performance. Moreover, we show that much of this performance potential can be captured if addresses of the memory locations accessed by both loads and stores can be used to schedule loads. However, using addresses to schedule load execution may not always be an option due to complexity, latency, and cost considerations. For this reason, we also consider configurations that use just memory dependence speculation to guide load execution. We consider a variety of methods and show that speculation/synchronization can be used to effectively exploit virtually all load/store parallelism. We demonstrate that this technique is competitive to or better than the one that uses addresses for scheduling loads. We conclude by discussing why our findings differ, in part, from those reported for split, distributed window processor models
... The importance of function unit clustering is evident in new microprocessors. In particular, the recently-announced Digital 21264 ( [3], [4]) implements two integer clusters, each with its own register file, integer logic, and address calculation unit. The architects estimated a 1% performance degradation due to a one-cycle communication delay between clusters. ...
... Previous work, particularly [5], uses universal, unit-latency function units 4 whereas we use limited function units within the clusters. This additional restriction means that we must determine what mixture of function units are good to put in the clusters. ...
Article
As more function units are integrated into wideissue superscalar processors and as cycle times decrease, result-forwarding delays will become worse relative to processor cycle time. Physical distance and capacitive effects of smaller geometry wires are the main reason for this increase in delay. Thus, a full bypass network, able to forward results from any function unit to any other function unit, cannot be realized without increasing the cycle time. Since increasing the cycle time is undesirable, this paper examines clustering of function units as an approximation to an ideal design where there is no inter-function unit communication delay. A cluster of function units is simply a grouping of neighboring function units with fast intra-cluster communication delay. Communication between clusters is assessed a higher penalty because of the distance between the clusters. This complicates function unit selection over the ideal case where there is no delay between function units. The goal of...
... There is a trend toward deeper processor pipelining; for example, the latest generation DEC processor, the 21264 [7], and the latest Intel processor, the Pentium Pro [4], are both more deeply pipelined than their predecessors. As levels of processor pipelining increases, it is likely that the cycles lost for each mispredicted branch will also increase. ...
... The basic pipeline of the proposed implementation, shown inFig. 15, is be similar to that used in current superscalar microprocessors, for example the MIPS R10000 or DEC 21264 [5, 7]. Instructions from both paths are fetched in the first pipeline stage. ...
Article
Selective Dual Path Execution (SDPE) reduces branch misprediction penalties by selectively forking a second path and executing instructions from both paths following a conditional branch instruction. SDPE restricts the number of simultaneously executed paths to two, and uses a branch prediction confidence mechanism to fork selectively only for branches that are more likely to be mispredicted. A branch forking policy defines the behavior of SDPE when a low confidence branch prediction is encountered while two paths are already being executed. Trace driven simulation is used to evaluate the effectiveness of SDPE with three different forking policies. SDPE can reduce the cycles lost to branch mispredictions by 34 to 50%, resulting in an approximate 10% reduction in overall execution time. However, it is shown that both branch mispredictions and low confidence predictions tend to occur in clusters, limiting the effectiveness of SDPE. A number of design parameters are studied via simulation. These include prediction and confidence table sizes. Finally, a number of implementation issues are discussed, with emphasis on instruction fetching mechanisms and register renaming. 1.0
... Firstly, we invoke a 4*4 GARNET network [1] on GEM5 with 2D Mesh topology (for more detail see table 1). The type of all cores are ALPHA 21264 [17]. Our main purpose is to evaluate the power and latency of network with different cache coherent protocols. ...
Conference Paper
Full-text available
The shared memory multiprocessors suffer with significant problem of accessing shared resources in a shared memory it will result in longer latencies. Consequently, the performance of the system will get affected. With the object of solving the problem of increased access latency due to large number of processors with shared memory, Cache is being used. Every processor has its own private cache, now they can update or access the data comfortably but again it leads to another serious issue i.e. cache coherency. The magnitude of the potential performance difference between the various cache coherency approaches indicates that the choice of coherence solution is very important in the design of an efficient shared-bus multiprocessor, since it may limit the number of processors in the system. In this paper we evaluate a typical multiprocessor system in terms of power and latency with different cache coherence protocols which GEM5 simulator is used. The traffic is generated with five injection rates (0.1, 0.2, 0.3, 0.4 and 0.5). Power and latency analyzing figures make up and appear in experimental result. The result shows MOESI_CMP_token has maximum latency and power consumption.
... In clustered architectures [22,23,24], the processor resources are split into two or more clusters. Each cluster is simpler, faster, and consumes less power than a monolithic architecture. ...
... We loosely based our simulator baseline on the Alpha 21264 microprocessor, setting the resource configuration and their latencies to values published by Compaq [22]. to the 21264's clustered microarchitecture, which is not modeled by SimpleScalar. ...
... The mainstream model of computation used by algorithm designers in the last half century Neumann 1945] assumes a sequential processor with unit memory access cost. However, the mainstream computers sitting on our desktops have increasingly deviated from this model in the last decade Hennessy and Patterson 1996; Intel Corporation 1997; Keller 1996; MIPS Technologies, Inc. 1998; Sun Microsystems 1997]. In particular, we usually distinguish at least four levels of memory hierarchy: A le of multiported registers, can be accessed in parallel in every clock-cycle. ...
Chapter
The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms which perform well in practice. We advocates the approach to adapt external memory algorithms to this purpose. We exemplify this approach and the practical issues involved by engineering a fast priority queue suited to external memory and cached memory which is based on k-way merging. It improves previous external memory algorithms by constant factors crucial for transferring it to cached memory. Running in the cache hierarchy of a workstation the algorithm is at least two times faster than an optimized implementation of binary heaps and 4-ary heaps for large inputs.
... All our measurements were taken on a uniprocessor system with an ALPHA 21164 processor [3] (workstation 500au), on a uniprocessor system with an ALPHA 21264 processor [17] (workstation XP1000) and on a single MIPS R10000 processor [26] of a multiprocessor system (SGI Origin 2000). The three different microarchitectures are briefly described inTable 2. ...
Conference Paper
In this paper, we compare automatic-optimized codes against hand-optimized codes. The automatic-optimized codes have been generated using our own developed tool that implements compiler techniques proposed in our previous work. Our compiler techniques focus on applying multilevel tiling to non-rectangular loop nests. This type of loop nests are commonly found in linear algebra algorithms, typically used in numerical codes. As hand-optimized codes, we use two different numerical libraries: the BLAS3 library provided by the manufacturers and the RISC-BLAS library proposed in Dayde and Duff (1998). Results will show how compiler technology can make it possible for non-rectangular loop nests to achieve as high performance as hand-optimized codes on modern microprocessors
... In figure 3 we see an example of how our Register Mapping works. Our register mapping structure follows the CAM scheme such as in the Alpha 21264 [17] and the HAL Sparc [4]. From this diagram we can see that at the present mo-ment only four physical registers are mapped, i.e. the ones with the valid bit set to one. ...
Conference Paper
Full-text available
Modern out-of-order processors tolerate long latency memory operations by supporting a large number of in-flight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the reorder buffer (ROB), the general purpose instructions queues, the load/store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. We propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10% for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200% improvement over a current processor with a similar number of entries.
... Single cycle stores are not so important. Low load/store latency (2 cycles) is achievable with a high frequency clock as demonstrated in the DEC-Alpha[7]. We calculated across all our benchmark results the average number of VLIW cycles per program with 8x16-block geometry of 98.57%, strongly suggesting that the DTSVLIW architecture is effective in taking advantage of its VLIW Engine. ...
Conference Paper
Full-text available
Dynamically trace scheduled VLIW (DTSVLIW) architectures can be used to implement machines that execute code of current RISC or CISC instruction set architectures in a VLIW fashion, delivering instruction level parallelism (ILP) with backward code compatibility. This paper presents the effect of multicycle instructions on the performance of a DTSVLIW architecture running the SPECint95 benchmarks. The classic approaches to providing ILP are VLIW and superscalar architectures. With VLIW, the compiler is required to extract the parallelism from the program and to build Very Long Instruction Words for execution. This leads to fast and (relatively) simple hardware, but puts a heavy responsibility on the compiler, and object code compatibility(4) is a problem. In Superscalar, the extraction of parallelism is done by the hardware which dynamically schedules the sequential instruction stream on to the functional units. The hardware is much more complex, and therefore slower than a corresponding VLIW design. The peak instruction feed into the functional units is lower for Superscalar. A number of approaches(1)(2)(3) have been examined that marry the advantages of the contrasting designs: the Superscalar dynamic extraction of ILP, the simplicity of the VLIW architectures. The approach presented here follows that first presented by Nair and Hopkins(3). Our architecture, the dynamically trace scheduled VLIW architecture (DTSVLIW) (4), demonstrates similar results to theirs in providing significant parallelism, but with a simpler architecture that should be much easier to implement. In our earlier work(5), we had zero latency load/store instructions. Here we present results with more realistic latencies. 2 The DTSVLIW Architecture The DTSVLIW has two execution engines: the Primary Processor and the VLIW Engine, and two caches for instructions: the Instruction Cache and the VLIW Cache. The Primary Processor, a simple pipelined processor, fetches instructions and does the first execution of this code. The instruction trace it produces is dynamically scheduled by the Scheduler Unit into VLIW instructions, saved in blocks to the VLIW Cache for re -execution by the VLIW Engine. The Primary Engine, executing the Sparc-7 ISA, provides object-code compatibility; the VLIW Engine VLIW performance and simplicity. The Scheduler unit works in parallel with the Primary Engine. Scheduling does not impact on VLIW performance as it does in Superscalar.
... Therefore the delays of these components will become relatively long compared with those of the most execution units, as Á Ï Ï Ë are increased and the feature sizes are re- duced. But the relative increase of the delays of these components does not directly prevent the reduction of the cycle time, because the delays a cycle can be drastically reduced by pipelining and/or clustering [7, 9]. In other words, these decentralization techniques are indispensable for future processors of larger Á Ï Ï Ë and smaller feature sizes. ...
Conference Paper
The wakeup logic is a part of the issuing window and is responsible to manage the ready jags of the operands for dynamic instruction scheduling. The conventional wakeup logic is based on association, and composed of a RAM and a CAM. Since the logic is not pipelinable and the delays of these memories are dominated by the wire delays, the logic will be more critical with deeper pipelines and smaller feature sizes. This paper describes a new scheduling scheme not based on the association but on matrices which represent the dependences between instructions. Since the update logic of the matrices detects the dependencies between instructions as the register renaming logic does, the wakeup operation is realized by just reading the matrices. This paper also describes a technique to reduce the effective size of the matrices for small IPC penalties. We designed the layouts of the logics guided by a 0.18pm CMOS design rule provided by Fujitsu Limited, and calculated the delays. We also evaluated the penalties by cycle-level simulation. The results show that our scheme achieves 2.7GHz clock speed for the IPC degradation of about 1 %.
... In clustered architectures [22, 23, 24], the processor resources are split into two or more clusters. Each cluster is simpler, faster, and consumes less power than a monolithic architecture. ...
Conference Paper
We propose a subordinate threading scheme in which the main thread skips instructions that are guaranteed to be correctly executed by the subordinate thread. Speeding up the main thread increases the overall speed of the processor. Also, a faster main thread can detect the subordinate thread's mispredictions earlier, thereby cutting down the amount of time the subordinate thread spends on wrong-path instructions. Hence, the subordinate thread is now free to do more aggressive speculations. We developed a cycle-accurate simulator and evaluated our symbiotic subordinate threading scheme for the SPEC2000 integer benchmarks. Our results show an average performance improvement of 21% over a base subordinate threading scheme that does not let the main thread skip any instructions.
... In contrast, the sliced register file enables that each lane only needs 4R+2W for the functional units 1 . Note that, as opposed to clustering techniques [11], writes into one register file lane do not need to be made visible to other lanes. ...
Conference Paper
Full-text available
Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor. Tarantula adds to the EV8 core a vector unit capable of 32 double-precision flops per cycle. The vector unit fetches data directly from a 16 MByte second level cache with a peak bandwidth of sixty four 64-bit values per cycle. The whole chip is backed by a memory controller capable of delivering over 64 GBytes/s of raw bandwidth. Tarantula extends the Alpha ISA with new vector instructions that operate on new architectural state. Salient features of the architecture and implementation are: (1) it fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, (2) provides high bandwidth for non-unit stride memory accesses, (3) supports gather/scatter instructions efficiently, (4) fully integrates with the EV8 core with a narrow, streamlined interface, rather than acting as a co-processor (5) can achieve a peak of 104 operations per cycle, and (6) achieves excellent "real-computation" per transistor and per watt ratios. Our detailed simulations show that Tarantula achieves an average speedup of 5X over EV8, out of a peak speedup in terms of flops of 8X. Furthermore, performance on gather/scatter intensive benchmarks such as Radix Sort is also remarkable: a speedup of almost 3X over EV8 and 15 sustained operations per cycle. Several benchmarks exceed 20 operations per cycle
... In CPR, a given physical register appears at most once in any map table. The reader may also recognize the RMap bitvector and its checkpoints as components of the Alpha 21264's CAM-style register renaming and map table checkpointing mechanisms, respectively [4]. Here, we use them for reference counting. ...
Article
Several proposed techniques including CPR (checkpoint processing and recovery) and NoSQ (no store queue) rely on reference counting to manage physical registers. However, the register reference counting mechanism itself has received surprisingly little attention. This paper fills this gap by describing potential register reference counting schemes for NoSQ, CPR, and a hypothetical NoSQ/CPR hybrid. Although previously described in terms of binary counters, we find that reference counts are actually more naturally represented as matrices. Binary representations can be used as an optimization in specific situations.
... We felt it was important to validate the performance of the baseline SimpleScalar model so we ran the identical code on a 600 Mhz Alpha 21264 workstation (running Tru64 Unix) with a 1GB main memory. We loosely based our simulator baseline on the Alpha 21264 microprocessor, setting the resource configuration and their latencies to values published by Compaq [17]. The correlation in absolute performance was quite close, all except Mars and Twofish were within 10% of the actual machine tests. ...
Article
Full-text available
The emergence of the Internet as a trusted medium for commerce and communication has made cryptography an essential component of modern information systems. Cryptography provides the mechanisms necessary to implement accountability, accuracy, and confidentiality in communication. As demands for secure communication bandwidth grow, efficient cryptographic processing will become increasingly vital to good system performance. In this paper, we explore techniques to improve the performance of symmetric key cipher algorithms. Eight popular strong encryption algorithms are examined in detail. Analysis reveals the algorithms are computationally complex and contain little parallelism. Overall throughput on a high-end microprocessor is quite poor, a 600 Mhz processor is incapable of saturating a T3 communication line with 3DES (triple DES) encrypted data. We introduce new instructions that improve the efficiency of the analyzed algorithms. Our approach adds instruction set support for fast substitutions, general permutations, rotates, and modular arithmetic. Performance analysis of the optimized ciphers shows an overall speedup of 59% over a baseline machine with rotate instructions and 74% speedup over a baseline without rotates. Even higher speedups are demonstrated with optimized substitutions (SBOXes) and additional functional unit resources. Our analyses of the original and optimized algorithms suggest future directions for the design of high-performance programmable cryptographic processors.
... We do argue however, our techniques can be used as a potentially lower complexity, shorter clock cycle alternative to scheduling load/stores by incorporating the load/store scheduling functionality in the existing register scheduler. While memory dependence mispeculations are not an issue for a centralized, continous window processor, future processors may utilize more aggressive front-ends and may have to rely on partitioning to balance between short clock cycles and larger instruction windows [26,82,66,44,87,90,72,25,85,32]. Under this different set of assumptions, instructions are not necessarily fetched in program order, and moreover, enforcing program order priority in the scheduler may not be possible. ...
Article
As the existing techniques that empower the modern high-performance processors are being refined and as the underlying technology trade-offs change, new bottlenecks are exposed and new challenges are raised. This thesis introduces a new tool, Memory Dependence Prediction that can be useful in combating these bottlenecks and meeting the new challenges. Memory dependence prediction is a technique to guess whether a load or a store will experience a dependence. Memory dependence prediction exploits regularity in the memory dependence stream of ordinary programs, a phenomenon which is also identified in this thesis. To demonstrate the utility of memory dependence prediction this thesis also presents the following three novel microarchitectural techniques: 1. Dynamic Speculation/Synchronization of Memory Dependences: this thesis demonstrates that to exploit parallelism over larger regions of code waiting to determine the dependences a load has is not the best performing option. Higher performance is possible if memory dependence speculation is used especially if memory dependence prediction is used to guide this speculation.
... In the case of a dynamic optimizer, on the other hand, profile data is generated and consumed in the very same run, and no data is written out for use offline or during a later run. Hardware implementations of dynamic optimizers are now commonplace in modern microprocessors [Kumar 1996; Song et al. 1995; Papworth 1996; Keller 1996]. The optimization unit is a fixed size instruction window, with the optimization logic operating on the critical execution path. ...
Article
Full-text available
Dynamic optimization refers to the runtime optimization of a native program binary. This paper describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capable of optimizing a native program binary at runtime. Dynamo is a realistic implementation, not a simulation, that is written entirely in user-level software, and runs on a PA-RISC machine under the HPUX operating system. Dynamo does not depend on any special programming language, compiler, operating system or hardware support. The program binary is not instrumented and is left untouched during Dynamo's operation. Dynamo observes the program's behavior through interpretation to dynamically select hot instruction traces from the running program. The hot traces are optimized using low-overhead optimization techniques and emitted into a software code cache. Subsequent instances of these traces cause the cached version to be executed, resulting in a performance boost. Contrary to intuition, we ...
... Most current superscalar processors [17, 18, 16, 4] are based on the microarchitecture shown inFigure 1. The instruction fetch unit reads multiple instructions from the instruction cache, decodes them, and places them in instruction buffers for execution by the integer and floating-point subsystems. ...
Article
In conventional superscalar microarchitectures with partitioned integer and floating-point resources, all floating-point resources are idle during execution of integer programs. Palacharla and Smith [26] addressed this drawback and proposed that the floating-point subsystem be augmented to support integer operations. The hardware changes required are expected to be fairly minimal. To exploit these idle floating resources, the compiler must identify integer code that can be profitably offloaded to the augmented floating-point subsystem. In this paper, we present two compiler algorithms to do this. The basic scheme offloads integer computation to the floating-point subsystem using existing program loads/stores for inter-partition communication. For the SPECINT95 benchmarks, we show that this scheme offloads from 5% to 29% of the total dynamic instructions to the floating-point subsystem. The advanced scheme inserts copy instructions and duplicates some instructions to further offload computation. We evaluate the effectiveness of the two schemes using timing simulation. We show that the advanced scheme can offload from 9% to 41% of the total dynamic instructions to the floating-point subsystem. In doing so, speedups from 3% to 23% are achieved over a conventional microarchitecture.
... The mainstream model of computation used by algorithm designers in the last half century Neumann 1945] assumes a sequential processor with unit memory access cost. However, the mainstream computers sitting on our desktops have increasingly deviated from this model in the last decade Hennessy and Patterson 1996;Intel Corporation 1997;Keller 1996;MIPS Technologies, Inc. 1998;Sun Microsystems 1997]. In particular, we usually distinguish at least four levels of memory hierarchy: A le of multiported registers, can be accessed in parallel in every clock-cycle. ...
Article
The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms which perform well in practice. We advocates the approach to adapt external memory algorithms to this purpose. We exemplify this approach and the practical issues involved by engineering a fast priority queue suited to external memory and cached memory which is based on k-way merging. It improves previous external memory algorithms by constant factors crucial for transferring it to cached memory. Running in the cache hierarchy of a workstation the algorithm is at least two times faster than an optimized implementation of binary heaps and 4-ary heaps for large inputs.
Article
In conventional superscalar microarchitectures with partitioned integer and floating-point resources, all floating-point resources are idle during execution of integer programs. Palacharla and Smith [26] addressed this drawback and proposed that the floating-point subsystem be augmented to support integer operations. The hardware changes required are expected to be fairly minimal.To exploit these idle floating resources, the compiler must identify integer code that can be profitably offloaded to the augmented floating-point subsystem. In this paper, we present two compiler algorithms to do this. The basic scheme offloads integer computation to the floating-point subsystem using existing program loads/stores for inter-partition communication. For the SPECINT95 benchmarks, we show that this scheme offloads from 5% to 29% of the total dynamic instructions to the floating-point subsystem. The advanced scheme inserts copy instructions and duplicates some instructions to further offload computation. We evaluate the effectiveness of the two schemes using timing simulation. We show that the advanced scheme can offload from 9% to 41% of the total dynamic instructions to the floating-point subsystem. In doing so, speedups from 3% to 23% are achieved over a conventional microarchitecture.
Article
Data dependence speculation is used in instruction-level parallel (ILP) processors to allow early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the two instructions must be synchronized. The modern dynamically scheduled processors that use data dependence speculation do so blindly (i.e., every load instruction with unresolved dependences is speculated). In this paper, we demonstrate that as dynamic instruction windows get larger, significant performance benefits can result when intelligent decisions about data dependence speculation are made. We propose dynamic data dependence speculation techniques: (i) to predict if the execution of an instruction is likely to result in a data dependence mis-specalation, and (ii) to provide the synchronization needed to avoid a mis-speculation. Experimental results evaluating the effectiveness of the proposed techniques are presented within the context of a Multiscalar processor.
Conference Paper
The continued increase in microprocessor clock frequency that has come from advancements in fabrication technology and reductions in feature size, creates challenges in maintaining both manufacturing yield rates and long-term reliability of devices. Methods based on defect detection and reduction may not offer a scalable solution due to cost of eliminating contaminants in the manufacturing process and increasing chip complexity. This paper proposes to use the inherent redundancy available in existing and future chip microarchitectures to improve yield and enable graceful performance degradation in fail-in-place systems. We introduce a new yield metric called performance averaged yield (Ypav) which accounts both for fully functional chips and those that exhibit some performance degradation. Our results indicate that at 250nm we are able to increase the Ypav of a uniprocessor with only redundant rows in its caches from a base value of 85% to 98% using microarchitectural redundancy. Given constant chip area, shrinking feature sizes increases fault susceptibility and reduces the base Ypav to 60% at 50nm, which exploiting microarchitectural redundancy then increases to 99.6%.
Article
Conventional out-of-order processors that use a unified physical register file allocate and reclaim registers explicitly using a free list that operates as a circular queue. We describe and evaluate a more flexible register management scheme — reference counting. We implement reference counting using a bit-matrix with a column for every physical register and a row for every entity that can hold a physical register, e.g., an in-flight instruction. Columns are NOR'ed together to create a bitvector free list from which registers are allocated using priority encoders. We describe reference counting designs that support micro-architectural techniques including register file power gating, dynamic register move elimination, register file checkpointing, and latency tolerant execution. Performance and circuit simulation show that the energy cost of reference counting is low and is easily recouped by the savings of the techniques it enables.
Article
Digital Signal Processor (DSP) has been widely used in processing video and audio streaming data. Due to the huge amount of streaming data, increasing throughput is the key issue in designing DSP architecture. One way to increase the throughput of a DSP is to increase the instruction level parallelism. To increase the instruction level parallelism, many architectures have been proposed and can be classified into two main approaches, the superscaler and the VLIW architectures. Among the hardware architectures, the VLIW attracts a lot of attention due to its simple hardware complexity. However, the VLIW architecture suffers from the explosion of instruction memories due to the overhead of instruction grouping. In this paper, we propose a novel DSP architecture which contains three pipelines and performs dynamic instruction grouping by hardware. The experimental results show that our architecture can reduce 13% of memory requiremnt on average while maintaining the same performance.
Conference Paper
This article described the application process of advanced manufacturing technology using in the national tourism industry from the digital development, exhibition and cultural heritage protection aspects of ethnic characteristic tourism products, and introduced specific ways of computer-aided design, reverse engineering, rapid prototyping manufacturing technology and information network technology in the national tourism product development. It has a far-reaching significance to improve the efficiency of product development and the development of national characteristic tourism industry.
Conference Paper
Traces are dynamic instruction sequences constructed and cached by hardware. A microarchitecture organized around traces is presented as a means for efficiently executing many instructions per cycle. Trace processors exploit both control flow and data flow hierarchy to overcome complexity and architectural limitations of conventional superscalar processors by (1) distributing execution resources based on trace boundaries and (2) applying control and data prediction at the trace level rather than individual branches or instructions. Three sets of experiments using the SPECInt95 benchmarks are presented. (i) A detailed evaluation of trace processor configurations: the results affirm that significant instruction-level parallelism can be exploited in integer programs (2 to 6 instructions per cycle). We also isolate the impact of distributed resources, and quantify the value of successively doubling the number of distributed elements. (ii) A trace processor with data prediction applied to inter-trace dependences: potential performance improvement with perfect prediction is around 45% for all benchmarks. With realistic prediction, gcc achieves an actual improvement of 10%. (iii) Evaluation of aggressive control flow: some benchmarks benefit from control independence by as much as 10%
Conference Paper
Increasing diversity in packet-processing applications and rapid increases in channel bandwidth lead to greater complexity in communication protocols. These factors result in larger computational loads for packet-processing engines that introduce high performance microprocessor designs as an important solution. This paper presents an exhaustive simulation for exploring the performance of instruction-level parallel super scalar processors executing packet-processing applications. Based on the simulation results, a design space exploration has been used to derive performance-efficient application-specific super scalar processor architecture based on MIPS instruction set architecture. Simple Scalar architecture toolset has been used for design space exploration and network applications have been investigated to guide the architecture exploration. The optimizations achieve up to 80% improvement in performance for representative packet-processing applications.
Conference Paper
Full-text available
Communication misses - those serviced by dirty data in remote caches - are a pressing performance limiter in shared-memory multiprocessors. Recent research has indicated that temporally silent stores can be exploited to substantially reduce such misses, either with coherence protocol enhancements (MESTI); by employing speculation to create atomic silent store-pairs that achieve speculative lock elision (SLE); or by employing load value prediction (LVP). We evaluate all three approaches utilizing full-system, execution-driven simulation, with scientific and commercial workloads, to measure performance. Our studies indicate that accurate detection of elision idioms for SLE is vitally important for delivering robust performance and appears difficult for existing commercial codes. Furthermore, common datapath issues in out-of-order cores cause barriers to speculation and therefore may cause SLE failures unless SLE-specific speculation mechanisms are added to the microarchitecture. We also propose novel prediction and silence detection mechanisms that enable the MESTI protocol to deliver robust performance for all workloads. Finally, we conduct a detailed execution-driven performance evaluation of load value prediction (LVP), another simple method for capturing the benefit of temporally silent stores. We show that while theoretically LVP can capture the greatest fraction of communication misses among all approaches, it is usually not the most effective at delivering performance. This occurs because attempting to hide latency by speculating at the consumer, i.e. predicting load values, is fundamentally less effective than eliminating the latency at the source, by removing the invalidation effect of stores. Applying each method, we observe performance changes in application benchmarks ranging from 1% to 14% for an enhanced version of MESTI, -1.0% to 9% for LVP, -3% to 9% for enhanced SLE, and 2% to 21% for combined techniques
Conference Paper
Not Available
Article
Thesis (Ph. D.)--University of Wisconsin--Madison, 2003. Includes bibliographical references (p. 239-244). Photocopy.
Conference Paper
The use of CC-NUMA multiprocessors complicates the placement of physical memory pages. Memory closest to a processor provides the best access time, but optimal memory page placement is a difficult problem with process movement, multiple processes requiring access to the same physical memory page, and application behavior changing over execution time. We use dynamic page placement to move memory pages where needed for the database benchmark TPC-C executing on a four node CC-NUMA multiprocessor. Dynamic page placement achieves local memory accesses up to 73% of the time instead of the static page placement results of 34% locality achieved with first touch and 25% with round robin. This can result in a 17% improvement in performance.
Conference Paper
Existing schemes for cache energy optimization incorporate a limited degree of dynamic associativity: either direct mapped or full available associativity (say 4-way). In this paper, we explore a more general design space for dynamic associativity (for a 4-way associative cache, consider 1-way, 2-way, and 4-way associative accesses). The other major departure is in the associativity control mechanism. We use the actual instruction level parallelism exhibited by the instructions surrounding a given load to classify it as an IPC k load (for 1≤k≤IW with an issue width of IW) in a superscalar architecture. The lookup schedule is fixed in advance for each IPC classifier 1≤k≤IW. The schedules are as way-disjoint as possible for load/stores with different IPC classifications. The energy savings over SPEC2000 CPU benchmarks average 28.6% for a 32 kB, 4-way, L-1 data cache. The resulting performance (IPC) degradation from the dynamic way schedule is restricted to less than 2.25%, mainly because IPC based placement ends up being an excellent classifier.
Conference Paper
The continued increase in microprocessor clock frequency that has come from advancements in fabrication technology and reductions in feature size, creates challenges in maintaining both manufacturing yield rates and long-term reliability of devices. Methods based on defect detection and reduction may not offer a scalable solution due to cost of eliminating contaminants in the manufacturing process and increasing chip complexity. We propose to use the inherent redundancy available in existing and future chip microarchitectures to improve yield and enable graceful performance degradation in fail-in-place systems. We introduce a new yield metric called performance averaged yield (Y<sub>PAV</sub>), which accounts both for fully functional chips and those that exhibit some performance degradation. Our results indicate that at 250nm we are able to increase the Y<sub>PAV</sub> of a uniprocessor with only redundant rows in its caches from a base value of 85% to 98% using microarchitectural redundancy. Given constant chip area, shrinking feature sizes increases fault susceptibility and reduces the base Y<sub>PAV</sub> to 60% at 50nm, which exploiting microarchitectural redundancy then increases to 99.6%.
Conference Paper
The rename map table (RMT) access and the dependence check logic (DCL) delays scale unfavorably "with the dispatch width (DW) of a superscalar processor. It is a well-known program property that the results of most instructions are consumed within the following 4-6 instruction window. This program behavior can be exploited to reduce the rename delay by reducing the number of read/write ports in the RMT to significantly below the current 3 * DW. We propose an algorithm to dynamically allocate reduced number of RMT ports to instructions in the current dispatch window, matching dispatch resources to average needs rather than peak needs. This results in shorter RMT access delays as well as in lower energy in the dispatch stage. The IPC reduction due to rename map table read/write port contention in the proposed scheme stays within 2-4%. The cycle time saved can also be leveraged to support wider dispatch in the same cycle time in order to offset this degradation.
Conference Paper
This paper examines the effect of technology scaling and microarchitectural trends on the rate of soft errors in CMOS memory and logic circuits. We describe and validate an end-to-end model that enables us to compute the soft error rates (SER) for existing and future microprocessor-style designs. The model captures the effects of two important masking phenomena, electrical masking and latching-window masking, which inhibit soft errors in combinational logic. We quantify the SER due to high-energy neutrons in SRAM cells, latches, and logic circuits for feature sizes from 600 nm to 50 nm and clock periods from 16 to 6 fan-out-of-4 inverter delays. Our model predicts that the SER per chip of logic circuits will increase nine orders of magnitude from 1992 to 2011 and at that point will be comparable to the SER per chip of unprotected memory elements. Our result emphasizes that computer system designers must address the risks of soft errors in logic circuits for future designs.
Conference Paper
Full-text available
An increasingly large portion of scheduler latency is derived from the monolithic content addressable memory arrays accessed during instruction wake-up. The performance of the scheduler can be improved by decreasing the number of tag comparisons necessary to schedule instructions. Using detailed simulation-based analyses, we find that most instructions enter the window with at least one of their input operands already available. By putting these instructions into specialized windows with fewer tag comparators, load capacitance on the scheduler critical path can be reduced, with only very small effects on program throughput. For instructions with multiple unavailable operands, we introduce a last-tag speculation mechanism that eliminates all remaining tag comparators except those for the last arriving input operand. By combining these two tag-reduction schemes, we are able to construct dynamic schedulers with approximately one quarter of the tag comparators found in conventional designs. Conservative circuit-level timing analyses indicate that the optimized designs are 20-45% faster and require 10-20% less power depending on instruction window size
Conference Paper
Full-text available
The growth of the Internet as a vehicle for secure communication and electronic commerce has brought cryptographic processing performance to the forefront of high throughput system design. This trend will be further underscored with the widespread adoption of secure protocols such as secure IP (IPSEC) and virtual private networks (VPNs). In this paper, we introduce the CryptoManiac processor, a fast and flexible co-processor for cryptographic workloads. Our design is extremely efficient; we present analysis of a 0.25 um physical design that runs the standard Rijndael cipher algorithm 2.25 times faster than a 600 MHz Alpha 21264 processor. Moreover, our implementation requires 1/100<sup>th</sup> the area and power in the same technology. We demonstrate that the performance of our design rivals a state-of-the-art dedicated hardware implementation of the 3DES (triple DES) algorithm, while retaining the flexibility to simultaneously support multiple cipher algorithms. Finally, we define a scalable system architecture that combines CryptoManiac processing elements to exploit inter-session and inter-packet parallelism available in many communication protocols. Using I/O traces and detailed timing simulation, we show that chip multiprocessor configurations can effectively service high throughput applications including secure web and disk I/O processing
Conference Paper
DS (Decoupled-Superscalar) is a new microarchitecture that combines decoupled and superscalar techniques to exploit instruction level parallelism. Issue bandwidth is increased while circuit complexity growth is controlled with little negative impact on performance. Programs for DS are compiled into two instruction substreams: the dominant substream navigates the control flow and the rest of computational task is shared between the dominant and subsidiary substreams. Each substream is processed by a separate superscalar core realizable with current VLSI technology. DS machines are binary compatible with superscalar machines having the same instruction set, and a family of DS machines is binary compatible without recompilation. DS run time behavior is examined with an analytical model. A novel technique for controlling slip between substreams is introduced. Code partitioning issues of instruction count balancing and residence time balancing, important to any split-stream scheme, are discussed. Simulation shows DS achieves performance comparable to an aggressive superscalar, but with potentially less complex hardware and faster clock rate
Conference Paper
Branch prediction is crucial to maintaining high performance in modern superscalar processors. Conditional branches can occur as frequently as one in every 5 or 6 instructions in nonnumeric programs, leading to heavy mis-prediction penalties in current superpipelined, superscalar architectures. Realization of this fact has lead to an upsurge in the number of recent research publications in this area. This has been accompanied by the processor manufacturers implementing more complex branch prediction hardware in recent processors. This paper surveys the status of research in static and dynamic branch prediction and also how these research findings are being employed by current processor designers. The paper ends with an evaluation of future directions of branch prediction
Conference Paper
Increasingly wider superscalar processors are experiencing diminishing performance returns while requiring larger portions of die area dedicated to control rather than datapath. As an alternative to using these processors to exploit parallelism effectively, we are investigating the viability of using single-chip vector microprocessors. This paper presents some initial results of our investigation where we compare the performance and cost of vector microprocessors to that of aggressive, out-of-order superscalar microprocessors. On the performance side, we show that vector processors are able to execute a highly parallel, integer-based application 1.5-7.3 times faster than superscalar processors can by exploiting parallelism more effectively. This ability stems from the use of vector instructions. Vector instructions exploit parallelism across loop iterations by implicitly re-scheduling operations and temporally localizing the parallelism. Vector instructions also reduce instruction bandwidth by more than an order of magnitude because they express an abundance of parallelism in a compact encoding. On the cost side we show that, to achieve these performance gains, highly parallel, integer-based vector microprocessors are no more costly to implement than existing in-order and out-of-order superscalar microprocessors. One reason for this is that the organization of a vector register file provides tremendous bandwidth without incurring a large area penalty. A second reason is that the control logic for issuing vector instructions is relatively simple. Both the performance gains and cost savings are possible because vector processors rely on a vectorizing compiler, rather than hardware, to detect parallelism and to express it in a compact form to the hardware. These initial results suggest that transferring this functionality to the compiler offers a tremendous performance/cost benefit
Conference Paper
The increasing widths of superscalar processors are placing greater demands upon the fetch mechanism. The trace cache meets these demands by placing logically contiguous instructions in physically contiguous storage. It is capable of supplying multiple fetch blocks each cycle. We examine two fetch and issue techniques, partial matching and inactive issue, that improve the overall performance of the trace cache by improving the effective fetch rate. We show that for the SPECint95 benchmarks partial matching increases the overall performance by 12% and adding inactive issue increases performance by 15%. Furthermore we apply these two techniques to issue blocks from trace segments which contain multiple execution paths. We conclude with a performance comparison between a trace cache implementing partial matching and inactive issue and an aggressive single block fetch mechanism. The trace cache increases performance by an average of 25% over the instruction cache
Conference Paper
Full-text available
Next-generation wide-issue processors will require greater memory bandwidth than provided by present memory hierarchy designs. We propose techniques for increasing the memory bandwidth of multi-ported L1 D-caches, large on-chip L2 caches and dedicated memory ports while minimizing the cycle time impact. These approaches are evaluated within the context of an 8-way superscalar processor design and next-generation VLSI, packaging and RAM technologies. We show that the combined L1 and L2 cache enhancements can outperform conventional techniques by over 80%, and that even with an on-chip 512-kByte L2 cache, board-level caches provide significant enough performance gains to justify their higher cost
Article
In this paper, we evaluate the performance of high bandwidth cache organizations employing multiple cache ports, multiple cycle hit times, and cache port efficiency enhancements, such as load all and line buffer, to find the organization that provides the best processor performance. Using a dynamic superscalar processor running realistic benchmarks that include operating system references, we use execution time to measure processor performance. When the cache is limited to a single cache port without enhancements, we find that two cache ports increase processor performance by 25 percent. With the addition of line buffer and load all to a single pelted cache, the processor achieves 91 percent of the performance of the same processor containing a cache with two ports. When the processor is not limited to a single cache port, the results show that a large dual-ported multicycle pipelined SRAM cache with a line buffer maximizes processor performance. A large pipelined cache provides both a low miss rate and a high CPU clock frequency. Dual-porting the cache and using a line buffer provide the bandwidth needed by a dynamic superscalar processor. The line buffer makes the pipelined dual-ported cache the best option by increasing cache port bandwidth and hiding cache latency
Article
Full-text available
Much emphasis is now being placed on chip-multiprocessor (CMP) architectures for exploiting thread-level parallelism in applications. In such architectures, speculation may be employed to execute applications that cannot be parallelized statically. In this paper, we present an efficient CMP architecture for the speculative execution of sequential binaries without source recompilation. We present software support that enables the identification of threads from a sequential binary. The hardware includes a memory disambiguation mechanism that enables the detection of interthread memory dependence violations during speculative execution. This hardware is different from past proposals in that it does not rely on a snoopy-based cache-coherence protocol. Instead, it uses an approach similar to a directory-based scheme. Furthermore, the architecture includes a simple and efficient hardware mechanism to enable register-level communication between on-chip processors. Evaluation of this software-hardware approach shows that it is quite effective in achieving high performance when running sequential binaries
Article
In this paper, we examine some critical design features of a trace cache fetch engine for a 16-wide issue processor and evaluate their effects on performance. We evaluate path associativity, partial matching, and inactive issue, all of which are straightforward extensions to the trace cache. We examine features such as the fill unit and branch predictor design. In our final analysis, we show that the trace cache mechanism attains a 28 percent performance improvement over an aggressive single block fetch mechanism and a 15 percent improvement over a sequential multiblock mechanism
Article
Full-text available
Next generation, wide-issue processors will require greater memory bandwidth than provided by present memory hierarchy designs. We propose techniques for increasing the memory bandwidth of multi-ported L1 Dcaches, large on-chip L2 caches, and dedicated memory ports while minimizing cycle time impact. These approaches are evaluated within the context of an 8way superscalar processor design and next-generation VLSI, packaging, and RAM technologies. We show that the combined L1 and L2 cache enhancements can outperform conventional techniques by over 80%, and that even with an on-chip 512KB L2 cache, board-level caches provide significant enough performance gains to justify their higher cost. 1 Introduction Increasing attention has been focused on the memory hierarchy performance of microprocessorbased systems due to the growing processor/memory speed gap and memory bandwidth demands of modern superscalar designs. In order to prevent the next-generation of wide-issue, highly-integrated m...
Conference Paper
The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms which perform well in practice. We start from the empirical observation that external memory algorithms often turn out to be good algorithms for cached memory. This is not self evident since caches have a fixed and quite restrictive algorithm choosing the content of the cache. We investigate the impact of this restriction for the frequently occurring case of access to multiple sequences. We show that any access pattern to k = Θ(M=B 1+1/a ) sequential data streams can be efficiently supported on an a-way set associative cache with capacity M and line size B. The bounds are tight up to lower order terms.
ResearchGate has not been able to resolve any references for this publication.