Conference PaperPDF Available

Exceeding the Dataflow Limit via Value Prediction.

January 1996

January 1996

DOI:10.1109/MICRO.1996.566464

Source
DBLP

Conference: Microarchitecture, 1996. MICRO-29. Proceedings of the 29th Annual IEEE/ACM International Symposium on

Authors:

Mikko Lipasti

University of Wisconsin–Madison

John Shen

Carnegie Mellon University

For decades, the serialization constraints imposed by true data dependences have been regarded as an absolute limit-the dataflow limit-on the parallel execution of serial programs. This paper proposes a new technique-value prediction-for exceeding that limit that allows data dependent instructions to issue and execute in parallel without violating program semantics. This technique is built on the concept of value locality which describes the likelihood of the recurrence of a previously-seen value within a storage location inside a computer system. Value prediction consists of predicting entire 32- and 64-bit register values based on previously-seen values. We find that such register values being written by machine instructions are frequently predictable. Furthermore, we show that simple microarchitectural enhancements to a modern microprocessor implementation based on the PowerPC 620 that enable value prediction can effectively exploit value locality to collapse true dependences, reduce average result latency and provide performance gains of 4.5%-23% (depending on machine model) by exceeding the dataflow limit

. Benchmark Descriptions.

…

. Instruction Types.

…

. Machine Model Specifications.

…

. VP Unit Configurations.

…

620 Speedups.

…

Figures - uploaded by Mikko Lipasti

Content may be subject to copyright.

Content uploaded by Mikko Lipasti

Content may be subject to copyright.

A preview of the PDF is not available

Improving Multicore Architectures by Selective Value Prediction of High-Latency Arithmetic Instructions

Article

Jan 2024

SoK: Analysis of Root Causes and Defense Strategies for Attacks on Microarchitectural Optimizations

Preprint

Full-text available

Dec 2022

Microarchitectural optimizations are expected to play a crucial role in ensuring performance scalability in future technology nodes. However, recent attacks have demonstrated that microarchitectural optimizations, which were assumed to be secure, can be exploited. Moreover, new attacks surface at a rapid pace limiting the scope of existing defenses. These developments prompt the need to review microarchitectural optimizations with an emphasis on security, understand the attack landscape and the potential defense strategies. We analyze timing-based side-channel attacks targeting a diverse set of microarchitectural optimizations. We provide a framework for analysing non-transient and transient attacks, which highlights the similarities. We identify the four root causes of timing-based side-channel attacks: determinism, sharing, access violation and information flow, through our systematic analysis. Our key insight is that a subset (or all) of the root causes are exploited by attacks and eliminating any of the exploited root causes, in any attack step, is enough to provide protection. Leveraging our framework, we systematize existing defenses and show that they target these root causes in the different attack steps.

Architectural and Technological Approaches for Efficient Energy Management in Multicore Processors

Article

Full-text available

Mar 2024

Benchmarks play an essential role in the performance evaluation of novel research concepts. Their effectiveness diminishes if they fail to exploit the available hardware of the evaluated microprocessor or, more broadly, if they are not consistent in comparing various systems. An empirical analysis of the consecrated Splash-2 benchmarks suite vs. the latest version Splash-4 was performed. It was shown that on a 64-core configuration, half of the simulated benchmarks reach temperatures well beyond the critical threshold of 105 °C, emphasizing the necessity of a multi-objective evaluation from at least the following perspectives: energy consumption, performance, chip temperature, and integration area. During the analysis, it was observed that the cores spend a large amount of time in the idle state, around 45% on average in some configurations. This can be exploited by implementing a predictive dynamic voltage and frequency scaling (DVFS) technique called the Simple Core State Predictor (SCSP) to enhance the Intel Nehalem architecture and to simulate it using Sniper. The aim was to decrease the overall energy consumption by reducing power consumption at core level while maintaining the same performance. More than that, the SCSP technique, which operates with core-level abstract information, was applied in parallel with a Value Predictor (VP) or a Dynamic Instruction Reuse (DIR) technique, which rely on instruction-level information. Using the SCSP alone, a 9.95% reduction in power consumption and an energy reduction of 10.54% were achieved, maintaining the performance. By combining the SCSP with the VP technique, a performance increase of 8.87% was obtained while reducing power and energy consumption by 3.13% and 8.48%, respectively.

Constable: Improving Performance and Power Efficiency by Safely Eliminating Load Instruction Execution

Preprint

Full-text available

Jun 2024

Load instructions often limit instruction-level parallelism (ILP) in modern processors due to data and resource dependences they cause. Prior techniques like Load Value Prediction (LVP) and Memory Renaming (MRN) mitigate load data dependence by predicting the data value of a load instruction. However, they fail to mitigate load resource dependence as the predicted load instruction gets executed nonetheless. Our goal in this work is to improve ILP by mitigating both load data dependence and resource dependence. To this end, we propose a purely-microarchitectural technique called Constable, that safely eliminates the execution of load instructions. Constable dynamically identifies load instructions that have repeatedly fetched the same data from the same load address. We call such loads likely-stable. For every likely-stable load, Constable (1) tracks modifications to its source architectural registers and memory location via lightweight hardware structures, and (2) eliminates the execution of subsequent instances of the load instruction until there is a write to its source register or a store or snoop request to its load address. Our extensive evaluation using a wide variety of 90 workloads shows that Constable improves performance by 5.1% while reducing the core dynamic power consumption by 3.4% on average over a strong baseline system that implements MRN and other dynamic instruction optimizations (e.g., move and zero elimination, constant and branch folding). In presence of 2-way simultaneous multithreading (SMT), Constable's performance improvement increases to 8.8% over the baseline system. When combined with a state-of-the-art load value predictor (EVES), Constable provides an additional 3.7% and 7.8% average performance benefit over the load value predictor alone, in the baseline system without and with 2-way SMT, respectively.

Cost-Effective Value Predictor for ILP processors through Design Space Exploration

Conference Paper

Jun 2024

An Energy-Efficient BNN Accelerator With Two-Stage Value Prediction for Sparse-Edge Gesture Recognition

Article

Jan 2023

In recent years, natural, flexible, and contactless vision-based gesture recognition has received significant attention in human-computer interaction. However, employing convolutional neural networks (CNNs) for RGB or RGB-D gestures can result in excessive power consumption and poor energy efficiency, making them unsuitable for embedded systems. In this paper, we propose a lightweight sparse binarized neural network (sBNN) model for edge gesture recognition that achieves an accuracy of 89.43%-99.92% on four open-source gesture datasets with $\leq$ 20.26 million operations (MOP) and $\leq$ 15.83-Kilobytes (KB) parameters. We find high channel-level sparsity in the activation maps of sBNN when edge gestures are used as inputs. The sparse activation maps have multiple identical activation vectors called sparse activation vectors (SAV), which lead to highly repeated calculations. In order to avoid this issue, we propose a two-stage value prediction approach to skip these calculations, achieving a speedup of 1.03x-1.83x. Moreover, to reduce on-chip memory, the compression technique is applied to the sparse activation maps, providing a compression rate of 1.72x-3.45x. Finally, we implement an energy-efficient sparse BNN accelerator (SBA) on an embedded field-programmable gate array (FPGA). The experimental results show that our SBA has a latency of 26.3-46.8- $\mu$ s, a power consumption of 0.807 W, and an energy efficiency of 536.22-952.70-GOPS/W at 50-MHz frequency. Our SBA offers lower latency, lower power consumption, and higher energy efficiency than previous state-of-the-art gesture recognition accelerators.

SoK: Analysis of Root Causes and Defense Strategies for Attacks on Microarchitectural Optimizations

Conference Paper

Jul 2023

Confidence Counter Modelling for Value Predictor

Conference Paper

Jun 2023

SEMPRE: Uma Arquitetura SuperEscalar com Múltiplos Processos em Execução

Conference Paper

Sep 1998

Este trabalho apresenta a arquitetura SEMPRE, projetada para executar múltiplos processos, ao invés de múltiplas threads, aproveitando assim a grande quantidade de processos existentes nas estações de trabalho compartilhadas e servidores de rede. Esta arquitetura emprega mecanismos inovadores que possibilitam a antecipação da troca de contexto entre processos, bem como a rápida remoção de instruções inconvenientes. Além disso, ela provê facilidades para o sistema operacional gerenciar processos com mínimo esforço.

Memory Consistency Motivation and Sequential Consistency

Chapter

Jan 2011

This chapter delves into memory consistency models (a.k.a. memory models) that define the behavior of shared memory systems for programmers and implementors. These models define correctness so that programmers know what to expect and implementors know what to provide. We first motivate the need to define memory behavior (Section 3.1), say what a memory consistency model should do (Section 3.2), and compare and contrast consistency and coherence (Section 3.3).

Architectural and organizational tradeoffs in the design of the MultiTitan CPU

Article

Full-text available

Jun 1989
Comput Architect News

Norman P. Jouppi

This paper describes the architectural and organizational tradeoffs made during the design of the MultiTitan, and provides data supporting the decisions made. These decisions covered the entire space of processor design, from the instruction set and virtual memory architecture through the pipeline and organization of the machine. In particular, some of the tradeoffs involved the use of an on-chip instruction cache with off-chip TLB and floating-point unit, the use of direct-mapped instead of associative caches, the use of 64-bit vs. 32-bit data bus, and the implementation of hardware pipeline interlocks.

Software Prefetching

Article

Jan 1991

We present an approach, called software prefetching, to reducing cache miss latencies. By providing a nonblocking prefetch instruction that causes data at a specified memory address to be brought into cache, the compiler can overlap the memory latency with other computation. Our simulations show that, even when generated by a very simple compiler algorithm, prefetch instructions can eliminate nearly all cache misses, while causing only modest increases in data traffic between memory and cache.

Dynamic Memory Disambiguation Using the Memory Conflict Buffer

Article

Jan 1994

To exploit instruction level parallelism, compilers for VLIW and superscalar processors often employ static code scheduling. However, the available code reordering may be severely restricted due to ambiguous dependences between memory instructions. This paper introduces a simple hardware mechanism, referred to as the memory conflict buffer, which facilitates static code scheduling in the presence of memory store/load dependences. Correct program execution is ensured by the memory conflict buffer and repair code provided by the compiler. With this addition, significant speedup over an aggressive code scheduling model can be achieved for both non-numerical and numerical programs.

Limits of Instruction-Level Parallelism

Article

Jan 1991

D.W. Wall

Growing interest in ambitious multiple-issue machines and heavily -pipelined machines requires a careful examination of how much instruction-level parallelism exists in typical programs. Such an examination is complicated by the wide variety of hardware and software techniques for increasing the parallelism that can be exploited, including branch prediction, register renaming, and alias analysis. By performing simulations based on instruction maces, we can model techniques at the limits of feasibility and even beyond. Our study shows a striking difference between assuming that the techniques we use are perfect and merely assuming that they are impossibly good. Even with impossibly good techniques, average parallelism rarely exceeds 7, with 5 more common.

A study of branch prediction techniques

Article

J. E. Smith

Adaptive training branch prediction

Article

Jan 2000

Tse-yu Yeh

Value ~ty and load value prediction

Article

Ullman: "compilers m principles, techniques and tools

Article

A performance study of software and hardware data prefetching schemes

Article

May 1994
Comput Architect News

Prefetching, i.e., exploiting the overlap of processor computations with data accesses, is one of several approaches for tolerating memory latencies. Prefetching can be either hardware-based or software-directed or a combination of both. Hardware-based prefetching, requiring some support unit connected to the cache, can dynamically handle prefetches at run-time without compiler intervention. Software-directed approaches rely on compiler technology to insert explicit prefetch instructions. Mowry et al.'s software scheme [13, 14] and our hardware approach [1] are two representative schemes.In this paper, we evaluate approximations to these two schemes in the context of a shared-memory multiprocessor environment. Our qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references. When complex data access patterns are considered, the software approach has compile-time information to perform sophisticated prefetching whereas the hardware scheme has the advantage of manipulating dynamic information. The performance results from an instruction-level simulation of four benchmarks confirm these observations. Our simulations show that the hardware scheme introduces more memory traffic into the network and that the software scheme introduces a non-negligible instruction execution overhead. An approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.

Speculative disambiguation: a compilation technique for dynamic memory disambiguation

Article

Apr 1994
Comput Architect News

Ambiguous memory references have always been one of the main sources of performance bottlenecks. Many papers have addressed this problem using static disambiguation. These methods work extremely well when the memory access pattern is linear and predictable. However they are ineffective when the memory access pattern is nonlinear or when the access pattern cannot be determined statically. For these difficult problems, this paper presents speculative disambiguation, a compilation technique for architectures supporting instruction level parallelism and either speculative execution or conditional execution (or both). This technique produces specialized code at compile time to disambiguate memory references at run time. It is shown that on machines with sufficient resources, the technique will always result in lower execution time. Speculative disambiguation has been implemented for a VLIW architecture with guarded execution. Preliminary results indicate that it can help bridge a significant fraction of the performance gap between a good and a perfect static disambiguator. Occasionally it can outperform the perfect static disambiguator.

Exceeding the Dataflow Limit via Value Prediction.

Abstract and Figures

Recommended publications

Exploiting Value Locality to Exceed the Dataflow Limit

Exploiting Value Locality to Exceed the Dataflow Limit

Value Locality and Load Value Prediction.

Superspeculative microarchitecture for beyond AD 2000