Conference PaperPDF Available

Exceeding the Dataflow Limit via Value Prediction.

Authors:

Abstract and Figures

For decades, the serialization constraints imposed by true data dependences have been regarded as an absolute limit-the dataflow limit-on the parallel execution of serial programs. This paper proposes a new technique-value prediction-for exceeding that limit that allows data dependent instructions to issue and execute in parallel without violating program semantics. This technique is built on the concept of value locality which describes the likelihood of the recurrence of a previously-seen value within a storage location inside a computer system. Value prediction consists of predicting entire 32- and 64-bit register values based on previously-seen values. We find that such register values being written by machine instructions are frequently predictable. Furthermore, we show that simple microarchitectural enhancements to a modern microprocessor implementation based on the PowerPC 620 that enable value prediction can effectively exploit value locality to collapse true dependences, reduce average result latency and provide performance gains of 4.5%-23% (depending on machine model) by exceeding the dataflow limit
Content may be subject to copyright.
A preview of the PDF is not available
... The parallelism is increasing each time the value of an instruction is correctly predicted. The technique was proposed in the period 1995-1997 by four distinct groups in [10][11][12][13][14][15]. ...
... In [12], the authors proposed a simple VP, called last-VP, which uses as current prediction the last computed value of a previous execution of the instruction. Such a predictor decides what value it will predict based on a quite rudimentary decision mechanism using confidence counters, thus achieving a performance gain from 4.5% up to 23% using this technique. ...
... Microarchitectural optimizations, widely implemented in commercial processors, like branch predictors [4], [5], [49], [51], [100], caches [68], [136], [188] and prefetchers [1], [40], [44], [66], [158], [175] among others, are prone to attacks. A recent paper [146] demonstrates that several optimizations proposed in literature, but not known to be commercially implemented as yet, such as value prediction [111], also are vulnerable. This underscores the importance of conducting a thorough review of microarchitectural optimizations with an emphasis on security. ...
... Value prediction: This is a speculative optimization that aims to increase instruction-level parallelism (ILP) and hide memory access latency by predicting values for load misses and consequently breaking instruction dependencies [111], [112], [135], [154], [156]. Accurate predictions can improve ILP by increasing the overlap between memory access(es) and useful computation(s) while mispredictions lead to pipeline squashes and reexecution of instruction(s). ...
Preprint
Full-text available
Microarchitectural optimizations are expected to play a crucial role in ensuring performance scalability in future technology nodes. However, recent attacks have demonstrated that microarchitectural optimizations, which were assumed to be secure, can be exploited. Moreover, new attacks surface at a rapid pace limiting the scope of existing defenses. These developments prompt the need to review microarchitectural optimizations with an emphasis on security, understand the attack landscape and the potential defense strategies. We analyze timing-based side-channel attacks targeting a diverse set of microarchitectural optimizations. We provide a framework for analysing non-transient and transient attacks, which highlights the similarities. We identify the four root causes of timing-based side-channel attacks: determinism, sharing, access violation and information flow, through our systematic analysis. Our key insight is that a subset (or all) of the root causes are exploited by attacks and eliminating any of the exploited root causes, in any attack step, is enough to provide protection. Leveraging our framework, we systematize existing defenses and show that they target these root causes in the different attack steps.
... This is a drawback because the flushing process will take additional cycles. The technique was proposed between 1995 and 1997 by four distinct groups [19][20][21][22][23]. In [24], an extensive review of the existing value predictors was carried out, also covering the security and data consistency implications of this speculative technique. ...
Article
Full-text available
Benchmarks play an essential role in the performance evaluation of novel research concepts. Their effectiveness diminishes if they fail to exploit the available hardware of the evaluated microprocessor or, more broadly, if they are not consistent in comparing various systems. An empirical analysis of the consecrated Splash-2 benchmarks suite vs. the latest version Splash-4 was performed. It was shown that on a 64-core configuration, half of the simulated benchmarks reach temperatures well beyond the critical threshold of 105 °C, emphasizing the necessity of a multi-objective evaluation from at least the following perspectives: energy consumption, performance, chip temperature, and integration area. During the analysis, it was observed that the cores spend a large amount of time in the idle state, around 45% on average in some configurations. This can be exploited by implementing a predictive dynamic voltage and frequency scaling (DVFS) technique called the Simple Core State Predictor (SCSP) to enhance the Intel Nehalem architecture and to simulate it using Sniper. The aim was to decrease the overall energy consumption by reducing power consumption at core level while maintaining the same performance. More than that, the SCSP technique, which operates with core-level abstract information, was applied in parallel with a Value Predictor (VP) or a Dynamic Instruction Reuse (DIR) technique, which rely on instruction-level information. Using the SCSP alone, a 9.95% reduction in power consumption and an energy reduction of 10.54% were achieved, maintaining the performance. By combining the SCSP with the VP technique, a performance increase of 8.87% was obtained while reducing power and energy consumption by 3.13% and 8.48%, respectively.
Preprint
Full-text available
Load instructions often limit instruction-level parallelism (ILP) in modern processors due to data and resource dependences they cause. Prior techniques like Load Value Prediction (LVP) and Memory Renaming (MRN) mitigate load data dependence by predicting the data value of a load instruction. However, they fail to mitigate load resource dependence as the predicted load instruction gets executed nonetheless. Our goal in this work is to improve ILP by mitigating both load data dependence and resource dependence. To this end, we propose a purely-microarchitectural technique called Constable, that safely eliminates the execution of load instructions. Constable dynamically identifies load instructions that have repeatedly fetched the same data from the same load address. We call such loads likely-stable. For every likely-stable load, Constable (1) tracks modifications to its source architectural registers and memory location via lightweight hardware structures, and (2) eliminates the execution of subsequent instances of the load instruction until there is a write to its source register or a store or snoop request to its load address. Our extensive evaluation using a wide variety of 90 workloads shows that Constable improves performance by 5.1% while reducing the core dynamic power consumption by 3.4% on average over a strong baseline system that implements MRN and other dynamic instruction optimizations (e.g., move and zero elimination, constant and branch folding). In presence of 2-way simultaneous multithreading (SMT), Constable's performance improvement increases to 8.8% over the baseline system. When combined with a state-of-the-art load value predictor (EVES), Constable provides an additional 3.7% and 7.8% average performance benefit over the load value predictor alone, in the baseline system without and with 2-way SMT, respectively.
Article
In recent years, natural, flexible, and contactless vision-based gesture recognition has received significant attention in human-computer interaction. However, employing convolutional neural networks (CNNs) for RGB or RGB-D gestures can result in excessive power consumption and poor energy efficiency, making them unsuitable for embedded systems. In this paper, we propose a lightweight sparse binarized neural network (sBNN) model for edge gesture recognition that achieves an accuracy of 89.43%-99.92% on four open-source gesture datasets with $\leq$ 20.26 million operations (MOP) and $\leq$ 15.83-Kilobytes (KB) parameters. We find high channel-level sparsity in the activation maps of sBNN when edge gestures are used as inputs. The sparse activation maps have multiple identical activation vectors called sparse activation vectors (SAV), which lead to highly repeated calculations. In order to avoid this issue, we propose a two-stage value prediction approach to skip these calculations, achieving a speedup of 1.03x-1.83x. Moreover, to reduce on-chip memory, the compression technique is applied to the sparse activation maps, providing a compression rate of 1.72x-3.45x. Finally, we implement an energy-efficient sparse BNN accelerator (SBA) on an embedded field-programmable gate array (FPGA). The experimental results show that our SBA has a latency of 26.3-46.8- $\mu$ s, a power consumption of 0.807 W, and an energy efficiency of 536.22-952.70-GOPS/W at 50-MHz frequency. Our SBA offers lower latency, lower power consumption, and higher energy efficiency than previous state-of-the-art gesture recognition accelerators.
Conference Paper
Este trabalho apresenta a arquitetura SEMPRE, projetada para executar múltiplos processos, ao invés de múltiplas threads, aproveitando assim a grande quantidade de processos existentes nas estações de trabalho compartilhadas e servidores de rede. Esta arquitetura emprega mecanismos inovadores que possibilitam a antecipação da troca de contexto entre processos, bem como a rápida remoção de instruções inconvenientes. Além disso, ela provê facilidades para o sistema operacional gerenciar processos com mínimo esforço.
Chapter
This chapter delves into memory consistency models (a.k.a. memory models) that define the behavior of shared memory systems for programmers and implementors. These models define correctness so that programmers know what to expect and implementors know what to provide. We first motivate the need to define memory behavior (Section 3.1), say what a memory consistency model should do (Section 3.2), and compare and contrast consistency and coherence (Section 3.3).
Article
Full-text available
This paper describes the architectural and organizational tradeoffs made during the design of the MultiTitan, and provides data supporting the decisions made. These decisions covered the entire space of processor design, from the instruction set and virtual memory architecture through the pipeline and organization of the machine. In particular, some of the tradeoffs involved the use of an on-chip instruction cache with off-chip TLB and floating-point unit, the use of direct-mapped instead of associative caches, the use of 64-bit vs. 32-bit data bus, and the implementation of hardware pipeline interlocks.
Article
We present an approach, called software prefetching, to reducing cache miss latencies. By providing a nonblocking prefetch instruction that causes data at a specified memory address to be brought into cache, the compiler can overlap the memory latency with other computation. Our simulations show that, even when generated by a very simple compiler algorithm, prefetch instructions can eliminate nearly all cache misses, while causing only modest increases in data traffic between memory and cache.
Article
To exploit instruction level parallelism, compilers for VLIW and superscalar processors often employ static code scheduling. However, the available code reordering may be severely restricted due to ambiguous dependences between memory instructions. This paper introduces a simple hardware mechanism, referred to as the memory conflict buffer, which facilitates static code scheduling in the presence of memory store/load dependences. Correct program execution is ensured by the memory conflict buffer and repair code provided by the compiler. With this addition, significant speedup over an aggressive code scheduling model can be achieved for both non-numerical and numerical programs.
Article
Growing interest in ambitious multiple-issue machines and heavily -pipelined machines requires a careful examination of how much instruction-level parallelism exists in typical programs. Such an examination is complicated by the wide variety of hardware and software techniques for increasing the parallelism that can be exploited, including branch prediction, register renaming, and alias analysis. By performing simulations based on instruction maces, we can model techniques at the limits of feasibility and even beyond. Our study shows a striking difference between assuming that the techniques we use are perfect and merely assuming that they are impossibly good. Even with impossibly good techniques, average parallelism rarely exceeds 7, with 5 more common.
Article
Prefetching, i.e., exploiting the overlap of processor computations with data accesses, is one of several approaches for tolerating memory latencies. Prefetching can be either hardware-based or software-directed or a combination of both. Hardware-based prefetching, requiring some support unit connected to the cache, can dynamically handle prefetches at run-time without compiler intervention. Software-directed approaches rely on compiler technology to insert explicit prefetch instructions. Mowry et al.'s software scheme [13, 14] and our hardware approach [1] are two representative schemes.In this paper, we evaluate approximations to these two schemes in the context of a shared-memory multiprocessor environment. Our qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references. When complex data access patterns are considered, the software approach has compile-time information to perform sophisticated prefetching whereas the hardware scheme has the advantage of manipulating dynamic information. The performance results from an instruction-level simulation of four benchmarks confirm these observations. Our simulations show that the hardware scheme introduces more memory traffic into the network and that the software scheme introduces a non-negligible instruction execution overhead. An approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.
Article
Ambiguous memory references have always been one of the main sources of performance bottlenecks. Many papers have addressed this problem using static disambiguation. These methods work extremely well when the memory access pattern is linear and predictable. However they are ineffective when the memory access pattern is nonlinear or when the access pattern cannot be determined statically. For these difficult problems, this paper presents speculative disambiguation, a compilation technique for architectures supporting instruction level parallelism and either speculative execution or conditional execution (or both). This technique produces specialized code at compile time to disambiguate memory references at run time. It is shown that on machines with sufficient resources, the technique will always result in lower execution time. Speculative disambiguation has been implemented for a VLIW architecture with guarded execution. Preliminary results indicate that it can help bridge a significant fraction of the performance gap between a good and a perfect static disambiguator. Occasionally it can outperform the perfect static disambiguator.