Figure 3 - uploaded by Youfeng Wu
Content may be subject to copyright.
1: Bochs CPU loop state diagram

1: Bochs CPU loop state diagram

Source publication
Conference Paper
Full-text available
In recent years, dynamic binary translation has emerged as an important tool with many real world applications. Besides supporting legacy binary code and ISA virtualization, it enables innovative co-designed microarchitectures and allows transparent binary instrumentation. The dynamic nature of the translation usually incurs extra execution overhea...

Citations

... It also enables basic software development before the hardware can be obtained. Several factors may influence the efficiency of a binary translator, including the overhead of initialization before translation, the overhead of code translation and optimization, and the overall quality of the generated code [5,21,25]. Code quality holds particular significance. ...
Chapter
Full-text available
The shortage of applications has become a major concern for new Instruction Set Architecture (ISA). Binary translation is a common solution to overcome this challenge. However, the performance of binary translation is heavily dependent on the quality of the translated code. To achieve high-quality translation, recent studies focus on integrating binary translators with compilation optimization methods. Nevertheless, such integration faces two main challenges. Firstly, it is hard to employ complex compilation optimization techniques in a dynamic binary translator (DBT) without introducing significant runtime overhead. Secondly, the task of implementing register mapping in the compiler is challenging, which can reduce expensive memory access instructions generated to maintain the guest CPU state. To resolve these challenges, we propose a hybrid binary translation system with multi-stage feedback, combining dynamic and static binary translator, named MFHBT. This system eliminates the runtime overhead caused by compilation optimization. Additionally, we introduce a mechanism to implement the register mapping through inline constraints and stack variables in the compiler. We implement a prototype of this new system powered by LLVM. Experimental results demonstrate an 81% decrease in the number of memory access instructions and a performance improvement of 3.28 times compared to QEMU.
... La DBT est soumise aux mêmes contraintes que la compilation dynamique : chaque cycle passé à optimiser (ou traduire) du code doit être compensé par l'accélération apportée par l'optimisation. Les surcoûts causés par la traduction dynamique de binaire ont été étudiés par Borin et al. [13]. Ceuxci ont tenté de caractériser ce surcoût afin d'en identifier les différentes causes. ...
Thesis
Full-text available
La thèse porte sur l’utilisation de techniques d’accélération matérielle pour la conception de processeurs basés sur l’optimisation dynamique de binaires. Dans ce type de machine, les instructions du programme exécuté par le processeur sont traduites et optimisées à la volée par un outil de compilation dynamique intégré au processeur. Ce procédé permet de mieux exploiter les ressources du processeur cible, mais est délicate à exploiter car le temps de cette recompilation impacte de manière très significative l’effet global de ces optimisations.Dans cette thèse, nous montrons que l’utilisation d’accélérateurs matériels pour certaines étapes clés de cette compilation (construction de la représentation intermédiaire, ordonnancement des instructions), permet de ramener le temps de compilation à des valeurs très faibles (en moyenne 6 cycles par instruction, contre plusieurs centaines dans le cas d’une mise en œuvre classique). Nous avons également montré comment ces techniques peuvent être exploitées pour offrir de meilleurs compromis performance/consommation sur certains types de noyaux de calculs. La thèse à également débouché sur la mise à disposition de la communauté de recherche du compilateur développé.
... A DBT engine usually starts by interpreting the code and then, after heating (translating all hot regions), it spends most of the time executing translated regions instead. Thus, the quality and performance of these translated regions are responsible for most of the DBT engine performance [Borin and Wu 2009] and there are two DBT design choices which affect most of the quality of translation: (1) the DBT's Region Formation Technique (RFT) which defines the shape of the translation units [Smith and Nair 2005] and (2) the characteristics of the guest and host ISA which can difficult or easy the translation [Auler and Borin 2017] While RFT design choice is well explored in the literature, the translation quality of each pair of guest and host ISA needs to be researched and retested for every new ISA. One approach to understanding the quality and difficulty of code translation for a pair of ISAs is by implementing a Static Binary Translation (SBT) engine [Auler and Borin 2017]. ...
... Therefore, the quality of these translated regions is crucial to the final performance of a DBT engine. In fact, this is evidenced by the low overhead of Same-ISAs DBT engines [Borin and Wu 2009], also known as binary optimizers, as they always execute code with the same or better quality than the native binary (this happen because same-ISA do not actually impose translations, but only optimizations). Designing and implementing high-performance Cross-ISA DBT engines, on the other hand, is more challenging because the performance of translated code depends heavily on the characteristics of the guest (source) and the host (target) ISA. ...
... RISC-V has most of the characteristics pointed by the authors to be easy to emulate: it is simple, it hardly uses the status register and it has a small number of instructions, with 66% of them being equivalent to the OpenISA instructions, all indicating that RISC-V is also an easy to emulate ISA. For same-ISA emulation, the best performance achieved is from StarDBT by Borin and Wu [Borin and Wu 2009]: 1.09x slower than native (x86) emulation. ...
Conference Paper
Full-text available
RISC-V is an open ISA which has been calling the attention worldwide by its fast growth and adoption, it is already supported by GCC, Clang and the Linux Kernel. Moreover, several emulators and simulators for RISC-V have arisen recently. However, none of them with good performance. In this paper, we investigate if faster emulators for RISC-V could be created. As the most common and also the fastest technique to implement an emulator, Dynamic Binary Translation (DBT), depends directly on good translation quality to achieve good performance, we investigate if a high-quality translation of RISC-V binaries is feasible. To this, we used Static Binary Translation (SBT) to test the quality that can be achieved by translating RISC-V to x86 and ARM. Our experimental results indicate that our SBT is able to produce high-quality code when translating RISC-V binaries to x86 and ARM, achieving only 12%/35% of overhead when compared to native x86/ARM code. A better result than well-known RISC-V DBT engines such as RV8 or QEMU. Since DBTs have its performance strongly related to translation quality, our SBT engine evidence the opportunity towards the creation of RISC-V DBT emulators with higher performance than the current ones.
... All these tools face the same problem of cold-code execution: when the tool first translates a code snippet, it cannot guess whether this snippet will be executed often enough to recoup the cost of an optimization stage. This problem is often identified as one of the bottlenecks of dynamic compilation [20], [21]. To handle this issue, most dynamic compilation tools are decomposed into several optimization levels which are triggered whenever a region of binaries becomes hot. ...
Article
In order to provide dynamic adaptation of the performance/energy trade-off, systems today rely on heterogeneous multi-core architectures (different micro-architectures on a chip). These systems are limited to single-ISA approaches to enable transparent migration between the different cores. To offer more trade-off, we can integrate statically scheduled micro-architecture and use Dynamic Binary Translation (DBT) for task migration. However, in a system where performance and energy consumption are a prime concern, the translation overhead has to be kept as low as possible. In this paper we present Hybrid-DBT, an open-source, hardware accelerated DBT system targeting VLIW cores. Three different hardware accelerators have been designed to speed-up critical steps of the translation process. Experimental study shows that the accelerated steps are two orders of magnitude faster than their software equivalent. The impact on the total execution time of applications and the quality of generated binaries are also measured.
... Instead of proposing a complete system, other works focus on the characterization of particular solutions. Examples of works targeting control flow handling include [32], [23], [20], [21], [22]. Code cache management is discussed in [33]. ...
Conference Paper
Full-text available
HW/SW co-designed processors currently have a renewed interest due to their capability to boost performance without running into the power and complexity walls. By employing a software layer that performs dynamic binary translation and applies aggressive optimizations through exploiting the runtime behaviour, these hybrid architectures provide better performance/watt. However, a poorly designed software layer can result in significant translations/optimization overheads that may offset its benefits. This work presents a detailed characterization of the software layer of a HW/SW co-designed processor using a variety of benchmark suites/ We observe that the performance of the software layer is very sensitive to the characteristics of the emulated application with a variance of more than 50%. We also show that the interaction between the software layer and the emulated application, while sharing the microarchitectural resources, can have 0-20% impact of performance. Finally, we identify some key elements which should be further investigated to reduce the observed variations in performance. The paper provides critical insights to improve the software layer design.
... Our previous experience with ISA emulation and design [27,14,18] have shown us that a clean ISA, similar to MIPS-1, allows us to build a high performance emulator capable of emulating guest applications on ARM and x86 host processors at near native performance. As a result, we are designing OpenISA, an ISA that aims to be emulation friendly, empowering most of the devices in the world to execute the same set of applications. ...
Conference Paper
In face of the high number of different hardware platforms that we need to program in the Internet-of-Things (IoT), Virtual Machines (VMs) pose as a promising technology to allow a program once, deploy everywhere strategy. Unfortunately, many existing VMs are very heavy to work on resourceconstrained IoT devices. We present COISA, a compact virtual platform that relies on OpenISA, an Instruction Set Architecture (ISA) that strives for easy emulation, to allow a single program to be deployed on many platforms, including tiny microcontrollers. Our experimental results indicate that COISA is easily portable and is capable of running unmodified guest applications in highly heterogeneous host platforms, including one with only 2 kB of RAM.
... On the other hand, our mechanism enables to obtain the actual behavior of the code execution that reflects input data and dynamic control flow via dynamic binary transformation mechanism during the code execution [23]. This means that we can obtain performance gain via dynamic tuning or dynamic parallelization without the feedback to the source codes † † . ...
Article
This paper presents a mechanism for detecting dynamic loop and procedure nesting during the actual program execution on-the-fly. This mechanism aims primarily at making better strategies for performance tuning or parallelization. Using a pre-compiled application executable machine code as an input, our mechanism statically generates simple but precise markers that indicate loop entries and loop exits, and dynamically monitors loop nesting that appears during the actual execution together with call context tree. To keep precise loop structures all the time, we monitor the indirect jumps that enter the loop regions and the setjmp/longjmp functions that cause irregular function call transfers. We also present a novel representation called Loop-Call Context Graph that can keep track of inter-procedural loop nests. We implement our mechanism and evaluate it using SPEC CPU2006 benchmark suite. The results confirm that our mechanism can successfully reveal the precise inter-procedural loop nest structures from all of SPEC CPU2006 benchmark executions without any particular compiler support. The results also show that it can reduce runtime loop detection overheads compared with the existing loop profiling method.
... Borin and Wu [26] study the overhead of the Intel research dynamic binary translator StarDBT when emulating the SPEC CPU 2000 benchmarks [29]. They found that branch handling and code duplication represent more than 64% of the StarDBT overhead and cold code translation and hot trace building together account for 34% of the DBT overhead. ...
Conference Paper
Full-text available
Virtual machines are versatile systems that can support innovative solutions to many problems. These systems usually rely on emulation techniques, such as interpretation and dynamic binary translation, to execute guest application code. Usually, in order to select the best emulation technique for each code segment, the system must predict whether the code is worth compiling (frequently executed) or not, known as hotness prediction. In this paper we show that the threshold-based hot code predictor frequently mispredicts the code hotness and as a result the VM emulation performance become dominated by miscompilations. To do so, we developed a mathematical model to simulate the behavior of such predictor and using it we quantify and characterize the impact of mispredictions in several benchmarks. We also show how the threshold choice can affect the predictor, what are the major overhead components and how using SPEC to analyze a VM performance can lead to misleading results.
... The quality of a DBT (Dynamic Bynary Translator) is measured by its fitness to translate computer programs efficiently and transparently [3,10]. Figure 1 shows the emulation overhead (slowdown) reported by HPT given different confidence levels. ...
Conference Paper
Full-text available
Several aspects of a computer system cause performance measurements to include random errors. Moreover, these systems are typically composed of a non-trivial combination of individual components that may cause one system to per- form better or worse than another depending on the workload. Hence, properly measuring and comparing computer systems performance are non-trivial tasks. The majority of work published on recent major computer architecture conferences do not report the random errors measured on their experiments. The few remaining authors have been using only confidence intervals or standard deviations to quantify and factor out random errors. Recent publications claim that this approach could still lead to mis- leading conclusions. In this work, we reproduce and discuss the results obtained in a previous study. Finally, we propose SToCS, a tool that integrates several statistical frameworks and facilitates the analysis of computer science experiments.
... The backend module is responsible for hot trace optimizations. In order to isolate the code optimization benefits from the optimization overhead, we modified the backend module to dump hot traces into files, and to load them back in future runs [10]. The dumping process occurs at the end of the first run. ...
Article
Full-text available
Dynamic Binary Translation (DBT) has been used as an approach to transparently run code on different architectures and has generally made use of runtime information to perform effective dynamic compiler optimization. Dynamic optimization techniques are hard to design, as they have to improve code performance under stringent runtime constraints. In this paper we present preliminary work on a code optimization technique called Hole Allocation. Our goal with this paper is to introduce and highlight its potential, and to point new directions for improvement. Hole Allocation uses runtime information, collected by DBTs, to identify free register ranges (holes) which could be target of register promotion. Preliminary experiments conducted with the SPEC CPU2000 benchmark, using only one memory access pattern, shows that Hole Allocation can achieve, for some program runs, moderate speedup-ups, but also reveal that performance can suffer in some cases, and needs to be addressed.