Figure 3.1: Bochs CPU loop state diagram

MFHBT: Hybrid Binary Translation System with Multi-stage Feedback Powered by LLVM

Chapter

Full-text available

Nov 2023

The shortage of applications has become a major concern for new Instruction Set Architecture (ISA). Binary translation is a common solution to overcome this challenge. However, the performance of binary translation is heavily dependent on the quality of the translated code. To achieve high-quality translation, recent studies focus on integrating binary translators with compilation optimization methods. Nevertheless, such integration faces two main challenges. Firstly, it is hard to employ complex compilation optimization techniques in a dynamic binary translator (DBT) without introducing significant runtime overhead. Secondly, the task of implementing register mapping in the compiler is challenging, which can reduce expensive memory access instructions generated to maintain the guest CPU state. To resolve these challenges, we propose a hybrid binary translation system with multi-stage feedback, combining dynamic and static binary translator, named MFHBT. This system eliminates the runtime overhead caused by compilation optimization. Additionally, we introduce a mechanism to implement the register mapping through inline constraints and stack variables in the compiler. We implement a prototype of this new system powered by LLVM. Experimental results demonstrate an 81% decrease in the number of memory access instructions and a performance improvement of 3.28 times compared to QEMU.

Accélération matérielle pour la traduction dynamique de programmes binaires

Thesis

Full-text available

Dec 2018

Simon Rokicki

La thèse porte sur l’utilisation de techniques d’accélération matérielle pour la conception de processeurs basés sur l’optimisation dynamique de binaires. Dans ce type de machine, les instructions du programme exécuté par le processeur sont traduites et optimisées à la volée par un outil de compilation dynamique intégré au processeur. Ce procédé permet de mieux exploiter les ressources du processeur cible, mais est délicate à exploiter car le temps de cette recompilation impacte de manière très significative l’effet global de ces optimisations.Dans cette thèse, nous montrons que l’utilisation d’accélérateurs matériels pour certaines étapes clés de cette compilation (construction de la représentation intermédiaire, ordonnancement des instructions), permet de ramener le temps de compilation à des valeurs très faibles (en moyenne 6 cycles par instruction, contre plusieurs centaines dans le cas d’une mise en œuvre classique). Nous avons également montré comment ces techniques peuvent être exploitées pour offrir de meilleurs compromis performance/consommation sur certains types de noyaux de calculs. La thèse à également débouché sur la mise à disposition de la communauté de recherche du compilateur développé.

Towards a High-Performance RISC-V Emulator

Conference Paper

Full-text available

Sep 2018

RISC-V is an open ISA which has been calling the attention worldwide by its fast growth and adoption, it is already supported by GCC, Clang and the Linux Kernel. Moreover, several emulators and simulators for RISC-V have arisen recently. However, none of them with good performance. In this paper, we investigate if faster emulators for RISC-V could be created. As the most common and also the fastest technique to implement an emulator, Dynamic Binary Translation (DBT), depends directly on good translation quality to achieve good performance, we investigate if a high-quality translation of RISC-V binaries is feasible. To this, we used Static Binary Translation (SBT) to test the quality that can be achieved by translating RISC-V to x86 and ARM. Our experimental results indicate that our SBT is able to produce high-quality code when translating RISC-V binaries to x86 and ARM, achieving only 12%/35% of overhead when compared to native x86/ARM code. A better result than well-known RISC-V DBT engines such as RV8 or QEMU. Since DBTs have its performance strongly related to translation quality, our SBT engine evidence the opportunity towards the creation of RISC-V DBT emulators with higher performance than the current ones.

Hybrid-DBT: Hardware/Software Dynamic Binary Translation Targeting VLIW

Article

Aug 2018
IEEE T COMPUT AID D

In order to provide dynamic adaptation of the performance/energy trade-off, systems today rely on heterogeneous multi-core architectures (different micro-architectures on a chip). These systems are limited to single-ISA approaches to enable transparent migration between the different cores. To offer more trade-off, we can integrate statically scheduled micro-architecture and use Dynamic Binary Translation (DBT) for task migration. However, in a system where performance and energy consumption are a prime concern, the translation overhead has to be kept as low as possible. In this paper we present Hybrid-DBT, an open-source, hardware accelerated DBT system targeting VLIW cores. Three different hardware accelerators have been designed to speed-up critical steps of the translation process. Experimental study shows that the accelerated steps are two orders of magnitude faster than their software equivalent. The impact on the total execution time of applications and the quality of generated binaries are also measured.

Quantitative Characterization of the Software Layer of a HW/SW Co-Designed Processor

Conference Paper

Full-text available

Sep 2016

HW/SW co-designed processors currently have a renewed interest due to their capability to boost performance without running into the power and complexity walls. By employing a software layer that performs dynamic binary translation and applies aggressive optimizations through exploiting the runtime behaviour, these hybrid architectures provide better performance/watt. However, a poorly designed software layer can result in significant translations/optimization overheads that may offset its benefits. This work presents a detailed characterization of the software layer of a HW/SW co-designed processor using a variety of benchmark suites/ We observe that the performance of the software layer is very sensitive to the characteristics of the emulated application with a variance of more than 50%. We also show that the interaction between the software layer and the emulated application, while sharing the microarchitectural resources, can have 0-20% impact of performance. Finally, we identify some key elements which should be further investigated to reduce the observed variations in performance. The paper provides critical insights to improve the software layer design.

COISA: A Compact OpenISA virtual platform for IoT devices

Conference Paper

Oct 2015
Acoust Speech Signal Process Newslett IEEE

In face of the high number of different hardware platforms that we need to program in the Internet-of-Things (IoT), Virtual Machines (VMs) pose as a promising technology to allow a program once, deploy everywhere strategy. Unfortunately, many existing VMs are very heavy to work on resourceconstrained IoT devices. We present COISA, a compact virtual platform that relies on OpenISA, an Instruction Set Architecture (ISA) that strives for easy emulation, to allow a single program to be deployed on many platforms, including tiny microcontrollers. Our experimental results indicate that COISA is easily portable and is capable of running unmodiﬁed guest applications in highly heterogeneous host platforms, including one with only 2 kB of RAM.

Identifying Program Loop Nesting Structures during Execution of Machine Code

Article

Sep 2014
IEICE T INF SYST

This paper presents a mechanism for detecting dynamic loop and procedure nesting during the actual program execution on-the-fly. This mechanism aims primarily at making better strategies for performance tuning or parallelization. Using a pre-compiled application executable machine code as an input, our mechanism statically generates simple but precise markers that indicate loop entries and loop exits, and dynamically monitors loop nesting that appears during the actual execution together with call context tree. To keep precise loop structures all the time, we monitor the indirect jumps that enter the loop regions and the setjmp/longjmp functions that cause irregular function call transfers. We also present a novel representation called Loop-Call Context Graph that can keep track of inter-procedural loop nests. We implement our mechanism and evaluate it using SPEC CPU2006 benchmark suite. The results confirm that our mechanism can successfully reveal the precise inter-procedural loop nest structures from all of SPEC CPU2006 benchmark executions without any particular compiler support. The results also show that it can reduce runtime loop detection overheads compared with the existing loop profiling method.

Modeling Virtual Machines Misprediction Overhead

Conference Paper

Full-text available

Sep 2013

Virtual machines are versatile systems that can support innovative solutions to many problems. These systems usually rely on emulation techniques, such as interpretation and dynamic binary translation, to execute guest application code. Usually, in order to select the best emulation technique for each code segment, the system must predict whether the code is worth compiling (frequently executed) or not, known as hotness prediction. In this paper we show that the threshold-based hot code predictor frequently mispredicts the code hotness and as a result the VM emulation performance become dominated by miscompilations. To do so, we developed a mathematical model to simulate the behavior of such predictor and using it we quantify and characterize the impact of mispredictions in several benchmarks. We also show how the threshold choice can affect the predictor, what are the major overhead components and how using SPEC to analyze a VM performance can lead to misleading results.

Assessing Computer Performance with SToCS

Conference Paper

Full-text available

Apr 2013

Several aspects of a computer system cause performance measurements to include random errors. Moreover, these systems are typically composed of a non-trivial combination of individual components that may cause one system to per- form better or worse than another depending on the workload. Hence, properly measuring and comparing computer systems performance are non-trivial tasks. The majority of work published on recent major computer architecture conferences do not report the random errors measured on their experiments. The few remaining authors have been using only confidence intervals or standard deviations to quantify and factor out random errors. Recent publications claim that this approach could still lead to mis- leading conclusions. In this work, we reproduce and discuss the results obtained in a previous study. Finally, we propose SToCS, a tool that integrates several statistical frameworks and facilitates the analysis of computer science experiments.

Live Range Hole Allocation in Dynamic Binary Translation

Article

Full-text available

Dynamic Binary Translation (DBT) has been used as an approach to transparently run code on different architectures and has generally made use of runtime information to perform effective dynamic compiler optimization. Dynamic optimization techniques are hard to design, as they have to improve code performance under stringent runtime constraints. In this paper we present preliminary work on a code optimization technique called Hole Allocation. Our goal with this paper is to introduce and highlight its potential, and to point new directions for improvement. Hole Allocation uses runtime information, collected by DBTs, to identify free register ranges (holes) which could be target of register promotion. Preliminary experiments conducted with the SPEC CPU2000 benchmark, using only one memory access pattern, shows that Hole Allocation can achieve, for some program runs, moderate speedup-ups, but also reveal that performance can suffer in some cases, and needs to be addressed.

1: Bochs CPU loop state diagram

Citations