Figure 2 - uploaded by Peng Wu
Content may be subject to copyright.
Examples of call stack comparison

Examples of call stack comparison

Source publication
Conference Paper
Full-text available
Trace-based compilation is a promising technique for language compilers and binary translators. It offers the potential to expand the compilation scopes that have traditionally been limited by method boundaries. Detecting repeating cyclic execution paths and capturing the detected repetitions into traces is a key requirement for trace selection alg...

Contexts in source publication

Context 1
... fact the call stacks are longer in actual execution (there are call stack elements corresponding to the callers of method f), but only the relevant parts are shown in Figure 2, and the irrelevant parts are shown as boxes with "...". The irrelevant parts are always excluded from the top k elements compared. ...
Context 2
... element of a call stack contains a return address. For example, the top element B of the call stack "...-B" in the first row of Figure 2 (b) means that control will return to B after the next method return. ...

Similar publications

Article
Full-text available
Praca dotyczy wyboru decyzji optymalnej na podstawie na podstawie zbioru n decyzji podanych przez ekspertów. Decyzje trójwymiarowe albo przestrzenne określają trzy cechy wybranego zagadnienia. W pracy wprowadzono pojęcie bliskości dwóch decyzji trójwymiarowych oraz bliskości decyzji od zbioru decyzji sobie bliskich. Decyzje nie bliskie są eliminowa...
Article
Full-text available
We examined two overarching topics: What are the Fundamental Interpersonal Relations Orientation-Behavior (FIRO-B), Perceived Stress Scale (PSS), and Self-rating Depression Scale (SDS) scores in medical students? Do their interpersonal needs correlate with stress and depression? FIRO-B, PSS-10, and SDS were administered to 82 freshmen in College of...
Conference Paper
Full-text available
Monitoring enterprise applications that consist of multiple heterogeneous components executing in different runtimes is a challenging problem particularly from a business centric perspective. We propose a business centric monitoring approach that involves using business information fields (invariants) to relate service activity to business composit...

Citations

... • Call Stack Comparison: To detect recursive functions, Hayashizaki et al. (2011) inspect the call stack to determine when the target of the current function call is already on the stack. ...
... imprecise) in the presence of higherorder functions and the possibility of storing functions as values in the heap. In this section, we describe two runtime approaches to detecting appropriate loops, both of which are dynamic variants of the "false loop filtering" technique of Hayashizaki et al. (2011). ...
... Such a "phase shift" of the loop has no impact on performance. This approach is a simplified version of the one proposed by Hayashizaki et al. (2011). Their solution looks at more levels of the stack to decide whether something is a false loop. ...
Article
Full-text available
We present Pycket, a high-performance tracing JIT compiler for Racket. Pycket supports a wide variety of the sophisticated features in Racket such as contracts, continuations, classes, structures, dynamic binding, and more. On average, over a standard suite of benchmarks, Pycket outperforms existing compilers, both Racket's JIT and other highly-optimizing Scheme compilers. Further, Pycket provides much better performance for Racket proxies than existing systems, dramatically reducing the overhead of contracts and gradual typing. We validate this claim with performance evaluation on multiple existing benchmark suites. The Pycket implementation is of independent interest as an application of the RPython meta-tracing framework (originally created for PyPy), which automatically generates tracing JIT compilers from interpreters. Prior work on meta-tracing focuses on bytecode interpreters, whereas Pycket is a high-level interpreter based on the CEK abstract machine and operates directly on abstract syntax trees. Pycket supports proper tail calls and first-class continuations. In the setting of a functional language, where recursion and higher-order functions are more prevalent than explicit loops, the most significant performance challenge for a tracing JIT is identifying which control flows constitute a loop-we discuss two strategies for identifying loops and measure their impact.
... O desempenho obtido não foi comparável àquele obtido por Lee et al, mas mostrou ser uma técnica vantajosa na aplicação de diversas otimizações de código [37]. Cinco anos depois, Hayashizaki et al propuseram uma técnica para detecção e remoção de falso laço dentro de um caminho de execução e conseguiram uma melhoria de desempenho de 37% na execução dos programas avaliados [38]. ...
... Provavelmente a próxima tendência na construção de compiladores JIT utilizará seleção por caminhos de execução para detectar regiões críticas mais eficientes. Hayashizaki et al [38] demonstraram que a remoção de falsos laços dos caminhos detectados é uma boa estratégia para melhorar o desempenho, embora outro trabalho atribuído a Inoue et al [70] tenha demonstrado que detectar grandes caminhos de execução e escaloná-los para compilação é também uma boa estratégia. Adicionalmente, a vantagem em manter grandes caminhos de execução é que, quanto maior for, maior as oportunidades de otimização [37], que convergem em código mais eficiente. ...
... Recently, trace-based compilation has gained popularity in dynamic scripting languages [5,11] and high level language virtual machine [12,16,17]. Inoue et al. [16,17] only employ trace-based optimization based on variations of NET in the trace-based JAVA virtual machine. ...
Conference Paper
Full-text available
Most dynamic binary translators (DBT) and optimizers (DBO) target binary traces, i.e. frequently executed paths, as code regions to be translated and optimized. Code region formation is the most important first step in all DBTs and DBOs. The quality of the dynamically formed code regions determines the extent and the types of optimization opportunities that can be exposed to DBTs and DBOs, and thus, determines the ultimate quality of the final optimized code. The Next-Executing-Tail (NET) trace formation method used in HP Dynamo is an early example of such techniques. Many existing trace formation schemes are variants of NET. They work very well for most binary traces, but they also suffer a major problem: the formed traces may contain a large number of early exits that could be branched out during the execution. If this happens frequently, the program execution will spend more time in the slow binary interpreter or in the unoptimized code regions than in the optimized traces in code cache. The benefit of the trace optimization is thus lost. Traces/regions with frequently taken early-exits are called delinquent traces/regions. Our empirical study shows that at least 8 of the 12 SPEC CPU2006 integer benchmarks have delinquent traces. In this paper, we propose a light-weight region formation technique called Early-Exit Guided Region Formation (EEG) to improve the quality of the formed traces/regions. It iteratively identifies and merges delinquent regions into larger code regions. We have implemented our EEG algorithm in two LLVM-based multi-threaded DBTs targeting ARM and IA32 instruction set architecture (ISA), respectively. Using SPEC CPU2006 benchmark suite with reference inputs, our results show that compared to an NET-variant currently used in QEMU, a state-of-the-art retargetable DBT, EEG can achieve a significant performance improvement of up to 72% (27% on average), and to 49% (23% on average) for IA32 and ARM, respectively.
... We relax such a backward-branch constraint, and stop trace formation only when the same program counter (PC) is executed again. This relaxed algorithm is similar to the cyclic-path-based repetition detection scheme in [14]. counter ← counter + 1 5: if counter ≥ threshold then 6: enable predict ← T RU E 7: ...
Article
Full-text available
Dynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation and security. However, there are several factors that often impede its performance: (1) emulation overhead before translation; (2) translation and optimization overhead, and (3) translated code quality. On the dynamic binary translator itself, the issues also include its retargetability to support guest applications from different instruction-set architectures (ISAs) to host machines also with different ISAs, an important feature for system virtualization. In this work, we take advantage of the ubiquitous multicore platforms, using multithreaded approach to implement DBT. By running the translators and the dynamic binary optimizers on different threads on different cores, it could off-load the overhead caused by DBT on the target applications; thus, afford DBT of more sophisticated optimization techniques as well as the support of its retargetability. Using QEMU (a popular retargetable DBT for system virtualization) and LLVM (Low Level Virtual Machine) as our building blocks, we demonstrated in a multi-threaded DBT prototype, called HQEMU, that it could improve QEMU performance by a factor of 2.4X and 4X on the SPEC 2006 integer and floating point benchmarks for x86 to x86-64 emulations, respectively, i.e. it is only 2.5X and 2.1X slower than native execution of the same benchmarks on x86-64, as opposed to 6X and 8.4X slowdown on QEMU. For ARM to x86-64 emulation, HQEMU could gain a factor of 2.4X speedup over QEMU for the SPEC 2006 integer benchmarks.
... To provide users with a low-overhead system for infinite loop detection and escape, we have designed Jolt around two components: Compiler: Jolt's compiler enables a developer or user to compile the source code of his or her application to obtain a binary executable that is amenable to infinite loop detection. In particular, Jolt's compiler adds lightweight instrumentation to the source of the application to identify the boundaries of loops, which can be difficult to identify accurately from a binary executable [15, 34]. Detector: Jolt's detector can, at the user's request, dynamically attach to and analyze a running instance of an application that the user believes is caught in an infinite loop (if the application has been compiled with Jolt's compiler). ...
Conference Paper
Full-text available
Infinite loops can make applications unresponsive. Potential problems include lost work or output, denied access to application functionality, and a lack of responses to urgent events. We present Jolt, a novel system for dynamically detecting and escaping infinite loops. At the user’s request, Jolt attaches to an application to monitor its progress. Specifically, Jolt records the program state at the start of each loop iteration. If two consecutive loop iterations produce the same state, Jolt reports to the user that the application is in an infinite loop. At the user’s option, Jolt can then transfer control to a statement following the loop, thereby allowing the application to escape the infinite loop and ideally continue its productive execution. The immediate goal is to enable the application to execute long enough to save any pending work, finish any in-progress computations, or respond to any urgent events. We evaluated Jolt by applying it to detect and escape eight infinite loops in five benchmark applications. Jolt was able to detect seven of the eight infinite loops (the eighth changes the state on every iteration). We also evaluated the effect of escaping an infinite loop as an alternative to terminating the application. In all of our benchmark applications, escaping an infinite loop produced a more useful output than terminating the application. Finally, we evaluated how well escaping from an infinite loop approximated the correction that the developers later made to the application. For two out of our eight loops, escaping the infinite loop produced the same output as the corrected version of the application.
Chapter
Static flow analyses compute a safe approximation of a program’s dataflow without executing it. Dynamic flow analyses compute a similar safe approximation by running the program on test data such that it achieves sufficient coverage. We design and implement a dynamic flow analysis for JavaScript. Our formalization and implementation observe a program’s execution in a training run and generate flow constraints from the observations. We show that a solution of the constraints yields a safe approximation to the program’s dataflow if each path in every function is executed at least once in the training run. As a by-product, we can reconstruct types for JavaScript functions from the results of the flow analysis. Our implementation shows that dynamic flow analysis is feasible for JavaScript. While our formalization concentrates on a core language, the implementation covers full JavaScript. We evaluated the implementation using the SunSpider benchmark.
Chapter
In widely-used actor-based programming languages, such as Erlang, sequential execution performance is as important as scalability of concurrency. In order to improve sequential performance of Erlang, we develop Pyrlang, an Erlang virtual machine with a just-in-time (JIT) compiler by applying an existing meta-tracing JIT compiler. In this paper, we overview our implementation and present the optimization techniques for Erlang programs, most of which heavily rely on function recursion. Our preliminary evaluation showed approximately 38% speedup over the standard Erlang interpreter.
Article
Full-text available
Region formation is an important step in dynamic binary translation to select hot code regions for translation and optimization. The quality of the formed regions determines the extent of optimizations and thus determines the final execution performance. Moreover, the overall performance is very sensitive to the formation overhead, because region formation can have a non-trivial cost. For addressing the dual issues of region quality and region formation overhead, this article presents a lightweight region formation method guided by processor tracing, e.g., Intel PT. We leverage the branch history information stored in the processor to reconstruct the program execution profile and effectively form high-quality regions with low cost. Furthermore, we present the designs of lightweight hardware performance monitoring sampling and the branch instruction decode cache to minimize region formation overhead. Using ARM64 to x86-64 translations, the experiment results show that our method achieves a performance speedup of up to 1.53× (1.16× on average) for SPEC CPU2006 benchmarks with reference inputs, compared to the well-known software-based trace formation method, Next Executing Tail (NET). The performance results of x86-64 to ARM64 translations also show a speedup of up to 1.25× over NET for CINT2006 benchmarks with reference inputs. The comparison with a relaxed NETPlus region formation method further demonstrates that our method achieves the best performance and lowest compilation overhead.
Conference Paper
In widely-used actor-based programming languages, such as Erlang, sequential execution performance is as important as scalability of concurrency. We are developing a virtual machine called Pyrlang for the Erlang BEAM bytecode with a just-in-time (JIT) compiler. By using RPython’s tracing JIT compiler, our preliminary evaluation showed approximately twice speedup over the standard Erlang interpreter. In this poster, we overview the design of Pyrlang and the tech- niques to apply RPython’s tracing JIT compiler to BEAM bytecode programs written in the Erlang’s functional style of programming.
Conference Paper
With the growing numbers of both parallel architectures and related programming models, the benchmarking tasks become very tricky since parallel programming requires architecture-dependent compilers and languages as well as high programming expertise. More than just comparing architectures with synthetic benchmarks, benchmarking is also more and more used to design specialized systems composed of heterogeneous computing resources to optimize the performance or performance/watt ratio (e.g. embedded systems designers build System-on-Chip (SoC) out of dedicated and well-chosen components). In the High-Performance-Computing (HPC) domain, systems are designed with symmetric and scalable computing nodes built to deliver the highest performance on a wide variety of applications. However, HPC is now facing cost and power consumption issues which motivate the design of heterogeneous systems. This is one of the rationales of the European FiPS project, which proposes to develop hardware architecture and software methodology easing the design of such systems. Thus, having a fair comparison between architectures while considering an application is of growing importance. Unfortunately, porting it on all available architectures using the related programming models is impossible. To tackle this challenge, we introduced a novel methodology to evaluate and to compare parallel architectures in order to ease the work of the programmer. Based on the usage of micro benchmarks, code profiling and characterization tools, this methodology introduces a semi-automatic prediction of sequential applications performances on a set of parallel architectures. In addition, performance estimation is correlated with the cost of other criteria such as power or portability effort. Introduced for targeting vision-based embedded applications, our methodology is currently being extended to target more complex applications from HPC world. This paper extends our work with new experiments and early results on a real HPC application of DNA sequencing.