Input signal. Sampling frequency = α α α α.

Source publication

Automatic Phase Detection and Structure Extraction of MPI Applications

Article

Full-text available

Aug 2010

In this paper we present an automatic system able to detect the internal structure of executions of high-performance computing applications. This automatic system is able to rule out non-significant regions of executions, to detect redundancies, and, finally, to select small but significant execution regions. This automatic detection process is bas...

Context 1

... example, if λ = 0.3 and δ = 10, we select all of the coefficients equal to or greater than 30% of the maximum and, after that, for each selected coefficient, we take 10 coefficients from the right neighborhood and 10 more from the left neighborhood. Figure 2 gives an example of a signal. It is clear that this signal has three very defined phases: first, a phase with low-frequency behavior, then, a second phase with high- frequency behavior, and, finally, a low-frequency phase. ...

View in full-text

Point cloud encapsulation diagram(f1-f2). A, Raw data encapsulation of...

Point cloud voxel grid diagram (f1-f4). A, Original point cloud...

Output diagram of simplification algorithm (f1-f2). A, Original point...

Side views of teeth.ply (f1-f2). A, Original point cloud (f1); B,...

Diagram of simplification effect for different parameters (f1-f4).A,...

An improved spatial point cloud simplification algorithm

Article

Full-text available

Aug 2022

Conventional data simplification algorithms depend much on scanning technology. However, with the development of the scanning technology, the conventional algorithm is unable to process numerous redundant data, leading to increased noise of point cloud data, failure of locating split center, and low simplification efficiency. In order to remove the...

Fig. 16. Noise equivalent bandwidth (NEB) concept. A filter with a...

Fig. 17. (a) Heat map indicating the Brillouin gain at fiber end,...

On the 2D Post-Processing of Brillouin Optical Time-Domain Analysis

Article

Full-text available

Jan 2020

The benefits and limitations inherent to the 2D post-processing of measurements from Brillouin optical time-domain analyzers are investigated from a fundamental point of view. In a preliminary step, the impact of curve fitting on the precision of the estimated Brillouin frequency shift is analyzed, enabling a fair comparison between the representat...

Novel TQWT Algorithm for Structural Damage Identification

Article

Full-text available

Nov 2019

: In this paper tunable Q-factor wavelet transform is implemented into damage identification. Fixed-Fixed beam damage identification problem is demonstrated. Translation and Rotational mode shapes are used as an input signal, the TQWT algorithm depends Q-factor and asymptotic redundancy, when it matches with the oscillatory behavior of the input si...

HVSR Technique Improvement Using Redundant Wavelet Transform

Chapter

Full-text available

In this study we demonstrate the use of Wavelet Transform (WT) as an improvement tool for the horizontal to vertical spectral ratio (HVSR) technique. Since the use of non-stationary transients in microtremor signals remains an open question we investigate the effect of long duration undetectable (by amplitude thresholding methods) transients in HVS...

Wavelet directions for DWT (top row) and DT-CWT (bottom row)

Samples of trained LR directional dictionaries

Coupled dictionary learning in wavelet domain for Single-Image Super-Resolution

Article

Full-text available

Mar 2018

In this paper a coupled dictionary learning mechanism with mapping function is proposed in the wavelet domain for the task of Single-Image Super-Resolution. Sparsity is used as the invariant feature for achieving super-resolution. Instead of using a single dictionary multiple compact dictionaries are proposed in the wavelet domain. Such dictionarie...

FTIO: Detecting I/O Periodicity Using Frequency Techniques

Preprint

Jun 2023

Characterizing the temporal I/O behavior of an HPC application is a challenging task, but informing the system about it can be valuable for techniques such as I/O scheduling, burst buffer management, and many more, especially if provided online. In this work, we focus on the most commonly discussed temporal I/O behavior aspect: the periodicity of I/O. Specifically, we propose to examine the periodicity of the I/O phases using a signal processing technique, namely the Discrete Fourier Transform (DFT). Our approach, named FTIO, also provides metrics that quantify how far from being periodic the signal is, and hence represent yield confidence in the DFT-obtained period. We validate our approach with large-scale experiments on a productive system and examine its limitations extensively.

Locating and categorizing inefficient communication patterns in HPC systems using inter-process communication traces

Article

Full-text available

Aug 2022
J SYST SOFTWARE

High Performance Computing (HPC) systems are used in a variety of industrial and research sectors to solve complex problems that require powerful computing platforms. For these systems to remain reliable, we should be able to debug and analyze their behavior in order to detect root causes of potential poor performance. Execution traces hold important information regarding the events and interactions among communicating processes, which are essential for the debugging of inter-process communication. Traces, however, tend to be considerably large, hindering their applicability. In previous work, we presented an approach for automatically detecting communication patterns and segmenting large HPC traces into execution phases. The goal is to reduce the effort of analyzing traces by allowing software analysts to focus on smaller parts of interest. In this paper, we propose an approach for detecting and localizing inefficient communication patterns using statistical and trace segmentation methods. In addition, we use the Analytic Hierarchy Process to categorize slow communication patterns based on their severity and complexity levels. Using our approach, an analyst can quickly locate slow communication patterns that may be the cause of important performance problems. We show the effectiveness of our approach by applying it to large traces from three HPC systems.

Fast energy estimation framework for long-running applications

Article

Full-text available

Aug 2020
FUTURE GENER COMP SY

The computation power in data center facilities is increasing significantly. This brings with it an increase of power consumption in data centers. Techniques such as power budgeting or resource management are used in data centers to increase energy efficiency. These techniques require to know beforehand the energy consumption throughout a full profiling of the applications. This is not feasible in scenarios with long-running applications that have long execution times. To tackle this problem we present a fast energy estimation framework for long-running applications. The framework is able to estimate the dynamic CPU and memory energy of the application without the need to perform a complete execution. For that purpose, we leverage the concept of application signature. The application signature is a reduced version, in terms of execution time, of the original application. Our fast energy estimation framework is validated with a set of long-running applications and obtains RMS values of 11.4% and 12.8% for the CPU and memory energy estimation errors, respectively. We define the concept of Compression Ratio as an indicator of the acceleration of the energy estimation process. Our framework is able to obtain Compression Ratio values in the range of 10.1 to 191.2.

A Survey of Phase Classification Techniques for Characterizing Variable Application Behavior

Preprint

Jul 2019

Adaptable computing is an increasingly important paradigm that specializes system resources to variable application requirements, environmental conditions, or user requirements. Adapting computing resources to variable application requirements (or application phases) is otherwise known as phase-based optimization. Phase-based optimization takes advantage of application phases, or execution intervals of an application, that behave similarly, to enable effective and beneficial adaptability. In order for phase-based optimization to be effective, the phases must first be classified to determine when application phases begin and end, and ensure that system resources are accurately specialized. In this paper, we present a survey of phase classification techniques that have been proposed to exploit the advantages of adaptable computing through phase-based optimization. We focus on recent techniques and classify these techniques with respect to several factors in order to highlight their similarities and differences. We divide the techniques by their major defining characteristics---online/offline and serial/parallel. In addition, we discuss other characteristics such as prediction and detection techniques, the characteristics used for prediction, interval type, etc. We also identify gaps in the state-of-the-art and discuss future research directions to enable and fully exploit the benefits of adaptable computing.

Sampled Simulation of Task-Based Programs

Article

Aug 2018
IEEE T COMPUT

Sampled simulation is a mature technique for reducing simulation time of single-threaded programs. Nevertheless, current sampling techniques do not take advantage of other execution models, like task-based execution, to provide both more accurate and faster simulation. Task-based programming models allow the programmer to specify program segments as tasks which are instantiated many times and scheduled dynamically to available threads. Due to variation in scheduling decisions, two consecutive executions on the same machine typically result in different instruction streams processed by each thread. In this paper, we propose TaskPoint, a sampled simulation technique for dynamically scheduled task-based programs. We leverage task instances as sampling units and simulate only a fraction of all task instances in detail. Between detailed simulation intervals, we employ a novel fast-forwarding mechanism for dynamically scheduled programs. We valuate different automatic techniques for clustering task instances and show that DBSCAN clustering combined with analytical performance modeling provides the best trade-off of simulation speed and accuracy. TaskPoint is the first technique combining sampled simulation and analytical modeling and provides a new way to trade off simulation speed and accuracy. Compared to detailed simulation, TaskPoint accelerates architectural simulation with 8 simulated threads by an average factor of 220x at an average error of 0.5% and a maximum error of 7.9%.

Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies

Conference Paper

Jun 2018

Shared memory systems are becoming increasingly complex as they typically integrate several storage devices. That brings different access latencies or bandwidth rates depending on the proximity between the cores where memory accesses are issued and the storage devices containing the requested data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are generally applied by the system software. We propose techniques at the runtime system level to further mitigate the impact of NUMA effects on parallel applications' performance. We leverage runtime system metadata expressed in terms of a task dependency graph, where nodes are pieces of serial code and edges are control or data dependencies between them, to efficiently reduce data transfers. Our approach, based on graph partitioning, adds negligible overhead and is able to provide performance improvements up to 1.52X and average improvements of 1.12X with respect to the best state-of-the-art approach when deployed on a 288-core shared-memory system. Our approach reduces the coherence traffic by 2.28X on average with respect to the state-of-the-art.

Automated Analysis of Time Series Data to Understand Parallel Program Behaviors

Conference Paper

Full-text available

Jun 2018

Traditionally, performance analysis tools have focused on collecting measurements, attributing them to program source code, and presenting them; responsibility for analysis and interpretation of measurement data falls to application developers. While profiles of parallel programs can identify the presence of performance problems, often developers need to analyze execution behavior over time to understand how and why parallel inefficiencies arise. With the growing scale of supercomputers, such manual analysis is becoming increasingly difficult. In many cases, performance problems of interest only appear at larger scales. Manual analysis of time series data from executions on extreme-scale parallel systems is daunting as the volume of data across processors and time makes it difficult to assimilate. To address this problem, we have developed an automated analysis framework that generates compact summaries of time series data for parallel program executions. These summaries provide users with high-level insight into patterns in the performance data and can quickly direct a user's attention to potential performance bottlenecks. We demonstrate the effectiveness of our framework by applying it to time-series measurements of two scientific codes.

Simplifying self-adaptive and power-aware computing with Nornir

Article

Full-text available

May 2018
FUTURE GENER COMP SY

Self-adaptation is an emerging requirement in parallel computing. It enables the dynamic selection of resources to allocate to the application in order to meet performance and power consumption requirements. This is particularly relevant in Fog Applications, where data is generated by a number of devices at a varying rate, according to users' activity. By dynamically selecting the appropriate number of resources it is possible, for example, to use at each time step the minimum amount of resources needed to process the incoming data. Implementing such kind of algorithms may be a complex task, due to low-level interactions with the underlying hardware and to non-intrusive and low-overhead monitoring of the applications. For these reasons, in this paper we propose Nornir, a C++-based framework, which can be used to enforce performance and power consumption constraints on parallel applications running on shared memory multicores. The framework can be easily customized by algorithm designers to implement new self-adaptive policies. By instrumenting the applications in the PARSEC benchmark, we provide to strategy designers a wide set of applications already interfaced to Nornir. In addition to this, to prove its flexibility, we implemented and compared several state-of-the-art existing policies, showing that Nornir can also be used to easily analyze different algorithms and to provide useful insights on them.

Prediction of the Impact of Network Switch Utilization on Application Performance via Active Measurement

Technical Report

Full-text available

Dec 2015

Inter-node networks are a key capability of High-Performance Computing (HPC) systems that differentiates them from less capable classes of machines. However, in spite of their very high performance, the increasing computational power of HPC compute nodes and the associated rise in application communication needs make network a common performance bottleneck. To achieve high performance in spite of network limitations application developers require tools to measure their applications' network utilization and inform them about how the network's communication capacity relates to the performance of their applications. This paper presents a new performance measurement and analysis methodology based on empirical measurements of network behavior. Our approach uses two benchmarks that inject extra network communication. The first probes the fraction of the network that is utilized by a software component (an application or an individual task) to determine the existence and severity of network contention. The second aggressively injects network traffic while a software component runs to evaluate its performance on less capable networks or when it shares the network with other software components. We combine information from the two measurement benchmarks to predict the performance slowdown experienced by multiple software components when they share a single network. We consider several training sets that use raw data from the measurement benchmarks to produce our predictions. The sensitivity of the training set size is evaluated by considering 12 different scenarios. Our results find the optimum training set size to be around 200 training points. When optimal data sets are used, the proposed methodology provides predictions with an average error of 9.58% considering 36 scenarios.

paper

Data

Dec 2015

Input signal. Sampling frequency = α α α α.

Context in source publication

Similar publications

Citations