Fig 2 - uploaded by Marc Casas
Content may be subject to copyright.
Input signal. Sampling frequency = α α α α.  

Input signal. Sampling frequency = α α α α.  

Source publication
Article
Full-text available
In this paper we present an automatic system able to detect the internal structure of executions of high-performance computing applications. This automatic system is able to rule out non-significant regions of executions, to detect redundancies, and, finally, to select small but significant execution regions. This automatic detection process is bas...

Context in source publication

Context 1
... example, if λ = 0.3 and δ = 10, we select all of the coefficients equal to or greater than 30% of the maximum and, after that, for each selected coefficient, we take 10 coefficients from the right neighborhood and 10 more from the left neighborhood. Figure 2 gives an example of a signal. It is clear that this signal has three very defined phases: first, a phase with low-frequency behavior, then, a second phase with high- frequency behavior, and, finally, a low-frequency phase. ...

Similar publications

Article
Full-text available
Conventional data simplification algorithms depend much on scanning technology. However, with the development of the scanning technology, the conventional algorithm is unable to process numerous redundant data, leading to increased noise of point cloud data, failure of locating split center, and low simplification efficiency. In order to remove the...
Article
Full-text available
The benefits and limitations inherent to the 2D post-processing of measurements from Brillouin optical time-domain analyzers are investigated from a fundamental point of view. In a preliminary step, the impact of curve fitting on the precision of the estimated Brillouin frequency shift is analyzed, enabling a fair comparison between the representat...
Article
Full-text available
: In this paper tunable Q-factor wavelet transform is implemented into damage identification. Fixed-Fixed beam damage identification problem is demonstrated. Translation and Rotational mode shapes are used as an input signal, the TQWT algorithm depends Q-factor and asymptotic redundancy, when it matches with the oscillatory behavior of the input si...
Chapter
Full-text available
In this study we demonstrate the use of Wavelet Transform (WT) as an improvement tool for the horizontal to vertical spectral ratio (HVSR) technique. Since the use of non-stationary transients in microtremor signals remains an open question we investigate the effect of long duration undetectable (by amplitude thresholding methods) transients in HVS...
Article
Full-text available
In this paper a coupled dictionary learning mechanism with mapping function is proposed in the wavelet domain for the task of Single-Image Super-Resolution. Sparsity is used as the invariant feature for achieving super-resolution. Instead of using a single dictionary multiple compact dictionaries are proposed in the wavelet domain. Such dictionarie...

Citations

... In the field of performance analysis, Casas et al. [8] proposed to construct signals of metrics (e.g. number of active processes, amount of communicated data, etc), and then to apply discrete wavelet transform to keep the highest-frequency portions, and auto-correlation to find the frequency of the application's phases. ...
Preprint
Characterizing the temporal I/O behavior of an HPC application is a challenging task, but informing the system about it can be valuable for techniques such as I/O scheduling, burst buffer management, and many more, especially if provided online. In this work, we focus on the most commonly discussed temporal I/O behavior aspect: the periodicity of I/O. Specifically, we propose to examine the periodicity of the I/O phases using a signal processing technique, namely the Discrete Fourier Transform (DFT). Our approach, named FTIO, also provides metrics that quantify how far from being periodic the signal is, and hence represent yield confidence in the DFT-obtained period. We validate our approach with large-scale experiments on a productive system and examine its limitations extensively.
... The mere detection of communication patterns may still generate a lot of data that is hard for a software analyst to grasp. To alleviate this problem, researchers have proposed the concept of trace segmentation in which a large trace is partitioned into distinct segments, which depict execution phases of the traced scenario, and detect communication patterns within each segment [5][6] [7]. ...
... The main objective is to provide a way for the software analyst to only focus on parts of the trace of interest instead of browsing the whole trace. Casas et al. [5] and Chetsa et al. [9] showed how MPI trace execution phases can help with performance optimization tasks by uncovering regions in a trace with the highest latency. Isaacs et al. [6] presented a trace visualization and analysis tool that logically orders and visualizes the MPI communication behavior into fine-grained phases to determine the lateness in program operations using temporal metrics and visual inspection. ...
... The Solve phase is labeled as the phase containing most of the parallel iterations. In [5], Casas et al. used CPU computing bursts in conjunction with communication events (signals) in order to define several metrics that could be used in the identification of sub-phases in the Solve (computational) phase. Our approach distinctly identifies the different phases in the program using information theory principles. ...
Article
Full-text available
High Performance Computing (HPC) systems are used in a variety of industrial and research sectors to solve complex problems that require powerful computing platforms. For these systems to remain reliable, we should be able to debug and analyze their behavior in order to detect root causes of potential poor performance. Execution traces hold important information regarding the events and interactions among communicating processes, which are essential for the debugging of inter-process communication. Traces, however, tend to be considerably large, hindering their applicability. In previous work, we presented an approach for automatically detecting communication patterns and segmenting large HPC traces into execution phases. The goal is to reduce the effort of analyzing traces by allowing software analysts to focus on smaller parts of interest. In this paper, we propose an approach for detecting and localizing inefficient communication patterns using statistical and trace segmentation methods. In addition, we use the Analytic Hierarchy Process to categorize slow communication patterns based on their severity and complexity levels. Using our approach, an analyst can quickly locate slow communication patterns that may be the cause of important performance problems. We show the effectiveness of our approach by applying it to large traces from three HPC systems.
... We can see that the execution can be divided in three execution paths: Main → FunctionA (Path 1), Main → FunctionB → FunctionC (Path 2) and Main → FunctionB → FunctionD (Path 3). The iterative structure shown in Application 1 has been selected because it appears in many long-running applications [26] [27] [28]. Figure 1(a) shows the traditional approach of energy estimation by doing a full profiling of the whole execution of the application. ...
Article
Full-text available
The computation power in data center facilities is increasing significantly. This brings with it an increase of power consumption in data centers. Techniques such as power budgeting or resource management are used in data centers to increase energy efficiency. These techniques require to know beforehand the energy consumption throughout a full profiling of the applications. This is not feasible in scenarios with long-running applications that have long execution times. To tackle this problem we present a fast energy estimation framework for long-running applications. The framework is able to estimate the dynamic CPU and memory energy of the application without the need to perform a complete execution. For that purpose, we leverage the concept of application signature. The application signature is a reduced version, in terms of execution time, of the original application. Our fast energy estimation framework is validated with a set of long-running applications and obtains RMS values of 11.4% and 12.8% for the CPU and memory energy estimation errors, respectively. We define the concept of Compression Ratio as an indicator of the acceleration of the energy estimation process. Our framework is able to obtain Compression Ratio values in the range of 10.1 to 191.2.
... Offline phase classification techniques use execution characteristics, such as basic blocks and execution traces, which Instruction execution [75]; function execution [40]; instructions per cycle [22] metrics Analysis method ...
... Manhattan distance [75]; pattern analysis [40]; digital signal processing [22] are gathered during compile time or through a priori application analysis. These techniques are generally easier to develop than online phase classification techniques, since online techniques have stringent constraints where negative impacts on application execution time must be minimized. ...
... Casas et at. [22] used a signal processing techniquea discrete wavelet transform-to classify phases in multithreaded applications. They found that, by using a signal processing technique on data from a trace file, they could detect only the most relevant frequencies of an applications execution-the frequencies most strongly related to the loops of an applications source code. ...
Preprint
Adaptable computing is an increasingly important paradigm that specializes system resources to variable application requirements, environmental conditions, or user requirements. Adapting computing resources to variable application requirements (or application phases) is otherwise known as phase-based optimization. Phase-based optimization takes advantage of application phases, or execution intervals of an application, that behave similarly, to enable effective and beneficial adaptability. In order for phase-based optimization to be effective, the phases must first be classified to determine when application phases begin and end, and ensure that system resources are accurately specialized. In this paper, we present a survey of phase classification techniques that have been proposed to exploit the advantages of adaptable computing through phase-based optimization. We focus on recent techniques and classify these techniques with respect to several factors in order to highlight their similarities and differences. We divide the techniques by their major defining characteristics---online/offline and serial/parallel. In addition, we discuss other characteristics such as prediction and detection techniques, the characteristics used for prediction, interval type, etc. We also identify gaps in the state-of-the-art and discuss future research directions to enable and fully exploit the benefits of adaptable computing.
... The main challenge in sampling simulations of multi-threaded programs is to ensure that at the beginning of each detailed simulation interval all threads have made the same amount of progress as in a full detailed simulation. A technique proposed by Casas et al. [11] detects periodic behavior in parallel applications using signal processing techniques. Carlson et al. [8] fast-forward different threads at different rates between intervals of detailed simulation. ...
Article
Sampled simulation is a mature technique for reducing simulation time of single-threaded programs. Nevertheless, current sampling techniques do not take advantage of other execution models, like task-based execution, to provide both more accurate and faster simulation. Task-based programming models allow the programmer to specify program segments as tasks which are instantiated many times and scheduled dynamically to available threads. Due to variation in scheduling decisions, two consecutive executions on the same machine typically result in different instruction streams processed by each thread. In this paper, we propose TaskPoint, a sampled simulation technique for dynamically scheduled task-based programs. We leverage task instances as sampling units and simulate only a fraction of all task instances in detail. Between detailed simulation intervals, we employ a novel fast-forwarding mechanism for dynamically scheduled programs. We valuate different automatic techniques for clustering task instances and show that DBSCAN clustering combined with analytical performance modeling provides the best trade-off of simulation speed and accuracy. TaskPoint is the first technique combining sampled simulation and analytical modeling and provides a new way to trade off simulation speed and accuracy. Compared to detailed simulation, TaskPoint accelerates architectural simulation with 8 simulated threads by an average factor of 220x at an average error of 0.5% and a maximum error of 7.9%.
... This is simple to achieve as intuition and experiments show that it is enough to include the initialization tasks and the first couple of computation phases of the application (iterations in the case of iterative algorithms). An alternative solution is to apply existing techniques of automatic detection of the phases [9] and use this information to decide at execution time what is a correct window size. Another possible approach is editing the application source code by making a call to the runtime system API indicating that the partition must be done at the specific point of the call. ...
Conference Paper
Shared memory systems are becoming increasingly complex as they typically integrate several storage devices. That brings different access latencies or bandwidth rates depending on the proximity between the cores where memory accesses are issued and the storage devices containing the requested data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are generally applied by the system software. We propose techniques at the runtime system level to further mitigate the impact of NUMA effects on parallel applications' performance. We leverage runtime system metadata expressed in terms of a task dependency graph, where nodes are pieces of serial code and edges are control or data dependencies between them, to efficiently reduce data transfers. Our approach, based on graph partitioning, adds negligible overhead and is able to provide performance improvements up to 1.52X and average improvements of 1.12X with respect to the best state-of-the-art approach when deployed on a 288-core shared-memory system. Our approach reduces the coherence traffic by 2.28X on average with respect to the state-of-the-art.
... Casas et al. [7] identify phases and iterations in MPI applications. They convert time-series data into signals and apply signal processing algorithms to detect program structures. ...
Conference Paper
Full-text available
Traditionally, performance analysis tools have focused on collecting measurements, attributing them to program source code, and presenting them; responsibility for analysis and interpretation of measurement data falls to application developers. While profiles of parallel programs can identify the presence of performance problems, often developers need to analyze execution behavior over time to understand how and why parallel inefficiencies arise. With the growing scale of supercomputers, such manual analysis is becoming increasingly difficult. In many cases, performance problems of interest only appear at larger scales. Manual analysis of time series data from executions on extreme-scale parallel systems is daunting as the volume of data across processors and time makes it difficult to assimilate. To address this problem, we have developed an automated analysis framework that generates compact summaries of time series data for parallel program executions. These summaries provide users with high-level insight into patterns in the performance data and can quickly direct a user's attention to potential performance bottlenecks. We demonstrate the effectiveness of our framework by applying it to time-series measurements of two scientific codes.
... This is a crucial property for the execution of the MAPE loop, since the manager can assume that the decisions taken at the current control step will still be valid at the next control step, since in the meanwhile the characteristics of the computation have not significantly changed. Many real applications fall into this category [25,26,24,27,10]. It is worth noting that the application may still be characterized by different phases, unless there are extreme behaviors like one different phase at each control step. ...
Article
Full-text available
Self-adaptation is an emerging requirement in parallel computing. It enables the dynamic selection of resources to allocate to the application in order to meet performance and power consumption requirements. This is particularly relevant in Fog Applications, where data is generated by a number of devices at a varying rate, according to users' activity. By dynamically selecting the appropriate number of resources it is possible, for example, to use at each time step the minimum amount of resources needed to process the incoming data. Implementing such kind of algorithms may be a complex task, due to low-level interactions with the underlying hardware and to non-intrusive and low-overhead monitoring of the applications. For these reasons, in this paper we propose Nornir, a C++-based framework, which can be used to enforce performance and power consumption constraints on parallel applications running on shared memory multicores. The framework can be easily customized by algorithm designers to implement new self-adaptive policies. By instrumenting the applications in the PARSEC benchmark, we provide to strategy designers a wide set of applications already interfaced to Nornir. In addition to this, to prove its flexibility, we implemented and compared several state-of-the-art existing policies, showing that Nornir can also be used to easily analyze different algorithms and to provide useful insights on them.
... In this approach various application regions are monitored to determine its communication structure, the amount of time it spends performing various operations and the number of events such as cache misses that occur during each operation. While these measurements can be used to derive non-trivial information of HPC applications, like the internal structure of their executions [7] or the reasons behind performance slowdowns [6,5], they can only enable indirect inference about how the properties of a network relate to application performance. ...
Technical Report
Full-text available
Inter-node networks are a key capability of High-Performance Computing (HPC) systems that differentiates them from less capable classes of machines. However, in spite of their very high performance, the increasing computational power of HPC compute nodes and the associated rise in application communication needs make network a common performance bottleneck. To achieve high performance in spite of network limitations application developers require tools to measure their applications' network utilization and inform them about how the network's communication capacity relates to the performance of their applications. This paper presents a new performance measurement and analysis methodology based on empirical measurements of network behavior. Our approach uses two benchmarks that inject extra network communication. The first probes the fraction of the network that is utilized by a software component (an application or an individual task) to determine the existence and severity of network contention. The second aggressively injects network traffic while a software component runs to evaluate its performance on less capable networks or when it shares the network with other software components. We combine information from the two measurement benchmarks to predict the performance slowdown experienced by multiple software components when they share a single network. We consider several training sets that use raw data from the measurement benchmarks to produce our predictions. The sensitivity of the training set size is evaluated by considering 12 different scenarios. Our results find the optimum training set size to be around 200 training points. When optimal data sets are used, the proposed methodology provides predictions with an average error of 9.58% considering 36 scenarios.
... In this approach various application regions are monitored to determine its communication structure, the amount of time it spends performing various operations and the number of events such as cache misses that occur during each operation. While these measurements can be used to derive non-trivial information of HPC applications, like the internal structure of their executions [7] or the reasons behind performance slowdowns [6, 5], they can only enable indirect inference about how the properties of a network relate to application performance. Detailed parametric models of HPC applications performance when run on large scale architectures include network parameters like latency or band- width [19]. ...