Figure 1 - uploaded by Angeles G. Navarro
Content may be subject to copyright.
Data dependence test simple example.

Data dependence test simple example.

Source publication
Conference Paper
Full-text available
In the new multicore architecture arena, the problem of improving the performance of a code is more in the software side than in the hardware one. However, optimizing irregular dynamic data structure based codes for such architectures is not easy, either by hand or compiler assisted. Regarding this last approach, shape analysis is a static techniqu...

Contexts in source publication

Context 1
... us illustrate the main idea behind our dependence test. The code in fig. 1 creates a singly-linked list and then traverses it, writing in a list-element the data field of the pre- vious one. Our test symbolically executes the code abstract- ing the data structures by shape graphs, sg. In the example, sg1 is the abstraction of the list created at statement S1. Using abstract interpretation, the abstract ...
Context 2
... When it is abstractly interpreted, then the corresponding node is an- notated with that information. Later, the data dependence test checks if a node has been actually written and read by statements that could produce LCDs, and in that case a data dependence (and the type of dependence -RAW, WAR or WAW) can be reported. In the example of fig. 1, statement S3 produces the annotation of the node pointed by p with the RS3 tag, whereas S4 annotates with WS4 (DepTouch pseudostatements are not shown for simplicity in the code). That information is then used to detect a RAW LCD. An interprocedural shape analysis technique must also be able to deal with function calls and return ...
Context 3
... and then reverses it with a recursive function, rev(). Let us assume now that the memory configuration for the 4-element list of fig. 3 is used as input for the rev() function. The list is then tra- versed in a sequence of recursive calls. The memory con- figuration that results at the 4th invocation of the rev() function (line 9) is shown in fig. 10(a), where the Activa- tion Record Stack (ARS) has been included to maintain the pointer links for x in previous calls of rev(). The infor- mation within the ARS is required when we go back to the return site of an enclosing call, so that we know the desti- nation of the corresponding pointer formal parameters and local pointers. ...
Context 4
... along the recursive, interprocedural control flow. A re- cursive flow pointer link points to the same memory loca- tion/node where the traced pointer was pointing to in the immediately previous call in a stack of recursive calls, while a recursive flow selector link points to the locations/nodes beyond the immediately previous activation record. Fig. 10(b) shows the memory configuration from (a), exchanging the ARS for the needed recursive flow links, which are shown in dashed lines. In the rfplc and rfslc, the last "c" stands for "concrete". The location/node denoted by • is representing the NULL location for the recursive flow path. Following the trace through a recursive flow selector ...
Context 5
... for "concrete". The location/node denoted by • is representing the NULL location for the recursive flow path. Following the trace through a recursive flow selector link with • as destination would not correspond to any acti- vation record in the succession of recursive calls, and there- fore would not render any realistic memory configuration. Fig. 10(c) shows the abstraction, in the abstract domain, of the memory configuration in (b). Here memory locations l1 and l2 are summarized by n1. The selector links are up- dated accordingly as shown in the ...
Context 6
... where they were pointing to in the previous context. For that, the matching of returned pointer and assigned pointer at call site, as well as the matching between formals and actuals, are used. The rest of local pointers are reassigned according to the exist- ing recursive flow links. These context change rules are better understood by ex- ample. Fig. 11(a) shows how the CTS r rule acts upon shape graph sg2 St12 , the abstraction of a 4-element singly-linked list that reaches the third call to rev(), by taking the if branch in the code of fig. 9. The result of applying the CTS r rule is named sg3 St9 . Null assigned or uninitialized point- ers are not shown. Links are only displayed ...
Context 7
... is now pointed to by x in sg3 St9 . The x rfptr pointer is up- dated to point to n2, node pointed to by x in the previous context, and the recursive flow selector from n2 to n1 in sg3 St9 leaves the trace to the node pointed to by x rfptr in the previous context. Local pointer y is not defined before the recursive call so no tracing is necessary. Fig. 11(b) shows how the RTC r works on sg2 St18 , the graph resulting from the third call to rev(). The result is sg3 St13 , which exchanges actual parameter z for its matching formal parameter x. Pointer y is assigned over itself, so it causes no change. Pointer x is reassigned to the node pointed to by x rfptr , which is updated by following ...
Context 8
... interesting fact about the use of coexistent links sets in our abstraction can also be observed in fig. 11(b): n3 is pointed from n2 and n4, both links coexisting in its cls (shown for sg2 St18 ). We say n3 is shared, because it can be directly reached from more than one node. This is high- lighted graphically by shading the oval depicting the node. The cls information allows to know that, not only can n3 be accessed by following two different ...
Context 9
... in this paper within our optimizing-compiler framework [3]. We focus on the analysis of C sequential programs. All the needed preprocessing passes are performed with custom- made passes built upon Cetus [11], a versatile source-to- source compiler framework. We have also implemented a GUI to enable a friendly use of our shape analyzer tool. In fig. 12 we can see one of the available windows in which the "Graphs" tab is selected. In that tab we have got the analyzed code with each state- ment annotated with information regarding the number of times that statement have been symbolically executed and the number of sg's associated with it. We also provide the links to each graph and ...
Context 10
... have conducted three kind of experiments. In a first place, fig. 13 shows a group of programs based on dynamic data structures. These programs were analyzed as a test for the ability of the technique to capture several types of dynamic data structures. These structures include singly- linked lists, doubly-linked lists, binary trees, n-ary trees, and sparse matrices or sparse vectors built based on ...
Context 11
... on dynamic data structures. These programs were analyzed as a test for the ability of the technique to capture several types of dynamic data structures. These structures include singly- linked lists, doubly-linked lists, binary trees, n-ary trees, and sparse matrices or sparse vectors built based on singly- or doubly-linked lists. The codes in fig. 13 do not include recursive functions. Programs 1 to 4 in fig. 13 create and traverse their corresponding data structures. Programs 5 to 8 in the same figure, implement the product of sparse ma- trix by sparse vector, or the product of two sparse matrices. Program 9 is the em3d from the Olden suit [2]. All the structures tested were ...
Context 12
... a test for the ability of the technique to capture several types of dynamic data structures. These structures include singly- linked lists, doubly-linked lists, binary trees, n-ary trees, and sparse matrices or sparse vectors built based on singly- or doubly-linked lists. The codes in fig. 13 do not include recursive functions. Programs 1 to 4 in fig. 13 create and traverse their corresponding data structures. Programs 5 to 8 in the same figure, implement the product of sparse ma- trix by sparse vector, or the product of two sparse matrices. Program 9 is the em3d from the Olden suit [2]. All the structures tested were captured ...
Context 13
... the node property "DepTouch" and symbolically executing the loops of a code, we can annotate which memory loca- tions are read or written and detect loop-carried data de- pendences (LCD). This test have been applied to the sparse matrix-vector, the sparse matrix-matrix and em3d codes of figure 13. The matrix codes were based on doubly-linked lists to store the sparse matrices and vectors. ...

Citations

... Cetus is already in use by a number of research groups in the U.S. and worldwide [83][84][85][86][87][88][89][90][91][92]. ...
Thesis
Full-text available
In today's multicore era, with the persistently improved fabrication technology, the new challenge is to find applications (i.e. killer Apps) that exploit the increased computational power. Automatic parallelization of sequential programs combined with tuning techniques is an alternative to manual parallelization that saves programmer time and effort. Hand parallelization is tedious, error-prone process. A key difficulty is that parallelizing compilers are generally unable to estimate the performance impact of an optimization on a whole program or a program section at compile time; hence, the ultimate performance decision today rests with the developer. Building an autotuning system to remedy this situation is not a trivial task. Automatic parallelization concentrates on finding any possible parallelism in the program, whereas tuning systems help identifying efficient parallel code segments and profitable optimization techniques. A key limitation of advanced optimizing compilers is their lack of runtime information, such as the program input data.^ With the renewed relevance of autoparallelizers, a comprehensive evaluation will identify strengths and weaknesses in the underlying techniques and direct researchers as well as engineers to potential improvements. No comprehensive study has been conducted on modern parallelizing compilers for today's multicore systems. Such study needs to evaluate different levels of techniques and their interactions, which requires efficiently navigating over a large search spaces of optimization variants. With the recently revealed non-trivial parallel architectures, a programmer needs to learn the behavior of these systems with respect to their programs in order to orchestrate it for a maximized utilization of a gazillion of CPU cycles available.^ In this dissertation, we go in a journey through parallel applications and parallel architectures in quantitative approach. This work presents a portable empirical autotuning system that operates at program-section granularity and partitions the compiler options into groups that can be tuned independently. To our knowledge, this is the first approach delivering an autoparallelization system that ensures performance improvements for nearly all programs, eliminating the users' need to "experiment" with such tools to strive for highest application performance. This method has the potential to substantially increase productivity and is thus of critical importance for exploiting the increased computational power of today's multicores.^ We present an experimental methodology for comprehensively evaluating the effectiveness of parallelizing compilers and their underlying optimization techniques. The methodology takes advantage of the proposed customizable tuning system that can efficiently evaluate a large space of optimization variants. We applied the proposed methodology on five modern parallelizing compilers and their tuning capabilities; we reported speedups, parallel coverage, and the number of parallel loops, using the NAS Benchmarks as a program suite. As there is an extensive body of proposed compiler analyses and transformations for parallelization, the question of the importance of the techniques arises. This work evaluates the impact of the individual optimization techniques on the overall program performance and discusses their mutual interactions. We study the differences between polyhedral model based compilers and Abstract Syntax Tree compilers. We also study the scalability of IBM BlueGeneQ and Intel MIC Architectures as representatives of modern multicore systems.^ We found parallelizers to be reasonably successful in about half of the given science-engineering programs. Advanced versions of some of the techniques identified as most successful in previous generations of compilers are also most important today, while other techniques have risen significantly in impact. An important finding is also that some techniques substitute each other. Furthermore, we found that automatic tuning can lead to significant additional performance and sometimes matches or outperforms hand parallelized programs. We analyze specific reasons for the measured performance and the potential for improvement of automatic parallelization. On average overall programs, BlueGeneQ and MIC systems could achieve a scalability factor of 1.5.
... Cetus is already in use by a number of research groups in the U.S. and worldwide [17,5,3,2,11,33,34,21,27]. In our ongoing work, we are applying the infrastructure for creating translators that convert shared-memory programs written in OpenMP into other models, such as message-passing and CUDA (for Graphics Processing Units) [20]. ...
Article
Full-text available
This paper provides an overview and an evaluation of the Cetus source-to-source compiler infrastructure. The original goal of the Cetus project was to create an easy-to-use compiler for research in automatic parallelization of C programs. In meantime, Cetus has been used for many additional program transformation tasks. It serves as a compiler infrastructure for many projects in the US and internationally. Recently, Cetus has been supported by the National Science Foundation to build a community resource. The compiler has gone through several iterations of benchmark studies and implementations of those techniques that could improve the parallel performance of these programs. These efforts have resulted in a system that favorably compares with state-of-the-art parallelizers, such as Intel’s ICC. A key limitation of advanced optimizing compilers is their lack of runtime information, such as the program input data. We will discuss and evaluate several techniques that support dynamic optimization decisions. Finally, as there is an extensive body of proposed compiler analyses and transformations for parallelization, the question of the importance of the techniques arises. This paper evaluates the impact of the individual Cetus techniques on overall program performance.
... Our shape analysis quite accurately captures the program heap storage in the form of an abstract bounded graph. In a previous study, and based on this shape analysis, we developed a client data dependence test [1] that annotates the read and write accesses in the graph's nodes during the abstract interpretation process. These annotations are those which allow us to detect flow, anti-and output data dependences in C-codes that traverse and modify complex heap data structures. ...
... Due to this, the conflict detection step (CD) now takes as input all the shape graphs that reach the entry of the corresponding intervals. On the other hand, the column ''DDT_AI'' represents the total times required by our previous dependence test [1]. Recall that this older test was designed as a client analysis that annotated the read and write accesses in the graph nodes during the abstract interpretation process of the shape analysis. ...
... This distinction can be useful for loop parallelization in the case of carried antidependences being detected. 1 This lack of precision is due to 1 We should note that our analysis based on conflict detection can correctly ascertain carried output dependences. ...
Article
We propose a data dependence detection test based on a new conflict analysis algorithm for C codes which make intensive use of recursive data structures dynamically allocated in the heap. This algorithm requires two pieces of information from the code section under analysis (a loop or a recursive function): (i) abstract shape graphs that represent the state of the heap at the code section; and (ii) path expressions that collect the traversing information for each statement. Our algorithm projects the path expressions on the shape graphs and checks over the graphs to ascertain whether one of the sites reached by a write statement matches one of the sites reached by another statement on a different loop iteration (or on a different call instance in a recursive function), in which case a conflict between the two statements is reported. Although our algorithm presents exponential complexity, we have found that in practice the parameters that dominate the computational cost have very low values, and to the best of our knowledge, all the other related studies involve higher costs. In fact, our experimental results show reductions in the data dependence analysis times of one or two orders of magnitude in some of the studied benchmarks when compared to a previous data dependence algorithm. Thanks to the information on uncovered data dependences, we have manually parallelized these codes, achieving speedups of 2.19 to 3.99 in four cores.
... Several US and worldwide research groups already use Cetus. [3][4][5] In our ongoing work, we apply the infrastructure for creating translators that convert shared-memory programs written in OpenMP to other models, such as message-passing 6 and CUDA (for graphics processing units). 7 ...
Article
Full-text available
The Cetus tool provides an infrastructure for research on multicore compiler optimizations that emphasizes automatic parallelization. The compiler infrastructure, which targets C programs, supports source-to-source transformations, is user-oriented and easy to handle, and provides the most important parallelization passes as well as the underlying enabling techniques.
... Several US and worldwide research groups already use Cetus. [3][4][5] In our ongoing work, we apply the infrastructure for creating translators that convert shared-memory programs written in OpenMP to other models, such as message-passing 6 and CUDA (for graphics processing units). 7 ...
Article
Full-text available
We describe the Cetus compiler infrastructure and its use in a number of transformation tasks for multicore architectures. The origi- nal intent of Cetus was to serve as a parallelizing compiler. In addition, the infrastructure has been used to build translators for programs writ- ten in the OpenMP directive language to be compiled onto multicore architectures. They include a direct OpenMP translator for current mul- ticores, an OpenMP to MPI translator for many-cores exhibiting disjoint address spaces, and a translator for OpenMP onto GPU architectures. We are also building autotuning capabilities into Cetus, which can defer compile-time optimization decisions to runtime. This feature is especially important for heterogeneous multicore architectures. We will describe the organization of the Cetus infrastructure and present preliminary results of several application projects.
... Cetus is already in use by a number of research groups in the U.S. and worldwide [3][4][5][6]. Increasing the user community is one goal of this paper. To this end, this paper describes the Cetus Community Portal in Section 2, the Cetus internal organization in Section 3, current analysis and transformation capabilities in Sections 4 and 5, respectively, and applications projects in Section 6. ...
Conference Paper
We describe the Cetus compiler infrastructure and its use in a number of transformation tasks for multicore architectures. The origi- nal intent of Cetus was to serve as a parallelizing compiler. In addition, the infrastructure has been used to build translators for programs writ- ten in the OpenMP directive language to be compiled onto multicore architectures. They include a direct OpenMP translator for current mul- ticores, an OpenMP to MPI translator for many-cores exhibiting disjoint address spaces, and a translator for OpenMP onto GPU architectures. We are also building autotuning capabilities into Cetus, which can defer compile-time optimization decisions to runtime. This feature is especially important for heterogeneous multicore architectures. We will describe the organization of the Cetus infrastructure and present preliminary results of several application projects.
Preprint
Full-text available
Performance is a key quality of modern software. Although recent years have seen a spike in research on automated improvement of software's execution time, energy, memory consumption, etc., there is a noticeable lack of standard benchmarks for such work. It is also unclear how such benchmarks are representative of current software. Furthermore, frequently non-functional properties of software are targeted for improvement one-at-a-time, neglecting potential negative impact on other properties. In order to facilitate more research on automated improvement of non-functional properties of software, we conducted a survey gathering benchmarks used in previous work. We considered 5 major online repositories of software engineering work: ACM Digital Library, IEEE Xplore, Scopus, Google Scholar, and ArXiV. We gathered 5000 publications (3749 unique), which were systematically reviewed to identify work that empirically improves non-functional properties of software. We identified 386 relevant papers. We find that execution time is the most frequently targeted property for improvement (in 62% of relevant papers), while multi-objective improvement is rarely considered (5%). Static approaches are prevalent (in 53% of papers), with exploratory approaches (evolutionary in 18% and non-evolutionary in 14% of papers) increasingly popular in the last 10 years. Only 40% of 386 papers describe work that uses benchmark suites, rather than single software, of those SPEC is most popular (covered in 33 papers). We also provide recommendations for choice of benchmarks in future work, noting, e.g., lack of work that covers Python or JavaScript. We provide all programs found in the 386 papers on our dedicated webpage at https://bloa.github.io/nfunc_survey/ We hope that this effort will facilitate more research on the topic of automated improvement of software's non-functional properties.
Conference Paper
Present-day automatic optimization relies on powerful static (i.e., compile-time) analysis and transformation methods. One popular platform for automatic optimization is the polyhedron model. Yet, after several decades of development, there remains a lack of empirical evidence of the model's benefits for real-world software systems. We report on an empirical study in which we analyzed a set of popular software systems, distributed across various application domains. We found that polyhedral analysis at compile time often lacks the information necessary to exploit the potential for optimization of a program's execution. However, when conducted also at run time, polyhedral analysis shows greater relevance for real-world applications. On average, the share of the execution time amenable to polyhedral optimization is increased by a factor of nearly 3. Based on our experimental results, we discuss the merits and potential of polyhedral optimization at compile time and run time.