Tor E. Jeremiassen's research while affiliated with University of Washington Seattle and other places

Publications (6)

Conference Paper
This paper presents a static algorithm, based on standard iterative dataflow techniques, for computing per-process memory references to shared data in coarse-grained parallel programs. The algorithm constructs control flow graphs for families of processes by recognizing predicates used in control statements whose values are invariant relative to an...
Article
: Many coarse-grained, explicitly parallel programs execute in phases delimited by barriers to preserve sets of cross process data dependencies. One of the major obstacles to optimizing these programs is the necessity to conservatively assume that any two statements in the program may execute concurrently. Consequently, compilers fail to take advan...
Article
We have developed compiler algorithms that analyze explicitly parallel programs and restructure their shared data to reduce the number of false sharing misses. The algorithms analyze per-process shared data accesses, pinpoint the data structures that are susceptible to false sharing and choose an appropriate transformation to reduce it. The transfo...
Article
This paper presents a static algorithm, based on standard iterative data-flow techniques, for computing per-process memory references to shared data in coarse-grained parallel programs. The algorithm constructs control flow graphs for families of processes by recognizing predicates used in control statements whose values are invariant relative to a...
Article
We have developed compiler algorithms that analyze explicitly parallel programs and restructure their shared data to reduce the number of false sharing misses. The algorithms analyze per-process shared data accesses, pinpoint the data structures that are susceptible to false sharing and choose an appropriate transformation to reduce it. The transfo...

Citations

... Cache falsesharing is a well-study issue in the architecture world [25] [26]. And, there have been several efforts over the years to address the problem: profiling and manual tuning the application, compiler techniques, and data layout transformations [27]. However, the cache false-sharing issue is often overlooked by developers in system development and it can't be addressed by the compiler automatically in many situations. ...
... Computing the patterns is straightforward, since features that make control flow and data flow analysis difficult, such as ''goto'' statements, global variables, and pointers have been intentionally omitted from Orca. Also, process types are part of the language syntax, so techniques such as separating control flow graphs are not needed [Jeremiassen and Eggers 1992]. The only difficulty is handling recursive functions. ...
... Upon every mutex invocation, we can lookup the hash table to find the pointer to the actual data, and then update it correspondingly. However, this approach introduces significant overhead due to the hash table lookup (and possible lock protection) on every synchronization operation, and the possible cache coherence messages to update the shared data (true/false sharing effect) [16,21]. This is especially problematic when there is a significant number of acquisitions. ...
... All processes must execute the same number of barriers to ensure correct synchronization. Jeremiahs et al. [116] present an initial work for verifying correct barrier synchronization for SPMD programs, based on named barriers and control flow graph reachability. Aiken and Gay present the seminal Barrier Inference analysis [5] for verifying barrier synchronization. ...
... False sharing [19] is a well-known problem in manycore systems, where multiple processing elements working on independent variables are falsely considered as sharers, because the variables reside in the same cache line, leading to application performance deterioration. The work in [20] uses a compiler-driven approach to transform data, thereby minimizing false sharing. Several works by Liu et al. [21,22] cover both detecting false sharing using compiler-driven approaches and also resolve it [23] at runtime by migrating data. ...
... Given these inputs the execution of the program is simulated on the given architecture and statistics are generated. The workload spans a wide range of problem domains and comprises two types of explicitly parallel programs: data localityoptimized (either by the programmer or the compiler [13, 14]), and locality-unoptimized (Table 1). These two sets of programs were written using different parallel programming paradigms. ...