Figure 2 - uploaded by Michael Klemm
Content may be subject to copyright.
Native memory layout vs. DCLmanaged application memory layout.  

Native memory layout vs. DCLmanaged application memory layout.  

Source publication
Article
Full-text available
Multicore designers often add a small local memory close to each core to speed up access and to reduce off-chip IO. But this approach puts a burden on the programmer, the compiler, and the runtime system, since this memory lacks hardware support (cache logic, MMU, ...) and hence needs to be managed in software to exploit its performance poten-tial....

Similar publications

Article
Full-text available
This paper describes a novel approach to reduce the memory consumption of Java programs, by reducing the string memory waste in the runtime. In recent Java applications, string data occupies a large amount of the heap area. For example, more than 30% of the live heap area is used for string data when WebSphere Applica-tion Server with Trade6 is run...
Article
Full-text available
Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , pa...

Citations

... Hammacher et al. [4] present an approach that analyzes dynamic data dependences of a program run and uses that information to identify independent computation paths that could have been handled by individual cores in a multicore machine. Werth at al. [5] described an automated technique for creating overlays to improve performance in certain specialized architectures. Schaefer [6] proposes a method to improve auto tuning of concurrent programs using knowledge of the patterns used to implement the program. ...
Conference Paper
Full-text available
Microprocessor performance can no longer be greatly improved by simply increasing clock frequencies; instead, higher performance will have to come from parallelism. As multi/manycore processors with multiple CPUs on a chip become standard and affordable for everyone, software engineers face the challenge of parallelizing applications of all sorts. However, compared to sequential applications, our repertoire of tools and methods for cost-effectively developing reliable, parallel applications is spotty. The mission of this workshop is to bring together researchers and practitioners with diverse backgrounds in order to advance the state of the art in software engineering for multi/-manycore parallel applications. This is the second in a series of workshops specifically focusing on software engineering challenges of multi/manycore.
Conference Paper
The potential of heterogeneous multicores, like the Cell BE, can only be exploited if the host and the accelerator cores are used in par-allel and if the specific features of the cores are considered. Parallel pro-gramming, especially when applied to irregular task-parallel problems, is challenging itself. However, heterogeneous multicores add to that com-plexity due to their memory hierarchy and specialized accelerators. As a solution for these issues we present CellCilk, a prototype implementation of Cilk for heterogeneous multicores with a host/accelerator design, using the Cell BE in particular. CellCilk introduces a new keyword (spu spawn) for task creation on the accelerator cores. Task scheduling and load bal-ancing are done by a novel dynamic cross-hierarchy work-stealing regime. Furthermore, the CellCilk runtime employs a garbage collection mecha-nism for distributed data structures that are created during scheduling. On benchmarks we achieve a good speedup and reasonable runtimes, even when compared to manually parallelized codes.
Article
Memory is a key parameter in embedded systems since both code complexity of embedded applications and amount of data they process are increasing. While it is true that the memory capacity of embedded systems is continuously increasing, the increases in the application complexity and dataset sizes are far greater. As a consequence, the memory space demand of code and data should be kept minimum. To reduce the memory space consumption of embedded systems, this paper proposes a control flow graph (CFG) based technique. Specifically, it tracks the lifetime of instructions at the basic block level. Based on the CFG analysis, if a basic block is known to be not accessible in the rest of the program execution, the instruction memory space allocated to this basic block is reclaimed. On the other hand, if the memory allocated to this basic block cannot be reclaimed, we try to compress this basic block. This way, it is possible to effectively use the available on-chip memory, thereby satisfying most of instruction/data requests from the on-chip memory. Our experiments with this framework show that it outperforms the previously proposed CFG-based memory reduction approaches.