Figure 1 - uploaded by Georg Hager
Content may be subject to copyright.
Block diagram of the Sun UltraSPARC T2 processor (see text for details). Picture by courtesy of Sun Microsystems.

Block diagram of the Sun UltraSPARC T2 processor (see text for details). Picture by courtesy of Sun Microsystems.

Source publication
Article
Full-text available
Processor and system architectures that feature multiple memory controllers and/or ccNUMA characteristics are prone to show bottlenecks and erratic performance numbers on scientific codes. Although cache thrashing, aliasing conflicts, and ccNUMA locality and contention problems are well known for many types of systems, they take on peculiar forms o...

Contexts in source publication

Context 1
... high single core performance for a highly parallel single chip architec- ture is the basic idea of T2 as can be seen in Fig. 1: Eight simple in-order SPARC cores (running at 1.2 or 1.4 GHz) are connected to a shared, banked L2 cache and four independently operating dual channel FB-DIMM memory controllers through a non-blocking switch, thereby providing UMA access char- acteristics with scalable bandwidth. Such features were previously only available in ...
Context 2
... Fig. 10 we present two-socket T2+ performance data. Obviously, the small cache size per thread starts to show very early when all 128 threads are used, so that 64 threads with static scheduling are best to use at small to intermediate problem sizes. For N 8000, however, 64-and 128-thread performance with static scheduling coincide, as could be ...
Context 3
... scheduling coincide, as could be expected from the low-level bandwidth Interestingly, in contrast to the situation on a single socket, static scheduling yields better performance for intermediate N . In order to identify a possible reason for this, we removed the write operation from the relaxation iteration. The result is shown in the inset of Fig. 10 and confirms that the mediocre performance, strong fluctuations and characteristic "jumps" up to N ≈ 8000 are a consequence of write traffic. Whether the corresponding increase in co- herence (snoop) activity or ccNUMA boundary effects are responsible for the performance characteristics can not be answered as of today; a more thorough ...
Context 4
... cartesian array holding the distribution functions. On cache-based architectures the propagation-optimized "IJKv" data layout, often referred to as "structure of arrays", is usually the best choice where I, J and K are cartesian coordinates and v denotes the distribution function index. The computational kernel using this layout is sketched in Fig. 11. Evidently, the 19 read and 19 write streams are traversed with unit stride in this ...
Context 5
... STREAM copy memory bandwidth (≈18 GB/s including RFO) and the required load/store traffic for a single lattice site update (456 bytes including RFO), one would expect an LBM performance of roughly 40 MLUPs/s . These kinds of estimates usually give good approximations for standard multi-core architectures [10] if the kernel is really memory-bound. Fig. 12 shows performance results in MLUPs/s for LBM on a cubic domain of extent N 3 for the standard IJKv layout as well as for an alternative IvJK layout. Obviously the latter choice yields twice the performance than IJKv and also smoother behaviour over a wide range of domain sizes. As the loop nest is parallelized on the outer level, the ...
Context 6
... array dimension. Second, the sawtooth-like performance pattern is a "modulo effect" which emerges from N not being a multiple of the number of threads. A simple way to remove the pattern is to coalesce several outer loop levels in order to lengthen the OpenMP parallel loop. Results for up to 64 threads and two- way coalescing are also shown in Fig. 12 and corroborate the call for extensions of the OpenMP standard towards more flexible options for parallel execution of loop nests. Luckily, the recently adopted OpenMP 3.0 standard provides basic support for ...
Context 7
... Fig. 13 we compare the best possible LBM variant on T2 (64 threads, lower graph) with the same code on T2+, using 128 threads. Performance saturates at ≈43 MLUPs/s, a 65 % boost versus T2. This is much more than could be ex- pected from the STREAM triad or even STREAM copy comparisons presented in Sect. 2.1.2. In the previous section it was ...
Context 8
... a dual-socket T2+ node we have demonstrated that most of the aliasing problems have vanished at the price of a doubled number of threads and the typical ccNUMA performance features. On both architectures -but especially for T2+ -OpenMP startup overhead can play a dominant role at small problem sizes due to the large thread numbers. At large prob- Fig. 12) versus T2+. There is significant performance gain from using twice the number of cores at the same theoretical memory ...

Similar publications

Conference Paper
Full-text available
Computational mechanics methods, such as finite element analysis or multibody dynamics, are usually employed toward the end of the design phase of a total joint implant system. It is of greater benefit, however, to utilize these methods early in the design process as a benchmarking tool to compare competitive products, as a screening tool to elimin...
Preprint
Full-text available
Address translation and protection play important roles in today's processors, supporting multiprocessing and enforcing security. Historically, the design of the address translation mechanisms has been closely tied to the instruction set. In contrast, RISC-V defines its privileged specification in a way that permits a variety of designs. An importa...

Citations

... The L2 cache is divided into 8 banks with groups of 2 banks connected to one of the four memory controllers (MCs) (Figure 1.5). Further details concerning the system architecture can be found at [HZW08] and [Sun07]. ...
... Ignoring this architectural detail does not fully utilize memory bandwidth. (see [HZW08]). This case is briefly discussed in Section 1.3.3. ...