Figure 2 - uploaded by Mark R. Fahey
Content may be subject to copyright.
AMD Opteron Design (Image courtesy of AMD).

AMD Opteron Design (Image courtesy of AMD).

Source publication
Conference Paper
Full-text available
Oak Ridge National Laboratory recently received delivery of a 5,294 processor Cray XT3. The XT3 is Cray's third-generation massively parallel processing system. The system builds on a single processor node - built around the AMD Opteron - and uses a custom chip - called SeaStar - to provide interprocess or communication. In addition, the system use...

Context in source publication

Context 1
... ORNL XT3 uses Opteron model 150 processors. As Figure 2 shows, this model includes an Opteron core, integrated memory controller, three 16b-wide 800 MHz HyperTransport (HT) links, and L1 and L2 caches. The Opteron core has three integer units and one floating point unit capable of two floating-point operations per cycle [3]. ...

Similar publications

Conference Paper
Full-text available
The rise of Chip-Multiprocessors (CMPs) as a promising trend for the state of the art high-performance processors design raised the need for a scalable cache directory organization along with a simple cache coherence protocol as a hot research area. While thousands of cores are expected to fit on a single chip soon, the previously proposed cache di...
Article
Full-text available
Graphics processing units (GPUs) can provide excellent speedups on some, but not all, general-purpose workloads. Using a set of computational GPU kernels as examples, the authors show how to adapt kernels to utilize the architectural features of a GeForce 8800 GPU and what finally limits the achievable performance.
Conference Paper
Full-text available
This paper proposes and evaluates a new microarchitecture for out-of-order processors that supports speculative renaming. We call speculative renaming to the speculative omission of physical register allocation along with the speculative early release of physical registers. These renaming policies may cause a register operand not to be kept in the...
Conference Paper
Full-text available
In this paper we present implementation and experimental results for a digital vision chip that operates in mixed asynchronous/synchronous mode. Mixed configuration benefits from full programmability (discrete-time mode) and high operational performance in global image processing operations (continuous-time mode) thus extending the application fiel...

Citations

... The FBFLY topology is derived by combining the routers in each row of a conventional butterfly topology while preserving the inter-router connections [14], connecting farthest nodes in each row and column. The Intel Teraflops Research Chip [15], Cray XT3 [16], IBM Blue Gene [17], and Tilera TILE-Gx [18] are examples of several commercial chips implementing NoCs. ...
... The many challenges in implementing LU efficiently present tradeoffs influenced by the interplay of processor microar-chitecture, node design, and interconnection network. Recent architectural trends have led to supercomputers with increasing numbers of cores per node: successive generations of Blue Gene have 2, 4, and 16 compute cores [15]; Cray's XT3 had 1-2 cores [29], while the XE6 has 24 cores [7]; the bullx S6010 nodes in today's 9th-ranked Tera-100 cluster can support up to 40 cores [3]. The computational capacity of each core has increased greatly in that time, but per-core cache and memory bandwidth have not kept up. ...
Article
Full-text available
Dense LU factorization is a prominent benchmark used to rank the performance of supercomputers. Many implementations, including the reference code HPL, use block-cyclic distributions of matrix blocks onto a two-dimensional process grid. The process grid dimensions drive a trade-off between communication and computation and are architecture- and implementation-sensitive. We show how the critical panel factorization steps can be made less communication-bound by overlapping asynchronous collectives for pivot identification and exchange with the computation of rank-k updates. By shifting this trade-off, a modified block-cyclic distribution can beneficially exploit more available parallelism on the critical path, and reduce panel factorization's memory hierarchy contention on now-ubiquitous multi-core architectures. The missed parallelism in traditional block-cyclic distributions arises because active panel factorization, triangular solves, and subsequent broadcasts are spread over single process columns or rows (respectively) of the process grid. Increasing one dimension of the process grid decreases the number of distinct processes in the other dimension. To increase parallelism in both dimensions, periodic 'rotation' is applied to the process grid to recover the row-parallelism lost by a tall process grid. During active panel factorization, rank-1 updates stream through memory with minimal reuse. In a column-major process grid, the performance of this access pattern degrades as too many streaming processors contend for access to memory. A block-cyclic mapping in the more popular row-major order does not encounter this problem, but consequently sacrifices node and network locality in the critical pivoting steps. We introduce 'striding' to vary between the two extremes of row- and column-major process grids. As a test-bed for further mapping experiments, we describe a dense LU implementation that allows a block distribution to be defined as a general function of block to processor. Other mappings can be tested with only small, local changes to the code.
... Much work has been put into the evaluation of novel supercomputers [11], [12], [13], [14], [15], [16] and nontraditional systems [17], [18], [19], [15], [20] for scientific and business computing. There has been a recent spur of research activity in assessing the performance of virtualized resources, in the cloud computing environments [7], [8], [9], [21], [22] and in other traditional networks [6], [23], [24], [25], [26], [27], [28]. ...
Article
Full-text available
High availability requirement of the network is becoming essential for high growth disruptive technology companies. For businesses which require migration to networks supporting scalability and high availability, it is important to analyze the various factors and the cost effectiveness for choosing the optimal solution for them. The current work considers this important problem and presents an analysis of the important factors influencing the decision. The high availability of network is discussed using internal and external risk factors of the network. A production network risk matrix is proposed and a scheme to compute the overall risk is presented. A case study is presented in which four possible network configurations are analyzed and the most suitable solution is recognized. This study provides a paradigm and a useful framework for analyzing cloud computing services.
... Alam et al. (Alam et al., 2008) perform an initial evaluation of BlueGene/P, the second generation of IBM BlueGene solutions, using benchmarks, Linpack, and scientific applications, such as ocean modeling, climate modeling and combustion. The Cray XT architectures are evaluated with micro-benchmarks and benchmarks in (Vetter et al., 2006) for Cray XT3, (Alam et al., 2007) for Cray XT4, and (Worley et al., 2009) for Cray XT5. ...
Article
Full-text available
This paper presents a description and the evaluation of the Netuno supercomputer, a high-performance cluster installed at Federal University of Rio de Janeiro in Brazil. The results for the High Performance Linpack (HPL) benchmark and two real applications are reported. Since building a high-performance cluster for running a wide range of applications is a non-trivial task, some lessons learned from assembling and operating this cluster, such as the excelent performance of the OpenMPI library, and the importance of the use an efficient parallel file system over the traditional NFS system, can be useful knowledge to support the design of new systems. Currently, Netuno is being heavily used to run large scale simulations in the areas of ocean modeling, meteorology, engineering, physics, and geophysics.
... Much work has been put into the evaluation of novel supercom- puters [27], [29], [30], [31], [45], [46] and non-traditional systems [5], [32], [37], [47], [66] for scientific computing. We share much of the used methodology with previous work; we see this as an advantage in that our results are readily comparable with existing results. ...
Article
Full-text available
Cloud computing is an emerging commercial infrastructure paradigm that promises to eliminate the need for maintaining expensive computing facilities by companies and institutes alike. Through the use of virtualization and resource time sharing, clouds serve with a single set of physical resources a large user base with different needs. Thus, clouds have the potential to provide to their owners the benefits of an economy of scale and, at the same time, become an alternative for scientists to clusters, grids, and parallel production environments. However, the current commercial clouds have been built to support web and small database workloads, which are very different from typical scientific computing workloads. Moreover, the use of virtualization and resource time sharing may introduce significant performance penalties for the demanding scientific computing workloads. In this work, we analyze the performance of cloud computing services for scientific computing workloads. We quantify the presence in real scientific computing workloads of Many-Task Computing (MTC) users, that is, of users who employ loosely coupled applications comprising many tasks to achieve their scientific goals. Then, we perform an empirical evaluation of the performance of four commercial cloud computing services including Amazon EC2, which is currently the largest commercial cloud. Last, we compare through trace-based simulation the performance characteristics and cost models of clouds and other scientific computing platforms, for general and MTC-based scientific computing workloads. Our results indicate that the current clouds need an order of magnitude in performance improvement to be useful to the scientific community, and show which improvements should be considered first to address this discrepancy between offer and demand.
... Every node in the Cray XT5 is connected into a 3D torus via the SeaStar 2+ interconnect chip (or NIC). Much has been written about the XT network since its introduction in Sandia's Red Storm machine ( [7], [10], [6] for example), so we will only briefly touch upon the details here. Each SeaStar NIC also acts as a router for the network, and has six independent, full duplex links to the rest of the system. ...
Article
Full-text available
As storage systems get larger to meet the the demands of petascale systems, careful planning must be applied to avoid congestion points and extract the maximum performance. In addition, the large size of the data sets generated by such systems makes it desirable for all compute resources in a center to have common access to this data without needing to copy it to each machine. This paper describes a method of placing I/O close to the storage nodes to minimize contention on Cray's SeaStar2+ network, and extends it to a routed Lustre configuration to gain the same benefits when running against a center-wide file system. Our experiments show performance improvements for both direct attached and routed file systems.
... " Cray Inc . has described a strategy based on their XT3 system (Vetter et al. , 2006 ), derived from Sandia National Laboratories ' Red Storm . Such future systems using an AMD Opteron -based and meshinterconnected Massively Parallel Processing (MPP) structure will provide the means to support accelerators such as a possible future vector -based processor, or even possibly Field Programmable Gate Arrays (FPGA) devices. ...
Book
An analytical overview of the state of the art, open problems, and future trends in heterogeneous parallel and distributed computing. This book provides an overview of the ongoing academic research, development, and uses of heterogeneous parallel and distributed computing in the context of scientific computing. Presenting the state of the art in this challenging and rapidly evolving area, the book is organized in five distinct parts: Heterogeneous Platforms: Taxonomy, Typical Uses, and Programming Issues. Performance Models of Heterogeneous Platforms and Design of Heterogeneous Algorithms. Performance: Implementation and Software. Applications. Future Trends. High Performance Heterogeneous Computing is a valuable reference for researchers and practitioners in the area of high performance heterogeneous computing. It also serves as an excellent supplemental text for graduate and postgraduate courses in related areas.
... " Cray Inc . has described a strategy based on their XT3 system (Vetter et al. , 2006 ), derived from Sandia National Laboratories ' Red Storm . Such future systems using an AMD Opteron -based and meshinterconnected Massively Parallel Processing (MPP) structure will provide the means to support accelerators such as a possible future vector -based processor, or even possibly Field Programmable Gate Arrays (FPGA) devices. ...
... Current generation of Petascale systems are composed of processing elements (PE) or nodes with homogenous and heterogeneous multi-cores on single or multiple sockets, deeper memory hierarchies and a complex interconnection network infrastructure. Even the current generation of systems with peak performance of hundreds of Teraflops such as the Cray XT and IBM Blue Gene series systems offer 4 cores or execution units per PE, multiple levels of unified and shared caches and a regular communication topology along with support for distributed computing (message-passing MPI) and hybrid (MPI and shared-memory OpenMP or pthreads) programming models [1,2,3,4,5,6,7,8,9]. As we demonstrate in this paper, it has become extremely challenging to sustain let alone improve performance efficiencies or scientific productivity on the existing systems; it requires application developers to develop a hierarchical view where memory and network performance follow a regular but non-uniform access model. ...
... With 5,212 compute nodes, the peak performance of the XT3 was just over 25 TFLOPS. An evaluation of this configuration was presented in [5,6,7]. Jaguar processors were upgraded to dual-core Opteron model 100 2.6 GHz processors in July, 2006, with memory per node doubled in order to maintain 2 GBytes per core. ...
Conference Paper
Full-text available
The Petascale Cray XT5 system at the Oak Ridge National Laboratory (ORNL) Leadership Computing Facility (LCF) shares a number of system and software features with its predecessor, the Cray XT4 system including the quad-core AMD processor and a multi-core aware MPI library. We analyze performance of scalable scientific applications on the quad-core Cray XT4 system as part of the early system access using a combination of micro-benchmarks and Petascale ready applications. Particularly, we evaluate impact of key changes that occurred during the dual-core to quad-core processor upgrade on applications behavior and provide projections for the next-generation massively-parallel platforms with multi-core processors, specifically for proposed Petascale Cray XT5 system. We compare and contrast the quad-core XT4 system features with the upcoming XT5 system and discuss strategies for improving scaling and performance for our target applications.
... For example, the IBM Blue Gene systems use a treestructured network for collective communication and a three-dimensional torus for point-topoint communication (Almási et al., 2005). The Cray XT series machines make use of a mesh network for collective and point-to-point communication (Vetter et al., 2006). ...
... On large distributed-memory systems, measurement tools need a mechanism to store observed performance data. The largest supercomputers increasingly use diskless nodes Vetter et al., 2006), so there may be no local storage on which to archive observed data. Large machines typically are connected to a high-performance Input/Output (I/O) system, but compute nodes typically communicate with the I/O system through the same network used by applications. ...
Article
Concurrency levels in large-scale, distributed-memory supercomputers are rising exponentially. Modern machines may contain 100,000 or more microprocessor cores, and the largest of these, IBM's Blue Gene/L, contains over 200,000 cores. Future systems are expected to support millions of concurrent tasks. In this dissertation, we focus on efficient techniques for measuring and analyzing the performance of applications running on very large parallel machines. Tuning the performance of large-scale applications can be a subtle and time-consuming task because application developers must measure and interpret data from many independent processes. While the volume of the raw data scales linearly with the number of tasks in the running system, the number of tasks is growing exponentially, and data for even small systems quickly becomes unmanageable. Transporting performance data from so many processes over a network can perturb application performance and make measurements inaccurate, and storing such data would require a prohibitive amount of space. Moreover, even if it were stored, analyzing the data would be extremely time-consuming. In this dissertation, we present novel methods for reducing performance data volume. The first draws on multi-scale wavelet techniques from signal processing to compress systemwide, time-varying load-balance data. The second uses statistical sampling to select a small subset of running processes to generate low-volume traces. A third approach combines sampling and wavelet compression to stratify performance data adaptively at run-time and to reduce further the cost of sampled tracing. We have integrated these approaches into Libra, a toolset for scalable load-balance analysis. We present Libra and show how it can be used to analyze data from large scientific applications scalably.