AMD Opteron Design (Image courtesy of AMD).

Source publication

Early evaluation of the Cray XT3

Conference Paper

Full-text available

May 2006

Oak Ridge National Laboratory recently received delivery of a 5,294 processor Cray XT3. The XT3 is Cray's third-generation massively parallel processing system. The system builds on a single processor node - built around the AMD Opteron - and uses a custom chip - called SeaStar - to provide interprocess or communication. In addition, the system use...

Context 1

... ORNL XT3 uses Opteron model 150 processors. As Figure 2 shows, this model includes an Opteron core, integrated memory controller, three 16b-wide 800 MHz HyperTransport (HT) links, and L1 and L2 caches. The Opteron core has three integer units and one floating point unit capable of two floating-point operations per cycle [3]. ...

View in full-text

Hybrid limited-pointer linked-list cache directory and cache coherence protocol

Conference Paper

Full-text available

Dec 2013

The rise of Chip-Multiprocessors (CMPs) as a promising trend for the state of the art high-performance processors design raised the need for a scalable cache directory organization along with a simple cache coherence protocol as a hot research area. While thousands of cores are expected to fit on a single chip soon, the previously proposed cache di...

Compute Unified Device Architecture Application Suitability

Article

Full-text available

May 2009

Graphics processing units (GPUs) can provide excellent speedups on some, but not all, general-purpose workloads. Using a set of computational GPU kernels as examples, the authors show how to adapt kernels to utilize the architectural features of a GeForce 8800 GPU and what finally limits the achievable performance.

Microarchitectural Support for Speculative Register Renaming

Conference Paper

Full-text available

Apr 2007

This paper proposes and evaluates a new microarchitecture for out-of-order processors that supports speculative renaming. We call speculative renaming to the speculative omission of physical register allocation along with the speculative early release of physical registers. These renaming policies may cause a register operand not to be kept in the...

ASPA: Focal Plane digital processor array with asynchronous processing capabilities

Conference Paper

Full-text available

Jun 2008

In this paper we present implementation and experimental results for a digital vision chip that operates in mixed asynchronous/synchronous mode. Mixed configuration benefits from full programmability (discrete-time mode) and high operational performance in global image processing operations (continuous-time mode) thus extending the application fiel...

Performance evaluation of copper and graphene nanoribbons in 2-D NoC structures

Conference Paper

Full-text available

Mar 2017

Dense LU Factorization on Multicore Supercomputer Nodes

Article

Full-text available

May 2012

Dense LU factorization is a prominent benchmark used to rank the performance of supercomputers. Many implementations, including the reference code HPL, use block-cyclic distributions of matrix blocks onto a two-dimensional process grid. The process grid dimensions drive a trade-off between communication and computation and are architecture- and implementation-sensitive. We show how the critical panel factorization steps can be made less communication-bound by overlapping asynchronous collectives for pivot identification and exchange with the computation of rank-k updates. By shifting this trade-off, a modified block-cyclic distribution can beneficially exploit more available parallelism on the critical path, and reduce panel factorization's memory hierarchy contention on now-ubiquitous multi-core architectures. The missed parallelism in traditional block-cyclic distributions arises because active panel factorization, triangular solves, and subsequent broadcasts are spread over single process columns or rows (respectively) of the process grid. Increasing one dimension of the process grid decreases the number of distinct processes in the other dimension. To increase parallelism in both dimensions, periodic 'rotation' is applied to the process grid to recover the row-parallelism lost by a tall process grid. During active panel factorization, rank-1 updates stream through memory with minimal reuse. In a column-major process grid, the performance of this access pattern degrades as too many streaming processors contend for access to memory. A block-cyclic mapping in the more popular row-major order does not encounter this problem, but consequently sacrifices node and network locality in the critical pivoting steps. We introduce 'striding' to vary between the two extremes of row- and column-major process grids. As a test-bed for further mapping experiments, we describe a dense LU implementation that allows a block distribution to be defined as a general function of block to processor. Other mappings can be tested with only small, local changes to the code.

High Availability based Migration Analysis to Cloud Computing for High Growth Businesses

Article

Full-text available

Jan 2012

DK Prasad

High availability requirement of the network is becoming essential for high growth disruptive technology companies. For businesses which require migration to networks supporting scalability and high availability, it is important to analyze the various factors and the cost effectiveness for choosing the optimal solution for them. The current work considers this important problem and presents an analysis of the important factors influencing the decision. The high availability of network is discussed using internal and external risk factors of the network. A production network risk matrix is proposed and a scheme to compute the overall risk is presented. A case study is presented in which four possible network configurations are analyzed and the most suitable solution is recognized. This study provides a paradigm and a useful framework for analyzing cloud computing services.

The Experience in Designing and Building the High Performance Cluster Netuno

Article

Full-text available

Oct 2011

This paper presents a description and the evaluation of the Netuno supercomputer, a high-performance cluster installed at Federal University of Rio de Janeiro in Brazil. The results for the High Performance Linpack (HPL) benchmark and two real applications are reported. Since building a high-performance cluster for running a wide range of applications is a non-trivial task, some lessons learned from assembling and operating this cluster, such as the excelent performance of the OpenMPI library, and the importance of the use an efficient parallel file system over the traditional NFS system, can be useful knowledge to support the design of new systems. Currently, Netuno is being heavily used to run large scale simulations in the areas of ocean modeling, meteorology, engineering, physics, and geophysics.

Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing

Article

Full-text available

Jul 2011
IEEE T PARALL DISTR

Cloud computing is an emerging commercial infrastructure paradigm that promises to eliminate the need for maintaining expensive computing facilities by companies and institutes alike. Through the use of virtualization and resource time sharing, clouds serve with a single set of physical resources a large user base with different needs. Thus, clouds have the potential to provide to their owners the benefits of an economy of scale and, at the same time, become an alternative for scientists to clusters, grids, and parallel production environments. However, the current commercial clouds have been built to support web and small database workloads, which are very different from typical scientific computing workloads. Moreover, the use of virtualization and resource time sharing may introduce significant performance penalties for the demanding scientific computing workloads. In this work, we analyze the performance of cloud computing services for scientific computing workloads. We quantify the presence in real scientific computing workloads of Many-Task Computing (MTC) users, that is, of users who employ loosely coupled applications comprising many tasks to achieve their scientific goals. Then, we perform an empirical evaluation of the performance of four commercial cloud computing services including Amazon EC2, which is currently the largest commercial cloud. Last, we compare through trace-based simulation the performance characteristics and cost models of clouds and other scientific computing platforms, for general and MTC-based scientific computing workloads. Our results indicate that the current clouds need an order of magnitude in performance improvement to be useful to the scientific community, and show which improvements should be considered first to address this discrepancy between offer and demand.

I/O Congestion Avoidance via Routing and Object Placement

Article

Full-text available

Jan 2011

As storage systems get larger to meet the the demands of petascale systems, careful planning must be applied to avoid congestion points and extract the maximum performance. In addition, the large size of the data sets generated by such systems makes it desirable for all compute resources in a center to have common access to this data without needing to copy it to each machine. This paper describes a method of placing I/O close to the storage nodes to minimize contention on Cray's SeaStar2+ network, and extends it to a routed Lustre configuration to gain the same benefits when running against a center-wide file system. Our experiments show performance improvements for both direct attached and routed file systems.

High-Performance Heterogeneous Computing

Book

Aug 2009

An analytical overview of the state of the art, open problems, and future trends in heterogeneous parallel and distributed computing. This book provides an overview of the ongoing academic research, development, and uses of heterogeneous parallel and distributed computing in the context of scientific computing. Presenting the state of the art in this challenging and rapidly evolving area, the book is organized in five distinct parts: Heterogeneous Platforms: Taxonomy, Typical Uses, and Programming Issues. Performance Models of Heterogeneous Platforms and Design of Heterogeneous Algorithms. Performance: Implementation and Software. Applications. Future Trends. High Performance Heterogeneous Computing is a valuable reference for researchers and practitioners in the area of high performance heterogeneous computing. It also serves as an excellent supplemental text for graduate and postgraduate courses in related areas.

High-Performance Heterogeneous Computing

Book

Jul 2009

Performance analysis and projections for Petascale applications on Cray XT series systems

Conference Paper

Full-text available

May 2009

The Petascale Cray XT5 system at the Oak Ridge National Laboratory (ORNL) Leadership Computing Facility (LCF) shares a number of system and software features with its predecessor, the Cray XT4 system including the quad-core AMD processor and a multi-core aware MPI library. We analyze performance of scalable scientific applications on the quad-core Cray XT4 system as part of the early system access using a combination of micro-benchmarks and Petascale ready applications. Particularly, we evaluate impact of key changes that occurred during the dual-core to quad-core processor upgrade on applications behavior and provide projections for the next-generation massively-parallel platforms with multi-core processors, specifically for proposed Petascale Cray XT5 system. We compare and contrast the quad-core XT4 system features with the upcoming XT5 system and discuss strategies for improving scaling and performance for our target applications.

Scalable Performance Measurement and Analysis

Article

Jan 2009

Gamblin

Concurrency levels in large-scale, distributed-memory supercomputers are rising exponentially. Modern machines may contain 100,000 or more microprocessor cores, and the largest of these, IBM's Blue Gene/L, contains over 200,000 cores. Future systems are expected to support millions of concurrent tasks. In this dissertation, we focus on efficient techniques for measuring and analyzing the performance of applications running on very large parallel machines. Tuning the performance of large-scale applications can be a subtle and time-consuming task because application developers must measure and interpret data from many independent processes. While the volume of the raw data scales linearly with the number of tasks in the running system, the number of tasks is growing exponentially, and data for even small systems quickly becomes unmanageable. Transporting performance data from so many processes over a network can perturb application performance and make measurements inaccurate, and storing such data would require a prohibitive amount of space. Moreover, even if it were stored, analyzing the data would be extremely time-consuming. In this dissertation, we present novel methods for reducing performance data volume. The first draws on multi-scale wavelet techniques from signal processing to compress systemwide, time-varying load-balance data. The second uses statistical sampling to select a small subset of running processes to generate low-volume traces. A third approach combines sampling and wavelet compression to stratify performance data adaptively at run-time and to reduce further the cost of sampled tracing. We have integrated these approaches into Libra, a toolset for scalable load-balance analysis. We present Libra and show how it can be used to analyze data from large scientific applications scalably.

AMD Opteron Design (Image courtesy of AMD).

Context in source publication

Similar publications

Citations