This figure shows the feasible core and memory clock ranges for computing one thousand iterations of the CoMD EAM force kernel on a Fermi GPU within one second. Green markers indicate evaluation points that satisfy the constraints and red markers indicate infeasible clock settings.

Source publication

Process diagram of the design exploration workflow. CLI indicates...

A characterization of the total number of flops required for SSCP image...

This chart shows the growth in the fftVolume data structure as a...

This chart shows the increase in predicted runtime as a function of the...

This chart shows the relationship between n and overall energy...

Automated Design Space Exploration with Aspen

Article

Full-text available

Apr 2015

Architects and applications scientists often use performance models to explore a multidimensional design space of architectural characteristics, algorithm designs, and application parameters. With traditional performance modeling tools, these explorations forced users to first develop a performance model and then repeatedly evaluate and analyze the...

An Overview of Design Space Exploration of Cache Memory

Conference Paper

Full-text available

Sep 2018

The advancement of computer architecture systems has led to the massive need for memory. The need to sustain such leaps has required the examination of solutions for the design space exploration of memory. From design space exploration and its relevance to the fundamental makeup of cache memory, there are some possible solutions to the problem of needing more and faster memory sooner than later. Design Space Exploration techniques Simulation, Analytical, Configurable and Evolutionary all provide different advantages and definitely enhance the cache memory design. However, each has its drawbacks that should be considered depending on the intent of its application. This paper presents an overview of the different techniques and evaluates their effectiveness with regard to their advantages and disadvantages. A suggested best solution could be a hybrid that allowed the different techniques to compensate for the other's deficiencies.

A Power Management Framework with Simple DSL for Automatic Power-Performance Optimization on Power-Constrained HPC Systems

Conference Paper

Full-text available

Mar 2018

To design exascale HPC systems, power limitation is one of the most crucial and unavoidable issues; and it is also necessary to optimize the power-performance of user applications while keeping the power consumption of the HPC system below a given power budget. For this kind of power-performance optimization for HPC applications, it is indispensable to have enough information and good understanding about both the system specifications (what kind of hardware resources are included in the system, which component can be used as a “power-knob”, how to control the power-knob, etc.) and user applications (which part of the application is CPU-intensive, memory-intensive, and so on). Because this situation forces both the users and administrators of power-constrained HPC systems pay much effort and cost, it has been highly demanded to realize a simple framework to automate a power-performance optimization process, and to provide a simple user interface to the framework. To tackle these concerns, we propose and implement a versatile framework to help carry out power management and performance optimization on power-constrained HPC systems. In this framework, we also propose a simple DSL as an interface to utilize the framework. We believe this is a key to effectively utilize HPC systems under the limited power budget.

Abisko: Deep codesign of an architecture for spiking neural networks using novel neuromorphic materials

Article

Jun 2023
INT J HIGH PERFORM C

The Abisko project aims to develop an energy-efficient spiking neural network (SNN) computing architecture and software system capable of autonomous learning and operation. The SNN architecture explores novel neuromorphic devices that are based on resistive-switching materials, such as memristors and electrochemical RAM. Equally important, Abisko uses a deep codesign approach to pursue this goal by engaging experts from across the entire range of disciplines: materials, devices and circuits, architectures and integration, software, and algorithms. The key objectives of our Abisko project are threefold. First, we are designing an energy-optimized high-performance neuromorphic accelerator based on SNNs. This architecture is being designed as a chiplet that can be deployed in contemporary computer architectures and we are investigating novel neuromorphic materials to improve its design. Second, we are concurrently developing a productive software stack for the neuromorphic accelerator that will also be portable to other architectures, such as field-programmable gate arrays and GPUs. Third, we are creating a new deep codesign methodology and framework for developing clear interfaces, requirements, and metrics between each level of abstraction to enable the system design to be explored and implemented interchangeably with execution, measurement, a model, or simulation. As a motivating application for this codesign effort, we target the use of SNNs for an analog event detector for a high-energy physics sensor.

FARSI: An Early-stage Design Space Exploration Framework to Tame the Domain-specific System-on-chip Complexity

Article

Jun 2022
ACM T EMBED COMPUT S

Domain-specific SoCs (DSSoCs) are an attractive solution for domains with extremely stringent power, performance, and area constraints. However, DSSoCs suffer from two fundamental complexities. On the one hand, their many specialized hardware blocks result in complex systems and thus high development effort. On the other hand, their many system knobs expand the complexity of design space, making the search for the optimal design difficult. Thus to reach prevalence, taming such complexities is necessary. To address these challenges, in this work, we identify the necessary features of an early-stage design space exploration (DSE) framework that targets the complex design space of DSSoCs and provide an instance of one such framework that we refer to as FARSI. FARSI provides an agile system-level simulator with speed up and accuracy of 8,400 × and 98.5% compared to Synopsys Platform Architect. FARSI also provides an efficient exploration heuristic and achieves up to 62 × and 35 × improvement in convergence time compared to the classic simulated annealing (SA) and modern Multi-Objective Optimistic Search (MOOS). This is done by augmenting SA with architectural reasoning such as locality exploitation and bottleneck relaxation. Furthermore, we embed various co-design capabilities and show that, on average, they have a 32% impact on the convergence rate. Finally, we demonstrate that using development-cost-aware policies can lower the system complexity, both in terms of the component count and variation by as much as 60% and 82% (e,g., for Network-on-a-Chip subsystem), respectively. PS: This paper targets the Special Issue on Domain-Specific System-on-Chip Architectures and Run-Time Management Techniques.

Supercomputing Frontiers 4th Asian Conference, SCFA 2018, Singapore, March 26-29, 2018, Proceedings: 4th Asian Conference, SCFA 2018, Singapore, March 26-29, 2018, Proceedings

Book

Jan 2018
Lect Notes Comput Sci

It constitutes the refereed proceedings of the 4th Asian Supercomputing Conference, SCFA 2018, held in Singapore in March 2018. Supercomputing Frontiers will be rebranded as Supercomputing Frontiers Asia (SCFA), which serves as the technical programme for SCA18. The technical programme for SCA18 consists of four tracks: • Application, Algorithms & Libraries • Programming System Software • Architecture, Network/Communications & Management • Data, Storage & Visualisation The 20 papers presented in this volume were carefully reviewed nd selected from 60 submissions.

Aspen-based performance and energy modeling frameworks

Article

Dec 2017
J PARALLEL DISTR COM

With the anticipation of exascale architectures, energy consumption is becoming one of the critical design parameters, especially in light of the energy budget of 20-30 Megawatts set by the U.S. Department of Energy. Understanding an application's execution pattern and its energy footprint is critical to improving application operation on a diverse heterogeneous architecture. Applying application-specific performance optimization can consequently improve energy consumption. However, this approach is only applicable to current systems. As we enter a new era of exascale architectures that is projected to contain more complex memory hierarchies, increased levels of parallelism, heterogeneity in hardware, and complex programming models and techniques, energy and performance management is getting more cumbersome. We therefore propose techniques that predict the energy consumption beforehand or at runtime to enable proactive tuning. Such energy prediction approaches must be generic and adapt themselves at runtime to changing application and hardware configurations. Most existing energy estimation and prediction approaches are empirical in nature and thus tied to current systems. To overcome this limitation, we propose two energy estimation techniques: ACEE (Algorithmic and Categorical Energy Estimation), which uses a combination of analytical and empirical modeling techniques; and AEEM (Aspen's Embedded Energy Estimation), a system-level analytical energy estimation technique. Both of these models incorporate the Aspen domain specific language for performance modeling. We present the methodologies of these two models and test their accuracy using five proxy applications. We also describe three use cases.

A Study of Power-Performance Modeling Using a Domain-Specific Language

Conference Paper

Oct 2016

Performance Models for Split-Execution Computing Systems

Conference Paper

May 2016

Combining Power and Performance Modeling for Application Analysis: A Case Study Using Aspen

Conference Paper

May 2016

Performance Models for Split-execution Computing Systems

Article

Full-text available

Jul 2016

Split-execution computing leverages the capabilities of multiple computational models to solve problems, but splitting program execution across different computational models incurs costs associated with the translation between domains. We analyze the performance of a split-execution computing system developed from conventional and quantum processing units (QPUs) by using behavioral models that track resource usage. We focus on asymmetric processing models built using conventional CPUs and a family of special-purpose QPUs that employ quantum computing principles. Our performance models account for the translation of a classical optimization problem into the physical representation required by the quantum processor while also accounting for hardware limitations and conventional processor speed and memory. We conclude that the bottleneck in this split-execution computing system lies at the quantum-classical interface and that the primary time cost is independent of quantum processor behavior.

This figure shows the feasible core and memory clock ranges for computing one thousand iterations of the CoMD EAM force kernel on a Fermi GPU within one second. Green markers indicate evaluation points that satisfy the constraints and red markers indicate infeasible clock settings.

Citations