Figure 3: Core micro-architecture.

Multiple Levels of Abstraction in the Simulation of Microthreaded Many-Core Architectures

Article

Full-text available

Oct 2015

Irfan Uddin

Simulators are generally used during the design of computer architectures. Typically, different simulators with different levels of complexity, speed and accuracy are used. However, for early design space exploration, simulators with less complexity, high simulation speed and reasonable accuracy are desired. It is also required that these simulators have a short development time and that changes in the design require less effort in the implementation in order to perform experiments and see the effects of changes in the design. These simulators are termed high-level simu-lators in the context of computer architecture. In this paper, we present multiple levels of abstractions in a high-level simulation of a general-purpose many-core system, where the objective of every level is to improve the accuracy in simulation without significantly affecting the complexity and simulation speed.

High-level simulation of concurrency operations in microthreaded many-core architectures

Article

Full-text available

Sep 2015

Irfan Uddin

Computer architects are always interested in analyzing the complex interactions amongst the dynamically allocated resources. Generally a detailed simulator with a cycle-accurate simulation of the execution time is used. However, the cycle-accurate simulator can execute at the rate of 100K instructions per second, divided over the number of simulated cores. This means that the evaluation of a complex application with complex concurrency interactions on contemporary multi-core machine can be very slow. To perform efficient design space exploration we present a co-simulation environment, where the detailed execution of concurrency instructions in the pipeline of microthreaded cores and the interactions amongst the hardware components are abstracted. We present the evaluation of the high-level simulation framework against the cycle-accurate simulation framework. The results show that high-level simulator is faster and less complicated than cycle-accurate simulator and has reasonable accuracy.

Cache-based high-level simulation of microthreaded many-core architectures

Article

Jun 2014
J SYST ARCHITECT

The accuracy of simulated cycles in high-level simulators is generally less than the accuracy in detailed simulators for a single-core systems, because high-level simulators simulate the behavior of components rather than the components themselves as in detailed simulators. The simulation problem becomes more challenging when simulating many-core systems, where many cores are executing instructions concurrently. In these systems data may be accessed from multiple caches and the abstraction of the instruction execution has to consider the dynamic resource sharing on the whole chip. The problem becomes even more challenging in microthreaded many-core systems, because there may exist concurrent hardware threads. Which means that the latency of long latency operations can be tolerated from many cycles to just few cycles. We have previously presented a simulation technique to improve the accuracy in high-level simulation of microthreaded many-core systems, known as Signature-based high- level simulator, which adapts the throughput of the program based on the type of instructions, number of instructions and number of active threads in the pipeline. However, it disregards the access to different levels of the caches on the many-core system. Accessing L1-cache has far less latency than accessing off-chip memory and if the core is not able to tolerate latency, different levels of caches can not be treated equally. The distributed cache network along with the synchronization-aware coherency protocol in the Microgrid is a complicated memory architecture and it is difficult to simulate its behavior at a high-level. In this article we present a high-level cache model, which aims to improve the accuracy in high-level simulators for general-purpose many-core systems by adding little complexity to the simulator and without affecting the simulation speed.

Apple-CORE: Harnessing general-purpose many-cores with hardware concurrency management

Article

Nov 2013
MICROPROCESS MICROSY

To harness the potential of CMPs for scalable, energy-efficient performance in general-purpose computers, the Apple-CORE project has co-designed a general machine model and concurrency control interface with dedicated hardware support for concurrency management across multiple cores. Its SVP interface combines dataflow synchronisation with imperative programming, towards the efficient use of parallelism in general-purpose workloads. Its implementation in hardware provides logic able to coordinate single-issue, in-order multi-threaded RISC cores into computation clusters on chip, called Microgrids. In contrast with the traditional "accelerator" approach, Microgrids are components in distributed systems on chip that consider both clusters of small cores and optional, larger sequential cores as system services shared between applications. The key aspects of the design are asynchrony, i.e. the ability to tolerate irregular long latencies on chip, a scale-invariant programming model, a distributed chip resource model, and the transparent performance scaling of a single program binary code across multiple cluster sizes. This article describes the execution model, the core micro-architecture, its realization in a many-core, general-purpose processor chip and its software environment. This article also presents cycle-accurate simulation results for various key algorithmic and cryptographic kernels. The results show good efficiency in terms of the utilisation of hardware despite the high-latency memory accesses and good scalability across relatively large clusters of cores.

Microgrid - The microthreaded many-core architecture

Technical Report

Sep 2013

Irfan Uddin

Traditional processors use the von Neumann execution model, some other processors in the past have used the dataflow execution model. A combination of von Neuman model and dataflow model is also tried in the past and the resultant model is referred as hybrid dataflow execution model. We describe a hybrid dataflow model known as the microthreading. It provides constructs for creation, synchronization and communication between threads in an intermediate language. The microthreading model is an abstract programming and machine model for many-core architecture. A particular instance of this model is named as the microthreaded architecture or the Microgrid. This architecture implements all the concurrency constructs of the microthreading model in the hardware with the management of these constructs in the hardware.

MGSim---A simulation Environment for Multi-Core Research and Education

Conference Paper

Full-text available

Jul 2013

This article presents MGSim, an open source discrete event simulator for on-chip hardware components developed at the University of Amsterdam. MGSim is used as research and teaching vehicle to study the fine-grained hardware/software interactions on many-core chips with and without hardware multithreading. MGSim's component library includes support for core models with different instruction sets, a configurable multi-core interconnect, multiple configurable cache and memory models, a dedicated I/O subsystem, and comprehensive monitoring and interaction facilities. The default model configuration shipped with MGSim implements Microgrids, a multi-core architecture with hardware concurrency management. MGSim is furthermore written mostly in C++ and uses object classes to represent chip components. It is optimized for architecture models that can be described as process networks.

MGSim - Simulation tools for multi-core processor architectures

Technical Report

Full-text available

Feb 2013

MGSim is an open source discrete event simulator for on-chip hardware components, developed at the University of Amsterdam. It is intended to be a research and teaching vehicle to study the fine-grained hardware/software interactions on many-core and hardware multithreaded processors. It includes support for core models with different instruction sets, a configurable multi-core interconnect, multiple configurable cache and memory models, a dedicated I/O subsystem, and comprehensive monitoring and interaction facilities. The default model configuration shipped with MGSim implements Microgrids, a many-core architecture with hardware concurrency management. MGSim is furthermore written mostly in C++ and uses object classes to represent chip components. It is optimized for architecture models that can be described as process networks.

Apple-CORE: Microgrids of SVP Cores -- Flexible, General-Purpose, Fine-Grained Hardware Concurrency Management

Conference Paper

Full-text available

Sep 2012

To harness the potential of CMPs for scalable, energy-efficient performance in general-purpose computers, the Apple-CORE project has co-designed a general machine model and concurrency control interface with dedicated hardware support for concurrency management across multiple cores. Its SVP interface combines dataflow synchronisation with imperative programming, towards the efficient use of parallelism in general-purpose workloads. The corresponding hardware implementation provides logic able to coordinate single-issue, in-order multi-threaded RISC cores into computation clusters on chip, called Microgrids. In contrast with the traditional "accelerator" approach, Microgrids are intended to be used as components in distributed systems on chip that consider both clusters of small cores and optional larger cores optimized towards sequential performance as system services shared between applications. The key aspects of the design are asynchrony, i.e. the ability to tolerate operations with irregular long latencies, a scale-invariant programming model, a distributed vision of the chip's structure, and the transparent performance scaling of a single program binary code across multiple cluster sizes. This paper describes the execution model, the core micro-architecture, its realization in a many-core, general-purpose processor chip and its software environment. The reference chip parameters include 128 cores, a 4 MB on-chip distributed cache network and four DDR3-1600 memory channels. This paper presents cycle-accurate simulation results for various key algorithmic and cryptographic kernels. The results show good efficiency in terms of the utilization of hardware despite the high-latency memory accesses and good scalability across relatively large clusters of cores.

SL: a "quick and dirty" but working intermediate language for SVP systems

Article

Full-text available

Aug 2012

Raphael Poss

The CSA group at the University of Amsterdam has developed SVP, a framework to manage and program many-core and hardware multithreaded processors. In this article, we introduce the intermediate language SL, a common vehicle to program SVP platforms. SL is designed as an extension to the standard C language (ISO C99/C11). It includes primitive constructs to bulk create threads, bulk synchronize on termination of threads, and communicate using word-sized dataflow channels between threads. It is intended for use as target language for higher-level parallelizing compilers. SL is a research vehicle; as of this writing, it is the only interface language to program a main SVP platform, the new Microgrid chip architecture. This article provides an overview of the language, to complement a detailed specification available separately.

One-IPC high-level simulation of microthreaded many-core architectures

Article

May 2015
INT J HIGH PERFORM C

Irfan Uddin

Abstract The accuracy of simulated cycles in high-level simulators is generally less than the accuracy in detailed simulators for a single-core systems, because high-level simulators simulate the behaviour of components rather than the components themselves as in detailed simulators. The simulation problem becomes more challenging when simulating many-core systems, where many cores are executing instructions concurrently. In these systems data may be accessed from multiple caches and the abstraction of the instruction execution has to consider the dynamic resource sharing on the whole chip. The problem becomes even more challenging in microthreaded many-core systems, because there may exist concurrent hardware threads. Which means that the latency of long latency operations can be tolerated from many cycles to just few cycles. We have previously presented a simulation technique to improve the accuracy in high-level simulation of microthreaded many-core systems, known as Signature-based high- level simulator, which adapts the throughput of the program based on the type of instructions, number of instructions and number of active threads in the pipeline. However, it disregards the access to different levels of the caches on the many-core system. Accessing L1-cache has far less latency than accessing off-chip memory and if the core is not able to tolerate latency, different levels of caches can not be treated equally. The distributed cache network along with the synchronization-aware coherency protocol in the Microgrid is a complicated memory architecture and it is difficult to simulate its behaviour at a high-level. In this article we present a high-level cache model, which aims to improve the accuracy in high-level simulators for general-purpose many-core systems by adding little complexity to the simulator and without affecting the simulation speed.

Core micro-architecture.

Citations