Figure 3 - uploaded by Raphael Poss
Content may be subject to copyright.
Core micro-architecture.

Core micro-architecture.

Source publication
Conference Paper
Full-text available
The EU Apple-CORE project has explored the design and implementation of novel general-purpose many-core chips featuring hardware microthreading and hardware support for concurrency management. The introduction of the latter in the cores ISA has required simultaneous investigation into compilers and multiple layers of the software stack, including o...

Citations

... The SL system for the Microgrid supports a range of OS services with familiar encapsulation making migration from existing systems relatively simple. MGSim is the cycle-accurate simulation framework [9]-[11] and HLSim is the high-level simulation framework [12] [13] of the Microgrid. HLSim is developed to make quick and reasonably accurate design decisions in the evaluation of the microthreaded architecture using multiple runs of benchmarks which may comprise the execution of billions of instructions. ...
Article
Full-text available
Simulators are generally used during the design of computer architectures. Typically, different simulators with different levels of complexity, speed and accuracy are used. However, for early design space exploration, simulators with less complexity, high simulation speed and reasonable accuracy are desired. It is also required that these simulators have a short development time and that changes in the design require less effort in the implementation in order to perform experiments and see the effects of changes in the design. These simulators are termed high-level simu-lators in the context of computer architecture. In this paper, we present multiple levels of abstractions in a high-level simulation of a general-purpose many-core system, where the objective of every level is to improve the accuracy in simulation without significantly affecting the complexity and simulation speed.
... Every core in the Microgrid supports fine-grained multi-threading and implements the concurrency management of the model in its instruction set (ISA). There exists a cycleaccurate simulator for the Microgrid, named as MGSim [20], [26], [19], [25], [20]. It simulates all the low-level details of the architecture and therefore becomes very slow in the execution of large applications. ...
Article
Full-text available
Computer architects are always interested in analyzing the complex interactions amongst the dynamically allocated resources. Generally a detailed simulator with a cycle-accurate simulation of the execution time is used. However, the cycle-accurate simulator can execute at the rate of 100K instructions per second, divided over the number of simulated cores. This means that the evaluation of a complex application with complex concurrency interactions on contemporary multi-core machine can be very slow. To perform efficient design space exploration we present a co-simulation environment, where the detailed execution of concurrency instructions in the pipeline of microthreaded cores and the interactions amongst the hardware components are abstracted. We present the evaluation of the high-level simulation framework against the cycle-accurate simulation framework. The results show that high-level simulator is faster and less complicated than cycle-accurate simulator and has reasonable accuracy.
... section 2). We have developed a cycle-accurate simulator for the Microgrid, named as MGSim [23,28,21,27]. However, MGSim simulates all the low-level details of the architecture and therefore becomes very slow in the execution of large applications. ...
Article
The accuracy of simulated cycles in high-level simulators is generally less than the accuracy in detailed simulators for a single-core systems, because high-level simulators simulate the behavior of components rather than the components themselves as in detailed simulators. The simulation problem becomes more challenging when simulating many-core systems, where many cores are executing instructions concurrently. In these systems data may be accessed from multiple caches and the abstraction of the instruction execution has to consider the dynamic resource sharing on the whole chip. The problem becomes even more challenging in microthreaded many-core systems, because there may exist concurrent hardware threads. Which means that the latency of long latency operations can be tolerated from many cycles to just few cycles. We have previously presented a simulation technique to improve the accuracy in high-level simulation of microthreaded many-core systems, known as Signature-based high- level simulator, which adapts the throughput of the program based on the type of instructions, number of instructions and number of active threads in the pipeline. However, it disregards the access to different levels of the caches on the many-core system. Accessing L1-cache has far less latency than accessing off-chip memory and if the core is not able to tolerate latency, different levels of caches can not be treated equally. The distributed cache network along with the synchronization-aware coherency protocol in the Microgrid is a complicated memory architecture and it is difficult to simulate its behavior at a high-level. In this article we present a high-level cache model, which aims to improve the accuracy in high-level simulators for general-purpose many-core systems by adding little complexity to the simulator and without affecting the simulation speed.
... precompiled device drivers) could be accessed remotely through the on-chip network (NoC) via lightweight remote procedure calls. The corresponding system architecture has been published previously [28]; it is inspired from the heterogeneous OS integration in the Cray XMT. The software produced in the Apple-CORE project is summarized in Figure 1. ...
... In both platforms, the Microgrid hardware shares the NoC with a small companion core able to run OS services that cannot be implemented directly on the Microgrid cores, as discussed in [28]. Virtual memory is implemented using a MMU shared by all cores, so that all threads appear to run in the same logical address space. ...
Article
To harness the potential of CMPs for scalable, energy-efficient performance in general-purpose computers, the Apple-CORE project has co-designed a general machine model and concurrency control interface with dedicated hardware support for concurrency management across multiple cores. Its SVP interface combines dataflow synchronisation with imperative programming, towards the efficient use of parallelism in general-purpose workloads. Its implementation in hardware provides logic able to coordinate single-issue, in-order multi-threaded RISC cores into computation clusters on chip, called Microgrids. In contrast with the traditional "accelerator" approach, Microgrids are components in distributed systems on chip that consider both clusters of small cores and optional, larger sequential cores as system services shared between applications. The key aspects of the design are asynchrony, i.e. the ability to tolerate irregular long latencies on chip, a scale-invariant programming model, a distributed chip resource model, and the transparent performance scaling of a single program binary code across multiple cluster sizes. This article describes the execution model, the core micro-architecture, its realization in a many-core, general-purpose processor chip and its software environment. This article also presents cycle-accurate simulation results for various key algorithmic and cryptographic kernels. The results show good efficiency in terms of the utilisation of hardware despite the high-latency memory accesses and good scalability across relatively large clusters of cores.
... The instruction set of the I/O core supports the I/O infrastructure and are connected to the delegation network of the Microgrid. We do not simulate I/O cores in the current implementation of HLSim, therefore we refer readers to [16,21,31] for the I/O management in the Microgrid. ...
Technical Report
Traditional processors use the von Neumann execution model, some other processors in the past have used the dataflow execution model. A combination of von Neuman model and dataflow model is also tried in the past and the resultant model is referred as hybrid dataflow execution model. We describe a hybrid dataflow model known as the microthreading. It provides constructs for creation, synchronization and communication between threads in an intermediate language. The microthreading model is an abstract programming and machine model for many-core architecture. A particular instance of this model is named as the microthreaded architecture or the Microgrid. This architecture implements all the concurrency constructs of the microthreading model in the hardware with the management of these constructs in the hardware.
... 2, 6 and 7. The origin of this specific configuration and its relationship with the FPGA UTLEON3 implementation was described above in section III and previously published in [3], [20]. ...
Conference Paper
Full-text available
This article presents MGSim, an open source discrete event simulator for on-chip hardware components developed at the University of Amsterdam. MGSim is used as research and teaching vehicle to study the fine-grained hardware/software interactions on many-core chips with and without hardware multithreading. MGSim's component library includes support for core models with different instruction sets, a configurable multi-core interconnect, multiple configurable cache and memory models, a dedicated I/O subsystem, and comprehensive monitoring and interaction facilities. The default model configuration shipped with MGSim implements Microgrids, a multi-core architecture with hardware concurrency management. MGSim is furthermore written mostly in C++ and uses object classes to represent chip components. It is optimized for architecture models that can be described as process networks.
... This model separates I/O from memory operations to separate (logical) networks; this choice was made to enable separate study of memory-related and I/O-related traffic in the initial applications of MGSim. The overall design of the I/O system and its motivations have been published in [14]. ...
Technical Report
Full-text available
MGSim is an open source discrete event simulator for on-chip hardware components, developed at the University of Amsterdam. It is intended to be a research and teaching vehicle to study the fine-grained hardware/software interactions on many-core and hardware multithreaded processors. It includes support for core models with different instruction sets, a configurable multi-core interconnect, multiple configurable cache and memory models, a dedicated I/O subsystem, and comprehensive monitoring and interaction facilities. The default model configuration shipped with MGSim implements Microgrids, a many-core architecture with hardware concurrency management. MGSim is furthermore written mostly in C++ and uses object classes to represent chip components. It is optimized for architecture models that can be described as process networks.
... The corresponding distributed cache network has 32 L2 caches of 128KiB each, connected in 4 rings themselves connected to a single top-level ring with 4 DDR3-1600 memory channels. The cluster shares the NoC with a small companion core able to run OS services that cannot be implemented directly on the Microgrid cores, as discussed in [30]. Virtual memory is implemented using a MMU shared by all cores, so that all threads appear to run in the same logical address space. ...
Conference Paper
Full-text available
To harness the potential of CMPs for scalable, energy-efficient performance in general-purpose computers, the Apple-CORE project has co-designed a general machine model and concurrency control interface with dedicated hardware support for concurrency management across multiple cores. Its SVP interface combines dataflow synchronisation with imperative programming, towards the efficient use of parallelism in general-purpose workloads. The corresponding hardware implementation provides logic able to coordinate single-issue, in-order multi-threaded RISC cores into computation clusters on chip, called Microgrids. In contrast with the traditional "accelerator" approach, Microgrids are intended to be used as components in distributed systems on chip that consider both clusters of small cores and optional larger cores optimized towards sequential performance as system services shared between applications. The key aspects of the design are asynchrony, i.e. the ability to tolerate operations with irregular long latencies, a scale-invariant programming model, a distributed vision of the chip's structure, and the transparent performance scaling of a single program binary code across multiple cluster sizes. This paper describes the execution model, the core micro-architecture, its realization in a many-core, general-purpose processor chip and its software environment. The reference chip parameters include 128 cores, a 4 MB on-chip distributed cache network and four DDR3-1600 memory channels. This paper presents cycle-accurate simulation results for various key algorithmic and cryptographic kernels. The results show good efficiency in terms of the utilization of hardware despite the high-latency memory accesses and good scalability across relatively large clusters of cores.
... SL is currently used as the intermediate representation for most Microgridrelated research at the University of Amsterdam and technology partners. Prior to the publication of[18], multiple published works have pretended to use µTC for evaluation whereas their results were actually based on the SL technology[3,14,13,26,23,21,28]; this substitution was judged acceptable by their authors due to the resemblance between the two languages. SL has also been targeted successfully by the SAC2C compiler for the array functional language SingleAssignment C[12,27], as well as a parallelizing C compiler[25,26]. ...
Article
Full-text available
The CSA group at the University of Amsterdam has developed SVP, a framework to manage and program many-core and hardware multithreaded processors. In this article, we introduce the intermediate language SL, a common vehicle to program SVP platforms. SL is designed as an extension to the standard C language (ISO C99/C11). It includes primitive constructs to bulk create threads, bulk synchronize on termination of threads, and communicate using word-sized dataflow channels between threads. It is intended for use as target language for higher-level parallelizing compilers. SL is a research vehicle; as of this writing, it is the only interface language to program a main SVP platform, the new Microgrid chip architecture. This article provides an overview of the language, to complement a detailed specification available separately.
Article
Abstract The accuracy of simulated cycles in high-level simulators is generally less than the accuracy in detailed simulators for a single-core systems, because high-level simulators simulate the behaviour of components rather than the components themselves as in detailed simulators. The simulation problem becomes more challenging when simulating many-core systems, where many cores are executing instructions concurrently. In these systems data may be accessed from multiple caches and the abstraction of the instruction execution has to consider the dynamic resource sharing on the whole chip. The problem becomes even more challenging in microthreaded many-core systems, because there may exist concurrent hardware threads. Which means that the latency of long latency operations can be tolerated from many cycles to just few cycles. We have previously presented a simulation technique to improve the accuracy in high-level simulation of microthreaded many-core systems, known as Signature-based high- level simulator, which adapts the throughput of the program based on the type of instructions, number of instructions and number of active threads in the pipeline. However, it disregards the access to different levels of the caches on the many-core system. Accessing L1-cache has far less latency than accessing off-chip memory and if the core is not able to tolerate latency, different levels of caches can not be treated equally. The distributed cache network along with the synchronization-aware coherency protocol in the Microgrid is a complicated memory architecture and it is difficult to simulate its behaviour at a high-level. In this article we present a high-level cache model, which aims to improve the accuracy in high-level simulators for general-purpose many-core systems by adding little complexity to the simulator and without affecting the simulation speed.