Temporal representation of the memory map of RAM R1

Towards Efficient Reuse of Software Programmable Streaming Coarse Grained Reconfigurable Architectures

Thesis

Jun 2021

Elias Barbudo franco

Coarse-Grained Reconfigurable Architectures (CGRA) are designed to deliver high performance while drastically reducing the latency of the computing system. There are several types of CGRA according to the structure, application, type of resources, and memory infrastructure. We focus our work on a subset of CGRA designs that we call Software Programmable Streaming Coarse-Grained Reconfigurable Architectures (SPS-CGRA). An SPS-CGRA is a more or less complex array of coarse-grained heterogeneous hardware resources with a coarser granularity than the classical. An SPS-CGRA can perform spatial and temporal computations at low latency. Its stream-based processing provides high performance maintaining a level of flexibility. Although they are often highly domain-specifically optimized, they keep several levels of custom post-fabrication programmability, given by a set of parameters, so that they can be reused. However, their reuse is generally limited due to the complexity of identifying the best allocation of the processing tasks into the hardware resources. Another limiting point is the complexity of producing a reliable performance analysis for each new implementation since no mature tool exists.To solve these problems, we propose a complete mapping and scheduling framework that targets SPS-CGRA. We introduce a generic hardware model allowing one to express these intrinsically custom levels of flexibility without neglecting data access and system configuration control. We also propose a performance estimation analysis based on resource latency description, allowing to obtain the upper bound of the computing cost. To complete, we present four different solutions for the mapping and scheduling problem: a List-based algorithm with backtracking, a Lookahead-based heuristic, a Bayesian-based heuristic and, a Q-Learning mapping algorithm. We evaluate and compare our solutions against an exhaustive approach in a real-life example and illustrate the benefits and efficiency of the proposed framework

Etude et implantation d'algorithmes pour le placement et l'ordonnancement d'applications Dataflow

Thesis

Dec 2019

Hamza Deroui

La complexité des architectures MPSoC (Multiprocessor System-on-Chip) augmente de manière exponentielle pour répondre à la demande croissante en puissance de calcul des applications DSP (Digital Signal Processor). Les architectures MPSoC modernes, telles que les architectures à coeurs multiple, incorporent déjà des centaines d’éléments de traitement (PE) dans une seule puce et prévoient d’intégrer jusqu’à un millier de PE dans un avenir proche. Conséquemment, la programmation des architectures MPSoC modernes avec les langages de programmation traditionnels basés sur des threads est devenue de plus en plus complexe. Dans ce contexte, les modèles de calcul de flux de données (MoC) sont devenus des paradigmes de programmation populaires offrant à la fois une grande analysabilité et une expression intuitive du parallélisme d’une application DSP basée sur le modèle de graphe de tâches. Dans cette thèse, nous proposons de nouvelles techniques pour l'évaluation du débit maximal et de la latence minimale du modèle IBSDF (Interface-Based Synchronous Dataflow), ciblant les architectures MPSoC avec des ressources illimitées. Le modèle IBSDF est un modèle hiérarchique à comportement statique qui permet d’évaluer certaines métriques de performance au moment de la conception. Cependant, les méthodes d'évaluation classiques consistent à aplatir la hiérarchie du modèle IBSDF en un graphe de flux de données non hiérarchique comportant un nombre exponentiel de tâches rendant son évaluation difficile, voire impossible. Les nouvelles techniques que nous proposons évaluent les performances des graphes IBSDF de manière modulaire sans aplatir leur hiérarchie. Ainsi, nous avons pu évaluer de très grand graphes IBSDF en quelques secondes, contrairement à l'approche classique qui ne permet pas d'obtenir un résultat.

Méthode de conception de systèmes temps réels embarqués multi-coeurs en milieu automobile

Thesis

Mar 2018

Enagnon Cédric Klikpo

The increasing complexity of embedded applications in modern cars has increased the need of computing power. To meet this need, the European automotive standard AUTOSAR has introduced the use of \multicore platforms. However, \multicore platform for critical automotive applications raises several issues. In particular, it is necessary to respect the functional specification and to guarantee deterministically the data exchanges between cores. In this thesis, we consider multi-periodic systems specified and validated with \mat. So, we developed a framework to deploy \mat applications on AUTOSAR \multicore. This framework guarantees the functional and temporal determinism and exploits the parallelism. Our contribution is threefold. First, we identify the communication mechanisms in \mat. Then, we prove that the dataflow in a multi-periodic \mat system is modeled by a SDFG. The SDFG formalism is an excellent analysis tool to exploit the parallelism. In fact, it is very popular in the literature and it is widely studied for the deployment of dataflow applications on multi/many-core. Then, we develop methods to realize the dataflow expressed by the SDFG in a preemptive \rt scheduling. These methods use theoretical results on SDFGs to guarantee deterministic precedence constraints without using blocking synchronization mechanisms. As such, both the functional and temporal determinism are guaranteed. Finally, we characterize the impact of dataflow requirements on tasks. We propose a partitioning technique that minimizes this impact. We show that this technique promotes the construction of a partitioning and a feasible scheduling when it is used to initiate multi-objective research and optimization algorithms. %As such, we reduce the number of design iterations and shorten the design time.

LVQ neural network optimized implementation on FPGA devices with multiple-wordlength operations for real-time systems

Article

Full-text available

Jan 2018
NEURAL COMPUT APPL

The development of hardware platforms for artificial neural networks (ANN) has been hindered by the high consumption of power and hardware resources. In this paper, we present a methodology for ANN-optimized implementation, of a learning vector quantization (LVQ) type on a field-programmable gate array (FPGA) device. The aim was to provide an intelligent embedded system for real-time vigilance state classification of a subject from an analysis of the electroencephalogram signal. The present approach consists in applying the extension of the algorithm architecture adequacy (AAA) methodology with the arithmetic accuracy constraint, allowing the LVQ-optimized implementation on the FPGA. This extension improves the optimization phase of the AAA methodology by taking into account the operations wordlength required by applying and creating approximative-wordlength operation groups, where the operations in the same group will be performed with the same operator. This LVQ implementation will allow a considerable gain of circuit resources, power and maximum frequency while respecting the time and accuracy constraints. To validate our approach, the LVQ implementation has been tried for several network topologies on two Virtex devices. The accuracy–success rate relation has been studied and reported.

Reproducible Evaluation of System Efficiency With a Model of Architecture: From Theory to Practice

Article

Nov 2017
IEEE T COMPUT AID D

Current trends in high performance and embedded computing include design of increasingly complex hardware architectures with high parallelism, heterogeneous processing elements and non-uniform communication resources. In order to take hardware and software design decisions, early evaluations of the system non-functional properties are needed. These evaluations of system efficiency require Electronic System-Level (ESL) information on both the algorithms and the architecture. Contrary to algorithm models for which a major body of work has been conducted on defining formal Models of Computation (MoCs), architecture models from the literature are mostly empirical models from which reproducible experimentation requires the accompanying software. In this paper, a precise definition of a Model of Architecture (MoA) is proposed that focuses on reproducibility and abstraction and removes the overlap previously existing between the notions of MoA and MoC. A first MoA, called the Linear System-Level Architecture Model (LSLA), is presented. To demonstrate the generic nature of the proposed new architecture modeling concepts, we show that the LSLA Model can be integrated flexibly with different MoCs. LSLA is then used to model the energy consumption of a State-of-the-Art Multiprocessor System-on-Chip (MPSoC) when running an application described using the Synchronous Dataflow (SDF) MoC. A method to automatically learn LSLA model parameters from platform measurements is introduced. Despite the high complexity of the underlying hardware and software, a simple LSLA model is demonstrated to estimate the energy consumption of the MPSoC with a fidelity of 86%.

Performance analysis of an efficient technique for ordering programs into multiple processors architectures

Conference Paper

Full-text available

Oct 2017

Models of Architecture

Technical Report

Full-text available

Dec 2015

The current trend in high performance and embedded computing consists of designing increasingly complex heterogeneous hardware architectures with non-uniform communication resources. In order to take hardware and software design decisions, early evaluations of the system non-functional properties are needed. These evaluations of system efficiency require high-level information on both the algorithms and the architecture. In state of the art Model Driven Engineering (MDE) methods, different communities have developed custom architecture models associated to languages of substantial complexity. This fact contrasts with Models of Computation (MoCs) that provide abstract representations of an algorithm behavior as well as tool interoperability. In this report, we define the notion of Model of Architecture (MoA) and study the combination of a MoC and an MoA to provide a design space exploration environment for the study of the algorithmic and architectural choices. An MoA provides reproducible cost computation for evaluating the efficiency of a system. A new MoA called Linear System-Level Architecture Model (LSLA) is introduced and compared to state of the art models. LSLA aims at representing hardware efficiency with a linear model. The computed cost results from the mapping of an application, represented by a model conforming a MoC on an architecture represented by a model conforming an MoA. The cost is composed of a processing-related part and a communication-related part. It is an abstract scalar value to be minimized and can represent any non-functional requirement of a system such as memory, energy, throughput or latency.

Analysis and optimization of dynamic dataflow programs

Article

Full-text available

Jan 2015

Simone Casale-Brunet

All computing platforms, from mobile to supercomputers, are becoming more and more heterogeneous and massively parallel. While they can provide higher power efficiency and computation throughput, effective and confident use of these systems always requires knowledge about low-level programming. The average time necessary to develop and optimize a design on heterogeneous platforms is higher and higher compared to typical homogeneous systems. Dataflow models of computation (MoC) are quickly becoming the common practice in heterogeneous systems development. In domains such as signal processing and multimedia communication, dataflow MoCs have become accepted as standard. However, the shift from a sequential and architecture-specific MoC to a dataflow MoC still uncovers several programming and development challenges. The Cal Actor Language (CAL) is a recently-specified dataflow and actor-based language capable of concisely expressing complex and general purpose parallel applications. However, design tools supporting this language are generally not adequate to fully exploit its features and expressiveness power. In fact, they generally restrict its MoC in order to reduce the design space exploration (DSE) effort. The objective of this thesis is to provide a DSE methodology where all the features of CAL and dynamic dataflow MoCs can be exploited in a more general and effective manner. This dissertation illustrates a novel profiling, analysis and performance estimation methodology for the DSE of dynamic dataflow programs. The main research contributions of this thesis are: the formalization of a graph-based representation of the program execution called an execution trace graph (ETG); the formalization of a systematic methodology for profiling generic dynamic dataflow programs through their code interpretation; the formalization of a complete DSE methodology for dynamic dataflow programs in order to efficiently identify close-to-optimal design points according to various and tailored performance merit functions. In particular, the following design space optimization problems for dynamic dataflow programs are addressed: the analysis of the hotspots and the algorithmic bottlenecks of a parallel program; the bounding and optimization of the buffer size configuration for complex designs; the dynamic power dissipation minimization of programs implemented in multi-clock domain architecture. Furthermore, theoretical concepts like the design space critical path and the potential speedup of a dataflow application have been defined and revisited, respectively. The thesis also presents a DSE framework developed in order to demonstrate the effectiveness of this design methodology.

Efficient compilation of embedded control specifications with complex functional and non-functional properties

Article

Full-text available

Oct 2014

Thomas Carle

There is a long standing separation between the fields of compiler construction and real-time scheduling. While both fields have the same objective - the construction of correct implementations – the separation was historically justified by significant differences in the models and methods that were used. Nevertheless, with the ongoing complexification of applications and of the hardware of the execution platforms, the objects and problems studied in these two fields are now largely overlapping. In this thesis, we focus on the automatic code generation for embedded control systems with complex constraints, including hard real-time requirements. To this purpose, we advocate the need for a reconciled research effort between the communities of compilation and real-time systems. By adapting a technique usually used in compilers (software pipelining) to the system-level problem of multiprocessor scheduling of hard real-time applications, we shed light on the difficulties of this unified research effort, but also show how it can lead to real advances. Indeed we explain how adapting techniques for the optimization of new objectives, in a different context, allows us to develop more easily systems of better quality than what was done until now. In this adaptation process, we propose to use synchronous formalisms and languages as a common formal ground. These can be naturally seen as extensions of classical models coming from both real-time scheduling (dependent task graphs) and compilation (single static assignment and data dependency graphs), but also provide powerful techniques for manipulating complex control structures. We implemented our results in the LoPhT compiler.

Memory Study and Dataflow Representations for Rapid Prototyping of Signal Processing Applications on MPSoCs

Article

Sep 2014

Karol Desnos

The development of embedded Digital Signal Processing (DSP) applications for Multiprocessor Systems-on-Chips (MPSoCs) is a complex task requiring the consideration of many constraints including real-time requirements, power consumption restrictions, and limited hardware resources. To satisfy these constraints, it is critical to understand the general characteristics of a given application: its behavior and its requirements in terms of MPSoC resources. In particular, the memory requirements of an application strongly impact the quality and performance of an embedded system, as the silicon area occupied by the memory can be as large as 80% of a chip and may be responsible for a major part of its power consumption. Despite the large overhead, limited memory resources remain an important constraint that considerably increases the development time of embedded systems. Dataflow Models of Computation (MoCs) are widely used for the specification, analysis, and optimization of DSP applications. The popularity of dataflow MoCs is due to their great analyzability and their natural expressivity of the parallelism of a DSP application. The abstraction of time in dataflow MoCs is particularly suitable for exploiting the parallelism offered by heterogeneous MPSoCs. In this thesis, we propose a complete method to study the important aspect of memory characteristic of a DSP application modeled with a dataflow graph. The proposed method spans the theoretical, architecture-independent memory characterization to the quasi-optimal static memory allocation of an application on a real shared-memory MPSoC. The proposed method, implemented as part of a rapid prototyping framework, is extensively tested on a set of state-of-the-art applications from the computer-vision, the telecommunication, and the multimedia domains. Then, because the dataflow MoC used in our method is unable to model applications with a dynamic behavior, we introduce a new dataflow meta-model to address the important challenge of managing dynamics in DSP-oriented representations. The new reconfigurable and composable dataflow meta-model strengthens the predictability, the conciseness and the readability of application descriptions.

Temporal representation of the memory map of RAM R1

Context in source publication

Citations