Figure 9 - uploaded by Thierry Grandpierre
Content may be subject to copyright.
Temporal representation of the memory map of RAM R1

Temporal representation of the memory map of RAM R1

Source publication
Conference Paper
Full-text available
This paper presents a seamless flow of transformations, which performs dedicated, distributed executive generation from a high level specification of a pair: algorithm, architecture. This work is based upon graph models and graph transformations and is part of the AAA methodology. We present an original architecture model, which allows to perform a...

Context in source publication

Context 1
... a diagram enables the designer to predict the memory space required to store and execute the whole application. Fig- ure 9 is a temporal memory allocation diagram of RAM R1 of figure 8 (memory map). This information is used by our heuristic in order to make off-line memory re-allocation, en- abling to safely and deterministically save memory space. ...

Citations

... Grandpierre and Sorel [35] presented a prototyping methodology and a software tool (Syn-DEx) that aims to optimize implementations of real-time image and signal processing onto heterogeneous multiprocessors architectures. This methodology has been extended to integrated circuits [36,37], VHDL code generation [38], and it has been the subject to improvements in terms of power consumption [39]. ...
... Additionally, a set of dependencies of the parameters defines the hierarchy of the parameters and the influence of a given parameter to its child's and parent's parameters. Inherit from the IBSDF and also taken from the Algorithm Architecture Matching Methodology [35,85], the PiSDF includes the concept of source and sink special nodes, that interact with the outside world. These special nodes allow defining the inputs and output of the system, increasing the feasibility to be instantiated in any design [64]. ...
... The Algorithm Architecture Matching methodology (AAM) [35,85] covers all the steps of rapid prototyping in a seamless flow of graph transformations from algorithm and architecture specification to distributed executives automatic generation [86]. A Data Dependence Graph represents an algorithm (application). ...
Thesis
Coarse-Grained Reconfigurable Architectures (CGRA) are designed to deliver high performance while drastically reducing the latency of the computing system. There are several types of CGRA according to the structure, application, type of resources, and memory infrastructure. We focus our work on a subset of CGRA designs that we call Software Programmable Streaming Coarse-Grained Reconfigurable Architectures (SPS-CGRA). An SPS-CGRA is a more or less complex array of coarse-grained heterogeneous hardware resources with a coarser granularity than the classical. An SPS-CGRA can perform spatial and temporal computations at low latency. Its stream-based processing provides high performance maintaining a level of flexibility. Although they are often highly domain-specifically optimized, they keep several levels of custom post-fabrication programmability, given by a set of parameters, so that they can be reused. However, their reuse is generally limited due to the complexity of identifying the best allocation of the processing tasks into the hardware resources. Another limiting point is the complexity of producing a reliable performance analysis for each new implementation since no mature tool exists.To solve these problems, we propose a complete mapping and scheduling framework that targets SPS-CGRA. We introduce a generic hardware model allowing one to express these intrinsically custom levels of flexibility without neglecting data access and system configuration control. We also propose a performance estimation analysis based on resource latency description, allowing to obtain the upper bound of the computing cost. To complete, we present four different solutions for the mapping and scheduling problem: a List-based algorithm with backtracking, a Lookahead-based heuristic, a Bayesian-based heuristic and, a Q-Learning mapping algorithm. We evaluate and compare our solutions against an exhaustive approach in a real-life example and illustrate the benefits and efficiency of the proposed framework
... In the co-design context, where designed systems have both hardware and software parts, developer inputs often gather a model of the application, a model of the targeted architecture, and a set of constraints for the deployment of the application on the architecture. As presented in the Algorithm-Architecture Adequation (AAA) methodology [GS03], the separation between the three inputs of the design flow ensures the independence between them, which eases the deployment of an application on several architectures or the use of a single architecture to deploy several applications. ...
Thesis
La complexité des architectures MPSoC (Multiprocessor System-on-Chip) augmente de manière exponentielle pour répondre à la demande croissante en puissance de calcul des applications DSP (Digital Signal Processor). Les architectures MPSoC modernes, telles que les architectures à coeurs multiple, incorporent déjà des centaines d’éléments de traitement (PE) dans une seule puce et prévoient d’intégrer jusqu’à un millier de PE dans un avenir proche. Conséquemment, la programmation des architectures MPSoC modernes avec les langages de programmation traditionnels basés sur des threads est devenue de plus en plus complexe. Dans ce contexte, les modèles de calcul de flux de données (MoC) sont devenus des paradigmes de programmation populaires offrant à la fois une grande analysabilité et une expression intuitive du parallélisme d’une application DSP basée sur le modèle de graphe de tâches. Dans cette thèse, nous proposons de nouvelles techniques pour l'évaluation du débit maximal et de la latence minimale du modèle IBSDF (Interface-Based Synchronous Dataflow), ciblant les architectures MPSoC avec des ressources illimitées. Le modèle IBSDF est un modèle hiérarchique à comportement statique qui permet d’évaluer certaines métriques de performance au moment de la conception. Cependant, les méthodes d'évaluation classiques consistent à aplatir la hiérarchie du modèle IBSDF en un graphe de flux de données non hiérarchique comportant un nombre exponentiel de tâches rendant son évaluation difficile, voire impossible. Les nouvelles techniques que nous proposons évaluent les performances des graphes IBSDF de manière modulaire sans aplatir leur hiérarchie. Ainsi, nous avons pu évaluer de très grand graphes IBSDF en quelques secondes, contrairement à l'approche classique qui ne permet pas d'obtenir un résultat.
... Then, the implementation is formalized as a graph transformations to distribute and schedule the application on the platform. Several heuristics are proposed to explore solutions that optimize the response time and resources allocation, while satisfying real-time constraints [103,54]. ...
Thesis
The increasing complexity of embedded applications in modern cars has increased the need of computing power. To meet this need, the European automotive standard AUTOSAR has introduced the use of \multicore platforms. However, \multicore platform for critical automotive applications raises several issues. In particular, it is necessary to respect the functional specification and to guarantee deterministically the data exchanges between cores. In this thesis, we consider multi-periodic systems specified and validated with \mat. So, we developed a framework to deploy \mat applications on AUTOSAR \multicore. This framework guarantees the functional and temporal determinism and exploits the parallelism. Our contribution is threefold. First, we identify the communication mechanisms in \mat. Then, we prove that the dataflow in a multi-periodic \mat system is modeled by a SDFG. The SDFG formalism is an excellent analysis tool to exploit the parallelism. In fact, it is very popular in the literature and it is widely studied for the deployment of dataflow applications on multi/many-core. Then, we develop methods to realize the dataflow expressed by the SDFG in a preemptive \rt scheduling. These methods use theoretical results on SDFGs to guarantee deterministic precedence constraints without using blocking synchronization mechanisms. As such, both the functional and temporal determinism are guaranteed. Finally, we characterize the impact of dataflow requirements on tasks. We propose a partitioning technique that minimizes this impact. We show that this technique promotes the construction of a partitioning and a feasible scheduling when it is used to initiate multi-objective research and optimization algorithms. %As such, we reduce the number of design iterations and shorten the design time.
... For this reason, we opted for using a methodology with an optimization phase and allowing the user to specify the algorithm and explore several possible implementations for different FPGA targets. The AAA methodology [13] is an effective means for rapid prototyping in real time dedicated to optimizing implementations of signal processing and images onto heterogeneous architecture multiprocessors. The AAA methodology that is extended to a circuit was targeted for designing real-time applications onto an integrated circuit [14,15]. ...
... The algorithm to implement is firstly specified as a Factorized and Conditioned Data Dependence Graph (FCDDG) [13]. The FCDDG is an extension of the classical direct data dependence model with dedicated nodes which specify a repetition pattern (like the operator ''for''/ ''while'' of the classical imperative languages) and conditioning (''if…then…else''). ...
... The FCDDG is an extension of the classical direct data dependence model with dedicated nodes which specify a repetition pattern (like the operator ''for''/ ''while'' of the classical imperative languages) and conditioning (''if…then…else''). Factorization is the graph model extension that allows replacing repeated sub-graph of connected operations by only one instance and marking each edge crossing the pattern frontier with dedicated factorization nodes in order to reduce the size of the specification [13]. Its boundary, named Factorization Frontier (FF), is represented by a dotted line. ...
Article
Full-text available
The development of hardware platforms for artificial neural networks (ANN) has been hindered by the high consumption of power and hardware resources. In this paper, we present a methodology for ANN-optimized implementation, of a learning vector quantization (LVQ) type on a field-programmable gate array (FPGA) device. The aim was to provide an intelligent embedded system for real-time vigilance state classification of a subject from an analysis of the electroencephalogram signal. The present approach consists in applying the extension of the algorithm architecture adequacy (AAA) methodology with the arithmetic accuracy constraint, allowing the LVQ-optimized implementation on the FPGA. This extension improves the optimization phase of the AAA methodology by taking into account the operations wordlength required by applying and creating approximative-wordlength operation groups, where the operations in the same group will be performed with the same operator. This LVQ implementation will allow a considerable gain of circuit resources, power and maximum frequency while respecting the time and accuracy constraints. To validate our approach, the LVQ implementation has been tried for several network topologies on two Virtex devices. The accuracy–success rate relation has been studied and reported.
... In the 1990s, models of parallel computation such as the ones over-viewed by Maggs et al. in [1] were designed to comprehensively represent a system including hardware and software-related features. Since the early 2000s, rapid prototyping initiatives like the Algorithm-Architecture Matching (AAA) methodology [2] have fostered the separation of algorithm and architecture models in order to automate the Design Space Exploration (DSE). Separation of concerns plays a major part in mitigating the design complexity of systems. ...
Article
Current trends in high performance and embedded computing include design of increasingly complex hardware architectures with high parallelism, heterogeneous processing elements and non-uniform communication resources. In order to take hardware and software design decisions, early evaluations of the system non-functional properties are needed. These evaluations of system efficiency require Electronic System-Level (ESL) information on both the algorithms and the architecture. Contrary to algorithm models for which a major body of work has been conducted on defining formal Models of Computation (MoCs), architecture models from the literature are mostly empirical models from which reproducible experimentation requires the accompanying software. In this paper, a precise definition of a Model of Architecture (MoA) is proposed that focuses on reproducibility and abstraction and removes the overlap previously existing between the notions of MoA and MoC. A first MoA, called the Linear System-Level Architecture Model (LSLA), is presented. To demonstrate the generic nature of the proposed new architecture modeling concepts, we show that the LSLA Model can be integrated flexibly with different MoCs. LSLA is then used to model the energy consumption of a State-of-the-Art Multiprocessor System-on-Chip (MPSoC) when running an application described using the Synchronous Dataflow (SDF) MoC. A method to automatically learn LSLA model parameters from platform measurements is introduced. Despite the high complexity of the underlying hardware and software, a simple LSLA model is demonstrated to estimate the energy consumption of the MPSoC with a fidelity of 86%.
... In this context, significant research has been done to improve the schedule quality. The SynDEX approach [6], for example, is system Level CAD software taking as an entry point a model of the application associated to a model of the architecture. It executes a list scheduling algorithm, orders tasks and maps the algorithm into multiprocessors. ...
... Further, authors presented Syndex approch in [6]. SynDEX is a system Level CAD software. ...
... In the 1990s, models of parallel computation such as the ones over-viewed by Maggs et al in [1] were designed to represent a global system including hardware and software-related features. Since the early 2000s, rapid prototyping initiatives such as the Algorithm-Architecture Matching (AAA) methodology [2] have fostered the separation of algorithm and architecture models in order to automate design space exploration. ...
... Abstr In multi-core rapid prototyping tools, each community is using a different custom architecture model, often associated to a specific syntax and specific requirements. For example, some custom models for multicore scheduling have been proposed such as the High-level Virtual Platform (HVP) [33], the SynDEx architecture model [2], and the System-Level Architecture Model (S-LAM) [38]. These models are time-oriented models for automated multicore scheduling. ...
Technical Report
Full-text available
The current trend in high performance and embedded computing consists of designing increasingly complex heterogeneous hardware architectures with non-uniform communication resources. In order to take hardware and software design decisions, early evaluations of the system non-functional properties are needed. These evaluations of system efficiency require high-level information on both the algorithms and the architecture. In state of the art Model Driven Engineering (MDE) methods, different communities have developed custom architecture models associated to languages of substantial complexity. This fact contrasts with Models of Computation (MoCs) that provide abstract representations of an algorithm behavior as well as tool interoperability. In this report, we define the notion of Model of Architecture (MoA) and study the combination of a MoC and an MoA to provide a design space exploration environment for the study of the algorithmic and architectural choices. An MoA provides reproducible cost computation for evaluating the efficiency of a system. A new MoA called Linear System-Level Architecture Model (LSLA) is introduced and compared to state of the art models. LSLA aims at representing hardware efficiency with a linear model. The computed cost results from the mapping of an application, represented by a model conforming a MoC on an architecture represented by a model conforming an MoA. The cost is composed of a processing-related part and a communication-related part. It is an abstract scalar value to be minimized and can represent any non-functional requirement of a system such as memory, energy, throughput or latency.
... Preesm [134,135] is a rapid-prototyping framework for static dataflow applications that has been inspired by the algorithm architecture adequation matching methodology (AAM, also sometimes called AAA) [136]. Preesm makes uses of a parameterized and interfaced dataflow meta-model (PiMM) [137] representation of the application, together with a System-Level Architecture Model (S-LAM) for the high-level architecture description. ...
... SynDEx [145] is a graphical and interactive software implementing the Algorithm Architecture Adequation Matching methodology (AAM, also sometimes called AAA) [136]. Within this environment, the designer defines an algorithm graph, an architecture graph and system constraints. ...
Article
Full-text available
All computing platforms, from mobile to supercomputers, are becoming more and more heterogeneous and massively parallel. While they can provide higher power efficiency and computation throughput, effective and confident use of these systems always requires knowledge about low-level programming. The average time necessary to develop and optimize a design on heterogeneous platforms is higher and higher compared to typical homogeneous systems. Dataflow models of computation (MoC) are quickly becoming the common practice in heterogeneous systems development. In domains such as signal processing and multimedia communication, dataflow MoCs have become accepted as standard. However, the shift from a sequential and architecture-specific MoC to a dataflow MoC still uncovers several programming and development challenges. The Cal Actor Language (CAL) is a recently-specified dataflow and actor-based language capable of concisely expressing complex and general purpose parallel applications. However, design tools supporting this language are generally not adequate to fully exploit its features and expressiveness power. In fact, they generally restrict its MoC in order to reduce the design space exploration (DSE) effort. The objective of this thesis is to provide a DSE methodology where all the features of CAL and dynamic dataflow MoCs can be exploited in a more general and effective manner. This dissertation illustrates a novel profiling, analysis and performance estimation methodology for the DSE of dynamic dataflow programs. The main research contributions of this thesis are: the formalization of a graph-based representation of the program execution called an execution trace graph (ETG); the formalization of a systematic methodology for profiling generic dynamic dataflow programs through their code interpretation; the formalization of a complete DSE methodology for dynamic dataflow programs in order to efficiently identify close-to-optimal design points according to various and tailored performance merit functions. In particular, the following design space optimization problems for dynamic dataflow programs are addressed: the analysis of the hotspots and the algorithmic bottlenecks of a parallel program; the bounding and optimization of the buffer size configuration for complex designs; the dynamic power dissipation minimization of programs implemented in multi-clock domain architecture. Furthermore, theoretical concepts like the design space critical path and the potential speedup of a dataflow application have been defined and revisited, respectively. The thesis also presents a DSE framework developed in order to demonstrate the effectiveness of this design methodology.
... • Some systems with a mixed event-driven/time-driven execution model, such as those synthesized by SynDEx [Grandpierre and Sorel, 2003]. ...
... The optimal scheduling of such specifications onto platforms with multiple, heterogenous execution and communication resources (distributed, parallel, multicore) is NP-hard regardless of the optimality criterion (throughput, makespan, etc.) [Garey and Johnson, 1979]. Existing scheduling techniques and tools , Zheng et al., 2005, Grandpierre and Sorel, 2003, Potop-Butucaru et al., 2010, Eles et al., 2000 heuristically solve the simpler problem of synthesizing a scheduling table of minimal length which implements one generic cycle of the embedded control algorithm. The algorithm briefly presented in Section 4. 2.2.3 follows this trend: it maps a CG specification to an architecture while trying to minimize the length of the obtained scheduling table. ...
... To optimize makespan we employ in the first phase of our approach existing scheduling techniques that were specifically designed for this purpose , Zheng et al., 2005, Grandpierre and Sorel, 2003, Potop-Butucaru et al., 2010, Eles et al., 2000, such as the one used to create the scheduling table of Section 4. 2.2.3. But the contribution described in this chapter concerns the second phase of our flow, which takes the scheduling table computed in phase 1 and optimizes its throughput while keeping its makespan unchanged. ...
Article
Full-text available
There is a long standing separation between the fields of compiler construction and real-time scheduling. While both fields have the same objective - the construction of correct implementations – the separation was historically justified by significant differences in the models and methods that were used. Nevertheless, with the ongoing complexification of applications and of the hardware of the execution platforms, the objects and problems studied in these two fields are now largely overlapping. In this thesis, we focus on the automatic code generation for embedded control systems with complex constraints, including hard real-time requirements. To this purpose, we advocate the need for a reconciled research effort between the communities of compilation and real-time systems. By adapting a technique usually used in compilers (software pipelining) to the system-level problem of multiprocessor scheduling of hard real-time applications, we shed light on the difficulties of this unified research effort, but also show how it can lead to real advances. Indeed we explain how adapting techniques for the optimization of new objectives, in a different context, allows us to develop more easily systems of better quality than what was done until now. In this adaptation process, we propose to use synchronous formalisms and languages as a common formal ground. These can be naturally seen as extensions of classical models coming from both real-time scheduling (dependent task graphs) and compilation (single static assignment and data dependency graphs), but also provide powerful techniques for manipulating complex control structures. We implemented our results in the LoPhT compiler.
... and a set of constraints for the deployment of the application on the architecture (Section 3.2.3). As presented in the Algorithm-Architecture Adequation (AAA) methodology [GS03], the separation between the three inputs of the design flow ensures the independence between them, which eases the deployment of an application on several architectures or the use of a single architecture to deploy several applications. ...
... The creation of the Preesm rapid prototyping framework has been inspired by the Algorithm-Architecture Adequation (AAA) methodology [GS03]. AAA consists of simultaneously searching the best software and hardware configurations for respecting system constraints. ...
Article
The development of embedded Digital Signal Processing (DSP) applications for Multiprocessor Systems-on-Chips (MPSoCs) is a complex task requiring the consideration of many constraints including real-time requirements, power consumption restrictions, and limited hardware resources. To satisfy these constraints, it is critical to understand the general characteristics of a given application: its behavior and its requirements in terms of MPSoC resources. In particular, the memory requirements of an application strongly impact the quality and performance of an embedded system, as the silicon area occupied by the memory can be as large as 80% of a chip and may be responsible for a major part of its power consumption. Despite the large overhead, limited memory resources remain an important constraint that considerably increases the development time of embedded systems. Dataflow Models of Computation (MoCs) are widely used for the specification, analysis, and optimization of DSP applications. The popularity of dataflow MoCs is due to their great analyzability and their natural expressivity of the parallelism of a DSP application. The abstraction of time in dataflow MoCs is particularly suitable for exploiting the parallelism offered by heterogeneous MPSoCs. In this thesis, we propose a complete method to study the important aspect of memory characteristic of a DSP application modeled with a dataflow graph. The proposed method spans the theoretical, architecture-independent memory characterization to the quasi-optimal static memory allocation of an application on a real shared-memory MPSoC. The proposed method, implemented as part of a rapid prototyping framework, is extensively tested on a set of state-of-the-art applications from the computer-vision, the telecommunication, and the multimedia domains. Then, because the dataflow MoC used in our method is unable to model applications with a dynamic behavior, we introduce a new dataflow meta-model to address the important challenge of managing dynamics in DSP-oriented representations. The new reconfigurable and composable dataflow meta-model strengthens the predictability, the conciseness and the readability of application descriptions.