A general view of the Parallel Ultra-Low Power (PULP) architecture (a) and the layout of the PULPv3 chip used for performance and power characterization (b). 

A general view of the Parallel Ultra-Low Power (PULP) architecture (a) and the layout of the PULPv3 chip used for performance and power characterization (b). 

Source publication
Article
Full-text available
Ultra-low power operation and extreme energy efficiency are strong requirements for a number of high-growth application areas requiring near-sensor processing, including elaboration of biosignals. Parallel near-threshold computing is emerging as an approach to achieve significant improvements in energy efficiency while overcoming the performance de...

Contexts in source publication

Context 1
... section describes the PULP platform, focusing on the architecture of its third embodiment fabricated in 28-nm UTBB FD-SOI technology [1]. The computational engine of the SoC is a cluster with a parametric number (2-16) of cores ( Figure 1a). The processors are based on power-optimized four-pipeline stage micro-architecture implementing the OpenRISC ISA [29], featuring full forwarding with single stalls only on load-use and mispredicted branches. The original OpenRISC ISA and the micro-architecture of the core have been enhanced for energy-efficient digital signal processing, supporting zero-overhead hardware loops, L0 buffer, load and store operations embedding pointer arithmetic and power management instructions. The platform supports optional integration of floating-point units [30], suitable to deal with applications requiring high precision and dynamic range. The cluster features a shared instruction cache with L0 buffer and support for instruction broadcasting that greatly reduces the pressure on the cache banks with respect to a private solution, resulting in a much higher energy efficiency [31]. The cluster relies on an L1 explicitly-managed multi-banked Tightly-Coupled Data Memory (TCDM) for data access, avoiding memory coherency overhead of data caches and greatly increasing area and energy efficiency. Each logical bank can be implemented as a heterogeneous memory, either composed of SRAMs or latch-based Standard Cell Memory (SCM) banks. Instruction caches can also be implemented with SCMs. The usage of SCMs for the implementation of frequently-accessed memory banks significantly improves energy efficiency, since energy/access of SCM is significantly lower than that of SRAMs for the relatively small cuts needed in L1 instruction and data memories [32]. Depending on the availability of low-voltage memories in the targeted implementation technology, different ratios of SCM and SRAM memory can be instantiated at design time. ...
Context 2
... on-line dataset called CHB-MIT (Children's Hospital Boston-Massachusetts Institute of Technology) is used to take EEG data used in the simulation [45]. It contains samples acquired from 24 pediatric subjects suffering for intractable seizures acquired at 16-bit resolution with a sampling frequency of 256 Hz. EEG data during regular (no seizure) and abnormal (seizure) brain activity are taken from four patients randomly chosen within the dataset. To test the processing chain, we have initially implemented and executed the application on MATLAB, verifying the accuracy gained from seizure recognition. Models are created using 30% of data (half no seizure, half seizure) of the subject and tested with the remaining 70%. The accuracy reached by the algorithm is around 92-99%. After validation, both the fixed-point and floating-point versions of the the processing chain were implemented and evaluated on the PULP platform. Both the fixed-point and floating-point versions are parallelized among multiple cores (2, 4 and 8 cores) to evaluate the scaling capabilities of the system and the related performance. To obtain reliable results as concerns classification accuracy, an implementation using 30,256 × 23 samples windows randomly chosen from the testing set belonging to the two classes (i.e., 70% Class 0, 30% Class 1) was simulated. The virtual platform of PULP was used to estimate the execution time and the energy consumption of the application. To characterize the power numbers of the PULP architecture, data were extracted from measurements on the silicon prototype of the PULPv3 SoC, shown in Figure 1b, fitted with post-layout simulations to characterize the breakdown of the power consumption among the components of the cluster and adapted to the configurations employed in the exploration by annotating the energy numbers on the virtual platform. Table 2 shows the percentages of precision and accuracy derived from the execution of the floating-point and fixed-point processing chains on PULP. Precision is evaluated comparing the energy matrices obtained at the output of the DWT + ENERGYkernel, while accuracy is evaluated at the output of the SVM classifier. On the MATLAB implementation, used as a reference golden model, a 64-bit double precision format is employed, while PULP employs 32-bit single precision format. It is possible to note that comparing the energy matrix obtained at the output DWT + ENERGY kernel inn MATLAB and the PULP platform floating-point, the loss of precision is negligible (i.e., less than 0.1%). On the other hand, the fixed-point conversion causes a slight loss of precision due to rounding and saturation during the execution of the processing chain. The results obtained from the execution of the fixed-point application lead to an average loss of precision of 6.5% compared to the floating point version and to an average accuracy of 92% for the floating-point application and 89% from the fixed-point version, which is aligned with the state of the art [46][47][48]. Tables 3 and 4 summarize the execution time (clock cycles) of the seizure detection application on the PULP platform, in its floating-point and fixed-pint embodiment, respectively. In the sequential implementation, PCA requires 80% of the overall computation time, while DWT, energy and SVM require the remaining computational load (20%). When the algorithm runs exploiting parallel processing over the multiple cores of PULP, the execution time reduces by up to 5× in the floating-point version and 3.44 in the fixed-point version. If we first analyze Table 3, showing the results of the floating-point implementation, we can notice that the kernels with higher parallelism, such as mean value + covariance, compute PC and SVM, accounting for more than 75% of the overall computational load, feature nearly ideal speed-ups. On the other hand, other kernels such as Householder reduction and accumulate require frequent parallel computations on small data chunks interleaved with synchronization points, which increases the overhead of the OpenMP runtime and degrades the parallel performance. Finally, the DWT + energy kernel is affected by workload unbalance, since it requires the elaboration of nine principal components on eight processors, limiting the speed-up of this kernel to 4.38×. Diagonalize is an iterative kernel affected by the pathological Amdahl bottleneck caused by the dependencies between matrix elements calculated during the iterations, which force most of this kernel to be executed sequentially, hence leading to a very poor parallel speed-up (i.e., 2× on eight cores). The situation is even worse in the fixed-point implementation due to the frequent scaling (Section 4.3) and the difficulty to converge in the diagonalize kernel due to loss of precision. Despite that the poor parallelization affinity of some of the kernels of the processing chain degrades performance with respect to the ideal case (i.e., 8× speed-up when executing on eight cores), preventing further voltage and frequency scaling for a given real-time requirement, the PULP platform tackles the problem through fine-grained clock gating of the idle processors of the cluster, as described in Section 3.3. This concept is highlighted in both Tables 3 and 4, in the columns showing the energy savings achieved by activating the fine-grained power management techniques employed in the PULP platform. As such, kernels featuring almost ideal speed-ups do not show any advantage since in these kernels, all of the resources of the cluster are (almost) fully utilized. On the other hand, the energy saving of heavily parallelizable kernels, such as Diagonalize, can be up to 45%. Table 5 shows a comparison between the number of cycles derived from the fixed-and floating-point processing chain on the PULP platform. It is noteworthy that if we consider execution time (i.e., number of cycles) as a comparison metric, the floating-point implementation provides better performance than the fixed-point implementation, even if floating-point operations require multiple cycles (i.e., two cycles for floating-point additions and multiplications). This is due to the relevant overhead required to perform complex fixed-point operations, such as 64-bit multiplications and divisions, and due to the additional overhead required to perform the scaling of the samples, which modifies the dynamic of the data at the boundary of some of the kernels. As shown in Table 5, Householder reduction and diagonalize kernels are the ones with the highest ratio between float and fixed execution time. Indeed, these kernels are the most computational dense, hence the overhead derived from function calls to perform multiplications, divisions and square roots significantly affects performance. Furthermore, the diagonalize kernel is computed with an iterative method that ends when the convergence's condition is reached. Thus, using fixed-point arithmetic implies a higher execution time needed to ...

Citations

... An alternative is to port FP operations into low-cost fixed-point operations [13] but porting an FP application into its fixed-point equivalent is laborious and demands in-depth knowledge of the application domain. Besides, porting an FP application into a fixed-point is not always energy-efficient due to the execution of instructions required for organization and normalization of such operations to overlook the dynamic range of FP numbers, resulting in significant overheads [14]. ...
... This Event Unit also controls Clock-gating between the proposed CGRA and the RI5CY cores. There is an FPU cluster which the RI5CY cores share, and the ratio between SFU [61,56] cores and RI5CY cores is 2:1, and the scheduling of FP operations is deterministic [14]. In chapter 6, a heterogeneous cluster is introduced to improve the performance of PULP-Cluster further [91] for near sensor processing and embedded ML. ...
Thesis
There is an ever-increasing demand for energy efficiency (EE) in rapidly evolving Internet-of-Things end nodes. This pushes researchers and engineers to develop solutions that provide both Application-Specific Integrated Circuit-like EE and Field-Programmable Gate Array-like flexibility. One such solution is Coarse Grain Reconfigurable Array (CGRA). Over the past decades, CGRAs have evolved and are competing to become mainstream hardware accelerators, especially for accelerating Digital Signal Processing (DSP) applications. Due to the over-specialization of computing architectures, the focus is shifting towards fitting an extensive data representation range into fewer bits, e.g., a 32-bit space can represent a more extensive data range with floating- point (FP) representation than an integer representation. Computation using FP representation requires numerous encodings and leads to complex circuits for the FP operators, decreasing the EE of the entire system. This thesis presents the design of an EE ultra-low-power CGRA with native support for FP computation by leveraging an emerging paradigm of approximate computing called transprecision computing. We also present the contributions in the compilation toolchain and system-level integration of CGRA in a System-on-Chip, to envision the proposed CGRA as an EE hardware accelerator. Finally, an extensive set of experiments using real-world algorithms employed in near-sensor processing applications are performed, and results are compared with state-of-the-art (SoA) architectures. It is empirically shown that our proposed CGRA provides better results w.r.t. SoA architectures in terms of power, performance, and area.
... The exponentiation inherent in floating-point computation like single or double precision IEEE Standard 754 Floating Point Numbers [12], assures a much larger dynamic range -the largest and smallest numbers that can be represented -which is especially important when processing extremely large data sets or where the range may be unpredictable. For instance, in [21], it is shown that the fixed point implementation of bio-signal processing applications consumes more energy than the floating point execution in hardware. Supporting floating-point operations thus becomes an essential requirement for CGRA based architectures to be "truly flexible" accelerator for applications with different data-types. ...
Article
Full-text available
Coarse Grained Reconfigurable Arrays (CGRAs) are emerging as energy efficient accelerators providing a high grade of flexibility in both academia and industry. However, with the recent advancements in algorithms and performance requirements of applications, supporting only integer and logical arithmetic limits the interest of classical/traditional CGRAs. In this paper, we propose a novel CGRA architecture and associated compilation flow supporting both integer and floating-point computations for energy efficient acceleration of DSP applications. Experimental results show that the proposed accelerator achieves a maximum of 4.61× speedup compared to a DSP optimized, ultra low power RISC-V based CPU while executing seizure detection, a representative of wide range of EEG signal processing applications with an area overhead of 1.9×. The proposed CGRA achieves a maximum of 6.5× energy efficiency compared to the single core CPU. While comparing the execution with the multi-core CPU with 8 cores, the proposed CGRA achieves up to 4.4× energy gain.
... It can be used in regular power source which may lead to ripples riding on top of the signal. d-Detrending filter: this filter can help to shift lowfrequency artifacts which may lead to improper amplitudes like filtering of breathing noise during detecting of ECG signal [8]. ...
... As a result, such solutions do not scale up easily with the complexity of the system (i.e. the number of EEG channels), especially because of the limited bandwidth of the low power data link, such as Bluetooth low energy or Zigbee. [11], [12], [13]. ...
Article
Advancements in Digital Signal Processing (DSP) and machine learning techniques have boosted the popularity of Brain-Computer Interfaces (BCIs), where electroencephalography (EEG) is a widely accepted method to enable intuitive human-machine interaction. Nevertheless, the evolution of such interfaces is currently hampered by the unavailability of embedded platforms capable of delivering the required computational power at high energy efficiency and allowing for a small and unobtrusive form factor. To fill this gap, we developed BioWolf, a highly wearable (40x20x2mm) BCI platform based on Mr. Wolf, a Parallel Ultra Low Power system on chip featuring nine RISC-V cores with DSP-oriented instruction set extensions. BioWolf also integrates a commercial 8-channel medical-grade analog-to-digital converter, and an ARM-Cortex M4 microcontroller unit (MCU) with Bluetooth low-energy connectivity. To demonstrate the capabilities of the system, we implemented and tested a BCI featuring Canonical Correlation Analysis (CCA) of steady-state visual evoked potentials (SSVEP). The system achieves an average information transfer rate of 1.46 bits per second (bps) (aligned with the state-of-the-art of bench-top systems). Thanks to the reduced power envelope of the digital computational platform, which consumes less than the analog front-end, the total power budget is just 6.31 mW, providing up to 38 hrs operation (65 mAh battery). To our knowledge, our design is the first to explore the significant energy boost of a parallel MCU with respect to single-core MCUs for CCA-based BCI.
... For SVM, a fixed- point approach is used to avoid all the computation needed to be executed in the floating-point. It is already demonstrated [13] that this approach leads to best performance preserving the accuracy. As shown, the HD classifier achieves ≈2× faster execution and lower power at iso-accuracy compared to the SVM on the ARM Cortex M4. ...
Conference Paper
Computing with high-dimensional (HD) vectors, also referred to as hypervectors, is a brain-inspired alternative to computing with scalars. Key properties of HD computing include a well-defined set of arithmetic operations on hypervectors, generality, scalability, robustness, fast learning, and ubiquitous parallel operations. HD computing is about manipulating and comparing large patterns---binary hypervectors with 10,000 dimensions---making its efficient realization on minimalistic ultra-low-power platforms challenging. This paper describes HD computing's acceleration and its optimization of memory accesses and operations on a silicon prototype of the PULPv3 4-core platform (1.5 mm², 2 mW), surpassing the state-of-the-art classification accuracy (on average 92.4%) with simultaneous 3.7× end-to-end speed-up and 2× energy saving compared to its single-core execution. We further explore the scalability of our accelerator by increasing the number of inputs and classification window on a new generation of the PULP architecture featuring bit-manipulation instruction extensions and larger number of 8 cores. These together enable a near ideal speed-up of 18.4× compared to the single-core PULPv3.
... For SVM, a fixedpoint approach is used to avoid all the computation needed to be executed in the floating-point. It is already demonstrated [13] that this approach leads to best performance preserving the accuracy. As shown, the HD classifier achieves ≈2× faster execution and lower power at iso-accuracy compared to the SVM on the ARM Cortex M4. ...
Article
Computing with high-dimensional (HD) vectors, also referred to as $\textit{hypervectors}$, is a brain-inspired alternative to computing with scalars. Key properties of HD computing include a well-defined set of arithmetic operations on hypervectors, generality, scalability, robustness, fast learning, and ubiquitous parallel operations. HD computing is about manipulating and comparing large patterns-binary hypervectors with 10,000 dimensions-making its efficient realization on minimalistic ultra-low-power platforms challenging. This paper describes HD computing's acceleration and its optimization of memory accesses and operations on a silicon prototype of the PULPv3 4-core platform (1.5mm$^2$, 2mW), surpassing the state-of-the-art classification accuracy (on average 92.4%) with simultaneous 3.7$\times$ end-to-end speed-up and 2$\times$ energy saving compared to its single-core execution. We further explore the scalability of our accelerator by increasing the number of inputs and classification window on a new generation of the PULP architecture featuring bit-manipulation instruction extensions and larger number of 8 cores. These together enable a near ideal speed-up of 18.4$\times$ compared to the single-core PULPv3.
Article
Timely and accurately seizure detection is of great importance for the diagnosis and treatment of epilepsy patients. Existing seizure detection models are often complex and time-consuming, highlighting the urgent need for lightweight seizure detection. Additionally, existing methods often neglect the key characteristic channels and spatial regions of electroencephalography (EEG) signals. To solve these issues, we propose a lightweight EEG-based seizure detection model named lightweight inverted residual attention network (LRAN). Specifically, we employ a four-stage inverted residual mobile block (iRMB) to effectively extract the hierarchical features from EEG. The convolutional block attention module (CBAM) is introduced to make the model focus on important feature channels and spatial information, thereby enhancing the discrimination of the learned features. Finally, convolution operations are used to capture local information and spatial relationships between features. We conduct intra-subject and inter-subject experiments on a publicly available dataset. Intra-subject experiments obtain 99.25% accuracy in segment-based detection and 0.36/h false detection rate (FDR) in event-based detection, respectively. Inter-subject experiments obtain 84.32% accuracy. Both sets of experiments maintain high classification accuracy with a low number of parameters, where the multiply accumulate operations (MACs) are 25.86[Formula: see text]M and the number of parameters is 0.57[Formula: see text]M.
Chapter
High-level programming models aim at exploiting hardware parallelism and reducing software development costs. However, their adoption on ultra-low-power multi-core microcontroller (MCU) platforms requires minimizing the overheads of work-sharing constructs on fine-grained parallel regions. This work tackles this challenge by proposing OMP-SPMD, a streamlined approach for parallel computing enabling the OpenMP syntax for the Single-Program Multiple-Data (SPMD) paradigm. To assess the performance improvement, we compare our solution with two alternatives: a baseline implementation of the OpenMP runtime based on the fork-join paradigm (OMP-base) and a version leveraging hardware-specific optimizations (OPM-opt). We benchmarked these libraries on a Parallel Ultra-Low Power (PULP) MCU, highlighting that hardware-specific optimizations improve OMP-base performance up to 69%. At the same time, OMP-SPMD leads to an extra improvement up to 178%.