Execution of floating-point seizure detection on the embedded computing platform. 

Execution of floating-point seizure detection on the embedded computing platform. 

Source publication
Article
Full-text available
Ultra-low power operation and extreme energy efficiency are strong requirements for a number of high-growth application areas requiring near-sensor processing, including elaboration of biosignals. Parallel near-threshold computing is emerging as an approach to achieve significant improvements in energy efficiency while overcoming the performance de...

Citations

... An alternative is to port FP operations into low-cost fixed-point operations [13] but porting an FP application into its fixed-point equivalent is laborious and demands in-depth knowledge of the application domain. Besides, porting an FP application into a fixed-point is not always energy-efficient due to the execution of instructions required for organization and normalization of such operations to overlook the dynamic range of FP numbers, resulting in significant overheads [14]. ...
... This Event Unit also controls Clock-gating between the proposed CGRA and the RI5CY cores. There is an FPU cluster which the RI5CY cores share, and the ratio between SFU [61,56] cores and RI5CY cores is 2:1, and the scheduling of FP operations is deterministic [14]. In chapter 6, a heterogeneous cluster is introduced to improve the performance of PULP-Cluster further [91] for near sensor processing and embedded ML. ...
Thesis
There is an ever-increasing demand for energy efficiency (EE) in rapidly evolving Internet-of-Things end nodes. This pushes researchers and engineers to develop solutions that provide both Application-Specific Integrated Circuit-like EE and Field-Programmable Gate Array-like flexibility. One such solution is Coarse Grain Reconfigurable Array (CGRA). Over the past decades, CGRAs have evolved and are competing to become mainstream hardware accelerators, especially for accelerating Digital Signal Processing (DSP) applications. Due to the over-specialization of computing architectures, the focus is shifting towards fitting an extensive data representation range into fewer bits, e.g., a 32-bit space can represent a more extensive data range with floating- point (FP) representation than an integer representation. Computation using FP representation requires numerous encodings and leads to complex circuits for the FP operators, decreasing the EE of the entire system. This thesis presents the design of an EE ultra-low-power CGRA with native support for FP computation by leveraging an emerging paradigm of approximate computing called transprecision computing. We also present the contributions in the compilation toolchain and system-level integration of CGRA in a System-on-Chip, to envision the proposed CGRA as an EE hardware accelerator. Finally, an extensive set of experiments using real-world algorithms employed in near-sensor processing applications are performed, and results are compared with state-of-the-art (SoA) architectures. It is empirically shown that our proposed CGRA provides better results w.r.t. SoA architectures in terms of power, performance, and area.
... The exponentiation inherent in floating-point computation like single or double precision IEEE Standard 754 Floating Point Numbers [12], assures a much larger dynamic range -the largest and smallest numbers that can be represented -which is especially important when processing extremely large data sets or where the range may be unpredictable. For instance, in [21], it is shown that the fixed point implementation of bio-signal processing applications consumes more energy than the floating point execution in hardware. Supporting floating-point operations thus becomes an essential requirement for CGRA based architectures to be "truly flexible" accelerator for applications with different data-types. ...
Article
Full-text available
Coarse Grained Reconfigurable Arrays (CGRAs) are emerging as energy efficient accelerators providing a high grade of flexibility in both academia and industry. However, with the recent advancements in algorithms and performance requirements of applications, supporting only integer and logical arithmetic limits the interest of classical/traditional CGRAs. In this paper, we propose a novel CGRA architecture and associated compilation flow supporting both integer and floating-point computations for energy efficient acceleration of DSP applications. Experimental results show that the proposed accelerator achieves a maximum of 4.61× speedup compared to a DSP optimized, ultra low power RISC-V based CPU while executing seizure detection, a representative of wide range of EEG signal processing applications with an area overhead of 1.9×. The proposed CGRA achieves a maximum of 6.5× energy efficiency compared to the single core CPU. While comparing the execution with the multi-core CPU with 8 cores, the proposed CGRA achieves up to 4.4× energy gain.
... It can be used in regular power source which may lead to ripples riding on top of the signal. d-Detrending filter: this filter can help to shift lowfrequency artifacts which may lead to improper amplitudes like filtering of breathing noise during detecting of ECG signal [8]. ...
... As a result, such solutions do not scale up easily with the complexity of the system (i.e. the number of EEG channels), especially because of the limited bandwidth of the low power data link, such as Bluetooth low energy or Zigbee. [11], [12], [13]. ...
Article
Advancements in Digital Signal Processing (DSP) and machine learning techniques have boosted the popularity of Brain-Computer Interfaces (BCIs), where electroencephalography (EEG) is a widely accepted method to enable intuitive human-machine interaction. Nevertheless, the evolution of such interfaces is currently hampered by the unavailability of embedded platforms capable of delivering the required computational power at high energy efficiency and allowing for a small and unobtrusive form factor. To fill this gap, we developed BioWolf, a highly wearable (40x20x2mm) BCI platform based on Mr. Wolf, a Parallel Ultra Low Power system on chip featuring nine RISC-V cores with DSP-oriented instruction set extensions. BioWolf also integrates a commercial 8-channel medical-grade analog-to-digital converter, and an ARM-Cortex M4 microcontroller unit (MCU) with Bluetooth low-energy connectivity. To demonstrate the capabilities of the system, we implemented and tested a BCI featuring Canonical Correlation Analysis (CCA) of steady-state visual evoked potentials (SSVEP). The system achieves an average information transfer rate of 1.46 bits per second (bps) (aligned with the state-of-the-art of bench-top systems). Thanks to the reduced power envelope of the digital computational platform, which consumes less than the analog front-end, the total power budget is just 6.31 mW, providing up to 38 hrs operation (65 mAh battery). To our knowledge, our design is the first to explore the significant energy boost of a parallel MCU with respect to single-core MCUs for CCA-based BCI.
... For SVM, a fixed- point approach is used to avoid all the computation needed to be executed in the floating-point. It is already demonstrated [13] that this approach leads to best performance preserving the accuracy. As shown, the HD classifier achieves ≈2× faster execution and lower power at iso-accuracy compared to the SVM on the ARM Cortex M4. ...
Conference Paper
Computing with high-dimensional (HD) vectors, also referred to as hypervectors, is a brain-inspired alternative to computing with scalars. Key properties of HD computing include a well-defined set of arithmetic operations on hypervectors, generality, scalability, robustness, fast learning, and ubiquitous parallel operations. HD computing is about manipulating and comparing large patterns---binary hypervectors with 10,000 dimensions---making its efficient realization on minimalistic ultra-low-power platforms challenging. This paper describes HD computing's acceleration and its optimization of memory accesses and operations on a silicon prototype of the PULPv3 4-core platform (1.5 mm², 2 mW), surpassing the state-of-the-art classification accuracy (on average 92.4%) with simultaneous 3.7× end-to-end speed-up and 2× energy saving compared to its single-core execution. We further explore the scalability of our accelerator by increasing the number of inputs and classification window on a new generation of the PULP architecture featuring bit-manipulation instruction extensions and larger number of 8 cores. These together enable a near ideal speed-up of 18.4× compared to the single-core PULPv3.
... For SVM, a fixedpoint approach is used to avoid all the computation needed to be executed in the floating-point. It is already demonstrated [13] that this approach leads to best performance preserving the accuracy. As shown, the HD classifier achieves ≈2× faster execution and lower power at iso-accuracy compared to the SVM on the ARM Cortex M4. ...
Article
Computing with high-dimensional (HD) vectors, also referred to as $\textit{hypervectors}$, is a brain-inspired alternative to computing with scalars. Key properties of HD computing include a well-defined set of arithmetic operations on hypervectors, generality, scalability, robustness, fast learning, and ubiquitous parallel operations. HD computing is about manipulating and comparing large patterns-binary hypervectors with 10,000 dimensions-making its efficient realization on minimalistic ultra-low-power platforms challenging. This paper describes HD computing's acceleration and its optimization of memory accesses and operations on a silicon prototype of the PULPv3 4-core platform (1.5mm$^2$, 2mW), surpassing the state-of-the-art classification accuracy (on average 92.4%) with simultaneous 3.7$\times$ end-to-end speed-up and 2$\times$ energy saving compared to its single-core execution. We further explore the scalability of our accelerator by increasing the number of inputs and classification window on a new generation of the PULP architecture featuring bit-manipulation instruction extensions and larger number of 8 cores. These together enable a near ideal speed-up of 18.4$\times$ compared to the single-core PULPv3.
Article
Timely and accurately seizure detection is of great importance for the diagnosis and treatment of epilepsy patients. Existing seizure detection models are often complex and time-consuming, highlighting the urgent need for lightweight seizure detection. Additionally, existing methods often neglect the key characteristic channels and spatial regions of electroencephalography (EEG) signals. To solve these issues, we propose a lightweight EEG-based seizure detection model named lightweight inverted residual attention network (LRAN). Specifically, we employ a four-stage inverted residual mobile block (iRMB) to effectively extract the hierarchical features from EEG. The convolutional block attention module (CBAM) is introduced to make the model focus on important feature channels and spatial information, thereby enhancing the discrimination of the learned features. Finally, convolution operations are used to capture local information and spatial relationships between features. We conduct intra-subject and inter-subject experiments on a publicly available dataset. Intra-subject experiments obtain 99.25% accuracy in segment-based detection and 0.36/h false detection rate (FDR) in event-based detection, respectively. Inter-subject experiments obtain 84.32% accuracy. Both sets of experiments maintain high classification accuracy with a low number of parameters, where the multiply accumulate operations (MACs) are 25.86[Formula: see text]M and the number of parameters is 0.57[Formula: see text]M.
Chapter
High-level programming models aim at exploiting hardware parallelism and reducing software development costs. However, their adoption on ultra-low-power multi-core microcontroller (MCU) platforms requires minimizing the overheads of work-sharing constructs on fine-grained parallel regions. This work tackles this challenge by proposing OMP-SPMD, a streamlined approach for parallel computing enabling the OpenMP syntax for the Single-Program Multiple-Data (SPMD) paradigm. To assess the performance improvement, we compare our solution with two alternatives: a baseline implementation of the OpenMP runtime based on the fork-join paradigm (OMP-base) and a version leveraging hardware-specific optimizations (OPM-opt). We benchmarked these libraries on a Parallel Ultra-Low Power (PULP) MCU, highlighting that hardware-specific optimizations improve OMP-base performance up to 69%. At the same time, OMP-SPMD leads to an extra improvement up to 178%.