Execution of floating-point seizure detection on the embedded computing...

Integrated Programmable-Array accelerator to design heterogeneous ultra-low power manycore architectures

Thesis

Jan 2022

Rohit Prasad

There is an ever-increasing demand for energy efficiency (EE) in rapidly evolving Internet-of-Things end nodes. This pushes researchers and engineers to develop solutions that provide both Application-Specific Integrated Circuit-like EE and Field-Programmable Gate Array-like flexibility. One such solution is Coarse Grain Reconfigurable Array (CGRA). Over the past decades, CGRAs have evolved and are competing to become mainstream hardware accelerators, especially for accelerating Digital Signal Processing (DSP) applications. Due to the over-specialization of computing architectures, the focus is shifting towards fitting an extensive data representation range into fewer bits, e.g., a 32-bit space can represent a more extensive data range with floating- point (FP) representation than an integer representation. Computation using FP representation requires numerous encodings and leads to complex circuits for the FP operators, decreasing the EE of the entire system. This thesis presents the design of an EE ultra-low-power CGRA with native support for FP computation by leveraging an emerging paradigm of approximate computing called transprecision computing. We also present the contributions in the compilation toolchain and system-level integration of CGRA in a System-on-Chip, to envision the proposed CGRA as an EE hardware accelerator. Finally, an extensive set of experiments using real-world algorithms employed in near-sensor processing applications are performed, and results are compared with state-of-the-art (SoA) architectures. It is empirically shown that our proposed CGRA provides better results w.r.t. SoA architectures in terms of power, performance, and area.

Floating Point CGRA based Ultra-Low Power DSP Accelerator

Article

Full-text available

Oct 2021
J SIGNAL PROCESS SYS

Coarse Grained Reconfigurable Arrays (CGRAs) are emerging as energy efficient accelerators providing a high grade of flexibility in both academia and industry. However, with the recent advancements in algorithms and performance requirements of applications, supporting only integer and logical arithmetic limits the interest of classical/traditional CGRAs. In this paper, we propose a novel CGRA architecture and associated compilation flow supporting both integer and floating-point computations for energy efficient acceleration of DSP applications. Experimental results show that the proposed accelerator achieves a maximum of 4.61× speedup compared to a DSP optimized, ultra low power RISC-V based CPU while executing seizure detection, a representative of wide range of EEG signal processing applications with an area overhead of 1.9×. The proposed CGRA achieves a maximum of 6.5× energy efficiency compared to the single core CPU. While comparing the execution with the multi-core CPU with 8 cores, the proposed CGRA achieves up to 4.4× energy gain.

Bio-Signals: Conceptual Framework and Significance Processing

Article

Full-text available

Sep 2019

BioWolf: A sub-10 mW 8-channel Advanced Brain Computer Interface Platform with a 9-core processor and BLE connectivity

Article

Jul 2019
IEEE T BIOMED CIRC S

Advancements in Digital Signal Processing (DSP) and machine learning techniques have boosted the popularity of Brain-Computer Interfaces (BCIs), where electroencephalography (EEG) is a widely accepted method to enable intuitive human-machine interaction. Nevertheless, the evolution of such interfaces is currently hampered by the unavailability of embedded platforms capable of delivering the required computational power at high energy efficiency and allowing for a small and unobtrusive form factor. To fill this gap, we developed BioWolf, a highly wearable (40x20x2mm) BCI platform based on Mr. Wolf, a Parallel Ultra Low Power system on chip featuring nine RISC-V cores with DSP-oriented instruction set extensions. BioWolf also integrates a commercial 8-channel medical-grade analog-to-digital converter, and an ARM-Cortex M4 microcontroller unit (MCU) with Bluetooth low-energy connectivity. To demonstrate the capabilities of the system, we implemented and tested a BCI featuring Canonical Correlation Analysis (CCA) of steady-state visual evoked potentials (SSVEP). The system achieves an average information transfer rate of 1.46 bits per second (bps) (aligned with the state-of-the-art of bench-top systems). Thanks to the reduced power envelope of the digital computational platform, which consumes less than the analog front-end, the total power budget is just 6.31 mW, providing up to 38 hrs operation (65 mAh battery). To our knowledge, our design is the first to explore the significant energy boost of a parallel MCU with respect to single-core MCUs for CCA-based BCI.

PULP-HD: accelerating brain-inspired high-dimensional computing on a parallel ultra-low power platform

Conference Paper

Jun 2018

Computing with high-dimensional (HD) vectors, also referred to as hypervectors, is a brain-inspired alternative to computing with scalars. Key properties of HD computing include a well-defined set of arithmetic operations on hypervectors, generality, scalability, robustness, fast learning, and ubiquitous parallel operations. HD computing is about manipulating and comparing large patterns---binary hypervectors with 10,000 dimensions---making its efficient realization on minimalistic ultra-low-power platforms challenging. This paper describes HD computing's acceleration and its optimization of memory accesses and operations on a silicon prototype of the PULPv3 4-core platform (1.5 mm², 2 mW), surpassing the state-of-the-art classification accuracy (on average 92.4%) with simultaneous 3.7× end-to-end speed-up and 2× energy saving compared to its single-core execution. We further explore the scalability of our accelerator by increasing the number of inputs and classification window on a new generation of the PULP architecture featuring bit-manipulation instruction extensions and larger number of 8 cores. These together enable a near ideal speed-up of 18.4× compared to the single-core PULPv3.

PULP-HD: Accelerating Brain-Inspired High-Dimensional Computing on a Parallel Ultra-Low Power Platform

Article

Apr 2018

Computing with high-dimensional (HD) vectors, also referred to as $\textit{hypervectors}$, is a brain-inspired alternative to computing with scalars. Key properties of HD computing include a well-defined set of arithmetic operations on hypervectors, generality, scalability, robustness, fast learning, and ubiquitous parallel operations. HD computing is about manipulating and comparing large patterns-binary hypervectors with 10,000 dimensions-making its efficient realization on minimalistic ultra-low-power platforms challenging. This paper describes HD computing's acceleration and its optimization of memory accesses and operations on a silicon prototype of the PULPv3 4-core platform (1.5mm$^2$, 2mW), surpassing the state-of-the-art classification accuracy (on average 92.4%) with simultaneous 3.7$\times$ end-to-end speed-up and 2$\times$ energy saving compared to its single-core execution. We further explore the scalability of our accelerator by increasing the number of inputs and classification window on a new generation of the PULP architecture featuring bit-manipulation instruction extensions and larger number of 8 cores. These together enable a near ideal speed-up of 18.4$\times$ compared to the single-core PULPv3.

Seizure Detection Based on Lightweight Inverted Residual Attention Network

Article

Apr 2024
INT J NEURAL SYST

Timely and accurately seizure detection is of great importance for the diagnosis and treatment of epilepsy patients. Existing seizure detection models are often complex and time-consuming, highlighting the urgent need for lightweight seizure detection. Additionally, existing methods often neglect the key characteristic channels and spatial regions of electroencephalography (EEG) signals. To solve these issues, we propose a lightweight EEG-based seizure detection model named lightweight inverted residual attention network (LRAN). Specifically, we employ a four-stage inverted residual mobile block (iRMB) to effectively extract the hierarchical features from EEG. The convolutional block attention module (CBAM) is introduced to make the model focus on important feature channels and spatial information, thereby enhancing the discrimination of the learned features. Finally, convolution operations are used to capture local information and spatial relationships between features. We conduct intra-subject and inter-subject experiments on a publicly available dataset. Intra-subject experiments obtain 99.25% accuracy in segment-based detection and 0.36/h false detection rate (FDR) in event-based detection, respectively. Inter-subject experiments obtain 84.32% accuracy. Both sets of experiments maintain high classification accuracy with a low number of parameters, where the multiply accumulate operations (MACs) are 25.86[Formula: see text]M and the number of parameters is 0.57[Formula: see text]M.

Landscape of Epilepsy Research: Analysis and Future Trajectory

Article

Nov 2023

Design and Implementation of a RISC-V Soc for Real-Time Epilepsy Detection on FPGA

Conference Paper

May 2023

Streamlining the OpenMP Programming Model on Ultra-Low-Power Multi-core MCUs

Chapter

Jul 2021

High-level programming models aim at exploiting hardware parallelism and reducing software development costs. However, their adoption on ultra-low-power multi-core microcontroller (MCU) platforms requires minimizing the overheads of work-sharing constructs on fine-grained parallel regions. This work tackles this challenge by proposing OMP-SPMD, a streamlined approach for parallel computing enabling the OpenMP syntax for the Single-Program Multiple-Data (SPMD) paradigm. To assess the performance improvement, we compare our solution with two alternatives: a baseline implementation of the OpenMP runtime based on the fork-join paradigm (OMP-base) and a version leveraging hardware-specific optimizations (OPM-opt). We benchmarked these libraries on a Parallel Ultra-Low Power (PULP) MCU, highlighting that hardware-specific optimizations improve OMP-base performance up to 69%. At the same time, OMP-SPMD leads to an extra improvement up to 178%.

Execution of floating-point seizure detection on the embedded computing platform.

Citations