Server demo based on PIM computing architecture.

Source publication

Figure 1. Traditional computing architecture.

Figure 2. Processing-in-Memory (PIM) concept.

Figure 4. Design philosophy of the PIM architecture.

A Processing-in-Memory Architecture Programming Paradigm for Wireless Internet-of-Things Applications

Article

Full-text available

Jan 2019

The widespread applications of the wireless Internet of Things (IoT) is one of the leading factors in the emerging of Big Data. Huge amounts of data need to be transferred and processed. The bandwidth and latency of data transfers have posed a new challenge for traditional computing systems. Under Big Data application scenarios, the movement of lar...

VOICE EXTRACTION FOR SPEECH RECOGNITION BASED ON NEURAL NETWORK

Article

Full-text available

Mar 2020

This paper presents the framework of malay isolated digit speech recognition system. The framework design is based on neural network method. One of the very famous methods to develop speech recognition system is a neural network (NN). NN is a computational paradigm model that consists of interconnected nerve cells.

Performance Analysis of Efficient Computing Techniques for Direction of Arrival Estimation of Underwater Multi Targets

Article

Full-text available

Feb 2021

Parameter estimation of Direction of Arrival (DOA) using deterministic and stochastic computing paradigms is an enabling development for underwater acoustic signal processing beside its applications in the field of seismology, astronomy, earthquake and bio-medicine. In this work, the comparative study between state of the art deterministic and heur...

EVBLB: Efficient Voronoi Tessellation-Based Load Balancing in Edge Computing Networks

Preprint

Aug 2021

Edge computing (EC) is a promising solution to enable the next-generation delay-critical network services which are not conceivable in the traditional cloud-based architecture. EC takes the computing and storage resources closer to the end-users at the edge of the networks to eliminate the propagation delays caused by geographical distances. Howeve...

Fig. 1 The Framework of Power Flow Adjustment based on Edge Intelligence

Fig. 3 Power grid region division based on multi-agent

The impact of loads and generators’ output on power flow adjustment

The convergence performance under different algorithms

Power flow adjustment for smart microgrid based on edge computing and multi-agent deep reinforcement learning

Article

Full-text available

Sep 2021

In current power grids, a massive amount of power equipment raises various emerging requirements, e.g., data perception, information transmission, and real-time control. The existing cloud computing paradigm is stubborn to address issues and challenges such as rapid response and local autonomy. Microgrids contain diverse and adjustable power compon...

Edge network with sensor actuator devices and edge devices

Policy 1 based Reconstructed data for DS1

Policy 1 based Reconstructed data for DS2

Performance Modelling and Analysis of IoT Based Edge Computing Policies

Article

Full-text available

Sep 2021

In today’s era, the acceptance of IoT-based edge devices is growing exponentially, which creates challenges of data acquisition, processing, and communication. In the edge computing paradigm, intelligence is shifted from the center to the edge by performing specific processing and prediction locally. Existing methods are not well computed with time...

Integrated Imager and 3.22 μ s/Kernel-Latency All-Digital In-Imager Global-Parallel Binary Convolutional Neural Network Accelerator for Image Processing

Article

Full-text available

Jan 2023

This paper presents an innovative approach to achieve ultralow-latency convolutional neural network (CNN) processing, which is critical for real-time image processing applications such as autonomous driving and virtual reality. Traditional CNN accelerators employing in/near-array-computing (inclusive of in/near-memory-computing and in/near-sensor-computing) architectures have struggled to meet real-time requirements due to latency bottlenecks encountered with conventional column-parallel processing for image processing. To address this challenge, we propose a novel, all-digital in-imager global-parallel binary convolutional neural network (IIGP-BNN) accelerator. This new approach employs a global-parallel processing concept, which enables multiply-and-accumulate operations (MACs) to be executed simultaneously within the imager array in a 2D manner, eliminating the additional latency associated with row-by-row processing and data access from random access memories (RAMs). In this design, convolution and subsampling operations using a 3 × 3 kernel are completed within just nine steps of global-parallel processing, regardless of image size. This results in a theoretical reduction of over 88.5% of repeated row scans compared to conventional column-parallel processing architectures, thus significantly reducing computing latency. We have designed and prototyped a 30 × 30 integrated imager and IIGP-BNN accelerator IC using a 0.18 μm CMOS process. This prototype achieved a latency of 3.22 μs/kernel on the first layer convolution at a power supply of 1 V and a clock frequency of 35.7 MHz. This represents a latency reduction of 35.6% compared to the state-of-the-art in/near-imager-computing works. This proposed global-parallel processing concept opens up the potential for processing high-resolution images in 4K and 8K with the same ultralow latency, marking a significant advancement in high-speed image processing.

The Prism Bridge: Maximizing Inter-Chip AXI Throughput in the High-Speed Serial Era

Article

Full-text available

Jan 2023

In this paper, we present the Prism Bridge, a soft IP core developed to bridge FPGA-MPSoC systems using high-speed serial links. Considering the current trend of ubiquitous serial transceivers with staggeringly increasing line rates, minimizing overhead and maximizing data throughput becomes paramount. Hence, our main design goal is to maximize bandwidth utilization for AXI data, which we realize through an advanced packetization mechanism. We give an overview of the Prism Bridge’s design and analyze its half-duplex bandwidth utilization. Additionally, we discuss the results of the experiments we conducted to assess its real-world performance, including measurements of throughput and latency of various combinations of line rates, link-layer cores, and bridge cores. Using a serial link with a 16.375 Gbit/s line rate, the Prism Bridge with an advanced packetizing mechanism achieved an AXI write throughput of 1368.82 MiB/s and an AXI read throughput of 1376.62 MiB/s, an increase of 46.20% and 45.86%, respectively, compared with the de-facto industry-standard core. The advanced packetization mechanism had negligible impact on latency but required 69.15%–73.91% more LUTs and 33.62%–36.19% more flip-flops. We conclude that for most designs that support inter-chip AXI transactions and will not be limited to short transaction lengths, the higher data throughput of the Prism Bridge with an advanced packetization mechanism is worth its cost in additional logic resource utilization.

Special Issue “Smart IC Design and Sensing Technologies”

Article

Full-text available

Oct 2022

Smart sensing technologies and their inherent data-processing techniques have drawn considerable research and industrial attention in recent years. Recent developments in nanometer CMOS technologies have shown great potential to deal with the increasing demand of processing power that arises in these sensing technologies, from IoT applications to complicated medical devices. Moreover, circuit implementation, which could be based on a full analog or digital approach or, in most cases, on a mixed-signal approach, possesses a fundamental role in exploiting the full capabilities of sensing technologies. In addition, all circuit design methodologies include the optimization of several performance metrics, such as low power, low cost, small area, and high throughput, which impose critical challenges in the field of sensor design. This Special Issue aims to highlight advances in the development, modeling, simulation, and implementation of integrated circuits for sensing technologies, from the component level to complete sensing systems.

IoT Cloud Computing Middleware for Crowd Monitoring and Evacuation

Article

Full-text available

Dec 2021

Map-Reduce is a programming model and an associated implementation for processing and generating large data sets. This model has a single point of failure: the master, who coordinates the work in a cluster. On the contrary, wireless sensor networks (WSNs) are distributed systems that scale and feature large numbers of small, computationally limited, low-power, unreliable nodes. In this article, we provide a top-down approach explaining the architecture, implementation and rationale of a distributed fault-tolerant IoT middleware. Specifically, this middleware consists of multiple mini-computing devices (Raspberry Pi) connected in a WSN which implement the Map-Reduce algorithm. First, we explain the tools used to develop this system. Second, we focus on the Map-Reduce algorithm implemented to overcome common network connectivity issues, as well as to enhance operation availability and reliability. Lastly, we provide benchmarks for our middleware as a crowd tracking application for a preserved building in Greece (i.e., M. Hatzidakis' residence). The results of this study show that IoT middleware with low-power and low-cost components are viable solutions for medium-sized cloud computing distributed and parallel computing centres. Potential uses of this middleware apply for monitoring buildings and indoor structures, in addition to crowd tracking to prevent the spread of COVID-19.

Design space exploration in near-data co-processors for general-purpose acceleration, in high-performance and low-power processing environments

Thesis

Sep 2021

Athanasios Tziouvaras

Modern computer architectures face a performance scaling wall as the throughput and power consumption bottleneck has shifted from the core pipeline towards the DRAM latency and data transfer operations. This phenomenon can be partially attributed to the stop of Dennard’s scaling and to the continuous shrinking size of transistors. As a result, the power density of the integrated circuits has increased to a point where most of the cores in a multi-core architecture are forced to operate in near-threshold voltage levels. In order to address such an issue, researchers tend to deviate from the standard Von Neuman architectures towards new computing models. In the last decade there is a resurgence of the NDP paradigm, under which the instructions are executed on the DRAM die instead of the core pipeline. Therefore, the amount of CPU-DRAM transactions is significantly decreased and thus, it positively affects the power dissipation and the achievable throughput of the system. Under this premise, in this dissertation we explore the NDP paradigm for high performance and for low-power computing. Regarding the high performance computing, we propose a novel approach that considers general purpose loop execution. Our design employs an instruction scheduling methodology which issues each individual instruction on a custom integrated circuit acting as loop accelerator that is located on the logic layer of an HMC DRAM. There, instructions are iteratively executed in parallel in a software pipelining fashion, while intermediate results are forwarded through an on-chip interconnection network. Regarding the low-power computing, we develop a novel timing analysis methodology that is based on the premises of STA, specifically for low-power, low-end pipelines. The proposed timing methodology considers the excitation of the timing paths for each instruction supported by the ISA, and calculates the worst-case slack for each individual instruction. As a result, we obtain timing information on an instruction level and we proceed in exploiting such knowledge to adaptively scale the clock frequency according to instruction types that execute in the pipeline at any given time. In the sequel, we employ the aforementioned BTWC methodology to co-design a pipeline from the ground up to support a clock scaling mechanism with cycle-to-cycle granularity. We focus on the general purpose code execution and we implement our design on the logic layer of an HMC DRAM in order to enable near-data execution. We opt to evaluate both the high performance and the low power architectures on post-layout simulations in order to strengthen the validity of our designs. Results indicate a significant performance increase in terms of throughput over the baseline processors while the power consumption levels are critically reduced.

Toward Taming the Overhead Monster for Data-Flow Integrity

Preprint

Feb 2021

Data-Flow Integrity (DFI) is a well-known approach to effectively detecting a wide range of software attacks. However, its real-world application has been quite limited so far because of the prohibitive performance overhead it incurs. Moreover, the overhead is enormously difficult to overcome without substantially lowering the DFI criterion. In this work, an analysis is performed to understand the main factors contributing to the overhead. Accordingly, a hardware-assisted parallel approach is proposed to tackle the overhead challenge. Simulations on SPEC CPU 2006 benchmark show that the proposed approach can completely verify the DFI defined in the original seminal work while reducing performance overhead by 4x on average.

A Technologically Agnostic Framework for Cyber-Physical and IoT Processing-in-Memory-based Systems Simulation

Article

Full-text available

Jun 2019
MICROPROCESS MICROSY

Smart devices based on Internet of Things (IoT) and Cyber-Physical System (CPS) are emerging as an important and complex set of applications in the modern world. These systems can generate a massive amounts of data, due to the enormous quantity of sensors being used in modern applications, which can either stress the communication mechanisms or need extra resources to treat data locally. In the era of efficient smart devices, the idea of transmitting huge amounts of data is prohibitive. Furthermore, implementing traditional architectures imposes limits on achieving the required efficiency. Within the area, power, and energy constraints, Processing-in-Memory(PIM) has emerged as a solution for efficiently processing big data. By using PIM, the generated data can be processed locally, with reduced power and energy costs, allowing an efficient solution for CPS and IoT data management problem. However, two main tools are fundamental on this scenario: a simulator that allows architectural performance and behavior analysis is essential within a project life-cycle, and a compiler able to automatically generate code for the targeted architecture with obvious improvements of productivity. Also, with the emergence of new technologies, the ability to simulate PIM coupled to the latest memory technologies is also important. This work presents a framework able to simulate and automatically generate code for IoT PIM -based systems. Also, supported by the presented framework, this work proposes an architecture that shows an efficient IoT PIM system able to compute a real image recognition application. The proposed architecture is able to process 6 × more frames per second than the baseline, while improving the energy efficiency by 30 × .

A Near-Memory Dynamically Programmable Many-Core Overlay

Conference Paper

Dec 2023

Application of Internet of Things and multimedia technology in English online teaching

Article

Dec 2023

This study aims to explore the application status of Internet of Things and multimedia technology in English online teaching and its optimization measures. The Internet of Things is a technology that connects objects together and realizes information exchange through the network, while multimedia technology refers to the integration and processing of different forms of information including text, images, sound, video and so on. Firstly, the development status of Internet of Things and multimedia technology at home and abroad, as well as the development trend and demand of online English teaching, are analyzed. Secondly, through a questionnaire survey, data on the teaching methods, content, student feedback, and teachers’ teaching methods and effects in English online teaching were collected. Through in-depth data analysis and empirical research, this paper discusses how to integrate the Internet of Things and multimedia technology to build a more efficient online English teaching model. On this basis, the problems existing in the Internet of Things and multimedia technology in English online teaching are summarized, such as uneven application of technology, lack of targeted teaching design, and insufficient interactivity. In response to these problems, a series of optimization measures are proposed, including balancing the application of technology, personalized teaching design, improving interactivity, cultivating independent learning ability and solving technical problems. Finally, the future development of English online teaching in the application of Internet of Things and multimedia technology is prospected, focusing on technological innovation and application, personalized and intelligent teaching, teaching mode and method innovation, teacher role change and evaluation system construction.

Design of a STT-MTJ based Random-Access Memory with In-situ Processing for Data-Intensive Applications

Article

Jan 2022
IEEE T NANOTECHNOL

During the last few years, deep learning techniques are frequently applied in large-scale image processing, detection in a variety of computer vision, cognitive tasks, and information analysis applications. The execution of deep learning algorithms like CNN and FCNN requires high dimensional matrix multiplication, which contributes to significant computational power. The frequent data movement between memory and core is one of the main reasons for considerable power consumption and latency, hence becoming a major performance bottleneck for conventional computing systems. To address this challenge, we propose an in-memory computing array that can perform computation directly within the memory, hence, reducing the overhead associated with data movement. The proposed Random-Access Memory with in-situ Processing (RAMP) array reconfigures the emerging magnetic random-access memory to realize logic and arithmetic functions inside the memory. Furthermore, the array supports independent operations over multiple rows and columns, which helps in accelerating the execution of matrix operations. To validate the functionality and evaluate the performance of the proposed array, we perform extensive spice simulations. At 45nm, the proposed array takes 5.39 ns, 0.68 ns, 0.68 ns, 0.7 ns and consumes 2.2 pJ/bit, 0.21 pJ/bit, 0.23 pJ/bit, 0.7 pJ/bit while performing a memory write, memory read, logic, and arithmetic operations respectively.

Server demo based on PIM computing architecture.

Similar publications

Citations