Server demo based on PIM computing architecture.

Server demo based on PIM computing architecture.

Source publication
Article
Full-text available
The widespread applications of the wireless Internet of Things (IoT) is one of the leading factors in the emerging of Big Data. Huge amounts of data need to be transferred and processed. The bandwidth and latency of data transfers have posed a new challenge for traditional computing systems. Under Big Data application scenarios, the movement of lar...

Similar publications

Article
Full-text available
This paper presents the framework of malay isolated digit speech recognition system. The framework design is based on neural network method. One of the very famous methods to develop speech recognition system is a neural network (NN). NN is a computational paradigm model that consists of interconnected nerve cells.
Article
Full-text available
Parameter estimation of Direction of Arrival (DOA) using deterministic and stochastic computing paradigms is an enabling development for underwater acoustic signal processing beside its applications in the field of seismology, astronomy, earthquake and bio-medicine. In this work, the comparative study between state of the art deterministic and heur...
Preprint
Edge computing (EC) is a promising solution to enable the next-generation delay-critical network services which are not conceivable in the traditional cloud-based architecture. EC takes the computing and storage resources closer to the end-users at the edge of the networks to eliminate the propagation delays caused by geographical distances. Howeve...
Article
Full-text available
In current power grids, a massive amount of power equipment raises various emerging requirements, e.g., data perception, information transmission, and real-time control. The existing cloud computing paradigm is stubborn to address issues and challenges such as rapid response and local autonomy. Microgrids contain diverse and adjustable power compon...
Article
Full-text available
In today’s era, the acceptance of IoT-based edge devices is growing exponentially, which creates challenges of data acquisition, processing, and communication. In the edge computing paradigm, intelligence is shifted from the center to the edge by performing specific processing and prediction locally. Existing methods are not well computed with time...

Citations

... Considering the information asymmetry on the latency of macro's I/O caused by the huge gap in quantity of data movement between input data and weight data when comparing in-sensor-computing and in-memory-computing due to their feature of input stationary and weight stationary [30][31][32], to provide a comprehensive comparison, Table I focuses on the results of recent works on in/near-imagercomputing CNN accelerator circuits for image processing. In this work, we proposed a novel global-parallel processing concept to achieve ultralow latency by processing convolution operations in 2D. ...
Article
Full-text available
This paper presents an innovative approach to achieve ultralow-latency convolutional neural network (CNN) processing, which is critical for real-time image processing applications such as autonomous driving and virtual reality. Traditional CNN accelerators employing in/near-array-computing (inclusive of in/near-memory-computing and in/near-sensor-computing) architectures have struggled to meet real-time requirements due to latency bottlenecks encountered with conventional column-parallel processing for image processing. To address this challenge, we propose a novel, all-digital in-imager global-parallel binary convolutional neural network (IIGP-BNN) accelerator. This new approach employs a global-parallel processing concept, which enables multiply-and-accumulate operations (MACs) to be executed simultaneously within the imager array in a 2D manner, eliminating the additional latency associated with row-by-row processing and data access from random access memories (RAMs). In this design, convolution and subsampling operations using a 3 × 3 kernel are completed within just nine steps of global-parallel processing, regardless of image size. This results in a theoretical reduction of over 88.5% of repeated row scans compared to conventional column-parallel processing architectures, thus significantly reducing computing latency. We have designed and prototyped a 30 × 30 integrated imager and IIGP-BNN accelerator IC using a 0.18 μm CMOS process. This prototype achieved a latency of 3.22 μs/kernel on the first layer convolution at a power supply of 1 V and a clock frequency of 35.7 MHz. This represents a latency reduction of 35.6% compared to the state-of-the-art in/near-imager-computing works. This proposed global-parallel processing concept opens up the potential for processing high-resolution images in 4K and 8K with the same ultralow latency, marking a significant advancement in high-speed image processing.
... In the realm of waveform design and deployment, Wildman et al. [23] split Common-modem Hardware Integrated Library (CHIL) [24] software and hardware onto separate boards and connected them using AXI Chip2Chip cores. Yang et al. [25] proposed a Processing-In-Memory (PIM) architecture. In their setup, the master board accesses the slave board with a bridge built using AXI Chip2Chip ...
Article
Full-text available
In this paper, we present the Prism Bridge, a soft IP core developed to bridge FPGA-MPSoC systems using high-speed serial links. Considering the current trend of ubiquitous serial transceivers with staggeringly increasing line rates, minimizing overhead and maximizing data throughput becomes paramount. Hence, our main design goal is to maximize bandwidth utilization for AXI data, which we realize through an advanced packetization mechanism. We give an overview of the Prism Bridge’s design and analyze its half-duplex bandwidth utilization. Additionally, we discuss the results of the experiments we conducted to assess its real-world performance, including measurements of throughput and latency of various combinations of line rates, link-layer cores, and bridge cores. Using a serial link with a 16.375 Gbit/s line rate, the Prism Bridge with an advanced packetizing mechanism achieved an AXI write throughput of 1368.82 MiB/s and an AXI read throughput of 1376.62 MiB/s, an increase of 46.20% and 45.86%, respectively, compared with the de-facto industry-standard core. The advanced packetization mechanism had negligible impact on latency but required 69.15%–73.91% more LUTs and 33.62%–36.19% more flip-flops. We conclude that for most designs that support inter-chip AXI transactions and will not be limited to short transaction lengths, the higher data throughput of the Prism Bridge with an advanced packetization mechanism is worth its cost in additional logic resource utilization.
... On the other hand, near-data architectures require refined control logic in order to designate which instructions will execute on the DRAM side and which instructions will execute on the processor side [5]. Previous works in near-data processing depict significant improvements in the domains of IoT [6], big data applications [7] and machine learning [8]. ...
Article
Full-text available
Smart sensing technologies and their inherent data-processing techniques have drawn considerable research and industrial attention in recent years. Recent developments in nanometer CMOS technologies have shown great potential to deal with the increasing demand of processing power that arises in these sensing technologies, from IoT applications to complicated medical devices. Moreover, circuit implementation, which could be based on a full analog or digital approach or, in most cases, on a mixed-signal approach, possesses a fundamental role in exploiting the full capabilities of sensing technologies. In addition, all circuit design methodologies include the optimization of several performance metrics, such as low power, low cost, small area, and high throughput, which impose critical challenges in the field of sensor design. This Special Issue aims to highlight advances in the development, modeling, simulation, and implementation of integrated circuits for sensing technologies, from the component level to complete sensing systems.
... In a recent bibliography, middleware-based applications focus on multimedia [33], intelligent service processing [34], high performance computing [35], mobile edge computing [36], fog computing [37], data transmission [38], automotive industry [39], big data [40], web ontology [41], contextawareness [42], semantics [43] and service-oriented [44], microservice-oriented [45] or software-orientated [46] architectures. Since middleware is case specific, several studies have been conducted regarding design and implementation applications in IoT [47], environmental monitoring [48], and urban activities [49]. ...
Article
Full-text available
Map-Reduce is a programming model and an associated implementation for processing and generating large data sets. This model has a single point of failure: the master, who coordinates the work in a cluster. On the contrary, wireless sensor networks (WSNs) are distributed systems that scale and feature large numbers of small, computationally limited, low-power, unreliable nodes. In this article, we provide a top-down approach explaining the architecture, implementation and rationale of a distributed fault-tolerant IoT middleware. Specifically, this middleware consists of multiple mini-computing devices (Raspberry Pi) connected in a WSN which implement the Map-Reduce algorithm. First, we explain the tools used to develop this system. Second, we focus on the Map-Reduce algorithm implemented to overcome common network connectivity issues, as well as to enhance operation availability and reliability. Lastly, we provide benchmarks for our middleware as a crowd tracking application for a preserved building in Greece (i.e., M. Hatzidakis' residence). The results of this study show that IoT middleware with low-power and low-cost components are viable solutions for medium-sized cloud computing distributed and parallel computing centres. Potential uses of this middleware apply for monitoring buildings and indoor structures, in addition to crowd tracking to prevent the spread of COVID-19.
... In that work authors deploy a near data graph processing accelerator achieving high throughput and reducing the power consumption, while also maintaining memory coherence by defining non-cachable memory spaces. Another work in [12] focuses on bitwise operation execution in memory, while in [13] authors propose an NDP framework for IoT applications. Big data applications are also suitable for NDP due to the increased memory bandwidth requirements they exhibit [14]. ...
Thesis
Modern computer architectures face a performance scaling wall as the throughput and power consumption bottleneck has shifted from the core pipeline towards the DRAM latency and data transfer operations. This phenomenon can be partially attributed to the stop of Dennard’s scaling and to the continuous shrinking size of transistors. As a result, the power density of the integrated circuits has increased to a point where most of the cores in a multi-core architecture are forced to operate in near-threshold voltage levels. In order to address such an issue, researchers tend to deviate from the standard Von Neuman architectures towards new computing models. In the last decade there is a resurgence of the NDP paradigm, under which the instructions are executed on the DRAM die instead of the core pipeline. Therefore, the amount of CPU-DRAM transactions is significantly decreased and thus, it positively affects the power dissipation and the achievable throughput of the system. Under this premise, in this dissertation we explore the NDP paradigm for high performance and for low-power computing. Regarding the high performance computing, we propose a novel approach that considers general purpose loop execution. Our design employs an instruction scheduling methodology which issues each individual instruction on a custom integrated circuit acting as loop accelerator that is located on the logic layer of an HMC DRAM. There, instructions are iteratively executed in parallel in a software pipelining fashion, while intermediate results are forwarded through an on-chip interconnection network. Regarding the low-power computing, we develop a novel timing analysis methodology that is based on the premises of STA, specifically for low-power, low-end pipelines. The proposed timing methodology considers the excitation of the timing paths for each instruction supported by the ISA, and calculates the worst-case slack for each individual instruction. As a result, we obtain timing information on an instruction level and we proceed in exploiting such knowledge to adaptively scale the clock frequency according to instruction types that execute in the pipeline at any given time. In the sequel, we employ the aforementioned BTWC methodology to co-design a pipeline from the ground up to support a clock scaling mechanism with cycle-to-cycle granularity. We focus on the general purpose code execution and we implement our design on the logic layer of an HMC DRAM in order to enable near-data execution. We opt to evaluate both the high performance and the low power architectures on post-layout simulations in order to strengthen the validity of our designs. Results indicate a significant performance increase in terms of throughput over the baseline processors while the power consumption levels are critically reduced.
... The main processor is an ARM Cortex-A15 with 2GHz frequency, 32KB L1 instruction cache, 64KB L1 data cache, 2MB L2 cache, and 512 MB memory. A single PIM processor is used and operates at 2GHz frequency [29,30]. 64MB memory is allocated for RDT, which is sufficient for the testcases in our experiment. ...
Preprint
Data-Flow Integrity (DFI) is a well-known approach to effectively detecting a wide range of software attacks. However, its real-world application has been quite limited so far because of the prohibitive performance overhead it incurs. Moreover, the overhead is enormously difficult to overcome without substantially lowering the DFI criterion. In this work, an analysis is performed to understand the main factors contributing to the overhead. Accordingly, a hardware-assisted parallel approach is proposed to tackle the overhead challenge. Simulations on SPEC CPU 2006 benchmark show that the proposed approach can completely verify the DFI defined in the original seminal work while reducing performance overhead by 4x on average.
... Several previous works have already implemented distinct PIM architectures aiming to both explore the abundant internal memory bandwidth and reduce data movement through the memory system [2,5,7,18,27,47]. In particular, previous works as [9,22,33,49,61,63] have taken advantage of these new memory architectures to accelerate unsupervised learning and IoT applications in distinct ways. ...
... In [62], the authors presented a PIM simulator that relies on the integration of three memory simulators to support different memory technologies and one architectural simulator to provide interconnection and description of Central Processing Unit (CPU) architectures. Likewise, in [63] is presented a PIM architecture for wireless IoT applications which relies on the integration of one simulator for simulating both PIM and host processing elements and a tool for estimating power consumption. Coupling several simulators to represent the desired computing system incurs drawbacks to the design life-cycle making this simulation approach prohibitive. ...
... Also, since the involved simulators may have different accuracy levels, system modeling patterns, and technological constraint representations, the result of the simulation might not present the desired precision. Although [63] utilizes the same architectural simulator for all the hardware components, different simulation accuracy level components are instantiated to compose the whole system. Thus, the simulation approach followed by [63] not only needs a particular synchronization mechanism but also does not reflect a real scenario where the host processor is represented by an event-detailed processor description and the PIM elements are described only with atomic and no-delayed operations. ...
Article
Full-text available
Smart devices based on Internet of Things (IoT) and Cyber-Physical System (CPS) are emerging as an important and complex set of applications in the modern world. These systems can generate a massive amounts of data, due to the enormous quantity of sensors being used in modern applications, which can either stress the communication mechanisms or need extra resources to treat data locally. In the era of efficient smart devices, the idea of transmitting huge amounts of data is prohibitive. Furthermore, implementing traditional architectures imposes limits on achieving the required efficiency. Within the area, power, and energy constraints, Processing-in-Memory(PIM) has emerged as a solution for efficiently processing big data. By using PIM, the generated data can be processed locally, with reduced power and energy costs, allowing an efficient solution for CPS and IoT data management problem. However, two main tools are fundamental on this scenario: a simulator that allows architectural performance and behavior analysis is essential within a project life-cycle, and a compiler able to automatically generate code for the targeted architecture with obvious improvements of productivity. Also, with the emergence of new technologies, the ability to simulate PIM coupled to the latest memory technologies is also important. This work presents a framework able to simulate and automatically generate code for IoT PIM -based systems. Also, supported by the presented framework, this work proposes an architecture that shows an efficient IoT PIM system able to compute a real image recognition application. The proposed architecture is able to process 6 × more frames per second than the baseline, while improving the energy efficiency by 30 × .
Article
This study aims to explore the application status of Internet of Things and multimedia technology in English online teaching and its optimization measures. The Internet of Things is a technology that connects objects together and realizes information exchange through the network, while multimedia technology refers to the integration and processing of different forms of information including text, images, sound, video and so on. Firstly, the development status of Internet of Things and multimedia technology at home and abroad, as well as the development trend and demand of online English teaching, are analyzed. Secondly, through a questionnaire survey, data on the teaching methods, content, student feedback, and teachers’ teaching methods and effects in English online teaching were collected. Through in-depth data analysis and empirical research, this paper discusses how to integrate the Internet of Things and multimedia technology to build a more efficient online English teaching model. On this basis, the problems existing in the Internet of Things and multimedia technology in English online teaching are summarized, such as uneven application of technology, lack of targeted teaching design, and insufficient interactivity. In response to these problems, a series of optimization measures are proposed, including balancing the application of technology, personalized teaching design, improving interactivity, cultivating independent learning ability and solving technical problems. Finally, the future development of English online teaching in the application of Internet of Things and multimedia technology is prospected, focusing on technological innovation and application, personalized and intelligent teaching, teaching mode and method innovation, teacher role change and evaluation system construction.
Article
During the last few years, deep learning techniques are frequently applied in large-scale image processing, detection in a variety of computer vision, cognitive tasks, and information analysis applications. The execution of deep learning algorithms like CNN and FCNN requires high dimensional matrix multiplication, which contributes to significant computational power. The frequent data movement between memory and core is one of the main reasons for considerable power consumption and latency, hence becoming a major performance bottleneck for conventional computing systems. To address this challenge, we propose an in-memory computing array that can perform computation directly within the memory, hence, reducing the overhead associated with data movement. The proposed Random-Access Memory with in-situ Processing (RAMP) array reconfigures the emerging magnetic random-access memory to realize logic and arithmetic functions inside the memory. Furthermore, the array supports independent operations over multiple rows and columns, which helps in accelerating the execution of matrix operations. To validate the functionality and evaluate the performance of the proposed array, we perform extensive spice simulations. At 45nm, the proposed array takes 5.39 ns, 0.68 ns, 0.68 ns, 0.7 ns and consumes 2.2 pJ/bit, 0.21 pJ/bit, 0.23 pJ/bit, 0.7 pJ/bit while performing a memory write, memory read, logic, and arithmetic operations respectively.