Overview of the ISAAC architecture (adapted from [38]).

Overview of the ISAAC architecture (adapted from [38]).

Source publication
Article
Full-text available
Data-intensive workloads and applications, such as machine learning (ML), are fundamentally limited by traditional computing systems based on the von-Neumann architecture. As data movement operations and energy consumption become key bottlenecks in the design of computing systems, the interest in unconventional approaches such as Near-Data Processi...

Citations

... The power efficiency of ACC decreases due to an increase in power and processing time for transferring NN weight data and temporary data. As a countermeasure, it is known to be effective to place local memory for those data near the ACC [4][5][6][7]. Among the data stored in the local memory, weight data need not be rewritten as long as no change is made in the NN. ...
Article
Full-text available
Using a 3-D monolithic stacking memory technology of crystalline oxide semiconductor (OS) transistors, we fabricated a test chip having AI accelerator (ACC) memory for weight data of a neural network (NN), backup memory of flip-flops (FF), and CPU memory storing instructions and data. These memories are composed of two-layer OS transistors on Si CMOS, where memories in each layer correspond to a bank. In this structure, bank switching of the ACC memory and the FF backup memory work together, and thus inference of different NNs is switched with low latency and low power so that the power gating standby time can be extended. Consequently, a 92% reduction in power consumption is achieved in inference at a frame rate of 60 fps as compared with a chip using static random access memory (SRAM) as the ACC memory.
... Figure 4 shows the structure of a ReRAM crossbar performing an analog MAC operation (a) and a Matrix Vector Multiplication (MVM) (b). In the ReRAM memory arrays, each bitline (BL) is connected to each wordline (WL) via a ReRAM cell that usually consists of a memristor with an access transistor, also known as the 1T1R cell design [21]. Applying a voltage V i to a ReRAM cell with conductance G i , results in a current V i × G i passing from the cell to the bitline. ...
... PIM ReRAM-based Accelerators. In recent years, PIM architectures [21] have regained popularity and attracted the attention of researchers due to the high memory requirements of modern machine learning applications and the limitations of traditional systems. Implementing DNN accelerators using memristors is quite effective and, hence, most recent works are leveraging the use of ReRAM crossbars to accelerate cognitive computing algorithms [4], [5], [7], [11], [12], [33], [34], [35], [36], [37], [38]. ...
Preprint
Full-text available
The primary operation in DNNs is the dot product of quantized input activations and weights. Prior works have proposed the design of memory-centric architectures based on the Processing-In-Memory (PIM) paradigm. Resistive RAM (ReRAM) technology is especially appealing for PIM-based DNN accelerators due to its high density to store weights, low leakage energy, low read latency, and high performance capabilities to perform the DNN dot-products massively in parallel within the ReRAM crossbars. However, the main bottleneck of these architectures is the energy-hungry analog-to-digital conversions (ADCs) required to perform analog computations in-ReRAM, which penalizes the efficiency and performance benefits of PIM. To improve energy-efficiency of in-ReRAM analog dot-product computations we present ReDy, a hardware accelerator that implements a ReRAM-centric Dynamic quantization scheme to take advantage of the bit serial streaming and processing of activations. The energy consumption of ReRAM-based DNN accelerators is directly proportional to the numerical precision of the input activations of each DNN layer. In particular, ReDy exploits that activations of CONV layers from Convolutional Neural Networks (CNNs), a subset of DNNs, are commonly grouped according to the size of their filters and the size of the ReRAM crossbars. Then, ReDy quantizes on-the-fly each group of activations with a different numerical precision based on a novel heuristic that takes into account the statistical distribution of each group. Overall, ReDy greatly reduces the activity of the ReRAM crossbars and the number of A/D conversions compared to an static 8-bit uniform quantization. We evaluate ReDy on a popular set of modern CNNs. On average, ReDy provides 13\% energy savings over an ISAAC-like accelerator with negligible accuracy loss and area overhead.
... Being DL acceleration such a prolific and rapidly evolving field, we do not claim to cover exhaustively all the research works appeared so far, but we focused on the most influential contributions. Moreover, this survey can be leveraged as a connecting point for some previous surveys on accelerators on the AI and DL field [40,82,108,231] and other surveys focused on some more specific aspects of DL, such as the architecture-oriented optimization of sparse matrices [232] and the Neural Architecture Search [46]. ...
... Compared to 2D memories, 3D stacking increases memory capacity and bandwidth, reduces access latency due to the shorter on-chip wiring interconnection and the use of wider buses, and potentially improves the performance and power efficiency of the system. In fact, 3D stacking of DRAM memory provides an order of magnitude higher bandwidth and up to 5× better energy efficiency than conventional 2D solutions, making the technology an excellent option for meeting the requirements in terms of high throughput and low energy of DNN accelerators [108]. Two main 3D stacked memory standards have been recently proposed: the Hybrid Memory Cube (HMC) and the High Bandwidth Memory (HBM). ...
... Apart from the above-mentioned technological challenges, embedding processing elements into the memory and moving the computation closer to it requires to rethink of the optimization of the system design in order to take into account the proximity of the processing logic to the main memory. Depending on the use case, this might involve the redesign of the on-chip buffers in the logic die, to support the lower latency and energy cost of the accesses to main memory, as well as the use of new approaches for representing, partitioning, and mapping the dataflow of the application in order to exploit the highly parallel system supported by the availability of multiple channels [108]. ...
Preprint
Full-text available
Recent trends in deep learning (DL) imposed hardware accelerators as the most viable solution for several classes of high-performance computing (HPC) applications such as image classification, computer vision, and speech recognition. This survey summarizes and classifies the most recent advances in designing DL accelerators suitable to reach the performance requirements of HPC applications. In particular, it highlights the most advanced approaches to support deep learning accelerations including not only GPU and TPU-based accelerators but also design-specific hardware accelerators such as FPGA-based and ASIC-based accelerators, Neural Processing Units, open hardware RISC-V-based accelerators and co-processors. The survey also describes accelerators based on emerging memory technologies and computing paradigms, such as 3D-stacked Processor-In-Memory, non-volatile memories (mainly, Resistive RAM and Phase Change Memories) to implement in-memory computing, Neuromorphic Processing Units, and accelerators based on Multi-Chip Modules. The survey classifies the most influential architectures and technologies proposed in the last years, with the purpose of offering the reader a comprehensive perspective in the rapidly evolving field of deep learning. Finally, it provides some insights into future challenges in DL accelerators such as quantum accelerators and photonics.
... For example, resistive memory-based designs are very sensitive to deviceto-device variations and they also require conversion of data between analog and digital domains, which further drives up the energy cost and reduces throughput and accuracy. Furthermore, such analog computation-based technology requires significant changes in chip design and thereby incurs large design and manufacturing costs [8], [9]. ...
... RaceTrack and ReRAM, being yet-to-mature, lack the research and optimization of traditional SRAM. Analog in nature, they need digital-analog-digital conversions, impairing signal, increasing power use, and limiting throughput [8], [9]. Our approach is based on conventional digital SRAM instead. ...
Preprint
Full-text available
DNNs are one of the most widely used Deep Learning models. The matrix multiplication operations for DNNs incur significant computational costs and are bottlenecked by data movement between the memory and the processing elements. Many specialized accelerators have been proposed to optimize matrix multiplication operations. One popular idea is to use Processing-in-Memory where computations are performed by the memory storage element, thereby reducing the overhead of data movement between processor and memory. However, most PIM solutions rely either on novel memory technologies that have yet to mature or bit-serial computations which have significant performance overhead and scalability issues. In this work, an in-SRAM digital multiplier is proposed to take the best of both worlds, i.e. performing GEMM in memory but using only conventional SRAMs without the drawbacks of bit-serial computations. This allows the user to design systems with significant performance gains using existing technologies with little to no modifications. We first design a novel approximate bit-parallel multiplier that approximates multiplications with bitwise OR operations by leveraging multiple wordlines activation in the SRAM. We then propose DAISM - Digital Approximate In-SRAM Multiplier architecture, an accelerator for convolutional neural networks, based on our novel multiplier. This is followed by a comprehensive analysis of trade-offs in area, accuracy, and performance. We show that under similar design constraints, DAISM reduces energy consumption by 25\% and the number of cycles by 43\% compared to state-of-the-art baselines.
... Accordingly, for future research, the DEASE approach can be proposed under uncertain data, including fuzzy [76][77][78][79][80][81][82][83][84][85][86][87][88][89][90], stochastic [91][92][93][94][95][96][97][98][99][100][101][102], and interval [103][104][105][106][107][108][109][110][111][112][113] data. Additionally, the DEASE approach can be combined with machine learning approaches for the prediction of input and output data, and consequently, evaluation of the future performance of DMUs [114][115][116][117][118][119][120][121][122][123][124][125]. ...
Article
Full-text available
The purpose of this study is to provide an efficient method for the selection of input–output indicators in the data envelopment analysis (DEA) approach, in order to improve the discriminatory power of the DEA method in the evaluation process and performance analysis of homogeneous decision-making units (DMUs) in the presence of negative values and data. For this purpose, the Shannon entropy technique is used as one of the most important methods for determining the weight of indicators. Moreover, due to the presence of negative data in some indicators, the range directional measure (RDM) model is used as the basic model of the research. Finally, to demonstrate the applicability of the proposed approach, the food and beverage industry has been selected from the Tehran stock exchange (TSE) as a case study, and data related to 15 stocks have been extracted from this industry. The numerical and experimental results indicate the efficacy of the hybrid data envelopment analysis–Shannon entropy (DEASE) approach to evaluate stocks under negative data. Furthermore, the discriminatory power of the proposed DEASE approach is greater than that of a classical DEA model.
Article
Full-text available
In this study, a binarized neural network (BNN) of silicon diode arrays achieved vector–matrix multiplication (VMM) between the binarized weights and inputs in these arrays. The diodes that operate in a positive-feedback loop in their p⁺-n-p-n⁺ device structure possess steep switching and bistable characteristics with an extremely low subthreshold swing (below 1 mV) and a high current ratio (approximately 10⁸). Moreover, the arrays show a self-rectifying functionality and an outstanding linearity by an R-squared value of 0.99986, which allows to compose a synaptic cell with a single diode. A 2 × 2 diode array can perform matrix multiply-accumulate operations for various binarized weight matrix cases with some input vectors, which is in high concordance with the VMM, owing to the high reliability and uniformity of the diodes. Moreover, the disturbance-free, nondestructive readout, and semi-permanent holding characteristics of the diode arrays support the feasibility of implementing the BNN.