SST's 55 nm ESF3 NOR flash memory cells: (a) schematic view, and (b) TEM image of the cross-section of a "supercell" incorporating two floatinggate transistors with a common source (S) and erase gate (EG) [7].

Source publication

Temperature-insensitive analog vector-by-matrix multiplier based on 55 nm NOR flash memory cells

Conference Paper

Full-text available

Apr 2017

Context 1

... alternative way forward has been enabled by the progress of the industrial flash memory technology, now featuring highly optimized floating-gate cells, which may be embedded into CMOS integrated circuits. For example, Fig.1 shows the "supercell" of the advanced commercial 55 nm ESF3 NOR flash memory from SST Inc. [7]. ...

View in full-text

Context 2

... SST NOR flash memory is based on "supercells" with two floating-gate transistors sharing the source (S) and the erase gate (EG), but are controlled by different word-line (WL) and coupling (CG) gates -see Fig. 1. In the original ESF3 memory arrays, the cells are connected as Fig. 2a shows, with six row lines per supercell, connecting transistor sources, erase gates, coupling gates, and word-line gates, while each column has only one ("bit") line connecting transistor drains ...

View in full-text

Context 3

... weight, and w b is the "bias weight", which may be optimized to suppress the temperature dependence of the new output current. A straightforward analysis of this scheme, using Eq. (2), shows that after such optimization, the temperature drift of the output may be reduced to less than 1% at the [25C, 85C] interval, for any weight 0 < w ij < 1. Fig. 11 shows the results of our preliminary experiments with this mode, showing the drifts not exceeding 2.7% in that temperature interval. ...

View in full-text

A Fault-Tolerant and High-Speed Memory Controller Targeting 3D Flash Memory Cubes for Space Applications

Conference Paper

Full-text available

Oct 2020

a Storm throughput when Kafka generates 5000 records per 1 ms with...

a Elapsed time of NameNode operations when Kafka generates 5000 records...

a Recovery experiment result after NNbenchWithoutMR execution b...

A write-friendly approach to manage namespace of Hadoop distributed file system by utilizing nonvolatile memory

Article

Full-text available

Oct 2019

With the emergence of the big data era, various technologies have been proposed to cope with the exascale of data. For a considerably large volume of data, a single machine does not comprise enough resources to store the complete data. Hadoop distributed file system (HDFS) enables large datasets to be stored across the big data environment consisti...

A Progressive Garbage Collection Scheme Based on Hotness of Valid Pages for NAND Flash Memory

Conference Paper

Full-text available

Jan 2017

Study of tunneling gate oxide and floating gate thickness variation effects to the performance of split gate flash memory

Conference Paper

Full-text available

Oct 2017

Figure 2. NLSB in WL (a) NLSB occurring in the bottom cells (WL0...

Figure 3 shows the potential boosting of each WL in the middle cell...

Investigation of Inhibited Channel Potential of 3D NAND Flash Memory According to Word-Line Location

Article

Full-text available

Feb 2020

Natural local self-boosting (NLSB) was analyzed according to the location of a selected word-line (WL) where potential boosting occurs. When the same pattern occurred, it was found that the top cells (WL11 through WL15) and bottom cells (WL0 through WL4) have identically symmetrical potential boosting. In addition, in the region of the middle cells...

Spiking CMOS-NVM mixed-signal neuromorphic ConvNet with circuit- and training-optimized temporal subsampling

Article

Full-text available

Jul 2023

We increasingly rely on deep learning algorithms to process colossal amount of unstructured visual data. Commonly, these deep learning algorithms are deployed as software models on digital hardware, predominantly in data centers. Intrinsic high energy consumption of Cloud-based deployment of deep neural networks (DNNs) inspired researchers to look for alternatives, resulting in a high interest in Spiking Neural Networks (SNNs) and dedicated mixed-signal neuromorphic hardware. As a result, there is an emerging challenge to transfer DNN architecture functionality to energy-efficient spiking non-volatile memory (NVM)-based hardware with minimal loss in the accuracy of visual data processing. Convolutional Neural Network (CNN) is the staple choice of DNN for visual data processing. However, the lack of analog-friendly spiking implementations and alternatives for some core CNN functions, such as MaxPool, hinders the conversion of CNNs into the spike domain, thus hampering neuromorphic hardware development. To address this gap, in this work, we propose MaxPool with temporal multiplexing for Spiking CNNs (SCNNs), which is amenable for implementation in mixed-signal circuits. In this work, we leverage the temporal dynamics of internal membrane potential of Integrate & Fire neurons to enable MaxPool decision-making in the spiking domain. The proposed MaxPool models are implemented and tested within the SCNN architecture using a modified version of the aihwkit framework, a PyTorch-based toolkit for modeling and simulating hardware-based neural networks. The proposed spiking MaxPool scheme can decide even before the complete spatiotemporal input is applied, thus selectively trading off latency with accuracy. It is observed that by allocating just 10% of the spatiotemporal input window for a pooling decision, the proposed spiking MaxPool achieves up to 61.74% accuracy with a 2-bit weight resolution in the CIFAR10 dataset classification task after training with back propagation, with only about 1% performance drop compared to 62.78% accuracy of the 100% spatiotemporal window case with the 2-bit weight resolution to reflect foundry-integrated ReRAM limitations. In addition, we propose the realization of one of the proposed spiking MaxPool techniques in an NVM crossbar array along with periphery circuits designed in a 130nm CMOS technology. The energy-efficiency estimation results show competitive performance compared to recent neuromorphic chip designs.

Smart sensors using artificial intelligence for on-detector electronics and ASICs

Preprint

Apr 2022

Cutting edge detectors push sensing technology by further improving spatial and temporal resolution, increasing detector area and volume, and generally reducing backgrounds and noise. This has led to a explosion of more and more data being generated in next-generation experiments. Therefore, the need for near-sensor, at the data source, processing with more powerful algorithms is becoming increasingly important to more efficiently capture the right experimental data, reduce downstream system complexity, and enable faster and lower-power feedback loops. In this paper, we discuss the motivations and potential applications for on-detector AI. Furthermore, the unique requirements of particle physics can uniquely drive the development of novel AI hardware and design tools. We describe existing modern work for particle physics in this area. Finally, we outline a number of areas of opportunity where we can advance machine learning techniques, codesign workflows, and future microelectronics technologies which will accelerate design, performance, and implementations for next generation experiments.

Applications and Techniques for Fast Machine Learning in Science

Article

Full-text available

Apr 2022

In this community review report, we discuss applications and techniques for fast machine learning (ML) in science—the concept of integrating powerful ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs.

Neural Network Implementations on Analog In-Memory-Computing Fabric

Thesis

Jan 2022

Qiwen Wang

Deep neural networks (DNNs) have achieved unprecedented capabilities in tasks such as analysis and recognition of images and voices, leading to their widespread adaptation. However, computation requirements and the associated energy consumption of neural network implementations have been growing rapidly. In addition, traditional computing architectures are ineffective for DNN workloads due to the high memory access demands, making it even more challenging to meet these computational requirements. The most important limiting factor for DNN computing is the transfer of data between processors and off-chip memories due to the limited density of existing on-chip memory technologies. In-memory computing (IMC) systems, utilizing the density advantage of emerging memory technologies like RRAM, can potentially store entire DNN models on-chip, thus eliminating off-chip memory access. Particularly, analog IMC systems that utilize the memory device properties to directly perform vector-matrix multiplication (VMM) operations allow device-level parallelism that promises drastic improvement in energy efficiency. However, the analog computation means computation accuracy is an additional concern, even though neural networks are known for their fault tolerance. In general, for analog computing systems, computation accuracy needs to be ensured before any benefit in energy efficiency can become material. This dissertation presents studies on the implementation of DNNs in realistic analog IMC systems from an accuracy perspective under realistic memory devices and system non-idealities. In this work, memory device performance requirement was established, and methods to mitigate the impact of the non-idealities was also developed. Deterministic error sources including memory device on/off ratio, programming precision, array size limitation, and ADC characteristics were considered. Stochastic error sources including device programming variation, device defect, and ADC noise were considered. Particularly, a tiled architecture was developed to mitigate the effects of limited practical memory array sizes, and the consequence of this architecture was carefully studied. First, inference operation on analog IMC systems is investigated. An architecture-aware training method was developed to mitigate the deterministic error sources, and noise injection was used to mitigate device programming variation. Inference accuracies similar to that of the floating-point baselines was achieved on simulated realistic analog IMC system for large-scale neural networks by using these mitigation methods. Minimum requirements for device defect rate and programming variation were established. Second, DNN training on analog IMC systems was also explored. A mixed-precision training method was used, where weight updates are accumulated in software and only programmed onto memory devices when a certain threshold is reached. This drastically reduces device programming cycles during the training process while leveraging analog IMC systems for efficient computation in the forward and backward propagation progress. DNN training was shown to be effective in the simulated analog IMC system, even achieving better than floating-point baseline validation accuracies in some situations.

NOR-Type Three-dimensional Synapse Array Architecture Based on Charge-Trap Flash Memory

Article

Full-text available

Jan 2022

In this work, we proposed a three-dimensional (3-D) channel stacked array architecture based on charge-trap flash (CTF) memory for an artificial neural network accelerator. The proposed synapse array architecture could be a promising solution for implementing efficiently a large-size artificial neural network on a limited-size hardware chip. We designed a full array architecture including a stacked layer selection circuit. In addition, we investigated the synaptic characteristics of CTF device by using technology computer-aided design (TCAD) simulation. We demonstrated the feasibility of the synapse array for neural network accelerators through a system-level MATLAB simulation with the Modified National Institute of Standards and Technology (MNIST) database.

A Noise-Resilient Neuromorphic Digit Classifier Based on NOR Flash Memories with Pulse-Width Modulation Scheme

Article

Full-text available

Nov 2021

In this work, we investigate the implementation of a neuromorphic digit classifier based on NOR Flash memory arrays as artificial synaptic arrays and exploiting a pulse-width modulation (PWM) scheme. Its performance is compared in presence of various noise sources against what achieved when a classical pulse-amplitude modulation (PAM) scheme is employed. First, by modeling the cell threshold voltage (VT) placement affected by program noise during a program-and-verify scheme based on incremental step pulse programming (ISPP), we show that the classifier truthfulness degradation due to the limited program accuracy achieved in the PWM case is considerably lower than that obtained with the PAM approach. Then, a similar analysis is carried out to investigate the classifier behavior after program in presence of cell VT instabilities due to random telegraph noise (RTN) and to temperature variations, leading again to results in favor of the PWM approach. In light of these results, the present work suggests a viable solution to overcome some of the more serious reliability issues of NOR Flash-based artificial neural networks, paving the way to the implementation of highly-reliable, noise-resilient neuromorphic systems.

Applications and Techniques for Fast Machine Learning in Science

Preprint

Full-text available

Oct 2021

In this community review report, we discuss applications and techniques for fast machine learning (ML) in science -- the concept of integrating power ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs.

Assessment of Two-Dimensional Materials-Based Technology for Analog Neural Networks

Article

Full-text available

Oct 2021

Embedding advanced cognitive capabilities in battery-constrained edge devices requires specialized hardware with new circuit architecture and – in the medium/long term - new device technology. We evaluate the potential of recently investigated devices based on 2D materials for the realization of analog deep neural networks, by comparing the performance of neural networks based on the same circuit architecture using three different device technologies for transistors and analog memories. As a reference result, it is included in the comparison also an implementation on a standard 0.18 μm CMOS technology. Our architecture of choice makes use of current-mode analog vector-matrix multipliers based on programmable current mirrors consisting of transistors and floating-gate non-volatile memories. We consider experimentally demonstrated transistors and memories based on a monolayer Molibdenum Disulfide channel and ideal devices based on heterostructures of multilayer-monolayer PtSe2. Following a consistent methodology for device-circuit co-design and optimization, we estimate layout area, energy efficiency and throughput as a function of the equivalent number of bits (ENOB), which is strictly correlated to classification accuracy. System-level tradeoffs are apparent: for a small ENOB experimental MoS2 floating-gate devices are already very promising; in our comparison a larger ENOB (7 bits) is only achieved with CMOS, signaling the necessity to improve linearity and electrostatics of devices with 2D materials.

Combinatorial optimization by weight annealing in memristive hopfield networks

Article

Full-text available

Aug 2021

The increasing utility of specialized circuits and growing applications of optimization call for the development of efficient hardware accelerator for solving optimization problems. Hopfield neural network is a promising approach for solving combinatorial optimization problems due to the recent demonstrations of efficient mixed-signal implementation based on emerging non-volatile memory devices. Such mixed-signal accelerators also enable very efficient implementation of various annealing techniques, which are essential for finding optimal solutions. Here we propose a “weight annealing” approach, whose main idea is to ease convergence to the global minima by keeping the network close to its ground state. This is achieved by initially setting all synaptic weights to zero, thus ensuring a quick transition of the Hopfield network to its trivial global minima state and then gradually introducing weights during the annealing process. The extensive numerical simulations show that our approach leads to a better, on average, solutions for several representative combinatorial problems compared to prior Hopfield neural network solvers with chaotic or stochastic annealing. As a proof of concept, a 13-node graph partitioning problem and a 7-node maximum-weight independent set problem are solved experimentally using mixed-signal circuits based on, correspondingly, a 20 × 20 analog-grade TiO2 memristive crossbar and a 12 × 10 eFlash memory array.

Novel Program Scheme of Vertical NAND Flash Memory for Reduction of Z-Interference

Article

Full-text available

May 2021

Minimizing the variation in threshold voltage (Vt) of programmed cells is required to the extreme level for realizing multi-level-cells; as many as even 5 bits per cell recently. In this work, a recent program scheme to write the cells from the top, for instance the 170th layer, to the bottom, the 1st layer, (T-B scheme) in vertical NAND (VNAND) Flash Memory, is investigated to minimize Vt variation by reducing Z-interference. With the aid of Technology Computer Aided Design (TCAD) the Z-Interference for T-B (84 mV) is found to be better than B-T (105 mV). Moreover, under scaled cell dimensions (e.g., Lg: 31→24 nm), the improvement becomes protruding (T-B: 126 mV and B-T: 162 mV), emphasizing the significance of the T-B program scheme for the next generation VNAND products with the higher bit density.

SST's 55 nm ESF3 NOR flash memory cells: (a) schematic view, and (b) TEM image of the cross-section of a "supercell" incorporating two floatinggate transistors with a common source (S) and erase gate (EG) [7].

Contexts in source publication

Similar publications

Citations