Fig 1 - uploaded by Nhan Do
Content may be subject to copyright.
SST's 55 nm ESF3 NOR flash memory cells: (a) schematic view, and (b) TEM image of the cross-section of a "supercell" incorporating two floatinggate transistors with a common source (S) and erase gate (EG) [7].

SST's 55 nm ESF3 NOR flash memory cells: (a) schematic view, and (b) TEM image of the cross-section of a "supercell" incorporating two floatinggate transistors with a common source (S) and erase gate (EG) [7].

Contexts in source publication

Context 1
... alternative way forward has been enabled by the progress of the industrial flash memory technology, now featuring highly optimized floating-gate cells, which may be embedded into CMOS integrated circuits. For example, Fig.1 shows the "supercell" of the advanced commercial 55 nm ESF3 NOR flash memory from SST Inc. [7]. ...
Context 2
... SST NOR flash memory is based on "supercells" with two floating-gate transistors sharing the source (S) and the erase gate (EG), but are controlled by different word-line (WL) and coupling (CG) gates -see Fig. 1. In the original ESF3 memory arrays, the cells are connected as Fig. 2a shows, with six row lines per supercell, connecting transistor sources, erase gates, coupling gates, and word-line gates, while each column has only one ("bit") line connecting transistor drains ...
Context 3
... weight, and w b is the "bias weight", which may be optimized to suppress the temperature dependence of the new output current. A straightforward analysis of this scheme, using Eq. (2), shows that after such optimization, the temperature drift of the output may be reduced to less than 1% at the [25C, 85C] interval, for any weight 0 < w ij < 1. Fig. 11 shows the results of our preliminary experiments with this mode, showing the drifts not exceeding 2.7% in that temperature interval. ...

Similar publications

Article
Full-text available
With the emergence of the big data era, various technologies have been proposed to cope with the exascale of data. For a considerably large volume of data, a single machine does not comprise enough resources to store the complete data. Hadoop distributed file system (HDFS) enables large datasets to be stored across the big data environment consisti...
Article
Full-text available
Natural local self-boosting (NLSB) was analyzed according to the location of a selected word-line (WL) where potential boosting occurs. When the same pattern occurred, it was found that the top cells (WL11 through WL15) and bottom cells (WL0 through WL4) have identically symmetrical potential boosting. In addition, in the region of the middle cells...

Citations

... Furthermore, this nonlinearity depends on process, voltage, and temperature (PVT). The weighting operation is essentially an analog multiplication, and the PVT-variable nonlinearity produces an imprecise multiplication, which eventually degrades the classification accuracy of the hardware DNN (Guo et al., 2017). Alternatively, encoding the analog input as a bi-level sequence of pulses or spikes alleviates the effect of such nonlinearity. ...
Article
Full-text available
We increasingly rely on deep learning algorithms to process colossal amount of unstructured visual data. Commonly, these deep learning algorithms are deployed as software models on digital hardware, predominantly in data centers. Intrinsic high energy consumption of Cloud-based deployment of deep neural networks (DNNs) inspired researchers to look for alternatives, resulting in a high interest in Spiking Neural Networks (SNNs) and dedicated mixed-signal neuromorphic hardware. As a result, there is an emerging challenge to transfer DNN architecture functionality to energy-efficient spiking non-volatile memory (NVM)-based hardware with minimal loss in the accuracy of visual data processing. Convolutional Neural Network (CNN) is the staple choice of DNN for visual data processing. However, the lack of analog-friendly spiking implementations and alternatives for some core CNN functions, such as MaxPool, hinders the conversion of CNNs into the spike domain, thus hampering neuromorphic hardware development. To address this gap, in this work, we propose MaxPool with temporal multiplexing for Spiking CNNs (SCNNs), which is amenable for implementation in mixed-signal circuits. In this work, we leverage the temporal dynamics of internal membrane potential of Integrate & Fire neurons to enable MaxPool decision-making in the spiking domain. The proposed MaxPool models are implemented and tested within the SCNN architecture using a modified version of the aihwkit framework, a PyTorch-based toolkit for modeling and simulating hardware-based neural networks. The proposed spiking MaxPool scheme can decide even before the complete spatiotemporal input is applied, thus selectively trading off latency with accuracy. It is observed that by allocating just 10% of the spatiotemporal input window for a pooling decision, the proposed spiking MaxPool achieves up to 61.74% accuracy with a 2-bit weight resolution in the CIFAR10 dataset classification task after training with back propagation, with only about 1% performance drop compared to 62.78% accuracy of the 100% spatiotemporal window case with the 2-bit weight resolution to reflect foundry-integrated ReRAM limitations. In addition, we propose the realization of one of the proposed spiking MaxPool techniques in an NVM crossbar array along with periphery circuits designed in a 130nm CMOS technology. The energy-efficiency estimation results show competitive performance compared to recent neuromorphic chip designs.
... The most prominent emerging technology proposals, including those based on emerging dense analog memory device circuits, are grouped according to the targeted low-level neuromorphic functionality -see, e.g. reviews in [85,86,87,88] and original work utilizing volatile [89,90,91,92,93,94,95,96,97,98,99] and nonvolatile [100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,96,116] memristors, phase change memories (PCM) [117,118,119,120,121,122,123], and nonvolatile NOR [124,125,126,114], and NAND [127,128,129], and organic volatile [130] floating gate memories, as well as multiferroic and spintronic [131,132,133,134,135], photonic [136,137,123,138,139,140,141,142,143,144,145,146], and superconductor [147,144,148] circuits. Many emerging devices and circuit technologies are currently being explored for neuromorphic hardware implementations. ...
Preprint
Cutting edge detectors push sensing technology by further improving spatial and temporal resolution, increasing detector area and volume, and generally reducing backgrounds and noise. This has led to a explosion of more and more data being generated in next-generation experiments. Therefore, the need for near-sensor, at the data source, processing with more powerful algorithms is becoming increasingly important to more efficiently capture the right experimental data, reduce downstream system complexity, and enable faster and lower-power feedback loops. In this paper, we discuss the motivations and potential applications for on-detector AI. Furthermore, the unique requirements of particle physics can uniquely drive the development of novel AI hardware and design tools. We describe existing modern work for particle physics in this area. Finally, we outline a number of areas of opportunity where we can advance machine learning techniques, codesign workflows, and future microelectronics technologies which will accelerate design, performance, and implementations for next generation experiments.
... Analog Vector-By-Matrix Multiplication: The emergence of dense analog-grade nonvolatile memories in the past two decades renewed interest in analog-circuit implementations of vectorby-matrix multiplication (VMMs) (Widrow and Angel, 1962;Mead, 1990;Holmes et al., 1993;Chawla et al., 2004;Alibart et al., 2012;Bayat et al., 2015;Guo et al., 2017b), which is the most common and frequently performed operation of any neural network in training or inference (Hertz et al., 1991;Gerstner and Kistler, 2002). In the simplest case, such a circuit is comprised of a matrix of memory cells that serve as configurable resistors for encoding the matrix (synaptic) weights and peripheral sense amplifiers playing the role of neurons (Figure 10). ...
... However, these devices have relatively large areas (>10 3 F 2 , where F is the minimum feature size), leading to higher interconnect capacitance and hence larger time delays. More recent work focused on implementing mixed-signal networks with much denser (∼40 F 2 ) commercial NOR-flash memory arrays redesigned for analog computing applications Guo et al., 2017b). For example, a prototype of a 100k+-cell two-layer perceptron network fabricated in a 180-nm process with modified NOR-flash memory technology was reported in Guo et al. (2017a). ...
Article
Full-text available
In this community review report, we discuss applications and techniques for fast machine learning (ML) in science—the concept of integrating powerful ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs.
... On one side, many emerging resistive switching devices exhibit state-dependent programming variation, where lower conductance states are associated with lower variations [24], [37], which is less detrimental to inference accuracy. On the other side, programming variations in multi-bit Flash memories are generally more state-independent while also suffering from additional non-linear behaviors [38], [39]. In Level-2 and Level-3 inference pipelines, where signed weights are represented in two columns ( Figure 2-2), the variations are applied independently to each cell. ...
Thesis
Deep neural networks (DNNs) have achieved unprecedented capabilities in tasks such as analysis and recognition of images and voices, leading to their widespread adaptation. However, computation requirements and the associated energy consumption of neural network implementations have been growing rapidly. In addition, traditional computing architectures are ineffective for DNN workloads due to the high memory access demands, making it even more challenging to meet these computational requirements. The most important limiting factor for DNN computing is the transfer of data between processors and off-chip memories due to the limited density of existing on-chip memory technologies. In-memory computing (IMC) systems, utilizing the density advantage of emerging memory technologies like RRAM, can potentially store entire DNN models on-chip, thus eliminating off-chip memory access. Particularly, analog IMC systems that utilize the memory device properties to directly perform vector-matrix multiplication (VMM) operations allow device-level parallelism that promises drastic improvement in energy efficiency. However, the analog computation means computation accuracy is an additional concern, even though neural networks are known for their fault tolerance. In general, for analog computing systems, computation accuracy needs to be ensured before any benefit in energy efficiency can become material. This dissertation presents studies on the implementation of DNNs in realistic analog IMC systems from an accuracy perspective under realistic memory devices and system non-idealities. In this work, memory device performance requirement was established, and methods to mitigate the impact of the non-idealities was also developed. Deterministic error sources including memory device on/off ratio, programming precision, array size limitation, and ADC characteristics were considered. Stochastic error sources including device programming variation, device defect, and ADC noise were considered. Particularly, a tiled architecture was developed to mitigate the effects of limited practical memory array sizes, and the consequence of this architecture was carefully studied. First, inference operation on analog IMC systems is investigated. An architecture-aware training method was developed to mitigate the deterministic error sources, and noise injection was used to mitigate device programming variation. Inference accuracies similar to that of the floating-point baselines was achieved on simulated realistic analog IMC system for large-scale neural networks by using these mitigation methods. Minimum requirements for device defect rate and programming variation were established. Second, DNN training on analog IMC systems was also explored. A mixed-precision training method was used, where weight updates are accumulated in software and only programmed onto memory devices when a certain threshold is reached. This drastically reduces device programming cycles during the training process while leveraging analog IMC systems for efficient computation in the forward and backward propagation progress. DNN training was shown to be effective in the simulated analog IMC system, even achieving better than floating-point baseline validation accuracies in some situations.
... From the implementing a deep neural network point of view, conventional computing systems have an inherent limitation due to data transmission between the physically separated memory and processor. To overcome this von-Neumann bottleneck, also called the memory wall, a neuromorphic computing system that can perform computations directly inside the memory array have been drawing attention from Kim) many researchers [1][2][3][4][5][6][7][8][9][10][11][12][13][14][26][27][28][29][30][31][32][33][34]. In a neuromorphic system, a DNN can be mapped onto a crossbar array in which the synaptic weight (w) is stored in the conductance state of the memory device. ...
... To achieve superior performance and reliability of neuromorphic systems, it is essential to develop a synapse device and an array architecture with high-density integration, low-power operation, good reliability, precise synaptic weight modulation, and CMOS process compatibility. Many research groups recently focus on implementing synapse array by using semiconductor memory devices such as static random-access memory (SRAM), resistive random-access memory (RRAM), phase-change memory (PCM), spin-transfer torque randomaccess memory (STT-MRAM), floating-gate (FG) memory, and charge-trap flash (CTF) memory [14][15][16][17][18][19][20][32][33][34]. A specific comparison of synaptic memory devices is summarized in Table I. ...
Article
Full-text available
In this work, we proposed a three-dimensional (3-D) channel stacked array architecture based on charge-trap flash (CTF) memory for an artificial neural network accelerator. The proposed synapse array architecture could be a promising solution for implementing efficiently a large-size artificial neural network on a limited-size hardware chip. We designed a full array architecture including a stacked layer selection circuit. In addition, we investigated the synaptic characteristics of CTF device by using technology computer-aided design (TCAD) simulation. We demonstrated the feasibility of the synapse array for neural network accelerators through a system-level MATLAB simulation with the Modified National Institute of Standards and Technology (MNIST) database.
... To that purpose, different memory solutions have been investigated for their adoption in neuromorphic systems and presented in literature. They include works based on crossbar arrays of resistive elements, mainly resistive switching random access memories (RRAM) [3][4][5] and phase change memories (PCM) [6,7], or based on memory arrays of charge storage devices, such as NAND and NOR Flash memory arrays [8][9][10][11][12][13][14][15]. ...
... For this reason, different implementations of neuromorphic systems based on NOR Flash memory arrays have been analyzed, including both supervised and unsupervised networks, which usually rely on some modification to the cell design [18][19][20], to the array design [2,[8][9][10] or to cells program/erase voltage schemes [11][12][13] to make the memory array operation compliant with the desired application. Notably, in [2] a fully integrated three-layer ANN (with dimensions 784 × 64 × 10) was implemented and tested for handwritten digits recognition via the gradient-descent method based on the backpropagation algorithm [21] reaching a 94.7% classification fidelity with a single-pattern classification time and energy equal to 1 µs and less than 20 nJ, respectively. ...
Article
Full-text available
In this work, we investigate the implementation of a neuromorphic digit classifier based on NOR Flash memory arrays as artificial synaptic arrays and exploiting a pulse-width modulation (PWM) scheme. Its performance is compared in presence of various noise sources against what achieved when a classical pulse-amplitude modulation (PAM) scheme is employed. First, by modeling the cell threshold voltage (VT) placement affected by program noise during a program-and-verify scheme based on incremental step pulse programming (ISPP), we show that the classifier truthfulness degradation due to the limited program accuracy achieved in the PWM case is considerably lower than that obtained with the PAM approach. Then, a similar analysis is carried out to investigate the classifier behavior after program in presence of cell VT instabilities due to random telegraph noise (RTN) and to temperature variations, leading again to results in favor of the PWM approach. In light of these results, the present work suggests a viable solution to overcome some of the more serious reliability issues of NOR Flash-based artificial neural networks, paving the way to the implementation of highly-reliable, noise-resilient neuromorphic systems.
... In this section, the most prominent emerging technology proposals, including those based on emerging dense analog memory device circuits, are grouped according to the targeted low-level neuromorphic functionality -see, e.g. reviews in [550][551][552][553] and original work utilizing volatile [554][555][556][557][558][559][560][561][562][563][564] and nonvolatile [565-580, 561, 581] memristors, phase change memories (PCM) [582][583][584][585][586][587][588], and nonvolatile NOR [589][590][591]579], and NAND [592][593][594], and organic volatile [595] floating gate memories, as well as multiferroic and spintronic [596][597][598][599][600], photonic [601,602,588,[603][604][605][606][607][608][609][610][611], and superconductor [612,609,613] circuits. More discussion is devoted to analog vector-by-matrix multiplication circuits in the following subsection because of their immediate value for today's state-of-the-art algorithms. ...
... The emergence of dense analog-grade nonvolatile memories in the past two decades renewed interest in analog-circuit implementations of vector-by-matrix multiplication (VMMs) [547,616,565,617,618,589,591], which is the most common and frequently performed operation of any neural network in training or inference [619,620]. In the simplest case, such a circuit is comprised of a matrix of memory cells that serve as configurable resistors for encoding the matrix (synaptic) weights and peripheral sense amplifiers playing the role of neurons (Fig. 13). ...
... However, these devices have relatively large areas (>10 3 F 2 , where F is the minimum feature size), leading to higher interconnect capacitance and hence larger time delays. More recent work focused on implementing mixed-signal networks with much denser (∼40 F 2 ) commercial NOR-flash memory arrays redesigned for analog computing applications [591,589]. For example, a prototype of a 100k+-cell two-layer perceptron network fabricated in a 180-nm process with modified NOR-flash memory technology was reported in Ref. [590]. ...
Preprint
Full-text available
In this community review report, we discuss applications and techniques for fast machine learning (ML) in science -- the concept of integrating power ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs.
... This opens to the possibility to perform computation in the analog domain by exploiting the device physics and circuit properties (e.g. Kirchhoff laws) [9], [10], [12], [13]. It is well known that analog processing blocks are usually affected by circuit non-idealities such as noise and non-linearity, which limit the effective number of bits of the arithmetic operations. ...
... Programmable mirrors are realized with three terminal floatinggate (FG) non-volatile memories, and the magnification factor depends on the threshold voltage difference ∆ ℎ , between the two FG non-volatile memories of the same mirror, programmed by injecting a charge in the floating gate of the CM output device. All transistors operate in the sub-threshold region allowing to reduce the power consumption and to achieve a range of the weights variation larger than two orders of magnitude [9], [13] considering that the weight can be expressed as ...
... Mismatch between devices can be mitigated during the programming phase, given that that threshold voltage variation can be fully compensated with appropriate tuning of the charge injected in the FGs. This work is mainly focused on the inference phase: the experimental demonstration of programming analog weights is provided in a few papers, either based on CMOS technology [9], [12], [13], [22] or on 2Dmaterials [17], using different injection mechanisms. ...
Article
Full-text available
Embedding advanced cognitive capabilities in battery-constrained edge devices requires specialized hardware with new circuit architecture and – in the medium/long term - new device technology. We evaluate the potential of recently investigated devices based on 2D materials for the realization of analog deep neural networks, by comparing the performance of neural networks based on the same circuit architecture using three different device technologies for transistors and analog memories. As a reference result, it is included in the comparison also an implementation on a standard 0.18 μm CMOS technology. Our architecture of choice makes use of current-mode analog vector-matrix multipliers based on programmable current mirrors consisting of transistors and floating-gate non-volatile memories. We consider experimentally demonstrated transistors and memories based on a monolayer Molibdenum Disulfide channel and ideal devices based on heterostructures of multilayer-monolayer PtSe2. Following a consistent methodology for device-circuit co-design and optimization, we estimate layout area, energy efficiency and throughput as a function of the equivalent number of bits (ENOB), which is strictly correlated to classification accuracy. System-level tradeoffs are apparent: for a small ENOB experimental MoS2 floating-gate devices are already very promising; in our comparison a larger ENOB (7 bits) is only achieved with CMOS, signaling the necessity to improve linearity and electrostatics of devices with 2D materials.
... The eFlash chip, fabricated in Global Foundries 55 nm LPe process, includes a 12 × 10 redesigned industry-grade split-gate memory array. The packaged chip is previously used for developing a high-performance dot-product engine 52 . Agilent B1500A and B1530A tools are used for measurements and pulse generation. ...
... We have developed a custom-made switch matrix on a printed circuit board controlled via a lightweight microprocessor to interface Agilent tools with the chip. More details on the experimental setup, programming, eraser, redesigned layout structure, half-select disturbance immunity, retention, and endurance characteristics are available in Ref. 52 . All eFlash memories are programmed to their targeted states at V WL = 1.5 V, V CG = 2.5 V, V BL = 1 V, V SL = 0 V, andV EG = 0 V and operated at the same biasing condition. ...
... Further, the devices are tuned one at a time by progressively increasing voltage pulses and using the write-verify algorithm. We have discussed the details of pulse amplitudes and durations in the programming phase in Ref. 52 . ...
Article
Full-text available
The increasing utility of specialized circuits and growing applications of optimization call for the development of efficient hardware accelerator for solving optimization problems. Hopfield neural network is a promising approach for solving combinatorial optimization problems due to the recent demonstrations of efficient mixed-signal implementation based on emerging non-volatile memory devices. Such mixed-signal accelerators also enable very efficient implementation of various annealing techniques, which are essential for finding optimal solutions. Here we propose a “weight annealing” approach, whose main idea is to ease convergence to the global minima by keeping the network close to its ground state. This is achieved by initially setting all synaptic weights to zero, thus ensuring a quick transition of the Hopfield network to its trivial global minima state and then gradually introducing weights during the annealing process. The extensive numerical simulations show that our approach leads to a better, on average, solutions for several representative combinatorial problems compared to prior Hopfield neural network solvers with chaotic or stochastic annealing. As a proof of concept, a 13-node graph partitioning problem and a 7-node maximum-weight independent set problem are solved experimentally using mixed-signal circuits based on, correspondingly, a 20 × 20 analog-grade TiO2 memristive crossbar and a 12 × 10 eFlash memory array.
... Due to the nature of NAND flash memory, which lacks the capability of random access [1] of NOR flash memory [2,3] or other memories such as DRAM (Dynamic Random Access Memory) and PCM (Phase Change Memory), reading and writing operations of one cell inevitably accompanies operations on the other cells simultaneously in a target NAND string [4,5]. Various combinations of the operation scheme such as bit line voltage (V BL ), read voltage (V READ ), pass voltage (V PASS ), etc., are typically tested and finally the optimal set is chosen by product engineers to minimize the threshold voltage (V t ) variation for the given as-fab-out chips [6][7][8][9]. ...
Article
Full-text available
Minimizing the variation in threshold voltage (Vt) of programmed cells is required to the extreme level for realizing multi-level-cells; as many as even 5 bits per cell recently. In this work, a recent program scheme to write the cells from the top, for instance the 170th layer, to the bottom, the 1st layer, (T-B scheme) in vertical NAND (VNAND) Flash Memory, is investigated to minimize Vt variation by reducing Z-interference. With the aid of Technology Computer Aided Design (TCAD) the Z-Interference for T-B (84 mV) is found to be better than B-T (105 mV). Moreover, under scaled cell dimensions (e.g., Lg: 31→24 nm), the improvement becomes protruding (T-B: 126 mV and B-T: 162 mV), emphasizing the significance of the T-B program scheme for the next generation VNAND products with the higher bit density.