Fig 6 - uploaded by Akash Levy
Content may be subject to copyright.
RADAR algorithm implementation. Note that there are limits on the number of pulses in both coarse and fine control phases (not specified in the flowchart), beyond which the program operation will be declared "fail." Cells with the same target ranges can be programed at the same time.

RADAR algorithm implementation. Note that there are limits on the number of pulses in both coarse and fine control phases (not specified in the flowchart), beyond which the program operation will be declared "fail." Cells with the same target ranges can be programed at the same time.

Source publication
Article
Full-text available
HfO₂-based resistive RAM (RRAM) is an emerging nonvolatile memory technology that has recently been shown capable of storing multiple bits-per-cell. The energy/delay costs of an RRAM write operation are dependent on the number of pulses required for RRAM programming. The pulse count is often large when existing programming approaches are used for m...

Contexts in source publication

Context 1
... BL voltage V BL (for fine SET) or the SL voltage V SL (for fine RESET)-this phase is similar to the GSR algorithm, except that the parameters are range-dependent, i.e., tuned specifically for each target range. Details about the choice of parameters for both resistance control phases are discussed in Section IV. The RADAR algorithm is shown in Fig. ...
Context 2
... range. We thus chose the simplest option, in which each coarse range has the largest possible width, from 0 to 35 k. With this choice, after a single SET pulse in the Coarse Control phase, the algorithm switches to the Fine Control phase immediately. For this scenario, RADAR can be simplified by removing the red "Coarse Reset pulse" part of Fig. 6 (belonging to the Coarse Control phase). The coarse ranges did not provide significant benefit to pulse count in the technology we used but may be beneficial in other RRAM ...
Context 3
... voltage of 4 V in the Fine Control phase (without the cell achieving the target range), the WL voltage V WL (which starts from the value used in the Coarse Control phase for each range) is stepped up with 10-mV step size (about two or three steps needed for the devices we tested). This case is indicated in "Blue" (in the Fine Control phase) of Fig. 6. The reason to use V BL in the fine SET phase can be explained as follows. Suppose the target range is Range 1, and after the coarse SET pulse (V BL = 2 V, V WL = 2.86 V, as shown in Table I), the cell resistance is higher than the target of 4.58 k. In the fine SET phase [where the cell configuration is like a common source amplifier, ...

Citations

... R ESISTIVE random access memory (RRAM) is a suitable on-chip memory technology for edge computing because of its back-end-of-line compatibility with CMOS, high unit cell density, multiple-bits-per-cell storage, nonvolatility, and low-energy operation versus embedded flash [1]. Multiplebits-per-cell RRAM ideally increases storage density by a factor of n [where n = log 2 (# of levels per cell)] over singlebit-per-cell [2] but faces several challenges, illustrated in Fig. 1: (1) large read/write peripheral circuits attenuate storage density improvements, (2) for multiple-bits-per-cell operation, the decreased low-conductance state slows down the read operation, (3) the increased high-conductance state increases read energy, (4) conductance relaxation [3] can lead to high bit error rates (BERs) as narrow conductance distributions broaden with time, and (5) narrow conductance levels require more write pulses to target and hence cost more write energy. Finally, (6) to our knowledge, no prior RRAM storage macro fully integrates the circuitry needed for both multiple-bits-percell readout and write-verify operation. ...
... READ and WRITE commands enable multiple-bits-per-cell read and write-verify operations. 2 During read, the sense amplifier's reference conductance is set to each level's upper readout threshold. As the level index is increased, it can be inferred which level a cell is in when the sense amplifier output changes from 1 (greater than reference) to 0 (less than reference). ...
Article
Full-text available
Designing compact and energy-efficient resistive RAM (RRAM) macros is challenging due to: 1) large read/write circuits that decrease storage density; 2) low-conductance cells that increase read latency; and 3) the pronounced effects of routing parasitics on high-conductance cell read energy. Multiple-bits-per-cell RRAM can boost storage density but has further challenges resulting from reliability problems due to conductance relaxation and slow write due to narrow conductance levels. This work presents a multiple-bits-per-cell RRAM macro called Efficient Multiple-Bits-per-Cell Embedded RRAM (EMBER), which: 1) demonstrates read/write circuit compaction through constrained optimization of driver and pass gate transistor sizes; 2) introduces a common-mode bleed conductance at the sense amplifier inputs, reducing read settling time by for low-conductance cells, and 3) cuts read path capacitance to further reduce read access time and energy. To address reliability and write speed, EMBER contains a configurable on-chip read/write controller. We present a level allocation scheme that uses array-level characterization data to find sufficiently reliable allocations, while simultaneously maximizing write bandwidth. EMBER is the first embedded RRAM storage macro to achieve fully integrated multiple-bits-per-cell readout and write-verification without any off-chip reference generation or sensing. The macro operates at with 64k $\times$ 48 $=$ 3 M cells in TSMC 40-nm CMOS, achieving 1 b/cell read operation with energy at, and 2 b/cell read with at . 1 b/cell write-verify operates with energy at (BER $<$ ), and 2 b/cell write-verify operates with at (BER $<$ ). The array-level endurance is found to be 10 K for 1–2 b/cell. Normalizing for process scaling, the macro demonstrates the highest effective RRAM cell density to date of for 1 b/cell and for 2 b/cell, an improvement of and, respectively, over the best prior work.
... RRAM is capable of storing multiple bits per cell [43,1,58]. This can be achieved by tuning the conductance using carefully crafted SET/RESET pulses, along with read operations to verify the conductance state. ...
... An insight from early RRAM work was that the write pulse supply's compliance current determined the final cell conductance after SET while the final cell conductance after RESET was determined by pulse amplitude [67]. Multiple-bits-per-cell write algorithms implement the idea of adjusting the compliance current by introducing RRAM unit cell WL write pulse amplitude modulation [89,58]. Coupling WL voltage modification with coarse and fine-grained pulse amplitude adjustments further reduces the pulse count needed to write a cell to a target conductance level compared to ISPP. ...
... Relaxation-aware write-verify range allocation schemes account for post-write cell conductance variation. These schemes allow less stable storage levels to relax more than stable levels [58]. A recent effort attempts to correct the time-based conductance variation by forcing the cell conductance to remain stable for a post-write time window [90]. ...
Thesis
Full-text available
In this dissertation, I present techniques for improving the power, performance, and area of integrated circuits (ICs) through 3-D integration of two emerging nanotechnologies: (1) resistive random-access memory (RRAM), a non-volatile memory with multiple-bits-per-cell storage capability, and (2) nanoelectromechanical (NEM) relays, nano-scale mechanical relays that can be actuated electrostatically. In modern ICs for edge computing, data movement between on and off-chip memories typically consumes a large fraction of the total power. Dense, non-volatile embedded memory can reduce/eliminate off-chip data movement by keeping frequently-read application data always on chip. RRAM is a good candidate for such a memory, especially because it can store multiple bits per cell, achieving high density on-chip storage. However, efficient and reliable operation with multiple-bits-per-cell RRAM has been a challenge due to (1) stochastic device behavior during programming that results in large pulse counts with traditional write-verify methods, and (2) reliability issues arising from resistance relaxation. Towards the goal of achieving efficient and reliable multiple-bits-per-cell RRAM, I present three contributions: (1) range-dependent adaptive resistance (RADAR) tuning, a fast and energy-efficient programming method for multiple-bits-per-cell RRAM that uses an adaptive combination of coarse- and fine-grained cell resistance tuning, yielding a 2.4x reduction in pulse count over prior methods, (2) characterization of resistance relaxation behavior in three RRAM technologies and analysis of its implications for multiple-bits-per-cell storage, and (3) efficient multiple-bits-per-cell embedded RRAM (EMBER), the first demonstration of a fully-integrated multiple-bits-per-cell RRAM macro. EMBER contains a multiple-bits-per-cell read and write controller with a high degree of flexibility that enables good level allocation (mitigating reliability issues from resistance relaxation) and programming scheme optimization (yielding low-energy, low-latency multiple-bits-per-cell writes). Finally, in reconfigurable ICs, in addition to the memories, the routing fabric consumes a large fraction of the overall area and power. I demonstrate that replacing CMOS routing switches with 3-D integrated multi-pole nanoelectromechanical (NEM) relays in a coarse-grained reconfigurable array (CGRA) can achieve 19% lower area and 10% lower power at iso-performance.
... Programming methods can be usually split into two phases, the first assigning conductance values to devices, referred to as weight mapping, and the second, modulating the conductance state of each device to their assigned target conductance value, referred to as weight programming. Weight programming is usually performed using an iterative read-write verify (RW-verify) process [18], [19], [20], [21], as single-shot programming cannot be utilized to reliably reach a target conductance value. However, it is possible to use single-shot programming to place a device in either its highest, that is, SET, or lowest, that is, RESET, conductance state. ...
Article
Full-text available
Analog in-memory computing (AIMC) using memristive devices is considered a promising Non-von Neumann approach for deep learning (DL) inference tasks. However, inaccuracies in the programming of devices, that are attributed to conductance variations, pose a key challenge toward achieving sufficient compute precision for DL inference. Fortunately, conduction variations in memristive devices, such as phase-change memory (PCM) devices, exhibit a strong state dependence. This state dependence can be exploited in synaptic unit cells that comprise more than one memristive device, to encode positive or negative weights. In such multi-memristive unit cells, we propose a method that optimally maps the weights to the device conductance values, by maximizing the number of devices at the stable SET and RESET states. We demonstrate that this method reduces the matrix-vector multiplication (MVM) error and is more resilient to non-ideal device retention characteristics. With this approach, we increase the mean experimental inference accuracy of a network trained for MNIST classification by 0.71% on two PCM-based AIMC cores, and the hardware-realistic simulated top-1 accuracy of a network trained for ImageNet classification by 0.28%, while significantly reducing variability across multiple experiment instances.
... In the case of machine learning accelerators or VMM accelerators, the VCM cells are initially programmed during the training phase and then read out over a long time scale during the inference [22,44]. The programming is achieved via programverify algorithms [45][46][47]. These algorithms adapt the resistance of individual 1T1R cells by repeatedly performing SET and RESET operations to bring the resistance into a previously specified range. ...
Chapter
Smart computing has demonstrated huge potential for various application sectors such as personalized healthcare and smart robotics. Smart computing aims bringing computing close to the source where the data is generated or stored. Memristor-based Computation-In-Memory (CIM) has the potential to realize such smart computing for data and computation intensive applications. This paper presents an overview and design present of CIM, covering from the architecture and circuit level down to the device level. On the circuit and device level, accelerators for machine learning will be presented and discussed, focusing on variability and reliability effects. We will discuss these aspects for Redox-based Resistive Random Access Memories (ReRAM) based on the Valence Change Mechanism (VCM) by employing the compact model JART VCM v1b.
... Restrictions apply. validated with realistic behaviors [13]. Furthermore, the writeverify process can be optimized for the RRAM technology that is characterized. ...
... Currently, RRAM devices embedded in CMOS nodes require a forming voltage higher than 3 V, exceeding the safe operating area (SOA) of core devices in sub-100-nm technologies [4], [14]. The voltage required in the other operations is lower, but still not compatible with devices of scaled CMOS technologies [13], [15]. The current is another key aspect in the writing phase, since it must be limited to avoid damaging the RRAM device during forming, and constrained in order to set the resistance value in the set-programming procedure. ...
Article
Full-text available
This paper presents the Flipped (F)-2T2R RRAM compute cell enhancing the performance of RRAM-based mixed-signal accelerators for deep neural networks (DNNs) in machine learning (ML) applications. The F-2T2R cell is designed to exploit the features of the FD-SOI technology and it achieves a large increase in cell output impedance, compared to the standard 1T1R cell. The paper also describes the modelling of an F-2T2R-based accelerator and its transistor-level implementation in a 22-nm FD-SOI technology. The modelling results and the accelerator performance are validated by simulation. The proposed design can achieve an energy efficiency of up to 1260 1b-TOPS/W, with a memory array of 256 rows and columns. From the results of our analytical framework, a ResNet18, mapped on the accelerator, can obtain an accuracy reduction below 2%, with respect to the floating point baseline, on the CIFAR-10 dataset.
... Previous works on connected RRAM device structures were only focusing on parallel connections between the electrodes [10]- [12] or have been using multiple cells to reduce variability [13]- [15]. There are also macros which use more than one memristive device in parallel to represent negative weights [16] although the mapping of negative weights which act as inhibitory neuronal connections can also be solved with other mapping techniques [17]. ...
... RRAMs and other emerging non-volatile memories have typically not been able to outperform SRAM and DRAM in benchmarking with in-memory computing specifications [11]- [13]. This is primarily due to the array's peripheral circuit design complexity to enable control and readout of state in the presence of significant variability [14], [15]. Many oxide/electrode materials have been explored to tackle this issue but integration and reliability remain a concern [1], [16]. ...
... Another ongoing line of research into improving density and energy efficiency is the use of memristive synapses with analog weights [48], [49], [50], [52], [53], [51], [54], [19]. Using analog weights for synapses can potentially reduce the number of synapses and neurons required to perform the same computations [21]. ...
... Using analog weights for synapses can potentially reduce the number of synapses and neurons required to perform the same computations [21]. At the moment there are some published works reported in exploiting multi-level analog values per memristor, however as of today, for good separation, low resistances are required [52], [53] and also there is a stochastic relaxation behaviour that takes several seconds after programming, thus slowing down the time efficiency of analog programming [54], [19]. ...
Article
Full-text available
The advent of nanoscale memristors raised hopes of being able to build CMOL (CMOS/nanowire/molecular) type ultra-dense in-memory-computing circuit architectures. In CMOL, nanoscale memristors would be fabricated at the inter-section of nanowires. The CMOL concept can be exploited in neuromorphic hardware by fabricating lower density neurons on CMOS and placing massive analog synaptic connectivity with nanowire and nanoscale-memristor fabric post-fabricated on top. However, technical problems have hindered such developments for presently available reliable commercial monolithic CMOS-memristor technologies. On one hand, each memristor needs a MOS selector transistor in series to guarantee forming and programming operations in large arrays. This results in compound MOS-memristor synapses (called 1T1R) which are no longer synapses at the crossing of nanowires. On the other hand, memristors do not yet constitute highly reliable, stable analog memories for massive analog-weight synapses with gradual learning. Here we demonstrate a pseudo-CMOL monolithic chip core that circumvents the two technical problems mentioned above by: (a) exploiting a CMOL-like geometrical chip layout technique to improve density despite the 1T1R limitation, and (b) exploiting a binary weight stochastic Spike-Timing-Dependent-Plasticity (STDP) learning rule that takes advantage of the more reliable binary memory capability of the memristors used. Experimental results are provided for a spiking neural network (SNN) CMOL-core with 64 input neurons, 64 output neurons and 4096 1T1R synapses, fabricated in 130nm CMOS with 200nm-sized Ti/HfOx/TiN memristors on top. The CMOL-core uses query-driven event read-out, which allows for memristor variability insensitive computations. Experimental system-level demonstrations are provided for plain template matching tasks, as well as regularized stochastic binary STDP feature-extraction learning, obtaining perfect recognition in hardware for a 4-letter recognition experiment.
... This behavior is typical for filamentary VCM cells [30], [31], [32]. It was observed that the states relax after programming, which leads to a widening of the distribution [33], [34], [35]. Moreover, it was shown that reprogramming the tail bits (i.e. ...
... Moreover, the fast switching devices may be over-programmed leading to failed re-write attempts. Thus, in most cases, a socalled write-verify process is used to program desired resistance states [33], [44]. [33] demonstrates such programming of eight different resistance states on a large array. ...
... Thus, in most cases, a socalled write-verify process is used to program desired resistance states [33], [44]. [33] demonstrates such programming of eight different resistance states on a large array. Resistance relaxation or retention describes changes in the programmed resistance distributions and therefore also introduces reliability concerns. ...
Article
Full-text available
Analog compute schemes as well as compute-in-memory have emerged in an effort to reduce the increasing power hunger of convolutional neural networks, which exceeds the constraints of edge devices. Memristive device types are a relatively new offering with interesting opportunities for unexplored circuit concepts. In this work, the use of memristive devices in cascaded time domain compute-in-memory is introduced with the primary goal of reducing size of fully unrolled architectures. The different effects influencing determinism in memristive devices are outlined together with reliability concerns. Architectures for binary as well as multi-bit multiply and accumulate cells are presented and evaluated. As more involved circuits offer more accurate compute result, a trade-off between design effort and accuracy comes into the picture. To further evaluate this trade-off, the impact of variations on overall compute accuracy is discussed. The presented cells reaches an Energy/OP of 0.23 fJ at a size of 1.2 μm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> for binary and 6.04 fJ at 3.2 μm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> for 4x4 bit multiply and accumulate operations.