Figure 2 - uploaded by Yuanqing Cheng
Content may be subject to copyright.
Illustration of 3T-3MTJ based memory bank, the schematic of 3T-3MTJ cell and peripheral circuit. 

Illustration of 3T-3MTJ based memory bank, the schematic of 3T-3MTJ cell and peripheral circuit. 

Source publication
Conference Paper
Full-text available
The STT-RAM (Spin-Transfer Torque Magnetic RAM) technology is a promising candidate for cache memory because of its high density, low standy-power, and non-volatility. As technology scales, especially under 40nm technology node, the read disturbance becomes severe since the read current approaches closely to the switching current. In addition, the...

Contexts in source publication

Context 1
... two-stage sensing scheme is proposed as shown in Fig. 2. Each sense amplifier (SA) contains two parts, i.e., current sensing part and amplification part, which are shown in red dashed box of Fig. 2. We adopt current sensing scheme proposed by [9] in the first part. It senses the differ- ence of two branches. However, due to the small signal dif- ference generated by the amplifier, it is ...
Context 2
... two-stage sensing scheme is proposed as shown in Fig. 2. Each sense amplifier (SA) contains two parts, i.e., current sensing part and amplification part, which are shown in red dashed box of Fig. 2. We adopt current sensing scheme proposed by [9] in the first part. It senses the differ- ence of two branches. However, due to the small signal dif- ference generated by the amplifier, it is necessary to amplify the signal amplitude to the full swing to drive the following logic circuit. We adopt the PCSA circuit [23] in the second ...
Context 3
... read operation and working principle of the sensing circuit are described as follows. In the reading process, a read voltage is applied between bitline and source line. When word line is asserted, the current flows through each sensing branch as shown in the right part of red dashed box in Fig. 2. The current sensing circuit converts current differ- ence to voltage difference (e.g., the difference of V 0 and V 1). Then, the two signals are fed into the amplification circuit in the left part of the red dashed box. It amplifies the signal difference to the full swing denoted as OU T -1 and OU T -1. If bits'00' or '11' is stored ...
Context 4
... early sensing stage in these two cases. After that, it finally outputs a random result depending on the resistance difference of two MTJs induced by process variations. Therefore, we could take advantage of this period to decide whether it is necessary to activate the second stage sensing. Finally, depending on data patterns, multiplexers in Fig. 2 are used to control which SA's output is used to generate the output ...
Context 5
... compare read performance of different cell structures, a 2KB array is constructed as shown in the left part of Fig. 2. It is composed of 4 subarrays, and the capacity of each subar- ray is 256×8b. Except for 45nm technology node, other three technology nodes, i.e., 90nm, 65nm, 25nm are also consid- ered to investigate scalabilities of different cell designs. The read latencies obtained by HSPICE simulation are plotted in Fig. 9. The access latency ...
Context 6
... two-stage sensing scheme is proposed as shown in Fig. 2. Each sense amplifier (SA) contains two parts, i.e., current sensing part and amplification part, which are shown in red dashed box of Fig. 2. We adopt current sensing scheme proposed by [9] in the first part. It senses the differ- ence of two branches. However, due to the small signal difference generated by the amplifier, it is ...
Context 7
... two-stage sensing scheme is proposed as shown in Fig. 2. Each sense amplifier (SA) contains two parts, i.e., current sensing part and amplification part, which are shown in red dashed box of Fig. 2. We adopt current sensing scheme proposed by [9] in the first part. It senses the differ- ence of two branches. However, due to the small signal difference generated by the amplifier, it is necessary to amplify the signal amplitude to the full swing to drive the following logic circuit. We adopt the PCSA circuit [23] in the second part ...
Context 8
... read operation and working principle of the sensing circuit are described as follows. In the reading process, a read voltage is applied between bitline and source line. When word line is asserted, the current flows through each sensing branch as shown in the right part of red dashed box in Fig. 2. The current sensing circuit converts current difference to voltage difference (e.g., the difference of V 0 and V 1). Then, the two signals are fed into the amplification circuit in the left part of the red dashed box. It amplifies the signal difference to the full swing denoted as OU T -1 and OU T -1. If bits'00' or '11' is stored in ...
Context 9
... early sensing stage in these two cases. After that, it finally outputs a random result depending on the resistance difference of two MTJs induced by process variations. Therefore, we could take advantage of this period to decide whether it is necessary to activate the second stage sensing. Finally, depending on data patterns, multiplexers in Fig. 2 are used to control which SA's output is used to generate the output ...
Context 10
... compare read performance of different cell structures, a 2KB array is constructed as shown in the left part of Fig. 2. It is composed of 4 subarrays, and the capacity of each subarray is 256×8b. Except for 45nm technology node, other three technology nodes, i.e., 90nm, 65nm, 25nm are also considered to investigate scalabilities of different cell designs. The read latencies obtained by HSPICE simulation are plotted in Fig. 9. The access latency ...

Similar publications

Article
Full-text available
Magnetic skyrmions are currently the most promising option to realize current-driven magnetic shift registers. A variety of concepts to create skyrmions were proposed and demonstrated. However, none of the reported experiments show controlled creation of single skyrmions using integrated designs. Here, we demonstrate that skyrmions can be generated...
Article
Full-text available
We address the nature of spin-orbit torques at the magnetic surfaces of topological insulators using the linear-response theory. We find that the so-called Dirac torques in such systems possess a different symmetry compared to their Rashba counterpart, as well as a high anisotropy as a function of the magnetization direction. In particular, the dam...

Citations

... However, they comprise non-silicon components that are expensive and require additional fabrication procedures. Moreover, owing to the high off-current, ReRAM and STT-RAM require high supply voltages and peripheral circuits to guarantee a sufficient sensing margin [11,12]. Additionally, although FEFETs exhibit a relatively high ON/OFF current ratio, reducing the gate voltage based on the high voltage drop across the interface oxide is a challenge [13], one that limits the possibility of achieving high endurance. ...
Article
Full-text available
In this paper, we propose a logic-in-memory (LIM) inverter comprising a silicon nanowire (SiNW) n-channel feedback field-effect transistor (n-FBFET) and a SiNW p-channel metal oxide semiconductor field-effect transistor (p-MOSFET). The hybrid logic and memory operations of the LIM inverter were investigated by mixed-mode technology computer-aided design simulations. Our LIM inverter exhibited a high voltage gain of 296.8 (V/V) when transitioning from logic ‘1’ to ‘0’ and 7.9 (V/V) when transitioning from logic ‘0’ to ‘1’, while holding calculated logic at zero input voltage. The energy band diagrams of the n-FBFET structure demonstrated that the holding operation of the inverter was implemented by controlling the positive feedback loop. Moreover, the output logic can remain constant without any supply voltage, resulting in zero static power consumption.
... III) STT-MRAM-based memories suffer from the read disturb problem in which a read current may unwantedly change the magnetization direction of an MTJ free layer, i.e., the data is corrupted during the sensing process [21]. This problem does not apply to Spin-Orbit Torque Magnetic Random Access Memory (SOT-MRAM) as they use different write and read current paths. ...
Article
Full-text available
This paper presents a high-performance and energy efficient processor exploiting a Mag netoresistive-based C omputing- i n- M emory array architecture (so-called MagCiM processor), to perform Boolean logic functions on operands stored in a memory array. The proposed processor efficiently addresses the memory wall and the leakage power consumption problems in conventional processors. The MagCiM processor utilizes mCell memory, a class of Magnetoresistive memory employing only Magnetic Tunnel Junction (MTJ) devices, to realize both computation-in-memory and on-chip instruction and data memories. The mCell memory is characterized by almost zero leakage power, high integration density, high level of reliability, and compatibility with the CMOS VLSI fabrication process. The circuit-level simulation results through comparisons with the previous work reveal that the MagCiM processor provides low occupation area, low power, and energy consumption and offers Normally-off instant-on computing capability, which makes it very suitable for embedded system applications. Based on our evaluations, a conventional processor based on the well-known MIPS architecture consumes about 13 times more energy while having 1.5 times more delay than the MagCiM processor.
... The problems with conventional CMOS TCAMs have motivated efforts to design more energy efficient and denser TCAMs. As an alternative, emerging devices such as resistive RAM (ReRAM) and magnetic tunneling junction (MTJ) [3], [4] enable the design of TCAM cells with a reduced area footprint and/or improvements to figures of merit (FOM) such as energy and delay compared to CMOS counterparts. However, there are still some challenges associated with MTJ and ReRAMbased TCAMs. ...
... While TCAMs have obvious utility, conventional CMOS TCAMs suffer from low area density and high energy consumption. To combat these challenges, research efforts have considered TCAM designs based on emerging resistive devices, such as resistive RAM (ReRAM), phase change memory (PCM) and spin transfer torque RAM (STT-RAM) [7], [8] which may enable TCAM cells with a reduced area footprint and/or improvements to figures of merit (FOM) such as energy and delay. However, as the aforementioned technologies encode binary '1's/'0's with low resistance states (LRS)/high resistance states (HRS), these emerging technologies still incur relatively high search and write energies, as well as high search delays due to low variable resistances, low HRS/LRS ratios and two-terminal structures. ...
Article
With the growing abundance of data-centric applications, researchers are increasingly looking for ways to co-locate logic and memory elements to improve area, energy and delay. Ternary content addressable memories (TCAMs), represent a form of logic-in-memory (LiM), and are currently widely used in routers, caches and efficient machine learning models. From a technology prospective, researchers have begun to consider various non-volatile (NV) memory technologies to design NV TCAMs that may offer improvements with respect to figures of merit (FOM) such as energy and delay when compared to conventional CMOS designs. Among these devices, ferroelectric field effect transistors (FeFETs) stand out due to their high ION/IOFF ratio, efficient voltage-driven write mechanism, low-cost, and CMOS-compatible fabrication process. We propose a 2FeFET TCAM design based on a state-of-the-art, experimentally calibrated FeFET model. We evaluate and compare our design with other TCAMs at the cell and array levels. Our results suggest that a 2FeFET TCAM requires 3.5X/3200X less write energy than CMOS/resistive random access memory (ReRAM) TCAMs, respectively. The cell area is 13% of that of a CMOS TCAM, and is on par with ReRAM designs. The search energy-delay-product (EDP) of a 2FeFET TCAM is also 4.1X/2.8X less than CMOS/ReRAM TCAMs, respectively.
... We are especially interested in how ferroelectric field-effect transistors (FeFETs) [11] that 1) are compatible with current CMOS technologies [12] and 2) have been experimentally demonstrated [12]- [17], can lead to more efficient LiMs. Researchers have been investigating LiM designs based on resistive random access memories (ReRAMs), and spin-transfer torque random access memories (STT-RAMs) [18], [19]. Both devices use high-resistance states (HRSs) and low-resistance states (LRSs) to encode binary states. ...
... However, these technologies face challenges. For example, STT-RAM-based memories may have low variable resistance (from 10 to 100k in general [19], [20]), low HRS/LRS ratios, and two terminal structures. These shortcomings can lead to relatively high energy consumption and extra transistors for write operations and to maintain acceptable output swings. ...
Article
Among the beyond-complementary metal-oxide-semiconductor (CMOS) devices being explored, ferroelectric field-effect transistors (FeFETs) are considered as one of the most promising. FeFETs are being studied by all major semiconductor manufacturers, and experimentally, FeFETs are making rapid progress. FeFETs also stand out with the unique hysteretic Ids-Vgs characteristic that allows a device to function as both a switch and a nonvolatile (NV) storage element. We exploit this FeFET property to build two categories of fine-grained logic-in-memory (LiM) circuits: 1) ternary content addressable memory (TCAM) which integrates efficient and compact logic/processing elements into various levels of memory hierarchy; 2) basic logic function units for constructing larger and more complex LiM circuits. Two writing schemes (with and without negative supply voltages respectively) for FeFETs are introduced in our LiM designs. The resulting designs are compared with existing LiM approaches based on CMOS, magnetic tunnel junctions (MTJs), resistive random access memories (ReRAMs), ferrorelectric tunnel junctions (FTJs), etc., that afford the same circuit-level functionality. Simulation results show that FeFET-based NV TCAMs offer lower area overhead than MTJ (79%) and CMOS (42% less) equivalents, as well as better search energy-delay products (EDPs) than TCAM designs based on MTJ (149x), ReRAM (1.7x), and CMOS (1.3x) in array evaluations. NV FeFET-based LiM basic circuit blocks are also more efficient than functional equivalents based on MTJs in terms of propagation delay (4.2x) and dynamic power (2.5x). A case study for an FeFET-based LiM accumulator further demonstrates that by employing FeFET as both a switch and an NV storage element, the FeFET-based accumulator can save area (36%) and power consumption (40%) when compared with a conventional CMOS accumulator with the same structure.
... While CMOS-based TCAMs have obviously been implemented, they frequently suffer from low density and high energy consumption when compared to static random access memories (SRAMs) [3], or dynamic random access memories (DRAMs). Researchers have also been investigating TCAM designs based on resistive RAM (ReRAM), and spin torque transfer RAM (STT-RAM) based on magnetic tunnel junctions (MTJs) [4], [5]. Both devices use high resistance states (HRS) and low resistance states (LRS) to encode binary states. ...
... The embedding size for MNIST is 16, and the embedding size for EMNIST is 64. As shown in Fig. 14(a), the Hamming distance between the test example (TE) '0' and the three learning examples (LEs) '0' is 0, and the distance between the TEs '2' and '9' and the LEs '0' is 10 (the IEEE Journal on Exploratory Solid-State Computational Devices and Circuits maximum possible distance in this case is 16). As shown in Fig. 14(b), the Hamming distance between the TE 'C' and the three LEs 'C' is 0. However, the distance between the TE 'e' and LEs 'C' is also small as they are visually similar. ...
Article
Full-text available
Memory-augmented neural networks (MANNs) require large external memories to enable long-term memory storage and retrieval. Content-addressable memory (CAM) is a type of memory used for high-speed searching applications and is well-suited for MANNs. Recent advances in exploratory non-volatile devices have spurred the development of non-volatile CAMs. However, these devices suffer from poor on-off ratio, large write voltages, and long write times. This work proposes a non-volatile ternary CAM using magneto-electric field effect transistors (MEFETs). The energy and delay of the various operations are simulated using the ASAP 7-nm predictive technology for the transistors and a Verilog-A model of the MEFET. The proposed structure achieves orders of magnitude improvement in search energy and > 45× improvement in search energy-delay product compared to prior works. The write energy and delay are also improved by 8× and 12×, respectively, compared to CAMs designed with other non-volatile devices. A variability analysis is performed to study the effect of process variations on the CAM. The proposed CAM is then used to build a one-shot learning MANN and is benchmarked with the MNIST, EMNIST, and Labeled Faces in the Wild (LFW) datasets with binary embeddings, giving > 99% accuracy on MNIST, a top-3 accuracy of 97.11% on the EMNIST dataset, and > 97% accuracy on the LFW dataset, with embedding sizes of 16, 64, and 512, respectively. The proposed CAM is shown to be fast, energy-efficient, and scalable, making it suitable for MANNs.
Article
Quantized neural networks (QNNs), which perform multiply-accumulate (MAC) operations with low-precision weights or activations, have been widely exploited to reduce energy consumption. QNNs usually have a trade-off between energy consumption and accuracy depending on the quantized precision, so that it is necessary to select an appropriate precision for energy efficiency. Nevertheless, the conventional hardware accelerators such as Google TPU are typically designed and optimized for a specific precision (e.g., 8-bit), which may degrade energy efficiency for other precisions. Though an analog-based computing-in-memory (CIM) technology supporting variable precision has been proposed to improve energy efficiency, its implementation requires extremely large and power-consuming analog-to-digital converters (ADCs). In this paper, we propose Scale-CIM, a precision-scalable CIM architecture which supports MAC operations based on digital computations (not analog computations). Scale-CIM performs binary MAC operations with high parallelism, by executing digital-based multiplication operations in the CIM array and accumulation operations in the peripheral logic. In addition, Scale-CIM supports multi-bit MAC operations without ADCs, based on the binary MAC operations and shift operations depending on the precision. Since Scale-CIM fully utilizes the CIM array for various quantized precisions (not for a specific precision), it achieves high compute-throughput. Consequently, Scale-CIM enables precision-scalable CIM-based MAC operations with high parallelism. Our simulation results show that Scale-CIM achieves 1.5∼15.8 × speedup and reduces system energy consumption by 53.7∼95.7% across different quantized precisions, compared to the state-of-the-art precision-scalable accelerator.
Article
As one type of associative memory, content-addressable memory (CAM) has become a critical component in several applications, including caches, routers, and pattern matching. Compared with the conventional CAM that could only deliver a “matched or not-matched” result, emerging multilevel CAM (ML-CAM) is capable of delivering “the degree of match” with multilevel distance calculation. This feature has been desired in applications that need beyond-Boolean matching results. However, existing ML-CAM designs are limited by the bit-cell device discharging current mismatch and vulnerability to the timing of sensing operations for distance calculation. This inherent constraint makes it difficult to further improve the accuracy and scalability toward higher accuracy and higher dimension matching. In this work, we propose CapCAM, a multilevel Cap acitive C ontent A ddressable M emory. It could be implemented based on either static random-access memory (SRAM) or emerging technologies, e.g., the ferroelectric field-effect transistor (FeFET). CapCAM could provide linear and stable voltage drop scaled by the match degree and need no strict timing for result sensing, which embraces the high-accuracy and high-scalability search. The inherent enabler of CapCAM is the charge-domain computing mechanism. This article will present the basic concept, operating mechanisms, detailed circuit designs, and circuit-level simulations of CapCAM. Besides, we apply CapCAM to few-shot learning applications and compare CapCAM with the current-domain TCAM designs. Results show 99.2% accuracy for a five-way five-shot classification task with our proposed CapCAM design while considering 1-fF capacitors, 20-domain FeFETs, and 256 columns. In contrast, the prior work based on discharging dynamics requires strict timing controls and suffers from accuracy degradation under the same configuration, which demonstrates CapCAM’s capability of low-power, accurate, and scalable multilevel CAM (ML-CAM) computing.
Article
Ternary content addressable memory (TCAM) is one type of associative memory and has been widely used in caches, routers, and many other mapping-aware applications. While the conventional SRAM-based TCAM is high speed and bulky, there have been denser but slower and less reliable nonvolatile TCAMs using nonvolatile memory (NVM) devices. Meanwhile, some CMOS TCAMs using dynamic memories have been also proposed. Although dynamic TCAM could be denser than the 16T SRAM TCAM and more reliable than the nonvolatile TCAMs, CMOS dynamic TCAMs still suffer from the row-by-row refresh energy and time overheads. In this article, we propose dynamic TCAM using nanoelectromechanical (NEM) relays (DyTAN), and utilize one-shot refresh (OSR) to solve the memory refresh problem. By exploiting the unique NEM relay characteristics, DyTAN outperforms the existing works in the balance between density, speed, and power efficiency. Compared with the 16T SRAM-based TCAM, the 5T CMOS dynamic TCAM, the 2T2R TCAM, and the 2FeFET TCAM, evaluations show that the proposed DyTAN reduces the write energy by up to $2.3\times $ , $1.3\times $ , $131\times $ , and $13.5\times $ , and improves the search energy-delay-product (EDP) by up to $12.7\times $ , $1.7\times $ , $1.3\times $ , and $2.8\times $ , respectively.