Fig 5 - uploaded by Baker Mohammad
Content may be subject to copyright.
1 Schematic and SIM picture of 6T cell for 90, 65, and 45 nm [ 39 ]

1 Schematic and SIM picture of 6T cell for 90, 65, and 45 nm [ 39 ]

Source publication
Book
Full-text available
This book describes the various tradeoffs systems designers face when designing embedded memory. Readers designing multi-core systems and systems on chip will benefit from the discussion of different topics from memory architecture, array organization, circuit design techniques and design for test. The presentation enables a multi-disciplinary appr...

Citations

... In a digital system, a comparison of the logical behavior of the tested system with the behavior of the good system is often used when conducting the test. Therefore, there is a need for all physical failures during manufacturing to be modeled as logical faults [5]. Figure 1 shows the general architecture of a random access memory (RAM) [29], [30]. It normally consists of memory cells, an address decoder, a write driver, and a sense amplifier. ...
Article
Full-text available
Testing embedded memories in a chip can be very challenging due to their high-density nature and manufactured using very deep submicron (VDSM) technologies. In this review paper, functional fault models which may exist in the memory are described, in terms of their definition and detection requirement. Several memory testing algorithms that are used in memory built-in self-test (BIST) are discussed, in terms of test operation sequences, fault detection ability, and also test complexity. From the studies, it shows that tests with 22 N of complexity such as March SS and March AB are needed to detect all static unlinked or simple faults within the memory cells. The N in the algorithm complexity refers to Nx*Ny*Nz whereby Nx represents the number of rows, Ny represents the number of columns and Nz represents the number of banks. This paper also looks into optimization and further improvement that can be achieved on existing March test algorithms to increase the fault coverage or to reduce the test complexity.
... The array design utilizes dis-charge low bit-line and all non-selected word lines are biased to 0 V, this makes the design similar to normal SRAM array where the word line address decode will be one-hot (only one word-line need to transition high and other word lines are grounded). The bit-line capacitance of 20 fF is used to account for interconnecting and readout circuit capacitance [14]. The read operation is based on charging the bit-line capacitance including any parasitic capacitance through the cell resistance of the selected cell for example M1. ...
... Traditionally, SRAM is used for this purpose. However, now new low-power and faster non-volatile memories [61] are being developed to replace SRAM [62,63], which will also preserve the data in case of accidental power-loss. The on-chip memory method is very demanding on the chip area and significantly increases the length of global interconnects and hence dynamic power. ...
... They have similar store-restore operation, however the data is now copied over the fast local interconnects. There are trade-offs in terms of area, timing overhead, power, clocking and design complexity in choosing between the retention registers and the on-chip memories [62]. Recent years have seen the efforts to implement non-volatile retention registers via ferroelectric and ferromagnetic elements which can furthermore be switched-off after store operation [65][66][67][68]. ...
Research Proposal
Problem: The widespread adoption of technology, exponential increase in transistor (fundamental computing unit) count (on scale of human population now) per chip is approaching the scaling limits of the technology at which the leakage current is unacceptably high and results in lot of energy/power issues. E.g. Status-quo as per available data is as follows. (a) data servers in 2011 globally consumed ~275 Billion KWh which is expected to increase by another 50% by the end of this decade, (b) laptop drains its entire battery in 3-10 hours limiting its mobility and applicability, (c) Medical equipments, automobiles and aeroplanes loose critical data during accidents or accidental power disruption (d) contemporary bio-electronic implants drain entire battery in 5-7 years which necessitates operation for replacing the device and repeated operation jeopardises the patient’s life. Although, a high-end VLSI (Very-Large Scale Integration) chip typically operates only its 20-50% transistors at a given time to limit the power dissipation from burning the chip, the rest 80-50% (dark-silicon) is forced into a standby mode. These modes are falling short of targeted power numbers for markets like Data servers, portable devices, IoT (Internet-of-Things) and medical implants. The chips still dissipate unacceptably large heat and substantially reduce the lifetime of both the chip and the embedded battery, and in some cases of humans as well. Solution: Objective is to achieve zero-standby power dissipation by realizing non-volatile logic (NVL) in these chips , especially in applications that operate chip or major sections of chip at low duty cycle. This proposal pioneers a new scheme of non-volatile interconnects (NVI) for attaining the above goal by making the selected interconnect wiring non-volatile by empowering it with spintronics technology. This would enable dark-silicon to be completely powered-down with no standby-leakage, thereby reducing the energy consumption to less than 1% of existing amount, enabling retrieval of data in case of accidental power loss and a proportional increase in intervals for charging battery.
... This helps decouple the memory from the logic and enables nonmemory supply voltage scaling. Yet, it does not solve the issue of memory supply scaling and adds complexity to the physical design by requiring level shifter and other specialized cells [15]. ...
... If there is an access to one of the locations whose mask bit is set by the BIST, it is reported as a miss in the case of a read. The cache subsystem (hardware) handles it the same way it does a miss case [15]. ...
Article
This paper comprises two new methodologies to improve yield and reduce system-on-a-chip power. The first methodology is based on faulty static random-access memory (SRAM) cells detections and cache resizing. The key advantage of this approach is that it enables the end user to control the system’s parameters to be error tolerant. Furthermore, this technique enables aggressive voltage scaling which causes parametric (soft) failures in SRAM-based memory. As such, the proposed methodology can be utilized to exchange cache size for lower power or better yield. In the second methodology, data from faulty cells are treated as imposed noise. Depending on the application, this error percentage (imposed noise) can be mitigated through three options. First, ignore error if the percentage of the error is tolerable. Second, simple hardware filtration is needed. Finally, software-based filtration is required. The viability of this approach is that it allows aggressive voltage scaling below the traditional to be a 100% correct approach for SRAM supply, which results in substantial reduction of power, trading off quality for power. For both approaches, BIST is used as part of the powerup sequence to identify the faulty memory addresses per voltage level and compute the faulty cells percentage. Furthermore, the proposed methodologies help in improving reliability and counteracting long-term effects on memory cell stability and lifetime degradation caused by negative bias temperature instability.
... The memory architecture used for each technology is shown in Table VIII, where all transistors are 45 nm. For example, SRAM writing procedure is similar to reading one [50]. Sections II and III explained in detail the writing mechanism and power consumed in both memristor and STT-RAM. ...
Article
Conventional charge-based memory usage in low-power applications is facing major challenges. Some of these challenges are leakage current for static random access memory (SRAM) and dynamic random access memory (DRAM), additional refresh operation for DRAM, and high programming voltage for Flash. In this paper, two emerging resistive random access memory (ReRAM) technologies are investigated, memristor and spin-transfer torque (STT)-RAM, as potential universal memory candidates to replace traditional ones. Both of these nonvolatile memories support zero leakage and low-voltage operation during read access, which makes them ideal for devices with long sleep time. To date, high write energy for both memristor and STT-RAM is one of the major inhibitors for adopting the technologies. The primary contribution of this paper is centered on addressing the high write energy issue by trading off retention time with noise margin. In doing so, the memristor and STT-RAM power has been compared with the traditional six-transistor-SRAM-based memory power and potential application in wireless sensor nodes is explored. This paper uses 45-nm foundry process technology data for SRAM and physics-based mathematical models derived from real devices for memristor and STT-RAM. The simulations are conducted using MATLAB and the results show a potential power savings of 87% and 77% when using memristor and STT-RAM, respectively, at 1% duty cycle.
... This helps decouple the memory from the logic and enables nonmemory supply voltage scaling. Yet, it does not solve the issue of memory supply scaling and adds complexity to the physical design by requiring level shifter and other specialized cells [15]. ...
... The cache subsystem (hardware) handles it the same way it does a miss case [15]. 2) Write Operation: A multiple-way associative cache requires an algorithm to determine which way to write for a given cache line. ...
Article
The increased size of embedded memory for system-on-chip (SoC) and multicore processors has a positive impact on performance yet poses a big challenge for chip yield, power consumption, and overall cost. Big percentage ( %) of today’s processors and SoC area in both 2-D (planar) and 3-D technologies, such as through silicon via (TSV), are dedicated to memory. Most of today’s embedded memories are not as simple as a storage area with single interface of data, address, and control, but rather they compromise complex logic on their interface due to timing constrains and interconnect technologies (NoC and TSV). Memory core testing strategy is well understood and has mature tools and methodologies to screen for defects such as built-in-self-test (BIST). In addition, core and logic-based testing using scan and automatic test pattern generation (ATPG) tools and methodologies are intended for flop-based design. However, interface logic and complex interconnect like the one in 3-D chips are not thoroughly tested using BIST or ATPG as they are not designed for such logic. This becomes even more important for 3-D chips where a stack memory could have different testing strategies other than the base layer core which is interfacing with it. This brief presents a design for test methodology to achieve good coverage on interface logic for embedded and stack memory. The proposed approach uses modified ATPG and scan methodology to test the memory logic interface with minimum impact to existing design.
Article
Bias temperature instability (BTI) is known as one serious reliability concern in nanoscale technologies. BTI gradually increases the absolute value of threshold voltage (V <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">th</sub> ) of MOS transistors. The main consequence of Vth shift of the SRAM cell transistors is the static noise margin (SNM) degradation. The SNM degradation of SRAM cells results in bit-flip occurrences due to transient faults and should be monitored accurately. This paper proposes a sensor called write current-based BTI sensor (WCBS) to assess the BTI-aging state of SRAM cells. The WCBS measures BTI-induced SNM degradation of SRAM cells by monitoring the maximum write current shifts due to BTI. The observations show that the maximum current consumption during write operation is an effective identifier to measure Vth and SNM shifts. The granularity of BTI assessment of one cell up to a row of memory can be achieved by writing special bit patterns on the memory block during the test. We evaluated the sensor through SPICE-level simulations in 32-nm technology size. The precision of WCBS is about ±1.25 mV (±3.2% error). One sensor is enough for the entire SRAM memory block with negligible area/power overhead; less than 1%. The effects of process variation and temperature changes on WCBS are investigated in detail.
Thesis
Full-text available
Eingebettete mikroelektronische Systeme finden in vielen Bereichen des täglichen Lebens Anwendung. Die Integration von zunehmend mehr Prozessorkernen auf einem einzelnen Mikrochip (On-Chip-Multiprozessor, MPSoC) erlaubt eine Steigerung der Rechenleistung und der Ressourceneffizienz dieser Systeme. In der AG Kognitronik und Sensorik der Universität Bielefeld wird das CoreVA-MPSoC entwickelt, welches ressourceneffiziente VLIW-Prozessorkerne über eine hierarchische Verbindungsstruktur koppelt. Eine enge Kopplung mehrerer Prozessorkerne in einem Cluster ermöglicht hierbei eine breitbandige Kommunikation mit geringer Latenz. Der Hauptbeitrag der vorliegenden Arbeit ist die Entwicklung und Entwurfsraumexploration eines ressourceneffizienten CPU-Clusters für den Einsatz im CoreVA-MPSoC. Eine abstrakte Modellierung der Hardware- und Softwarekomponenten des CPU-Clusters sowie ein hoch automatisierter Entwurfsablauf ermöglichen die schnelle Analyse eines großen Entwurfsraums. Im Rahmen der Entwurfsraumexploration werden verschiedene Topologien, Busstandards und Speicherarchitekturen untersucht. Insbesondere das Zusammenspiel der Hardware-Architektur mit Programmiermodell und Synchronisierung ist evident für eine hohe Ressourceneffizienz und eine gute Ausnutzung der verfügbaren Rechenleistung durch den Anwendungsentwickler. Dazu wird ein an die Hardwarearchitektur angepasstes blockbasiertes Synchronisierungsverfahren vorgestellt. Dieses Verfahren wird von Compilern für die Sprachen StreamIt, C sowie OpenCL verwendet, um Anwendungen auf verschiedenen Konfigurationen des CPU-Clusters abzubilden. Neun repräsentative Streaming-Anwendungen zeigen bei der Abbildung auf einem Cluster mit 16 CPUs eine durchschnittliche Beschleunigung um den Faktor 13,3 gegenüber der Ausführung auf einer CPU. Zudem wird ein eng gekoppelter gemeinsamer L1-Datenspeicher mit mehreren Speicherbänken in den CPU-Cluster integriert, der allen CPUs einen Zugriff mit geringer Latenz erlaubt. Des Weiteren wird die Verwendung verschiedener Instruktionsspeicher und -caches evaluiert sowie der Energiebedarf für Kommunikation und Synchronisierung im CPU-Cluster betrachtet. Es wird in dieser Arbeit gezeigt, dass ein CPU-Cluster mit 16 CPU-Kernen einen guten Kompromiss in Bezug auf den Flächenbedarf der Cluster-Verbindungsstruktur sowie die Leistungsfähigkeit des Clusters darstellt. Ein CPU-Cluster mit 16 2-Slot-VLIW-CPUs und insgesamt 512 kB Speicher besitzt bei einer prototypischen Implementierung in einer 28-nm-FD-SOI-Standardzellenbibliothek einen Flächenbedarf von 2,63 mm². Bei einer Taktfrequenz von 760 MHz liegt die durchschnittliche Leistungsaufnahme bei 440 mW. Eine FPGA-basierte Emulation auf einem Xilinx Virtex-7-FPGA erlaubt die Evaluierung eines CoreVA-MPSoCs mit bis zu 24 CPUs bei einer maximalen Taktfrequenz von bis zu 124 MHz. Als weiteres Anwendungsszenario wird ein CoreVA-MPSoC mit bis zu vier CPUs auf das FPGA des autonomen Miniroboters AMiRo abgebildet.
Article
Full-text available
This paper proposes a configurable analog front-end (AFE) chain architecture used in the EEG detection system. The proposed chain consists of three stages: the first and the third stages are instrumentation amplifier and programmable gain amplifier, respectively, and the second stage is notch filter with low pass feature. The proposed architecture relaxes the design of the notch filter and the analog-to-digital converter (ADC) used in building the EEG detection system. A basic building block is the digitally programmable balanced output operational transconductance amplifier (DPOTA), which is proposed to realize the AFE blocks. A successive-approximation ADC (SA-ADC) architecture is mostly designed on digital circuits in order to lower the power dissipation. Based on this, PSpice post layout simulation results for the overall EEG detection system using 0.25-µm CMOS technology are also given. The overall configurable gain/filtering chain architecture has a total gain ranging from 61 to 84 dB, a total power dissipation of 32 µW and input referred noise spectral density of 4 µV/ \(\sqrt {\text{Hz}}\).