ArticlePDF Available

A 256Gb NAND flash memory stack with 300MB/s HLNAND interface chip for point-to-point ring topology

Authors:
  • Novachips

Abstract and Figures

A 256Gb NAND flash device includes eight stacked 32Gb MLC die and a 16.2mm 2 HLNAND interface chip providing a 300MB/s synchronous DDR point-to-point ring topology system interface. Four internal busses supporting both 40MHz asynchronous NAND or 133MHz toggle mode NAND allow independent, concurrent operation of the MLC die. The device features data truncation power savings, programmable page size, and command packet error detection.
Content may be subject to copyright.
A 256Gb NAND Flash Memory Stack with 300MB/s
HLNAND Interface Chip for Point-to-Point Ring
Topology
Peter Gillingham, Jin-Ki Kim, Roland Schuetz, Hong-Beom Pyeon, HakJune Oh, Don Macdonald, Eric Choi, David Chinn
MOSAID Technologies Incorporated
11 Hines Road, Suite 203, Kanata, Ontario, CANADA K2K 2X1
gillingham@mosaid.com
Abstract— A 256Gb NAND flash device includes eight stacked
32Gb MLC die and a 16.2mm2 HLNAND interface chip
providing a 300MB/s synchronous DDR point-to-point ring
topology system interface. Four internal busses supporting both
40MHz asynchronous NAND or 133MHz toggle mode NAND
allow independent, concurrent operation of the MLC die. The
device features data truncation power savings, programmable
page size, and command packet error detection.
Keywords- HyperLink NAND, HLNAND, High speed NAND
flash, DDR, MCP, and SSD
I. INTRODUCTION
In recent years there has been a surge of growth in
applications for NAND flash in such things as solid-state drives
and enterprise storage class memory devices that rely on high
storage capacity and benefit from a high speed interface.
Conventional NAND flash suffers from speed limitations due
to bus loading and presents the controller with large page sizes
that may yet increase [1,2]. An interface chip isolates the multi-
drop NAND interface from the memory system channel and
provides, instead, a 300MB/s synchronous DDR interface for
connection to a point-to-point ring topology supporting up to
255 devices without speed degradation. An I/O throughput 7x
faster than conventional asynchronous NAND interfaces and
vastly more scalable than emerging multi-drop DDR NAND
interfaces is achieved [3].
II. POINT-TO-POINT RING TOPOLOGY
A serial daisy-chain ring, shown in Fig. 1, provides a uni-
directional flow of data and commands from the controller,
through each memory device and back to the controller similar
to RamLink [4]. A single load is seen by each device regardless
of the number of devices in the ring. The HyperLink protocol
[5] defines a multiple byte command packet where the first
byte is a target device address, imposing a maximum of 255
devices in the ring with one broadcast address. Input signals
consist of serial command strobe CSI, data strobe DSI, status
STI, and a user-configurable 1 to 8 bit data bus D[7:0], as well
as differential clock CK/CK#, reset RST# and chip enable CE#
signals distributed in parallel. Each device regenerates the
serial signals to provide outputs CSO, DSO, STO, and Q[7:0]
to the next device. The command strobe is used to demarcate
Figure 1: HyperLink Point-to-Point Ring Topology
978-1-4577-0226-6/11/$26.00 ©2011 IEEE
commands and write data to be programmed to the cell array or
on-chip registers while the data strobe is used to demarcate
memory or register read data to be output onto the ring by a
selected device as shown in Fig. 2. The serial status ring allows
any device to raise a flag indicating an event such as
completion of page read, page program, or block erase
operations.
Figure 2: Command & Write Packet and Data Packet
III. INTERFACE CHIP ARCHITECTURE AND ERROR
DETECTION CODE
Fig. 3 shows high level architectural details of the interface
chip. The chip is divided into an I/O control block and four
independent NAND flash interface blocks each provisioned
with 8KB data and mask SRAMs. All core logic operates from
a nominal 1.8V supply. The I/O control block provides user-
configurable data bus width, command and data pass through,
device identification, data truncation, Error Detection Code
(EDC) for commands, read data output muxing, and
programmable NAND flash clock generation. The I/O pins use
1.8V Low-Voltage CMOS (LVCMOS) signaling; fully
sufficient for operation at 300MB/s. Output drivers provide
user selectable 35 or 50 source termination to deliver
optimized signal integrity without static power consumption
and an output access time of 2.3ns. Input setup time is 0.3ns.
The interface block forwards all 3-6 byte command packets
from input pins to output pins on the next rising or falling clock
edge. The I/O control block decodes the device address within
command packets and, upon a matching address, inhibits data
from being forwarded to the output pins. Data truncation
reduces I/O power by an average 50% and allows simultaneous
read and write data transfer when the write device is upstream
of the read device, achieving 600MB/s total throughput.
The IO control block also includes an error detection
mechanism that monitors the integrity of command packets.
The final byte of each command packet is a Hamming Code
calculated on the preceding bytes of the packet. Upon detection
of an error, command execution is inhibited and a status
register flag is set. Upon detecting the error the controller
broadcasts a status register read command to determine
whether the error occurred before or after the target device. If
the error occurred before the target device the command should
be reissued. On-chip error detection prevents erroneous
commands from being executed by the target device before the
controller is able to detect and terminate them. Failure to
terminate a command in a timely manner could have serious
consequences if, for example, a device mistakenly interprets an
incoming read command as a program or erase command.
IV. MEASUREMENT RESULTS
Four internal NAND flash interface blocks each contain a
command converter, SRAM for intermediate storage of data
Latency control
CE# CE# [7:0]
RST#
CK
CK#
+
-
Finite State Machine
Timing Generator
Adjustable Frequency divider
Input/Ouput
Control + Registers
Command decoding &
Conversion
FIFO
CSI
DSI
D[7:0]
Q[7:0]
Address
registers
DEMUXDEMUX
MUX
Data path control (x32)
CSO
STO
32
I/O0[7:0]
I/O1[7:0]
I/O2[7:0]
I/O3[7:0]
CLE[3:0]
ALE[3:0]
WE#[3:0]
RE#[3:0]
icsi / idsi
icsi
idsi
dvick
idsi
icsi
icsi
WE sig. gen
RE sig. gen
Latch EN gen
WP#[3:0]
8
8K Byte SRAM
(Bank 0)
STI
DSO
32
8
8
4
4
4
4
External
HyperLink
Interface
Internal
NAND
Interfaces
4
8
Figure 3. Interface Chip Architecture
and timing control for a configurable 40MHz asynchronous
NAND or 133MHz toggle mode NAND I/O port. These allow
simultaneous data transfers and independent, concurrent flash
commands to be carried out on each of the eight flash die. The
interface chip also supports several technology nodes from
different manufacturers selectable by bond option. A clock for
the NAND flash interface block is generated from external
CK/CK# by a programmable clock divider. The SRAM
supports a page size of 8192 bytes plus 448, 512, or 640 bytes
of spare data. Before executing a program command the
program data is first loaded, via the external interface, into one
of the four SRAMs. Upon receiving a program command the
interface chip transfers the SRAM data to the target NAND
flash device and then issues a page program command to the
target NAND device. To reduce the amount of time spent
transferring read data to and from the NAND flash devices,
pages may be subdivided into smaller sub-pages ranging in size
from 2048 bytes to the full physical page. Programmable sub-
pages provide increased I/O operations per second (IOPS); a
key system level performance indicator. The subdivision size is
programmable by register and all internal read data transfers
occur automatically based on the programmed sub-page size.
Program and read throughput are a function of the page
size, the external interface throughput, the internal transfer
time, and the page program and page read times of the specific
NAND flash devices packaged with the chip. Although an
individual NAND die may provide only 5MB/s program
throughput, all 8 die may be operated independently and
simultaneously to achieve 40MB/s within a single HLNAND
MCP. Only 8 MCPs are required to fully saturate the
HyperLink ring. Fig. 4 shows test results indicating 308MB/s
throughput with a 6.5ns DDR clock.
tCK=6.5ns @ Vdd=1.8V (154MHz, 308MB/s DDR)
Figure 4: Output Data Schmoo Plot
V. CONCLUSION
The HLNAND MCP connects through a ring topology to
provide high throughput, increased scalability, reduced I/O
power and flexible page size, delivering important system level
benefits. Placing the high speed interface on a small separate
logic die eliminates the cost adder on multiple NAND devices
due to increased die size and process enhancements to provide
higher performance I/O transistors. Isolating the internal
NAND busses from the external interface dramatically reduces
loading and CV power. Fig. 5 shows a die photo of a prototype
interface chip and an X-ray cross section of a 4-die stack MCP.
The key features of the production version interface chip and
MCP supporting up to 8 NAND die are summarized in Table 1.
Figure 5: Interface Chip Die Photo and X-ray cross section of 4
NAND die stack MCP
Technology (Interface Chip) 0.18um CMOS 1P6M
Chip Size (Interface Chip) 16.2mm2
Organization 8912 bytes x 128 pages x 4096
blocks x 2 LUNs x 4 banks
Power Supply 2.7V ~ 3.6V & 1.8V
Read Time 96us ~ 146us (2KB ~ full page)
Program Time 2ms (Typ)
Erase Time 1.5ms (Typ)
Clock Cycle Time 6.5ns
I/O Width x1, x2, x4, and x8
Package 18mm x 14mm 100-Ball BGA
Table 1. 256Gb NAND Flash MCP Key Features
ACKNOWLEDGMENTS
The authors thank Dick Foss, Steven Przybylski, Roelof
Salters, and John Lindgren for technical suggestions and
support.
REFERENCES
[1] J.-K. Kim, K. Sakui, et al., “A 120mm2 64Mb NAND Flash Memory
Achieving 180ns/Byte Effective Program Speed,” Symp. On VLSI
Circuits, Digest of Technical Papers, Jun. 1996, pp.168-169.
[2] R. Cernea, L. Pham, et al., “A 34MB/s-Program-Throughput 16Gb MCL
NAND with All-Bitline Architecture in 56nm,” ISSCC Dig. Tech.
Papers, Feb. 2008, pp. 420-421.
[3] D. Nobunaga, E. Abedifard, et al., “A 50nm 8Gb Flash Memory with
100MB/s Program Throughput and 200MB/s DDR Interface,” ISSCC
Dig. Tech. Papers, Feb., 2008, pp. 426-427.
[4] H. Wiggers, D. Gustavsom, et al., “IEEE Standard for High-Bandwidth
Memory Interface Based on Scalable Coherent Interface (SCI) Signaling
Technology (RamLink)”, IEEE Std 1596.4-1996
[5] R. Schuetz, H. J. Oh, et al., “HyperLink NAND Flash Architecture for
Mass Storage Applications,” IEEE NVSMW, Aug. 2007, pp. 3-4.
Article
A 256 Gb NAND flash memory multi-chip package (MCP) includes eight stacked 32 Gb 2 bit/cell multi-level cell (MLC) die and an 11.6 mm2 HyperLink NAND bridge chip providing four internal NAND channels for concurrent memory operations. The bridge chip provides an external 1.2 V unidirectional byte-wide point-to-point source-synchronous double data-rate (DDR) interface for low power 800 MB/s operation in a ring topology. Interface power is reduced by shutting down the phase-locked loop in every second MCP and alternating between edge aligned DDR clock and center aligned DDR clock for source-synchronous data transfer from MCP to MCP.
Conference Paper
Full-text available
The dramatic price reduction of NAND Flash devices in recent years has created an opportunity for Flash to penetrate mass storage applications. This will happen provided the memory vendors can deliver NAND Flash devices with adequate performance and no intrinsic cost premium over the lowest cost conventional NAND Flash devices. The new HLNAND Flash Architecture facilitates this transition by enabling high performance NAND Flash devices with increased longevity and a cost advantage stemming from the low pin count interface and small die size.
Article
Full-text available
Emerging application areas of mass storage flash memories require low cost, high density flash memories with enhanced device performance. This paper describes a 64 Mb NAND flash memory having improved read and program performances. A 40 MB/s read throughput is achieved by improving the page sensing time and employing the full-chip burst read capability. A 2-μs random access time is obtained by using a precharged capacitive decoupling sensing scheme with a staggered row decoder scheme. The full-chip burst read capability is realized by introducing a new array architecture. A narrow incremental step pulse programming scheme achieves a 5 MB/s program throughput corresponding to 180 ns/Byte effective program speed. The chip has been fabricated using a 0.4-μm single-metal CMOS process resulting in a die size of 120 mm<sup>2</sup> and an effective cell size of 1.1 μm<sup>2</sup>
Conference Paper
A 3.3V 8Gb NAND flash memory with a synchronous double-data-rate (DDR) interface is designed and fabricated using 3M 50nm technology to meet the requirements of the markets. This paper achieves a NAND flash program throughput of 100 MB/s with quad-plane operation, which is 5x previously reported. I/O read/write throughput of 200MB/s is achieved using a newly developed DDR interface and data path. The chip features a dual interface, supporting both the newly developed synchronous DDR interface as well as the standard, asynchronous NAND flash interface.
Conference Paper
In the diverse world of NAND flash applications, higher storage capacity is not the only imperative. Increasingly, performance is a differentiating factor and is also a way of creating new markets or expanding existing markets. While conventional memory uses, for actual operations, every other cell along a selected word line (WL) (Takeuchi, 2006), this design simultaneously exercises them all. A performance improvement of at least 100% is derived from this all-bitline (ABL) architecture relative to conventional chips. Additional techniques push performance to even higher levels.
IEEE Standard for High-Bandwidth Memory Interface Based on Scalable Coherent Interface (SCI) Signaling Technology (RamLink)
  • H Wiggers
  • D Gustavsom
H. Wiggers, D. Gustavsom, et al., "IEEE Standard for High-Bandwidth Memory Interface Based on Scalable Coherent Interface (SCI) Signaling Technology (RamLink)", IEEE Std 1596.4-1996
A 50nm 8Gb Flash Memory with 100MB/s Program Throughput and 200MB/s DDR Interface
  • D Nobunaga
  • E Abedifard
D. Nobunaga, E. Abedifard, et al., "A 50nm 8Gb Flash Memory with 100MB/s Program Throughput and 200MB/s DDR Interface," ISSCC Dig. Tech. Papers, Feb., 2008, pp. 426-427.
A 34MB/s-Program-Throughput 16Gb MCL NAND with All-Bitline Architecture in 56nm
  • R Cernea
  • L Pham
R. Cernea, L. Pham, et al., "A 34MB/s-Program-Throughput 16Gb MCL NAND with All-Bitline Architecture in 56nm," ISSCC Dig. Tech. Papers, Feb. 2008, pp. 420-421.
HyperLink NAND Flash Architecture for Mass Storage Applications
  • R Schuetz
  • H J Oh
R. Schuetz, H. J. Oh, et al., "HyperLink NAND Flash Architecture for Mass Storage Applications," IEEE NVSMW, Aug. 2007, pp. 3-4.