ArticlePDF Available

Design of on-chip error correction systems for multilevel NOR and NAND flash memories

IET Circuits, Devices & Systems

July 2007
1(3):241 - 249

DOI:10.1049/iet-cds:20060275

Source
IEEE Xplore

Authors:

K. Rose

Rensselaer Polytechnic Institute

Tong Zhang

Rensselaer Polytechnic Institute

The design of on-chip error correction systems for multilevel code-storage NOR flash and data-storage NAND flash memories is concerned. The concept of trellis coded modulation (TCM) has been used to design on-chip error correction system for NOR flash. This is motivated by the non-trivial modulation process in multilevel memory storage and the effectiveness of TCM in integrating coding with modulation to provide better performance at relatively short block length. The effectiveness of TCM-based systems, in terms of error-correcting performance, coding redundancy, silicon cost and operational latency, has been successfully demonstrated. Meanwhile, the potential of using strong Bose-Chaudhiri-Hocquenghem (BCH) codes to improve multilevel data-storage NAND flash memory capacity is investigated. Current multilevel flash memories store 2 bits in each cell. Further storage capacity may be achieved by increasing the number of storage levels per cell, which nevertheless will correspondingly degrade the raw storage reliability. It is demonstrated that strong BCH codes can effectively enable the use of a larger number of storage levels per cell and hence improve the effective NAND flash memory storage capacity up to 59.1% without degradation of cell programming time. Furthermore, a scheme to leverage strong BCH codes to improve memory defect tolerance at the cost of increased NAND flash cell programming time is proposed.

Block diagram of TCM-based on-chip error correction system

…

: BCH codes parameters and performance

…

BER performance when protecting 16-bit, 32-bit and 64-bit user data

…

: BCH decoder ASIC design post-layout results

…

Data flow of the 4D demodulator

…

Figures - uploaded by K. Rose

Content may be subject to copyright.

Content uploaded by K. Rose

Content may be subject to copyright.

Design of on-chip error correction systems for

multilevel NOR and NAND ﬂash memories

F. Sun, S. Devarajan, K. Rose and T. Zhang

Abstract: The design of on-chip error correction systems for multilevel code-storage NOR ﬂash

and data-storage NAND ﬂash memories is concerned. The concept of trellis coded modulation

(TCM) has been used to design on-chip error correction system for NOR ﬂash. This is motivated

by the non-trivial modulation process in multilevel memory storage and the effectiveness of TCM

in integrating coding with modulation to provide better performance at relatively short block

length. The effectiveness of TCM-based systems, in terms of error-correcting performance,

coding redundancy, silicon cost and operational latency, has been successfully demonstrated.

Meanwhile, the potential of using strong Bose–Chaudhiri–Hocquenghem (BCH) codes to

improve multilevel data-storage NAND ﬂash memory capacity is investigated. Current multilevel

ﬂash memories store 2 bits in each cell. Further storage capacity may be achieved by increasing the

number of storage levels per cell, which nevertheless will correspondingly degrade the raw storage

reliability. It is demonstrated that strong BCH codes can effectively enable the use of a larger

number of storage levels per cell and hence improve the effective NAND ﬂash memory storage

capacity up to 59.1% without degradation of cell programming time. Furthermore, a scheme to

leverage strong BCH codes to improve memory defect tolerance at the cost of increased NAND

ﬂash cell programming time is proposed.

1 Introduction

Driven by the ever increasing demand for on-chip/board

non-volatile data storage, ﬂash memory has become one

of the fastest growing segments in the global semiconductor

industry [1]. Flash memories are categorised into two

families, NOR ﬂash and NAND ﬂash [2]: NOR ﬂash mem-

ories are mainly used for code storage and have relatively

short block length, fo r example, 16 or 64 user bits per

block, whereas NAND ﬂash memories are mainly used

for massive data storage and have relatively long block

length, for example, 8192 or 16,384 user bits (i.e. 1024 or

2048 user bytes) per block. With its well-demonstrated

effectiveness for increasing ﬂash memory storage capacity,

the multilevel concept, that is, to store more than 1 bit in

each cell (or ﬂoating-gate MOS transistor) by programming

the cell threshold voltage into one of l . 2 voltage

windows, is being widely used in both NOR and NAND

ﬂash memories [3–7]. Owing to the inherently reduced

operational margin, multilevel ﬂash memories are increas-

ingly relying upon on-chip error correction to ensure the

storage reliability [8–10]. In current practice, most

multilevel NOR and NAND ﬂash memories store 2 bits in

each memory cell and employ classical linear block error

correction codes (ECCs) such as Hamming and Bose–

Chaudhuri–Hocquenghem (BCH) codes to realise on-chip

error correction.

This work is interested in the design of multilevel ﬂash

memory on-chip error correction systems that may

outperform the current practice by realising superior

reliability and/or enabling higher effective storage capacity.

Because of the signiﬁcant difference on the block length

between NOR and NAND ﬂash memories, we consider

these two types of ﬂash memories separately. In the

context of NOR ﬂash, we investigate the use of trellis

coded modulation (TCM) [11] technique to realise

on-chip error correction. The motivation is 2-fold: (1) The

more-than-two-levels-per-cell storage capacity makes

the modulation process non-trivial and an integral part of

the on-chip ECC. (2) TCM can effectively integrate ECC

with modulation to realise better error correction perform-

ance when the block length is relatively small. We note

that, although the use of TCM in multilevel memory has

been ﬁrst proposed in [12], the incurred hardware

implementation cost and latency overhead have not been

addressed, which leaves its practical feasibility a missing

link. Furthermore, TCM-based approach is only applicable

to NOR ﬂash because its advantage over conventional

linear block codes quickly diminishes as the block length

increases, which however was not pointed out in [12].To

evaluate the silicon cost of using TCM-based on-chip

error correction, we implemented the read datapath consist-

ing of high-precision sensing circuits and TCM decoder.

The results suggest that TCM-based systems can achieve

encouraging memory cell savings at small operational

latency and silicon cost.

In the context of NAND ﬂash, we investigated the use of

very strong BCH codes to enable higher storage capacity.

Currently, most multilevel NAND ﬂash memories store 2

bits (or four levels) in each cell, for which a weak ECC

code that can only correct few (e.g. one or two) errors is typi-

cally used [13]. Higher storage capacity may be realised by

further increasing l, which will make it increasingly more dif-

ﬁcult to ensure storage reliability. In this regard, solutions may

be pursued along two directions, including: (i) improve the

# The Institution of Engineering and Technology 2007

doi:10.1049/iet-cds:20060275

Paper ﬁrst received 3rd September 2006 and in revised form 7th April 2007

The authors are with the ECSE Department, Rensselaer Polytechnic Institute,

Troy, NY 12180, USA

E-mail: sunf@rpi.edu

IET Circuits Devices Syst., 2007, 1, (3), pp. 241 –249

241

programming scheme to accordingly tighten each threshold

voltage window and (ii) use much stronger ECC. Along the

ﬁrst direction, researchers have developed high-accuracy pro-

gramming techniques to realise 3bits/cell and even 4bits/cell

storage capacity [14, 15], which however complicates the

design of the peripheral programming mixed signal circuits

and degrades the programming throughput.

To the best of our knowledge, the potential of using much

stronger ECC to improve NAND ﬂash storage capacity has

not been addressed in the open literature. This work

attempts to ﬁll this gap by investigating the use of strong

BCH codes to enable a relatively large l (6, 8 and 12 in

this work). With the advantages of simplifying the program-

ming circuits and maintaining or even increasing the pro-

gramming throughput, the use of strong ECC is subject to

two main drawbacks: (i) strong ECC requires a higher

coding redundancy that will inevitably degrade the storage

capacity improvement gained by a larger l and (ii) the

ECC decoder may incur non-negligible silicon area over-

head and increase read latency. In general, to realise the

same error correction performance (or to achieve the same

coding gain), the longer the ECC code length, the less

will be the relative coding redundancy (or higher code

rate). Therefore strong ECCs are only suitable for NAND

ﬂash memories that have long data block length and

hence may tolerate longer read latency. Using 2 bits/cell

NAND ﬂash memories that employ single-error-correcting

Hamming codes as a benchmark, we investigated the effec-

tiveness of using strong BCH codes to ensure storage

reliability when increasing the value of l to 6, 8 and 12,

respectively. With the same programming scheme, and

hence the same threshold voltage distribution characteristics

as the 2 bits/cell benchmark, the larger value of l will result

in a worse raw storage reliability and demands a stronger

BCH code. To investigate the trade-off between design

complexities and storage capacity improvements, we

designed BCH decoders using 0.13 mm complementary

metal–oxide– semiconductor (CMOS) standard cell and

static random access memory (SRAM) libraries. The

results show that strong BCH codes can enable a relatively

large increase of the number of storage levels per cell and

hence a potentially signiﬁcant memory storage capacity

improvement. Finally, a scheme is proposed to leverage

strong BCH codes to improve NAND ﬂash memory

defect tolerance by trading off the memory cell program-

ming time.

The paper is organised as follows: We brieﬂy present the

basics of multilevel ﬂash memories in Section 2. The pro-

posed TCM-based on-chip error correction systems for mul-

tilevel NOR ﬂash memories is presented in Section 3, and

Section 4 discusses the use of strong BCH codes in multile-

vel NAND ﬂash memories. Conclusions are drawn in

Section 5.

2 Multilevel ﬂash memories

This section brieﬂy presents some basics of multilevel ﬂash

memory programming/read and the memory cell threshold

voltage distrib ution model to be used in this work.

Interested readers are referred to [3] for a comprehensive

discussion on multilevel ﬂash memories. Multilevel ﬂash

memory programming is typically realised by combining

a program-and-verify technique with a staircase V

ramp

as illustrated in Fig. 1. The tightness of each programming

threshold voltage window is proportional to V

, whereas

the cell programming time is roughly proportional to

1/V

. The read circuit in l levels/cell NAND ﬂash

memories usually has a serial sensing structure that takes

l 2 1 cycles to ﬁnish the read operation. Higher read

speed can be realised by increasing the sensing parallelism

at the cost of silicon area, which is typically preferred in

latency-critical NOR ﬂash memories.

On the basis of the results published in [16] for 2 bits/cell

NOR ﬂash memory, the cell threshold voltage approxi-

mately follows a Gaussian distribution as illustrated in

Fig. 2: the two inner distributions have the same standard

deviation, denoted as

; the standard deviations of the

two outer distributions are 4

and 2

, respectively. The

locations of the means of the two inner distributions are

determined to minimise the raw bit error rate. Let V

max

denote the voltage difference between the means of the

two outer distributions. We assume that this model is also

valid for NAND ﬂash memories.

3 TCM-based on-chip error correction for NOR

ﬂash

3.1 TCM system structure

The basic idea of TCM is to jointly design trellis codes (i.e.

convolutional codes) and signal mapping (i.e. modulation)

processes to maximise the free Euclidean distance

(Similar to the Hamming distance of linear block codes,

free Euclidean distance determines the error correction

capability of convolutional codes, that is, a convolutional

code with free Euclidean distance of d

free

can correct at

least b(d

free

2 1)/2c code symbol errors) between coded

signal sequences. As illustrated in Fig. 3, given an

l-level/cell memory core, an m-dimensional TCM encoder

receives a sequence of n-bit input data, adds r-bit redun-

dancy and hence generates a sequence of (n þ r)-bit data,

where each (n þ r)-bit data are stored in m memory cells

and 2

nþr

 l

. The encoding process can be outlined as

follows: (1) A convolutional encoder convolves the input

k bits sequence with r linear algebraic functions and gener-

ates k þ r coded bits. (2) Each k þ r coded bits select one of

the 2

kþr

subsets of an m 2 D signal constellation, where

each subset contains 2

n2k

signal points. (3) The additional

n 2 k uncoded bits select an individual m 2 D signal

point from the selected subset.

Fig. 1 Schematic illustration of program-and-verify cell

programming

Fig. 2 Approximate cell threshold voltage distribution model in

2 bits/cell memory

IET Circuits Devices Syst., Vol. 1, No. 3, June 2007242

Let s denote the memory order of the convolutional code

encoder. To protect an N-bit data block, the TCM encoder

totally receives N þ s bits including s zero bits for convolu-

tional code termination. If N þ s is not divisible by n, the

last input to the encoder will contain less than n bits, for

which the m 2 D modulation may be simpliﬁed to a modu-

lation with a lower dimension. As illustrated in Fig. 3, the

TCM decoder contains an m 2 D demodulator that provides

kþr

branch metrics and branch symbol decisions to the

Viterbi decoder for trellis decoding.

3.2 System design and performance evaluation

Targeting 2 bits/cell NOR ﬂash memories, we designed

three TCM-based systems that protect 16-bit, 32-bit and

64-bit user data in one codeword. These three systems

share the same system design parameters (referred to

Fig. 3): n ¼ 7, k ¼ 2, r ¼ 1, m ¼ 4 and the memory order

of the convolutional code v ¼ 3. The signal read from

each memory cell is quantised by 12 levels. We decided

to use 12-level quantisation mainly based on our ﬁnite-

precision computer simulations, which suggested 12-level

quantisation appears to provide a good trade-off between

implementation cost and error correction performance. To

realise 4D modulation, we use the scheme proposed by

Wei [17] that hierarchically partitions the 4D rectangular

lattice formed by four memory cells into eight 4D sub-

lattices. Each coded 3-bit data from the convolutional

code encoder selects one out of the eight 4D sub-lattices.

The 4D signal space partition is described as follows:

First, we partition each 1D signal constellation (correspond-

ing to one memory cell) into two subsets E and F, as shown

in Fig. 4, where the signal points labelled as e and f belong

to the subsets E and F, respectively.

Next, we partition each 2D signal constellation into four

subsets A ¼ (E, F ), B ¼ (F, E), C ¼ (E , E) and D ¼ (F , F),

as shown in Fig. 4, where the signal points labelled as a, b, c

and d belong to the subsets A, B, C, and D, respectively.

Finally, we partition the 4D signal constellation into

kþr

¼ 8 4D subsets formed as listed in Table 1.

To protect 16-bit user data, the TCM encoder receives 19

bits (including 3 zero bits for termination) and ﬁnishes

encoding in three steps: during each of the ﬁrst two steps,

it receives 7 bits and maps the coded 8 bits onto four

memory cells through 4D modulation; in the last step, it

receives 5 bits and maps the coded 6 bits onto three

memory cells through 3D modulation that is obtained by

collapsing one 2D constellation in the original 4D modu-

lation into a 1D constellation. Therefore this system is

denoted as (11, 8) TCM, that is, one codeword occupies

11 memory cells and protects 2  8 ¼ 16-bit user data

(notice that each memory cell stores 2 bits). For the

purpose of comparison, we considered two other ECC

schemes using linear block codes: (i) a (11, 8, 1) 4-ary shor-

tened Hamming code and (ii) a (13,8,2) 4-ary shortened

two-error-correcting BCH code [18].

Fig. 5a shows the performance comparison of these three

schemes. Although the performance curves of the two linear

block codes can be analytically derived, we have to rely on

extensive computer simulations to obtain the performance

curve of the (11, 8) TCM system, for which the solid part

is obtained by computer simulation and the dashed part is

estimated following the trend of the simulation results.

With the same coding redundancy, (11, 8) TCM can

achieve about ﬁve orders of magnitude better performance

than (11, 8, 1) Hamming code. Compared with (13, 8, 2)

BCH code, (11, 8) TCM can achieve almost the same per-

formance while realising a saving of 2/13 (15.4%)

memory cells.

To protect 32-bit user data, the TCM encoder receives 35

bits and ﬁnishes encoding in ﬁve steps, each step maps 8

bits onto four memory cells through 4D modulation.

Hence, this is denoted as (20, 16) TCM, that is, one code-

word occupies 20 memory cells and protects 32-bit user

data. The (20, 16) TCM is compared with two linear

block codes: (i) a (19, 16, 1) 4-ary shortened Hamming

code and (ii) a (23, 16, 2) 4-ary shortened BCH code.

Fig. 5b shows their performance comparison.

To protect 64-bit user data, the TCM encoder receives 67

bits and ﬁnishes encoding in 10 steps: during each of the

ﬁrst nine steps, it receives 7 bits and maps the coded 8

bits onto four memory cells through 4D modulation; in

the last step, it receives 4 bits that bypass the convolutional

code encoder and directly map onto two memory cells

through 2D modulation that is a constituent of the original

4D modulation. Hence, this is denoted as (38, 32) TCM.

We compared it with two linear block codes: (i) a (36, 32,

1) 4-ary shortened Hamming code and (ii) a (39, 32, 2)

4-ary shortened BCH code. Fig. 5c shows their performance

Fig. 3 Block diagram of TCM-based on-chip error correction system

Fig. 4 16-point 2D signal constellation partition

Table 1: Partition of the 4-D signal constellation

4D subset Concatenation form

(A, A) < (B, B)

(C, C) < (D, D)

(A, B) < (B, A)

(C, D) < (D, C )

(A, C) < (B, D)

(C, B) < (D, A)

(A, D) < (B, C )

(C, A) < (D, B)

IET Circuits Devices Syst., Vol. 1, No. 3, June 2007 243

comparison. Table 2 summarises the comparison between

the TCM and the other ECC schemes discussed above in

terms of coding redundancy and error-correcting perform-

ance. Notice that the positive and negative numbers in the

third column mean positive and negative saving of

memory cells, respectively, and the performance gain in

the fourth column is measured when the bit error

rate(BER) of TCM-based system approaches 10

214

3.3 Silicon implementation

The above result shows the effectiveness of TCM-based

on-chip error correction in terms of coding redundancy

and error-correcting performance. However, to be a promis-

ing candidate for multilevel NOR ﬂash memory, it should

be able to achieve small latency and negligible silicon

area compared with the overall memory die size. In the fol-

lowing, we present proof-of-concept implementation results

for the above three TCM-based systems for protecting

16-bit, 32-bit and 64-bit user data. Clearly, TCM encoders

are very simple and can easily achieve very small latency

with negligible silicon cost. Hence we only focus on TCM

decoders.

The TCM decoding datapath contains high-precision

sensing circuits, 4D demodulator and Viterbi decoder.

The sensing circuit realises 12-level quantisation, instead

of 4-level quantisation as in the conventional linear-

block-code-based ECC scheme. Using Cadence tool with

IBM 0.18 mm 7WL technology, we designed a

current-mode 12-level parallel sensing circuit following

the structure proposed in [19]. Fig. 6 shows the general

structure of a 12-level current-mode parallel sensing

circuit that mainly contains 11 current comparators.

12-level quantisation is realised by comparing the current

from the selected memory cell with the reference currents

from the 11 reference cells which are appropriately pro-

grammed. The silicon area of one 12-level current-mode

parallel sensing circuit is estimated as 0.006 mm

. The

simulation results show that the worst-case sensing

latency (i.e. the input current is equal to one of the reference

currents) is about 300 ps.

Upon receiving the data from four 12-level sensing cir-

cuits, the 4D demodulator ﬁnds the most likely point in

each 4D signal subset and calculates the corresponding

log-likelihood metric as the branch metrics sent to the

Viterbi decoder. The output branch metrics are represented

with 6 bits. The 4D demodulator receives a set of data

denoted as

Z ¼ {^z

, ^z

} from the four analogue-to-

digital converters (ADCs), where each ^z

is the digitised

data read from one cell. Given

Z, the 4D demodulator

should calculate

Metric j ¼ max

p[P

( log P(

Zjp)),

for j ¼ 1, 2, ..., 8 (1)

where each 4D_Metric_ j represents the log-likelihood

value of one most likely point in each 4D subset and p rep-

resents the point in each 4D subset P

. As each point in each

4D subset is represented by 5 bits, the 4D demodulator

should also generate eight 5-bit data representing the most

likely points in the eight 4D subsets. Leveraging the hier-

archical structure of the 4D modulation, as shown in

Fig. 7 , the demodulation is realised in a hierarchical

manner. The 4D demodulator starts with ﬁnding the

Fig. 5 BER performance when protecting 16-bit, 32-bit and 64-bit user data

In TCM schemes, the signal read from each memory cell is quantised by 12 levels

Table 2: Comparison between TCM and the other ECC

schemes based on linear block code

TCM Competing

ECC

Savings of

cells, %

Performance

gain

(11, 8) (11, 8, 1) Hamming 0 10

(13, 8, 2) BCH 15.4 1

(20, 16) (19, 16, 1) Hamming 25.3 10

(23, 16, 2) BCH 13 10

(38, 32) (36, 32, 1) Hamming 25.6 . 10

(39, 32, 2) BCH 2.6 .10

Fig. 6 Structure of a 12-level parallel sensing circuit

IET Circuits Devices Syst., Vol. 1, No. 3, June 2007244

closest point in each 1D subset and its metric. Each 1D

demodulator receives the data ^z

read from one cell and

calculates

Metric

E ¼ max

p[E

( log P(^z

jp)),

Metric

F ¼ max

p[F

( log P(^z

jp)) (2)

where Metric_E and Metric_F represent the log-likelihood

value of one most likely point in two 1D subsets E and F,

respectively. The 1D demodulator also generates 1 bit

(note that we have two signal points in each 1D subset) to

represent the closest point in each 1D subset. This operation

can be implemented using a simple look-up table.

As discussed in Section 3.2, each 2D signal constellation

is partitioned into four subsets A ¼ (E, F ), B ¼ (F, E),

C ¼ (E, E), and D ¼ (F, F ). Upon the received data

{^z

, ^z

iþ1

} for i ¼ 1, 3, the 2D demodulator generates

Metric

A ¼ max

p[A

( log P({^z

, ^z

iþ1

}jp))

¼ Metric

E þ Metric F

Metric

B ¼ max

p[B

( log P({^z

, ^z

iþ1

}jp))

¼ Metric

F þ Metric E

Metric

C ¼ max

p[C

( log P({^z

, ^z

iþ1

}jp))

¼ Metric

E þ Metric E

Metric

D ¼ max

p[D

( log P({^z

, ^z

iþ1

}jp))

¼ Metric

|ﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄ}

1st 1D

þ Metric F

|ﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄ}

2nd 1D

(3)

In a similar way, the 4D demodulator generates eight demo-

dulation metrics as follows

Metric 1 ¼ max

p[P

( log P(

Zjp))

¼ max (Metric

A þ Metric A,

Metric

B þ Metric B)

Metric 2 ¼ max

p[P

( log P(

Zjp))

¼ max (Metric

C þ Metric C,

Metric

D þ Metric D)

Metric 3 ¼ max

p[P

( log P(

Zjp))

¼ max (Metric

A þ Metric B,

Metric

B þ Metric A)

Metric 4 ¼ max

p[P

( log P(

Zjp))

¼ max (Metric

C þ Metric D,

Metric

D þ Metric C)

Metric 5 ¼ max

p[P

( log P(

Zjp))

¼ max (Metric

A þ Metric C,

Metric

B þ Metric D)

Metric 6 ¼ max

p[P

( log P(

Zjp))

¼ max (Metric

C þ Metric B,

Metric

D þ Metric A)

Metric 7 ¼ max

p[P

( log P(

Zjp))

¼ max (Metric

A þ Metric D,

Metric

B þ Metric C)

Metric 8 ¼ max

p[P

( log P(

Zjp))

¼ max Metric

|ﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄ}

1st 2D

þ Metric A

|ﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄ}

2nd 2D

Metric

|ﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄ}

1st 2D

þ Metric B

|ﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄ}

2nd D

(4)

The generated eight metric values will be sent to Viterbi

decoder as the branch metrics for ﬁnal decoding.

The last block on the decoding datapath is a Viterbi

decoder. To minimise the decoding latency, we use a state-

parallel register-exchange Viterbi decoder architecture. As

Viterbi decoder implementation has been extensively

addressed in the open literature, we will not elaborate on

the decoder architecture details. Interested readers are

referred to [20]. Here we note that, for the scenario of pro-

tecting 16-bit user data, as the Viterbi decoder ﬁnishes the

decoding in only three steps, we directly unrolled the recur-

sive datapath of the original Viterbi decoder and fully opti-

mise the circuit’s structure, which red uces both silicon area

and decoding latency. Table 3 summarises the read datapath

implementation metrics of the three TCM systems. We note

that the TCM systems protecting 32-bit and 64-bit user data

contain four sensing circuits and one 4D demodulator,

whereas the TCM system protecting 16-bit user data con-

tains 11 sensing circuits and two 4D and one 3D demodula-

tors in order to match the parallelism of the unrolled Viterbi

decoder, and hence the silicon area of the TCM system pro-

tecting 16-bit user data is comparable to the other two scen-

arios, even with the least block length.

4 BCH-based on-chip error correction for NAND

ﬂash

4.1 Binary BCH codes

Binary BCH code construction and encoding/decoding are

based on binary Galois ﬁelds. A binary Galois ﬁeld with

degree of m is represented as GF(2

). For any m  3 and

t . 2

m21

, there exists a primitive binary BCH code over

GF(2

), which has the code length n ¼ 2

m21

and infor-

mation bit length k  2

2 mt and can correct up to (or

slightly more than) t errors. A primitive t-error-correcting

(n, k, t) BCH code can be shortened (i.e. eliminate a

Fig. 7 Data ﬂow of the 4D demodulator

IET Circuits Devices Syst., Vol. 1, No. 3, June 2007 245

certain number, say s, of information bits) to construct a

t-error-correcting (n 2 s, k 2 s, t) BCH code with less infor-

mation bits and code length but the same redundancy. Given

the raw BER p

raw

,an(n, k , t) binary BCH code can achieve

a codeword error rate of

i¼tþ1



raw

(1  p

raw

)

ni

(5)

Binary BCH encoding can be realised efﬁciently using

linear shift registers, whereas binary BCH decoding is

much more complex. Various BCH decoding algorithms

have been proposed [21]. In Section 4.3, we will elaborate

on the binary BCH decoding algorithm and decoder archi-

tecture used in this work.

4.2 BCH codes for multilevel NAND ﬂash

We ﬁrst investigate the potential storage capacity improve-

ment by increasing l from 4 to 6, 8 and 12, respectively.

Assuming the same programming scheme (i.e. the same

step-up voltage V

and hence same cell programming

time) as the 2 bits/cell memory, we have the cell threshold

voltage distributions for l ¼ 6, 8 and 12 as illustrated in

Fig. 8 and described as follows: the l 2 2 inner distributions

have the same standard deviation

; the standard deviations

of the two outer distributions are 4

and 2

, respectively.

The locations of the means of the l 2 2 inner distributions

are determined to minimise the raw BER. It should be

pointed out that, as the value of l increases, some factors

such as ﬂoating-gate interference [22] and source line

noise [23] might degrade the threshold voltage distribution

(or increase the standard deviation). As currently no data are

available in the open literature to model such possible devi-

ation degradation and we expect that such degradation

should not be signiﬁcant, we assume that the standard devi-

ation is independent of l in this work.

We set V

max

, the voltage difference between the means of

the two outer distributions, as 6.5 V [24] and

as 1. For l of

6, 8 and 12, we store 5 bits per two cells, 3 bits per cell and 7

bits per two cells, respectively. Accordingly, the raw BER

are about 8  10

212

(l ¼ 4), 5  10

(l ¼ 6), 5  10

(l ¼ 8) and 2  10

(l ¼ 12), respectively. Because the

cell programming time remains the same as the 2 bits/cell

benchmark, the programming throughput may approxi-

mately increase by 25, 50, and 75%, respectively.

To protect 8192 and 16,384 user bits per codeword with a

target codeword error rate of lower than 10

214

single-error-correcting Hamming codes will be sufﬁcient

to ensure the storage reliability for l ¼ 4. For larger

values of l, binary BCH codes are constructed by shortening

primitive binary BCH codes under GF(2

) and GF(2

respectively. Table 4 lists the BCH code parameters and

the corresponding codeword error rates. Table 4 also

shows the percentages of the user bits storage gain over

the 2 bits/cell benchmark, given the same number of

memory cells.

4.3 BCH code decoder architecture and ASIC

design

To evaluate decoder silicon implementation metrics for the

above BCH codes, we carried out application–speciﬁc inte-

grated circuit(ASIC) design using 0.13 mm CMOS standard

cell and SRAM libraries. In the following, we ﬁrst brieﬂy

describe the BCH decoder architecture and then present

the silicon implementation results. A syndrome-based

binary BCH code decoder consists of three blocks, as

shown in Fig. 9. For an (n, k, t) binary BCH code con-

structed under a Galois ﬁeld with the primitive element

the overall decoder architecture is described as follows.

4.3.1 Syndrome computation: Given the received bit

vector r, it computes 2t syndromes as S

n1

j¼0

for

i ¼ 0, 1, ...,2t 2 1. As pointed out in [25] for binary

BCH codes, we have S

¼ S

, so only t parallel syndrome

generators are required to explicitly calculate the

odd-indexed syndromes, followed by mu ch simpler square

circuits. For a decoder with parallelism of p (i.e. the syn-

drome computation block receives p input bits in each

Table 3: Summary of implementation metrics

Silicon area, mm

Latency

,ns

(11, 8) TCM 0.12 8.3

(20, 16) TCM 0.10 24.3

(38, 32) TCM 0.12 44.3

Includes latency of sensing circuits and TCM decoding

Fig. 8 Approximate ﬂash memory cell threshold voltage distribution model

al¼ 6

bl¼ 8

cl¼ 12

IET Circuits Devices Syst., Vol. 1, No. 3, June 2007246

clock cycle), each syndrome generator has the structure as

shown in Fig. 10.

4.3.2 Error locator calculation: Based on the 2t syn-

dromes, we calculate the error locator polynomial

(x) ¼ 1 þ

x þ

...

using the inversion-

free Berlekamp – Massey algorithm [26]. To minimise the

silicon area cost, a fully serial architecture is used, which

takes t(t þ 3)/ 2 clock cycles to ﬁnish the calculation. It

mainly contains three Galois ﬁeld multipliers and two ﬁrst-

input ﬁrst-output (FIFO) buffers with lengths of t and t þ 1,

respectively.

4.3.3 Chien search: Upon receiving the error locator

polynomial

(x), it exhaustively examines whether

the root of

(x) for i ¼ 0, 1, ..., n 2 1, that is, check

whether

(

) ¼

j¼1

þ 1 is zero or not. It outputs

an error vector e in such a way that, if

is a root, then

n2i

¼ 1, otherwise e

n2i

¼ 0. The overall decoder output

is obtained by r þ e as illustrated in Fig. 9. Fig. 11 shows

the Chien search architecture with the parallelism factor

of p that generates a p-bit output each clock cycle.

4.3.4 Decoder ASIC design: For the BCH codes listed

in Table 4, we designed decoders with the following con-

ﬁgurations: the syndrome computation and Chien search

blocks have a parallelism factor of 4; the error locator cal-

culation block is fully serial and takes t(t þ 3)/ 2 clock

cycles. Therefore the syndrome computation and Chien

search blocks always have the same latency (in terms of

the number of clock cycles), whereas the latency of error

locator calculation block depends on the value of t.To

improve the decoding throughput and minimise the decod-

ing latency, these BCH decoders support pipelined oper-

ation summarised as follows:

† For l ¼ 6, 8: The BCH codes have relatively small values

of t, so the corresponding error locator calculation blocks

have much less latency than the other two blocks.

Therefore we use a one-stage pipelined decoder structure

in which the syndrome computation and error locator calcu-

lation blocks operate on one codeword, whereas the Chien

search block operates on the other codeword in parallel.

† For l ¼ 12: The BCH codes have relatively large values

of t, so the corresponding error locator calculation blocks

have similar or even slightly longer latency than the other

two blocks. Therefore we use a two-stage pipelined

decoder structure in which these three blocks operate in

parallel on three consecutive codewords.

Furthermore, the decoder FIFO as shown in Fig. 9 is

realised by SRAMs to minimise the silicon area cost.

These BCH decoders are designed with Chartered

0.13 mm CMOS standard cell and SRAM libraries, where

Synopsys tools are used throughout the design hierarchy

down to place and route. We set the power supply as

1.08 V and the number of metal layers as four in the place

and route. Post-layout results verify that the decoders can

operate at 400 MHz and hence support about 1.6 Gbps

decoding throughput because of the decoder parallelism

factor of 4. Such throughput appears to be sufﬁcient in real-

life applications [27]. The silicon area and decoding latency

are listed in Table 5.

To demonstrate the overall NAND ﬂash memory storage

capacity improvement potential, we carried out the follow-

ing estimation for 70-nm CMOS technology: The effective

NAND ﬂash memory cell size is 0.024 mm

at 70-nm

CMOS technology [7]. We scale the BCH decoder silicon

area by (130/70)

¼ 3.4 to estimate the decoder silicon

area at 70-nm CMOS technology. Accordingly, Table 6

shows the estimated total numbers of user bits that can be

stored in a NAND ﬂash memory core of 100 mm

while

considering the BCH decoder silicon area cost. The effec-

tive storage capacity improvement is obtained by compar-

ing against the 2 bits/cell benchmark.

4.4 Integration with defect tolerance

We assumed above that the cell programming time (and

hence threshold voltage distrib ution) remains the same for

various l and BCH codes are solely used for compensating

threshold voltage distribution-induced (TVDI) errors. It is

intuitive that, if we improve the programming accuracy

by reducing the step-up programming voltage V

at the

cost of increased programming time, the TVDI error rates

will correspondingly reduce. This will leave a certain

degree of BCH code error correction capability available

for compensating memory defects. This can be considered

as a trade-off between programming time and defect

tolerance.

Table 4: BCH codes parameters and performance

l (n, k, t) BCH

codes

Codeword

error rate

User bits

storage gain, %

6 (8262, 8192, 5) 1.1  10

217

24.0

8 (8360, 8192, 12) 3.1  10

215

47.0

12 (9130, 8192, 67) 2.8  10

215

57.0

6 (16 459, 16 384, 5) 7.0  10

216

24.5

8 (16 609, 16 384, 15) 3.2  10

215

48.0

12 (17 914, 16 384, 102) 7.2  10

215

60.1

Fig. 9 Binary BCH code decoder structure

Fig. 10 Structure of one syndrome generator with parallelism

factor of p

Fig. 11 Structure of Chien search with the parallelism factor of p

IET Circuits Devices Syst., Vol. 1, No. 3, June 2007 247

Following this intuition, we investigate such a trade-off in

NAND ﬂash memories with l of 6, 8 and 12, respectively.

On the basis of the cell threshold voltage distribution

model as presented above, if we reduce the step-up pro-

gramming voltage V

to improve the programming accu-

racy, the standard deviation of the threshold voltage

distribution will accordingly reduce. In this work, we

simply assume that the standard deviation

is inversely

proportional to the cell programming time. If a

t-error-correcting binary BCH code needs to compensate

up to d

def

defective memory cells, it will only be able to

correct up to t

TVDI

¼ t 2 d

def

TVDI errors. To

accommodate such TVDI error correction capability loss,

we have to accordingly reduce the TVDI error rates by

improving the programming accuracy and hence reducing

the standard deviation parameter

. For the BCH codes

listed in Table 4 , we have to shown the trade-offs

between

and d

def

as in Table 7.

Based on the above discussion, we further propose a

modiﬁed multilevel ﬂash memory defect-tolerant strategy

by combining the conven tional spare rows/columns repair

and BCH codes. As illustrated in Fig. 12, we ﬁrst check

whether the available spare rows/columns can repair all

the defects in one memory block, if not, then we carry out

a certain repair algorithm to use the spare rows/columns

to repair as many defects as possible so that the number

of residual defective cells can be minimised. Then we cal-

culate how to adjust the threshold voltage distribution devi-

ation parameter

in order to compensate the TVDI error

correction capability loss. Finally, we check whether the

target

is feasible, subject to some practical constraints

such as circuit precision and minimum allowable cell

programming time.

5 Conclusions

This paper presented on-chip error correction system design

approaches for multilevel code-st orage NOR ﬂash and

data-storage NAND ﬂash memories. We applied the TCM

concept to design an on-chip error correction system for

multilevel NOR ﬂash memories. Compared with the con-

ventional practice using linear block codes, the

TCM-based design solution can provide better coding

redundancy against error-correcting performance trade-offs.

Targeting 2 bits/cell NOR ﬂash, we designed TCM-based

systems for three scenarios where the number of user bits

per block is 16, 32 and 64, respectively. Compared with

the systems using two-error-correcting BCH codes, the

TCM-based systems can achieve 1 order of magni tude

better BER while saving 15.3% (16-bit), 13.0% (32-bit)

and 2.6% (64-bit) memory cells, respectively. Cadence

and Synopsys tools were used to implement the read data-

path including mixed signal sensing circuits and dig ital

TCM demodulation and decoding circuits. The latency

and silicon area are 8.3 ns and 0.12 mm

(16-bit), 24.3 ns

and 0.10 mm

(32-bit), and 44.3 ns and 0.12 mm

(64-bit),

respectively.

In the context of NAND ﬂash memory, we demonstrated

the promise of using strong BCH codes to further improve

multilevel data-storage NAND ﬂash memory capacity

without degrading memory programming time. Targeting

the codeword error rate lower than 10

214

, we constructed

BCH codes with 8192 and 16 384 user bits per codeword,

respectively. It shows that, given the same number of

memory cells, up to 60% more user bits can be stored com-

pared with the 2 bits/cell benchmark. To evaluate decoder

Table 5: BCH decoder ASIC design post-layout results

l (n, k, t) BCH codes Silicon area, mm

Latency, ms

6 (8262, 8192, 5) 0.21 10.4

8 (8360, 8192, 12) 0.32 10.9

12 (9130, 8192, 67) 1.43 17.6

6 (16 459, 16 384, 5) 0.25 20.7

8 (16 609, 16 384, 15) 0.38 21.4

12 (17 914, 16 384, 102) 2.14 40.2

Table 6: Estimated storage capacity for a 100 mm

NAND ﬂash memory core

(n, k, t) BCH

codes

Stored user

bits, Gbits

Effective stor

age capacity

improvement

4 8.33 –

6 (8262, 8192, 5) 10.32 23.9

8 (8360, 8192, 12) 12.24 46.9

12 (9130, 8192, 67) 13.03 56.4

6 (16 459, 16 384, 5) 10.36 24.4

8 (16 609, 16 384, 15) 12.32 47.9

12 (17 914, 16 384, 102) 13.25 59.1

Table 7: Trading TVDI error correction capability for

defect tolerance

l (n, k, t) BCH codes Standard

deviation

def

TVDI

6 (8262, 8192, 5) 0.833 1 2

8 (8360, 8192, 12) 0.930 1 9

0.719 3 3

12 (9130, 8192, 67) 0.950 4 53

0.900 7 42

0.800 12 24

0.571 17 6

6 (16 459, 16 384, 5) 0.800 1 2

8 (16 609, 16 384, 15) 0.950 1 12

0.700 4 3

12 (17 914, 16 384, 102) 0.950 6 81

0.900 12 60

0.800 20 32

0.571 27 6

Fig. 12 Flow diagram using BCH codes for defect tolerance

IET Circuits Devices Syst., Vol. 1, No. 3, June 2007248

silicon area and achi evable decoding throughput/latency,

we implemented BCH decoders using 0.13 mm CMOS stan-

dard cell and SRAM libraries. Post-layout results verify that

the decoders occupy (much) less than 2.5 mm

silicon area

and achieves (much) less than 41 ms decoding latency and

1.6 Gbps decoding throughput. On the basis of the pub-

lished results for NAND ﬂash effective cell area and a

simple scaling rule, we estimate that, under 70-nm CMOS

technology and 100 mm

core area, up to 59.1% effective

storage capacity improvement can be realised compared

with 2 bits%cell benchmark.

Furthermore, we propose a design strategy that can lever-

age the large error correction capability of strong BCH

codes to improve memory defect tolerance by trading off

the memory cell programming time.

6 Acknowledgments

The authors thank the anonymous reviewers for their valu-

able comments and suggestions, which have largely

improved the quality and presentation of this paper.

7 References

1 Hwang, C.: ‘Nanotechnology enables a new memory growth model’,

Proc. IEEE, 2003, 91, pp. 1765–1771

2 Bez, R., Camerlenghi, E., Modelli, A., and Visconti, A.: ‘Introduction

to Flash memory’, Proc. IEEE, 2003, 91, pp. 489–502

3 Ricco, B., et al.: ‘Nonvolatile multilevel memories for digital

applications’, Proc. IEEE, 1998, 86, pp. 2399–2423

4 Lee, S., et al.: ‘A 3.3 V 4 Gb four-level NAND ﬂash memory with

90 nm CMOS technology’. Proc. IEEE Int. Solid-State Circuits

Conf. (ISSCC), 2004, pp. 52–513

5 Sim, S.-P., et al.: ‘A 90 nm generation NOR ﬂash multilevel cell

(MLC) with 0.44 mm

/bit cell size’. IEEE VLSI-TSA Int. Symp. on

VLSI Technology, April 2005, pp. 35 –36

6 Servalli, G., et al.: ‘A 65nm NOR ﬂash technology with 0.042 mm

cell size for high performance multilevel application’. IEEE Int.

Electron Devices Meeting, December 2005, pp. 849–852

7 Hara, T., et al.: ‘A 146-mm

8-Gb multi-level NAND ﬂash memory

with 70-nm CMOS technology’, IEEE J. Solid-State Circuits, 2006,

41, pp. 161–169

8 Gregori, S., Cabrini, A., Khouri, O., and Torelli, G.: ‘On-chip error

correcting techniques for new-generation Flash memories’, Proc.

IEEE, 2003, 91, pp. 602 –616

9 Silvagni, A., Fusillo, G., Ravasio, R., Picca, M., and Zanardi, S.: ‘An

overview of logic architectures inside Flash memory devices’, Proc.

IEEE, 2003, 91, pp. 569 –580

10 Rossi, D., Metra, C., and Ricco, B.: ‘Fast and compact error correcting

scheme for reliable multilevel ﬂash memories’. Proc. Eighth IEEE Int.

On-Line Testing Workshop, July 2002, pp. 221–225

11 Ungerboeck, G.: ‘Trellis-coded modulation with redundant signal sets.

Parts I and II’, IEEE Commun. Mag., 1987, 25, pp. 5 –21

12 Lou, H.-L., and -Sundberg, C.E.W.: ‘Increasing storage capacity in

multilevel memory cells by means of communications and signal

processing techniques’, IEE Proc., Circuits Devices Syst., 2000,

147, pp. 229 –236

13 Tanzawa, T., et al.: ‘A compact on-chip ECC for low cost ﬂash

memories’, IEEE J. Solid-State Circuits, 1997, 32, pp. 662 –669

14 Nobukata, H., et al.: ‘A 144-Mb, eight-level NAND ﬂash memory

with optimized pulsewidth programming’, IEEE J. Solid-State

Circuits, 2000, 35, pp. 682–690

15 Grossi, M., Lanzoni, M., and Ricco, B.: ‘A novel algorithm for

high-throughput programming of multilevel ﬂash memories’, IEEE

Trans. Electron Devices, 2003, 50, pp. 1290–1296

16 Atwood, G., Fazio, A., Mills, D., and Reaves, B.: ‘Intel StrataFlash

memory technology overview’ Intel Technology Journal’, 4th Quarter

1997, pp. 1–8

17 Wei, L.F.: ‘Trellis-coded modulation with multidimensional

constellations’, IEEE Trans. Inf. Theory, 1987, 33, pp. 483 –501

18 Sun, F., Devarajan, S., Rose, K., and Zhang, T.: ‘Multilevel

ﬂash memory on-chip error correction based on trellis coded

modulation’. IEEE Int. Symp. Circuits and Systems (ISCAS), May

2006

19 Calligaro, C., Gastaldi, R., Manstretta, A., and Torelli, G.: ‘A

high-speed parallel sensing scheme for multi-level nonvolatile

memories’. Proc. Int. Workshop on Memory Technology, Design

and Testing, August 1997, pp. 96–101

20 Fettweis, G., and Meyr, H.: ‘High-speed parallel Viterbi decoding:

algorithm and VLSI-architecture’, IEEE Commun. Mag., 1991, 29,

pp. 46–55

21 Blahut, R.E.: ‘Algebraic codes for data transmission’ (Cambridge

University Press, 2003)

22 Lee, J.-D., Hur, S.-H., and Choi, J.-D.: ‘Effects of ﬂoating-gate

interference on NAND ﬂash memory cell operation’, IEEE Trans.

Electron Devices, 2002, 23, pp. 264 –266

23 Takeuchi, K., Tanaka, T., and Nakamura, H.: ‘A double-level-V

select gate array architecture for multilevel NAND ﬂash memories’,

IEEE J. Solid-State Circuits, 1996, 31, pp. 602– 609

24 Micheloni, R., et al.: ‘A 0.13-mm CMOS NOR ﬂash memory

experimental chip for 4-b/cell digital storage’. Proc. 28th European

Solid-State Circuits Conf., September 2002, pp. 131–134

25 Chen, Y., and Parhi, K.K.: ‘Area efﬁcient parallel decoder architecture

for long BCH codes’. IEEE Int. Conf. Acoustics, Speech, and Signal

Processing, May 2004, pp. V-73–V-76

26 Burton, H.O.: ‘Inversionless decoding of binary BCH codes’, IEEE

Trans. Inf. Theory, 1971, 17, (4), pp. 464–466

27 Micheloni, R., et al.: ‘A 4 Gb 2b/cell NAND ﬂash memory with

embedded 5b BCH ECC for 36 MB/s system read throughput’.

IEEE Int. Solid-State Circuits Conf., February 2006, pp. 497 –506

IET Circuits Devices Syst., Vol. 1, No. 3, June 2007 249

On the Design of SSRS and RS Codes for Enhancing the Integrity of Information Storage in NAND Flash Memories

Article

Full-text available

Jan 2023

The revolution in the field of information processing systems has created a huge demand for reliable and enhanced data storage capabilities. This demand is being met by advances in channel coding algorithms along with upward scaling of the capacities of hardware devices. NAND Flash memory is a type of non-volatile memory. Scaling of the size of flash memories from Single Level Cell (SLC) devices to Multilevel cell (MLC) devices has increased the storage capacity. However, these multi-bit per cell architectures are characterized by significantly higher Raw Bit Error Rate (RBER) values when compared with SLC architectures. The requirement of low Undetected Bit Error Rate (UBER) values has motivated us to synthesize powerful channel codes for enhancing the integrity of information Storage in multi-level NAND Flash Memory devices. This paper describes the synthesis of novel Subfield Subcodes of Reed Solomon Codes (SSRS) and Reed-Solomon (RS) codes which are matched to multi-bit per cell architectures. UBER values have been calculated for each of the synthesized codes described in this paper. This allows the determination of the performance and the improvement in data storage integrity brought by using these codes. We have shown that the synthesized SSRS and RS codes can provide very low UBER even when the corresponding RBER values are appreciable. As RS codes permit the detection and correction of a greater number of errors for a given code length, their performance is superior to that of SSRS codes. This improved performance is obtained at the cost of greater complexity of encoding and decoding processes.

Enhancing Data Storage Reliability and Error Correction in Multilevel NOR and NAND Flash Memories through Optimal Design of BCH Codes

Preprint

Full-text available

Jul 2023

The size reduction of transistors in the latest flash memory generation has resulted in programming and data erasure issues within these designs. Consequently, ensuring reliable data storage has become a significant challenge for these memory structures. To tackle this challenge, error-correcting codes like BCH (Bose-Chaudhuri-Hocquenghem) codes are employed in the controllers of these memories. When decoding BCH codes, two crucial factors are the delay in error correction and the hardware requirements of each sub-block. This article proposes an effective solution to enhance error correction speed and optimize the decoder circuit's efficiency. It suggests implementing a parallel architecture for the BCH decoder's sub-blocks and utilizing pipeline techniques. Moreover, to reduce the hardware requirements of the BCH decoder block, an algorithm based on XOR sharing is introduced to eliminate redundant gates in the search Chien block. The proposed decoder is simulated using the VHDL hardware description language and subsequently synthesized with Xilinx ISE software. Simulation results indicate that the proposed algorithm not only significantly reduces error correction time but also achieves a noticeable reduction in the hardware overhead of the BCH decoder block compared to similar methods.

Channel Coding for Flash Memories

Thesis

Oct 2019

Jens Spinner

Flash memories are non-volatile memory devices. The rapid development of flash technologies leads to higher storage density, but also to higher error rates. This dissertation considers this reliability problem of flash memories and investigates suitable error correction codes, e.g. BCH-codes and concatenated codes. First, the flash cells, their functionality and error characteristics are explained. Next, the mathematics of the employed algebraic code are discussed. Subsequently, generalized concatenated codes (GCC) are presented. Compared to the commonly used BCH codes, concatenated codes promise higher code rates and lower implementation complexity. This complexity reduction is achieved by dividing a long code into smaller components, which require smaller Galois-Field sizes. The algebraic decoding algorithms enable analytical determination of the block error rate. Thus, it is possible to guarantee very low residual error rates for flash memories. Besides the complexity reduction, general concatenated codes can exploit soft information. This so-called soft decoding is not practicable for long BCH-codes. In this dissertation, two soft decoding methods for GCC are presented and analyzed. These methods are based on the Chase decoding and the stack algorithm. The last method explicitly uses the generalized concatenated code structure, where the component codes are nested subcodes. This property supports the complexity reduction. Moreover, the two-dimensional structure of GCC enables the correction of error patterns with statistical dependencies. One chapter of the thesis demonstrates how the concatenated codes can be used to correct two-dimensional cluster errors. Therefore, a two-dimensional interleaver is designed with the help of Gaussian integers. This design achieves the correction of cluster errors with the best possible radius. Large parts of this works are dedicated to the question, how the decoding algorithms can be implemented in hardware. These hardware architectures, their throughput and logic size are presented for long BCH-codes and generalized concatenated codes. The results show that generalized concatenated codes are suitable for error correction in flash memories, especially for three-dimensional NAND memory systems used in industrial applications, where low residual errors must be guaranteed.

A Flexible Hybrid BCH Decoder for Modern NAND Flash Memories Using General Purpose Graphical Processing Units (GPGPUs)

Article

Full-text available

May 2019

Bose–Chaudhuri–Hocquenghem (BCH) codes are broadly used to correct errors in flash memory systems and digital communications. These codes are cyclic block codes and have their arithmetic fixed over the splitting field of their generator polynomial. There are many solutions proposed using CPUs, hardware, and Graphical Processing Units (GPUs) for the BCH decoders. The performance of these BCH decoders is of ultimate importance for systems involving flash memory. However, it is essential to have a flexible solution to correct multiple bit errors over the different finite fields (GF(2 m )). In this paper, we propose a pragmatic approach to decode BCH codes over the different finite fields using hardware circuits and GPUs in tandem. We propose to employ hardware design for a modified syndrome generator and GPUs for a key-equation solver and an error corrector. Using the above partition, we have shown the ability to support multiple bit errors across different BCH block codes without compromising on the performance. Furthermore, the proposed method to generate modified syndrome has zero latency for scenarios where there are no errors. When there is an error detected, the GPUs are deployed to correct the errors using the iBM and Chien search algorithm. The results have shown that using the modified syndrome approach, we can support different multiple finite fields with high throughput.

The Error-Correcting Coding in Information Storage Modules with Increased Radiation Resistance

Article

Sep 2018

This article is devoted to the study and analysis of various noise-resistant code structures, which are designed for use in miniature memory drives on spacecrafts. Error-correcting coding is aimed for correcting memory errors that occur due to ionizing radiation. The first part of the article provides information about the general memory architecture using error-correcting coding. The second part considers linear code constructions, such as Hamming code, convolutional code, PC and LDPC code, as well as nonlinear code constructions, which are promising means of correcting memory errors (Vasiliev code, Phelps code, switching code, AMD-code). Based on the research and analysis data, the conclusion is made about the most suitable code design for the development of the information storage module. It should be noted that the determining requirement for choosing the code for the drive used on the spacecraft is the presence of simple decoding algorithms that allow high decoding speed and low energy consumption.

High-Speed LFSR Decoder Architectures for BCH and GII Codes

Article

Jan 2023

Yingquan Wu

In literature, PIBMA, a linear-feedback-shift-register (LFSR) decoder, has been shown to be the most efficient high-speed decoder for Reed-Solomon (RS) codes. In this work, we follow the same design principles and present two high-speed LFSR decoder architectures for binary BCH codes, both achieving the critical path of one multiplier and one adder. We identify a key insight of the Berlekamp algorithm that iterative discrepancy computation involves only even-degree terms. The first decoder separates the even and odd-degree terms of the error-locator polynomial to iterate homogeneously with discrepancy computation. The resulting LFSR decoder architecture, dubbed PIBA, has $\lfloor {}\frac {3t}{2}\rfloor +1$ processing elements (PEs), each containing two registers, two multipliers, one adder, and two multiplexers (same as that of PIBMA), which compares favorably against the best existing architecture composed by $2t+1$ PEs. The second one, dubbed pPIBA, squeezes the entire error-locator polynomial into the even-term array of the first one to iterate along with discrepancy computation, which comes at the cost of a controlled defect rate. pPIBA employs $t+1+f$ systolic units with a defect probability of $2^{-q(f+1)}$ , where $q$ denotes the finite field dimension and $f$ is a design parameter, which significantly reduces the number of PEs for a large correcting power $t$ . The proposed architectures can be arbitrarily folded to trade off complexity with latency, due to the systolic nature. GII decoding has been notorious for the composition of many seemly irrelevant functional blocks. We are motivated by the unified framework UPIBA which can be reconfigured to carry out both error-only and error-and-erasure decoding of RS codes in the most efficient manner. We devise a unified LFSR decoder for GII-RS, GII-ERS (referring to erasure correction of GII-RS codes), and GII-BCH codes, respectively. Each LFSR decoder can be reconfigured (but not multiplexed) to execute different functional blocks, and moreover achieves the same critical path of one multiplier, one adder, and one multiplexer. The resulting GII-RS/BCH decoder contains only four functional blocks, which are literally the same as the decoder for single RS/BCH codes. For GII-RS and GII-BCH decoding, we also incorporate the original mechanism by Tang and $\text{K}\ddot {\text {o}}$ tter to minimize the miscorrection rate, which comes surprisingly at a negligible cost. Our proposed high-speed low-complexity GII-ERS decoder renders the multi-layer GII codes highly attractive against other locally recoverable codes.

Narrowband Data Waveform Development and Simulation for Achieving Low BER in Wireless Communication

Chapter

Jul 2023

This paper presents a two step methodology for the development of narrowband (NB) data test waveform to achieve low bit error rate (BER). The physical layer of the NB waveform consists of many baseband processing blocks like digital modulation schemes, channel coding schemes, pulse shaping filters, synchronization techniques, and so more. In first step, two or more options of the important baseband processing blocks have been chosen through literature survey. In second step, these options have been simulated in Additive White Gaussian Noise (AWGN) and Rayleigh fading channel model. Comparative analysis of different options has been done on the basis of BER versus Energy per bit to noise power spectral density ratio ($E_{b}/N_{0}$) performance during simulation to select the best options for the optimized baseband processing block chain. At last, the complete baseband processing chain composed of the selected option has been simulated, and the BER of $10^{-6}$ has been achieved at 13 dB of $E_{b}/N_{0}$.KeywordsNBWFBER $E_{b}/N_{0}$ BCHFPGAQPSKQAM

An Efficient Decoding Architecture with Improved Error Correcting Technique for NB-LDPC Code

Article

May 2020

Dr Kamalakannan S

Due to higher integration densities, technology scaling and variation in parameters, the performance failures may occur for every application. The memory applications are also prone to single event upsets and transient errors which may lead to malfunctions. This paper proposed a novel error detection and correction method using EG-LDPC. This is useful as majority logic decoding can be implemented serially with simple hardware but requires a large decoding time. For memory applications, this increases the memory access time. The method detects whether a word has errors in the first iterations of majority logic decoding, and when there are no errors the decoding ends without completing the rest of the iterations. Also, errors affecting more than five bits were detected with a probability very close to one. The probability of undetected errors was also found to decrease as the code block length increased. For a billion error patterns only a few errors (or sometimes none) were undetected. This may be sufficient for some applications. Error commonly occurs in the Flash memory while employing LDPC decoding. The SRMMU actually suggests to use a the VTVI design by introducing the Context Number register, however also a PTPI or VTPI design could be implemented that complies to the SRMMU standard. The VTVI design with a physical write buffer and a combined I/D Cache TLB is the simplest design to implement. This will give error correction in minimum cyclic period using LDPC method.

On the Capacity of the Flash Memory Channel with Inter-cell Interference

Conference Paper

Jul 2019

Management of Next-Generation NAND Flash to Achieve Enterprise-Level Endurance and Latency Targets

Article

Dec 2018

Despite its widespread use in consumer devices and enterprise storage systems, NAND flash faces a growing number of challenges. While technology advances have helped to increase the storage density and reduce costs, they have also led to reduced endurance and larger block variations, which cannot be compensated solely by stronger ECC or read-retry schemes but have to be addressed holistically. Our goal is to enable low-cost NAND flash in enterprise storage for cost efficiency. We present novel flash-management approaches that reduce write amplification, achieve better wear leveling, and enhance endurance without sacrificing performance. We introduce block calibration, a technique to determine optimal read-threshold voltage levels that minimize error rates, and novel garbage-collection as well as data-placement schemes that alleviate the effects of block health variability and show how these techniques complement one another and thereby achieve enterprise storage requirements. By combining the proposed schemes, we improve endurance by up to 15× compared to the baseline endurance of NAND flash without using a stronger ECC scheme. The flash-management algorithms presented herein were designed and implemented in simulators, hardware test platforms, and eventually in the flash controllers of production enterprise all-flash arrays. Their effectiveness has been validated across thousands of customer deployments since 2015.

Multilevel flash memory on-chip error correction based on trellis coded modulation

Conference Paper

Full-text available

Jun 2006

This paper presents a multilevel (ML) flash memory on-chip error correction system design based on the concept of trellis coded modulation (TCM). This is motivated by the non-trivial modulation process in ML memory storage and the effectiveness of TCM on integrating coding with modulation to provide better performance. Using code storage 2bits/cell flash memory as a test vehicle, the effectiveness of TCM-based systems, in terms of error-correcting performance, coding redundancy, silicon cost, and operation latency, has been successfully demonstrated

TRELLIS-CODED MODULATION WITH MULTIDIMENSIONAL CONSTELLATIONS

Article

Jul 1987

LF WEI

Effects of floating-gate interference on NAND flash memory cell operation

Article

May 2002

We have introduced the concept of floating-gate interference in flash memory cells for the first time. The floating-gate interference causes V-T shift of a cell proportional to the V-T change of the adjacent cells. It results from capacitive coupling via parasitic capacitors around the floating gate. The coupling ratio defined in the previous works should be modified to include the floating-gate interference. In a 0.12-mum design-rule NAND flash cell, the floating-gate interference corresponds to about 0.2 V shift in multilevel cell operation. Furthermore, the adjacent word-line voltages affect the programming speed via parasitic capacitors.

A 65nm NOR flash technology with 0.042/spl mu/m/sup 2/ cell size for high performance multilevel application

Article

Jan 2005

A 65nm NOR flash technology, featuring a true 10lambda2 , 0.042mum2 cell, is presented for the first time for 1bit/cell and 2bit/cell products. Advanced 193nm lithography, floating gate self aligned STI, cobalt salicide and three levels of copper metallization allow the integration with a high density and high performance 1.8V CMOS

Trellis - coded modulation with redundant signal sets: Parts I & II

Article

Jan 1987

Gottfried Ungerboeck

Algebraic Codes For Data Transmission

Article

Feb 2003

Richard E. Blahut

The need to transmit and store massive amounts of data reliably and without error is a vital part of modern communications systems. Error-correcting codes play a fundamental role in minimising data corruption caused by defects such as noise, interference, crosstalk and packet loss. This book provides an accessible introduction to the basic elements of algebraic codes, and discusses their use in a variety of applications. The author describes a range of important coding techniques, including Reed-Solomon codes, BCH codes, trellis codes, and turbocodes. Throughout the book, mathematical theory is illustrated by reference to many practical examples. The book was first published in 2003 and is aimed at graduate students of electrical and computer engineering, and at practising engineers whose work involves communications or signal processing.

A high-speed parallel sensing scheme for multi-level non-volatile memories

Article

Jan 1997

A parallel sensing scheme for multi-level non-volatile memories (ML NVM) is presented. A single comparison step is used to achieve high sensing speed. To this purpose, a high-speed low-voltage current comparator is used. Experimental evaluations on a 0.6-/spl mu/m EPROM test chip demonstrated the feasibility of 4-level-cell NV MLMs from the sensing standpoint. A read throughput of 12 MB/s is achieved with the proposed 4-level-cell memory architecture. Multi-level storage is achieved by using a program-verified scheme to obtain tight cell threshold voltage distribution. Overall sensing area overhead for a 32-Mbit chip is in the range of 1%.

A 4Gb 2b/cell NAND flash memory with embedded 5b BCH ECC for 36MB/s system read throughput

Conference Paper

Mar 2006

A 4Gb 2b/cell NAND flash memory designed in a 90nm CMOS technology incorporates a 25MHz BCH ECC architecture, correcting up to 5 errors over a flexible data field (1B to 2102B). Two alternative Chien circuits are used depending on the number of errors (1 to 5) thus minimizing latency time. ECC area overhead is less than 1%

Fast and Compact Error Correcting Scheme for Reliable Multilevel Flash Memories.

Conference Paper

Jul 2002

This paper presents a method to reduce area and timing overhead due to the implementation of standard single symbol correcting codes to provide ML flash memories with error correction capability. In particular, the proposed method is based on the manipulation of the parity check matrix which defines a code, which allows one to minimize the matrix weight and the maximum row weight. Furthermore, we show that a minimal increase in the redundancy, with respect to the standard case, allows a further considerable reduction of the impact on the memory access time, as well as on the area overhead due to the error correction circuitry.

A 90nm generation NOR flash multilevel cell (MLC) with 0.44μm2/bit cell size

Conference Paper

May 2005

A 256Mb NOR MLC flash memory with 90nm technology has been successfully developed. Through judicious integration to control the cell dispersion and charge loss/gain with cycling, we confirm a successful MLC operation up to 10K cycling for 0.44 μm<sup>2</sup>/bit cell size. In this paper, the key features governing multilevel cell (MLC) operation below 90nm technology node is discussed with experimental results.

Design of on-chip error correction systems for multilevel NOR and NAND flash memories

Abstract and Figures

Recommended publications

Minimum Pearson Distance Detection for Multilevel Channels With Gain and/or Offset Mismatch

On the Use of Strong BCH Codes for Improving Multilevel NAND Flash Memory Storage Capacity

Improving multi-level NAND flash memory storage reliability using concatenated TCM-BCH coding

Improving multi-level NAND flash memory storage reliability using concatenated BCH-TCM coding

Multilevel flash memory on-chip error correction based on trellis coded modulation