ArticlePDF Available

Design of on-chip error correction systems for multilevel NOR and NAND flash memories

Wiley
IET Circuits, Devices & Systems
Authors:

Abstract and Figures

The design of on-chip error correction systems for multilevel code-storage NOR flash and data-storage NAND flash memories is concerned. The concept of trellis coded modulation (TCM) has been used to design on-chip error correction system for NOR flash. This is motivated by the non-trivial modulation process in multilevel memory storage and the effectiveness of TCM in integrating coding with modulation to provide better performance at relatively short block length. The effectiveness of TCM-based systems, in terms of error-correcting performance, coding redundancy, silicon cost and operational latency, has been successfully demonstrated. Meanwhile, the potential of using strong Bose-Chaudhiri-Hocquenghem (BCH) codes to improve multilevel data-storage NAND flash memory capacity is investigated. Current multilevel flash memories store 2 bits in each cell. Further storage capacity may be achieved by increasing the number of storage levels per cell, which nevertheless will correspondingly degrade the raw storage reliability. It is demonstrated that strong BCH codes can effectively enable the use of a larger number of storage levels per cell and hence improve the effective NAND flash memory storage capacity up to 59.1% without degradation of cell programming time. Furthermore, a scheme to leverage strong BCH codes to improve memory defect tolerance at the cost of increased NAND flash cell programming time is proposed.
Content may be subject to copyright.
Design of on-chip error correction systems for
multilevel NOR and NAND flash memories
F. Sun, S. Devarajan, K. Rose and T. Zhang
Abstract: The design of on-chip error correction systems for multilevel code-storage NOR flash
and data-storage NAND flash memories is concerned. The concept of trellis coded modulation
(TCM) has been used to design on-chip error correction system for NOR flash. This is motivated
by the non-trivial modulation process in multilevel memory storage and the effectiveness of TCM
in integrating coding with modulation to provide better performance at relatively short block
length. The effectiveness of TCM-based systems, in terms of error-correcting performance,
coding redundancy, silicon cost and operational latency, has been successfully demonstrated.
Meanwhile, the potential of using strong BoseChaudhiriHocquenghem (BCH) codes to
improve multilevel data-storage NAND flash memory capacity is investigated. Current multilevel
flash memories store 2 bits in each cell. Further storage capacity may be achieved by increasing the
number of storage levels per cell, which nevertheless will correspondingly degrade the raw storage
reliability. It is demonstrated that strong BCH codes can effectively enable the use of a larger
number of storage levels per cell and hence improve the effective NAND flash memory storage
capacity up to 59.1% without degradation of cell programming time. Furthermore, a scheme to
leverage strong BCH codes to improve memory defect tolerance at the cost of increased NAND
flash cell programming time is proposed.
1 Introduction
Driven by the ever increasing demand for on-chip/board
non-volatile data storage, flash memory has become one
of the fastest growing segments in the global semiconductor
industry [1]. Flash memories are categorised into two
families, NOR flash and NAND flash [2]: NOR flash mem-
ories are mainly used for code storage and have relatively
short block length, fo r example, 16 or 64 user bits per
block, whereas NAND flash memories are mainly used
for massive data storage and have relatively long block
length, for example, 8192 or 16,384 user bits (i.e. 1024 or
2048 user bytes) per block. With its well-demonstrated
effectiveness for increasing flash memory storage capacity,
the multilevel concept, that is, to store more than 1 bit in
each cell (or floating-gate MOS transistor) by programming
the cell threshold voltage into one of l . 2 voltage
windows, is being widely used in both NOR and NAND
flash memories [37]. Owing to the inherently reduced
operational margin, multilevel flash memories are increas-
ingly relying upon on-chip error correction to ensure the
storage reliability [810]. In current practice, most
multilevel NOR and NAND flash memories store 2 bits in
each memory cell and employ classical linear block error
correction codes (ECCs) such as Hamming and Bose
ChaudhuriHocquenghem (BCH) codes to realise on-chip
error correction.
This work is interested in the design of multilevel flash
memory on-chip error correction systems that may
outperform the current practice by realising superior
reliability and/or enabling higher effective storage capacity.
Because of the significant difference on the block length
between NOR and NAND flash memories, we consider
these two types of flash memories separately. In the
context of NOR flash, we investigate the use of trellis
coded modulation (TCM) [11] technique to realise
on-chip error correction. The motivation is 2-fold: (1) The
more-than-two-levels-per-cell storage capacity makes
the modulation process non-trivial and an integral part of
the on-chip ECC. (2) TCM can effectively integrate ECC
with modulation to realise better error correction perform-
ance when the block length is relatively small. We note
that, although the use of TCM in multilevel memory has
been first proposed in [12], the incurred hardware
implementation cost and latency overhead have not been
addressed, which leaves its practical feasibility a missing
link. Furthermore, TCM-based approach is only applicable
to NOR flash because its advantage over conventional
linear block codes quickly diminishes as the block length
increases, which however was not pointed out in [12].To
evaluate the silicon cost of using TCM-based on-chip
error correction, we implemented the read datapath consist-
ing of high-precision sensing circuits and TCM decoder.
The results suggest that TCM-based systems can achieve
encouraging memory cell savings at small operational
latency and silicon cost.
In the context of NAND flash, we investigated the use of
very strong BCH codes to enable higher storage capacity.
Currently, most multilevel NAND flash memories store 2
bits (or four levels) in each cell, for which a weak ECC
code that can only correct few (e.g. one or two) errors is typi-
cally used [13]. Higher storage capacity may be realised by
further increasing l, which will make it increasingly more dif-
ficult to ensure storage reliability. In this regard, solutions may
be pursued along two directions, including: (i) improve the
# The Institution of Engineering and Technology 2007
doi:10.1049/iet-cds:20060275
Paper first received 3rd September 2006 and in revised form 7th April 2007
The authors are with the ECSE Department, Rensselaer Polytechnic Institute,
Troy, NY 12180, USA
E-mail: sunf@rpi.edu
IET Circuits Devices Syst., 2007, 1, (3), pp. 241 249
241
programming scheme to accordingly tighten each threshold
voltage window and (ii) use much stronger ECC. Along the
first direction, researchers have developed high-accuracy pro-
gramming techniques to realise 3bits/cell and even 4bits/cell
storage capacity [14, 15], which however complicates the
design of the peripheral programming mixed signal circuits
and degrades the programming throughput.
To the best of our knowledge, the potential of using much
stronger ECC to improve NAND flash storage capacity has
not been addressed in the open literature. This work
attempts to fill this gap by investigating the use of strong
BCH codes to enable a relatively large l (6, 8 and 12 in
this work). With the advantages of simplifying the program-
ming circuits and maintaining or even increasing the pro-
gramming throughput, the use of strong ECC is subject to
two main drawbacks: (i) strong ECC requires a higher
coding redundancy that will inevitably degrade the storage
capacity improvement gained by a larger l and (ii) the
ECC decoder may incur non-negligible silicon area over-
head and increase read latency. In general, to realise the
same error correction performance (or to achieve the same
coding gain), the longer the ECC code length, the less
will be the relative coding redundancy (or higher code
rate). Therefore strong ECCs are only suitable for NAND
flash memories that have long data block length and
hence may tolerate longer read latency. Using 2 bits/cell
NAND flash memories that employ single-error-correcting
Hamming codes as a benchmark, we investigated the effec-
tiveness of using strong BCH codes to ensure storage
reliability when increasing the value of l to 6, 8 and 12,
respectively. With the same programming scheme, and
hence the same threshold voltage distribution characteristics
as the 2 bits/cell benchmark, the larger value of l will result
in a worse raw storage reliability and demands a stronger
BCH code. To investigate the trade-off between design
complexities and storage capacity improvements, we
designed BCH decoders using 0.13 mm complementary
metaloxide semiconductor (CMOS) standard cell and
static random access memory (SRAM) libraries. The
results show that strong BCH codes can enable a relatively
large increase of the number of storage levels per cell and
hence a potentially significant memory storage capacity
improvement. Finally, a scheme is proposed to leverage
strong BCH codes to improve NAND flash memory
defect tolerance by trading off the memory cell program-
ming time.
The paper is organised as follows: We briefly present the
basics of multilevel flash memories in Section 2. The pro-
posed TCM-based on-chip error correction systems for mul-
tilevel NOR flash memories is presented in Section 3, and
Section 4 discusses the use of strong BCH codes in multile-
vel NAND flash memories. Conclusions are drawn in
Section 5.
2 Multilevel flash memories
This section briefly presents some basics of multilevel flash
memory programming/read and the memory cell threshold
voltage distrib ution model to be used in this work.
Interested readers are referred to [3] for a comprehensive
discussion on multilevel flash memories. Multilevel flash
memory programming is typically realised by combining
a program-and-verify technique with a staircase V
pp
ramp
as illustrated in Fig. 1. The tightness of each programming
threshold voltage window is proportional to V
pp
, whereas
the cell programming time is roughly proportional to
1/V
pp
. The read circuit in l levels/cell NAND flash
memories usually has a serial sensing structure that takes
l 2 1 cycles to finish the read operation. Higher read
speed can be realised by increasing the sensing parallelism
at the cost of silicon area, which is typically preferred in
latency-critical NOR flash memories.
On the basis of the results published in [16] for 2 bits/cell
NOR flash memory, the cell threshold voltage approxi-
mately follows a Gaussian distribution as illustrated in
Fig. 2: the two inner distributions have the same standard
deviation, denoted as
s
; the standard deviations of the
two outer distributions are 4
s
and 2
s
, respectively. The
locations of the means of the two inner distributions are
determined to minimise the raw bit error rate. Let V
max
denote the voltage difference between the means of the
two outer distributions. We assume that this model is also
valid for NAND flash memories.
3 TCM-based on-chip error correction for NOR
flash
3.1 TCM system structure
The basic idea of TCM is to jointly design trellis codes (i.e.
convolutional codes) and signal mapping (i.e. modulation)
processes to maximise the free Euclidean distance
(Similar to the Hamming distance of linear block codes,
free Euclidean distance determines the error correction
capability of convolutional codes, that is, a convolutional
code with free Euclidean distance of d
free
can correct at
least b(d
free
2 1)/2c code symbol errors) between coded
signal sequences. As illustrated in Fig. 3, given an
l-level/cell memory core, an m-dimensional TCM encoder
receives a sequence of n-bit input data, adds r-bit redun-
dancy and hence generates a sequence of (n þ r)-bit data,
where each (n þ r)-bit data are stored in m memory cells
and 2
nþr
l
m
. The encoding process can be outlined as
follows: (1) A convolutional encoder convolves the input
k bits sequence with r linear algebraic functions and gener-
ates k þ r coded bits. (2) Each k þ r coded bits select one of
the 2
kþr
subsets of an m 2 D signal constellation, where
each subset contains 2
n2k
signal points. (3) The additional
n 2 k uncoded bits select an individual m 2 D signal
point from the selected subset.
Fig. 1 Schematic illustration of program-and-verify cell
programming
Fig. 2 Approximate cell threshold voltage distribution model in
2 bits/cell memory
IET Circuits Devices Syst., Vol. 1, No. 3, June 2007242
Let s denote the memory order of the convolutional code
encoder. To protect an N-bit data block, the TCM encoder
totally receives N þ s bits including s zero bits for convolu-
tional code termination. If N þ s is not divisible by n, the
last input to the encoder will contain less than n bits, for
which the m 2 D modulation may be simplified to a modu-
lation with a lower dimension. As illustrated in Fig. 3, the
TCM decoder contains an m 2 D demodulator that provides
2
kþr
branch metrics and branch symbol decisions to the
Viterbi decoder for trellis decoding.
3.2 System design and performance evaluation
Targeting 2 bits/cell NOR flash memories, we designed
three TCM-based systems that protect 16-bit, 32-bit and
64-bit user data in one codeword. These three systems
share the same system design parameters (referred to
Fig. 3): n ¼ 7, k ¼ 2, r ¼ 1, m ¼ 4 and the memory order
of the convolutional code v ¼ 3. The signal read from
each memory cell is quantised by 12 levels. We decided
to use 12-level quantisation mainly based on our finite-
precision computer simulations, which suggested 12-level
quantisation appears to provide a good trade-off between
implementation cost and error correction performance. To
realise 4D modulation, we use the scheme proposed by
Wei [17] that hierarchically partitions the 4D rectangular
lattice formed by four memory cells into eight 4D sub-
lattices. Each coded 3-bit data from the convolutional
code encoder selects one out of the eight 4D sub-lattices.
The 4D signal space partition is described as follows:
First, we partition each 1D signal constellation (correspond-
ing to one memory cell) into two subsets E and F, as shown
in Fig. 4, where the signal points labelled as e and f belong
to the subsets E and F, respectively.
Next, we partition each 2D signal constellation into four
subsets A ¼ (E, F ), B ¼ (F, E), C ¼ (E , E) and D ¼ (F , F),
as shown in Fig. 4, where the signal points labelled as a, b, c
and d belong to the subsets A, B, C, and D, respectively.
Finally, we partition the 4D signal constellation into
2
kþr
¼ 8 4D subsets formed as listed in Table 1.
To protect 16-bit user data, the TCM encoder receives 19
bits (including 3 zero bits for termination) and finishes
encoding in three steps: during each of the first two steps,
it receives 7 bits and maps the coded 8 bits onto four
memory cells through 4D modulation; in the last step, it
receives 5 bits and maps the coded 6 bits onto three
memory cells through 3D modulation that is obtained by
collapsing one 2D constellation in the original 4D modu-
lation into a 1D constellation. Therefore this system is
denoted as (11, 8) TCM, that is, one codeword occupies
11 memory cells and protects 2 8 ¼ 16-bit user data
(notice that each memory cell stores 2 bits). For the
purpose of comparison, we considered two other ECC
schemes using linear block codes: (i) a (11, 8, 1) 4-ary shor-
tened Hamming code and (ii) a (13,8,2) 4-ary shortened
two-error-correcting BCH code [18].
Fig. 5a shows the performance comparison of these three
schemes. Although the performance curves of the two linear
block codes can be analytically derived, we have to rely on
extensive computer simulations to obtain the performance
curve of the (11, 8) TCM system, for which the solid part
is obtained by computer simulation and the dashed part is
estimated following the trend of the simulation results.
With the same coding redundancy, (11, 8) TCM can
achieve about five orders of magnitude better performance
than (11, 8, 1) Hamming code. Compared with (13, 8, 2)
BCH code, (11, 8) TCM can achieve almost the same per-
formance while realising a saving of 2/13 (15.4%)
memory cells.
To protect 32-bit user data, the TCM encoder receives 35
bits and finishes encoding in five steps, each step maps 8
bits onto four memory cells through 4D modulation.
Hence, this is denoted as (20, 16) TCM, that is, one code-
word occupies 20 memory cells and protects 32-bit user
data. The (20, 16) TCM is compared with two linear
block codes: (i) a (19, 16, 1) 4-ary shortened Hamming
code and (ii) a (23, 16, 2) 4-ary shortened BCH code.
Fig. 5b shows their performance comparison.
To protect 64-bit user data, the TCM encoder receives 67
bits and finishes encoding in 10 steps: during each of the
first nine steps, it receives 7 bits and maps the coded 8
bits onto four memory cells through 4D modulation; in
the last step, it receives 4 bits that bypass the convolutional
code encoder and directly map onto two memory cells
through 2D modulation that is a constituent of the original
4D modulation. Hence, this is denoted as (38, 32) TCM.
We compared it with two linear block codes: (i) a (36, 32,
1) 4-ary shortened Hamming code and (ii) a (39, 32, 2)
4-ary shortened BCH code. Fig. 5c shows their performance
Fig. 3 Block diagram of TCM-based on-chip error correction system
Fig. 4 16-point 2D signal constellation partition
Table 1: Partition of the 4-D signal constellation
4D subset Concatenation form
P
1
(A, A) < (B, B)
P
2
(C, C) < (D, D)
P
3
(A, B) < (B, A)
P
4
(C, D) < (D, C )
P
5
(A, C) < (B, D)
P
6
(C, B) < (D, A)
P
7
(A, D) < (B, C )
P
8
(C, A) < (D, B)
IET Circuits Devices Syst., Vol. 1, No. 3, June 2007 243
comparison. Table 2 summarises the comparison between
the TCM and the other ECC schemes discussed above in
terms of coding redundancy and error-correcting perform-
ance. Notice that the positive and negative numbers in the
third column mean positive and negative saving of
memory cells, respectively, and the performance gain in
the fourth column is measured when the bit error
rate(BER) of TCM-based system approaches 10
214
.
3.3 Silicon implementation
The above result shows the effectiveness of TCM-based
on-chip error correction in terms of coding redundancy
and error-correcting performance. However, to be a promis-
ing candidate for multilevel NOR flash memory, it should
be able to achieve small latency and negligible silicon
area compared with the overall memory die size. In the fol-
lowing, we present proof-of-concept implementation results
for the above three TCM-based systems for protecting
16-bit, 32-bit and 64-bit user data. Clearly, TCM encoders
are very simple and can easily achieve very small latency
with negligible silicon cost. Hence we only focus on TCM
decoders.
The TCM decoding datapath contains high-precision
sensing circuits, 4D demodulator and Viterbi decoder.
The sensing circuit realises 12-level quantisation, instead
of 4-level quantisation as in the conventional linear-
block-code-based ECC scheme. Using Cadence tool with
IBM 0.18 mm 7WL technology, we designed a
current-mode 12-level parallel sensing circuit following
the structure proposed in [19]. Fig. 6 shows the general
structure of a 12-level current-mode parallel sensing
circuit that mainly contains 11 current comparators.
12-level quantisation is realised by comparing the current
from the selected memory cell with the reference currents
from the 11 reference cells which are appropriately pro-
grammed. The silicon area of one 12-level current-mode
parallel sensing circuit is estimated as 0.006 mm
2
. The
simulation results show that the worst-case sensing
latency (i.e. the input current is equal to one of the reference
currents) is about 300 ps.
Upon receiving the data from four 12-level sensing cir-
cuits, the 4D demodulator finds the most likely point in
each 4D signal subset and calculates the corresponding
log-likelihood metric as the branch metrics sent to the
Viterbi decoder. The output branch metrics are represented
with 6 bits. The 4D demodulator receives a set of data
denoted as
^
Z ¼ {^z
1
, ^z
2
, ^z
3
, ^z
4
} from the four analogue-to-
digital converters (ADCs), where each ^z
i
is the digitised
data read from one cell. Given
^
Z, the 4D demodulator
should calculate
4D
Metric j ¼ max
p[P
j
( log P(
^
Zjp)),
for j ¼ 1, 2, ..., 8 (1)
where each 4D_Metric_ j represents the log-likelihood
value of one most likely point in each 4D subset and p rep-
resents the point in each 4D subset P
j
. As each point in each
4D subset is represented by 5 bits, the 4D demodulator
should also generate eight 5-bit data representing the most
likely points in the eight 4D subsets. Leveraging the hier-
archical structure of the 4D modulation, as shown in
Fig. 7 , the demodulation is realised in a hierarchical
manner. The 4D demodulator starts with finding the
Fig. 5 BER performance when protecting 16-bit, 32-bit and 64-bit user data
In TCM schemes, the signal read from each memory cell is quantised by 12 levels
Table 2: Comparison between TCM and the other ECC
schemes based on linear block code
TCM Competing
ECC
Savings of
cells, %
Performance
gain
(11, 8) (11, 8, 1) Hamming 0 10
5
(13, 8, 2) BCH 15.4 1
(20, 16) (19, 16, 1) Hamming 25.3 10
5
(23, 16, 2) BCH 13 10
(38, 32) (36, 32, 1) Hamming 25.6 . 10
5
(39, 32, 2) BCH 2.6 .10
Fig. 6 Structure of a 12-level parallel sensing circuit
IET Circuits Devices Syst., Vol. 1, No. 3, June 2007244
closest point in each 1D subset and its metric. Each 1D
demodulator receives the data ^z
i
read from one cell and
calculates
Metric
E ¼ max
p[E
( log P(^z
i
jp)),
Metric
F ¼ max
p[F
( log P(^z
i
jp)) (2)
where Metric_E and Metric_F represent the log-likelihood
value of one most likely point in two 1D subsets E and F,
respectively. The 1D demodulator also generates 1 bit
(note that we have two signal points in each 1D subset) to
represent the closest point in each 1D subset. This operation
can be implemented using a simple look-up table.
As discussed in Section 3.2, each 2D signal constellation
is partitioned into four subsets A ¼ (E, F ), B ¼ (F, E),
C ¼ (E, E), and D ¼ (F, F ). Upon the received data
{^z
i
, ^z
iþ1
} for i ¼ 1, 3, the 2D demodulator generates
Metric
A ¼ max
p[A
( log P({^z
i
, ^z
iþ1
}jp))
¼ Metric
E þ Metric F
Metric
B ¼ max
p[B
( log P({^z
i
, ^z
iþ1
}jp))
¼ Metric
F þ Metric E
Metric
C ¼ max
p[C
( log P({^z
i
, ^z
iþ1
}jp))
¼ Metric
E þ Metric E
Metric
D ¼ max
p[D
( log P({^z
i
, ^z
iþ1
}jp))
¼ Metric
F
|fflfflfflfflffl{zfflfflfflfflffl}
1st 1D
þ Metric F
|fflfflfflfflffl{zfflfflfflfflffl}
2nd 1D
(3)
In a similar way, the 4D demodulator generates eight demo-
dulation metrics as follows
4D
Metric 1 ¼ max
p[P
1
( log P(
^
Zjp))
¼ max (Metric
A þ Metric A,
Metric
B þ Metric B)
4D
Metric 2 ¼ max
p[P
2
( log P(
^
Zjp))
¼ max (Metric
C þ Metric C,
Metric
D þ Metric D)
4D
Metric 3 ¼ max
p[P
3
( log P(
^
Zjp))
¼ max (Metric
A þ Metric B,
Metric
B þ Metric A)
4D
Metric 4 ¼ max
p[P
4
( log P(
^
Zjp))
¼ max (Metric
C þ Metric D,
Metric
D þ Metric C)
4D
Metric 5 ¼ max
p[P
5
( log P(
^
Zjp))
¼ max (Metric
A þ Metric C,
Metric
B þ Metric D)
4D
Metric 6 ¼ max
p[P
6
( log P(
^
Zjp))
¼ max (Metric
C þ Metric B,
Metric
D þ Metric A)
4D
Metric 7 ¼ max
p[P
7
( log P(
^
Zjp))
¼ max (Metric
A þ Metric D,
Metric
B þ Metric C)
4D
Metric 8 ¼ max
p[P
8
( log P(
^
Zjp))
¼ max Metric
C
|fflfflfflfflffl{zfflfflfflfflffl}
1st 2D
þ Metric A
|fflfflfflfflffl{zfflfflfflfflffl}
2nd 2D
,
Metric
D
|fflfflfflfflffl{zfflfflfflfflffl}
1st 2D
þ Metric B
|fflfflfflfflffl{zfflfflfflfflffl}
2nd D
!
(4)
The generated eight metric values will be sent to Viterbi
decoder as the branch metrics for final decoding.
The last block on the decoding datapath is a Viterbi
decoder. To minimise the decoding latency, we use a state-
parallel register-exchange Viterbi decoder architecture. As
Viterbi decoder implementation has been extensively
addressed in the open literature, we will not elaborate on
the decoder architecture details. Interested readers are
referred to [20]. Here we note that, for the scenario of pro-
tecting 16-bit user data, as the Viterbi decoder finishes the
decoding in only three steps, we directly unrolled the recur-
sive datapath of the original Viterbi decoder and fully opti-
mise the circuit’s structure, which red uces both silicon area
and decoding latency. Table 3 summarises the read datapath
implementation metrics of the three TCM systems. We note
that the TCM systems protecting 32-bit and 64-bit user data
contain four sensing circuits and one 4D demodulator,
whereas the TCM system protecting 16-bit user data con-
tains 11 sensing circuits and two 4D and one 3D demodula-
tors in order to match the parallelism of the unrolled Viterbi
decoder, and hence the silicon area of the TCM system pro-
tecting 16-bit user data is comparable to the other two scen-
arios, even with the least block length.
4 BCH-based on-chip error correction for NAND
flash
4.1 Binary BCH codes
Binary BCH code construction and encoding/decoding are
based on binary Galois fields. A binary Galois field with
degree of m is represented as GF(2
m
). For any m 3 and
t . 2
m21
, there exists a primitive binary BCH code over
GF(2
m
), which has the code length n ¼ 2
m21
and infor-
mation bit length k 2
m
2 mt and can correct up to (or
slightly more than) t errors. A primitive t-error-correcting
(n, k, t) BCH code can be shortened (i.e. eliminate a
Fig. 7 Data flow of the 4D demodulator
IET Circuits Devices Syst., Vol. 1, No. 3, June 2007 245
certain number, say s, of information bits) to construct a
t-error-correcting (n 2 s, k 2 s, t) BCH code with less infor-
mation bits and code length but the same redundancy. Given
the raw BER p
raw
,an(n, k , t) binary BCH code can achieve
a codeword error rate of
P
e
¼
X
n
i¼tþ1
n
i

p
i
raw
(1 p
raw
)
ni
(5)
Binary BCH encoding can be realised efficiently using
linear shift registers, whereas binary BCH decoding is
much more complex. Various BCH decoding algorithms
have been proposed [21]. In Section 4.3, we will elaborate
on the binary BCH decoding algorithm and decoder archi-
tecture used in this work.
4.2 BCH codes for multilevel NAND flash
We first investigate the potential storage capacity improve-
ment by increasing l from 4 to 6, 8 and 12, respectively.
Assuming the same programming scheme (i.e. the same
step-up voltage V
pp
and hence same cell programming
time) as the 2 bits/cell memory, we have the cell threshold
voltage distributions for l ¼ 6, 8 and 12 as illustrated in
Fig. 8 and described as follows: the l 2 2 inner distributions
have the same standard deviation
s
; the standard deviations
of the two outer distributions are 4
s
and 2
s
, respectively.
The locations of the means of the l 2 2 inner distributions
are determined to minimise the raw BER. It should be
pointed out that, as the value of l increases, some factors
such as floating-gate interference [22] and source line
noise [23] might degrade the threshold voltage distribution
(or increase the standard deviation). As currently no data are
available in the open literature to model such possible devi-
ation degradation and we expect that such degradation
should not be significant, we assume that the standard devi-
ation is independent of l in this work.
We set V
max
, the voltage difference between the means of
the two outer distributions, as 6.5 V [24] and
s
as 1. For l of
6, 8 and 12, we store 5 bits per two cells, 3 bits per cell and 7
bits per two cells, respectively. Accordingly, the raw BER
are about 8 10
212
(l ¼ 4), 5 10
27
(l ¼ 6), 5 10
25
(l ¼ 8) and 2 10
23
(l ¼ 12), respectively. Because the
cell programming time remains the same as the 2 bits/cell
benchmark, the programming throughput may approxi-
mately increase by 25, 50, and 75%, respectively.
To protect 8192 and 16,384 user bits per codeword with a
target codeword error rate of lower than 10
214
,
single-error-correcting Hamming codes will be sufficient
to ensure the storage reliability for l ¼ 4. For larger
values of l, binary BCH codes are constructed by shortening
primitive binary BCH codes under GF(2
14
) and GF(2
15
),
respectively. Table 4 lists the BCH code parameters and
the corresponding codeword error rates. Table 4 also
shows the percentages of the user bits storage gain over
the 2 bits/cell benchmark, given the same number of
memory cells.
4.3 BCH code decoder architecture and ASIC
design
To evaluate decoder silicon implementation metrics for the
above BCH codes, we carried out applicationspecific inte-
grated circuit(ASIC) design using 0.13 mm CMOS standard
cell and SRAM libraries. In the following, we first briefly
describe the BCH decoder architecture and then present
the silicon implementation results. A syndrome-based
binary BCH code decoder consists of three blocks, as
shown in Fig. 9. For an (n, k, t) binary BCH code con-
structed under a Galois field with the primitive element
a
,
the overall decoder architecture is described as follows.
4.3.1 Syndrome computation: Given the received bit
vector r, it computes 2t syndromes as S
i
¼
P
n1
j¼0
r
j
a
ij
for
i ¼ 0, 1, ...,2t 2 1. As pointed out in [25] for binary
BCH codes, we have S
2j
¼ S
j
2
, so only t parallel syndrome
generators are required to explicitly calculate the
odd-indexed syndromes, followed by mu ch simpler square
circuits. For a decoder with parallelism of p (i.e. the syn-
drome computation block receives p input bits in each
Table 3: Summary of implementation metrics
Silicon area, mm
2
Latency
a
,ns
(11, 8) TCM 0.12 8.3
(20, 16) TCM 0.10 24.3
(38, 32) TCM 0.12 44.3
a
Includes latency of sensing circuits and TCM decoding
Fig. 8 Approximate flash memory cell threshold voltage distribution model
al¼ 6
bl¼ 8
cl¼ 12
IET Circuits Devices Syst., Vol. 1, No. 3, June 2007246
clock cycle), each syndrome generator has the structure as
shown in Fig. 10.
4.3.2 Error locator calculation: Based on the 2t syn-
dromes, we calculate the error locator polynomial
L
(x) ¼ 1 þ
L
1
x þ
L
2
x
2
þ
...
þ
L
t
x
t
using the inversion-
free Berlekamp Massey algorithm [26]. To minimise the
silicon area cost, a fully serial architecture is used, which
takes t(t þ 3)/ 2 clock cycles to finish the calculation. It
mainly contains three Galois field multipliers and two first-
input first-output (FIFO) buffers with lengths of t and t þ 1,
respectively.
4.3.3 Chien search: Upon receiving the error locator
polynomial
L
(x), it exhaustively examines whether
a
i
is
the root of
L
(x) for i ¼ 0, 1, ..., n 2 1, that is, check
whether
L
(
a
i
) ¼
P
t
j¼1
L
j
a
ij
þ 1 is zero or not. It outputs
an error vector e in such a way that, if
a
i
is a root, then
e
n2i
¼ 1, otherwise e
n2i
¼ 0. The overall decoder output
is obtained by r þ e as illustrated in Fig. 9. Fig. 11 shows
the Chien search architecture with the parallelism factor
of p that generates a p-bit output each clock cycle.
4.3.4 Decoder ASIC design: For the BCH codes listed
in Table 4, we designed decoders with the following con-
figurations: the syndrome computation and Chien search
blocks have a parallelism factor of 4; the error locator cal-
culation block is fully serial and takes t(t þ 3)/ 2 clock
cycles. Therefore the syndrome computation and Chien
search blocks always have the same latency (in terms of
the number of clock cycles), whereas the latency of error
locator calculation block depends on the value of t.To
improve the decoding throughput and minimise the decod-
ing latency, these BCH decoders support pipelined oper-
ation summarised as follows:
For l ¼ 6, 8: The BCH codes have relatively small values
of t, so the corresponding error locator calculation blocks
have much less latency than the other two blocks.
Therefore we use a one-stage pipelined decoder structure
in which the syndrome computation and error locator calcu-
lation blocks operate on one codeword, whereas the Chien
search block operates on the other codeword in parallel.
For l ¼ 12: The BCH codes have relatively large values
of t, so the corresponding error locator calculation blocks
have similar or even slightly longer latency than the other
two blocks. Therefore we use a two-stage pipelined
decoder structure in which these three blocks operate in
parallel on three consecutive codewords.
Furthermore, the decoder FIFO as shown in Fig. 9 is
realised by SRAMs to minimise the silicon area cost.
These BCH decoders are designed with Chartered
0.13 mm CMOS standard cell and SRAM libraries, where
Synopsys tools are used throughout the design hierarchy
down to place and route. We set the power supply as
1.08 V and the number of metal layers as four in the place
and route. Post-layout results verify that the decoders can
operate at 400 MHz and hence support about 1.6 Gbps
decoding throughput because of the decoder parallelism
factor of 4. Such throughput appears to be sufficient in real-
life applications [27]. The silicon area and decoding latency
are listed in Table 5.
To demonstrate the overall NAND flash memory storage
capacity improvement potential, we carried out the follow-
ing estimation for 70-nm CMOS technology: The effective
NAND flash memory cell size is 0.024 mm
2
at 70-nm
CMOS technology [7]. We scale the BCH decoder silicon
area by (130/70)
2
¼ 3.4 to estimate the decoder silicon
area at 70-nm CMOS technology. Accordingly, Table 6
shows the estimated total numbers of user bits that can be
stored in a NAND flash memory core of 100 mm
2
while
considering the BCH decoder silicon area cost. The effec-
tive storage capacity improvement is obtained by compar-
ing against the 2 bits/cell benchmark.
4.4 Integration with defect tolerance
We assumed above that the cell programming time (and
hence threshold voltage distrib ution) remains the same for
various l and BCH codes are solely used for compensating
threshold voltage distribution-induced (TVDI) errors. It is
intuitive that, if we improve the programming accuracy
by reducing the step-up programming voltage V
pp
at the
cost of increased programming time, the TVDI error rates
will correspondingly reduce. This will leave a certain
degree of BCH code error correction capability available
for compensating memory defects. This can be considered
as a trade-off between programming time and defect
tolerance.
Table 4: BCH codes parameters and performance
l (n, k, t) BCH
codes
Codeword
error rate
User bits
storage gain, %
6 (8262, 8192, 5) 1.1 10
217
24.0
8 (8360, 8192, 12) 3.1 10
215
47.0
12 (9130, 8192, 67) 2.8 10
215
57.0
6 (16 459, 16 384, 5) 7.0 10
216
24.5
8 (16 609, 16 384, 15) 3.2 10
215
48.0
12 (17 914, 16 384, 102) 7.2 10
215
60.1
Fig. 9 Binary BCH code decoder structure
Fig. 10 Structure of one syndrome generator with parallelism
factor of p
Fig. 11 Structure of Chien search with the parallelism factor of p
IET Circuits Devices Syst., Vol. 1, No. 3, June 2007 247
Following this intuition, we investigate such a trade-off in
NAND flash memories with l of 6, 8 and 12, respectively.
On the basis of the cell threshold voltage distribution
model as presented above, if we reduce the step-up pro-
gramming voltage V
pp
to improve the programming accu-
racy, the standard deviation of the threshold voltage
distribution will accordingly reduce. In this work, we
simply assume that the standard deviation
s
is inversely
proportional to the cell programming time. If a
t-error-correcting binary BCH code needs to compensate
up to d
def
defective memory cells, it will only be able to
correct up to t
TVDI
¼ t 2 d
def
TVDI errors. To
accommodate such TVDI error correction capability loss,
we have to accordingly reduce the TVDI error rates by
improving the programming accuracy and hence reducing
the standard deviation parameter
s
. For the BCH codes
listed in Table 4 , we have to shown the trade-offs
between
s
and d
def
as in Table 7.
Based on the above discussion, we further propose a
modified multilevel flash memory defect-tolerant strategy
by combining the conven tional spare rows/columns repair
and BCH codes. As illustrated in Fig. 12, we first check
whether the available spare rows/columns can repair all
the defects in one memory block, if not, then we carry out
a certain repair algorithm to use the spare rows/columns
to repair as many defects as possible so that the number
of residual defective cells can be minimised. Then we cal-
culate how to adjust the threshold voltage distribution devi-
ation parameter
s
in order to compensate the TVDI error
correction capability loss. Finally, we check whether the
target
s
is feasible, subject to some practical constraints
such as circuit precision and minimum allowable cell
programming time.
5 Conclusions
This paper presented on-chip error correction system design
approaches for multilevel code-st orage NOR flash and
data-storage NAND flash memories. We applied the TCM
concept to design an on-chip error correction system for
multilevel NOR flash memories. Compared with the con-
ventional practice using linear block codes, the
TCM-based design solution can provide better coding
redundancy against error-correcting performance trade-offs.
Targeting 2 bits/cell NOR flash, we designed TCM-based
systems for three scenarios where the number of user bits
per block is 16, 32 and 64, respectively. Compared with
the systems using two-error-correcting BCH codes, the
TCM-based systems can achieve 1 order of magni tude
better BER while saving 15.3% (16-bit), 13.0% (32-bit)
and 2.6% (64-bit) memory cells, respectively. Cadence
and Synopsys tools were used to implement the read data-
path including mixed signal sensing circuits and dig ital
TCM demodulation and decoding circuits. The latency
and silicon area are 8.3 ns and 0.12 mm
2
(16-bit), 24.3 ns
and 0.10 mm
2
(32-bit), and 44.3 ns and 0.12 mm
2
(64-bit),
respectively.
In the context of NAND flash memory, we demonstrated
the promise of using strong BCH codes to further improve
multilevel data-storage NAND flash memory capacity
without degrading memory programming time. Targeting
the codeword error rate lower than 10
214
, we constructed
BCH codes with 8192 and 16 384 user bits per codeword,
respectively. It shows that, given the same number of
memory cells, up to 60% more user bits can be stored com-
pared with the 2 bits/cell benchmark. To evaluate decoder
Table 5: BCH decoder ASIC design post-layout results
l (n, k, t) BCH codes Silicon area, mm
2
Latency, ms
6 (8262, 8192, 5) 0.21 10.4
8 (8360, 8192, 12) 0.32 10.9
12 (9130, 8192, 67) 1.43 17.6
6 (16 459, 16 384, 5) 0.25 20.7
8 (16 609, 16 384, 15) 0.38 21.4
12 (17 914, 16 384, 102) 2.14 40.2
Table 6: Estimated storage capacity for a 100 mm
2
NAND flash memory core
(n, k, t) BCH
codes
Stored user
bits, Gbits
Effective stor
age capacity
improvement
4 8.33
6 (8262, 8192, 5) 10.32 23.9
8 (8360, 8192, 12) 12.24 46.9
12 (9130, 8192, 67) 13.03 56.4
6 (16 459, 16 384, 5) 10.36 24.4
8 (16 609, 16 384, 15) 12.32 47.9
12 (17 914, 16 384, 102) 13.25 59.1
Table 7: Trading TVDI error correction capability for
defect tolerance
l (n, k, t) BCH codes Standard
deviation
s
d
def
t
TVDI
6 (8262, 8192, 5) 0.833 1 2
8 (8360, 8192, 12) 0.930 1 9
0.719 3 3
12 (9130, 8192, 67) 0.950 4 53
0.900 7 42
0.800 12 24
0.571 17 6
6 (16 459, 16 384, 5) 0.800 1 2
8 (16 609, 16 384, 15) 0.950 1 12
0.700 4 3
12 (17 914, 16 384, 102) 0.950 6 81
0.900 12 60
0.800 20 32
0.571 27 6
Fig. 12 Flow diagram using BCH codes for defect tolerance
IET Circuits Devices Syst., Vol. 1, No. 3, June 2007248
silicon area and achi evable decoding throughput/latency,
we implemented BCH decoders using 0.13 mm CMOS stan-
dard cell and SRAM libraries. Post-layout results verify that
the decoders occupy (much) less than 2.5 mm
2
silicon area
and achieves (much) less than 41 ms decoding latency and
1.6 Gbps decoding throughput. On the basis of the pub-
lished results for NAND flash effective cell area and a
simple scaling rule, we estimate that, under 70-nm CMOS
technology and 100 mm
2
core area, up to 59.1% effective
storage capacity improvement can be realised compared
with 2 bits%cell benchmark.
Furthermore, we propose a design strategy that can lever-
age the large error correction capability of strong BCH
codes to improve memory defect tolerance by trading off
the memory cell programming time.
6 Acknowledgments
The authors thank the anonymous reviewers for their valu-
able comments and suggestions, which have largely
improved the quality and presentation of this paper.
7 References
1 Hwang, C.: ‘Nanotechnology enables a new memory growth model’,
Proc. IEEE, 2003, 91, pp. 17651771
2 Bez, R., Camerlenghi, E., Modelli, A., and Visconti, A.: ‘Introduction
to Flash memory’, Proc. IEEE, 2003, 91, pp. 489502
3 Ricco, B., et al.: ‘Nonvolatile multilevel memories for digital
applications’, Proc. IEEE, 1998, 86, pp. 23992423
4 Lee, S., et al.: ‘A 3.3 V 4 Gb four-level NAND flash memory with
90 nm CMOS technology’. Proc. IEEE Int. Solid-State Circuits
Conf. (ISSCC), 2004, pp. 52513
5 Sim, S.-P., et al.: ‘A 90 nm generation NOR flash multilevel cell
(MLC) with 0.44 mm
2
/bit cell size’. IEEE VLSI-TSA Int. Symp. on
VLSI Technology, April 2005, pp. 35 36
6 Servalli, G., et al.: ‘A 65nm NOR flash technology with 0.042 mm
2
cell size for high performance multilevel application’. IEEE Int.
Electron Devices Meeting, December 2005, pp. 849852
7 Hara, T., et al.: ‘A 146-mm
2
8-Gb multi-level NAND flash memory
with 70-nm CMOS technology’, IEEE J. Solid-State Circuits, 2006,
41, pp. 161169
8 Gregori, S., Cabrini, A., Khouri, O., and Torelli, G.: ‘On-chip error
correcting techniques for new-generation Flash memories’, Proc.
IEEE, 2003, 91, pp. 602 616
9 Silvagni, A., Fusillo, G., Ravasio, R., Picca, M., and Zanardi, S.: ‘An
overview of logic architectures inside Flash memory devices’, Proc.
IEEE, 2003, 91, pp. 569 580
10 Rossi, D., Metra, C., and Ricco, B.: ‘Fast and compact error correcting
scheme for reliable multilevel flash memories’. Proc. Eighth IEEE Int.
On-Line Testing Workshop, July 2002, pp. 221225
11 Ungerboeck, G.: ‘Trellis-coded modulation with redundant signal sets.
Parts I and II’, IEEE Commun. Mag., 1987, 25, pp. 5 21
12 Lou, H.-L., and -Sundberg, C.E.W.: ‘Increasing storage capacity in
multilevel memory cells by means of communications and signal
processing techniques’, IEE Proc., Circuits Devices Syst., 2000,
147, pp. 229 236
13 Tanzawa, T., et al.: ‘A compact on-chip ECC for low cost flash
memories’, IEEE J. Solid-State Circuits, 1997, 32, pp. 662 669
14 Nobukata, H., et al.: ‘A 144-Mb, eight-level NAND flash memory
with optimized pulsewidth programming’, IEEE J. Solid-State
Circuits, 2000, 35, pp. 682690
15 Grossi, M., Lanzoni, M., and Ricco, B.: ‘A novel algorithm for
high-throughput programming of multilevel flash memories’, IEEE
Trans. Electron Devices, 2003, 50, pp. 12901296
16 Atwood, G., Fazio, A., Mills, D., and Reaves, B.: ‘Intel StrataFlash
TM
memory technology overview’ Intel Technology Journal’, 4th Quarter
1997, pp. 18
17 Wei, L.F.: ‘Trellis-coded modulation with multidimensional
constellations’, IEEE Trans. Inf. Theory, 1987, 33, pp. 483 501
18 Sun, F., Devarajan, S., Rose, K., and Zhang, T.: ‘Multilevel
flash memory on-chip error correction based on trellis coded
modulation’. IEEE Int. Symp. Circuits and Systems (ISCAS), May
2006
19 Calligaro, C., Gastaldi, R., Manstretta, A., and Torelli, G.: ‘A
high-speed parallel sensing scheme for multi-level nonvolatile
memories’. Proc. Int. Workshop on Memory Technology, Design
and Testing, August 1997, pp. 96101
20 Fettweis, G., and Meyr, H.: ‘High-speed parallel Viterbi decoding:
algorithm and VLSI-architecture’, IEEE Commun. Mag., 1991, 29,
pp. 4655
21 Blahut, R.E.: ‘Algebraic codes for data transmission’ (Cambridge
University Press, 2003)
22 Lee, J.-D., Hur, S.-H., and Choi, J.-D.: ‘Effects of floating-gate
interference on NAND flash memory cell operation’, IEEE Trans.
Electron Devices, 2002, 23, pp. 264 266
23 Takeuchi, K., Tanaka, T., and Nakamura, H.: ‘A double-level-V
th
select gate array architecture for multilevel NAND flash memories’,
IEEE J. Solid-State Circuits, 1996, 31, pp. 602 609
24 Micheloni, R., et al.: ‘A 0.13-mm CMOS NOR flash memory
experimental chip for 4-b/cell digital storage’. Proc. 28th European
Solid-State Circuits Conf., September 2002, pp. 131134
25 Chen, Y., and Parhi, K.K.: ‘Area efficient parallel decoder architecture
for long BCH codes’. IEEE Int. Conf. Acoustics, Speech, and Signal
Processing, May 2004, pp. V-73V-76
26 Burton, H.O.: ‘Inversionless decoding of binary BCH codes’, IEEE
Trans. Inf. Theory, 1971, 17, (4), pp. 464466
27 Micheloni, R., et al.: ‘A 4 Gb 2b/cell NAND flash memory with
embedded 5b BCH ECC for 36 MB/s system read throughput’.
IEEE Int. Solid-State Circuits Conf., February 2006, pp. 497 506
IET Circuits Devices Syst., Vol. 1, No. 3, June 2007 249
... The Raw Bit Error Rate (RBER) is defined as the Bit error rate returned by the device before the employment of error detection and correction by a suitable error correction code [8]- [10]. Multi-level cell architectures are characterized by significantly higher RBER values when compared with SLC architecture. ...
... Since the number of codewords possessed by the codes synthesized in this paper is very large (greater than or equal to 4 4096 in the case of MLC NAND flash memory with Architecture-1), it is not practical to determine the weight distribution by computing the codewords and employing (10). Hence, we have employed the upper bounds specified in (9) to estimate the value of the probability of undetected error, P u (E) associated with the codes synthesized for this application. ...
Article
Full-text available
The revolution in the field of information processing systems has created a huge demand for reliable and enhanced data storage capabilities. This demand is being met by advances in channel coding algorithms along with upward scaling of the capacities of hardware devices. NAND Flash memory is a type of non-volatile memory. Scaling of the size of flash memories from Single Level Cell (SLC) devices to Multilevel cell (MLC) devices has increased the storage capacity. However, these multi-bit per cell architectures are characterized by significantly higher Raw Bit Error Rate (RBER) values when compared with SLC architectures. The requirement of low Undetected Bit Error Rate (UBER) values has motivated us to synthesize powerful channel codes for enhancing the integrity of information Storage in multi-level NAND Flash Memory devices. This paper describes the synthesis of novel Subfield Subcodes of Reed Solomon Codes (SSRS) and Reed-Solomon (RS) codes which are matched to multi-bit per cell architectures. UBER values have been calculated for each of the synthesized codes described in this paper. This allows the determination of the performance and the improvement in data storage integrity brought by using these codes. We have shown that the synthesized SSRS and RS codes can provide very low UBER even when the corresponding RBER values are appreciable. As RS codes permit the detection and correction of a greater number of errors for a given code length, their performance is superior to that of SSRS codes. This improved performance is obtained at the cost of greater complexity of encoding and decoding processes.
... To address this issue, an efficient approach based on parallel architecture has been implemented for each of the sub-blocks responsible for generating syndromes and conducting search Chien. This implementation reduces the time delay significantly and speeds up the overall computation process [16]. Moving forward, we will discuss the introduction of parallel architecture for each of the sub-blocks involved in syndrome generation and Chien search. ...
Preprint
Full-text available
The size reduction of transistors in the latest flash memory generation has resulted in programming and data erasure issues within these designs. Consequently, ensuring reliable data storage has become a significant challenge for these memory structures. To tackle this challenge, error-correcting codes like BCH (Bose-Chaudhuri-Hocquenghem) codes are employed in the controllers of these memories. When decoding BCH codes, two crucial factors are the delay in error correction and the hardware requirements of each sub-block. This article proposes an effective solution to enhance error correction speed and optimize the decoder circuit's efficiency. It suggests implementing a parallel architecture for the BCH decoder's sub-blocks and utilizing pipeline techniques. Moreover, to reduce the hardware requirements of the BCH decoder block, an algorithm based on XOR sharing is introduced to eliminate redundant gates in the search Chien block. The proposed decoder is simulated using the VHDL hardware description language and subsequently synthesized with Xilinx ISE software. Simulation results indicate that the proposed algorithm not only significantly reduces error correction time but also achieves a noticeable reduction in the hardware overhead of the BCH decoder block compared to similar methods.
... For flash memories, various concatenated coding schemes were proposed to enable soft-input decoding, e.g. product codes [75] and concatenated coding schemes based on trellis coded modulation and outer BCH oder RS codes [64,41,55]. For the presented simulation results we assumed a quantized additive white Gaussian noise channel. ...
Thesis
Flash memories are non-volatile memory devices. The rapid development of flash technologies leads to higher storage density, but also to higher error rates. This dissertation considers this reliability problem of flash memories and investigates suitable error correction codes, e.g. BCH-codes and concatenated codes. First, the flash cells, their functionality and error characteristics are explained. Next, the mathematics of the employed algebraic code are discussed. Subsequently, generalized concatenated codes (GCC) are presented. Compared to the commonly used BCH codes, concatenated codes promise higher code rates and lower implementation complexity. This complexity reduction is achieved by dividing a long code into smaller components, which require smaller Galois-Field sizes. The algebraic decoding algorithms enable analytical determination of the block error rate. Thus, it is possible to guarantee very low residual error rates for flash memories. Besides the complexity reduction, general concatenated codes can exploit soft information. This so-called soft decoding is not practicable for long BCH-codes. In this dissertation, two soft decoding methods for GCC are presented and analyzed. These methods are based on the Chase decoding and the stack algorithm. The last method explicitly uses the generalized concatenated code structure, where the component codes are nested subcodes. This property supports the complexity reduction. Moreover, the two-dimensional structure of GCC enables the correction of error patterns with statistical dependencies. One chapter of the thesis demonstrates how the concatenated codes can be used to correct two-dimensional cluster errors. Therefore, a two-dimensional interleaver is designed with the help of Gaussian integers. This design achieves the correction of cluster errors with the best possible radius. Large parts of this works are dedicated to the question, how the decoding algorithms can be implemented in hardware. These hardware architectures, their throughput and logic size are presented for long BCH-codes and generalized concatenated codes. The results show that generalized concatenated codes are suitable for error correction in flash memories, especially for three-dimensional NAND memory systems used in industrial applications, where low residual errors must be guaranteed.
... BCH decoders can be categorized by the place of the decoders. The decoders can be either located on-chip within memory device [18] or outside the memory device [19]. The focus of this paper is on the decoder being outside the memory device. ...
Article
Full-text available
Bose–Chaudhuri–Hocquenghem (BCH) codes are broadly used to correct errors in flash memory systems and digital communications. These codes are cyclic block codes and have their arithmetic fixed over the splitting field of their generator polynomial. There are many solutions proposed using CPUs, hardware, and Graphical Processing Units (GPUs) for the BCH decoders. The performance of these BCH decoders is of ultimate importance for systems involving flash memory. However, it is essential to have a flexible solution to correct multiple bit errors over the different finite fields (GF(2 m )). In this paper, we propose a pragmatic approach to decode BCH codes over the different finite fields using hardware circuits and GPUs in tandem. We propose to employ hardware design for a modified syndrome generator and GPUs for a key-equation solver and an error corrector. Using the above partition, we have shown the ability to support multiple bit errors across different BCH block codes without compromising on the performance. Furthermore, the proposed method to generate modified syndrome has zero latency for scenarios where there are no errors. When there is an error detected, the GPUs are deployed to correct the errors using the iBM and Chien search algorithm. The results have shown that using the modified syndrome approach, we can support different multiple finite fields with high throughput.
... Based on this equation, the Hamming code can correct all single-digit errors and detect two-digit errors. In individual cases according to the number of control bits, more than a single-bit error can be corrected [9]. Advantages and disadvantages of Hamming codes : -simple design and are easy to decode; -can only fix single errors. ...
Article
This article is devoted to the study and analysis of various noise-resistant code structures, which are designed for use in miniature memory drives on spacecrafts. Error-correcting coding is aimed for correcting memory errors that occur due to ionizing radiation. The first part of the article provides information about the general memory architecture using error-correcting coding. The second part considers linear code constructions, such as Hamming code, convolutional code, PC and LDPC code, as well as nonlinear code constructions, which are promising means of correcting memory errors (Vasiliev code, Phelps code, switching code, AMD-code). Based on the research and analysis data, the conclusion is made about the most suitable code design for the development of the information storage module. It should be noted that the determining requirement for choosing the code for the drive used on the spacecraft is the presence of simple decoding algorithms that allow high decoding speed and low energy consumption.
Article
In literature, PIBMA, a linear-feedback-shift-register (LFSR) decoder, has been shown to be the most efficient high-speed decoder for Reed-Solomon (RS) codes. In this work, we follow the same design principles and present two high-speed LFSR decoder architectures for binary BCH codes, both achieving the critical path of one multiplier and one adder. We identify a key insight of the Berlekamp algorithm that iterative discrepancy computation involves only even-degree terms. The first decoder separates the even and odd-degree terms of the error-locator polynomial to iterate homogeneously with discrepancy computation. The resulting LFSR decoder architecture, dubbed PIBA, has $\lfloor {}\frac {3t}{2}\rfloor +1$ processing elements (PEs), each containing two registers, two multipliers, one adder, and two multiplexers (same as that of PIBMA), which compares favorably against the best existing architecture composed by $2t+1$ PEs. The second one, dubbed pPIBA, squeezes the entire error-locator polynomial into the even-term array of the first one to iterate along with discrepancy computation, which comes at the cost of a controlled defect rate. pPIBA employs $t+1+f$ systolic units with a defect probability of $2^{-q(f+1)}$ , where $q$ denotes the finite field dimension and $f$ is a design parameter, which significantly reduces the number of PEs for a large correcting power $t$ . The proposed architectures can be arbitrarily folded to trade off complexity with latency, due to the systolic nature. GII decoding has been notorious for the composition of many seemly irrelevant functional blocks. We are motivated by the unified framework UPIBA which can be reconfigured to carry out both error-only and error-and-erasure decoding of RS codes in the most efficient manner. We devise a unified LFSR decoder for GII-RS, GII-ERS (referring to erasure correction of GII-RS codes), and GII-BCH codes, respectively. Each LFSR decoder can be reconfigured (but not multiplexed) to execute different functional blocks, and moreover achieves the same critical path of one multiplier, one adder, and one multiplexer. The resulting GII-RS/BCH decoder contains only four functional blocks, which are literally the same as the decoder for single RS/BCH codes. For GII-RS and GII-BCH decoding, we also incorporate the original mechanism by Tang and $\text{K}\ddot {\text {o}}$ tter to minimize the miscorrection rate, which comes surprisingly at a negligible cost. Our proposed high-speed low-complexity GII-ERS decoder renders the multi-layer GII codes highly attractive against other locally recoverable codes.
Chapter
This paper presents a two step methodology for the development of narrowband (NB) data test waveform to achieve low bit error rate (BER). The physical layer of the NB waveform consists of many baseband processing blocks like digital modulation schemes, channel coding schemes, pulse shaping filters, synchronization techniques, and so more. In first step, two or more options of the important baseband processing blocks have been chosen through literature survey. In second step, these options have been simulated in Additive White Gaussian Noise (AWGN) and Rayleigh fading channel model. Comparative analysis of different options has been done on the basis of BER versus Energy per bit to noise power spectral density ratio (\(E_{b}/N_{0}\)) performance during simulation to select the best options for the optimized baseband processing block chain. At last, the complete baseband processing chain composed of the selected option has been simulated, and the BER of \(10^{-6}\) has been achieved at 13 dB of \(E_{b}/N_{0}\).KeywordsNBWFBER \(E_{b}/N_{0}\) BCHFPGAQPSKQAM
Article
Due to higher integration densities, technology scaling and variation in parameters, the performance failures may occur for every application. The memory applications are also prone to single event upsets and transient errors which may lead to malfunctions. This paper proposed a novel error detection and correction method using EG-LDPC. This is useful as majority logic decoding can be implemented serially with simple hardware but requires a large decoding time. For memory applications, this increases the memory access time. The method detects whether a word has errors in the first iterations of majority logic decoding, and when there are no errors the decoding ends without completing the rest of the iterations. Also, errors affecting more than five bits were detected with a probability very close to one. The probability of undetected errors was also found to decrease as the code block length increased. For a billion error patterns only a few errors (or sometimes none) were undetected. This may be sufficient for some applications. Error commonly occurs in the Flash memory while employing LDPC decoding. The SRMMU actually suggests to use a the VTVI design by introducing the Context Number register, however also a PTPI or VTPI design could be implemented that complies to the SRMMU standard. The VTVI design with a physical write buffer and a combined I/D Cache TLB is the simplest design to implement. This will give error correction in minimum cyclic period using LDPC method.
Article
Despite its widespread use in consumer devices and enterprise storage systems, NAND flash faces a growing number of challenges. While technology advances have helped to increase the storage density and reduce costs, they have also led to reduced endurance and larger block variations, which cannot be compensated solely by stronger ECC or read-retry schemes but have to be addressed holistically. Our goal is to enable low-cost NAND flash in enterprise storage for cost efficiency. We present novel flash-management approaches that reduce write amplification, achieve better wear leveling, and enhance endurance without sacrificing performance. We introduce block calibration, a technique to determine optimal read-threshold voltage levels that minimize error rates, and novel garbage-collection as well as data-placement schemes that alleviate the effects of block health variability and show how these techniques complement one another and thereby achieve enterprise storage requirements. By combining the proposed schemes, we improve endurance by up to 15× compared to the baseline endurance of NAND flash without using a stronger ECC scheme. The flash-management algorithms presented herein were designed and implemented in simulators, hardware test platforms, and eventually in the flash controllers of production enterprise all-flash arrays. Their effectiveness has been validated across thousands of customer deployments since 2015.
Conference Paper
Full-text available
This paper presents a multilevel (ML) flash memory on-chip error correction system design based on the concept of trellis coded modulation (TCM). This is motivated by the non-trivial modulation process in ML memory storage and the effectiveness of TCM on integrating coding with modulation to provide better performance. Using code storage 2bits/cell flash memory as a test vehicle, the effectiveness of TCM-based systems, in terms of error-correcting performance, coding redundancy, silicon cost, and operation latency, has been successfully demonstrated
Article
We have introduced the concept of floating-gate interference in flash memory cells for the first time. The floating-gate interference causes V-T shift of a cell proportional to the V-T change of the adjacent cells. It results from capacitive coupling via parasitic capacitors around the floating gate. The coupling ratio defined in the previous works should be modified to include the floating-gate interference. In a 0.12-mum design-rule NAND flash cell, the floating-gate interference corresponds to about 0.2 V shift in multilevel cell operation. Furthermore, the adjacent word-line voltages affect the programming speed via parasitic capacitors.
Article
A 65nm NOR flash technology, featuring a true 10lambda2 , 0.042mum2 cell, is presented for the first time for 1bit/cell and 2bit/cell products. Advanced 193nm lithography, floating gate self aligned STI, cobalt salicide and three levels of copper metallization allow the integration with a high density and high performance 1.8V CMOS
Article
The need to transmit and store massive amounts of data reliably and without error is a vital part of modern communications systems. Error-correcting codes play a fundamental role in minimising data corruption caused by defects such as noise, interference, crosstalk and packet loss. This book provides an accessible introduction to the basic elements of algebraic codes, and discusses their use in a variety of applications. The author describes a range of important coding techniques, including Reed-Solomon codes, BCH codes, trellis codes, and turbocodes. Throughout the book, mathematical theory is illustrated by reference to many practical examples. The book was first published in 2003 and is aimed at graduate students of electrical and computer engineering, and at practising engineers whose work involves communications or signal processing.
Article
A parallel sensing scheme for multi-level non-volatile memories (ML NVM) is presented. A single comparison step is used to achieve high sensing speed. To this purpose, a high-speed low-voltage current comparator is used. Experimental evaluations on a 0.6-/spl mu/m EPROM test chip demonstrated the feasibility of 4-level-cell NV MLMs from the sensing standpoint. A read throughput of 12 MB/s is achieved with the proposed 4-level-cell memory architecture. Multi-level storage is achieved by using a program-verified scheme to obtain tight cell threshold voltage distribution. Overall sensing area overhead for a 32-Mbit chip is in the range of 1%.
Conference Paper
A 4Gb 2b/cell NAND flash memory designed in a 90nm CMOS technology incorporates a 25MHz BCH ECC architecture, correcting up to 5 errors over a flexible data field (1B to 2102B). Two alternative Chien circuits are used depending on the number of errors (1 to 5) thus minimizing latency time. ECC area overhead is less than 1%
Conference Paper
This paper presents a method to reduce area and timing overhead due to the implementation of standard single symbol correcting codes to provide ML flash memories with error correction capability. In particular, the proposed method is based on the manipulation of the parity check matrix which defines a code, which allows one to minimize the matrix weight and the maximum row weight. Furthermore, we show that a minimal increase in the redundancy, with respect to the standard case, allows a further considerable reduction of the impact on the memory access time, as well as on the area overhead due to the error correction circuitry.
Conference Paper
A 256Mb NOR MLC flash memory with 90nm technology has been successfully developed. Through judicious integration to control the cell dispersion and charge loss/gain with cycling, we confirm a successful MLC operation up to 10K cycling for 0.44 μm<sup>2</sup>/bit cell size. In this paper, the key features governing multilevel cell (MLC) operation below 90nm technology node is discussed with experimental results.