ArticlePDF Available

Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Authors:

Abstract and Figures

This tutorial paper gives an overview of the implementation aspects related to turbo decoders, where the term turbo generally refers to iterative decoders intended for parallel concatenated convolutional codes as well as for serial concatenated convolutional codes. We start by considering the general structure of iterative decoders and the main features of the soft-input soft-output algorithm that forms the heart of iterative decoders. Then, we show that very efficient parallel architectures are available for all types of turbo decoders allowing high-speed implementations. Other implementation aspects like quantization issues and stopping rules used in conjunction with buffering for increasing throughput are considered. Finally, we perform an evaluation of the complexities of the turbo decoders as a function of the main parameters of the code.
Content may be subject to copyright.
INVITED
PAPER
Iterative Decoding of
Concatenated Convolutional
Codes: Implementation Issues
The speed of decoding can be increased by raising the decoder clock frequency,
increasing the use of parallel hardware, and judiciously limiting the
number of decoding iterations.
By Emmanuel Boutillon, Catherine Douillard, and Guido Montorsi
ABSTRACT
|
This tutorial paper gives an overview of the
implementation aspects related to turbo decoders, where the
term turbo generally refers to iterative decoders intended for
parallel concatenated convolutional codes as well as for serial
concatenated convolutional codes. We start by considering the
general structure of iterative decoders and the main features of
the soft-input soft-output algorithm that forms the heart of
iterative decoders. Then, we show that very efficient parallel
architectures are available for all types of turbo decoders
allowing high-speed implementations. Other implementation
aspects like quantization issues and stopping rules used in
conjunction with buffering for increasing throughput are
considered. Finally, we perform an evaluation of the complex-
ities of the turbo decoders as a function of the main parameters
of the code.
KEYWORDS
|
Concatenated convolutional codes; hardware
complexity; iterative decoders; p arallel architectures; quanti-
zation; stopping rules
I. INTRODUCTION
In 1993, at a moment when there were not many people to
believe in the practicability of capacity approaching codes,
the presentation of turbo codes [1] was a revival for the
channel coding research community. Furthermore, the
performance claimed in this seminal paper was soon
confirmed with a practical hardware implementation [2].
Historical turbo codes, also sometimes called parallel
concatenated convolutional codes (PCCC), are based on a
parallel concatenation of two recursive systematic con-
volutional codes separated by an interleaver. They are
called turbo in reference to the analogy of their decoding
principle with the turbo principle of a turbo-compressed
engine, which reuses the exhaust gas in order to improve
efficiency. The turbo decoding principle calls for an
iterative algorithm involving two component decoders
exchanging information in order to improve the error
correction performance with the decoding iterations.
This iterative decoding principle was soon applied to
other concatenations of codes separated by interleavers,
such as serial concatenated convolutional codes (SCCC)
[3], [4], sometimes called serial turbo codes, or concatena-
tion of block codes, also named block turbo codes [5], [6].
The near-capacity performance of turbo codes and their
suitability for practical implementation explain their
adoption in various communication standards as early as
the late 1990s. Firstly, they were chosen in the telemetry
coding standard by the Consultative Committee for Space
Data Systems (CCSDS) [7] and for the medium to high
data rate transmissions in the third generation mobile
communication 3GPP/UMTS standard [8]. They have
further been adopted as part of the digital video broadcast–
return channel satellite and terrestrial (DVB–RCS and
DVB–RCT) links [9], [10], thus enabling broadband
interactive satellite and terrestrial services. More recently,
they were also selected for the next generation of 3GPP2/
cdma2000 wireless communication systems [11] as well as
Manuscript received June 20, 2006; revised February 3, 2007. This work was
supported by the E. U., under the Network of Excellence in Wireless
Communications (NEWCOM), Project 507325.
E. Boutillon is with LESTER, CNRS, Universite
´
de Bretagne Sud, 56321 Lorient
Cedex, France (e-mail: emmanuel.boutillon@univ-ubs.fr).
C. Douillard is with the Electronics Department, CNRS, GET-ENST Bretagne,
Technopo
ˆ
le Brest-Iroise, 29238 Brest Cedex 3, France (e-mail:
catherine.douillard@enst-bretagne.fr).
G. Montorsi is with the Dipartimento di Elettronica, Politecnico di Torino,
10129 Torino, Italy (e-mail: guido.montorsi@polito.it).
Digital Object Identifier: 10.1109/JPROC.2007.895202
Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 12010018-9219/$25.00
Ó
2007 IEEE
for the IEEE 802.16 standard (WiMAX) [12] intended
for broadband connections over long distances. Turbo
codes are used in several Inmarsat’s communication
systems, too, such as in the new Broadband Global Area
Network (BGAN [13]) that entered service in 2006. A
serial turbo code was also adopted in 2003 by the
European Space Agency for the implementation of a very
high speed (1 Gb/s) adaptive coded modulation modem for
satellite applications [14].
This paper deals with the implementation issues of
iterative decoders for concatenated convolutional codes.
Both parallel and serial concatenated convolutional codes
are addressed and corresponding iterative decoders are
equally referred as turbo decoders. Due to the trend of the
most recent applications towards an increasing demand for
high-throughput data, special attention is paid to the
design of high-speed hardware decoder architectures,
since it represents an important issue for industry. The
remainder of the paper is divided into six parts. A survey of
the general structure of turbo decoders is presented in
Section II. Section III reviews the soft-input soft-output
(SISO) algorithms that are used as fundamental building
blocks of iterative decoders. Section IV deals with the issue
of architectures dedicated to high throughput services.
Particular stress is laid on the increase of the parallelism in
the decoder architecture. Section V presents a review of
the different stopping rules that can be applied in order to
decrease the average number of iterations, thus also
increasing the average decoding speed of the decoder. The
fixed-point implementation of decoders, which is desirable
for hardware implementations, is dealt with in Section VI.
Finally, complexity issues, related to implementation
choices, are discussed in Section VII.
II. CONCATENATION OF
CONVOLUTIONAL CODES AND
ITERATIVE DECODING
A. Short History of Concatenated Coding
and Decoding
Code concatenation is a multilevel coding method
allowing codes with good asymptotic as well as practical
properties to be constructed. The idea dates back to Elias’s
product code construction in the mid 1950s [15]. A major
advance was performed by Forney in his thesis work on
concatenated codes ten years later [16]. As stated by
Forney, concatenation is a method of building long codes
out of shorter ones in order to resolve the problem of
decoding complexity by breaking the required computa-
tion into manageable segments according to the divide and
conquer strategy. The principle of concatenation is
applicable to any type of codes, convolutional or block
codes. This first version of code concatenation is now
called serial concatenation (SC) of codes. Before the
invention of turbo codes, the most famous example of
concatenated codes was the concatenation of an outer
algebraic code, such as a Reed–Solomon code, with an
inner convolutional code, which has been used in
numerous applications, ranging from space communica-
tions to digital broadcasting of television.
As far as the SC of convolutional codes is concerned,
the straightforward way of decoding involves the use of
two conventional Viterbi algorithm (VA) decoders in a
concatenated way. The first drawback that prevents such a
receiver from performing efficiently is that the inner VA
produces bursts of errors at its output, that the outer VA
has difficulty correcting. This can be circumvented by
inserting an interleaver between the inner and outer VAs,
as illustrated in Fig. 1.
The second main drawback is that the inner VA
provides hard decisions, thus preventing the outer VA
from using its ability to accept soft samples at its input. In
order to work properly, the inner decoder has to provide
the outer decoder with soft information. Among the
different attempts to achieve this goal, the soft-output
Viterbi algorithm (SOVA) approach calls for a modifica-
tion of the VA in order to deliver a reliability value for
each decoded bit [17], [18]. Like the VA, the SOVA
provides a maximum likelihood trellis
1
decoding of the
convolutional code which minimizes the probability of a
sequence error. However, it is suboptimal with respect to
bit or symbol error probability. The minimal bit or symbol
error probability can be achieved with symbol-by-symbol
Fig. 1. Transmission scheme with insertion of interleaver into a serial concatenation of convolutional codes.
1
The trellis diagram is a temporal representation of the code. It
represents all the possible transitions between the states of the encoder as
a function of time. The length of the trellis is equal to the number of time
instants required to encode the whole information sequence.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
1202 Proceedings of the IEEE |Vol.95,No.6,June2007
maximum a posteriori (MAP) decoding using the BCJR
algorithm [19].
The ultimate progress in decoding concatenated codes
occurred when it was observed that a SISO decoder could
be regarded as an SNR amplifier, thus allowing common
concepts in amplifiers such as the feedback principle to be
implemented. This observation gave birth to the so-called
turbo decoding principle of concatenated codes [20] in the
early 1990s. Thanks to this decoding concept, the 1.5-dB
gap still remaining between what theory promised and
what the state of the art in error control coding was able to
offer at the time
2
was almost removed. In Fig. 1, the
receiver clearly exploits the received samples in a
suboptimal way, even in the case where information
passed from the inner decoder to the outer decoder is soft
information. The overall decoder works in an asymmetri-
cal way: both decoders work towards decoding the same
data. The outer decoder takes advantage of the inner
decoder work but the contrary is not true. The basic idea of
turbo decoding involves a symmetric information ex-
change between both SISO decoders, so that they can
converge to the same probabilistic decision, as a global
decoder would. The issue of stability, which is crucial in
feedback systems, was solved by introducing the notion of
extrinsic information, which prevents the decoder from
being a positive feedback amplifier. In the case where the
component decoders compute the Logarithm of Likelihood
Ratios (LLR) related to information data, the extrinsic
information can be obtained with a simple subtraction
between the output and the input of the decoder as
described in Fig. 2.
B. Parallel and Serial Concatenation
of Convolutional Codes
1) Parallel Versus Serial Concatenation of Convolutional
Codes: The introduction of turbo codes is also the origin of
the concept of parallel concatenation (PC). Fig. 3(a) shows
the PC of two convolutional codes, as commonly used in
classical turbo codes. With the PC of codes, the message to
be transmitted is encoded twice in a separate fashion: the
first encoder processes the data message in the order it is
delivered by the source, while the second one encodes the
Fig. 2. General principle of turbo decoding in logarithmic domain: extrinsic information symmetrically exchanged between inner and
outer SISO decoders can be obtained with a simple subtraction between ouput and input of decoders.
Fig. 3. General structures of (a) PC and (b) SC of two convolutional encoders. In (a), encoded sequence c is obtained through concatenation
of information or systematic sequence and redundancy sequences provided by two constituent encoders c ¼ðu; y
1
; y
2
Þ.
2
In the early 1990s, the state of the art in error control coding was the
concatenation of a Reed–Solomon code and a memory 14 convolutional
code proposed for the Galileo space probe [21]. This scheme was decoded
using a four-stage iterative scheme where hard decisions were fedback
from the Reed–Solomon decoder to the Viterbi decoder.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1203
same sequence in a different order obtained by way of an
interleaver.
The overall coding rates for SC and PC, R
s
and R
p
,are
equal to
R
s
¼ R
i
R
o
and
R
p
¼
R
1
R
2
R
1
þ R
2
R
1
R
2
¼
R
1
R
2
1 ð1 R
1
Þð1 R
2
Þ
where R
i
and R
o
refer to the coding rates of the inner and
outer constituent codes in the SC scheme and R
1
and R
2
refer to the rates of code 1 and code 2 in the PC scheme.
Historical turbo codes [1]–[22] are based on the PC of
two recursive systematic convolutional (RSC) codes.
Firstly, the choice of RSC codes instead of classical
nonrecursive convolutional (NRC) codes has to be found
in the comparison of their respective error correction per-
formance. An example of such memory length ¼ 3codes
is shown in Fig. 4. Since they have the same minimum
Hamming distance,
3
the observed error correcting perfor-
mance is very similar for both codes at medium and low
error rates. However, the RSC code performs significantly
better at low signal-to-noise ratios (SNRs) [22].
However, the reason why using RSC codes is es-
sential in the construction of concatenated convolutional
codes was explained through bounding techniques. It
was shown in [23] and [24] that, under the assumption
of uniform interleaving,
4
the bit error probability in the
low error region P
b
varies asymptotically with the inter-
leaver gain
P
b
/ N
max
(1)
where N is the interleaver length and
max
is the maximum
exponent of N in the asymptotic union bound approxima-
tion. In the PC case, the maximum exponent is equal to
max
¼ 1 w
min
,wherew
min
is the minimum input
Hamming weight of finite error events. This result shows
that there is an interleaving gain only if w
min
> 1, which is
trueforRSCcodes(w
min
¼ 2and
max
¼1) but not for
NRC codes (w
min
¼ 1and
max
¼ 0). As for SC schemes, if
both constituent codes are NRC, the same problem
appears. It was shown that, for SCCC, at least the inner
codehastoberecursive[25]inordertoensurean
interleaver gain. Provided this condition is satisfied, the
maximum exponent is equal to
max
¼ bðd
min;o
þ 1Þ=2c,
where d
min;o
is the minimum Hamming distance of the
outer code.
This analytical analysis also allows us to compare the
SC and PC performance at low error rates; serial turbo
codes perform better than parallel turbo codes in this
region because their interleaver gain is larger. As a
counterpart, the opposite behavior can be observed at high
error rates, for the same overall coding rate. This can be
explained through the analysis of extrinsic information
transfer characteristics based on mutual information, the
so-called EXIT charts [26].
2) Block Coding With Convolutional Codes: Convolutional
codes are not aprioriwell suited to encode information
transmitted in block form. Nevertheless, most practical
applications require the transmission of data in block
fashion, the size of the transmitted blocks being
sometimes reduced to less than 100 bits (see, e.g., the
3GPP turbo code [8] with a 40-bit minimum block
length). In order to properly decode data blocks at the
receiver side, the decoder needs some information about
the encoder state at the beginning and the end of the
encoding process. The knowledge of the initial state of the
encoderisnotaproblem,sincetheBall zero[ state is, in
general, forced. However, the decoder has no special
available information regarding the final state of the
encoder and its trellis. Several methods can solve this
problem.
1) Do nothing: that is, no information concerning
the final states of the trellises is provided to the
decoder. The trellis is truncated at the end of each
block. The decoding process is less effective for
the last encoded data and the asymptotic coding
gain may be reduced. This degradation is a
function of the block length and may be low
enough to be accepted for a given application.
Fig. 4. (a) Classical nonrecursive nonsystematic convolutional
code with ¼ 3 memory units (eight-state code). (b) Equivalent
recursive systematic version of (a) code.
3
The minimum Hamming distance d
min
of a code is the smallest
Hamming distance between two any different encoded sequences. The
correcting capability of the code is directly related to the value of d
min
.
4
A uniform interleaver is a probabilistic device that maps a given
sequence of length N and Hamming weight w into all distinct
permutations of length N and Hamming weight w with equal probability.
The uniform interleaver is representative of the average performance over
all possible deterministic interleavers.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
1204 Proceedings of the IEEE |Vol.95,No.6,June2007
2) Force the encoder state at the end of the
encoding phase: for one or all constituent codes.
This solution was adopted by the CCSDS and
UMTS turbo codes [7], [8]. The trellis termination
of a constituent code involves encoding extra
bits, called tail bits, in order to make the encoder
return to the all-zero state. These tail bits are then
sent to the decoder. This method presents two
drawbacks. Firstly, the spectral efficiency of the
transmission is slighly decreased. Nevertheless,
this reduction is negligible except for very short
blocks. Next, for parallel turbo codes, tail bits are
not identical for the termination of both constit-
uent codes, or in other words, they are not turbo
encoded. Consequently, symbols placed at the
blockendhaveaweakerprotectionagain.Asfor
serial turbo codes, the tail bits used for the
termination of the inner coder are not taken into
account in the turbo decoding process, thus lead-
ing to a similar problem. However, the resulting
loss in performance is very small and can be
acceptable in most applications.
3) Adopt tail-biting [27]: This technique allows any
state of the encoder as the initial state. The
encoding task is performed so that the final state
of the encoder is equal to its initial state. The code
trellis can then be viewed as a circle, without any
state discontinuity. Tail-biting presents two main
advantages in comparison with trellis termination
using tail bits to drive the encoder to the all-zero
state. Firstly, no extra bits have to be added and
transmitted. Next, with tail-biting RSC codes,
only codewords with minimum input weight 2
have to be considered. In other words, tail-biting
encoding avoids any side effects, unlike classical
termination. This is particularly attractive for
highly parallel hardware implementations of the
decoder, since the block sides do not require any
specific processing. In practice, the straightfor-
ward circular encoding of a data block consists of a
two-step process. At the first step, the information
sequence is encoded from the all-zero state and
the final state is stored. During this first step, the
output bits are ignored. The second step is the
actual encoding, whose initial state is a function of
thefinalstatepreviouslystored.Thedouble
encoding operation represents the main drawback
of this method, but in most cases it can be
performed at a frequency much higher than the
data rate. An increased amount of memory is also
required to store the state information related to
the start and the end of the block between
iterations.
C. Iterative Decoders for Concatenated
Convolutional Codes
The decoding principle of PCCC and SCCC is shown in
Figs. 5 and 6. The SISO decoders are assumed to process
LLRs at their inputs ð; IÞ and outputs ð; 0Þ (the notations
used in the figure are those adopted in Section III).
In the PC scheme of Fig. 5, each SISO decoder
computes the extrinsic LLRs related to information
symbols ðu
1
; OÞ and ðu
2
; OÞ, using the observation of
the associated systematic and parity symbols coming from
the transmission channel ðc
1
; IÞ and ðc
2
; IÞ,andthe
a priori LLRs ðu
2
; IÞ and ðu
1
; IÞ.SincenoaprioriLLRs
areavalaiblefromthedecodingprocessatthebeginningof
the iterations, they are then set to zero. For the subse-
quent iterations, the extrinsic LLRs coming from the other
decoder are used as aprioriLLRs for the current SISO
Fig. 5. Turbo decoding principle in the case of parallel turbo codes. Notations are taken from Fig. 3(a). ð; IÞ and ð; OÞ refer to LLRs at
input and output of SISO decoders.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1205
decoder. The decisions can be computed from any of the
decoders. In the PC case, the turbo decoder structure is
symmetrical with respect to both constituent decoders.
However, in practice, the SISO processes are executed in a
sequential fashion; the decoding process starts arbitrarily
with either one decoder, SISO1 for example. After SISO1
processing is completed, SISO2 starts processing and
so on. In the SC scheme of Fig. 6, the decoding diagram is
no longer symmetrical. On the one hand, the inner SISO
decoder computes extrinsic LLRs ðu
i
; OÞ related to the
inner code information symbol, using the observation of
the associated coded symbols coming from the transmis-
sion channel ðc
i
; IÞ and the extrinsic LLRs coming from
the other SISO decoder ðu
i
; IÞ. On the other hand, the
outer SISO decoder computes the extrinsic LLRs ðc
o
; OÞ
related to the outer code symbols using the extrinsic LLRs
provided by the inner decoder. The decisions are computed
as a posteriori LLRs ðu
o
; OÞ related to information
symbols by the outer SISO decoder. Although the overall
decoding principle depends on the type of concatenation,
both turbo decoders can be constructed from the same
basic building SISO blocks, as described in Section III.
For digital implementations of turbo decoders, the
different processing stages present a nonzero internal
delay, with the result that turbo decoding can only be
implemented through an iterative process. Each SISO
decoder processes its own data and passes it to the other
SISO decoder. One iteration corresponds to one pass
through each of all the SISO decoders. One pass through a
single SISO decoder is sometimes referred to as half an
iteration of decoding.
The concatenated encoders and decoders can work
continuously or block-wise. Since convolutional codes are
naturally better suited to encode information in a
continuous fashion, the very first turbo encoders and
decoders [2], [28] were stream oriented. In this case, there
is no constraint about the termination of the constituent
encoders, and the best interleavers turned out to be
periodic or convolutional interleavers [29], [30]. The
corresponding decoders call for a modular pipelined
structure, as illustrated in Fig. 7 in the case of parallel
turbo codes. The basic decoding structure has to be
replicated as many times as the number of iterations. In
order to ensure the correct timing of the decoding process,
delay lines have to be inserted into the decoder. The de-
cision computation, not shown in the figure, is performed
at the output of SISO decoder 2, at the last iteration stage.
However, as far as block decoding is concerned, the
simplest decoding architecture is based on the use of a
single SISO decoder, which alternately processes both
constituent codes. This standard architecture requires
three storage units: a memory for the received data at the
channel and decoder outputs (LLR
in
memory), a memory
for extrinsic information at the SISO output (EXT
memory), and a memory for the decoded data (LLR
out
memory). The instantiations of this architecture in the PC
and SC cases are illustrated in Figs. 8 and 9. In the parallel
scheme, the decoding architecture is the same for both
component codes, since they play the same role in the
overall decoding process. The SISO decoder decodes code
1 or 2 using the corresponding channel data from the LLR
in
memory and the aprioriinformation stored in the EXT
memory at the previous half-iteration. The resulting
extrinsic information is stored in the EXT memory and
then used as a priori information at the SISO input when
the other code is processed, at the next half-iteration. The
decoded data are written into the LLR
out
memory at the
last half-iteration of the decoding process. On the contrary,
intheSCscheme,thearchitectureelementsarenotused
in the same fashion for inner and outer code processing.
The inner decoding process, shown in Fig. 9(a), is similar
to the elementary decoding process in the PC case;
whereas, the outer decoding process, shown in Fig. 9(b),
does not use the LLR
in
memory contents or the apriori
information input ðu
o
; IÞ. The decoded data are written
into the LLR
out
memory after the last processing round of
the outer decoder.
III. SOFT INPUT SOFT
OUTPUT ALGORITHMS
SISO algorithms are the fundamental building blocks of
iterative decoders. A SISO is, in general, a block that
accepts messages (to be defined later) about both the input
and output symbols of an encoder and provides extrinsic
messages about the same symbols. The extrinsic message is
generated by considering the aprioriconstraints that exist
between the input and output sequences of the encoder. In
this section, we will give more precise definitions of the
Bmessage[ andontheinputoutputrelationshipsofaSISO
block. Moreover, we will show efficient algorithms to
perform SISO for some special types of encoders.
Fig. 6. Turbodecodingprincipleinthecaseofserialcodes.
Notations are taken from Fig. 3(b). ð; IÞ and ð; OÞ refer to LLRs at
input and output of SISO decoders.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
1206 Proceedings of the IEEE |Vol.95,No.6,June2007
A. Definition of the Input and Output Metrics
A SISO module generally works associated to a known
mapping f (encoding) between input and output
alphabets
c ¼ f ðuÞ¼ f
1
ðuÞ; ...; f
n
ðuÞðÞu 2Uc 2C (2)
where u ¼ðu
1
; ...; u
k
Þ and c ¼ðc
1
; ...; c
n
Þ are the input
and output sequences of the encoder, respectively. The
alphabets U and C are generic finite alphabets and the
mapping is not necessarily invertible.
A SISO module is a four-port device that accepts some
messages of the input and output symbols of the encoder
and provides homologous output extrinsic messages.
Fig. 8. Standard architecture for block-wise decoding of
parallel turbo codes.
Fig. 9. Standard architecture for block-wise decoding of serial
turbo codes: (a) inner decoder and (b) outer decoder.
Fig. 7. Pipelined architecture of turbo decoder of Fig. 5, in the case of continuous decoding of PCCC. A priori LLRs at input of first iteration
stage are set to zero. Decision computation, performed at last iteration stage, is not shown.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 12 07
We will consider the following two types of normalized
messages.
1) Likelihood ratio (LR)
LðxÞ¼
PðX ¼ xÞ
PðX ¼ 0Þ
represents the ratio between the likelihood of the
symbol being x and the likelihood of being zero.
2) Log-Likelihood ratio (LLR)
ðxÞ¼log
PðX ¼ xÞ
PðX ¼ 0Þ
! e
ðxÞ
¼ LðxÞ
is its logarithmic version.
Normalized messages are usually preferred to not
normalized ones as they allow the decoder to save one
value in the representation. In fact, by definition Lð0Þ¼1
and ð0Þ¼0. In particular, when the variable x is binary
ðx 2f0; 1 LRs and LLRs can be represented with a
scalar L
x
¼ Lðx ¼ 1Þ
5
and
x
¼ ðx ¼ 1Þ so that one can
write
LðxÞ¼ðL
x
Þ
x
ðxÞ¼x
x
: (3)
The sequence of LRs is always assumed to be
independent at the input of SISO
6
so that the likelihood
ratio of the sequences u and c is the product of the
likelihoods of their constituent symbols
LðuÞ¼
Y
k
i¼1
Lðu
i
Þ; LðcÞ¼
Y
n
j¼1
Lðc
j
Þ
and equivalently its LLR
ðuÞ¼
X
k
i¼1
ðu
i
Þ;ðcÞ¼
X
n
j¼1
ðc
j
Þ:
Furthermore, assuming that the sequence of LR of
input and output symbols are mutually independent, the
LR of a pair ðu; cÞ, with the constraint of being a valid
correspondence, can be obtained as
Lðu; cÞ¼
LðuÞL f ðuÞðÞ; c ¼ f ðuÞ;
0; otherwise.
B. General SISO Relationships
The formal SISO input output relationships are
obtained with the independence assumption constraining
the set of input/output sequences to be in the set of
possible mapping correspondences. Using LR messages the
relationships are
Lðu
i
; OÞLðu
i
; IÞ¼
P
u
0
:u
0
i
¼u
i
Lðu
0
; IÞL f ðu
0
Þ; IðÞ
P
u
0
:u
0
i
¼0
Lðu
0
; IÞL f ðu
0
Þ; IðÞ
(4)
Lðc
j
; OÞLðc
j
; IÞ¼
P
u
0
:f
j
ðu
0
Þ¼c
j
Lðu
0
; IÞL f ðu
0
Þ; IðÞ
P
u
0
:f
j
ðu
0
Þ¼0
Lðu
0
; IÞL f ðu
0
Þ; IðÞ
(5)
wherewehaveintroducedthelettersBI[ and BO[ to
distinguish between input and output messages.
In the logarithmic domain (LLRs), products translate to
sums and sums are mapped to the operator
max
ð
1
;
2
Þ¼
logðe
1
þ e
2
Þ
¼ maxð
1
;
2
Þþlog 1 þ e
j
1
2
j

(6)
and (4) and (5) (LLRs) become
ðu
i
; OÞþðu
i
; IÞ¼ max
u
0
:u
0
i
¼u
i
ðu
0
; IÞþ f ðu
0
Þ; IðÞ½
max
u
0
:u
0
i
¼0
ðu
0
; IÞþ f ðu
0
Þ; IðÞ½(7)
ðc
j
; OÞþðc
j
; IÞ¼ max
u
0
:f
j
ðu
0
Þ¼c
j
ðu
0
; IÞþ f ðu
0
Þ; IðÞ½
max
u
0
:f
j
ðu
0
Þ¼0
ðu
0
; IÞþ f ðu
0
Þ; IðÞ½:
(8)
The algorithm obtained from the first type of messages
(LR) is called multiplicative (sum-prod) while the second
(LLRs) is called additive (max
-sum), or log-MAP.
The max
operator requires, in general, two sums and a
look-up table. The look-up table size depends on the
5
We have kept the index x with the binary LLR to remind the reader
the name of underlying symbol.
6
This is actually the basic assumption of iterative decoding, which
otherwise would be optimum.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
1208 Proceedings of the IEEE |Vol.95,No.6,June2007
required accuracy and the whole correction term can be
avoided, giving rise to a simpler and suboptimal version of
the SISO (max-sum or max-log-MAP). To compensate the
effect of neglecting the look-up table several strategies are
possible, like scaling or offsetting the messages. These
techniques are described in [31] and [32].
Independently from the metric used, the input–output
relationships (4), (5) or (7), (8) have a complexity that
grows with the size of the code. This complexity can be
affordable for very simple mappings but becomes imprac-
tical for most of the mappings used as encoders.
In the following sections, we will describe some
particular cases where this computation can be
simplified.
C. Binary Mappings
As we have seen, for binary variables LR and LLR
havetheappealingfeatureofbeingabletoberepre-
sented as single values, in particular, using relationships
(3) we have
u
l
ðOÞþ
u
l
ðIÞ
¼ max
u
0
:u
0
l
¼1
X
k
i¼1
u
0
i
u
i
ðIÞþ
X
n
j¼1
f
j
ðu
0
Þ
c
j
ðIÞ
!
max
u
0
:u
0
l
¼0
X
k
i¼1
u
0
i
u
i
ðIÞþ
X
n
j¼1
f
j
ðu
0
Þ
c
j
ðIÞ
!
(9)
c
l
ðOÞþ
c
l
ðIÞ
¼ max
u
0
:p
l
ðu
0
Þ¼1
X
k
i¼1
u
0
i
u
i
ðIÞþ
X
n
j¼1
f
j
ðu
0
Þ
c
j
ðIÞ
!
max
u
0
:p
l
ðu
0
Þ¼0
X
k
i¼1
u
0
i
u
i
ðIÞþ
X
n
j¼1
f
j
ðu
0
Þ
c
j
ðIÞ
!
: (10)
D. SISO Relationships on Trellises
In this section, we will explain how to simplify the
computation of (4) and (5), or their logarithmic counter-
parts(7)and(8),whenthemappingisrepresentedovera
trellis. As the correspondence between the multiplicative
and additive domain requires only the use of different
operators we will describe the algorithm in the multipli-
cative domain.
A trellis is an object characterized by the concatenation
of L trellis sections T
l
. Each trellis section (see Fig. 10)
consists of a set of starting states s
l
2S
l
,aninputset
x
l
2X
l
,thesetofedges defined as the pair e
l
¼
ðs
l
; x
l
Þ2E
l
¼S
l
X
l
, and two functions p
l
ðe
l
Þ and y
l
ðe
l
Þ
that assign to each edge a final state in S
lþ1
and an output
symbol in Y
l
.
A trellis is associated to a mapping (2) by making a
correspondence of the L trellis section’s input and output
alphabets with the k input and n output alphabets with the
mapping
O
L
l¼1
X
l
¼
O
k
i¼1
U
i
O
L
l¼1
Y
l
¼
O
n
j¼1
C
j
where
N
denotes the Cartesian product of sets.
The alphabets of the trellis sections must then be the
Cartesian product of a set of subalphabets of the original
mapping
X
l
¼
O
i2I
l
U
i
Y
l
¼
O
j2J
l
C
j
where fI
l
g and fJ
l
g are a partition of the sets of indexes of
the input and output alphabets. Furthermore, the cardi-
nality of S
1
and S
Lþ1
is always one.
Any mapping (2) in general admits several time varying
trellis representations, in particular, the number of trellis
sections L and the alphabets defining each trellis section
can be arbitrarily fixed (see Fig. 11). Finding a trellis with
minimal complexity (i.e., minimal number of edges as
explained in [33]) is a rather complex task and is out of the
scope of this tutorial paper.
A trellis representation for a mapping allows us to
interpret the mapping itself as paths in the representing
trellis. This correspondence allows us to build efficient
encoders based on time-varying finite-state machines.
More importantly, this same structure can be exploited for
the efficient evaluation of expressions like those in (4) and
(5) or (7) and (8) that involve associative and distributive
operators.
In fact, due to the distributive and associative
properties of the operators appearing in (4) and (5), it is
possible to compute the required output extrinsic LRs with
the following algorithm [19].
1) From the likelihoods of the alphabets of the map-
ping Lðu
i
; IÞ and Lðc
j
; IÞ compute the likelihoods of
the alphabets of the trellis sections and from them
the likelihoods of the edges (branch metrics)
C
l
ðx
l
Þ¼
Y
i2I
l
Lðu
i
; IÞ8x
l
2X
l
(11)
C
l
ðy
l
Þ¼
Y
j2J
l
Lðc
j
; IÞ8y
l
2Y
l
! G
l
ðe
l
Þ¼C
l
ðx
l
ÞC
l
y
l
ðe
l
ÞðÞ: (12)
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1209
2) Compute the forward and backward recursions
according to
A
lþ1
ðsÞ¼
X
e
l
:p
l
¼s
A
l
ðs
l
ÞG
l
ðe
l
Þ l ¼ 1; ...; L 1(13)
B
l
ðsÞ¼
X
e
l
:s
l
¼s
B
lþ1
ðp
l
ÞG
l
ðe
l
Þ l ¼ L; ...; 2(14)
with initializations
A
1
ð0Þ¼B
Lþ1
ð0Þ¼1:
Periodically, normalization at this point may be
introduced to avoid overflows. The normalization
can be performed by dividing all state metrics by a
reference one. Note, however, that in the additive
version normalization can be avoided by using
complement-two representation of state metrics
using a technique that can be found in the
literature relative to Viterbi decoders.
3) Compute the a posteriori likelihood of edges
D
l
ðe
l
Þ¼A
l
ðs
l
ÞG
l
ðe
l
ÞB
lþ1
ðp
l
Þ8e
l
l ¼ 1; ...; L:
4) Finally, compute the desired extrinsic output
LRs as
Lðu
i
; OÞ¼
1
Lðu
i
; IÞ
P
e
l
:u
i
ðe
l
Þ¼u
i
Dðe
l
Þ
P
e
l
:u
i
ðeÞ¼0
Dðe
l
Þ
(15)
Lðc
j
; OÞ¼
1
Lðc
j
; IÞ
P
e
l
:c
j
ðe
l
Þ¼c
j
Dðe
l
Þ
P
e
l
:c
j
ðeÞ¼0
Dðe
l
Þ
(16)
where l in (15) [respectively, (16)] is the index of
the trellis section associated to the symbol u
i
(respectively, c
j
).
If no information is available for some of the symbols
involved in the mapping, their corresponding LR should be
set to the vector 1 and their LLR to the vector 0.Ifno
extrinsic information is required for some of the symbols
involved in the mapping, the corresponding final relation-
ships(15)or(16)canbeavoided.
Due to the finite memory of the trellis, forward and
backward recursions (13) and (14) generally are such that
A
lþW
ðsÞ and B
lW
ðsÞ are independent from A
l
ðsÞ [respec-
tively, B
l
ðsÞ] for sufficiently large W. This fact also allows
the algorithm to work when messages are available only for
a contiguous subset of trellis sections. See Section III-F for
more details.
Fig. 10. Trellis section and related functions. Starting state, ending state, input symbol, and output symbol. Note that edge is defined
as pair starting state, input symbol.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
1210 Proceedings of the IEEE |Vol.95,No.6,June2007
InFig.12,weshowasanexampletheblockdiagramof
the forward recursion in its multiplicative and additive
forms.Notethatintheadditiveformwedenotewithlower
case Greek letters the logarithmic counterparts of the
variables defined for the multiplicative version.
The described SISO algorithm is completely general
and can be used for any mapping and trellis. In the
following we will consider some special cases of mappings
where the algorithm can be further simplified.
1) Convolutional encoders:Thetrellisofacon-
volutional encoder has trellis sections that do not
dependonthetimeindex.Aconvolutional
encoder has a constant set of states, a constant
input and output alphabet, and constant functions
pðeÞ and yðeÞ. Convolutional encoders define a
mapping between semi-infinite sequences when
the starting state is fixed
O
k
0
i¼0
U
i
O
n
0
j¼0
C
j
:
2) Binary convolutional encoders: have the addi-
tional constraint of having U
i
¼C
j
¼ Z
2
.For
them, each trellis section is characterized with a
set of k
0
input bits and n
0
output bits. LR and LLR
Fig. 11. Correspondence between a mapping and a time-varying trellis.
Fig. 12. Implementation of forward recursion in additive and
multiplicative version.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1211
are single quantities as shown in (3), so that the
branch metrics can be computed simply as
C
l
ðxÞ¼
X
k
0
1
i¼0
u
i
u
i
ðIÞ
C
l
ðyÞ¼
X
n
0
1
j¼0
c
j
c
j
ðIÞ:
3) Linear binary convolutional encoders:For
linear encoders we have the additional linear
property
f ðu
1
u
2
Þ¼f ðu
1
Þf ðu
2
Þ:
The linearity of an encoder does not simplify the
SISO algorithm. However, linear encoders admit
dual encoders and the SISO algorithm can be
performed with modified metrics (as we will see
in the next section) on the trellis of the dual code.
This fact may lead to considerable savings
especially for high rate codes that have simpler
dual trellis.
4) Systematic encoders: Systematic encoders are
encoders for which the output symbol y
l
is
obtained by concatenating the input symbol x
l
with a redundancy symbol r
l
y
l
¼ðx
l
; r
l
Þ:
For them, the computation of metrics (11) and
(12) is simplified as the metric on x
l
can be
incorporated in those of y
l
C
l
ðy
l
Þ¼
Y
i2I
l
Lðu
i
; IÞLðc
i
; IÞ
Y
j2J
l
=I
l
Lðc
j
; IÞ
where the first product refers to the systematic
part of the label y
l
and the second part to the
redundancy r
l
.
E. SISO Algorithm for Dual Codes
In the previous section, we saw that the linear property
of the code does not simplify the SISO algorithm. How-
ever, when the encoder is linear and binary
7
we recall a
fundamental result first derived in [34] and restated here.
Defining the new binary messages called reflection
coefficients R
x
from the LR L
x
as
R
x
¼
1 L
x
1 þ L
x
and the correspondent sequence reflection coefficients
RðcÞ¼
Y
n
j¼1
R
c
j

c
j
:
The following relationship holds true:
Rðc
j
; OÞRðc
j
; IÞ¼
P
u
0
:f
?
j
ðu
0
Þ¼c
j
R f
?
ðu
0
Þ; I

P
u
0
:f
?
j
ðu
0
Þ¼0
R f
?
ðu
0
Þ; I

(17)
where f
?
is the mapping that defines the dual encoder, i.e.,
the set of sequences orthogonal to the code. This
relationship, formally identical to (5), can be used to per-
form SISO decoding of high rate linear binary encoders
with a complexity that is the one associated to their duals
[35]–[37]. A very important case where this property is
exploitedisintheLDPCdecodingfortheefficientSISO
computation at the check nodes side, as the dual code of
the ðn; n 1Þ parity check code is the simple two-word
ðn; 1Þ repetition code.
Although elegant and simple, this approach may
sometimes lead to numerical problems in fixed point
implementations, and in these cases the approach based on
puncturing a low rate mother code is still preferred.
F. Initialization of Forward and Backward
Recursions and Windowing
For SISO on convolutional codes the initialization of
forward and backward recursion, as well as the order in
which these recursions are performed, leads to different
solutions. In Fig. 13, we show, pictorially, the most re-
levant approaches.
In the upper left part of the figure, we represent a
possible solution when a single SISO processes a block of
data. The data is divided into adjacent equally sized
Bwindows[ and the SISO processes sequentially the set of
windows in natural order. It first performs the forward
recursion, storing the result in a temporary buffer, and
then performs a backward recursion and computation of
theoutputsatthesametime.SincetheSISOprocessesthe
windows sequentially and in natural order, the forward
recursion results can be propagated to the next window for
correct initialization. Backward recursion, on the other
hand, needs to be properly initialized on each window and
7
The result can be easily extended more generally to finite fields.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
1212 Proceedings of the IEEE |Vol.95,No.6,June2007
an additional unit must be deployed to perform this task
(dashed lines).
Other scheduling possibilities, like performing first the
backward and then the forward recursion, are examined in
[38] and give rise to similar overheads.
When parallel processors are used (top, right),
initialization of the forward recursions are needed and
an additional unit to perform this task must be deployed.
A more elegant and efficient solution that eliminates
the overhead of initializations is reported in the bottom
part of the figure. In this case, we exploit the fact that SISO
processors are employed in an iterative decoder. The
results of the forward and backward recursion at a given
iteration are propagated to the adjacent (previous or next)
window in the next iteration. This approach leads to
negligible performance losses provided that the window
size is large enough.
IV. PARALLEL ARCHITECTURES FOR
TURBO DECODERS
In this section, we consider the basic SISO decoder
architecture with four processing units (branch metric,
forward recursion unit, backward recursion unit, and
output computation unit, see Fig. 21). This SISO is able to
perform one trellis step during one clock cycle. The
maximum throughput achievable by the turbo decoder
using this SISO is f
clk
= n
it
,wheren
it
is the number of
iterations of the decoding process and f
clk
is the clock
frequency of the architecture. The factor indicates the
minimum number of trellis stages per information bit
( ¼ 2 for a PCCC turbo decoder, ¼ 2 þð1=R
i
Þ for a
SCCC turbo decoder).
8
There are three solutions to increase the throughput of
the decoder: increasing the parallelism of the decoder,
increasing the clock frequency, and, finally, decreasing the
number of iterations. This last solution will be considered
separately in Section V.
Theoretically, the increase of parallelism can be
obtained at all the levels of the hierarchy of the turbo
decoding algorithm, i.e., first at the turbo decoder level,
second at the SISO level (duplication of the processing
unit performing an iteration), third at the half iteration
level (duplication of the hardware to speed up a SISO
processing), and, finally, at the trellis stage level.
A. Codeword Pipeline and Parallelization
The first method proposed in the literature to increase
the throughput of a turbo decoder [2] is the simplest one: it
dedicates a processor for each half iteration. Thus, the 2n
it
processors work in a linear systolic way. While the first one
processes the first half iteration of the newest received
8
This maximum decoding rate is obtained when there is no idle cycle
(to empty and/or initialized pipe-line) between two half iterations. In [39],
a technique to obtain this condition is presented.
Fig. 13. Possible solutions for initialization of forward and backward recursions.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1213
codeword of index k, the second one processes the second
half iteration of the previous received codeword (index
k 1),andsoon,uptothe2n
it
th processor that performs
the last iteration of the codeword of index k 2n
it
.Once
the processing of a half iteration is finished, all codewords
are shifted in the linear processor array. This method is
efficient in the sense that the increase of throughput is
proportional to the increase of the hardware: 2n
it
processors working in parallel increase the throughput
by a factor of 2n
it
. Another efficient alternative involves
instanciating several turbo decoders working in parallel on
independent codewords using demultiplexing.
Nevertheless, both methods imply a large latency of
decoding and also require the duplication of the memories
for extrinsic and intrinsic information. To avoid the
memory duplication, it is far more efficient to use
parallelism at the SISO level in order to speed up the
execution time of an iteration.
B. Parallel SISO Architecture
The idea is to use a parallelism to perform a SISO
decoder, i.e., use several independent SISO decoders
working on the same codeword. To do so, the frame of size
N (N ¼ k=R
o
inthecaseoftheinnercodeoftheSCCC
scheme; N ¼ k,otherwise)isslicedintoP slices of size
M ¼ N=P and each slice is processed in parallel by P
independent SISO decoders (in the following, we assume
that the size N of the frame is a multiple of P). This
technique implies two types of problems which should be
solved: first, the problem of data dependency within an
iteration; second, the problem of the memory collisions in
the parallel memory accesses.
Problem of Data Dependency: Since forward processing
is performed sequentially from the first symbol u
0
to the
last symbol u
N1
, the question arises of how to start si-
multaneously P independent forward recursions starting
respectively from the symbols u
0
; u
M1
; u
2M1
; ...;
u
NM1
? The same question also arises for the backward
recursion in a symmetrical way. An elegant and simple
solution to this problem was proposed independently by
Blankenship et al. [40] and Giulietti et al. [41] in 2002.
This solution is derived from the sliding window technique
defined in Section III-D. The idea is to relax the constraint
of performing the entire forward (respectively, backward)
processing within a single iteration (see Fig. 13). Thus, the
P final state metrics of the forward processing of the P
slices obtained during the j
it
th iteration are used as initial
states, after a circular right shift, of the forward processing
of iteration j þ 1(j varying from 1 to n
it
). The same
principle also stands for the backward recursion.
Memory Organization: In the natural order, the organi-
zation of the memory can be very simple: the first M data
(u
0
to u
M1
) are stored in a first memory block (MB), then
the next M data ðu
M
; ...; u
2M1
Þ are stored in a second MB,
and so on. This mapping is called direct mapping. With
direct mapping, the processing of the first dimension is
straightforward: each SISO unit has a direct access to its
own memory. For the interleaved dimension, the problem
is more complex. In fact, the first SISO needs to process
sequentially the symbol u
ð1Þ
; u
ð2Þ
; ...; u
ðM1Þ
. Since at a
given time l, ðlÞ can be stored in any of the P memories, a
network has to be created in order to connect the first
SISO unit to the P memory banks. Similarly, a network
should also be created to connect the P memory banks to
the P SISO units. However, a conflict in the memory access
can appear if, at a given time, two SISO units need to
access data stored in the same memory bank.
The problem of memory conflict has been widely
studied in the literature, and so far three kinds of solutions
have been proposed: solution at the execution stage, so-
lution at the compilation stage, and solution at the design
stage. These three solutions are described as follows.
Formulation of the Problem: Since each stage of a trellis
needs inputs (intrinsic information, aprioriinformation
and the associated redundancy) and generates an output
(extrinsic and/or LLR information), the execution of P
trellis stages at each clock cycle requires a memory
bandwidth of P read/write accesses per clock cycle.
Assuming that a memory bank can be read and written
at each clock cycle, at least P memory banks of size M are
required to provide both the memory capacity and the
memory bandwidth. Let us describe the sequence of read/
write in the memory bank in the natural and interleaved
order. At time l,theP symbols accessed by the P SISO
processors are in the natural order V
1
l
¼fl; l þ M; ...;
l þðP 1ÞMg andintheinterleavedorderV
2
l
¼fðlÞ;
ðl þ MÞ; ...;ðl þðP 1ÞMÞg. The memory organiza-
tion should allow, for all l ¼ 1 ...M, a parallel read/write
access for both set V
1
l
and V
2
l
. In the remainder of the
paper, we assume that the memory address i corresponds
to the bank bi=mc at address ði mod mÞ where b:c denotes
the integral part function.
The generic parallel architecture is shown in Fig. 14.
An iteration on this architecture works on two steps. When
decoding the first encoder, the pth memory bank is
accessed thanks to its associated address generator (AG).
The address generator delivers a sequence of addresses
1
p
ðlÞ
l¼1...M
. The data coming from the memory banks are
then sent to the SISO units through the permutation
network. The permutation network shuffles the data
according to a permutation defined at each cycle l by
1
ðlÞ. The outputs of the SISO units are stored in the
memory banks in a symmetrical way, thanks to the
permutation network.
9
At the end of an iteration, the final
forward recursion metrics f
m
(respectively, backward b
m
)
9
The permutation network is composed, in fact, of two shuffle
networks, one to shuffle the data between the SISOs and memory banks
during a write access, the other one to shuffle the data during a read
access between memory bank and SISO unit.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
1214 Proceedings of the IEEE |Vol.95,No.6,June2007
are sent to their right (respectively, left) neighbor. Those
metrics are stored temporarily during the processing of the
second encoder. They are used as the initial state of the
trellis at the beginning of the next iteration when the first
encoder is processed again. Note that when tail-biting code
is used, the left-most and right-most SISO also exchange
their f
m
and b
m
values. The decoding of the second encoder
is similar: the
1
p
ðlÞ
l¼1...M
are replaced by
2
p
ðlÞ
l¼1...M
and
1
ðlÞ by
2
ðlÞ.
1) Solution at the Execution Stage: In this family of
solutions, we encompass all solutions using direct mapping
with some extra hardware or features to tackle the problem
of memory conflict during the interleaved dimension
processing. Two kinds of solutions have been proposed.
The first one is technological, it uses memories with a
bandwidth higher than necessary in order to avoid memory
conflict. For example, if two read/write accesses in a
memory cannot be performed in a single SISO clock cycle,
then all double access memory conflicts are solved. Note
that this solution is not efficient in terms of area and power
dissipation. A more efficient solution was proposed by
Thul et al. [42]. It relies on a Bsmart[ permutation network
that contains small FIFO modules to smooth the memory
access. These FIFO modules allow a fraction of the write
access to be delayed and a fraction of the read access to be
anticipated so that, at each cycle, the whole memory
bandwidth is used. In order to limit the size of the FIFO
modules, Thul et al. also proposed to Bfreeze[ the SISO
module in order to solve the remaining memory conflict.
This solution is generic and efficient but requires some
additional hardware and extra latency in the decoding
process [43].
2) Solution at Compilation Stage: This solution was
proposed by Tarable et al. [44]. The authors show that,
regardless of the interleaver and the number P of SISO
units, there is always a memory mapping free of read/
write conflict access. A memory mapping is a permutation
on the set f0 ...N 1g that associates to the index l
the memory location i ¼ ðlÞ. This nontrivial mapping
ð IdÞ implies nontrivial spatial permutations in both
natural (
1
ðlÞ 6¼ Id, l ¼ 0 ...M) and interleaved order
(
2
ðlÞ Id, l ¼ 0 ...M). This method is general but has
two main drawbacks: the complexity of the network and
the amount of memory to store the different configura-
tions of the network during the decoding process. In fact,
the network should perform all possible permutations. It
can be implemented in one stage by a crossbar at a cost of
a very high area, or in several stages by an ad hoc network
(a Benes network, for example [45]). In this later case,
both complexity and latency of the network is multiplied
by a factor 2 compared to the simple barrel shifter (see
Section IV-B3). Moreover, memories are required to store
the address generator sequences and the 2M permutations
1
ðlÞ
l¼1...M
and
2
ðlÞ
l¼1...M
. Assuming a multiple frame and
multiple rate decoder, the size of the memory to store
each interleaver configuration may be prohibitive.
In conclusion to this section, [44] proposes to deal with
the problem of a turbo decoder of different size by defining
a single interleaver of maximum size. Codes of shorter
length are then generated by simply pruning the original
interleaver. This type of interleaver is named prunable
collision-free interleaver.
3) Solution at Design Stage: Theideaistodefinejointly
the interleaver and the architecture in order to solve a
priori the memory conflict while keeping a simple
architecture. Fig. 14 presents the interleaver structure
used to construct a parallel interleaver constrained by
decoding parallelism.
With this kind of technique, memory mapping is the
natural one, i.e., ¼ Id. Thus, in the natural order, the
spatial permutation is identity,
1
l
¼ Id, and the temporal
permutations are also identity, ð
1
p
¼ IdÞ
p¼1...M
.Inthe
interleaved order, spatial permutation at time l is simply
arotationofindexrðlÞ,i.e.,
2
l
ðpÞ¼ðp þ rðlÞÞ mod P.
All the temporal permutations
2
p
,forp ¼ 0 ...P 1are
equal to a unique temporal permutation
2
(the same
address is used for all memory blocks). Moreover, the
expression of
2
can be computed on the flight as the
Fig. 14. Generic architecture of a parallel turbo decoder. At the end of
each iteration, final states f
m
of forward processing (respectively,
backward processing b
m
) are exchanged to the right (respectively, left)
SISO decoder in a circular way. Parameter P represents number of
parallel SISO decoder. Parameter M represents memory size of
each slice.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 12 15
sum of a linear congruence expression and a periodic
offset:
2
ðlÞ¼ða l þ ðl mod !ÞÞ mod M,wherea is
the linear factor prime with M, ! is the periodicity of the
offset (generally, ! ¼ 4), and is an array of size ! that
contains the offset values.
At first glance, one could think that such constraints on
the interleaver would lead to poor performance. On the
contrary, the interleavers of this type are among the best
known ones. Since the number of parameters to define an
interleaver is low, an exhaustive search among all param-
eters configuration can be done. Note that the impulse
method defined by Berrou [46] and its derivative [47]
are efficient tools to optimize the construction of such
interleavers.
One should note that the almost regular permutation
[48] defined for the DVB-RCS standard, the dithered
relatively prime interleaver [49], and the slice turbo code [50]
belong to the family of Bdesign stage interleaver.[
Problem of Activity of Parallel Architecture: However,
another issue arises when dealing with parallel architec-
tures: the efficient usage of computational resources. The
idea is that it is inefficient to increase the number of SISOs
if the computational power of theses SISOs is used only
partially. This issue is particularly critical in the context of
sequential decoding of turbo codes where half-iterations
are processed sequentially. Due to data dependencies and
pipeline stages in hardware implementation, the SW
algorithm leads to an idle time at the beginning and end of
each subblock processing. This idle time can waste a
significant part of the total processing power when the
subblock size is short, i.e., with short codeword length
and/or high degree of parallelism. This problem is quite
complex and not easy to solve in the general case.
However, it can be tackled by the use of a joint interleaver/
decoder design (design stage solution), as proposed in
[39], or going back to the pipeline solution of Section IV-A,
as proposed recently in [51] for very high speed turbo
decoders.
C. Parallel Trellis Stage
In a given recursion (forward or backward), each trellis
stage implies the computation, for each node of the trellis,
of the recursive (13) and (14). It is easy to associate a
processing unit to each node of the trellis so that a trellis
stage can be processed at each clock cycle. The question is
now: is it possible to increase the parallelism? Since the
forward (or backward) recursion contains a loop, it is not
possible to increase the parallelism directly. Some authors
propose a technique, called trellis compacting. This
technique is based on restructuring the conventional
trellis by grouping two consecutive trellis stages in a single
one. In other words, instead of serially computing the
forward metric
lþ1
from the
l
metrics and the branch
metrics
l
,andthencomputing
lþ2
from
lþ1
and
lþ1
,
lþ2
is directly computed from
l
,
lþ1
and
l
.
This technique was proposed initially in the context
of the Viterbi decoder and a speed improvement of a
factor 1.7 was reported [52]. It can be directly applied
when the max-log-map algorithm is implemented. In fact,
inthiscase,forwardandbackwardrecursionare
equivalent to the Viterbi recursion (the so-called Add
Compare Select Unit). Trellis compaction can also be
adapted, thanks to few approximations, when the log-MAP
algorithm is implemented, as shown in [53] and [54].
10
It is worth mentioning that trellis compaction leads to
an equivalent trellis of the trellis of the double binary code
[56]. This is one explanation, together with its good
performance for medium and low rate code, of the success
of double binary codes in several standards.
D. Increase of Clock Frequency
A direct way to increase the decoding throughput of a
turbo decoder is to increase the clock frequency. To do
that, the critical path of the turbo decoder should be
reduced. A simple analysis shows that the BM units as well
as the OCU units can be pipelined as needed. The critical
path is in the forward or backward recursion loop. There
are not so many solutions to reduce this path directly: use
of a fast adder (at a cost of an increase of area), reduce the
number of bits to code the forward and backward metrics
(at a cost of a decreasing of the performance), and, if the
log-map algorithm is implemented, delay of one cycle
the addition of the correcting offset in order to reduce
the critical path (see [38] for more details).
Another architecturally efficient solution for reducing
this critical path consists of adding a pipeline register in a
recursion unit and then to interleave two (or more)
independent SISOs on the same hardware. With the
pipeline register, the critical path can almost be reduced by
two; thus, the hardware can operate at a double frequency
compared to the direct solution. The over cost is low since
only a single additional pipeline stage is introduced in the
forward (or backward) metrics recursion loop after the
first adder stage of Fig. 17.
V. STOPPING RULES AND BUFFERING
An efficient way to increase the throughput of the decoder
is to exploit the randomness of the number of required
iterations when the decoder is embedded with some
stopping criterion.
Assume as in Fig. 15 that we have a decoder that is
capable of performing n
min
1 iterations while receiving a
frame. We add in front of it a FIFO buffer that has size
ð1 þ Þ2N,whereN is the codeword size and 0isa
constant that measures the memory overhead. If ¼ 0,
the decoder has no possibility of changing the number of
iterations as the FIFO memory only stores the following
10
One can note that trellis compaction is a special case of the general
method presented in [55] to break the ACS bottleneck.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
1216 Proceedings of the IEEE |Vol.95,No.6,June2007
frame while decoding the current one. If instead >0,
thetimeavailablefordecodingrangesfromn
min
to
n
max
¼ð1 þ 2Þn
min
depending on the status of the FIFO.
In order to stop iterative decoding the decoder is em-
bedded with a stopping rule so that the number of ite-
rations for decoding is described through a (memoryless)
random variable x with distribution f ðxÞ. We represent the
status of the FIFO as an integer ranging between n
min
and
n
max
which is the number of available iterations at any
given time. The transition probability matrix P of the
underlying Markov chain has the following elements:
p
i!j
¼
0; i j n
min
;
f
þ
ði j þ n
min
Þ; j ¼ n
min
; i n
min
f
ði j þ n
min
Þ j ¼ n
max
; i > n
max
n
min
;
fði j þ n
min
Þ; otherwise
8
>
>
>
<
>
>
>
:
8i; j ¼ n
min
; ...; n
max
: (18)
wherewehavedefined
f
þ
ðxÞ¼
X
1
k¼x
fðkÞ and f
ðxÞ¼
X
x
k¼0
fðkÞ:
This Markov chain is irreducible and aperiodic.
Consequently, a steady-state probability vector, defined as
p
S
¼
lim
n!1
SP
n
8S (19)
exists.
From the vector p
S
ðxÞ, which represents the probability
of having x available iterations, the frame error probability
is then computed as
P
F
¼
X
n
max
x¼n
min
p
S
ðxÞP
F
ðxÞ (20)
where P
F
ðxÞ is the frame error probability when x
iterations are available. The steady-state distribution p
S
typically shows a rather sharp transition. As a conse-
quence, when the average number of iterations n is below
the minimum number of available iterations n
min
the FIFO
is typically empty and
P
F
" P
F
ðn
max
Þ:
On the other side, if n > n
min
the FIFO is typically full and
P
F
" P
F
ðn
min
Þ:
The procedure to design a FIFO buffer is then the
following.
1) Fix a stopping criterion choosing from one of
those listed in the following Section V-A.
2) Run a single simulation with an large (infinite)
number of iterations. In this simulation carry on,
for each desired E
b
=N
0
, the statistics of the
number of iterations required f ðxÞ and frame
error probability having x available iterations
P
F
ðxÞ.
3) For all desired pairs ðn
min
; n
max
Þ compute the
matrix P through (18) and from P the steady-
state distribution of the FIFO p
S
ðxÞ using (19).
From the steady-state distribution compute the
frame error probability as in (20).
4) Generally, the decoder speed must have n
min
slightly larger than the average required for the
chosen stopping rule at the target error rate. The
maximum number of iterations n
max
must be set
according to the desired target number of
iterations. The memory overhead is then
obtained as
¼
1
2
n
max
n
min
1

:
Fig. 15. Block diagram of an iterative decoder with input and output FIFO to improve throughput.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1217
A. List of Stopping Rules
Several stopping rules have been proposed in the lit-
erature, see for example [57]–[62] and references
therein. In the following, we list the most efficient and
simple rules.
1) Hard rule 1: The signs of the LLRs at the input
and at the output of a constituent SISO module are
compared and the iterative decoder is stopped if
all signs agree. Note that the output of a SISO is
always an extrinsic LLR.
2) Hard rule 2: To improve the reliability of the stop
rule, the previous check has to be passed for two
successive iterations
3) Soft rule 1: The minimum absolute value of all the
extrinsic LLR at the output of a SISO is compared
against a threshold. Increasing the threshold
increases the reliability of the rule and also
increases the average number of iterations.
4) Soft rule 2: The minimum absolute value of all
the total LLR is compared against a threshold.
Note that the total LLR is the sum of the input and
output LLR of a SISO module.
The choice of the stopping rule and possibly of the
corresponding threshold for soft rules, is mainly dictated
by the complexity of its evaluation and by the probability of
false stops that induce an error floor in the performance;
formoredetailssee[57].
In Fig. 16, we report the FER performance of a rate
0.35 SCCC with four state constituent encoders and
interleaver size of 8640 with fixed number of iterations
and those obtained with the structure of Fig. 15 with the
first hard stopping rule.
Solid lines refer to the decoder with a fixed number of
iterations(5,6,7,10),whiledashedcurvesarethe
performance obtained using stopping rules and finite
buffering. For these curves n
max
has been kept fixed to 10
and n
min
takesthelabelvalues5,6,7,and10.As
anticipated, it can be seen that all the dashed curves start
from the performance of the correspondent n
min
and at a
given point they start to converge to the performance
relative to n
max
iterations. The threshold point corresponds
to the situation for which n " n
min
.Notethatsincethe
average number of iterations decreases with the SNR the
maximum gap between the curves is obtained at low values
of the SNR.
The pair (10, 6), corresponding to a memory overhead
of ¼ 33% and a speed-up of 66%, shows a maximum
penalty of 0.15 dB at FER ¼ 10
1
that becomes 0.01 dB at
10
5
. The pair (10, 5), corresponding to a memory
overhead of 50% and a speed-up of 100%, shows instead
a maximum penalty of 0.3 dB at FER ¼ 10
1
that becomes
0.1 dB at 10
5
.
VI. QUANTIZATION ISSUES IN
TURBO DECODERS
The problem of fixed-point implementation is an impor-
tant one since hardware complexity increases linearly with
the internal bit width representation of the data. The
tradeoff can be formulated as follows: what is the
Fig. 16. Comparison of FER performance of an SCCC with four state constituent encoders and interleaver size of 8640 with
fixed number of iterations with those obtained using structure of Fig. 15 and first hard stopping rule.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
1218 Proceedings of the IEEE |Vol.95,No.6,June2007
minimum bit width internal representation that leads to an
acceptable degradation of performance. It is interesting to
note that the very function of the turbo-decoding process is
the suppression of channel noise. This robustness against
channel noise implies also, as a benecifial side effect, a
robustness against the internal quantification noise. Thus,
compared to a classical DSP application, the internal
precision of a turbo-decoder can be very low without
significant degradation of the performance.
In this section, we will give a brief survey of the
problem of fixed-point implementation in the case of a
binary turbo-decoder. First, we will discuss the problem of
optimal quantization of the input signal and the resulting
internal precision. Then, the problem of the scaling of the
extrinsic message will be discussed. Finally, we will
conclude this chapter by giving a not yet published
pragmatic method to optimize the complexity versus
performance tradeoff of a turbo decoder.
A. Internal Precision of a Turbo Decoder
Let us consider a binary turbo code associated with a
BPSK modulation. The received symbol at time l is thus
equal to x
l
¼ x
l
þ w
l
,wherex
l
equals 1ifc
l
¼ 0, and
equals one, otherwise; w is white gaussien noise of
variance . The LLR ðc
l
; IÞ is then equal to
ðc
l
; IÞ¼2x=
2
: (21)
The quantization of ðc
l
; IÞ on b
LLR
bits is a key issue
that impacts both the performance and the complexity of
the design. In the following, we will assume that the
quantized value ðc; IÞ
Q
of ðc; IÞ on b
LLR
bit is given by
ðc; IÞ
Q
¼ Qððc; IÞÞ, where the quantification function Q
is defined as
x
Q
¼QðxÞ¼sat x
2
b
LLR
1
1
A
þ0:5

; 2
b
LLR
1
1

(22)
where satða; bÞ¼a if a belongs to ½b; b,andsatða; bÞ¼
signðaÞb, otherwise; A is the interval dynamic of the
quantization (data are quantized between ½A; A). Sym-
metrical quantization is needed to avoid giving a system-
atic advantage to one bit value against an other. In fact,
such a situation would decrease the performance of the
turbo decoder.
One can note that if A is very large, most of the input
will be quantized by a zero value, i.e., an erasure. In that
case, the decoding process will fail. On the other hand, a
too small value of A would lead to a saturation most of the
time and the soft quantization would thus be equivalent to
a hard decision. Clearly, for a given code rate and a given
SNR, there is an optimal value of A. As a rule of thumb, for
a 1/2 rate turbo code, the optimal value of A is around 1.2.
Equations (21) and (22) show that the input value ðc; IÞ
Q
of the turbo decoder depends on the channel observation,
the value A, as mentioned above, but also on the variance
of the SNR of the signal, i.e., the variance of the noise
2
.
The SNR is not available at the receiver level and it should
be estimated using the statistic of the input signal. In
practical cases, the estimation of the SNR is not
mandatory. When the max–log–MAP algorithm is used,
i.e., the max
operation is approximated by the simple
max operator, its estimation is not necessary. In fact, the
term 2=
2
is just, in the logarithm domain, an offset factor
that impacts both input and output of the max unit. When
the log-MAP algorithm is used, a pragmatic solution is to
replace the real standard deviation of (21) by the
maximum value
o
leading to a bit-error rate (BER) or a
frame-error rate (FER) acceptable for the application.
Note that when the effective noise variance is below
o
,
thedecodingprocessbecomessuboptimalduetoa
subestimation of the ðc; I). Nevertheless, the BER (or
the FER) would still decrease and thus remains in the
functionality domain of the application.
The number of bits b
ext
to code the extrinsic message
can be deduced from b
LLR
. In many reported publications,
b
ext
¼ b
LLR
þ 1, i.e., extrinsic messages, are quantified in
the interval ½2A; 2A.NotethatiftheOCUdeliversa
binaryvalueoutoftherange½2
b
ext
þ 1; 2
b
ext
1,a
saturation needs to be performed.
Once b
LLR
and b
ext
are chosen, the number of bits b
fm
to
code the forward recursion metrics and the backward
recursion nodes can be derived automatically. In the
following, we will only consider the case of the forward
recursion metrics . The same results also stand for the
backward recursion unit.
According to (11) and (12), in the logarithm domain,
the branch metric
l
is a finite sum of bounded values. Let
us assume that is a bound of the absolute value of the
branch metric
l
( is a function of the code, b
LLR
and b
ext
).
Then, it is shown in [63] that at any time l
11
max
s
l
ðsÞðÞmin
s
l
ðsÞðÞG (23)
where the parameter is the memory depth of the
convolutional encoder.
Assuming that min
s
ð
l
ðsÞÞ is maintained equal to zero
by dedicated hardware, then (23) shows that b
fm
¼
dlog
2
ð Þe bits are sufficient to code the metrics.
In practice, this solution is not efficient. In fact,
maintaining min
s
ð
l
ðsÞÞ equal to zero implies additional
11
The exact bound is derived in [38].
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1219
hardware in the recursion loop to perform the determi-
nation of min
s
ð
l
ðsÞÞ and to subtract it from all the
metrics. This hardware increases both the complexity and
the critical path of the FR unit.
A more efficient solution is to replace those complex
systematic operations by the subtraction of a fixed value
when needed. Let us define as ¼dlog
2
ðð þ 1ÞÞe.
Then,codingtheforwardmetriconb
fm
¼ 1 þ leads to a
very simple scheme. In fact, since is the maximum
dynamic of the branch metric, (23) gives max
s
ð
lþ1
ðsÞÞ
min
s
0
ð
l
ðs
0
ÞÞ ð þ 1Þ.
This equation proves that, if at time l,min
s
ð
l
ðsÞÞ is
below 2
,thenattimel þ 1, max
s
ð
lþ1
ðsÞÞ also remains
below 2
1þ
1 ¼ 2
b
fm
1. If at time l þ 1, min
s
ð
l
ðsÞÞ
becomes above 2
, then all the forward metric will
range between 2
and 2
þ1
. In that case, a rescaling
operation is performed by subtracting 2
from all the
forward metrics. This situation is detected thanks to an
AND gate connected to the most significant bit (MSB) of
the metrics. The subtraction is simply realized by
setting all the MSB to zero. Thanks to the rescaling
process, the dynamic to represent the increasing func-
tion of the forward metric is very limited. Note that
other efficient methods have been proposed in the
literature, like more elaborated rescaling techniques or
simply avoiding the rescaling operation using modulus
arithmetic [64].
Typical values of b
LLR
are between 3 to 6 bits. For an
eight-state turbo code, the corresponding b
fm
are around 7
to 10 bits. An example of the impact of b
LLR
on per-
formance is given in Fig. 19. The reader can find in [65]
a deeper analysis of the bit width precision of the turbo
decoder.
B. Practical Implementation of max
Algorithm
In Fig. 17, we report a block diagram of the im-
plementation of the fundamental associative operator
max
according to its definition (6). The look-up table
performs the computation of the correcting factor
given by
fðxÞ¼ln 1 þ e
jxj

: (24)
Fig. 18 shows the plot of the function fðxÞ.The
maximum value of this function is lnð2Þ and the function
decreases rapidly toward zero. In hardware, all real values
are replaced by integer values thanks to (22). Thus, the
maximum quantized value of f is given by Qðlnð2ÞÞ.
Moreover, from (22), it is possible to compute the max-
imum integer m
Q
so that x
Q
> m
Q
implies f
Q
ðx
Q
Þ¼0. As
an example, for M ¼ 1:2andb
LLR
¼ 4, the maximum
quantized value of f
Q
is 4 (thus f
Q
takes its values
between 0 and 4 and requires 3 bits to be coded) and
m
Q
¼ 14 (see Fig. 18). The hardware realization of the
computation of the offset factor is thus simple. A first test
is performed to determine if jx
Q
j is below 15. If the test is
positive, the 5 least significant bits (LSB) of x
Q
are used
as input of a 3-bit output look-up table that contained the
precomputed values of f
Q
ðx
Q
Þ. Otherwise, the offset is set
to zero. Note that it is also possible to compute the
absolute value of x
Q
before the access to the LUT in order
to reduce its size by half.
There are many other different implementations of
the offset function in the literature. For example, [66]
proposes a linear approximation of the f ðxÞ function
while [67] proposes a very coarse, but efficient, ap-
proximation. When the offset factor is omitted, the hard-
ware is simplified at a cost of a performance degradation.
This point is discussed in the next section. Note that,
when the offset factor is omitted, the algorithm is named
in the literature the min–sum or the max–log–MAP
algorithm.
C. Rescaling of Extrinsic Message
The use of the max–log–MAP algorithm leads to an
overestimation of the extrinsic message and thus decreases
the performances of the turbo decoder by approximatively
Fig. 18. Plot of function f ðxÞ. Quantized version of this function for
M ¼ 1.2 and b
LLR
¼ 4 is also given (red dot points).
Fig. 17. Block diagram for max
operator.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
1220 Proceedings of the IEEE |Vol.95,No.6,June2007
0.5 dB. This penalty can be significantly reduced if the
overestimation of the extrinsic message is compensated,
on average, by a systematic scaling down of the extrinsic
message between two consecutive half iterations. This
scaling is performed thanks to a multiplication by a
scaling factor a
i
,wherethevalueofa
i
depends on the
number of the half iteration. Typically, during the first
iteration, the value of a
i
is low (around 0.5), and increases
up to one for the two last half iterations. This technique,
for a classical turbo decoder, reduces the degradation
from 0.5 to 0.2 dB. For a double binary code, it reduces
the degradation from 0.5 to 0.05 dB. Note that the scaling
factor is also favorable with respect to the decorrelation of
the extrinsic messages. Even if we use the log–MAP
algorithm, the scaling factors help the decoder to
converge.
D. Method of Optimization
As seen previously, many parameters impact both the
performance and the complexity of the turbo decoder:
number of bits of quantization, maximum number of it-
erations, scaling factors of the extrinsic message, value
of the quantization dynamic M, and also the targeted
BER and FER. For example, Fig. 19 shows the BER
obtained for a turbo code of rate 1/2 and size N ¼ 512
decoded for different numbers of iteration and different
values of b
LLR
.
All those parameters interact in a nonlinear way (see
the effect of n
it
and b
LLR
in Fig. 19 for example). The
problem of finding a good performance-complexity trade-
off is thus a complex task. Moreover, the evaluation of each
configuration generally requires a CPU extensive Monte
Carlo simulation in order to have an accurate estimation of
the BER. In order to avoid such a simulation, we propose
an efficient pragmatic method:
1) Definethespaceofthesearchbydefiningthe
range of search for each parameter of the decoder.
Define a model of complexity for each parame-
ter.
12
Define also the maximum allowable com-
plexity of the design.
2) Define the Bworst case[ configuration by individ-
ually setting the parameters to the value that
degrades performance most.
3) Using this configuration, perform a Monte Carlo
simulation at the SNR of interest. Each time a
received codeword fails to be decoded, store the
codeword (or information to reconstruct it, i.e.,
the seeds of the pseudo-random generators) in a
set S. Stop the process when the cardinality of the
set S is high enough (typically around 1000). Note
that this operation can be very CPU consuming
butithastobedoneonlyonce.
4) Perform an optimization in order to find the set
of parameters that minimize the BER (or the
FER) over the set S with the constraint that the
overall complexity of the decoder remains below
a given value.
5) Perform anormalMonte Carlo simulation in order
to verify a posteriori the real performance of the
selected parameters. Go to Step 1) with a different
scenario of optimization if needed.
This method is efficient since the test of one con-
figuration can be a few orders faster than the direct
method. For example, for an FER of 10
4
, the test of a
configuration with a classical Monte Carlo simulation
requires, on average, the simulation of 10
6
codewords. In
contrast, with the proposed method, testing a new
configuration requires only the decoding of 10
3
code-
words. An improvement of simulation speed by a factor of
10
3
is then obtained.
VII. EVALUATION OF COMPLEXITY OF
ITERATIVE DECODERS
In order to provide a high level evaluation of the
complexity of the iterative decoders we will use as
reference the architecture reported in Fig. 20. Constituent
encoders are assumed to be binary and messages are stored
in the form of LLRs.
The iterative decoders are generally built around two
sets of processors and two memories. The messages
coming from the channel are stored in a buffer as they
will be accessed several times during the iterative process.
The extrinsic messages are instead stored in a temporary
memory denoted by EXT in the figure.
12
See [31] and Section VII for an example of modelization of the
hardware complexity of a turbo decoder.
Fig. 19. Curve BER ¼ fðSNRÞ for a 2/3 rate turbo code of length
N ¼ 512 for different b
LLR
values and number of decoding iterations.
Value of A is equal to 1.39.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1221
The high level algorithm of the decoder can be
summarized with the following steps.
1) Initialize the inner memory to null messages.
2) Apply the first set of constraints A using the EXT
and LLR, write the updated messages in EXT.
3) Apply the second set of constraints B using
the EXT and LLR, write the updated messages
in EXT.
4) Iterate until some stopping criterion is satisfied
or the maximum number of iterations is
reached.
The applications of constraints A and B here are executed
serially. Different scheduling can be applied, especially for
LDPC, but this is the common approach.
As we have seen in Section IV, PCCC, as well as SCCC
and LDPC, admits efficient highly parallel structures so
that the throughput can be arbitrarily increased by
increasing the number of parallel processors without
requiring additional memory. The tradeoff between area
and throughput of the decoder is then fully under designer
control. In this section, we will focus on C, defined as the
number of elementary operations required for decoding one
information bit per iteration as a function of the main design
parameters of the code.
The throughput T of the implemented decoder can be
well approximated by
T ¼ f
C
dep
N
it
C
where C
dep
is the number of deployed operators running
at frequency f ,andN
it
is the number of required
iterations.
For iterative decoders with messages in the form of
LLR, the evaluation of computational complexity can be
expressed in terms of number of sums and max
operations. Note that max will substitute max
when the
max–sum version of SISO processors is used instead of the
max
–sum version.
A. LDPC
For LDPC, each variable node processor with degree d
v
requires 2d
v
sums to compute the updated message. Thus,
summing up over all possible nodes, the variable node
processing requires two sums per edge.
Forchecknodes,thenumberofoperationsneededfor
updating the EXT messages depends linearly on the check
degree d
c
with a factor that depends on the used ap-
proximation. In [68], a comprehensive summary of the
available optimal and approximated techniques for check
node updates is presented, together with their correspon-
dent complexity and performance tradeoffs. Here, we will
assume the optimal and numerical stable algorithm (6)
and (7) in [68] that requires 6ðd
c
2Þ max
operators for
a check node of degree d
c
. The reader is warned, however,
that the factor 6 can be reduced by using other suboptimal
techniques.
Summing up over the set of variable and check nodes
we get the following complexity:
C ¼
2
R
; sums
6
2
R
þ 2

; max
(25)
per decoded bit and per iteration. In (25), we introduced
the fundamental parameter that is the average variable
degree which measures the density of the LDPC parity
check matrix. R istherateofthecode.
In Table 1, we show the normalized complexity C
required for some values of the two relevant parameters
and R.
Note that LDPC decoders have a complexity that is
inversely proportional to the rate of the code and
proportional to the parameter .Theparameter has
also an impact on the performance of the code and takes
typical values in the range 3–5.
B. PCCC and SCCC
The complexity of PCCC and SCCC is strictly related
to the complexity of their constituent SISO decoders.
Here, we consider rate k=n binary constituent encoders.
In Fig. 21, we report a block diagram of the SISO module
showing its basic architecture according to the algorithm
described in Section III. In the figure, we have reported
the sections of the architecture corresponding to the four
Fig. 20. General architecture of iterative decoder. Shaded blocks
are processors, white blocks are memories.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
1222 Proceedings of the IEEE |Vol.95,No.6,June2007
operations reported in the algorithm described in
Section III-D. Light blocks are computation blocks, while
dark block refers to memory, which can be organized as
LIFO or FIFO RAM. In this SISO structure, we have
assumed that initialization of forward and backward state
metrics are propagated across iterations so that no
overhead is required for this operation.
We will consider two versions of SISO. The first
(inner SISO), which is used in PCCC and as the inner
SISO for SCCC, gets messages on information and coded
bits and provides messages on input bits. The second
(outer SISO), which is used as outer SISO in SCCC, gets
messages only on coded bits and provides updated
messages on both information (user’s data) and coded
bits. In Fig. 21, we can identify the following units.
1) The Branch Metric Computer (BMC)isrespon-
sibleforcomputingtheLRorLLRtobeassociated
with each trellis edge section according to (11)
and(12).Thenumberofsumsrequiredis2
n
1 n fortheouterSISOand2
n
n 1 þ k for
inner SISO.
2) The Forward Recursion (FR)computeris
responsible for computing the forward recursion
metrics A according to (13). Each state metric
update requires 2
k
1max
between the in-
coming edges. The metrics of the 2
k
N
s
edges are
obtained summing the previously computed
branch metrics with each the path metrics
(2
k
N
s
sums) max
.
3) The Backward Recursion (BR) computer, identi-
cal to forward recursion, is responsible for
computing the backward recursion metrics B
according to (14). As the recursion proceeds in
the backward direction the input branch metrics
must be reversed in time, and this is the reason for
the LIFO in front of it.
4) The LR edges computed according to (3) require
both the forward and backward metrics and the
branch metrics. As backward metrics are provid-
ed in reversed order, a LIFO may also be inserted
on the line coming from the FR. Note that the
edge LF are then produced in reversed time
order.ForbothFRandBRaswellasinnerand
outer SISO the complexity is 2
k
N
s
sums and
ð2
k
1ÞN
s
max
.
5) The Output Computation Unit (OCU)computes
the a posteriori LR applying (15) and/or (16)
depending on the needs. The input LR are used to
compute extrinsic information. An efficient algo-
rithm to compute the updated messages requires
2
k
N
s
þ 2n þ k sums and 2
k
N
s
þ 2
nþ1
2n
4max
. As the edges LR are provided in reversed
Table 1 Complexity of LDPC Decoder in Number of Elementary Operations per Information Bit and per Iteration
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1223
time order, also the computed LR are reversed in
order so that an LIFO may be necessary to
provide the updated LR in the same order of the
inputs. The correct ordering of the messages,
however, can also be solved implicitly when
storing them in the RAM.
InTable2,wegivethesummaryoftheoperations
requiredfortheinnerandouterSISO,whileinTable3we
report the numerical values for some typical values of k, n,
and N
s
.
Having determined the complexity per information
bit for the inner and outer SISO decoders C
I
and C
O
,
the complexity of the PCCC and SCCC can be eval-
uated as
13
C
SCCC
¼ C
O
þ
1
r
o
C
I

C
PCCC
¼ 2C
I
where r
o
istherateoftheouterencoderforSCCC.
13
We assume here that the two constituent encoders are equal,
generalization to different encoders is straightforward.
Fig. 21. General architecture of SISO module. Module uses four logic units (FR, BR, OCU, and BMC). Sections of block diagram refer
to four main steps of algorithm described in Section III.
Table 2 Summary of Complexity per Trellis Step for SISO Algorithm on Binary LLR
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
1224 Proceedings of the IEEE |Vol.95,No.6,June2007
C. Memory Requirements
The memory requirements of an iterative decoder is
the sum of memory required for the storage of channel
messages, which is N foralltypesofdecoders,andthe
memory for the storage of the extrinsic messages, which
depends on the encoding scheme.
For LDPC the number of extrinsic messages is given by
N,forSCCCbyðK=r
o
Þn and for PCCC by K.
Memory requirements are usually not negligible and
in some important cases, e.g., for low throughput im-
plementation and/or high block sizes, can dominate the
computational requirements considered in the next
sections.
D. Nonbinary Decoders
Slightly different conclusions are obtained using
nonbinary constituent decoders. The main consequence
of this approach is that messages are no longer scalars but
vectors of dimension equal to the cardinality of the used
alphabet minus one. The dimension of message memory
must then be increased accordingly.
The operators BMC and OCU are slightly changed
while FR and BR remain the same.
Nonbinary decoders also yield different performance.
An important application of nonbinary decoder is the
DVB-RCS/RCT and WiMAX 8-state double-binary turbo
code and its extension to 16 states [69].
VIII. CONCLUSION
We have presented an overview of implementation issues
for the design of iterative decoders in the context of the
concatenation of convolutional codes. Hardware archi-
tectures were described and their complexity was as-
sessed. Low throughput (below 10 Mb/s) turbo decoders
have already been widely implemented, either in software
or hardware, and are commercially available. In this
paper, we have laid stress on the different methods al-
lowing the throughput of a turbo decoder to be increased.
We have particularly investigated parallel architectures
and stopping criteria. As for architecture optimization, a
joint design of the concatenated code and the architec-
ture was favored, especially concerning interleaver de-
sign. Such an approach has already been introduced in
standards such as DVB-RCS, DVB-RCT, and WiMAX.
Among the main challenges in the years to come, low
Table 3 Complexity per Information Bit of SISO Algorithm for Some Typical Values of k, n, and N
s
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1225
energy consumption receiver design will represent a
crucial one. Significant progress in this field will probably
require a real technological breakthrough. Some answers
to this problem are currently emerging, such as the
analog decoding concept [70], [71], which allows the
iterative process to be removed, the SISO decoders being
directly wired together in order to implement the feed-
back connections. h
REFERENCES
[1] C. Berrou, A. Glavieux, and P. Thitimajshima,
BNear Shannon limit error-correcting
coding and decoding: Turbo-codes,[ in
Proc. ICC, Geneva, Switzerland, May 1993,
pp. 1064–1070.
[2] Comatlas, CAS5093: Turbo Encoder/Decoder,
Nov. 1993, datasheet.
[3] S. Benedetto and G. Montorsi, BIterative
decoding of serially concatenated
convolutional codes,[ Electron. Lett.,
vol. 32, no. 13, pp. 1186–1187, Jun. 1996.
[4] S. Benedetto, D. Divsalar, G. Montorsi, and
F. Pollara, BSerial concatenation of
interleaved codes: Performance analysis,
design, and iterative decoding,[ IEEE Trans.
Inform. Theory, vol. 44, no. 5, pp. 909–926,
May 1998.
[5] S. Benedetto and G. Montorsi, BSerial
concatenation of block and convolutional
codes,[ Electron. Lett., vol. 32, no. 10,
pp. 887–888, May 1996.
[6] R. M. Pyndiah, BNear-optimum decoding
of product codes: Block turbo codes,[
IEEE Trans. Commun., vol. 46, no. 8,
pp. 1003–1010, Aug. 1998.
[7] CCSDS, Recommendation for Space Data
System Standards. TM Synchronization
and Channel Coding, Sep. 2003, 131.0-B-1,
Blue Book.
[8] Third generation partnership project (3GPP)
Technical Specification Group, Multiplexing
and Channel Coding (FDD), Jun. 1999,
TS 25.212, v2.0.0.
[9] DVB, Interaction Channel for Satellite
Distribution Systems, ETSI EN 301 790,
2000, v. 1.2.2.
[10] DVB, Interaction Channel for Digital Terrestrial
Television, ETSI EN 301 958, 2001, v. 1.1.1.
[11] Third generation partnership project 2
(3GPP2), Physical Layer Standard for
cdma2000 spread spectrum systems, Release D,
Feb. 2004, 3GPP2 C.S0002-D, ver. 1.0.
[12] IEEE, IEEE Standard for Local and Metropolitan
Area Networks. Part 16: Air Interface for
Fixed Broadband Wireless Access Systems,
IEEE 802.16-2004, Nov. 2004.
[13] A. Franchi and J. Sengupta, BTechnology
trends and market drivers for broadband
mobile via satellite: Inmarsat BGAN,[ in
Proc. DSP 2001, 7th Int. Workshop Digital Signal
Processing Techniques Space Communications,
Sesimbra, Portugal, Oct. 2001.
[14] S. Benedetto, R. Garello, G. Montorsi,
C. Berrou, C. Douillard, A. Ginesi, L. Giugno,
and M. Luise, BMHOMS: High-speed
ACM for satellite applications,[ IEEE Trans.
Wireless Commun., vol. 12, no. 2, pp. 66–77,
Apr. 2005.
[15] P. Elias, BError-free coding,[ IRE Trans.
Inform. Theory, vol. 4, pp. 29–37, Sep. 1954.
[16] G. D. Forney, Jr., Concatenated Codes.
Cambridge, MA: MIT Press, 1966.
[17] G. Battail, BWeighting the symbols decoded
by the Viterbi algorithm,[ (in French),
Ann. Telecommun., vol. 42, pp. 31–38,
Jan.–Feb. 1987.
[18] J. Hagenauer and P. Hoeher, BA Viterbi
algorithm with soft-decision outputs and its
applications,[ in Proc. IEEE GLOBECOM’89,
Dallas, TX, Nov. 1989, pp. 47.1.1–47.1.7.
[19] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv,
BOptimal decoding of linear codes for
minimizing symbol error rate,[ IEEE Trans.
Inform. Theory, vol. 20, no. 3, pp. 284–287,
Mar. 1974.
[20] C. Berrou and A. Glavieux, BReflections
on the prize paper: FNear optimum
error-correcting coding and decoding:
Turbo codes_,[ IEEE IT Soc. Newslett.,
vol. 48, no. 2, Jun. 1998.
[21] S. Dolinar and M. Belongie, BEnhanced
decoding for the Galileo low-gain antenna
mission: Viterbi redecoding with four
decoding stages,[ in JPL TDA Progr. Rep.,
vol. 42-121, pp. 96–109, May 1995.
[22] C. Berrou and A. Glavieux, BNear optimum
error correcting coding and decoding:
Turbo-codes,[ IEEE Trans. Commun.,
vol. 44, no. 10, pp. 1261–1271, Oct. 1996.
[23] S. Benedetto and G. Montorsi, BUnveiling
turbo codes: Some results on parallel
concatenated coding schemes,[ IEEE Trans.
Inform. Theory, vol. 42, no. 3, pp. 409–428,
Mar. 1996.
[24] S. Benedetto and G. Montorsi, BDesign
of parallel concatenated convolutional
codes,[ IEEE Trans. Commun., vol. 42,
no. 5, pp. 591–600, May 1996.
[25] S. Benedetto, D. Divsalar, G. Montorsi, and
F. Pollara, BSerial concatenation of
interleaved codes: Performance analysis,
design, and iterative decoding,[ IEEE Trans.
Inform. Theory, vol. 44, no. 5, pp. 909–926,
May 1998.
[26] S. ten Brink, BConvergence behavior of
iteratively decoded parallel concatenated
codes,[ IEEE Trans. Commun., vol. 49, no. 10,
pp. 1727–1737, Oct. 2001.
[27] C. Weiss, C. Bettstetter, and S. Riedel,
BCode construction and decoding of parallel
concatenated tail-biting codes,[ IEEE Trans.
Inform. Theory, vol. 47, no. 1, pp. 366–386,
Jan. 2001.
[28] M. Je
´
ze
´
quel, C. Berrou, C. Douillard, and
P. Pe
´
nard, BCharacteristics of a sixteen-state
turbo-encoder/decoder (turbo4),[ in Proc. Int.
Symp. Turbo Codes, Brest, France, Sep. 1997,
pp. 280–283.
[29] S. Benedetto and G. Montorsi, BPerformance
of continuous and blockwise decoded turbo
codes,[ IEEE Commun. Lett., vol. 1, no. 3,
pp. 77–79, May 1997.
[30] E. K. Hall and S. G. Wilson, BStream-oriented
turbo codes,[ IEEE Trans. Inform. Theory,
vol. 47, no. 7, pp. 1813–1831, Jul. 2001.
[31] G. Masera, G. Piccinini, M. Roch, and
M. Zamboni, BVLSI architectures for turbo
codes,[ IEEE Trans. VLSI Syst., vol. 7, no. 5,
pp. 369–379, Sep. 1999.
[32] C.-M. Wu, M.-D. Shieh, C.-H. Wu,
Y.-T. Hwang, and J.-H. Chen, BVLSI
architectural design tradeoffs for
sliding-window log-MAP decoders,[
IEEE Trans. VLSI Syst., vol. 13, no. 2,
pp. 439–447, Apr. 2005.
[33] R. McEliece, BOn the BCJR trellis for linear
block codes,[ IEEE Trans. Inform. Theory,
vol. 42, no. 7, pp. 1072–1092, Jul. 1996.
[34] C. Hartmann and L. Rudolph, BAn optimum
symbol-by-symbol decoding rule for linear
codes,[ IEEE Trans. Inform. Theory, vol. 22,
no. 9, pp. 514–517, Sep. 1976.
[35] S. Riedel, BMAP decoding of convolutional
codes using reciprocal dual codes,[ IEEE
Trans. Inform. Theory, vol. 44, no. 3,
pp. 1176–1187, Mar. 1998.
[36] S. Riedel, BSymbol-by-symbol MAP decoding
algorithm for high-rate convolutional codes
that use reciprocal dual codes,[ IEEE J.
Select. Areas Commun., vol. 16, no. 1,
pp. 175–185, Jan. 1998.
[37] G. Montorsi and S. Benedetto, BAn additive
version of the SISO algorithm for the dual
code,[ in Proc. IEEE Int. Symp. Information
Theory, 2001, pp. 27.
[38] E. Boutillon, W. J. Gross, and P. G. Gulak,
BVLSI architectures for the MAP algorithm,[
IEEE Trans. Commun., vol.51,no.2,
pp. 175–185, Feb. 2003.
[39] D. Gnaedig, E. Boutillon, J. Tousch, and
M. Je
´
ze
´
quel, BTowards an optimal parallel
decoding of turbo codes,[ in Proc. 4th Int.
Symp. Turbo Codes Related Topics, Munich,
Germany, Apr. 2006.
[40] T. Blankenship, B. Classon, and V. Deai,
BHigh-throughput turbo decoding techniques
for 4G,[ in Proc. Int. Conf. 3G Wireless and
Beyond, San Francisco, CA, Jun. 2005,
pp. 137–142.
[41] A. Giulietti, L. van der Perre, and A. Strum,
BParallel turbo coding interleavers: Avoiding
collisions in accesses to storage elements,[
Elec. Lett., vol. 38, no. 5, pp. 232–234,
Feb. 2002.
[42] M. J. Thul, N. When, and L. P. Rao,
BEnabling high speed turbo-decoding
though concurrent interleaving,[ in
Proc. Int. Symp. Circuits and Systems
(ISCAS’02), Phoenix, AZ, May 2002,
pp. 897–900.
[43] M. J. Thul, F. Gilbert, and N. When,
BConcurrent interleaving architecture
for high-throughput channel coding,[ in
Proc. ICASSP’03, Apr. 2003, pp. 613–616.
[44] A. Tarable, S. Benedetto, and G. Montorsi,
BMapping interleaving laws to parallel
turbo and LDPC decoder architectures,[
IEEE Trans. Inform. Theory, vol. 50, no. 9,
Sep. 2004.
[45] V. Benes, BOptimal rearrangeable multistage
connecting networks,[ Bell Syst. Tech. J.,
vol. 42, pp. 1641–1656, 1964.
[46] C. Berrou, S. Vaton, M. Jzquel, and
C. Douillard, BComputing the minimum
distances of linear codes by the error
impulse method,[ in Proc. IEEE
GLOBECOM’02, Taipei, Taiwan,
Nov. 2002, pp. 1017–1020.
[47] S. Crozier, P. Guinand, and A. Hunt,
BEstimating the minimum distance
of turbo-codes using double and triple
impulse methods,[ IEEE Commun. Lett.,
vol. 9, no. 6, pp. 631–633, Jun. 2005.
[48] C. Berrou, Y. Saouter, C. Douillard,
S. Ke
´
rouedan, and M. M. Je
´
ze
´
quel,
BDesigning good permutations for
turbo codes: Towards a single model,[ in
Proc. ICC’04, Paris, France, Jun. 2004,
pp. 341–345.
[49] S. Crozier and P. Guinand,
BHigh-performance low-memory
interleaver banks for turbo-codes,[ in
Proc. VTC2001, Rhodes, Greece, Oct. 2001,
pp. 2394–2398.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
1226 Proceedings of the IEEE |Vol.95,No.6,June2007
[50] D. Gnaedig, E. Boutillon, V. C. Gaudet,
M. Je
´
ze
´
quel, and P. G. Gulak, BOn multiple
slice turbo codes,[ in Proc. 3rd Int. Symp.
Turbo Codes and Related Topics, Brest, France,
Sep. 2003, pp. 153–157.
[51] O. Muller, A. Baghdadi, and M. Je
´
ze
´
quel,
BExploring parallel processing levels for
convolutional turbo decoding,[ in Proc.
2nd ICTTA Conf., Damascus, Syria,
Apr. 2006.
[52] P. Black and T. Meng, BA 140-mb/s, 32-state,
radix-4 viterbi decoder,[ IEEE Commun.
Lett., vol. 27, no. 12, pp. 1877–1885,
Dec. 1992.
[53] T. Miyauchi, K. Yamamoto, T. Yokokawa,
M. Kan, Y. Mizutani, and M. Hattori,
BHigh-performance programmable
SISO decoder VLSI implementation for
decoding turbo codes,[ in Proc. IEEE Global
Telecommunications Conf., GLOBECOM ’01,
San Antonio, TX, Nov. 2001, pp. 305–309.
[54] M. Bickerstaff, L. Davis, C. Thomas,
D. Garrett, and C. Nicol, BA 24 Mb/s radix-4
logMAP turbo decoder for 3GPP-HSDPA
mobile wireless,[ in IEEE Solid-State Circuits
Conf. Dig. Tech. Papers, San Francisco, CA,
Feb. 2003.
[55] G. Fettweis and H. Meyr, BParallel Viterbi
algorithm implementation: Breaking the
ACS-bottleneck,[ IEEE Trans. Commun.,
vol. 37, no. 8, pp. 785–790, Aug. 1989.
[56] C. Berrou and M. Jezequel, BNon-binary
convolutional codes for turbo coding,[
Elec. Lett., vol. 35, no. 1, pp. 39–40,
Jan. 1999.
[57] A. Matache, S. Dolinar, and F. Pollara,
BStopping rules for turbo decoders,[ in
JPL TMO Progress Rep., vol. 42-142, pp. 1–22,
Aug. 2000.
[58] S. Shao, L. Shu, and M. Fossorier,
BTwo simple stopping criteria for
turbo decoding,[ IEEE Trans. Commun.,
vol. 47, pp. 1117–1120, 1999.
[59] A. Shibutani, H. Suda, and F. Adachi,
BReducing average number of turbo
decoding iterations,[ Electron. Lett.,
vol. 35, pp. 701–702, 1999.
[60] A. Shibutani, H. Suda, and F. Adachi,
BComplexity reduction of turbo
decoding,[ in Proc. Vehicular Technology
Conf. VTC’1999, Ottawa, ON, Canada,
1999, vol. 3, pp. 1570–1574.
[61] K. Gracie, S. Crozier, and A. Hunt,
BPerformance of a low-complexity turbo
decoder with a simple early stopping
criterion implemented on a SHARC
processor,[ in Proc. 6th Int. Mobile
Satellite Conf. IMSC 99, Ottawa, ON,
Canada, 1999, pp. 281–286.
[62] B. Kim and H. S. Lee, BReduction of the
number of iterations in turbo decoding
using extrinsic information,[ in Proc.
IEEE TENCON 99, Inchon, South Korea,
1999, pp. 494–497.
[63] J. Cain, BCMOS VLSI implementation of
r ¼ 1=2, k ¼ 7 decoder,[ in Proc. IEEE Nat.
Aerospace Electron. Conf., NAECOM’84,
1984, pp. 20–27.
[64] A. Hekstra, BAn alternative to metric
rescaling in Viterbi decoders,[ IEEE Trans.
Commun., vol. 37, no. 11, pp. 1220–1222,
Nov. 1989.
[65] S. Pietrobon, BImplementation and
performance of a turbo/MAP decoder,[
Int. J. Satellite Commun., vol. 16, pp. 23–46,
Jan.–Feb. 1998.
[66] C. Jung-Fu and T. Ottosson, BLinearly
approximated log-MAP algorithm for
turbo decoding,[ in Proc. Vehicular
Technol. Conf. VTC’2000, Tokyo, Japan,
2000, pp. 2252–2256.
[67] W. J. Gross and P. G. Gulak, BSimplified
MAP algorithm suitable for implementation
of turbo decoder,[ Electron. Lett., vol. 34,
no. 16, pp. 1577–1578, Aug. 1998.
[68] C. J. Chen, A. Dholakia, E. Eleftheriou,
M. Fossorier, and H. Xiao-Yu,
BReduced-complexity decoding of
LDPC codes,[ IEEE Trans. Commun.,
vol. 53, no. 8, pp. 1288–1299,
Aug. 2005.
[69] C. Douillard and C. Berrou, BTurbo
codes with rate-m=ðm þ 1Þ constituent
convolutional codes,[ IEEE Trans.
Commun., vol. 53, no. 10, pp. 1630–1638,
Oct. 2005.
[70] H.-A. Loeliger, F. Tarkoy, F. Lustenberger,
and M. Helfenstein, BDecoding in analog
VLSI,[ IEEE Commun. Mag., vol. 37,
pp. 99–101, Apr. 1999.
[71] J. Hagenauer, BDecoding of binary codes
with analog networks,[ in Proc. 1998
Information Theory Workshop, San Diego,
CA, Feb. 1998, pp. 13–14.
ABOUT THE AUTHORS
Emmanuel Boutillon was born in Chatou, Fran ce,
on November 2, 1966. He received the Engineering
Diploma and Ph.D. degree from the E
´
cole Natio-
nale Supe
´
rieure des Te
´
le
´
communications (ENST),
Paris, France, in 1990 and 1995, respectively.
In 1991, he worked as an Assistant Professor in
the E
´
cole Multinationale Supe
´
rieure des Te
´
le
´
com-
munications, Dakar, Africa. In 1992, h e joined ENST
as a Research Engineer where he conducted
research in the field of VLSI for digital commu-
nications. In 1998, he spent a sabbatical year at the University o f Toronto,
ON, Canada. Since 2000, he has been a Pro fessor at the University of
South Britany, Lorient, France. His current research interests inc lude the
interactions between algo rithm and architecture in the field of wireless
communications. In particular, he works on tur bo codes a nd LDPC
decoders.
Catherine Douillard was born in Fontenay le
Comte, France, on July 13, 1965. She received the
engineering degree in telecommunications from
the E
´
cole Nationale Supe
´
rieure des Te
´
le
´
commu-
nications (ENST) de Bre tagne, Brest, France, in
1988, and the Ph.D. degree in ele ctrical engineer -
ing from the Universite
´
de Bretagne Occidentale,
Brest, in 1992.
In 1991, she joined ENST Bretagne, where she is
currently a Professor in the Electronics Depart-
ment. Her main interests are turbo codes and iterative decodin g, iterative
detection, and the efficient combination of high spec tral efficiency
modulation and turbo c oding schemes.
Guido Montorsi wasborninTurin,Italy,on
January 1, 1965. He received the Laurea in
Ingegneria Elettronica in 1990 from Politecnico
di Torino, Turin , Ita ly, wit h a ma ster’s t hesis
concerning the study and design of coding
schemes for HDTV, develop ed at the RAI Research
Center, Turin. He received the Ph.D. degree in
telecommunication s from the Dipartimento di
Elettronica of Politecnico di Tor ino, in 1994.
In 1992, h e spent the year as a Visiting S cholar
in the Department of Electrical Engineering, Rensselaer Polytechnic
Institute, Tr oy, NY. Since December 1997, he has been an Assistant
Professor at the Politecnico di Torino. His current interests are in the area
of channel coding, particularly on the analysis and design of concate-
nated coding schemes and study of iterative decoding strategies.
Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues
Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1227
... The SCCC decoder implemented is based on the BCJR algorithm, defined by Bahl, Cocke, Jelinek, and Raviv [31]. It is an iterative Maximum A Posteriori (MAP) Trellis-based algorithm that aims to reconstruct the a posteriori probabilities by means of Soft-Input Soft-Output (SISO) decoders [32]. The nature of the SISO decoders, and the iterations needed to decode the signal, make the SCCC decoder the most computationally intensive part of the receiver. ...
... First of all, the GPU memory has been exploited to guarantee fast and shared access to data for each thread. Then, to parallelize α and β computation, the windowing technique has been employed [32]. This procedure involves parallelizing the computation of sets of samples by computing the metrics in successive windows. ...
... Several approximations of this function have been proposed in literature [34,35]. One of these proposes to use the classic max instead of the max function, introducing an implementation loss in error decoding performance [32]. ...
Article
Full-text available
In recent years, the number of Earth Observation missions has been exponentially increasing. Satellites dedicated to these missions usually embark with payloads that produce large amount of data and that need to be transmitted towards ground stations, in time-limited windows. Moreover, the noisy nature of the link between satellites and ground stations makes it hard to achieve reliable communication. To address these problems, a standard for a flexible advanced coding and modulation scheme for high-rate telemetry applications has been defined by the Consultative Committee for Space Data Systems (CCSDS). The defined standard, referred to as CCSDS 131.2-B, makes use of Serially Concatenated Convolutional Codes (SCCC) based on 27 ModCods to optimize transmission quality. A limiting factor in the adoption of this standard is represented by the complexity and the cost of the hardware required for developing high-performance receivers. In the last decade, the performance of software has grown due to the advancement of general-purpose processing hardware, leading to the development of many high-performance software systems even in the telecommunication sector. These are commonly referred to as Software-Defined Radio (SDR), indicating a radio communication system in which components that are usually implemented in hardware, by means of FPGAs or ASICs, are instead implemented in software, offering many advantages such as flexibility, modularity, extensibility, cheaper maintenance, and cost saving. This paper proposes the development of an SDR based on NVIDIA Graphics Processing Units (GPU) for implementing the receiver end of the CCSDS 131.2-B standard. At first, a brief description of the CCSDS 131.2-B standard is given, focusing on the architecture of the transmitter and receiver sides. Then, the receiver architecture is shown, giving an overview of its functional blocks and of the implementation choices made to optimize the processing of the signal, especially for the SCCC Decoder. Finally, the performance of the system is analyzed in terms of data-rate and error correction and compared with other SW systems to highlight the achieved improvements. The presented system has been demonstrated to be a perfect solution for CCSDS 131.2-B-compliant device testing and for its use in science missions, providing a valid low-cost alternative with respect to the state-of-the-art HW receivers.
... On the contrary, if A is too small, the saturation function prevents the quantized value from taking high reliability values. In [81], the authors proposed that, for a rate 1/2 turbo code, the value of A could be taken around 1.2. Furthermore, as the code rate increases, the value of A can be decreased. ...
... On the other hand, the choice of the channel bitwidth w depends on the target error performance and the decoder architecture. Typical values of w are between 3 to 6 bits [81]. For the UXMAP architecture, a quantization of 6 bits for the channel LLRs was chosen in [18,27,2]. ...
... For the UXMAP architecture, a quantization of 6 bits for the channel LLRs was chosen in [18,27,2]. From the chosen channel bitwidth w, the bitwidth for the extrinsic information (and also the a priori information) is then w + 1 = 7 bits [2,81]. Therefore, the value of the extrinsic information is in the range of [−2A, 2A), and if the SOU produces an extrinsic value out of this range, it should be saturated as given by (5.7) with b = 2 w+1 before sending to the other constituent decoder. ...
Thesis
With recent mobile communication standards such as LTE Advanced Pro or 5G. New Radio, users can experience communication links with data rates up to tens of Gb/s. Along with the development of technology, the demand for even higher data rates of hundreds of Gb/s or Tb/s can be foreseen in the future. Baseband processing, in which forward error correcting (FEC) plays an essential role, will therefore have to be able to support such data rates. However,advances in semiconductor technologies alone will not be sufficient to meet the Tb/s FEC challenge. Therefore, major algorithmic and architectural innovations are required in the design and implementation of the corresponding algorithms. The aim of this thesis is to propose, for turbo codes, innovative decoding techniques, allowing the decoder to achieve or approach throughput ofTb/s. A novel high-radix decoding algorithm with low-complexity property was developed and was jointly optimized with a dedicated very high-throughput architecture. The resulting turbo decoder can therefore approach Tb/s with a low complexity and high area efficiency. In addition, a new decoding algorithm dedicated for high-rate turbo codes was proposed and was thoroughly studied in this work. The obtained results are promising, where the proposed algorithm provides a low-complexity sub-optimal alternative to the state-of-the-art algorithms.
... The most common is to find the encoders in parallel, i.e. the original input and the input modified by the interleaver are encoded at the same time, although there is also the possibility of the turbo encoder appearing with the RSC in series with the interleaver in between, in such a way that the interleaver messes up the original encoded information, which means that it has a larger memory than in the parallel case. Each variant has its advantages and disadvantages, which are discussed in [34], but one important advantage that the serial configuration has over the parallel model is that the data and parity bits can exploit the extrinsic information. This advantage is one of the main reasons why in the reference standard of this work a modification of the series turbo decoder ,the above mentioned SCCC configuration, is chosen. ...
Article
Full-text available
Most of the turbo encoding schemes at standards are parallel-based, so different architectures for efficient implementation are common in the literature. However, a serial turbo decoder is not that common. This scheme is used in CCSDS 131.2-B-1 standard, which is attracting much of attention recently due to its higher performance for satellite communications. In this paper, an efficient architecture for the decoder is proposed and analyzed. It is intended to show an architecture that can be modeled in a circuit description language (such as VHDL and Verilog) in such a way that it can be easily implemented on a Field Programmable Gate Array (FPGA). This work describes in detail this architecture explaining the encoding operations that are performed at the transmitter and then, how to undo them at the receiver. The proposed algorithm works by using independent components to divide the tasks and to obtain a pipeline architecture to improve the efficiency. The results of simulating and implementing the proposed architecture on a Xilinx Zynq UltraScale+ RFSoC ZCU28DR board with XCZU28DR-2FFVG1517E RFSoC are shown. The final results presented demonstrate how the hardware operations give equivalent results to the software simulation and do not consume board resources aggressively as usually the turbodecoder does.
... This is because low turbo decoder energy consumption Eprb is implied byMax-Log-BCJR algorithm's low complexity. However, thisis achieved at the cost of degrading the coding gain by0.5 dB compared to the optimal Log-BCJR algorithm [4], increasing the required transmission energy Etxb by 10%.This disadvantage ofthe Max-Log-BCJR outweighs its attractively low complexity, when optimizing the overall energy consumption Eb ^tx + Eb^prof [5] sensor nodes that are separated by dozens of meters.In Section 2 deals with the proposed algorithm. Section 3 describes about the proposed decoder architecture. ...
Article
Full-text available
Wireless sensor network can be considered to be energy constrained wireless scenarios, since the sensors are operated for extended periods of time, while relying on batteries that are small, lightweight and sine expensive. Energy constrained wireless communication application is done with the help of lookup table-log-BCJR (LUT-Log-BCJR) architecture.. In our existing system the conventional LUT-Log-BCJR architecture have wasteful designs requiring high chip areas and hence high energy consumptions Energy constrained applications. This motivated our proposed System the LUT log BCJR is designed with Clock gating technique achieves low-complexity energy-efficient architecture, which achieves a low area and hence a low energy consumption, and also achieving a low energy consumption has a higher priority than having a high throughput. we use most fundamental add compare select (ACS) operations and It having low processing steps, so that low transmission energy consumption is required and also reduces the overallenergy consumption. Our demonstrated implementation has a throughput of 1.03 Mb/s.
... 15.401 mJ) but due to its high sensitivity to noise this modulation scheme is rejected and OQPSK is chosen [3]. A modified version of the BCJR algorithm, that has redesigned vigorously is also used [4]. There are several simplified versions of the Maximum A posteriori Probability (MAP) algorithm, namely the log-MAP and the max-log-MAP algorithms. ...
Article
Full-text available
Wireless sensor network can be considered to be energy constrained wireless scenarios, since the sensors are operated for extended periods of time, while relying on batteries that are small, lightweight and inexpensive. The conventional turbo decoder architecture requires high chip area and hence high power consumption. This motivated the proposed system to design the decoder architecture with high throughput, less decoding iteration and less memory requirement. Clock gating is a technique that can be used to control the power dissipated by clock net. The proposed work is implemented using clock gating technique in order to reduce the power consumption. The previous turbo decoder architectures uses optimal-log based algorithm which has the complexity about 75% and hence leads to time and energy consumption due to sequential operations. Whereas the proposed architecture uses the fundamental Add Compare Select (ACS) operation. Due to the parallel processing operation of ACS blocks the proposed architecture tend to have low processing steps, so that low transmission energy and less complexity about 71%. The proposed work implementation has a throughput of 1.03 Mb/s, memory requirement of 128 Kbps, power consumption of about 0.016(mV) and requires 0.010(A) of current. Comparing to the optimal-log based algorithm in the proposed lookup table based architecture the complexity is reduced by 4% and by implementing the clock-gating technique the power consumption is reduced by 38%.
... A modified version of the BCJR algorithm, that has redesigned vigorously is also used [4]. There are several simplified versions of the Maximum A posteriori Probability (MAP) algorithm, namely the log-MAP and the max-log-MAP algorithms. ...
Article
Full-text available
Wireless sensor network can be considered to be energy constrained wireless scenarios, since the sensors are operated for extended periods of time, while relying on batteries that are small, lightweight and inexpensive. Much architecture has been proposed for energy constrained wireless communication applications using look-up table. The conventional turbo decoder architecture requires high chip area and hence high power consumption. This motivated the proposed system to design the decoder architecture with high throughput, less decoding iteration and less memory requirement. Clock gating is a technique that can be used to control the power dissipated by clock net. The proposed work is implemented using clock gating technique in order to reduce the power consumption. The previous turbo decoder architectures uses optimal-log based algorithm which has the complexity about 75% and requires sequential operations hence leads to time and energy consumption. Whereas the proposed architecture uses the fundamental Add Compare Select(ACS) operation. Due to the parallel processing operation ACS blocks tend to have low processing steps, so that low transmission energy and less complexity about 71%. The proposed work implementation has a throughput of 1.03 Mb/s, memory requirement of 128.8 Kbps, the complexity is reduced by 4% and the power consumption is reduced by 32%.
... In order to resolve the mentioned problems and enhance performance of coding, these independent codes are connected with another technique. Therefore, compared with a single code, the concatenated code has a lower error probability, best performance of BER, and higher reliability [29] [30]. The combination of MIMO-OFDM and channel coding is considered a solution for high data rates and reliable transmission [31]. ...
Thesis
Full-text available
The growing demand for high data rate with high quality of service (QoS) is required for a re-design of current systems. The Multiple Input Multiple Output Orthogonal Frequency Division Multiplexing (MIMO-OFDM) system can enhance performance of system toward high data rate requirements. However, the received signal can be corrupted due to selectivity of the channel. Error correction code (ECC) can overcome the drawback of channel affect. In addition, space time code (STC) that transmits replica of transmitted signal through different sub-channels can obtain the spatial diversity. As a result, it improved the system performance. The combination of the aforementioned systems can be improved with the system outcome significantly. In this work, the actual MIMO system was implemented with two transmission,and a reception antennas. The receiver combined received signals using maximum ratio combiner (MRC) to increase a received signal to noise ratio (SNR). OFDM which divides channel into several sub-channels is employed with different number of sub-carriers. To increase the system throughput, several modulation formats are used with different constellation maps such as (BPSK, QPSK, 16QAM, and 64 QAM). To exploit the advantages of ECC regarding the cancelation of channel effect to the received signals. Several codes such as Conventional Code (CC), Low Density Parity Check (LDPC) code and Reed Solomon (RS) code were studied. LDPC performs better performance compared with other codes, therefore, LDPC is used with different code rate. It is seen from the simulation results, the MIMO-OFDM system combined with LDPC code and STC can perform better performance regarding bit error rate (BER) at the indicated SNR. The MIMO-OFDM with LDPC and STC system obtain a gain about 5 dB in SNR at 10^-4 BER. Moreover, the gain in the spectral efficiency with sacrificing by SNR is achieved. The 8 bps/Hz was obtained.
Conference Paper
Nowadays the number of Earth Observation (EO) missions are exponentially increasing. Each satellite may embark payloads which produce a big amount of data, that need to be transmitted on-ground. CCSDS 131.2-B-1 standard has been developed to address large data transmissions, hence it is suitable for this kind of missions. A limiting factor in the adoption of this communication system is represented by the complexity of the hardware required for running a receiver in real time. Software Defined Radio receivers represent a potential low-cost solution when dealing with scientific data or low- medium- data rate. To this aim, IngeniArs S.r.l., a Small-Medium Enterprise leader in designing and developing space communication protocols, has developed a Software Defined Radio dedicated to CCSDS 131.2-B-1, object of this paper. In particular, this software was designed following the Single Instruction Multiple Data paradigm to speed up the reception and data processing, exploiting the so called General Purpose-Graphic Processing Unit. The entire software has been designed to be completely modular, i.e., allowing to easily add/modify each block in the reception chain. The preliminary results of the adoption of this SDR as a CCSDS 131.2-B-1, has shown excellent performance, maintaining a throughput up to 10 MBaud for ACM 1-7. This throughput can be easily increased exploiting parallelism at decoder level, hence linearly increasing the SDR performance, up to the GP-GPU memory filling. The proposed solution is designed to be a game changer and a potential flagship killer for such missions wich requirements are relaxed in term of throughput.
Article
In this paper, a methodology to compare highthroughput turbo decoder architectures, is proposed. The model is based on the area-efficiency estimation of different architectures and design choices. Moreover, it is specifically oriented to the exploration of Parallel-MAP (PMAP) architectures, combined with both the Max-Log-MAP algorithm and the recently proposed Local-SOVA. The main objective is the search for optimal radix-orders, capable to maximize the area-efficiency of the decoder. In this scenario, it is proved that i) radix-orders higher than 4 are expected to drastically reduce the area-efficiency; ii) the optimal choice between radix-2 and radix-4 architectures strongly depends on the area distribution between logic and memory.
Article
Full-text available
High throughput decoding of turbo-codes can be achieved thanks to parallel decoding. However, for finite block sizes, the initialisation duration of each half-iteration reduces the activity of the processing units, especially for higher degrees of parallelism. To solve this issue, a new decoding scheduling is proposed, with a partial process- ing overlapping of two successive half iterations. Potential memory conflicts introduced by this new scheduling are solved by a constrained interleaver design. An example of application of the proposed technique shows that the complexity of the decoder is reduced by 25 % compared to a conventional approach.
Article
A rearrangeable connecting network is one whose permitted states realize every assignment of inlets to outlets—that is, one in which it is possible to rearrange existing calls so as to put in any new call. In the effort to provide adequate telephone service with efficient networks it is of interest to be able to select rearrangeable networks (from suitable classes) having a minimum number of crosspoints. This problem is fully resolved for the class of connecting networks built of stages of identical square switches arranged symmetrically around a center stage: roughly, the optimal network should have as many stages as possible, with switches that are as small as possible, the largest switches being in the center stage; the cost (in crosspoints per inlet) of an optimal network of N inlets and N outlets is nearly twice the sum of the prime divisors of N, while the number of its stages is 2x — 1, where x is the number of prime divisors of N, in each case counted according to their multiplicity. By using a large number of stages, these designs achieve a far greater combinatorial efficiency than has been attained heretofore.
Article
The Galileo low-gain antenna mission will be supported by a coding system that uses a (14,1/4) inner convolutional code concatenated with Reed-Solomon codes of four different redundancies. Decoding for this code is designed to proceed in four distinct stages of Viterbi decoding followed by Reed-Solomon decoding. In each successive stage, the Reed-Solomon decoder only tries to decode the highest redundancy codewords not yet decoded in previous stages, and the Viterbi decoder redecodes its data utilizing the known symbols from previously decoded Reed-Solomon codewords. A previous article analyzed a two-stage decoding option that was not selected by Galileo. The present article analyzes the four-stage decoding scheme and derives the near-optimum set of redundancies selected for use by Galileo. The performance improvements relative to one- and two-stage decoding systems are evaluated.