ArticlePDF Available

Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

July 2007
Proceedings of the IEEE 95(6):1201 - 1227

July 2007
95(6):1201 - 1227

DOI:10.1109/JPROC.2007.895202

Source
IEEE Xplore

Authors:

Emmanuel Boutillon

Université Bretagne Sud

Catherine Douillard

IMT Atlantique

Guido Montorsi

Politecnico di Torino

This tutorial paper gives an overview of the implementation aspects related to turbo decoders, where the term turbo generally refers to iterative decoders intended for parallel concatenated convolutional codes as well as for serial concatenated convolutional codes. We start by considering the general structure of iterative decoders and the main features of the soft-input soft-output algorithm that forms the heart of iterative decoders. Then, we show that very efficient parallel architectures are available for all types of turbo decoders allowing high-speed implementations. Other implementation aspects like quantization issues and stopping rules used in conjunction with buffering for increasing throughput are considered. Finally, we perform an evaluation of the complexities of the turbo decoders as a function of the main parameters of the code.

Transmission scheme with insertion of interleaver into a serial concatenation of convolutional codes.

…

General principle of turbo decoding in logarithmic domain: extrinsic information symmetrically exchanged between inner and

…

(a) Classical nonrecursive nonsystematic convolutional

…

Turbo decoding principle in the case of parallel turbo codes. Notations are taken from Fig. 3(a). ð; IÞ and ð; OÞ refer to LLRs at input and output of SISO decoders.

…

+12

Turbo decoding principle in the case of serial codes.

…

Figures - uploaded by Catherine Douillard

Content may be subject to copyright.

Content uploaded by Catherine Douillard

Content may be subject to copyright.

INVITED

PAPER

Iterative Decoding of

Concatenated Convolutional

Codes: Implementation Issues

The speed of decoding can be increased by raising the decoder clock frequency,

increasing the use of parallel hardware, and judiciously limiting the

number of decoding iterations.

By Emmanuel Boutillon, Catherine Douillard, and Guido Montorsi

ABSTRACT

This tutorial paper gives an overview of the

implementation aspects related to turbo decoders, where the

term turbo generally refers to iterative decoders intended for

parallel concatenated convolutional codes as well as for serial

concatenated convolutional codes. We start by considering the

general structure of iterative decoders and the main features of

the soft-input soft-output algorithm that forms the heart of

iterative decoders. Then, we show that very efficient parallel

architectures are available for all types of turbo decoders

allowing high-speed implementations. Other implementation

aspects like quantization issues and stopping rules used in

conjunction with buffering for increasing throughput are

considered. Finally, we perform an evaluation of the complex-

ities of the turbo decoders as a function of the main parameters

of the code.

KEYWORDS

Concatenated convolutional codes; hardware

complexity; iterative decoders; p arallel architectures; quanti-

zation; stopping rules

I. INTRODUCTION

In 1993, at a moment when there were not many people to

believe in the practicability of capacity approaching codes,

the presentation of turbo codes [1] was a revival for the

channel coding research community. Furthermore, the

performance claimed in this seminal paper was soon

confirmed with a practical hardware implementation [2].

Historical turbo codes, also sometimes called parallel

concatenated convolutional codes (PCCC), are based on a

parallel concatenation of two recursive systematic con-

volutional codes separated by an interleaver. They are

called turbo in reference to the analogy of their decoding

principle with the turbo principle of a turbo-compressed

engine, which reuses the exhaust gas in order to improve

efficiency. The turbo decoding principle calls for an

iterative algorithm involving two component decoders

exchanging information in order to improve the error

correction performance with the decoding iterations.

This iterative decoding principle was soon applied to

other concatenations of codes separated by interleavers,

such as serial concatenated convolutional codes (SCCC)

[3], [4], sometimes called serial turbo codes, or concatena-

tion of block codes, also named block turbo codes [5], [6].

The near-capacity performance of turbo codes and their

suitability for practical implementation explain their

adoption in various communication standards as early as

the late 1990s. Firstly, they were chosen in the telemetry

coding standard by the Consultative Committee for Space

Data Systems (CCSDS) [7] and for the medium to high

data rate transmissions in the third generation mobile

communication 3GPP/UMTS standard [8]. They have

further been adopted as part of the digital video broadcast–

return channel satellite and terrestrial (DVB–RCS and

DVB–RCT) links [9], [10], thus enabling broadband

interactive satellite and terrestrial services. More recently,

they were also selected for the next generation of 3GPP2/

cdma2000 wireless communication systems [11] as well as

Manuscript received June 20, 2006; revised February 3, 2007. This work was

supported by the E. U., under the Network of Excellence in Wireless

Communications (NEWCOM), Project 507325.

E. Boutillon is with LESTER, CNRS, Universite

de Bretagne Sud, 56321 Lorient

Cedex, France (e-mail: emmanuel.boutillon@univ-ubs.fr).

C. Douillard is with the Electronics Department, CNRS, GET-ENST Bretagne,

Technopo

le Brest-Iroise, 29238 Brest Cedex 3, France (e-mail:

catherine.douillard@enst-bretagne.fr).

G. Montorsi is with the Dipartimento di Elettronica, Politecnico di Torino,

10129 Torino, Italy (e-mail: guido.montorsi@polito.it).

Digital Object Identifier: 10.1109/JPROC.2007.895202

Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 12010018-9219/$25.00

2007 IEEE

for the IEEE 802.16 standard (WiMAX) [12] intended

for broadband connections over long distances. Turbo

codes are used in several Inmarsat’s communication

systems, too, such as in the new Broadband Global Area

Network (BGAN [13]) that entered service in 2006. A

serial turbo code was also adopted in 2003 by the

European Space Agency for the implementation of a very

high speed (1 Gb/s) adaptive coded modulation modem for

satellite applications [14].

This paper deals with the implementation issues of

iterative decoders for concatenated convolutional codes.

Both parallel and serial concatenated convolutional codes

are addressed and corresponding iterative decoders are

equally referred as turbo decoders. Due to the trend of the

most recent applications towards an increasing demand for

high-throughput data, special attention is paid to the

design of high-speed hardware decoder architectures,

since it represents an important issue for industry. The

remainder of the paper is divided into six parts. A survey of

the general structure of turbo decoders is presented in

Section II. Section III reviews the soft-input soft-output

(SISO) algorithms that are used as fundamental building

blocks of iterative decoders. Section IV deals with the issue

of architectures dedicated to high throughput services.

Particular stress is laid on the increase of the parallelism in

the decoder architecture. Section V presents a review of

the different stopping rules that can be applied in order to

decrease the average number of iterations, thus also

increasing the average decoding speed of the decoder. The

fixed-point implementation of decoders, which is desirable

for hardware implementations, is dealt with in Section VI.

Finally, complexity issues, related to implementation

choices, are discussed in Section VII.

II. CONCATENATION OF

CONVOLUTIONAL CODES AND

ITERATIVE DECODING

A. Short History of Concatenated Coding

and Decoding

Code concatenation is a multilevel coding method

allowing codes with good asymptotic as well as practical

properties to be constructed. The idea dates back to Elias’s

product code construction in the mid 1950s [15]. A major

advance was performed by Forney in his thesis work on

concatenated codes ten years later [16]. As stated by

Forney, concatenation is a method of building long codes

out of shorter ones in order to resolve the problem of

decoding complexity by breaking the required computa-

tion into manageable segments according to the divide and

conquer strategy. The principle of concatenation is

applicable to any type of codes, convolutional or block

codes. This first version of code concatenation is now

called serial concatenation (SC) of codes. Before the

invention of turbo codes, the most famous example of

concatenated codes was the concatenation of an outer

algebraic code, such as a Reed–Solomon code, with an

inner convolutional code, which has been used in

numerous applications, ranging from space communica-

tions to digital broadcasting of television.

As far as the SC of convolutional codes is concerned,

the straightforward way of decoding involves the use of

two conventional Viterbi algorithm (VA) decoders in a

concatenated way. The first drawback that prevents such a

receiver from performing efficiently is that the inner VA

produces bursts of errors at its output, that the outer VA

has difficulty correcting. This can be circumvented by

inserting an interleaver between the inner and outer VAs,

as illustrated in Fig. 1.

The second main drawback is that the inner VA

provides hard decisions, thus preventing the outer VA

from using its ability to accept soft samples at its input. In

order to work properly, the inner decoder has to provide

the outer decoder with soft information. Among the

different attempts to achieve this goal, the soft-output

Viterbi algorithm (SOVA) approach calls for a modifica-

tion of the VA in order to deliver a reliability value for

each decoded bit [17], [18]. Like the VA, the SOVA

provides a maximum likelihood trellis

decoding of the

convolutional code which minimizes the probability of a

sequence error. However, it is suboptimal with respect to

bit or symbol error probability. The minimal bit or symbol

error probability can be achieved with symbol-by-symbol

Fig. 1. Transmission scheme with insertion of interleaver into a serial concatenation of convolutional codes.

The trellis diagram is a temporal representation of the code. It

represents all the possible transitions between the states of the encoder as

a function of time. The length of the trellis is equal to the number of time

instants required to encode the whole information sequence.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

1202 Proceedings of the IEEE |Vol.95,No.6,June2007

maximum a posteriori (MAP) decoding using the BCJR

algorithm [19].

The ultimate progress in decoding concatenated codes

occurred when it was observed that a SISO decoder could

be regarded as an SNR amplifier, thus allowing common

concepts in amplifiers such as the feedback principle to be

implemented. This observation gave birth to the so-called

turbo decoding principle of concatenated codes [20] in the

early 1990s. Thanks to this decoding concept, the 1.5-dB

gap still remaining between what theory promised and

what the state of the art in error control coding was able to

offer at the time

was almost removed. In Fig. 1, the

receiver clearly exploits the received samples in a

suboptimal way, even in the case where information

passed from the inner decoder to the outer decoder is soft

information. The overall decoder works in an asymmetri-

cal way: both decoders work towards decoding the same

data. The outer decoder takes advantage of the inner

decoder work but the contrary is not true. The basic idea of

turbo decoding involves a symmetric information ex-

change between both SISO decoders, so that they can

converge to the same probabilistic decision, as a global

decoder would. The issue of stability, which is crucial in

feedback systems, was solved by introducing the notion of

extrinsic information, which prevents the decoder from

being a positive feedback amplifier. In the case where the

component decoders compute the Logarithm of Likelihood

Ratios (LLR) related to information data, the extrinsic

information can be obtained with a simple subtraction

between the output and the input of the decoder as

described in Fig. 2.

B. Parallel and Serial Concatenation

of Convolutional Codes

1) Parallel Versus Serial Concatenation of Convolutional

Codes: The introduction of turbo codes is also the origin of

the concept of parallel concatenation (PC). Fig. 3(a) shows

the PC of two convolutional codes, as commonly used in

classical turbo codes. With the PC of codes, the message to

be transmitted is encoded twice in a separate fashion: the

first encoder processes the data message in the order it is

delivered by the source, while the second one encodes the

Fig. 2. General principle of turbo decoding in logarithmic domain: extrinsic information symmetrically exchanged between inner and

outer SISO decoders can be obtained with a simple subtraction between ouput and input of decoders.

Fig. 3. General structures of (a) PC and (b) SC of two convolutional encoders. In (a), encoded sequence c is obtained through concatenation

of information or systematic sequence and redundancy sequences provided by two constituent encoders c ¼ðu; y

; y

Þ.

In the early 1990s, the state of the art in error control coding was the

concatenation of a Reed–Solomon code and a memory 14 convolutional

code proposed for the Galileo space probe [21]. This scheme was decoded

using a four-stage iterative scheme where hard decisions were fedback

from the Reed–Solomon decoder to the Viterbi decoder.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1203

same sequence in a different order obtained by way of an

interleaver.

The overall coding rates for SC and PC, R

and R

,are

equal to

¼ R

and

þ R

 R

1 ð1  R

Þð1  R

where R

and R

refer to the coding rates of the inner and

outer constituent codes in the SC scheme and R

and R

refer to the rates of code 1 and code 2 in the PC scheme.

Historical turbo codes [1]–[22] are based on the PC of

two recursive systematic convolutional (RSC) codes.

Firstly, the choice of RSC codes instead of classical

nonrecursive convolutional (NRC) codes has to be found

in the comparison of their respective error correction per-

formance. An example of such memory length  ¼ 3codes

is shown in Fig. 4. Since they have the same minimum

Hamming distance,

the observed error correcting perfor-

mance is very similar for both codes at medium and low

error rates. However, the RSC code performs significantly

better at low signal-to-noise ratios (SNRs) [22].

However, the reason why using RSC codes is es-

sential in the construction of concatenated convolutional

codes was explained through bounding techniques. It

was shown in [23] and [24] that, under the assumption

of uniform interleaving,

the bit error probability in the

low error region P

varies asymptotically with the inter-

leaver gain

/ N



max

(1)

where N is the interleaver length and 

max

is the maximum

exponent of N in the asymptotic union bound approxima-

tion. In the PC case, the maximum exponent is equal to



max

¼ 1  w

min

,wherew

min

is the minimum input

Hamming weight of finite error events. This result shows

that there is an interleaving gain only if w

min

> 1, which is

trueforRSCcodes(w

min

¼ 2and

max

¼1) but not for

NRC codes (w

min

¼ 1and

max

¼ 0). As for SC schemes, if

both constituent codes are NRC, the same problem

appears. It was shown that, for SCCC, at least the inner

codehastoberecursive[25]inordertoensurean

interleaver gain. Provided this condition is satisfied, the

maximum exponent is equal to 

max

¼ bðd

min;o

þ 1Þ=2c,

where d

min;o

is the minimum Hamming distance of the

outer code.

This analytical analysis also allows us to compare the

SC and PC performance at low error rates; serial turbo

codes perform better than parallel turbo codes in this

region because their interleaver gain is larger. As a

counterpart, the opposite behavior can be observed at high

error rates, for the same overall coding rate. This can be

explained through the analysis of extrinsic information

transfer characteristics based on mutual information, the

so-called EXIT charts [26].

2) Block Coding With Convolutional Codes: Convolutional

codes are not aprioriwell suited to encode information

transmitted in block form. Nevertheless, most practical

applications require the transmission of data in block

fashion, the size of the transmitted blocks being

sometimes reduced to less than 100 bits (see, e.g., the

3GPP turbo code [8] with a 40-bit minimum block

length). In order to properly decode data blocks at the

receiver side, the decoder needs some information about

the encoder state at the beginning and the end of the

encoding process. The knowledge of the initial state of the

encoderisnotaproblem,sincetheBall zero[ state is, in

general, forced. However, the decoder has no special

available information regarding the final state of the

encoder and its trellis. Several methods can solve this

problem.

1) Do nothing: that is, no information concerning

the final states of the trellises is provided to the

decoder. The trellis is truncated at the end of each

block. The decoding process is less effective for

the last encoded data and the asymptotic coding

gain may be reduced. This degradation is a

function of the block length and may be low

enough to be accepted for a given application.

Fig. 4. (a) Classical nonrecursive nonsystematic convolutional

code with  ¼ 3 memory units (eight-state code). (b) Equivalent

recursive systematic version of (a) code.

The minimum Hamming distance d

min

of a code is the smallest

Hamming distance between two any different encoded sequences. The

correcting capability of the code is directly related to the value of d

min

A uniform interleaver is a probabilistic device that maps a given

sequence of length N and Hamming weight w into all distinct

permutations of length N and Hamming weight w with equal probability.

The uniform interleaver is representative of the average performance over

all possible deterministic interleavers.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

1204 Proceedings of the IEEE |Vol.95,No.6,June2007

2) Force the encoder state at the end of the

encoding phase: for one or all constituent codes.

This solution was adopted by the CCSDS and

UMTS turbo codes [7], [8]. The trellis termination

of a constituent code involves encoding  extra

bits, called tail bits, in order to make the encoder

return to the all-zero state. These tail bits are then

sent to the decoder. This method presents two

drawbacks. Firstly, the spectral efficiency of the

transmission is slighly decreased. Nevertheless,

this reduction is negligible except for very short

blocks. Next, for parallel turbo codes, tail bits are

not identical for the termination of both constit-

uent codes, or in other words, they are not turbo

encoded. Consequently, symbols placed at the

blockendhaveaweakerprotectionagain.Asfor

serial turbo codes, the tail bits used for the

termination of the inner coder are not taken into

account in the turbo decoding process, thus lead-

ing to a similar problem. However, the resulting

loss in performance is very small and can be

acceptable in most applications.

3) Adopt tail-biting [27]: This technique allows any

state of the encoder as the initial state. The

encoding task is performed so that the final state

of the encoder is equal to its initial state. The code

trellis can then be viewed as a circle, without any

state discontinuity. Tail-biting presents two main

advantages in comparison with trellis termination

using tail bits to drive the encoder to the all-zero

state. Firstly, no extra bits have to be added and

transmitted. Next, with tail-biting RSC codes,

only codewords with minimum input weight 2

have to be considered. In other words, tail-biting

encoding avoids any side effects, unlike classical

termination. This is particularly attractive for

highly parallel hardware implementations of the

decoder, since the block sides do not require any

specific processing. In practice, the straightfor-

ward circular encoding of a data block consists of a

two-step process. At the first step, the information

sequence is encoded from the all-zero state and

the final state is stored. During this first step, the

output bits are ignored. The second step is the

actual encoding, whose initial state is a function of

thefinalstatepreviouslystored.Thedouble

encoding operation represents the main drawback

of this method, but in most cases it can be

performed at a frequency much higher than the

data rate. An increased amount of memory is also

required to store the state information related to

the start and the end of the block between

iterations.

C. Iterative Decoders for Concatenated

Convolutional Codes

The decoding principle of PCCC and SCCC is shown in

Figs. 5 and 6. The SISO decoders are assumed to process

LLRs at their inputs ð; IÞ and outputs ð; 0Þ (the notations

used in the figure are those adopted in Section III).

In the PC scheme of Fig. 5, each SISO decoder

computes the extrinsic LLRs related to information

symbols ðu

; OÞ and ðu

; OÞ, using the observation of

the associated systematic and parity symbols coming from

the transmission channel ðc

; IÞ and ðc

; IÞ,andthe

a priori LLRs ðu

; IÞ and  ðu

; IÞ.SincenoaprioriLLRs

areavalaiblefromthedecodingprocessatthebeginningof

the iterations, they are then set to zero. For the subse-

quent iterations, the extrinsic LLRs coming from the other

decoder are used as aprioriLLRs for the current SISO

Fig. 5. Turbo decoding principle in the case of parallel turbo codes. Notations are taken from Fig. 3(a). ð; IÞ and ð; OÞ refer to LLRs at

input and output of SISO decoders.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1205

decoder. The decisions can be computed from any of the

decoders. In the PC case, the turbo decoder structure is

symmetrical with respect to both constituent decoders.

However, in practice, the SISO processes are executed in a

sequential fashion; the decoding process starts arbitrarily

with either one decoder, SISO1 for example. After SISO1

processing is completed, SISO2 starts processing and

so on. In the SC scheme of Fig. 6, the decoding diagram is

no longer symmetrical. On the one hand, the inner SISO

decoder computes extrinsic LLRs ðu

; OÞ related to the

inner code information symbol, using the observation of

the associated coded symbols coming from the transmis-

sion channel ðc

; IÞ and the extrinsic LLRs coming from

the other SISO decoder ðu

; IÞ. On the other hand, the

outer SISO decoder computes the extrinsic LLRs ðc

; OÞ

related to the outer code symbols using the extrinsic LLRs

provided by the inner decoder. The decisions are computed

as a posteriori LLRs ðu

; OÞ related to information

symbols by the outer SISO decoder. Although the overall

decoding principle depends on the type of concatenation,

both turbo decoders can be constructed from the same

basic building SISO blocks, as described in Section III.

For digital implementations of turbo decoders, the

different processing stages present a nonzero internal

delay, with the result that turbo decoding can only be

implemented through an iterative process. Each SISO

decoder processes its own data and passes it to the other

SISO decoder. One iteration corresponds to one pass

through each of all the SISO decoders. One pass through a

single SISO decoder is sometimes referred to as half an

iteration of decoding.

The concatenated encoders and decoders can work

continuously or block-wise. Since convolutional codes are

naturally better suited to encode information in a

continuous fashion, the very first turbo encoders and

decoders [2], [28] were stream oriented. In this case, there

is no constraint about the termination of the constituent

encoders, and the best interleavers turned out to be

periodic or convolutional interleavers [29], [30]. The

corresponding decoders call for a modular pipelined

structure, as illustrated in Fig. 7 in the case of parallel

turbo codes. The basic decoding structure has to be

replicated as many times as the number of iterations. In

order to ensure the correct timing of the decoding process,

delay lines have to be inserted into the decoder. The de-

cision computation, not shown in the figure, is performed

at the output of SISO decoder 2, at the last iteration stage.

However, as far as block decoding is concerned, the

simplest decoding architecture is based on the use of a

single SISO decoder, which alternately processes both

constituent codes. This standard architecture requires

three storage units: a memory for the received data at the

channel and decoder outputs (LLR

memory), a memory

for extrinsic information at the SISO output (EXT

memory), and a memory for the decoded data (LLR

out

memory). The instantiations of this architecture in the PC

and SC cases are illustrated in Figs. 8 and 9. In the parallel

scheme, the decoding architecture is the same for both

component codes, since they play the same role in the

overall decoding process. The SISO decoder decodes code

1 or 2 using the corresponding channel data from the LLR

memory and the aprioriinformation stored in the EXT

memory at the previous half-iteration. The resulting

extrinsic information is stored in the EXT memory and

then used as a priori information at the SISO input when

the other code is processed, at the next half-iteration. The

decoded data are written into the LLR

out

memory at the

last half-iteration of the decoding process. On the contrary,

intheSCscheme,thearchitectureelementsarenotused

in the same fashion for inner and outer code processing.

The inner decoding process, shown in Fig. 9(a), is similar

to the elementary decoding process in the PC case;

whereas, the outer decoding process, shown in Fig. 9(b),

does not use the LLR

memory contents or the apriori

information input ðu

; IÞ. The decoded data are written

into the LLR

out

memory after the last processing round of

the outer decoder.

III. SOFT INPUT SOFT

OUTPUT ALGORITHMS

SISO algorithms are the fundamental building blocks of

iterative decoders. A SISO is, in general, a block that

accepts messages (to be defined later) about both the input

and output symbols of an encoder and provides extrinsic

messages about the same symbols. The extrinsic message is

generated by considering the aprioriconstraints that exist

between the input and output sequences of the encoder. In

this section, we will give more precise definitions of the

Bmessage[ andontheinputoutputrelationshipsofaSISO

block. Moreover, we will show efficient algorithms to

perform SISO for some special types of encoders.

Fig. 6. Turbodecodingprincipleinthecaseofserialcodes.

Notations are taken from Fig. 3(b). ð; IÞ and ð; OÞ refer to LLRs at

input and output of SISO decoders.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

1206 Proceedings of the IEEE |Vol.95,No.6,June2007

A. Definition of the Input and Output Metrics

A SISO module generally works associated to a known

mapping f (encoding) between input and output

alphabets

c ¼ f ðuÞ¼ f

ðuÞ; ...; f

ðuÞðÞu 2Uc 2C (2)

where u ¼ðu

; ...; u

Þ and c ¼ðc

; ...; c

Þ are the input

and output sequences of the encoder, respectively. The

alphabets U and C are generic finite alphabets and the

mapping is not necessarily invertible.

A SISO module is a four-port device that accepts some

messages of the input and output symbols of the encoder

and provides homologous output extrinsic messages.

Fig. 8. Standard architecture for block-wise decoding of

parallel turbo codes.

Fig. 9. Standard architecture for block-wise decoding of serial

turbo codes: (a) inner decoder and (b) outer decoder.

Fig. 7. Pipelined architecture of turbo decoder of Fig. 5, in the case of continuous decoding of PCCC. A priori LLRs at input of first iteration

stage are set to zero. Decision computation, performed at last iteration stage, is not shown.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 12 07

We will consider the following two types of normalized

messages.

1) Likelihood ratio (LR)

LðxÞ¼



PðX ¼ xÞ

PðX ¼ 0Þ

represents the ratio between the likelihood of the

symbol being x and the likelihood of being zero.

2) Log-Likelihood ratio (LLR)

ðxÞ¼log

PðX ¼ xÞ

PðX ¼ 0Þ

! e

ðxÞ

¼ LðxÞ

is its logarithmic version.

Normalized messages are usually preferred to not

normalized ones as they allow the decoder to save one

value in the representation. In fact, by definition Lð0Þ¼1

and ð0Þ¼0. In particular, when the variable x is binary

ðx 2f0; 1gÞ LRs and LLRs can be represented with a

scalar L

¼ Lðx ¼ 1Þ

and 

¼ ðx ¼ 1Þ so that one can

write

LðxÞ¼ðL

ðxÞ¼x

: (3)

The sequence of LRs is always assumed to be

independent at the input of SISO

so that the likelihood

ratio of the sequences u and c is the product of the

likelihoods of their constituent symbols

LðuÞ¼

i¼1

Lðu

Þ; LðcÞ¼

j¼1

Lðc

and equivalently its LLR

ðuÞ¼

i¼1

ðu

Þ;ðcÞ¼

j¼1

ðc

Þ:

Furthermore, assuming that the sequence of LR of

input and output symbols are mutually independent, the

LR of a pair ðu; cÞ, with the constraint of being a valid

correspondence, can be obtained as

Lðu; cÞ¼

LðuÞL f ðuÞðÞ; c ¼ f ðuÞ;

0; otherwise.



B. General SISO Relationships

The formal SISO input output relationships are

obtained with the independence assumption constraining

the set of input/output sequences to be in the set of

possible mapping correspondences. Using LR messages the

relationships are

Lðu

; OÞLðu

; IÞ¼

¼u

Lðu

; IÞL f ðu

Þ; IðÞ

¼0

Lðu

; IÞL f ðu

Þ; IðÞ

(4)

Lðc

; OÞLðc

; IÞ¼

ðu

Þ¼c

Lðu

; IÞL f ðu

Þ; IðÞ

ðu

Þ¼0

Lðu

; IÞL f ðu

Þ; IðÞ

(5)

wherewehaveintroducedthelettersBI[ and BO[ to

distinguish between input and output messages.

In the logarithmic domain (LLRs), products translate to

sums and sums are mapped to the operator

max



ð

;

Þ¼



logðe



þ e



¼ maxð

;

Þþlog 1 þ e

j



(6)

and (4) and (5) (LLRs) become

ðu

; OÞþðu

; IÞ¼ max

¼u



ðu

; IÞþ f ðu

Þ; IðÞ½

 max

¼0



ðu

; IÞþ f ðu

Þ; IðÞ½(7)

ðc

; OÞþðc

; IÞ¼ max

ðu

Þ¼c



ðu

; IÞþ f ðu

Þ; IðÞ½

 max

ðu

Þ¼0



ðu

; IÞþ f ðu

Þ; IðÞ½:

(8)

The algorithm obtained from the first type of messages

(LR) is called multiplicative (sum-prod) while the second

(LLRs) is called additive (max



-sum), or log-MAP.

The max



operator requires, in general, two sums and a

look-up table. The look-up table size depends on the

We have kept the index x with the binary LLR to remind the reader

the name of underlying symbol.

This is actually the basic assumption of iterative decoding, which

otherwise would be optimum.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

1208 Proceedings of the IEEE |Vol.95,No.6,June2007

required accuracy and the whole correction term can be

avoided, giving rise to a simpler and suboptimal version of

the SISO (max-sum or max-log-MAP). To compensate the

effect of neglecting the look-up table several strategies are

possible, like scaling or offsetting the messages. These

techniques are described in [31] and [32].

Independently from the metric used, the input–output

relationships (4), (5) or (7), (8) have a complexity that

grows with the size of the code. This complexity can be

affordable for very simple mappings but becomes imprac-

tical for most of the mappings used as encoders.

In the following sections, we will describe some

particular cases where this computation can be

simplified.

C. Binary Mappings

As we have seen, for binary variables LR and LLR

havetheappealingfeatureofbeingabletoberepre-

sented as single values, in particular, using relationships

(3) we have



ðOÞþ

ðIÞ

¼ max

¼1



i¼1



ðIÞþ

j¼1

ðu

Þ

ðIÞ

 max

¼0



i¼1



ðIÞþ

j¼1

ðu

Þ

ðIÞ

(9)



ðOÞþ

ðIÞ

¼ max

ðu

Þ¼1



i¼1



ðIÞþ

j¼1

ðu

Þ

ðIÞ

 max

ðu

Þ¼0



i¼1



ðIÞþ

j¼1

ðu

Þ

ðIÞ

: (10)

D. SISO Relationships on Trellises

In this section, we will explain how to simplify the

computation of (4) and (5), or their logarithmic counter-

parts(7)and(8),whenthemappingisrepresentedovera

trellis. As the correspondence between the multiplicative

and additive domain requires only the use of different

operators we will describe the algorithm in the multipli-

cative domain.

A trellis is an object characterized by the concatenation

of L trellis sections T

. Each trellis section (see Fig. 10)

consists of a set of starting states s

,aninputset

,thesetofedges defined as the pair e

ðs

; x

Þ2E

¼S

X

, and two functions p

ðe

Þ and y

ðe

that assign to each edge a final state in S

lþ1

and an output

symbol in Y

A trellis is associated to a mapping (2) by making a

correspondence of the L trellis section’s input and output

alphabets with the k input and n output alphabets with the

mapping

l¼1

i¼1

l¼1

j¼1

where

denotes the Cartesian product of sets.

The alphabets of the trellis sections must then be the

Cartesian product of a set of subalphabets of the original

mapping

i2I

j2J

where fI

g and fJ

g are a partition of the sets of indexes of

the input and output alphabets. Furthermore, the cardi-

nality of S

and S

Lþ1

is always one.

Any mapping (2) in general admits several time varying

trellis representations, in particular, the number of trellis

sections L and the alphabets defining each trellis section

can be arbitrarily fixed (see Fig. 11). Finding a trellis with

minimal complexity (i.e., minimal number of edges as

explained in [33]) is a rather complex task and is out of the

scope of this tutorial paper.

A trellis representation for a mapping allows us to

interpret the mapping itself as paths in the representing

trellis. This correspondence allows us to build efficient

encoders based on time-varying finite-state machines.

More importantly, this same structure can be exploited for

the efficient evaluation of expressions like those in (4) and

(5) or (7) and (8) that involve associative and distributive

operators.

In fact, due to the distributive and associative

properties of the operators appearing in (4) and (5), it is

possible to compute the required output extrinsic LRs with

the following algorithm [19].

1) From the likelihoods of the alphabets of the map-

ping Lðu

; IÞ and Lðc

; IÞ compute the likelihoods of

the alphabets of the trellis sections and from them

the likelihoods of the edges (branch metrics)

ðx

Þ¼

i2I

Lðu

; IÞ8x

(11)

ðy

Þ¼

j2J

Lðc

; IÞ8y

! G

ðe

Þ¼C

ðx

ÞC

ðe

ÞðÞ: (12)

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1209

2) Compute the forward and backward recursions

according to

lþ1

ðsÞ¼

¼s

ðs

ÞG

ðe

Þ l ¼ 1; ...; L  1(13)

ðsÞ¼

¼s

lþ1

ðp

ÞG

ðe

Þ l ¼ L; ...; 2(14)

with initializations

ð0Þ¼B

Lþ1

ð0Þ¼1:

Periodically, normalization at this point may be

introduced to avoid overflows. The normalization

can be performed by dividing all state metrics by a

reference one. Note, however, that in the additive

version normalization can be avoided by using

complement-two representation of state metrics

using a technique that can be found in the

literature relative to Viterbi decoders.

3) Compute the a posteriori likelihood of edges

ðe

Þ¼A

ðs

ÞG

ðe

ÞB

lþ1

ðp

Þ8e

l ¼ 1; ...; L:

4) Finally, compute the desired extrinsic output

LRs as

Lðu

; OÞ¼

Lðu

; IÞ

ðe

Þ¼u

Dðe

ðeÞ¼0

Dðe

(15)

Lðc

; OÞ¼

Lðc

; IÞ

ðe

Þ¼c

Dðe

ðeÞ¼0

Dðe

(16)

where l in (15) [respectively, (16)] is the index of

the trellis section associated to the symbol u

(respectively, c

If no information is available for some of the symbols

involved in the mapping, their corresponding LR should be

set to the vector 1 and their LLR to the vector 0.Ifno

extrinsic information is required for some of the symbols

involved in the mapping, the corresponding final relation-

ships(15)or(16)canbeavoided.

Due to the finite memory of the trellis, forward and

backward recursions (13) and (14) generally are such that

lþW

ðsÞ and B

lW

ðsÞ are independent from A

ðsÞ [respec-

tively, B

ðsÞ] for sufficiently large W. This fact also allows

the algorithm to work when messages are available only for

a contiguous subset of trellis sections. See Section III-F for

more details.

Fig. 10. Trellis section and related functions. Starting state, ending state, input symbol, and output symbol. Note that edge is defined

as pair starting state, input symbol.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

1210 Proceedings of the IEEE |Vol.95,No.6,June2007

InFig.12,weshowasanexampletheblockdiagramof

the forward recursion in its multiplicative and additive

forms.Notethatintheadditiveformwedenotewithlower

case Greek letters the logarithmic counterparts of the

variables defined for the multiplicative version.

The described SISO algorithm is completely general

and can be used for any mapping and trellis. In the

following we will consider some special cases of mappings

where the algorithm can be further simplified.

1) Convolutional encoders:Thetrellisofacon-

volutional encoder has trellis sections that do not

dependonthetimeindex.Aconvolutional

encoder has a constant set of states, a constant

input and output alphabet, and constant functions

pðeÞ and yðeÞ. Convolutional encoders define a

mapping between semi-infinite sequences when

the starting state is fixed

X¼

i¼0

Y¼

j¼0

2) Binary convolutional encoders: have the addi-

tional constraint of having U

¼C

¼ Z

.For

them, each trellis section is characterized with a

set of k

input bits and n

output bits. LR and LLR

Fig. 11. Correspondence between a mapping and a time-varying trellis.

Fig. 12. Implementation of forward recursion in additive and

multiplicative version.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1211

are single quantities as shown in (3), so that the

branch metrics can be computed simply as

ðxÞ¼

1

i¼0

 

ðIÞ

ðyÞ¼

1

j¼0

 

ðIÞ:

3) Linear binary convolutional encoders:For

linear encoders we have the additional linear

property

f ðu

 u

Þ¼f ðu

Þf ðu

Þ:

The linearity of an encoder does not simplify the

SISO algorithm. However, linear encoders admit

dual encoders and the SISO algorithm can be

performed with modified metrics (as we will see

in the next section) on the trellis of the dual code.

This fact may lead to considerable savings

especially for high rate codes that have simpler

dual trellis.

4) Systematic encoders: Systematic encoders are

encoders for which the output symbol y

obtained by concatenating the input symbol x

with a redundancy symbol r

¼ðx

; r

Þ:

For them, the computation of metrics (11) and

(12) is simplified as the metric on x

can be

incorporated in those of y

ðy

Þ¼

i2I

Lðu

; IÞLðc

; IÞ

j2J

Lðc

; IÞ

where the first product refers to the systematic

part of the label y

and the second part to the

redundancy r

E. SISO Algorithm for Dual Codes

In the previous section, we saw that the linear property

of the code does not simplify the SISO algorithm. How-

ever, when the encoder is linear and binary

we recall a

fundamental result first derived in [34] and restated here.

Defining the new binary messages called reflection

coefficients R

from the LR L



1  L

1 þ L

and the correspondent sequence reflection coefficients

RðcÞ¼



j¼1



The following relationship holds true:

Rðc

; OÞRðc

; IÞ¼

ðu

Þ¼c

R f

ðu

Þ; I



ðu

Þ¼0

R f

ðu

Þ; I



(17)

where f

is the mapping that defines the dual encoder, i.e.,

the set of sequences orthogonal to the code. This

relationship, formally identical to (5), can be used to per-

form SISO decoding of high rate linear binary encoders

with a complexity that is the one associated to their duals

[35]–[37]. A very important case where this property is

exploitedisintheLDPCdecodingfortheefficientSISO

computation at the check nodes side, as the dual code of

the ðn; n  1Þ parity check code is the simple two-word

ðn; 1Þ repetition code.

Although elegant and simple, this approach may

sometimes lead to numerical problems in fixed point

implementations, and in these cases the approach based on

puncturing a low rate mother code is still preferred.

F. Initialization of Forward and Backward

Recursions and Windowing

For SISO on convolutional codes the initialization of

forward and backward recursion, as well as the order in

which these recursions are performed, leads to different

solutions. In Fig. 13, we show, pictorially, the most re-

levant approaches.

In the upper left part of the figure, we represent a

possible solution when a single SISO processes a block of

data. The data is divided into adjacent equally sized

Bwindows[ and the SISO processes sequentially the set of

windows in natural order. It first performs the forward

recursion, storing the result in a temporary buffer, and

then performs a backward recursion and computation of

theoutputsatthesametime.SincetheSISOprocessesthe

windows sequentially and in natural order, the forward

recursion results can be propagated to the next window for

correct initialization. Backward recursion, on the other

hand, needs to be properly initialized on each window and

The result can be easily extended more generally to finite fields.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

1212 Proceedings of the IEEE |Vol.95,No.6,June2007

an additional unit must be deployed to perform this task

(dashed lines).

Other scheduling possibilities, like performing first the

backward and then the forward recursion, are examined in

[38] and give rise to similar overheads.

When parallel processors are used (top, right),

initialization of the forward recursions are needed and

an additional unit to perform this task must be deployed.

A more elegant and efficient solution that eliminates

the overhead of initializations is reported in the bottom

part of the figure. In this case, we exploit the fact that SISO

processors are employed in an iterative decoder. The

results of the forward and backward recursion at a given

iteration are propagated to the adjacent (previous or next)

window in the next iteration. This approach leads to

negligible performance losses provided that the window

size is large enough.

IV. PARALLEL ARCHITECTURES FOR

TURBO DECODERS

In this section, we consider the basic SISO decoder

architecture with four processing units (branch metric,

forward recursion unit, backward recursion unit, and

output computation unit, see Fig. 21). This SISO is able to

perform one trellis step during one clock cycle. The

maximum throughput achievable by the turbo decoder

using this SISO is f

clk

=  n

,wheren

is the number of

iterations of the decoding process and f

clk

is the clock

frequency of the architecture. The factor  indicates the

minimum number of trellis stages per information bit

( ¼ 2 for a PCCC turbo decoder,  ¼ 2 þð1=R

Þ for a

SCCC turbo decoder).

There are three solutions to increase the throughput of

the decoder: increasing the parallelism of the decoder,

increasing the clock frequency, and, finally, decreasing the

number of iterations. This last solution will be considered

separately in Section V.

Theoretically, the increase of parallelism can be

obtained at all the levels of the hierarchy of the turbo

decoding algorithm, i.e., first at the turbo decoder level,

second at the SISO level (duplication of the processing

unit performing an iteration), third at the half iteration

level (duplication of the hardware to speed up a SISO

processing), and, finally, at the trellis stage level.

A. Codeword Pipeline and Parallelization

The first method proposed in the literature to increase

the throughput of a turbo decoder [2] is the simplest one: it

dedicates a processor for each half iteration. Thus, the 2n

processors work in a linear systolic way. While the first one

processes the first half iteration of the newest received

This maximum decoding rate is obtained when there is no idle cycle

(to empty and/or initialized pipe-line) between two half iterations. In [39],

a technique to obtain this condition is presented.

Fig. 13. Possible solutions for initialization of forward and backward recursions.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1213

codeword of index k, the second one processes the second

half iteration of the previous received codeword (index

k  1),andsoon,uptothe2n

th processor that performs

the last iteration of the codeword of index k  2n

.Once

the processing of a half iteration is finished, all codewords

are shifted in the linear processor array. This method is

efficient in the sense that the increase of throughput is

proportional to the increase of the hardware: 2n

processors working in parallel increase the throughput

by a factor of 2n

. Another efficient alternative involves

instanciating several turbo decoders working in parallel on

independent codewords using demultiplexing.

Nevertheless, both methods imply a large latency of

decoding and also require the duplication of the memories

for extrinsic and intrinsic information. To avoid the

memory duplication, it is far more efficient to use

parallelism at the SISO level in order to speed up the

execution time of an iteration.

B. Parallel SISO Architecture

The idea is to use a parallelism to perform a SISO

decoder, i.e., use several independent SISO decoders

working on the same codeword. To do so, the frame of size

N (N ¼ k=R

inthecaseoftheinnercodeoftheSCCC

scheme; N ¼ k,otherwise)isslicedintoP slices of size

M ¼ N=P and each slice is processed in parallel by P

independent SISO decoders (in the following, we assume

that the size N of the frame is a multiple of P). This

technique implies two types of problems which should be

solved: first, the problem of data dependency within an

iteration; second, the problem of the memory collisions in

the parallel memory accesses.

Problem of Data Dependency: Since forward processing

is performed sequentially from the first symbol u

to the

last symbol u

N1

, the question arises of how to start si-

multaneously P independent forward recursions starting

respectively from the symbols u

; u

M1

; u

2M1

; ...;

NM1

? The same question also arises for the backward

recursion in a symmetrical way. An elegant and simple

solution to this problem was proposed independently by

Blankenship et al. [40] and Giulietti et al. [41] in 2002.

This solution is derived from the sliding window technique

defined in Section III-D. The idea is to relax the constraint

of performing the entire forward (respectively, backward)

processing within a single iteration (see Fig. 13). Thus, the

P final state metrics of the forward processing of the P

slices obtained during the j

th iteration are used as initial

states, after a circular right shift, of the forward processing

of iteration j þ 1(j varying from 1 to n

). The same

principle also stands for the backward recursion.

Memory Organization: In the natural order, the organi-

zation of the memory can be very simple: the first M data

to u

M1

) are stored in a first memory block (MB), then

the next M data ðu

; ...; u

2M1

Þ are stored in a second MB,

and so on. This mapping is called direct mapping. With

direct mapping, the processing of the first dimension is

straightforward: each SISO unit has a direct access to its

own memory. For the interleaved dimension, the problem

is more complex. In fact, the first SISO needs to process

sequentially the symbol u

ð1Þ

; u

ð2Þ

; ...; u

ðM1Þ

. Since at a

given time l, ðlÞ can be stored in any of the P memories, a

network has to be created in order to connect the first

SISO unit to the P memory banks. Similarly, a network

should also be created to connect the P memory banks to

the P SISO units. However, a conflict in the memory access

can appear if, at a given time, two SISO units need to

access data stored in the same memory bank.

The problem of memory conflict has been widely

studied in the literature, and so far three kinds of solutions

have been proposed: solution at the execution stage, so-

lution at the compilation stage, and solution at the design

stage. These three solutions are described as follows.

Formulation of the Problem: Since each stage of a trellis

needs inputs (intrinsic information, aprioriinformation

and the associated redundancy) and generates an output

(extrinsic and/or LLR information), the execution of P

trellis stages at each clock cycle requires a memory

bandwidth of P read/write accesses per clock cycle.

Assuming that a memory bank can be read and written

at each clock cycle, at least P memory banks of size M are

required to provide both the memory capacity and the

memory bandwidth. Let us describe the sequence of read/

write in the memory bank in the natural and interleaved

order. At time l,theP symbols accessed by the P SISO

processors are in the natural order V

¼fl; l þ M; ...;

l þðP  1ÞMg andintheinterleavedorderV

¼fðlÞ;

ðl þ MÞ; ...;ðl þðP  1ÞMÞg. The memory organiza-

tion should allow, for all l ¼ 1 ...M, a parallel read/write

access for both set V

and V

. In the remainder of the

paper, we assume that the memory address i corresponds

to the bank bi=mc at address ði mod mÞ where b:c denotes

the integral part function.

The generic parallel architecture is shown in Fig. 14.

An iteration on this architecture works on two steps. When

decoding the first encoder, the pth memory bank is

accessed thanks to its associated address generator (AG).

The address generator delivers a sequence of addresses



ðlÞ

l¼1...M

. The data coming from the memory banks are

then sent to the SISO units through the permutation

network. The permutation network shuffles the data

according to a permutation defined at each cycle l by



ðlÞ. The outputs of the SISO units are stored in the

memory banks in a symmetrical way, thanks to the

permutation network.

At the end of an iteration, the final

forward recursion metrics f

(respectively, backward b

)

The permutation network is composed, in fact, of two shuffle

networks, one to shuffle the data between the SISOs and memory banks

during a write access, the other one to shuffle the data during a read

access between memory bank and SISO unit.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

1214 Proceedings of the IEEE |Vol.95,No.6,June2007

are sent to their right (respectively, left) neighbor. Those

metrics are stored temporarily during the processing of the

second encoder. They are used as the initial state of the

trellis at the beginning of the next iteration when the first

encoder is processed again. Note that when tail-biting code

is used, the left-most and right-most SISO also exchange

their f

and b

values. The decoding of the second encoder

is similar: the 

ðlÞ

l¼1...M

are replaced by 

ðlÞ

l¼1...M

and



ðlÞ by 

ðlÞ.

1) Solution at the Execution Stage: In this family of

solutions, we encompass all solutions using direct mapping

with some extra hardware or features to tackle the problem

of memory conflict during the interleaved dimension

processing. Two kinds of solutions have been proposed.

The first one is technological, it uses memories with a

bandwidth higher than necessary in order to avoid memory

conflict. For example, if two read/write accesses in a

memory cannot be performed in a single SISO clock cycle,

then all double access memory conflicts are solved. Note

that this solution is not efficient in terms of area and power

dissipation. A more efficient solution was proposed by

Thul et al. [42]. It relies on a Bsmart[ permutation network

that contains small FIFO modules to smooth the memory

access. These FIFO modules allow a fraction of the write

access to be delayed and a fraction of the read access to be

anticipated so that, at each cycle, the whole memory

bandwidth is used. In order to limit the size of the FIFO

modules, Thul et al. also proposed to Bfreeze[ the SISO

module in order to solve the remaining memory conflict.

This solution is generic and efficient but requires some

additional hardware and extra latency in the decoding

process [43].

2) Solution at Compilation Stage: This solution was

proposed by Tarable et al. [44]. The authors show that,

regardless of the interleaver and the number P of SISO

units, there is always a memory mapping free of read/

write conflict access. A memory mapping is a permutation

’ on the set f0 ...N  1g that associates to the index l

the memory location i ¼ ’ðlÞ. This nontrivial mapping

ð’ 6¼ IdÞ implies nontrivial spatial permutations in both

natural (

ðlÞ 6¼ Id, l ¼ 0 ...M) and interleaved order

(

ðlÞ 6¼ Id, l ¼ 0 ...M). This method is general but has

two main drawbacks: the complexity of the network and

the amount of memory to store the different configura-

tions of the network during the decoding process. In fact,

the network should perform all possible permutations. It

can be implemented in one stage by a crossbar at a cost of

a very high area, or in several stages by an ad hoc network

(a Benes network, for example [45]). In this later case,

both complexity and latency of the network is multiplied

by a factor 2 compared to the simple barrel shifter (see

Section IV-B3). Moreover, memories are required to store

the address generator sequences and the 2M permutations



ðlÞ

l¼1...M

and 

ðlÞ

l¼1...M

. Assuming a multiple frame and

multiple rate decoder, the size of the memory to store

each interleaver configuration may be prohibitive.

In conclusion to this section, [44] proposes to deal with

the problem of a turbo decoder of different size by defining

a single interleaver of maximum size. Codes of shorter

length are then generated by simply pruning the original

interleaver. This type of interleaver is named prunable

collision-free interleaver.

3) Solution at Design Stage: Theideaistodefinejointly

the interleaver and the architecture in order to solve a

priori the memory conflict while keeping a simple

architecture. Fig. 14 presents the interleaver structure

used to construct a parallel interleaver constrained by

decoding parallelism.

With this kind of technique, memory mapping is the

natural one, i.e., ’ ¼ Id. Thus, in the natural order, the

spatial permutation is identity, 

¼ Id, and the temporal

permutations are also identity, ð

¼ IdÞ

p¼1...M

.Inthe

interleaved order, spatial permutation at time l is simply

arotationofindexrðlÞ,i.e.,

ðpÞ¼ðp þ rðlÞÞ mod P.

All the temporal permutations 

,forp ¼ 0 ...P  1are

equal to a unique temporal permutation 

(the same

address is used for all memory blocks). Moreover, the

expression of 

can be computed on the flight as the

Fig. 14. Generic architecture of a parallel turbo decoder. At the end of

each iteration, final states f

of forward processing (respectively,

backward processing b

) are exchanged to the right (respectively, left)

SISO decoder in a circular way. Parameter P represents number of

parallel SISO decoder. Parameter M represents memory size of

each slice.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 12 15

sum of a linear congruence expression and a periodic

offset: 

ðlÞ¼ða  l þ ðl mod !ÞÞ mod M,wherea is

the linear factor prime with M, ! is the periodicity of the

offset (generally, ! ¼ 4), and  is an array of size ! that

contains the offset values.

At first glance, one could think that such constraints on

the interleaver would lead to poor performance. On the

contrary, the interleavers of this type are among the best

known ones. Since the number of parameters to define an

interleaver is low, an exhaustive search among all param-

eters configuration can be done. Note that the impulse

method defined by Berrou [46] and its derivative [47]

are efficient tools to optimize the construction of such

interleavers.

One should note that the almost regular permutation

[48] defined for the DVB-RCS standard, the dithered

relatively prime interleaver [49], and the slice turbo code [50]

belong to the family of Bdesign stage interleaver.[

Problem of Activity of Parallel Architecture: However,

another issue arises when dealing with parallel architec-

tures: the efficient usage of computational resources. The

idea is that it is inefficient to increase the number of SISOs

if the computational power of theses SISOs is used only

partially. This issue is particularly critical in the context of

sequential decoding of turbo codes where half-iterations

are processed sequentially. Due to data dependencies and

pipeline stages in hardware implementation, the SW

algorithm leads to an idle time at the beginning and end of

each subblock processing. This idle time can waste a

significant part of the total processing power when the

subblock size is short, i.e., with short codeword length

and/or high degree of parallelism. This problem is quite

complex and not easy to solve in the general case.

However, it can be tackled by the use of a joint interleaver/

decoder design (design stage solution), as proposed in

[39], or going back to the pipeline solution of Section IV-A,

as proposed recently in [51] for very high speed turbo

decoders.

C. Parallel Trellis Stage

In a given recursion (forward or backward), each trellis

stage implies the computation, for each node of the trellis,

of the recursive (13) and (14). It is easy to associate a

processing unit to each node of the trellis so that a trellis

stage can be processed at each clock cycle. The question is

now: is it possible to increase the parallelism? Since the

forward (or backward) recursion contains a loop, it is not

possible to increase the parallelism directly. Some authors

propose a technique, called trellis compacting. This

technique is based on restructuring the conventional

trellis by grouping two consecutive trellis stages in a single

one. In other words, instead of serially computing the

forward metric 

lþ1

from the 

metrics and the branch

metrics 

,andthencomputing

lþ2

from 

lþ1

and 

lþ1



lþ2

is directly computed from 

, 

lþ1

and 

This technique was proposed initially in the context

of the Viterbi decoder and a speed improvement of a

factor 1.7 was reported [52]. It can be directly applied

when the max-log-map algorithm is implemented. In fact,

inthiscase,forwardandbackwardrecursionare

equivalent to the Viterbi recursion (the so-called Add

Compare Select Unit). Trellis compaction can also be

adapted, thanks to few approximations, when the log-MAP

algorithm is implemented, as shown in [53] and [54].

It is worth mentioning that trellis compaction leads to

an equivalent trellis of the trellis of the double binary code

[56]. This is one explanation, together with its good

performance for medium and low rate code, of the success

of double binary codes in several standards.

D. Increase of Clock Frequency

A direct way to increase the decoding throughput of a

turbo decoder is to increase the clock frequency. To do

that, the critical path of the turbo decoder should be

reduced. A simple analysis shows that the BM units as well

as the OCU units can be pipelined as needed. The critical

path is in the forward or backward recursion loop. There

are not so many solutions to reduce this path directly: use

of a fast adder (at a cost of an increase of area), reduce the

number of bits to code the forward and backward metrics

(at a cost of a decreasing of the performance), and, if the

log-map algorithm is implemented, delay of one cycle

the addition of the correcting offset in order to reduce

the critical path (see [38] for more details).

Another architecturally efficient solution for reducing

this critical path consists of adding a pipeline register in a

recursion unit and then to interleave two (or more)

independent SISOs on the same hardware. With the

pipeline register, the critical path can almost be reduced by

two; thus, the hardware can operate at a double frequency

compared to the direct solution. The over cost is low since

only a single additional pipeline stage is introduced in the

forward (or backward) metrics recursion loop after the

first adder stage of Fig. 17.

V. STOPPING RULES AND BUFFERING

An efficient way to increase the throughput of the decoder

is to exploit the randomness of the number of required

iterations when the decoder is embedded with some

stopping criterion.

Assume as in Fig. 15 that we have a decoder that is

capable of performing n

min

 1 iterations while receiving a

frame. We add in front of it a FIFO buffer that has size

ð1 þ  Þ2N,whereN is the codeword size and   0isa

constant that measures the memory overhead. If  ¼ 0,

the decoder has no possibility of changing the number of

iterations as the FIFO memory only stores the following

One can note that trellis compaction is a special case of the general

method presented in [55] to break the ACS bottleneck.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

1216 Proceedings of the IEEE |Vol.95,No.6,June2007

frame while decoding the current one. If instead >0,

thetimeavailablefordecodingrangesfromn

min

max

¼ð1 þ 2Þn

min

depending on the status of the FIFO.

In order to stop iterative decoding the decoder is em-

bedded with a stopping rule so that the number of ite-

rations for decoding is described through a (memoryless)

random variable x with distribution f ðxÞ. We represent the

status of the FIFO as an integer ranging between n

min

and

max

which is the number of available iterations at any

given time. The transition probability matrix P of the

underlying Markov chain has the following elements:

i!j

0; i  j n

min

;

ði  j þ n

min

Þ; j ¼ n

min

; i  n

min



ði  j þ n

min

Þ j ¼ n

max

; i > n

max

 n

min

;

fði  j þ n

min

Þ; otherwise

8i; j ¼ n

min

; ...; n

max

: (18)

wherewehavedefined

ðxÞ¼



k¼x

fðkÞ and f



ðxÞ¼



k¼0

fðkÞ:

This Markov chain is irreducible and aperiodic.

Consequently, a steady-state probability vector, defined as



lim

n!1

8S (19)

exists.

From the vector p

ðxÞ, which represents the probability

of having x available iterations, the frame error probability

is then computed as

max

x¼n

min

ðxÞP

ðxÞ (20)

where P

ðxÞ is the frame error probability when x

iterations are available. The steady-state distribution p

typically shows a rather sharp transition. As a conse-

quence, when the average number of iterations n is below

the minimum number of available iterations n

min

the FIFO

is typically empty and

" P

ðn

max

Þ:

On the other side, if n > n

min

the FIFO is typically full and

" P

ðn

min

Þ:

The procedure to design a FIFO buffer is then the

following.

1) Fix a stopping criterion choosing from one of

those listed in the following Section V-A.

2) Run a single simulation with an large (infinite)

number of iterations. In this simulation carry on,

for each desired E

, the statistics of the

number of iterations required f ðxÞ and frame

error probability having x available iterations

ðxÞ.

3) For all desired pairs ðn

min

; n

max

Þ compute the

matrix P through (18) and from P the steady-

state distribution of the FIFO p

ðxÞ using (19).

From the steady-state distribution compute the

frame error probability as in (20).

4) Generally, the decoder speed must have n

min

slightly larger than the average required for the

chosen stopping rule at the target error rate. The

maximum number of iterations n

max

must be set

according to the desired target number of

iterations. The memory overhead  is then

obtained as

 ¼

max

min

 1



Fig. 15. Block diagram of an iterative decoder with input and output FIFO to improve throughput.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1217

A. List of Stopping Rules

Several stopping rules have been proposed in the lit-

erature, see for example [57]–[62] and references

therein. In the following, we list the most efficient and

simple rules.

1) Hard rule 1: The signs of the LLRs at the input

and at the output of a constituent SISO module are

compared and the iterative decoder is stopped if

all signs agree. Note that the output of a SISO is

always an extrinsic LLR.

2) Hard rule 2: To improve the reliability of the stop

rule, the previous check has to be passed for two

successive iterations

3) Soft rule 1: The minimum absolute value of all the

extrinsic LLR at the output of a SISO is compared

against a threshold. Increasing the threshold

increases the reliability of the rule and also

increases the average number of iterations.

4) Soft rule 2: The minimum absolute value of all

the total LLR is compared against a threshold.

Note that the total LLR is the sum of the input and

output LLR of a SISO module.

The choice of the stopping rule and possibly of the

corresponding threshold for soft rules, is mainly dictated

by the complexity of its evaluation and by the probability of

false stops that induce an error floor in the performance;

formoredetailssee[57].

In Fig. 16, we report the FER performance of a rate

0.35 SCCC with four state constituent encoders and

interleaver size of 8640 with fixed number of iterations

and those obtained with the structure of Fig. 15 with the

first hard stopping rule.

Solid lines refer to the decoder with a fixed number of

iterations(5,6,7,10),whiledashedcurvesarethe

performance obtained using stopping rules and finite

buffering. For these curves n

max

has been kept fixed to 10

and n

min

takesthelabelvalues5,6,7,and10.As

anticipated, it can be seen that all the dashed curves start

from the performance of the correspondent n

min

and at a

given point they start to converge to the performance

relative to n

max

iterations. The threshold point corresponds

to the situation for which n " n

min

.Notethatsincethe

average number of iterations decreases with the SNR the

maximum gap between the curves is obtained at low values

of the SNR.

The pair (10, 6), corresponding to a memory overhead

of  ¼ 33% and a speed-up of 66%, shows a maximum

penalty of 0.15 dB at FER ¼ 10

1

that becomes 0.01 dB at

5

. The pair (10, 5), corresponding to a memory

overhead of 50% and a speed-up of 100%, shows instead

a maximum penalty of 0.3 dB at FER ¼ 10

1

that becomes

0.1 dB at 10

5

VI. QUANTIZATION ISSUES IN

TURBO DECODERS

The problem of fixed-point implementation is an impor-

tant one since hardware complexity increases linearly with

the internal bit width representation of the data. The

tradeoff can be formulated as follows: what is the

Fig. 16. Comparison of FER performance of an SCCC with four state constituent encoders and interleaver size of 8640 with

fixed number of iterations with those obtained using structure of Fig. 15 and first hard stopping rule.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

1218 Proceedings of the IEEE |Vol.95,No.6,June2007

minimum bit width internal representation that leads to an

acceptable degradation of performance. It is interesting to

note that the very function of the turbo-decoding process is

the suppression of channel noise. This robustness against

channel noise implies also, as a benecifial side effect, a

robustness against the internal quantification noise. Thus,

compared to a classical DSP application, the internal

precision of a turbo-decoder can be very low without

significant degradation of the performance.

In this section, we will give a brief survey of the

problem of fixed-point implementation in the case of a

binary turbo-decoder. First, we will discuss the problem of

optimal quantization of the input signal and the resulting

internal precision. Then, the problem of the scaling of the

extrinsic message will be discussed. Finally, we will

conclude this chapter by giving a not yet published

pragmatic method to optimize the complexity versus

performance tradeoff of a turbo decoder.

A. Internal Precision of a Turbo Decoder

Let us consider a binary turbo code associated with a

BPSK modulation. The received symbol at time l is thus

equal to x

¼ x

þ w

,wherex

equals 1ifc

¼ 0, and

equals one, otherwise; w is white gaussien noise of

variance . The LLR ðc

; IÞ is then equal to

ðc

; IÞ¼2x=

: (21)

The quantization of ðc

; IÞ on b

LLR

bits is a key issue

that impacts both the performance and the complexity of

the design. In the following, we will assume that the

quantized value ðc; IÞ

of ðc; IÞ on b

LLR

bit is given by

ðc; IÞ

¼ Qððc; IÞÞ, where the quantification function Q

is defined as

¼QðxÞ¼sat x 

LLR

1

þ0:5



; 2

LLR

1

 1



(22)

where satða; bÞ¼a if a belongs to ½b; b,andsatða; bÞ¼

signðaÞb, otherwise; A is the interval dynamic of the

quantization (data are quantized between ½A; A). Sym-

metrical quantization is needed to avoid giving a system-

atic advantage to one bit value against an other. In fact,

such a situation would decrease the performance of the

turbo decoder.

One can note that if A is very large, most of the input

will be quantized by a zero value, i.e., an erasure. In that

case, the decoding process will fail. On the other hand, a

too small value of A would lead to a saturation most of the

time and the soft quantization would thus be equivalent to

a hard decision. Clearly, for a given code rate and a given

SNR, there is an optimal value of A. As a rule of thumb, for

a 1/2 rate turbo code, the optimal value of A is around 1.2.

Equations (21) and (22) show that the input value ðc; IÞ

of the turbo decoder depends on the channel observation,

the value A, as mentioned above, but also on the variance

of the SNR of the signal, i.e., the variance of the noise 

The SNR is not available at the receiver level and it should

be estimated using the statistic of the input signal. In

practical cases, the estimation of the SNR is not

mandatory. When the max–log–MAP algorithm is used,

i.e., the max



operation is approximated by the simple

max operator, its estimation is not necessary. In fact, the

term 2=

is just, in the logarithm domain, an offset factor

that impacts both input and output of the max unit. When

the log-MAP algorithm is used, a pragmatic solution is to

replace the real standard deviation  of (21) by the

maximum value 

leading to a bit-error rate (BER) or a

frame-error rate (FER) acceptable for the application.

Note that when the effective noise variance  is below 

thedecodingprocessbecomessuboptimalduetoa

subestimation of the  ðc; I). Nevertheless, the BER (or

the FER) would still decrease and thus remains in the

functionality domain of the application.

The number of bits b

ext

to code the extrinsic message

can be deduced from b

LLR

. In many reported publications,

ext

¼ b

LLR

þ 1, i.e., extrinsic messages, are quantified in

the interval ½2A; 2A.NotethatiftheOCUdeliversa

binaryvalueoutoftherange½2

ext

þ 1; 2

ext

 1,a

saturation needs to be performed.

Once b

LLR

and b

ext

are chosen, the number of bits b

code the forward recursion metrics  and the backward

recursion nodes  can be derived automatically. In the

following, we will only consider the case of the forward

recursion metrics . The same results also stand for the

backward recursion unit.

According to (11) and (12), in the logarithm domain,

the branch metric 

is a finite sum of bounded values. Let

us assume that  is a bound of the absolute value of the

branch metric 

( is a function of the code, b

LLR

and b

ext

Then, it is shown in [63] that at any time l

max



ðsÞðÞmin



ðsÞðÞG    (23)

where the parameter  is the memory depth of the

convolutional encoder.

Assuming that min

ð

ðsÞÞ is maintained equal to zero

by dedicated hardware, then (23) shows that b

dlog

ð  Þe bits are sufficient to code the  metrics.

In practice, this solution is not efficient. In fact,

maintaining min

ð

ðsÞÞ equal to zero implies additional

The exact bound is derived in [38].

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1219

hardware in the recursion loop to perform the determi-

nation of min

ð

ðsÞÞ and to subtract it from all the

metrics. This hardware increases both the complexity and

the critical path of the FR unit.

A more efficient solution is to replace those complex

systematic operations by the subtraction of a fixed value

when needed. Let us define  as  ¼dlog

ðð þ 1ÞÞe.

Then,codingtheforwardmetriconb

¼ 1 þ  leads to a

very simple scheme. In fact, since  is the maximum

dynamic of the branch metric, (23) gives max

ð

lþ1

ðsÞÞ

min

ð

ðs

ÞÞ ð þ 1Þ.

This equation proves that, if at time l,min

ð

ðsÞÞ is

below 2



,thenattimel þ 1, max

ð

lþ1

ðsÞÞ also remains

below 2

1þ

 1 ¼ 2

 1. If at time l þ 1, min

ð

ðsÞÞ

becomes above 2



, then all the forward metric will

range between 2



and 2

þ1

. In that case, a rescaling

operation is performed by subtracting 2



from all the

forward metrics. This situation is detected thanks to an

AND gate connected to the most significant bit (MSB) of

the  metrics. The subtraction is simply realized by

setting all the MSB to zero. Thanks to the rescaling

process, the dynamic to represent the increasing func-

tion of the forward metric is very limited. Note that

other efficient methods have been proposed in the

literature, like more elaborated rescaling techniques or

simply avoiding the rescaling operation using modulus

arithmetic [64].

Typical values of b

LLR

are between 3 to 6 bits. For an

eight-state turbo code, the corresponding b

are around 7

to 10 bits. An example of the impact of b

LLR

on per-

formance is given in Fig. 19. The reader can find in [65]

a deeper analysis of the bit width precision of the turbo

decoder.

B. Practical Implementation of max



Algorithm

In Fig. 17, we report a block diagram of the im-

plementation of the fundamental associative operator

max



according to its definition (6). The look-up table

performs the computation of the correcting factor

given by

fðxÞ¼ln 1 þ e

jxj



: (24)

Fig. 18 shows the plot of the function fðxÞ.The

maximum value of this function is lnð2Þ and the function

decreases rapidly toward zero. In hardware, all real values

are replaced by integer values thanks to (22). Thus, the

maximum quantized value of f is given by Qðlnð2ÞÞ.

Moreover, from (22), it is possible to compute the max-

imum integer m

so that x

> m

implies f

ðx

Þ¼0. As

an example, for M ¼ 1:2andb

LLR

¼ 4, the maximum

quantized value of f

is 4 (thus f

takes its values

between 0 and 4 and requires 3 bits to be coded) and

¼ 14 (see Fig. 18). The hardware realization of the

computation of the offset factor is thus simple. A first test

is performed to determine if jx

j is below 15. If the test is

positive, the 5 least significant bits (LSB) of x

are used

as input of a 3-bit output look-up table that contained the

precomputed values of f

ðx

Þ. Otherwise, the offset is set

to zero. Note that it is also possible to compute the

absolute value of x

before the access to the LUT in order

to reduce its size by half.

There are many other different implementations of

the offset function in the literature. For example, [66]

proposes a linear approximation of the f ðxÞ function

while [67] proposes a very coarse, but efficient, ap-

proximation. When the offset factor is omitted, the hard-

ware is simplified at a cost of a performance degradation.

This point is discussed in the next section. Note that,

when the offset factor is omitted, the algorithm is named

in the literature the min–sum or the max–log–MAP

algorithm.

C. Rescaling of Extrinsic Message

The use of the max–log–MAP algorithm leads to an

overestimation of the extrinsic message and thus decreases

the performances of the turbo decoder by approximatively

Fig. 18. Plot of function f ðxÞ. Quantized version of this function for

M ¼ 1.2 and b

LLR

¼ 4 is also given (red dot points).

Fig. 17. Block diagram for max



operator.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

1220 Proceedings of the IEEE |Vol.95,No.6,June2007

0.5 dB. This penalty can be significantly reduced if the

overestimation of the extrinsic message is compensated,

on average, by a systematic scaling down of the extrinsic

message between two consecutive half iterations. This

scaling is performed thanks to a multiplication by a

scaling factor a

,wherethevalueofa

depends on the

number of the half iteration. Typically, during the first

iteration, the value of a

is low (around 0.5), and increases

up to one for the two last half iterations. This technique,

for a classical turbo decoder, reduces the degradation

from 0.5 to 0.2 dB. For a double binary code, it reduces

the degradation from 0.5 to 0.05 dB. Note that the scaling

factor is also favorable with respect to the decorrelation of

the extrinsic messages. Even if we use the log–MAP

algorithm, the scaling factors help the decoder to

converge.

D. Method of Optimization

As seen previously, many parameters impact both the

performance and the complexity of the turbo decoder:

number of bits of quantization, maximum number of it-

erations, scaling factors of the extrinsic message, value

of the quantization dynamic M, and also the targeted

BER and FER. For example, Fig. 19 shows the BER

obtained for a turbo code of rate 1/2 and size N ¼ 512

decoded for different numbers of iteration and different

values of b

LLR

All those parameters interact in a nonlinear way (see

the effect of n

and b

LLR

in Fig. 19 for example). The

problem of finding a good performance-complexity trade-

off is thus a complex task. Moreover, the evaluation of each

configuration generally requires a CPU extensive Monte

Carlo simulation in order to have an accurate estimation of

the BER. In order to avoid such a simulation, we propose

an efficient pragmatic method:

1) Definethespaceofthesearchbydefiningthe

range of search for each parameter of the decoder.

Define a model of complexity for each parame-

ter.

Define also the maximum allowable com-

plexity of the design.

2) Define the Bworst case[ configuration by individ-

ually setting the parameters to the value that

degrades performance most.

3) Using this configuration, perform a Monte Carlo

simulation at the SNR of interest. Each time a

received codeword fails to be decoded, store the

codeword (or information to reconstruct it, i.e.,

the seeds of the pseudo-random generators) in a

set S. Stop the process when the cardinality of the

set S is high enough (typically around 1000). Note

that this operation can be very CPU consuming

butithastobedoneonlyonce.

4) Perform an optimization in order to find the set

of parameters that minimize the BER (or the

FER) over the set S with the constraint that the

overall complexity of the decoder remains below

a given value.

5) Perform anormalMonte Carlo simulation in order

to verify a posteriori the real performance of the

selected parameters. Go to Step 1) with a different

scenario of optimization if needed.

This method is efficient since the test of one con-

figuration can be a few orders faster than the direct

method. For example, for an FER of 10

4

, the test of a

configuration with a classical Monte Carlo simulation

requires, on average, the simulation of 10

codewords. In

contrast, with the proposed method, testing a new

configuration requires only the decoding of 10

code-

words. An improvement of simulation speed by a factor of

is then obtained.

VII. EVALUATION OF COMPLEXITY OF

ITERATIVE DECODERS

In order to provide a high level evaluation of the

complexity of the iterative decoders we will use as

reference the architecture reported in Fig. 20. Constituent

encoders are assumed to be binary and messages are stored

in the form of LLRs.

The iterative decoders are generally built around two

sets of processors and two memories. The messages

coming from the channel are stored in a buffer as they

will be accessed several times during the iterative process.

The extrinsic messages are instead stored in a temporary

memory denoted by EXT in the figure.

See [31] and Section VII for an example of modelization of the

hardware complexity of a turbo decoder.

Fig. 19. Curve BER ¼ fðSNRÞ for a 2/3 rate turbo code of length

N ¼ 512 for different b

LLR

values and number of decoding iterations.

Value of A is equal to 1.39.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1221

The high level algorithm of the decoder can be

summarized with the following steps.

1) Initialize the inner memory to null messages.

2) Apply the first set of constraints A using the EXT

and LLR, write the updated messages in EXT.

3) Apply the second set of constraints B using

the EXT and LLR, write the updated messages

in EXT.

4) Iterate until some stopping criterion is satisfied

or the maximum number of iterations is

reached.

The applications of constraints A and B here are executed

serially. Different scheduling can be applied, especially for

LDPC, but this is the common approach.

As we have seen in Section IV, PCCC, as well as SCCC

and LDPC, admits efficient highly parallel structures so

that the throughput can be arbitrarily increased by

increasing the number of parallel processors without

requiring additional memory. The tradeoff between area

and throughput of the decoder is then fully under designer

control. In this section, we will focus on C, defined as the

number of elementary operations required for decoding one

information bit per iteration as a function of the main design

parameters of the code.

The throughput T of the implemented decoder can be

well approximated by

T ¼ f

dep

where C

dep

is the number of deployed operators running

at frequency f ,andN

is the number of required

iterations.

For iterative decoders with messages in the form of

LLR, the evaluation of computational complexity can be

expressed in terms of number of sums and max



operations. Note that max will substitute max



when the

max–sum version of SISO processors is used instead of the

max



–sum version.

A. LDPC

For LDPC, each variable node processor with degree d

requires 2d

sums to compute the updated message. Thus,

summing up over all possible nodes, the variable node

processing requires two sums per edge.

Forchecknodes,thenumberofoperationsneededfor

updating the EXT messages depends linearly on the check

degree d

with a factor that depends on the used ap-

proximation. In [68], a comprehensive summary of the

available optimal and approximated techniques for check

node updates is presented, together with their correspon-

dent complexity and performance tradeoffs. Here, we will

assume the optimal and numerical stable algorithm (6)

and (7) in [68] that requires 6ðd

 2Þ max



operators for

a check node of degree d

. The reader is warned, however,

that the factor 6 can be reduced by using other suboptimal

techniques.

Summing up over the set of variable and check nodes

we get the following complexity:

C ¼

2

; sums

2

þ 2



; max





(25)

per decoded bit and per iteration. In (25), we introduced

the fundamental parameter  that is the average variable

degree which measures the density of the LDPC parity

check matrix. R istherateofthecode.

In Table 1, we show the normalized complexity C

required for some values of the two relevant parameters 

and R.

Note that LDPC decoders have a complexity that is

inversely proportional to the rate of the code and

proportional to the parameter .Theparameter has

also an impact on the performance of the code and takes

typical values in the range 3–5.

B. PCCC and SCCC

The complexity of PCCC and SCCC is strictly related

to the complexity of their constituent SISO decoders.

Here, we consider rate k=n binary constituent encoders.

In Fig. 21, we report a block diagram of the SISO module

showing its basic architecture according to the algorithm

described in Section III. In the figure, we have reported

the sections of the architecture corresponding to the four

Fig. 20. General architecture of iterative decoder. Shaded blocks

are processors, white blocks are memories.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

1222 Proceedings of the IEEE |Vol.95,No.6,June2007

operations reported in the algorithm described in

Section III-D. Light blocks are computation blocks, while

dark block refers to memory, which can be organized as

LIFO or FIFO RAM. In this SISO structure, we have

assumed that initialization of forward and backward state

metrics are propagated across iterations so that no

overhead is required for this operation.

We will consider two versions of SISO. The first

(inner SISO), which is used in PCCC and as the inner

SISO for SCCC, gets messages on information and coded

bits and provides messages on input bits. The second

(outer SISO), which is used as outer SISO in SCCC, gets

messages only on coded bits and provides updated

messages on both information (user’s data) and coded

bits. In Fig. 21, we can identify the following units.

1) The Branch Metric Computer (BMC)isrespon-

sibleforcomputingtheLRorLLRtobeassociated

with each trellis edge section according to (11)

and(12).Thenumberofsumsrequiredis2



1  n fortheouterSISOand2

 n  1 þ k for

inner SISO.

2) The Forward Recursion (FR)computeris

responsible for computing the forward recursion

metrics A according to (13). Each state metric

update requires 2

 1max



between the in-

coming edges. The metrics of the 2

edges are

obtained summing the previously computed

branch metrics with each the path metrics

sums) max



3) The Backward Recursion (BR) computer, identi-

cal to forward recursion, is responsible for

computing the backward recursion metrics B

according to (14). As the recursion proceeds in

the backward direction the input branch metrics

must be reversed in time, and this is the reason for

the LIFO in front of it.

4) The LR edges computed according to (3) require

both the forward and backward metrics and the

branch metrics. As backward metrics are provid-

ed in reversed order, a LIFO may also be inserted

on the line coming from the FR. Note that the

edge LF are then produced in reversed time

order.ForbothFRandBRaswellasinnerand

outer SISO the complexity is 2

sums and

ð2

 1ÞN

max



5) The Output Computation Unit (OCU)computes

the a posteriori LR applying (15) and/or (16)

depending on the needs. The input LR are used to

compute extrinsic information. An efficient algo-

rithm to compute the updated messages requires

þ 2n þ k sums and 2

þ 2

nþ1

 2n

4max



. As the edges LR are provided in reversed

Table 1 Complexity of LDPC Decoder in Number of Elementary Operations per Information Bit and per Iteration

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1223

time order, also the computed LR are reversed in

order so that an LIFO may be necessary to

provide the updated LR in the same order of the

inputs. The correct ordering of the messages,

however, can also be solved implicitly when

storing them in the RAM.

InTable2,wegivethesummaryoftheoperations

requiredfortheinnerandouterSISO,whileinTable3we

report the numerical values for some typical values of k, n,

and N

Having determined the complexity per information

bit for the inner and outer SISO decoders C

and C

the complexity of the PCCC and SCCC can be eval-

uated as

SCCC

¼ C



PCCC

¼ 2C

where r

istherateoftheouterencoderforSCCC.

We assume here that the two constituent encoders are equal,

generalization to different encoders is straightforward.

Fig. 21. General architecture of SISO module. Module uses four logic units (FR, BR, OCU, and BMC). Sections of block diagram refer

to four main steps of algorithm described in Section III.

Table 2 Summary of Complexity per Trellis Step for SISO Algorithm on Binary LLR

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

1224 Proceedings of the IEEE |Vol.95,No.6,June2007

C. Memory Requirements

The memory requirements of an iterative decoder is

the sum of memory required for the storage of channel

messages, which is N foralltypesofdecoders,andthe

memory for the storage of the extrinsic messages, which

depends on the encoding scheme.

For LDPC the number of extrinsic messages is given by

N,forSCCCbyðK=r

Þn and for PCCC by K.

Memory requirements are usually not negligible and

in some important cases, e.g., for low throughput im-

plementation and/or high block sizes, can dominate the

computational requirements considered in the next

sections.

D. Nonbinary Decoders

Slightly different conclusions are obtained using

nonbinary constituent decoders. The main consequence

of this approach is that messages are no longer scalars but

vectors of dimension equal to the cardinality of the used

alphabet minus one. The dimension of message memory

must then be increased accordingly.

The operators BMC and OCU are slightly changed

while FR and BR remain the same.

Nonbinary decoders also yield different performance.

An important application of nonbinary decoder is the

DVB-RCS/RCT and WiMAX 8-state double-binary turbo

code and its extension to 16 states [69].

VIII. CONCLUSION

We have presented an overview of implementation issues

for the design of iterative decoders in the context of the

concatenation of convolutional codes. Hardware archi-

tectures were described and their complexity was as-

sessed. Low throughput (below 10 Mb/s) turbo decoders

have already been widely implemented, either in software

or hardware, and are commercially available. In this

paper, we have laid stress on the different methods al-

lowing the throughput of a turbo decoder to be increased.

We have particularly investigated parallel architectures

and stopping criteria. As for architecture optimization, a

joint design of the concatenated code and the architec-

ture was favored, especially concerning interleaver de-

sign. Such an approach has already been introduced in

standards such as DVB-RCS, DVB-RCT, and WiMAX.

Among the main challenges in the years to come, low

Table 3 Complexity per Information Bit of SISO Algorithm for Some Typical Values of k, n, and N

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1225

energy consumption receiver design will represent a

crucial one. Significant progress in this field will probably

require a real technological breakthrough. Some answers

to this problem are currently emerging, such as the

analog decoding concept [70], [71], which allows the

iterative process to be removed, the SISO decoders being

directly wired together in order to implement the feed-

back connections. h

REFERENCES

[1] C. Berrou, A. Glavieux, and P. Thitimajshima,

BNear Shannon limit error-correcting

coding and decoding: Turbo-codes,[ in

Proc. ICC, Geneva, Switzerland, May 1993,

pp. 1064–1070.

[2] Comatlas, CAS5093: Turbo Encoder/Decoder,

Nov. 1993, datasheet.

[3] S. Benedetto and G. Montorsi, BIterative

decoding of serially concatenated

convolutional codes,[ Electron. Lett.,

vol. 32, no. 13, pp. 1186–1187, Jun. 1996.

[4] S. Benedetto, D. Divsalar, G. Montorsi, and

F. Pollara, BSerial concatenation of

interleaved codes: Performance analysis,

design, and iterative decoding,[ IEEE Trans.

Inform. Theory, vol. 44, no. 5, pp. 909–926,

May 1998.

[5] S. Benedetto and G. Montorsi, BSerial

concatenation of block and convolutional

codes,[ Electron. Lett., vol. 32, no. 10,

pp. 887–888, May 1996.

[6] R. M. Pyndiah, BNear-optimum decoding

of product codes: Block turbo codes,[

IEEE Trans. Commun., vol. 46, no. 8,

pp. 1003–1010, Aug. 1998.

[7] CCSDS, Recommendation for Space Data

System Standards. TM Synchronization

and Channel Coding, Sep. 2003, 131.0-B-1,

Blue Book.

[8] Third generation partnership project (3GPP)

Technical Specification Group, Multiplexing

and Channel Coding (FDD), Jun. 1999,

TS 25.212, v2.0.0.

[9] DVB, Interaction Channel for Satellite

Distribution Systems, ETSI EN 301 790,

2000, v. 1.2.2.

[10] DVB, Interaction Channel for Digital Terrestrial

Television, ETSI EN 301 958, 2001, v. 1.1.1.

[11] Third generation partnership project 2

(3GPP2), Physical Layer Standard for

cdma2000 spread spectrum systems, Release D,

Feb. 2004, 3GPP2 C.S0002-D, ver. 1.0.

[12] IEEE, IEEE Standard for Local and Metropolitan

Area Networks. Part 16: Air Interface for

Fixed Broadband Wireless Access Systems,

IEEE 802.16-2004, Nov. 2004.

[13] A. Franchi and J. Sengupta, BTechnology

trends and market drivers for broadband

mobile via satellite: Inmarsat BGAN,[ in

Proc. DSP 2001, 7th Int. Workshop Digital Signal

Processing Techniques Space Communications,

Sesimbra, Portugal, Oct. 2001.

[14] S. Benedetto, R. Garello, G. Montorsi,

C. Berrou, C. Douillard, A. Ginesi, L. Giugno,

and M. Luise, BMHOMS: High-speed

ACM for satellite applications,[ IEEE Trans.

Wireless Commun., vol. 12, no. 2, pp. 66–77,

Apr. 2005.

[15] P. Elias, BError-free coding,[ IRE Trans.

Inform. Theory, vol. 4, pp. 29–37, Sep. 1954.

[16] G. D. Forney, Jr., Concatenated Codes.

Cambridge, MA: MIT Press, 1966.

[17] G. Battail, BWeighting the symbols decoded

by the Viterbi algorithm,[ (in French),

Ann. Telecommun., vol. 42, pp. 31–38,

Jan.–Feb. 1987.

[18] J. Hagenauer and P. Hoeher, BA Viterbi

algorithm with soft-decision outputs and its

applications,[ in Proc. IEEE GLOBECOM’89,

Dallas, TX, Nov. 1989, pp. 47.1.1–47.1.7.

[19] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv,

BOptimal decoding of linear codes for

minimizing symbol error rate,[ IEEE Trans.

Inform. Theory, vol. 20, no. 3, pp. 284–287,

Mar. 1974.

[20] C. Berrou and A. Glavieux, BReflections

on the prize paper: FNear optimum

error-correcting coding and decoding:

Turbo codes_,[ IEEE IT Soc. Newslett.,

vol. 48, no. 2, Jun. 1998.

[21] S. Dolinar and M. Belongie, BEnhanced

decoding for the Galileo low-gain antenna

mission: Viterbi redecoding with four

decoding stages,[ in JPL TDA Progr. Rep.,

vol. 42-121, pp. 96–109, May 1995.

[22] C. Berrou and A. Glavieux, BNear optimum

error correcting coding and decoding:

Turbo-codes,[ IEEE Trans. Commun.,

vol. 44, no. 10, pp. 1261–1271, Oct. 1996.

[23] S. Benedetto and G. Montorsi, BUnveiling

turbo codes: Some results on parallel

concatenated coding schemes,[ IEEE Trans.

Inform. Theory, vol. 42, no. 3, pp. 409–428,

Mar. 1996.

[24] S. Benedetto and G. Montorsi, BDesign

of parallel concatenated convolutional

codes,[ IEEE Trans. Commun., vol. 42,

no. 5, pp. 591–600, May 1996.

[25] S. Benedetto, D. Divsalar, G. Montorsi, and

F. Pollara, BSerial concatenation of

interleaved codes: Performance analysis,

design, and iterative decoding,[ IEEE Trans.

Inform. Theory, vol. 44, no. 5, pp. 909–926,

May 1998.

[26] S. ten Brink, BConvergence behavior of

iteratively decoded parallel concatenated

codes,[ IEEE Trans. Commun., vol. 49, no. 10,

pp. 1727–1737, Oct. 2001.

[27] C. Weiss, C. Bettstetter, and S. Riedel,

BCode construction and decoding of parallel

concatenated tail-biting codes,[ IEEE Trans.

Inform. Theory, vol. 47, no. 1, pp. 366–386,

Jan. 2001.

[28] M. Je

quel, C. Berrou, C. Douillard, and

P. Pe

nard, BCharacteristics of a sixteen-state

turbo-encoder/decoder (turbo4),[ in Proc. Int.

Symp. Turbo Codes, Brest, France, Sep. 1997,

pp. 280–283.

[29] S. Benedetto and G. Montorsi, BPerformance

of continuous and blockwise decoded turbo

codes,[ IEEE Commun. Lett., vol. 1, no. 3,

pp. 77–79, May 1997.

[30] E. K. Hall and S. G. Wilson, BStream-oriented

turbo codes,[ IEEE Trans. Inform. Theory,

vol. 47, no. 7, pp. 1813–1831, Jul. 2001.

[31] G. Masera, G. Piccinini, M. Roch, and

M. Zamboni, BVLSI architectures for turbo

codes,[ IEEE Trans. VLSI Syst., vol. 7, no. 5,

pp. 369–379, Sep. 1999.

[32] C.-M. Wu, M.-D. Shieh, C.-H. Wu,

Y.-T. Hwang, and J.-H. Chen, BVLSI

architectural design tradeoffs for

sliding-window log-MAP decoders,[

IEEE Trans. VLSI Syst., vol. 13, no. 2,

pp. 439–447, Apr. 2005.

[33] R. McEliece, BOn the BCJR trellis for linear

block codes,[ IEEE Trans. Inform. Theory,

vol. 42, no. 7, pp. 1072–1092, Jul. 1996.

[34] C. Hartmann and L. Rudolph, BAn optimum

symbol-by-symbol decoding rule for linear

codes,[ IEEE Trans. Inform. Theory, vol. 22,

no. 9, pp. 514–517, Sep. 1976.

[35] S. Riedel, BMAP decoding of convolutional

codes using reciprocal dual codes,[ IEEE

Trans. Inform. Theory, vol. 44, no. 3,

pp. 1176–1187, Mar. 1998.

[36] S. Riedel, BSymbol-by-symbol MAP decoding

algorithm for high-rate convolutional codes

that use reciprocal dual codes,[ IEEE J.

Select. Areas Commun., vol. 16, no. 1,

pp. 175–185, Jan. 1998.

[37] G. Montorsi and S. Benedetto, BAn additive

version of the SISO algorithm for the dual

code,[ in Proc. IEEE Int. Symp. Information

Theory, 2001, pp. 27.

[38] E. Boutillon, W. J. Gross, and P. G. Gulak,

BVLSI architectures for the MAP algorithm,[

IEEE Trans. Commun., vol.51,no.2,

pp. 175–185, Feb. 2003.

[39] D. Gnaedig, E. Boutillon, J. Tousch, and

M. Je

quel, BTowards an optimal parallel

decoding of turbo codes,[ in Proc. 4th Int.

Symp. Turbo Codes Related Topics, Munich,

Germany, Apr. 2006.

[40] T. Blankenship, B. Classon, and V. Deai,

BHigh-throughput turbo decoding techniques

for 4G,[ in Proc. Int. Conf. 3G Wireless and

Beyond, San Francisco, CA, Jun. 2005,

pp. 137–142.

[41] A. Giulietti, L. van der Perre, and A. Strum,

BParallel turbo coding interleavers: Avoiding

collisions in accesses to storage elements,[

Elec. Lett., vol. 38, no. 5, pp. 232–234,

Feb. 2002.

[42] M. J. Thul, N. When, and L. P. Rao,

BEnabling high speed turbo-decoding

though concurrent interleaving,[ in

Proc. Int. Symp. Circuits and Systems

(ISCAS’02), Phoenix, AZ, May 2002,

pp. 897–900.

[43] M. J. Thul, F. Gilbert, and N. When,

BConcurrent interleaving architecture

for high-throughput channel coding,[ in

Proc. ICASSP’03, Apr. 2003, pp. 613–616.

[44] A. Tarable, S. Benedetto, and G. Montorsi,

BMapping interleaving laws to parallel

turbo and LDPC decoder architectures,[

IEEE Trans. Inform. Theory, vol. 50, no. 9,

Sep. 2004.

[45] V. Benes, BOptimal rearrangeable multistage

connecting networks,[ Bell Syst. Tech. J.,

vol. 42, pp. 1641–1656, 1964.

[46] C. Berrou, S. Vaton, M. Jzquel, and

C. Douillard, BComputing the minimum

distances of linear codes by the error

impulse method,[ in Proc. IEEE

GLOBECOM’02, Taipei, Taiwan,

Nov. 2002, pp. 1017–1020.

[47] S. Crozier, P. Guinand, and A. Hunt,

BEstimating the minimum distance

of turbo-codes using double and triple

impulse methods,[ IEEE Commun. Lett.,

vol. 9, no. 6, pp. 631–633, Jun. 2005.

[48] C. Berrou, Y. Saouter, C. Douillard,

S. Ke

rouedan, and M. M. Je

quel,

BDesigning good permutations for

turbo codes: Towards a single model,[ in

Proc. ICC’04, Paris, France, Jun. 2004,

pp. 341–345.

[49] S. Crozier and P. Guinand,

BHigh-performance low-memory

interleaver banks for turbo-codes,[ in

Proc. VTC2001, Rhodes, Greece, Oct. 2001,

pp. 2394–2398.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

1226 Proceedings of the IEEE |Vol.95,No.6,June2007

[50] D. Gnaedig, E. Boutillon, V. C. Gaudet,

M. Je

quel, and P. G. Gulak, BOn multiple

slice turbo codes,[ in Proc. 3rd Int. Symp.

Turbo Codes and Related Topics, Brest, France,

Sep. 2003, pp. 153–157.

[51] O. Muller, A. Baghdadi, and M. Je

quel,

BExploring parallel processing levels for

convolutional turbo decoding,[ in Proc.

2nd ICTTA Conf., Damascus, Syria,

Apr. 2006.

[52] P. Black and T. Meng, BA 140-mb/s, 32-state,

radix-4 viterbi decoder,[ IEEE Commun.

Lett., vol. 27, no. 12, pp. 1877–1885,

Dec. 1992.

[53] T. Miyauchi, K. Yamamoto, T. Yokokawa,

M. Kan, Y. Mizutani, and M. Hattori,

BHigh-performance programmable

SISO decoder VLSI implementation for

decoding turbo codes,[ in Proc. IEEE Global

Telecommunications Conf., GLOBECOM ’01,

San Antonio, TX, Nov. 2001, pp. 305–309.

[54] M. Bickerstaff, L. Davis, C. Thomas,

D. Garrett, and C. Nicol, BA 24 Mb/s radix-4

logMAP turbo decoder for 3GPP-HSDPA

mobile wireless,[ in IEEE Solid-State Circuits

Conf. Dig. Tech. Papers, San Francisco, CA,

Feb. 2003.

[55] G. Fettweis and H. Meyr, BParallel Viterbi

algorithm implementation: Breaking the

ACS-bottleneck,[ IEEE Trans. Commun.,

vol. 37, no. 8, pp. 785–790, Aug. 1989.

[56] C. Berrou and M. Jezequel, BNon-binary

convolutional codes for turbo coding,[

Elec. Lett., vol. 35, no. 1, pp. 39–40,

Jan. 1999.

[57] A. Matache, S. Dolinar, and F. Pollara,

BStopping rules for turbo decoders,[ in

JPL TMO Progress Rep., vol. 42-142, pp. 1–22,

Aug. 2000.

[58] S. Shao, L. Shu, and M. Fossorier,

BTwo simple stopping criteria for

turbo decoding,[ IEEE Trans. Commun.,

vol. 47, pp. 1117–1120, 1999.

[59] A. Shibutani, H. Suda, and F. Adachi,

BReducing average number of turbo

decoding iterations,[ Electron. Lett.,

vol. 35, pp. 701–702, 1999.

[60] A. Shibutani, H. Suda, and F. Adachi,

BComplexity reduction of turbo

decoding,[ in Proc. Vehicular Technology

Conf. VTC’1999, Ottawa, ON, Canada,

1999, vol. 3, pp. 1570–1574.

[61] K. Gracie, S. Crozier, and A. Hunt,

BPerformance of a low-complexity turbo

decoder with a simple early stopping

criterion implemented on a SHARC

processor,[ in Proc. 6th Int. Mobile

Satellite Conf. IMSC 99, Ottawa, ON,

Canada, 1999, pp. 281–286.

[62] B. Kim and H. S. Lee, BReduction of the

number of iterations in turbo decoding

using extrinsic information,[ in Proc.

IEEE TENCON 99, Inchon, South Korea,

1999, pp. 494–497.

[63] J. Cain, BCMOS VLSI implementation of

r ¼ 1=2, k ¼ 7 decoder,[ in Proc. IEEE Nat.

Aerospace Electron. Conf., NAECOM’84,

1984, pp. 20–27.

[64] A. Hekstra, BAn alternative to metric

rescaling in Viterbi decoders,[ IEEE Trans.

Commun., vol. 37, no. 11, pp. 1220–1222,

Nov. 1989.

[65] S. Pietrobon, BImplementation and

performance of a turbo/MAP decoder,[

Int. J. Satellite Commun., vol. 16, pp. 23–46,

Jan.–Feb. 1998.

[66] C. Jung-Fu and T. Ottosson, BLinearly

approximated log-MAP algorithm for

turbo decoding,[ in Proc. Vehicular

Technol. Conf. VTC’2000, Tokyo, Japan,

2000, pp. 2252–2256.

[67] W. J. Gross and P. G. Gulak, BSimplified

MAP algorithm suitable for implementation

of turbo decoder,[ Electron. Lett., vol. 34,

no. 16, pp. 1577–1578, Aug. 1998.

[68] C. J. Chen, A. Dholakia, E. Eleftheriou,

M. Fossorier, and H. Xiao-Yu,

BReduced-complexity decoding of

LDPC codes,[ IEEE Trans. Commun.,

vol. 53, no. 8, pp. 1288–1299,

Aug. 2005.

[69] C. Douillard and C. Berrou, BTurbo

codes with rate-m=ðm þ 1Þ constituent

convolutional codes,[ IEEE Trans.

Commun., vol. 53, no. 10, pp. 1630–1638,

Oct. 2005.

[70] H.-A. Loeliger, F. Tarkoy, F. Lustenberger,

and M. Helfenstein, BDecoding in analog

VLSI,[ IEEE Commun. Mag., vol. 37,

pp. 99–101, Apr. 1999.

[71] J. Hagenauer, BDecoding of binary codes

with analog networks,[ in Proc. 1998

Information Theory Workshop, San Diego,

CA, Feb. 1998, pp. 13–14.

ABOUT THE AUTHORS

Emmanuel Boutillon was born in Chatou, Fran ce,

on November 2, 1966. He received the Engineering

Diploma and Ph.D. degree from the E

cole Natio-

nale Supe

rieure des Te

communications (ENST),

Paris, France, in 1990 and 1995, respectively.

In 1991, he worked as an Assistant Professor in

the E

cole Multinationale Supe

rieure des Te

com-

munications, Dakar, Africa. In 1992, h e joined ENST

as a Research Engineer where he conducted

research in the field of VLSI for digital commu-

nications. In 1998, he spent a sabbatical year at the University o f Toronto,

ON, Canada. Since 2000, he has been a Pro fessor at the University of

South Britany, Lorient, France. His current research interests inc lude the

interactions between algo rithm and architecture in the field of wireless

communications. In particular, he works on tur bo codes a nd LDPC

decoders.

Catherine Douillard was born in Fontenay le

Comte, France, on July 13, 1965. She received the

engineering degree in telecommunications from

the E

cole Nationale Supe

rieure des Te

commu-

nications (ENST) de Bre tagne, Brest, France, in

1988, and the Ph.D. degree in ele ctrical engineer -

ing from the Universite

de Bretagne Occidentale,

Brest, in 1992.

In 1991, she joined ENST Bretagne, where she is

currently a Professor in the Electronics Depart-

ment. Her main interests are turbo codes and iterative decodin g, iterative

detection, and the efficient combination of high spec tral efficiency

modulation and turbo c oding schemes.

Guido Montorsi wasborninTurin,Italy,on

January 1, 1965. He received the Laurea in

Ingegneria Elettronica in 1990 from Politecnico

di Torino, Turin , Ita ly, wit h a ma ster’s t hesis

concerning the study and design of coding

schemes for HDTV, develop ed at the RAI Research

Center, Turin. He received the Ph.D. degree in

telecommunication s from the Dipartimento di

Elettronica of Politecnico di Tor ino, in 1994.

In 1992, h e spent the year as a Visiting S cholar

in the Department of Electrical Engineering, Rensselaer Polytechnic

Institute, Tr oy, NY. Since December 1997, he has been an Assistant

Professor at the Politecnico di Torino. His current interests are in the area

of channel coding, particularly on the analysis and design of concate-

nated coding schemes and study of iterative decoding strategies.

Boutillon et al.: Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Vol. 95, No. 6, June 2007 | Proceedings of the IEEE 1227

Design and Development of a CCSDS 131.2-B Software-Defined Radio Receiver Based on Graphics Processing Unit Accelerators

Article

Full-text available

Jan 2024

In recent years, the number of Earth Observation missions has been exponentially increasing. Satellites dedicated to these missions usually embark with payloads that produce large amount of data and that need to be transmitted towards ground stations, in time-limited windows. Moreover, the noisy nature of the link between satellites and ground stations makes it hard to achieve reliable communication. To address these problems, a standard for a flexible advanced coding and modulation scheme for high-rate telemetry applications has been defined by the Consultative Committee for Space Data Systems (CCSDS). The defined standard, referred to as CCSDS 131.2-B, makes use of Serially Concatenated Convolutional Codes (SCCC) based on 27 ModCods to optimize transmission quality. A limiting factor in the adoption of this standard is represented by the complexity and the cost of the hardware required for developing high-performance receivers. In the last decade, the performance of software has grown due to the advancement of general-purpose processing hardware, leading to the development of many high-performance software systems even in the telecommunication sector. These are commonly referred to as Software-Defined Radio (SDR), indicating a radio communication system in which components that are usually implemented in hardware, by means of FPGAs or ASICs, are instead implemented in software, offering many advantages such as flexibility, modularity, extensibility, cheaper maintenance, and cost saving. This paper proposes the development of an SDR based on NVIDIA Graphics Processing Units (GPU) for implementing the receiver end of the CCSDS 131.2-B standard. At first, a brief description of the CCSDS 131.2-B standard is given, focusing on the architecture of the transmitter and receiver sides. Then, the receiver architecture is shown, giving an overview of its functional blocks and of the implementation choices made to optimize the processing of the signal, especially for the SCCC Decoder. Finally, the performance of the system is analyzed in terms of data-rate and error correction and compared with other SW systems to highlight the achieved improvements. The presented system has been demonstrated to be a perfect solution for CCSDS 131.2-B-compliant device testing and for its use in science missions, providing a valid low-cost alternative with respect to the state-of-the-art HW receivers.

Design of next-generation Tb/s turbo codes

Thesis

Mar 2021

Vinh Hoang Son Le

With recent mobile communication standards such as LTE Advanced Pro or 5G. New Radio, users can experience communication links with data rates up to tens of Gb/s. Along with the development of technology, the demand for even higher data rates of hundreds of Gb/s or Tb/s can be foreseen in the future. Baseband processing, in which forward error correcting (FEC) plays an essential role, will therefore have to be able to support such data rates. However,advances in semiconductor technologies alone will not be sufficient to meet the Tb/s FEC challenge. Therefore, major algorithmic and architectural innovations are required in the design and implementation of the corresponding algorithms. The aim of this thesis is to propose, for turbo codes, innovative decoding techniques, allowing the decoder to achieve or approach throughput ofTb/s. A novel high-radix decoding algorithm with low-complexity property was developed and was jointly optimized with a dedicated very high-throughput architecture. The resulting turbo decoder can therefore approach Tb/s with a low complexity and high area efficiency. In addition, a new decoding algorithm dedicated for high-rate turbo codes was proposed and was thoroughly studied in this work. The obtained results are promising, where the proposed algorithm provides a low-complexity sub-optimal alternative to the state-of-the-art algorithms.

CCSDS 131.2-B-1 Serial Concatenated Convolutional Turbo Decoder Architecture for Efficient FPGA Implementation

Article

Full-text available

Jan 2023

Most of the turbo encoding schemes at standards are parallel-based, so different architectures for efficient implementation are common in the literature. However, a serial turbo decoder is not that common. This scheme is used in CCSDS 131.2-B-1 standard, which is attracting much of attention recently due to its higher performance for satellite communications. In this paper, an efficient architecture for the decoder is proposed and analyzed. It is intended to show an architecture that can be modeled in a circuit description language (such as VHDL and Verilog) in such a way that it can be easily implemented on a Field Programmable Gate Array (FPGA). This work describes in detail this architecture explaining the encoding operations that are performed at the transmitter and then, how to undo them at the receiver. The proposed algorithm works by using independent components to divide the tasks and to obtain a pipeline architecture to improve the efficiency. The results of simulating and implementing the proposed architecture on a Xilinx Zynq UltraScale+ RFSoC ZCU28DR board with XCZU28DR-2FFVG1517E RFSoC are shown. The final results presented demonstrate how the hardware operations give equivalent results to the software simulation and do not consume board resources aggressively as usually the turbodecoder does.

VLSI Implementation of a Parallel Turbo- Decoder for Wireless Communication

Article

Full-text available

Mar 2007

Anitha Velu

Wireless sensor network can be considered to be energy constrained wireless scenarios, since the sensors are operated for extended periods of time, while relying on batteries that are small, lightweight and sine expensive. Energy constrained wireless communication application is done with the help of lookup table-log-BCJR (LUT-Log-BCJR) architecture.. In our existing system the conventional LUT-Log-BCJR architecture have wasteful designs requiring high chip areas and hence high energy consumptions Energy constrained applications. This motivated our proposed System the LUT log BCJR is designed with Clock gating technique achieves low-complexity energy-efficient architecture, which achieves a low area and hence a low energy consumption, and also achieving a low energy consumption has a higher priority than having a high throughput. we use most fundamental add compare select (ACS) operations and It having low processing steps, so that low transmission energy consumption is required and also reduces the overallenergy consumption. Our demonstrated implementation has a throughput of 1.03 Mb/s.

An Efficient VLSI Architecture of a Clock-gating Turbo Decoder for Wireless Sensor Network Applications

Article

Full-text available

Mar 2016

Anitha Velu

Wireless sensor network can be considered to be energy constrained wireless scenarios, since the sensors are operated for extended periods of time, while relying on batteries that are small, lightweight and inexpensive. The conventional turbo decoder architecture requires high chip area and hence high power consumption. This motivated the proposed system to design the decoder architecture with high throughput, less decoding iteration and less memory requirement. Clock gating is a technique that can be used to control the power dissipated by clock net. The proposed work is implemented using clock gating technique in order to reduce the power consumption. The previous turbo decoder architectures uses optimal-log based algorithm which has the complexity about 75% and hence leads to time and energy consumption due to sequential operations. Whereas the proposed architecture uses the fundamental Add Compare Select (ACS) operation. Due to the parallel processing operation of ACS blocks the proposed architecture tend to have low processing steps, so that low transmission energy and less complexity about 71%. The proposed work implementation has a throughput of 1.03 Mb/s, memory requirement of 128 Kbps, power consumption of about 0.016(mV) and requires 0.010(A) of current. Comparing to the optimal-log based algorithm in the proposed lookup table based architecture the complexity is reduced by 4% and by implementing the clock-gating technique the power consumption is reduced by 38%.

VLSI Architecture of a Clock-gating Turbo Encoder for Wireless Sensor Network Applications

Article

Full-text available

Dec 2015

Wireless sensor network can be considered to be energy constrained wireless scenarios, since the sensors are operated for extended periods of time, while relying on batteries that are small, lightweight and inexpensive. Much architecture has been proposed for energy constrained wireless communication applications using look-up table. The conventional turbo decoder architecture requires high chip area and hence high power consumption. This motivated the proposed system to design the decoder architecture with high throughput, less decoding iteration and less memory requirement. Clock gating is a technique that can be used to control the power dissipated by clock net. The proposed work is implemented using clock gating technique in order to reduce the power consumption. The previous turbo decoder architectures uses optimal-log based algorithm which has the complexity about 75% and requires sequential operations hence leads to time and energy consumption. Whereas the proposed architecture uses the fundamental Add Compare Select(ACS) operation. Due to the parallel processing operation ACS blocks tend to have low processing steps, so that low transmission energy and less complexity about 71%. The proposed work implementation has a throughput of 1.03 Mb/s, memory requirement of 128.8 Kbps, the complexity is reduced by 4% and the power consumption is reduced by 32%.

CODED MIMO-OFDM FOR NEXT GENERATION OF WIRELESS COMMUNICATION SYSTEMS

Thesis

Full-text available

Jan 2021

The growing demand for high data rate with high quality of service (QoS) is required for a re-design of current systems. The Multiple Input Multiple Output Orthogonal Frequency Division Multiplexing (MIMO-OFDM) system can enhance performance of system toward high data rate requirements. However, the received signal can be corrupted due to selectivity of the channel. Error correction code (ECC) can overcome the drawback of channel affect. In addition, space time code (STC) that transmits replica of transmitted signal through different sub-channels can obtain the spatial diversity. As a result, it improved the system performance. The combination of the aforementioned systems can be improved with the system outcome significantly. In this work, the actual MIMO system was implemented with two transmission,and a reception antennas. The receiver combined received signals using maximum ratio combiner (MRC) to increase a received signal to noise ratio (SNR). OFDM which divides channel into several sub-channels is employed with different number of sub-carriers. To increase the system throughput, several modulation formats are used with different constellation maps such as (BPSK, QPSK, 16QAM, and 64 QAM). To exploit the advantages of ECC regarding the cancelation of channel effect to the received signals. Several codes such as Conventional Code (CC), Low Density Parity Check (LDPC) code and Reed Solomon (RS) code were studied. LDPC performs better performance compared with other codes, therefore, LDPC is used with different code rate. It is seen from the simulation results, the MIMO-OFDM system combined with LDPC code and STC can perform better performance regarding bit error rate (BER) at the indicated SNR. The MIMO-OFDM with LDPC and STC system obtain a gain about 5 dB in SNR at 10^-4 BER. Moreover, the gain in the spectral efficiency with sacrificing by SNR is achieved. The 8 bps/Hz was obtained.

CCSDS 131.2-B-1 Software Defined Radio receiver featuring GPU accelerators: up to 1000x with respect to CPU implementation

Conference Paper

Dec 2022

Nowadays the number of Earth Observation (EO) missions are exponentially increasing. Each satellite may embark payloads which produce a big amount of data, that need to be transmitted on-ground. CCSDS 131.2-B-1 standard has been developed to address large data transmissions, hence it is suitable for this kind of missions. A limiting factor in the adoption of this communication system is represented by the complexity of the hardware required for running a receiver in real time. Software Defined Radio receivers represent a potential low-cost solution when dealing with scientific data or low- medium- data rate. To this aim, IngeniArs S.r.l., a Small-Medium Enterprise leader in designing and developing space communication protocols, has developed a Software Defined Radio dedicated to CCSDS 131.2-B-1, object of this paper. In particular, this software was designed following the Single Instruction Multiple Data paradigm to speed up the reception and data processing, exploiting the so called General Purpose-Graphic Processing Unit. The entire software has been designed to be completely modular, i.e., allowing to easily add/modify each block in the reception chain. The preliminary results of the adoption of this SDR as a CCSDS 131.2-B-1, has shown excellent performance, maintaining a throughput up to 10 MBaud for ACM 1-7. This throughput can be easily increased exploiting parallelism at decoder level, hence linearly increasing the SDR performance, up to the GP-GPU memory filling. The proposed solution is designed to be a game changer and a potential flagship killer for such missions wich requirements are relaxed in term of throughput.

Architectural Comparison Model for Area-Efficient PMAP Turbo-Decoders

Article

Jan 2022

In this paper, a methodology to compare highthroughput turbo decoder architectures, is proposed. The model is based on the area-efficiency estimation of different architectures and design choices. Moreover, it is specifically oriented to the exploration of Parallel-MAP (PMAP) architectures, combined with both the Max-Log-MAP algorithm and the recently proposed Local-SOVA. The main objective is the search for optimal radix-orders, capable to maximize the area-efficiency of the decoder. In this scenario, it is proved that i) radix-orders higher than 4 are expected to drastically reduce the area-efficiency; ii) the optimal choice between radix-2 and radix-4 architectures strongly depends on the area distribution between logic and memory.

VLSI Implementation of Turbo Product Code

Conference Paper

Aug 2021

MHOMS: High Speed ACM Modem for Satellite Applications1

Article

Full-text available

Towards an optimal parallel decoding of turbo codes

Article

Full-text available

Apr 2006

High throughput decoding of turbo-codes can be achieved thanks to parallel decoding. However, for finite block sizes, the initialisation duration of each half-iteration reduces the activity of the processing units, especially for higher degrees of parallelism. To solve this issue, a new decoding scheduling is proposed, with a partial process- ing overlapping of two successive half iterations. Potential memory conflicts introduced by this new scheduling are solved by a constrained interleaver design. An example of application of the proposed technique shows that the complexity of the decoder is reduced by 25 % compared to a conventional approach.

High-throughput turbo decoding techniques for 4g

Article

Jan 2002

Optimal Rearrangeable Multistage Connecting Networks

Article

Jan 1964

V.E. Benes

Optimal Rearrangeable Multistage Connecting Networks

Article

Jul 1964

V. E. Beneš

A rearrangeable connecting network is one whose permitted states realize every assignment of inlets to outlets—that is, one in which it is possible to rearrange existing calls so as to put in any new call. In the effort to provide adequate telephone service with efficient networks it is of interest to be able to select rearrangeable networks (from suitable classes) having a minimum number of crosspoints. This problem is fully resolved for the class of connecting networks built of stages of identical square switches arranged symmetrically around a center stage: roughly, the optimal network should have as many stages as possible, with switches that are as small as possible, the largest switches being in the center stage; the cost (in crosspoints per inlet) of an optimal network of N inlets and N outlets is nearly twice the sum of the prime divisors of N, while the number of its stages is 2x — 1, where x is the number of prime divisors of N, in each case counted according to their multiplicity. By using a large number of stages, these designs achieve a far greater combinatorial efficiency than has been attained heretofore.

for Cellular Fixed Broadband Wireless Access Systems

Article

Ehab Armanious

Near Shannon Limit Error-Correcting Coding and Decoding: Turbo-Code

Article

Jan 1993

Near Shannon limit error correcting coding and decoding: Turbo-codes

Article

Jan 1996

Decoding of binary codes with analog networks

Article

J. Hagenauer

Enhanced decoding for the Galileo low-gain antenna mission: Viterbi redecoding with four decoding stages

Article

May 1995

The Galileo low-gain antenna mission will be supported by a coding system that uses a (14,1/4) inner convolutional code concatenated with Reed-Solomon codes of four different redundancies. Decoding for this code is designed to proceed in four distinct stages of Viterbi decoding followed by Reed-Solomon decoding. In each successive stage, the Reed-Solomon decoder only tries to decode the highest redundancy codewords not yet decoded in previous stages, and the Viterbi decoder redecodes its data utilizing the known symbols from previously decoded Reed-Solomon codewords. A previous article analyzed a two-stage decoding option that was not selected by Galileo. The present article analyzes the four-stage decoding scheme and derives the near-optimum set of redundancies selected for use by Galileo. The performance improvements relative to one- and two-stage decoding systems are evaluated.

Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Abstract and Figures

Recommended publications

Design and application of parallel digital signal processor based image processing system for indust...

Acquisition of Direct-Sequence Spread-Spectrum Signals

Two way time transfer results at NRL and USNO

A floating point FIR filter with reduced exponent dynamic range