ArticlePDF Available

CCSDS 131.2-B-1 Serial Concatenated Convolutional Turbo Decoder Architecture for Efficient FPGA Implementation

January 2023
IEEE Access PP(99):1-1

January 2023
PP(99):1-1

DOI:10.1109/ACCESS.2023.3235966

License
CC BY 4.0

Authors:

Miguel Ángel Pérez Naranjo

University Carlos III de Madrid

Víctor P. Gil Jiménez

University Carlos III de Madrid

Most of the turbo encoding schemes at standards are parallel-based, so different architectures for efficient implementation are common in the literature. However, a serial turbo decoder is not that common. This scheme is used in CCSDS 131.2-B-1 standard, which is attracting much of attention recently due to its higher performance for satellite communications. In this paper, an efficient architecture for the decoder is proposed and analyzed. It is intended to show an architecture that can be modeled in a circuit description language (such as VHDL and Verilog) in such a way that it can be easily implemented on a Field Programmable Gate Array (FPGA). This work describes in detail this architecture explaining the encoding operations that are performed at the transmitter and then, how to undo them at the receiver. The proposed algorithm works by using independent components to divide the tasks and to obtain a pipeline architecture to improve the efficiency. The results of simulating and implementing the proposed architecture on a Xilinx Zynq UltraScale+ RFSoC ZCU28DR board with XCZU28DR-2FFVG1517E RFSoC are shown. The final results presented demonstrate how the hardware operations give equivalent results to the software simulation and do not consume board resources aggressively as usually the turbodecoder does.

Block Diagram of the SCC Turbo Coding Scheme in the transmitter. Extracted from [32].

…

Convolutional Encoder Block Diagram for CC1 and CC2. Extracted from [32].

…

Schematic representation of the action performed by the Row-Column de-interleaver block.

…

SCC-Decoder full hardware block schematic.

…

+14

Full block diagram for the hardware components and connections corresponding to the Row-Column de-interleaver block.

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identiﬁer 10.1109/ACCESS.2017.DOI

CCSDS 131.2-B-1 serial concatenated

convolutional turbo decoder architecture

for efﬁcient FPGA implementation

MIGUEL ÁNGEL PÉREZ NARANJO1, VÍCTOR P. GIL JIMÉNEZ2(Senior Member, IEEE)

1Department of Signal theory and communications, Universidad Carlos III de Madrid, 28911 Leganés, Madrid (e-mail: mpnaranjo@tsc.uc3m.es)

2Department of Signal theory and communications, Universidad Carlos III de Madrid, 28911 Leganés, Madrid (e-mail: vgil@ing.uc3m.es)

Corresponding author: Miguel Ángel Pérez Naranjo (e-mail: mpnaranjo@tsc.uc3m.es).

This work was partly funded by Project “IRENE” (PID2020-115323RB-C33) (MINECO/AEI/FEDER, UE) and project MFOC Madrid

Flight on Chip - Innovation Cooperative Projects Comunidad of Madrid - HUBS 2018/ Madrid Flight on Chip.

ABSTRACT Most of the turbo encoding schemes at standards are parallel-based, so different architectures

for efﬁcient implementation are common in the literature. However, a serial turbo decoder is not that

common. This scheme is used in CCSDS 131.2-B-1 standard, which is attracting much of attention recently

due to its higher performance for satellite communications. In this paper, an efﬁcient architecture for the

decoder is proposed and analyzed. It is intended to show an architecture that can be modeled in a circuit

description language (such as VHDL and Verilog) in such a way that it can be easily implemented on a Field

Programmable Gate Array (FPGA). This work describes in detail this architecture explaining the encoding

operations that are performed at the transmitter and then, how to undo them at the receiver. The proposed

algorithm works by using independent components to divide the tasks and to obtain a pipeline architecture

to improve the efﬁciency. The results of simulating and implementing the proposed architecture on a Xilinx

Zynq UltraScale+ RFSoC ZCU28DR board with XCZU28DR-2FFVG1517E RFSoC are shown. The ﬁnal

results presented demonstrate how the hardware operations give equivalent results to the software simulation

and do not consume board resources aggressively as usually the turbodecoder does.

INDEX TERMS DSP, Coding, FEC, serial Turbodecoder, Pipeline, Convolutional, BCJR, VHDL,

Synthesis, Implementation, Hardware, FPGA, CCSDS.

I. INTRODUCTION

FORWARD Error Correction (FEC) is a mandatory tech-

nique nowadays for the design of a high performance

communications system, as it allows to detect and correct

errors on the transmitted information, allowing to reach the

well-known Shanon limit [1].

Among all the channel coding techniques, turbo encod-

ing/decoding [2] is one of the most promising strategies

for improving performance [3] although the complexity in-

creases.. For this reason, there are many standards using

turbo encoding/decoding [4]–[6] . The turbo encoding idea

is the concatenation of two simple convolutional encoders,

preferably those based on a recursive systematic codes (RSC)

model, to improve the error correction capacity by means of

an iterative algorithm, in order to achieve very low error rates

without the need to use a large number of shift registers in the

encoder architecture, as this would exponentially increase the

complexity of the decoding process [3].

The origins of turbo decoder technology dates back to late

1980s. In 1989, Alain Glavieux proposed a modiﬁcation to

the Viterbi algorithm called Soft-Output Viterbi Algorithm

(SOVA) [7], [8] that allowed working with soft outputs after

decoding a single convolutional encoder, which led to the ob-

servation that working with soft-input and soft-output (SISO)

decoders [9]–[11] improved the signal-to-noise ratio (SNR).

When the next phases of the general structure of the turbo

decoder were developed, the concatenation of the encoders

made the use of SOVA for decoding unfeasible and the BCJR

algorithm [12], [13], also known as the forward-backward

algorithm, which follows the Maximum a Posteriori (MAP)

criterion, began to be used. It was created in 1974 but adapted

and improved in 1993 by its inventors, Bahl, Cocke, Jelinek,

and Raviv.

The ﬁrst commercial use of turbo codes occurred in 1997

with Inmarsat’s M4 multimedia service by satellite. This new

service used the component of Turbo4 circuits [14] (CAS

VOLUME 4, 2016 1

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

5093 successor) with a 16-QAM modulation and enables

the user to communicate with Inmarsat-3 spot-beam satellite

from a terminal at 64 kbit/s. The narrowband technology

based on a 16 QAM constellation mapping and turbo coding

provides signiﬁcant reduction (> 50 %) in the required band-

width for mobile satellite channels, at the same time improv-

ing the satellite power efﬁciency [15]. This was the beginning

of turbo codes in commercial applications which took a lot

of effort from several teams in hardware implementation

[16], [17] and the adaptation of communications standards to

these technologies in a way that would make them achievable

in commercial electronics [18], [19]. However, as hardware

development technology evolved and it became possible to

increase the number of resources of the devices where such

algorithms are to be implemented, it also became possible

to increase the complexity of the turbo decoder variant to be

implemented, as increasingly higher transmission rates are

required and decoding operations are applied to arrays of

symbols of much longer lengths, as is the case with IEEE

802.16 family of wireless communications standards [20] or

HomePlug AV became an IEEE standard in 2010 [21]. It

has also been a key player in more recent technologies such

as faster-than-Nyquist (FTN) signaling [22]–[24] or coherent

decoding [25], [26].

For the case of satellite communications, the performance

of the turbo decoder combined with Frequency-Hopped

Spread Spectrum (FH-SS) has been analysed in [27] and

compared with the combined performance of several dy-

namic power allocation algorithms, where a modiﬁcation of

the classical structure is made to develop a new iterative algo-

rithm for channel variance and carrier phase estimation (side

information), which was shown to provide superior perfor-

mance to the case where no side information is available [28].

A turbo trellis coded modulation in conjunction with con-

tinuous phase modulation was used in a frequency-hopping

packet radio structure to further reduce the error probability

too [29]. Also very interesting is the high throughput that

can be achieved with error correction algorithms based on

newer techniques such as Turbo-Hadamard coding methods

[30]. However, all these algorithms are not yet validated in

direct hardware implementation, that is, they are still com-

plex to implement by transcribing them directly into a circuit

description language, either VHDL or Verliog, without using

other elements such as a microprocessor or Universal Soft-

ware Radio Peripheral (USRP) that allow certain parts of the

algorithm to be implemented in software. That is why the

standard when using a turbo decoder scheme is to try not to

add additional blocks beyond those used in classical schemes

[31], which are the RSCs, an interleaver and a puncturing

block in some cases to achieve the rates recommended in the

standards.

Thus we come to the standard speciﬁcations recommended

by the Consultative Committee for Space Data Systems

(CCSDS) in [32] where optimal combinations of coding

rates and frame lengths are pursued to make efﬁcient use of

bandwidth and maximising spectral efﬁciency.

The aim of this work is to present a valid and efﬁcient

architecture to perform decoding of the Serial Concatenated

Convolutional Code (SCCC) block, described as a Serial

Concatenated Convolutional (SCC) Turbo Coding Scheme,

proposed in the aforementioned standard and described in

[33], showing the pipelined components of the complete

decoder assembly and the connections of its integrated com-

ponents. In addition, detailed simulation results are shown

on the code produced in VHDL, corresponding to the wave-

forms that would be obtained from the proposed circuit,

validating the likelihoods computed in each iteration and

the ﬁnal bits obtained after the hard decision process, to

compare those results with the software simulation using

MATLAB. In addition, the synthesis and implementation

performance values are included on the evaluation board, in

this case, on a Xilinx Zynq UltraScale+ RFSoC ZCU28DR

with XCZU28DR-2FFVG1517E RFSoC, to evaluate the re-

source management on it and the efﬁciency of the proposed

architecture.

In other words, this paper presents a novel architecture not

explored in the literature to provide a customized decoding

scheme to [32], and optimized for electronics to facilitate its

implementation in real systems. So, the main contributions of

this paper are summarized as follows:

•Complete description of the proposed architecture, in-

cluding the description of each component with its

corresponding block diagram and the connections and

signals between them for their subsequent hardware

implementation. In addition, the optimized algorithm is

speciﬁed for decoding each RSC individually.

•A software simulation comparing the theoretical and

optimized versions of the algorithm, to demonstrate that

using the appropriate number of iterations only 0.5 dB

of Eb/N0is lost between the two versions, the latter

being inﬁnitely simpler to implement in hardware and

more efﬁcient in terms of FPGA resource consumption.

•Exhaustive comparison of the results in the software

simulation with the hardware simulation to demonstrate

that the values are the same and to verify that the

architecture works correctly.

•Show the synthesis results of the proposed architecture

to evaluate the resources it consumes on the evaluation

board.

•Present the results of carrying out the implementation of

the synthesised circuit to evaluate the timing and tem-

perature performance and certify that it can be realised

on the FPGA.

This paper is organized as follows: Section II describes the

SCC Turbo Coding block collected in the reference standard,

an overview of the BCJR algorithm for a single RSC and a the

description of methods of simplifying this algorithm, which

makes it possible to do such signal processing in hardware in

an efﬁcient way; section III shows the proposed architecture

of the turbo decoder, detailing the pipelined structure that

builds it, showing each component that makes it up and how

2VOLUME 4, 2016

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

FIGURE 1. Block Diagram of the SCC Turbo Coding Scheme in the transmitter. Extracted from [32].

the ﬂow is controlled to avoid mismatches of information

in the memory blocks to generically ﬁt the different pro-

posed lengths of the input and output frames; section IV

presents hardware simulation, synthesis and implementation

commented results; conclusions are presented in section V.

II. SYSTEM MODEL

A. SCC TURBO CODING SCHEME DESCRIPTION

Turbo encoder classic architecture is based on two RSCs

concatenated through a memory called interleaver which is

responsible for scrambling the input data to avoid bursts of

consecutive encoding errors. The most common is to ﬁnd

the encoders in parallel, i.e. the original input and the input

modiﬁed by the interleaver are encoded at the same time,

although there is also the possibility of the turbo encoder

appearing with the RSC in series with the interleaver in

between, in such a way that the interleaver messes up the

original encoded information, which means that it has a

larger memory than in the parallel case. Each variant has its

advantages and disadvantages, which are discussed in [34],

but one important advantage that the serial conﬁguration has

over the parallel model is that the data and parity bits can

exploit the extrinsic information. This advantage is one of

the main reasons why in the reference standard of this work a

modiﬁcation of the series turbo decoder ,the above mentioned

SCCC conﬁguration, is chosen.

The use of SCCC is intended mainly for high data rate

applications. The Forward Error Correction (FEC) scheme is

based on the concatenation of two simple four-state encoder

structures. The SCCC scheme implies a Physical Layer frame

of constant length, with pilots inserted in ﬁxed positions. This

architecture simpliﬁes the synchronization procedure, thus

further allowing fast and efﬁcient acquisition at very high

rates for the receiver [32]. The following sub-subsections

describe the different blocks that make up the complete en-

coder. As it has been exposed, the turbo encoding proposed in

this standard is a serial-based one, which needs a completely

different architecture at the decoder side. As it can be seen in

the ﬁgure, the puncturing operation after CC1 only removes

one bit out of 4 of redundancy, thus the systematic bits are

always transmitted.

Fig. 1 shows the block diagram for the complete encoder.

It consists of two convolutional encoders, the outer one is

referred to as CC1 and the inner one as CC2, a ﬁxed punc-

turing block, an interleaver block, one de-multiplexer item

to split the systematic and parity bits produced by encoder

CC2, so that the inputs are ﬁnally reorganised to enter another

interleaver, named as Row-Column interleaver, with its own

puncturing patterns that would only affect the systematic

or parity information separately, again preventing possible

concatenation errors and making the system more robust.

It should be emphasised that the puncturing block pattern,

the interleaving block pattern and the additional puncturing

patterns of the Row-Column interleaver depend on the mode

of operation followed from those available in the reference

standard. It should be noted also that it is these modes of

operation that determine the lengths of the frames that are

sent between the different blocks that make up the SCC Turbo

Coding Scheme. Although it is not in the scope of this paper

to explain in detail how the individual blocks work, it is

considered necessary to provide some additional information

about the encoders and the Row-Column interleaver, in order

to facilitate the subsequent understanding of the proposed

architecture for the decoding of the complete structure.

About the encoders, they have the same features, with

coding rates 1/2 and 4 possible states in the encoding process.

Fig. 2 shows the architecture of this type of encoder, where

the boxes with a D inside symbolise a shift register, and the

circles enclosing the ’+’ symbol refer to the logical operation

"Exclusive OR". Initially the registers are initialised to ’0’,

and the coding takes place with the switch in upper position.

Once the frame is encoded, for the ﬁnal two bit times, the

switch moves to the lower position to receive feedback from

the registers. This feedback cancels the same feedback sent

(unswitched) to the leftmost exclusive OR gate, and causes

all two registers to become ﬁlled with zeros after the ﬁnal two

bit times. In this way, a terminated trellis has been obtained,

which simpliﬁes the decoding process.

In the case of Row-Column interleaver, its goal is to

reorganize the information in such a way that the output

VOLUME 4, 2016 3

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

FIGURE 2. Convolutional Encoder Block Diagram for CC1 and CC2.

Extracted from [32].

consists ﬁrst of getting all the systematic bits, applying a

certain puncturing pattern to them, and then all the parity bits

of CC2 also having done an additional puncturing operation.

Fig. 3 presents a graphical representation of this action in the

reverse way to be implemented in the receiver, in a block

that is called Row-Column de-interleaver. It can be seen

how the systematic and parity bits of CC2, orange and blue

positions in Fig. 3 respectively, are interleaved, and after

passing through the block ﬁrst all the systematic bits are

grouped together, and then all the parity bits. At this point,

the block has independent puncturing patterns too for the

systematic or parity information, which it applies to recover

the corresponding position that was deleted in the transmitter

when the original puncturing was applied, reﬂected as black

ﬁlled positions in Fig. 3. Depending on the transmission

mode, the amount of systematic or parity bits are different.

FIGURE 3. Schematic representation of the action performed by the

Row-Column de-interleaver block.

B. BCJR ALGORITHM OVERVIEW

As mentioned in the introduction, in a turbo decoder design

the algorithm should be selected as a trade off between the

decoding performance and implementation complexity. The

BCJR algorithm must work with soft inputs and soft outputs,

which allows it to obtain more accurate results than applying

a hard method to the output. Besides, a soft output allows

to make the structure more ﬂexible to develop an iterative

architecture.

Let’s consider an input sequence x=x1x2. . . xNof N

n-bit symbols, and let uibe a binary random variable with

possible values {0,1}which represent the information or

message input bit corresponding to estimated value according

to xksymbol. From now, if ukis ’1’, it is mapped as +1

and ’0’ is mapped as -1 otherwise. Thus, taking into account

the input bit a priori probability P(ui), we deﬁne the log-

likelihood ratio (LLR)

L(ui) = logP(ui= +1)

P(ui=−1),(1)

which, at the beginning of the algorithm is zero. The reason

is because we do not have any previous information and we

assume the input bits are i.i.d, so P(ui= +1) = P(ui=

−1) = 1/2.

The BCJR algorithm needs certain information to estimate

the correct bit. This information corresponds to the sequence

of symbols received, denoted as yfor which the algorithm

computes the a posteriori LLR

L(ui|y) = logP(ui= +1|y)

P(ui=−1|y).(2)

Considering the transition from the current state ψ′to the

next state ψ, and deﬁning two sets denoted as U1and U0

representing the set of transitions from state Si−1=ψ′to

state Si=ψoriginated by ui=−1or ui= +1, with

i= 1,2, . . . , N , respectively. Thus, a posteriori LLR can be

expressed as

L(ui|y) = logP(ui= +1|y)

P(ui=−1|y)=logPU1P(ψ′, ψ|y)

PU0P(ψ′, ψ|y).(3)

Applying Bayes theorem over (3) due to transitions are

mutually exclusive, we ﬁnally obtain

L(ui|y) = logPU1P(ψ′, ψ, y)P(y)

PU0P(ψ′, ψ, y)P(y)

=logPU1P(ψ′, ψ, y)

PU0P(ψ′, ψ, y).(4)

In (4), it is shown the joint probability of receiving the N-

bit sequence yand being in state ψ′at time i-1 and in state ψ

at the current time i.It can be seen how the ﬁnal expression

for calculating the LLR is a ratio of two joint probabilities,

with the numerator being the joint probability of receiving y

and being in state ψ′at time i−1and in state ψat the current

time ifor the set originated by uk= +1, and the numerator

same case for the set originated by uk=−1. In turn,

this joint probability can be disassembled as the product of

three temporally differentiable probabilities, associated to the

temporal character that reﬂects the computed trellis diagram

of a convolutional encoder. Assuming that we are in the i-

th position of the trellis, we can deﬁne these probabilities

as referring to the past state, present state and future state of

that position in the diagram. Reapplying Bayes’ theorem and

assuming a memory-less channel where the current symbol

does not depend on past information, the joint probability de-

composition in the three temporal subsequences is as follows

4VOLUME 4, 2016

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

P(ψ′, ψ, y) = P(y>i |ψ′, ψ, y<iyi)P(ψ′, ψ, y<i,yi)

=P(y>i|ψ)P(ψ′, ψ, y<i,yi)

=P(y>i|ψ)P(yi, ψ|ψ′,y<i)P(ψ′,y<i)

=P(y>i|ψ)P(yi, ψ|ψ′)P(ψ′,y<i)

=βi(ψ)γi(ψ′, ψ)αi−1(ψ′).(5)

Attending to (5), the above-mentioned time subsequences

are therefore described by those conditional and joint prob-

abilities, in such a way that βi(ψ) = P(y>i|ψ)is the

conditional probability that, given the current state is ψ,

the future sequence will be y>i,γi(ψ′, ψ) = P(yi, ψ|ψ′)

deﬁnes the probability that next state is ψ, and the received

symbol is yigiven the previous state is ψ′and αi−1(ψ′) =

P(ψ′,y<i)determines the joint probability that at time i-

th the state is ψ′and the received sequence until then is

y<i. Associating the deﬁnitions of the probabilities to the

grid diagram, therefore, βi(ψ)refers to the future transitions

for moments after i-th instant and it is denoted as backward

metric, γi(ψ′, ψ)refers to current transitions values and it is

denoted as branch metric, and αi−1(ψ′)refers to previous

transitions for moments before i-th instant and it is denoted

as forward metric. Finally, by inserting (5) in (4) we arrive at

the ﬁnal LLR expression that we will implement in the FPGA

to calculate the a posteriori probabilities of each decoder of

the proposed architecture

L(ui|y) = logPU1βi(ψ)γi(ψ′, ψ)αi−1(ψ′)

PU0βi(ψ)γi(ψ′, ψ)αi−1(ψ′).(6)

The process by which an approximate closed expression is

derived for each of the time subsequences is shown below in

order to facilitate the simpliﬁcation process discussed in the

next section where the ﬁnal expressions to be implemented

in VHDL are shown.

1) Branch metric computation

Applying the deﬁnition of conditional probability, the branch

metric expression can be written as

γi(ψ′, ψ) = P(yi, ψ|ψ′)

=P(yi|ψ, ψ′)P(ψ , ψ′)

=P(yi|ψ, ψ′)P(ui).(7)

Referring to the ﬁrst factor, P(yi|ψ, ψ′), it should been

noted that the joint occurrence of the consecutive states

Si−1=ψ′and Si=ψis equivalent to the occurrence of

the corresponding coded symbol xiin the transmitter, so,

P(yi|ψ, ψ′) = P(yi|xi). Substituting this in (7) we obtain

γi(ψ′, ψ) = P(yi|xi)P(ui).(8)

To deﬁne an expression for P(yi|xi), it is taking into

consideration that in a memory-less channel the successive

transmissions are statistically independent, so it can be writ-

ten than

P(yi|xi) =

m=1

P(yim|xim ).(9)

In this work, in order to facilitate the hardware imple-

mentation, we consider the approximation that the channel

has been modeled under the conditions of an Additive and

White Gaussian Noise (AWGN) channel, therefore in [35] it

is shown that the expression for (9) is as follows

P(yi|xi) = C(0)

iexp 2FREb

(xi·yi),(10)

where C(0)

iis a constant computed from channel characteris-

tics such as fading or measure with the received sequence,

Fis the channel fading amplitude, Ris the coding rate

and Eb/N0is the energy per bit-to-noise ratio in dB of the

system. The P(ui)calculation case is much simpler, since

by deﬁning

P(ui=±1) = exp {uiL(ui)}

1 + exp {uiL(ui)},(11)

and substituting this value in (1) , we obtain, after regrouping

the terms

P(ui) = C(1)

iuiL(ui)

2.(12)

Finally, substituting (10) and (12) into (8), we arrive at the

ﬁnal expression of the branch metrics

γi(ψ′, ψ) = C(0)

iC(1)

iexp 2FREb

(xi·yi)uiL(ui)

2

=Ciexp FREb

(xi·yi)uiL(ui).(13)

In the case of other channel models the metric can be

adapted to accordingly or even used as it is in eq. 13 assuming

a slight degradation in the performance.

2) Forward and Backward metric computation

Let us deﬁne the expressions to be implemented to refer to

past and future transitions with respect to the motion in the

i-th state. Previously, the subsequence corresponding to the

past had been deﬁned as

αi−1(ψ′) = P(ψ′,y<i)⇔αi(ψ) = P(ψ, y<i,yi),(14)

and applying deﬁnitions of probability theory [36] and as-

suming again the action on a memory-less channel, the

expression for the forward metrics computation expression

remains in a recursive calculation as deﬁned below

VOLUME 4, 2016 5

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

αi(ψ) = P(ψ, y<i ,yi)

ψ′

P(ψ, ψ′,y<i ,yi)

ψ′

P(ψ, yi|ψ′)P(ψ′,y<i )

ψ′

γi(ψ′, ψ)αi−1(ψ′).(15)

It is worth nothing to say that constraints of memory-less

channel is not mandatory for good performance. It is only for

ease on the explanation. For the case of the sub-sequences

corresponding to the future, the process is analogous. From

the probabilistic deﬁnition

βi(ψ) = P(y>i|ψ)⇔βi−1(ψ′) = P(y>i−1|ψ′),(16)

and following the constrains of the previous case, the recur-

sive expression for calculating future transitions is deﬁned as

follows

βi−1(ψ′) = P(y>i−1|ψ′)

P(ψ, yi,y>i |ψ′)

P(y>i|ψ′, ψ, yi)P(ψ, yi|ψ′)

P(y>i|ψ)P(ψ, yi|ψ′)

βi(ψ)γi(ψ′, ψ).(17)

Due to the fact that (15) and (17) are recursive expressions,

it is necessary to deﬁne an initial value, in the case i= 0 for

the forward metrics and i=Nin the case of the backward

metrics. In both cases, the initial value is

α0(ψ) = (1if ψ= 0,

0if ψ= 0.(18)

βN(ψ) = (1if ψ= 0,

0if ψ= 0.(19)

However, as discussed in [37], all these expressions are

too expensive to be implemented in hardware, and therefore

numerical methods have been developed that are able to

present similar results to those achieved with the previous

expressions but with lower computational cost.

C. BCJR ALGORITHM SIMPLIFICATION METHODS

In order to implement the convolutional decoder for a RSC

based on the BCJR algorithm, it is necessary to ﬁrst apply

(13), and then recursively and preferably with a paralleled

structure, (15),(17), (18) and (19).

In order to achieve this goal efﬁciently, in [38] a method

based on natural logarithms is proposed, by which the ex-

pressions mentioned above are altered, thus obtaining

L(γ)

i(ψ′, ψ) = logγi(ψ′, ψ)

=Ci+FREb

(xi·yi) + uiL(ui).(20)

L(α)

i(ψ) = logαi(ψ)

=max[ψ]L(α)

i−1(ψ′) + L(γ)

i(ψ′, ψ).(21)

L(β)

i−1(ψ′) = logβi−1(ψ′)

=max[ψ′]L(β)

i(ψ) + L(γ)

N−i(ψ′, ψ).(22)

From (20), (21) and (22) it can be seen that the greatest

advantage obtained is that the multiplications have been

transformed into sums evaluated on the max[Si](·)function,

which is nothing more than the ordinary maximum function

evaluated in the current trellis state, where the highest value

is sought, which is simple to implement in VHDL’93 and

already compiled if VHDL’08 is used. This simpliﬁcation

is known as the max-log-MAP algorithm. Since logarithms

are applied to recursive expressions, we must also apply

logarithms to the expressions (18) and (19), obtaining

L(α)

0(ψ) = (0if ψ= 0,

−∞ otherwise. (23)

L(β)

N(ψ) = (0if ψ= 0,

−∞ otherwise. (24)

There is a disadvantage with this approximation, and that

is that the numerical values are slightly worse than values

obtained with the original expressions. However, such degra-

dation in performance are not very large and therefore in

practice this sub optimal solution makes it perfectly viable

for implementation due to the versatility it offers in terms of

simplicity of implementation and computational efﬁciency,

since in hardware the computational expressions would use

auxiliary Digital Signal Processors (DSPs) that with this

implementation are replaced by adders, much less expensive

and faster, leaving the DSPs free for other tasks of the

receiver.

Original BCJR algorithm has a disadvantage which is the

problem of numerical instability that occurs in (15) and (17),

since a normalization of these expressions is required for

each time iand avoid overﬂow. Max-log-MAP algorithm

solves this problem and does not require such normalization,

which translates as a splitting over previous results that

must be saved for subsequent iteration due to recursion, and

therefore, entails extra memory cost and auxiliary DSPs in

hardware.

6VOLUME 4, 2016

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

Algorithm 1: Max-log-MAP algorithm steps

Data: ui, L(ui),i= 0

Result: L(ui|y)∈R

;/*Branch metrics computation */

1while i=Ndo

2Compute (20) for each input uiand L(ui);

3i++;

4i=0;

;/*Forward and backward metrics

computation */

5while i=Ndo

6Compute (21) with L(γ)

i(ψ′, ψ);

7Compute (22) with L(γ)

N−i(ψ′, ψ);

8i++;

9i=0;

;/*A posteriori probability

computation */

10 while i=Ndo

11 Compute (26) with L(α)

i(ψ′),L(γ)

i(ψ′, ψ)and

L(β)

N−i(ψ);

12 i++;

13 i=0;

It is important to note that in [39] another simpliﬁcation

method is proposed that gives the same numerical results as

the original solution but with less computational expense,

using the max* function, based on the Jacobian logarithm

operation and deﬁned as

max*(θ, ϕ) = max(θ, ϕ) + log (1 + exp {1− |θ−ϕ|}).

(25)

This variant is known as log-MAP algorithm, but despite

giving the exact results of the original BCJR algorithm, it

requires making external auxiliary modules based on Look-

Up-Tables (LUTs) that would consume more board resources

and would also add additional clock cycles delays. Due to

that the integration of this block with the rest of the structure

would harder the complete turbo decoder and the receiver

itself, since the latter also uses LUTs based on counters to

synchronize the sending of information from one block to

another. For these reasons we have preferred to use the max-

log-MAP version, since it is simpler to implement and allows

the calculations to be performed in the same clock cycle.

Finally, changes made in (20),(21) and (22) have an effect

on (6) where the a posteriori probabilities computation is

simpliﬁed as

L(ui|y) = max[U1]L(α)

i(ψ′) + L(γ)

i(ψ′, ψ) + L(β)

N−i(ψ)

−max[U0]L(α)

i(ψ′) + L(γ)

i(ψ′, ψ) + L(β)

N−i(ψ).(26)

For implementation purposes, max-log-MAP algorithm

summarized steps are described in Algorithm 1 box. It must

FIGURE 4. SCC-Decoder full hardware block schematic.

FIGURE 5. Main block diagram of the two major stages of the proposed

turbo decoder architecture.

be emphasised that while statements in Algorithm 1 box are

not apply to do a loop in hardware, they are just to note that in

the implementation should be a pipelined structure with ﬂow

control.

III. PROPOSED PIPELINED ARCHITECTURE

This section shows the schematics of the proposed design to

implement the complete turbo decoder structure in hardware.

First, Fig. 4 shows a schematic of the inputs and outputs of

the above modules. The Row-Column de-interleaver input,

which is the input of the complete SCC-decoding block,

takes the output values from the previous soft demodulator

and a trigger signal that is activated while calculating those

soft values a priori. The demodulator is designed to, on the

same clock edge, provide the soft bit corresponding to the

systematic bit and the parity bit coming out of the CC2

encoder, according to the transmission scheme shown in Fig.

2. Once these soft bits are obtained, while the trigger signal

arriving from the demodulator is enabled, the storing circuit

is also active, which is responsible for storing the incoming

information, until ﬁnally the maximum of the frame has been

reached and the demodulator is deactivated. Therefore, the

incoming trigger signal to our block is not more enabled and

the storing circuit is stopped to perform the interleaving and

puncturing operations corresponding to that block.

The proposal for this next block can be simpliﬁed in the

diagram in Fig. 5, where the decoder operation has been

divided into 2 main stages. The ﬁrst stage consists of decod-

VOLUME 4, 2016 7

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

FIGURE 6. Full block diagram for the hardware components and connections corresponding to the Row-Column de-interleaver block.

FIGURE 7. Complete block diagram of the proposed hardware architecture for the corresponding turbo decoder block.

8VOLUME 4, 2016

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

ing by inverting the steps of the proposed scheme, where a

block with the VHDL implementation of the max-log-MAP

algorithm corresponding to CC2 would be applied, then the

effect of the interleaver and the ﬁxed puncturing block would

be undone, and ﬁnally the max-log-MAP algorithm corre-

sponding to CC1 would be applied again on the processed

information. These four big blocks are collected in the purple

area of Fig. 5, named as feedforward stage, where the next

step is to check the iteration on which we are, if it has already

been the last one, we take out the calculated likelihoods to

apply a hard decisor and obtain the estimated bits, otherwise

we go to the second stage. The second step consists of an

update of the a priori observations from the ﬁrst decoding

block, obtaining the inputs from the second decoding block,

updating their values and applying puncturing to them to

adjust the size of the ﬁrst decoding block frame, which is

ﬁxed, due to the ﬁrst decoding block inputs always are the

observations from the Row-Column de-interleaver, which are

themselves stored in auxiliary block RAMs. Also, updated a

priori observations are modiﬁed by the original interleaver.

These stage refers to green area in Fig. 5, named as feedback

stage. Data in this stage always returns to feedforward stage.

Further details of the Row-Column de-interleaver can be

seen in Fig. 6, where the pipelined components that compose

it are shown in yellow. Also, in same ﬁgure, it is shown

several rectangular prisms in dark turquoise blue that rep-

resent the RAM blocks where the information used in the

blocks is stored for processing. In Fig. 6, it can be seen

how the block consists of a master process, called deinter-

leaver synchronizer, which is responsible for activating each

component at the right time, as it has followed a parallel

architecture and the blocks are independent, so that when

implementing on the FPGA resources are allocated more

efﬁciently. The master process activates the storing circuit

in such a way that the input information is stored in two

RAM blocks. Subsequently, when no more data enters, the

de-interleaving + puncturing process is activated to perform

the behavior shown in Fig. 3. For this, the data is stored in

another auxiliary RAM block in the appropriate order, and

ﬁnally, reading the puncturing patterns corresponding to the

operation mode, implemented by LUTs where ’1’ is stored if

the corresponding position is not deleted and ’0’ if a position

on which puncturing has been performed in transmission has

to be added, the corresponding positions are added. Finally,

when the end of the LUT corresponding to the puncturing

patterns is reached, the information stored in the auxiliary

RAM block is sent to the block corresponding to the turbo

decoder.

The more detailed separate components that make up

the turbo decoder block (hardware-oriented component im-

plementation of the complete decoding algorithm) and the

connections between them to control the data ﬂow between

the two stages are presented in Fig. 7. As with the Row-

Column de-interleaver block architecture, a master process

called turbodecoder synchronizer is also used here, which is

responsible for enabling each of the independent components

that perform the operations mentioned above (activation of

decoders, interleaver and puncturer circuits for the feedfor-

ward stage, and updater, puncturer and interleaver circuit for

the feedback stage). As in the previous case, it is shown in

yellow every pipelined component used in this architecture

and as a rectangular prism in dark turquoise blue for RAM

blocks, but in addition, two marked data ﬂows have been

added, a purple-coloured path reﬂecting the route followed

in the feedforward stage and a green-coloured path showing

the route followed in the feedback stage, in accordance with

the diagram presented in Fig. 5. The rest of the blue markers

refer to the trigger signals of the independent components,

which are handled by the turbodecoder synchronizer process.

Before explaining the control logic over the data ﬂow, it

should be noted that depending on the used transmission

mode, the dimensions of the frames within the block will

vary. These dimensions are given in the reference standard

and can be summarised in up to two single variables in

this block: an integer variable S representing the number

of systematic bits generated in the complete encoder, and

another integer variable K representing the encoder input

frame size [32], both referring to the transmitter components.

Therefore, in the block diagram in Fig. 7, the dimensions of

the frames according to these variables have been added to all

the paths corresponding to the two main stages of the block.

The operation is therefore as follows: in the case of the

ﬁrst iteration, the synchronising signals between the Row-

Column de-interleaver and the turbo decoder are only acti-

vated while information is sent from the former to the latter,

where that information is stored in RAM blocks. Once these

signals are deactivated, the 1st convolutional decoder (corre-

sponding to CC2 applying max-log-MAP algorithm) starts

to operate taking a priori observations as null, since i.i.d.

inputs are considered and therefore their LLR is zero, which

produces the ﬁrst set of observations a posteriori with respect

to the initial input information. These observations are re-

ordered according to reverse the original interleaving pattern

assigned by a LUT depending on the transmission mode,

which generates a frame of 2 observations less than the input,

as the interleaver length is S-2 according to the standard,

regardless of the transmission modes. The de-interleaved

observations are then passed through the depuncturer block,

where the dimension of the incoming frame is increased to

undo the effect of the ﬁxed puncturing pattern applied in

transmission and match the length of the output frame of

CC1 in transmission. As explained above, this puncturing

pattern is ﬁxed in the standard, and as in the transmitter it

consisted of removing one observation every four coming out

of CC1, this block in reception generates an extra observation

every 3 received from the de-interleaver block. Next, the

2nd convolutional decoder (corresponding to CC1 and also

applying the max-log-MAP algorithm version) is applied and

the ﬁrst set of a posteriori observations of the complete turbo

decoder is obtained, ﬁnishing the operations corresponding

to ﬁrst iteration of the feedforward stage. This decoder also

works with null a posteriori likelihoods, but unlike the ﬁrst

VOLUME 4, 2016 9

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

one, these are not going to be updated in any iteration,

since the information about the changes of the interleaver

and puncturer blocks are the ones used to update only the

1st decoder at the beginning of a new iteration, i.e. the

modiﬁcations are included in the ﬁrst decoder. This bring us

a simpler and efﬁcient architecture, due to it is not necessary

to do extra computation for this block. At this point, it is

evaluated whether the maximum number of iterations set has

been reached. If this is the case, the hard decision is made on

the last a posteriori likelihoods obtained, but as we are still

in the ﬁrst iteration, we move on to the feedback stage, and

therefore the a posteriori likelihoods are discarded (frame of

length K) and the updated 2nd convolutional decoder input

(frame of length 2(K+2)) is sent to the 2nd convolutional

decoder block. In this way the original puncturing pattern is

performed and the interleaving operation is applied to these

observations, which ﬁnally return to the 1st convolutional

decoder block and will be the a priori observations to be

added to the original inputs in the next iteration.

This process will be repeated until the maximum number

of iterations is reached, and as mentioned above, it will be

applied in a hard decision on the a posteriori likelihoods

coming from the 2nd decoder, such that a ’0’ is chosen if the

likelihood is negative, and ’1’ otherwise. An efﬁcient way to

perform this operation in hardware is just to invert the sign bit

of the observation value to be evaluated, i.e. if we obtain an

observation whose value as a real number is negative, its sign

bit will always be ’1’, and therefore the result of the circuit

is simply to apply a NOT gate to this bit, obtaining a ’0’.

The same operation would be done in the case of a positive

likelihood.

IV. RESULTS AND ANALYSIS

This section describes the results obtained with the proposed

architecture. Before analysing the simulation and imple-

mentation results. It is important to note the degradation

suffered with the max-log-MAP algorithm approximation

with respect to the optimal version described by (25). For

this purpose, Fig. 8 shows the software simulation results

of the Bit Error Rate (BER) obtained using the complete

architecture for a a different number of iterations and for a

low range of values over the Eb/N0. This ﬁgure is intended

to show the difference in error correction gain achieved by

applying the optimized algorithm with the theoretical, or

optimal, algorithm (much more expensive and difﬁcult to

implement in hardware) since both versions will correct all

frame errors but with different values of Eb/N0(in dB).

Two groups of curves can be seen in the ﬁgure, the solid

line curves corresponding to the optimal version of the BCJR

algorithm and the dashed line curves corresponding to the

results of applying the max-log-MAP algorithm. For the case

in which only one iteration of the turbodecoder (associated

to both blue curves) is performed, the difference between the

optimal version and the simpliﬁed approach is null, since the

system has not been fed back by updating the information

a priori, so the differences between the two versions of the

-3 -2 -1 0 1 2 3

Eb/N0 (dB)

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

Bit Error Rate

Optimum: 1 iter.

max-log-MAP: 1 iter.

Optimum: 2 iter.

max-log-MAP: 2 iter.

Optimum: 4 iter.

max-log-MAP: 4 iter.

Optimum: 7 iter.

max-log-MAP: 7 iter.

Optimum: 10 iter.

max-log-MAP: 10 iter.

FIGURE 8. BER results using the optimal version of the BCJR algorithm

(solid lines) and using the simpliﬁed version based on the max-log-MAP

algorithm (dashed line) for the complete turbo decoder architecture.

algorithm cannot be appreciated. However, when a second

iteration is performed, the difference in system performance

is already noticeable, as shown by the red curves. This trend

is maintained as the number of iterations increases, being

the system able to correct all errors more drastically for a

higher number of iterations than the previous cases, until

in the case shown by the magenta and black solid lines,

corresponding to use 7 and 10 iterations respectively, the

system converges and corrects all errors for an Eb/N0= -1

dB, and its simpliﬁed version (associated to the dashed curves

with the same colors) shows the same behavior in the case

of -0.5 dB. Therefore, it has been decided to implement the

system using 7 iterations instead of 10 in order to realize a

more efﬁcient system that spends less processing time and

FPGA resources.

A. HDL SIMULATION

This subsection shows the hardware simulation, after having

applied the architecture described in this paper in VHDL.

Simulation is the processing of a complete frame corre-

sponding to having used transmission mode 1, consisting

of 16000 observations from the soft demodulator, will be

established with the turbo decoder set to run at 7 iterations

and operating at a clock frequency of 100 MHz. To verify

the results shown and for simplicity for the reader, integer

arithmetic is considered. For this, a random synthetic signal

of logic vectors is generated, but it will be evaluated using

the Vivado converter to signed integers, and the input signal

will be cloned with MATLAB to show that the results of the

likelihoods and the ﬁnal logic signal are correct, obtaining

the same results in both tools.

The result of the hardware simulation of the Row-Column

de-interleaver block is shown in Figures 9 and 10. Fig. 9

shows the Row-Column de-interleaver own input, which is

mapped to samples received from the soft demodulator. It

can be seen in this ﬁgure how the inputs of this block

(marked with a yellow marker) correspond to the outputs of

10 VOLUME 4, 2016

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

FIGURE 9. Beginning of the Row-Column de-interleaver input frame in

the hardware simulation.

FIGURE 10. Action time of the block corresponding to the Row-Column

interleaver.

FIGURE 11. Numeric values of the beginning of the Row-Column

de-interleaver input frame (left) and of the beginning (middle) and (right)

of the Row-Column de-interleaver output frame.

the demodulator (named as ’LL1’ and ’LL2’ in the ﬁgure).

Fig. 10 presents the time transition between all the operations

corresponding to this complete block. The yellow vertical

time marker indicates the beginning of the de-interleaving

and puncturing process, and the blue marker indicates the end

and the beginning of the ejection of the modiﬁed information

to the turbo decoder. Also in this ﬁgure, it is mentioned

the beginning and the end of the modiﬁed Row-Column

de-interleaver frame, respectively, marking with blue arrows

also the positions where extra values have been added as a

result of the depuncturer process. The beginning of the frame

is the ﬁgure divided by a red dash line and the end of the

same frame is signalized with an orange dash line, all in the

same ﬁgure.

The results shown can be validated with the software

results shown in Fig. 11, where the systematic observations

have also been highlighted in blue, leaving the unmarked

ones as those corresponding to parity. Blue arrows are also

used to indicate the action of the depuncturer on the corre-

sponding positions.

FIGURE 12. Waveforms corresponding to the turbo decoder 1st iteration

process.

Fig. 12 shows all the trigger signals corresponding to the

subsystems involved in the decoding for a single iteration.

The signals ’y1’ and ’y2’ are the outputs of the Row-Column

interleaver, while the signals highlighted with an orange box

symbolize the activation of each of the blocks that make

up the turbo decoder architecture. It is shown how these

signals are activated sequentially to maintain the ﬂow of

the complete turbo decoding algorithm, and also the end of

the second decoder (where the max-log-MAP algorithm has

been applied for the second time in this ﬁrst iteration) has

been marked with a circle and a red arrow to symbolize that

this is where the likelihoods of interest to be evaluated are

obtained after a ﬁxed number of iterations (7 according to the

gain simulation measurements as indicated at the end of the

second paragraph of section IV). Finally, separated by a red

and orange dashed line, the numerical values of the beginning

and end of the plot, respectively, of ﬁnal likelihoods obtained

in the ﬁrst iteration of the architecture are shown. These

values are to be compared with those obtained in the software

simulation to check the correct functioning of the architecture

in hardware.

These trigger signals are responsible for enable each of

the sub-circuits marked in yellow in Fig. 7 in the appropriate

order to maintain the correct ﬂow of the algorithm. According

to the label used for each signal in Fig. 12, it is summarized

in Table 1 which sub-circuit is activated and which function

each activation signal performs.

Fig. 13 presents the numerical values of the output of the

ﬁrst decoder for each of the iterations in software simulation,

to validate the data of the algorithm written in VHDL. The

left column shows the values at the beginning of the frame

and the right column shows the values at the end of the frame.

Fig. 14 shows the waveforms obtained when the complete

decoding process has been completed, reaching the maxi-

VOLUME 4, 2016 11

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

TABLE 1. Associated circuits and functions of each trigger signal shown in Fig. 12.

Stage Signal name Fig. 7 sub-circuit Operation done

FEEDFORWARD

SET-UP-F SET UP BRAMs Start the process, save the initial observations in memory to use at the beginning of each iteration.

DCDR-1-F OUTER DECODER Enables the outer decoder circuit.

DEC-IN-F OUTER DECODER It allows avoiding losing information after the feedback stage.

DEINTL-F DEINTERLEAVER Enables the De-Interleaver circuit.

DEPUN-F DEPUNCTURER Enables the De-Interleaver circuit.

DCDR-2-F INNER DECODER Enables the inner decoder circuit.

HARD-D-F HARD DECISION Enables the hard decision circuit at the end of the algorithm.

FEEDBACK

NEW-IT-F NEW ITERATOR FLAG Signaling when a new iteration is on (Non ﬁrst or ﬁnal iteration)

UPDATE-F UPDATE Enable the full feedback action with INNER DECODER results (a posteriori estimated likelihoods).

PUNCTU-F PUNCTURER Apply original puncturing pattern

FEEDBK-F FEEDBACK INTERLEAVER Apply original interelaver pattern and feedback OUTER DECODER

FIGURE 13. Beginning (left) and end (right) of the numerical values

frame for the feedforward stage output observations for all iterations.

FIGURE 14. Waveforms corresponding to performing the turbo decoder

iterations.

mum number of conﬁgured iterations. In the ﬁgure itself,

an orange grid has been added in the part according to the

trigger signals that divide the activation’s of each iteration.

It can be seen how the patterns of the activation signals

are repeated during all iterations, according adapting to the

behavior seen in Fig. 12, where in addition a trigger signal

is activated which is held at ’1’ for the rest of the iterations

and returns to ’0’ when the iteration is ﬁnished and the hard

decision is about to be made. Below, separated with a dashed

red line, the update of the values of the likelihoods of interest

on which to do in the last iteration the hard demodulation and

FIGURE 15. Bits obtained after performing the hard decision process in

software simulation. Initial (left) and ﬁnal values (right).

obtain the estimated bits of the frame is shown. It can be seen

how the behavior of the previous ﬁgure is maintained without

anomalies in each iteration. At the beginning, it has been

marked with a red circle the activation of the signal called

in the ﬁgure as ’SET-UP-F’, which is activated only in the

ﬁrst iteration, and remains off in the rest of iterations, since

this signal only indicates that the original input values on

which it is necessary to iterate have been stored in memory.

12 VOLUME 4, 2016

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

FIGURE 16. Beginning of the bit frame obtained at the output of the SCC-decoder block.

FIGURE 17. End of the bit frame obtained at the output of the SCC-decoder block.

Likewise, with a purple circle we have marked the end of the

last iteration, where only the signal ’HARD-D-F’ is activated,

which applies hard demodulation on the last likelihoods

calculated to obtain bits, with the method explained in the last

paragraph of Section III. Therefore, the behavior is shown to

be correct since there are markers at the beginning and end

of the architecture.

Finally, the results of the hard decision process are shown

in ﬁgures 20-22. Fig. 20 simply shows the results of applying

the proposed software architecture on the signed integer

results used in this simulation. Figures 21 and 22 present the

start and end of the bit frame obtained after the complete

decoding process, respectively. Also, in these ﬁgures, the

value of the corresponding bit at each clock edge has been

marked to make it easier to compare the results.

B. SYNTHESIS RESULTS

Once it has been shown that the hardware description of

the proposed architecture shows waveforms that suit our

objectives, we move on to synthesise the circuit to see the

resources it consumes. The synthesis tool used is Vivado

and, as mentioned before, the FPGA is the Xilinx Zynq

UltraScale+ RFSoC ZCU28DR model. The results of the

synthesis are summarised in Table 2.

TABLE 2. Resources used by the FPGA after synthesising the proposed

architecture.

Resource Estimation Available Utilization (%)

LUT 6914 425280 1.63

FF 2542 850560 0.3

BRAM 273.5 1080 25.33

IO 28 347 8.07

BUFG 1 696 0.14

This table 2 shows that the proposed architecture as a

whole expends 6914 LUTs, 2542 Flip-Flops (FF), 273.5

RAM blocks (BRAM), 28 input-output (IO) nets and 1 clock

buffer (BUFG). It should be emphasised that the BRAM

components used in this design refers to the set of memories

used in the design, as the architecture in general uses 39

memories, distributed as follows: 12 RAMs of 207 Kbits, 12

RAMs of 180 Kbits, 2 RAMs of 174 Kbits, 3 RAMs of 168

Kbits, 3 RAMs of 126 Kbits, 2 RAMs of 92 Kbits, 2 RAMs

of 87 Kbits and 3 RAMs of 84 Kbits. These sizes are related

to the turbo decoder parameters in the standard, which deﬁne

the frame lengths with which it operates. In this way, it is

possible to customise the RAMs to be used in the circuit and

use the right amount of bits to implement the hardware. Thus,

a component such as the decoder that usually consumes a lot

of receiver resources has been optimized to use around 25%

of the FPGA’s available BRAMs. The rest of the resources

used are residual considering all those available.

The synthesis results shown can be graphically comple-

mented with the RTL models that Vivado generates from

the VHDL code written to implement the equations cor-

responding to the proposed architecture. Figure 18 shows

the two major components of the architecture, which are

the Row-Column de-interleaver (marked in green) and the

turbodecoder block (marked in red).

FIGURE 18. Main view of synthesized circuit. The Row-Column

De-inteleaver is marked in green and the turbodecoder block is marked

in red.

Within the turbodecoder block, the submodules of Figure

7 are implemented, some of which are shown in Figure 18.

However, it should be noted that for the system to function

correctly, a large number of primitives are generated that clut-

ter the canvas, so a zoom of the complete block is provided.

In this ﬁgure the initial RAMs have been marked in red, used

to store the initial frames coming from the Row-Column de-

interleaver so that they can be used at the beginning of each

iteration of the algorithm and not be lost in the following

clock edges, and in green the circuits corresponding to the

VOLUME 4, 2016 13

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

TABLE 3. Summary of results of timing test on the proposed implemented architecture running at 100 MHz.

Design timing summary

Setup Hold Pulse Width

Worst Negative Slack (WNS) 2.354 ns Worst Hold Slack (WHS) 0.004 ns Worst Pulse Width Slack (WPWS) 4.458 ns

Total Negative Slack (TNS) 0 ns Total Hold Slack (THS) 0 ns Total Pulse Width Slack (TPWS) 0 ns

Number of Failing Endpoints 0 Number of Failing Endpoints 0 Number of Failing Endpoints 0

Total number of Endpoints 16447 Total number of Endpoints 16447 Total number of Endpoints 3119

FIGURE 19. Some implemented and encapsulated sub-circuits of Figure

17 inside the turbodecoder block.

implementation of the max-log MAP algorithm have been

marked. You can also see other blocks of the architecture

such as the de-interleaver or the de-puncturer and see the

connection between them, i.e., how the blocks are connected

and how the signals enter according to the workﬂow shown

in Figure 17. It should be noted that without making cuts

and zooming the images, a large number of primitives are

generated, which, as mentioned above, occupy the entire

canvas. This case is shown in Figure 19, where it can be

seen how you get very little information from the full RTL

implementation. This does not matter at the hardware level

of the FPGA since its ability to work in parallel allows

using different areas of the board to perform operations and

modules at the same time. That’s why, to facilitate the reading

of the paper and not to introduce too much information about

the implementation that would complicate its reading, only

some details in the implementation are indicated to give a

more precise vision.

FIGURE 20. Full RTL implementation of turbodecoder block.

C. IMPLEMENTATION RESULTS

Finally, the results of implementing the architecture de-

scribed during this work are shown. In the case of end

resource usage, it is the same as shown in Table 2 except

that the implemented design uses 6807 LUTs instead. For

the case when timing constrains are evaluated, results are

shown in Table 3. It should be remembered that implemen-

tation has been carried out for a clock frequency of 100

MHz. These results reveal that timing requirements are very

comfortably met in the WNS and WPWS measurements, so

higher frequencies could be used. The results on the WHS

measure are a little tighter, but this is to be expected since

this measurement occurs according to the worst case FPGA

resources hold up, and more taking into account that many

BRAMs components are used where there have to be a

lot of connections between LUTs, nets and these blocks,

which affects the performance, but as the results show it is

not a problem in the implementation. In the case of power

consumption, the results are quite good as there is still a lot

of thermal margin. These results are shown in Table 4.

TABLE 4. Summary of results of the power test carried out on the

proposed implemented architecture.

Power test parameter Value

Total On-Chip Power 1.386 W

Junction Temperature 26,2 ºC

Thermal Margin 73,8 ºC (84,7 W)

Effective ΘJA 0,8 ºC/W

Power supplied to off-chips devices 0 W

FIGURE 21. Summary of power expended by each on-chip component.

In addition, as the resources used in the implemented

design have been analysed, a breakdown of the individual

14 VOLUME 4, 2016

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

FIGURE 22. Implemented design on the FPGA device.

power consumed by each component on these on-chip re-

sources is added in Fig. 23. As expected, the BRAM blocks

represent the highest consumption of the FPGA due to the

fact that they represent the largest number of components

used in the proposed design. All these implementation results

are due to the routing positioning shown in Fig. 24, where

a schematic of the FPGA is shown and the primitives that

have been assigned to each of the processes describing the

proposed architecture are marked in blue. It can be seen how

the implementation tool has decided to put the primitives in

the northern part of the device, i.e. those areas with a higher

value for the Y coordinates with respect to the different clock

sections.

V. CONCLUSIONS

The work presented in this paper proposed a simple ar-

chitecture for the implementation in the Xilinx Zynq Ul-

traScale+RFSoC ZCU28DR evaluation board a decoding

scheme valid for the SCC Turbo Coding Scheme block

suggested in the CCSDS 131.2-B-1 standard. The algorithms

to be implemented in the design have been detailed as well as

complete and explained schemes on the data ﬂow treatment

and the nets connections that would be necessary to be

able to be implemented in hardware. In addition, a structure

consisting of independent components has been presented

that favours a pipeline architecture, separating as much as

possible the resources to be used by the FPGA, which is

an advantage for the efﬁciency of the design. Results have

been presented on a hardware simulation based on integer

arithmetic to simplify the validation of results, showing that

the results match the software simulation results and proving

the effectiveness of the proposed architecture. Finally, it has

been veriﬁed that the design is synthesizable and passes the

time and temperature tests running the device at 100 MHz,

as well as presenting the resources consumed by the FPGA,

which are used in a very small percentage with respect to

those provided by this board and a scheme of the imple-

mented design on the device. Although the proposed scheme

in this paper is for the standard CCSDS 131.2-B-1, the ideas

and architecture can be easily extrapolated to other serial-

based turbodecoder scheme what makes the contribution of

this paper more valuable.

REFERENCES

[1] C. E. Shannon, “A mathematical theory of communication,” The Bell

System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.

[2] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near shannon limit error-

correcting coding and decoding: Turbo-codes. 1,” in Proceedings of ICC

’93 - IEEE International Conference on Communications, vol. 2, pp. 1064–

1070 vol.2, 1993.

[3] C. Berrou and A. Glavieux, Turbo Codes. John Wiley & Sons, Ltd, 2003.

[4] “Btm synchronization and channel coding: Blue book,” standard, CCSDS

131.2-B-0, Consultative Committee for Space Data Systems, Mar. 2013.

[5] “Low-density parity-check codes for use in near-earth and deep space

applications. experimental speciﬁcation,” standard, CCSDS 131.1-O-0.4,

Consultative Committee for Space Data Systems, Mar. 2006.

[6] “Use of dvb-s2 etsi standard in high data rate telemetry for near space-earth

transmissions. experimental speciﬁcation,” standard, CCSDS 131.1-O-0.4,

Consultative Committee for Space Data Systems, Mar. 2006.

[7] G. Battail, “Weighting of the symbols decoded by the viterbi algorithm (in

french),” Ann. Télécommun., vol. 42, p. 31–38, 01 1987.

[8] J. Heller and I. Jacobs, “Viterbi decoding for satellite and space communi-

cation,” IEEE Transactions on Communication Technology, vol. 19, no. 5,

pp. 835–848, 1971.

[9] S. Benedetto, G. Montorsi, D. Divsalar, and F. Pollara, “A soft-input soft-

output maximum a posteriori (map) module to decode parallel and serial

concatenated codes,” Telecommunications and Data Acquisition Progress

Report, vol. 42, 07 1996.

[10] A. Viterbi, “An intuitive justiﬁcation and a simpliﬁed implementation of

the map decoder for convolutional codes,” IEEE Journal on Selected Areas

in Communications, vol. 16, no. 2, pp. 260–264, 1998.

[11] S. Benedetto and G. Montorsi, “Performance of continuous and blockwise

decoded turbo codes,” IEEE Communications Letters, vol. 1, no. 3, pp. 77–

79, 1997.

[12] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of linear

codes for minimizing symbol error rate (corresp.),” IEEE Transactions on

Information Theory, vol. 20, no. 2, pp. 284–287, 1974.

[13] P. Robertson, P. Hoeher, and E. Villebrun, “Optimal and sub-optimal

maximum a posteriori algorithms suitable for turbo decoding,” European

Transactions on Telecommunications, vol. 8, no. 2, pp. 119–125, 1997.

[14] M. Jezequel, C. Berrou, C. Douillard, and P. PENARD, “Characteristics

of a sixteen-state turbo-encoder/decoder (Turbo4),” in International Sym-

posium on Turbo Codes & Related Topics, (Brest, France), pp. 280–283,

Télécom Bretagne, Sept. 1997.

[15] S. Barbulescu, Turbo Codes on Satellite Communications, p. 44. 01 2005.

[16] S. Barbulescu, W. Farrell, P. Gray, and M. Rice, “Bandwidth efﬁcient turbo

coding for high speed mobile satellite communications,” 11 1997.

[17] S. S. Pietrobon, “Implementation and performance of a turbo/map de-

coder,” International Journal of Satellite Communications, vol. 16, no. 1,

pp. 23–46, 1998.

[18] “Introduction to cdma2000 standards for spread spectrum systems,” stan-

dard, 3rd generation partnership Project 2, July 1999.

[19] “Physical layer standard for cdma2000 spread spectrum systems,” stan-

dard, 3rd generation partnership Project 2, July 1999.

[20] D. Wisdom, E. Ajayi, U. Arinze, O. Aladesote, A. Ganya, H. Idris,

and D. Wisdom, “Ieee compter society -nigeria -technical paper series a

comprehensive survey on power saving schemes (cspss) in ieee 802.16e/m

networks,” 07 2021.

[21] H. Latchman, S. Katar, L. Yonge, and S. Gavette, Homeplug AV and IEEE

1901: A handbook for PLC designers and users. 09 2013.

[22] S. Li, B. Bai, J. Zhou, P. Chen, and Z. Yu, “Reduced-complexity equaliza-

tion for faster-than-nyquist signaling: New methods based on ungerboeck

observation model,”IEEE Transactions on Communications, vol. 66, no. 3,

pp. 1190–1204, 2018.

[23] F.-L. Luo and C. Zhang, Faster-than-Nyquist Signaling for 5G Communi-

cation, pp. 24–46. 2016.

VOLUME 4, 2016 15

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Perez-Naranjo and Gil Jiménez: CCSDS 131.2-B-1 SCC turbo decoder architecture for efﬁcient FPGA implementation

[24] J. Fan, S. Guo, X. Zhou, Y. Ren, G. Li, and X. Chen, “Faster-than-nyquist

signaling: An overview,” IEEE Access, vol. PP, pp. 1–1, 02 2017.

[25] V. K. Veludandi, “Bcjr vs sova for a practical coherent turbo coded

ofdm system,” in 2019 10th International Conference on Computing,

Communication and Networking Technologies (ICCCNT), pp. 1–5, 2019.

[26] K. Vasudevan, “Coherent detection of turbo-coded OFDM signals trans-

mitted through frequency selective rayleigh fading channels with receiver

diversity and increased throughput,” Wireless Personal Communications,

vol. 82, pp. 1623–1642, feb 2015.

[27] J. Kang and W. Stark, “Turbo codes for noncoherent fh-ss with partial band

interference,” IEEE Transactions on Communications, vol. 46, no. 11,

pp. 1451–1458, 1998.

[28] H. El Gamal and E. Geraniotis, “Turbo codes with channel estima-

tion and dynamic power allocation for anti-jam fh/ssma,” in IEEE

Military Communications Conference. Proceedings. MILCOM 98 (Cat.

No.98CH36201), vol. 1, pp. 170–175 vol.1, 1998.

[29] J. Gass, P. Curry, and C. Langford, “An application of turbo trellis-coded

modulation to tactical communications,” in MILCOM 1999. IEEE Military

Communications. Conference Proceedings (Cat. No.99CH36341), vol. 1,

pp. 530–533 vol.1, 1999.

[30] S. Jiang, P. W. Zhang, F. Lau, C.-W. Sham, and K. Huang, “A turbo-

hadamard encoder/decoder system with hundreds of mbps throughput,”

in 2018 IEEE 10th International Symposium on Turbo Codes Iterative

Information Processing (ISTC), pp. 1–5, 2018.

[31] A. Louliej, Y. Jabrane, V. Gil Jiménez, and A. Garcia Armada, “Practical

guidelines for approaching the implementation of neural networks on fpga

for papr reduction in vehicular networks,” Sensors, vol. 19, p. 116, 12

2018.

[32] “Flexible advanced coding and modulation scheme for high rate telemetry

applications: Blue book,” standard, CCSDS 131.2-B-1. Consultative Com-

mittee for Space Data Systems, Mar. 2012.

[33] A. Lamoral Coines and V. P. G. Jiménez, “Ccsds 131.2-b-1 transmitter

design on fpga with adaptive coding and modulation schemes for satellite

communications,” Electronics, vol. 10, no. 20, 2021.

[34] E. Boutillon, C. Douillard, and G. Montorsi, “Iterative decoding of con-

catenated convolutional codes: Implementation issues,” Proceedings of the

IEEE, vol. 95, no. 6, pp. 1201–1227, 2007.

[35] P. Robertson, “Illuminating the structure of code and decoder of parallel

concatenated recursive systematic (turbo) codes,” in 1994 IEEE GLOBE-

COM. Communications: The Global Bridge, vol. 3, pp. 1298–1303 vol.3,

1994.

[36] F. Dekking, C. Kraaikamp, H. Lopuhaä, and L. Meester, A Modern

Introduction to Probability and Statistics: Understanding Why and How.

Springer Texts in Statistics, Springer, 2005.

[37] J. Erfanian, S. Pasupathy, and G. Gulak, “Reduced complexity symbol

detectors with parallel structure for isi channels,” IEEE Transactions on

Communications, vol. 42, no. 234, pp. 1661–1671, 1994.

[38] W. Koch and A. Baier, “Optimum and sub-optimum detection of coded

data disturbed by time-varying intersymbol interference (applicable to

digital mobile radio receivers),” in [Proceedings] GLOBECOM ’90: IEEE

Global Telecommunications Conference and Exhibition, pp. 1679–1684

vol.3, 1990.

[39] P. Robertson, E. Villebrun, and P. Hoeher, “A comparison of optimal and

sub-optimal map decoding algorithms operating in the log domain,” in

Proceedings IEEE International Conference on Communications ICC ’95,

vol. 2, pp. 1009–1013 vol.2, 1995.

MIGUEL ÁNGEL PÉREZ NARANJO received

the B.S. and M.S. degree in telecommunication

from the University Carlos III of Madrid in 2019

and 2021, respectively. In addition to working in

the private sector in Spain, most of his research

career has been focused on working as senior re-

search assistant at University Carlos III of Madrid

from 2020 to 2022, where he has also partici-

pated in international projects. His research in-

terests include advanced beamforming techniques

and hardware implementation of hybrid algorithms applied to satellite com-

munications.

VÍCTOR P. GIL JIMÉNEZ (Senior Mem-

ber, IEEE) received the B.S. degree (Hons.) in

telecommunication from the University of Alcalá

in 1998 and the M.S. degree (Hons.) in telecom-

munication and the Ph.D. degree (Hons.) from the

University Carlos III of Madrid in 2001 and 2005,

respectively. He was with the Spanish Antarctica

Base in 1999 as a Communications Staff. He

visited the University of Leeds, U.K., in 2003,

Chalmers Technical University, Sweden, in 2004,

and the Instituto de Telecommunicaçoes, Portugal, from 2008 to 2010. He

is currently with the Department of Signal Theory and Communications,

University Carlos III of Madrid, as an Associate Professor. He has also

led several private and national Spanish projects and has participated in

several European and international projects. He holds one patent. He has

published over 80 journal articles/conference papers and 8 book chapters.

His research interests include advanced multicarrier systems for wireless

radio, satellite and visible light communications. He held the IEEE Spanish

Communications and Signal Processing Joint Chapter Chair from 2015 to

2022. He received the Master Thesis and the Ph.D. Thesis Award from the

Professional Association of Telecommunication Engineers of Spain in 1998

and 2006, respectively.

16 VOLUME 4, 2016

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235966

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Iterative Equalization and Decoding over an Additive White Gaussian Noise Channel with ISI Using Low-Density Parity-Check Codes

Article

Full-text available

Nov 2023

In this article we present an iterative system of equalization and decoding to manage the intersymbol interference over an additive white Gaussian noise (AWGN) channel. Following the classic turbo equalization scheme, the proposed system consists of low-density parity-check (LDPC) coding at the transmitter side; we applied a Log maximum a posteriori probability (Log-MAP) equalizer and min-sum LDPC decoding at the receiver side. The equalizer and decoder, linked through interleaving and deinterleaving, iteratively update each other’s information. We performed the performance analysis of the proposed system, bit error rate (BER) vs. signal-to-noise ratio (SNR), considering three different impulse responses of the channel (h). Our experimental results indicated that increasing the number of iterations performed by the LDPC decoder from 10 to 20 during the iterative process of equalization and decoding leads to better outcomes. The proposed system was compared with turbo equalization and separate equalization, performed before the decoding process with minimum mean-square error (MMSE) and LDPC decoding, in terms of BER vs. SNR, considering the three different h. Based on the analyzed results, it can be concluded that the equalization performance depends on both the impulse responses of the channel and the chosen decoding and equalization method; therefore, the equalization method does not always offer good results for any h.

CCSDS 131.2-B-1 Transmitter Design on FPGA with Adaptive Coding and Modulation Schemes for Satellite Communications

Article

Full-text available

Oct 2021

Satellite communications are a well-established research area in which the main innovation of last decade has been the use of multi-carrier modulations and more robust channel coding techniques. However, in recent years, novel advanced signal processing has started being developed for these communications due to the increase in the signal processing capacity of transmitters and receivers. Although signal processing capabilities are increasing, they are still constrained by large limitations because these techniques need to be implemented in real hardware, thus making complexity a matter of critical importance. Therefore, this paper presents the design and implementation of a transmitter with adaptable coding and modulation on a field-programmable-gate-array (FPGA). The main motivation came from the standard CCSDS 131.2-B-1 which recommends that such a novel transmitter which has to date not been implemented in a real system The system was modeled by MATLAB with the purpose of being programmed in VHDL following the AXI-stream protocol between components. Behavioral simulation results were obtained in VIVADO and compared with MATLAB for verification purposes. The transmitter logical circuit was synthesized in a FPGA Zynq Ultrascale RFSoC ZU28DR, showing low resource consumption and correct functioning, leading us to conclude that the deployment of new communication systems in state-of-the-art hardware in satellite communications is justified.

Practical Guidelines for Approaching the Implementation of Neural Networks on FPGA for PAPR Reduction in Vehicular Networks

Article

Full-text available

Dec 2018
SENSORS-BASEL

Nowadays, the sensor community has become wireless, increasing their potential and applications. In particular, these emerging technologies are promising for vehicles’ communications (V2V) to dramatically reduce the number of fatal roadway accidents by providing early warnings. The ECMA-368 wireless communication standard has been developed and used in wireless sensor networks and it is also proposed to be used in vehicular networks. It adopts Multiband Orthogonal Frequency Division Multiplexing (MB-OFDM) technology to transmit data. However, the large power envelope fluctuation of OFDM signals limits the power efficiency of the High Power Amplifier (HPA) due to nonlinear distortion. This is especially important for mobile broadband wireless and sensors in vehicular networks. Many algorithms have been proposed for solving this drawback. However, complexity and implementations are usually an issue in real developments. In this paper, the implementation of a novel architecture based on multilayer perceptron artificial neural networks on a Field Programmable Gate Array (FPGA) chip is evaluated and some guidelines are drawn suitable for vehicular communications. The proposed implementation improves performance in terms of Peak to Average Power Ratio (PAPR) reduction, distortion and Bit Error Rate (BER) with much lower complexity. Two different chips have been used, namely, Xilinx and Altera and a comparison is also provided. As a conclusion, the proposed implementation allows a minimal consumption of the resources jointly with a higher maximum frequency, higher performance and lower complexity.

Faster-Than-Nyquist Signaling: An Overview

Article

Full-text available

Feb 2017

Faster-than-Nyquist (FTN) signaling can improve the bandwidth utilization. In this article, we will provide a comprehensive survey on the topic. The history and the applications of FTN signaling are first introduced. And then, the basic principles and the system framework of FTN signaling are presented. Next, more details on transmitter and receiver optimization are discussed. Finally, the current research challenges on FTN signaling are identified and conclusions are provided.

A Comprehensive Survey On Power Saving Schemes (CSPSS) In IEEE 802.16E/M Networks.

Conference Paper

Jun 2021

Worldwide interoperability for micro wave access (WiMAX) is a wireless network that have attracted the attention of researchers on the area of power savings, because of the increase in multimedia applications that consumes more battery power of Mobile stations (MS). MS experience an increase in their daily operations with limited power enforced on the MS; hence the need to optimally increase efficiency; due to the excessive power consumption experienced by MS; which significantly affects the performance of MS since, MS are battery powered with a super impose life, improving the life time of MS is imperative. Thus, this has led to numerous research works on energy-savings. Hence, we have presented a survey on the recent and past energy-saving schemes, aimed at understanding some of the most relevant sources of inefficiency in energy savings and how some of these challenges are solved by the existing solutions. We further presented a comparative analysis on these schemes with the aim of identifying current challenges that are yet to be addressed by the research community as well as presents future directions towards efficient energy savings in WiMAX networks. Keywords: Power-Savings, Consumption, WiMAX, Networks

BCJR vs SOVA for a practical coherent turbo coded OFDM system

Conference Paper

Jul 2019

Vineel Kumar Veludandi

A Turbo-Hadamard Encoder/Decoder System with Hundreds of Mbps Throughput

Conference Paper

Dec 2018

Reduced-Complexity Equalization for Faster-Than-Nyquist Signaling: New Methods Based on Ungerboeck Observation Model

Article

Nov 2017

Homeplug AV and IEEE 1901: A handbook for PLC designers and users

Book

Sep 2013

HomePlug is a growing technology for creating high-speed Power Line Communication (PLC) networks by transmitting data over in-home or in-office power lines. Users only need to plug adapters into wall outlets to create an instant network of computers, printers, routers, home entertainment devices, and appliance control systems. HomePlug AV and IEEE 1901: A Handbook for PLC Designers and Users provides for the first time an opportunity for non-members of the HomePlug Alliance to gain in-depth insight into the design and operation of the HomePlug standards. Offering a clear and simple description of the standards, this groundbreaking resource presents HomePlug AV and the associated IEEE 1901 standards in terms more readily understood by a much wider audience, including nontechnical managers, engineers, students, and HomePlug designers. The book details the many benefits of HomePlug AV, including: An affordable, secure alternative or complement to WiFi-especially in buildings where WiFi reception is poor or running new network wires is impractical Higher potential data transmission rates up to 200 Mbps Support for multimedia applications such as HDTV and VoIP The book also provides an overview of the HomePlug Green PHY standard that is targeted for use in smart energy applications, and the HomePlug AV 2.0 standard that operates at up to 1.5 Gbps. An essential tool for designers of HomePlug devices, network administrators, and individual users of HomePlug networks who need to understand the features and capabilities of HomePlug, HomePlug AV and IEEE 1901: A Handbook for PLC Designers and Users will also prove useful for researchers in academia and the power line communications industry. © 2013 The Institute of Electrical and Electronics Engineers, Inc. All rights reserved.

A Comparison of Optimal and Sub-optimal MAP Decoding Algorithms Operating in the Log-Domain.

Conference Paper

Jan 1995

Turbo Codes on Satellite Communications

Chapter

Jan 2005

S. Adrian Barbulescu

The first modem designed for a satellite application of turbo codes [1] was tested only four years after the publication of the first paper which introduced the concept of turbo codes [2–5]. Such a short span between the invention of a radical new concept and its application is also a first in the brief history of this field. Satellite communications took off in early 70's, with Intelsat being the first international body that coordinated the activities in this sector and acted as an enabler through standards definition process. For example, the IESS308/309/310 standards allowed different satellite modem manufacturers to build Viterbi and Reed-Solomon codecs that were compatible with one another. In the late 90's, with the move towards privatization, Intelsat's role changed. With the explosion of different turbo-like codes and no standards available, Intelsat issued generic standards (IESS 315/316) which do not specify the details of the coding scheme but only its performance requirements. Therefore, one can see today a wide range of proprietary solutions using turbo-like codes. Performance-wise, there are very small differences between the lat-est contenders, e.g., in the order of fractions of a dB. Additional features like code rate flexibility, variable delay, performance in non-linear channels (Section 11.5.10) or increased security (Section 11.5.1) became the differentiator factors, on top of the cost issue. Another trend encouraged by the iterative decoding techniques introduced with the invention of turbo codes is the joint source-channel decoding, an example being briefly described in Section 11.5.2. In conjunction with the latest MPEG4 standard, this could improve significantly the link budget for image/video transmissions.