Fig 1 - uploaded by Joao Andrade
Content may be subject to copyright.
Generic Tanner graph and iterative decoding using a thread-per-node parallel approach.

Generic Tanner graph and iterative decoding using a thread-per-node parallel approach.

Source publication
Conference Paper
Full-text available
It is well known that LDPC decoding is computationally demanding and one of the hardest signal operations to parallelize. Beyond data dependencies that restrict the decoding of a single word, it requires a large number of memory accesses. In this paper we propose parallel algorithms for performing in CPUs the most demanding case of irregular and lo...

Contexts in source publication

Context 1
... are usually represented by bipartite Tanner graphs [14], connecting Bit Nodes (BN) and Check Nodes (CN). The in-formation received from the channel is propagated, processed and exchanged between neighboring nodes of the graph, as depicted by the arrows in figure 1. If at the end of an itera- tion the codeword does not verify all parity-check equations, a new iteration is launched until the maximum number of al- lowed iterations occurs or a valid codeword is reached. ...
Context 2
... Min-Sum algorithm was adopted in this work to per- form the decoding of computationally intensive long LDPC codes because it is less complex than the well-known Sum- Product algorithm [14]. As figure 1 indicates, the inputs of the decoder are log-likelihood ratios (LLRs), which represent the logarithm of the ratio of two complementary probabilities (LLR(x) = ln(p(x = 0)/p(x = 1))) at the input of the decoder [14]. ...

Similar publications

Article
Full-text available
When video is transmitted over 3G networks, the video quality might suffer from impairments caused by packet losses. Extracting video quality features is a set of algorithms and inverse discrete cosine transforms is an important algorithm in this field. To improve the performance and be suitable to apply to evaluating the 3G video quality in real-t...
Data
Full-text available
Although the use of iterative algorithms for image reconstruction in 3D Positron Emission Tomography (PET) has shown to produce images with better quality than analytical methods, they are computationally expensive. New Graphic Processor Units (GPUs) provide high performance at low cost and programming tools that make it possible to execute paralle...
Article
Full-text available
Or the past few years, the clusters equipped with GPUs have become attractive tools for high performance computing. In this thesis, we have designed parallel iterative algorithms for solving large sparse linear and nonlinear systems on GPU clusters. First, we have focused on solving sparse linear systems using CG and GMRES iterative methods. The ex...

Citations

... At first fountain codes came into picture for distribution of bulk data in the year 1998 [33] and then LDPC codes were used in standards like Wimax IEEE 802.16 [34,35], 10GBase-T Ethernet IEEE 802.3 [36], Wi-Fi 802.11n [34] standards, DVB-S2 Digital video Broadcasting standards [37,38]. After such communication models, researchers dug deep into the use of these LDPC codes in Data Storage systems. ...
Article
Full-text available
The need for highly scalable and reliable big data storage systems is due to the fact that there is an explosive growth in data everywhere particularly due to data generated from social networking sites and IOT Technology for various applications. Hence most of Information Technology, Medical organizations and big Industries, Social networking companies, Government organizations (like ISRO) are required to have storage capacities of 100 PB (petabytes) of data. Therefore to secure and store this kind of large data efficiently research on application of Erasure codes for both Cloud storage and Network or Distributed Storage systems is considered recently. The traditional triple-replication method, which stores 3 copies of every file and requires an extra 200% storage overhead is actually highly expensive. Erasure Codes are considered for parallel storage systems as an alternative to traditional storage system. This paper presents the various techniques of applying LDPC codes for big data storage and provides the research gap to investigate application of LDPC codes for big data storage.
... The disadvantages of the LDPC codes include higher encoding complexity, longer latency than turbo code, and poor performance compared to turbo codes when code length is short [12]. LDPC code has been adopted in several standards including IEEE 802.16 (WiMAX) [27], IEEE 802.3 (10G Base-T Ethernt) [28] and DVB-S2 (satellite transmission of digital television) [29], [30]. The algorithm to decode LDPC codes is available in different names; the most common are belief propagation algorithm, message passing algorithm and sum-product algorithm. ...
Article
Full-text available
Forward error detection and correction codes have been widely either used of storage applications or transferred through a wireline or wireless communication media systems for many years. Due to the unreliable wireless links, broadcast nature of wireless transmissions, interference, moreover, noisy transmission channel, frequent topology changes, and the various quality of wireless channel, there are challenge to provide high data rate service, high throughput, high packet delivery ratio (PDR), low end-to-end delay and reliable services. In order to address these challenges, several channel coding scheme are proposed. In this paper, detailed overviews of the major concepts in error detection and correction codes are presented. The paper provided fundamentals of Low Density Parity Check (LDPC) codes, and a comprehensive survey of the binary and non-binary LDPC codes is provided.
... However, [12], [42] demonstrated that turbo decoding is the most processor-intensive operation of basestation processing, requiring at least 64% of the processing resources used for receiving a message frame, where the remaining 36% includes the FFT, demapping, demodulation and other operations. Motivated by this, a number of previous research efforts [13], [14], [17], [18], [21], [38], [39], [43]- [45] have proposed GPGPU implementations dedicated to turbo decoding, as shown in Figure 3. Additionally, the authors of [27]- [30], [36], [40], [46], [47] have proposed GPGPU implementations of LDPC decoders. ...
Article
Full-text available
Turbo codes comprising a parallel concatenation of upper and lower convolutional codes are widely employed in state-of-the-art wireless communication standards, since they facilitate transmission throughputs that closely approach the channel capacity. However, this necessitates high processing throughputs in order for the turbo code to support real-time communications. In state-of-the-art turbo code implementations, the processing throughput is typically limited by the data dependencies that occur within the forward and backward recursions of the Log-BCJR algorithm, which is employed during turbo decoding. In contrast to the highly-serial Log-BCJR turbo decoder, we have recently proposed a novel Fully Parallel Turbo Decoder (FPTD) algorithm, which can eliminate the data dependencies and perform fully parallel processing. In this paper, we propose an optimized FPTD algorithm, which reformulates the operation of the FPTD algorithm so that the upper and lower decoders have identical operation, in order to support Single Instruction Multiple Data (SIMD) operation. This allows us to develop a novel General Purpose Graphics Processing Unit (GPGPU) implementation of the FPTD, which has application in Software-Defined Radios (SDRs) and virtualized Cloud-Radio Access Networks (C-RANs). As a benefit of its higher degree of parallelism, we show that our FPTD improves the higher processing throughput of the Log-BCJR turbo decoder by between 2.3 and 9.2 times, when employing a high-specification GPGPU. However, this is achieved at the cost of a moderate increase of the overall complexity by between 1.7 and 3.3 times.
... ... The memory footprint is not the most pressing issue in programmable architectures, as the memory addressing space size of modern CPU and GPU systems can be larger than the Tanner graph indexing memory footprint (1) (2) (3). However, the indexing method becomes a source for memory contention if for every computed message a memory index location requires loading, reducing the overall bandwidth to memoryit contributes to poorer cache hit ratios on CPUs [75] and adds further pressure to GPU memory engines [30]. The best performing LDPC decoders are those exploring structure sparse storage that exploit the Tanner graph structure, as opposed to a generic sparse matrix storage method. ...
... The best performing LDPC decoders are those exploring structure sparse storage that exploit the Tanner graph structure, as opposed to a generic sparse matrix storage method. For instance, LDPC decoders implementing the former methodology achieve much higher throughputs than those exploring the latter [30]. ...
... The decoder forcibly defined a 2-D texture mapping of the loglikelihood ratios (LLRs) messages that contributes to the poor performance yielded [31]. Under a more general-purpose computing memory mapping, the authors were able to elevate the decoding throughputs to ∼87 Mbit/s for the normal frame DVB-S2 codes [30], and to ∼40 Mbit/s for rate 1/2 Mackay codes (1024 to 20000 bits) [32]. The difference in the attained performance shows how data-parallelism design decisions and Tanner graph indexing methods are pivotal to elevating the decoding throughputs attained by GPU-based LDPC decoders. ...
Article
Full-text available
Low-density parity-check (LDPC) block codes are popular forward error correction schemes due to their capacity-approaching characteristics. However, the realization of LDPC decoders that meet both low latency and high throughput is not a trivial challenge. Usually, this has been solved with ASIC and FPGA technology that enables meeting the decoder design constraints. But the rise of parallel architectures, such as graphics processing units, and the scaling of CPU streaming extensions, has shown that multi-core and many-core technology can provide a flexible alternative to the development of dedicated LDPC decoders for the compute-intensive prototyping phase of the design of new codes. Under this light, this paper surveys the most relevant publications made in the past decade to programmable LDPC decoders. It looks at the advantages and disadvantages of parallel archi-tectures and data-parallel programming models, and assesses how the design space exploration is pursued regarding key characteristics of the underlying code and decoding algorithm features. The paper concludes with a set of open problems in the field of communication systems on parallel programmable and reconfigurable architectures.
... However, as the DVB-S2 codes have been specifically designed to enable high speed and low complexity implementation, decoding speeds can be very high. Reported decoding speeds are around 80-200 Mb/s on a GPU [20] and around 300 Mb/sec on an ASIC [21]. Encoding speeds are much faster. ...
Article
Full-text available
This paper investigates the design of low-complexity error correction codes for the verification step in continuous variable quantum key distribution (CVQKD) systems. We design new coding schemes based on quasi-cyclic repeat-accumulate codes which demonstrate good performances for CVQKD reconciliation.
... Two application examples are Quantum Key Distribution (QKD), a cryptographic primitive that applies quan* * This work is supported by Instituto tum mechanics for establishing secure communications, and video transmission systems that are capable of working over the erasure channel under severe Signal-to-Noise Ratio (SNR) conditions. Although we have assisted to the recent development of binary LDPC decoders on Graphics Processing Units (GPUs) [3, 4, 5, 6, 7, 8], the importance of the non-binary case seems to be underestimated yet. In this paper, we propose a parallel decoder based on the Fast-Fourier Transform Sum-Product Algorithm (FFT-SPA) that exploits the multithread capabilities of the GPU to parallelize the intensive computation of Variable Nodes (VNs) and, more importantly, of Check Nodes (CNs). ...
Conference Paper
Full-text available
It is well known that non-binary LDPC codes outperform the BER performance of binary LDPC codes for the same code length. The superior BER performance of non-binary codes comes at the expense of more complex decoding algorithms that demand higher computational power. In this paper, we propose parallel signal processing algorithms for performing the FFT-SPA and the corresponding decoding of non-binary LDPC codes over GF(q). The constraints imposed by the complex nature of associated subsystems and kernels, in particular the Check Nodes, present computational challenges regarding multicore systems. Experimental results obtained on GPU for a variety of GF(q) show throughputs in the order of 2 Mbps, which is far above from the minimum throughput required, for example, for real-time video applications that can benefit from such error correcting capabilities.
Thesis
Les systèmes de communications numériques sont omniprésents dans notre quotidien. L'évolution des besoins implique la recherche et le développement de solutions innovantes pour les futurs systèmes de communications. Dans le cadre des communications numériques satellitaires, la plupart des satellites utilisent des liens par radiofréquences pour communiquer avec la Terre. Pour limiter l'utilisation de bande passante et augmenter les débits, les technologies de communications numériques via des liens optiques constituent une alternative intéressante. Ces technologies utilisent des lasers pour l'émission des données et des télescopes en réception. Cependant, l'énergie lumineuse est absorbée ou déviée par les particules présentes dans l'atmosphère terrestre. Ces perturbations sont à l'origine de nouvelles problématiques et de nouveaux schémas de codage doivent être mis au point pour y remédier.Les codes LDPC sont une famille de codes correcteurs d'erreurs. Leurs performances proches de la limite de Shannon en font des solutions très attractives pour les systèmes de communications numériques. Ils ont notamment été sélectionnés dans le standard Wifi et pour la 5G, permettant d'atteindre de très haut débits (plusieurs Gbit/s). Ils ont aussi été retenus par les standards CCSDS et DVB-S2 pour des applications spatiales.Cette thèse porte sur l'étude et l'implantation matérielle de schéma de codage appliqué à des communications numériques satellitaires via des liens optiques. La première contribution est l'étude d'un schéma de codage pour un lien optique descendant avec un décodage canal à entrées souples au sol. Dans le cadre de cette étude, une architecture matérielle permettant d'implanter le processus de décodage sur FPGA et capable d'atteindre un débit attendu de 10 Gbit/s a été développée. Une deuxième contribution porte sur le lien optique montant impliquant un décodeur canal à entrées dures embarqué dans un satellite. Les contraintes qui en découlent ont amené à repenser l'algorithme Gallager B étendu. Cela a permis la conception d'une nouvelle architecture permettant d'effectuer efficacement un décodage à entrées dures tout en respectant les contraintes spatiales au niveau de la complexité matérielle, de la dissipation thermique et du débit (10 Gbit/s).
Conference Paper
Demodulation and decoding of second generation terrestrial digital video broadcasting (DVB-T2) signals on general purpose processor platforms is challenging in terms of complexity and in terms of power. FPGA-based runtime acceleration for DVB-T2 allows for unwrapping the iterative structures of modern channel decoding schemes by using parallel hardware designs. Additionally, due to the sequential nature of the DVB-T2 receiver chain we can use partial reconfiguration to switch between different decoding modules. We will show in a theoretical analysis that this time-multiplexing approach can be used to realize resource-efficient DVB-T2 receiver chains at a much lower resource and power consumption as compared to solely processor-based solutions.
Article
Because layered low-density parity-check (LDPC) decoding algorithm was proposed, one can exploit the diversity gain to achieve performance comparable to the traditional two-phase message passing (TPMP) decoding but with about twice faster decoding convergence compared to TPMP. In order to reduce the decoding time of layered LDPC decoder, a graphics processing unit (GPU) is exploited as the modem processor so that the decoding procedure can be processed in parallel using numerous threads in the GPU. In this paper, we present the parallel algorithms and efficient implementations on the GPU for two different layered message passing schemes, the row-layered and column-layered decoding. In the experiments, the quasicyclic LDPC codes for WiFi (802.11n) and WiMAX (802.16e) are decoded by the proposed layered LDPC decoders. The experimental results show that our decoder has good bit error ratio (BER) performance comparable to TPMP decoder. The peak throughput is 712 Mbps, which is about two orders of magnitude faster than that of CPU implementation and comparable to the dedicated hardware solutions. Compared to the existing fastest GPU-based implementation, the presented decoder can achieve a performance improvement of 2.3 times. Copyright © 2013 John Wiley & Sons, Ltd.