Figure 1 - uploaded by Francisco Igual
Content may be subject to copyright.
C66x DSP Core Block Diagram. 

C66x DSP Core Block Diagram. 

Source publication
Article
Full-text available
Digital Signal Processors (DSP) are commonly em-ployed in embedded systems. The increase of process-ing needs in cellular base-stations, radio controllers and industrial/medical imaging systems, has led to the devel-opment of multi-core DSPs as well as inclusion of float-ing point operations while maintaining low power dissi-pation. The eight-core...

Context in source publication

Context 1
... C66x core architecture [9] based on very Long In- struction Word (VLIW) architecture is shown in Figure 1. The architecture takes advantage of several levels of paral- lelism. ...

Similar publications

Article
Full-text available
We present an architecture-portable and performant implementation of the atmospheric dynamical core (High-Order Methods Modeling Environment, HOMME) of the Energy Exascale Earth System Model (E3SM). The original Fortran implementation is highly performant and scalable on conventional architectures using the Message Passing Interface (MPI) and Open...
Conference Paper
Full-text available
Shared-memory and message-passing are two opposite models to develop parallel computations. The shared-memory model, adopted by existing frameworks such as OpenMP, represents a de-facto standard on multi-/many-core architectures. However, message-passing deserves to be studied for its inherent properties in terms of portability and flexibility as w...
Article
Full-text available
w-Projection is a wide-field imaging technique that is widely used in radio synthesis arrays. Processing the wide-field big data generated by the future Square Kilometre Array (SKA) will require significant updates to current methods to significantly reduce the time consumed on data processing. Data loading and gridding are found to be two major ti...
Conference Paper
Full-text available
There are several parallel applications that are implemented using a Master/Worker parallel/distributed pro-gramming paradigm. Applications using this predefined pro-gramming structure can be easily implemented using message passing programming libraries (MPI). Moreover, the multicore features present nowadays on CPU architecture can be exploited a...
Article
Full-text available
This paper focuses on the parallelization of an ocean model applying current multicore processor-based cluster architectures to an irregular computational mesh. The aim is to maximize the efficiency of the computational resources used. To make the best use of the resources offered by these architectures, this parallelization has been addressed at a...

Citations

... It has the advantage of ultra-low power consumption due to its very long instruction words and on-chip temporary memory [6]. GPDSPs that have been introduced for high-performance computing include TI's C66X series [7,8] and Phytium's Matrix series [9][10][11][12][13]. ...
Article
Full-text available
Structured grid-based sparse matrix-vector multiplication and Gauss–Seidel iterations are very important kernel functions in scientific and engineering computations, both of which are memory intensive and bandwidth-limited. GPDSP is a general purpose digital signal processor, which is a very significant embedded processor that has been introduced into high-performance computing. In this paper, we designed various optimization methods, which included a blocking method to improve data locality and increase memory access efficiency, a multicolor reordering method to develop Gauss–Seidel fine-grained parallelism, a data partitioning method designed for GPDSP memory structures, and a double buffering method to overlap computation and access memory on structured grid-based SpMV and Gauss–Seidel iterations for GPDSP. At last, we combined the above optimization methods to design a multicore vectorization algorithm. We tested the matrices generated with structured grids of different sizes on the GPDSP platform and obtained speedups of up to 41× and 47× compared to the unoptimized SpMV and Gauss–Seidel iterations, with maximum bandwidth efficiencies of 72% and 81%, respectively. The experiment results show that our algorithms could fully utilize the external memory bandwidth. We also implemented the commonly used mixed precision algorithm on the GPDSP and obtained speedups of 1.60× and 1.45× for the SpMV and Gauss–Seidel iterations, respectively.
... Ali et al. [30] describe implementation of BLAS functions on multi-core DSP. Performance of GEMM is optimized based on partitioning of both multipliers and its parallel processing. ...
Article
Full-text available
Matrix multiplication is an important operation for many engineering applications. Sometimes new features that include matrix multiplication should be added to existing and even out-of-date embedded platforms. In this paper, an unusual problem is considered: how to implement matrix multiplication of 32-bit signed integers and fixed-point numbers on DSP having SIMD instructions for 16-bit integers only. For examined tasks, matrix size may vary from several tens to two hundred. The proposed mathematical approach for dense rectangular matrix multiplication of 32-bit numbers comprises decomposition of 32-bit matrices to matrices of 16-bit numbers, four matrix multiplications of 16-bit unsigned integers via outer product, and correction of outcome for signed integers and fixed point numbers. Several tricks for performance optimization are analyzed. In addition, ways for block-wise and parallel implementations are described. An implementation of the proposed method by means of 16-bit vector instructions is faster than matrix multiplication using 32-bit scalar instructions and demonstrates performance close to a theoretically achievable limit. The described technique can be generalized for matrix multiplication of n-bit integers and fixed point numbers via handling with matrices of n/2-bit integers. In conclusion, recommendations for practitioners who work on implementation of matrix multiplication for various DSP are presented.
... • We develop an instance of gemm optimized for 8-bit integer (INT8) arithmetic that can be easily adapted for fixed point arithmetic. • Similarly to [13], we orchestrate a careful sequence of data transfers between the different memory areas of the GAP8 via DMA transfers, integrating these movements into the BLIS packing routines. • We perform an experimental evaluation of the gemm realization in the FC (fabric controller) in the GAP8 platform. ...
Article
Full-text available
We address the efficient realization of matrix multiplication ( gemm ), with application in the convolution operator for machine learning, for the RISC-V core present in the GreenWaves GAP8 processor. Our approach leverages BLIS (Basic Linear Algebra Instantiation Software) to develop an implementation that (1) re-organizes the gemm algorithm adapting its micro-kernel to exploit the hardware-supported dot product kernel in the GAP8; (2) explicitly orchestrates the data transfers across the hierarchy of scratchpad memories via DMA (direct memory access); and (3) operates with integer arithmetic.
... This approach allows the parallel execution of matrix transposition and calculation. Several work leverage this feature to achieve high performance for several important workloads, including SAR image processing [18], signal processing kernels [19] and BLAS routines [20]. A limitation of these DMA controllers [5]- [10], [21] is that they can only transfer one matrix element per clock cycle, which significantly degrades the bandwidth utilization. ...
Article
Full-text available
Matrix transposition plays a critical role in digital signal processing. However, existing matrix transposition implementations have significant limitations. A traditional design uses load and store instructions to accomplish matrix transposition. Depending on the amount of load/store units, this design typically transposes up to one matrix element per clock cycle. More seriously, this design cannot perform matrix transposition and data calculations in parallel. Modern digital signal processors (DSPs) integrate the support for matrix transposition into the direct memory access (DMA) controller; the matrix can be transposed during data movements. It allows the parallel execution of matrix transposition and data calculations. Yet, its bandwidth utilization is limited; it can only transfer one matrix element per clock cycle. To address the limitations of existing designs, we propose MT-DMA, to support efficient matrix transposition in DMA controllers. It can transpose multiple matrix elements per clock cycle to improve the bandwidth utilization. Compared with existing designs, MT-DMA achieves a maximum 23.9 times performance improvement for micro-benchmarks. It is also more energy efficient. Since MT-DMA effectively hides the latency of matrix transposition behind data calculations, it performs very closely to an ideal design for real applications.
... In essence, GEPP is decomposed into multiple calls to block-panel multiplication (GEBP). Since there is a L3 cache in the 64-bit ARMv8 architecture, we assume that a k c ×n c panel of B will always reside fully in the L3 cache [12]. GEBP, the inner kernel handled at layer 4, updates an m c × n c panel of C as a product of an m c × k c block of A and a k c ×n c panel of B. To ensure consecutive accesses, OpenBLAS packs a block (panel) of A (B) into contiguous buffers. ...
... Figure 6 illustrates our register allocation scheme for #0 and #1, together with the order in which the computations are performed in each case. The optimal distance 7 from solving (12) When scheduling the instructions for each copy #i from the original register kernel, we need to consider also their WAR (write-after-read) and RAW (read-after-write) dependences. In each copy #i, 24 floating-point FMA instructions (fmla) and 7 load instructions (ldr) are executed, together with one prefetching instruction (prfm). ...
Research
Full-text available
Accepted by ICPP 2015 Beijing . Design and Implementation of a Highly Efficient DGEMM for 64-bit ARMv8 Multi-Core Processors.
... From this perspective, it is more similar to generalpurpose architectures such as ARM or Intel x86. The BLAS implementation from TI [20,26] has been extensively used in our implementation when possible. In general, both inside the TI BLAS implementation and in the rest of the code, three different levels of parallelism are exploited: ...
Article
Full-text available
Power consumption is emerging as one of the main concerns in the High Performance Computing (HPC) field. As a growing number of bioinformatics applications require HPC techniques and parallel architectures to meet performance requirements, power consumption arises as an additional limitation when accelerating them. In this paper, we present a comparative study of optimized implementations of the Non-negative Matrix Factorization (NMF), that is widely used in many fields of bioinformatics, taking into account both performance and power consumption. We target a wide range of state-of-the-art parallel architectures, including general-purpose, low-power processors and specific-purpose accelerators like GPUs, DSPs or the Intel Xeon Phi. From our study, we gain insights in both performance and energy consumption for each one of them under a number of experimental conditions, and conclude that the most appropriate architecture is usually a trade-off between performance and energy consumption for a given experimental setup and dataset.
... Unfortunately we found no works in the literature which map these kernels on a CGRA and provide detailed energy results, allowing only coarse comparisons in terms of overall power efficiency or power density [3]. Only aggregate results could be found also for GPGPU solutions [7] and a DSP [8]. In [9], authors presented a novel FPGA-based fine-grained reconfigurable architecture to map several numerical linear algebra kernels and compared with Intel Xeon Woodcrest processor to report 10-150× speed-up/energy-efficiency improvement. ...
Conference Paper
Full-text available
A scalable mapping is proposed for 3 important kernels from the Numerical Linear Algebra domain, to exploit architectural features to reach asymptotically optimal efficiency and a low energy consumption. Performance and power evaluations were done with input data set matrix sizes ranging from 64×64 to 16384×16384. 12 architectural variants with up to 10×10 processing elements were used to explore scalability of the mapping and the architecture, achieving < 10% energy increase for architectures up to 8×8 PEs coupled with performance speed-ups of more than an order of magnitude. This enables a clean area-performance trade-off on the Layers architecture while keeping energy constant over the variants.
... GPUs, digital signal processors, and field-programmable gate arrays are typically significantly more energy efficient than conventional desktop or server CPUs; their deployment often leads to 4Â increases in energy efficiency over traditional CPUs. 1,2 With mobile processor architectures becoming ever more versatile, the market is gradually moving away from custom silicon solutions toward energy-efficient CPUs. ...
Article
Full-text available
In recent years, a new generation of ultralow-power processors have emerged that are aimed primarily at signal processing in mobile computing. However, their architecture could make some of these useful for other applications. Algorithms originally developed for scientific computing are used increasingly in signal conditioning and emerging fields such as computer vision, increasing the demand for computing power in mobile systems. In this article, the authors describe the design and implementation of dense matrix multiplication on the Movidius Myriad architecture and evaluate its performance and energy efficiency. The authors demonstrate a performance of 8.11 Gflops on the Myriad I processor and a performance/watt ratio of 23.17 Gflops/W for a key computational kernel. These results show significant potential for scientific-computing tasks and invite further research.
... It provides a peak performance of 160 GFLOP for a floating point and 320 GMAC for a fixed point for only 10 watts. It has been used recently by many research communities to build high-performance and low power real-time signal processing systems [2,567891011 . Additionally , SRIO is integrated in the C6678 DSP as a peripheral device. ...
Article
Full-text available
Pulse-Doppler radars require high-computing power. A massively parallel machine has been developed in this paper to implement a Pulse-Doppler radar signal processing chain in real-time fashion. The proposed machine consists of two C6678 digital signal processors (DSPs), each with eight DSP cores, interconnected with Serial RapidIO (SRIO) bus. In this study, each individual core is considered as the basic processing element; hence, the proposed parallel machine contains 16 processing elements. A straightforward model has been adopted to distribute the Pulse-Doppler radar signal processing chain. This model provides low latency, but communication inefficiency limits system performance. This paper proposes several optimizations that greatly reduce the inter-processor communication in a straightforward model and improves the parallel efficiency of the system. A use case of the Pulse-Doppler radar signal processing chain has been used to illustrate and validate the concept of the proposed mapping model. Experimental results show that the parallel efficiency of the proposed parallel machine is about 90%.
... Embedded accelerators such as the TI C6678 DSP have been proven to provide high GFLOPS/Watt for HPC applications [1,2] . As a result there has been considerable interest in utilizing low-power SoCs containing these accelerators to build supercomputers capable of higher energy efficiency compared to current systems. ...
... Several approaches have been used to perform blocking/panelling of matrices to maximize utilization of caches and scratchpad RAM. In [1,2], GEMM was written for the C6678 DSP and the performance measured across 8 DSP cores was 79.4 GFLOPS for SGEMM and 21 GFLOPS for DGEMM with the DSPs running at 1Ghz. To port GEMM for 66AK2H, the same algorithm, inner kernel and matrix panelling scheme was used. ...
... With respect to the peak performance of the DSP cluster, using the accelerator model on 66AK2H achieves 70.05% for SGEMM and 74.18% efficiency for DGEMM including all overheads. The original implementation for C6678 achieved 62% for SGEMM and 65% for DGEMM [1]. The increase in efficiency can be attributed to larger L2 and MSMC SRAM available in 66AK2H. ...
Conference Paper
Full-text available
The TI Keystone II architecture provides a unique combination of ARM Cortex-A15 processors with high performance TI C66x floating-point DSPs on a single low-power System-on-chip (SoC). Commercially available systems such as the HP Proliant m800 and nCore BrownDwarf are based on this ARM-DSP SoC. The Keystone II architecture promises to deliver high GFLOPS/Watt and is of increasing interest as it provides an alternate building block for future exascale systems. However, the success of this architecture is intimately related to the ease of migrating existing HPC applications for maximum performance. Effective use of all ARM and DSP cores and DMA co-processors is crucial for maximizing performance/watt. This paper explores issues and challenges encountered while migrating the matrix multiplication (GEMM) kernel, originally written only for the C6678 DSP to the ARM-DSP SoC using an early prototype of the OpenMP 4.0 accelerator model. Single precision (SGEMM) matrix multiplication performance of 110.11 GFLOPS and and double precision (DGEMM) performance of 29.15 GFLOPS was achieved on the TI Keystone II Evaluation Module Revision 3.0 (EVM). Trade-offs and factors affecting performance are discussed.