C66x DSP Core Block Diagram.

Source publication

Level-3 BLAS on the TI C6678 multi-core DSP

Article

Full-text available

Oct 2012

Digital Signal Processors (DSP) are commonly em-ployed in embedded systems. The increase of process-ing needs in cellular base-stations, radio controllers and industrial/medical imaging systems, has led to the devel-opment of multi-core DSPs as well as inclusion of float-ing point operations while maintaining low power dissi-pation. The eight-core...

Context 1

... C66x core architecture [9] based on very Long In- struction Word (VLIW) architecture is shown in Figure 1. The architecture takes advantage of several levels of paral- lelism. ...

View in full-text

Figure 1. Horizontal (a) and vertical (b) structure of DOFs (marked...

Figure 2. Example of a cubed-sphere quadrilateral mesh with n e = 10 (3...

Figure 3. Visualization of total precipitable water, from a 0.125 • (n...

Figure 4. HOMME time stepping configuration.

Figure 6. Parallelization of non-tightly nested loops via team policy....

HOMMEXX 1.0: A performance-portable atmospheric dynamical core for the Energy Exascale Earth System Model

Article

Full-text available

Apr 2019

We present an architecture-portable and performant implementation of the atmospheric dynamical core (High-Order Methods Modeling Environment, HOMME) of the Energy Exascale Earth System Model (E3SM). The original Fortran implementation is highly performant and scalable on conventional architectures using the Message Passing Interface (MPI) and Open...

Optimizing Message-Passing on Multicore Architectures Using Hardware Multi-threading

Conference Paper

Full-text available

Feb 2014

Shared-memory and message-passing are two opposite models to develop parallel computations. The shared-memory model, adopted by existing frameworks such as OpenMP, represents a de-facto standard on multi-/many-core architectures. However, message-passing deserves to be studied for its inherent properties in terms of portability and flexibility as w...

Parallel implementation of w-projection wide-field imaging

Article

Full-text available

Apr 2019

w-Projection is a wide-field imaging technique that is widely used in radio synthesis arrays. Processing the wide-field big data generated by the future Square Kilometre Array (SKA) will require significant updates to current methods to significantly reduce the time consumed on data processing. Data loading and gridding are found to be two major ti...

Performance Model for Master/Worker Hybrid Applications on Multicore Clusters

Conference Paper

Full-text available

Nov 2013

There are several parallel applications that are implemented using a Master/Worker parallel/distributed pro-gramming paradigm. Applications using this predefined pro-gramming structure can be easily implemented using message passing programming libraries (MPI). Moreover, the multicore features present nowadays on CPU architecture can be exploited a...

Efficient parallelization of a regional ocean model for the western Mediterranean Sea

Article

Full-text available

Jul 2013

This paper focuses on the parallelization of an ocean model applying current multicore processor-based cluster architectures to an irregular computational mesh. The aim is to maximize the efficiency of the computational resources used. To make the best use of the resources offered by these architectures, this parallelization has been addressed at a...

Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSP

Article

Full-text available

Aug 2023

Structured grid-based sparse matrix-vector multiplication and Gauss–Seidel iterations are very important kernel functions in scientific and engineering computations, both of which are memory intensive and bandwidth-limited. GPDSP is a general purpose digital signal processor, which is a very significant embedded processor that has been introduced into high-performance computing. In this paper, we designed various optimization methods, which included a blocking method to improve data locality and increase memory access efficiency, a multicolor reordering method to develop Gauss–Seidel fine-grained parallelism, a data partitioning method designed for GPDSP memory structures, and a double buffering method to overlap computation and access memory on structured grid-based SpMV and Gauss–Seidel iterations for GPDSP. At last, we combined the above optimization methods to design a multicore vectorization algorithm. We tested the matrices generated with structured grids of different sizes on the GPDSP platform and obtained speedups of up to 41× and 47× compared to the unoptimized SpMV and Gauss–Seidel iterations, with maximum bandwidth efficiencies of 72% and 81%, respectively. The experiment results show that our algorithms could fully utilize the external memory bandwidth. We also implemented the commonly used mixed precision algorithm on the GPDSP and obtained speedups of 1.60× and 1.45× for the SpMV and Gauss–Seidel iterations, respectively.

An Approach for Matrix Multiplication of 32-Bit Fixed Point Numbers by Means of 16-Bit SIMD Instructions on DSP

Article

Full-text available

Dec 2022

Matrix multiplication is an important operation for many engineering applications. Sometimes new features that include matrix multiplication should be added to existing and even out-of-date embedded platforms. In this paper, an unusual problem is considered: how to implement matrix multiplication of 32-bit signed integers and fixed-point numbers on DSP having SIMD instructions for 16-bit integers only. For examined tasks, matrix size may vary from several tens to two hundred. The proposed mathematical approach for dense rectangular matrix multiplication of 32-bit numbers comprises decomposition of 32-bit matrices to matrices of 16-bit numbers, four matrix multiplications of 16-bit unsigned integers via outer product, and correction of outcome for signed integers and fixed point numbers. Several tricks for performance optimization are analyzed. In addition, ways for block-wise and parallel implementations are described. An implementation of the proposed method by means of 16-bit vector instructions is faster than matrix multiplication using 32-bit scalar instructions and demonstrates performance close to a theoretically achievable limit. The described technique can be generalized for matrix multiplication of n-bit integers and fixed point numbers via handling with matrices of n/2-bit integers. In conclusion, recommendations for practitioners who work on implementation of matrix multiplication for various DSP are presented.

A BLIS-like matrix multiplication for machine learning in the RISC-V ISA-based GAP8 processor

Article

Full-text available

May 2022
J SUPERCOMPUT

We address the efficient realization of matrix multiplication ( gemm ), with application in the convolution operator for machine learning, for the RISC-V core present in the GreenWaves GAP8 processor. Our approach leverages BLIS (Basic Linear Algebra Instantiation Software) to develop an implementation that (1) re-organizes the gemm algorithm adapting its micro-kernel to exploit the hardware-supported dot product kernel in the GAP8; (2) explicitly orchestrates the data transfers across the hierarchy of scratchpad memories via DMA (direct memory access); and (3) operates with integer arithmetic.

MT-DMA: A DMA Controller Supporting Efficient Matrix Transposition for Digital Signal Processing

Article

Full-text available

Dec 2018

Matrix transposition plays a critical role in digital signal processing. However, existing matrix transposition implementations have significant limitations. A traditional design uses load and store instructions to accomplish matrix transposition. Depending on the amount of load/store units, this design typically transposes up to one matrix element per clock cycle. More seriously, this design cannot perform matrix transposition and data calculations in parallel. Modern digital signal processors (DSPs) integrate the support for matrix transposition into the direct memory access (DMA) controller; the matrix can be transposed during data movements. It allows the parallel execution of matrix transposition and data calculations. Yet, its bandwidth utilization is limited; it can only transfer one matrix element per clock cycle. To address the limitations of existing designs, we propose MT-DMA, to support efficient matrix transposition in DMA controllers. It can transpose multiple matrix elements per clock cycle to improve the bandwidth utilization. Compared with existing designs, MT-DMA achieves a maximum 23.9 times performance improvement for micro-benchmarks. It is also more energy efficient. Since MT-DMA effectively hides the latency of matrix transposition behind data calculations, it performs very closely to an ideal design for real applications.

Design and Implementation of a Highly Efficient DGEMM for 64-bit ARMv8 Multi-Core Processors

Research

Full-text available

May 2015

Hao Jiang

Accepted by ICPP 2015 Beijing . Design and Implementation of a Highly Efficient DGEMM for 64-bit ARMv8 Multi-Core Processors.

Non-negative Matrix Factorization on Low-Power Architectures and Accelerators: A Comparative Study

Article

Full-text available

Apr 2015
COMPUT ELECTR ENG

Power consumption is emerging as one of the main concerns in the High Performance Computing (HPC) field. As a growing number of bioinformatics applications require HPC techniques and parallel architectures to meet performance requirements, power consumption arises as an additional limitation when accelerating them. In this paper, we present a comparative study of optimized implementations of the Non-negative Matrix Factorization (NMF), that is widely used in many fields of bioinformatics, taking into account both performance and power consumption. We target a wide range of state-of-the-art parallel architectures, including general-purpose, low-power processors and specific-purpose accelerators like GPUs, DSPs or the Intel Xeon Phi. From our study, we gain insights in both performance and energy consumption for each one of them under a number of experimental conditions, and conclude that the most appropriate architecture is usually a trade-off between performance and energy consumption for a given experimental setup and dataset.

Scalable and Efficient Linear Algebra Kernel Mapping for Low Energy Consumption on the Layers CGRA

Conference Paper

Full-text available

Apr 2015

A scalable mapping is proposed for 3 important kernels from the Numerical Linear Algebra domain, to exploit architectural features to reach asymptotically optimal efficiency and a low energy consumption. Performance and power evaluations were done with input data set matrix sizes ranging from 64×64 to 16384×16384. 12 architectural variants with up to 10×10 processing elements were used to explore scalability of the mapping and the architecture, achieving < 10% energy increase for architectures up to 8×8 PEs coupled with performance speed-ups of more than an order of magnitude. This enables a clean area-performance trade-off on the Layers architecture while keeping energy constant over the variants.

The Movidius Myriad Architecture's Potential for Scientific Computing

Article

Full-text available

Jan 2015
IEEE MICRO

In recent years, a new generation of ultralow-power processors have emerged that are aimed primarily at signal processing in mobile computing. However, their architecture could make some of these useful for other applications. Algorithms originally developed for scientific computing are used increasingly in signal conditioning and emerging fields such as computer vision, increasing the demand for computing power in mobile systems. In this article, the authors describe the design and implementation of dense matrix multiplication on the Movidius Myriad architecture and evaluate its performance and energy efficiency. The authors demonstrate a performance of 8.11 Gflops on the Myriad I processor and a performance/watt ratio of 23.17 Gflops/W for a key computational kernel. These results show significant potential for scientific-computing tasks and invite further research.

Real-time parallel implementation of Pulse-Doppler radar signal processing chain on a massively parallel machine based on multi-core DSP and Serial RapidIO interconnect

Article

Full-text available

Nov 2014
EURASIP J ADV SIG PR

Pulse-Doppler radars require high-computing power. A massively parallel machine has been developed in this paper to implement a Pulse-Doppler radar signal processing chain in real-time fashion. The proposed machine consists of two C6678 digital signal processors (DSPs), each with eight DSP cores, interconnected with Serial RapidIO (SRIO) bus. In this study, each individual core is considered as the basic processing element; hence, the proposed parallel machine contains 16 processing elements. A straightforward model has been adopted to distribute the Pulse-Doppler radar signal processing chain. This model provides low latency, but communication inefficiency limits system performance. This paper proposes several optimizations that greatly reduce the inter-processor communication in a straightforward model and improves the parallel efficiency of the system. A use case of the Pulse-Doppler radar signal processing chain has been used to illustrate and validate the concept of the proposed mapping model. Experimental results show that the parallel efficiency of the proposed parallel machine is about 90%.

Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture

Conference Paper

Full-text available

Sep 2014

The TI Keystone II architecture provides a unique combination of ARM Cortex-A15 processors with high performance TI C66x floating-point DSPs on a single low-power System-on-chip (SoC). Commercially available systems such as the HP Proliant m800 and nCore BrownDwarf are based on this ARM-DSP SoC. The Keystone II architecture promises to deliver high GFLOPS/Watt and is of increasing interest as it provides an alternate building block for future exascale systems. However, the success of this architecture is intimately related to the ease of migrating existing HPC applications for maximum performance. Effective use of all ARM and DSP cores and DMA co-processors is crucial for maximizing performance/watt. This paper explores issues and challenges encountered while migrating the matrix multiplication (GEMM) kernel, originally written only for the C6678 DSP to the ARM-DSP SoC using an early prototype of the OpenMP 4.0 accelerator model. Single precision (SGEMM) matrix multiplication performance of 110.11 GFLOPS and and double precision (DGEMM) performance of 29.15 GFLOPS was achieved on the TI Keystone II Evaluation Module Revision 3.0 (EVM). Trade-offs and factors affecting performance are discussed.

C66x DSP Core Block Diagram.

Context in source publication

Similar publications

Citations