(a) A register file with eight 96-bit registers, 2 read ports, and 1 write port, (b) the implementation of two read ports and one write port for a matrix register file with 8 96-bit registers as well as a partitioned ALU for subword parallel processing

Source publication

Performance Improvement of Multimedia Kernels by Alleviating Overhead Instructions on SIMD Devices

Conference Paper

Full-text available

Aug 2009

SIMD extension is one of the most common and effective technique to exploit data-level parallelism in today's processor designs. However, the performance of SIMD architectures is limited by some con- straints such as mismatch between the storage and the computational formats and using data permutation instructions during vectorization. In our previ...

Exploiting ILP, DLP, TLP, and MPI to accelerate matrix multiplication on Xeon processors

Article

Full-text available

Jan 2015

Matrix multiplication is one of the most important kernels used in the dense linear algebra codes. It is a computational intensive kernel that demands exploiting all available forms of parallelism to improve its performance. In this paper, ILP, DLP, TLP, and MPI are exploited to accelerate the execution of matrix multiplication on a cluster of comp...

Design of a novel SIMD architecture by fusing operations and registers

Conference Paper

Full-text available

Jun 2009

This paper presents an architecture called "multi-streaming SIMD" that enables one SIMD instruction to simultaneously manipulate multiple data streams. To efficiently and flexibly realize the proposed architecture, an operation cell is designed by fusing the logic gates and the storage cells together. The operation cells then are used to compose a...

Architecture and Advantages of SIMD in Multimedia Applications

Article

Full-text available

Jun 2020

In this paper, we identified the single instruction multi-data architecture (SIMD) that is a method of computing parallelism. Most modern processor designs contain SIMD in order to increase performance of the computer. The aims of this work are to describe the classification of SIMD architecture in computer systems that it depends on implementation...

Parallel PWMs Based Fully Digital Transmitter with Wide Carrier Frequency Range

Article

Full-text available

Sep 2013

The carrier-frequency (CF) and intermediate-frequency (IF) pulse-width modulators (PWMs) based on delay lines are proposed, where baseband signals are conveyed by both positions and pulse widths or densities of the carrier clock. By combining IF-PWM and precorrected CF-PWM, a fully digital transmitter with unit-delay autocalibration is implemented...

On-chip Vector Coprocessor Sharing for Multicores

Conference Paper

Full-text available

Feb 2011

For most of the applications that make use of a vector coprocessor, the resources are not highly utilized due to the lack of sustained data parallelism, which sometimes occurs due to vector-length changes in dynamic environments. The motivation of our work stems from (a) the mandate for multicore designs to make efficient use of the on-chip resourc...

SIMD programming using Intel vector extensions

Article

Sep 2019
J PARALLEL DISTR COM

An Automatic Superword Vectorization in LLVM

Article

Jan 2010

More and more modern processors support SIMD instructions for improving performance in media applications. Programmers usually need detailed target-specific knowledge to use SIMD instructions directly. Thus, an auto-vectorization compiler that automatically generates efficient SIMD instructions is in urgent need. We implement an automatic superword vectorization based on the LLVM compiler infrastructure, to which an auto-vectorization and an alignment analysis passes have been added. The superword auto-vectorization pass exploits data-parallelism and convert IR instructions from primitive type to vector type. Then, in code generator, the alignment analysis pass analyzes every memory access with respect to those vector instructions and generates the align-ment information for generate target-specific alignment instructions. In this paper, we use UltraSPARC as our experimental platform and two realignment instructions to perform misaligned access. We also present preliminary experimental results, which reveals that our optimization generates vectorized code that are 4% up to 35% speed up. 1 Index Terms—Auto-vectorization, UltraSPARC T2,VIS Instruction Set.

Design and evaluation of a task-based parallel H.264 video encoder for the Cell processor

Article

Full-text available

Modern multi-coe processors with explictly managed local memories, such as the Cell Broadband Engine (Cell) constitute in many ways a significant departure from traditional high performance CPU designs. Such CPUs, on one hand bear the potential of higher performance in certain application domains and on the other hand require extensive application modifications. We design and implement x264, a complete H.264 video encoder for the Cell processor, based on an open source H.264 library, c264. Our implementation achieves speedups of 4.5x on six synergistic processing elements (SPEs), compared to the serial version running on the power processing element (PPE). Our work considers all parts of the encoding process and reveals related limitations. x264 constitutes an extensive redesign of the original c264 code to employ fine-grain parallelization to cope with the small size of the local memory in the SPEs and achieve replication and privatization of shared data structures due to the non-coherent Cell architecture. Our analysis allows us to identify the main limitations for further scaling H. 264 video encoding on future multi-cores: (a) overheads for task management cause a heavy burden on the single master processor, (b) complex control flow in the code limits effective parallelism, and (c) small on-chip memories limit the overlap of communication and computation.

(a) A register file with eight 96-bit registers, 2 read ports, and 1 write port, (b) the implementation of two read ports and one write port for a matrix register file with 8 96-bit registers as well as a partitioned ALU for subword parallel processing

Similar publications

Citations