Fig 1 - uploaded by Ben H. H. Juurlink
Content may be subject to copyright.
(a) A register file with eight 96-bit registers, 2 read ports, and 1 write port, (b) the implementation of two read ports and one write port for a matrix register file with 8 96-bit registers as well as a partitioned ALU for subword parallel processing

(a) A register file with eight 96-bit registers, 2 read ports, and 1 write port, (b) the implementation of two read ports and one write port for a matrix register file with 8 96-bit registers as well as a partitioned ALU for subword parallel processing

Source publication
Conference Paper
Full-text available
SIMD extension is one of the most common and effective technique to exploit data-level parallelism in today's processor designs. However, the performance of SIMD architectures is limited by some con- straints such as mismatch between the storage and the computational formats and using data permutation instructions during vectorization. In our previ...

Similar publications

Article
Full-text available
Matrix multiplication is one of the most important kernels used in the dense linear algebra codes. It is a computational intensive kernel that demands exploiting all available forms of parallelism to improve its performance. In this paper, ILP, DLP, TLP, and MPI are exploited to accelerate the execution of matrix multiplication on a cluster of comp...
Conference Paper
Full-text available
This paper presents an architecture called "multi-streaming SIMD" that enables one SIMD instruction to simultaneously manipulate multiple data streams. To efficiently and flexibly realize the proposed architecture, an operation cell is designed by fusing the logic gates and the storage cells together. The operation cells then are used to compose a...
Article
Full-text available
In this paper, we identified the single instruction multi-data architecture (SIMD) that is a method of computing parallelism. Most modern processor designs contain SIMD in order to increase performance of the computer. The aims of this work are to describe the classification of SIMD architecture in computer systems that it depends on implementation...
Article
Full-text available
The carrier-frequency (CF) and intermediate-frequency (IF) pulse-width modulators (PWMs) based on delay lines are proposed, where baseband signals are conveyed by both positions and pulse widths or densities of the carrier clock. By combining IF-PWM and precorrected CF-PWM, a fully digital transmitter with unit-delay autocalibration is implemented...
Conference Paper
Full-text available
For most of the applications that make use of a vector coprocessor, the resources are not highly utilized due to the lack of sustained data parallelism, which sometimes occurs due to vector-length changes in dynamic environments. The motivation of our work stems from (a) the mandate for multicore designs to make efficient use of the on-chip resourc...

Citations

... After a year, SSE2 added 144 new instructions for integers and doubleprecision floating-point operations. Then SSE3 introduced 13 new instructions followed by Supplemental SSE3 (SSSE3) with 32 new instructions which accelerate several types of computations on S8, S16 S8, S16 S8, S16 S8, S16 S8, S16, S32 a Note that 68 instructions of the 144 SSE2 instructions operate on 128-bit packed integer in XMM registers, wide versions of 64-bit MMX/SSE integer instructions [84]. times faster compared to corresponding sequential implementation when k is equal to the size of the data type in bits. ...
... simultaneously operate on a block of 128-bit data that is partitioned into four 32-bit integers. SIMD architectures include not only common arithmetic instructions, but also other instructions, such as data alignment, data type conversion, data reorganization, etc., that are also needed to prepare the data in a proper format for SIMD execution [2]. ...
Article
More and more modern processors support SIMD instructions for improving performance in media applications. Programmers usually need detailed target-specific knowledge to use SIMD instructions directly. Thus, an auto-vectorization compiler that automatically generates efficient SIMD instructions is in urgent need. We implement an automatic superword vectorization based on the LLVM compiler infrastructure, to which an auto-vectorization and an alignment analysis passes have been added. The superword auto-vectorization pass exploits data-parallelism and convert IR instructions from primitive type to vector type. Then, in code generator, the alignment analysis pass analyzes every memory access with respect to those vector instructions and generates the align-ment information for generate target-specific alignment instructions. In this paper, we use UltraSPARC as our experimental platform and two realignment instructions to perform misaligned access. We also present preliminary experimental results, which reveals that our optimization generates vectorized code that are 4% up to 35% speed up. 1 Index Terms—Auto-vectorization, UltraSPARC T2,VIS Instruction Set.
... The impact of SPE-vectorization in the computational intensive kernels have bee extensively analyzed. Vectorization of kernels used in video encoding include: motion estimation [35], deblocking filter [9], discrete cosine transformation [5], and interpolation [12]. Vectorization usually produces additional overhead in for packing and unpacking data into vectors, but lower overall execution time than the scalar version of kernels. ...
Article
Full-text available
Modern multi-coe processors with explictly managed local memories, such as the Cell Broadband Engine (Cell) constitute in many ways a significant departure from traditional high performance CPU designs. Such CPUs, on one hand bear the potential of higher performance in certain application domains and on the other hand require extensive application modifications. We design and implement x264, a complete H.264 video encoder for the Cell processor, based on an open source H.264 library, c264. Our implementation achieves speedups of 4.5x on six synergistic processing elements (SPEs), compared to the serial version running on the power processing element (PPE). Our work considers all parts of the encoding process and reveals related limitations. x264 constitutes an extensive redesign of the original c264 code to employ fine-grain parallelization to cope with the small size of the local memory in the SPEs and achieve replication and privatization of shared data structures due to the non-coherent Cell architecture. Our analysis allows us to identify the main limitations for further scaling H. 264 video encoding on future multi-cores: (a) overheads for task management cause a heavy burden on the single master processor, (b) complex control flow in the code limits effective parallelism, and (c) small on-chip memories limit the overlap of communication and computation.