Altivec technology: A Second Generation SIMD Microprocessor Architecture

Compiler-Based Data Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor

Conference Paper

Full-text available

May 2013

The Intel® Xeon Phi™ coprocessor has software prefetching instructions to hide memory latencies and special store instructions to save bandwidth on streaming non-temporal store operations. In this work, we provide details on compiler-based generation of these instructions and evaluate their impact on the performance of the Intel® Xeon Phi™ coprocessor using a wide range of parallel applications with different characteristics. Our results show that the Intel® Composer XE 2013 compiler can make effective use of these mechanisms to achieve significant performance improvements.

A Programmable Processor with 4096 Processing Units for Media Applications

Article

Full-text available

Apr 2001

Over the past few years, technology drivers for processor designs have changed significantly. Media data delivery and processing -- such as telecommunications, networking, video processing, speech recognition and 3D graphics -- is increasing in importance and will soon dominate the processing cycles consumed in computer-based systems. This paper describes a processo, called Linedancer, that provides high media performance with low energy consumption by integrating associative SIMD parallel processing with embedded microprocessor technology. The major innovations in the Linedancer is the integration of thousands of processing units in a single chip that are capable to support software programmable high-performance mathematical functions as well as abstract data processing. In addition to 4096 processing units, Linedancer integrates on a single chip a RISC controller that is an implementation of the SPARC architecture, 128 Kbytes of Data Memory, and I/O interfaces. The SIMD processing in Linedancer implements the ASProCore architecture, which is a proprietary implementation of SIMD processing, operates at 266 MHz with program instructions issued by the RISC controller. The device also integrates a 64-bit synchronous main memory interface operating at 133 MHZ (double-data rate, DDR), and a 64-bit 66 MHz PCI interface.

TriMedia CPU64 Architecture

Article

Full-text available

Jul 2000

We present a new VLIW core as a successor to the TriMedia TM1000. The processor is targeted for embedded use in media-processing devices like DTVs and set-top boxes. Intended as a core, its design must be supplemented with on-chip co-processors to obtain a cost-effective system. Good performance is obtained through a uniform 64-bit 5 issue-slot VLIW design, supporting subword parallelism with an extensive instruction set optimized with respect to media-processing. Multi-slot `super-ops' allow powerful multi-argument and multi-result operations. As an example, an IDCT algorithm shows a very low instruction count in comparison with other processors. To achieve good performance, critical sections in the application program source code need to be rewritten with vector data types and function calls for media operations. Benchmarking with several media applications was used to tune the instruction set and study cache behavior. This resulted in a VLIW architecture with wide data paths and relatively simple cpu control. 1.

TriMedia CPU64 architecture

Conference Paper

Full-text available

Feb 1999

We present a new VLIW core as a successor to the TriMedia TM1000. The processor is targeted for embedded use in media-processing devices like DTVs and set-top boxes. Intended as a core, its design must be supplemented with on-chip co-processors to obtain a cost-effective system. Good performance is obtained through a uniform 64-bit 5 issue-slot VLIW design, supporting subword parallelism with an extensive instruction set optimized with respect to media-processing. Multi-slot `super-ops' allow powerful multi-argument and multi-result operations. As an example, the IDCT algorithm shows a very low instruction count in comparison with other processors. To achieve good performance, critical sections in the application program source code need to be rewritten with vector data types and function calls for media operations. Benchmarking with several media applications was used to tune the instruction set and study cache behaviour. This resulted in a VLIW architecture with wide data paths and relatively simple CPU control

Altivec technology: A Second Generation SIMD Microprocessor Architecture

No full-text available

Recommended publications

A multithreading embedded architecture

LOOKING FORWARD TO PROGRAMMABLE CONTROLLERS.

Design and implementation of 64 bit VLIW microprocessor on 20nm and 28nm technologies

Book Review: Microprocessors: Technology, Architecture and Applications

New chip platform driving digital camera innovations

Blurring the boundaries

The architecture of microprocessor-based systems

Technology Independent Area and Delay Estimates for Microprocessor Building Blocks