ArticlePDF Available

A Programmable Processor with 4096 Processing Units for Media Applications

April 2001

April 2001

Authors:

Over the past few years, technology drivers for processor designs have changed significantly. Media data delivery and processing -- such as telecommunications, networking, video processing, speech recognition and 3D graphics -- is increasing in importance and will soon dominate the processing cycles consumed in computer-based systems. This paper describes a processo, called Linedancer, that provides high media performance with low energy consumption by integrating associative SIMD parallel processing with embedded microprocessor technology. The major innovations in the Linedancer is the integration of thousands of processing units in a single chip that are capable to support software programmable high-performance mathematical functions as well as abstract data processing. In addition to 4096 processing units, Linedancer integrates on a single chip a RISC controller that is an implementation of the SPARC architecture, 128 Kbytes of Data Memory, and I/O interfaces. The SIMD processing in Linedancer implements the ASProCore architecture, which is a proprietary implementation of SIMD processing, operates at 266 MHz with program instructions issued by the RISC controller. The device also integrates a 64-bit synchronous main memory interface operating at 133 MHZ (double-data rate, DDR), and a 64-bit 66 MHz PCI interface.

depicts the block diagram of the processor called the Linedancer. The Linedancer major functional units include the ASProCore, a RISC controller that is an implementation of the SPARC architecture, 128 Kbytes of Data Memory, and I/O interfaces. The ASProCore, which is a proprietary implementation of SIMD processing, is designed to operate at 266 MHz with program instructions issued by the RISC controller-the controller block includes a 128 Kbytes program memory. The device also integrates a 64-bit synchronous main memory interface operating at 133 MHZ (double-data rate, DDR), and a 64-bit 66 MHz PCI interface.

…

Figures - uploaded by Argy Krikelis

Content may be subject to copyright.

Content uploaded by Argy Krikelis

Content may be subject to copyright.

A Programmable Processor with 4096 Processing Units for Media Applications

A. Krikelis, I. P. Jalowiecki, D. Bean, R. Bishop, M. Facey, D. Boughton, S. Murphy, and M. Whitaker

Aspex Technology Ltd.

Brunel Science Park

Kingston Lane

Uxbridge, UB8 3PH

United Kingdom

argy.krikelis@aspex.co.uk

Abstract

Over the past few years, technology drivers for processor designs have changed significantly. Media data delivery and processing – such

as telecommunications, networking, video processing, speech recognition and 3D graphics – is increasing in importance and will soon

dominate the processing cycles consumed in computer-based systems. This paper describes a processo, called Linedancer, that provides

high media performance with low energy consumption by integrating associative SIMD parallel processing with embedded microprocessor

technology. The major innovations in the Linedancer is the integration of thousands of processing units in a single chip that are capable to

support software programmable high-performance mathematical functions as well as abstract data processing. In addition to 4096

processing units, Linedancer integrates on a single chip a RISC controller that is an implementation of the SPARC architecture, 128

Kbytes of Data Memory, and I/O interfaces. The SIMD processing in Linedancer implements the ASProCore architecture, which is a

proprietary implementation of SIMD processing, operates at 266 MHz with program instructions issued by the RISC controller. The device

also integrates a 64-bit synchronous main memory interface operating at 133 MHZ (double-data rate, DDR), and a 64-bit 66 MHz PCI

interface.

Introduction

Over the past few years, technology drivers for microprocessors have changed significantly. High-end systems for technical and scientific

applications used to direct the evolution of processor architecture. Now, consumer-level systems drive technology, due to their large

volume and attendant profits. Within this environment, important application and technology trends have evolved. Media data delivery and

processing – such as telecommunications, networking, video processing, speech recognition and 3D graphics – is increasing in importance

and will soon dominate the processing cycles consumed in computer-based systems [1]. SIMD extensions to existing processor

architectures [2, 3] for supporting DSP type of operations are essentially narrow vector designs without support for vector memory

operations. They have limited scalability because each instruction specifies a fixed number of operations. Most extensions do not support

SIMD memory operations, therefore exposing data alignment to user software [4]. Certain instructions, such as random permutations, will

not scale well due to interconnect delay scalability problems.

This paper presents the architecture of the Linedancer DSP processor. The Linedancer provides high multimedia performance with low

energy consumption by integrating associative SIMD parallel processing with embedded microprocessor technology. The major

innovations in the Linedancer is the integration of thousands of processing units in a single chip that are capable to support software

programmable high-performance mathematical functions as well as abstract data processing.

Linedancer Processor

Linedancer is a DSP processor for media processing that integrates 4096 processing units. With 15.6M transistors, the processor is

implemented in a 0.18um 5-metal-layer CMOS process. It is packaged in a 560-pin EBGA pack. The core of the device operates at 1.8

Volts, while the I/Os are 3.3 Volts. The device, which will be sampling in the first quarter of 2001, at its peak, dissipates less than 5 Watts

at the 266 MHZ typical operating frequency. Table I details the Linedancer device specifications.

Technology 0.18

µm, 5-layer metal CMOS

Clock 266 MHz

Power Supply 1.8 V Core, and 3.3 V I/O

Transistor Count 15.6 millions

Power Dissipation 4 Watts (peak)

Die Size 111 mm2

Package 680 EBGA

Testability Full-scan

Data Transfer – Synchronous Memory Interface Up to 2,100 Mbytes/sec

PCI Interface 8 bytes @ 66 MHz

Integer Performance (8-bit add/subtract) 126,000 MOPS

Integer Performance (8-bit x 8-bit multiplies) 13,500 MOPS

Integer Performance (16-bit x 16-bit multiplies) 3,800 MOPS

Floating Point Performance (IEEE Single Precision) 1,700 MFLOPS

Table I: Linedancer device specifications

The Linedancer processor is one of the devices that incorporates ASProCore (Associative String Processor Core), a software

programmable associative SIMD structure that provides extremely efficient support for the parallel processing required for media

applications.

Figure 1 depicts the block diagram of the processor called the Linedancer. The Linedancer major functional units include the ASProCore, a

RISC controller that is an implementation of the SPARC architecture, 128 Kbytes of Data Memory, and I/O interfaces. The ASProCore,

which is a proprietary implementation of SIMD processing, is designed to operate at 266 MHz with program instructions issued by the

RISC controller - the controller block includes a 128 Kbytes program memory. The device also integrates a 64-bit synchronous main

memory interface operating at 133 MHZ (double-data rate, DDR), and a 64-bit 66 MHz PCI interface.

The Data Transfer block provides the application programmer with the support required to develop application software that overlaps

application processing with data transfers. In inner loops operating over large sets of data, the programmer can program it to move strips,

blocks or patches of data on and off the chip. The overall design of the processor allows SIMD processing in ASProCore to fully overlap

with data movement.

ASProCore

The ASProCore (Associative String Processor Core) part of the Linedancer is a programmable, homogeneous and fault-tolerant SIMD

parallel processor core incorporating a string of identical processing units, a reconfigurable Intercommunication Network, and a Vector

Data Buffer for fully overlapped data input-output as indicated in Figure 2.

As shown in Figure 2, each processing unit incorporates a Data Register and a bit-serial ALU. The size of the Data Register is 200 bits. The

Data Register, in addition to storing data for arithmetic operations involving the local ALU, can support associative processing operation

(i.e. to direct support for logical and relational operations).

ASProCore

RISC Controller

Linedancer Internal Bus

Data

Transfer

PCI

Interface

64-bits

@ 66 MHz 64-bits

@ 266 MHz

Data Memory

64-bits

@ 66 MHz 64-bits

@ 266 MHz

Figure 1: Linedancer Block Diagram

The processing units are connected via the Intercommunication Network. The Intercommunication Network is a flexible network that

supports data transfers and navigation of data structures. It can be dynamically reconfigured, in a programmable and user-transparent way,

thus providing a cost-effective emulation of common network topologies. The interconnection strategy supports 2 modes of inter-

processing communication:

!" asynchronous bi-directional single-bit communication to connect processing units sources and corresponding processing units

destinations of high-speed activation signals, implementing a fully-connected dynamically-configured (programmer-transparently)

permutation and broadcast network for processing element selection and inter-processing element routing functions;

!" synchronous bi-directional multi-bit communication, via a high-speed bit-serial shift register for data/message transfer between

processing unit groups.

While being served with control and sequential data via the Instruction and Data Interface, the ASProCore can support parallel data I/O via

the Vector Data Buffer. Data is loaded, overlapped with SIMD parallel processing, word-sequentially, bit-parallel into the Vector Data

Buffer, It can subsequently be exchanged with the data stored in the Data Register of the local processing unit in a word-parallel, bit-

sequentially manner. For data-parallel operations, data are distributed over the processing units and stored in the local Data Register.

Successive computational tasks are performed on the stored data and the results are dumped. The ASProCore supports a form of set

associative processing, in which a sub-set of active processing units (i.e. those which associatively match broadcast scalar information)

support scalar-vector (i.e. between a scalar and Data Registers) and vector-vector (i.e. within Data Registers) operations. Matching

processing units are either directly activated or source inter-processing element communications to indirectly activate other processing

units. The control interface provides feedback on whether none or some processing units match. The instruction set for the ASProCore is

based on 4 basic operations, match, add, read and write. More complicated functionality can be performed by combining these operations.

A more detailed discussion of the ASProCore architecture can be found in [5].

Software Programmable

Conceptually the Linedancer processor can be viewed as a general purpose RISC processor with a tightly coupled data parallel co-

processor (ASProCore) and a DMA unit for moving data between memory and the co-processor. In inner loops operating over large sets of

data, the programmer can program it to move strips, blocks or patches of data on and off the chip. Applications consist of a single program

where instructions for the RISC and co-processor can be freely intermixed.

The Linedancer, may be programmed using one or both of the following methods:

• Application Programming Interface (API)

An API provides a suite of specialised functions that can be used to build solutions for particular applications. The functions are

callable by a standard C or C++ program without the need for the programmer to have specialised hardware or computer architecture

knowledge.

• Using an extended version of C

Linedancer processors can also be programmed using an extended version of C, which has additional language statement that allow

operations to be performed in parallel on all the data elements stored in the processors.

Data Register ALU

Data Buffer

Concurrent

Data I/O

Intercommunication Network

Instruction & Data Interface

Processing

Unit

Vector Data Buffer

ure 2: ASProCore Architecture

As indicated in Figure 3, which depicts the software and hardware layers of Linedancer, the overall programming environment includes:

software libraries that provide functions supporting mathematical operation like square root, raise to a power etc. - standard C functions for

I/O, memory management etc. - and hardware interface functions for caches control, DMA programming, interrupt handling etc.

Performance

Table II presents the sustained performance of a Linedancer device for a number of media related processing; i.e. signal processing, image

processing, graphics, etc.

Media-related Task Linedancer Performance

FIR, (8-taps, 16-bits) 284.4 Msamples/sec

FIR, (16-taps, 16-bits) 142.4 Msamples/sec

Convolution, (3x3 kernel, 8-bits) 1400 Gpixels/sec

Median filtering , (3x3 kernel, 8-bits) 600 Mpixels/sec

DCT/IDCT 1100 Mpixels/sec

Motion Estimation, (vector for 8x8 block over 32x32 area) 400 Kvectors/sec

Line Resize, (1024 pixel line or column, 12-tap filter, 8 sets of filters) 1400 Mpixels/sec

3D Graphics, (geometry & light transformations) 33 Mvertices/ec

3D Graphics, (Gouraund rendering 25 pixels/triangle) 25.7 Mtriangles/sec

3D Graphics, (Gouraund rendering 50 pixels/triangle) 12.8 Mtriangles/sec

3D Volumetric Visualisation, (transformation & rendering) 450 Mvoxels/sec

Colour transformation, (RGB to CMYK) 104 Mpixels/sec

Table II: Linedancer processor performance for a number of media processing tasks

Summary

The Linedancer digital signal processor with it software programmable very high performance is a revolutionary component to solve the

computing demands of media applications in areas such as digital TV, digital imaging office products, digital wired and wireless

communications and networking. The extremely high degree of data parallelism available in a single device together with associative

processing of data information position the Linedancer in a position to exploit the ever increasing demand for sophisticated high-

performance devices – instead of the dual device solutions employed with competing DSP devices. With full programmability in high level

languages and application programming interfaces, the Linedancer creates a flexible solution for the constantly evolving media standards

and applications. The Linedancer’s scalable architecture and its portable media processing code enables a wide range of products for

handheld, consumer, and professional environments.

References

[1] Diefendorff, K., and Dubey, P., “How Multimedia Workloads Will Change Processor Design”, IEEE Computer, Vol. 30, No. 9,

pp:43–45, September 1997.

[2] Peleg, A. and Weiser, U., “MMX Technology Extension to the Intel Architecture”, IEEE Micro, Vol. 16, No. 4, pp: 42–50,

August 1996.

[3] Phillip, M., “A Second Generation SIMD Microprocessor Architecture”, The Proceedings of Hot Chips X Symposium, August

1998.

[4] Ranganathan, P., S. Adve, S. and N. Jouppi, N., “Performance of Image and Video Processing with General-Purpose Processors

and Media ISA Extensions”, The Proceeding of the 26th International Symposium on Computer Architecture, May 1999.

[5] Krikelis, A., “A Modular Massively Parallel Computing Approach to Image-related Processing”, Proceedings of the IEEE, Vol.

84, No. 7, pages 988-1004, July 1996.

Application

Hardware Programming Interface (HPI)

Application Specific APIs

Basic Functions library

S/W

Customer

H/W

ASProCore RISC core DMA Other

devices

H/W

Linedancer Device

ure 3: Linedancer software

FlexWAFE - an Architecture for Reconfigurable Image Processing Systems

Thesis

Full-text available

Apr 2012

Amilcar do Carmo Lucas

Recently there has been an increase in demand for high-resolution digital media content in both cinema and television industries. Currently existing equipment does not meet the requirements, or is too costly. New hardware systems and new programming techniques are needed in order to meet the high-resolution, high-quality, image requirements and reduce costs. The industry seeks a flexible architecture capable of running multiple applications on top of standard off-the-shelf components, with reduced development time. Until now, standard practice has been to develop specialized architectures and systems that target a single application. This has little flexibility and leads to high developments costs, every new application is designed almost from scratch. Our focus was to develop an architecture that is suited to image stream processing and has the flexibility to run multiple applications using the same FPGA-based hardware platform. The novelty in our approach is that we reconfigure parts of the architecture at run-time, but without incurring in the time and added constraints penalty of FPGA-partial-reconfiguration techniques. The architecture uses a hierarchical control structure that is well suited to parallel processing, and allows single cycle latency reconfiguration of parts of the processing pipeline. This is achieved using relatively little resources for the distributed control structures. To test the developed architecture a complex film-grain noise reduction algorithm was implemented on an off-the-shelf hardware platform developed by Thomson-Grass Valley. The system meet all the requirements and had very little load on the hierarchical control structures, there is growth headroom for much complexer control demands. The architecture has been ported to other hardware platforms, and other applications have been implemented as well. The run-time reconfigurability has proven to be a key factor in the success of the FlexWAFE.

Hardware Design and Implementation of a Parallel Processor for High-Performance Multimedia Processing

Article

Full-text available

May 2011

As the use of mobile multimedia devices is increasing in the recent year, the needs for high-performance multimedia processors are increasing. In this regard, we propose a SIMD (Single Instruction Multiple Data) based parallel processor that supports high-performance multimedia applications with low energy consumption. The proposed parallel processor consists of 16 processing elements (PEs) and operates on a 3-stage pipelining. Experimental results indicated that the proposed parallel processor outperforms conventional parallel processors in terms of performance. In addition, our proposed parallel processor outperforms commercial high-performance TI C6416 DSP in terms of performance (1.4-31.4x better) and energy efficiency (5.9-8.1x better) with same 130nm technology and 720 clock frequency. The proposed parallel processor was developed with verilog HDL and verified with a FPGA prototype system.

Implementation of SIMD-based Many-Core Processor for Efficient Image Data Processing

Article

Full-text available

Jan 2011

Recently, as mobile multimedia devices are used more and more, the needs for high-performance and low-energy multimedia processors are increasing. Application-specific integrated circuits (ASIC) can meet the needed high performance for mobile multimedia, but they provide limited, if any, generality needed for various application requirements. DSP based systems can used for various types of applications due to their generality, but they require higher cost and energy consumption as well as less performance than ASICs. To solve this problem, this paper proposes a single instruction multiple data (SIMD) based many-core processor which supports high-performance and low-power image data processing while keeping generality. The proposed SIMD based many-core processor composed of 16 processing elements (PEs) exploits large data parallelism inherent in image data processing. Experimental results indicate that the proposed SIMD-based many-core processor higher performance (22 times better), energy efficiency (7 times better), and area efficiency (3 times better) than conversional commercial high-performance processors.

Implementation of Pixel Subword Parallel Processing Instructions for Embedded Parallel Processors

Article

Full-text available

Jun 2011

Processor technology is currently continued to parallel processing techniques, not by only increasing clock frequency of a single processor due to the high technology cost and power consumption. In this paper, a SIMD (Single Instruction Multiple Data) based parallel processor is introduced that efficiently processes massive data inherent in multimedia. In addition, this paper proposes pixel subword parallel processing instructions for the SIMD parallel processor architecture that efficiently operate on the image and video pixels. The proposed pixel subword parallel processing instructions store and process four 8-bit pixels on the partitioned four 12-bit registers in a 48-bit datapath architecture. This solves the overflow problem inherent in existing multimedia extensions and reduces the use of many packing/unpacking instructions. Experimental results using the same SIMD-based parallel processor architecture indicate that the proposed pixel subword parallel processing instructions achieve a speedup of 2.3{\times} over the baseline SIMD array performance. This is in contrast to MMX-type instructions (a representative Intel multimedia extension), which achieve a speedup of only 1.4{\times} over the same baseline SIMD array performance. In addition, the proposed instructions achieve 2.5{\times} better energy efficiency than the baseline program, while MMX-type instructions achieve only 1.8{\times} better energy efficiency than the baseline program.

Performance Evaluation and Verification of MMX-type Instructions on an Embedded Parallel Processor

Article

Full-text available

Oct 2011

This paper introduces an SIMD(Single Instruction Multiple Data) based parallel processor that efficiently processes massive data inherent in multimedia. In addition, this paper implements MMX(MultiMedia eXtension)-type instructions on the data parallel processor and evaluates and analyzes the performance of the MMX-type instructions. The reference data parallel processor consists of 16 processors each of which has a 32-bit datapath. Experimental results for a JPEG compression application with a 1280x1024 pixel image indicate that MMX-type instructions achieves a 50% performance improvement over the baseline instructions on the same data parallel architecture. In addition, MMX-type instructions achieves 100% and 51% improvements over the baseline instructions in energy efficiency and area efficiency, respectively. These results demonstrate that multimedia specific instructions including MMX-type have potentials for widely used many-core GPU(Graphics Processing Unit) and any types of parallel processors.

Multi-Core Processor for Real-Time Sound Synthesis of Gayageum

Article

Feb 2011

Physical modeling has been widely used for sound synthesis since it synthesizes high quality sound which is similar to real-sound for musical instruments. However, physical modeling requires a lot of parameters to synthesize a large number of sounds simultaneously for the musical instrument, preventing its real-time processing. To solve this problem, this paper proposes a single instruction, multiple data (SIMD) based multi-core processor that supports real-time processing of sound synthesis of gayageum which is a representative Korean traditional musical instrument. The proposed SIMD-base multi-core processor consists of 12 processing elements (PE) to control 12 strings of gayageum in which each PE supports modeling of the corresponding string. The proposed SIMD-based multi-core processor can generate synthesized sounds of 12 strings simultaneously after receiving excitation signals and parameters of each string as an input. Experimental results using a sampling reate 44.1 kHz and 16 bits quantization show that synthesis sound using the proposed multi-core processor was very similar to the original sound. In addition, the proposed multi-core processor outperforms commercial processors(TI's TMS320C6416, ARM926EJ-S, ARM1020E) in terms of execution time ( better) and energy efficiency (about better).

Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions.

Conference Paper

Full-text available

May 1999

This paper aims to provide a quantitative understanding of the performance of image and video processing applications on general-purpose processors, without and with media ISA extensions. We use detailed simulation of 12 benchmarks to study the effectiveness of current architectural features and identify future challenges for these workloads. Our results show that conventional techniques in current processors to enhance instruction-level parallelism (ILP) provide a factor of 2.3 X to 4.2 X performance improvement. The Sun VIS media ISA extensions provide an additional 1.1 X to 4.2 X performance improvement. The ILP features and media ISA extensions significantly reduce the CPU component of execution time, making 5 of the image processing benchmarks memory-bound. The memory behavior of our benchmarks is characterized by large working sets and streaming data accesses. Increasing the cache size has no impact on 8 of the benchmarks. The remaining benchmarks require relatively large cache sizes (dependent on the display sizes) to exploit data reuse, but derive less than 1.2 X performance benefits with the larger caches. Software prefetching provides 1.4 X to 2.5 X performance improvement in the image processing benchmarks where memory is a significant problem. With the addition of software prefetching, all our benchmarks revert to being compute-bound

Altivec technology: A Second Generation SIMD Microprocessor Architecture

Article

M. Phillip

A second generation SIMD microprocessor architecture

Article

Mike Phillip

MMX Technology Extension to the Intel Architecture

Article

Sep 1996

Designed to accelerate multimedia and communications software, MMX technology improves performance by introducing data types and instructions to the IA that exploit the parallelism in these applications. MMX technology extends the Intel architecture (IA) to improve the performance of multimedia, communications, and other numeric-intensive applications. It uses a SIMD (single-instruction, multiple-data) technique to exploit the parallelism inherent in many algorithms, producing full application performance of 1.5 to 2 times faster than the same applications run on the same processor without MMX. The extension also maintains full compatibility with existing IA microprocessors, operating systems, and applications while providing new instructions and data types that applications can use to achieve a higher level of performance on the host CPU

How multimedia workloads will change processor design

Article

Oct 1997

Workloads drive architecture design and will change in the next two decades. For high-performance, general-purpose processors, there is a consensus that multimedia will continue to grow in importance. The authors predict these processors will incorporate more media processing capabilities, eventually bringing about the demise of specialized media processors, except perhaps, in embedded applications. These enhanced general-purpose processor capabilities will arise from multimedia applications that require real-time response, continuous-media data types and significant fine-grained data parallelism

A Modular Massively Parallel Computing Approach to Image-related Processing

Jul 1996
988-1004

A Krikelis

Krikelis, A., "A Modular Massively Parallel Computing Approach to Image-related Processing", Proceedings of the IEEE, Vol. 84, No. 7, pages 988-1004, July 1996.

A Programmable Processor with 4096 Processing Units for Media Applications

Abstract and Figures

Recommended publications

Photonics West 2001 - Electronic Imaging

VASP-4096: a very high performance programmable device for digital media processing applications

A programmable processor with 4096 processing units for media applications

Associative massively parallel processor for video processing