ArticlePDF Available

A Programmable Processor with 4096 Processing Units for Media Applications

Authors:

Abstract and Figures

Over the past few years, technology drivers for processor designs have changed significantly. Media data delivery and processing -- such as telecommunications, networking, video processing, speech recognition and 3D graphics -- is increasing in importance and will soon dominate the processing cycles consumed in computer-based systems. This paper describes a processo, called Linedancer, that provides high media performance with low energy consumption by integrating associative SIMD parallel processing with embedded microprocessor technology. The major innovations in the Linedancer is the integration of thousands of processing units in a single chip that are capable to support software programmable high-performance mathematical functions as well as abstract data processing. In addition to 4096 processing units, Linedancer integrates on a single chip a RISC controller that is an implementation of the SPARC architecture, 128 Kbytes of Data Memory, and I/O interfaces. The SIMD processing in Linedancer implements the ASProCore architecture, which is a proprietary implementation of SIMD processing, operates at 266 MHz with program instructions issued by the RISC controller. The device also integrates a 64-bit synchronous main memory interface operating at 133 MHZ (double-data rate, DDR), and a 64-bit 66 MHz PCI interface.
Content may be subject to copyright.
A Programmable Processor with 4096 Processing Units for Media Applications
A. Krikelis, I. P. Jalowiecki, D. Bean, R. Bishop, M. Facey, D. Boughton, S. Murphy, and M. Whitaker
Aspex Technology Ltd.
Brunel Science Park
Kingston Lane
Uxbridge, UB8 3PH
United Kingdom
argy.krikelis@aspex.co.uk
Abstract
Over the past few years, technology drivers for processor designs have changed significantly. Media data delivery and processing – such
as telecommunications, networking, video processing, speech recognition and 3D graphics – is increasing in importance and will soon
dominate the processing cycles consumed in computer-based systems. This paper describes a processo, called Linedancer, that provides
high media performance with low energy consumption by integrating associative SIMD parallel processing with embedded microprocessor
technology. The major innovations in the Linedancer is the integration of thousands of processing units in a single chip that are capable to
support software programmable high-performance mathematical functions as well as abstract data processing. In addition to 4096
processing units, Linedancer integrates on a single chip a RISC controller that is an implementation of the SPARC architecture, 128
Kbytes of Data Memory, and I/O interfaces. The SIMD processing in Linedancer implements the ASProCore architecture, which is a
proprietary implementation of SIMD processing, operates at 266 MHz with program instructions issued by the RISC controller. The device
also integrates a 64-bit synchronous main memory interface operating at 133 MHZ (double-data rate, DDR), and a 64-bit 66 MHz PCI
interface.
Introduction
Over the past few years, technology drivers for microprocessors have changed significantly. High-end systems for technical and scientific
applications used to direct the evolution of processor architecture. Now, consumer-level systems drive technology, due to their large
volume and attendant profits. Within this environment, important application and technology trends have evolved. Media data delivery and
processing – such as telecommunications, networking, video processing, speech recognition and 3D graphics – is increasing in importance
and will soon dominate the processing cycles consumed in computer-based systems [1]. SIMD extensions to existing processor
architectures [2, 3] for supporting DSP type of operations are essentially narrow vector designs without support for vector memory
operations. They have limited scalability because each instruction specifies a fixed number of operations. Most extensions do not support
SIMD memory operations, therefore exposing data alignment to user software [4]. Certain instructions, such as random permutations, will
not scale well due to interconnect delay scalability problems.
This paper presents the architecture of the Linedancer DSP processor. The Linedancer provides high multimedia performance with low
energy consumption by integrating associative SIMD parallel processing with embedded microprocessor technology. The major
innovations in the Linedancer is the integration of thousands of processing units in a single chip that are capable to support software
programmable high-performance mathematical functions as well as abstract data processing.
Linedancer Processor
Linedancer is a DSP processor for media processing that integrates 4096 processing units. With 15.6M transistors, the processor is
implemented in a 0.18um 5-metal-layer CMOS process. It is packaged in a 560-pin EBGA pack. The core of the device operates at 1.8
Volts, while the I/Os are 3.3 Volts. The device, which will be sampling in the first quarter of 2001, at its peak, dissipates less than 5 Watts
at the 266 MHZ typical operating frequency. Table I details the Linedancer device specifications.
Technology 0.18
µm, 5-layer metal CMOS
Clock 266 MHz
Power Supply 1.8 V Core, and 3.3 V I/O
Transistor Count 15.6 millions
Power Dissipation 4 Watts (peak)
Die Size 111 mm2
Package 680 EBGA
Testability Full-scan
Data Transfer – Synchronous Memory Interface Up to 2,100 Mbytes/sec
PCI Interface 8 bytes @ 66 MHz
Integer Performance (8-bit add/subtract) 126,000 MOPS
Integer Performance (8-bit x 8-bit multiplies) 13,500 MOPS
Integer Performance (16-bit x 16-bit multiplies) 3,800 MOPS
Floating Point Performance (IEEE Single Precision) 1,700 MFLOPS
Table I: Linedancer device specifications
The Linedancer processor is one of the devices that incorporates ASProCore (Associative String Processor Core), a software
programmable associative SIMD structure that provides extremely efficient support for the parallel processing required for media
applications.
Figure 1 depicts the block diagram of the processor called the Linedancer. The Linedancer major functional units include the ASProCore, a
RISC controller that is an implementation of the SPARC architecture, 128 Kbytes of Data Memory, and I/O interfaces. The ASProCore,
which is a proprietary implementation of SIMD processing, is designed to operate at 266 MHz with program instructions issued by the
RISC controller - the controller block includes a 128 Kbytes program memory. The device also integrates a 64-bit synchronous main
memory interface operating at 133 MHZ (double-data rate, DDR), and a 64-bit 66 MHz PCI interface.
The Data Transfer block provides the application programmer with the support required to develop application software that overlaps
application processing with data transfers. In inner loops operating over large sets of data, the programmer can program it to move strips,
blocks or patches of data on and off the chip. The overall design of the processor allows SIMD processing in ASProCore to fully overlap
with data movement.
ASProCore
The ASProCore (Associative String Processor Core) part of the Linedancer is a programmable, homogeneous and fault-tolerant SIMD
parallel processor core incorporating a string of identical processing units, a reconfigurable Intercommunication Network, and a Vector
Data Buffer for fully overlapped data input-output as indicated in Figure 2.
As shown in Figure 2, each processing unit incorporates a Data Register and a bit-serial ALU. The size of the Data Register is 200 bits. The
Data Register, in addition to storing data for arithmetic operations involving the local ALU, can support associative processing operation
(i.e. to direct support for logical and relational operations).
ASProCore
RISC Controller
Linedancer Internal Bus
Data
Transfer
PCI
Interface
64-bits
@ 66 MHz 64-bits
@ 266 MHz
Data Memory
64-bits
@ 66 MHz 64-bits
@ 266 MHz
Figure 1: Linedancer Block Diagram
The processing units are connected via the Intercommunication Network. The Intercommunication Network is a flexible network that
supports data transfers and navigation of data structures. It can be dynamically reconfigured, in a programmable and user-transparent way,
thus providing a cost-effective emulation of common network topologies. The interconnection strategy supports 2 modes of inter-
processing communication:
!" asynchronous bi-directional single-bit communication to connect processing units sources and corresponding processing units
destinations of high-speed activation signals, implementing a fully-connected dynamically-configured (programmer-transparently)
permutation and broadcast network for processing element selection and inter-processing element routing functions;
!" synchronous bi-directional multi-bit communication, via a high-speed bit-serial shift register for data/message transfer between
processing unit groups.
While being served with control and sequential data via the Instruction and Data Interface, the ASProCore can support parallel data I/O via
the Vector Data Buffer. Data is loaded, overlapped with SIMD parallel processing, word-sequentially, bit-parallel into the Vector Data
Buffer, It can subsequently be exchanged with the data stored in the Data Register of the local processing unit in a word-parallel, bit-
sequentially manner. For data-parallel operations, data are distributed over the processing units and stored in the local Data Register.
Successive computational tasks are performed on the stored data and the results are dumped. The ASProCore supports a form of set
associative processing, in which a sub-set of active processing units (i.e. those which associatively match broadcast scalar information)
support scalar-vector (i.e. between a scalar and Data Registers) and vector-vector (i.e. within Data Registers) operations. Matching
processing units are either directly activated or source inter-processing element communications to indirectly activate other processing
units. The control interface provides feedback on whether none or some processing units match. The instruction set for the ASProCore is
based on 4 basic operations, match, add, read and write. More complicated functionality can be performed by combining these operations.
A more detailed discussion of the ASProCore architecture can be found in [5].
Software Programmable
Conceptually the Linedancer processor can be viewed as a general purpose RISC processor with a tightly coupled data parallel co-
processor (ASProCore) and a DMA unit for moving data between memory and the co-processor. In inner loops operating over large sets of
data, the programmer can program it to move strips, blocks or patches of data on and off the chip. Applications consist of a single program
where instructions for the RISC and co-processor can be freely intermixed.
The Linedancer, may be programmed using one or both of the following methods:
Application Programming Interface (API)
An API provides a suite of specialised functions that can be used to build solutions for particular applications. The functions are
callable by a standard C or C++ program without the need for the programmer to have specialised hardware or computer architecture
knowledge.
Using an extended version of C
Linedancer processors can also be programmed using an extended version of C, which has additional language statement that allow
operations to be performed in parallel on all the data elements stored in the processors.
Data Register ALU
Data Register ALU
Data Register ALU
Data Register ALU
Data Buffer
Data Buffer
Data Buffer
Data Buffer
Concurrent
Data I/O
Intercommunication Network
Instruction & Data Interface
Processing
Unit
Vector Data Buffer
Fi
g
ure 2: ASProCore Architecture
As indicated in Figure 3, which depicts the software and hardware layers of Linedancer, the overall programming environment includes:
software libraries that provide functions supporting mathematical operation like square root, raise to a power etc. - standard C functions for
I/O, memory management etc. - and hardware interface functions for caches control, DMA programming, interrupt handling etc.
Performance
Table II presents the sustained performance of a Linedancer device for a number of media related processing; i.e. signal processing, image
processing, graphics, etc.
Media-related Task Linedancer Performance
FIR, (8-taps, 16-bits) 284.4 Msamples/sec
FIR, (16-taps, 16-bits) 142.4 Msamples/sec
Convolution, (3x3 kernel, 8-bits) 1400 Gpixels/sec
Median filtering , (3x3 kernel, 8-bits) 600 Mpixels/sec
DCT/IDCT 1100 Mpixels/sec
Motion Estimation, (vector for 8x8 block over 32x32 area) 400 Kvectors/sec
Line Resize, (1024 pixel line or column, 12-tap filter, 8 sets of filters) 1400 Mpixels/sec
3D Graphics, (geometry & light transformations) 33 Mvertices/ec
3D Graphics, (Gouraund rendering 25 pixels/triangle) 25.7 Mtriangles/sec
3D Graphics, (Gouraund rendering 50 pixels/triangle) 12.8 Mtriangles/sec
3D Volumetric Visualisation, (transformation & rendering) 450 Mvoxels/sec
Colour transformation, (RGB to CMYK) 104 Mpixels/sec
Table II: Linedancer processor performance for a number of media processing tasks
Summary
The Linedancer digital signal processor with it software programmable very high performance is a revolutionary component to solve the
computing demands of media applications in areas such as digital TV, digital imaging office products, digital wired and wireless
communications and networking. The extremely high degree of data parallelism available in a single device together with associative
processing of data information position the Linedancer in a position to exploit the ever increasing demand for sophisticated high-
performance devices – instead of the dual device solutions employed with competing DSP devices. With full programmability in high level
languages and application programming interfaces, the Linedancer creates a flexible solution for the constantly evolving media standards
and applications. The Linedancer’s scalable architecture and its portable media processing code enables a wide range of products for
handheld, consumer, and professional environments.
References
[1] Diefendorff, K., and Dubey, P., “How Multimedia Workloads Will Change Processor Design”, IEEE Computer, Vol. 30, No. 9,
pp:43–45, September 1997.
[2] Peleg, A. and Weiser, U., “MMX Technology Extension to the Intel Architecture”, IEEE Micro, Vol. 16, No. 4, pp: 42–50,
August 1996.
[3] Phillip, M., “A Second Generation SIMD Microprocessor Architecture”, The Proceedings of Hot Chips X Symposium, August
1998.
[4] Ranganathan, P., S. Adve, S. and N. Jouppi, N., “Performance of Image and Video Processing with General-Purpose Processors
and Media ISA Extensions”, The Proceeding of the 26th International Symposium on Computer Architecture, May 1999.
[5] Krikelis, A., “A Modular Massively Parallel Computing Approach to Image-related Processing”, Proceedings of the IEEE, Vol.
84, No. 7, pages 988-1004, July 1996.
Application
Hardware Programming Interface (HPI)
Application Specific APIs
Basic Functions library
S/W
Customer
H/W
ASProCore RISC core DMA Other
devices
H/W
Linedancer Device
Fi
g
ure 3: Linedancer software
... The da Vinci Resolve [32] product line is a system designed exclusively for image processing. It uses Power Plant accelerator boards based on the Aspex LineDancer processor [12, 66] that integrates one 32-bit Reduced Instruction Set Computer (RISC) processor, a DMA engine, 2 x 128KiB 1 on-chip memories, external memory interfaces, inter-chip connect and 4096 small processing elements, all operating in parallel, to provide an ultra-high performance processor that is fully software programmable. The array of small processors, each containing an Arithmetic Logic Unit (ALU), memory, and a high speed inter-processor communications network is known as ASProCore. ...
Thesis
Full-text available
Recently there has been an increase in demand for high-resolution digital media content in both cinema and television industries. Currently existing equipment does not meet the requirements, or is too costly. New hardware systems and new programming techniques are needed in order to meet the high-resolution, high-quality, image requirements and reduce costs. The industry seeks a flexible architecture capable of running multiple applications on top of standard off-the-shelf components, with reduced development time. Until now, standard practice has been to develop specialized architectures and systems that target a single application. This has little flexibility and leads to high developments costs, every new application is designed almost from scratch. Our focus was to develop an architecture that is suited to image stream processing and has the flexibility to run multiple applications using the same FPGA-based hardware platform. The novelty in our approach is that we reconfigure parts of the architecture at run-time, but without incurring in the time and added constraints penalty of FPGA-partial-reconfiguration techniques. The architecture uses a hierarchical control structure that is well suited to parallel processing, and allows single cycle latency reconfiguration of parts of the processing pipeline. This is achieved using relatively little resources for the distributed control structures. To test the developed architecture a complex film-grain noise reduction algorithm was implemented on an off-the-shelf hardware platform developed by Thomson-Grass Valley. The system meet all the requirements and had very little load on the hierarchical control structures, there is growth headroom for much complexer control demands. The architecture has been ported to other hardware platforms, and other applications have been implemented as well. The run-time reconfigurability has proven to be a key factor in the success of the FlexWAFE.
... The proposed parallel processor was developed with verilog HDL and verified with a FPGA prototype system. ▸Keyword : Parallel processor, multimedia processing, pipeline architecture, [10], [11], [12], [13], [14] ...
Article
Full-text available
As the use of mobile multimedia devices is increasing in the recent year, the needs for high-performance multimedia processors are increasing. In this regard, we propose a SIMD (Single Instruction Multiple Data) based parallel processor that supports high-performance multimedia applications with low energy consumption. The proposed parallel processor consists of 16 processing elements (PEs) and operates on a 3-stage pipelining. Experimental results indicated that the proposed parallel processor outperforms conventional parallel processors in terms of performance. In addition, our proposed parallel processor outperforms commercial high-performance TI C6416 DSP in terms of performance (1.4-31.4x better) and energy efficiency (5.9-8.1x better) with same 130nm technology and 720 clock frequency. The proposed parallel processor was developed with verilog HDL and verified with a FPGA prototype system.
Article
Full-text available
Recently, as mobile multimedia devices are used more and more, the needs for high-performance and low-energy multimedia processors are increasing. Application-specific integrated circuits (ASIC) can meet the needed high performance for mobile multimedia, but they provide limited, if any, generality needed for various application requirements. DSP based systems can used for various types of applications due to their generality, but they require higher cost and energy consumption as well as less performance than ASICs. To solve this problem, this paper proposes a single instruction multiple data (SIMD) based many-core processor which supports high-performance and low-power image data processing while keeping generality. The proposed SIMD based many-core processor composed of 16 processing elements (PEs) exploits large data parallelism inherent in image data processing. Experimental results indicate that the proposed SIMD-based many-core processor higher performance (22 times better), energy efficiency (7 times better), and area efficiency (3 times better) than conversional commercial high-performance processors.
Article
Full-text available
Processor technology is currently continued to parallel processing techniques, not by only increasing clock frequency of a single processor due to the high technology cost and power consumption. In this paper, a SIMD (Single Instruction Multiple Data) based parallel processor is introduced that efficiently processes massive data inherent in multimedia. In addition, this paper proposes pixel subword parallel processing instructions for the SIMD parallel processor architecture that efficiently operate on the image and video pixels. The proposed pixel subword parallel processing instructions store and process four 8-bit pixels on the partitioned four 12-bit registers in a 48-bit datapath architecture. This solves the overflow problem inherent in existing multimedia extensions and reduces the use of many packing/unpacking instructions. Experimental results using the same SIMD-based parallel processor architecture indicate that the proposed pixel subword parallel processing instructions achieve a speedup of 2.3{\times} over the baseline SIMD array performance. This is in contrast to MMX-type instructions (a representative Intel multimedia extension), which achieve a speedup of only 1.4{\times} over the same baseline SIMD array performance. In addition, the proposed instructions achieve 2.5{\times} better energy efficiency than the baseline program, while MMX-type instructions achieve only 1.8{\times} better energy efficiency than the baseline program.
Article
Full-text available
This paper introduces an SIMD(Single Instruction Multiple Data) based parallel processor that efficiently processes massive data inherent in multimedia. In addition, this paper implements MMX(MultiMedia eXtension)-type instructions on the data parallel processor and evaluates and analyzes the performance of the MMX-type instructions. The reference data parallel processor consists of 16 processors each of which has a 32-bit datapath. Experimental results for a JPEG compression application with a 1280x1024 pixel image indicate that MMX-type instructions achieves a 50% performance improvement over the baseline instructions on the same data parallel architecture. In addition, MMX-type instructions achieves 100% and 51% improvements over the baseline instructions in energy efficiency and area efficiency, respectively. These results demonstrate that multimedia specific instructions including MMX-type have potentials for widely used many-core GPU(Graphics Processing Unit) and any types of parallel processors.
Article
Physical modeling has been widely used for sound synthesis since it synthesizes high quality sound which is similar to real-sound for musical instruments. However, physical modeling requires a lot of parameters to synthesize a large number of sounds simultaneously for the musical instrument, preventing its real-time processing. To solve this problem, this paper proposes a single instruction, multiple data (SIMD) based multi-core processor that supports real-time processing of sound synthesis of gayageum which is a representative Korean traditional musical instrument. The proposed SIMD-base multi-core processor consists of 12 processing elements (PE) to control 12 strings of gayageum in which each PE supports modeling of the corresponding string. The proposed SIMD-based multi-core processor can generate synthesized sounds of 12 strings simultaneously after receiving excitation signals and parameters of each string as an input. Experimental results using a sampling reate 44.1 kHz and 16 bits quantization show that synthesis sound using the proposed multi-core processor was very similar to the original sound. In addition, the proposed multi-core processor outperforms commercial processors(TI's TMS320C6416, ARM926EJ-S, ARM1020E) in terms of execution time ( better) and energy efficiency (about better).
Conference Paper
Full-text available
This paper aims to provide a quantitative understanding of the performance of image and video processing applications on general-purpose processors, without and with media ISA extensions. We use detailed simulation of 12 benchmarks to study the effectiveness of current architectural features and identify future challenges for these workloads. Our results show that conventional techniques in current processors to enhance instruction-level parallelism (ILP) provide a factor of 2.3 X to 4.2 X performance improvement. The Sun VIS media ISA extensions provide an additional 1.1 X to 4.2 X performance improvement. The ILP features and media ISA extensions significantly reduce the CPU component of execution time, making 5 of the image processing benchmarks memory-bound. The memory behavior of our benchmarks is characterized by large working sets and streaming data accesses. Increasing the cache size has no impact on 8 of the benchmarks. The remaining benchmarks require relatively large cache sizes (dependent on the display sizes) to exploit data reuse, but derive less than 1.2 X performance benefits with the larger caches. Software prefetching provides 1.4 X to 2.5 X performance improvement in the image processing benchmarks where memory is a significant problem. With the addition of software prefetching, all our benchmarks revert to being compute-bound
Article
Designed to accelerate multimedia and communications software, MMX technology improves performance by introducing data types and instructions to the IA that exploit the parallelism in these applications. MMX technology extends the Intel architecture (IA) to improve the performance of multimedia, communications, and other numeric-intensive applications. It uses a SIMD (single-instruction, multiple-data) technique to exploit the parallelism inherent in many algorithms, producing full application performance of 1.5 to 2 times faster than the same applications run on the same processor without MMX. The extension also maintains full compatibility with existing IA microprocessors, operating systems, and applications while providing new instructions and data types that applications can use to achieve a higher level of performance on the host CPU
Article
Workloads drive architecture design and will change in the next two decades. For high-performance, general-purpose processors, there is a consensus that multimedia will continue to grow in importance. The authors predict these processors will incorporate more media processing capabilities, eventually bringing about the demise of specialized media processors, except perhaps, in embedded applications. These enhanced general-purpose processor capabilities will arise from multimedia applications that require real-time response, continuous-media data types and significant fine-grained data parallelism
A Modular Massively Parallel Computing Approach to Image-related Processing
  • A Krikelis
Krikelis, A., "A Modular Massively Parallel Computing Approach to Image-related Processing", Proceedings of the IEEE, Vol. 84, No. 7, pages 988-1004, July 1996.