Conference PaperPDF Available

Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions.

May 1999

May 1999
27(2):124-135

DOI:10.1109/ISCA.1999.765945

Source
DBLP

Conference: Computer Architecture, 1999. Proceedings of the 26th International Symposium on

Authors:

Parthasarathy Ranganathan

HP Inc.

Sarita V. Adve

University of Illinois, Urbana-Champaign

Norman P. Jouppi

Google Inc.

This paper aims to provide a quantitative understanding of the performance of image and video processing applications on general-purpose processors, without and with media ISA extensions. We use detailed simulation of 12 benchmarks to study the effectiveness of current architectural features and identify future challenges for these workloads. Our results show that conventional techniques in current processors to enhance instruction-level parallelism (ILP) provide a factor of 2.3 X to 4.2 X performance improvement. The Sun VIS media ISA extensions provide an additional 1.1 X to 4.2 X performance improvement. The ILP features and media ISA extensions significantly reduce the CPU component of execution time, making 5 of the image processing benchmarks memory-bound. The memory behavior of our benchmarks is characterized by large working sets and streaming data accesses. Increasing the cache size has no impact on 8 of the benchmarks. The remaining benchmarks require relatively large cache sizes (dependent on the display sizes) to exploit data reuse, but derive less than 1.2 X performance benefits with the larger caches. Software prefetching provides 1.4 X to 2.5 X performance improvement in the image processing benchmarks where memory is a significant problem. With the addition of software prefetching, all our benchmarks revert to being compute-bound

Impact of VIS on dynamic (retired) instruction count.

…

. Default processor parameters.

…

Effect of software-inserted prefetching.

…

Figures - uploaded by Norman P. Jouppi

Content may be subject to copyright.

Content uploaded by Norman P. Jouppi

Content may be subject to copyright.

To appear in the Proceedings of the 26th International Symposium on Computer Architecture. May 1999

Performance of Image and Video Processing with

General-Purpose Processors and Media ISA Extensions

Parthasarathy Ranganathan



, Sarita Adve



, and Norman P. Jouppi



Electrical and Computer Engineering

Western Research Laboratory

Rice University Compaq Computer Corporation

parthas,sarita

@rice.edu jouppi@pa.dec.com

Abstract

This paper aims to provide a quantitative understanding

of the performance of image and video processing applica-

tions on general-purpose processors, without and with me-

dia ISA extensions. We use detailed simulation of 12 bench-

marks to study the effectiveness of current architectural fea-

tures and identify future challenges for these workloads.

Our results show that conventional techniques in current

processors to enhance instruction-level parallelism (ILP)

provide a factor of 2.3X to 4.2X performance improve-

ment. The Sun VIS media ISA extensions provide an ad-

ditional 1.1X to 4.2X performance improvement. The ILP

features and media ISA extensions signiﬁcantly reduce the

CPU component of execution time, making 5 of the image

processing benchmarks memory-bound.

The memory behavior of our benchmarks is character-

ized by large working sets and streaming data accesses. In-

creasing the cache size has no impact on 8 of the bench-

marks. The remaining benchmarks require relatively large

cache sizes (dependent on the display sizes) to exploit data

reuse, but derive less than 1.2X performance beneﬁts with

the larger caches. Software prefetching provides 1.4X to

2.5X performance improvement in the image processing

benchmarks where memory is a signiﬁcant problem. With

the addition of software prefetching, all our benchmarks re-

vert to being compute-bound.

1 Introduction

In the near future, media processing is expected to be-

come one of the dominant computing workloads [6, 13].



This work is supported in part by an IBM Partnership award, Intel Cor-

poration, the National Science Foundation under Grant No. CCR-9410457,

CCR-9502500, CDA-9502791, and CDA-9617383, and the Texas Ad-

vanced Technology Program under Grant No. 003604-025. Sarita Adve

is also supported by an Alfred P. Sloan Research Fellowship.

Media processing refers to the computing required for the

creation, encoding/decoding, processing, display, and com-

munication of digital multimedia information such as im-

ages, audio, video, and graphics. The last few years

have seen signiﬁcant advances in this area, but the true

promise of media processing will be seen only when ap-

plications such as collaborative teleconferencing, distance

learning, and high-quality media-rich content channels ap-

pear in ubiquitously available commodity systems. Fur-

ther out, advanced human-computer interfaces, telepres-

ence, and immersive and interactive virtual environments

hold even greater promise.

One obstacle in achieving this promise is the high com-

putational demands imposed by these applications. These

requirements arise from the computationally expensive na-

ture of the algorithms, the stringent real-time constraints,

and the need to run many such tightly synchronized appli-

cations at the same time on the same system. For exam-

ple, a video teleconferencing system may need to run video

processing including encoding/decoding, audio processing,

and a software modem simultaneously. As a result, such

applications currently display images of only a few square

inches at a few frames per second when running on general-

purpose processors. Full-screen images at 20-30 frames per

second could require more than two orders of magnitude

more performance.

To meet the high computational requirements of emerg-

ing media applications, current systems use a combination

of general-purpose processors accelerated with DSP (or me-

dia) processors and ASICs performing specialized compu-

tations. However, beneﬁts offered by general-purpose pro-

cessors in terms of ease of programming, higher perfor-

mance growth, easier upgrade paths between generations,

and cost considerations argue for increasing use of general-

purpose processors for media processing applications [6,

13]. The most visible evidence of this trend has been

the SIMD-style media instruction-set architecture (ISA) ex-

tensions announced for most high-performance general-

purpose processors (e.g., 3DNow! [15], AltiVec [19],

MAX [12], MDMX and MIPSV [9], MMX [18], MVI [4],

VIS [23]).

Unfortunately, in spite of the large amount of recent at-

tention given to media processing [5, 6, 13], there is very

little quantitative understanding of the performance of such

applications on general-purpose systems. A major chal-

lenge for such studies has been the large number of ap-

plication classes in this domain (e.g., image, video, au-

dio, speech, communication, graphics, etc.), and the ab-

sence of any standardized representative benchmark sets.

Consequently, in contrast to the much-researched SPEC,

SPLASH, and (more recently) TPC benchmarks, a number

of fundamental questions still remain unanswered for me-

dia processing workloads. For example, is computation or

memory the primary bottleneck in these applications? How

effective are current architectural designs and media ISA

extensions? What are the future challenges for these work-

loads? Given the lack of understanding of such issues, it is

not surprising that the media instruction set extensions an-

nounced by different processor vendors vary widely – from

13 instructions in MVI for Alpha [4] to 162 instructions in

AltiVec for PowerPC [19].

This paper is a ﬁrst step in understanding the above is-

sues to determine if and how we need to change the way we

design general-purpose systems to support media process-

ing applications. We focus on image and video workloads,

an important class of media processing workloads, and at-

tempt to cover the spectrum of the key tasks in this class.

Our benchmark suite consists of 12 kernels and applications

covering image processing, image source coding, and video

source coding. We use detailed simulation to study a va-

riety of general-purpose-processor architectural conﬁgura-

tions, both with and without the use of Sun’s visual instruc-

tion set (VIS) media ISA extensions. VIS shares a number

of fundamental similarities with the media ISA extensions

proposed for other processors, and is representative of the

beneﬁts and limitations of current media ISA extensions.

We start with a base single-issue in-order processor. In

this system, all the benchmarks are primarily compute-

bound. We ﬁnd that conventional techniques in current

processors to enhance instruction-level parallelism or ILP

(multiple issue and out-of-order issue) provide a factor of

2.3X to 4.2X performance improvement for the benchmarks

studied. The VIS media ISA extensions provide an addi-

tional 1.1X to 4.2X performance improvement. Our de-

tailed analysis indicates the sources and limitations of the

performance beneﬁts due to VIS. The conventional ILP

techniques and the VIS extensions together signiﬁcantly re-

duce the CPU component of execution time, making ﬁve of

the image processing benchmarks memory-bound.

The memory behavior of these workloads is character-

ized by large working sets and streaming data accesses. In-

creasing the cache size has no impact on 8 of the bench-

marks. The remaining reuse data, but require relatively

large cache sizes (dependent on the display sizes) to ex-

ploit the reuse and derive a performance beneﬁt of less than

1.2X. Software-inserted prefetching provides 1.4X to 2.5X

performance improvement in the image processing bench-

marks where memory stall time is signiﬁcant. With the ad-

dition of software prefetching, all of our benchmarks revert

to being compute-bound.

The rest of the paper is organized as follows. Section 2

describes our workloads, the architectures modeled, and the

simulation methodology. Section 3 presents our results on

the impact of ILP features and VIS media extensions. Sec-

tion 4 studies the performance of the cache system and the

impact of software prefetching. Section 5 discusses related

work. Section 6 concludes the paper.

2 Methodology

2.1 Workloads

We attempt to cover the spectrum of key tasks in im-

age and video processing workloads. The kernels and ap-

plications in our benchmark suite form signiﬁcant compo-

nents of many current and future real-world workloads such

as collaborative teleconferencing, scene-visualization, dis-

tance learning, streaming video across the internet, digi-

tal broadcasting, real-time ﬂight imaging and radar sens-

ing, content-based storage and retrieval, online video cata-

loging, and medical tomography [8]. Future standards such

as JPEG2000 and MPEG4 are likely to build on a number

of components of our benchmark suite.

Table 1 summarizes the 12 benchmarks that we use

in this paper, and is divided into image processing (Sec-

tion 2.1.1), image source coding (Section 2.1.2), and video

source coding (Section 2.1.3). These benchmarks are simi-

lar to some of the benchmarks used in the image and video

parts of the Intel Media Benchmark (described at the Intel

web site) and the UCLA MediaBench [11].1

All the image benchmarks were run with 1024x640 pixel

3-band (i.e., channel) input images obtained from the Intel

Media Benchmark. The video benchmarks were run with

the mei16v2 test bit stream from the MPEG Software Sim-

ulation Group that operates on 352x240 sized 3-band im-

ages. We did not study larger (full-screen) sizes because

they were not readily available and would have required im-

practical simulation time.

1We did not use the Intel Media Benchmark or the UCLA MediaBench

directly because the former does not provide source code and the latter

does not include image processing applications.

Image processing

Addition Addition of two images (sf16.ppm,rose16.ppm) using mean of two pixel values

Blend Alpha blending of two images (sf16.ppm,rose16.ppm) with another alpha image (winter16.ppm); the operation

performed is

dst

alpha



src

1 + (255



alpha

)



src

Conv General 3x3 image convolution of an image (sf16.ppm). The operation performed includes a saturation sum-

mation of 9 product terms. Each term corresponds to multiplying the pixel values in a moving 3x3 window

across the image dimensions with the values of a 3x3 kernel matrix.

Dotprod 16x16 dot product of a randomly-initialized 1048576-element linear array

Scaling Linear image scaling of an image (sf16.ppm)

Thresh Double-limit thresholding of an image (sf16.ppm). If the pixel band value falls within the low and high values

for that band, the destination is set to the map value for that band; otherwise, the destination is set to be the

same as the source pixel value.

Image source coding

Cjpeg JPEG progressive encoding (rose16.ppm)

Djpeg JPEG progressive decoding (rose16.jpg)

Cjpeg-np JPEG non-progressive encoding (rose16.ppm)

Djpeg-np JPEG non-progressive decoding (rose16.jpg)

Video source coding

Mpeg-enc MPEG2 encoding of 4 frames (I-B-B-P frames) of the mei16v2rec bit stream. Properties of the bit stream

include frame rate of 30fps, bit rate of 5Mbps at the Main proﬁle@Main level conﬁguration. The image is

352x240 pixels in the 4:2:0 YUV chroma format, and is scaled to a 704x480 display. The quantization tables

and the motion estimation search parameters are set to the default parameters speciﬁed by the MPEG group.

Mpeg-dec MPEG2 decoding of the mei16v2rec video bit stream into separate YUV components.

Table 1. Summary of the benchmarks used in this study.

2.1.1 Image Processing

Our image processing benchmarks are taken from the Sun

VIS Software Development Kit (VSDK), which includes 14

image processing kernels. These kernels include common

image processing tasks such as one-band and three-band

(i.e., channel) alpha blending (used in image compositing),

single-limit and double-limit thresholding (used in chroma-

keying, image masking, and blue screening), and functions

such as general and separable convolution, copying, inver-

sion, addition, dot product, and scaling (used in the core

of many image processing codes like blurring, sharpening,

edge detection, embossing, etc.). We study all 14 of the

VSDK kernels, but due to space constraints, we report re-

sults for only 6 representative benchmarks (addition, blend,

conv, dotprod, scaling, and thresh).

2.1.2 Image Source Coding

We focus on the Joint Photography Experts Group (JPEG)

standard and study the performance of the Release 6a codec

(encoder/decoder) from the Independent JPEG Group. We

study two different commonly used codecs speciﬁed in the

standard, a progressive JPEG codec (cjpeg encoder and

djpeg decoder), and a non-progressive JPEG codec (cjpeg-

np encoder and djpeg-np decoder).

The JPEG encoding process consists of a number of

phases many of which exploit properties of the human vi-

sual system to reduce the number of bits required to spec-

ify the image. First, the color conversion and chroma-

decimation phases convert the source image from a 24-bit

RGB representation domain to a 12 bit 4:2:0 YUV repre-

sentation. Next, a linear DCT image transform phase con-

verts the image into the frequency domain. The quantiza-

tion phase then scales the frequency domain values by a

quantization value (either constant or variable). The zig-

zag scanning and variable-length (Huffman) coding phases

then reorder the resulting data into streams of bits and en-

code them as a stream of variable-length symbols based on

statistical analysis of the frequency of symbols.

Progressive image compression uses a compression al-

gorithm that performs multiple Huffman coding passes on

the image to encode it as multiple scans of increasing pic-

ture quality (leading to the perception of gradual focusing

of images seen on many web pages).

The decoding process performs the inverse of the opera-

tions for the encoding process in the reverse order to obtain

the original image from the compressed image.

2.1.3 Video Source Coding

We focus on the Motion Picture Experts Group-2 (MPEG2)

video coding standard, and study the performance of the

version 1.1 codec from the MPEG Software Simulation

Group.

The ﬁrst part of the video compression process consists

of spatial compression similar to that described for JPEG

Processor speed 1 GHz

Issue width 4-way

Instruction window size 64

Memory queue size 32

Branch prediction

Bimodal agree predictor size 2K

Return-address stack size 32

Taken branches per cycle 1

Simultaneous speculated branches 16

Functional unit counts

Integer arithmetic units 2

Floating-point units 2

Address generation units 2

VIS multipliers 1

VIS adders 1

Functional unit latencies (cycles)

Default integer/address generation 1/1

Integer multiply/divide 7/12

Default ﬂoating point 4

FP moves/converts/divides 4/4/12

Default VIS 1

VIS 8-bit loads/multiply/pdist 1/3/3

Table 2. Default processor parameters.

Cache line size 64 bytes

L1 data cache size (on-chip) 64 KB

L1 data cache associativity 2-way

L1 data cache request ports 2

L1 data cache hit time 2 ns

Number of L1 MSHRs 12

L2 cache size (off-chip) 128K

L2 cache associativity 4-way

L2 request ports 1

L2 hit time (pipelined) 20 ns

Number of L2 MSHRs 12

Max. outstanding misses per MSHR 8

Total memory latency for L2 misses 100 ns

Memory interleaving 4-way

Table 3. Default memory system parameters.

and includes the color conversion, chroma decimation,

frequency transformation, quantization, zig-zag coding,

and run-length coding phases. Additionally, MPEG2 has

an inter-frame predictive-compression motion-estimation

phase that uses difference vectors to encode temporal redun-

dancy between macroblocks in a frame and macroblocks in

the following and preceding frames. Motion estimation is

the most compute-intensive part of mpeg-encode.

The video decompression process performs the inverse

of the various encode operations in reverse order to get the

decoded bit stream from the input compressed video. The

mei16v2 bit stream is already in the YUV format, and con-

sequently, our MPEG simulations do not go through the

color conversion phase discussed in Section 2.1.2.

2.2 Architectures Modeled

2.2.1 Processor and Memory System

We study two processor models – an in-order processor

model (similar to the Compaq Alpha 21164, Intel Pen-

tium, and Sun UltraSPARC-II processors) and an out-of-

order processor model (similar to the Compaq Alpha 21264,

HP PA8000, IBM PowerPC, Intel Pentium Pro, and MIPS

R10000 processors). Both the processor models support

non-blocking loads and stores.

For the experiments with software prefetching, the pro-

cessor models provide support for software-controlled non-

binding prefetches into the ﬁrst-level cache.

The base system uses a 64KB two-way associative ﬁrst-

level (L1) write-back cache and a 128KB 4-way associa-

tive second-level (L2) write-back cache. Section 4.1 dis-

cusses the impact of varying the cache sizes. All the caches

are non-blocking and allow support for multiple outstand-

ing misses. At each cache, 12 miss status holding regis-

ters (MSHRs) reserve space for outstanding cache misses

and combine a maximum of 8 multiple requests to the same

cache line.

Tables 2 and 3 summarize the parameters used for the

processor and memory subsystems. When studying the per-

formance of a 1-way issue processor, we scale the number

of functional units to 1 of each type. The functional unit

latencies were chosen based on the Alpha 21264 processor.

All functional units are fully pipelined except the ﬂoating-

point divide (non-pipelined).

2.2.2 VIS Media ISA Extensions

The VIS media ISA extensions to the SPARC V9 architec-

ture are a set of instructions targeted at accelerating media

processing [10, 23]. Both our in-order and out-of-order pro-

cessor models include support for VIS.

The VIS extensions deﬁne the packed byte, packed word

and packed double data types which allow concurrent oper-

ations on eight bytes, four words (16-bits each) or two dou-

ble words of ﬁxed-point data in a 64-bit register. These data

types allow VIS instructions to exploit single-instruction-

multiple-data (SIMD) parallelism at the subword level.

Most of the VIS instructions operate on packed words or

packed doubles; loads, stores, and pdist instructions op-

erate on packed bytes. Many of the VIS instructions make

implicit assumptions about rounding and the number of sig-

niﬁcant bits in the ﬁxed-point data. Hence, their use require

ensuring that they do not lead to incorrect outputs. We next

provide a short overview of the VIS instructions (summa-

rized in Table 4).

Packed arithmetic and logical operations. The packed

arithmetic VIS instructions allow SIMD-style parallelism to

be exploited for add, subtract, and multiply instructions. To

Packed arithmetic and logical operations

Packed addition

Packed subtraction

Packed multiplication

Logical operations

Subword rearrangement and realignment

Data packing and expansion

Data merging

Data alignment

Partitioned compares and edge operations

Partitioned compares

Mask generation for edge effects

Memory-related operations

Partial stores

Short loads and stores

Blocked loads and stores

Special-purpose operations

Pixel distance computation

Array address conversion for data reuse

Access to the graphics status register

Table 4. Classiﬁcation of VIS instructions

minimize implementation complexity, VIS uses a pipelined

series of two 8x16 multiplies and one add instruction to em-

ulate packed 16x16-bit multiplication. The VIS logical in-

structions allow logical operations on the ﬂoating-point data

path.

Subword rearrangement and alignment. To facilitate

conversion between different data types, VIS supports sub-

word rearrangement and alignment using pack, expand,

merge (interleave), and align instructions. The subword re-

arrangement instructions also include support for implicitly

handling saturation arithmetic (limiting data values to the

minimum or maximum instead of the default wrap-around).

Partitioned compares and edge operations. For branches,

VIS supports a partitioned compare that performs four 16-

bit or two 32-bit compares in parallel to producea mask that

can be used in subsequent instructions. VIS also supports

the edge instruction to generate masks for partial stores that

can eliminate special branch code to handle boundary con-

ditions in media processing applications.

Memory-related operations. For memory instructions,

VIS supports partial stores that selectively write to parts of

the 64-bit output based on an input mask. Short loads

and stores transfer 1 or 2 bytes of memory to the register

ﬁle. Blocked loads and stores transfer 64 bytes of data

between memory and a group of eight consecutive VIS reg-

isters without causing allocations in the cache.

Special-purpose operations. The pixel distance computa-

tion (pdist) instruction is primarily targeted at motion es-

timation and computes the sum of the absolute differences

between corresponding 8-bit components in two packed

bytes. The array instruction is mainly targeted at 3D graph-

ics rendering applications and converts 3D ﬁxed-point co-

ordinates into a blocked byte address that allows for greater

cache reuse. VIS also deﬁnes instructions to manipulate

the graphics status register, a special-purpose register that

stores additional data for various media instructions.

Overall, the functionality discussed above for VIS is sim-

ilar to that of ﬁxed-point media ISA extensions in other

general purpose processors (e.g., MAX [12], MMX [18],

MVI [4], MDMX [9], AltiVec [19]). The various ISA ex-

tensions mainly differ in the number, types, and latencies of

the individual instructions (e.g., MMX implements direct

support for 16x16 multiply), whether they are implemented

in the integer or ﬂoating-point data path, and in the width

of the data path. The most different ISA extension, the pro-

posed PowerPC AltiVec ISA, adds support for a separate

128-bit vector multimedia unit in the processor.

Our VIS implementation is closely modeled after

the UltraSPARC-II implementation and operates on the

ﬂoating-point register ﬁle with latencies comparable to the

UltraSPARC-II [23] (Table 2).2The increase in chip area

associated with the VIS instructions was estimated to be less

than 3% for the UltraSPARC-II [10].

2.3 Methodology

2.3.1 Simulation Environment

We use the RSIM simulator [16] to simulate the in-order and

out-of-order processors described in Section 2.2. RSIM is

a user-level execution-driven simulator that models the pro-

cessor pipeline and memory hierarchy in detail including

contention for all resources. To assess the impact of not

modeling system level code, we proﬁled the benchmarks on

an UltraSPARC-II-based Sun Enterprise server. We found

that the time spent on operating system kernel calls is less

than 2% on all the benchmarks. The time spent on I/O is

less than 15% on all the benchmarks except mpeg-dec. This

benchmark experiences an inﬂated I/O component (45%)

because of its high frequency of ﬁle writes. In a typical sys-

tem, however, these writes would be handled by a graphics

accelerator, signiﬁcantly reducing this component. Since

our applications have small instruction footprints, our sim-

ulations assume all instructions hit in the instruction cache.

All the applications3were compiled with the SPARC

SC4.2 compiler with the -xO4 -xtarget=ultra1/170

-xarch=v8plusa -dalign options to produce optimized code

for the in-order UltraSPARC processor.

2Our VIS multiplier has a lower latency compared to the integer (64-

bit) multiplier because it operates on 16-bit data.

3We changed the 14 image processing kernels from the Sun VSDK to

skew the starting addresses of concurrent array accesses and unroll small

innermost loops. This reduced cache conﬂicts and branch mispredictions

leading to 1.2X to 6.7X performance beneﬁts. To facilitate modifying the

applications for VIS, we replaced some of the key routines in the JPEG

and MPEG applications with equivalent routines from the Sun MediaLib

library.

2.3.2 VIS Usage Methodology

We are not aware of any compiler that automatically modi-

ﬁes media processing applications to use media ISA exten-

sions. For our experiments studying the impact of VIS, we

manually modiﬁed our benchmarks to use VIS instructions

based on the methodology detailed below.

We proﬁled the applications to identify key procedures

and manually examined these procedures for loops that sat-

isﬁed the following three conditions: (1) The loop body

should have no loop-carried dependences or control depen-

dences that cannot be convertedto data dependences (other

than the loop branch). (2) The key computation in the loop

body must be replaceable with a set of equivalent ﬁxed-

point VIS instructions. The loss in accuracy in this stage,

if any, should be visually imperceptible. (3) The poten-

tial beneﬁt from VIS should be more than the overhead

of adding VIS; VIS overhead can result from subword re-

arrangement instructions to convert between packed data

types or from alignment-related instructions.

For loops that satisﬁed the above criteria, we strip-mined

or unrolled the loop to isolate multiple iterations of the loop

body that we then replaced with equivalent VIS instruc-

tions. We used the inline assembly-code macros from the

Sun VSDK for the VIS instructions; this minimizes code

perturbation and allows the use of regular compiler opti-

mizations. Wherever possible, we tried to use procedures

available from the Sun VSDK Kit and the SUN MediaLib

library routines that were already optimized for VIS.

Our benchmarks use all the VIS instructions except for

the array and blocked load/store instructions. Array in-

structions are targeted at 3D array accesses in graphics

loops, and are not applicable for our applications. Blocked

loads and stores are primarily targeted at transfers of large

blocks of data between buffers without affecting the cache

(e.g., in operating system buffer management, networking,

memory-mapped I/O). We did not use these instructions

since the Sun VSDK does not provide inline assembly-code

macros to support them. The alternative of hand-coded as-

sembly could result in lower performance since it is hard to

emulate the compiler optimizations associated with modern

superscalar processors by hand [22]. Note that both the ar-

ray and blocked load/store instructions are unique to VIS

and are not supported by other general-purpose ISA exten-

sions.

2.3.3 Software Prefetching Algorithm

We studied the applicability of software prefetching for the

benchmarks where the cache miss stall time is a signiﬁcant

component of the total execution time (

20%). We identi-

ﬁed the memory accesses that dominate the cache miss stall

time, and inserted prefetches by hand for these accesses.

We followed the well known software prefetching compiler

algorithm developed by Mowry et al. [14].

2.3.4 Performance Metrics

We use the execution time of the system as the primary met-

ric to evaluate the performance of the system, while also re-

porting the individual components of execution time. With

out-of-order processors, an instruction can potentially be

overlapped with instructions preceding and following it. We

therefore use the following convention to identify the differ-

ent components of execution time. At every cycle, the frac-

tion of instructions retired that cycle to the maximum retire

rate is attributed to the busy time; the remaining fraction

is attributed as stall time to the ﬁrst instruction that could

not be retired that cycle. We also study other metrics such

as dynamic instruction counts, branch misprediction rates,

cache miss rates, MSHR occupancies, and prefetch counts

for further insights into the system behavior.

3 Improving Processor Performance

For each benchmark, Figure 1 presents execution times

for three variations of our base architecture, each without

VIS (the ﬁrst set of three bars) and with VIS (the second

set of three bars). The three architecture variations are (i)

in-order and single issue, (ii) in-order and 4-way issue, and

(iii) out-of-order and 4-way issue. On the VIS-enhanced

architecture, we use the VIS-enhanced version of the appli-

cation as mentioned in Section 2. The execution times are

normalized to the time with the in-order single-issue pro-

cessor. For all the benchmarks, the execution time is di-

vided into the busy component, the functional unit stall (FU

stall) component, and the memory component. The mem-

ory component is shown divided into the L1 miss and L1 hit

components.

3.1 Impact of Conventional ILP Features

This section focuses on the system without the VIS me-

dia ISA extensions (the left three bars for each benchmark

in Figure 1).

Overall results. Both multiple issue and out-of-order issue

provide substantial reductions in execution time for most of

our benchmarks. Compared to a single-issue in-order pro-

cessor, on the average, multiple issue improves performance

by a factor of 1.2X (range of 1.1X to 1.4X), while the com-

bination of multiple issue and out-of-order issue improves

performance by a factor of 3.1X (range of 2.3X-4.2X).

Analysis. Compared to the single issue processor, we ﬁnd

that multiple issue achieves most of its beneﬁts by reduc-

ing the busy CPU component of execution time. Data, con-

trol, and structural dependences prevent the CPU compo-

nent from attaining an ideal speedup of 4 from a 4-way is-

100

Normalized execution time

L1 hit

L1 miss

FU stall

Busy

1-way

100.0

4-way

71.2

4-way

43.6

1-way

39.8

4-way

36.7

4-way

15.4

1-way

100.0

4-way

77.5

4-way

33.3

1-way

28.7

4-way

26.9

4-way

10.1

1-way

100.0

4-way

87.5

4-way

24.1

1-way

15.9

4-way

12.5

4-way

5.7

ooo ooo ooo ooo ooo ooo

Without VIS With VIS Without VIS With VIS Without VIS With VIS

Addition Blend Conv

100

Normalized execution time

L1 hit

L1 miss

FU stall

Busy

1-way

100.0

4-way

88.2

4-way

32.4

1-way

74.6

4-way

68.0

4-way

28.7

1-way

100.0

4-way

88.8

4-way

26.0

1-way

18.3

4-way

15.9

4-way

10.3

1-way

100.0

4-way

89.6

4-way

37.7

1-way

32.0

4-way

26.5

4-way

14.9

ooo ooo ooo ooo ooo ooo

Without VIS With VIS Without VIS With VIS Without VIS With VIS

Dotprod Scaling Thresh

100

Normalized execution time

L1 hit

L1 miss

FU stall

Busy

1-way

100.0

4-way

84.4

4-way

29.3

1-way

87.6

4-way

74.2

4-way

26.6

1-way

100.0

4-way

78.8

4-way

33.2

1-way

69.2

4-way

53.8

4-way

26.8

1-way

100.0

4-way

78.0

4-way

30.1

1-way

66.8

4-way

50.6

4-way

23.1

ooo ooo ooo ooo ooo ooo

Without VIS With VIS Without VIS With VIS Without VIS With VIS

Cjpeg Djpeg Cjpeg-np

100

Normalized execution time

L1 hit

L1 miss

FU stall

Busy

1-way

100.0

4-way

74.7

4-way

27.6

1-way

58.9

4-way

41.1

4-way

18.9

1-way

100.0

4-way

89.6

4-way

43.2

1-way

33.5

4-way

26.2

4-way

12.2

1-way

100.0

4-way

77.3

4-way

32.0

1-way

65.0

4-way

49.2

4-way

24.4

ooo ooo ooo ooo ooo ooo

Without VIS With VIS Without VIS With VIS Without VIS With VIS

Djpeg-np Mpeg-enc Mpeg-dec

Figure 1. Performance of image and video benchmarks.

sue processor, reﬂected in the increased functional unit and

L1 hit memory stall time.

Some of the benchmarks see additional memory laten-

cies when the number of outstanding misses to one cache

line (MSHR) increases beyond the maximum allowed limit

of 8. This is caused by the higher use of small data types

associated with media applications which leads to a high

frequency of accesses for each cache line (e.g., 64 pixel

writes in a 64-byte line). Since the processors do not stall

on writes, in benchmarks with small loop bodies (e.g., addi-

tion,cjpeg,djpeg), this leads to a backup of multiple writes.

This backup leads to contention for the MSHR that even-

tually prevents other accesses from being serviced at the

cache.

Out-of-order issue, on the other hand, improves perfor-

mance by reducing both functional unit stall time and mem-

ory stall time. A large fraction of the stall times due to data,

control, and structural dependences, as well as MSHR con-

tention, is now overlapped with other useful work. This is

seen in the reduction in the FU stall and L1 hit components

of execution time. Additionally, out-of-order issue can bet-

ter exploit the non-blocking loads feature of the system by

allowing the latency of multiple long-latency load misses to

be overlapped with one another. Our results examining the

MSHR occupancies at the cache indicate that while there is

increased load miss overlap in all the 12 benchmarks, only

2 to 3 misses are overlapped in most cases. The total capac-

ity of 12 MSHRs is never fully utilized for load misses in

all our benchmarks.

Overall, the impact of the various ILP features is qualita-

tively consistent with that described in previous studies for

scientiﬁc and database workloads. Quantitatively, these ILP

features are substantially more effective for the image and

video benchmarks than for previously reported online trans-

action processing (OLTP) workloads [21], and comparable

in beneﬁt to previously reported scientiﬁc and decision sup-

port system (DSS) workloads [1, 17, 21].

It must be noted that the performance of the in-order is-

sue processor is dependent on the quality of the compiler

used to schedule the code. Our experiments use the com-

mercial SPARC SC4.2 compiler with maximum optimiza-

tions turned on for the in-order UltraSPARC processor. To

try to isolate compiler scheduling effects, we studied two

other processor conﬁgurations with single-cycle functional

unit latencies and functional unit latencies comparable to

the UltraSPARC processor. In both these conﬁgurations,

our results continued to be qualitatively similar; the out-

of-order processor continues to signiﬁcantly outperform the

in-order processor. The impact of future, more advanced,

compiler optimizations, however, is still an open question.

Interestingly, a recent position paper [6] on the impact of

multimedia workloads on general-purpose processors con-

jectures that complex out-of-order issue techniques devel-

oped for scientiﬁc and engineering workloads (e.g., SPEC)

may not be needed for multimedia workloads. Our results

show that, on the contrary, out-of-order issue can provide

signiﬁcant performance beneﬁts for the image and video

workloads.

3.2 Impact of VIS Media ISA Extensions

This section discusses the performance impact of the

VIS ISA extensions (comparing the left-hand three bars and

right-hand three bars for each benchmark in Figure 1).

3.2.1 Overall Results

The VIS media ISA extensions provide signiﬁcant perfor-

mance improvements for all the benchmarks (factors of

1.1X to 4.0X for the out-of-order system, 1.1X to 7X across

all conﬁgurations). On average, the addition of VIS im-

proves the performance of the single-issue in-order system

by a factor of 2.0X, the performance of the 4-way-issue in-

order system by a factor of 2.1X, and the performance of

the 4-way issue out-of-order system by a factor of 1.8X.

Multiple issue and out-of-order issue are beneﬁcial even

with VIS. On average, with VIS, compared to a single-issue

in-order processor, multiple issue achieves a factor of 1.2X

performance improvement, while the combination of mul-

tiple issue and out-of-order issue achieves a factor of 2.7X

performance improvement. The reasons for these perfor-

mance beneﬁts from ILP features are the same as for the

systems without VIS.

3.2.2 Beneﬁts from VIS

Figure 2 presents some additional data showing the distribu-

tion of the dynamic (retired) instructions for the 4-way out-

of-order processor without and with VIS, normalized to the

former. Each bar divides the instructions into the Functional

unit (FU, combines ALU and FPU), Branch, Memory, and

VIS categories. The use of VIS instructions provides a sig-

niﬁcant reduction in the dynamic instruction count for all

the benchmarks. The reductions in the dynamic instruc-

tion count correlate well with the performance beneﬁts from

VIS. We next discuss the sources for the reductions in the

dynamic instructions.

Reductions in FU instructions. The VIS packed arith-

metic and logical instructions allow multiple (typically

four) arithmetic instructions to be replaced with one VIS

instruction. Consequently, all the benchmarks see signif-

icant reductions in the FU instructions with correspond-

ing, smaller, increases in the VIS instruction count. Ad-

ditionally, the SIMD VIS instructions replace multiple it-

erations in the original loop with one equivalent VIS it-

eration. This reduces iteration-speciﬁc loop-overhead in-

structions that increment index and address values and com-

pute branch conditions, further reducing the FU instruction

count.

Reductions in branch instructions. All the benchmarks

use the edge masking and partial store instructions to elim-

inate testing for edge boundaries and selective writes. They

also use loop unrolling when replacing multiple iterations

with one equivalent VIS iteration. These lead to a reduc-

tion in the branch instruction count for all the benchmarks.

For some applications, branch instruction counts are also

reduced because of the elimination of the code to explic-

itly perform saturation arithmetic (mainly in conv4and the

JPEG applications), and the use of partitioned SIMD com-

pares (mainly in thresh).

Many of the branches eliminated are hard-to-predict

branches (e.g., saturation, thresholding, selective writes),

leading to signiﬁcant improvementsin the hardware branch

misprediction rates for some of our benchmarks (branch

misprediction rate decreases from 10% to 0% for conv and

from 6% to 0% for thresh1).

Reductions in memory instructions. With VIS, memory

accesses operate on packed data as opposed to individual

media data. Consequently, most of the benchmarks see sig-

niﬁcant reductions in the number of memory instructions

(and associated cache accesses). This reduces the MSHR

contention discussed in Section 3.1.

Most of the memory instructions eliminated are cache

hits, without a proportional decrease in the number of

4The original source code from the Sun VSDK checks for saturation

only in the conv code. The add,blend, and dotprod kernels are written in

the non-saturation mode. These could, however, potentially be rewritten to

check for saturation in which case they would also see similar beneﬁts.

100

Normalized dynamic instruction count

VIS

Memory

Branch

Addition

Base VIS

100.0

26.2

Blend

Base VIS

100.0

17.6

Conv

Base VIS

100.0

25.4

Dotprod

Base VIS

100.0

88.5

Scaling

Base VIS

100.0

18.0

Thresh

Base VIS

100.0

30.5

Cjpeg

Base VIS

100.0

85.5

Djpeg

Base VIS

100.0

66.3

Djpeg-np

Base VIS

100.0

66.9

Djpeg-np

Base VIS

100.0

58.1

Mpeg-enc

Base VIS

100.0

32.7

Mpeg-dec

Base VIS

100.0

66.4

Figure 2. Impact of VIS on dynamic (retired) instruction count.

misses, causing higher L1 cache miss rates with VIS. The

higher miss rate and the lower instruction count allows addi-

tional load misses to appear together within the instruction

window and be overlapped with each other. However, the

system still rarely sees more than 3 load misses overlapped

concurrently.

Pixel distance computation for mpeg-enc.mpeg-enc

achieves additional beneﬁts from the special-purpose pixel

distance computation (pdist) instruction in the motion es-

timation phase. The pdist instruction allows a sequence

of 48 instructions to be reduced to oneinstruction [23], sig-

niﬁcantly reducing the FU, branch, and memory instruction

counts. The elimination of hard-to-predict branches to per-

form comparisons and saturation improves the branch mis-

prediction rate from 27% to 10%.

3.2.3 Limitations of VIS

Examining the variation of performance beneﬁts across the

benchmarks, we observe that the JPEG applications, mpeg-

dec, and dotprod exhibit relatively lower performance ben-

eﬁts (factors of 1.1X to 1.5X) compared to the remaining

benchmarks (factors of 2.8X to 4.2X). We next discuss fac-

tors limiting the beneﬁts from VIS.

Inapplicability of VIS. The VIS media ISA extensions are

not applicable to a number of key procedures in the JPEG

applications and mpeg-dec. For example, the JPEG appli-

cations (especially the progressive versions) spend a large

fraction of their time in the variable-length Huffman cod-

ing phase. This phase is inherently sequential and operates

on variable-length data types, and consequently cannot be

optimized using VIS instructions.5Other examples of code

segments in the JPEG and MPEG applications where VIS

could not be applied include bit-level input/output stream

manipulation, scatter-gather addressing, quantization, and

saturation arithmetic operations not embedded in a loop.

VIS overhead. All our benchmarks use subword rear-

rangement and alignment instructions to get the data in

a form that the VIS instructions can operate (e.g., pack-

5It is instructive to note that many media processors (e.g., the Mit-

subishi VLIW Media Processor and Samsung Media Signal Processor)

have a special-purpose hardware unit to handle the variable-length coding.

ing/unpacking between packed-byte pixels and packed-

word operands). This results in extra overhead that limits

the performance beneﬁts from VIS (on the average, for our

benchmarks, 41% of the VIS instructions are for subword

rearrangement and alignment). Overhead is also increased

when multiple VIS instructions are used to emulate one op-

eration (e.g., 16x16 multiply in dotprod) or when the data

need to be reordered to exploit SIMD (e.g., byte reordering

in the color conversion phase in JPEG).

Limited parallelism and scheduling constraints. Most of

the packed arithmetic instructions operate only on packed

words or packed double words. This ensures enough bits to

maintain the precision of intermediate values. However, this

limits the maximum parallelism to 4 (on 16-bit data types),

even in cases when the operations are known to be per-

formed on smaller data types (e.g., 8-bit pixels). The limit

on the SIMD parallel path,6in combination with contention

for VIS functional units, limits the beneﬁts from VIS on

some of our benchmarks (most signiﬁcantly in mpeg-enc).

Cache miss stall time. As discussed earlier, the reductions

in memory instructions mainly occur for cache hits. The

VIS instructions do not directly target cache misses though

there are indirect beneﬁts associated with increased load

miss overlap due to instruction count reduction (discussed

in Section 3.2.2).

3.3 Combination of ILP and VIS

The combination of conventional ILP features and VIS

extensions achieves an average factor of 5.5X perfor-

mance improvement (range of 3.5X to 18X) over the base

single-issue in-order processor. The beneﬁts from VIS are

achieved with a much smaller increase in the die area com-

pared to the ILP features.

On the base single-issue in-order processor, all the

benchmarks are primarily compute-bound. With ILP fea-

tures and VIS extensions, cjpeg,djpeg, and mpeg-enc con-

tinue to spend most of their execution time in the proces-

6The MIPS MDMX provides support for a larger size accumulator that

allows greater parallelism without losing the precision of the intermediate

result [9]. The PowerPC AltiVec supports a larger 128-bit data path to

increase the parallelism[19].

sor sub-system (87% to 97%). Five of the image process-

ing kernels, however, now spend 55% to 66% of their total

time in memory stalls. The strong compute-centric perfor-

mance beneﬁts from ILP features and VIS extensions shift

the bottleneck to the memory sub-system for these bench-

marks. The remaining 4 applications (conv,cjpeg,djpeg,

and mpeg-dec) spend between 20% to 30% of their total

time on memory stalls.

4 Improving Memory System Performance

This section studies memory system performance for

the benchmarks. Section 4.1 discusses the effectiveness of

caches, and Section 4.2 discusses the impact of software

prefetching.

4.1 Impact of Caches

Impact of varying L2 cache size. We varied the L2 cache

size from 128K to 2M, keeping the L1 cache ﬁxed at 64K.

Our results (not shown here due to lack of space) showed

that increasing the size of the L2 cache has no impact on

the performance of the 6 image processing kernels and the

cjpeg-np and djpeg-np applications. The remaining four

applications, cjpeg,djpeg,mpeg-enc, and mpeg-dec, reuse

data; but the cache size needed to exploit the reuse depends

on the size of the display. For our input image sizes, L2

cache sizes of 2M capture the entire working sets for all the

four benchmarks, and provide 1.1X to 1.2X performance

improvement over the default 128K L2 cache. With the 2M

cache sizes, memory stall time is between 7-9% on all the

applications and is dominated by L1 hit time (due to MSHR

contention) and L2 hit time.

The image processing kernels have streaming data ac-

cesses to a large image buffer with no reuse and low com-

putation per cache miss. Consequently, they exhibit high

memory stall times unaffected by larger caches. cjpeg-np

and djpeg-np do not see any variation in performance with

larger caches because of their negligible memory compo-

nents. These applications implement a blocked pipeline al-

gorithm that performs all the computation phases on 8x8-

sized blocks at a time, reducing the bandwidth requirements

and increasing the computation per miss. The cjpeg and

djpeg applications differ from their non-progressive coun-

terparts in the progressive coding phase where they perform

a multi-pass traversal of the buffer storing the DCT coefﬁ-

cients. The low computation per miss in this phase com-

bined with the reuse of the image-sized (1024x640 pixels)

buffer results in a 1.2X performance beneﬁt from increas-

ing the cache size to 2M. Larger images would increase

the working set requiring larger caches. For example, a

1024x1024 image would require a 4M cache size. mpeg-

enc and mpeg-dec perform inter-block distance vector op-

erations and therefore need to operate on multiple image-

sized buffers (as opposed to blocks-sized buffers in cjpeg-np

and djpeg-np). The reuse of these 352x240 buffers across

the frames in the video leads to 1.1X performance bene-

ﬁts with 512K (mpeg-dec) and 1M (mpeg-enc) cache sizes.

Larger image sizes would require larger caches; for exam-

ple a 1024x1024 image would require almost a factor of

12X increase.

Impact of varying L1 caches. We also performed ex-

periments varying the size of the L1 cache from 1K to

64K while keeping the L2 cache ﬁxed at 128K. Our results

showed that the L1 cache size had no impact on ﬁve of the

image processing kernels. On the remaining benchmarks,

a 64K L1 conﬁguration outperforms the 1K L1 conﬁgura-

tion by factors of 1.1X to 1.3X; 4K-16K L1 caches achieve

within 3% of the performance of the 64K L1 cache con-

ﬁguration. Small data structures other than the main data,

such as tables for convolution, quantization, color conver-

sion, and saturation clipping, are responsible for these small

ﬁrst-level working sets. At 64K L1 caches, memory stall

time is mainly due to L1 hits (mainly related to MSHR con-

tention) or to L2 misses.

4.2 Impact of Software Prefetching

Figure 3 summarizes the execution time reductions from

software prefetching relative to the base system with VIS

(with 64K L1 and 128K L2 caches). We do not report

results for cjpeg-np,djpeg-np and mpeg-enc since these

benchmarks spend less than 6% of their total time on L1

cache misses. Our results show that software prefetching

achieves high performance improvements for the six image

processing benchmarks (an average of 1.9X and a range of

1.4X to 2.5X. The cjpeg,djpeg, and mpeg-dec benchmarks

exhibit relatively small performance improvements. Over-

all, after applying software prefetching, all our benchmarks

revert to being compute bound.

For the image processing kernels, a signiﬁcant fraction

of the prefetches are useful in completely or partially hid-

ing the latency of the cache miss with computation or with

other misses. The addition of software prefetching also in-

creases the utilization of cache MSHRs; in many of the im-

age processing kernels, more than 5 MSHRs are used for a

large fraction of the time. The remaining memory stall time

is mainly due to late prefetches and resource contention.

Late prefetches (prefetches that arrive after the demand ac-

cess) arise mainly because of inadequate computation in the

loop bodies to overlap the miss latencies. Contention for

resources occurs when multiple prefetches are outstanding

at a time. These effects are similar to those discussed in

previous studies with scientiﬁc applications for ILP-based

processors [20].

The other benchmarks (cjpeg,djpeg, and mpeg-dec) see

100

Normalized execution time

VIS +PF

100.0

56.3

Addition

VIS +PF

100.0

53.2

Blend

VIS +PF

100.0

72.3

Conv

VIS +PF

100.0

40.6

Dotprod

VIS +PF

100.0

44.5

Scaling

VIS +PF

100.0

42.8

Thresh

VIS +PF

100.098.1

Cjpeg

VIS +PF

100.098.1

Djpeg

VIS +PF

100.095.0

Mpeg-dec

L1 hit

L1 miss

FU stall

Busy

Figure 3. Effect of software-inserted prefetching.

lower performance beneﬁts primarily because the fraction

of memory stall time is relatively low and includes an L1

hit component (mainly due to MSHR contention). Soft-

ware prefetches do not address the L1 component. Sec-

ond, in cjpeg and djpeg, the prefetches are to memory lo-

cations that are indirectly addressed (of the form A[B[i]]).

Consequently, the prefetching algorithm is unable to dis-

tinguish between hits and misses and is constrained to is-

sue prefetches for all accesses. The resulting overhead due

to address calculation and cache contention limits perfor-

mance (seen as increased Busy and FU stall components7).

ﬁnally, as before, late prefetches and resource contention

also contribute to the lower beneﬁts from prefetching.

5 Related Work

Most of the papers discussing instruction-set extensions

for multimedia processing have focused on detailed descrip-

tions of the additional instructions and examples of their

use [4, 9, 12, 15, 18, 19, 23]. The performance character-

ization in these papers is usually limited to a few sample

code segments and/or a brief mention of the beneﬁts an-

ticipated on larger applications. Eyre studies the applica-

bility of general-purpose processors for DSP applications;

however, the study only reports high-level metrics such as

MIPS, power efﬁciency, and cost [7].

Daniel Rice presents a detailed description of VIS and 8

image processing applications without and with VIS [22].

The study reports speedups of 2.7X to 10.5X on an actual

UltraSPARC-based system, but does not analyze the cause

of performance beneﬁts, the remaining bottlenecks, or the

impact of alternative architectures.

Yang et al. look at the beneﬁts of packed ﬂoating-point

formats and instructions for graphics but assume a per-

fect memory system [24]. Bharghava et al. study some

MMX-enhanced applications based on Pentium-based sys-

tems; but again, no detailed characterization of performance

bottlenecks or the impact of other architectures is done [2].

Zucker et al. study MPEG video decode applications and

7Some of the image processing kernels see a reduction in the CPU com-

ponent because of the reduction in instructions and better scheduling when

loops are unrolled for the prefetching algorithm [14].

show the beneﬁts from I/O prefetching, software restruc-

turing to use SIMD without hardware support, and proﬁle-

driven software prefetching [25, 26]. However, the studies

assume a simplistic processor model with blocking loads

and do not study the effect of media ISA extensions. Bilas

et al. develop two parallel versions of the MPEG decoder

and present results for multiprocessor speedup, memory re-

quirements, load balance, synchronization, and locality [3].

Similar to our results, they also ﬁnd that the miss rates for

352x240 images on “realistic” cache sizes are negligible.

6 Conclusions

Media processing is a workload of increasing importance

for desktop processors. This paper focuses on image and

video processing, an important class of media processing,

and aims to provide a quantitative understanding of the per-

formance of these workloads on general-purpose proces-

sors. We use detailed simulation to study 12 representa-

tive benchmarks on a variety of architectural conﬁgurations,

both with and without the use of Sun’s visual instruction set

(VIS) media ISA extensions.

Our results show that conventional techniques in current

processors to enhance ILP (multiple issue and out-of-order

issue) provide a factor of 2.3X to 4.2X performance im-

provement for the image and video benchmarks. The Sun

VIS media ISA extensions provide an additional 1.1X to

4.2X performance improvement. The beneﬁts from VIS are

achieved with a much smaller increase in the die area com-

pared to the ILP features.

Our detailed analysis indicates the sources and limita-

tions of the performance beneﬁts due to VIS. VIS is very

effective in exploiting SIMD parallelism using packed data

types, and can eliminate a number of potentially hard-to-

predict branches using instructions targeted towards satu-

ration arithmetic, boundary detection, and partial writes.

Special-purpose instructions such as pdist achieve high ben-

eﬁts on the targeted application, but are too specialized to

use in other cases. Routines that are sequential and operate

on variable data types, VIS instruction overhead, cache miss

stall times, and the ﬁxed parallelism in the packed arith-

metic instructions limit the beneﬁts on the benchmarks.

On our base single-issue in-order processor, all the

benchmarks are primarily compute-bound. Conventional

ILP features and the VIS instructions together signiﬁcantly

reduce the CPU component of execution time, making 5

of our image processing benchmarks memory-bound. The

memory behavior of these workloads is characterized by

large working sets and streaming data accesses. Increas-

ing the cache size has no impact on the image processing

kernels and the non-progressive JPEG applications. This is

particularly interesting considering current trends towards

large on-chip and off-chip caches. The remaining bench-

marks require relatively large cache sizes (dependent on the

display sizes) to exploit data reuse, but derive less than 1.2X

performance beneﬁts with the larger caches. Software-

inserted prefetching achieves 1.4X to 2.5X performance

beneﬁts on the image processing kernels where memory

stall time is signiﬁcant.

With the addition of software prefetching, all our bench-

marks revert to being compute-bound. Architectural opti-

mizations that improve computation time (e.g., multipro-

cessing) may be useful to exploit greater parallelism. Such

efforts are likely to expose the memory system bottleneck

yet again, possibly requiring additional novel memory sys-

tem techniques beyond conventional software prefetching.

In the future, we plan to explore new architectural tech-

niques for general-purpose processors to support media pro-

cessing workloads. We also plan to expand our study to in-

clude other media processing applications such as speech,

audio, communication, and natural language interaction.

7 Acknowledgments

We would like to thank Behnaam Aazhang, Mohit Aron,

Rich Baraniuk, Joe Cavallaro, Tim Dorney, Aria Nostra-

tinia, and Jan Odegard for numerous discussions on the be-

havior of media processing workloads. We would also like

to thank Partha Tirumalai, Ahmad Zandi and Tony Zhang

from Sun for useful pointers on enhancing the applications

with VIS. We also thank Vijay Pai, Barton Sano, Chaitali

Sengupta, and the anonymous reviewers for their valuable

comments on earlier drafts of the paper.

References

[1] D. Bhandarkar and J. Ding. Performance characterization of

the pentium pro processor. In HPCA97, pages 288–297, Feb

1997.

[2] R. Bhargava et al. Evaluating MMX Technology Using DSP

and Multimedia Applications. In MICRO-31, Dec 1998.

[3] A. Bilas et al. Real-time Parallel MPEG-2 Decoding in Soft-

ware. In IPPS-11, April 1997.

[4] D. A. Carlson et al. Multimedia Extensions for a 550MHz

RISC Microprocessor. In IEEE Journal of Solid-State Cir-

cuits, 1997.

[5] T. M. Conte et al. Challenges to Combining General-

Purpose and Multimedia Processors. In IEEE Computer,

pages 33–37, Dec 1997.

[6] K. Diefendorff and P. K. Dubey. How Multimedia Work-

loads Will Change Processor Design. In IEEE Micro, pages

43–45, Sep 1997.

[7] J. Eyre. Assessing General-Purpose Processors for DSP Ap-

plications. Berkeley Design Technology Inc. presentation,

1998.

[8] International Organisation for Standardisation – ISO/IEC

JTC1/SC29/WG11MPEG 98/N2457. MPEG-4 Applications

Document, 1998.

[9] E. Killian. MIPS Extension for Digital Media with 3D.

Slides presented at Microprocessor Forum, October 1996.

[10] L. Kohn et al. The Visual Instruction Set (VIS) in Ultra-

SPARC. In COMPCON Digest of Papers, March 1995.

[11] C. Lee et al. MediaBench: A Tool for Evaluating and

Synthesizing Multimedia and Communications Systems. In

MICRO-30, 1997.

[12] R. B. Lee. Subword Parallelism with MAX-2. In IEEE

Micro, volume 16(4), pages 51–59, August 1996.

[13] R. B. Lee and M. D. Smith. Media Processing: A New De-

sign Target. In IEEE MICRO, pages 6–9, Aug 1996.

[14] T. Mowry. Tolerating Latency through Software-controlled

data prefetching. PhD thesis, Stanford University, 1994.

[15] S. Oberman et al. AMD 3DNow! Technology and the K6-2

Microprocessor. In HOTCHIPS10, 1998.

[16] V. S. Pai et al. RSIM: A Simulator for Shared-Memory Mul-

tiprocessor and Uniprocessor Systems that Exploit ILP. In

Proc. 3rd Workshop on Computer Architecture Education,

1997.

[17] V. S. Pai et al. The Impact of Instruction Level Parallelism

on Multiprocessor Performance and Simulation Methodol-

ogy. In HPCA-3, pages 72–83, 1997.

[18] A. Peleg and U. Weiser. MMX Technology Extension to

the Intel Architecture. In IEEE Micro, volume 16(4), pages

51–59, Aug 1996.

[19] M. Phillip et al. AltiVec Technology: Accelerating Media

Processing Across the Spectrum. In HOTCHIPS10, Aug

1998.

[20] P. Ranganathan et al. The Interaction of Software Prefetch-

ing with ILP Processors in Shared-Memory Systems. In

ISCA24, pages 144–156, 1997.

[21] P. Ranganathan et al. Performance of Database Workloads

on Shared-Memory Systems with Out-of-Order Processors.

In ASPLOS8, pages 307–318, 1998.

[22] D. S. Rice. High-Performance Image Processing Using

Special-Purpose CPU Instructions: The UltraSPARC Visual

Instruction Set. Master’s thesis, Stanford University, 1996.

[23] M. Tremblay et al. VIS Speeds New Media Processing. In

IEEE Micro, volume 16(4), pages 51–59, Aug 1996.

[24] C.-L. Yang et al. Exploiting Instruction-Level Parallelism in

Geometry Processing for Three Dimensional Graphics Ap-

plications. In Micro31, 1998.

[25] D. F. Zucker. Architecture and Arithmetic for Multimedia

Enhanced Processors. PhD thesis, Department of Electrical

Engineering, Stanford University, June 1997.

[26] D. F. Zucker et al. An Automated Method for Software Con-

trolled Cache Prefetching. In Proc. of the 31st Hawaii Intl.

Conf. on System Sciences, Jan 1998.

Mixed-triggered communication with limited elastic slot boundaries

Article

Aug 2021
MICROPROCESS MICROSY

Time-triggered systems provide dependable and deterministic communication based on strict time boundaries for tasks and messages. On the other hand, event-triggered communication is less strict on requirements and more flexible, but it is difficult to provide fault isolation. Thus, the combination of time-triggered and event-triggered communication is desirably extending a time-triggered communication system by a dynamic property. This paper improves a novel mixed-triggered communication approach based on ESBs which adds limited delay tolerance to an otherwise strict real-time communication. The presented approach overcomes the problem of delayed respectively missing messages by adding fail-operational behavior to the ESB-based communication. This is achieved by extending the basic mechanism using additional fault-tolerant features. The presented mechanism allows the system to stay operable even when messages occasionally violate their planned slot boundaries and provides fault isolation for timing violations beyond a predefined tolerance budget. The introduced fault-tolerant features are analyzed via Monte-Carlo simulations where the resulting data throughput is compared to two straightforward hard real-time communication approaches. As our results show, the proposed guards enable the system to handle extended message reception delays while, compared to strictly bounded communication, having a better performance of successfully transmitted messages.

Survey on High Performance Reconfigurable Soft -Core Processor for SIMD Applications

Article

Full-text available

Oct 2020

Kanagaraj Venusamy

The prospective need of SIMD (Single Instruction and Multiple Data) applications like video and image processing systems requires a processor with greater flexibility and computation to deliver high quality real time output. The main goal of work is to offer a wider survey over high performance processor through various reconfigurable techniques targeting on real time SIMD dataset. The real-time multimedia streaming with extensive parallel data demands high computational processors with low power consumption. The modern processors with flexible computation supports on-the-fly system redesign with reconfiguration techniques at reduced cost. It also accommodates new functionalities sustaining the increasing demand for processor with less NREs (Non-Recurring Engineering) cost and shorter time-to-market. The processors with flexible platform permit the designer to incorporate the demanding changes in the existing standards of any application like wireless communication, telecommunication etc. Adaptive computing is the new processor paradigm evolved in early 90s to bridge the space which exists between the generic processor and application specific processor. It also supports the inclusion of specific hardware accelerator in the existing RISC (Reduced Instruction Set Computer) based architecture to improve the performance of particular application. In the last decade, processor based embedded systems like SoC (System-on-Chip) have become more prevalent due to their high performance at low power. Many embedded application designers like mobile and smart phone developers namely Philips N-Experia, Intel PXA etc., explored SoC as computational devices. Thus this work summarizes a literature survey of all the related works such as RC (Reconfigurable Computing) , HPRC (High Performance Reconfigurable Computing), FPGA based processors, open source soft-core processor, reconfigurable architecture, parallel computing and SIMD. The detailed literature survey analyzing the impact of current techniques targeting processor performance is illustrated.

SIMD programming using Intel vector extensions

Article

Sep 2019

Computational Analysis of an Image Processing Pipeline for Obstacle Detection

Thesis

Full-text available

Dec 2017

Ajai George

Obstacle detection is an important feature in autonomous and semi-autonomous mobile robots and demand high performance from their underlying hardware to help navigate a moving object on a given path in real-time. In this thesis we have optimised the computational performance of an image processing pipeline that can be used for obstacle detection. The image processing pipeline performs edge detection, removes straight line edges from a tiled floor and highlights obstacle boundaries on a given input image. Various optimisation strategies were applied to enable the image processing pipeline to perform optimally on diverse hardware such as CPUs, NVIDIA GPUs, and embedded boards. Further, the efficacy of expressing the image processing pipeline through programming abstractions such as Halide and Thrust2D was evaluated. Our experiments suggest that there was minimal performance degradation when the image processing pipeline was written using Thrust2D when compared to a native implementation of the same on the GPU.

RECONFIGURABLE FPGA BASED SOFT-CORE PROCESSOR FOR SIMD APPLICATIONS

Article

Jul 2017

Objective: The prospective need of SIMD (Single Instruction and Multiple Data) applications like video and image processing in single system requires greater flexibility in computation to deliver high quality real time data. This paper performs an analysis of FPGA (Field Programmable Gate Array) based high performance Reconfigurable OpenRISC1200 (ROR) soft-core processor for SIMD.Methods: The ROR1200 ensures performance improvement by data level parallelism executing SIMD instruction simultaneously in HPRC (High Performance Reconfigurable Computing) at reduced resource utilization through RRF (Reconfigurable Register File) with multiple core functionalities. This work aims at analyzing the functionality of the reconfigurable architecture, by illustrating the implementation of two different image processing operations such as image convolution and image quality improvement. The MAC (Multiply-Accumulate) unit of ROR1200 used to perform image convolution and execution unit with HPRC is used for image quality improvement.Result: With parallel execution in multi-core, the proposed processor improves image quality by doubling the frame rate up-to 60 fps (frames per second) with peak power consumption of 400mWatt. Thus the processor gives a significant computational cost of 12ms with a refresh rate of 60Hz and 1.29ns of MAC critical path delay.Conclusion:This FPGA based processor becomes a feasible solution for portable embedded SIMD based applications which need high performance at reduced power consumptions

SIMD ISA extensions: Power efficiency on multimedia on a superscalar processor

Article

Full-text available

Feb 2002

Media processing has become one of the dominant computing workloads. In this context, SIMD instructions have been introduced in current processors to raise performance, often the main goal of microprocessor designers. Today, however, designers have become concerned with the power consumption, and in some cases low power is the main design goal (laptops). In this paper, we show that SIMD ISA extensions on a superscalar processor can be one solution to reduce power consumption and keeping a high performance level. We reduce the average power consumption by decreasing the number of instructions, the number of cache references, and using dynamic power management to transform the speedup in performance in power consumption reduction.

Performance engineering of image processing systems through benchmarking techniques

Conference Paper

Aug 2017

Implementation of Multi-Core Processor for Beamforming Algorithm of Mobile Ultrasound Image Signals

Article

Full-text available

Apr 2011

In the past, a patient went to the room where an ultrasound image diagnosis device was set, and then he or she was examined by a doctor. However, currently a doctor can go and examine the patient with a handheld ultrasound device who stays in a room. However, it was implemented with only fundamental functions, and can not meet the high performance required by the focusing algorithm of ultrasound beam which determines the quality of ultrasound image. In addition, low energy consumption was satisfied for the mobile ultrasound device. To satisfy these requirements, this paper proposes a high-performance and low-power single instruction, multiple data (SIMD) based multi-core processor that supports a representative beamforming algorithm out of several focusing methods of mobile ultrasound image signals. The proposed SIMD multi-core processor, which consists of 16 processing elements (PEs), satisfies the high-performance required by the beamforming algorithm by exploiting considerable data-level parallelism inherent in the echo image data of ultrasound. Experimental results showed that the proposed multi-core processor outperforms a commercial high-performance processor, TI DSP C6416, in terms of execution time (15.8 times better), energy efficiency (6.9 times better), and area efficiency (10 times better).

Cloud computing providers for satellite image processing service: A comparative study

Conference Paper

Aug 2015

Advanced Earth observation technologies nowadays produce more variety of huge datasets. To derive timely the information from such datasets, remote sensing scientists need to be equipped with a better and powerful computing and storage platform. Cloud computing platform can be a good option, since it provides the required computing power with lowest cost on pay-as-use basis. To see which existing platform shall be suitable for the complex analysis of huge remote sensing data, we present here a comparative study between the most commonly used cloud platforms, Amazon, Microsoft and CloudSigma. Based on the limiting factors that the satellite image-processing task requires, we considered flexibility, scalability, management and pricing. Flexibility means how resilient is the hardware architecture; Scalability describes how the application could utilize from available computing resources to continue well-functioning; Management looks at the availability of dashboards and control panels to manage cloud resources; and Pricing includes the cost of development and running cost of the service on top of cloud platform. Comparison showed that Amazon Web Services transcended all competitors especially in big data processing and scalability attainable options.

CERCIS: A video codec system-on-chip design and implementation

Article

Aug 2009

Application specific instruction processor (ASIP) chips give the high performance and low power levels of application specific integrated circuit (ASIC) chips and the flexibility of digital signal processor (DSP) chips. A video codec system-on-chip (SoC) was designed based on an ASIP chip architecture with the encode and decode parts of the video codec SoC having the same media signal processing architecture. The encode and decode parts each consist of an 8-issue very long instruction word (VLIW) DSP core and user defined hardware for the data transfer and variable length codec. Simulations using 0.13 μm show that the chip can accomplish 15 f/s QCIF H.263 baseline encoding at 17 MHz, while 15 f/s QCIF H.263 baseline decoding at 10 MHz.

Performance Characterization of the Pentium ® Pro Processor

Conference Paper

Full-text available

Mar 1997

In this paper, we characterize the performance of several business and technical benchmarks on a Pentium ® Pro processor based system. Various architectural data are collected using a performance monitoring counter tool. Results show that the Pentium Pro processor achieves significantly lower cycles per instruction than the Pentium processor due to its out of order and speculative execution, and non-blocking cache and memory system. Its higher clock frequency also contributes to even higher performance.

Subword parallelism with MAX-2

Article

Full-text available

Aug 1996

Ruby B. Lee

MAX-2 illustrates how a small set of instruction extensions can provide subword parallelism to accelerate media processing and other data-parallel programs. This article proposes that subword parallelism-parallel computation on lower precision data packed into a word-is an efficient and effective solution for accelerating media processing. As an example, it describes MAX-2, a very lean, RISC-like set of media acceleration primitives included in the 64-bit PA-RISC 2.0 architecture. Because MAX-2 strives to be a minimal set of instructions, the article discusses both instructions included and excluded. Several examples illustrate the use of MAX-2 instructions, which provide subword parallelism in a word-oriented general-purpose processor at essentially no incremental cost

Performance of database workloads on shared-memory systems with out-of-order processors

Article

Dec 1998

Database applications such as online transaction processing (OLTP) and decision support systems (DSS) constitute the largest and fastest-growing segment of the market for multiprocessor servers. However, most current system designs have been optimized to perform well on scientific and engineering workloads. Given the radically different behavior of database workloads (especially OLTP), it is important to re-evaluate key system design decisions in the context of this important class of applications.This paper examines the behavior of database workloads on shared-memory multiprocessors with aggressive out-of-order processors, and considers simple optimizations that can provide further performance improvements. Our study is based on detailed simulations of the Oracle commercial database engine. The results show that the combination of out-of-order execution and multiple instruction issue is indeed effective in improving performance of database workloads, providing gains of 1.5 and 2.6 times over an in-order single-issue processor for OLTP and DSS, respectively. In addition, speculative techniques enable optimized implementations of memory consistency models that significantly improve the performance of stricter consistency models, bringing the performance to within 10--15% of the performance of more relaxed models.The second part of our study focuses on the more challenging OLTP workload. We show that an instruction stream buffer is effective in reducing the remaining instruction stalls in OLTP, providing a 17% reduction in execution time (approaching a perfect instruction cache to within 15%). Furthermore, our characterization shows that a large fraction of the data communication misses in OLTP exhibit migratory behavior; our preliminary results show that software prefetch and writeback/flush hints can be used for this data to further reduce execution time by 12%.

AltiVec Technology: Accelerating Media Processing Across the Spectrum

Article

M. Phillip

High-performance image processing using special-purpose cpu instructions: the ultrasparc visual instruction set

Article

D. S. Rice

Assessing general-purpose processors for dsp applications

Article

J. Eyre

Media processing: A new design target

Article

Jan 1996

The evolution and refinement of media-processing hardware has just begun. As programmable processors or coprocessors with media-processing enhancements gradually replace fixed-function, special-purpose devices, compiler support for these features will also improve. Today, an application developer who organizes program and data structures to exploit media-processing hardware achieves the best performance. Eventually, language extensions will probably emerge to support improved programmer efficiency without loss of application performance. Media processing - with its almost limitless appetite for computational power - provides an exciting new target for hardware and software design innovation.

Multimedia extensions for a 550mhz risc microprocessor

Article