Content uploaded by Norman P. Jouppi
Author content
All content in this area was uploaded by Norman P. Jouppi on May 12, 2014
Content may be subject to copyright.
To appear in the Proceedings of the 26th International Symposium on Computer Architecture. May 1999
Performance of Image and Video Processing with
General-Purpose Processors and Media ISA Extensions
Parthasarathy Ranganathan
, Sarita Adve
, and Norman P. Jouppi
y
Electrical and Computer Engineering
y
Western Research Laboratory
Rice University Compaq Computer Corporation
f
parthas,sarita
g
@rice.edu jouppi@pa.dec.com
Abstract
This paper aims to provide a quantitative understanding
of the performance of image and video processing applica-
tions on general-purpose processors, without and with me-
dia ISA extensions. We use detailed simulation of 12 bench-
marks to study the effectiveness of current architectural fea-
tures and identify future challenges for these workloads.
Our results show that conventional techniques in current
processors to enhance instruction-level parallelism (ILP)
provide a factor of 2.3X to 4.2X performance improve-
ment. The Sun VIS media ISA extensions provide an ad-
ditional 1.1X to 4.2X performance improvement. The ILP
features and media ISA extensions significantly reduce the
CPU component of execution time, making 5 of the image
processing benchmarks memory-bound.
The memory behavior of our benchmarks is character-
ized by large working sets and streaming data accesses. In-
creasing the cache size has no impact on 8 of the bench-
marks. The remaining benchmarks require relatively large
cache sizes (dependent on the display sizes) to exploit data
reuse, but derive less than 1.2X performance benefits with
the larger caches. Software prefetching provides 1.4X to
2.5X performance improvement in the image processing
benchmarks where memory is a significant problem. With
the addition of software prefetching, all our benchmarks re-
vert to being compute-bound.
1 Introduction
In the near future, media processing is expected to be-
come one of the dominant computing workloads [6, 13].
This work is supported in part by an IBM Partnership award, Intel Cor-
poration, the National Science Foundation under Grant No. CCR-9410457,
CCR-9502500, CDA-9502791, and CDA-9617383, and the Texas Ad-
vanced Technology Program under Grant No. 003604-025. Sarita Adve
is also supported by an Alfred P. Sloan Research Fellowship.
Media processing refers to the computing required for the
creation, encoding/decoding, processing, display, and com-
munication of digital multimedia information such as im-
ages, audio, video, and graphics. The last few years
have seen significant advances in this area, but the true
promise of media processing will be seen only when ap-
plications such as collaborative teleconferencing, distance
learning, and high-quality media-rich content channels ap-
pear in ubiquitously available commodity systems. Fur-
ther out, advanced human-computer interfaces, telepres-
ence, and immersive and interactive virtual environments
hold even greater promise.
One obstacle in achieving this promise is the high com-
putational demands imposed by these applications. These
requirements arise from the computationally expensive na-
ture of the algorithms, the stringent real-time constraints,
and the need to run many such tightly synchronized appli-
cations at the same time on the same system. For exam-
ple, a video teleconferencing system may need to run video
processing including encoding/decoding, audio processing,
and a software modem simultaneously. As a result, such
applications currently display images of only a few square
inches at a few frames per second when running on general-
purpose processors. Full-screen images at 20-30 frames per
second could require more than two orders of magnitude
more performance.
To meet the high computational requirements of emerg-
ing media applications, current systems use a combination
of general-purpose processors accelerated with DSP (or me-
dia) processors and ASICs performing specialized compu-
tations. However, benefits offered by general-purpose pro-
cessors in terms of ease of programming, higher perfor-
mance growth, easier upgrade paths between generations,
and cost considerations argue for increasing use of general-
purpose processors for media processing applications [6,
13]. The most visible evidence of this trend has been
the SIMD-style media instruction-set architecture (ISA) ex-
1
tensions announced for most high-performance general-
purpose processors (e.g., 3DNow! [15], AltiVec [19],
MAX [12], MDMX and MIPSV [9], MMX [18], MVI [4],
VIS [23]).
Unfortunately, in spite of the large amount of recent at-
tention given to media processing [5, 6, 13], there is very
little quantitative understanding of the performance of such
applications on general-purpose systems. A major chal-
lenge for such studies has been the large number of ap-
plication classes in this domain (e.g., image, video, au-
dio, speech, communication, graphics, etc.), and the ab-
sence of any standardized representative benchmark sets.
Consequently, in contrast to the much-researched SPEC,
SPLASH, and (more recently) TPC benchmarks, a number
of fundamental questions still remain unanswered for me-
dia processing workloads. For example, is computation or
memory the primary bottleneck in these applications? How
effective are current architectural designs and media ISA
extensions? What are the future challenges for these work-
loads? Given the lack of understanding of such issues, it is
not surprising that the media instruction set extensions an-
nounced by different processor vendors vary widely – from
13 instructions in MVI for Alpha [4] to 162 instructions in
AltiVec for PowerPC [19].
This paper is a first step in understanding the above is-
sues to determine if and how we need to change the way we
design general-purpose systems to support media process-
ing applications. We focus on image and video workloads,
an important class of media processing workloads, and at-
tempt to cover the spectrum of the key tasks in this class.
Our benchmark suite consists of 12 kernels and applications
covering image processing, image source coding, and video
source coding. We use detailed simulation to study a va-
riety of general-purpose-processor architectural configura-
tions, both with and without the use of Sun’s visual instruc-
tion set (VIS) media ISA extensions. VIS shares a number
of fundamental similarities with the media ISA extensions
proposed for other processors, and is representative of the
benefits and limitations of current media ISA extensions.
We start with a base single-issue in-order processor. In
this system, all the benchmarks are primarily compute-
bound. We find that conventional techniques in current
processors to enhance instruction-level parallelism or ILP
(multiple issue and out-of-order issue) provide a factor of
2.3X to 4.2X performance improvement for the benchmarks
studied. The VIS media ISA extensions provide an addi-
tional 1.1X to 4.2X performance improvement. Our de-
tailed analysis indicates the sources and limitations of the
performance benefits due to VIS. The conventional ILP
techniques and the VIS extensions together significantly re-
duce the CPU component of execution time, making five of
the image processing benchmarks memory-bound.
The memory behavior of these workloads is character-
ized by large working sets and streaming data accesses. In-
creasing the cache size has no impact on 8 of the bench-
marks. The remaining reuse data, but require relatively
large cache sizes (dependent on the display sizes) to ex-
ploit the reuse and derive a performance benefit of less than
1.2X. Software-inserted prefetching provides 1.4X to 2.5X
performance improvement in the image processing bench-
marks where memory stall time is significant. With the ad-
dition of software prefetching, all of our benchmarks revert
to being compute-bound.
The rest of the paper is organized as follows. Section 2
describes our workloads, the architectures modeled, and the
simulation methodology. Section 3 presents our results on
the impact of ILP features and VIS media extensions. Sec-
tion 4 studies the performance of the cache system and the
impact of software prefetching. Section 5 discusses related
work. Section 6 concludes the paper.
2 Methodology
2.1 Workloads
We attempt to cover the spectrum of key tasks in im-
age and video processing workloads. The kernels and ap-
plications in our benchmark suite form significant compo-
nents of many current and future real-world workloads such
as collaborative teleconferencing, scene-visualization, dis-
tance learning, streaming video across the internet, digi-
tal broadcasting, real-time flight imaging and radar sens-
ing, content-based storage and retrieval, online video cata-
loging, and medical tomography [8]. Future standards such
as JPEG2000 and MPEG4 are likely to build on a number
of components of our benchmark suite.
Table 1 summarizes the 12 benchmarks that we use
in this paper, and is divided into image processing (Sec-
tion 2.1.1), image source coding (Section 2.1.2), and video
source coding (Section 2.1.3). These benchmarks are simi-
lar to some of the benchmarks used in the image and video
parts of the Intel Media Benchmark (described at the Intel
web site) and the UCLA MediaBench [11].1
All the image benchmarks were run with 1024x640 pixel
3-band (i.e., channel) input images obtained from the Intel
Media Benchmark. The video benchmarks were run with
the mei16v2 test bit stream from the MPEG Software Sim-
ulation Group that operates on 352x240 sized 3-band im-
ages. We did not study larger (full-screen) sizes because
they were not readily available and would have required im-
practical simulation time.
1We did not use the Intel Media Benchmark or the UCLA MediaBench
directly because the former does not provide source code and the latter
does not include image processing applications.
Image processing
Addition Addition of two images (sf16.ppm,rose16.ppm) using mean of two pixel values
Blend Alpha blending of two images (sf16.ppm,rose16.ppm) with another alpha image (winter16.ppm); the operation
performed is
dst
=
alpha
src
1 + (255
alpha
)
src
2
.
Conv General 3x3 image convolution of an image (sf16.ppm). The operation performed includes a saturation sum-
mation of 9 product terms. Each term corresponds to multiplying the pixel values in a moving 3x3 window
across the image dimensions with the values of a 3x3 kernel matrix.
Dotprod 16x16 dot product of a randomly-initialized 1048576-element linear array
Scaling Linear image scaling of an image (sf16.ppm)
Thresh Double-limit thresholding of an image (sf16.ppm). If the pixel band value falls within the low and high values
for that band, the destination is set to the map value for that band; otherwise, the destination is set to be the
same as the source pixel value.
Image source coding
Cjpeg JPEG progressive encoding (rose16.ppm)
Djpeg JPEG progressive decoding (rose16.jpg)
Cjpeg-np JPEG non-progressive encoding (rose16.ppm)
Djpeg-np JPEG non-progressive decoding (rose16.jpg)
Video source coding
Mpeg-enc MPEG2 encoding of 4 frames (I-B-B-P frames) of the mei16v2rec bit stream. Properties of the bit stream
include frame rate of 30fps, bit rate of 5Mbps at the Main profile@Main level configuration. The image is
352x240 pixels in the 4:2:0 YUV chroma format, and is scaled to a 704x480 display. The quantization tables
and the motion estimation search parameters are set to the default parameters specified by the MPEG group.
Mpeg-dec MPEG2 decoding of the mei16v2rec video bit stream into separate YUV components.
Table 1. Summary of the benchmarks used in this study.
2.1.1 Image Processing
Our image processing benchmarks are taken from the Sun
VIS Software Development Kit (VSDK), which includes 14
image processing kernels. These kernels include common
image processing tasks such as one-band and three-band
(i.e., channel) alpha blending (used in image compositing),
single-limit and double-limit thresholding (used in chroma-
keying, image masking, and blue screening), and functions
such as general and separable convolution, copying, inver-
sion, addition, dot product, and scaling (used in the core
of many image processing codes like blurring, sharpening,
edge detection, embossing, etc.). We study all 14 of the
VSDK kernels, but due to space constraints, we report re-
sults for only 6 representative benchmarks (addition, blend,
conv, dotprod, scaling, and thresh).
2.1.2 Image Source Coding
We focus on the Joint Photography Experts Group (JPEG)
standard and study the performance of the Release 6a codec
(encoder/decoder) from the Independent JPEG Group. We
study two different commonly used codecs specified in the
standard, a progressive JPEG codec (cjpeg encoder and
djpeg decoder), and a non-progressive JPEG codec (cjpeg-
np encoder and djpeg-np decoder).
The JPEG encoding process consists of a number of
phases many of which exploit properties of the human vi-
sual system to reduce the number of bits required to spec-
ify the image. First, the color conversion and chroma-
decimation phases convert the source image from a 24-bit
RGB representation domain to a 12 bit 4:2:0 YUV repre-
sentation. Next, a linear DCT image transform phase con-
verts the image into the frequency domain. The quantiza-
tion phase then scales the frequency domain values by a
quantization value (either constant or variable). The zig-
zag scanning and variable-length (Huffman) coding phases
then reorder the resulting data into streams of bits and en-
code them as a stream of variable-length symbols based on
statistical analysis of the frequency of symbols.
Progressive image compression uses a compression al-
gorithm that performs multiple Huffman coding passes on
the image to encode it as multiple scans of increasing pic-
ture quality (leading to the perception of gradual focusing
of images seen on many web pages).
The decoding process performs the inverse of the opera-
tions for the encoding process in the reverse order to obtain
the original image from the compressed image.
2.1.3 Video Source Coding
We focus on the Motion Picture Experts Group-2 (MPEG2)
video coding standard, and study the performance of the
version 1.1 codec from the MPEG Software Simulation
Group.
The first part of the video compression process consists
of spatial compression similar to that described for JPEG
Processor speed 1 GHz
Issue width 4-way
Instruction window size 64
Memory queue size 32
Branch prediction
Bimodal agree predictor size 2K
Return-address stack size 32
Taken branches per cycle 1
Simultaneous speculated branches 16
Functional unit counts
Integer arithmetic units 2
Floating-point units 2
Address generation units 2
VIS multipliers 1
VIS adders 1
Functional unit latencies (cycles)
Default integer/address generation 1/1
Integer multiply/divide 7/12
Default floating point 4
FP moves/converts/divides 4/4/12
Default VIS 1
VIS 8-bit loads/multiply/pdist 1/3/3
Table 2. Default processor parameters.
Cache line size 64 bytes
L1 data cache size (on-chip) 64 KB
L1 data cache associativity 2-way
L1 data cache request ports 2
L1 data cache hit time 2 ns
Number of L1 MSHRs 12
L2 cache size (off-chip) 128K
L2 cache associativity 4-way
L2 request ports 1
L2 hit time (pipelined) 20 ns
Number of L2 MSHRs 12
Max. outstanding misses per MSHR 8
Total memory latency for L2 misses 100 ns
Memory interleaving 4-way
Table 3. Default memory system parameters.
and includes the color conversion, chroma decimation,
frequency transformation, quantization, zig-zag coding,
and run-length coding phases. Additionally, MPEG2 has
an inter-frame predictive-compression motion-estimation
phase that uses difference vectors to encode temporal redun-
dancy between macroblocks in a frame and macroblocks in
the following and preceding frames. Motion estimation is
the most compute-intensive part of mpeg-encode.
The video decompression process performs the inverse
of the various encode operations in reverse order to get the
decoded bit stream from the input compressed video. The
mei16v2 bit stream is already in the YUV format, and con-
sequently, our MPEG simulations do not go through the
color conversion phase discussed in Section 2.1.2.
2.2 Architectures Modeled
2.2.1 Processor and Memory System
We study two processor models – an in-order processor
model (similar to the Compaq Alpha 21164, Intel Pen-
tium, and Sun UltraSPARC-II processors) and an out-of-
order processor model (similar to the Compaq Alpha 21264,
HP PA8000, IBM PowerPC, Intel Pentium Pro, and MIPS
R10000 processors). Both the processor models support
non-blocking loads and stores.
For the experiments with software prefetching, the pro-
cessor models provide support for software-controlled non-
binding prefetches into the first-level cache.
The base system uses a 64KB two-way associative first-
level (L1) write-back cache and a 128KB 4-way associa-
tive second-level (L2) write-back cache. Section 4.1 dis-
cusses the impact of varying the cache sizes. All the caches
are non-blocking and allow support for multiple outstand-
ing misses. At each cache, 12 miss status holding regis-
ters (MSHRs) reserve space for outstanding cache misses
and combine a maximum of 8 multiple requests to the same
cache line.
Tables 2 and 3 summarize the parameters used for the
processor and memory subsystems. When studying the per-
formance of a 1-way issue processor, we scale the number
of functional units to 1 of each type. The functional unit
latencies were chosen based on the Alpha 21264 processor.
All functional units are fully pipelined except the floating-
point divide (non-pipelined).
2.2.2 VIS Media ISA Extensions
The VIS media ISA extensions to the SPARC V9 architec-
ture are a set of instructions targeted at accelerating media
processing [10, 23]. Both our in-order and out-of-order pro-
cessor models include support for VIS.
The VIS extensions define the packed byte, packed word
and packed double data types which allow concurrent oper-
ations on eight bytes, four words (16-bits each) or two dou-
ble words of fixed-point data in a 64-bit register. These data
types allow VIS instructions to exploit single-instruction-
multiple-data (SIMD) parallelism at the subword level.
Most of the VIS instructions operate on packed words or
packed doubles; loads, stores, and pdist instructions op-
erate on packed bytes. Many of the VIS instructions make
implicit assumptions about rounding and the number of sig-
nificant bits in the fixed-point data. Hence, their use require
ensuring that they do not lead to incorrect outputs. We next
provide a short overview of the VIS instructions (summa-
rized in Table 4).
Packed arithmetic and logical operations. The packed
arithmetic VIS instructions allow SIMD-style parallelism to
be exploited for add, subtract, and multiply instructions. To
Packed arithmetic and logical operations
Packed addition
Packed subtraction
Packed multiplication
Logical operations
Subword rearrangement and realignment
Data packing and expansion
Data merging
Data alignment
Partitioned compares and edge operations
Partitioned compares
Mask generation for edge effects
Memory-related operations
Partial stores
Short loads and stores
Blocked loads and stores
Special-purpose operations
Pixel distance computation
Array address conversion for data reuse
Access to the graphics status register
Table 4. Classification of VIS instructions
minimize implementation complexity, VIS uses a pipelined
series of two 8x16 multiplies and one add instruction to em-
ulate packed 16x16-bit multiplication. The VIS logical in-
structions allow logical operations on the floating-point data
path.
Subword rearrangement and alignment. To facilitate
conversion between different data types, VIS supports sub-
word rearrangement and alignment using pack, expand,
merge (interleave), and align instructions. The subword re-
arrangement instructions also include support for implicitly
handling saturation arithmetic (limiting data values to the
minimum or maximum instead of the default wrap-around).
Partitioned compares and edge operations. For branches,
VIS supports a partitioned compare that performs four 16-
bit or two 32-bit compares in parallel to producea mask that
can be used in subsequent instructions. VIS also supports
the edge instruction to generate masks for partial stores that
can eliminate special branch code to handle boundary con-
ditions in media processing applications.
Memory-related operations. For memory instructions,
VIS supports partial stores that selectively write to parts of
the 64-bit output based on an input mask. Short loads
and stores transfer 1 or 2 bytes of memory to the register
file. Blocked loads and stores transfer 64 bytes of data
between memory and a group of eight consecutive VIS reg-
isters without causing allocations in the cache.
Special-purpose operations. The pixel distance computa-
tion (pdist) instruction is primarily targeted at motion es-
timation and computes the sum of the absolute differences
between corresponding 8-bit components in two packed
bytes. The array instruction is mainly targeted at 3D graph-
ics rendering applications and converts 3D fixed-point co-
ordinates into a blocked byte address that allows for greater
cache reuse. VIS also defines instructions to manipulate
the graphics status register, a special-purpose register that
stores additional data for various media instructions.
Overall, the functionality discussed above for VIS is sim-
ilar to that of fixed-point media ISA extensions in other
general purpose processors (e.g., MAX [12], MMX [18],
MVI [4], MDMX [9], AltiVec [19]). The various ISA ex-
tensions mainly differ in the number, types, and latencies of
the individual instructions (e.g., MMX implements direct
support for 16x16 multiply), whether they are implemented
in the integer or floating-point data path, and in the width
of the data path. The most different ISA extension, the pro-
posed PowerPC AltiVec ISA, adds support for a separate
128-bit vector multimedia unit in the processor.
Our VIS implementation is closely modeled after
the UltraSPARC-II implementation and operates on the
floating-point register file with latencies comparable to the
UltraSPARC-II [23] (Table 2).2The increase in chip area
associated with the VIS instructions was estimated to be less
than 3% for the UltraSPARC-II [10].
2.3 Methodology
2.3.1 Simulation Environment
We use the RSIM simulator [16] to simulate the in-order and
out-of-order processors described in Section 2.2. RSIM is
a user-level execution-driven simulator that models the pro-
cessor pipeline and memory hierarchy in detail including
contention for all resources. To assess the impact of not
modeling system level code, we profiled the benchmarks on
an UltraSPARC-II-based Sun Enterprise server. We found
that the time spent on operating system kernel calls is less
than 2% on all the benchmarks. The time spent on I/O is
less than 15% on all the benchmarks except mpeg-dec. This
benchmark experiences an inflated I/O component (45%)
because of its high frequency of file writes. In a typical sys-
tem, however, these writes would be handled by a graphics
accelerator, significantly reducing this component. Since
our applications have small instruction footprints, our sim-
ulations assume all instructions hit in the instruction cache.
All the applications3were compiled with the SPARC
SC4.2 compiler with the -xO4 -xtarget=ultra1/170
-xarch=v8plusa -dalign options to produce optimized code
for the in-order UltraSPARC processor.
2Our VIS multiplier has a lower latency compared to the integer (64-
bit) multiplier because it operates on 16-bit data.
3We changed the 14 image processing kernels from the Sun VSDK to
skew the starting addresses of concurrent array accesses and unroll small
innermost loops. This reduced cache conflicts and branch mispredictions
leading to 1.2X to 6.7X performance benefits. To facilitate modifying the
applications for VIS, we replaced some of the key routines in the JPEG
and MPEG applications with equivalent routines from the Sun MediaLib
library.
2.3.2 VIS Usage Methodology
We are not aware of any compiler that automatically modi-
fies media processing applications to use media ISA exten-
sions. For our experiments studying the impact of VIS, we
manually modified our benchmarks to use VIS instructions
based on the methodology detailed below.
We profiled the applications to identify key procedures
and manually examined these procedures for loops that sat-
isfied the following three conditions: (1) The loop body
should have no loop-carried dependences or control depen-
dences that cannot be convertedto data dependences (other
than the loop branch). (2) The key computation in the loop
body must be replaceable with a set of equivalent fixed-
point VIS instructions. The loss in accuracy in this stage,
if any, should be visually imperceptible. (3) The poten-
tial benefit from VIS should be more than the overhead
of adding VIS; VIS overhead can result from subword re-
arrangement instructions to convert between packed data
types or from alignment-related instructions.
For loops that satisfied the above criteria, we strip-mined
or unrolled the loop to isolate multiple iterations of the loop
body that we then replaced with equivalent VIS instruc-
tions. We used the inline assembly-code macros from the
Sun VSDK for the VIS instructions; this minimizes code
perturbation and allows the use of regular compiler opti-
mizations. Wherever possible, we tried to use procedures
available from the Sun VSDK Kit and the SUN MediaLib
library routines that were already optimized for VIS.
Our benchmarks use all the VIS instructions except for
the array and blocked load/store instructions. Array in-
structions are targeted at 3D array accesses in graphics
loops, and are not applicable for our applications. Blocked
loads and stores are primarily targeted at transfers of large
blocks of data between buffers without affecting the cache
(e.g., in operating system buffer management, networking,
memory-mapped I/O). We did not use these instructions
since the Sun VSDK does not provide inline assembly-code
macros to support them. The alternative of hand-coded as-
sembly could result in lower performance since it is hard to
emulate the compiler optimizations associated with modern
superscalar processors by hand [22]. Note that both the ar-
ray and blocked load/store instructions are unique to VIS
and are not supported by other general-purpose ISA exten-
sions.
2.3.3 Software Prefetching Algorithm
We studied the applicability of software prefetching for the
benchmarks where the cache miss stall time is a significant
component of the total execution time (
>
20%). We identi-
fied the memory accesses that dominate the cache miss stall
time, and inserted prefetches by hand for these accesses.
We followed the well known software prefetching compiler
algorithm developed by Mowry et al. [14].
2.3.4 Performance Metrics
We use the execution time of the system as the primary met-
ric to evaluate the performance of the system, while also re-
porting the individual components of execution time. With
out-of-order processors, an instruction can potentially be
overlapped with instructions preceding and following it. We
therefore use the following convention to identify the differ-
ent components of execution time. At every cycle, the frac-
tion of instructions retired that cycle to the maximum retire
rate is attributed to the busy time; the remaining fraction
is attributed as stall time to the first instruction that could
not be retired that cycle. We also study other metrics such
as dynamic instruction counts, branch misprediction rates,
cache miss rates, MSHR occupancies, and prefetch counts
for further insights into the system behavior.
3 Improving Processor Performance
For each benchmark, Figure 1 presents execution times
for three variations of our base architecture, each without
VIS (the first set of three bars) and with VIS (the second
set of three bars). The three architecture variations are (i)
in-order and single issue, (ii) in-order and 4-way issue, and
(iii) out-of-order and 4-way issue. On the VIS-enhanced
architecture, we use the VIS-enhanced version of the appli-
cation as mentioned in Section 2. The execution times are
normalized to the time with the in-order single-issue pro-
cessor. For all the benchmarks, the execution time is di-
vided into the busy component, the functional unit stall (FU
stall) component, and the memory component. The mem-
ory component is shown divided into the L1 miss and L1 hit
components.
3.1 Impact of Conventional ILP Features
This section focuses on the system without the VIS me-
dia ISA extensions (the left three bars for each benchmark
in Figure 1).
Overall results. Both multiple issue and out-of-order issue
provide substantial reductions in execution time for most of
our benchmarks. Compared to a single-issue in-order pro-
cessor, on the average, multiple issue improves performance
by a factor of 1.2X (range of 1.1X to 1.4X), while the com-
bination of multiple issue and out-of-order issue improves
performance by a factor of 3.1X (range of 2.3X-4.2X).
Analysis. Compared to the single issue processor, we find
that multiple issue achieves most of its benefits by reduc-
ing the busy CPU component of execution time. Data, con-
trol, and structural dependences prevent the CPU compo-
nent from attaining an ideal speedup of 4 from a 4-way is-
|
|
0
|
20
|
40
|
60
|
80
|
100
Normalized execution time
L1 hit
L1 miss
FU stall
Busy
1-way
100.0
4-way
71.2
4-way
43.6
1-way
39.8
4-way
36.7
4-way
15.4
1-way
100.0
4-way
77.5
4-way
33.3
1-way
28.7
4-way
26.9
4-way
10.1
1-way
100.0
4-way
87.5
4-way
24.1
1-way
15.9
4-way
12.5
4-way
5.7
ooo ooo ooo ooo ooo ooo
Without VIS With VIS Without VIS With VIS Without VIS With VIS
Addition Blend Conv
|
|
0
|
20
|
40
|
60
|
80
|
100
Normalized execution time
L1 hit
L1 miss
FU stall
Busy
1-way
100.0
4-way
88.2
4-way
32.4
1-way
74.6
4-way
68.0
4-way
28.7
1-way
100.0
4-way
88.8
4-way
26.0
1-way
18.3
4-way
15.9
4-way
10.3
1-way
100.0
4-way
89.6
4-way
37.7
1-way
32.0
4-way
26.5
4-way
14.9
ooo ooo ooo ooo ooo ooo
Without VIS With VIS Without VIS With VIS Without VIS With VIS
Dotprod Scaling Thresh
|
|
0
|
20
|
40
|
60
|
80
|
100
Normalized execution time
L1 hit
L1 miss
FU stall
Busy
1-way
100.0
4-way
84.4
4-way
29.3
1-way
87.6
4-way
74.2
4-way
26.6
1-way
100.0
4-way
78.8
4-way
33.2
1-way
69.2
4-way
53.8
4-way
26.8
1-way
100.0
4-way
78.0
4-way
30.1
1-way
66.8
4-way
50.6
4-way
23.1
ooo ooo ooo ooo ooo ooo
Without VIS With VIS Without VIS With VIS Without VIS With VIS
Cjpeg Djpeg Cjpeg-np
|
|
0
|
20
|
40
|
60
|
80
|
100
Normalized execution time
L1 hit
L1 miss
FU stall
Busy
1-way
100.0
4-way
74.7
4-way
27.6
1-way
58.9
4-way
41.1
4-way
18.9
1-way
100.0
4-way
89.6
4-way
43.2
1-way
33.5
4-way
26.2
4-way
12.2
1-way
100.0
4-way
77.3
4-way
32.0
1-way
65.0
4-way
49.2
4-way
24.4
ooo ooo ooo ooo ooo ooo
Without VIS With VIS Without VIS With VIS Without VIS With VIS
Djpeg-np Mpeg-enc Mpeg-dec
Figure 1. Performance of image and video benchmarks.
sue processor, reflected in the increased functional unit and
L1 hit memory stall time.
Some of the benchmarks see additional memory laten-
cies when the number of outstanding misses to one cache
line (MSHR) increases beyond the maximum allowed limit
of 8. This is caused by the higher use of small data types
associated with media applications which leads to a high
frequency of accesses for each cache line (e.g., 64 pixel
writes in a 64-byte line). Since the processors do not stall
on writes, in benchmarks with small loop bodies (e.g., addi-
tion,cjpeg,djpeg), this leads to a backup of multiple writes.
This backup leads to contention for the MSHR that even-
tually prevents other accesses from being serviced at the
cache.
Out-of-order issue, on the other hand, improves perfor-
mance by reducing both functional unit stall time and mem-
ory stall time. A large fraction of the stall times due to data,
control, and structural dependences, as well as MSHR con-
tention, is now overlapped with other useful work. This is
seen in the reduction in the FU stall and L1 hit components
of execution time. Additionally, out-of-order issue can bet-
ter exploit the non-blocking loads feature of the system by
allowing the latency of multiple long-latency load misses to
be overlapped with one another. Our results examining the
MSHR occupancies at the cache indicate that while there is
increased load miss overlap in all the 12 benchmarks, only
2 to 3 misses are overlapped in most cases. The total capac-
ity of 12 MSHRs is never fully utilized for load misses in
all our benchmarks.
Overall, the impact of the various ILP features is qualita-
tively consistent with that described in previous studies for
scientific and database workloads. Quantitatively, these ILP
features are substantially more effective for the image and
video benchmarks than for previously reported online trans-
action processing (OLTP) workloads [21], and comparable
in benefit to previously reported scientific and decision sup-
port system (DSS) workloads [1, 17, 21].
It must be noted that the performance of the in-order is-
sue processor is dependent on the quality of the compiler
used to schedule the code. Our experiments use the com-
mercial SPARC SC4.2 compiler with maximum optimiza-
tions turned on for the in-order UltraSPARC processor. To
try to isolate compiler scheduling effects, we studied two
other processor configurations with single-cycle functional
unit latencies and functional unit latencies comparable to
the UltraSPARC processor. In both these configurations,
our results continued to be qualitatively similar; the out-
of-order processor continues to significantly outperform the
in-order processor. The impact of future, more advanced,
compiler optimizations, however, is still an open question.
Interestingly, a recent position paper [6] on the impact of
multimedia workloads on general-purpose processors con-
jectures that complex out-of-order issue techniques devel-
oped for scientific and engineering workloads (e.g., SPEC)
may not be needed for multimedia workloads. Our results
show that, on the contrary, out-of-order issue can provide
significant performance benefits for the image and video
workloads.
3.2 Impact of VIS Media ISA Extensions
This section discusses the performance impact of the
VIS ISA extensions (comparing the left-hand three bars and
right-hand three bars for each benchmark in Figure 1).
3.2.1 Overall Results
The VIS media ISA extensions provide significant perfor-
mance improvements for all the benchmarks (factors of
1.1X to 4.0X for the out-of-order system, 1.1X to 7X across
all configurations). On average, the addition of VIS im-
proves the performance of the single-issue in-order system
by a factor of 2.0X, the performance of the 4-way-issue in-
order system by a factor of 2.1X, and the performance of
the 4-way issue out-of-order system by a factor of 1.8X.
Multiple issue and out-of-order issue are beneficial even
with VIS. On average, with VIS, compared to a single-issue
in-order processor, multiple issue achieves a factor of 1.2X
performance improvement, while the combination of mul-
tiple issue and out-of-order issue achieves a factor of 2.7X
performance improvement. The reasons for these perfor-
mance benefits from ILP features are the same as for the
systems without VIS.
3.2.2 Benefits from VIS
Figure 2 presents some additional data showing the distribu-
tion of the dynamic (retired) instructions for the 4-way out-
of-order processor without and with VIS, normalized to the
former. Each bar divides the instructions into the Functional
unit (FU, combines ALU and FPU), Branch, Memory, and
VIS categories. The use of VIS instructions provides a sig-
nificant reduction in the dynamic instruction count for all
the benchmarks. The reductions in the dynamic instruc-
tion count correlate well with the performance benefits from
VIS. We next discuss the sources for the reductions in the
dynamic instructions.
Reductions in FU instructions. The VIS packed arith-
metic and logical instructions allow multiple (typically
four) arithmetic instructions to be replaced with one VIS
instruction. Consequently, all the benchmarks see signif-
icant reductions in the FU instructions with correspond-
ing, smaller, increases in the VIS instruction count. Ad-
ditionally, the SIMD VIS instructions replace multiple it-
erations in the original loop with one equivalent VIS it-
eration. This reduces iteration-specific loop-overhead in-
structions that increment index and address values and com-
pute branch conditions, further reducing the FU instruction
count.
Reductions in branch instructions. All the benchmarks
use the edge masking and partial store instructions to elim-
inate testing for edge boundaries and selective writes. They
also use loop unrolling when replacing multiple iterations
with one equivalent VIS iteration. These lead to a reduc-
tion in the branch instruction count for all the benchmarks.
For some applications, branch instruction counts are also
reduced because of the elimination of the code to explic-
itly perform saturation arithmetic (mainly in conv4and the
JPEG applications), and the use of partitioned SIMD com-
pares (mainly in thresh).
Many of the branches eliminated are hard-to-predict
branches (e.g., saturation, thresholding, selective writes),
leading to significant improvementsin the hardware branch
misprediction rates for some of our benchmarks (branch
misprediction rate decreases from 10% to 0% for conv and
from 6% to 0% for thresh1).
Reductions in memory instructions. With VIS, memory
accesses operate on packed data as opposed to individual
media data. Consequently, most of the benchmarks see sig-
nificant reductions in the number of memory instructions
(and associated cache accesses). This reduces the MSHR
contention discussed in Section 3.1.
Most of the memory instructions eliminated are cache
hits, without a proportional decrease in the number of
4The original source code from the Sun VSDK checks for saturation
only in the conv code. The add,blend, and dotprod kernels are written in
the non-saturation mode. These could, however, potentially be rewritten to
check for saturation in which case they would also see similar benefits.
|
|
0
|
20
|
40
|
60
|
80
|
100
Normalized dynamic instruction count
VIS
Memory
Branch
FU
Addition
Base VIS
100.0
26.2
Blend
Base VIS
100.0
17.6
Conv
Base VIS
100.0
25.4
Dotprod
Base VIS
100.0
88.5
Scaling
Base VIS
100.0
18.0
Thresh
Base VIS
100.0
30.5
Cjpeg
Base VIS
100.0
85.5
Djpeg
Base VIS
100.0
66.3
Djpeg-np
Base VIS
100.0
66.9
Djpeg-np
Base VIS
100.0
58.1
Mpeg-enc
Base VIS
100.0
32.7
Mpeg-dec
Base VIS
100.0
66.4
Figure 2. Impact of VIS on dynamic (retired) instruction count.
misses, causing higher L1 cache miss rates with VIS. The
higher miss rate and the lower instruction count allows addi-
tional load misses to appear together within the instruction
window and be overlapped with each other. However, the
system still rarely sees more than 3 load misses overlapped
concurrently.
Pixel distance computation for mpeg-enc.mpeg-enc
achieves additional benefits from the special-purpose pixel
distance computation (pdist) instruction in the motion es-
timation phase. The pdist instruction allows a sequence
of 48 instructions to be reduced to oneinstruction [23], sig-
nificantly reducing the FU, branch, and memory instruction
counts. The elimination of hard-to-predict branches to per-
form comparisons and saturation improves the branch mis-
prediction rate from 27% to 10%.
3.2.3 Limitations of VIS
Examining the variation of performance benefits across the
benchmarks, we observe that the JPEG applications, mpeg-
dec, and dotprod exhibit relatively lower performance ben-
efits (factors of 1.1X to 1.5X) compared to the remaining
benchmarks (factors of 2.8X to 4.2X). We next discuss fac-
tors limiting the benefits from VIS.
Inapplicability of VIS. The VIS media ISA extensions are
not applicable to a number of key procedures in the JPEG
applications and mpeg-dec. For example, the JPEG appli-
cations (especially the progressive versions) spend a large
fraction of their time in the variable-length Huffman cod-
ing phase. This phase is inherently sequential and operates
on variable-length data types, and consequently cannot be
optimized using VIS instructions.5Other examples of code
segments in the JPEG and MPEG applications where VIS
could not be applied include bit-level input/output stream
manipulation, scatter-gather addressing, quantization, and
saturation arithmetic operations not embedded in a loop.
VIS overhead. All our benchmarks use subword rear-
rangement and alignment instructions to get the data in
a form that the VIS instructions can operate (e.g., pack-
5It is instructive to note that many media processors (e.g., the Mit-
subishi VLIW Media Processor and Samsung Media Signal Processor)
have a special-purpose hardware unit to handle the variable-length coding.
ing/unpacking between packed-byte pixels and packed-
word operands). This results in extra overhead that limits
the performance benefits from VIS (on the average, for our
benchmarks, 41% of the VIS instructions are for subword
rearrangement and alignment). Overhead is also increased
when multiple VIS instructions are used to emulate one op-
eration (e.g., 16x16 multiply in dotprod) or when the data
need to be reordered to exploit SIMD (e.g., byte reordering
in the color conversion phase in JPEG).
Limited parallelism and scheduling constraints. Most of
the packed arithmetic instructions operate only on packed
words or packed double words. This ensures enough bits to
maintain the precision of intermediate values. However, this
limits the maximum parallelism to 4 (on 16-bit data types),
even in cases when the operations are known to be per-
formed on smaller data types (e.g., 8-bit pixels). The limit
on the SIMD parallel path,6in combination with contention
for VIS functional units, limits the benefits from VIS on
some of our benchmarks (most significantly in mpeg-enc).
Cache miss stall time. As discussed earlier, the reductions
in memory instructions mainly occur for cache hits. The
VIS instructions do not directly target cache misses though
there are indirect benefits associated with increased load
miss overlap due to instruction count reduction (discussed
in Section 3.2.2).
3.3 Combination of ILP and VIS
The combination of conventional ILP features and VIS
extensions achieves an average factor of 5.5X perfor-
mance improvement (range of 3.5X to 18X) over the base
single-issue in-order processor. The benefits from VIS are
achieved with a much smaller increase in the die area com-
pared to the ILP features.
On the base single-issue in-order processor, all the
benchmarks are primarily compute-bound. With ILP fea-
tures and VIS extensions, cjpeg,djpeg, and mpeg-enc con-
tinue to spend most of their execution time in the proces-
6The MIPS MDMX provides support for a larger size accumulator that
allows greater parallelism without losing the precision of the intermediate
result [9]. The PowerPC AltiVec supports a larger 128-bit data path to
increase the parallelism[19].
sor sub-system (87% to 97%). Five of the image process-
ing kernels, however, now spend 55% to 66% of their total
time in memory stalls. The strong compute-centric perfor-
mance benefits from ILP features and VIS extensions shift
the bottleneck to the memory sub-system for these bench-
marks. The remaining 4 applications (conv,cjpeg,djpeg,
and mpeg-dec) spend between 20% to 30% of their total
time on memory stalls.
4 Improving Memory System Performance
This section studies memory system performance for
the benchmarks. Section 4.1 discusses the effectiveness of
caches, and Section 4.2 discusses the impact of software
prefetching.
4.1 Impact of Caches
Impact of varying L2 cache size. We varied the L2 cache
size from 128K to 2M, keeping the L1 cache fixed at 64K.
Our results (not shown here due to lack of space) showed
that increasing the size of the L2 cache has no impact on
the performance of the 6 image processing kernels and the
cjpeg-np and djpeg-np applications. The remaining four
applications, cjpeg,djpeg,mpeg-enc, and mpeg-dec, reuse
data; but the cache size needed to exploit the reuse depends
on the size of the display. For our input image sizes, L2
cache sizes of 2M capture the entire working sets for all the
four benchmarks, and provide 1.1X to 1.2X performance
improvement over the default 128K L2 cache. With the 2M
cache sizes, memory stall time is between 7-9% on all the
applications and is dominated by L1 hit time (due to MSHR
contention) and L2 hit time.
The image processing kernels have streaming data ac-
cesses to a large image buffer with no reuse and low com-
putation per cache miss. Consequently, they exhibit high
memory stall times unaffected by larger caches. cjpeg-np
and djpeg-np do not see any variation in performance with
larger caches because of their negligible memory compo-
nents. These applications implement a blocked pipeline al-
gorithm that performs all the computation phases on 8x8-
sized blocks at a time, reducing the bandwidth requirements
and increasing the computation per miss. The cjpeg and
djpeg applications differ from their non-progressive coun-
terparts in the progressive coding phase where they perform
a multi-pass traversal of the buffer storing the DCT coeffi-
cients. The low computation per miss in this phase com-
bined with the reuse of the image-sized (1024x640 pixels)
buffer results in a 1.2X performance benefit from increas-
ing the cache size to 2M. Larger images would increase
the working set requiring larger caches. For example, a
1024x1024 image would require a 4M cache size. mpeg-
enc and mpeg-dec perform inter-block distance vector op-
erations and therefore need to operate on multiple image-
sized buffers (as opposed to blocks-sized buffers in cjpeg-np
and djpeg-np). The reuse of these 352x240 buffers across
the frames in the video leads to 1.1X performance bene-
fits with 512K (mpeg-dec) and 1M (mpeg-enc) cache sizes.
Larger image sizes would require larger caches; for exam-
ple a 1024x1024 image would require almost a factor of
12X increase.
Impact of varying L1 caches. We also performed ex-
periments varying the size of the L1 cache from 1K to
64K while keeping the L2 cache fixed at 128K. Our results
showed that the L1 cache size had no impact on five of the
image processing kernels. On the remaining benchmarks,
a 64K L1 configuration outperforms the 1K L1 configura-
tion by factors of 1.1X to 1.3X; 4K-16K L1 caches achieve
within 3% of the performance of the 64K L1 cache con-
figuration. Small data structures other than the main data,
such as tables for convolution, quantization, color conver-
sion, and saturation clipping, are responsible for these small
first-level working sets. At 64K L1 caches, memory stall
time is mainly due to L1 hits (mainly related to MSHR con-
tention) or to L2 misses.
4.2 Impact of Software Prefetching
Figure 3 summarizes the execution time reductions from
software prefetching relative to the base system with VIS
(with 64K L1 and 128K L2 caches). We do not report
results for cjpeg-np,djpeg-np and mpeg-enc since these
benchmarks spend less than 6% of their total time on L1
cache misses. Our results show that software prefetching
achieves high performance improvements for the six image
processing benchmarks (an average of 1.9X and a range of
1.4X to 2.5X. The cjpeg,djpeg, and mpeg-dec benchmarks
exhibit relatively small performance improvements. Over-
all, after applying software prefetching, all our benchmarks
revert to being compute bound.
For the image processing kernels, a significant fraction
of the prefetches are useful in completely or partially hid-
ing the latency of the cache miss with computation or with
other misses. The addition of software prefetching also in-
creases the utilization of cache MSHRs; in many of the im-
age processing kernels, more than 5 MSHRs are used for a
large fraction of the time. The remaining memory stall time
is mainly due to late prefetches and resource contention.
Late prefetches (prefetches that arrive after the demand ac-
cess) arise mainly because of inadequate computation in the
loop bodies to overlap the miss latencies. Contention for
resources occurs when multiple prefetches are outstanding
at a time. These effects are similar to those discussed in
previous studies with scientific applications for ILP-based
processors [20].
The other benchmarks (cjpeg,djpeg, and mpeg-dec) see
|
|
0
|
20
|
40
|
60
|
80
|
100
Normalized execution time
VIS +PF
100.0
56.3
Addition
VIS +PF
100.0
53.2
Blend
VIS +PF
100.0
72.3
Conv
VIS +PF
100.0
40.6
Dotprod
VIS +PF
100.0
44.5
Scaling
VIS +PF
100.0
42.8
Thresh
VIS +PF
100.098.1
Cjpeg
VIS +PF
100.098.1
Djpeg
VIS +PF
100.095.0
Mpeg-dec
L1 hit
L1 miss
FU stall
Busy
Figure 3. Effect of software-inserted prefetching.
lower performance benefits primarily because the fraction
of memory stall time is relatively low and includes an L1
hit component (mainly due to MSHR contention). Soft-
ware prefetches do not address the L1 component. Sec-
ond, in cjpeg and djpeg, the prefetches are to memory lo-
cations that are indirectly addressed (of the form A[B[i]]).
Consequently, the prefetching algorithm is unable to dis-
tinguish between hits and misses and is constrained to is-
sue prefetches for all accesses. The resulting overhead due
to address calculation and cache contention limits perfor-
mance (seen as increased Busy and FU stall components7).
finally, as before, late prefetches and resource contention
also contribute to the lower benefits from prefetching.
5 Related Work
Most of the papers discussing instruction-set extensions
for multimedia processing have focused on detailed descrip-
tions of the additional instructions and examples of their
use [4, 9, 12, 15, 18, 19, 23]. The performance character-
ization in these papers is usually limited to a few sample
code segments and/or a brief mention of the benefits an-
ticipated on larger applications. Eyre studies the applica-
bility of general-purpose processors for DSP applications;
however, the study only reports high-level metrics such as
MIPS, power efficiency, and cost [7].
Daniel Rice presents a detailed description of VIS and 8
image processing applications without and with VIS [22].
The study reports speedups of 2.7X to 10.5X on an actual
UltraSPARC-based system, but does not analyze the cause
of performance benefits, the remaining bottlenecks, or the
impact of alternative architectures.
Yang et al. look at the benefits of packed floating-point
formats and instructions for graphics but assume a per-
fect memory system [24]. Bharghava et al. study some
MMX-enhanced applications based on Pentium-based sys-
tems; but again, no detailed characterization of performance
bottlenecks or the impact of other architectures is done [2].
Zucker et al. study MPEG video decode applications and
7Some of the image processing kernels see a reduction in the CPU com-
ponent because of the reduction in instructions and better scheduling when
loops are unrolled for the prefetching algorithm [14].
show the benefits from I/O prefetching, software restruc-
turing to use SIMD without hardware support, and profile-
driven software prefetching [25, 26]. However, the studies
assume a simplistic processor model with blocking loads
and do not study the effect of media ISA extensions. Bilas
et al. develop two parallel versions of the MPEG decoder
and present results for multiprocessor speedup, memory re-
quirements, load balance, synchronization, and locality [3].
Similar to our results, they also find that the miss rates for
352x240 images on “realistic” cache sizes are negligible.
6 Conclusions
Media processing is a workload of increasing importance
for desktop processors. This paper focuses on image and
video processing, an important class of media processing,
and aims to provide a quantitative understanding of the per-
formance of these workloads on general-purpose proces-
sors. We use detailed simulation to study 12 representa-
tive benchmarks on a variety of architectural configurations,
both with and without the use of Sun’s visual instruction set
(VIS) media ISA extensions.
Our results show that conventional techniques in current
processors to enhance ILP (multiple issue and out-of-order
issue) provide a factor of 2.3X to 4.2X performance im-
provement for the image and video benchmarks. The Sun
VIS media ISA extensions provide an additional 1.1X to
4.2X performance improvement. The benefits from VIS are
achieved with a much smaller increase in the die area com-
pared to the ILP features.
Our detailed analysis indicates the sources and limita-
tions of the performance benefits due to VIS. VIS is very
effective in exploiting SIMD parallelism using packed data
types, and can eliminate a number of potentially hard-to-
predict branches using instructions targeted towards satu-
ration arithmetic, boundary detection, and partial writes.
Special-purpose instructions such as pdist achieve high ben-
efits on the targeted application, but are too specialized to
use in other cases. Routines that are sequential and operate
on variable data types, VIS instruction overhead, cache miss
stall times, and the fixed parallelism in the packed arith-
metic instructions limit the benefits on the benchmarks.
On our base single-issue in-order processor, all the
benchmarks are primarily compute-bound. Conventional
ILP features and the VIS instructions together significantly
reduce the CPU component of execution time, making 5
of our image processing benchmarks memory-bound. The
memory behavior of these workloads is characterized by
large working sets and streaming data accesses. Increas-
ing the cache size has no impact on the image processing
kernels and the non-progressive JPEG applications. This is
particularly interesting considering current trends towards
large on-chip and off-chip caches. The remaining bench-
marks require relatively large cache sizes (dependent on the
display sizes) to exploit data reuse, but derive less than 1.2X
performance benefits with the larger caches. Software-
inserted prefetching achieves 1.4X to 2.5X performance
benefits on the image processing kernels where memory
stall time is significant.
With the addition of software prefetching, all our bench-
marks revert to being compute-bound. Architectural opti-
mizations that improve computation time (e.g., multipro-
cessing) may be useful to exploit greater parallelism. Such
efforts are likely to expose the memory system bottleneck
yet again, possibly requiring additional novel memory sys-
tem techniques beyond conventional software prefetching.
In the future, we plan to explore new architectural tech-
niques for general-purpose processors to support media pro-
cessing workloads. We also plan to expand our study to in-
clude other media processing applications such as speech,
audio, communication, and natural language interaction.
7 Acknowledgments
We would like to thank Behnaam Aazhang, Mohit Aron,
Rich Baraniuk, Joe Cavallaro, Tim Dorney, Aria Nostra-
tinia, and Jan Odegard for numerous discussions on the be-
havior of media processing workloads. We would also like
to thank Partha Tirumalai, Ahmad Zandi and Tony Zhang
from Sun for useful pointers on enhancing the applications
with VIS. We also thank Vijay Pai, Barton Sano, Chaitali
Sengupta, and the anonymous reviewers for their valuable
comments on earlier drafts of the paper.
References
[1] D. Bhandarkar and J. Ding. Performance characterization of
the pentium pro processor. In HPCA97, pages 288–297, Feb
1997.
[2] R. Bhargava et al. Evaluating MMX Technology Using DSP
and Multimedia Applications. In MICRO-31, Dec 1998.
[3] A. Bilas et al. Real-time Parallel MPEG-2 Decoding in Soft-
ware. In IPPS-11, April 1997.
[4] D. A. Carlson et al. Multimedia Extensions for a 550MHz
RISC Microprocessor. In IEEE Journal of Solid-State Cir-
cuits, 1997.
[5] T. M. Conte et al. Challenges to Combining General-
Purpose and Multimedia Processors. In IEEE Computer,
pages 33–37, Dec 1997.
[6] K. Diefendorff and P. K. Dubey. How Multimedia Work-
loads Will Change Processor Design. In IEEE Micro, pages
43–45, Sep 1997.
[7] J. Eyre. Assessing General-Purpose Processors for DSP Ap-
plications. Berkeley Design Technology Inc. presentation,
1998.
[8] International Organisation for Standardisation – ISO/IEC
JTC1/SC29/WG11MPEG 98/N2457. MPEG-4 Applications
Document, 1998.
[9] E. Killian. MIPS Extension for Digital Media with 3D.
Slides presented at Microprocessor Forum, October 1996.
[10] L. Kohn et al. The Visual Instruction Set (VIS) in Ultra-
SPARC. In COMPCON Digest of Papers, March 1995.
[11] C. Lee et al. MediaBench: A Tool for Evaluating and
Synthesizing Multimedia and Communications Systems. In
MICRO-30, 1997.
[12] R. B. Lee. Subword Parallelism with MAX-2. In IEEE
Micro, volume 16(4), pages 51–59, August 1996.
[13] R. B. Lee and M. D. Smith. Media Processing: A New De-
sign Target. In IEEE MICRO, pages 6–9, Aug 1996.
[14] T. Mowry. Tolerating Latency through Software-controlled
data prefetching. PhD thesis, Stanford University, 1994.
[15] S. Oberman et al. AMD 3DNow! Technology and the K6-2
Microprocessor. In HOTCHIPS10, 1998.
[16] V. S. Pai et al. RSIM: A Simulator for Shared-Memory Mul-
tiprocessor and Uniprocessor Systems that Exploit ILP. In
Proc. 3rd Workshop on Computer Architecture Education,
1997.
[17] V. S. Pai et al. The Impact of Instruction Level Parallelism
on Multiprocessor Performance and Simulation Methodol-
ogy. In HPCA-3, pages 72–83, 1997.
[18] A. Peleg and U. Weiser. MMX Technology Extension to
the Intel Architecture. In IEEE Micro, volume 16(4), pages
51–59, Aug 1996.
[19] M. Phillip et al. AltiVec Technology: Accelerating Media
Processing Across the Spectrum. In HOTCHIPS10, Aug
1998.
[20] P. Ranganathan et al. The Interaction of Software Prefetch-
ing with ILP Processors in Shared-Memory Systems. In
ISCA24, pages 144–156, 1997.
[21] P. Ranganathan et al. Performance of Database Workloads
on Shared-Memory Systems with Out-of-Order Processors.
In ASPLOS8, pages 307–318, 1998.
[22] D. S. Rice. High-Performance Image Processing Using
Special-Purpose CPU Instructions: The UltraSPARC Visual
Instruction Set. Master’s thesis, Stanford University, 1996.
[23] M. Tremblay et al. VIS Speeds New Media Processing. In
IEEE Micro, volume 16(4), pages 51–59, Aug 1996.
[24] C.-L. Yang et al. Exploiting Instruction-Level Parallelism in
Geometry Processing for Three Dimensional Graphics Ap-
plications. In Micro31, 1998.
[25] D. F. Zucker. Architecture and Arithmetic for Multimedia
Enhanced Processors. PhD thesis, Department of Electrical
Engineering, Stanford University, June 1997.
[26] D. F. Zucker et al. An Automated Method for Software Con-
trolled Cache Prefetching. In Proc. of the 31st Hawaii Intl.
Conf. on System Sciences, Jan 1998.