Conference PaperPDF Available

Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions.

Authors:

Abstract and Figures

This paper aims to provide a quantitative understanding of the performance of image and video processing applications on general-purpose processors, without and with media ISA extensions. We use detailed simulation of 12 benchmarks to study the effectiveness of current architectural features and identify future challenges for these workloads. Our results show that conventional techniques in current processors to enhance instruction-level parallelism (ILP) provide a factor of 2.3 X to 4.2 X performance improvement. The Sun VIS media ISA extensions provide an additional 1.1 X to 4.2 X performance improvement. The ILP features and media ISA extensions significantly reduce the CPU component of execution time, making 5 of the image processing benchmarks memory-bound. The memory behavior of our benchmarks is characterized by large working sets and streaming data accesses. Increasing the cache size has no impact on 8 of the benchmarks. The remaining benchmarks require relatively large cache sizes (dependent on the display sizes) to exploit data reuse, but derive less than 1.2 X performance benefits with the larger caches. Software prefetching provides 1.4 X to 2.5 X performance improvement in the image processing benchmarks where memory is a significant problem. With the addition of software prefetching, all our benchmarks revert to being compute-bound
Content may be subject to copyright.
To appear in the Proceedings of the 26th International Symposium on Computer Architecture. May 1999
Performance of Image and Video Processing with
General-Purpose Processors and Media ISA Extensions
Parthasarathy Ranganathan
, Sarita Adve
, and Norman P. Jouppi
y
Electrical and Computer Engineering
y
Western Research Laboratory
Rice University Compaq Computer Corporation
f
parthas,sarita
g
@rice.edu jouppi@pa.dec.com
Abstract
This paper aims to provide a quantitative understanding
of the performance of image and video processing applica-
tions on general-purpose processors, without and with me-
dia ISA extensions. We use detailed simulation of 12 bench-
marks to study the effectiveness of current architectural fea-
tures and identify future challenges for these workloads.
Our results show that conventional techniques in current
processors to enhance instruction-level parallelism (ILP)
provide a factor of 2.3X to 4.2X performance improve-
ment. The Sun VIS media ISA extensions provide an ad-
ditional 1.1X to 4.2X performance improvement. The ILP
features and media ISA extensions significantly reduce the
CPU component of execution time, making 5 of the image
processing benchmarks memory-bound.
The memory behavior of our benchmarks is character-
ized by large working sets and streaming data accesses. In-
creasing the cache size has no impact on 8 of the bench-
marks. The remaining benchmarks require relatively large
cache sizes (dependent on the display sizes) to exploit data
reuse, but derive less than 1.2X performance benefits with
the larger caches. Software prefetching provides 1.4X to
2.5X performance improvement in the image processing
benchmarks where memory is a significant problem. With
the addition of software prefetching, all our benchmarks re-
vert to being compute-bound.
1 Introduction
In the near future, media processing is expected to be-
come one of the dominant computing workloads [6, 13].
This work is supported in part by an IBM Partnership award, Intel Cor-
poration, the National Science Foundation under Grant No. CCR-9410457,
CCR-9502500, CDA-9502791, and CDA-9617383, and the Texas Ad-
vanced Technology Program under Grant No. 003604-025. Sarita Adve
is also supported by an Alfred P. Sloan Research Fellowship.
Media processing refers to the computing required for the
creation, encoding/decoding, processing, display, and com-
munication of digital multimedia information such as im-
ages, audio, video, and graphics. The last few years
have seen significant advances in this area, but the true
promise of media processing will be seen only when ap-
plications such as collaborative teleconferencing, distance
learning, and high-quality media-rich content channels ap-
pear in ubiquitously available commodity systems. Fur-
ther out, advanced human-computer interfaces, telepres-
ence, and immersive and interactive virtual environments
hold even greater promise.
One obstacle in achieving this promise is the high com-
putational demands imposed by these applications. These
requirements arise from the computationally expensive na-
ture of the algorithms, the stringent real-time constraints,
and the need to run many such tightly synchronized appli-
cations at the same time on the same system. For exam-
ple, a video teleconferencing system may need to run video
processing including encoding/decoding, audio processing,
and a software modem simultaneously. As a result, such
applications currently display images of only a few square
inches at a few frames per second when running on general-
purpose processors. Full-screen images at 20-30 frames per
second could require more than two orders of magnitude
more performance.
To meet the high computational requirements of emerg-
ing media applications, current systems use a combination
of general-purpose processors accelerated with DSP (or me-
dia) processors and ASICs performing specialized compu-
tations. However, benefits offered by general-purpose pro-
cessors in terms of ease of programming, higher perfor-
mance growth, easier upgrade paths between generations,
and cost considerations argue for increasing use of general-
purpose processors for media processing applications [6,
13]. The most visible evidence of this trend has been
the SIMD-style media instruction-set architecture (ISA) ex-
1
tensions announced for most high-performance general-
purpose processors (e.g., 3DNow! [15], AltiVec [19],
MAX [12], MDMX and MIPSV [9], MMX [18], MVI [4],
VIS [23]).
Unfortunately, in spite of the large amount of recent at-
tention given to media processing [5, 6, 13], there is very
little quantitative understanding of the performance of such
applications on general-purpose systems. A major chal-
lenge for such studies has been the large number of ap-
plication classes in this domain (e.g., image, video, au-
dio, speech, communication, graphics, etc.), and the ab-
sence of any standardized representative benchmark sets.
Consequently, in contrast to the much-researched SPEC,
SPLASH, and (more recently) TPC benchmarks, a number
of fundamental questions still remain unanswered for me-
dia processing workloads. For example, is computation or
memory the primary bottleneck in these applications? How
effective are current architectural designs and media ISA
extensions? What are the future challenges for these work-
loads? Given the lack of understanding of such issues, it is
not surprising that the media instruction set extensions an-
nounced by different processor vendors vary widely from
13 instructions in MVI for Alpha [4] to 162 instructions in
AltiVec for PowerPC [19].
This paper is a first step in understanding the above is-
sues to determine if and how we need to change the way we
design general-purpose systems to support media process-
ing applications. We focus on image and video workloads,
an important class of media processing workloads, and at-
tempt to cover the spectrum of the key tasks in this class.
Our benchmark suite consists of 12 kernels and applications
covering image processing, image source coding, and video
source coding. We use detailed simulation to study a va-
riety of general-purpose-processor architectural configura-
tions, both with and without the use of Sun’s visual instruc-
tion set (VIS) media ISA extensions. VIS shares a number
of fundamental similarities with the media ISA extensions
proposed for other processors, and is representative of the
benefits and limitations of current media ISA extensions.
We start with a base single-issue in-order processor. In
this system, all the benchmarks are primarily compute-
bound. We find that conventional techniques in current
processors to enhance instruction-level parallelism or ILP
(multiple issue and out-of-order issue) provide a factor of
2.3X to 4.2X performance improvement for the benchmarks
studied. The VIS media ISA extensions provide an addi-
tional 1.1X to 4.2X performance improvement. Our de-
tailed analysis indicates the sources and limitations of the
performance benefits due to VIS. The conventional ILP
techniques and the VIS extensions together significantly re-
duce the CPU component of execution time, making five of
the image processing benchmarks memory-bound.
The memory behavior of these workloads is character-
ized by large working sets and streaming data accesses. In-
creasing the cache size has no impact on 8 of the bench-
marks. The remaining reuse data, but require relatively
large cache sizes (dependent on the display sizes) to ex-
ploit the reuse and derive a performance benefit of less than
1.2X. Software-inserted prefetching provides 1.4X to 2.5X
performance improvement in the image processing bench-
marks where memory stall time is significant. With the ad-
dition of software prefetching, all of our benchmarks revert
to being compute-bound.
The rest of the paper is organized as follows. Section 2
describes our workloads, the architectures modeled, and the
simulation methodology. Section 3 presents our results on
the impact of ILP features and VIS media extensions. Sec-
tion 4 studies the performance of the cache system and the
impact of software prefetching. Section 5 discusses related
work. Section 6 concludes the paper.
2 Methodology
2.1 Workloads
We attempt to cover the spectrum of key tasks in im-
age and video processing workloads. The kernels and ap-
plications in our benchmark suite form significant compo-
nents of many current and future real-world workloads such
as collaborative teleconferencing, scene-visualization, dis-
tance learning, streaming video across the internet, digi-
tal broadcasting, real-time flight imaging and radar sens-
ing, content-based storage and retrieval, online video cata-
loging, and medical tomography [8]. Future standards such
as JPEG2000 and MPEG4 are likely to build on a number
of components of our benchmark suite.
Table 1 summarizes the 12 benchmarks that we use
in this paper, and is divided into image processing (Sec-
tion 2.1.1), image source coding (Section 2.1.2), and video
source coding (Section 2.1.3). These benchmarks are simi-
lar to some of the benchmarks used in the image and video
parts of the Intel Media Benchmark (described at the Intel
web site) and the UCLA MediaBench [11].1
All the image benchmarks were run with 1024x640 pixel
3-band (i.e., channel) input images obtained from the Intel
Media Benchmark. The video benchmarks were run with
the mei16v2 test bit stream from the MPEG Software Sim-
ulation Group that operates on 352x240 sized 3-band im-
ages. We did not study larger (full-screen) sizes because
they were not readily available and would have required im-
practical simulation time.
1We did not use the Intel Media Benchmark or the UCLA MediaBench
directly because the former does not provide source code and the latter
does not include image processing applications.
Image processing
Addition Addition of two images (sf16.ppm,rose16.ppm) using mean of two pixel values
Blend Alpha blending of two images (sf16.ppm,rose16.ppm) with another alpha image (winter16.ppm); the operation
performed is
dst
=
alpha
src
1 + (255
alpha
)
src
2
.
Conv General 3x3 image convolution of an image (sf16.ppm). The operation performed includes a saturation sum-
mation of 9 product terms. Each term corresponds to multiplying the pixel values in a moving 3x3 window
across the image dimensions with the values of a 3x3 kernel matrix.
Dotprod 16x16 dot product of a randomly-initialized 1048576-element linear array
Scaling Linear image scaling of an image (sf16.ppm)
Thresh Double-limit thresholding of an image (sf16.ppm). If the pixel band value falls within the low and high values
for that band, the destination is set to the map value for that band; otherwise, the destination is set to be the
same as the source pixel value.
Image source coding
Cjpeg JPEG progressive encoding (rose16.ppm)
Djpeg JPEG progressive decoding (rose16.jpg)
Cjpeg-np JPEG non-progressive encoding (rose16.ppm)
Djpeg-np JPEG non-progressive decoding (rose16.jpg)
Video source coding
Mpeg-enc MPEG2 encoding of 4 frames (I-B-B-P frames) of the mei16v2rec bit stream. Properties of the bit stream
include frame rate of 30fps, bit rate of 5Mbps at the Main profile@Main level configuration. The image is
352x240 pixels in the 4:2:0 YUV chroma format, and is scaled to a 704x480 display. The quantization tables
and the motion estimation search parameters are set to the default parameters specified by the MPEG group.
Mpeg-dec MPEG2 decoding of the mei16v2rec video bit stream into separate YUV components.
Table 1. Summary of the benchmarks used in this study.
2.1.1 Image Processing
Our image processing benchmarks are taken from the Sun
VIS Software Development Kit (VSDK), which includes 14
image processing kernels. These kernels include common
image processing tasks such as one-band and three-band
(i.e., channel) alpha blending (used in image compositing),
single-limit and double-limit thresholding (used in chroma-
keying, image masking, and blue screening), and functions
such as general and separable convolution, copying, inver-
sion, addition, dot product, and scaling (used in the core
of many image processing codes like blurring, sharpening,
edge detection, embossing, etc.). We study all 14 of the
VSDK kernels, but due to space constraints, we report re-
sults for only 6 representative benchmarks (addition, blend,
conv, dotprod, scaling, and thresh).
2.1.2 Image Source Coding
We focus on the Joint Photography Experts Group (JPEG)
standard and study the performance of the Release 6a codec
(encoder/decoder) from the Independent JPEG Group. We
study two different commonly used codecs specified in the
standard, a progressive JPEG codec (cjpeg encoder and
djpeg decoder), and a non-progressive JPEG codec (cjpeg-
np encoder and djpeg-np decoder).
The JPEG encoding process consists of a number of
phases many of which exploit properties of the human vi-
sual system to reduce the number of bits required to spec-
ify the image. First, the color conversion and chroma-
decimation phases convert the source image from a 24-bit
RGB representation domain to a 12 bit 4:2:0 YUV repre-
sentation. Next, a linear DCT image transform phase con-
verts the image into the frequency domain. The quantiza-
tion phase then scales the frequency domain values by a
quantization value (either constant or variable). The zig-
zag scanning and variable-length (Huffman) coding phases
then reorder the resulting data into streams of bits and en-
code them as a stream of variable-length symbols based on
statistical analysis of the frequency of symbols.
Progressive image compression uses a compression al-
gorithm that performs multiple Huffman coding passes on
the image to encode it as multiple scans of increasing pic-
ture quality (leading to the perception of gradual focusing
of images seen on many web pages).
The decoding process performs the inverse of the opera-
tions for the encoding process in the reverse order to obtain
the original image from the compressed image.
2.1.3 Video Source Coding
We focus on the Motion Picture Experts Group-2 (MPEG2)
video coding standard, and study the performance of the
version 1.1 codec from the MPEG Software Simulation
Group.
The first part of the video compression process consists
of spatial compression similar to that described for JPEG
Processor speed 1 GHz
Issue width 4-way
Instruction window size 64
Memory queue size 32
Branch prediction
Bimodal agree predictor size 2K
Return-address stack size 32
Taken branches per cycle 1
Simultaneous speculated branches 16
Functional unit counts
Integer arithmetic units 2
Floating-point units 2
Address generation units 2
VIS multipliers 1
VIS adders 1
Functional unit latencies (cycles)
Default integer/address generation 1/1
Integer multiply/divide 7/12
Default floating point 4
FP moves/converts/divides 4/4/12
Default VIS 1
VIS 8-bit loads/multiply/pdist 1/3/3
Table 2. Default processor parameters.
Cache line size 64 bytes
L1 data cache size (on-chip) 64 KB
L1 data cache associativity 2-way
L1 data cache request ports 2
L1 data cache hit time 2 ns
Number of L1 MSHRs 12
L2 cache size (off-chip) 128K
L2 cache associativity 4-way
L2 request ports 1
L2 hit time (pipelined) 20 ns
Number of L2 MSHRs 12
Max. outstanding misses per MSHR 8
Total memory latency for L2 misses 100 ns
Memory interleaving 4-way
Table 3. Default memory system parameters.
and includes the color conversion, chroma decimation,
frequency transformation, quantization, zig-zag coding,
and run-length coding phases. Additionally, MPEG2 has
an inter-frame predictive-compression motion-estimation
phase that uses difference vectors to encode temporal redun-
dancy between macroblocks in a frame and macroblocks in
the following and preceding frames. Motion estimation is
the most compute-intensive part of mpeg-encode.
The video decompression process performs the inverse
of the various encode operations in reverse order to get the
decoded bit stream from the input compressed video. The
mei16v2 bit stream is already in the YUV format, and con-
sequently, our MPEG simulations do not go through the
color conversion phase discussed in Section 2.1.2.
2.2 Architectures Modeled
2.2.1 Processor and Memory System
We study two processor models an in-order processor
model (similar to the Compaq Alpha 21164, Intel Pen-
tium, and Sun UltraSPARC-II processors) and an out-of-
order processor model (similar to the Compaq Alpha 21264,
HP PA8000, IBM PowerPC, Intel Pentium Pro, and MIPS
R10000 processors). Both the processor models support
non-blocking loads and stores.
For the experiments with software prefetching, the pro-
cessor models provide support for software-controlled non-
binding prefetches into the first-level cache.
The base system uses a 64KB two-way associative first-
level (L1) write-back cache and a 128KB 4-way associa-
tive second-level (L2) write-back cache. Section 4.1 dis-
cusses the impact of varying the cache sizes. All the caches
are non-blocking and allow support for multiple outstand-
ing misses. At each cache, 12 miss status holding regis-
ters (MSHRs) reserve space for outstanding cache misses
and combine a maximum of 8 multiple requests to the same
cache line.
Tables 2 and 3 summarize the parameters used for the
processor and memory subsystems. When studying the per-
formance of a 1-way issue processor, we scale the number
of functional units to 1 of each type. The functional unit
latencies were chosen based on the Alpha 21264 processor.
All functional units are fully pipelined except the floating-
point divide (non-pipelined).
2.2.2 VIS Media ISA Extensions
The VIS media ISA extensions to the SPARC V9 architec-
ture are a set of instructions targeted at accelerating media
processing [10, 23]. Both our in-order and out-of-order pro-
cessor models include support for VIS.
The VIS extensions define the packed byte, packed word
and packed double data types which allow concurrent oper-
ations on eight bytes, four words (16-bits each) or two dou-
ble words of fixed-point data in a 64-bit register. These data
types allow VIS instructions to exploit single-instruction-
multiple-data (SIMD) parallelism at the subword level.
Most of the VIS instructions operate on packed words or
packed doubles; loads, stores, and pdist instructions op-
erate on packed bytes. Many of the VIS instructions make
implicit assumptions about rounding and the number of sig-
nificant bits in the fixed-point data. Hence, their use require
ensuring that they do not lead to incorrect outputs. We next
provide a short overview of the VIS instructions (summa-
rized in Table 4).
Packed arithmetic and logical operations. The packed
arithmetic VIS instructions allow SIMD-style parallelism to
be exploited for add, subtract, and multiply instructions. To
Packed arithmetic and logical operations
Packed addition
Packed subtraction
Packed multiplication
Logical operations
Subword rearrangement and realignment
Data packing and expansion
Data merging
Data alignment
Partitioned compares and edge operations
Partitioned compares
Mask generation for edge effects
Memory-related operations
Partial stores
Short loads and stores
Blocked loads and stores
Special-purpose operations
Pixel distance computation
Array address conversion for data reuse
Access to the graphics status register
Table 4. Classification of VIS instructions
minimize implementation complexity, VIS uses a pipelined
series of two 8x16 multiplies and one add instruction to em-
ulate packed 16x16-bit multiplication. The VIS logical in-
structions allow logical operations on the floating-point data
path.
Subword rearrangement and alignment. To facilitate
conversion between different data types, VIS supports sub-
word rearrangement and alignment using pack, expand,
merge (interleave), and align instructions. The subword re-
arrangement instructions also include support for implicitly
handling saturation arithmetic (limiting data values to the
minimum or maximum instead of the default wrap-around).
Partitioned compares and edge operations. For branches,
VIS supports a partitioned compare that performs four 16-
bit or two 32-bit compares in parallel to producea mask that
can be used in subsequent instructions. VIS also supports
the edge instruction to generate masks for partial stores that
can eliminate special branch code to handle boundary con-
ditions in media processing applications.
Memory-related operations. For memory instructions,
VIS supports partial stores that selectively write to parts of
the 64-bit output based on an input mask. Short loads
and stores transfer 1 or 2 bytes of memory to the register
file. Blocked loads and stores transfer 64 bytes of data
between memory and a group of eight consecutive VIS reg-
isters without causing allocations in the cache.
Special-purpose operations. The pixel distance computa-
tion (pdist) instruction is primarily targeted at motion es-
timation and computes the sum of the absolute differences
between corresponding 8-bit components in two packed
bytes. The array instruction is mainly targeted at 3D graph-
ics rendering applications and converts 3D fixed-point co-
ordinates into a blocked byte address that allows for greater
cache reuse. VIS also defines instructions to manipulate
the graphics status register, a special-purpose register that
stores additional data for various media instructions.
Overall, the functionality discussed above for VIS is sim-
ilar to that of fixed-point media ISA extensions in other
general purpose processors (e.g., MAX [12], MMX [18],
MVI [4], MDMX [9], AltiVec [19]). The various ISA ex-
tensions mainly differ in the number, types, and latencies of
the individual instructions (e.g., MMX implements direct
support for 16x16 multiply), whether they are implemented
in the integer or floating-point data path, and in the width
of the data path. The most different ISA extension, the pro-
posed PowerPC AltiVec ISA, adds support for a separate
128-bit vector multimedia unit in the processor.
Our VIS implementation is closely modeled after
the UltraSPARC-II implementation and operates on the
floating-point register file with latencies comparable to the
UltraSPARC-II [23] (Table 2).2The increase in chip area
associated with the VIS instructions was estimated to be less
than 3% for the UltraSPARC-II [10].
2.3 Methodology
2.3.1 Simulation Environment
We use the RSIM simulator [16] to simulate the in-order and
out-of-order processors described in Section 2.2. RSIM is
a user-level execution-driven simulator that models the pro-
cessor pipeline and memory hierarchy in detail including
contention for all resources. To assess the impact of not
modeling system level code, we profiled the benchmarks on
an UltraSPARC-II-based Sun Enterprise server. We found
that the time spent on operating system kernel calls is less
than 2% on all the benchmarks. The time spent on I/O is
less than 15% on all the benchmarks except mpeg-dec. This
benchmark experiences an inflated I/O component (45%)
because of its high frequency of file writes. In a typical sys-
tem, however, these writes would be handled by a graphics
accelerator, significantly reducing this component. Since
our applications have small instruction footprints, our sim-
ulations assume all instructions hit in the instruction cache.
All the applications3were compiled with the SPARC
SC4.2 compiler with the -xO4 -xtarget=ultra1/170
-xarch=v8plusa -dalign options to produce optimized code
for the in-order UltraSPARC processor.
2Our VIS multiplier has a lower latency compared to the integer (64-
bit) multiplier because it operates on 16-bit data.
3We changed the 14 image processing kernels from the Sun VSDK to
skew the starting addresses of concurrent array accesses and unroll small
innermost loops. This reduced cache conflicts and branch mispredictions
leading to 1.2X to 6.7X performance benefits. To facilitate modifying the
applications for VIS, we replaced some of the key routines in the JPEG
and MPEG applications with equivalent routines from the Sun MediaLib
library.
2.3.2 VIS Usage Methodology
We are not aware of any compiler that automatically modi-
fies media processing applications to use media ISA exten-
sions. For our experiments studying the impact of VIS, we
manually modified our benchmarks to use VIS instructions
based on the methodology detailed below.
We profiled the applications to identify key procedures
and manually examined these procedures for loops that sat-
isfied the following three conditions: (1) The loop body
should have no loop-carried dependences or control depen-
dences that cannot be convertedto data dependences (other
than the loop branch). (2) The key computation in the loop
body must be replaceable with a set of equivalent fixed-
point VIS instructions. The loss in accuracy in this stage,
if any, should be visually imperceptible. (3) The poten-
tial benefit from VIS should be more than the overhead
of adding VIS; VIS overhead can result from subword re-
arrangement instructions to convert between packed data
types or from alignment-related instructions.
For loops that satisfied the above criteria, we strip-mined
or unrolled the loop to isolate multiple iterations of the loop
body that we then replaced with equivalent VIS instruc-
tions. We used the inline assembly-code macros from the
Sun VSDK for the VIS instructions; this minimizes code
perturbation and allows the use of regular compiler opti-
mizations. Wherever possible, we tried to use procedures
available from the Sun VSDK Kit and the SUN MediaLib
library routines that were already optimized for VIS.
Our benchmarks use all the VIS instructions except for
the array and blocked load/store instructions. Array in-
structions are targeted at 3D array accesses in graphics
loops, and are not applicable for our applications. Blocked
loads and stores are primarily targeted at transfers of large
blocks of data between buffers without affecting the cache
(e.g., in operating system buffer management, networking,
memory-mapped I/O). We did not use these instructions
since the Sun VSDK does not provide inline assembly-code
macros to support them. The alternative of hand-coded as-
sembly could result in lower performance since it is hard to
emulate the compiler optimizations associated with modern
superscalar processors by hand [22]. Note that both the ar-
ray and blocked load/store instructions are unique to VIS
and are not supported by other general-purpose ISA exten-
sions.
2.3.3 Software Prefetching Algorithm
We studied the applicability of software prefetching for the
benchmarks where the cache miss stall time is a significant
component of the total execution time (
>
20%). We identi-
fied the memory accesses that dominate the cache miss stall
time, and inserted prefetches by hand for these accesses.
We followed the well known software prefetching compiler
algorithm developed by Mowry et al. [14].
2.3.4 Performance Metrics
We use the execution time of the system as the primary met-
ric to evaluate the performance of the system, while also re-
porting the individual components of execution time. With
out-of-order processors, an instruction can potentially be
overlapped with instructions preceding and following it. We
therefore use the following convention to identify the differ-
ent components of execution time. At every cycle, the frac-
tion of instructions retired that cycle to the maximum retire
rate is attributed to the busy time; the remaining fraction
is attributed as stall time to the first instruction that could
not be retired that cycle. We also study other metrics such
as dynamic instruction counts, branch misprediction rates,
cache miss rates, MSHR occupancies, and prefetch counts
for further insights into the system behavior.
3 Improving Processor Performance
For each benchmark, Figure 1 presents execution times
for three variations of our base architecture, each without
VIS (the first set of three bars) and with VIS (the second
set of three bars). The three architecture variations are (i)
in-order and single issue, (ii) in-order and 4-way issue, and
(iii) out-of-order and 4-way issue. On the VIS-enhanced
architecture, we use the VIS-enhanced version of the appli-
cation as mentioned in Section 2. The execution times are
normalized to the time with the in-order single-issue pro-
cessor. For all the benchmarks, the execution time is di-
vided into the busy component, the functional unit stall (FU
stall) component, and the memory component. The mem-
ory component is shown divided into the L1 miss and L1 hit
components.
3.1 Impact of Conventional ILP Features
This section focuses on the system without the VIS me-
dia ISA extensions (the left three bars for each benchmark
in Figure 1).
Overall results. Both multiple issue and out-of-order issue
provide substantial reductions in execution time for most of
our benchmarks. Compared to a single-issue in-order pro-
cessor, on the average, multiple issue improves performance
by a factor of 1.2X (range of 1.1X to 1.4X), while the com-
bination of multiple issue and out-of-order issue improves
performance by a factor of 3.1X (range of 2.3X-4.2X).
Analysis. Compared to the single issue processor, we find
that multiple issue achieves most of its benefits by reduc-
ing the busy CPU component of execution time. Data, con-
trol, and structural dependences prevent the CPU compo-
nent from attaining an ideal speedup of 4 from a 4-way is-
|
|
0
|
20
|
40
|
60
|
80
|
100
Normalized execution time
L1 hit
L1 miss
FU stall
Busy
1-way
100.0
4-way
71.2
4-way
43.6
1-way
39.8
4-way
36.7
4-way
15.4
1-way
100.0
4-way
77.5
4-way
33.3
1-way
28.7
4-way
26.9
4-way
10.1
1-way
100.0
4-way
87.5
4-way
24.1
1-way
15.9
4-way
12.5
4-way
5.7
ooo ooo ooo ooo ooo ooo
Without VIS With VIS Without VIS With VIS Without VIS With VIS
Addition Blend Conv
|
|
0
|
20
|
40
|
60
|
80
|
100
Normalized execution time
L1 hit
L1 miss
FU stall
Busy
1-way
100.0
4-way
88.2
4-way
32.4
1-way
74.6
4-way
68.0
4-way
28.7
1-way
100.0
4-way
88.8
4-way
26.0
1-way
18.3
4-way
15.9
4-way
10.3
1-way
100.0
4-way
89.6
4-way
37.7
1-way
32.0
4-way
26.5
4-way
14.9
ooo ooo ooo ooo ooo ooo
Without VIS With VIS Without VIS With VIS Without VIS With VIS
Dotprod Scaling Thresh
|
|
0
|
20
|
40
|
60
|
80
|
100
Normalized execution time
L1 hit
L1 miss
FU stall
Busy
1-way
100.0
4-way
84.4
4-way
29.3
1-way
87.6
4-way
74.2
4-way
26.6
1-way
100.0
4-way
78.8
4-way
33.2
1-way
69.2
4-way
53.8
4-way
26.8
1-way
100.0
4-way
78.0
4-way
30.1
1-way
66.8
4-way
50.6
4-way
23.1
ooo ooo ooo ooo ooo ooo
Without VIS With VIS Without VIS With VIS Without VIS With VIS
Cjpeg Djpeg Cjpeg-np
|
|
0
|
20
|
40
|
60
|
80
|
100
Normalized execution time
L1 hit
L1 miss
FU stall
Busy
1-way
100.0
4-way
74.7
4-way
27.6
1-way
58.9
4-way
41.1
4-way
18.9
1-way
100.0
4-way
89.6
4-way
43.2
1-way
33.5
4-way
26.2
4-way
12.2
1-way
100.0
4-way
77.3
4-way
32.0
1-way
65.0
4-way
49.2
4-way
24.4
ooo ooo ooo ooo ooo ooo
Without VIS With VIS Without VIS With VIS Without VIS With VIS
Djpeg-np Mpeg-enc Mpeg-dec
Figure 1. Performance of image and video benchmarks.
sue processor, reflected in the increased functional unit and
L1 hit memory stall time.
Some of the benchmarks see additional memory laten-
cies when the number of outstanding misses to one cache
line (MSHR) increases beyond the maximum allowed limit
of 8. This is caused by the higher use of small data types
associated with media applications which leads to a high
frequency of accesses for each cache line (e.g., 64 pixel
writes in a 64-byte line). Since the processors do not stall
on writes, in benchmarks with small loop bodies (e.g., addi-
tion,cjpeg,djpeg), this leads to a backup of multiple writes.
This backup leads to contention for the MSHR that even-
tually prevents other accesses from being serviced at the
cache.
Out-of-order issue, on the other hand, improves perfor-
mance by reducing both functional unit stall time and mem-
ory stall time. A large fraction of the stall times due to data,
control, and structural dependences, as well as MSHR con-
tention, is now overlapped with other useful work. This is
seen in the reduction in the FU stall and L1 hit components
of execution time. Additionally, out-of-order issue can bet-
ter exploit the non-blocking loads feature of the system by
allowing the latency of multiple long-latency load misses to
be overlapped with one another. Our results examining the
MSHR occupancies at the cache indicate that while there is
increased load miss overlap in all the 12 benchmarks, only
2 to 3 misses are overlapped in most cases. The total capac-
ity of 12 MSHRs is never fully utilized for load misses in
all our benchmarks.
Overall, the impact of the various ILP features is qualita-
tively consistent with that described in previous studies for
scientific and database workloads. Quantitatively, these ILP
features are substantially more effective for the image and
video benchmarks than for previously reported online trans-
action processing (OLTP) workloads [21], and comparable
in benefit to previously reported scientific and decision sup-
port system (DSS) workloads [1, 17, 21].
It must be noted that the performance of the in-order is-
sue processor is dependent on the quality of the compiler
used to schedule the code. Our experiments use the com-
mercial SPARC SC4.2 compiler with maximum optimiza-
tions turned on for the in-order UltraSPARC processor. To
try to isolate compiler scheduling effects, we studied two
other processor configurations with single-cycle functional
unit latencies and functional unit latencies comparable to
the UltraSPARC processor. In both these configurations,
our results continued to be qualitatively similar; the out-
of-order processor continues to significantly outperform the
in-order processor. The impact of future, more advanced,
compiler optimizations, however, is still an open question.
Interestingly, a recent position paper [6] on the impact of
multimedia workloads on general-purpose processors con-
jectures that complex out-of-order issue techniques devel-
oped for scientific and engineering workloads (e.g., SPEC)
may not be needed for multimedia workloads. Our results
show that, on the contrary, out-of-order issue can provide
significant performance benefits for the image and video
workloads.
3.2 Impact of VIS Media ISA Extensions
This section discusses the performance impact of the
VIS ISA extensions (comparing the left-hand three bars and
right-hand three bars for each benchmark in Figure 1).
3.2.1 Overall Results
The VIS media ISA extensions provide significant perfor-
mance improvements for all the benchmarks (factors of
1.1X to 4.0X for the out-of-order system, 1.1X to 7X across
all configurations). On average, the addition of VIS im-
proves the performance of the single-issue in-order system
by a factor of 2.0X, the performance of the 4-way-issue in-
order system by a factor of 2.1X, and the performance of
the 4-way issue out-of-order system by a factor of 1.8X.
Multiple issue and out-of-order issue are beneficial even
with VIS. On average, with VIS, compared to a single-issue
in-order processor, multiple issue achieves a factor of 1.2X
performance improvement, while the combination of mul-
tiple issue and out-of-order issue achieves a factor of 2.7X
performance improvement. The reasons for these perfor-
mance benefits from ILP features are the same as for the
systems without VIS.
3.2.2 Benefits from VIS
Figure 2 presents some additional data showing the distribu-
tion of the dynamic (retired) instructions for the 4-way out-
of-order processor without and with VIS, normalized to the
former. Each bar divides the instructions into the Functional
unit (FU, combines ALU and FPU), Branch, Memory, and
VIS categories. The use of VIS instructions provides a sig-
nificant reduction in the dynamic instruction count for all
the benchmarks. The reductions in the dynamic instruc-
tion count correlate well with the performance benefits from
VIS. We next discuss the sources for the reductions in the
dynamic instructions.
Reductions in FU instructions. The VIS packed arith-
metic and logical instructions allow multiple (typically
four) arithmetic instructions to be replaced with one VIS
instruction. Consequently, all the benchmarks see signif-
icant reductions in the FU instructions with correspond-
ing, smaller, increases in the VIS instruction count. Ad-
ditionally, the SIMD VIS instructions replace multiple it-
erations in the original loop with one equivalent VIS it-
eration. This reduces iteration-specific loop-overhead in-
structions that increment index and address values and com-
pute branch conditions, further reducing the FU instruction
count.
Reductions in branch instructions. All the benchmarks
use the edge masking and partial store instructions to elim-
inate testing for edge boundaries and selective writes. They
also use loop unrolling when replacing multiple iterations
with one equivalent VIS iteration. These lead to a reduc-
tion in the branch instruction count for all the benchmarks.
For some applications, branch instruction counts are also
reduced because of the elimination of the code to explic-
itly perform saturation arithmetic (mainly in conv4and the
JPEG applications), and the use of partitioned SIMD com-
pares (mainly in thresh).
Many of the branches eliminated are hard-to-predict
branches (e.g., saturation, thresholding, selective writes),
leading to significant improvementsin the hardware branch
misprediction rates for some of our benchmarks (branch
misprediction rate decreases from 10% to 0% for conv and
from 6% to 0% for thresh1).
Reductions in memory instructions. With VIS, memory
accesses operate on packed data as opposed to individual
media data. Consequently, most of the benchmarks see sig-
nificant reductions in the number of memory instructions
(and associated cache accesses). This reduces the MSHR
contention discussed in Section 3.1.
Most of the memory instructions eliminated are cache
hits, without a proportional decrease in the number of
4The original source code from the Sun VSDK checks for saturation
only in the conv code. The add,blend, and dotprod kernels are written in
the non-saturation mode. These could, however, potentially be rewritten to
check for saturation in which case they would also see similar benefits.
|
|
0
|
20
|
40
|
60
|
80
|
100
Normalized dynamic instruction count
VIS
Memory
Branch
FU
Addition
Base VIS
100.0
26.2
Blend
Base VIS
100.0
17.6
Conv
Base VIS
100.0
25.4
Dotprod
Base VIS
100.0
88.5
Scaling
Base VIS
100.0
18.0
Thresh
Base VIS
100.0
30.5
Cjpeg
Base VIS
100.0
85.5
Djpeg
Base VIS
100.0
66.3
Djpeg-np
Base VIS
100.0
66.9
Djpeg-np
Base VIS
100.0
58.1
Mpeg-enc
Base VIS
100.0
32.7
Mpeg-dec
Base VIS
100.0
66.4
Figure 2. Impact of VIS on dynamic (retired) instruction count.
misses, causing higher L1 cache miss rates with VIS. The
higher miss rate and the lower instruction count allows addi-
tional load misses to appear together within the instruction
window and be overlapped with each other. However, the
system still rarely sees more than 3 load misses overlapped
concurrently.
Pixel distance computation for mpeg-enc.mpeg-enc
achieves additional benefits from the special-purpose pixel
distance computation (pdist) instruction in the motion es-
timation phase. The pdist instruction allows a sequence
of 48 instructions to be reduced to oneinstruction [23], sig-
nificantly reducing the FU, branch, and memory instruction
counts. The elimination of hard-to-predict branches to per-
form comparisons and saturation improves the branch mis-
prediction rate from 27% to 10%.
3.2.3 Limitations of VIS
Examining the variation of performance benefits across the
benchmarks, we observe that the JPEG applications, mpeg-
dec, and dotprod exhibit relatively lower performance ben-
efits (factors of 1.1X to 1.5X) compared to the remaining
benchmarks (factors of 2.8X to 4.2X). We next discuss fac-
tors limiting the benefits from VIS.
Inapplicability of VIS. The VIS media ISA extensions are
not applicable to a number of key procedures in the JPEG
applications and mpeg-dec. For example, the JPEG appli-
cations (especially the progressive versions) spend a large
fraction of their time in the variable-length Huffman cod-
ing phase. This phase is inherently sequential and operates
on variable-length data types, and consequently cannot be
optimized using VIS instructions.5Other examples of code
segments in the JPEG and MPEG applications where VIS
could not be applied include bit-level input/output stream
manipulation, scatter-gather addressing, quantization, and
saturation arithmetic operations not embedded in a loop.
VIS overhead. All our benchmarks use subword rear-
rangement and alignment instructions to get the data in
a form that the VIS instructions can operate (e.g., pack-
5It is instructive to note that many media processors (e.g., the Mit-
subishi VLIW Media Processor and Samsung Media Signal Processor)
have a special-purpose hardware unit to handle the variable-length coding.
ing/unpacking between packed-byte pixels and packed-
word operands). This results in extra overhead that limits
the performance benefits from VIS (on the average, for our
benchmarks, 41% of the VIS instructions are for subword
rearrangement and alignment). Overhead is also increased
when multiple VIS instructions are used to emulate one op-
eration (e.g., 16x16 multiply in dotprod) or when the data
need to be reordered to exploit SIMD (e.g., byte reordering
in the color conversion phase in JPEG).
Limited parallelism and scheduling constraints. Most of
the packed arithmetic instructions operate only on packed
words or packed double words. This ensures enough bits to
maintain the precision of intermediate values. However, this
limits the maximum parallelism to 4 (on 16-bit data types),
even in cases when the operations are known to be per-
formed on smaller data types (e.g., 8-bit pixels). The limit
on the SIMD parallel path,6in combination with contention
for VIS functional units, limits the benefits from VIS on
some of our benchmarks (most significantly in mpeg-enc).
Cache miss stall time. As discussed earlier, the reductions
in memory instructions mainly occur for cache hits. The
VIS instructions do not directly target cache misses though
there are indirect benefits associated with increased load
miss overlap due to instruction count reduction (discussed
in Section 3.2.2).
3.3 Combination of ILP and VIS
The combination of conventional ILP features and VIS
extensions achieves an average factor of 5.5X perfor-
mance improvement (range of 3.5X to 18X) over the base
single-issue in-order processor. The benefits from VIS are
achieved with a much smaller increase in the die area com-
pared to the ILP features.
On the base single-issue in-order processor, all the
benchmarks are primarily compute-bound. With ILP fea-
tures and VIS extensions, cjpeg,djpeg, and mpeg-enc con-
tinue to spend most of their execution time in the proces-
6The MIPS MDMX provides support for a larger size accumulator that
allows greater parallelism without losing the precision of the intermediate
result [9]. The PowerPC AltiVec supports a larger 128-bit data path to
increase the parallelism[19].
sor sub-system (87% to 97%). Five of the image process-
ing kernels, however, now spend 55% to 66% of their total
time in memory stalls. The strong compute-centric perfor-
mance benefits from ILP features and VIS extensions shift
the bottleneck to the memory sub-system for these bench-
marks. The remaining 4 applications (conv,cjpeg,djpeg,
and mpeg-dec) spend between 20% to 30% of their total
time on memory stalls.
4 Improving Memory System Performance
This section studies memory system performance for
the benchmarks. Section 4.1 discusses the effectiveness of
caches, and Section 4.2 discusses the impact of software
prefetching.
4.1 Impact of Caches
Impact of varying L2 cache size. We varied the L2 cache
size from 128K to 2M, keeping the L1 cache fixed at 64K.
Our results (not shown here due to lack of space) showed
that increasing the size of the L2 cache has no impact on
the performance of the 6 image processing kernels and the
cjpeg-np and djpeg-np applications. The remaining four
applications, cjpeg,djpeg,mpeg-enc, and mpeg-dec, reuse
data; but the cache size needed to exploit the reuse depends
on the size of the display. For our input image sizes, L2
cache sizes of 2M capture the entire working sets for all the
four benchmarks, and provide 1.1X to 1.2X performance
improvement over the default 128K L2 cache. With the 2M
cache sizes, memory stall time is between 7-9% on all the
applications and is dominated by L1 hit time (due to MSHR
contention) and L2 hit time.
The image processing kernels have streaming data ac-
cesses to a large image buffer with no reuse and low com-
putation per cache miss. Consequently, they exhibit high
memory stall times unaffected by larger caches. cjpeg-np
and djpeg-np do not see any variation in performance with
larger caches because of their negligible memory compo-
nents. These applications implement a blocked pipeline al-
gorithm that performs all the computation phases on 8x8-
sized blocks at a time, reducing the bandwidth requirements
and increasing the computation per miss. The cjpeg and
djpeg applications differ from their non-progressive coun-
terparts in the progressive coding phase where they perform
a multi-pass traversal of the buffer storing the DCT coeffi-
cients. The low computation per miss in this phase com-
bined with the reuse of the image-sized (1024x640 pixels)
buffer results in a 1.2X performance benefit from increas-
ing the cache size to 2M. Larger images would increase
the working set requiring larger caches. For example, a
1024x1024 image would require a 4M cache size. mpeg-
enc and mpeg-dec perform inter-block distance vector op-
erations and therefore need to operate on multiple image-
sized buffers (as opposed to blocks-sized buffers in cjpeg-np
and djpeg-np). The reuse of these 352x240 buffers across
the frames in the video leads to 1.1X performance bene-
fits with 512K (mpeg-dec) and 1M (mpeg-enc) cache sizes.
Larger image sizes would require larger caches; for exam-
ple a 1024x1024 image would require almost a factor of
12X increase.
Impact of varying L1 caches. We also performed ex-
periments varying the size of the L1 cache from 1K to
64K while keeping the L2 cache fixed at 128K. Our results
showed that the L1 cache size had no impact on five of the
image processing kernels. On the remaining benchmarks,
a 64K L1 configuration outperforms the 1K L1 configura-
tion by factors of 1.1X to 1.3X; 4K-16K L1 caches achieve
within 3% of the performance of the 64K L1 cache con-
figuration. Small data structures other than the main data,
such as tables for convolution, quantization, color conver-
sion, and saturation clipping, are responsible for these small
first-level working sets. At 64K L1 caches, memory stall
time is mainly due to L1 hits (mainly related to MSHR con-
tention) or to L2 misses.
4.2 Impact of Software Prefetching
Figure 3 summarizes the execution time reductions from
software prefetching relative to the base system with VIS
(with 64K L1 and 128K L2 caches). We do not report
results for cjpeg-np,djpeg-np and mpeg-enc since these
benchmarks spend less than 6% of their total time on L1
cache misses. Our results show that software prefetching
achieves high performance improvements for the six image
processing benchmarks (an average of 1.9X and a range of
1.4X to 2.5X. The cjpeg,djpeg, and mpeg-dec benchmarks
exhibit relatively small performance improvements. Over-
all, after applying software prefetching, all our benchmarks
revert to being compute bound.
For the image processing kernels, a significant fraction
of the prefetches are useful in completely or partially hid-
ing the latency of the cache miss with computation or with
other misses. The addition of software prefetching also in-
creases the utilization of cache MSHRs; in many of the im-
age processing kernels, more than 5 MSHRs are used for a
large fraction of the time. The remaining memory stall time
is mainly due to late prefetches and resource contention.
Late prefetches (prefetches that arrive after the demand ac-
cess) arise mainly because of inadequate computation in the
loop bodies to overlap the miss latencies. Contention for
resources occurs when multiple prefetches are outstanding
at a time. These effects are similar to those discussed in
previous studies with scientific applications for ILP-based
processors [20].
The other benchmarks (cjpeg,djpeg, and mpeg-dec) see
|
|
0
|
20
|
40
|
60
|
80
|
100
Normalized execution time
VIS +PF
100.0
56.3
Addition
VIS +PF
100.0
53.2
Blend
VIS +PF
100.0
72.3
Conv
VIS +PF
100.0
40.6
Dotprod
VIS +PF
100.0
44.5
Scaling
VIS +PF
100.0
42.8
Thresh
VIS +PF
100.098.1
Cjpeg
VIS +PF
100.098.1
Djpeg
VIS +PF
100.095.0
Mpeg-dec
L1 hit
L1 miss
FU stall
Busy
Figure 3. Effect of software-inserted prefetching.
lower performance benefits primarily because the fraction
of memory stall time is relatively low and includes an L1
hit component (mainly due to MSHR contention). Soft-
ware prefetches do not address the L1 component. Sec-
ond, in cjpeg and djpeg, the prefetches are to memory lo-
cations that are indirectly addressed (of the form A[B[i]]).
Consequently, the prefetching algorithm is unable to dis-
tinguish between hits and misses and is constrained to is-
sue prefetches for all accesses. The resulting overhead due
to address calculation and cache contention limits perfor-
mance (seen as increased Busy and FU stall components7).
finally, as before, late prefetches and resource contention
also contribute to the lower benefits from prefetching.
5 Related Work
Most of the papers discussing instruction-set extensions
for multimedia processing have focused on detailed descrip-
tions of the additional instructions and examples of their
use [4, 9, 12, 15, 18, 19, 23]. The performance character-
ization in these papers is usually limited to a few sample
code segments and/or a brief mention of the benefits an-
ticipated on larger applications. Eyre studies the applica-
bility of general-purpose processors for DSP applications;
however, the study only reports high-level metrics such as
MIPS, power efficiency, and cost [7].
Daniel Rice presents a detailed description of VIS and 8
image processing applications without and with VIS [22].
The study reports speedups of 2.7X to 10.5X on an actual
UltraSPARC-based system, but does not analyze the cause
of performance benefits, the remaining bottlenecks, or the
impact of alternative architectures.
Yang et al. look at the benefits of packed floating-point
formats and instructions for graphics but assume a per-
fect memory system [24]. Bharghava et al. study some
MMX-enhanced applications based on Pentium-based sys-
tems; but again, no detailed characterization of performance
bottlenecks or the impact of other architectures is done [2].
Zucker et al. study MPEG video decode applications and
7Some of the image processing kernels see a reduction in the CPU com-
ponent because of the reduction in instructions and better scheduling when
loops are unrolled for the prefetching algorithm [14].
show the benefits from I/O prefetching, software restruc-
turing to use SIMD without hardware support, and profile-
driven software prefetching [25, 26]. However, the studies
assume a simplistic processor model with blocking loads
and do not study the effect of media ISA extensions. Bilas
et al. develop two parallel versions of the MPEG decoder
and present results for multiprocessor speedup, memory re-
quirements, load balance, synchronization, and locality [3].
Similar to our results, they also find that the miss rates for
352x240 images on “realistic” cache sizes are negligible.
6 Conclusions
Media processing is a workload of increasing importance
for desktop processors. This paper focuses on image and
video processing, an important class of media processing,
and aims to provide a quantitative understanding of the per-
formance of these workloads on general-purpose proces-
sors. We use detailed simulation to study 12 representa-
tive benchmarks on a variety of architectural configurations,
both with and without the use of Sun’s visual instruction set
(VIS) media ISA extensions.
Our results show that conventional techniques in current
processors to enhance ILP (multiple issue and out-of-order
issue) provide a factor of 2.3X to 4.2X performance im-
provement for the image and video benchmarks. The Sun
VIS media ISA extensions provide an additional 1.1X to
4.2X performance improvement. The benefits from VIS are
achieved with a much smaller increase in the die area com-
pared to the ILP features.
Our detailed analysis indicates the sources and limita-
tions of the performance benefits due to VIS. VIS is very
effective in exploiting SIMD parallelism using packed data
types, and can eliminate a number of potentially hard-to-
predict branches using instructions targeted towards satu-
ration arithmetic, boundary detection, and partial writes.
Special-purpose instructions such as pdist achieve high ben-
efits on the targeted application, but are too specialized to
use in other cases. Routines that are sequential and operate
on variable data types, VIS instruction overhead, cache miss
stall times, and the fixed parallelism in the packed arith-
metic instructions limit the benefits on the benchmarks.
On our base single-issue in-order processor, all the
benchmarks are primarily compute-bound. Conventional
ILP features and the VIS instructions together significantly
reduce the CPU component of execution time, making 5
of our image processing benchmarks memory-bound. The
memory behavior of these workloads is characterized by
large working sets and streaming data accesses. Increas-
ing the cache size has no impact on the image processing
kernels and the non-progressive JPEG applications. This is
particularly interesting considering current trends towards
large on-chip and off-chip caches. The remaining bench-
marks require relatively large cache sizes (dependent on the
display sizes) to exploit data reuse, but derive less than 1.2X
performance benefits with the larger caches. Software-
inserted prefetching achieves 1.4X to 2.5X performance
benefits on the image processing kernels where memory
stall time is significant.
With the addition of software prefetching, all our bench-
marks revert to being compute-bound. Architectural opti-
mizations that improve computation time (e.g., multipro-
cessing) may be useful to exploit greater parallelism. Such
efforts are likely to expose the memory system bottleneck
yet again, possibly requiring additional novel memory sys-
tem techniques beyond conventional software prefetching.
In the future, we plan to explore new architectural tech-
niques for general-purpose processors to support media pro-
cessing workloads. We also plan to expand our study to in-
clude other media processing applications such as speech,
audio, communication, and natural language interaction.
7 Acknowledgments
We would like to thank Behnaam Aazhang, Mohit Aron,
Rich Baraniuk, Joe Cavallaro, Tim Dorney, Aria Nostra-
tinia, and Jan Odegard for numerous discussions on the be-
havior of media processing workloads. We would also like
to thank Partha Tirumalai, Ahmad Zandi and Tony Zhang
from Sun for useful pointers on enhancing the applications
with VIS. We also thank Vijay Pai, Barton Sano, Chaitali
Sengupta, and the anonymous reviewers for their valuable
comments on earlier drafts of the paper.
References
[1] D. Bhandarkar and J. Ding. Performance characterization of
the pentium pro processor. In HPCA97, pages 288–297, Feb
1997.
[2] R. Bhargava et al. Evaluating MMX Technology Using DSP
and Multimedia Applications. In MICRO-31, Dec 1998.
[3] A. Bilas et al. Real-time Parallel MPEG-2 Decoding in Soft-
ware. In IPPS-11, April 1997.
[4] D. A. Carlson et al. Multimedia Extensions for a 550MHz
RISC Microprocessor. In IEEE Journal of Solid-State Cir-
cuits, 1997.
[5] T. M. Conte et al. Challenges to Combining General-
Purpose and Multimedia Processors. In IEEE Computer,
pages 33–37, Dec 1997.
[6] K. Diefendorff and P. K. Dubey. How Multimedia Work-
loads Will Change Processor Design. In IEEE Micro, pages
43–45, Sep 1997.
[7] J. Eyre. Assessing General-Purpose Processors for DSP Ap-
plications. Berkeley Design Technology Inc. presentation,
1998.
[8] International Organisation for Standardisation ISO/IEC
JTC1/SC29/WG11MPEG 98/N2457. MPEG-4 Applications
Document, 1998.
[9] E. Killian. MIPS Extension for Digital Media with 3D.
Slides presented at Microprocessor Forum, October 1996.
[10] L. Kohn et al. The Visual Instruction Set (VIS) in Ultra-
SPARC. In COMPCON Digest of Papers, March 1995.
[11] C. Lee et al. MediaBench: A Tool for Evaluating and
Synthesizing Multimedia and Communications Systems. In
MICRO-30, 1997.
[12] R. B. Lee. Subword Parallelism with MAX-2. In IEEE
Micro, volume 16(4), pages 51–59, August 1996.
[13] R. B. Lee and M. D. Smith. Media Processing: A New De-
sign Target. In IEEE MICRO, pages 6–9, Aug 1996.
[14] T. Mowry. Tolerating Latency through Software-controlled
data prefetching. PhD thesis, Stanford University, 1994.
[15] S. Oberman et al. AMD 3DNow! Technology and the K6-2
Microprocessor. In HOTCHIPS10, 1998.
[16] V. S. Pai et al. RSIM: A Simulator for Shared-Memory Mul-
tiprocessor and Uniprocessor Systems that Exploit ILP. In
Proc. 3rd Workshop on Computer Architecture Education,
1997.
[17] V. S. Pai et al. The Impact of Instruction Level Parallelism
on Multiprocessor Performance and Simulation Methodol-
ogy. In HPCA-3, pages 72–83, 1997.
[18] A. Peleg and U. Weiser. MMX Technology Extension to
the Intel Architecture. In IEEE Micro, volume 16(4), pages
51–59, Aug 1996.
[19] M. Phillip et al. AltiVec Technology: Accelerating Media
Processing Across the Spectrum. In HOTCHIPS10, Aug
1998.
[20] P. Ranganathan et al. The Interaction of Software Prefetch-
ing with ILP Processors in Shared-Memory Systems. In
ISCA24, pages 144–156, 1997.
[21] P. Ranganathan et al. Performance of Database Workloads
on Shared-Memory Systems with Out-of-Order Processors.
In ASPLOS8, pages 307–318, 1998.
[22] D. S. Rice. High-Performance Image Processing Using
Special-Purpose CPU Instructions: The UltraSPARC Visual
Instruction Set. Master’s thesis, Stanford University, 1996.
[23] M. Tremblay et al. VIS Speeds New Media Processing. In
IEEE Micro, volume 16(4), pages 51–59, Aug 1996.
[24] C.-L. Yang et al. Exploiting Instruction-Level Parallelism in
Geometry Processing for Three Dimensional Graphics Ap-
plications. In Micro31, 1998.
[25] D. F. Zucker. Architecture and Arithmetic for Multimedia
Enhanced Processors. PhD thesis, Department of Electrical
Engineering, Stanford University, June 1997.
[26] D. F. Zucker et al. An Automated Method for Software Con-
trolled Cache Prefetching. In Proc. of the 31st Hawaii Intl.
Conf. on System Sciences, Jan 1998.
... A camera system in a car that is detecting obstacles may already be too complex to give hard timing guarantees. This depends on the performance of the computing unit and the applied detection algorithm which performs differently based on the complexity of the scene to be analyzed [6]. ...
Article
Time-triggered systems provide dependable and deterministic communication based on strict time boundaries for tasks and messages. On the other hand, event-triggered communication is less strict on requirements and more flexible, but it is difficult to provide fault isolation. Thus, the combination of time-triggered and event-triggered communication is desirably extending a time-triggered communication system by a dynamic property. This paper improves a novel mixed-triggered communication approach based on ESBs which adds limited delay tolerance to an otherwise strict real-time communication. The presented approach overcomes the problem of delayed respectively missing messages by adding fail-operational behavior to the ESB-based communication. This is achieved by extending the basic mechanism using additional fault-tolerant features. The presented mechanism allows the system to stay operable even when messages occasionally violate their planned slot boundaries and provides fault isolation for timing violations beyond a predefined tolerance budget. The introduced fault-tolerant features are analyzed via Monte-Carlo simulations where the resulting data throughput is compared to two straightforward hard real-time communication approaches. As our results show, the proposed guards enable the system to handle extended message reception delays while, compared to strictly bounded communication, having a better performance of successfully transmitted messages.
... Horta et al. in 2012 [51] incorporate dynamic reconfiguration with plug-ins and implement on-chip system programmable with structured interconnections for dynamic reconfiguration. The performance of image and video processing was evaluated by [52] using general purpose processor and ISA extension. They also made analysis with ISA extension in both in-order and out-of-order processors resulting in significant reduction in overall execution time. ...
Article
Full-text available
The prospective need of SIMD (Single Instruction and Multiple Data) applications like video and image processing systems requires a processor with greater flexibility and computation to deliver high quality real time output. The main goal of work is to offer a wider survey over high performance processor through various reconfigurable techniques targeting on real time SIMD dataset. The real-time multimedia streaming with extensive parallel data demands high computational processors with low power consumption. The modern processors with flexible computation supports on-the-fly system redesign with reconfiguration techniques at reduced cost. It also accommodates new functionalities sustaining the increasing demand for processor with less NREs (Non-Recurring Engineering) cost and shorter time-to-market. The processors with flexible platform permit the designer to incorporate the demanding changes in the existing standards of any application like wireless communication, telecommunication etc. Adaptive computing is the new processor paradigm evolved in early 90s to bridge the space which exists between the generic processor and application specific processor. It also supports the inclusion of specific hardware accelerator in the existing RISC (Reduced Instruction Set Computer) based architecture to improve the performance of particular application. In the last decade, processor based embedded systems like SoC (System-on-Chip) have become more prevalent due to their high performance at low power. Many embedded application designers like mobile and smart phone developers namely Philips N-Experia, Intel PXA etc., explored SoC as computational devices. Thus this work summarizes a literature survey of all the related works such as RC (Reconfigurable Computing) , HPRC (High Performance Reconfigurable Computing), FPGA based processors, open source soft-core processor, reconfigurable architecture, parallel computing and SIMD. The detailed literature survey analyzing the impact of current techniques targeting processor performance is illustrated.
... For instance, the SIMD implementations of the MPEG/JPEG codecs using the VIS ISA require on average 41% overhead instructions such as packing/unpacking and data re-shuffling. The execution of this large number of the SIMD overhead instructions decreases the performance and increases pressure on the fetch and decode steps [80]. ...
... Ranganathan et al. [18]. A comprehensive analysis of performance and optimisation of an image processing algorithm -Scale Invariant Feature Transform (SIFT) [19] in a multi-core system was done by Zhang et al. [20]. ...
Thesis
Full-text available
Obstacle detection is an important feature in autonomous and semi-autonomous mobile robots and demand high performance from their underlying hardware to help navigate a moving object on a given path in real-time. In this thesis we have optimised the computational performance of an image processing pipeline that can be used for obstacle detection. The image processing pipeline performs edge detection, removes straight line edges from a tiled floor and highlights obstacle boundaries on a given input image. Various optimisation strategies were applied to enable the image processing pipeline to perform optimally on diverse hardware such as CPUs, NVIDIA GPUs, and embedded boards. Further, the efficacy of expressing the image processing pipeline through programming abstractions such as Halide and Thrust2D was evaluated. Our experiments suggest that there was minimal performance degradation when the image processing pipeline was written using Thrust2D when compared to a native implementation of the same on the GPU.
... They also made analysis with ISA extension in both in-order and out-of-order processors resulting in significant reduction in overall execution time. The mixture of multiple issue and out-of-order issue increases the overall performance by more than 3 times [9]. To work on extremely long data streams such as SIMD, the reconfigurable computing for a soft-core processor with out-of-order execution implemented in FPGA hardware is proposed. ...
Article
Objective: The prospective need of SIMD (Single Instruction and Multiple Data) applications like video and image processing in single system requires greater flexibility in computation to deliver high quality real time data. This paper performs an analysis of FPGA (Field Programmable Gate Array) based high performance Reconfigurable OpenRISC1200 (ROR) soft-core processor for SIMD.Methods: The ROR1200 ensures performance improvement by data level parallelism executing SIMD instruction simultaneously in HPRC (High Performance Reconfigurable Computing) at reduced resource utilization through RRF (Reconfigurable Register File) with multiple core functionalities. This work aims at analyzing the functionality of the reconfigurable architecture, by illustrating the implementation of two different image processing operations such as image convolution and image quality improvement. The MAC (Multiply-Accumulate) unit of ROR1200 used to perform image convolution and execution unit with HPRC is used for image quality improvement.Result: With parallel execution in multi-core, the proposed processor improves image quality by doubling the frame rate up-to 60 fps (frames per second) with peak power consumption of 400mWatt. Thus the processor gives a significant computational cost of 12ms with a refresh rate of 60Hz and 1.29ns of MAC critical path delay.Conclusion:This FPGA based processor becomes a feasible solution for portable embedded SIMD based applications which need high performance at reduced power consumptions
... In addition, the architecture de nes often a new register set. Several papers study the performance of the multimedia applications with di erent SIMD instruction set extensions 5], 9], 10], 19], 22], 23] and show these SIMD extensions provide a performance improvement. ...
Article
Full-text available
Media processing has become one of the dominant computing workloads. In this context, SIMD instructions have been introduced in current processors to raise performance, often the main goal of microprocessor designers. Today, however, designers have become concerned with the power consumption, and in some cases low power is the main design goal (laptops). In this paper, we show that SIMD ISA extensions on a superscalar processor can be one solution to reduce power consumption and keeping a high performance level. We reduce the average power consumption by decreasing the number of instructions, the number of cache references, and using dynamic power management to transform the speedup in performance in power consumption reduction.
Article
Full-text available
In the past, a patient went to the room where an ultrasound image diagnosis device was set, and then he or she was examined by a doctor. However, currently a doctor can go and examine the patient with a handheld ultrasound device who stays in a room. However, it was implemented with only fundamental functions, and can not meet the high performance required by the focusing algorithm of ultrasound beam which determines the quality of ultrasound image. In addition, low energy consumption was satisfied for the mobile ultrasound device. To satisfy these requirements, this paper proposes a high-performance and low-power single instruction, multiple data (SIMD) based multi-core processor that supports a representative beamforming algorithm out of several focusing methods of mobile ultrasound image signals. The proposed SIMD multi-core processor, which consists of 16 processing elements (PEs), satisfies the high-performance required by the beamforming algorithm by exploiting considerable data-level parallelism inherent in the echo image data of ultrasound. Experimental results showed that the proposed multi-core processor outperforms a commercial high-performance processor, TI DSP C6416, in terms of execution time (15.8 times better), energy efficiency (6.9 times better), and area efficiency (10 times better).
Conference Paper
Advanced Earth observation technologies nowadays produce more variety of huge datasets. To derive timely the information from such datasets, remote sensing scientists need to be equipped with a better and powerful computing and storage platform. Cloud computing platform can be a good option, since it provides the required computing power with lowest cost on pay-as-use basis. To see which existing platform shall be suitable for the complex analysis of huge remote sensing data, we present here a comparative study between the most commonly used cloud platforms, Amazon, Microsoft and CloudSigma. Based on the limiting factors that the satellite image-processing task requires, we considered flexibility, scalability, management and pricing. Flexibility means how resilient is the hardware architecture; Scalability describes how the application could utilize from available computing resources to continue well-functioning; Management looks at the availability of dashboards and control panels to manage cloud resources; and Pricing includes the cost of development and running cost of the service on top of cloud platform. Comparison showed that Amazon Web Services transcended all competitors especially in big data processing and scalability attainable options.
Article
Application specific instruction processor (ASIP) chips give the high performance and low power levels of application specific integrated circuit (ASIC) chips and the flexibility of digital signal processor (DSP) chips. A video codec system-on-chip (SoC) was designed based on an ASIP chip architecture with the encode and decode parts of the video codec SoC having the same media signal processing architecture. The encode and decode parts each consist of an 8-issue very long instruction word (VLIW) DSP core and user defined hardware for the data transfer and variable length codec. Simulations using 0.13 μm show that the chip can accomplish 15 f/s QCIF H.263 baseline encoding at 17 MHz, while 15 f/s QCIF H.263 baseline decoding at 10 MHz.
Conference Paper
Full-text available
In this paper, we characterize the performance of several business and technical benchmarks on a Pentium ® Pro processor based system. Various architectural data are collected using a performance monitoring counter tool. Results show that the Pentium Pro processor achieves significantly lower cycles per instruction than the Pentium processor due to its out of order and speculative execution, and non-blocking cache and memory system. Its higher clock frequency also contributes to even higher performance.
Article
Full-text available
MAX-2 illustrates how a small set of instruction extensions can provide subword parallelism to accelerate media processing and other data-parallel programs. This article proposes that subword parallelism-parallel computation on lower precision data packed into a word-is an efficient and effective solution for accelerating media processing. As an example, it describes MAX-2, a very lean, RISC-like set of media acceleration primitives included in the 64-bit PA-RISC 2.0 architecture. Because MAX-2 strives to be a minimal set of instructions, the article discusses both instructions included and excluded. Several examples illustrate the use of MAX-2 instructions, which provide subword parallelism in a word-oriented general-purpose processor at essentially no incremental cost
Article
Database applications such as online transaction processing (OLTP) and decision support systems (DSS) constitute the largest and fastest-growing segment of the market for multiprocessor servers. However, most current system designs have been optimized to perform well on scientific and engineering workloads. Given the radically different behavior of database workloads (especially OLTP), it is important to re-evaluate key system design decisions in the context of this important class of applications.This paper examines the behavior of database workloads on shared-memory multiprocessors with aggressive out-of-order processors, and considers simple optimizations that can provide further performance improvements. Our study is based on detailed simulations of the Oracle commercial database engine. The results show that the combination of out-of-order execution and multiple instruction issue is indeed effective in improving performance of database workloads, providing gains of 1.5 and 2.6 times over an in-order single-issue processor for OLTP and DSS, respectively. In addition, speculative techniques enable optimized implementations of memory consistency models that significantly improve the performance of stricter consistency models, bringing the performance to within 10--15% of the performance of more relaxed models.The second part of our study focuses on the more challenging OLTP workload. We show that an instruction stream buffer is effective in reducing the remaining instruction stalls in OLTP, providing a 17% reduction in execution time (approaching a perfect instruction cache to within 15%). Furthermore, our characterization shows that a large fraction of the data communication misses in OLTP exhibit migratory behavior; our preliminary results show that software prefetch and writeback/flush hints can be used for this data to further reduce execution time by 12%.
Article
The evolution and refinement of media-processing hardware has just begun. As programmable processors or coprocessors with media-processing enhancements gradually replace fixed-function, special-purpose devices, compiler support for these features will also improve. Today, an application developer who organizes program and data structures to exploit media-processing hardware achieves the best performance. Eventually, language extensions will probably emerge to support improved programmer efficiency without loss of application performance. Media processing - with its almost limitless appetite for computational power - provides an exciting new target for hardware and software design innovation.