Conference PaperPDF Available

Semantic video adaptation based on automatic annotation of sport videos

Authors:

Abstract and Figures

Semantic video adaptation improves traditional adaptation by taking into account the degree of relevance of the different portions of the content. It employs solutions to detect the significant parts of the video and applies different compression ratios to elements that have different importance. Performance of semantic adaptation heavily depends on the precision of the automatic annotation and the way of operation of the codec which is used to perform adaptation at the event or object level. In this paper, we discuss critical factors that affect performance of automatic annotation and define new performance measures of semantic adaptation, Viewing Quality Loss and Bitrate Cost Increase , that are obtained from classical PSNR and Bit Rate, but relate the results of semantic adaptation with the user's preferences and expectations. The new measures are discussed in detail for a system of sport annotation and adaptation with reference to different user profiles
Content may be subject to copyright.
Semantic Video Adaptation based on Automatic
Annotation of Sport Videos
Marco Bertini, Alberto Del Bimbo
Dipartimento di Sistemi e Informatica
University of Florence
Via S. Marta, 3
Firenze, Italy
{bertini,delbimbo}@dsi.unifi.it
Rita Cucchiara, Andrea Prati
Dipartimento di Ingegneria dell’Informazione
University of Modena and Reggio Emilia
Via Vignolese, 905
Modena, Italy
{cucchiara.rita, prati.andrea}@unimore.it
ABSTRACT
Semantic video adaptation improves traditional adaptation by tak-
ing into account the degree of relevance of the different portions
of the content. It employs solutions todetect the significant parts
of the video and applies different compression ratios to elements
that have differentimportance. Performance ofsemantic adaptation
heavily depends on the precision of the automatic annotation and
the way of operationof the codec which is used to performadap-
tation at the event or object level. In this paper, we discuss critical
factors that affect performance of automatic annotation and define
new performance measures of semantic adaptation, Viewing Qual-
ity Loss and Bitrate Cost Increase, that are obtained from classical
PSNR and Bit Rate, but relate the results of semantic adaptation
with the user’s preferences and expectations. The new measures
are discussed in detail for a system of sport annotation and adapta-
tion with reference to different user profiles.
Categories and Subject Descriptors
H.3.7 [Information Storage and Retrieval]: Digital Libraries—
Systems issues, User issues; H.2.4 [Systems]: Multimedia databases;
I.4.2 [Compression(Coding)]
General Terms
Performance, Human factors
Keywords
Video adaptation, automatic video annotation, transcoding
1. INTRODUCTION
Universal multimedia access is becoming more and more popu-
lar due to the diffusionof new devices to access to multimedia data
from any place. Among multimedia data, videos are probably the
more challenging since they call for high bandwidth requirement
to preserve as much as possible ofthe original quality. However,
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed forprofit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistri bute to lists, requires prior specific
permission and/or a fee.
MIR’04, October 15–16, 2004, New York, New York,USA.
Copyright 2004 ACM 1-58113-940-3/04/0010 ...$5.00.
meeting the constraints of the device and the requirements of the
user, and keeping low the costs of the transmission (in terms of
data transferred and time required)at the same time, is not a trivial
task.
Video adaptation techniques have been widely studied in the
last years [13, 8] in order to enable UniversalMultimedia Access
(UMA) fromany place andalso with devices withlimited resources.
Most of the video adaptation techniques provide syntactic video
adaptation performing scaling, color subsampling, temporal down-
scaling or changing the compression’s factor [9]. This results in
that the video is adapted equally. Therefore, there is, on the one
side, bandwidth waste for preserving the quality of useless parts of
the video, and, on the other side, excessive degradation of mean-
ingfulparts.
As a consequence, recently many researchers have concentrated
their efforts in definingnew “semantics-based” or “content-based”
video adaptation approaches. The rationale is that the user can
elicit relevant video elements (either objects or events of interest)
and define for each of them a degree of relevance. Relevant ele-
ments should be detected automatically in the video, possibly with
computer vision-basedannotation modules, and the quality of their
transmission shouldbe adapted to theiruser-defined relevance. This
selective adaptation can be done at object-level (connected regions
in a frame) or at event-level (sequences of frames with common
meaning). For example, in the transmission of a video of a soccer
game, we can send good qualityvideo only for the frames where
interesting actions take place, or, within the individual frames, pro-
vide high resolution sampling only for the most relevant objects
(e.g., regions in the surrounding of the players).
Videoadaptation in terms ofthe relevance ofthe objects detected
in each frame has been addressed by [14] and [2] forvideo surveil-
lance applications. In[14], Vetro et al. presented an object-based
transcoding framework that uses dynamic programming or meta-
data, for the allocation of bits among the multiple objects in the
scene. In [7] the advantages of representing visual data and thus
semantics in terms of regions corresponding to objects is clearly
evidenced. Chang et al. [6]have filtered live video content accord-
ing to events and highlights. In [2] we have developed a prototype
system for annotation and adaptationof soccer sport videos, with
adaptation based on objects and events. However, a still open prob-
lem is the choice of the granularity of the elements to be exploited
for the adaptation, thatis deciding to work at object- orevent-level.
A detailed comparison of the possible approaches has been dis-
cussed in [3].
In addition,there is the needof a reliableand consistent perfor-
mance evaluationof content-based video adaptation systems. Most
291
of the measuresfor performanceevaluation ofvideo adaptationsys-
tems are, however,still based on the PSNR (PeakSignal-to-Noise
Ratio) [6, 2] withsome noticeable exceptions that take intoaccount
non-linear distortion effects on the human perception system [14,
5]. However, in the case of content-based video adaptation, they
all can not take into account user’s satisfaction and how much this
is affected by errors in the video annotation system. A few ap-
proaches in this direction havebeen proposed recently. A weighted
PSNR has been defined in [2] to include user’s preferences. Chang
et al. [6]have defined a function that takes into account both qual-
ity in the video transfer (by means of PSNR) and the consumed
bandwidth (using bit rate, BR).
In this paperwe present a new metric forperformance evaluation
of content-based video adaptationsystems that takes into account
the overall user’s satisfaction by merging the effects of annotation
errors and adaptation distortions. The new performance measures,
Viewing Quality Loss and Bitrate Cost Increase, are obtained from
classical PSNR andBit Rate, but relate theresults of semantic adap-
tation with the user’s preferences and expectations. They can be
used with any annotation system and only content-based adapta-
tion module.
2. ANNOTATION AND SEMANTIC
ADAPTATION SYSTEMS
The reference framework is a system resulting from the integra-
tion of an automatic annotationengine and a content-based adap-
tation module. Video annotation has been widely studied over the
last few years, resulting in many research prototypes and several
commercialtools. Among the possibleapplicationcontexts, sports
annotation is very widespread, due to its deployment in broad-
casting, post-production logging, indexing, and so on [1, 10, 16].
Known context is usually structured in an ontology, the defini-
tion of which is beneficial not only in the annotationprocess, but
also for informationretrieval. When video annotation is associated
with video access and delivery, and thus with content adaptation,
the most common frameworks for knowledgerepresentation come
from MPEG-7 and MPEG-21standards [12, 11]. In MPEG-7 the
description schemes (DS) are modeled on XML schemas, easing
the use of parsingtools for indexing, querying,and retrieving infor-
mation. Furthermore, efforts have been made to standardizetech-
niques and rules formodelingthe users’ requests and preferences.
Recently, the MPEG-21 standardization committee has addressed
the UMA-related problems by including a Digital Item Adaptation
(DIA) section in the Part 7 of the standard (ISO/IEC 21000-7) in
order to adapt the media content to the device’s limitations [12].
2.1 Ontology
According to the MPEG-7 terminology, each frame of a video
can be divided into spatial segments, which aresets of not neces-
sarily connected pixelsof a frame. Within them, we call theregions
with associated semantics ROIs, regions of interest. Thus, we can
define the set of meaningful objects of a video as
O={ROIi}∪{o};O={ROI1,ROI
2, ..., R OIn}∪{o}
where oare the parts of a frame that do notbelong to any ROI. The
ROIs are segmented by means of visual descriptors able to extract
and classify objects, and to perform temporaland spatial reasoning
on the scene.
Then, we shall use the concept oftemporal segments as defined
by MPEG-7. A temporalsegment in MPEG-7 is a set of not nec-
essarily contiguous frames. We shall use the termevents to define
the types of temporal segments with a specific meaning. Unlike
MPEG-7 which uses the word “event” to define the condition that
connects objects to each other in any instant, in our case, “event”
refers only tothe continuous presence of a fact alongthe time se-
quence. In practice, we consider a set of events Edefined as:
E={hi}∪{e};E={h1,h
2, ..., hm}∪{e}
where each hican be viewed as a highlight, while the category
ecomprises all not relevant parts of the video. The ontology is
thus defined in terms of objects, events, and their relationships as
described by means of acyclic graphs.
2.2 User device’s requirements
The device’srequirements representthe constraints thatthe client
device imposes on the access to the video content. For instance,
the maximum resolution of the device’s display limits the spatial
dimension of the video. In this case, spatial downscaling is manda-
tory and has the positive knock-on effect of reducing the required
bandwidth. Furthermore,current mobile devices havelimited color
resolution (typically, no more than 65,535 colors). Consequently,
a reduction in colordepth might be necessary in orderto adapt to
the handset’s capabilities. Although these two alterations are un-
avoidable and bringbenefits in terms of requiredbandwidth, they
may also entail notable image degradation, especially with regard
to color reduction. Tests on basic adaptation techniques have been
carried outin [4]. Mobile devices normallyhave a limited available
memory and a computational capability that sometimes circum-
scribes the possibility to run sophisticated codecs and browsers.
Thus, the video adaptation server should supply different encoded
versions of the video, for instance with MPEG-2 or MPEG-4 stan-
dards. Another requirement that must be taken into account is the
maximumbandwidth availablefor the connection. Currenttelecom-
municationstandards for mobiledevices are GPRS (GeneralPacket
Radio System) and UMTS (Universal Mobile Telecommunications
System), whose maximumdedicated bandwidthsare about 115kbps
and 2 Mbps, respectively. Since typical bandwidthrequirementsfor
videos at PAL/NTSCframe rate are muchhigher, suitable and ef-
fective compression techniques must be employed. In particular,
the selective adaptation of the compression based on content and
user’s interests can improve performance considerably.
2.3 User’s interests
The user’s interests can be basically defined in terms of viewing
quality and service costs. Therefore the basic performanceanalysis
parameters that can be used are PSNR and BR. In [15], a utility
function has been defined showing relationships between different
types of resources (bandwidth, display, etc.) and utilities (objective
or subjective quality, user’s satisfaction, etc.). Here, bitrate and
PSNR are the straightforward parameters adopted for measuring
costs and quality of the video output.
The quality adaptation can be improved by exploitingsemantic
annotation and user’s interests. In particular, we maydefine a set C
of classes of relevance which groups togetherthe parts of the video
that are of the same degree of interest to the user.
Specifically, a class of relevancegroups entities ofthe ontology
(objects and events) with the same degree ofrelevance for the user.
Formally, giventhe set of classes of relevanceordered by ascending
relevanceC={C0, ..., CNCL
}, each element is defined as:
Ci=<oi,ei>withoiO, eiE(1)
The relevanceassociated to each class is quantifiedby means of
a weight assigned by the user. In this paper, we employed three
classes as an example, namely C0,C1and C2of low, medium, and
292
high quality, respectively. The user can assign a relative weight for
each class, indicating the respectiveratios in the relevance, thatwill
basically map onto the compression levels. As an example, setting
the weights to {wC0,w
C1,w
C2}={0.1,0.4,1.0}means that the
quality of class C2should be ten times better than that of class C0.
In this case the performance evaluation depends on the user’s in-
terests. Actually, the user can select his preferences according with
the semantic of the video (e.g., s/he could be more interested ina
shot of goal than a placed kick). The user gives the relative interest
of each class w.r.t. the others and the degree of quality (andcon-
sequent cost) neededforthe most interestingclass. The system se-
lects the compression level of the classes of relevance accordingly.
Then, the final performanceparameters, such as PSNR and BR, are
in accordance with theuser’s satisfaction. Nevertheless, while the
variation ofPSNR and BR in function of thecompression is almost
known, the effects of annotation errors on the final performance is
not a-priori estimable.
3. ANNOTATION AND ADAPTATION OF
SOCCER VIDEOS
In sports videos, users are usually interested in watching certain
areas of the images, such as the playfield or the zone around the
goal box in soccer, or the zone near the start or arrival in a race.
These regions of interest are extracted by the automatic annotation
system for two purposes: the firstone is to provide a selective com-
pression at object level, preserving as much quality as possible for
the objects in which the user is moreinterested; the second purpose
is to use the objects as inputs forthe classification ofevents. In fact,
in sports certain events can happen only in given areas and under
given conditions (thinkfor instance to the shot on goal in soccer).
The objects that are detected and extractedin soccer videos are
the playfield (PF), the players and the ball (PL):
Osoccer ={PF,PL}∪{o}
where ois the area outside the playfield (e.g., the crowd), which is
of no interest forthe detection of highlightsnor to the viewer of the
video.
The playfield shape is obtained by applyingcolor analysis and
binarization to the video frames. The frame bitmap is processed
using K-fill, flood fill, followed by erosion and dilation. The shape
of the playfield is representedas a polygon for the purposeof au-
tomatic annotation, whilefor the purpose ofvideo adaptation, the
polygon is used for soccer videos, and a bitmap representationis
used for swimming videos. This difference is due to the fact that
accurate detection of the playfield shape and polygonal approxi-
mation are obtained precisely if the color of the playfield area is
uniform, and playfield lines and player “blobs” are of a small size:
the soccer field is a typical example in which polygonal shapes can
be extracted reliably in most frames. The portion of playfield that
is framed (and hence the playfieldzone where the play takes place)
is identified by the aspect of the playfield shape and the playfield
lines extracted from the edge image; recognition is performed us-
ing Na¨ıve Bayes classifiers. Players and ball blobs are extracted by
color differencing and represented as “binary blobs”. Constraints
on the side ratio of blobs’ boundingboxes and area are used to dis-
card non-player blobs. In order to provide users with a betterun-
derstanding of thevideo content, the blobs of playersand ball are
enlarged in order to include a small part of the area around them.
The problem of modeling highlights can be seen as part of the
problem of detecting special occurrences within the temporal se-
quences. In fact, a generic highlight can be regarded as a con-
catenation of consecutive phases of the competition. Each phase
occurs typically in a distinct zone ofthe playfield, while transitions
between phases are related to the movement of objects such as the
athletes and/orthe ball. In our approach, highlights are modeledus-
ing FSMs. Each highlight is described using a directed graph they
model therelevant steps in theprogression of thegame or race, such
as moving from one partof the playfield to another, accelerating or
decelerating, etc.
Meaningful events that are extracted by the annotation subsys-
tem are the most important highlights. In particular, for soccer,
highlights that have been modeled are: forward launches (FL),
shots on goal (SG), spot kicks as penaltykicks (PK), free kicks near
the goal post, and corner kicks, as well as attacks actions (AA) and
other plays that may lead to a shot on goal. In this paper, we use an
ontologyas follows:
Esoccer ={FL,SG,PK,AA}∪{e}
(2)
where eindicates that no highlight is present in the video stream
being processed.
Table 1 reportsthe Detection Rate (DR) and False Alarm Rate
(FAR) figures of playfield zones and players in terms of pixels
classified as belonging to these objects.
Sports video Object DR FA R
Soccer videos Playfield 99,9% 0.16%
Players 99,8% 5.51%
Table1: Performance figures ofobject automaticdetection over
90’ of soccer video .
Table 2 reports the confusion matrix, showing the precision in
highlight detection and the errors in highlight classification. The
percentage in the “other” column indicates the false highlightde-
tection. Finally, in Table 3 the percentages of miss detection are
reported.
The adaptation module performs content-based videoadaptation
according to the bandwidth requirements and the weights of the
classes of relevance. Different compression techniques have been
implemented that performs coding at the semantic level.
The first oneexploits the standardadaptive quantizationofMPEG-
2 to select the quantizationscale QS i(QS i[0,31]) of each mac-
roblock iof each frame of the video. This approach is referred to
as S-MPEG2. For each i, the dominant class of relevance and the
correspondingQS iare computed, depending on which objects and
event are involved.
Other twocoding policies havebeen implementedbased on MPEG-
4 and, particularly, on the Xvid open source software (http://
www.xvid.org ). Differently from MPEG-2, in MPEG-4 the
quantization values forthe macroblocks within the same Video Ob-
ject Plane (VOP) are sent in a differential format: each value for
a macroblock(except forthe first)is coded as {-2,-1,1,2}with re-
spect to the base value of the VOP. This allows MPEG-4 to reduce
the bandwidthrequired forthe adaptive quantization(2 bits foreach
quantization value w.r.t. 5 bits), but restricts the flexibility, practi-
cally preventingus fromthe use ofdifferentquantization scales for
the macroblocks.
The most straightforwardway is to employ the MPEG-4 Simple
Profile(S-MPEG4-SP): it does not considerobjects (and, thus, does
not allow different quantizationfactors within the same frame), but
only events, i.e. different quantization scales are used in different
groups of frames. Instead, working at object-level, the Core Profile
of MPEG-4 can be used (S-MPEG4-CP) and creating a different
VOP for each object extracted by the annotation system. In this
293
True highlight
Recog. highlight Fwd.Launch Shot Goal Placed kick Attack act. Other
Fwd.Launch 89.75% 1.67% 0% 0% 8.58%
Shot on goal 1.525% 93.9% 0% 0% 4.575%
Placed kick 0% 0% 89.75% 0% 10.25%
Attack action 1.6% 1.0% 0% 97.4% 1.0%
Table 2: Performance figures of highlight automaticdetection over 90’ of soccer video: precision and misclassification errors
True highlight
Recognizedhighlight Fwd.Launch Shot on goal Placed kick Attack act.
Misses 5% 13% 7% 25%
Table3: Soccer highlight misses percentage
Compr. Techn. Avg. bandwidth Standard Semantic
MPEG-2 530.30 kbps 32,67 dB 35,57 dB
MPEG-4 179,94 kbps 33,47 dB 36,22 dB
Table 4: Average PSNR for MPEG-2 and MPEG-4, both stan-
dard and semantic approaches over 90’of soccer video
way, we can assign different quantization scales to each object in
dependence to its relevance for the user. However, this approach
has proven to be not suitable in the case of sports videos [3].
In Table 4, we provide a comparisonof the performanceof the
techniques S-MPEG2 and S-MPEG4-SP. Results have been ob-
tained under the hypothesis of ideal (error free) annotation engine
(events and objectsare detected manually), fromDV source videos.
According with the weights assigned to user, we select different
compression factors for objects and events of interests w.r.t. to non
interestingelements.
The average PSNR is calculated at fixed bandwidth. In order to
maintain the frame rate of 10 fps, and comparable viewing quality,
compressed outputs havebeen obtained atthe average bandwidthof
530 kbps for MPEG-2 based solutions, and 180 kbps for MPEG-4-
based solutions (please note that MPEG-4 based solutions achieve
similar viewing qualitywith less bandwidth). The average PSNR
improvementwith semantic adaptation is about 8.5%. Results have
been obtained with a user profileof reference (see Table 5). Videos
included in the test set take into consideration different sources
from different broadcasters and differentconditions, and they are
selected considering the typical averagepercentage of highlights in
a soccer match, as provided by UEFA organization.
4. PERFORMANCE MEASURE FOR
ANNOTATION AND ADAPTATION
Let us consider the case of the access to a whole soccer game(90
minutes) from a mobile device connected with GPRS, whose real
average bandwidthcan be considered 35 kbps. The use of semantic
adaptation enables to achieve acceptable quality for significant en-
tities also with this very strict limit. The 90 minutes of videos are
downscaled fromthe PAL format to a220x176 frame size, with ref-
erence to anoff-the-shelflatest generation cellularphone (Motorola
V525). An example of the quality of a relevant frame is reportedin
Fig. 1(b) (and zoomed on the player in Fig. 1(d)) that corresponds
to about 32.7 dB. Ifa standard approach is employed,without ex-
ploiting the semantics, results are much poorer, as demonstrated in
Figs. 1(a) and 1(c), corresponding to about 30.4 dB.
Nevertheless, standard metrics, such as PSNR and BR for the
adaptation module and DR and FAR for the annotation engine,
present two main drawbacks for our purposes:
these metrics evaluate the performance of the single module
(annotation or adaptation), but not of the integrated system;
in particular, ourproposal aims at evaluatinghow much the
annotation errors affects the overall performance of the sys-
tem;
these metrics do not take the user’s preferencesinto account;
for instance, degrading the quality of different parts of the
video can have different impacts on the user, depending on
the relevancethat those parts havefor her/him;standard PSNR
does not consider this;
The errors of automatic annotationcan affect the user’s satisfac-
tion. Since objects and events are dividedin classes of relevance by
the users, errors can cause under- or over-estimations of objects or
events. In particular, under-estimation and miss conditions have a
negative impact on user’s satisfaction under the viewpoint of view-
ing quality loss. In fact, in this case, events and/or objects are com-
pressed more than necessary. Instead, costs paid by the user are
lowered since under-estimated objects and events are more com-
pressed. On the other hand, over-estimation and false detection
conditions affect negatively user’s satisfaction with respect to the
cost paid by the user (for transmission, downloading, and storage).
These two effects could compensate each other: two videos differ-
ently annotated could be compressed with the same PSNR and the
same BR, but with a large negative impact in user’s satisfaction:
user can lose details of interests and waste bits for useless parts.
Starting fromthe usual figures of PSNR at the pixel level and
Bit Rate, we can derive new indexes of performance that do not
take into accountthe parts correctly annotated andadapted but only
the errors: i) Viewing Quality Loss (VQL): resulting from over-
compression due to under-estimation and miss conditions occurred
in the annotation; ii) Bitrate Cost Increase (BCI): resulting from
higher bitrate due to over-estimations and false detections. Let
us call Errt
Qthe set of points of frame tthat have been under-
estimated, i.e. all the pointsthat are supposed to belongto a class
Ciand are, instead, detected as belonging toa class Cj, with j<i.
Correspondingly, let us call Errt
Cthe set of points of frames that
have been over-estimated.
The VQL is evaluated on the pixels that result under-estimated
for each frame It. Using the standard PSNR definition on this set
Errt
Q, a comparison between ideal (error-free) and actual annota-
tion is provided. The PSNRof under-estimatedpixels in the case
294
Profile C2C1C0Weights (w
C2,w
C1,w
C0)
Profile Ref. <{SG,FL},*> < {PK,AA}, players >residuals (1.0, 0.3, 0.1)
Profile A <SG, * > < {FL, PK,AA}, players >residuals (1.0, 0.3, 0.1)
Profile B <{SG,FL},*> < {PK, AA}, players >residual s (1.0, 0.6, 0.5)
Profile C <{SG, FL}, players > < {PK, AA}, players >residula s (1.0, 0.3, 0.1)
Profile D <{SG, FL},{playfield, players}> < {PK, AA}, players >residuals (1.0, 0.3, 0.1)
Table5: User profiles used to evaluate averageVQLand BCI
of actual annotation is denoted by PSNRErrt
Q, and defined as:
PSNR
Errt
Q=10log
10 V2
MAX
MSE
Errt
Q(3)
where where VMAX is the maximum (peak-to-peak) value of the
signal to be measured and MSEErrt
Qis the Mean Square Error of
the frame (limited to Err t
Q), defined as follows:
MSE
Errt
Q=
pErrt
Q
d2(p)
|Errt
Q|(4)
with d(p)a properlydefined distance to measure the errorbetween
original and distorted images. As distance, we used the Euclidean
distance in the RGB color space.
The same measure of Eq. 3 can be carried out for ideal (ground-
truthed) annotation: the computed PSNR ID
Errt
Qis also computed
on the set Errt
Qand it is is affected by a non null MSEID
Errt
Q,
only due to the selected compression standard and the quantization
scale. The viewing quality loss of the frame is, thus, defined as:
VQL
t=1
PSNR
Errt
Q
PSNR
ID
Errt
Q
(5)
Since PSNRErrt
Qis computed only on under-estimated pixels
of the frame, its value is lower or equal to that in the case of error-
free annotation (PSNRID
Errt
Q). Consequently, the ratio of Eq. 5
is between 1 (ideal annotation)and 0 (maximumdistortion due to
annotationand adaptationprocesses).
Similarly, the bitrate cost increase, for objects and events, is de-
fined for a frame Itas the ratio between the bandwidth request
in the ideal and actualcase computed on the set of over-estimated
pixels Errt
C:
BCIt=1BRID
Errt
C
BRErrt
C
(6)
Viewing quality loss (VQL) and bitrate cost increase (BCI)at
the video level are directly obtained by averaging the VQL tand
the BCIt:
VQL=
N
t=0
VQL
t
N;BCI =
N
t=0
BCIt
N(7)
where Nis equal to the numberof frames of true highlightplus the
number of frames falsely detected.
The graphs reported in Fig. 2 compare the performance analysis
achievable with classical metrics (such as PSNR and Bit Rate) and
with the new metrics, VQL and BCI, for a sample case. The re-
ported example presents a set of annotation errors. The ideal event
annotation should detect a “forward launch” (FL)event (associated
to class of relevance C1) between frames 0and 42, and a “shot on
goal”(SG) event (of class C2) between frames 282 and 375. An-
notation with the actual system results in a FL detected between
frames 12 and 26, leading to two partial misses (represented by
frame intervals 1 and 3 in Fig. 2), and in a SG between frames
251 and 322, resulting in a partial false detection (number4) and
a partial missed detection (number6). In addition to the errors in
event detection, the actual system makes some errors in the seg-
mentation of the objects (the playfield in this case) that result in
the small cost increase in intervals 2 and 5, and in more relevant
(especially in interval 5)loss of viewing quality. Please note that
the descending PSNRin intervals 4and 5 is due to the decreasing
area of the playfield, since the amount of playfieldin the image de-
creases while approaching to the goal. It is also worth noting that
the effect of missed event in the case of FL (intervals 1 and 3) and
in that of SG (interval 6) is different, being different the relevance
of the missed event. The false event in interval 4 results in a BCI
of about 80%. In fact, in this interval, the average occupation of
a frame in the case of correct annotation is about 100 Kbits, that
grows to 650 Kbits since the actual system misclassifies the frame.
From the graphs of Fig. 2 it is evident that the use of classic
metrics is not sufficient. The PSNR reported in the upper graph is
computed to the whole frame Itand can mix both quality and cost
effects of incorrect annotation. As a limit case, these two effects
can neutralize each other, resulting in two videos with the same
average PSNR, but very different user satisfaction levels. From
PSNR and BR only it is not possible, for instance, to understand
how much ofthe PSNR’s decrease of interval 5 is due toannotation
errors and how much is due to reduced playfield size.
According to the definitions above, both viewing quality loss
and bitrate cost increase depend on the statistics of the objects
and events present in the video, the performance of the annotation
(measured in terms of misses, misclassifications and losses), the
performance of the adaptation, and, ultimately, the way in which
objects and events are clusteredinto each class of relevance and the
relative importance weight. Thus it appears that the most important
conditions potentiallyinfluencing user satisfactionare those related
to events. However players objects are also important for their im-
pact on viewing quality, especially in the presence of meaningful
actions. Among event conditions the most critical one for perfor-
mance is event miss. In fact, in this case all the frames duringthe
whole duration of the event are compressed at a lower rate, that is
proportionalto the relevance weight of the residual C0class that
comprises the events that are less interesting for the user.
5. PERFORMANCE EVALUATION
To evaluatehow VQL and BCI change according to different
sets of user preferences (CoR and weights) and tothe performance
of the automatic annotationsystem (events and objects misses, and
wrongly recognizedhighlights) selected soccer videos taken from
the video testset have been manually annotated,performing objects
and events segmentation, to precisely estimate under- and over-
estimation of objects and events. A reference user profile has been
compared versus 4 other possible user profiles, defined as reported
in Table 5.
295
(a) An example frame of standard com-
pression at GPRS bandwidth (b) An example frame of semantic com-
pression at GPRS bandwidth
(c) Zoomed portion of (a) (d) Zoomed portion of (b)
Figure 1: Examples of standard compression compared with the semantic approach.
Ref. Profile A Profile B Profile C Profile D
FL 4974.95 711.21 4994.88 1222.39 3904.62
SG 5497.50 5497.50 5502.80 1476.17 3683.73
AA 598.89 598.89 797.74 598.89 598.89
PK 485.72 485.72 713.20 485.72 485.72
Table 6: Bitrate of the video obtained with the actual annota-
tion (figures are in kbps).
The test set consists ina set of selected clipsfromdifferentsoc-
cer videos: in total about 6000 frames have been annotated with
2361 highlights frames (624 frames of forward launches, 760 of
shot on goal, 320 of attack actions, 650 of placed kicks). Also the
annotation at object level has been provided withplayers and play-
field. Both manual and annotated versions have been adapted with
SMPEG2, according with the classes of relevances of the 5
users. SMPEG2has been preferred to SMPEG4SP
for two main reasons: first, because MPEG-2 is less computational
intensive, and, thus, more suitable forlow-powerdevices; second,
as stated above, MPEG-2 enables selective compressionat both ob-
ject and event level.
The content-based adaptationsystem provides a compressionat
Ref. Profile A Profile B Profile C Profile D
FL 35.47 31.92 35.47 31.85 32.55
SG 33.64 33.64 33.64 31.35 32.39
AA 32.29 32.29 34.01 32.29 32.29
PK 30.88 30.88 32.57 30.88 30.88
Table 7: Peak Signal-to-Noise Ratio (PSNR) of the video ob-
tained withthe actual annotation(figures are in dB).
constant quality for the pixels belonging to the highest classes of
relevance. For this reason, similarly to the Constant Quality (CQ)
method ofMPEG, we define this approachas Constant Best Quality
(CBQ). Tables 6 and 7 reports the averagevalues divided fortype
of highlights. For instance, User Ref and User A are different only
regardingFL. User B has the same PSNRand BR for the FLand SG
highlights (bothof class C2), while obtains a higher overall quality
(PSNR) thanUser Ref in AA and PK highlights (bothof class C1).
User C and UserD have apparentlylower PSNR thanRef because it
is averaged over all the pixels of the frame. These profiles, indeed,
ask for the best quality only for the regions of interest (players and
playfield, respectively). The video quality in interesting areas is
almost same, but a decrease in the required bandwidth is evident.
296
Figure 2: Comparison between PSNR-BR classical metrics and newly defined VQL and BCI.
Finally, Tables 8 and 9 report the results achieved with the new
metrics. Here, many considerationsabout the goodness ofthe whole
annotation and adaptation systems can be inferred. First, there is a
high bitrate cost increase dueto the errors in shotof goal. This is
due to the high number of false positives and the average length of
shot on goal, i.e. to the number of frames erroneously classified as
SG. It is worth noting that, fromTable 2, FL presents also higher
false alarm rate at event level than SG, but the BCI is always lower
than that of SG. This can be explained because shots on goal are
highlight that last more frames, thus the over-estimated frames in
the case of missed events are more (the average length for shot on
goal is about 140 frames, while in the case of forward launch is 60
frames). Another interesting results provided by the BCI measure
is that obtained comparing User Ref and User D. It can be eas-
ily noted that, at least for FL and SG, the BCI is higher for User
D. The BCI is due to two factors: false/over-estimatedevents and
false/over-estimated objects. In the first case, the BCI is similar to
that of User Ref since the playfield is usually very large in SG and
297
Ref. Profile A Profile B Profile C Profile D
FL 9.07% 1.23% 8.14% 5.58% 9.39%
SG 11.78% 11.78% 11.07% 8.10% 13.76%
AA 0.45% 0.45% 0.56% 0.45% 0.45%
PK 0.27% 0.27% 0.30% 0.27% 0.27%
Table8: Bitrate Cost Increase.
Ref. Profile A Profile B Profile C Profile D
FL 2.75% 0.61% 1.47% 2.49% 3.60%
SG 1.31% 1.31% 0.83% 3.04% 2.54%
AA 1.34% 1.34% 1.14% 1.34% 1.34%
PK 0.24% 0.24% 0.24% 0.24% 0.24%
Table 9: Viewing Quality Loss.
FL actions. In addition, in the case of User D, there is the over-
estimation of the playfieldthat contributes to increasingthe BCI. A
similar consideration can be done for User C, but in this case the
of missed/over-estimated events the errors is lower since the ob-
jects (players) are smaller and the number of over-estimatedpixels
smaller. Thus, the overall BCI is lowerthan in the case of User Ref.
and User D.
Regarding VQL, the error in quality is limited, especially for
User B. In fact, the User B accepts higher costs (due to a higher
bitrate) that limit the effects of miss detections. User A that is less
interested in FL is not affected by significant errors. AA and PK
highlights are always considered of average importance.
Acknowledgments
The work has been carried out inthe context of DELOS, the Net-
work of Excellence in digital libraries of European VI Framework
programme.
6. REFERENCES
[1] J. Assfalg, M. Bertini, C. Colombo,A. D. Bimbo, and
W. Nunziati. Semantic annotationof soccer videos:
automatic highlights identification. Computer Vision and
ImageUnderstanding, 92(2-3):285–305,
November-December2003.
[2] M. Bertini, R. Cucchiara, A. D. Bimbo, and A. Prati. An
integrated framework forsemantic annotation and
transcoding. Multimedia tools and applications, to appear.
[3] M. Bertini, R. Cucchiara, A. Del Bimbo, and A. Prati.
Object-based and event-based semantic video adaptation. In
Proceedings of Int’l Conference on Pattern Recognition, to
appear, Aug. 2004.
[4] R. Cucchiara, C. Grana, and A. Prati. Semantic video
transcoding using classes of relevance. InternationalJournal
of Image and Graphics, 3(1):145–169,Jan. 2003.
[5] N. Damera-Venkata, T. D. Kite, W. S. Geisler, B. L. Evans,
and A. C. Bovik. Image quality assessment based on a
degradation model. IEEE Transactions on ImageProcessing,
9(4):636–650, Apr. 2000.
[6] J.-G. Kim, Y. Wang, and S.-F. Chang;. Content-adaptive
utility-based videoadaptation. In Proc. of IEEE Int’l
Conference onMultimedia& Expo, pages 281–284, July
2003.
[7] M. Kunt. Object-based Video Coding, chapter 6.3, pages
585–596. in ’Handbook ofImage and Video Processing’.
Academic Press, 2000.
[8] R. Mohan, J. Smith, and C. Li. Adapting multimedia internet
content for universal access. IEEE Transactions on
Multimedia, 1(1):104–114, March 1999.
[9] T. Shanablehand M. Ghanbari.Heterogeneous video
transcoding to lowerspatio-temporal resolution and different
encoding formats. IEEE Transactions on Multimedia,
2(2):101–110, June 2000.
[10] S.Nepal, U.Srinivasan, and G.Reynolds. Automatic detection
of ‘goal’ segments in basketball videos. In Proc. of ACM
Multimedia, pages 261–269, 2001.
[11] B. L. Tseng, C.-Y. Lin, and J. R. Smith. Using MPEG-7 and
MPEG-21 for personalizingvideo. IEEE Multimedia,
11(1):42–52, Jan.-Mar. 2004.
[12] A. Vetro. MPEG-21 digital item adaptation: enabling
universal multimedia access. IEEE Multimedia, 11(1):84–87,
Jan.-Mar. 2004.
[13] A. Vetro, C. Chrisopoulos, and H. Sun. Video transcoding
architectures and techniques: An overview. IEEE Signal
Processing Magazine, 20(2):18–29, Mar. 2003.
[14] A. Vetro, T. Haga, K. Sumi, and H. Sun;. Object-based
coding for long-term archiveof surveillance video. In Proc.
of IEEE Int’l Conference on Multimedia & Expo, volume 2,
pages 417–420, 2003.
[15] Y. Wang, J.-G. Kim, and S.-F. Chang;. Content-based utility
function prediction for real-timeMPEG-4 video transcoding.
In Proc. of IEEE Int’l Conference on Image Processing,
volume 1, pages 189–192, 2003.
[16] W. Zhou, A. Vellaikal, , and C. Kuo. Rule-based video
classification system for basketball video indexing. In
Proc. ACM Multimedia 2000 workshop, pages 213–216,
2000.
298
... On the other hand, semantic annotation on videos also serves the similar purpose for indexing and retrieval, widely known in Content Based Image Retrieval (CBIR) [14]. The existing video annotation approaches are based on the analysis of transliteration or transcripts of a video recording [21][22][23], or from the motion detected and extracted from a video recording [24,25]. The later approach is usually ontology-based, where the ontology serves as the knowledge foundation for annotation. ...
... The later approach is usually ontology-based, where the ontology serves as the knowledge foundation for annotation. Currently, efforts reported in video annotation focus on ontology-assisted manual annotation [26], clip pattern matching and annotation [24,25] and transcript based language processing [22,23]. ...
Article
With the advent of various services and applications of Semantic Web, semantic annotation had emerged as an important research area. The use of semantically annotated ontology had been evident in numerous information processing and retrieval tasks. One of such tasks is utilizing the semantically annotated ontology in product design which is able to suggest many important applications that are critical to aid various design related tasks. However, ontology development in design engineering remains a time consuming and tedious task that demands tremendous human efforts. In the context of product family design, management of different product information that features efficient indexing, update, navigation, search and retrieval across product families is both desirable and challenging. This paper attempts to address this issue by proposing an information management and retrieval framework based on the semantically annotated product family ontology. Particularly, we propose a document profile (DP) model to suggest semantic tags for annotation purpose. Using a case study of digital camera families, we illustrate how the faceted search and retrieval of product information can be accomplished based on the semantically annotated camera family ontology. Lastly, we briefly discuss some further research and application in design decision support, e.g. commonality and variety, based on the semantically annotated product family ontology.
... On the other hand, semantic annotation on videos also serves the similar purpose for indexing and retrieval, widely known in Content Based Image Retrieval (CBIR) (Chang & Liu, 1984). The existing video annotation approaches are based on the analysis of transliterations or transcripts of a video recording (Dowman, Tablan, Cunningham, & Popov, 2005;Repp, Linckels, & Meinel, 2007, 2008, or from the motion detected and extracted from a video recording (Bertini, Bimbo, Cucchiara, & Prati, 2004;Bertini, Bimbo, & Torniai, 2006). The latter approach is usually ontology-based, where the ontology serves as the knowledge foundation for annotation. ...
... The latter approach is usually ontology-based, where the ontology serves as the knowledge foundation for annotation. Currently, efforts reported in video annotation focus on ontology-assisted manual annotation (Moreau, Leclère, Chein, & Gutierrez, 2007), clip pattern matching and annotation (Bertini et al., 2004(Bertini et al., , 2006 and transcript based language processing (Repp et al., 2007(Repp et al., , 2008. ...
... Bertini et al [9] provided an examplary usage of automated annotation in soccer videos, and used this to more effectively compress the videos with higher loss rates in the parts that would be of less importance for the audience. The decision for the importance is decided by the results of automated annotation, which uses an ontology structure and ranks zones of important events like playfield or zone around the goal box. ...
... Technology is situated to optimize this process. Researchers have also explored dialogue transcription [43,93,99], tagging [2,99], scene based automatic annotation [26,82], automatic event logging [9], and object of focus identification [17]. This chapter contrasts these other foci by demonstrating techniques for supporting human based annotation of events that occur in video. ...
... Examples are the semiautomatic keyframe method which is accompanied by linear interpolation or automatic approaches like object or scene detection, scenebased event logging, or object of focus detection. (Banerjee et al., 2004;Bertini et al., 2004;Finke, 2005;Hofmann and Hollender, 2007;Snoek and Worring, 2005). Corresponding to the time code in which an event takes place, users can define a point in time (single video frame) or a time interval (multiple following frames). ...
Conference Paper
There is a growing number of application scenarios for computer-supported video annotation and analysis in educational settings. In related research work, a large number of different research fields and approaches have been involved. Nevertheless, the support of the annotation workflow has been little taken into account. As a first step towards developing a framework that assist users during the annotation process, the single work steps, tasks and sequences of the workflow had to be identified. In this paper, a model of the underlying annotation workflow is illustrated considering its single phases, tasks, and iterative loops that can be especially associated with the collaborative processes taking place.
... Researchers have also explored dialogue transcription [17,13,12], tagging [2,12], scene based automatic annotation [9,19], automatic event logging [5], and object of focus identification [8]. Our work contrasts these other foci by demonstrating techniques for supporting human based annotation of events that occur in video. ...
Conference Paper
Full-text available
ABSTRACT Digital tools for annotation of video have the promise to provide immense,value to researchers in disciplines rang- ing from psychology to ethnography to computer,science. With traditional methods for annotation being cumbersome, time-consuming, and frustrating, technological solutions are situated to aid in video annotation by increasing reliabil- ity, repeatability, and workflow optimizations. Three no- table limitations of existing video annotation tools are lack of support for the annotation workflow, poor representation of data on a timeline, and poor interaction techniques with video, data, and annotations. This paper details a set of design requirements intended to enhance video annotation. Our framework is grounded in existing literature, interviews with experienced coders, and ongoing discussions with re- searchers in multiple disciplines. Our model is demonstrated in a new,system called VCode and VData. The benefit of our system is that is directly addresses the workflow and needs of both researchers and video coders. Categories and Subject Descriptors
Conference Paper
Video adaptation is critical for video encoding and streaming in mobile environment. Semantic video adaptation takes into account the degree of relevance of different part of the image and tries to adapt video content to meet wireless channel constraint. Making assumption that only changing region of the video is important to mobile user, we propose a real-time semantic level adaptation technique based on video preprocessing. Firstly moving foreground of a video frame is segmented from background scene. Then a low-pass filter is applied to the background of the video frame before it is encoded. By this way, background is blurred and video encoder is forced to allocate more bit rate to foreground part of the frame than the background thus improves the quality of semantically significant portion of the image. To improve robustness a user feedback strategy is employed enabling user to choose the degree of background blurring according their preferences. The simulation result shows that the proposed method improves the quality of encoded video by both subjective and objective measurement.
Conference Paper
In video annotation research, the support of the video annotation workflow has been taken little into account, especially concerning collaborative use cases. Previous research projects focus each on a different essential part of the whole annotation process. We present a reference architecture model which is based on identified phases of the video annotation workflow. In a first step, the underlying annotation workflow is exemplified with respect to its single phases, tasks, and loops. Secondly, the system architecture is going to be exemplified with respect to its elements, their internal procedures, as well as the interaction between these elements. The goals of this paper are to provide the reader with a basic understanding of the specific characteristics and requirements of collaborative video annotation processes, and to define a reference framework for the design of video annotation systems that include a workflow management system.
Conference Paper
Whilst many content adaptation operations exist and a number of different approaches are currently being proposed, selecting the right content adaptation operation is largely an ad hoc process. In this paper we discuss issues pertinent to content adaptation operations and in particular the process of their selection by taking the view that a selection process needs to be informed both by the results of the evaluation of generated content and the analysis of the content. The discussion' is made in the context of MPEG-21 which supports both Quality of Service (QoS) and user priority preferences, the combination of which are essential for instituting a process of selection of appropriate adaptation operations.
Article
Full-text available
Automatic semantic annotation of video streams allows both to extract significant clips for production logging and to index video streams for posterity logging. Automatic annotation for production logging is particularly demanding, as it is applied to non-edited video streams and must rely only on visual information. Moreover, annotation must be computed in quasi real-time. In this paper, we present a system that performs automatic annotation of the principal highlights in soccer video, suited for both production and posterity logging. The knowledge of the soccer domain is encoded into a set of finite state machines, each of which models a specific highlight. Highlight detection exploits visual cues that are estimated from the video stream, and particularly, ball motion, the currently framed playfield zone, players’ positions and colors of players’ uniforms. The highlight models are checked against the current observations, using a model checking algorithm. The system has been developed within the EU ASSAVID project.
Article
Full-text available
In this work we present a framework for on-the-fly video transcoding that exploits computer vision-based techniques to adapt the Web access to the user requirements. The proposed transcoding approach aims at coping with both user bandwidth and resources capabilities, and with user interests in the video's content. We propose an object-based semantic transcoding that, according to the user-defined classes of relevance, applies different transcoding techniques to the objects segmented in a scene. Object extraction is provided by on-the-fly video processing, without manual annotation. Multiple transcoding policies are reviewed and a performance evaluation metric based on the Weighted Mean Square Error (and corresponding PSNR), that takes into account the perceptual user requirements by means of classes of relevance, is defined. Results are analyzed by varying transcoding techniques, bandwidth requirements and video types (with indoor and outdoor scenes), showing that the use of semantics can dramatically improve the bandwidth to distortion ratio.
Article
Full-text available
Tools for the interpretation of significant events from video and video clip adaptation can effectively support automatic extraction and distribution of relevant content from video streams. In fact, adaptation can adjust meaningful content, previously detected and extracted, to the user/client capabilities and requirements. The integration of these two functions is increasingly important, due to the growing demand of multimedia data from remote clients with limited resources (PDAs, HCCs, Smart phones). In this paper we propose an unified framework for event-based and object-based semantic extraction from video and semantic on-line adaptation. Two cases of application, highlight detection and recognition from soccer videos and people behavior detection in domotic* applications, are analyzed and discussed.
Conference Paper
Full-text available
Semantic video adaptation allows transmitting video content with different viewing quality, depending on the relevance of the content from the user's viewpoint. To this end, an automatic annotation subsystem must be employed that automatically detect relevant objects and events in the video stream. We present a composite framework that is made of an automatic annotation engine and a semantics-based adaptation module. Three new different compression solutions are proposed that work at the object or event level. Their performance is compared according to a new measure that takes into account the user's satisfaction and the effects on it of the errors in the annotation module.
Conference Paper
Full-text available
This paper describes video coding and segmentation techniques that can be used to achieve significant increase in storage capacity. Specifically, we examine the possibility to use object- based coding for efficient long-term archiving of surveillance video. We consider surveillance systems with many camera sources in which we are required to store several months of video data for each source, thus storage capacity is a major concern. The paper considers several automatic segmentation algorithms. With each algorithm, we analyze the shape coding overhead and implication on overall storage requirements, as well as the effect each algorithm has on the reconstructed quality of frames. Additionally, this paper reviews techniques to dynamically control the temporal rate of objects in the scene and perform bit allocation. Experimental results show that up to 90% savings in storage can be achieved with the proposed method compared to frame-based video coding techniques. The cost for this savings is that the accuracy of the background is compromised; however, we feel that this is satisfactory for the application under consideration.
Conference Paper
Advances in the media and entertainment industries, for example streaming audio and digital TV, present new challenges for managing large audio-visual collections. Efficient and effective retrieval from large content collections forms an important component of the business models for content holders and this is driving a need for research in audio-visual search and retrieval. Current content management systems support retrieval using low-level features, such as motion, colour, texture, beat and loudness. However, low-level features often have little meaning for the human users of these systems, who much prefer to identify content using high-level semantic descriptions or concepts. This creates a gap between the system and the user that must be bridged for these systems to be used effectively. The research presented in this paper describes our approach to bridging this gap in a specific content domain, sports video. Our approach is based on a number of automatic techniques for feature detection used in combination with heuristic rules determined through manual observations of sports footage. This has led to a set of models for interesting sporting events-goal segments-that have been implemented as part of an information retrieval system. The paper also presents results comparing output of the system against manually identified goals.
Article
We model a degraded image as an original image that has been subject to linear frequency distortion and additive noise injection. Since the psychovisual effects of frequency distortion and noise injection are independent, we decouple these two sources of degradation and measure their effect on the human visual system. We develop a distortion measure (DM) of the effect of frequency distortion, and a noise quality measure (NQM) of the effect of additive noise. The NQM, which is based on Peli's (1990) contrast pyramid, takes into account the following: 1) variation in contrast sensitivity with distance, image dimensions, and spatial frequency; 2) variation in the local luminance mean; 3) contrast interaction between spatial frequencies; 4) contrast masking effects. For additive noise, we demonstrate that the nonlinear NQM is a better measure of visual quality than peak signal-to noise ratio (PSNR) and linear quality measures. We compute the DM in three steps. First, we find the frequency distortion in the degraded image. Second, we compute the deviation of this frequency distortion from an allpass response of unity gain (no distortion). Finally, we weight the deviation by a model of the frequency response of the human visual system and integrate over the visible frequencies. We demonstrate how to decouple distortion and additive noise degradation in a practical image restoration system.
Conference Paper
Utility function based transcoding is an efficient systematic solution for choosing optimal media transcoding operation to meet dynamic resource constraints (such as bandwidth). However, to date the real-time generation of utility function is not feasible due to computational complexity. In this paper we present a content-based utility function prediction framework for real-time MPEG-4 video transcoding. We develop a statistical approach combining real-time compressed-domain feature extraction, content-based pattern classification and regression. Our extensive experiment results demonstrate that the proposed method achieves very promising prediction accuracy - up to 89% in choosing the optimal transcoding operation with the highest quality from multiple alternatives meeting the same target bitrate.