ChapterPDF Available

Predicting Interestingness of Visual Content

Authors:

Abstract and Figures

The ability of multimedia data to attract and keep people’s interest for longer periods of time is gaining more and more importance in the fields of information retrieval and recommendation, especially in the context of the ever growing market value of social media and advertising. In this chapter we introduce a benchmarking framework (dataset and evaluation tools) designed specifically for assessing the performance of media interestingness prediction techniques. We release a dataset which consists of excerpts from 78 movie trailers of Hollywood-like movies. These data are annotated by human assessors according to their degree of interestingness. A real-world use scenario is targeted, namely interestingness is defined in the context of selecting visual content for illustrating a Video on Demand (VOD) website. We provide an in-depth analysis of the human aspects of this task, i.e., the correlation between perceptual characteristics of the content and the actual data, as well as of the machine aspects by overviewing the participating systems of the 2016 MediaEval Predicting Media Interestingness campaign. After discussing the state-of-art achievements, valuable insights, existing current capabilities as well as future challenges are presented.
Content may be subject to copyright.
Predicting Interestingness of Visual Content
Claire-Hélène Demarty, Mats Sjöberg, Mihai Gabriel Constantin,
Ngoc Q.K. Duong, Bogdan Ionescu, Thanh-Toan Do, and Hanli Wang
Abstract The ability of multimedia data to attract and keep people’s interest
for longer periods of time is gaining more and more importance in the fields of
information retrieval and recommendation, especially in the context of the ever
growing market value of social media and advertising. In this chapter we introduce
a benchmarking framework (dataset and evaluation tools) designed specifically
for assessing the performance of media interestingness prediction techniques. We
release a dataset which consists of excerpts from 78 movie trailers of Hollywood-
like movies. These data are annotated by human assessors according to their degree
of interestingness. A real-world use scenario is targeted, namely interestingness is
defined in the context of selecting visual content for illustrating a Video on Demand
(VOD) website. We provide an in-depth analysis of the human aspects of this task,
i.e., the correlation between perceptual characteristics of the content and the actual
data, as well as of the machine aspects by overviewing the participating systems of
the 2016 MediaEval Predicting Media Interestingness campaign. After discussing
the state-of-art achievements, valuable insights, existing current capabilities as well
as future challenges are presented.
C.-H. Demarty () • N.Q.K. Duong
Technicolor R&I, Rennes, France
e-mail: claire-helene.demarty@technicolor.com;quang-khanh-ngoc.duong@technicolor.com
M. Sjöberg
Helsinki Institute for Information Technology HIIT, Department of Computer Science,
University of Helsinki, Helsinki, Finland
e-mail: mats.sjoberg@helsinki.fi
M.G. Constantin • B. Ionescu
LAPI, University Politehnica of Bucharest, Bucharest, Romania
e-mail: mgconstantin@imag.pub.ro;bionescu@imag.pub.ro
T.- T. Do
Singapore University of Technology and Design, Singapore, Singapore
University of Science, Ho Chi Minh City, Vietnam
e-mail: thanhtoan_do@sutd.edu.sg
H. Wang
Department of Computer Science and Technology, Tongji University, Shanghai, China
e-mail: hanliwang@tongji.edu.cn
© Springer International Publishing AG 2017
J. Benois-Pineau, P. Le Callet (eds.), Visual Content Indexing and Retrieval
with Psycho-Visual Models, Multimedia Systems and Applications,
DOI 10.1007/978-3-319-57687-9_10
233
234 C.-H. Demarty et al.
1 Introduction
With the increased popularity of amateur and professional digital multimedia
content, accessing relevant information is now dependent on effective tools for
managing and browsing, due to the huge amount of data. Managing content often
involves filtering parts of it to extract what corresponds to specific requests or
applications. Fine filtering is impossible however without a clear understanding of
the content’s semantic meaning. To this end, current research in multimedia and
computer vision has moved towards modeling of more complex semantic notions,
such as emotions, complexity, memorability and interestingness of content, thus
going closer to human perception.
Being able to assess, for instance, the interestingness level of an image or
a video has several direct applications: from personal and professional content
retrieval, content management, to content summarization and story telling, selective
encoding, or even education. Although it has already raised a huge interest in the
research community, a common and clear definition of multimedia interestingness
has not yet been proposed, nor does a common benchmark for the assessment of the
different techniques for its automatic prediction exist.
MediaEval1is a benchmarking initiative which focuses on the multi-modal
aspects of multimedia content, i.e., it is dedicated to the evaluation of new
algorithms for multimedia access and retrieval. MediaEval emphasizes the multi-
modal character of the data, e.g., speech, audio, visual content, tags, users and
context. In 2016, the Predicting Media Interesting Task2was proposed as a new
track in the MediaEval benchmark. The purpose of the task is to answer a
real and professional-oriented interestingness prediction use case, formulated by
Technicolor.3Technicolor is a creative technology company and a provider of
services in multimedia entertainment and solutions, in particular, providing also
solutions for helping users select the most appropriate content according to, for
example, their profile. In this context, the selected use case for interestingness
consists in helping professionals to illustrate a Video on Demand (VOD) web site
by selecting some interesting frames and/or video excerpts for the posted movies.
Although the targeted application is well-defined and confined to the illustration
of a VOD web site, the task remains highly challenging. Firstly, it raises the question
of the subjectivity of interestingness, which may vary from one person to the other.
Furthermore, the semantic nature of interestingness constrains its modeling to be
able to bridge the semantic gap between the notion of interestingness and the
statistical features that can be extracted from the content. Lastly, by placing the
task in the field of the understanding of multi-modal content, i.e., audio and video,
we push the challenge even further by adding a new dimensionality to the task. The
1http://www.multimediaeval.org/.
2http://www.multimediaeval.org/mediaeval2016/mediainterestingness/.
3http://www.technicolor.com.
Predicting Interestingness of Visual Content 235
choice of Hollywood movies as targeted content also adds potential difficulties, in
the sense that the systems will have to cope with different movie genres and potential
editing and special effects (i.e., alteration of the content).
Nevertheless, although highly challenging, the building of the task responds to
the absence of such benchmarks. It provides a common dataset and a common
definition of interestingness. To the best of our knowledge, the MediaEval 2016
Predicting Media Interestingness is the first attempt to cope with this issue in the
research community. Even though still in its infancy, the task has, in this first year,
been a source of meaningful insights for the future of the field.
This chapter focuses on a detailed description of the benchmarking framework,
together with a thorough analysis of its results, both in terms of the performance
of the submitted systems and in what concerns the produced annotated dataset. We
identify the following main contributions:
an overview of the current interestingness literature, both from the perspective
of the psychological implications and also from the multimedia/computer vision
side;
the introduction of the first benchmark framework for the validation of the
techniques for predicting the interestingness of video (image and audio) content,
formulated around a real-world use case, which allows for disambiguating the
definition of interestingness;
the public release of a specially designed annotated dataset. It is accompanied
with an analysis of its perceptual characteristics;
an overview of the current capabilities via the analysis of the submitted runs;
an in-depth discussion on the remaining issues and challenges for the prediction
of the interestingness of content.
The rest of the chapter is organized as follows. Section 2presents a consistent
state of the art on interestingness prediction from both the psychological and
computational points of view. It is followed by a detailed description of the Medi-
aEval Predicting Media Interestingness Task, its definition, dataset, annotations and
evaluation rules, in Sect. 3. Section 4gives an overview of the different submitted
systems and trends for this first year of the benchmark. We analyze the produced
dataset and annotations, their qualities and limitations. Finally, Sect.5discusses the
future challenges and the conclusions.
2 A Review of the Literature
The prediction and detection of multimedia data interestingness has been analyzed
in the literature from the human perspective, involving psychological studies, and
also from the computational perspective, where machines are taught to replicate the
human process. Content interestingness has gained importance with the increasing
popularity of social media, on-demand video services and recommender systems.
These different research directions try to create a general model for human interest,
236 C.-H. Demarty et al.
go beyond the subjectivity of interestingness and detect some objective features that
appeal to the majority of subjects. In the following, we present an overview of these
directions.
2.1 Visual Interestingness as a Psychological Concept
Psychologists and neuroscientists have extensively studied the subjective perception
of visual content. The basis of the psychological interestingness studies was
established in [5]. It was revealed that interest is determined by certain factors
and their combinations, like “novelty”, “uncertainty”, “conflict” and “complexity”.
More recent studies have also developed the idea that interest is a result of
appraisal structures [58]. Psychological experiments determined two components,
namely: “novelty-complexity”—a structure that indicates the interest shown for
new and complex events; and “coping potential”—a structure that measures a
subject’s ability to discern the meaning of a certain event. The influence of each
appraisal component was further studied in [59], proving that personality traits
could influence the appraisals that define interest. Subjects with a high “openness
trait, who are sensation seeking, curious, open to experiences [47], were more
attracted by the novelty-complexity structure. In opposition, those not belonging
to that personality category, were influenced more by their coping potential. Some
of these factors were confirmed in numerous other studies based on image or video
interestingness [11,22,54,61].
The importance of objects was also analyzed as a central interestingness
cue [20,62]. The saliency maps used by the authors in [20] were able to predict
interesting objects in a scene with an accuracy of more than 43%. They introduced
and demonstrated the idea that, when asked to describe a scene, humans tend to
talk about the most interesting objects in that scene first. Experiments show that
there was a strong consistency between different users [62]. Eye movement, another
behavioral cue, was used by the authors in [9] to detect the level of interest shown in
segments of images or whole images. The authors used saccades, the eye movements
that continuously contribute to the building of a mental map of the viewed scene.
The authors in [4] studied the object attributes that could influence importance and
draw attention, and found that animated, unusual or rare events tend to be more
interesting for the viewer.
In [65], the authors conducted an interestingness study on 77 subjects, using
artworks as visual data. The participants were asked to give ratings on different
scales to opposing attributes for the images, including: “interesting-uninteresting”,
enjoyable-unenjoyable”, “cheerful-sad”, “pleasing-displeasing”. The results show
that disturbing images can still be classified as interesting, therefore negating the
need of pleasantness in human visual interest stimulation. Another analysis [11]led
to several conclusions regarding the influence on interest, namely: instant enjoyment
was found to be an important factor, exploration intent and novelty had a positive
effect and challenge had a small effect. The authors in [13] studied the influence
Predicting Interestingness of Visual Content 237
of familiarity with the presented image on the concept of interestingness. They
concluded that for general scenes, unfamiliar context positively influenced interest,
while photos of familiar faces (including self photos) were more interesting than
those of unfamiliar people.
It is interesting to observe also a correlation between different attributes and
interestingness. Authors in [23] performed such a study on a specially designed and
annotated dataset of images. The positively correlated attributes were found to be
assumed memorability”, “aesthetics”, “pleasant”, “exciting”, “famous”, “unusual”,
makes happy”, “expert photo”, “mysterious”, “outdoor-natural”, “arousing”,
strange”, “historical”or“cultural place”.
2.2 Visual Interestingness from a Computational Perspective
Besides the vast literature of psychological studies, the concept of visual inter-
estingness has been studied from the perspective of automatic, machine-based,
approaches. The idea is to replicate human capabilities via computational means.
For instance, the authors in [23] studied a large set of attributes: RGB values,
GIST features [50], spatial pyramids of SIFT histograms [39], colorfulness [17],
complexity, contrast and edge distributions [35], arousal [46] and composition of
parts [6] to model different cues related to interestingness. They investigated the
role of these cues in varying context of viewing: different datasets were used,
from arbitrary selected and very different images (weak context) to images issued
from similar Webcam streams (strong context). They found that the concept of
unusualness”, defined as the degree of novelty of a certain image when compared
to the whole dataset, was related to interestingness, in case of a strong context.
Unusualness was calculated by clustering performed on the images using Local
Outlier Factor [8] with RGB values, GIST and SIFT as features, composition of
parts and complexity interpreted as the JPEG image size. In case of a weak context,
personal preferences of the user, modeled by pixel values, GIST, SIFT and Color
Histogram as features, classified with a -SVR—Support Vector Regression (SVR)
with a RBF kernel, performed best. Continuing this work, the author in [61] noticed
that a regression with sparse approximation of data performed better with the
features defined by Gygli et al. [23] than the SVR approach.
Another approach [19] selected three types of attributes for determining image
interestingness: compositional, image content and sky-illumination. The composi-
tional attributes were: rule of thirds, low depth of field, opposing colors and salient
objects; the image content attributes were: the presence of people, animals and faces,
indoor/outdoor classifiers; and finally the sky-illumination attributes consisted of
scene classification as cloudy, clear or sunset/sunrise. Classification of interesting
content is performed with Support Vector Machines (SVM). As baseline, the authors
used the low-level attributes proposed in [35], namely average hue, color, contrast,
brightness, blur and simplicity interpreted as distribution of edges; and the Naïve
Bayes and SVM for classification. Results show that high-level attributes tend to
238 C.-H. Demarty et al.
perform better than the baseline. However, the combination of the two was able to
achieve even better results.
Other approaches focused on subcategories of interestingness. For instance, the
authors in [27] determined “social interestingness” based on social media ranking
and “visual interestingness” via crowdsourcing. The Pearson correlation coefficient
between these two subcategories had low values, e.g., 0.015 to 0.195, indicating
that there is a difference between what people share on social networks and what
has a high pure visual interest. The features used for predicting these concepts were
color descriptors determined on the HSV color space, texture information via Local
Binary Patterns, saliency [25] and edge information captured with Histogram of
Oriented Gradients.
Individual frame interestingness was calculated by the authors in [43]. They
used web photo collections of interesting landmarks from Flickr as estimators of
human interest. The proposed approach involved calculating a similarity measure
between each frame from YouTube travel videos and the Flickr image collection
of the landmarks presented in the videos, used as interesting examples. SIFT
features were computed and the number of features shared between the frame
and the image collection baseline, and their spatial arrangement similarity were
the components that determined the interestingness measure. Finally the authors
showed that their algorithm achieved the desired results, tending to classify full
images of the landmarks as interesting.
Another interesting approach is the one proposed in [31]. Authors used audio,
video and high-level features for predicting video shot interestingness, e.g., color
histograms, SIFT [45], HOG [15,68], SSIM Self-Similarities [55], GIST [50],
MFCC [63], Spectrogram SIFT [34], Audio-Six, Classemes [64], ObjectBank [41]
and the 14 photographic styles described in [48]. The system was trained via
Joachims’ Ranking SVM [33]. The final results showed that audio and visual
features performed well, and that their fusion performed even better on the two
user-annotated datasets used, giving a final accuracy of 78.6% on the 1200 Flickr
videos and 71.7% on the 420 YouTube videos. Fusion with the high-level attributes
provided a better result only on the Flickr dataset, with an overall precision of 79.7
and 71.4%.
Low- and high-level features were used in [22] to detect the most interesting
frames in image sequences. The selected low-level features were: raw pixel values,
color histogram, HOG, GIST and image self-similarity. The high-level features
were grouped in several categories: emotion predicted from raw pixel values [66],
complexity defined as the size of the compressed PNG image, novelty computed
through a Local Outlier Factor [8] and a learning feature computed using a -
SVR classifier with RBF kernel on the GIST features. Each one of these features
performed above the baseline (i.e., random selection), and their combination also
showed improvements over each individual one. The tests were performed on a
database consisting of 20 image sequences, each containing 159 color images taken
from various webcams and surveillance scenarios, and the final results for the com-
bination of features gave an average precision score of 0.35 and a To p3score of 0.59.
Predicting Interestingness of Visual Content 239
2.3 Datasets for Predicting Interestingness
A critical point to build and evaluate any machine learning system is the availability
of labeled data. Although the literature for automatic interestingness prediction is
still at its early stages, there are some attempts to construct an evaluation data. In
the following, we introduce the most relevant initiatives.
Many of the authors have chosen to create their own datasets for evaluating
their methods. Various sources of information were used, mainly coming form
social media, e.g., Flickr [19,27,31,43,61], Pinterest [27], Youtube [31,43].
The data consisted of the results returned by search queries. Annotations were
determined either automatically, by exploiting the available social media metadata
and statistics such as Flickr’s “interestingness measure”in[19,31], or manually, via
crowdsourcing in [27] or local human assessors in [31].
The authors in [19] used a dataset composed of 40,000 images, and kept the top
10%, ordered according to the Flickr interestingness score, as positive interesting
examples and the last 10% as negative, non interesting examples. Half of this dataset
was used for training and half for testing. The same top and last 10% of Flickr results
was used in [31], generating 1200 videos retrieved with 15 keyword queries, e.g.,:
basketball”, “beach”, “bird”, “birthday”, “cat”, “dancing”. In addition to these,
the authors in [31] also used 30 YouTube advertisement videos from 14 categories,
such as “accessories”, “clothing&shoes”, “computer&website”, “digital products”,
drink”. The videos had an average duration of 36 s and were annotated by human
assessors, thus generating a baseline interestingness score.
Apart from the individual datasets, there were also initiatives of grouping several
datasets of different compositions. The authors in [23], associated an internal
context to the data: a strong context dataset proposed in [22], where the images
in 20 publicly available webcam streams are consistently related to one another,
thus generating a collection of 20 image sequences each containing 159 images; a
weak context dataset introduced in [50] which consists of 2688 fixed size images
grouped in 8 scene categories: “coast”, “mountain”, “forest”, “open country”,
street”, “inside city”, “tall buildings” and “highways”; and a no context dataset
which consists of the 2222 image memorability dataset proposed in [29,30], with
no context or story behind the pictures.
3 The Predicting Media Interestingness Task
This section describes the Predicting Media Interestingness Task, which was
proposed in the context of the 2016 MediaEval international evaluation campaign.
This section addresses the task definition (Sect. 3.1), the description of the provided
data with its annotations (Sect. 3.2), and the evaluation protocol (Sect. 3.3).
240 C.-H. Demarty et al.
3.1 Task Definition
Interestingness of media content is a perceptual and highly semantic notion that
remains very subjective and dependent on the user and the context. Nevertheless,
experiments show that there is, in general, an average and common interestingness
level, shared by most of the users [10]. This average interestingness level provides
evidence to envision that the building of a model for the prediction of interestingness
is feasible. Starting from this basic assumption, and constraining the concept to a
clearly defined use case, will serve to disambiguate the notion and reduce the level
of subjectivity.
In the proposed benchmark, interestingness is assessed according to a practical
use case originated from Technicolor, where the goal is to help professionals
to illustrate a Video on Demand (VOD) web site by selecting some interesting
frames and/or video excerpts for the movies. We adopt the following definition of
interestingness: an image/video excerpt is interesting in the context of helping a user
to make his/her decision about whether he/she is interested in watching the movie
it represents. The proposed data is naturally adapted to this specific scenario, and
consists of professional content, i.e., Hollywood-like movies.
Given this data and use case, the task requires participants to develop systems
which can automatically select images and/or video segments which are considered
to be the most interesting according to the aforementioned definition. Interesting-
ness of the media is to be judged by the systems based on visual appearance, audio
information and text accompanying the data. Therefore, the challenge is inherently
multi-modal.
As presented in numerous studies in the literature, predicting the interestingness
level of images and videos often requires significantly different perspectives. Images
are self contained and the information is captured in the scene composition and
colors, whereas, videos are lower quality images in motion, whose purpose is to
transmit the action via the movement of the objects. Therefore, to address the two
cases, two benchmarking scenarios (subtasks) are proposed as:
predicting image interestingness: given a set of key-frames extracted from a
movie, the systems are required to automatically identify those images for the
given movie that viewers report to be the most interesting in the given movie.
To solve the task, participants can make use of visual content as well as external
metadata, e.g., Internet data about the movie, social media information, etc.;
predicting video interestingness: given the video shots of a movie, the systems
are required to automatically identify those shots that viewers report to be the
most interesting in the given movie. To solve the task, participants can make use
of visual and audio data as well as external data, e.g., subtitles, Internet data, etc.
A special feature of the provided data is the fact that it is extracted from the same
source movies, i.e., the key-frames are extracted from the provided video shots of
the movies. Therefore, this will allow for comparison between the two tasks, namely
to assess to which extent image and video interestingness are linked.
Predicting Interestingness of Visual Content 241
Furthermore, we proposed a binary scenario, where data can be either interesting
or not (two cases). Nevertheless, a confidence score is also required for each
decision, so that the final evaluation measure could be computed in a ranking
fashion. This is more closely related to a real world usage scenario, where results
are provided in order of decreasing interestingness level.
3.2 Data Description
As mentioned in the previous section, the video and image subtasks are based on
a common dataset, which consists of Creative Commons trailers of Hollywood-like
movies, so as to allow redistribution. The dataset, its annotations, and accompanying
features, as described in the following subsections, are publicly available.4
The use of trailers, instead of full movies, has several motivations. Firstly, it is
the need for having content that can be freely and publicly distributed, as opposed
to e.g., full movies which have much stronger restrictions on distribution. Basically,
each copyrighted movie would require an individual permission for distribution.
Secondly, using full movies is not practically feasible for the highly demanding
segmentation and annotations steps with limited time and resources, as the number
of images/video excerpts to process is enormous, in the order of millions. Finally,
running on full movies, even if the aforementioned problems were solved, will not
allow for having a high diversification of the content, as only a few movies could
have been used. Trailers, will allow for selecting a larger number of movies and thus
diversifying the content.
Trailers are by definition representative of the main content and quality of the full
movies. However, it is important to note that trailers are already the result of some
manual filtering of the movie to find the most interesting scenes, but without spoiling
the movie key elements. In practice, most trailers also contain less interesting, or
slower paced shots to balance their content. We therefore believe that this is a good
compromise for the practicality of the data/task.
The proposed dataset is split into development data, intended for designing and
training the algorithms which is based on 52 trailers; and testing data which is used
for the actual evaluation of the systems, and is based on 26 trailers.
The data for the video subtask was created by segmenting the trailers into video
shots. The same video shots were also used for the image subtask, but here each shot
is represented by a single key-frame image. The task is thus to classify the shots, or
key-frames, of a particular trailer, into interesting and non interesting samples.
4http://www.technicolor.com/en/innovation/scientific-community/scientific-data-sharing/
interestingness-dataset.
242 C.-H. Demarty et al.
3.2.1 Shot Segmentation and Key-Frame Extraction
Video shot segmentation was carried out manually using a custom-made software
tool. Here we define a video shot as a continuous video sequence recorded between
a turn-on and a turn-off of the camera. For an edited video sequence, a shot is
delimited between two video transitions. Typical video transitions include sharp
transitions or cuts (direct concatenation of two shots), and gradual transitions
like fades (gradual disappearance/appearance of a frame to/from a black frame)
and dissolves (gradual transformation of one frame into another). In the process,
we discarded movie credits and title shots. Gradual transitions were considered
presumably very uninteresting shots by themselves, whenever possible. In a few
cases, shots in between two gradual transitions were too short to be segmented.
In that case, they were merged with their surrounding transitions, resulting in one
single shot.
The segmentation process resulted in 5054 shots for the development dataset, and
2342 shots for the test dataset, with an average duration of 1 s in each case. These
shots were used for the video subtask. For the image subtask, we extracted a single
key-frame for each shot. The key-frame was chosen as the middle frame, as it is
likely to capture the most representative information of the shot.
3.2.2 Ground-Truth Annotation
All video shots and key-frames were manually annotated in terms of interestingness
by human assessors. The annotation process was performed separately for the video
and image subtasks, to allow us to study the correlation between the two. Indeed we
would like to answer the question: Does image interestingness automatically imply
video interestingness, and vice versa?
A dedicated web-based tool was developed to assist the annotation process. The
tool has been released as free and open source software, so that others can benefit
from it and contribute improvements.5
We use the following annotation protocol. Instead of asking annotators to assign
an interestingness value to each shot/key-frame, we used a pair-wise comparison
protocol where the annotators were asked to select the more interesting shot/key-
frame from a pair of examples taken from the same trailer. Annotators were provided
with the clips for the shots and the images for the key-frames, presented side by
side. Also, they were informed about the Video on Demand-use case, and asked
to consider also that “the selected video excerpts/key-frames should be suitable in
terms of helping a user to make his/her decision about whether he/she is interested
in watching a movie”. Figure 1illustrates the pair-wise decision stage of the user
interface.
5https://github.com/mvsjober/pair-annotate.
Predicting Interestingness of Visual Content 243
Fig. 1 Web user interface for pair-wise annotations
The choice of a pair-wise annotation protocol instead of direct rating was based
on our previous experience with annotating multimedia for affective content and
interestingness [3,10,60]. Assigning a rating is a cognitively very demanding task,
requiring the annotator to understand, and constantly keep in mind, the full range
of the interestingness scale [70]. Making a single comparison is a much easier task
as one only needs to compare the interestingness of two items, and not consider
the full range. Directly assigning a rating value is also problematic since different
annotators may use different ranges, and even for the same annotator the values may
not be easily interpreted [51]. For example, is an increase from 0.3 to 0.4 the same
as the one from 0.8 to 0.9? Finally, it has been shown that pairwise comparisons are
less influenced by the order in which the annotations are displayed than with direct
rating [71].
However, annotating all possible pairs is not feasible due to the sheer number of
comparisons required. For instance, nshots/key-frames would require n.n1/=2
comparisons to be made for a full coverage. Instead, we adopted the adaptive square
design method [40], where the shots/key-frames are placed in a square design and
only pairs on the same row or column are compared. This reduces the numbers of
comparisons to n.pn1/. For example, for nD100 we need to undergo only 900
comparisons instead of 4950 (full coverage). Finally, the Bradley-Terry-Luce (BTL)
model [7] was used to convert the paired comparison data to a scalar value.
We modified the adaptive square design setup so that comparisons were taken
by many users simultaneously until all the required pairs had been covered. For the
rest, we proceeded according to the scheme in [40]:
1. Initialization: shots/key-frames are randomly assigned positions in the square
matrix;
2. Perform a single annotation round according to the shot/key-frame pairs given
by the square (across rows, columns);
244 C.-H. Demarty et al.
3. Calculate the BTL scores based on the annotations;
4. Re-arrange the square matrix so that shots/key-frames are ranked according to
their BTL scores, and placed in a spiral. This arrangement ensures that mostly
similar shots/key-frames are compared row-wise and column-wise;
5. Repeat steps 2 to 4 until convergence.
For practical reasons, we decided to consider by default that convergence is
achieved after five rounds and thus terminated the process when the five runs are
finished. The final binary interestingness decisions were obtained with a heuristic
method that tried to detect a “jumping point” in the normalized distribution of the
BTL values for each movie separately. The underlying motivation for this empirical
rule is the assumption that the distribution is a sum of two underlying distributions:
non interesting shots/key-frames, and interesting shots/key-frames.
Overall, 315 annotators participated in the annotation for the video data and 100
for the images. The cultural distribution is over 29 different countries around the
world. The average reported age of the annotators was 32, with a standard deviation
around 13. Roughly, 66% were male, 32% female, and 2% did not specify their
gender.
3.2.3 Additional Features
Apart from the data and its annotations, to broaden the targeted communities, we
also provide some pre-computed content descriptors, namely:
Dense SIFT which are computed following the original work in [45], except that
the local frame patches are densely sampled instead of using interest point detectors.
A codebook of 300 codewords is used in the quantization process with a spatial
pyramid of three layers [39].
HoG descriptors i.e., Histograms of Oriented Gradients [15] are computed over
densely sampled patches. Following [68], HoG descriptors in a 22neighborhood
are concatenated to form a descriptor of higher dimension.
LBP i.e., Local Binary Patterns as proposed in [49].
GIST is computed based on the output energy of several Gabor-like filters (eight
orientations and four scales) over a dense frame grid like in [50].
Color histogram computed in the HSV space (Hue-Saturation-Value).
MFCC computed over 32 ms time-windows with 50% overlap. The cepstral
vectors are concatenated with their first and second derivatives.
CNN features i.e., the fc7 layer (4096 dimensions) and prob layer (1000
dimensions) of AlexNet [32].
Mid level face detection and tracking features obtained by face tracking-by-
detection in each video shot via a HoG detector [15] and the correlation tracker
proposed in [16].
Predicting Interestingness of Visual Content 245
3.3 Evaluation Rules
As for other tasks in MediaEval, participants were allowed to submit a total of up
to 5 runs for the video and image subtasks. To provide the reader with a complete
picture of the evaluation process in order to understand the achieved results, we
replicate the exact conditions for the participants, here.
Each task had a required run, namely: for predicting image interestingness,
classification had to be achieved with the use of the visual information only, no
external data was allowed; for predicting video interestingness, classification had to
be achieved with the use of both audio and visual information; no external data was
allowed. External data was considered to be any of the following: additional datasets
and annotations which were specifically designed for interestingness classification;
the use of pre-trained models, features, detectors obtained from such dedicated
additional datasets; additional metadata from the Internet (e.g., from IMDb). On the
contrary, CNN features trained on generic datasets such as ImageNet were allowed
for use in the required runs. By generic datasets, we mean datasets that were not
explicitly designed to support research in interestingness prediction. Additionally,
datasets dedicated to study memorability or other aspects of media were allowed,
as long as these concepts are different from interestingness, although a correlation
may exist.
To assess performance, several metrics were computed. The official evaluation
metric was the mean average precision (MAP) computed over all trailers, whereas
average precision was to be computed on a per trailer basis, over all ranked
images/video shots. MAP was computed with the trec_eval tool.6In addition to
MAP, several other secondary metrics were provided, namely: accuracy, precision,
recall and f-score for each class, and the class confusion matrix.
4 Results and Analysis of the First Benchmark
4.1 Official Results
The 2016 Predicting Media Interestingness Task received more than 30 registrations
and 12 teams coming from 9 countries all over the world submitted runs in the end
(see Fig. 2). The task attracted a lot of interest from the community, which shows
the importance of this topic.
Tables 1and 2provide an overview of the official results for the two subtasks
(video and image interestingness prediction). A total of 54 runs were received,
6http://trec.nist.gov/trec_eval/.
246 C.-H. Demarty et al.
Fig. 2 2016 Predicting Media Interestingness task’s participation at different stages
Tabl e 1 Official results for image interestingness prediction evaluated by MAP
Team Run name MAP
TUD-MMC [42]me16in_tudmmc2_image_histface 0.2336
Technicolor [56]me16in_technicolor_image_run1_SVM_rbf 0.2336
Technicolor me16in_technicolor_image_run2_DNNresampling06_100 0.2315
MLPBOON [52]me16in_MLPBOON_image_run5 0.2296
BigVid [69]me16in_BigVid_image_run5FusionCNN 0.2294
MLPBOON me16in_MLPBOON_image_run1 0.2205
TUD-MMC me16in_tudmmc2_image_hist 0.2202
MLPBOON me16in_MLPBOON_image_run4 0.217
HUCVL [21]me16in_HUCVL_image_run1 0.2125
HUCVL me16in_HUCVL_image_run2 0.2121
UIT-NII [38]me16in_UITNII_image_FA 0.2115
RUC [12]me16in_RUC_image_run2 0.2035
MLPBOON me16in_MLPBOON_image_run2 0.2023
HUCVL me16in_HUCVL_image_run3 0.2001
RUC me16in_RUC_image_run3 0.1991
RUC me16in_RUC_image_run1 0.1987
ETH-CVL [67]me16in_ethcvl1_image_run2 0.1952
MLPBOON me16in_MLPBOON_image_run3 0.1941
HKBU [44]me16in_HKBU_image_baseline 0.1868
ETH-CVL me16in_ethcvl1_image_run1 0.1866
ETH-CVL me16in_ethcvl1_image_run3 0.1858
HKBU me16in_HKBU_image_drbaseline 0.1839
BigVId me16in_BigVid_image_run4SVM 0.1789
UIT-NII me16in_UITNII_image_V1 0.1773
LAPI [14]me16in_lapi_image_runf1 0.1714
UNIGECISA [53]me16in_UNIGECISA_image_ReglineLoF 0.1704
Baseline 0.16556
LAPI me16in_lapi_image_runf2 0.1398
Predicting Interestingness of Visual Content 247
Tabl e 2 Official results for video interestingness prediction evaluated by MAP
Team Run name MAP
UNIFESP [1]me16in_unifesp_video_run1 0.1815
HKBU [44]me16in_HKBU_video_drbaseline 0.1735
UNIGECISA [53]me16in_UNIGECISA_video_RegsrrLoF 0.171
RUC [12]me16in_RUC_video_run2 0.1704
UIT-NII [38]me16in_UITNII_video_A1 0.169
UNIFESP me16in_unifesp_video_run4 0.1656
RUC me16in_RUC_video_run1 0.1647
UIT-NII me16in_UITNII_video_F1 0.1641
LAPI [14]me16in_lapi_video_runf5 0.1629
Technicolor [56]me16in_technicolor_video_run5_CSP_multimodal_80_epoch7 0.1618
UNIFESP me16in_unifesp_video_run2 0.1617
UNIFESP me16in_unifesp_video_run3 0.1617
ETH-CVL [67]me16in_ethcvl1_video_run2 0.1574
LAPI me16in_lapi_video_runf3 0.1574
LAPI me16in_lapi_video_runf4 0.1572
TUD-MMC [42]me16in_tudmmc2_video_histface 0.1558
TUD-MMC me16in_tudmmc2_video_hist 0.1557
BigVid [69]me16in_BigVid_video_run3RankSVM 0.154
HKBU me16in_HKBU_video_baseline 0.1521
BigVid me16in_BigVid_video_run2FusionCNN 0.1511
UNIGECISA me16in_UNIGECISA_video_RegsrrGiFe 0.1497
Baseline 0.1496
BigVid me16in_BigVid_video_run1SVM 0.1482
Technicolor me16in_technicolor_video_run3_LSTM_U19_100_epoch5 0.1465
UNIFESP me16in_unifesp_video_run5 0.1435
UNIGECISA me16in_UNIGECISA_video_SVRloAudio 0.1367
Technicolor me16in_technicolor_video_run4_CSP_video_80_epoch9 0.1365
ETH-CVL me16in_ethcvl1_video_run1 0.1362
equally distributed between the two subtasks. As a general conclusion, the achieved
MAP values were low, which proves again the challenging nature of this problem.
Slightly higher values were obtained for image interestingness prediction.
To serve as a baseline for comparison, we generated a random ranking run, i.e.,
samples were ranked randomly five times and we take the average MAP. Compared
to the baseline, the results of the image subtask clearly confirm their performance,
being almost all above the baseline. For the video subtask, on the other hand,
the value range is smaller and a few systems did worse than the baseline. In the
following we present the participating systems and analyze the achieved results in
detail.
248 C.-H. Demarty et al.
4.2 Participating Systems and Global Trends
Numerous approaches have been investigated by the participating teams to tackle
both image and video interestingness prediction. In the following, we will firstly
summarize the general techniques used by the teams and their key features
(Sect. 4.2.1), and secondly present the global insights of the results (Sect. 4.2.2).
4.2.1 Participants’ Approaches
A summary of the features and classification techniques used by each participating
system is presented in Table 3(image interestingness) and Table 4(video inter-
estingness). Below, we present the main characteristics of each approach. Unless
otherwise specified, each team participated in both subtasks.
Tabl e 3 Overview of the characteristics of the submitted systems for predicting image interest-
ingness
Team Features Classification technique
BigVid [69] denseSIFT+CNN+Style Attributes+SentiBank SVM (run4)
Regularized DNN (run5)
ETH-CVL [67]DNN-based Visual Semantic
Embedding Model
HKBU [44]ColorHist+denseSIFT+GIST+HOG+LBP (run1) Nearest neighbor and SVR
features from run1 + dimension reduction (run2)
HUCVL [21]CNN (run1, run3) MLP (run1, run2)
MemNet (run2) Deep triplet network (run3)
LAPI [14]ColorHist+GIST (run1) SVM
denseSIFT+GIST (run2)
MLPBOON [52]CNN, PCA for dimension reduction Logistic regression
RUC [12]GIST+LBP+CNN prob (run1) Random Forest (run1)
ColorHist+GIST+CNN prob (run2), Random Forest (run2)
ColorHist+GIST+LBP+CNN prob (run3) SVM (run3)
Technicolor [56]CNN (Alexnet fc7) SVM (run1)
MLP (run2)
TUD-MMC [42]Face-related ColorHist (run1) Normalized histogram-based
Face-related ColorHist+Face area (run2) confidence score
NHCS+Normalized face
area score (run2)
UIT-NII [38]CNN (AlexNet+VGG) (run1) SVM with late fusion
CNN (VGG)+GIST+HOG+DenseSIFT (run2)
UNIGECISA [1]Multilingual visual sentiment ontology Linear regression
(MVSO)+CNN
Predicting Interestingness of Visual Content 249
Tabl e 4 Overview of the characteristics of the submitted systems for predicting video interesting-
ness
Teams Features Classification technique Multi-modality
BigVid [69]denseSIFT, CNN SVM (run1) No
Style Attributes, SentiBank Regularlized DNN (run2)
SVM/Ranking-SVM (run3)
ETH-CVL [67]DNN-based Video2GIF (run1) Text+Visual
Video2GIF+Visual Semantic
Embedding Model (run2)
HKBU [44]ColorHist+denseSIFT+GIST Nearest neighbor and SVR No
+HOG+LBP (run1)
features from run1
+ dimension reduction (run2)
LAPI [14]GIST+CNN prob (run3) SVM No
ColorHist+CNN (run4)
denseSIFT+CNN prob (run5)
RUC [12]Acoustic Statistics + GIST
(run4)
SVM Audio+Visual
MFCC with Fisher Vector
Encoding + GIST (run5)
Technicolor [56] CNN+MFCC LSTM-Resnet + MLP (run3) Audio+Visual
Proposed RNN-based model
(run4, run5)
TUD-MMC [42]ColorHist (run1) Normalized histogram-based No
ColorHist+Face area (run2) confidence score (NHCS)
run3)
NHCS+Normalized face
area score (run4)
UIT-NII [38]CNN (AlexNet)+MFCC (run3) SVM with late fusion Audio+Visual
CNN (VGG)+GIST (run4)
UNIFESP [1]Histogram of motion Majority voting of pairwise No
patterns (HMP) [2]ranking methods:
Ranking SVM, RankNet
RankBoost, ListNet
UNIGECISA [53]MVSO+CNN (run2) SVR (run2) Audio+Visual
Baseline visual features [18]
(run3),
SPARROW (run3, run4)
Emotionally-motivated audio
feature (run4)
BigVid [69] (Fudan University, China): explored various low-level features
(from visual and audio modalities) and high-level semantic attributes, as well as
the fusion of these features for classification. Both SVM and recent deep learning
methods were tested as classifiers. The results proved that the high-level attributes
250 C.-H. Demarty et al.
are complementary to visual features since the combination of these features
increases the overall performance.
ETH-CVL [67] (ETH Zurich, Switzerland): participated in the video subtask
only. Two models were presented: (1) a frame-based model that uses textual side
information (external data) and (2) a generic predictor for finding video highlights
in the form of segments. For the frame-based model, they learned a joint embedding
space for image and text, which allows to measure relevance of a frame with
regard to some text such as the video title. For video interestingness prediction, the
approach in [24] was used, where a deep RankNet is trained to rank the segments of
a video based upon their suitability as animated GIFs. Note that RankNet captures
the spatio-temporal aspect of video segments via the use of 3D convolutional neural
networks (C3D).
HKBU [44] (Hong Kong Baptist University, China): used two dimensional-
ity reduction methods, named Neighborhood MinMax Projections (NMMP) and
Supervised Manifold Regression (SMR), to extract features of lower dimension
from a set of baseline low-level visual features (Color Histogram, dense SIFT,
GIST, HOG, LBP). Then nearest neighbor (NN) classifier and Support Vector
Regressor (SVR) were exploited for interestingness classification. They found
that after dimensionality reduction, the performance of the reduced features was
comparable to that of their original features, which indicated that the reduced
features successfully captured most of the discriminant information of the data.
HUCVL [21] (Hacettepe University, Turkey): participated in image interesting-
ness prediction only. They investigated three different Deep Neural Network (DNN)
models. The first two models were based on fine-tuning two pre-trained models,
namely AlexNet and MemNet. Note that MemNet was trained on the image memo-
rability dataset proposed in [36], the idea being to see if memorability can be gener-
alized to the interestingness concept. The third model, on the other hand, depends on
a proposed triplet network which comprised three instances with shared weights of
the same feed-forward network. The results demonstrated that all these models pro-
vide relatively similar and promising results on the image interestingness subtask.
LAPI [14] (University Politehnica of Bucharest, Romania, co-organizer of the
task): investigated a classic descriptor-classification scheme, namely the combi-
nation of different low-level features (HoG, dense SIFT, LBP, GIST, AlexNet fc7
layer features (hereafter referred as CNN features), Color Histogram, Color Naming
Histogram) and use of SVM, with different kernel types, as classifier. For video,
frame features were averaged to obtain a global video descriptor.
MLPBOON [52] (Indian Institute of Technology, Bombay, India): participated
only in image interestingness prediction and studied various baseline visual features
provided by the organizers [18], and classifiers on the development dataset. Principal
component analysis (PCA) was used for reducing the feature dimension. Their final
system involved the use of PCA on CNN features for the input representation
and logistic regression (LR) as classifier. Interestingly, they observed that the
combination of CNN features with GIST and Color Histogram features gave similar
performance to the use of CNN features only. Overall, this simple, yet effective,
system obtained quite high MAP values for the image subtask.
Predicting Interestingness of Visual Content 251
RUC [12] (Renmin University, China): investigated the use of CNN features
and AlexNet probabilistic layer (referred as CNN prob), and hand-crafted visual
features including Color Histogram, GIST, LBP, HOG, dense SIFT. Classifiers
were SVM and Random Forest. They found that semantic-level features, i.e., CNN
prob, and low-level appearance features are complementary. However, concate-
nating CNN features with hand-crafted features did not bring any improvement.
This finding is coherent with the statement from MLPBOON team [52]. For
predicting video interestingness, audio modality offered superior performance than
visual modality and the early fusion of the two modalities can further boost the
performance.
Technicolor [56] (Technicolor R&D France, co-organizer of the task): used
CNN features as visual features (for both the image and video subtasks), and
MFCC as audio feature (for the video subtask) and investigated the use of both
SVM and different Deep Neural Networks (DNN) as classification techniques.
For the image subtask, a simple system with CNN features and SVM resulted
in the best MAP, 0.2336. For the video subtask, multi-modality as a mid-level
fusion of audio and visual features, was taken into account within the DNN
framework. Additionally, a novel DNN architecture based on multiple Recurrent
Neural Networks (RNN) was proposed for modeling the temporal aspect of the
video, and a resampling/upsampling technique was used to deal with the unbalanced
dataset.
TUD-MMC [42] (Delft University of Technology, Netherlands): investigated
MAP values obtained on the development set by swapping and submitting ground-
truth annotations of image and video to the video and image subtasks respectively,
i.e., using the video ground-truth as submission on the image subtask and the
image ground-truth as submission on the video subtask. They concluded on the low
correlation between the image interestingness and video interestingness concepts.
Their simple visual features took into account the human face information (color
and sizes) in the image and video with the assumption that clear human faces should
attract the viewer’s attention and thus make the image/video more interesting. One
of their submitted runs, only rule-based, obtained the best MAP value of 0.2336 for
the image subtask.
UIT-NII [38] (University of Science, Vietnam; University of Information Tech-
nology, Vietnam; National Institute of Informatics, Japan): used SVM to predict
three different scores given the three types of input features: (1) low-level visual
features provided by the organizers [18], (2) CNN features (AlexNet and VGG),
and (3) MFCC as audio feature. Late fusion of these scores was used for computing
the final interestingness levels. Interestingly, their system tends to output a higher
rank on images of beautiful women. Furthermore, they found that images from dark
scenes were often considered as more interesting.
UNIFESP [1] (Federal University of Sao Paulo, Brazil): participated only in the
video subtask. Their approach was based on combining learning-to-rank algorithms
for predicting the interestingness of videos by using their visual content only. For
this purpose, Histogram of Motion Patterns (HMP) [2] were used. A simple majority
voting scheme was used for combining four pairwise machine learned rankers
252 C.-H. Demarty et al.
(Ranking SVM, RankNet, RankBoost, ListNet) and predicting the interestingness
of videos. This simple, yet effective, method obtained the best MAP of 0.1815 for
the video subtask.
UNIGECISA [53] (University of Geneva, Switzerland): used mid-level semantic
visual sentiment features, which are related to the emotional content of images
and were shown to be effective in recognizing interestingness in GIFs [24]. They
found that these features outperform the baseline low-level ones provided by the
organizers [18]. They also investigated the use of emotionally-motivated audio
features (eGeMAPS) for the video subtask and showed the significance of the audio
modality. Three regression models were reported to predict interestingness levels:
linear regression (LR), SVR with linear kernel, and sparse approximation weighted
regression (SPARROW).
4.2.2 Analysis of this Year’S Trends and Outputs
This section provides an in-depth analysis of the results and discusses the global
trends found in the submitted systems.
Low-Level vs. High-Level Description The conventional low-level visual fea-
tures, such as dense SIFT, GIST, LBP, Color Histogram, were still being used by
many of the systems for both, image and video interestingness prediction [12,14,
38,44,69]. However, deep features like CNN features (i.e., Alexnet fc7 or VGG)
have become dominant and are exploited by the majority of the systems. This
shows the effectiveness and popularity of deep learning. Some teams investigated
the combination of hand crafted features with deep features, i.e., conventional and
CNN features. A general finding is that such a combination did not really bring
any benefit to the prediction results [12,44,52]. Some systems combined low-level
features with some high-level attributes such as emotional expressions, human faces,
CNN visual concept predictions [12,69]. In this case, the resulting conclusion was
that low-level appearance features and semantic-level features are complementary,
as the combination in general offered better prediction results.
Standard vs. Deep Learning-Based Classification As it can be seen in Tables 3
and 4, SVM was mostly used by a large number of systems, for both predic-
tion tasks. In addition, regression techniques such as linear regression, logistic
regression, and support vector regression were also widely reported. Contrary to
CNN features, which were widely used by most of the systems, deep learning
classification techniques were investigated less (see [21,56,67,69] for image
interestingness and [56,67,69] for the video interestingness). This may be due to
the fact that the datasets are not large enough to justify a deep learning approach.
Conventional classifiers were preferred here.
Use of External Data Some systems investigated the use of external data to
improve the results. For instance, Flickr images with social-driven interestingness
Predicting Interestingness of Visual Content 253
labels were used for model selection in the image interestingness subtask by the
Technicolor team [56]. The HUCVL team [21] submitted a run with a fine-tuning of
the MemNet model, which was trained for image memorability prediction. Although
memorability and interestingness are not the same concept, the authors expected
that fine-tuning a model related to an intrinsic property of images could be helpful
in learning better high-level features for image interestingness prediction. The ETH-
CVL team [67] exploited movie titles, as textual side information related to movies,
for both subtasks. In addition, ETH-CVL also investigated the use of the deep
RankNet model, which was trained on the Video2GIF dataset [24], and the Visual
Semantic Embedding model, which was trained on the MSR Clickture dataset [28].
Dealing with Small and Unbalanced Data As the development data provided for
the two subtasks are not very large, some systems, e.g., [1,56], used the whole
image and video development sets for training when building the final models. To
cope with the imbalance of the two classes in the dataset, the Technicolor team [56]
proposed to use classic resampling and upsampling strategies so that the positive
samples are used multiple times during training.
Multi-Modality Specific to video interestingness, multi-modal approaches were
exploited by half of the teams for at least one of their runs, as shown in Table 4.
Four teams combined audio and visual information [12,38,53,56], and one team
combined text with visual information [67]. The fusion of modalities was done
either at the early stage [12,53], middle stage [56], or late stage [38]inthe
processing workflows. Note that the combination of text and visual information was
also reported in [67] for image interestingness prediction. The general finding here
was that multi-modality brings benefits to the prediction results.
Temporal Modeling for Video Though the temporal aspect is an important
property of a video, most systems did not actually exploit any temporal modeling
for video interestingness prediction. They mainly considered a video as a sequence
of frames and a global video descriptor was computed simply by averaging frame
image descriptors over each shot. As an example, HKBU team [44] treated each
frame as a separated image, and calculated the average and standard deviation of
their features over all frames in a shot to build their global feature vector for each
video. Only two teams incorporated temporal modeling in their submitted systems,
namely Technicolor [56] who used long-short term memory (LSTM) in their deep
learning-based framework, and ETH-CVL [67] who used 3D convolutional neural
networks (C3D) in their video highlight detector, trained on the Video2GIF dataset.
4.3 In-Depth Analysis of the Data and Annotations
The purpose of this section is to give some insights on the characteristics of the
produced data, i.e., the dataset and its annotations.
254 C.-H. Demarty et al.
4.3.1 Quality of the Dataset
In general, the overall results obtained during the 2016 campaign show low values
for MAP (see Figs. 1and 2), especially for the video interestingness prediction
subtask. To have a comparison, we provide examples of MAP values obtained
by other multi-modal tasks from the literature. Of course, these were obtained on
other datasets which are fundamentally different from the underlying data, both
from the data point of view and also use case scenario. A direct comparison is not
possible, however, they provide an idea about the current classification capabilities
for video:
ILSVR Challenge 2015, Object Detection with provided training data, 200
fully labeled categories, best MAP is 0.62; Object Detection from videos with
provided training data, 30 fully labeled categories, best MAP is 0.67;
TRECVID 2015, Semantic indexing of concepts such as: airplane, kitchen, flags,
etc., best MAP is 0.37;
TRECVID 2015, Multi-modal event detection, e.g., somebody cooking on an
outdoor grill, best MAP is less than 0.35.
Although higher than the obtained MAP for the Predicting Media Interestingness
Task, it must be noted that for more difficult tasks such as multi-modal event
detection, the difference of performance is not that high, given the fact that the
proposed challenge is far more subjective than the tasks we are referring to.
Nevertheless, we may wonder, especially for the video interestingness sub-
task, whether the quality of the dataset/anotations partly affects the predicting
performance. Firstly, the dataset size, although it is sufficient for classic learning
techniques and required a huge annotation effort, it may not be sufficient for deep
learning, with only several thousands of samples for both subtasks.
Furthermore, it may be considered to be highly unbalanced with 8:3 and 9:6%
of interesting content for the development set and test set, respectively. Trying to
cope with the dataset’s unbalance has shown to increase the performance for some
systems [56,57]. This leads to the conclusion that, although this unbalance reflects
reality, i.e., interesting content corresponds to only a small part of the data, it makes
the task even more difficult, as systems will have to take this characteristic into
account.
Finally, in Sect. 3.2, we explained that the final annotations were determined with
an iterative process which required the convergence of the results. Due to limited
time and human resources, this process was limited to five rounds. More rounds
would certainly have resulted in better convergence of the inter-annotator ratings.
To have an idea of the subjective quality of the ground-truth rankings, Figs. 3
and 4illustrate some image examples for the image interestingness subtask together
with the rankings obtained by one of the best systems and the second worst
performing system, for both interesting and non interesting images.
The figures show that results obtained by the best system for the most interesting
images are coherent with the selection proposed by the ground-truth, whereas the
second worst performing system offers more images at the top ranks which do not
Predicting Interestingness of Visual Content 255
Fig. 3 Examples of interesting images from different videos of the test set. Images are ranked
from left to right decreasing interestingness ranking. (a) Interesting images according to the
ground-truth. (b) Interesting images selected by the best system. (c) Interesting images selected
by the second worst performing system (Color figure online)
256 C.-H. Demarty et al.
Fig. 4 Examples of non interesting images from different videos of the test set. Images are ranked from left to right increasing interestingness ranking. (a)Non
interesting images according to the ground-truth. (b) Non interesting images selected by the best system (Color figure online)
Predicting Interestingness of Visual Content 257
really contain any information, e.g., black or uniform frames, with blur or objects
and persons only partially visible.
These facts converge to the idea that both the provided ground-truth and the best
working systems have managed to capture the interestingness of images. It also
confirms that the obtained MAP values, although quite low, nevertheless correspond
to real differences in the interestingness prediction performance.
The observation of the images which were classified as non interesting (Fig. 4)is
also a source of interesting insights. According to the ground-truth and also to the
best performing systems, non interesting images tend to be those mostly uniform,
of low quality or without meaningful information. The amount of information con-
tained in the non interesting images then increases with the level of interestingness.
Note that we do not show here the images classified as non interesting by the second
worst performing system, as we did for the interesting images, because there were
too few (for the example 7 images out of 25 videos) to draw any conclusion.
We also calculated Krippendorff’s alpha metric (˛), which is a measure for
inter-observer agreement [26,37], to be ˛D0:059 for image interestingness and
˛D0:063 for video interestingness. This result would indicate that there is no
inter-observer agreement. However, as our method (by design) produced very few
duplicate comparisons it is not clear if this result is reliable.
As a last insight, it is worth noting that the two experienced teams [53,67],
i.e., the two teams that did work on predicting content interestingness before the
MediaEval benchmark, did not achieve particularly good results on both subtasks
and especially on the image subtask. This raises the question of the generalization
ability of their systems on different types of content, unless this difference of
performance comes from the choice of different use cases as working context.
For the latter, this seems to show that, to different use cases correspond different
interpretations of the interestingness concept.
4.3.2 Correlation Between the Two Subtasks
The Predicting Media Interestingness task was designed so that a comparison
between the interestingness prediction for images and videos would be possible
afterwards. Indeed, the same videos were used to extract both the shots and the
key-frames to be classified in each subtask, each key-frame corresponding to the
middle of shots. Thanks to this, we studied a potential correlation between image
interestingness and video interestingness.
Figure 5shows the annotated video ranking against their key-frame ranking for
several videos in the development set. None of the curves exhibit a correlation (the
coefficient of determination, R-squared or R2, used while fitting a regression line
to the data, exhibits values lower than 0.03), leading to the conclusion that the two
concepts differ, in the sense that we cannot use video interestingness to infer the
image interestingness and the other way round on this data and use case scenario.
258 C.-H. Demarty et al.
Fig. 5 Representation of image rankings vs. video rankings from the ground-truth for several
videos of the development set. (a)Video0,(b)Video4,(c)Video7,(d) Video 10, (e) Video 14,
(f)Video51
This conclusion is in line with what was found in [42] where the authors
investigated the assessment of the ground-truth ranking of the image subtask against
the ground-truth ranking of the video subtask and vice-versa. MAP value achieved
by the video ground-truth on the image subtask was 0.1747, while for the image
ground-truth on the video subtask, it was 0.1457, i.e., in the range, or even lower,
than the random baseline for both cases. Videos obviously contain more information
Predicting Interestingness of Visual Content 259
than a single image, which can be conveyed by other channels such as audio and
motion, for example. Because of this additional information, a video might be
globally considered as interesting while one single key-frame extracted from the
same video will be considered as non interesting. This can explain, in some cases,
the observed discrepancy between image and video interestingnesses.
4.3.3 Link with Perceptual Content Characteristics
Trying to infer some potential links between the interestingness concept and
perceptual content characteristics, we did study how low-level characteristics such
as shot length, average luminance, blur and presence of high quality faces influence
the interestingness prediction of images and videos.
A first qualitative study of both sets of interesting and non interesting images
in the development and test sets shows that most uniformly black and very blurry
images were mostly classified as non interesting. So were the majority of images
with no real information, i.e., close-up of usual objects, partly cut faces or objects,
etc., as it can be seen in Fig. 4.
Figure 6shows the distributions of interestingness values for both the develop-
ment and test sets, in the video interestingness subtask, compared to the distributions
of interesting values restricted to the shots with less than 10 frames. In all cases,
it seems that the distributions of small shots can just be superimposed under the
complete distributions, meaning that the shot length does not seem to influence the
interestingness of video segments even for very short durations. On the contrary,
Fig. 7shows the two same types of distributions but for the image interestingness
subtask and when trying to assess the influence of the average luminance value on
interestingness. This time, the distributions of interestingness levels for the images
with low average luminance seem to be slightly shifted toward lower interestingness
Fig. 6 Video interestingness and shot length: distribution of interestingness levels (in blue—all
shots considered; in green—shots with length smaller than 10 frames). (a) Development set, (b)
test set (Color figure online)
260 C.-H. Demarty et al.
Fig. 7 Image interestingness and average luminance: distribution of interestingness levels (in
blue—all key-frames considered; in green—key-frames with luminance values lower than 25).
(a) Development set, (b) test set (Color figure online)
levels. This might lead us to the conclusion that low average luminance values tend
to decrease the interestingness level of a given image, contrary to the conclusion
in [38].
We also investigated some potential correlation between the presence of high-
quality faces in frames and the interestingness level. By high-quality faces, we mean
rather big faces with no motion blur, either frontal or profile, no closed eyes or
funny faces. This last mid-level characteristic was assessed manually by counting
the number of high-quality faces present in both the interesting and non interesting
images for the image interestingness subtask. The proportion of high-quality faces
on the development set was found to be 48:2% for the set of images annotated as
interesting and 33:9% for the set of images annotated as non interesting. For the test
set, 56:0% of the interesting images and 36:7% of the non interesting images contain
high quality faces. The difference in favor of the interesting sets tends to prove that
this characteristic has a positive influence on the interestingness assessment. This
was confirmed by the results obtained by TUD-MMC team [42] who based their
system only on the detection of these high quality faces and achieved the best MAP
value for the image subtask.
As a general conclusion, we may say that perceptual quality plays an important
role when assessing the interestingness of images, although it is not the only clue to
assess the interestingness of content. Among other semantic objects, the presence
of good quality human faces seems to be correlated with interestingness.
5 Conclusions and Future Challenges
In this chapter we introduced a specially designed evaluation framework for
assessing the performance of automatic techniques for predicting image and video
interestingness. We described the released dataset and its annotations. Content
Predicting Interestingness of Visual Content 261
interestingness was defined in a multi-modal scenario and for a real-world, specific
use case defined by Technicolor R&D France, namely the selection of interesting
images and video excerpts for helping professionals to illustrate a Video on Demand
(VOD) web site.
The proposed framework was validated during the 2016 Predicting Media
Interestingness Task, organized with the MediaEval Benchmarking Initiative for
Multimedia Evaluation. It received participation from 12 teams submitting a total
of 54 runs. Highest MAP obtained for the image interestingness data was 0.2336,
whereas for video interestingness prediction it was only 0.1815. Although a great
deal of approaches were experimented, ranging from standard classifiers and
descriptors, to deep learning and use of pre-trained data, the results show the
difficulty of this task.
From the experience with this data, we can draw some general conclusions that
will help shape future data in this area. Firstly, one should note that generating data
and ground truth for such a subjective task is a huge effort and effective methods
should be devised to reduce the complexity of annotation. In our approach we
took advantage of a pair-wise comparison protocol which was further applied in
an adaptive square fashion way to avoid comparing all possible pairs. This has
limitation as it still requires a great number of annotators and resulted in a low
inter-agreement. A potential improvement may consist on ranking directly series
of images/videos. We could also think of crowd-sourcing the key-frames/videos
returned by the participants’ systems to extract the most interesting samples and
evaluating the performances of the systems against these samples only.
Secondly, the source of data is key for a solid evaluation. In our approach we
selected movie trailers, due to their Creative Commons licenses which allow redis-
tribution. Other movies are in almost all cases closed content for the community.
On the other hand, trailers are edited content which will limit at some point the
naturalness of the task, but offer a good compromise given the circumstances.
Future improvements could consist of selecting the data as parts of a full movie—
a few Creative Commons movies are indeed available. This will require a greater
annotation effort but might provide a better separation between interesting and non
interesting content.
Thirdly, a clear definition of image/video interestingness is mandatory. The
concept of content interestingness is already very subjective and highly user
dependent, even compared to other video concepts which are exploited in TRECVID
or ImageCLEF benchmarks. A well founded definition will allow for a focused
evaluation and disambiguate the information need. In our approach, we define
interestingness in the context of selecting video content for illustrating a web site,
where interesting means an image/video which would be interesting enough to
convince the user to watch the source movie. As a future challenge, we might want
to compare the results of interestingness prediction for different use scenarios, or
even test the generalization power of the approaches.
Finally, although image and video data was by design specifically correlated,
i.e., images were selected as key-frames from videos, results show that actually
predicting image interestingness and predicting video interestingness are two
262 C.-H. Demarty et al.
completely different tasks. This was more or less proved in the literature, however,
in those cases, images and videos were not chosen to be correlated. Therefore,
a future perspective might be the separation of the two, while focusing on more
representative data for each.
Acknowledgements We would like to thank Yu-Gang Jiang and Baohan Xu from the Fudan
University, China, and Hervé Bredin, from LIMSI, France for providing the features that
accompany the released data, and Frédéric Lefebvre, Alexey Ozerov and Vincent Demoulin for
their valuable inputs to the task definition. We also would like to thank our anonymous annotators
for their contribution to building the ground-truth for the datasets. Part of this work was funded
under project SPOTTER PN-III-P2-2.1-PED-2016-1065, contract 30PED/2017.
References
1. Almeida, J.: UNIFESP at MediaEval 2016 Predicting Media Interestingness Task. In:
Proceedings of the MediaEval Workshop, Hilversum (2016)
2. Almeida, J., Leite, N.J., Torres, R.S.: Comparison of video sequences with histograms of
motion patterns. In: IEEE ICIP International Conference on Image Processing, pp. 3673–3676
(2011)
3. Baveye, Y., Dellandréa, E., Chamaret, C., Chen, L.: Liris-accede: a video database for affective
content analysis. IEEE Trans. Affect. Comput. 6(1), 43–55 (2015)
4. Berg, A.C., Berg, T.L., Daume, H., Dodge, J., Goyal, A., Han, X., Mensch, A., Mitchell, M.,
Sood, A., Stratos, K., et al.: Understanding and predicting importance in images. In: IEEE
CVPR International Conference on Computer Vision and Pattern Recognition, pp. 3562–3569.
IEEE, Providence (2012)
5. Berlyne, D.E.: Conflict, Arousal and Curiosity. Mc-Graw-Hill, New York (1960)
6. Boiman, O., Irani, M.: Detecting irregularities in images and in video. Int. J. Comput. Vis.
74(1), 17–31 (2007)
7. Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs: the method of paired
comparisons. Biometrika 39(3-4), 324–345 (1952)
8. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers.
In: ACM Sigmod Record, vol. 29, pp. 93–104. ACM, New York (2000)
9. Bulling, A., Roggen, D.: Recognition of visual memory recall processes using eye movement
analysis. In: Proceedings of the 13th international conference on Ubiquitous Computing, pp.
455–464. ACM, New York (2011)
10. Chamaret, C., Demarty, C.H., Demoulin, V., Marquant, G.: Experiencing the interestingness
concept within and between pictures. In: Proceeding of SPIE, Human Vision and Electronic
Imaging (2016)
11. Chen, A., Darst, P.W., Pangrazi, R.P.: An examination of situational interest and its sources.
Br. J. Educ. Psychol. 71(3), 383–400 (2001)
12. Chen, S., Dian, Y., Jin, Q.: RUC at MediaEval 2016 Predicting Media Interestingness Task. In:
Proceedings of the MediaEval Workshop, Hilversum (2016)
13. Chu, S.L., Fedorovskaya, E., Quek, F., Snyder, J.: The effect of familiarity on perceived
interestingness of images. In: Proceedings of SPIE, vol. 8651, pp. 86,511C–86,511C–12
(2013). doi:10.1117/12.2008551,http://dx.doi.org/10.1117/12.2008551
14. Constantin, M.G., Boteanu, B., Ionescu, B.: LAPI at MediaEval 2016 Predicting Media
Interestingness Task. In: Proceedings of the MediaEval Workshop, Hilversum (2016)
Predicting Interestingness of Visual Content 263
15. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE CVPR
International Conference on Computer Vision and Pattern Recognition (2005)
16. Danelljan, M., Hager, G., Khan, F.S., Felsberg, M.: Accurate scale estimation for robust visual
tracking. In: British Machine Vision Conference (2014)
17. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Studying aesthetics in photographic images using a
computational approach. In: IEEE ECCV European Conference on Computer Vision, pp. 288–
301. Springer, Berlin (2006)
18. Demarty, C.H., Sjöberg, M., Ionescu, B., Do, T.T., Wang, H., Duong, N.Q.K., Lefebvre, F.:
Mediaeval 2016 Predicting Media Interestingness Task. In: Proceedings of the MediaEval
Workshop, Hilversum (2016)
19. Dhar, S., Ordonez, V., Berg, T.L.: High level describable attributes for predicting aesthetics
and interestingness. In: IEEE International Conference on Computer Vision and Pattern
Recognition (2011)
20. Elazary, L., Itti, L.: Interesting objects are visually salient. J. Vis. 8(3), 3–3 (2008)
21. Erdogan, G., Erdem, A., Erdem, E.: HUCVL at MediaEval 2016: predicting interesting key
frames with deep models. In: Proceedings of the MediaEval Workshop, Hilversum (2016)
22. Grabner, H., Nater, F., Druey, M., Gool, L.V.: Visual interestingness in image sequences.
In: ACM International Conference on Multimedia, pp. 1017–1026. ACM, New York (2013).
doi:10.1145/2502081.2502109,http://doi.acm.org/10.1145/2502081.2502109
23. Gygli, M., Grabner, H., Riemenschneider, H., Nater, F., van Gool, L.: The interestingness of
images. In: ICCV International Conference on Computer Vision (2013)
24. Gygli, M., Song, Y., Cao, L.: Video2gif: automatic generation of animated gifs from video.
CoRR abs/1605.04850 (2016). http://arxiv.org/abs/1605.04850
25. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Advances in Neural
Information Processing Systems, pp. 545–552 (2006)
26. Hayes, A.F., Krippendorff, K.: Answering the call for a standard reliability measure for coding
data. Commun. Methods Meas. 1(1), 77–89 (2007). doi:10.1080/19312450709336664,http://
dx.doi.org/10.1080/19312450709336664
27. Hsieh, L.C., Hsu, W.H., Wang, H.C.: Investigating and predicting social and visual image
interestingness on social media by crowdsourcing. In: 2014 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 4309–4313. IEEE, Providence (2014)
28. Hua, X.S., Yang, L., Wang, J., Wang, J., Ye, M., Wang, K., Rui, Y., Li, J.: Clickage:
towards bridging semantic and intent gaps via mining click logs of search engines. In: ACM
International Conference on Multimedia (2013)
29. Isola, P., Parikh, D., Torralba, A., Oliva, A.: Understanding the intrinsic memorability of
images. In: Advances in Neural Information Processing Systems, pp. 2429–2437 (2011)
30. Isola, P., Xiao, J., Torralba, A., Oliva, A.: What makes an image memorable? In: IEEE CVPR
International Conference on Computer Vision and Pattern Recognition, pp. 145–152. IEEE,
Providence (2011)
31. Jiang, Y.G., Wang, Y., Feng, R., Xue, X., Zheng, Y., Yan, H.: Understanding and predicting
interestingness of videos. In: AAAI Conference on Artificial Intelligence (2013)
32. Jiang, Y.G., Dai, Q., Mei, T., Rui, Y., Chang, S.F.: Super fast event recognition in internet
videos. IEEE Trans. Multimedia 177(8), 1–13 (2015)
33. Joachims, T.: Optimizing search engines using clickthrough data. In: ACM SIGKDD
international conference on Knowledge discovery and data mining, pp. 133–142. ACM, New
York (2002)
34. Ke, Y., Hoiem, D., Sukthankar, R.: Computer vision for music identification. In: IEEE CVPR
International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 597–604.
IEEE, Providence (2005)
35. Ke, Y., Tang, X., Jing, F.: The design of high-level features for photo quality assessment. In:
IEEE CVPR International Conference on Computer Vision and Pattern Recognition, vol. 1, pp.
419–426. IEEE, Providence (2006)
264 C.-H. Demarty et al.
36. Khosla, A., Raju, A.S., Torralba, A., Oliva, A.: Understanding and predicting image memora-
bility at a large scale. In: International Conference on Computer Vision (ICCV) (2015)
37. Krippendorff, K.: Content Analysis: An Introduction to Its Methodology, 3rd edn. Sage,
Thousand Oaks (2013)
38. Lam, V., Do, T., Phan, S., Le, D.D., Satoh, S., Duong, D.: NII-UIT at MediaEval 2016 Pre-
dicting Media Interestingness Task. In: Proceedings of the MediaEval Workshop, Hilversum
(2016)
39. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for
recognizing natural scene categories. In: IEEE CVPR International Conference on Computer
Vision and Pattern Recognition, pp. 2169–2178 (2006)
40. Li, J., Barkowsky, M., Le Callet, P.: Boosting paired comparison methodology in measuring
visual discomfort of 3dtv: performances of three different designs. In: Proceedings of SPIE
Electronic Imaging, Stereoscopic Displays and Applications, vol. 8648 (2013)
41. Li, L.J., Su, H., Fei-Fei, L., Xing, E.P.: Object bank: a high-level image representation for scene
classification & semantic feature sparsification. In: Advances in Neural Information Processing
Systems, pp. 1378–1386 (2010)
42. Liem, C.: TUD-MMC at MediaEval 2016 Predicting Media Interestingness Task. In:
Proceedings of the MediaEval Workshop, Hilversum (2016)
43. Liu, F., Niu, Y., Gleicher, M.: Using web photos for measuring video frame interestingness.
In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 2058–2063
(2009)
44. Liu, Y., Gu, Z., Cheung, Y.M.: Supervised manifold learning for media interestingness
prediction. In: Proceedings of the MediaEval Workshop, Hilversum (2016)
45. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60,
91–110 (2004)
46. Machajdik, J., Hanbury, A.: Affective image classification using features inspired by psychol-
ogy and art theory. In: ACM International Conference on Multimedia, pp. 83–92. ACM, New
York (2010). doi:10.1145/1873951.1873965,http://doi.acm.org/10.1145/1873951.1873965
47. McCrae, R.R.: Aesthetic chills as a universal marker of openness to experience. Motiv. Emot.
31(1), 5–11 (2007)
48. Murray, N., Marchesotti, L., Perronnin, F.: Ava: a large-scale database for aesthetic visual anal-
ysis. In: IEEE CVPR International Conference on Computer Vision and Pattern Recognition,
pp. 2408–2415. IEEE, Providence (2012)
49. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant
texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7),
971–987 (2002)
50. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial
envelope. Int. J. Comput. Vis. 42, 145–175 (2001)
51. Ovadia, S.: Ratings and rankings: reconsidering the structure of values and their measurement.
Int. J. Soc. Res. Methodol. 7(5), 403–414 (2004). doi:10.1080/1364557032000081654,http://
dx.doi.org/10.1080/1364557032000081654
52. Parekh, J., Parekh, S.: The MLPBOON Predicting Media Interestingness System for MediaE-
val 2016. In: Proceedings of the MediaEval Workshop, Hilversum (2016)
53. Rayatdoost, S., Soleymani, M.: Ranking images and videos on visual interestingness by visual
sentiment features. In: Proceedings of the MediaEval Workshop, Hilversum (2016)
54. Schaul, T., Pape, L., Glasmachers, T., Graziano, V., Schmidhuber, J.: Coherence progress:
a measure of interestingness based on fixed compressors. In: International Conference on
Artificial General Intelligence, pp. 21–30. Springer, Berlin (2011)
55. Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: 2007
IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE, Providence
(2007)
56. Shen, Y., Demarty, C.H., Duong, N.Q.K.: Technicolor@MediaEval 2016 Predicting Media
Interestingness Task. In: Proceedings of the MediaEval Workshop, Hilversum (2016)
Predicting Interestingness of Visual Content 265
57. Shen, Y., Demarty, C.H., Duong, N.Q.K.: Deep learning for multimodal-based video interest-
ingness prediction. In: IEEE International Conference on Multimedia and Expo, ICME’17
(2017)
58. Silvia, P.J.: What is interesting? Exploring the appraisal structure of interest. Emotion 5(1), 89
(2005)
59. Silvia, P.J., Henson, R.A., Templin, J.L.: Are the sources of interest the same for everyone?
using multilevel mixture models to explore individual differences in appraisal structures.
Cognit. Emot. 23(7), 1389–1406 (2009)
60. Sjöberg, M., Baveye, Y., Wang, H., Quang, V.L., Ionescu, B., Dellandréa, E., Schedl, M.,
Demarty, C.H., Chen, L.: The mediaeval 2015 affective impact of movies task. In: Proceedings
of the MediaEval Workshop, CEUR Workshop Proceedings (2015)
61. Soleymani, M.: The quest for visual interest. In: ACM International Conference on Multime-
dia, pp. 919–922. New York, NY, USA (2015). doi:10.1145/2733373.2806364,http://doi.acm.
org/10.1145/2733373.2806364
62. Spain, M., Perona, P.: Measuring and predicting object importance. Int. J. Comput. Vis. 91(1),
59–76 (2011)
63. Stein, B.E., Stanford, T.R.: Multisensory integration: current issues from the perspective of the
single neuron. Nat. Rev. Neurosci. 9(4), 255–266 (2008)
64. Torresani, L., Szummer, M., Fitzgibbon, A.: Efficient object category recognition using
classemes. In: IEEE ECCV European Conference on Computer Vision, pp. 776–789. Springer,
Berlin (2010)
65. Turner, S.A. Jr, Silvia, P.J.: Must interesting things be pleasant? A test of competing appraisal
structures. Emotion 6(4), 670 (2006)
66. Valdez, P., Mehrabian, A.: Effects of color on emotions. J. Exp. Psychol. Gen. 123(4), 394
(1994)
67. Vasudevan, A.B., Gygli, M., Volokitin, A., Gool, L.V.: Eth-cvl @ MediaEval 2016: Textual-
visual embeddings and video2gif for video interestingness. In: Proceedings of the MediaEval
Workshop, Hilversum (2016)
68. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: large-scale scene
recognition from abbey to zoo. In: IEEE CVPR International Conference on Computer Vision
and Pattern Recognition, pp. 3485–3492 (2010)
69. Xu, B., Fu, Y., Jiang, Y.G.: BigVid at MediaEval 2016: predicting interestingness in images
and videos. In: Proceedings of the MediaEval Workshop, Hilversum (2016)
70. Yang, Y.H., Chen, H.H.: Ranking-based emotion recognition for music organization and
retrieval. IEEE Trans. Audio Speech Lang. Process. 19(4), 762–774 (2011)
71. Yannakakis, G.N., Hallam, J.: Ranking vs. preference: a comparative study of self-reporting.
In: International Conference on Affective Computing and Intelligent Interaction, pp. 437–446.
Springer, Berlin (2011)
... The term of memorability is linked with the likelihood that a user will recognize the same photograph after a certain time delay [63,73]. The term of interestingness is linked with the ability of a certain image or video to draw attention of a user to its content and keep this attention for an extent of time [25,51]. Although the methodology used to approach these problems is similar, such methods are primarily concerned with evaluating the effect a new, unseen image might have to a user. ...
Thesis
An automatic photo assessment can significantly aid the process of photo selection within photo collections. However, existing computational methods approach this problem in an independent manner, by evaluating each image apart from other images in a photo album. In this thesis, we explore the modeling of photo context via a clustering approach for photo collections and the possibility of applying such context information in photo assessment. To better understand user actions within photo albums, we conduct experimental user studies, where we study how users cluster and select photos in photo collections. We estimate the level of agreement between users and investigate how the context, defined by similar photos in corresponding clusters, influences users’ decisions. After studying the nature of user decisions, we propose a computational approach to model user behavior. First, we introduce a hierarchical clustering method, which allows to group similar photos according to a multi-level similarity structure, based on visual descriptors. Then, the photo context information is extracted from the obtained cluster data and used to adapt a pre-computed independent photo score, using the statistics-based data and a machine learning approach. In addition, as the majority of recent methods for photo assessment are based on convolutional neural networks, we explore and visualize the aesthetic characteristics learned by such methods.
... The best MAP score, 0.1815, for the video subtask, was recorded by [Almeida 2016], who used Histograms of motion patterns in a normalized confidence scoring system. For an in-depth analysis of approaches, the reader may refer to [Demarty et al. 2017a]. We can also identify in literature some relevant approaches which were validated on these data. ...
Article
Full-text available
Understanding visual interestingness is a challenging task addressed by researchers in various disciplines ranging from humanities and psychology to, more recently, computer vision and multimedia. The rise of infographics and the visual information overload that we are facing today have given this task a crucial importance. Automatic systems are increasingly needed to help users navigate through the growing amount of visual information available, either on the web or our personal devices, for instance by selecting relevant and interesting content. Previous studies indicate that visual interest is highly related to concepts like arousal, unusualness or complexity, where these connections are found either based on psychological theories, user studies or computational approaches. However, the link between visual interestingness and other related concepts has been partially explored so far, for example by considering only a limited subset of covariates at a time. In this paper, we present a comprehensive survey on visual interestingness and related concepts, aiming to bring together works based on different approaches, highlighting controversies and identifying links which have not been fully investigated yet. Finally, we present some open questions that may be addressed in future works. Our work aims to support researchers interested in visual interestingness and related subjective or abstract concepts, providing an in-depth overlook at state-of-the-art theories in humanities and methods in computational approaches, as well as providing an extended list of datasets and evaluation metrics.
Chapter
Full-text available
In the context of the ever growing quantity of multimedia content from social, news and educational platforms, generating meaningful recommendations and ratings now requires a more advanced understanding of their impact on the user, such as their subjective perception. One of the important subjective concepts explored by researchers is visual interestingness. While several definitions of this concept are given in the current literature, in a broader sense, this property attempts to measure the ability of audio-visual data to capture and keep the viewer’s attention for longer periods of time. While many computer vision and machine learning methods have been tested for predicting media interestingness, overall, due to the heavily subjective nature of interestingness, the precision of the results is relatively low. In this chapter, we investigate several methods that address this problem from a different angle. We first review the literature on interestingness prediction and present an overview of the traditional fusion mechanisms, such as statistical fusion, weighted approaches, boosting, random forests or randomized trees. Further, we explore the possibility of employing a stronger, novel deep learning-based, system fusion for enhancing the performance. We investigate several types of deep networks for creating the fusion systems, including dense, attention, convolutional and cross-space-fusion networks, while also proposing some input decoration methods that help these networks achieve optimal performance. We present the results, as well as an analysis of the correlation between network structure and overall system performance. Experimental validation is carried out on a publicly available data set and on the systems benchmarked during the 2017 MediaEval Predicting Media Interestingness task.
Article
Autonomous robots frequently need to detect “interesting” scenes to decide on further exploration, or to decide which data to share for cooperation. These scenarios often require fast deployment with little or no training data. Prior work considers “interestingness” based on data from the same distribution. Instead, we propose to develop a method that automatically adapts online to the environment to report interesting scenes quickly. To address this problem, we develop a novel translation-invariant visual memory and design a three-stage architecture for long-term, short-term, and online learning, which enables the system to learn human-like experience, environmental knowledge, and online adaption, respectively. With this system, we achieve an average of 20% higher accuracy than the state-of-the-art unsupervised methods in a subterranean tunnel environment. We show comparable performance to supervised methods for robot exploration scenarios showing the efficacy of our approach. We expect that the presented method will play an important role in the robotic interestingness recognition exploration tasks.
Article
Full-text available
In this paper, we report on the creation of a publicly available, common evaluation framework for image and video visual interestingness prediction. We propose a robust data set, the Interestingness10k, with 9831 images and more than 4 h of video, interestigness scores determined based on more than 1M pair-wise annotations of 800 trusted annotators, some pre-computed multi-modal descriptors, and 192 system output results as baselines. The data were validated extensively during the 2016–2017 MediaEval benchmark campaigns. We provide an in-depth analysis of the crucial components of visual interestingness prediction algorithms by reviewing the capabilities and the evolution of the MediaEval benchmark systems, as well as of prominent systems from the literature. We discuss overall trends, influence of the employed features and techniques, generalization capabilities and the reliability of results. We also discuss the possibility of going beyond state-of-the-art performance via an automatic, ad-hoc system fusion, and propose a deep MLP-based architecture that outperforms the current state-of-the-art systems by a large margin. Finally, we provide the most important lessons learned and insights gained.
Chapter
In this paper, we explore the problem of interesting scene prediction for mobile robots. This area is currently underexplored but is crucial for many practical applications such as autonomous exploration and decision making. Inspired by industrial demands, we first propose a novel translation-invariant visual memory for recalling and identifying interesting scenes, then design a three-stage architecture of long-term, short-term, and online learning. This enables our system to learn human-like experience, environmental knowledge, and online adaption, respectively. Our approach achieves much higher accuracy than the state-of-the-art algorithms on challenging robotic interestingness datasets.
Article
Full-text available
This paper presents a novel method for favorite video estimation based on multiview feature integration via Kernel Multiview Local Fisher Discriminant Analysis (KMvLFDA). The proposed method first extracts electroencephalogram (EEG) features from users’ EEG signals recorded while watching videos and multiple visual features from videos. Then multiple EEG-based visual features are obtained by applying Locality Preserving Canonical Correlation Analysis (LPCCA) to EEG features and each visual feature. Next, KMvLFDA, which is newly derived in this paper, explores the complementary properties of different features and integrates the multiple EEG-based visual features. In addition, by using KMvLFDA, between-class scatter is maximized and within-class scatter is minimized in the integrated feature space. Consequently, it can be expected that the new features that are obtained by the above integration are more effective than each of the EEG-based visual features for estimation of users’ favorite videos. The main contribution of this paper is the new derivation of KMvLFDA. Successful estimation of users’ favorite videos becomes feasible by using the new features obtained via KMvLFDA.
Conference Paper
In this paper, we propose a multimodal framework for video segment interestingness prediction based on the genre and affective impact of movie content. We hypothesize that the emotional characteristic and impact of a video infer its genre, which can in turn be a factor for identifying the perceived interestingness of a particular video segment (shot) within the entire media. Our proposed approach is based on audio-visual deep features for perceptual content analysis. The multimodal content is quantified in a mid-level representation which consists in describing each audio-visual segment as a distribution over various genres (action, drama, horror, romance, sci-fi for now). Some segment might be more characteristic of a media and therefore be more interesting than a segment containing content with a neutral genre. Having determined the genre of individual video segments, we trained a classifier to produce an interestingness factor which is then used to rank segments. We evaluate our approach on the MediaEval2017 Media Interestingness Prediction Task Dataset (PMIT). We demonstrate that our approach outperforms the existing video interestingness approaches on the PMIT dataset in terms of Mean Average Precision.
Article
Full-text available
Today, users generate and consume millions of videos online. Automatic identification of the most interesting moments of these videos have many applications such as video retrieval. Although most interesting excerpts are person-dependent, existing work demonstrate that there are some common features among these segments. The media interestingness task at MediaEval 2016 focuses on ranking the shots and key-frames in a movie trailer based on their interestingness. The dataset consists of a set of commercial movie trailers from which the participants are required to automatically identify the most interesting shots and frames. We approach the problem as a regression task and test several algorithms. We particularly use mid-level semantic visual sentiment features. These features are related to the emotional content of images and are shown to be effective in recognizing inter-estingness in GIFs. We found that our suggested features outperform the baseline for the task at hand.
Chapter
Papers from the 2006 flagship meeting on neural computation, with contributions from physicists, neuroscientists, mathematicians, statisticians, and computer scientists. The annual Neural Information Processing Systems (NIPS) conference is the flagship meeting on neural computation and machine learning. It draws a diverse group of attendees—physicists, neuroscientists, mathematicians, statisticians, and computer scientists—interested in theoretical and applied aspects of modeling, simulating, and building neural-like or intelligent systems. The presentations are interdisciplinary, with contributions in algorithms, learning theory, cognitive science, neuroscience, brain imaging, vision, speech and signal processing, reinforcement learning, and applications. Only twenty-five percent of the papers submitted are accepted for presentation at NIPS, so the quality is exceptionally high. This volume contains the papers presented at the December 2006 meeting, held in Vancouver. Bradford Books imprint
Article
The amount of videos available on the Web is growing explosively. While some videos are very interesting and receive high rating from viewers, many of them are less interesting or even boring. This paper conducts a pilot study on the understanding of human perception of video interestingness, and demonstrates a simple computational method to identify more interesting videos. To this end we first construct two datasets of Flickr and YouTube videos respectively. Human judgements of interestingness are collected and used as the ground-truth for training computational models. We evaluate several off-the-shelf visual and audio features that are potentially useful for predicting interestingness on both datasets. Results indicate that audio and visual features are equally important and the combination of both modalities shows very promising results.
Article
Progress in estimating visual memorability has been limited by the small scale and lack of variety of benchmark data. Here, we introduce a novel experimental procedure to objectively measure human memory, allowing us to build LaMem, the largest annotated image memorability dataset to date (containing 60,000 images from diverse sources). Using Convolutional Neural Networks (CNNs), we show that fine-tuned deep features outperform all other features by a large margin, reaching a rank correlation of 0.64, near human consistency (0.68). Analysis of the responses of the high-level CNN layers shows which objects and regions are positively, and negatively, correlated with memorability, allowing us to create memorability maps for each image and provide a concrete method to perform image memorability manipulation. This work demonstrates that one can now robustly estimate the memorability of images from many different classes, positioning memorability and deep memorability features as prime candidates to estimate the utility of information for cognitive systems. Our model and data are available at: http://memorability.csail.mit.edu.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Article
Robust low-level image features have been proven to be effective representations for a variety of visual recognition tasks such as object recognition and scene classification; but pixels, or even local image patches, carry little semantic meanings. For high level visual tasks, such low-level image representations are potentially not enough. In this paper, we propose a high-level image representation, called the Object Bank, where an image is represented as a scale-invariant response map of a large number of pre-trained generic object detectors, blind to the testing dataset or visual task. Leveraging on the Object Bank representation, superior performances on high level visual recognition tasks can be achieved with simple off-the-shelf classifiers such as logistic regression and linear SVM. Sparsity algorithms make our representation more efficient and scalable for large scene datasets, and reveal semantically meaningful feature patterns.
Conference Paper
Media interestingness prediction plays an important role in many real-world applications and attracts much research attention recently. In this paper, we aim to investigate this problem from the perspective of supervised feature extraction. Specifically, we design a novel algorithm dubbed Multi-view Manifold Learning (M) to uncover the latent factors that are capable of distinguishing interesting media data from non-interesting ones. By modelling both geometry preserving criterion and discrimination maximization criterion in a unified framework, M²L learns a common subspace for data from multiple views. The analytical solution of M²L is obtained by solving a generalized eigen-decomposition problem. Experiments on the Predicting Media Interestingness Dataset validate the effectiveness of the proposed method.