ChapterPDF Available

Predicting Interestingness of Visual Content

October 2017

October 2017

DOI:10.1007/978-3-319-57687-9_10

In book: Visual Content Indexing and Retrieval with Psycho-Visual Models (pp.233-265)

Authors:

Claire-Hélène Demarty

InterDigital

Mihai-Gabriel Constantin

Polytechnic University of Bucharest

Ngoc Q K Duong

Technicolor

Show all 7 authorsHide

The ability of multimedia data to attract and keep people’s interest for longer periods of time is gaining more and more importance in the fields of information retrieval and recommendation, especially in the context of the ever growing market value of social media and advertising. In this chapter we introduce a benchmarking framework (dataset and evaluation tools) designed specifically for assessing the performance of media interestingness prediction techniques. We release a dataset which consists of excerpts from 78 movie trailers of Hollywood-like movies. These data are annotated by human assessors according to their degree of interestingness. A real-world use scenario is targeted, namely interestingness is defined in the context of selecting visual content for illustrating a Video on Demand (VOD) website. We provide an in-depth analysis of the human aspects of this task, i.e., the correlation between perceptual characteristics of the content and the actual data, as well as of the machine aspects by overviewing the participating systems of the 2016 MediaEval Predicting Media Interestingness campaign. After discussing the state-of-art achievements, valuable insights, existing current capabilities as well as future challenges are presented.

Web user interface for pair-wise annotations

…

Official results for image interestingness prediction evaluated by MAP

…

Official results for video interestingness prediction evaluated by MAP

…

Examples of interesting images from different videos of the test set. Images are ranked from left to right decreasing interestingness ranking. (a) Interesting images according to the ground-truth. (b) Interesting images selected by the best system. (c) Interesting images selected by the second worst performing system (Color figure online)

…

Examples of non interesting images from different videos of the test set. Images are ranked from left to right

…

Figures - uploaded by Thanh-Toan Do

Content may be subject to copyright.

Content uploaded by Thanh-Toan Do

Content may be subject to copyright.

Predicting Interestingness of Visual Content

Claire-Hélène Demarty, Mats Sjöberg, Mihai Gabriel Constantin,

Ngoc Q.K. Duong, Bogdan Ionescu, Thanh-Toan Do, and Hanli Wang

Abstract The ability of multimedia data to attract and keep people’s interest

for longer periods of time is gaining more and more importance in the ﬁelds of

information retrieval and recommendation, especially in the context of the ever

growing market value of social media and advertising. In this chapter we introduce

a benchmarking framework (dataset and evaluation tools) designed speciﬁcally

for assessing the performance of media interestingness prediction techniques. We

release a dataset which consists of excerpts from 78 movie trailers of Hollywood-

like movies. These data are annotated by human assessors according to their degree

of interestingness. A real-world use scenario is targeted, namely interestingness is

deﬁned in the context of selecting visual content for illustrating a Video on Demand

(VOD) website. We provide an in-depth analysis of the human aspects of this task,

i.e., the correlation between perceptual characteristics of the content and the actual

data, as well as of the machine aspects by overviewing the participating systems of

the 2016 MediaEval Predicting Media Interestingness campaign. After discussing

the state-of-art achievements, valuable insights, existing current capabilities as well

as future challenges are presented.

C.-H. Demarty () • N.Q.K. Duong

Technicolor R&I, Rennes, France

e-mail: claire-helene.demarty@technicolor.com;quang-khanh-ngoc.duong@technicolor.com

M. Sjöberg

Helsinki Institute for Information Technology HIIT, Department of Computer Science,

University of Helsinki, Helsinki, Finland

e-mail: mats.sjoberg@helsinki.ﬁ

M.G. Constantin • B. Ionescu

LAPI, University Politehnica of Bucharest, Bucharest, Romania

e-mail: mgconstantin@imag.pub.ro;bionescu@imag.pub.ro

T.- T. Do

Singapore University of Technology and Design, Singapore, Singapore

University of Science, Ho Chi Minh City, Vietnam

e-mail: thanhtoan_do@sutd.edu.sg

H. Wang

Department of Computer Science and Technology, Tongji University, Shanghai, China

e-mail: hanliwang@tongji.edu.cn

J. Benois-Pineau, P. Le Callet (eds.), Visual Content Indexing and Retrieval

with Psycho-Visual Models, Multimedia Systems and Applications,

DOI 10.1007/978-3-319-57687-9_10

233

234 C.-H. Demarty et al.

1 Introduction

With the increased popularity of amateur and professional digital multimedia

content, accessing relevant information is now dependent on effective tools for

managing and browsing, due to the huge amount of data. Managing content often

involves ﬁltering parts of it to extract what corresponds to speciﬁc requests or

applications. Fine ﬁltering is impossible however without a clear understanding of

the content’s semantic meaning. To this end, current research in multimedia and

computer vision has moved towards modeling of more complex semantic notions,

such as emotions, complexity, memorability and interestingness of content, thus

going closer to human perception.

Being able to assess, for instance, the interestingness level of an image or

a video has several direct applications: from personal and professional content

retrieval, content management, to content summarization and story telling, selective

encoding, or even education. Although it has already raised a huge interest in the

research community, a common and clear deﬁnition of multimedia interestingness

has not yet been proposed, nor does a common benchmark for the assessment of the

different techniques for its automatic prediction exist.

MediaEval1is a benchmarking initiative which focuses on the multi-modal

aspects of multimedia content, i.e., it is dedicated to the evaluation of new

algorithms for multimedia access and retrieval. MediaEval emphasizes the multi-

modal character of the data, e.g., speech, audio, visual content, tags, users and

context. In 2016, the Predicting Media Interesting Task2was proposed as a new

track in the MediaEval benchmark. The purpose of the task is to answer a

real and professional-oriented interestingness prediction use case, formulated by

Technicolor.3Technicolor is a creative technology company and a provider of

services in multimedia entertainment and solutions, in particular, providing also

solutions for helping users select the most appropriate content according to, for

example, their proﬁle. In this context, the selected use case for interestingness

consists in helping professionals to illustrate a Video on Demand (VOD) web site

by selecting some interesting frames and/or video excerpts for the posted movies.

Although the targeted application is well-deﬁned and conﬁned to the illustration

of a VOD web site, the task remains highly challenging. Firstly, it raises the question

of the subjectivity of interestingness, which may vary from one person to the other.

Furthermore, the semantic nature of interestingness constrains its modeling to be

able to bridge the semantic gap between the notion of interestingness and the

statistical features that can be extracted from the content. Lastly, by placing the

task in the ﬁeld of the understanding of multi-modal content, i.e., audio and video,

we push the challenge even further by adding a new dimensionality to the task. The

1http://www.multimediaeval.org/.

2http://www.multimediaeval.org/mediaeval2016/mediainterestingness/.

3http://www.technicolor.com.

Predicting Interestingness of Visual Content 235

choice of Hollywood movies as targeted content also adds potential difﬁculties, in

the sense that the systems will have to cope with different movie genres and potential

editing and special effects (i.e., alteration of the content).

Nevertheless, although highly challenging, the building of the task responds to

the absence of such benchmarks. It provides a common dataset and a common

deﬁnition of interestingness. To the best of our knowledge, the MediaEval 2016

Predicting Media Interestingness is the ﬁrst attempt to cope with this issue in the

research community. Even though still in its infancy, the task has, in this ﬁrst year,

been a source of meaningful insights for the future of the ﬁeld.

This chapter focuses on a detailed description of the benchmarking framework,

together with a thorough analysis of its results, both in terms of the performance

of the submitted systems and in what concerns the produced annotated dataset. We

identify the following main contributions:

• an overview of the current interestingness literature, both from the perspective

of the psychological implications and also from the multimedia/computer vision

side;

• the introduction of the ﬁrst benchmark framework for the validation of the

techniques for predicting the interestingness of video (image and audio) content,

formulated around a real-world use case, which allows for disambiguating the

deﬁnition of interestingness;

• the public release of a specially designed annotated dataset. It is accompanied

with an analysis of its perceptual characteristics;

• an overview of the current capabilities via the analysis of the submitted runs;

• an in-depth discussion on the remaining issues and challenges for the prediction

of the interestingness of content.

The rest of the chapter is organized as follows. Section 2presents a consistent

state of the art on interestingness prediction from both the psychological and

computational points of view. It is followed by a detailed description of the Medi-

aEval Predicting Media Interestingness Task, its deﬁnition, dataset, annotations and

evaluation rules, in Sect. 3. Section 4gives an overview of the different submitted

systems and trends for this ﬁrst year of the benchmark. We analyze the produced

dataset and annotations, their qualities and limitations. Finally, Sect.5discusses the

future challenges and the conclusions.

2 A Review of the Literature

The prediction and detection of multimedia data interestingness has been analyzed

in the literature from the human perspective, involving psychological studies, and

also from the computational perspective, where machines are taught to replicate the

human process. Content interestingness has gained importance with the increasing

popularity of social media, on-demand video services and recommender systems.

These different research directions try to create a general model for human interest,

236 C.-H. Demarty et al.

go beyond the subjectivity of interestingness and detect some objective features that

appeal to the majority of subjects. In the following, we present an overview of these

directions.

2.1 Visual Interestingness as a Psychological Concept

Psychologists and neuroscientists have extensively studied the subjective perception

of visual content. The basis of the psychological interestingness studies was

established in [5]. It was revealed that interest is determined by certain factors

and their combinations, like “novelty”, “uncertainty”, “conﬂict” and “complexity”.

More recent studies have also developed the idea that interest is a result of

appraisal structures [58]. Psychological experiments determined two components,

namely: “novelty-complexity”—a structure that indicates the interest shown for

new and complex events; and “coping potential”—a structure that measures a

subject’s ability to discern the meaning of a certain event. The inﬂuence of each

appraisal component was further studied in [59], proving that personality traits

could inﬂuence the appraisals that deﬁne interest. Subjects with a high “openness”

trait, who are sensation seeking, curious, open to experiences [47], were more

attracted by the novelty-complexity structure. In opposition, those not belonging

to that personality category, were inﬂuenced more by their coping potential. Some

of these factors were conﬁrmed in numerous other studies based on image or video

interestingness [11,22,54,61].

The importance of objects was also analyzed as a central interestingness

cue [20,62]. The saliency maps used by the authors in [20] were able to predict

interesting objects in a scene with an accuracy of more than 43%. They introduced

and demonstrated the idea that, when asked to describe a scene, humans tend to

talk about the most interesting objects in that scene ﬁrst. Experiments show that

there was a strong consistency between different users [62]. Eye movement, another

behavioral cue, was used by the authors in [9] to detect the level of interest shown in

segments of images or whole images. The authors used saccades, the eye movements

that continuously contribute to the building of a mental map of the viewed scene.

The authors in [4] studied the object attributes that could inﬂuence importance and

draw attention, and found that animated, unusual or rare events tend to be more

interesting for the viewer.

In [65], the authors conducted an interestingness study on 77 subjects, using

artworks as visual data. The participants were asked to give ratings on different

scales to opposing attributes for the images, including: “interesting-uninteresting”,

“enjoyable-unenjoyable”, “cheerful-sad”, “pleasing-displeasing”. The results show

that disturbing images can still be classiﬁed as interesting, therefore negating the

need of pleasantness in human visual interest stimulation. Another analysis [11]led

to several conclusions regarding the inﬂuence on interest, namely: instant enjoyment

was found to be an important factor, exploration intent and novelty had a positive

effect and challenge had a small effect. The authors in [13] studied the inﬂuence

Predicting Interestingness of Visual Content 237

of familiarity with the presented image on the concept of interestingness. They

concluded that for general scenes, unfamiliar context positively inﬂuenced interest,

while photos of familiar faces (including self photos) were more interesting than

those of unfamiliar people.

It is interesting to observe also a correlation between different attributes and

interestingness. Authors in [23] performed such a study on a specially designed and

annotated dataset of images. The positively correlated attributes were found to be

“assumed memorability”, “aesthetics”, “pleasant”, “exciting”, “famous”, “unusual”,

“makes happy”, “expert photo”, “mysterious”, “outdoor-natural”, “arousing”,

“strange”, “historical”or“cultural place”.

2.2 Visual Interestingness from a Computational Perspective

Besides the vast literature of psychological studies, the concept of visual inter-

estingness has been studied from the perspective of automatic, machine-based,

approaches. The idea is to replicate human capabilities via computational means.

For instance, the authors in [23] studied a large set of attributes: RGB values,

GIST features [50], spatial pyramids of SIFT histograms [39], colorfulness [17],

complexity, contrast and edge distributions [35], arousal [46] and composition of

parts [6] to model different cues related to interestingness. They investigated the

role of these cues in varying context of viewing: different datasets were used,

from arbitrary selected and very different images (weak context) to images issued

from similar Webcam streams (strong context). They found that the concept of

“unusualness”, deﬁned as the degree of novelty of a certain image when compared

to the whole dataset, was related to interestingness, in case of a strong context.

Unusualness was calculated by clustering performed on the images using Local

Outlier Factor [8] with RGB values, GIST and SIFT as features, composition of

parts and complexity interpreted as the JPEG image size. In case of a weak context,

personal preferences of the user, modeled by pixel values, GIST, SIFT and Color

Histogram as features, classiﬁed with a -SVR—Support Vector Regression (SVR)

with a RBF kernel, performed best. Continuing this work, the author in [61] noticed

that a regression with sparse approximation of data performed better with the

features deﬁned by Gygli et al. [23] than the SVR approach.

Another approach [19] selected three types of attributes for determining image

interestingness: compositional, image content and sky-illumination. The composi-

tional attributes were: rule of thirds, low depth of ﬁeld, opposing colors and salient

objects; the image content attributes were: the presence of people, animals and faces,

indoor/outdoor classiﬁers; and ﬁnally the sky-illumination attributes consisted of

scene classiﬁcation as cloudy, clear or sunset/sunrise. Classiﬁcation of interesting

content is performed with Support Vector Machines (SVM). As baseline, the authors

used the low-level attributes proposed in [35], namely average hue, color, contrast,

brightness, blur and simplicity interpreted as distribution of edges; and the Naïve

Bayes and SVM for classiﬁcation. Results show that high-level attributes tend to

238 C.-H. Demarty et al.

perform better than the baseline. However, the combination of the two was able to

achieve even better results.

Other approaches focused on subcategories of interestingness. For instance, the

authors in [27] determined “social interestingness” based on social media ranking

and “visual interestingness” via crowdsourcing. The Pearson correlation coefﬁcient

between these two subcategories had low values, e.g., 0.015 to 0.195, indicating

that there is a difference between what people share on social networks and what

has a high pure visual interest. The features used for predicting these concepts were

color descriptors determined on the HSV color space, texture information via Local

Binary Patterns, saliency [25] and edge information captured with Histogram of

Oriented Gradients.

Individual frame interestingness was calculated by the authors in [43]. They

used web photo collections of interesting landmarks from Flickr as estimators of

human interest. The proposed approach involved calculating a similarity measure

between each frame from YouTube travel videos and the Flickr image collection

of the landmarks presented in the videos, used as interesting examples. SIFT

features were computed and the number of features shared between the frame

and the image collection baseline, and their spatial arrangement similarity were

the components that determined the interestingness measure. Finally the authors

showed that their algorithm achieved the desired results, tending to classify full

images of the landmarks as interesting.

Another interesting approach is the one proposed in [31]. Authors used audio,

video and high-level features for predicting video shot interestingness, e.g., color

histograms, SIFT [45], HOG [15,68], SSIM Self-Similarities [55], GIST [50],

MFCC [63], Spectrogram SIFT [34], Audio-Six, Classemes [64], ObjectBank [41]

and the 14 photographic styles described in [48]. The system was trained via

Joachims’ Ranking SVM [33]. The ﬁnal results showed that audio and visual

features performed well, and that their fusion performed even better on the two

user-annotated datasets used, giving a ﬁnal accuracy of 78.6% on the 1200 Flickr

videos and 71.7% on the 420 YouTube videos. Fusion with the high-level attributes

provided a better result only on the Flickr dataset, with an overall precision of 79.7

and 71.4%.

Low- and high-level features were used in [22] to detect the most interesting

frames in image sequences. The selected low-level features were: raw pixel values,

color histogram, HOG, GIST and image self-similarity. The high-level features

were grouped in several categories: emotion predicted from raw pixel values [66],

complexity deﬁned as the size of the compressed PNG image, novelty computed

through a Local Outlier Factor [8] and a learning feature computed using a -

SVR classiﬁer with RBF kernel on the GIST features. Each one of these features

performed above the baseline (i.e., random selection), and their combination also

showed improvements over each individual one. The tests were performed on a

database consisting of 20 image sequences, each containing 159 color images taken

from various webcams and surveillance scenarios, and the ﬁnal results for the com-

bination of features gave an average precision score of 0.35 and a To p3score of 0.59.

Predicting Interestingness of Visual Content 239

2.3 Datasets for Predicting Interestingness

A critical point to build and evaluate any machine learning system is the availability

of labeled data. Although the literature for automatic interestingness prediction is

still at its early stages, there are some attempts to construct an evaluation data. In

the following, we introduce the most relevant initiatives.

Many of the authors have chosen to create their own datasets for evaluating

their methods. Various sources of information were used, mainly coming form

social media, e.g., Flickr [19,27,31,43,61], Pinterest [27], Youtube [31,43].

The data consisted of the results returned by search queries. Annotations were

determined either automatically, by exploiting the available social media metadata

and statistics such as Flickr’s “interestingness measure”in[19,31], or manually, via

crowdsourcing in [27] or local human assessors in [31].

The authors in [19] used a dataset composed of 40,000 images, and kept the top

10%, ordered according to the Flickr interestingness score, as positive interesting

examples and the last 10% as negative, non interesting examples. Half of this dataset

was used for training and half for testing. The same top and last 10% of Flickr results

was used in [31], generating 1200 videos retrieved with 15 keyword queries, e.g.,:

“basketball”, “beach”, “bird”, “birthday”, “cat”, “dancing”. In addition to these,

the authors in [31] also used 30 YouTube advertisement videos from 14 categories,

such as “accessories”, “clothing&shoes”, “computer&website”, “digital products”,

“drink”. The videos had an average duration of 36 s and were annotated by human

assessors, thus generating a baseline interestingness score.

Apart from the individual datasets, there were also initiatives of grouping several

datasets of different compositions. The authors in [23], associated an internal

context to the data: a strong context dataset proposed in [22], where the images

in 20 publicly available webcam streams are consistently related to one another,

thus generating a collection of 20 image sequences each containing 159 images; a

weak context dataset introduced in [50] which consists of 2688 ﬁxed size images

grouped in 8 scene categories: “coast”, “mountain”, “forest”, “open country”,

“street”, “inside city”, “tall buildings” and “highways”; and a no context dataset

which consists of the 2222 image memorability dataset proposed in [29,30], with

no context or story behind the pictures.

3 The Predicting Media Interestingness Task

This section describes the Predicting Media Interestingness Task, which was

proposed in the context of the 2016 MediaEval international evaluation campaign.

This section addresses the task deﬁnition (Sect. 3.1), the description of the provided

data with its annotations (Sect. 3.2), and the evaluation protocol (Sect. 3.3).

240 C.-H. Demarty et al.

3.1 Task Deﬁnition

Interestingness of media content is a perceptual and highly semantic notion that

remains very subjective and dependent on the user and the context. Nevertheless,

experiments show that there is, in general, an average and common interestingness

level, shared by most of the users [10]. This average interestingness level provides

evidence to envision that the building of a model for the prediction of interestingness

is feasible. Starting from this basic assumption, and constraining the concept to a

clearly deﬁned use case, will serve to disambiguate the notion and reduce the level

of subjectivity.

In the proposed benchmark, interestingness is assessed according to a practical

use case originated from Technicolor, where the goal is to help professionals

to illustrate a Video on Demand (VOD) web site by selecting some interesting

frames and/or video excerpts for the movies. We adopt the following deﬁnition of

interestingness: an image/video excerpt is interesting in the context of helping a user

to make his/her decision about whether he/she is interested in watching the movie

it represents. The proposed data is naturally adapted to this speciﬁc scenario, and

consists of professional content, i.e., Hollywood-like movies.

Given this data and use case, the task requires participants to develop systems

which can automatically select images and/or video segments which are considered

to be the most interesting according to the aforementioned deﬁnition. Interesting-

ness of the media is to be judged by the systems based on visual appearance, audio

information and text accompanying the data. Therefore, the challenge is inherently

multi-modal.

As presented in numerous studies in the literature, predicting the interestingness

level of images and videos often requires signiﬁcantly different perspectives. Images

are self contained and the information is captured in the scene composition and

colors, whereas, videos are lower quality images in motion, whose purpose is to

transmit the action via the movement of the objects. Therefore, to address the two

cases, two benchmarking scenarios (subtasks) are proposed as:

•predicting image interestingness: given a set of key-frames extracted from a

movie, the systems are required to automatically identify those images for the

given movie that viewers report to be the most interesting in the given movie.

To solve the task, participants can make use of visual content as well as external

metadata, e.g., Internet data about the movie, social media information, etc.;

•predicting video interestingness: given the video shots of a movie, the systems

are required to automatically identify those shots that viewers report to be the

most interesting in the given movie. To solve the task, participants can make use

of visual and audio data as well as external data, e.g., subtitles, Internet data, etc.

A special feature of the provided data is the fact that it is extracted from the same

source movies, i.e., the key-frames are extracted from the provided video shots of

the movies. Therefore, this will allow for comparison between the two tasks, namely

to assess to which extent image and video interestingness are linked.

Predicting Interestingness of Visual Content 241

Furthermore, we proposed a binary scenario, where data can be either interesting

or not (two cases). Nevertheless, a conﬁdence score is also required for each

decision, so that the ﬁnal evaluation measure could be computed in a ranking

fashion. This is more closely related to a real world usage scenario, where results

are provided in order of decreasing interestingness level.

3.2 Data Description

As mentioned in the previous section, the video and image subtasks are based on

a common dataset, which consists of Creative Commons trailers of Hollywood-like

movies, so as to allow redistribution. The dataset, its annotations, and accompanying

features, as described in the following subsections, are publicly available.4

The use of trailers, instead of full movies, has several motivations. Firstly, it is

the need for having content that can be freely and publicly distributed, as opposed

to e.g., full movies which have much stronger restrictions on distribution. Basically,

each copyrighted movie would require an individual permission for distribution.

Secondly, using full movies is not practically feasible for the highly demanding

segmentation and annotations steps with limited time and resources, as the number

of images/video excerpts to process is enormous, in the order of millions. Finally,

running on full movies, even if the aforementioned problems were solved, will not

allow for having a high diversiﬁcation of the content, as only a few movies could

have been used. Trailers, will allow for selecting a larger number of movies and thus

diversifying the content.

Trailers are by deﬁnition representative of the main content and quality of the full

movies. However, it is important to note that trailers are already the result of some

manual ﬁltering of the movie to ﬁnd the most interesting scenes, but without spoiling

the movie key elements. In practice, most trailers also contain less interesting, or

slower paced shots to balance their content. We therefore believe that this is a good

compromise for the practicality of the data/task.

The proposed dataset is split into development data, intended for designing and

training the algorithms which is based on 52 trailers; and testing data which is used

for the actual evaluation of the systems, and is based on 26 trailers.

The data for the video subtask was created by segmenting the trailers into video

shots. The same video shots were also used for the image subtask, but here each shot

is represented by a single key-frame image. The task is thus to classify the shots, or

key-frames, of a particular trailer, into interesting and non interesting samples.

4http://www.technicolor.com/en/innovation/scientiﬁc-community/scientiﬁc-data-sharing/

interestingness-dataset.

242 C.-H. Demarty et al.

3.2.1 Shot Segmentation and Key-Frame Extraction

Video shot segmentation was carried out manually using a custom-made software

tool. Here we deﬁne a video shot as a continuous video sequence recorded between

a turn-on and a turn-off of the camera. For an edited video sequence, a shot is

delimited between two video transitions. Typical video transitions include sharp

transitions or cuts (direct concatenation of two shots), and gradual transitions

like fades (gradual disappearance/appearance of a frame to/from a black frame)

and dissolves (gradual transformation of one frame into another). In the process,

we discarded movie credits and title shots. Gradual transitions were considered

presumably very uninteresting shots by themselves, whenever possible. In a few

cases, shots in between two gradual transitions were too short to be segmented.

In that case, they were merged with their surrounding transitions, resulting in one

single shot.

The segmentation process resulted in 5054 shots for the development dataset, and

2342 shots for the test dataset, with an average duration of 1 s in each case. These

shots were used for the video subtask. For the image subtask, we extracted a single

key-frame for each shot. The key-frame was chosen as the middle frame, as it is

likely to capture the most representative information of the shot.

3.2.2 Ground-Truth Annotation

All video shots and key-frames were manually annotated in terms of interestingness

by human assessors. The annotation process was performed separately for the video

and image subtasks, to allow us to study the correlation between the two. Indeed we

would like to answer the question: Does image interestingness automatically imply

video interestingness, and vice versa?

A dedicated web-based tool was developed to assist the annotation process. The

tool has been released as free and open source software, so that others can beneﬁt

from it and contribute improvements.5

We use the following annotation protocol. Instead of asking annotators to assign

an interestingness value to each shot/key-frame, we used a pair-wise comparison

protocol where the annotators were asked to select the more interesting shot/key-

frame from a pair of examples taken from the same trailer. Annotators were provided

with the clips for the shots and the images for the key-frames, presented side by

side. Also, they were informed about the Video on Demand-use case, and asked

to consider also that “the selected video excerpts/key-frames should be suitable in

terms of helping a user to make his/her decision about whether he/she is interested

in watching a movie”. Figure 1illustrates the pair-wise decision stage of the user

interface.

5https://github.com/mvsjober/pair-annotate.

Predicting Interestingness of Visual Content 243

Fig. 1 Web user interface for pair-wise annotations

The choice of a pair-wise annotation protocol instead of direct rating was based

on our previous experience with annotating multimedia for affective content and

interestingness [3,10,60]. Assigning a rating is a cognitively very demanding task,

requiring the annotator to understand, and constantly keep in mind, the full range

of the interestingness scale [70]. Making a single comparison is a much easier task

as one only needs to compare the interestingness of two items, and not consider

the full range. Directly assigning a rating value is also problematic since different

annotators may use different ranges, and even for the same annotator the values may

not be easily interpreted [51]. For example, is an increase from 0.3 to 0.4 the same

as the one from 0.8 to 0.9? Finally, it has been shown that pairwise comparisons are

less inﬂuenced by the order in which the annotations are displayed than with direct

rating [71].

However, annotating all possible pairs is not feasible due to the sheer number of

comparisons required. For instance, nshots/key-frames would require n.n1/=2

comparisons to be made for a full coverage. Instead, we adopted the adaptive square

design method [40], where the shots/key-frames are placed in a square design and

only pairs on the same row or column are compared. This reduces the numbers of

comparisons to n.pn1/. For example, for nD100 we need to undergo only 900

comparisons instead of 4950 (full coverage). Finally, the Bradley-Terry-Luce (BTL)

model [7] was used to convert the paired comparison data to a scalar value.

We modiﬁed the adaptive square design setup so that comparisons were taken

by many users simultaneously until all the required pairs had been covered. For the

rest, we proceeded according to the scheme in [40]:

1. Initialization: shots/key-frames are randomly assigned positions in the square

matrix;

2. Perform a single annotation round according to the shot/key-frame pairs given

by the square (across rows, columns);

244 C.-H. Demarty et al.

3. Calculate the BTL scores based on the annotations;

4. Re-arrange the square matrix so that shots/key-frames are ranked according to

their BTL scores, and placed in a spiral. This arrangement ensures that mostly

similar shots/key-frames are compared row-wise and column-wise;

5. Repeat steps 2 to 4 until convergence.

For practical reasons, we decided to consider by default that convergence is

achieved after ﬁve rounds and thus terminated the process when the ﬁve runs are

ﬁnished. The ﬁnal binary interestingness decisions were obtained with a heuristic

method that tried to detect a “jumping point” in the normalized distribution of the

BTL values for each movie separately. The underlying motivation for this empirical

rule is the assumption that the distribution is a sum of two underlying distributions:

non interesting shots/key-frames, and interesting shots/key-frames.

Overall, 315 annotators participated in the annotation for the video data and 100

for the images. The cultural distribution is over 29 different countries around the

world. The average reported age of the annotators was 32, with a standard deviation

around 13. Roughly, 66% were male, 32% female, and 2% did not specify their

gender.

3.2.3 Additional Features

Apart from the data and its annotations, to broaden the targeted communities, we

also provide some pre-computed content descriptors, namely:

Dense SIFT which are computed following the original work in [45], except that

the local frame patches are densely sampled instead of using interest point detectors.

A codebook of 300 codewords is used in the quantization process with a spatial

pyramid of three layers [39].

HoG descriptors i.e., Histograms of Oriented Gradients [15] are computed over

densely sampled patches. Following [68], HoG descriptors in a 22neighborhood

are concatenated to form a descriptor of higher dimension.

LBP i.e., Local Binary Patterns as proposed in [49].

GIST is computed based on the output energy of several Gabor-like ﬁlters (eight

orientations and four scales) over a dense frame grid like in [50].

Color histogram computed in the HSV space (Hue-Saturation-Value).

MFCC computed over 32 ms time-windows with 50% overlap. The cepstral

vectors are concatenated with their ﬁrst and second derivatives.

CNN features i.e., the fc7 layer (4096 dimensions) and prob layer (1000

dimensions) of AlexNet [32].

Mid level face detection and tracking features obtained by face tracking-by-

detection in each video shot via a HoG detector [15] and the correlation tracker

proposed in [16].

Predicting Interestingness of Visual Content 245

3.3 Evaluation Rules

As for other tasks in MediaEval, participants were allowed to submit a total of up

to 5 runs for the video and image subtasks. To provide the reader with a complete

picture of the evaluation process in order to understand the achieved results, we

replicate the exact conditions for the participants, here.

Each task had a required run, namely: for predicting image interestingness,

classiﬁcation had to be achieved with the use of the visual information only, no

external data was allowed; for predicting video interestingness, classiﬁcation had to

be achieved with the use of both audio and visual information; no external data was

allowed. External data was considered to be any of the following: additional datasets

and annotations which were speciﬁcally designed for interestingness classiﬁcation;

the use of pre-trained models, features, detectors obtained from such dedicated

additional datasets; additional metadata from the Internet (e.g., from IMDb). On the

contrary, CNN features trained on generic datasets such as ImageNet were allowed

for use in the required runs. By generic datasets, we mean datasets that were not

explicitly designed to support research in interestingness prediction. Additionally,

datasets dedicated to study memorability or other aspects of media were allowed,

as long as these concepts are different from interestingness, although a correlation

may exist.

To assess performance, several metrics were computed. The ofﬁcial evaluation

metric was the mean average precision (MAP) computed over all trailers, whereas

average precision was to be computed on a per trailer basis, over all ranked

images/video shots. MAP was computed with the trec_eval tool.6In addition to

MAP, several other secondary metrics were provided, namely: accuracy, precision,

recall and f-score for each class, and the class confusion matrix.

4 Results and Analysis of the First Benchmark

4.1 Ofﬁcial Results

The 2016 Predicting Media Interestingness Task received more than 30 registrations

and 12 teams coming from 9 countries all over the world submitted runs in the end

(see Fig. 2). The task attracted a lot of interest from the community, which shows

the importance of this topic.

Tables 1and 2provide an overview of the ofﬁcial results for the two subtasks

(video and image interestingness prediction). A total of 54 runs were received,

6http://trec.nist.gov/trec_eval/.

246 C.-H. Demarty et al.

Fig. 2 2016 Predicting Media Interestingness task’s participation at different stages

Tabl e 1 Ofﬁcial results for image interestingness prediction evaluated by MAP

Team Run name MAP

TUD-MMC [42]me16in_tudmmc2_image_histface 0.2336

Technicolor [56]me16in_technicolor_image_run1_SVM_rbf 0.2336

Technicolor me16in_technicolor_image_run2_DNNresampling06_100 0.2315

MLPBOON [52]me16in_MLPBOON_image_run5 0.2296

BigVid [69]me16in_BigVid_image_run5FusionCNN 0.2294

MLPBOON me16in_MLPBOON_image_run1 0.2205

TUD-MMC me16in_tudmmc2_image_hist 0.2202

MLPBOON me16in_MLPBOON_image_run4 0.217

HUCVL [21]me16in_HUCVL_image_run1 0.2125

HUCVL me16in_HUCVL_image_run2 0.2121

UIT-NII [38]me16in_UITNII_image_FA 0.2115

RUC [12]me16in_RUC_image_run2 0.2035

MLPBOON me16in_MLPBOON_image_run2 0.2023

HUCVL me16in_HUCVL_image_run3 0.2001

RUC me16in_RUC_image_run3 0.1991

RUC me16in_RUC_image_run1 0.1987

ETH-CVL [67]me16in_ethcvl1_image_run2 0.1952

MLPBOON me16in_MLPBOON_image_run3 0.1941

HKBU [44]me16in_HKBU_image_baseline 0.1868

ETH-CVL me16in_ethcvl1_image_run1 0.1866

ETH-CVL me16in_ethcvl1_image_run3 0.1858

HKBU me16in_HKBU_image_drbaseline 0.1839

BigVId me16in_BigVid_image_run4SVM 0.1789

UIT-NII me16in_UITNII_image_V1 0.1773

LAPI [14]me16in_lapi_image_runf1 0.1714

UNIGECISA [53]me16in_UNIGECISA_image_ReglineLoF 0.1704

Baseline 0.16556

LAPI me16in_lapi_image_runf2 0.1398

Predicting Interestingness of Visual Content 247

Tabl e 2 Ofﬁcial results for video interestingness prediction evaluated by MAP

Team Run name MAP

UNIFESP [1]me16in_unifesp_video_run1 0.1815

HKBU [44]me16in_HKBU_video_drbaseline 0.1735

UNIGECISA [53]me16in_UNIGECISA_video_RegsrrLoF 0.171

RUC [12]me16in_RUC_video_run2 0.1704

UIT-NII [38]me16in_UITNII_video_A1 0.169

UNIFESP me16in_unifesp_video_run4 0.1656

RUC me16in_RUC_video_run1 0.1647

UIT-NII me16in_UITNII_video_F1 0.1641

LAPI [14]me16in_lapi_video_runf5 0.1629

Technicolor [56]me16in_technicolor_video_run5_CSP_multimodal_80_epoch7 0.1618

UNIFESP me16in_unifesp_video_run2 0.1617

UNIFESP me16in_unifesp_video_run3 0.1617

ETH-CVL [67]me16in_ethcvl1_video_run2 0.1574

LAPI me16in_lapi_video_runf3 0.1574

LAPI me16in_lapi_video_runf4 0.1572

TUD-MMC [42]me16in_tudmmc2_video_histface 0.1558

TUD-MMC me16in_tudmmc2_video_hist 0.1557

BigVid [69]me16in_BigVid_video_run3RankSVM 0.154

HKBU me16in_HKBU_video_baseline 0.1521

BigVid me16in_BigVid_video_run2FusionCNN 0.1511

UNIGECISA me16in_UNIGECISA_video_RegsrrGiFe 0.1497

Baseline 0.1496

BigVid me16in_BigVid_video_run1SVM 0.1482

Technicolor me16in_technicolor_video_run3_LSTM_U19_100_epoch5 0.1465

UNIFESP me16in_unifesp_video_run5 0.1435

UNIGECISA me16in_UNIGECISA_video_SVRloAudio 0.1367

Technicolor me16in_technicolor_video_run4_CSP_video_80_epoch9 0.1365

ETH-CVL me16in_ethcvl1_video_run1 0.1362

equally distributed between the two subtasks. As a general conclusion, the achieved

MAP values were low, which proves again the challenging nature of this problem.

Slightly higher values were obtained for image interestingness prediction.

To serve as a baseline for comparison, we generated a random ranking run, i.e.,

samples were ranked randomly ﬁve times and we take the average MAP. Compared

to the baseline, the results of the image subtask clearly conﬁrm their performance,

being almost all above the baseline. For the video subtask, on the other hand,

the value range is smaller and a few systems did worse than the baseline. In the

following we present the participating systems and analyze the achieved results in

detail.

248 C.-H. Demarty et al.

4.2 Participating Systems and Global Trends

Numerous approaches have been investigated by the participating teams to tackle

both image and video interestingness prediction. In the following, we will ﬁrstly

summarize the general techniques used by the teams and their key features

(Sect. 4.2.1), and secondly present the global insights of the results (Sect. 4.2.2).

4.2.1 Participants’ Approaches

A summary of the features and classiﬁcation techniques used by each participating

system is presented in Table 3(image interestingness) and Table 4(video inter-

estingness). Below, we present the main characteristics of each approach. Unless

otherwise speciﬁed, each team participated in both subtasks.

Tabl e 3 Overview of the characteristics of the submitted systems for predicting image interest-

ingness

Team Features Classiﬁcation technique

BigVid [69] denseSIFT+CNN+Style Attributes+SentiBank SVM (run4)

Regularized DNN (run5)

ETH-CVL [67]DNN-based Visual Semantic

Embedding Model

HKBU [44]ColorHist+denseSIFT+GIST+HOG+LBP (run1) Nearest neighbor and SVR

features from run1 + dimension reduction (run2)

HUCVL [21]CNN (run1, run3) MLP (run1, run2)

MemNet (run2) Deep triplet network (run3)

LAPI [14]ColorHist+GIST (run1) SVM

denseSIFT+GIST (run2)

MLPBOON [52]CNN, PCA for dimension reduction Logistic regression

RUC [12]GIST+LBP+CNN prob (run1) Random Forest (run1)

ColorHist+GIST+CNN prob (run2), Random Forest (run2)

ColorHist+GIST+LBP+CNN prob (run3) SVM (run3)

Technicolor [56]CNN (Alexnet fc7) SVM (run1)

MLP (run2)

TUD-MMC [42]Face-related ColorHist (run1) Normalized histogram-based

Face-related ColorHist+Face area (run2) conﬁdence score

NHCS+Normalized face

area score (run2)

UIT-NII [38]CNN (AlexNet+VGG) (run1) SVM with late fusion

CNN (VGG)+GIST+HOG+DenseSIFT (run2)

UNIGECISA [1]Multilingual visual sentiment ontology Linear regression

(MVSO)+CNN

Predicting Interestingness of Visual Content 249

Tabl e 4 Overview of the characteristics of the submitted systems for predicting video interesting-

ness

Teams Features Classiﬁcation technique Multi-modality

BigVid [69]denseSIFT, CNN SVM (run1) No

Style Attributes, SentiBank Regularlized DNN (run2)

SVM/Ranking-SVM (run3)

ETH-CVL [67]DNN-based Video2GIF (run1) Text+Visual

Video2GIF+Visual Semantic

Embedding Model (run2)

HKBU [44]ColorHist+denseSIFT+GIST Nearest neighbor and SVR No

+HOG+LBP (run1)

features from run1

+ dimension reduction (run2)

LAPI [14]GIST+CNN prob (run3) SVM No

ColorHist+CNN (run4)

denseSIFT+CNN prob (run5)

RUC [12]Acoustic Statistics + GIST

(run4)

SVM Audio+Visual

MFCC with Fisher Vector

Encoding + GIST (run5)

Technicolor [56] CNN+MFCC LSTM-Resnet + MLP (run3) Audio+Visual

Proposed RNN-based model

(run4, run5)

TUD-MMC [42]ColorHist (run1) Normalized histogram-based No

ColorHist+Face area (run2) conﬁdence score (NHCS)

run3)

NHCS+Normalized face

area score (run4)

UIT-NII [38]CNN (AlexNet)+MFCC (run3) SVM with late fusion Audio+Visual

CNN (VGG)+GIST (run4)

UNIFESP [1]Histogram of motion Majority voting of pairwise No

patterns (HMP) [2]ranking methods:

Ranking SVM, RankNet

RankBoost, ListNet

UNIGECISA [53]MVSO+CNN (run2) SVR (run2) Audio+Visual

Baseline visual features [18]

(run3),

SPARROW (run3, run4)

Emotionally-motivated audio

feature (run4)

BigVid [69] (Fudan University, China): explored various low-level features

(from visual and audio modalities) and high-level semantic attributes, as well as

the fusion of these features for classiﬁcation. Both SVM and recent deep learning

methods were tested as classiﬁers. The results proved that the high-level attributes

250 C.-H. Demarty et al.

are complementary to visual features since the combination of these features

increases the overall performance.

ETH-CVL [67] (ETH Zurich, Switzerland): participated in the video subtask

only. Two models were presented: (1) a frame-based model that uses textual side

information (external data) and (2) a generic predictor for ﬁnding video highlights

in the form of segments. For the frame-based model, they learned a joint embedding

space for image and text, which allows to measure relevance of a frame with

regard to some text such as the video title. For video interestingness prediction, the

approach in [24] was used, where a deep RankNet is trained to rank the segments of

a video based upon their suitability as animated GIFs. Note that RankNet captures

the spatio-temporal aspect of video segments via the use of 3D convolutional neural

networks (C3D).

HKBU [44] (Hong Kong Baptist University, China): used two dimensional-

ity reduction methods, named Neighborhood MinMax Projections (NMMP) and

Supervised Manifold Regression (SMR), to extract features of lower dimension

from a set of baseline low-level visual features (Color Histogram, dense SIFT,

GIST, HOG, LBP). Then nearest neighbor (NN) classiﬁer and Support Vector

Regressor (SVR) were exploited for interestingness classiﬁcation. They found

that after dimensionality reduction, the performance of the reduced features was

comparable to that of their original features, which indicated that the reduced

features successfully captured most of the discriminant information of the data.

HUCVL [21] (Hacettepe University, Turkey): participated in image interesting-

ness prediction only. They investigated three different Deep Neural Network (DNN)

models. The ﬁrst two models were based on ﬁne-tuning two pre-trained models,

namely AlexNet and MemNet. Note that MemNet was trained on the image memo-

rability dataset proposed in [36], the idea being to see if memorability can be gener-

alized to the interestingness concept. The third model, on the other hand, depends on

a proposed triplet network which comprised three instances with shared weights of

the same feed-forward network. The results demonstrated that all these models pro-

vide relatively similar and promising results on the image interestingness subtask.

LAPI [14] (University Politehnica of Bucharest, Romania, co-organizer of the

task): investigated a classic descriptor-classiﬁcation scheme, namely the combi-

nation of different low-level features (HoG, dense SIFT, LBP, GIST, AlexNet fc7

layer features (hereafter referred as CNN features), Color Histogram, Color Naming

Histogram) and use of SVM, with different kernel types, as classiﬁer. For video,

frame features were averaged to obtain a global video descriptor.

MLPBOON [52] (Indian Institute of Technology, Bombay, India): participated

only in image interestingness prediction and studied various baseline visual features

provided by the organizers [18], and classiﬁers on the development dataset. Principal

component analysis (PCA) was used for reducing the feature dimension. Their ﬁnal

system involved the use of PCA on CNN features for the input representation

and logistic regression (LR) as classiﬁer. Interestingly, they observed that the

combination of CNN features with GIST and Color Histogram features gave similar

performance to the use of CNN features only. Overall, this simple, yet effective,

system obtained quite high MAP values for the image subtask.

Predicting Interestingness of Visual Content 251

RUC [12] (Renmin University, China): investigated the use of CNN features

and AlexNet probabilistic layer (referred as CNN prob), and hand-crafted visual

features including Color Histogram, GIST, LBP, HOG, dense SIFT. Classiﬁers

were SVM and Random Forest. They found that semantic-level features, i.e., CNN

prob, and low-level appearance features are complementary. However, concate-

nating CNN features with hand-crafted features did not bring any improvement.

This ﬁnding is coherent with the statement from MLPBOON team [52]. For

predicting video interestingness, audio modality offered superior performance than

visual modality and the early fusion of the two modalities can further boost the

performance.

Technicolor [56] (Technicolor R&D France, co-organizer of the task): used

CNN features as visual features (for both the image and video subtasks), and

MFCC as audio feature (for the video subtask) and investigated the use of both

SVM and different Deep Neural Networks (DNN) as classiﬁcation techniques.

For the image subtask, a simple system with CNN features and SVM resulted

in the best MAP, 0.2336. For the video subtask, multi-modality as a mid-level

fusion of audio and visual features, was taken into account within the DNN

framework. Additionally, a novel DNN architecture based on multiple Recurrent

Neural Networks (RNN) was proposed for modeling the temporal aspect of the

video, and a resampling/upsampling technique was used to deal with the unbalanced

dataset.

TUD-MMC [42] (Delft University of Technology, Netherlands): investigated

MAP values obtained on the development set by swapping and submitting ground-

truth annotations of image and video to the video and image subtasks respectively,

i.e., using the video ground-truth as submission on the image subtask and the

image ground-truth as submission on the video subtask. They concluded on the low

correlation between the image interestingness and video interestingness concepts.

Their simple visual features took into account the human face information (color

and sizes) in the image and video with the assumption that clear human faces should

attract the viewer’s attention and thus make the image/video more interesting. One

of their submitted runs, only rule-based, obtained the best MAP value of 0.2336 for

the image subtask.

UIT-NII [38] (University of Science, Vietnam; University of Information Tech-

nology, Vietnam; National Institute of Informatics, Japan): used SVM to predict

three different scores given the three types of input features: (1) low-level visual

features provided by the organizers [18], (2) CNN features (AlexNet and VGG),

and (3) MFCC as audio feature. Late fusion of these scores was used for computing

the ﬁnal interestingness levels. Interestingly, their system tends to output a higher

rank on images of beautiful women. Furthermore, they found that images from dark

scenes were often considered as more interesting.

UNIFESP [1] (Federal University of Sao Paulo, Brazil): participated only in the

video subtask. Their approach was based on combining learning-to-rank algorithms

for predicting the interestingness of videos by using their visual content only. For

this purpose, Histogram of Motion Patterns (HMP) [2] were used. A simple majority

voting scheme was used for combining four pairwise machine learned rankers

252 C.-H. Demarty et al.

(Ranking SVM, RankNet, RankBoost, ListNet) and predicting the interestingness

of videos. This simple, yet effective, method obtained the best MAP of 0.1815 for

the video subtask.

UNIGECISA [53] (University of Geneva, Switzerland): used mid-level semantic

visual sentiment features, which are related to the emotional content of images

and were shown to be effective in recognizing interestingness in GIFs [24]. They

found that these features outperform the baseline low-level ones provided by the

organizers [18]. They also investigated the use of emotionally-motivated audio

features (eGeMAPS) for the video subtask and showed the signiﬁcance of the audio

modality. Three regression models were reported to predict interestingness levels:

linear regression (LR), SVR with linear kernel, and sparse approximation weighted

regression (SPARROW).

4.2.2 Analysis of this Year’S Trends and Outputs

This section provides an in-depth analysis of the results and discusses the global

trends found in the submitted systems.

Low-Level vs. High-Level Description The conventional low-level visual fea-

tures, such as dense SIFT, GIST, LBP, Color Histogram, were still being used by

many of the systems for both, image and video interestingness prediction [12,14,

38,44,69]. However, deep features like CNN features (i.e., Alexnet fc7 or VGG)

have become dominant and are exploited by the majority of the systems. This

shows the effectiveness and popularity of deep learning. Some teams investigated

the combination of hand crafted features with deep features, i.e., conventional and

CNN features. A general ﬁnding is that such a combination did not really bring

any beneﬁt to the prediction results [12,44,52]. Some systems combined low-level

features with some high-level attributes such as emotional expressions, human faces,

CNN visual concept predictions [12,69]. In this case, the resulting conclusion was

that low-level appearance features and semantic-level features are complementary,

as the combination in general offered better prediction results.

Standard vs. Deep Learning-Based Classiﬁcation As it can be seen in Tables 3

and 4, SVM was mostly used by a large number of systems, for both predic-

tion tasks. In addition, regression techniques such as linear regression, logistic

regression, and support vector regression were also widely reported. Contrary to

CNN features, which were widely used by most of the systems, deep learning

classiﬁcation techniques were investigated less (see [21,56,67,69] for image

interestingness and [56,67,69] for the video interestingness). This may be due to

the fact that the datasets are not large enough to justify a deep learning approach.

Conventional classiﬁers were preferred here.

Use of External Data Some systems investigated the use of external data to

improve the results. For instance, Flickr images with social-driven interestingness

Predicting Interestingness of Visual Content 253

labels were used for model selection in the image interestingness subtask by the

Technicolor team [56]. The HUCVL team [21] submitted a run with a ﬁne-tuning of

the MemNet model, which was trained for image memorability prediction. Although

memorability and interestingness are not the same concept, the authors expected

that ﬁne-tuning a model related to an intrinsic property of images could be helpful

in learning better high-level features for image interestingness prediction. The ETH-

CVL team [67] exploited movie titles, as textual side information related to movies,

for both subtasks. In addition, ETH-CVL also investigated the use of the deep

RankNet model, which was trained on the Video2GIF dataset [24], and the Visual

Semantic Embedding model, which was trained on the MSR Clickture dataset [28].

Dealing with Small and Unbalanced Data As the development data provided for

the two subtasks are not very large, some systems, e.g., [1,56], used the whole

image and video development sets for training when building the ﬁnal models. To

cope with the imbalance of the two classes in the dataset, the Technicolor team [56]

proposed to use classic resampling and upsampling strategies so that the positive

samples are used multiple times during training.

Multi-Modality Speciﬁc to video interestingness, multi-modal approaches were

exploited by half of the teams for at least one of their runs, as shown in Table 4.

Four teams combined audio and visual information [12,38,53,56], and one team

combined text with visual information [67]. The fusion of modalities was done

either at the early stage [12,53], middle stage [56], or late stage [38]inthe

processing workﬂows. Note that the combination of text and visual information was

also reported in [67] for image interestingness prediction. The general ﬁnding here

was that multi-modality brings beneﬁts to the prediction results.

Temporal Modeling for Video Though the temporal aspect is an important

property of a video, most systems did not actually exploit any temporal modeling

for video interestingness prediction. They mainly considered a video as a sequence

of frames and a global video descriptor was computed simply by averaging frame

image descriptors over each shot. As an example, HKBU team [44] treated each

frame as a separated image, and calculated the average and standard deviation of

their features over all frames in a shot to build their global feature vector for each

video. Only two teams incorporated temporal modeling in their submitted systems,

namely Technicolor [56] who used long-short term memory (LSTM) in their deep

learning-based framework, and ETH-CVL [67] who used 3D convolutional neural

networks (C3D) in their video highlight detector, trained on the Video2GIF dataset.

4.3 In-Depth Analysis of the Data and Annotations

The purpose of this section is to give some insights on the characteristics of the

produced data, i.e., the dataset and its annotations.

254 C.-H. Demarty et al.

4.3.1 Quality of the Dataset

In general, the overall results obtained during the 2016 campaign show low values

for MAP (see Figs. 1and 2), especially for the video interestingness prediction

subtask. To have a comparison, we provide examples of MAP values obtained

by other multi-modal tasks from the literature. Of course, these were obtained on

other datasets which are fundamentally different from the underlying data, both

from the data point of view and also use case scenario. A direct comparison is not

possible, however, they provide an idea about the current classiﬁcation capabilities

for video:

• ILSVR Challenge 2015, Object Detection with provided training data, 200

fully labeled categories, best MAP is 0.62; Object Detection from videos with

provided training data, 30 fully labeled categories, best MAP is 0.67;

• TRECVID 2015, Semantic indexing of concepts such as: airplane, kitchen, ﬂags,

etc., best MAP is 0.37;

• TRECVID 2015, Multi-modal event detection, e.g., somebody cooking on an

outdoor grill, best MAP is less than 0.35.

Although higher than the obtained MAP for the Predicting Media Interestingness

Task, it must be noted that for more difﬁcult tasks such as multi-modal event

detection, the difference of performance is not that high, given the fact that the

proposed challenge is far more subjective than the tasks we are referring to.

Nevertheless, we may wonder, especially for the video interestingness sub-

task, whether the quality of the dataset/anotations partly affects the predicting

performance. Firstly, the dataset size, although it is sufﬁcient for classic learning

techniques and required a huge annotation effort, it may not be sufﬁcient for deep

learning, with only several thousands of samples for both subtasks.

Furthermore, it may be considered to be highly unbalanced with 8:3 and 9:6%

of interesting content for the development set and test set, respectively. Trying to

cope with the dataset’s unbalance has shown to increase the performance for some

systems [56,57]. This leads to the conclusion that, although this unbalance reﬂects

reality, i.e., interesting content corresponds to only a small part of the data, it makes

the task even more difﬁcult, as systems will have to take this characteristic into

account.

Finally, in Sect. 3.2, we explained that the ﬁnal annotations were determined with

an iterative process which required the convergence of the results. Due to limited

time and human resources, this process was limited to ﬁve rounds. More rounds

would certainly have resulted in better convergence of the inter-annotator ratings.

To have an idea of the subjective quality of the ground-truth rankings, Figs. 3

and 4illustrate some image examples for the image interestingness subtask together

with the rankings obtained by one of the best systems and the second worst

performing system, for both interesting and non interesting images.

The ﬁgures show that results obtained by the best system for the most interesting

images are coherent with the selection proposed by the ground-truth, whereas the

second worst performing system offers more images at the top ranks which do not

Predicting Interestingness of Visual Content 255

Fig. 3 Examples of interesting images from different videos of the test set. Images are ranked

from left to right decreasing interestingness ranking. (a) Interesting images according to the

ground-truth. (b) Interesting images selected by the best system. (c) Interesting images selected

by the second worst performing system (Color ﬁgure online)

256 C.-H. Demarty et al.

Fig. 4 Examples of non interesting images from different videos of the test set. Images are ranked from left to right increasing interestingness ranking. (a)Non

interesting images according to the ground-truth. (b) Non interesting images selected by the best system (Color ﬁgure online)

Predicting Interestingness of Visual Content 257

really contain any information, e.g., black or uniform frames, with blur or objects

and persons only partially visible.

These facts converge to the idea that both the provided ground-truth and the best

working systems have managed to capture the interestingness of images. It also

conﬁrms that the obtained MAP values, although quite low, nevertheless correspond

to real differences in the interestingness prediction performance.

The observation of the images which were classiﬁed as non interesting (Fig. 4)is

also a source of interesting insights. According to the ground-truth and also to the

best performing systems, non interesting images tend to be those mostly uniform,

of low quality or without meaningful information. The amount of information con-

tained in the non interesting images then increases with the level of interestingness.

Note that we do not show here the images classiﬁed as non interesting by the second

worst performing system, as we did for the interesting images, because there were

too few (for the example 7 images out of 25 videos) to draw any conclusion.

We also calculated Krippendorff’s alpha metric (˛), which is a measure for

inter-observer agreement [26,37], to be ˛D0:059 for image interestingness and

˛D0:063 for video interestingness. This result would indicate that there is no

inter-observer agreement. However, as our method (by design) produced very few

duplicate comparisons it is not clear if this result is reliable.

As a last insight, it is worth noting that the two experienced teams [53,67],

i.e., the two teams that did work on predicting content interestingness before the

MediaEval benchmark, did not achieve particularly good results on both subtasks

and especially on the image subtask. This raises the question of the generalization

ability of their systems on different types of content, unless this difference of

performance comes from the choice of different use cases as working context.

For the latter, this seems to show that, to different use cases correspond different

interpretations of the interestingness concept.

4.3.2 Correlation Between the Two Subtasks

The Predicting Media Interestingness task was designed so that a comparison

between the interestingness prediction for images and videos would be possible

afterwards. Indeed, the same videos were used to extract both the shots and the

key-frames to be classiﬁed in each subtask, each key-frame corresponding to the

middle of shots. Thanks to this, we studied a potential correlation between image

interestingness and video interestingness.

Figure 5shows the annotated video ranking against their key-frame ranking for

several videos in the development set. None of the curves exhibit a correlation (the

coefﬁcient of determination, R-squared or R2, used while ﬁtting a regression line

to the data, exhibits values lower than 0.03), leading to the conclusion that the two

concepts differ, in the sense that we cannot use video interestingness to infer the

image interestingness and the other way round on this data and use case scenario.

258 C.-H. Demarty et al.

Fig. 5 Representation of image rankings vs. video rankings from the ground-truth for several

videos of the development set. (a)Video0,(b)Video4,(c)Video7,(d) Video 10, (e) Video 14,

(f)Video51

This conclusion is in line with what was found in [42] where the authors

investigated the assessment of the ground-truth ranking of the image subtask against

the ground-truth ranking of the video subtask and vice-versa. MAP value achieved

by the video ground-truth on the image subtask was 0.1747, while for the image

ground-truth on the video subtask, it was 0.1457, i.e., in the range, or even lower,

than the random baseline for both cases. Videos obviously contain more information

Predicting Interestingness of Visual Content 259

than a single image, which can be conveyed by other channels such as audio and

motion, for example. Because of this additional information, a video might be

globally considered as interesting while one single key-frame extracted from the

same video will be considered as non interesting. This can explain, in some cases,

the observed discrepancy between image and video interestingnesses.

4.3.3 Link with Perceptual Content Characteristics

Trying to infer some potential links between the interestingness concept and

perceptual content characteristics, we did study how low-level characteristics such

as shot length, average luminance, blur and presence of high quality faces inﬂuence

the interestingness prediction of images and videos.

A ﬁrst qualitative study of both sets of interesting and non interesting images

in the development and test sets shows that most uniformly black and very blurry

images were mostly classiﬁed as non interesting. So were the majority of images

with no real information, i.e., close-up of usual objects, partly cut faces or objects,

etc., as it can be seen in Fig. 4.

Figure 6shows the distributions of interestingness values for both the develop-

ment and test sets, in the video interestingness subtask, compared to the distributions

of interesting values restricted to the shots with less than 10 frames. In all cases,

it seems that the distributions of small shots can just be superimposed under the

complete distributions, meaning that the shot length does not seem to inﬂuence the

interestingness of video segments even for very short durations. On the contrary,

Fig. 7shows the two same types of distributions but for the image interestingness

subtask and when trying to assess the inﬂuence of the average luminance value on

interestingness. This time, the distributions of interestingness levels for the images

with low average luminance seem to be slightly shifted toward lower interestingness

Fig. 6 Video interestingness and shot length: distribution of interestingness levels (in blue—all

shots considered; in green—shots with length smaller than 10 frames). (a) Development set, (b)

test set (Color ﬁgure online)

260 C.-H. Demarty et al.

Fig. 7 Image interestingness and average luminance: distribution of interestingness levels (in

blue—all key-frames considered; in green—key-frames with luminance values lower than 25).

(a) Development set, (b) test set (Color ﬁgure online)

levels. This might lead us to the conclusion that low average luminance values tend

to decrease the interestingness level of a given image, contrary to the conclusion

in [38].

We also investigated some potential correlation between the presence of high-

quality faces in frames and the interestingness level. By high-quality faces, we mean

rather big faces with no motion blur, either frontal or proﬁle, no closed eyes or

funny faces. This last mid-level characteristic was assessed manually by counting

the number of high-quality faces present in both the interesting and non interesting

images for the image interestingness subtask. The proportion of high-quality faces

on the development set was found to be 48:2% for the set of images annotated as

interesting and 33:9% for the set of images annotated as non interesting. For the test

set, 56:0% of the interesting images and 36:7% of the non interesting images contain

high quality faces. The difference in favor of the interesting sets tends to prove that

this characteristic has a positive inﬂuence on the interestingness assessment. This

was conﬁrmed by the results obtained by TUD-MMC team [42] who based their

system only on the detection of these high quality faces and achieved the best MAP

value for the image subtask.

As a general conclusion, we may say that perceptual quality plays an important

role when assessing the interestingness of images, although it is not the only clue to

assess the interestingness of content. Among other semantic objects, the presence

of good quality human faces seems to be correlated with interestingness.

5 Conclusions and Future Challenges

In this chapter we introduced a specially designed evaluation framework for

assessing the performance of automatic techniques for predicting image and video

interestingness. We described the released dataset and its annotations. Content

Predicting Interestingness of Visual Content 261

interestingness was deﬁned in a multi-modal scenario and for a real-world, speciﬁc

use case deﬁned by Technicolor R&D France, namely the selection of interesting

images and video excerpts for helping professionals to illustrate a Video on Demand

(VOD) web site.

The proposed framework was validated during the 2016 Predicting Media

Interestingness Task, organized with the MediaEval Benchmarking Initiative for

Multimedia Evaluation. It received participation from 12 teams submitting a total

of 54 runs. Highest MAP obtained for the image interestingness data was 0.2336,

whereas for video interestingness prediction it was only 0.1815. Although a great

deal of approaches were experimented, ranging from standard classiﬁers and

descriptors, to deep learning and use of pre-trained data, the results show the

difﬁculty of this task.

From the experience with this data, we can draw some general conclusions that

will help shape future data in this area. Firstly, one should note that generating data

and ground truth for such a subjective task is a huge effort and effective methods

should be devised to reduce the complexity of annotation. In our approach we

took advantage of a pair-wise comparison protocol which was further applied in

an adaptive square fashion way to avoid comparing all possible pairs. This has

limitation as it still requires a great number of annotators and resulted in a low

inter-agreement. A potential improvement may consist on ranking directly series

of images/videos. We could also think of crowd-sourcing the key-frames/videos

returned by the participants’ systems to extract the most interesting samples and

evaluating the performances of the systems against these samples only.

Secondly, the source of data is key for a solid evaluation. In our approach we

selected movie trailers, due to their Creative Commons licenses which allow redis-

tribution. Other movies are in almost all cases closed content for the community.

On the other hand, trailers are edited content which will limit at some point the

naturalness of the task, but offer a good compromise given the circumstances.

Future improvements could consist of selecting the data as parts of a full movie—

a few Creative Commons movies are indeed available. This will require a greater

annotation effort but might provide a better separation between interesting and non

interesting content.

Thirdly, a clear deﬁnition of image/video interestingness is mandatory. The

concept of content interestingness is already very subjective and highly user

dependent, even compared to other video concepts which are exploited in TRECVID

or ImageCLEF benchmarks. A well founded deﬁnition will allow for a focused

evaluation and disambiguate the information need. In our approach, we deﬁne

interestingness in the context of selecting video content for illustrating a web site,

where interesting means an image/video which would be interesting enough to

convince the user to watch the source movie. As a future challenge, we might want

to compare the results of interestingness prediction for different use scenarios, or

even test the generalization power of the approaches.

Finally, although image and video data was by design speciﬁcally correlated,

i.e., images were selected as key-frames from videos, results show that actually

predicting image interestingness and predicting video interestingness are two

262 C.-H. Demarty et al.

completely different tasks. This was more or less proved in the literature, however,

in those cases, images and videos were not chosen to be correlated. Therefore,

a future perspective might be the separation of the two, while focusing on more

representative data for each.

Acknowledgements We would like to thank Yu-Gang Jiang and Baohan Xu from the Fudan

University, China, and Hervé Bredin, from LIMSI, France for providing the features that

accompany the released data, and Frédéric Lefebvre, Alexey Ozerov and Vincent Demoulin for

their valuable inputs to the task deﬁnition. We also would like to thank our anonymous annotators

for their contribution to building the ground-truth for the datasets. Part of this work was funded

under project SPOTTER PN-III-P2-2.1-PED-2016-1065, contract 30PED/2017.

References

1. Almeida, J.: UNIFESP at MediaEval 2016 Predicting Media Interestingness Task. In:

Proceedings of the MediaEval Workshop, Hilversum (2016)

2. Almeida, J., Leite, N.J., Torres, R.S.: Comparison of video sequences with histograms of

motion patterns. In: IEEE ICIP International Conference on Image Processing, pp. 3673–3676

(2011)

3. Baveye, Y., Dellandréa, E., Chamaret, C., Chen, L.: Liris-accede: a video database for affective

content analysis. IEEE Trans. Affect. Comput. 6(1), 43–55 (2015)

4. Berg, A.C., Berg, T.L., Daume, H., Dodge, J., Goyal, A., Han, X., Mensch, A., Mitchell, M.,

Sood, A., Stratos, K., et al.: Understanding and predicting importance in images. In: IEEE

CVPR International Conference on Computer Vision and Pattern Recognition, pp. 3562–3569.

IEEE, Providence (2012)

5. Berlyne, D.E.: Conﬂict, Arousal and Curiosity. Mc-Graw-Hill, New York (1960)

6. Boiman, O., Irani, M.: Detecting irregularities in images and in video. Int. J. Comput. Vis.

74(1), 17–31 (2007)

7. Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs: the method of paired

comparisons. Biometrika 39(3-4), 324–345 (1952)

8. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers.

In: ACM Sigmod Record, vol. 29, pp. 93–104. ACM, New York (2000)

9. Bulling, A., Roggen, D.: Recognition of visual memory recall processes using eye movement

analysis. In: Proceedings of the 13th international conference on Ubiquitous Computing, pp.

455–464. ACM, New York (2011)

10. Chamaret, C., Demarty, C.H., Demoulin, V., Marquant, G.: Experiencing the interestingness

concept within and between pictures. In: Proceeding of SPIE, Human Vision and Electronic

Imaging (2016)

11. Chen, A., Darst, P.W., Pangrazi, R.P.: An examination of situational interest and its sources.

Br. J. Educ. Psychol. 71(3), 383–400 (2001)

12. Chen, S., Dian, Y., Jin, Q.: RUC at MediaEval 2016 Predicting Media Interestingness Task. In:

Proceedings of the MediaEval Workshop, Hilversum (2016)

13. Chu, S.L., Fedorovskaya, E., Quek, F., Snyder, J.: The effect of familiarity on perceived

interestingness of images. In: Proceedings of SPIE, vol. 8651, pp. 86,511C–86,511C–12

(2013). doi:10.1117/12.2008551,http://dx.doi.org/10.1117/12.2008551

14. Constantin, M.G., Boteanu, B., Ionescu, B.: LAPI at MediaEval 2016 Predicting Media

Interestingness Task. In: Proceedings of the MediaEval Workshop, Hilversum (2016)

Predicting Interestingness of Visual Content 263

15. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE CVPR

International Conference on Computer Vision and Pattern Recognition (2005)

16. Danelljan, M., Hager, G., Khan, F.S., Felsberg, M.: Accurate scale estimation for robust visual

tracking. In: British Machine Vision Conference (2014)

17. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Studying aesthetics in photographic images using a

computational approach. In: IEEE ECCV European Conference on Computer Vision, pp. 288–

301. Springer, Berlin (2006)

18. Demarty, C.H., Sjöberg, M., Ionescu, B., Do, T.T., Wang, H., Duong, N.Q.K., Lefebvre, F.:

Mediaeval 2016 Predicting Media Interestingness Task. In: Proceedings of the MediaEval

Workshop, Hilversum (2016)

19. Dhar, S., Ordonez, V., Berg, T.L.: High level describable attributes for predicting aesthetics

and interestingness. In: IEEE International Conference on Computer Vision and Pattern

Recognition (2011)

20. Elazary, L., Itti, L.: Interesting objects are visually salient. J. Vis. 8(3), 3–3 (2008)

21. Erdogan, G., Erdem, A., Erdem, E.: HUCVL at MediaEval 2016: predicting interesting key

frames with deep models. In: Proceedings of the MediaEval Workshop, Hilversum (2016)

22. Grabner, H., Nater, F., Druey, M., Gool, L.V.: Visual interestingness in image sequences.

In: ACM International Conference on Multimedia, pp. 1017–1026. ACM, New York (2013).

doi:10.1145/2502081.2502109,http://doi.acm.org/10.1145/2502081.2502109

23. Gygli, M., Grabner, H., Riemenschneider, H., Nater, F., van Gool, L.: The interestingness of

images. In: ICCV International Conference on Computer Vision (2013)

24. Gygli, M., Song, Y., Cao, L.: Video2gif: automatic generation of animated gifs from video.

CoRR abs/1605.04850 (2016). http://arxiv.org/abs/1605.04850

25. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Advances in Neural

Information Processing Systems, pp. 545–552 (2006)

26. Hayes, A.F., Krippendorff, K.: Answering the call for a standard reliability measure for coding

data. Commun. Methods Meas. 1(1), 77–89 (2007). doi:10.1080/19312450709336664,http://

dx.doi.org/10.1080/19312450709336664

27. Hsieh, L.C., Hsu, W.H., Wang, H.C.: Investigating and predicting social and visual image

interestingness on social media by crowdsourcing. In: 2014 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP), pp. 4309–4313. IEEE, Providence (2014)

28. Hua, X.S., Yang, L., Wang, J., Wang, J., Ye, M., Wang, K., Rui, Y., Li, J.: Clickage:

towards bridging semantic and intent gaps via mining click logs of search engines. In: ACM

International Conference on Multimedia (2013)

29. Isola, P., Parikh, D., Torralba, A., Oliva, A.: Understanding the intrinsic memorability of

images. In: Advances in Neural Information Processing Systems, pp. 2429–2437 (2011)

30. Isola, P., Xiao, J., Torralba, A., Oliva, A.: What makes an image memorable? In: IEEE CVPR

International Conference on Computer Vision and Pattern Recognition, pp. 145–152. IEEE,

Providence (2011)

31. Jiang, Y.G., Wang, Y., Feng, R., Xue, X., Zheng, Y., Yan, H.: Understanding and predicting

interestingness of videos. In: AAAI Conference on Artiﬁcial Intelligence (2013)

32. Jiang, Y.G., Dai, Q., Mei, T., Rui, Y., Chang, S.F.: Super fast event recognition in internet

videos. IEEE Trans. Multimedia 177(8), 1–13 (2015)

33. Joachims, T.: Optimizing search engines using clickthrough data. In: ACM SIGKDD

international conference on Knowledge discovery and data mining, pp. 133–142. ACM, New

York (2002)

34. Ke, Y., Hoiem, D., Sukthankar, R.: Computer vision for music identiﬁcation. In: IEEE CVPR

International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 597–604.

IEEE, Providence (2005)

35. Ke, Y., Tang, X., Jing, F.: The design of high-level features for photo quality assessment. In:

IEEE CVPR International Conference on Computer Vision and Pattern Recognition, vol. 1, pp.

419–426. IEEE, Providence (2006)

264 C.-H. Demarty et al.

36. Khosla, A., Raju, A.S., Torralba, A., Oliva, A.: Understanding and predicting image memora-

bility at a large scale. In: International Conference on Computer Vision (ICCV) (2015)

37. Krippendorff, K.: Content Analysis: An Introduction to Its Methodology, 3rd edn. Sage,

Thousand Oaks (2013)

38. Lam, V., Do, T., Phan, S., Le, D.D., Satoh, S., Duong, D.: NII-UIT at MediaEval 2016 Pre-

dicting Media Interestingness Task. In: Proceedings of the MediaEval Workshop, Hilversum

(2016)

39. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for

recognizing natural scene categories. In: IEEE CVPR International Conference on Computer

Vision and Pattern Recognition, pp. 2169–2178 (2006)

40. Li, J., Barkowsky, M., Le Callet, P.: Boosting paired comparison methodology in measuring

visual discomfort of 3dtv: performances of three different designs. In: Proceedings of SPIE

Electronic Imaging, Stereoscopic Displays and Applications, vol. 8648 (2013)

41. Li, L.J., Su, H., Fei-Fei, L., Xing, E.P.: Object bank: a high-level image representation for scene

classiﬁcation & semantic feature sparsiﬁcation. In: Advances in Neural Information Processing

Systems, pp. 1378–1386 (2010)

42. Liem, C.: TUD-MMC at MediaEval 2016 Predicting Media Interestingness Task. In:

Proceedings of the MediaEval Workshop, Hilversum (2016)

43. Liu, F., Niu, Y., Gleicher, M.: Using web photos for measuring video frame interestingness.

In: Proceedings of the International Joint Conference on Artiﬁcial Intelligence, pp. 2058–2063

(2009)

44. Liu, Y., Gu, Z., Cheung, Y.M.: Supervised manifold learning for media interestingness

prediction. In: Proceedings of the MediaEval Workshop, Hilversum (2016)

45. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60,

91–110 (2004)

46. Machajdik, J., Hanbury, A.: Affective image classiﬁcation using features inspired by psychol-

ogy and art theory. In: ACM International Conference on Multimedia, pp. 83–92. ACM, New

York (2010). doi:10.1145/1873951.1873965,http://doi.acm.org/10.1145/1873951.1873965

47. McCrae, R.R.: Aesthetic chills as a universal marker of openness to experience. Motiv. Emot.

31(1), 5–11 (2007)

48. Murray, N., Marchesotti, L., Perronnin, F.: Ava: a large-scale database for aesthetic visual anal-

ysis. In: IEEE CVPR International Conference on Computer Vision and Pattern Recognition,

pp. 2408–2415. IEEE, Providence (2012)

49. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant

texture classiﬁcation with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7),

971–987 (2002)

50. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial

envelope. Int. J. Comput. Vis. 42, 145–175 (2001)

51. Ovadia, S.: Ratings and rankings: reconsidering the structure of values and their measurement.

Int. J. Soc. Res. Methodol. 7(5), 403–414 (2004). doi:10.1080/1364557032000081654,http://

dx.doi.org/10.1080/1364557032000081654

52. Parekh, J., Parekh, S.: The MLPBOON Predicting Media Interestingness System for MediaE-

val 2016. In: Proceedings of the MediaEval Workshop, Hilversum (2016)

53. Rayatdoost, S., Soleymani, M.: Ranking images and videos on visual interestingness by visual

sentiment features. In: Proceedings of the MediaEval Workshop, Hilversum (2016)

54. Schaul, T., Pape, L., Glasmachers, T., Graziano, V., Schmidhuber, J.: Coherence progress:

a measure of interestingness based on ﬁxed compressors. In: International Conference on

Artiﬁcial General Intelligence, pp. 21–30. Springer, Berlin (2011)

55. Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: 2007

IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE, Providence

(2007)

56. Shen, Y., Demarty, C.H., Duong, N.Q.K.: Technicolor@MediaEval 2016 Predicting Media

Interestingness Task. In: Proceedings of the MediaEval Workshop, Hilversum (2016)

Predicting Interestingness of Visual Content 265

57. Shen, Y., Demarty, C.H., Duong, N.Q.K.: Deep learning for multimodal-based video interest-

ingness prediction. In: IEEE International Conference on Multimedia and Expo, ICME’17

(2017)

58. Silvia, P.J.: What is interesting? Exploring the appraisal structure of interest. Emotion 5(1), 89

(2005)

59. Silvia, P.J., Henson, R.A., Templin, J.L.: Are the sources of interest the same for everyone?

using multilevel mixture models to explore individual differences in appraisal structures.

Cognit. Emot. 23(7), 1389–1406 (2009)

60. Sjöberg, M., Baveye, Y., Wang, H., Quang, V.L., Ionescu, B., Dellandréa, E., Schedl, M.,

Demarty, C.H., Chen, L.: The mediaeval 2015 affective impact of movies task. In: Proceedings

of the MediaEval Workshop, CEUR Workshop Proceedings (2015)

61. Soleymani, M.: The quest for visual interest. In: ACM International Conference on Multime-

dia, pp. 919–922. New York, NY, USA (2015). doi:10.1145/2733373.2806364,http://doi.acm.

org/10.1145/2733373.2806364

62. Spain, M., Perona, P.: Measuring and predicting object importance. Int. J. Comput. Vis. 91(1),

59–76 (2011)

63. Stein, B.E., Stanford, T.R.: Multisensory integration: current issues from the perspective of the

single neuron. Nat. Rev. Neurosci. 9(4), 255–266 (2008)

64. Torresani, L., Szummer, M., Fitzgibbon, A.: Efﬁcient object category recognition using

classemes. In: IEEE ECCV European Conference on Computer Vision, pp. 776–789. Springer,

Berlin (2010)

65. Turner, S.A. Jr, Silvia, P.J.: Must interesting things be pleasant? A test of competing appraisal

structures. Emotion 6(4), 670 (2006)

66. Valdez, P., Mehrabian, A.: Effects of color on emotions. J. Exp. Psychol. Gen. 123(4), 394

(1994)

67. Vasudevan, A.B., Gygli, M., Volokitin, A., Gool, L.V.: Eth-cvl @ MediaEval 2016: Textual-

visual embeddings and video2gif for video interestingness. In: Proceedings of the MediaEval

Workshop, Hilversum (2016)

68. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: large-scale scene

recognition from abbey to zoo. In: IEEE CVPR International Conference on Computer Vision

and Pattern Recognition, pp. 3485–3492 (2010)

69. Xu, B., Fu, Y., Jiang, Y.G.: BigVid at MediaEval 2016: predicting interestingness in images

and videos. In: Proceedings of the MediaEval Workshop, Hilversum (2016)

70. Yang, Y.H., Chen, H.H.: Ranking-based emotion recognition for music organization and

retrieval. IEEE Trans. Audio Speech Lang. Process. 19(4), 762–774 (2011)

71. Yannakakis, G.N., Hallam, J.: Ranking vs. preference: a comparative study of self-reporting.

In: International Conference on Affective Computing and Intelligent Interaction, pp. 437–446.

Springer, Berlin (2011)

Assessment of photos in albums based on aesthetics and context

Thesis

Jun 2019

Dmitry Kuzovkin

An automatic photo assessment can significantly aid the process of photo selection within photo collections. However, existing computational methods approach this problem in an independent manner, by evaluating each image apart from other images in a photo album. In this thesis, we explore the modeling of photo context via a clustering approach for photo collections and the possibility of applying such context information in photo assessment. To better understand user actions within photo albums, we conduct experimental user studies, where we study how users cluster and select photos in photo collections. We estimate the level of agreement between users and investigate how the context, defined by similar photos in corresponding clusters, influences users’ decisions. After studying the nature of user decisions, we propose a computational approach to model user behavior. First, we introduce a hierarchical clustering method, which allows to group similar photos according to a multi-level similarity structure, based on visual descriptors. Then, the photo context information is extracted from the obtained cluster data and used to adapt a pre-computed independent photo score, using the statistics-based data and a machine learning approach. In addition, as the majority of recent methods for photo assessment are based on convolutional neural networks, we explore and visualize the aesthetic characteristics learned by such methods.

Computational Understanding of Visual Interestingness Beyond Semantics: Literature Survey and Analysis of Covariates

Article

Full-text available

Mar 2019

Understanding visual interestingness is a challenging task addressed by researchers in various disciplines ranging from humanities and psychology to, more recently, computer vision and multimedia. The rise of infographics and the visual information overload that we are facing today have given this task a crucial importance. Automatic systems are increasingly needed to help users navigate through the growing amount of visual information available, either on the web or our personal devices, for instance by selecting relevant and interesting content. Previous studies indicate that visual interest is highly related to concepts like arousal, unusualness or complexity, where these connections are found either based on psychological theories, user studies or computational approaches. However, the link between visual interestingness and other related concepts has been partially explored so far, for example by considering only a limited subset of covariates at a time. In this paper, we present a comprehensive survey on visual interestingness and related concepts, aiming to bring together works based on different approaches, highlighting controversies and identifying links which have not been fully investigated yet. Finally, we present some open questions that may be addressed in future works. Our work aims to support researchers interested in visual interestingness and related subjective or abstract concepts, providing an in-depth overlook at state-of-the-art theories in humanities and methods in computational approaches, as well as providing an extended list of datasets and evaluation metrics.

Measuring the Binary Interestingness of Images: A Unified Prediction Framework Combining Discriminant Correlation Analysis and Multiple Kernel Learning

Conference Paper

Dec 2021

Exploring Deep Fusion Ensembling for Automatic Visual Interestingness Prediction

Chapter

Full-text available

Jan 2022

In the context of the ever growing quantity of multimedia content from social, news and educational platforms, generating meaningful recommendations and ratings now requires a more advanced understanding of their impact on the user, such as their subjective perception. One of the important subjective concepts explored by researchers is visual interestingness. While several definitions of this concept are given in the current literature, in a broader sense, this property attempts to measure the ability of audio-visual data to capture and keep the viewer’s attention for longer periods of time. While many computer vision and machine learning methods have been tested for predicting media interestingness, overall, due to the heavily subjective nature of interestingness, the precision of the results is relatively low. In this chapter, we investigate several methods that address this problem from a different angle. We first review the literature on interestingness prediction and present an overview of the traditional fusion mechanisms, such as statistical fusion, weighted approaches, boosting, random forests or randomized trees. Further, we explore the possibility of employing a stronger, novel deep learning-based, system fusion for enhancing the performance. We investigate several types of deep networks for creating the fusion systems, including dense, attention, convolutional and cross-space-fusion networks, while also proposing some input decoration methods that help these networks achieve optimal performance. We present the results, as well as an analysis of the correlation between network structure and overall system performance. Experimental validation is carried out on a publicly available data set and on the systems benchmarked during the 2017 MediaEval Predicting Media Interestingness task.

Unsupervised Online Learning for Robotic Interestingness With Visual Memory

Article

Dec 2021

Autonomous robots frequently need to detect “interesting” scenes to decide on further exploration, or to decide which data to share for cooperation. These scenarios often require fast deployment with little or no training data. Prior work considers “interestingness” based on data from the same distribution. Instead, we propose to develop a method that automatically adapts online to the environment to report interesting scenes quickly. To address this problem, we develop a novel translation-invariant visual memory and design a three-stage architecture for long-term, short-term, and online learning, which enables the system to learn human-like experience, environmental knowledge, and online adaption, respectively. With this system, we achieve an average of 20% higher accuracy than the state-of-the-art unsupervised methods in a subterranean tunnel environment. We show comparable performance to supervised methods for robot exploration scenarios showing the efficacy of our approach. We expect that the presented method will play an important role in the robotic interestingness recognition exploration tasks.

Visual Interestingness Prediction: A Benchmark Framework and Literature Review

Article

Full-text available

May 2021
INT J COMPUT VISION

In this paper, we report on the creation of a publicly available, common evaluation framework for image and video visual interestingness prediction. We propose a robust data set, the Interestingness10k, with 9831 images and more than 4 h of video, interestigness scores determined based on more than 1M pair-wise annotations of 800 trusted annotators, some pre-computed multi-modal descriptors, and 192 system output results as baselines. The data were validated extensively during the 2016–2017 MediaEval benchmark campaigns. We provide an in-depth analysis of the crucial components of visual interestingness prediction algorithms by reviewing the capabilities and the evolution of the MediaEval benchmark systems, as well as of prominent systems from the literature. We discuss overall trends, influence of the employed features and techniques, generalization capabilities and the reliability of results. We also discuss the possibility of going beyond state-of-the-art performance via an automatic, ad-hoc system fusion, and propose a deep MLP-based architecture that outperforms the current state-of-the-art systems by a large margin. Finally, we provide the most important lessons learned and insights gained.

Visual Memorability for Robotic Interestingness via Unsupervised Online Learning

Chapter

Nov 2020

In this paper, we explore the problem of interesting scene prediction for mobile robots. This area is currently underexplored but is crucial for many practical applications such as autonomous exploration and decision making. Inspired by industrial demands, we first propose a novel translation-invariant visual memory for recalling and identifying interesting scenes, then design a three-stage architecture of long-term, short-term, and online learning. This enables our system to learn human-like experience, environmental knowledge, and online adaption, respectively. Our approach achieves much higher accuracy than the state-of-the-art algorithms on challenging robotic interestingness datasets.

Collecting, Analyzing and Predicting Socially-Driven Image Interestingness

Conference Paper

Sep 2019

Favorite Video Estimation Based on Multiview Feature Integration via KMvLFDA

Article

Full-text available

Oct 2018

This paper presents a novel method for favorite video estimation based on multiview feature integration via Kernel Multiview Local Fisher Discriminant Analysis (KMvLFDA). The proposed method first extracts electroencephalogram (EEG) features from users’ EEG signals recorded while watching videos and multiple visual features from videos. Then multiple EEG-based visual features are obtained by applying Locality Preserving Canonical Correlation Analysis (LPCCA) to EEG features and each visual feature. Next, KMvLFDA, which is newly derived in this paper, explores the complementary properties of different features and integrates the multiple EEG-based visual features. In addition, by using KMvLFDA, between-class scatter is maximized and within-class scatter is minimized in the integrated feature space. Consequently, it can be expected that the new features that are obtained by the above integration are more effective than each of the EEG-based visual features for estimation of users’ favorite videos. The main contribution of this paper is the new derivation of KMvLFDA. Successful estimation of users’ favorite videos becomes feasible by using the new features obtained via KMvLFDA.

Deep Multimodal Features for Movie Genre and Interestingness Prediction

Conference Paper

Sep 2018

In this paper, we propose a multimodal framework for video segment interestingness prediction based on the genre and affective impact of movie content. We hypothesize that the emotional characteristic and impact of a video infer its genre, which can in turn be a factor for identifying the perceived interestingness of a particular video segment (shot) within the entire media. Our proposed approach is based on audio-visual deep features for perceptual content analysis. The multimodal content is quantified in a mid-level representation which consists in describing each audio-visual segment as a distribution over various genres (action, drama, horror, romance, sci-fi for now). Some segment might be more characteristic of a media and therefore be more interesting than a segment containing content with a neutral genre. Having determined the genre of individual video segments, we trained a classifier to produce an interestingness factor which is then used to rank segments. We evaluate our approach on the MediaEval2017 Media Interestingness Prediction Task Dataset (PMIT). We demonstrate that our approach outperforms the existing video interestingness approaches on the PMIT dataset in terms of Mean Average Precision.

Ranking Images and Videos on Visual Interestingness by Visual Sentiment Features

Article

Full-text available

Oct 2016

Today, users generate and consume millions of videos online. Automatic identification of the most interesting moments of these videos have many applications such as video retrieval. Although most interesting excerpts are person-dependent, existing work demonstrate that there are some common features among these segments. The media interestingness task at MediaEval 2016 focuses on ranking the shots and key-frames in a movie trailer based on their interestingness. The dataset consists of a set of commercial movie trailers from which the participants are required to automatically identify the most interesting shots and frames. We approach the problem as a regression task and test several algorithms. We particularly use mid-level semantic visual sentiment features. These features are related to the emotional content of images and are shown to be effective in recognizing inter-estingness in GIFs. We found that our suggested features outperform the baseline for the task at hand.

Graph-Based Visual Saliency

Chapter

Sep 2007

Papers from the 2006 flagship meeting on neural computation, with contributions from physicists, neuroscientists, mathematicians, statisticians, and computer scientists. The annual Neural Information Processing Systems (NIPS) conference is the flagship meeting on neural computation and machine learning. It draws a diverse group of attendees—physicists, neuroscientists, mathematicians, statisticians, and computer scientists—interested in theoretical and applied aspects of modeling, simulating, and building neural-like or intelligent systems. The presentations are interdisciplinary, with contributions in algorithms, learning theory, cognitive science, neuroscience, brain imaging, vision, speech and signal processing, reinforcement learning, and applications. Only twenty-five percent of the papers submitted are accepted for presentation at NIPS, so the quality is exceptionally high. This volume contains the papers presented at the December 2006 meeting, held in Vancouver. Bradford Books imprint

Understanding and Predicting Interestingness of Videos

Article

Jun 2013

The amount of videos available on the Web is growing explosively. While some videos are very interesting and receive high rating from viewers, many of them are less interesting or even boring. This paper conducts a pilot study on the understanding of human perception of video interestingness, and demonstrates a simple computational method to identify more interesting videos. To this end we first construct two datasets of Flickr and YouTube videos respectively. Human judgements of interestingness are collected and used as the ground-truth for training computational models. We evaluate several off-the-shelf visual and audio features that are potentially useful for predicting interestingness on both datasets. Results indicate that audio and visual features are equally important and the combination of both modalities shows very promising results.

Understanding and Predicting Image Memorability at a Large Scale

Article

Dec 2015

Progress in estimating visual memorability has been limited by the small scale and lack of variety of benchmark data. Here, we introduce a novel experimental procedure to objectively measure human memory, allowing us to build LaMem, the largest annotated image memorability dataset to date (containing 60,000 images from diverse sources). Using Convolutional Neural Networks (CNNs), we show that fine-tuned deep features outperform all other features by a large margin, reaching a rank correlation of 0.64, near human consistency (0.68). Analysis of the responses of the high-level CNN layers shows which objects and regions are positively, and negatively, correlated with memorability, allowing us to create memorability maps for each image and provide a concrete method to perform image memorability manipulation. This work demonstrates that one can now robustly estimate the memorability of images from many different classes, positioning memorability and deep memorability features as prime candidates to estimate the utility of information for cognitive systems. Our model and data are available at: http://memorability.csail.mit.edu.

Histograms of Oriented Gradients for Human Detection

Conference Paper

Jul 2005
IEEE Comput Soc Conf Comput Vis Pattern Recogn

We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

Object Bank : A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification

Article

Jan 2010

Robust low-level image features have been proven to be effective representations for a variety of visual recognition tasks such as object recognition and scene classification; but pixels, or even local image patches, carry little semantic meanings. For high level visual tasks, such low-level image representations are potentially not enough. In this paper, we propose a high-level image representation, called the Object Bank, where an image is represented as a scale-invariant response map of a large number of pre-trained generic object detectors, blind to the testing dataset or visual task. Leveraging on the Object Bank representation, superior performances on high level visual recognition tasks can be achieved with simple off-the-shelf classifiers such as logistic regression and linear SVM. Sparsity algorithms make our representation more efficient and scalable for large scene datasets, and reveal semantically meaningful feature patterns.

Deep learning for multimodal-based video interestingness prediction

Conference Paper

Jul 2017

Multi-view Manifold Learning for Media Interestingness Prediction

Conference Paper

Jun 2017

Media interestingness prediction plays an important role in many real-world applications and attracts much research attention recently. In this paper, we aim to investigate this problem from the perspective of supervised feature extraction. Specifically, we design a novel algorithm dubbed Multi-view Manifold Learning (M) to uncover the latent factors that are capable of distinguishing interesting media data from non-interesting ones. By modelling both geometry preserving criterion and discrimination maximization criterion in a unified framework, M²L learns a common subspace for data from multiple views. The analytical solution of M²L is obtained by solving a generalized eigen-decomposition problem. Experiments on the Predicting Media Interestingness Dataset validate the effectiveness of the proposed method.

Accurate scale estimation for robust visual tracking

Article