Conference PaperPDF Available

Using Abstract Anchors to Aid The Development of Multimedia Applications With Sensory Effects

Authors:

Abstract and Figures

Declarative multimedia authoring languages allows authors to combine multiple media objects, generating a range of multimedia presentations. Novel multimedia applications, focusing at improving user experience, extend multimedia applications with multisensory content. The idea is to synchronize sensory effects with the audiovisual content being presented. The usual approach for specifying such synchronization is to mark the content of a main media object (e.g. a main video) indicating the moments when a given effect has to be executed. For example, a mark may represent when snow appears in the main video so that a cold wind may be synchronized with it. Declarative multimedia authoring languages provide a way to mark subparts of a media object through anchors. An anchor indicates its begin and end times (video frames or audio samples) in relation to its parent media object. The manual deffinition of anchors in the above scenario is both not efficient and error prone (i) when the main media object size increases, (ii) when a given scene component appears several times and (iii) when the application requires marking scene components. This paper tackles this problem by providing an approach for creating abstract anchors in declarative multimedia documents. An abstract anchor represents (possibly) several media anchors, indicating the moments when a given scene component appears in a media object content. The author, therefore is able to define the application behavior through relationships among, for example, sensory effects and abstract anchors. Prior to executing, abstract anchors are automatically instantiated for each moment a given element appears and relationships are cloned so the application behavior is maintained. This paper presents an implementation of the proposed approach using NCL (Nested Context Language) as the target language. The abstract anchor processor is implemented in Lua and uses available APIs for video recognition in order to identify the begin and end times for abstract anchor instances. We also present an evaluation of our approach using a real world use cases. CCS CONCEPTS • Applied computing → Markup languages; • Human-centered computing → Hypertext / hypermedia; • Software and its engineering → Translator writing systems and compiler generators;
Content may be subject to copyright.
Using Abstract Anchors to Aid The Development of Multimedia
Applications With Sensory Eects
Raphael Abreu
CEFET/RJ
raphael.abreu@eic.cefet-rj.br
Joel A. F. dos Santos
CEFET/RJ
jsantos@eic.cefet-rj.br
ABSTRACT
Declarative multimedia authoring languages allows authors to com-
bine multiple media objects, generating a range of multimedia pre-
sentations. Novel multimedia applications, focusing at improving
user experience, extend multimedia applications with multisensory
content. The idea is to synchronize sensory eects with the audio-
visual content being presented. The usual approach for specifying
such synchronization is to mark the content of a main media object
(e.g. a main video) indicating the moments when a given eect has
to be executed. For example, a mark may represent when snow
appears in the main video so that a cold wind may be synchronized
with it. Declarative multimedia authoring languages provide a way
to mark subparts of a media object through anchors. An anchor
indicates its begin and end times (video frames or audio samples)
in relation to its parent media object. The manual denition of an-
chors in the above scenario is both not ecient and error prone (i)
when the main media object size increases, (ii) when a given scene
component appears several times and (iii) when the application
requires marking scene components.
This paper tackles this problem by providing an approach for
creating abstract anchors in declarative multimedia documents.
An abstract anchor represents (possibly) several media anchors,
indicating the moments when a given scene component appears
in a media object content. The author, therefore is able to dene
the application behavior through relationships among, for example,
sensory eects and abstract anchors. Prior to executing, abstract
anchors are automatically instantiated for each moment a given
element appears and relationships are cloned so the application
behavior is maintained.
This paper presents an implementation of the proposed approach
using NCL (Nested Context Language) as the target language. The
abstract anchor processor is implemented in Lua and uses available
APIs for video recognition in order to identify the begin and end
times for abstract anchor instances. We also present an evaluation
of our approach using a real world use cases.
CCS CONCEPTS
Applied computing Markup languages; Human-centered
computing Hypertext / hypermedia; Software and its en-
gineering Translator writing systems and compiler generators;
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
DocEng’17, September 4–7, 2017, Valletta, Malta.
©2017 ACM. 978-1-4503-4689-4/17/09.. . $15.00
DOI: http://dx.doi.org/10.1145/3103010.3103014
KEYWORDS
Anchors; Multimedia authoring; Multisensory Content; Mulseme-
dia; NCL; Video Recognition
ACM Reference format:
Raphael Abreu and Joel A. F. dos Santos. 2017. Using Abstract Anchors to
Aid The Development of Multimedia Applications With Sensory Eects. In
Proceedings of DocEng’17, September 4–7, 2017, Valletta, Malta., , 8 pages.
DOI: http://dx.doi.org/10.1145/3103010.3103014
1 INTRODUCTION
The recent advances in human-computer interactions ([
4
,
12
,
18
])
oers many opportunities to enrich multimedia experience with
new features. Since the beginning of this decade there was sig-
nicant commercial interest in more immersive technologies (3D
displays, VR, etc). Such interest resulted in increased eorts of the
multimedia community to develop new methods to enhance the
user immersion in multimedia applications [21].
New kinds of immersive multimedia applications have been
proposed, giving rise to the multiple sensorial media (Mulsemedia)
applications [
9
], where traditional media content (text, image, audio,
video, etc.) can be related to media objects that target other human
senses (e.g. smell, haptics, etc.). To enable this applications, one
can use physical sensing devices (sensors) to identify the ambient
state (e.g. temperature, room size, user feedback) and actuators to
generate sensory eects (e.g. wind, mist, heat) to the user.
Traditional declarative multimedia authoring languages, author-
ing languages for short, specify interactive multimedia applications
focusing on the denition of media objects synchronization, in-
dependent of their content. Examples of authoring languages are
SMIL (Synchronized Multimedia Integration Language) [
23
] and NCL
(Nested Context Language) [
11
]. In the above scenario, it is inter-
esting to take advantage of those languages abstractions for media
and relationships specication in order to provide synchronization
among both traditional content and also multisensory content.
An approach for synchronizing traditional and multisensory is
to represent sensors and actuators as media objects and create rela-
tionships among parts of a main media object (e.g. a main video)
and those media objects representing multisensory content. In
order to do so, authors have to mark the main media object indicat-
ing when, for example, an explosion occurs so the corresponding
sensory eect can be synchronized with it.
In this paper, we call a scene component a given element (rock,
tree, dog, person, etc.) or concept (happy, crowded, dark, etc.) that
appears in the main media object content.
The usual approach for marking when a given scene component
appears in a given media object is to execute such media object and
create anchors related to those components. Relationships among
such anchors and the related multisensory content, therefore, dene
the intended synchronization.
When the application size grows, or when several scene compo-
nent shall be synchronized with multisensory content, authors are
required to create several anchors. The manual denition of such
anchors, however, is not ecient. Moreover, such an approach can
be error prone, given the size of the resulting code. This problem
was presented in [
22
], where the authors emphasize the need for
automating this process.
This paper presents an approach for automating the creation of
anchors in multimedia authoring languages. Our approach is to
provide a way for the author to dene abstract anchors in multime-
dia documents. An abstract anchor represents (possibly) several
media anchors, indicating the moments when a given scene com-
ponent appears in a media object content. Relationships in the
document are dened considering such abstract anchors. Prior to
execution, a document with abstract anchors is processed so that,
abstract anchors are automatically instantiated for each moment
a given scene component appears and relationships are cloned so
the application behavior is maintained.
The proposed approach was implemented using NCL as the
target language. NCL is a standard for digital TV [
1
] and IPTV [
11
]
services. It provides anchors for media objects, whose denition
indicate their begin and end times in relation to their parent media
object. In this work NCL anchors were extended so they can indicate
the scene component they refer to. The Abstract Anchor Processor,
AAP for short, uses available APIs for video recognition in order
to identify when a given scene component appears in the video
content. An instance of a given abstract anchor is created for each
time the element appears. In sequence, document relationships
are cloned for each anchor instance, maintaining the document
behavior. AAP was implemented in Lua [
10
] and is available for
download and use1.
Using NCL with Abstract Anchors, NCLAA for short, reduces
authoring eort, since anchors and document relationships are
created only once, for each dierent scene component. In order to
support our claim we present an evaluation of our approach using
a real world use cases.
The remainder of the paper is organized as follows. Section 2
presents related work regarding approaches for reducing the author-
ing eort for multimedia and mulsemedia applications. Section 3
discusses the concept of abstract anchors, their creation in NCL
and the steps for processing abstract anchors. Section 4 presents
the implementation of the abstract anchor processor. Section 5
presents our approach evaluation results. Section 6 concludes the
paper and presents future work.
2 RELATED WORK
A lot of attention has been devoted to reducing the authoring ef-
fort of multimedia and mulsemedia applications. Two common
approaches are to provide authoring tools or template languages
for those applications.
A template language allows the author to specify reusable com-
ponents (placeholders) that should later be replaced by instances in
1https://github.com/raphael-abreu/NCLAAP
the target language. More precisely, templates dene generic com-
ponents and express relationships between generic components
that later can be duplicated to a target language by a template
processor before runtime. The template processor ensures that
the generic components are correctly instantiated in the target
language. This section presents works focusing on templates for
multimedia applications.
XTemplate [
6
] is a modular approach for creating templates
for NCL documents. The template language proposed represents
generic components and relationships among them. XTemplate
species composite templates, which denes a spatio-temporal
semantics to be reused by (possibly) several document compositions.
Along with the template specication, a template processor was
proposed. The processor receives as input a set of templates and
a document using them and returns an NCL compliant document
that can run on any standard NCL player. A similar approach is
provided in [
16
] where authors propose the TAL template language
and its associated processor.
Some template languages not only support placeholders, but also
loops and conditions which often lack in declarative multimedia
languages. This is the case of Luar [
3
]. The authors focus on
authors with programming expertise, providing a way to embed
Lua code in NCL documents. The Luar processor, executes the Lua
code embedded in the NCL document producing an NCL compliant
document.
Another approach to reduce the authoring eort is to develop
visual authoring tools. These tools help the user by providing a
graphical user interface (GUI) that eases or remove the need to write
code. In general, such approaches target in non-expert authors,
aiding the application development.
Examples of authoring tools for multimedia documents are [
2
,
5
,
19
,
20
]. [
2
] proposes NCL Composer, an authoring tool presenting
to the user a structural, a textual, and a layout view of an NCL
document. It allows authors to interact with the document logical
structure by representing media objects as nodes and the relation-
ships among them as vertices.
A similar approach is presented in [
19
] where the NEXT tool is
proposed. The dierence is that NEXT is focused on templates, also
providing a template view where authors may create documents
using XTemplate templates.
LimSee [
5
] also uses templates for document authoring, in a
similar approach to the one presented in [
19
]. Finally, xSMART
[
20
] is used to create wizards to guide the creation of a multimedia
document.
In the mulsemedia domain, much of the authoring eort is to
specify scene components for synchronizing audiovisual content
with sensory eects [
25
]. Usually, the authoring eort it to tie scene
components to the sensory eects that a human should experience
when they are presented [
22
], such as feeling cold when a snow
scene is presented or feeling heat when a beach scene is presented.
In [
24
] the authors present an authoring tool designed for au-
thoring mulsemedia applications, called SEVino (Sensory Eect
Video Annotation Tool). SEVino provides the author an interface
that presents a video timeline. Such video represents the main
audiovisual content to have sensory eects synchronized with. The
tool creates cells representing sensory eects (eg. fog, wind, tem-
perature, etc.) and for a given time interval, users could select a cell
representing a sensory eect to be executed. After the authoring
phase, the tool generates descriptions compatible with the MPEG-V
standard [
26
], which is a standard for information exchange be-
tween digital world and real world. The MPEG-V descriptions
generated by SEVino represent the sensory eects to be executed
on physical devices.
Despite the advances in tools and templates for easing the au-
thoring eort, the process of authoring a mulsemedia application
is still a very expensive work in terms of eort and time. Espe-
cially when a great deal of synchronization among the audiovisual
content and sensory eects is required.
Such problem gave rise to research proposing semi-automatic
or automatic video description. A video description indicates for
each instant of the video, the scene components that are present.
Such approaches should require minimal to no author interaction
at all for providing a video description, as well as to generate events
based on that description.
The SEVino authors have also developed a media player capa-
ble of automatically gathering a video description and producing
events on the ambient. More specically, the proposed player can
synchronize ambient lighting eects with a video presentation [
24
].
To achieve such synchronization, the player gather pixel color in-
formation from a video frame (usually the borders) and send the
same color information to a nearby array of LED lights. This player
removes the need for the user to specify the lightning eects in the
multimedia document, however the approach is restricted to only
one kind of eect, in this case, lightning eects.
The work presented in this paper diers from related work as
follows. (i) It enables the author to describe its application abstract-
ing the video description, using abstract anchors. (ii) It enables the
author to dene abstract anchors for multiple videos in a document,
and not just one as the above approaches. (iii) It enables authors to
synchronize any sensory eect with the application, by providing
relationships among them and abstract anchors.
Although in this paper we present an approach for video descrip-
tion, the Abstract Anchor Processor (AAP) architecture is indepen-
dent of the tool to be used for describing a media object content.
Therefore, it could be used also for dening abstract anchors for
audio objects.
3 ABSTRACT ANCHORS
Multimedia applications are described by multimedia documents. A
document specication is described using some multimedia author-
ing language. Common entities for multimedia authoring languages
are nodes, representing the document content, and relationships, for
representing the synchronization to be performed in an application.
Dierent languages, such as NCL [
11
], provide temporal anchors
for representing a subpart of a node content. Temporal anchors
represent a subpart of a node content in the time axis. For example,
a sequence of frames of a video node or a sequence of samples in
an audio node. Usually, temporal anchors are dened by a begin
and end, in respect to the node content.
By allowing the author to dene anchors, multimedia languages
enables the denition of relationships taking into account parts of
a node content. Thus providing a ne-grained synchronization.
As discussed in Section 2. Template authoring languages enable
the user to abstract some steps of the authoring process in favor
of a more generic description. After authoring, at processing time,
the template processor to “ll the blanks” with document specic
content.
With that in mind, this work enables the author to make use of
abstract anchors (NCLAA) to represent subparts of a node content,
without explicitly describing them. It is similar to a template ap-
proach, in the sense that it enables another level of abstraction in
the authoring phase.
An abstract anchor represents (possibly) several dierent node
anchors, that are related by the node content being presented while
they are active. In our approach, abstract anchors are related to
scene components, such that all of its instances represent when the
scene component it is associated with is being presented. Figure 1
depicts such idea, where media nodes are represented as circles and
node anchors are represented as squared. Dashed lines associate an
anchor to a node and solid lines represent document relationships.
Figure 1: Abstract anchor denition and processing
The upper part of Figure 1 presents a document where media
video1 has three anchors sea,snow and sun. Each anchor represents
a given scene component. Relationships among such anchors and
medias wind eect and heat eect dene when such medias shall
be presented.
NCL [
11
], the target language used in this work, provides ele-
ment
media
for dening nodes representing media objects. It also
enables the denition of anchors using element
area
, child of el-
ement media. Listing 1 presents an example of media and anchor
specication.
1<medi a id= " video1 " src= " v i d e o . mp4 " >
2<area ta g = " s e a " / >
3<area ta g = " s un " / >
4</media>
Listing 1: NCL media and anchor specication example
In order to provide the denition of abstract anchors, we extend
NCL such that
area
elements have a new attribute
tag
. Such at-
tribute indicates the scene components related to that anchor. In
the example presented in Listing 1, two abstract anchors are created,
one representing the instants when the sea appears in the video
and the other representing the instants when the sun appears. Ad-
ditionally, the author can dene the
tag
to asterisk ( *) if it should
match every scene component in a document.
NCL is an event-based language such that synchronization rela-
tionships are dened based on events. NCL provides causal relation-
ships such that when an event specied as its condition happens,
one or more actions are triggered. Relationships in NCL are de-
ned using link-connector element pairs. Connectors [
15
] dene
a general relation that is instantiated by links to a given set of
participants. Listing 2 presents an example of link specication.
1<link xconnector= " onBeginStart ">
2<b i nd role= " o n B e g in " component= " video1 "
interface= " s e a " / >
3<b i nd role= " s t a r t " component= " wi nd " / >
4</link>
5<link xconnector= " onBeginStart ">
6<b i nd role= " o n B e g in " component= " video1 "
interface= " sun " / >
7<b i nd role= " s t a r t " component= " h e a t " / >
8</link>
Listing 2: NCL link specication example
The example presented in Listing 2 denes two links. The rst
species that whenever anchor sea of video1 starts, media wind
shall be started. The second species that whenever anchor sun of
video1 starts, media wind shall be started. Two links are also crated
to stop the wind and the head when the related anchor stops. For
simplicity, they are not presented in Listing 2.
It is worth noticing that
bind
elements inside NCL links indicate
the participants in a relationship. Attribute
component
indicates
the participant node and an optional attribute
interface
restricts
to a given node interface, i.e., a node anchor or property. In order
to enable links to be dened over abstract anchors, we extend NCL
such that attribute
interface
instead an anchor id may indicate
its tag attribute value.
Prior to execution, a document using abstract anchors shall
be processed into a nal document following the NCL standard.
The processing performed for abstract anchors is similar to that
performed for template languages. The rst step of the process is
to instantiate the abstract anchors for the scene components they
specify. The second step is to duplicate links for each instance of a
given abstract anchors. The whole process in shown in Figure 1.
The anchor instantiation step is performed using tools for scene
recognition as presented in Section 4.3. It recognizes the time in-
stants a given scene component is presented in the video content
and create anchor instances marked with temporal description.
Therefore, our approach requires from authors little (or even no)
prior knowledge about the media content. Anchors temporal def-
inition is performed entirely with data acquired by recognition
software.
4 ARCHITECTURE
The architecture of the Abstract Anchor Processor (AAP) is depicted
in Figure 2.
AAP receives as input a document containing abstract anchors
dened by the author. It parses the document identifying nodes
that dene abstract anchors and links related to them. At this step,
the processor also extracts media content from those nodes. For
the example in Listing 1 the processor identies node video1 as a
node dening abstract anchors and shall extract its content (le
video.mp4).
The extracted media content is sent to an external software for
scene recognition. As it can be seen in Figure 2, the recognition
software is decoupled from the processor. Such approach gives
more freedom to the author allowing one to use dierent scene
recognition software. The scene recognition step results in a set of
tags
2
that are equivalent to ones identied in the abstract anchors
dened by the author. These tags represent the scene components
along with timing information about when they appear in the video.
4.1 Anchor Instantiation
According to the tags received from the scene recognition soft-
ware, AAP instantiates the abstract anchors. The process of anchor
instantiation is performed as follows. According to the scene com-
ponents specied in the abstract anchor, the processor checks in
the set of received tags the time instants when those components
were present. It identies adjacent instants dening intervals where
scene components are present. For each resulting interval, one an-
chor instance is created. Listing 3 presents the result of the anchor
instantiation step for the example in Listing 1.
1<medi a src= " v i d e o . mp4 " i d = " video1 " >
2<area id = " s e a _ 1 " begin= " 0 1 s " end= " 0 9 s " / >
3<area id = " s e a _ 2 " begin= " 1 7 s " end= " 1 9 s " / >
4<area id = " s un_ 1 " begin= " 01 s " end= " 19 s " / >
5<area id = " s un_ 2 " begin= " 28 s " end= " 32 s " / >
6</media>
Listing 3: Anchor instantiation step result for the example
in Listing 1
2
We use the same nomenclature as the scene recognition software. It shall not be
confounded with XML tags.
Figure 2: Abstract anchor processor architecture
In the example presented in Listing 3, the scene component
sea
was identied in the video in the intervals
[
1
,
9
]
and
[
17
,
19
]
seconds
of the video. Thus two anchor instances were created,
sea_1
for
the rst interval and
sea_2
for the second one. The same is done
for scene component
sun
, which was identied in the video inside
intervals
[
1
,
19
]
and
[
28
,
32
]
, generating anchor instances
sun_1
and sun_2.
It is worth noticing that in the resulting document, the attribute
tag
was removed from the anchor instances. Anchor ids, which
are mandatory in NCL, are created according to the
tag
attribute
value. In order to maintain the output compatibility with the NCL
standard, each anchor id is also incremented to be unique in the
whole document.
4.2 Link Instantiation
After the anchor instantiation process, AAP is able to instantiate
links that refer to abstract anchors.
For each link marked at the processing begin as using an ab-
stract anchor, the processor examines each of its binds in order to
determine its target element. Two outputs are possible.
The bind targets a media node as a whole or a regular
anchor. In that case nothing has to be done.
The bind targets an abstract anchor of a media node. In
that case the link has to be duplicated for each instance of
the abstract anchor.
This process continues until no link bind targets an abstract
anchor. Listing 4 presents the result of the link instantiation step
for the example in Listing 2.
1<link xconnector= " onBeginStart ">
2<b i nd role= " o n B e g in " component= " video1 "
interface= " s e a _ 1 " / >
3<b i nd role= " s t a r t " component= " wi nd " / >
4</link>
5<link xconnector= " onBeginStart ">
6<b i nd role= " o n B e g in " component= " video1 "
interface= " s e a _ 2 " / >
7<b i nd role= " s t a r t " component= " wi nd " / >
8</link>
9<link xconnector= " onBeginStart ">
10 <b i nd role= " o n B e g in " component= " video1 "
interface= " sun_1 " / >
11 <b i nd role= " s t a r t " component= " h e a t " / >
12 </link>
13 <link xconnector= " onBeginStart ">
14 <b i nd role= " o n B e g in " component= " video1 "
interface= " sun_2 " / >
15 <b i nd role= " s t a r t " component= " h e a t " / >
16 </link>
Listing 4: Link instantiation step result for the example in
Listing 2
In the example presented in Listing 4, the rst link from Listing 2
was instantiated to both instances of the abstract anchor
sea
. The
resulting links now targets anchors
sea_1
and
sea_2
, respectively.
The same process was done for the second link from Listing 2,
which was instantiated for anchors sun_1 and sun_2.
It is worth noticing that the steps of anchor instantiation and link
instantiation may be executed in distinct moments. It is possible
for the author to use AAP to rst instantiate the anchors, continues
to work in the document and perform the link instantiation step
later.
4.3 Scene recognition
Given a set of abstract anchors previously dened by the author,
AAP collects the anchors
tag
attribute values along with his par-
ent element source. The resulting tags must be instantiated with
temporal information that identies where that tag appeared on
the scene. Here we call this process scene recognition.
Scene recognition is achieved by submitting all the
tag
attribute
values to the recognition system, which is a system that employs
algorithms that can detect scene components in media content
(e.g. video, audio, text analysis). These approaches return a set of
tags indicating the description of a media content. Although static
media can also be analysed (image and text) this work focuses on
continuous media objects, which are frequently used as basis for
sensory eect synchronization.
The scene recognition phase is decoupled from the processor to
enable its adaptation to novel ways of recognizing features in any
media format. The author can adapt the AAP settings for another
recognition system. The only requirement is that the recognition
system has to return a list of independent tags with their temporal
data, according to the notation used by the processor.
In our implementation we used a video recognition API3based
on a Convolutional Neural Networks (CNN)[
14
]. These neural net-
works have been shown as an eective method for understanding
video content ([
13
,
28
]). Figure 3 shows the result of an image
recognition using such software.
Figure 3: Image recognition result
The example in Figure 3 returns a set of tags indicating the scene
components present in the image. Each tag is followed by the neural
network prediction probability. The API can identify objects (e.g.
boat), as well as individual concepts (e.g. reection).
To recognize video content, the neural network works in a similar
way of image recognition. One approach is to treat video as a series
of images. However, as pointed out by [
17
], this approach does not
account for the temporal information between frames and can lead
to irrelevant concepts emerging from the scene. Nonetheless one
advantage of this method is that it requires less computation time
to analyse the video.
Another approach is to consider the temporal relationship be-
tween the frames and deduce the tags by analysing relationships
as time passes. A advantage of this method is that it decreases the
probability of returning irrelevant tags from the video and keeps
only the ones that persisted though the entire time. However this
approach is shown to be dicult to compute [17].
3https://clarifai.com
The video recognition API we used in this work, content descrip-
tion is performed for every second of video content. Therefore, after
the instantiation phase, the events described on the multimedia
document will also have a 1second time-step.
The description of scenes by one second at a time may seem to
include a great deal of delay in the specication of sensory eect
synchronization with audiovisual content. However, for mulseme-
dia applications, works published in the literature show that user
perception of a sensory eect happens in a time window of
1
s
for
haptic eects [
27
],
2
s
for heat eects [
7
],
3
s
for wind eects
[7, 27] and 25sfor scent eects [8].
Given the above results, we consider that the content description
of a media object with a one second step should not pose a threat
to the user quality of experience. A future work is to investigate an
approach to reduce such time step.
5 EVALUATION
For the purpose of evaluating of our approach we introduce a usage
scenario to highlight how AAP supports the development of a
mulsemedia application. We developed an NCL application that
combines video and sensory eects to enrich the user experience.
The application called “environments around the world”, consists
of scenes about dierent environments that are presented to the
user.
A timeline representation of the video content and its synchro-
nization with sensory eects is presented on Figure 4. It presents
a set of key frames of the video
4
and three of the tags recognized
in that part of the video. At the moment of each scene, the NCL
application starts an actuator to perform a sensory eect related to
that scene.
Table 1 describes the sensory eects to be synchronized when a
given tag is found in the video. It varies from scent eects to wind,
heat and cold eects. The eects also vary in intensity according
with the scene components. One should notice, that eects can be
played at the same time. It shall occurs when both tags are found
in the video at the same time. Thus both
area
elements related to
those tags will be active and, as consequence of NCL links, so shall
be the sensory eects.
Table 1: Sensory eects generated by each scene component
Tag Sensory eects
Summer wind 50%, heat 50%
Snow cold 100%
Forest forest scent 100%, wind 25%
Flower ower scent 100%, wind 25%
Storm wind 100%, cold 50%, air humidier 100%
Sea wind 50%, heat 50%, air humidier 50%
Hot wind 50%, heat 100%
The video was described in NCL with abstract anchors indi-
cating the scene components of interest. The cover components
present in all environments. Listing 5 presents the abstract anchor
specication.
4
Images and videos are licensed as Creative Commons CC0 and were found at Pixabay.
https://pixabay.com
Figure 4: Sensory eects generated on a video timeline
1<medi a id= " v i d e o " src= " v i d e o . mp4 " >
2<area ta g = " summer " / >
3<area ta g = " s now " / >
4<area ta g = " forest " / >
5<area ta g = " f l o w e r " / >
6<area ta g = " s to r m " />
7<area ta g = " s e a " / >
8<area ta g = " h ot " / >
9</media>
Listing 5: NCL abstract anchors for the application
“environments around the world”
The behavior of the application is dened by a group of 7
link
elements (one for each abstract anchor). Listing 6 presents an link
specication for one of the abstract anchors.
1<link xconnector= " onBeginStartSet ">
2<b i nd role= " o n B e g in " component= " v i d e o "
interface= " summer " / >
3<b i nd role= " s t a r t " component= " wi nd " >
4<bindParam name= " intensity " value= " 50 %
"/ >
5</bind>
6<b i nd role= " s t a r t " component= " h e a t " >
7<bindParam name= " intensity " value= " 50 %
"/ >
8</bind>
9</link>
Listing 6: NCL link specication with intensity parameters
The link presented in Listing 6 synchronizes the scene compo-
nent
summer
to the sensory eects
wind
and
heat
. Both sensory
eects are represented as media nodes in the application, and repre-
sent Lua scripts that control the actuators responsible for that eect.
The scripts have an intensity parameter whose value is dened in
NCL by parameters (lines 4 and 7). The intensity is expressed in
a percentage of the maximum capable intensity the actuator can
provide.
The author of this application, using NCLAA, has to declare 7
abstract anchors and 7links. The application has a total of 74 lines
of code to describe the behavior of the application.
After processing, according to the video content, the document
has 45
anchor
instances and also 45
link
instances. The processed
document has a total of 362 lines of code to perform the behavior
described in the abstract anchors.
As can be seen in this example, using abstract anchors, the author
had to declare around 15% of the resulting number of anchors and
links and around 20% of the resulting lines of code. Moreover,
without the use of the AAP the author would have to, not only,
dene the anchors and links, but also carefully watch the video for
recognizing scene components and their timing in order to describe
the anchors and their synchronization with the sensory eects. As
intended, we can see a great decrease in the authoring eort with
respect to manual authoring.
It is worth noting that the same code described using NCLAA
is maintained even in case the video size changes. Given that the
abstract anchors are not directly related to the video length (and
timing), but only to the scene components it has, the application
code does not have to change in case the video size changes. This
result is also favorable to the author, as the number of anchor
instances may increase with the video size.
6 CONCLUSION
This paper proposed an approach to describe multimedia applica-
tion with abstract anchors. Abstract anchors represent intervals
when a given scene component is presented in the media node
content. Thus, a mulsemedia application author does not need have
a complete knowledge of a node content for dening its synchro-
nization with other content.
Such approach is intended to be used in a mulsemedia context,
where it is common to perform sensory eect synchronization in
relation to audiovisual content. The approach, however, is not re-
stricted to it and can be used for traditional multimedia application
specication.
Together with the abstract anchors, the abstract anchor processor
(AAP) allows for the automatic generation of node anchor based
on its content. It gathers information about the document and uses
scene recognition software for identifying the temporal information
for anchors. This approach allow automatic media synchronization
to be done based on video recognition.
A positive side eect of our approach is that given that the
abstract anchors are not directly related to the video length (and
timing), but only to the scene components it has, the application
code does not have to change in case the video size changes.
Since the AAP processor have broad applications with dierent
media types. A rst future work should be to integrate to it audio
recognition software. The idea is to identify scene components,
e.g., according to the background sound, and use such informa-
tion for anchor instantiation. A use case could be the automatic
synchronization of subtitles in NCL applications.
Another future work is to enhance AAP with the ability to infer
synonyms of the words used to describe abstract anchors. The
current approach for identifying scene concepts can be error prone.
Sometimes it can be dicult to guess which concept the recogni-
tion software can handle. There are several recognition softwares
available and they may not follow a common standard for concept
naming.
Finally, one interesting future work is to improve our approach
so that it can be used for live content. AAP has to be able to per-
form anchor and link instantiation at runtime. Besides some kind
of caching strategy has to be used for performing the scene recog-
nition step. The challenge to that approach is related to Quality of
Experience (QoE) preservation in multimedia applications, which
may be lost due to processing latency of some scene recognition
software.
REFERENCES
[1]
ABNT. 2011. Digital terrestrial television - Data coding and transmission speci-
cation for digital broadcasting - Part 2: Ginga-NCL for xed and mobile receivers
- XML application language for application coding. (2011). ABNT NBR 15606-
2:2011 standard.
[2]
Roberto Gerson A. Azevedo, Eduardo Cruz Araújo, Bruno Lima, Luiz Fernando G.
Soares, and Marcelo F. Moreno. 2014. Composer: meeting non-functional aspects
of hypermedia authoring environment. Multimedia Tools and Applications 70, 2
(2014), 1199–1228. DOI:http://dx.doi.org/10.1007/s11042-012-1216- 8
[3]
Diogo Henrique Duarte Bezerra, Denio Mariz Timóteo Sousa, Guido Lemos
de Souza Filho, Aquiles Medeiros Filgueira Burlamaqui, and Igor Ros-
berg Medeiros Silva. 2012. Luar: A Language for Agile Development of NCL
Templates and Documents. In Proceedings of the 18th Brazilian Symposium on
Multimedia and the Web (WebMedia ’12). ACM, New York, NY, USA, 395–402.
DOI:http://dx.doi.org/10.1145/2382636.2382718
[4]
Carolina Cruz-Neira, Daniel J. Sandin, Thomas A. DeFanti, Robert V. Kenyon,
and John C. Hart. 1992. The CAVE: Audio Visual Experience Automatic Virtual
Environment. Commun. ACM 35, 6 (June 1992), 64–72.
DOI:
http://dx.doi.org/10.
1145/129888.129892
[5]
Romain Deltour and Cécile Roisin. 2006. The limsee3 multimedia authoring
model. In Proceedings of the 2006 ACM symposium on Document engineering.
ACM, 173–175.
[6]
Joel André Ferreira dos Santos and Débora Christina Muchaluat Saade. 2010.
XTemplate 3.0: Adding Semantics to Hypermedia Compositions and Providing
Document Structure Reuse. In Proceedings of the 2010 ACM Symposium on Applied
Computing (SAC ’10). ACM, New York, NY, USA, 1892–1897.
DOI:
http://dx.doi.
org/10.1145/1774088.1774490
[7]
H Felix, Nikita Mattar, and Julia Fr.2014. Simulating Wind and Warmth in Virtual
Reality : Conception , Realization and Evaluation for a CAVE Environment. 11,
10 (2014).
[8] Gheorghita Ghinea and Oluwakemi A. Ademoye. 2010. Perceived synchroniza-
tion of olfactory multimedia. IEEE Transactions on Systems, Man, and Cybernetics
Part A:Systems and Humans 40, 4 (2010), 657–663.
DOI:
http://dx.doi.org/10.1109/
TSMCA.2010.2041224
[9]
Gheorghita Ghinea, Christian Timmerer, Weisi Lin, and Stephen R. Gulliver. 2014.
Mulsemedia : State of the Art, Perspectives, and Challenges. ACM Transactions
on Multimedia Computing, Communications, and Applications 11, 1s (2014), 1–23.
DOI:http://dx.doi.org/10.1145/2617994
[10]
Roberto Ierusalimschy. 2006. Programming in lua (2nd ed.). Roberto Ierusalim-
schy.
[11]
ITU. 2009. Nested Context Language (NCL) and Ginga-NCL for IPTV services.
http://www.itu.int/rec/T-REC-H.761-200904-S. (2009). ITU-T Recommendation
H.761.
[12]
Alejandro Jaimes and Nicu Sebe. 2007. Multimodal human–computer interaction:
A survey. Computer Vision and Image Understanding 108, 1–2 (2007), 116 – 134.
DOI:
http://dx.doi.org/10.1016/j.cviu.2006.10.019 Special Issue on Vision for
Human-Computer Interaction.
[13]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk-
thankar, and Li Fei-Fei. 2014. Large-Scale Video Classication with Convolutional
Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR ’14). IEEE Computer Society, Washington, DC,
USA, 1725–1732. DOI:http://dx.doi.org/10.1109/CVPR.2014.223
[14]
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard,
and L. D. Jackel. 1989. Backpropagation Applied to Handwritten Zip Code
Recognition. Neural Comput. 1, 4 (Dec. 1989), 541–551.
DOI:
http://dx.doi.org/10.
1162/neco.1989.1.4.541
[15]
D. C. Muchaluat-Saade and L. F. G. Soares. 2002. XConnector & XTemplate:
Improving the Expressiveness and Reuse in Web Authoring Languages. The New
Review of Hypermedia and Multimedia Journal 8, 1 (2002), 139–169.
[16]
Carlos de Salles Soares Neto, Luiz Fernando Gomes Soares, and Clarisse Sieck-
enius de Souza. 2012. TAL-Template Authoring Language. Journal of the
Brazilian Computer Society 18, 3 (2012), 185–199.
DOI:
http://dx.doi.org/10.1007/
s13173-012- 0073-7
[17]
Joe Yue-Hei Ng, Matthew J. Hausknecht, Sudheendra Vijayanarasimhan, Oriol
Vinyals, Rajat Monga, and George Toderici. 2015. Beyond Short Snippets: Deep
Networks for Video Classication. CoRR abs/1503.08909 (2015). http://arxiv.org/
abs/1503.08909
[18]
Sharon Oviatt. 2003. The Human-computer Interaction Handbook. L. Erlbaum
Associates Inc., Hillsdale, NJ, USA, Chapter Multimodal Interfaces, 286–304.
http://dl.acm.org/citation.cfm?id=772072.772093
[19]
Douglas Paulo de Mattos, Júlia Varanda da Silva, and Débora Christina Muchaluat-
Saade. 2013. NEXT: graphical editor for authoring NCL documents supporting
composite templates. In Proceedings of the 11th european conference on Interactive
TV and video. ACM, 89–98.
[20]
A. Scherp and S. Boll. 2005. Context-driven Smart Authoring of Multimedia
Content with xSMART. In 13th ACM Multimedia.
[21]
Y. Sulema. 2016. Mulsemedia vs. Multimedia: State of the art and future trends. In
2016 International Conference on Systems, Signals and Image Processing (IWSSIP).
1–5. DOI:http://dx.doi.org/10.1109/IWSSIP.2016.7502696
[22]
Christian Timmerer, Markus Waltl, Benjamin Rainer, and Hermann Hellwagner.
2012. Assessing the quality of sensory experience for multimedia presentations.
Signal Processing: Image Communication 27, 8 (2012), 909–916.
DOI:
http://dx.
doi.org/10.1016/j.image.2012.01.016
[23]
W3C. 2008. Synchronized Multimedia Integration Language - SMIL 3.0 Speci-
cation. http://w ww.w3c.org/TR/SMIL3. (2008). World-Wide Web Consortium
Recommendation.
[24]
Markus Waltl, Benjamin Rainer, Christian Timmerer, and Hermann Hellwagner.
2013. An end-to-end tool chain for Sensory Exp erience based on MPEG-V. Signal
Processing: Image Communication 28, 2 (2013), 136–150.
DOI:
http://dx.doi.org/
10.1016/j.image.2012.10.009
[25]
K. Yoon, B. Choi, E. S. Lee, and T. B. Lim. 2010. 4-D broadcasting with MPEG-V.
In 2010 IEEE International Workshop on Multimedia Signal Processing. 257–262.
DOI:http://dx.doi.org/10.1109/MMSP.2010.5662029
[26]
Kyoungro Yoon, Sang-Kyun Kim, Jae Joon Han, Seungju Han, and Marius Preda.
2015. MPEG-V: Bridging the Virtual and Real World (1st ed.). Academic Press.
[27]
Zhenhui Yuan, Shengyang Chen, Gheorghita Ghinea, and Gabriel-Miro Muntean.
2014. User Quality of Experience of Mulsemedia Applications. ACM Transactions
on Multimedia Computing, Communications, and Applications 11, 1s (2014), 1–19.
DOI:http://dx.doi.org/10.1145/2661329
[28]
Matthew D. Zeiler and Rob Fergus. 2013. Visualizing and Understanding Convo-
lutional Networks. CoRR abs/1311.2901 (2013). http://arxiv.org/abs/1311.2901
... This is a very costly activity in terms of effort and time, besides being error-prone. Thus, accelerating and simplifying the authoring process is paramount to encourage community adoption of such applications [2]. ...
... These tools provide a sophisticated graphical editing interface for synchronizing a set of media objects with sensory effects. However, they still require a long and relatively complex authoring process, since the aforementioned "media content inspection" is still necessary [2]. ...
... 1 In the current implementation of STEVEML, a cloudbased DNN service for video recognition was used. 2 The chosen neural network API provides a free plan that allows the recognition of 5,000 seconds of video per month. It used the general recognition model which can return over 11,000 different labels. ...
... Para criar uma aplicação mulsemídia é necessário um esforço de autoria para realizar a sincronização de efeitos sensoriais com o conteúdo audiovisual [2]. Isto é, um autor de tais aplicações deve cuidadosamente inspecionar o conteúdo de mídia para identificar e marcar os momentos de início e fim de um dado efeito sensorial. ...
... Isto é, um autor de tais aplicações deve cuidadosamente inspecionar o conteúdo de mídia para identificar e marcar os momentos de início e fim de um dado efeito sensorial. Este processo de autoria manual é custoso e pode induzir a erros [2]. Sendo assim, uma forma de incentivar a autoria de aplicações mulsemídia é diminuir a carga de autoria manual, em especial utilizando sistemas inteligentes que possam automatizar o processo de autoria de efeitos sensoriais. ...
... No contexto de aplicações mulsemídia, grande parte do esforço de autoria está na sincronização temporal dos efeitos sensoriais com conteúdos audiovisuais [2]. Portanto, uma forma fundamental de incentivar autores a criarem aplicações mulsemídia é utilizar ferramentas de autoria que permitem a definição e sincronização de efeitos sensoriais graficamente. ...
Conference Paper
Full-text available
Synchronization of sensory effects with multimedia content is a non-trivial and error-prone task that can discourage authoring of mulsemedia applications. Although there are authoring tools that assist in the specification of sensory effect metadata in an automated way, the forms of analysis used by them are not general enough to identify complex components that may be related to sensory effects. In this work, we present an intelligent component, which allows the semi-automatic definition of sensory effects. This component uses a neural network to extract information from video scenes. This information is used to set sensory effects synchronously to related videos. The proposed component was implemented in STEVE 2.0 authoring tool, helping the authoring of sensory effects in a graphical interface.
... In [9], we define a scene component as, a given element (rock, tree, dog, person, etc.) or concept (happy, crowded, dark, etc.) that appears in the content of a media object. In the application example we presented earlier, scene components may refer to the sun, the beach, trees, flowers, and other elements that may appear in the Rio de Janeiro's sights presented in the touristic program. ...
... In previous work [9], [10], we tackle the problem of automatically recognizing scene components in audiovisual objects, in order to assist the realization of the synchronization task in mulsemedia applications. We proposed an architecture capable of identifying the presence of scene components in video and audio objects and defining, in a semi-supervised manner, the synchronization among sensory effects and an application main video and/or audio. ...
Preprint
Full-text available
In mulsemedia applications, traditional media content (text, image, audio, video, etc.) can be related to media objects that target other human senses (e.g., smell, haptics, taste). Such applications aim at bridging the virtual and real worlds through sensors and actuators. Actuators are responsible for the execution of sensory effects (e.g., wind, heat, light), which produce sensory stimulations on the users. In these applications sensory stimulation must happen in a timely manner regarding the other traditional media content being presented. For example, at the moment in which an explosion is presented in the audiovisual content, it may be adequate to activate actuators that produce heat and light. It is common to use some declarative multimedia authoring language to relate the timestamp in which each media object is to be presented to the execution of some sensory effect. One problem in this setting is that the synchronization of media objects and sensory effects is done manually by the author(s) of the application, a process which is time-consuming and error prone. In this paper, we present a bimodal neural network architecture to assist the synchronization task in mulsemedia applications. Our approach is based on the idea that audio and video signals can be used simultaneously to identify the timestamps in which some sensory effect should be executed. Our learning architecture combines audio and video signals for the prediction of scene components. For evaluation purposes, we construct a dataset based on Google's AudioSet. We provide experiments to validate our bimodal architecture. Our results show that the bimodal approach produces better results when compared to several variants of unimodal architectures.
... The insufficiency of knowledge and guidelines on how to benefit from the computer-aided tools requires a culture building in higher education (HE) systems [15], [16]. The challenge becomes more critical when it is compulsory to shift to partially or completely online learning platforms [17]. ...
Article
One-dimensional (1-D) demonstrations, e.g., the black-box systems, have become popular in teaching materials for engineering modules due to the high complexity of the system's multidimensional (e.g., 2-D and 3-D) identities. The need for multidimensional explanations on how multiphysics equations and systems work is vital for engineering students, whose learning experience must gain a cognitive process understanding for utilizing such multiphysics-focused equations into a pragmatic dimension. The lack of knowledge and expertise in creating animations for visualizing sequent processes and operations in academia can result in an ineffective learning experience for engineering students. This study explores the benefits of animation, which can eventually improve the teaching and student learning experiences. In this article, the use of computer-aided animation tools is evaluated based on their capabilities. Based on their strengths and weaknesses, the study offered some insights for selecting the investigated tools. To verify the effectiveness of animations in teaching and learning, a survey was conducted for undergraduate and postgraduate cohorts and automotive engineering academics. Based on the survey's data, some analytics and discussion have offered more quantitative results. The historic data (2012-2020) analysis has validated the animations efficacy as achievements of the study, where the average mark of both modules has significantly improved, with the reduced rate of failure.
... Such elements may have a virtual anchor, called RecognitionAnchor, that triggers a recognition event when an expected interaction is recognized from the input device. Abreu and Santos [16] propose the AbstractAnchor, which is an anchor type that represents parts of a content node where concepts are detected. Then, during the document parsing, the processor analyses all the media and create the timestamps relative to the time interval where expected concepts are recognized in each media. ...
Article
Full-text available
Recently the Brazilian DTV system standards have been upgraded, called TV 2.5, in order to provide a better integration between broadcast and broadband services. The next Brazilian DTV system evolution, called TV 3.0, will address more deeply this convergence of TV systems not only at low-level network layers but also at the application layer. One of the new features to be addressed by this future application layer is the use of Artificial Intelligence technologies. Recently, there have been practical applications using Artificial Intelligence (AI) deployed to improve TV production efficiency and correlated cost reduction. The success in operationalize and evaluate these applications is a strong indication of the interest and relevance of AI in TV. This paper presents TeleMídia Lab’s future vision on interactive and intelligent TV Systems, with particular focus on edge AI. Edge AI means use in-device capabilities to run AI applications instead of running them in cloud.
... The introduction of multimedia technology into the classroom is an important part of the modernization of education [1]. As a means of teaching organization, multimedia teaching uses multimedia to process text, image, sound, animation, and other information to form visualized teachings of sound, image, picture, and text [2]. It not only stimulates students' interest in learning, but also helps students understand and master the content of art teaching. ...
Article
Full-text available
Symmetries play a vital role in multimedia-aided art teaching activities. The relevant teaching systems designed with a social network, including the optimized teaching methods, are on the basis of symmetry principles. In order to study art teaching, from the perspective of the teaching organization form, combined with the survey method, multimedia-aided art classroom teaching was explained in detail. Based on the symmetrical thinking in art teaching, the multimedia-aided teaching mode of art classroom was discussed. The reasons for the misunderstanding of multimedia-aided art teaching were analyzed, and the core factors affecting the use of multimedia art teaching were found. In art teaching, more real pictures were shown aided by multimedia; students could experience the beauty of symmetrical things in real life and were guided to find the artistic characteristics of these kinds of graphics, analyze them, and summarize them. The results showed that this method enriched the art multimedia teaching theory and improved the efficiency of art teaching. The blind use of multimedia technology by teachers in art classroom teaching was avoided. Therefore, the method can develop individualized teaching, develop students’ potential, and cultivate innovative consciousness and practical ability.
... Abreu and Santos [3] propose the AbstractAnchor, which is an anchor type that represents parts of a content node where concepts are detected. They implement an Abstract Anchor Processor (AAP) that use an API of image classification to analyze video frames. ...
Chapter
Full-text available
Deep learning research has allowed significant advances in several areas of multimedia, especially in tasks related to speech processing, hearing, and computational vision. Particularly, recent usage scenarios in hypermedia domain already use such deep learning tasks to build applications that are sensitive to its media content semantics. However, the development of such scenarios is usually done from scratch. In particular, current hypermedia standards such as HTML do not fully support such kind of development. To support such development, we propose that a hypermedia language should be extended to support: (1) describe learning using structured media datasets; (2) recognize content semantics of the media elements in presentation time; (3) use the recognized semantics elements as events in during the multimedia. To illustrate our approach, we extended the NCL language, and its model NCM, to support such features. NCL (Nested Context Language) is the declarative language for developing interactive applications for Brazilian Digital TV and an ITU-T Recommendation for IPTV services. As a result of the work, it is presented a usage scenario to highlight how the extended NCL supports the development of content-aware hypermedia presentations, attesting the expressiveness and applicability of the model.
Chapter
The Fog of Things (FoT) proposes a paradigm which uses the Fog Computing concept to deploy Internet of Things (IoT) applications. The FoT exploits the processing, storage, and network capacity of local resources, allowing for the integration of different devices in a seamless IoT architecture, and it defines the components which compose the FoT paradigm describing their characteristics. This chapter presents the FoT paradigm and relates it to IoT architecture describing the main characteristics and concepts from the sensor and actuator communication to gateways, and local and cloud servers. Lastly, this chapter presents SOFT-IoT platform as a concrete implementation of FoT, which uses microservice infrastructure distributed along devices in the IoT system.
Chapter
Model-driven Engineering (MDE) is an approach that considers models as the main artifacts in software development. Models are generally built using domain-specific languages, such as UML and XML. These languages are defined by their own metamodels. In this context, this chapter aims to present the basics of MDE as well as key frameworks and languages available for its support, providing the necessary background to assist in building an environment to build models in accordance with a particular metamodel. Models built in this environment can then be used to document and maintain systems from different domains.
Article
Full-text available
User Quality of Experience (QoE) is of fundamental importance in multimedia applications and has been extensively studied for decades. However, user QoE in the context of the emerging multiple-sensorial media (mulsemedia) services, which involve different media components than the traditional multimedia applications, have not been comprehensively studied. This article presents the results of subjective tests which have investigated user perception of mulsemedia content. In particular, the impact of intensity of certain mulsemedia components including haptic and airflow on user-perceived experience are studied. Results demonstrate that by making use of mulsemedia the overall user enjoyment levels increased by up to 77%.
Article
Full-text available
Mulsemedia – multiple sensorial media – captures a wide variety of research efforts and applications. This paper presents a historic perspective on mulsemedia work and reviews current developments in the area. These take place across the traditional multimedia spectrum – from virtual reality applications to computer games-as well as efforts in the arts, gastronomy and therapy, to mention a few. We also describe standardization efforts, via the MPEG-V standard, and identify future developments and exciting challenges the community needs to overcome.
Conference Paper
Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).
Article
Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 87.9%) and without additional optical flow information (82.6% vs. 72.8%).
Conference Paper
Using the Ginga-NCL middleware, interactive multimedia applications for the Brazilian digital TV system are written in NCL (Nested Context Language). Although programming skills are not required when using a declarative authoring language, authors need to have at least a basic knowledge of the language in order to develop an application. Aiming at facilitating and spreading the use of NCL, this paper presents a graphical editor that allows the development of NCL documents for authors with no knowledge of the language. The proposed editor is called NEXT (NCL Editor Supporting XTemplate). To provide that facility, the editor uses hypermedia composite templates, which represent generic structures for NCL programs. Those templates are specified in the XTemplate 3.0 language. In addition, NEXT offers other functionalities, such as creation and editing NCL documents in different views, which facilitate the development of digital TV applications. Those functionalities are provided as a set of plugins, which makes the tool extensible and adaptable to different author skills.
Conference Paper
In the application's development described in NCL language, we have observed the reuse of some models and document structures, which is possible by repetition of common codes on applications. Thus, we do visualize the need to generalize this kind of development described in NCL. This need has been observed by other developers who are aiming at the possibility of the reuse of structure from some documents. This paper introduces Luar, an authoring language for NCL templates. The Luar language has been conceived through analysis of the applications' behavior for iDTV. Luar has a templates's processor developed with the Lua language and library to maintain and to aggregating template collections, sharing them among developers. The entire template system aims to facilitate the design and development of interactive applications described in NCL using the technique of reuse.