Content uploaded by Raphael Abreu
Author content
All content in this area was uploaded by Raphael Abreu on Oct 27, 2017
Content may be subject to copyright.
Using Abstract Anchors to Aid The Development of Multimedia
Applications With Sensory Eects
Raphael Abreu
CEFET/RJ
raphael.abreu@eic.cefet-rj.br
Joel A. F. dos Santos
CEFET/RJ
jsantos@eic.cefet-rj.br
ABSTRACT
Declarative multimedia authoring languages allows authors to com-
bine multiple media objects, generating a range of multimedia pre-
sentations. Novel multimedia applications, focusing at improving
user experience, extend multimedia applications with multisensory
content. The idea is to synchronize sensory eects with the audio-
visual content being presented. The usual approach for specifying
such synchronization is to mark the content of a main media object
(e.g. a main video) indicating the moments when a given eect has
to be executed. For example, a mark may represent when snow
appears in the main video so that a cold wind may be synchronized
with it. Declarative multimedia authoring languages provide a way
to mark subparts of a media object through anchors. An anchor
indicates its begin and end times (video frames or audio samples)
in relation to its parent media object. The manual denition of an-
chors in the above scenario is both not ecient and error prone (i)
when the main media object size increases, (ii) when a given scene
component appears several times and (iii) when the application
requires marking scene components.
This paper tackles this problem by providing an approach for
creating abstract anchors in declarative multimedia documents.
An abstract anchor represents (possibly) several media anchors,
indicating the moments when a given scene component appears
in a media object content. The author, therefore is able to dene
the application behavior through relationships among, for example,
sensory eects and abstract anchors. Prior to executing, abstract
anchors are automatically instantiated for each moment a given
element appears and relationships are cloned so the application
behavior is maintained.
This paper presents an implementation of the proposed approach
using NCL (Nested Context Language) as the target language. The
abstract anchor processor is implemented in Lua and uses available
APIs for video recognition in order to identify the begin and end
times for abstract anchor instances. We also present an evaluation
of our approach using a real world use cases.
CCS CONCEPTS
•Applied computing →Markup languages; •Human-centered
computing →Hypertext / hypermedia; •Software and its en-
gineering →Translator writing systems and compiler generators;
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
DocEng’17, September 4–7, 2017, Valletta, Malta.
©2017 ACM. 978-1-4503-4689-4/17/09.. . $15.00
DOI: http://dx.doi.org/10.1145/3103010.3103014
KEYWORDS
Anchors; Multimedia authoring; Multisensory Content; Mulseme-
dia; NCL; Video Recognition
ACM Reference format:
Raphael Abreu and Joel A. F. dos Santos. 2017. Using Abstract Anchors to
Aid The Development of Multimedia Applications With Sensory Eects. In
Proceedings of DocEng’17, September 4–7, 2017, Valletta, Malta., , 8 pages.
DOI: http://dx.doi.org/10.1145/3103010.3103014
1 INTRODUCTION
The recent advances in human-computer interactions ([
4
,
12
,
18
])
oers many opportunities to enrich multimedia experience with
new features. Since the beginning of this decade there was sig-
nicant commercial interest in more immersive technologies (3D
displays, VR, etc). Such interest resulted in increased eorts of the
multimedia community to develop new methods to enhance the
user immersion in multimedia applications [21].
New kinds of immersive multimedia applications have been
proposed, giving rise to the multiple sensorial media (Mulsemedia)
applications [
9
], where traditional media content (text, image, audio,
video, etc.) can be related to media objects that target other human
senses (e.g. smell, haptics, etc.). To enable this applications, one
can use physical sensing devices (sensors) to identify the ambient
state (e.g. temperature, room size, user feedback) and actuators to
generate sensory eects (e.g. wind, mist, heat) to the user.
Traditional declarative multimedia authoring languages, author-
ing languages for short, specify interactive multimedia applications
focusing on the denition of media objects synchronization, in-
dependent of their content. Examples of authoring languages are
SMIL (Synchronized Multimedia Integration Language) [
23
] and NCL
(Nested Context Language) [
11
]. In the above scenario, it is inter-
esting to take advantage of those languages abstractions for media
and relationships specication in order to provide synchronization
among both traditional content and also multisensory content.
An approach for synchronizing traditional and multisensory is
to represent sensors and actuators as media objects and create rela-
tionships among parts of a main media object (e.g. a main video)
and those media objects representing multisensory content. In
order to do so, authors have to mark the main media object indicat-
ing when, for example, an explosion occurs so the corresponding
sensory eect can be synchronized with it.
In this paper, we call a scene component a given element (rock,
tree, dog, person, etc.) or concept (happy, crowded, dark, etc.) that
appears in the main media object content.
The usual approach for marking when a given scene component
appears in a given media object is to execute such media object and
create anchors related to those components. Relationships among
such anchors and the related multisensory content, therefore, dene
the intended synchronization.
When the application size grows, or when several scene compo-
nent shall be synchronized with multisensory content, authors are
required to create several anchors. The manual denition of such
anchors, however, is not ecient. Moreover, such an approach can
be error prone, given the size of the resulting code. This problem
was presented in [
22
], where the authors emphasize the need for
automating this process.
This paper presents an approach for automating the creation of
anchors in multimedia authoring languages. Our approach is to
provide a way for the author to dene abstract anchors in multime-
dia documents. An abstract anchor represents (possibly) several
media anchors, indicating the moments when a given scene com-
ponent appears in a media object content. Relationships in the
document are dened considering such abstract anchors. Prior to
execution, a document with abstract anchors is processed so that,
abstract anchors are automatically instantiated for each moment
a given scene component appears and relationships are cloned so
the application behavior is maintained.
The proposed approach was implemented using NCL as the
target language. NCL is a standard for digital TV [
1
] and IPTV [
11
]
services. It provides anchors for media objects, whose denition
indicate their begin and end times in relation to their parent media
object. In this work NCL anchors were extended so they can indicate
the scene component they refer to. The Abstract Anchor Processor,
AAP for short, uses available APIs for video recognition in order
to identify when a given scene component appears in the video
content. An instance of a given abstract anchor is created for each
time the element appears. In sequence, document relationships
are cloned for each anchor instance, maintaining the document
behavior. AAP was implemented in Lua [
10
] and is available for
download and use1.
Using NCL with Abstract Anchors, NCLAA for short, reduces
authoring eort, since anchors and document relationships are
created only once, for each dierent scene component. In order to
support our claim we present an evaluation of our approach using
a real world use cases.
The remainder of the paper is organized as follows. Section 2
presents related work regarding approaches for reducing the author-
ing eort for multimedia and mulsemedia applications. Section 3
discusses the concept of abstract anchors, their creation in NCL
and the steps for processing abstract anchors. Section 4 presents
the implementation of the abstract anchor processor. Section 5
presents our approach evaluation results. Section 6 concludes the
paper and presents future work.
2 RELATED WORK
A lot of attention has been devoted to reducing the authoring ef-
fort of multimedia and mulsemedia applications. Two common
approaches are to provide authoring tools or template languages
for those applications.
A template language allows the author to specify reusable com-
ponents (placeholders) that should later be replaced by instances in
1https://github.com/raphael-abreu/NCLAAP
the target language. More precisely, templates dene generic com-
ponents and express relationships between generic components
that later can be duplicated to a target language by a template
processor before runtime. The template processor ensures that
the generic components are correctly instantiated in the target
language. This section presents works focusing on templates for
multimedia applications.
XTemplate [
6
] is a modular approach for creating templates
for NCL documents. The template language proposed represents
generic components and relationships among them. XTemplate
species composite templates, which denes a spatio-temporal
semantics to be reused by (possibly) several document compositions.
Along with the template specication, a template processor was
proposed. The processor receives as input a set of templates and
a document using them and returns an NCL compliant document
that can run on any standard NCL player. A similar approach is
provided in [
16
] where authors propose the TAL template language
and its associated processor.
Some template languages not only support placeholders, but also
loops and conditions which often lack in declarative multimedia
languages. This is the case of Luar [
3
]. The authors focus on
authors with programming expertise, providing a way to embed
Lua code in NCL documents. The Luar processor, executes the Lua
code embedded in the NCL document producing an NCL compliant
document.
Another approach to reduce the authoring eort is to develop
visual authoring tools. These tools help the user by providing a
graphical user interface (GUI) that eases or remove the need to write
code. In general, such approaches target in non-expert authors,
aiding the application development.
Examples of authoring tools for multimedia documents are [
2
,
5
,
19
,
20
]. [
2
] proposes NCL Composer, an authoring tool presenting
to the user a structural, a textual, and a layout view of an NCL
document. It allows authors to interact with the document logical
structure by representing media objects as nodes and the relation-
ships among them as vertices.
A similar approach is presented in [
19
] where the NEXT tool is
proposed. The dierence is that NEXT is focused on templates, also
providing a template view where authors may create documents
using XTemplate templates.
LimSee [
5
] also uses templates for document authoring, in a
similar approach to the one presented in [
19
]. Finally, xSMART
[
20
] is used to create wizards to guide the creation of a multimedia
document.
In the mulsemedia domain, much of the authoring eort is to
specify scene components for synchronizing audiovisual content
with sensory eects [
25
]. Usually, the authoring eort it to tie scene
components to the sensory eects that a human should experience
when they are presented [
22
], such as feeling cold when a snow
scene is presented or feeling heat when a beach scene is presented.
In [
24
] the authors present an authoring tool designed for au-
thoring mulsemedia applications, called SEVino (Sensory Eect
Video Annotation Tool). SEVino provides the author an interface
that presents a video timeline. Such video represents the main
audiovisual content to have sensory eects synchronized with. The
tool creates cells representing sensory eects (eg. fog, wind, tem-
perature, etc.) and for a given time interval, users could select a cell
representing a sensory eect to be executed. After the authoring
phase, the tool generates descriptions compatible with the MPEG-V
standard [
26
], which is a standard for information exchange be-
tween digital world and real world. The MPEG-V descriptions
generated by SEVino represent the sensory eects to be executed
on physical devices.
Despite the advances in tools and templates for easing the au-
thoring eort, the process of authoring a mulsemedia application
is still a very expensive work in terms of eort and time. Espe-
cially when a great deal of synchronization among the audiovisual
content and sensory eects is required.
Such problem gave rise to research proposing semi-automatic
or automatic video description. A video description indicates for
each instant of the video, the scene components that are present.
Such approaches should require minimal to no author interaction
at all for providing a video description, as well as to generate events
based on that description.
The SEVino authors have also developed a media player capa-
ble of automatically gathering a video description and producing
events on the ambient. More specically, the proposed player can
synchronize ambient lighting eects with a video presentation [
24
].
To achieve such synchronization, the player gather pixel color in-
formation from a video frame (usually the borders) and send the
same color information to a nearby array of LED lights. This player
removes the need for the user to specify the lightning eects in the
multimedia document, however the approach is restricted to only
one kind of eect, in this case, lightning eects.
The work presented in this paper diers from related work as
follows. (i) It enables the author to describe its application abstract-
ing the video description, using abstract anchors. (ii) It enables the
author to dene abstract anchors for multiple videos in a document,
and not just one as the above approaches. (iii) It enables authors to
synchronize any sensory eect with the application, by providing
relationships among them and abstract anchors.
Although in this paper we present an approach for video descrip-
tion, the Abstract Anchor Processor (AAP) architecture is indepen-
dent of the tool to be used for describing a media object content.
Therefore, it could be used also for dening abstract anchors for
audio objects.
3 ABSTRACT ANCHORS
Multimedia applications are described by multimedia documents. A
document specication is described using some multimedia author-
ing language. Common entities for multimedia authoring languages
are nodes, representing the document content, and relationships, for
representing the synchronization to be performed in an application.
Dierent languages, such as NCL [
11
], provide temporal anchors
for representing a subpart of a node content. Temporal anchors
represent a subpart of a node content in the time axis. For example,
a sequence of frames of a video node or a sequence of samples in
an audio node. Usually, temporal anchors are dened by a begin
and end, in respect to the node content.
By allowing the author to dene anchors, multimedia languages
enables the denition of relationships taking into account parts of
a node content. Thus providing a ne-grained synchronization.
As discussed in Section 2. Template authoring languages enable
the user to abstract some steps of the authoring process in favor
of a more generic description. After authoring, at processing time,
the template processor to “ll the blanks” with document specic
content.
With that in mind, this work enables the author to make use of
abstract anchors (NCLAA) to represent subparts of a node content,
without explicitly describing them. It is similar to a template ap-
proach, in the sense that it enables another level of abstraction in
the authoring phase.
An abstract anchor represents (possibly) several dierent node
anchors, that are related by the node content being presented while
they are active. In our approach, abstract anchors are related to
scene components, such that all of its instances represent when the
scene component it is associated with is being presented. Figure 1
depicts such idea, where media nodes are represented as circles and
node anchors are represented as squared. Dashed lines associate an
anchor to a node and solid lines represent document relationships.
Figure 1: Abstract anchor denition and processing
The upper part of Figure 1 presents a document where media
video1 has three anchors sea,snow and sun. Each anchor represents
a given scene component. Relationships among such anchors and
medias wind eect and heat eect dene when such medias shall
be presented.
NCL [
11
], the target language used in this work, provides ele-
ment
media
for dening nodes representing media objects. It also
enables the denition of anchors using element
area
, child of el-
ement media. Listing 1 presents an example of media and anchor
specication.
1<medi a id= " video1 " src= " v i d e o . mp4 " >
2<area ta g = " s e a " / >
3<area ta g = " s un " / >
4</media>
Listing 1: NCL media and anchor specication example
In order to provide the denition of abstract anchors, we extend
NCL such that
area
elements have a new attribute
tag
. Such at-
tribute indicates the scene components related to that anchor. In
the example presented in Listing 1, two abstract anchors are created,
one representing the instants when the sea appears in the video
and the other representing the instants when the sun appears. Ad-
ditionally, the author can dene the
tag
to asterisk ( *) if it should
match every scene component in a document.
NCL is an event-based language such that synchronization rela-
tionships are dened based on events. NCL provides causal relation-
ships such that when an event specied as its condition happens,
one or more actions are triggered. Relationships in NCL are de-
ned using link-connector element pairs. Connectors [
15
] dene
a general relation that is instantiated by links to a given set of
participants. Listing 2 presents an example of link specication.
1<link xconnector= " onBeginStart ">
2<b i nd role= " o n B e g in " component= " video1 "
interface= " s e a " / >
3<b i nd role= " s t a r t " component= " wi nd " / >
4</link>
5<link xconnector= " onBeginStart ">
6<b i nd role= " o n B e g in " component= " video1 "
interface= " sun " / >
7<b i nd role= " s t a r t " component= " h e a t " / >
8</link>
Listing 2: NCL link specication example
The example presented in Listing 2 denes two links. The rst
species that whenever anchor sea of video1 starts, media wind
shall be started. The second species that whenever anchor sun of
video1 starts, media wind shall be started. Two links are also crated
to stop the wind and the head when the related anchor stops. For
simplicity, they are not presented in Listing 2.
It is worth noticing that
bind
elements inside NCL links indicate
the participants in a relationship. Attribute
component
indicates
the participant node and an optional attribute
interface
restricts
to a given node interface, i.e., a node anchor or property. In order
to enable links to be dened over abstract anchors, we extend NCL
such that attribute
interface
instead an anchor id may indicate
its tag attribute value.
Prior to execution, a document using abstract anchors shall
be processed into a nal document following the NCL standard.
The processing performed for abstract anchors is similar to that
performed for template languages. The rst step of the process is
to instantiate the abstract anchors for the scene components they
specify. The second step is to duplicate links for each instance of a
given abstract anchors. The whole process in shown in Figure 1.
The anchor instantiation step is performed using tools for scene
recognition as presented in Section 4.3. It recognizes the time in-
stants a given scene component is presented in the video content
and create anchor instances marked with temporal description.
Therefore, our approach requires from authors little (or even no)
prior knowledge about the media content. Anchors temporal def-
inition is performed entirely with data acquired by recognition
software.
4 ARCHITECTURE
The architecture of the Abstract Anchor Processor (AAP) is depicted
in Figure 2.
AAP receives as input a document containing abstract anchors
dened by the author. It parses the document identifying nodes
that dene abstract anchors and links related to them. At this step,
the processor also extracts media content from those nodes. For
the example in Listing 1 the processor identies node video1 as a
node dening abstract anchors and shall extract its content (le
video.mp4).
The extracted media content is sent to an external software for
scene recognition. As it can be seen in Figure 2, the recognition
software is decoupled from the processor. Such approach gives
more freedom to the author allowing one to use dierent scene
recognition software. The scene recognition step results in a set of
tags
2
that are equivalent to ones identied in the abstract anchors
dened by the author. These tags represent the scene components
along with timing information about when they appear in the video.
4.1 Anchor Instantiation
According to the tags received from the scene recognition soft-
ware, AAP instantiates the abstract anchors. The process of anchor
instantiation is performed as follows. According to the scene com-
ponents specied in the abstract anchor, the processor checks in
the set of received tags the time instants when those components
were present. It identies adjacent instants dening intervals where
scene components are present. For each resulting interval, one an-
chor instance is created. Listing 3 presents the result of the anchor
instantiation step for the example in Listing 1.
1<medi a src= " v i d e o . mp4 " i d = " video1 " >
2<area id = " s e a _ 1 " begin= " 0 1 s " end= " 0 9 s " / >
3<area id = " s e a _ 2 " begin= " 1 7 s " end= " 1 9 s " / >
4<area id = " s un_ 1 " begin= " 01 s " end= " 19 s " / >
5<area id = " s un_ 2 " begin= " 28 s " end= " 32 s " / >
6</media>
Listing 3: Anchor instantiation step result for the example
in Listing 1
2
We use the same nomenclature as the scene recognition software. It shall not be
confounded with XML tags.
Figure 2: Abstract anchor processor architecture
In the example presented in Listing 3, the scene component
sea
was identied in the video in the intervals
[
1
,
9
]
and
[
17
,
19
]
seconds
of the video. Thus two anchor instances were created,
sea_1
for
the rst interval and
sea_2
for the second one. The same is done
for scene component
sun
, which was identied in the video inside
intervals
[
1
,
19
]
and
[
28
,
32
]
, generating anchor instances
sun_1
and sun_2.
It is worth noticing that in the resulting document, the attribute
tag
was removed from the anchor instances. Anchor ids, which
are mandatory in NCL, are created according to the
tag
attribute
value. In order to maintain the output compatibility with the NCL
standard, each anchor id is also incremented to be unique in the
whole document.
4.2 Link Instantiation
After the anchor instantiation process, AAP is able to instantiate
links that refer to abstract anchors.
For each link marked at the processing begin as using an ab-
stract anchor, the processor examines each of its binds in order to
determine its target element. Two outputs are possible.
•
The bind targets a media node as a whole or a regular
anchor. In that case nothing has to be done.
•
The bind targets an abstract anchor of a media node. In
that case the link has to be duplicated for each instance of
the abstract anchor.
This process continues until no link bind targets an abstract
anchor. Listing 4 presents the result of the link instantiation step
for the example in Listing 2.
1<link xconnector= " onBeginStart ">
2<b i nd role= " o n B e g in " component= " video1 "
interface= " s e a _ 1 " / >
3<b i nd role= " s t a r t " component= " wi nd " / >
4</link>
5<link xconnector= " onBeginStart ">
6<b i nd role= " o n B e g in " component= " video1 "
interface= " s e a _ 2 " / >
7<b i nd role= " s t a r t " component= " wi nd " / >
8</link>
9<link xconnector= " onBeginStart ">
10 <b i nd role= " o n B e g in " component= " video1 "
interface= " sun_1 " / >
11 <b i nd role= " s t a r t " component= " h e a t " / >
12 </link>
13 <link xconnector= " onBeginStart ">
14 <b i nd role= " o n B e g in " component= " video1 "
interface= " sun_2 " / >
15 <b i nd role= " s t a r t " component= " h e a t " / >
16 </link>
Listing 4: Link instantiation step result for the example in
Listing 2
In the example presented in Listing 4, the rst link from Listing 2
was instantiated to both instances of the abstract anchor
sea
. The
resulting links now targets anchors
sea_1
and
sea_2
, respectively.
The same process was done for the second link from Listing 2,
which was instantiated for anchors sun_1 and sun_2.
It is worth noticing that the steps of anchor instantiation and link
instantiation may be executed in distinct moments. It is possible
for the author to use AAP to rst instantiate the anchors, continues
to work in the document and perform the link instantiation step
later.
4.3 Scene recognition
Given a set of abstract anchors previously dened by the author,
AAP collects the anchors
tag
attribute values along with his par-
ent element source. The resulting tags must be instantiated with
temporal information that identies where that tag appeared on
the scene. Here we call this process scene recognition.
Scene recognition is achieved by submitting all the
tag
attribute
values to the recognition system, which is a system that employs
algorithms that can detect scene components in media content
(e.g. video, audio, text analysis). These approaches return a set of
tags indicating the description of a media content. Although static
media can also be analysed (image and text) this work focuses on
continuous media objects, which are frequently used as basis for
sensory eect synchronization.
The scene recognition phase is decoupled from the processor to
enable its adaptation to novel ways of recognizing features in any
media format. The author can adapt the AAP settings for another
recognition system. The only requirement is that the recognition
system has to return a list of independent tags with their temporal
data, according to the notation used by the processor.
In our implementation we used a video recognition API3based
on a Convolutional Neural Networks (CNN)[
14
]. These neural net-
works have been shown as an eective method for understanding
video content ([
13
,
28
]). Figure 3 shows the result of an image
recognition using such software.
Figure 3: Image recognition result
The example in Figure 3 returns a set of tags indicating the scene
components present in the image. Each tag is followed by the neural
network prediction probability. The API can identify objects (e.g.
boat), as well as individual concepts (e.g. reection).
To recognize video content, the neural network works in a similar
way of image recognition. One approach is to treat video as a series
of images. However, as pointed out by [
17
], this approach does not
account for the temporal information between frames and can lead
to irrelevant concepts emerging from the scene. Nonetheless one
advantage of this method is that it requires less computation time
to analyse the video.
Another approach is to consider the temporal relationship be-
tween the frames and deduce the tags by analysing relationships
as time passes. A advantage of this method is that it decreases the
probability of returning irrelevant tags from the video and keeps
only the ones that persisted though the entire time. However this
approach is shown to be dicult to compute [17].
3https://clarifai.com
The video recognition API we used in this work, content descrip-
tion is performed for every second of video content. Therefore, after
the instantiation phase, the events described on the multimedia
document will also have a 1second time-step.
The description of scenes by one second at a time may seem to
include a great deal of delay in the specication of sensory eect
synchronization with audiovisual content. However, for mulseme-
dia applications, works published in the literature show that user
perception of a sensory eect happens in a time window of
≈
1
s
for
haptic eects [
27
],
≈
2
s
for heat eects [
7
],
≈
3
s
for wind eects
[7, 27] and ≈25sfor scent eects [8].
Given the above results, we consider that the content description
of a media object with a one second step should not pose a threat
to the user quality of experience. A future work is to investigate an
approach to reduce such time step.
5 EVALUATION
For the purpose of evaluating of our approach we introduce a usage
scenario to highlight how AAP supports the development of a
mulsemedia application. We developed an NCL application that
combines video and sensory eects to enrich the user experience.
The application called “environments around the world”, consists
of scenes about dierent environments that are presented to the
user.
A timeline representation of the video content and its synchro-
nization with sensory eects is presented on Figure 4. It presents
a set of key frames of the video
4
and three of the tags recognized
in that part of the video. At the moment of each scene, the NCL
application starts an actuator to perform a sensory eect related to
that scene.
Table 1 describes the sensory eects to be synchronized when a
given tag is found in the video. It varies from scent eects to wind,
heat and cold eects. The eects also vary in intensity according
with the scene components. One should notice, that eects can be
played at the same time. It shall occurs when both tags are found
in the video at the same time. Thus both
area
elements related to
those tags will be active and, as consequence of NCL links, so shall
be the sensory eects.
Table 1: Sensory eects generated by each scene component
Tag Sensory eects
Summer wind 50%, heat 50%
Snow cold 100%
Forest forest scent 100%, wind 25%
Flower ower scent 100%, wind 25%
Storm wind 100%, cold 50%, air humidier 100%
Sea wind 50%, heat 50%, air humidier 50%
Hot wind 50%, heat 100%
The video was described in NCL with abstract anchors indi-
cating the scene components of interest. The cover components
present in all environments. Listing 5 presents the abstract anchor
specication.
4
Images and videos are licensed as Creative Commons CC0 and were found at Pixabay.
https://pixabay.com
Figure 4: Sensory eects generated on a video timeline
1<medi a id= " v i d e o " src= " v i d e o . mp4 " >
2<area ta g = " summer " / >
3<area ta g = " s now " / >
4<area ta g = " forest " / >
5<area ta g = " f l o w e r " / >
6<area ta g = " s to r m " />
7<area ta g = " s e a " / >
8<area ta g = " h ot " / >
9</media>
Listing 5: NCL abstract anchors for the application
“environments around the world”
The behavior of the application is dened by a group of 7
link
elements (one for each abstract anchor). Listing 6 presents an link
specication for one of the abstract anchors.
1<link xconnector= " onBeginStartSet ">
2<b i nd role= " o n B e g in " component= " v i d e o "
interface= " summer " / >
3<b i nd role= " s t a r t " component= " wi nd " >
4<bindParam name= " intensity " value= " 50 %
"/ >
5</bind>
6<b i nd role= " s t a r t " component= " h e a t " >
7<bindParam name= " intensity " value= " 50 %
"/ >
8</bind>
9</link>
Listing 6: NCL link specication with intensity parameters
The link presented in Listing 6 synchronizes the scene compo-
nent
summer
to the sensory eects
wind
and
heat
. Both sensory
eects are represented as media nodes in the application, and repre-
sent Lua scripts that control the actuators responsible for that eect.
The scripts have an intensity parameter whose value is dened in
NCL by parameters (lines 4 and 7). The intensity is expressed in
a percentage of the maximum capable intensity the actuator can
provide.
The author of this application, using NCLAA, has to declare 7
abstract anchors and 7links. The application has a total of 74 lines
of code to describe the behavior of the application.
After processing, according to the video content, the document
has 45
anchor
instances and also 45
link
instances. The processed
document has a total of 362 lines of code to perform the behavior
described in the abstract anchors.
As can be seen in this example, using abstract anchors, the author
had to declare around 15% of the resulting number of anchors and
links and around 20% of the resulting lines of code. Moreover,
without the use of the AAP the author would have to, not only,
dene the anchors and links, but also carefully watch the video for
recognizing scene components and their timing in order to describe
the anchors and their synchronization with the sensory eects. As
intended, we can see a great decrease in the authoring eort with
respect to manual authoring.
It is worth noting that the same code described using NCLAA
is maintained even in case the video size changes. Given that the
abstract anchors are not directly related to the video length (and
timing), but only to the scene components it has, the application
code does not have to change in case the video size changes. This
result is also favorable to the author, as the number of anchor
instances may increase with the video size.
6 CONCLUSION
This paper proposed an approach to describe multimedia applica-
tion with abstract anchors. Abstract anchors represent intervals
when a given scene component is presented in the media node
content. Thus, a mulsemedia application author does not need have
a complete knowledge of a node content for dening its synchro-
nization with other content.
Such approach is intended to be used in a mulsemedia context,
where it is common to perform sensory eect synchronization in
relation to audiovisual content. The approach, however, is not re-
stricted to it and can be used for traditional multimedia application
specication.
Together with the abstract anchors, the abstract anchor processor
(AAP) allows for the automatic generation of node anchor based
on its content. It gathers information about the document and uses
scene recognition software for identifying the temporal information
for anchors. This approach allow automatic media synchronization
to be done based on video recognition.
A positive side eect of our approach is that given that the
abstract anchors are not directly related to the video length (and
timing), but only to the scene components it has, the application
code does not have to change in case the video size changes.
Since the AAP processor have broad applications with dierent
media types. A rst future work should be to integrate to it audio
recognition software. The idea is to identify scene components,
e.g., according to the background sound, and use such informa-
tion for anchor instantiation. A use case could be the automatic
synchronization of subtitles in NCL applications.
Another future work is to enhance AAP with the ability to infer
synonyms of the words used to describe abstract anchors. The
current approach for identifying scene concepts can be error prone.
Sometimes it can be dicult to guess which concept the recogni-
tion software can handle. There are several recognition softwares
available and they may not follow a common standard for concept
naming.
Finally, one interesting future work is to improve our approach
so that it can be used for live content. AAP has to be able to per-
form anchor and link instantiation at runtime. Besides some kind
of caching strategy has to be used for performing the scene recog-
nition step. The challenge to that approach is related to Quality of
Experience (QoE) preservation in multimedia applications, which
may be lost due to processing latency of some scene recognition
software.
REFERENCES
[1]
ABNT. 2011. Digital terrestrial television - Data coding and transmission speci-
cation for digital broadcasting - Part 2: Ginga-NCL for xed and mobile receivers
- XML application language for application coding. (2011). ABNT NBR 15606-
2:2011 standard.
[2]
Roberto Gerson A. Azevedo, Eduardo Cruz Araújo, Bruno Lima, Luiz Fernando G.
Soares, and Marcelo F. Moreno. 2014. Composer: meeting non-functional aspects
of hypermedia authoring environment. Multimedia Tools and Applications 70, 2
(2014), 1199–1228. DOI:http://dx.doi.org/10.1007/s11042-012-1216- 8
[3]
Diogo Henrique Duarte Bezerra, Denio Mariz Timóteo Sousa, Guido Lemos
de Souza Filho, Aquiles Medeiros Filgueira Burlamaqui, and Igor Ros-
berg Medeiros Silva. 2012. Luar: A Language for Agile Development of NCL
Templates and Documents. In Proceedings of the 18th Brazilian Symposium on
Multimedia and the Web (WebMedia ’12). ACM, New York, NY, USA, 395–402.
DOI:http://dx.doi.org/10.1145/2382636.2382718
[4]
Carolina Cruz-Neira, Daniel J. Sandin, Thomas A. DeFanti, Robert V. Kenyon,
and John C. Hart. 1992. The CAVE: Audio Visual Experience Automatic Virtual
Environment. Commun. ACM 35, 6 (June 1992), 64–72.
DOI:
http://dx.doi.org/10.
1145/129888.129892
[5]
Romain Deltour and Cécile Roisin. 2006. The limsee3 multimedia authoring
model. In Proceedings of the 2006 ACM symposium on Document engineering.
ACM, 173–175.
[6]
Joel André Ferreira dos Santos and Débora Christina Muchaluat Saade. 2010.
XTemplate 3.0: Adding Semantics to Hypermedia Compositions and Providing
Document Structure Reuse. In Proceedings of the 2010 ACM Symposium on Applied
Computing (SAC ’10). ACM, New York, NY, USA, 1892–1897.
DOI:
http://dx.doi.
org/10.1145/1774088.1774490
[7]
H Felix, Nikita Mattar, and Julia Fr.2014. Simulating Wind and Warmth in Virtual
Reality : Conception , Realization and Evaluation for a CAVE Environment. 11,
10 (2014).
[8] Gheorghita Ghinea and Oluwakemi A. Ademoye. 2010. Perceived synchroniza-
tion of olfactory multimedia. IEEE Transactions on Systems, Man, and Cybernetics
Part A:Systems and Humans 40, 4 (2010), 657–663.
DOI:
http://dx.doi.org/10.1109/
TSMCA.2010.2041224
[9]
Gheorghita Ghinea, Christian Timmerer, Weisi Lin, and Stephen R. Gulliver. 2014.
Mulsemedia : State of the Art, Perspectives, and Challenges. ACM Transactions
on Multimedia Computing, Communications, and Applications 11, 1s (2014), 1–23.
DOI:http://dx.doi.org/10.1145/2617994
[10]
Roberto Ierusalimschy. 2006. Programming in lua (2nd ed.). Roberto Ierusalim-
schy.
[11]
ITU. 2009. Nested Context Language (NCL) and Ginga-NCL for IPTV services.
http://www.itu.int/rec/T-REC-H.761-200904-S. (2009). ITU-T Recommendation
H.761.
[12]
Alejandro Jaimes and Nicu Sebe. 2007. Multimodal human–computer interaction:
A survey. Computer Vision and Image Understanding 108, 1–2 (2007), 116 – 134.
DOI:
http://dx.doi.org/10.1016/j.cviu.2006.10.019 Special Issue on Vision for
Human-Computer Interaction.
[13]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk-
thankar, and Li Fei-Fei. 2014. Large-Scale Video Classication with Convolutional
Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR ’14). IEEE Computer Society, Washington, DC,
USA, 1725–1732. DOI:http://dx.doi.org/10.1109/CVPR.2014.223
[14]
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard,
and L. D. Jackel. 1989. Backpropagation Applied to Handwritten Zip Code
Recognition. Neural Comput. 1, 4 (Dec. 1989), 541–551.
DOI:
http://dx.doi.org/10.
1162/neco.1989.1.4.541
[15]
D. C. Muchaluat-Saade and L. F. G. Soares. 2002. XConnector & XTemplate:
Improving the Expressiveness and Reuse in Web Authoring Languages. The New
Review of Hypermedia and Multimedia Journal 8, 1 (2002), 139–169.
[16]
Carlos de Salles Soares Neto, Luiz Fernando Gomes Soares, and Clarisse Sieck-
enius de Souza. 2012. TAL-Template Authoring Language. Journal of the
Brazilian Computer Society 18, 3 (2012), 185–199.
DOI:
http://dx.doi.org/10.1007/
s13173-012- 0073-7
[17]
Joe Yue-Hei Ng, Matthew J. Hausknecht, Sudheendra Vijayanarasimhan, Oriol
Vinyals, Rajat Monga, and George Toderici. 2015. Beyond Short Snippets: Deep
Networks for Video Classication. CoRR abs/1503.08909 (2015). http://arxiv.org/
abs/1503.08909
[18]
Sharon Oviatt. 2003. The Human-computer Interaction Handbook. L. Erlbaum
Associates Inc., Hillsdale, NJ, USA, Chapter Multimodal Interfaces, 286–304.
http://dl.acm.org/citation.cfm?id=772072.772093
[19]
Douglas Paulo de Mattos, Júlia Varanda da Silva, and Débora Christina Muchaluat-
Saade. 2013. NEXT: graphical editor for authoring NCL documents supporting
composite templates. In Proceedings of the 11th european conference on Interactive
TV and video. ACM, 89–98.
[20]
A. Scherp and S. Boll. 2005. Context-driven Smart Authoring of Multimedia
Content with xSMART. In 13th ACM Multimedia.
[21]
Y. Sulema. 2016. Mulsemedia vs. Multimedia: State of the art and future trends. In
2016 International Conference on Systems, Signals and Image Processing (IWSSIP).
1–5. DOI:http://dx.doi.org/10.1109/IWSSIP.2016.7502696
[22]
Christian Timmerer, Markus Waltl, Benjamin Rainer, and Hermann Hellwagner.
2012. Assessing the quality of sensory experience for multimedia presentations.
Signal Processing: Image Communication 27, 8 (2012), 909–916.
DOI:
http://dx.
doi.org/10.1016/j.image.2012.01.016
[23]
W3C. 2008. Synchronized Multimedia Integration Language - SMIL 3.0 Speci-
cation. http://w ww.w3c.org/TR/SMIL3. (2008). World-Wide Web Consortium
Recommendation.
[24]
Markus Waltl, Benjamin Rainer, Christian Timmerer, and Hermann Hellwagner.
2013. An end-to-end tool chain for Sensory Exp erience based on MPEG-V. Signal
Processing: Image Communication 28, 2 (2013), 136–150.
DOI:
http://dx.doi.org/
10.1016/j.image.2012.10.009
[25]
K. Yoon, B. Choi, E. S. Lee, and T. B. Lim. 2010. 4-D broadcasting with MPEG-V.
In 2010 IEEE International Workshop on Multimedia Signal Processing. 257–262.
DOI:http://dx.doi.org/10.1109/MMSP.2010.5662029
[26]
Kyoungro Yoon, Sang-Kyun Kim, Jae Joon Han, Seungju Han, and Marius Preda.
2015. MPEG-V: Bridging the Virtual and Real World (1st ed.). Academic Press.
[27]
Zhenhui Yuan, Shengyang Chen, Gheorghita Ghinea, and Gabriel-Miro Muntean.
2014. User Quality of Experience of Mulsemedia Applications. ACM Transactions
on Multimedia Computing, Communications, and Applications 11, 1s (2014), 1–19.
DOI:http://dx.doi.org/10.1145/2661329
[28]
Matthew D. Zeiler and Rob Fergus. 2013. Visualizing and Understanding Convo-
lutional Networks. CoRR abs/1311.2901 (2013). http://arxiv.org/abs/1311.2901